Data Analysis

45 min Read
Video Summary

Next generation sequencing (NGS) data is extremely high throughput, allowing for exponentially higher amounts of data to be generated than the traditional Sanger Sequencing. This is made possible by procuring millions of sequence clusters in parallel, and reading the sequences of all of these clusters base by base, through cycles of nucleotide incorporation, fluorescence reading, and dye cleaving (1). The explosion of experimental data from NGS has driven the need for new paradigms for data computation and knowledge extraction. In this knowledge base, we describe the common formats of NGS raw data files and some downstream analysis of NGS data.



Raw Data Output of NGS

.bcl format

The raw output of all Illumina-based next-generation sequencing machines is the .bcl format. These files are named after, and represent base calls per cycle, which is a binary file that contains both the base call and the quality of that base call for every “tile” in every cycle (2). Each lane of the flow cell has a set amount of swaths on the top and bottom surface, with each swath containing a variable amount of tiles, with a variable amount of nucleotide clusters in each tile. Illumina sequencing platforms use flow cells that vary in the number of lanes, swaths per lane, and tiles per swath. For example, the high output flow cell for the NextSeq 500 system has four lanes, with three swaths on the top and bottom of the lanes, for a total of 864 tiles each containing thousands of template clusters.

The advantage of the .bcl file format is that each base calling is recorded as the machine actually calls that base. In contrast to the fastq file format, where the base call recording is made after the entire sequence is read, calling the bases for every cluster in a particular tile for a particular cycle number in .bcl format is a much more efficient process for the sequencing machine.

The .bcl format, however, is not useful for any other reason other than being convenient and effective for the sequencing machine. If multiple samples were run on the sequencer, then the raw data must be sorted to separate reads with different indices (each library would have a unique index), in a process called demultiplexing (2). In this demultiplexing process, the .bcl data is quickly converted to the universally used fastq format using a bcl2fastq program made by Illumina. This program will assign a name to each read, which includes its location on the flow cell as well as the index specified, and the base calls that constitute each read with their accompanying quality scores.


Next Generation Sequencing NGS - Flowchart of NGS workflow


Figure 1 – Flowchart of an NGS workflow.


Fastq format

TThe fastq file format is universally used to represent raw sequencing data in the bioinformatics community. This format consists of four lines for each read. The first line starts with an ‘@’ character, followed by the sequencer identifier name given to the read by the sequencer, as described above. The second line is the sequence of the read itself, the third line is simply a ‘+’ symbol, acting as a spacer, and the fourth and final line is the Phred quality scores of the bases in the second line (3). There is several variations on the quality score encoding, including Phred64, Phred33, and Sanger (3). These versions use different characters to represent different quality scores, which can make it difficult to know what the actual quality scores of reads are if the version used in the fastq file is unknown. The original fastq version is Sanger encoding, and Illumina has had several versions since they have released their sequencers, based on updated software and systems.


Next Generation Sequencing NGS - Fastq File Specification


Figure 2 – Flowchart of an NGS workflow.


The quality scores assigned to each base call are referred to as Phred scores, and are linked to the probability that the sequencer called that base incorrectly (3). This score can be expressed as a logarithmic function as seen below, where Q is the Phred quality score, and P is the probability of an incorrect base call. The actual probabilities are calculated by the machine by determining fluorescence peak shape, resolution, and any potential overlap at every base. These scores tend to drop near the end of reads, because fluorescent overlap due to incomplete dye cleavage becomes a bigger factor the longer the read is.


Q = -10 log10P


Phred Quality Score Probability of Incorrect Base Call Base Call Accuracy
10 1 in 10 90%
20 1 in 100 99%
30 1 in 1000 99.9%
40 1 in 10,000 99.99%
50 1 in 100,000 99.999%
60 1 in 1,000,000 99.9999%


Figure 3 – Phred quality score chart.

Cell Cluster Density Data

An important factor included in the .bcl raw data generation is the flow cell cluster density. Libraries must be loaded onto the flow cell at the proper concentrations to avoid risking over or under loading of the flow cell. Underloading would lead to a lower throughput of the run, since there would be fewer clusters to gather data from, although quality scores would be unaffected. Overloading of the sequencer however, depending on the severity, could potentially lead to significantly lower quality scores, rendering the entire sequencing run useless. This is because if clusters are packed too close together, there will be too much interference and background from surrounding clusters when the sequencer tries to image the fluorescent signal given off by the recently incorporated nucleotide for each cluster. This is an even more significant issue on a sequencer that uses two channel sequencing such as the NextSeq 500TM (rather than the traditional four channel sequencing). With two channel fluorescent imaging systems, clusters visualized in red or green are interpreted as C and T bases respectively (as with four channel systems), however G basecalls are represented by no fluorescence, and A basecalls show dual red and green fluorescence, which images as yellow (4). In this system, background noise from overclustering surrounding a G basecall or an A basecall would likely be especially problematic. Other statistical metrics taken by the sequencer include clusters passing filter (CPF), which refers to the passing/failure of reads against Illumina’s filter which removes clusters that have a dim or low quality signal.


Downstream Analysis

1) Whole Genome Sequencing Data

After generation of fastq data, many options for downstream analysis lie ahead, the most popular of these being alignment of the fastq data to a reference genome. Numerous mapping programs exist, prominent examples being BWA, Bowtie2, Maq, Stampy, and NovoAlign (5, 6, 7, 8, 9). These programs are exponentially better suited than tools such as BLAST, because they use heuristic (approximate) algorithms to make the alignment process extremely fast and able to deal with millions or billions of reads being mapped against very large reference genomes.

Such programs as Maq and Bowtie use a computational strategy known as ‘indexing’ to speed up their mapping algorithms (6). Like the index at the end of a book, an index of a large DNA sequence allows one to rapidly find shorter sequences embedded within it. Maq is based on a straightforward but effective strategy called spaced seed indexing. In this strategy, a read is divided into four segments of equal length, called the ‘seeds’. If the entire read aligns perfectly to the reference genome, then clearly all of the seeds will also align perfectly. If there is one or two mismatches however (perhaps due to SNPs), then it must fall within one or two of the four seeds, but the others will still match perfectly. Thus, by aligning all possible pairs of seeds (six possible pairs) against the reference, it is possible to narrow the list of candidate locations within the reference where the full read may map, allowing at most two mismatches. Maq's spaced seed index enables it to perform this narrowing operation very effectively. The resulting set of candidate reads is typically small enough that the rest of the read—that is, the other two seeds that might contain the mismatches—may be individually checked against the reference (6).

Bowtie takes a somewhat different approach, using a technique originally called the Burrows-Wheeler transform. Using this transform, the index for the entire human genome fits into less than two gigabytes of memory; in contrast to a spaced seed index, which may require more than fifty, and yet reads can still be aligned efficiently (6). Bowtie aligns a read one character at a time to the Burrows-Wheeler–transformed genome. Each successively aligned new character allows Bowtie to narrow the list of positions to which the read might map. If Bowtie cannot find a location where a read aligns perfectly, the algorithm backtracks to a previous character of the read, makes a substitution and resumes the search. This alignment algorithm is substantially more complicated than spaced-seed algorithms, but is much faster (6).


a. De novo Assembly

Alternatively, if the organism being sequenced does not have a reference genome available, then the reads must be aligned in a de novo manner. A large scale comparison of many of the top de novo alignment programs such as ABySS and SOAPdenovo was performed and attempts to illustrate the success of each program and the variability between them (10). With this type of alignment, the reads are examined against each other to check for overlap between them in order to build larger contiguous sequences called contigs. The goal of this approach is to build a single contig that encompasses the entire genome of the organism, however this is usually a pipedream without additional sequencing support. Paired end reads are extremely beneficial to de novo experiments, as they can potentially link two smaller contigs together that are separated by a region that is impossible to sequence, such as a homopolymer stretch. Mate-pair reads are also linked together, but by much larger distances, typically 5-10k bases. These would help in the same manner as paired end reads, linking contigs separated by regions of thousands of base pairs that cannot be aligned, such as large microsatellites.


Next Generation Sequencing NGS - Rainbow Sequence


Figure 4 – The sequencing of a beautiful rainbow.


b. Sequence Alignment

Alignment of sequenced fastq data through either reference or de novo methods will result in the generation of a SAM file, which is the universal file format for mapped sequence reads. This file type contains the sequence and quality scores of each read just as with the fastq file, but is much more detailed as it also specifies information about the location in the genome the read maps to and how confident the aligner was in mapping to that location, information about its mate read (if paired end sequenced), and a CIGAR string which outlines the presence of any SNP’s contained in the read. The SAM file also contains a few lines of header text which contain generic information about the alignment, such as the version, and information about the reference. A BAM file is the compressed binary version of SAM, and is otherwise identical. Complete information regarding the SAM format can be found here.


c. Variant Calling

After alignment to a reference genome, a common next step is variant calling, where a program examines your mapped data and the reference side by side, to determine the existence of SNPs, de novo SNVs, and INDELs. There are two major open source variant calling programs available to the bioinformatics community, those being SAMtools mpileup (11), and the Genome Analysis Toolkit (GATK) (12). These programs are highly similar, and both use Bayesian algorithms to compare your aligned sequence against the reference, to determine the presence or absence of SNPs and small INDELs. Detailed information about the algorithms these programs use can be found at their respective webpages.


d. Visualization of WGS Data

After both of these pipelines have been completed, visualization of the NGS data is usually one of the final steps in analysis. Two options exist for this process: Integrative Genomics Viewer (IGV) (13), and the UCSC Genome Browser (14). IGV is a downloadable program that allows you to view your data locally on your own computer, thus a better option for sensitive data that cannot be uploaded to the cloud. UCSC is an online tool that allows you to upload your own genome and various tracks (such as those generated by variant calling). Since it is cloud-based, you do not have to download a program, however the strict confidentiality of your data will cease to exist.


Next Generation Sequencing NGS - UCSC Genome Browser


Figure 5 – The UCSC Genome Browser illustrating the site of a single nucleotide variation (SNV) of T to C.


2) RNA Sequencing Data

RNA-Seq analysis is slightly different than pipelines used for WGS, since reads will only map to the areas of the genome that code for RNA transcripts, leaving the remaining regions of the genome uncovered. Within these transcripts are multiple exon/intron boundaries, of which reads will map only to the former, since introns are spliced out during transcript processing. Therefore, some reads will be split, with half covering both the last part of the first exon, and the start of the second exon, creating a problem that would give fits to a standard WGS alignment program such as BWA. To get around this, special mapping programs have been developed such as TopHat (15) and STAR (16) that include algorithms that help deal with this situation. A specialized type of RNA-Seq called small RNA or miRNA sequencing also exists, where the starting RNA material is enriched for small RNAs. This allows researchers to look at the presence of miRNAs and short non-coding RNAs.


3) Exome Sequencing Data

Another common type of sequencing is Exome-Seq, which is the sequencing of all the protein-coding genes in the genome. This is accomplished by using probes which target only the 1% of the genome which codes for protein, leaving the other 99% of the genome behind. This is a powerful approach because the vast majority of mutation that cause or could lead to disease are found in the exons of the genome, so this type of sequencing would be able to detect such mutations in a much more focused effort, without the high cost of whole genome sequencing.The same data analysis tools for WGS can be applied to exome-seq data.


Additional Software and Tools Available

Picard is a set of command line Java-based tools that manipulates NGS data in the SAM/BAM format (17). These tools include marking and/or removal of duplicates, conversion from SAM to BAM and vice-versa, and collection of various run metrics. This suite of tools is a mainstay in popular standard bioinformatics pipelines.


Outside of the common bioinformatics pipeline of alignment, variant calling, and viewing, there are additional software packages available that can help you manipulate or better understand your NGS data. FastQC is a tool that provides a quick method of quality control on fastq files before they are used for deeper analysis downstream (18). This program compiles important statistics on NGS data, such as quality scores per base and per sequence overall, sequence and GC content, and checks for duplicated and overrepresented sequences within the file. The figure below shows a report of a fastq file that shows slightly diminishing quality towards the end of the 75bp sequence, but otherwise appears to be of high quality.


Next Generation Sequencing - FastQC Report


Figure 6 – FastQC report for a 75 base pair read fastq file.

abm runs FastQC on all NGS sequencing orders, as a way to confirm high quality results.


References
  • Illumina. (2011). Quality scores for next-generation sequencing. Retrieved from http://www.illumina.com/documents/products/technotes/technote_Q-Scores.pdf
  • Illumina. (2013). Bcl2Fastq Conversion User Guide. Retrieved from https://support.illumina.com/content/dam/illumina-support/documents/documentation/
    software_documentation/bcl2fastq/bcl2fastq_letterbooklet_15038058brpmi.pdf
  • FastQ Format. (n.d.). In Wikipedia. Retrieved December 2, 2015, from https://en.wikipedia.org/wiki/FASTQ_format
  • Illumina. (2014). Illumina two-channel SBS sequencing technology. Retrieved from https://www.illumina.com/content/dam/illumina-marketing/documents/products/techspotlights/techspotlight_two-channel_sbs.pdf
  • Li, H., & Durbin, R. (2009). Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 25(14). 1754-1760. doi:10.1093/bioinformatics/btp324
  • Langmead, B., & Salzberg, S. L. (2012). Fast gapped-read alignment with Bowtie 2. Nature Methods. 9, 357-359. doi:10.1038/nmeth.1923
  • Li, H., Ruan, J., & Durbin, R. (2008). Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Research, 18(11),1851-1858. doi:10.1101/gr.078212.108
  • Lunter, G., & Goodson M. (2011). Stampy: A statistical algorithm for sensitive and fast mapping of Illumina sequence reads. Genome Research, 21(6), 936-939. doi:10.1101/gr.111120.110
  • Novocraft Technologies. (2015). Novoalign [Computer software]. Retrieved from http://www.novocraft.com/products/novoalign/
  • Bradnam, K. R., Fass, J. N., Alexandrov, A., Baranay, P., Bechner, M., Birol, I., … Korf, I. F. (2013). Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience, 2, 10. Retrieved from http://www.gigasciencejournal.com/
  • Wellcome Trust Sanger Institute. (2015). SAMtools [Computer software]. Retrieved from http://samtools.sourceforge.net/
  • Broad Institute. (2015). GATK [Computer software]. Retrieved from https://www.broadinstitute.org/gatk/
  • Robinson, J. T., Thorvaldsdóttir, H., Winckler, W., Guttman, M., Lander, E. S., Getz, G., & Mesirov, J. P. (2011).  Integrative Genomics Viewer. Nature Biotechnology, 29, 24–26. Retrieved from http://www.nature.com/nbt/index.html/
  • Kent, W. J., Sugnet, C. W., Furey, T. S., Roskin, K. M., Pringle, T. H., Zahler, A. M., & Haussler, D. (2002). The human genome browser at UCSC. Genome Research, 12(6), 996-1006. Retrieved from http://genome.cshlp.org/
  • Trapnell, C., Roberts, A., Goff, L., Pertea, G., Kim, D., Kelley, D. R., Pimentel, H., Salzberg, S. L., Rinn, J. L., & Pachter, L. (2012). Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nature Protocols, 7(3), 562-78. doi:10.1038/nprot.2012.016
  • Dobin, A., Davis, C. A., Schlesinger, F., Drenkow, J., Zaleski, C., Jha, S., Batut, P., Chaisson, M., & Gingeras, T. R. (2013) STAR: ultrafast universal RNA-seq aligner. Bioinformatics, 29(1), 15-21. doi:10.1093/bioinformatics/bts635
  • Broad Institute. (2015). Picardtools [Computer software]. Retrieved from http://broadinstitute.github.io/picard/
  • Babraham Bioinformatics. (2015). FastQC Tool [Computer software]. Retrieved from http://www.bioinformatics.babraham.ac.uk/projects/fastqc/
Top