Next Generation Sequencing (NGS) – Experimental Design

Next Generation Sequencing (NGS) - Experimental Design

45 min Read

Video Summary

Next Generation Sequencing (NGS) is a useful tool in determining the DNA sequence, information which is valuable in furthering our understanding of biological processes. Unlike some tools, NGS is flexible and it can be applied in different situations, ranging from the exome to the small RNAs. This flexibility means that there are parameters that needed to be considered prior to running an NGS experiment. This section will outline some of these important considerations below.

Coverage and What Coverage Should be Used

The current NGS platforms available on the market, although very accurate, are still prone to error. Even at accuracies of 99% and greater, a sequence generated may contain incorrect nucleotides. This means that if a machine’s accuracy is 99%, one base pair is read incorrectly out of 100 base pairs; since NGS platforms generate high amounts of output, these errors can add up quickly. The way to circumvent NGS platform limitations is to sequence nucleotides multiple times. The number of times a nucleotide is sequenced is referred to as “coverage”, or “depth” (1). Coverage may also be used to refer to the percentage of target bases that have been sequenced a specific number of times (1).

Coverage will vary depending on the type of NGS and the research application. More coverage tends to be used when in search for a variant that is less common (<1%) in a sample. An example is the detection of cancer mutations in tumour DNA circulating in the plasma of cancer patients (2). However, the appropriate coverage for an experiment is determined on a case-by-case basis. The coverage also varies depending on the NGS type (i.e. Whole Genome Sequencing). For instance, whole genome sequencing generally requires approximately 30x coverage, as this will detect 98% of heterozygous single nucleotide variants identified in a microarray. There is a way to compute coverage, as shown in the Lander-Waterman equation below (1).

C = LN/G

The equation consists of the following variables:

C = coverage
G = haploid genome length (in nucleotides)
L = read length (in nucleotides)
N = number of reads

For general coverage guidelines, please refer to the table below:

NGS Type	Application	Recommended Coverage (x) or Reads (millions)	References
Whole Genome Sequencing	Homozygous Single Nucleotide Variants (SNVs) – single nucleotide changes in genes where the alleles are identical.	15x	Bentley et al., 2008
	Heterozygous SNVs – single nucleotide changes in genes where the alleles are different from each other.	33x	Bentley et al., 2008
	Insertion/Deletion Mutations (INDELS) – mutations in the genome where nucleotides are inserted or removed.	60x	Feng et al., 2014
	Genotype calling - determination of an individual's genotype.	35x	Ajay et al., 2011
	Copy Number Variation (CNV) – variance in the number of copies of a gene between individuals	1-8x	Xie et al., 2009; Medvedev et al., 2010
Whole Exome Sequencing	Homozygous SNVs	100x (3x local read coverage)	Clark et al., 2011; Meynert et al., 2013
	Heterozygous SNVs	100x (13x local read coverage)	Clark et al., 2011; Meynert et al., 2013
	INDELs	Not recommended	Feng et al., 2014
RNA Sequencing - Transcriptome Sequencing	Differential expression profiling – quantitative measurement of gene expression across multiple genes to examine different levels of expression in the sample.	10-25 million	Liu et al., 2014; ENCODE 2011 RNA-Seq
	Alternative splicing – identification of different splice variants from mRNA transcripts.	50-100 million	Liu et al., 2014; ENCODE 2011 RNA-Seq
	Allele specific expression – transcript expression which is affected by a specific gene allele.	50-100 million	Liu et al., 2014; ENCODE 2011 RNA-Seq
	De novo assembly – construction of a transcriptome without use of a reference sequence.	>100 million	Liu et al., 2014; ENCODE 2011 RNA-Seq
RNA Sequencing - Small RNA (microRNA) Sequencing	Differential expression – quantitative measurement of small RNA expression to examine different levels of expression in the sample.	~1-2 million	Metpally et al., 2013; Campbell et al., 2015
RNA Sequencing - Small RNA (microRNA) Sequencing	Discovery of novel small RNAs.	~5-8 million	Metpally et al., 2013; Campbell et al., 2015
DNA Methylation Sequencing	Bisulfite Sequencing (Bisulfite-Seq) – sequencing which is done by treating genomic DNA with bisulfite to convert non-methylated cytosines to uracil.	5-15x per strand or per replicate; 30x total methylome	Ziller et al., 2015; Epigenomics Road Map

Adapted from Genohub website, “Table 1: Coverage and Read Recommendations by Application”

Library Preparation

Before a sample can be sequenced, it must be prepared into a sample library from genomic DNA or total RNA. A library is a collection of randomly sized DNA fragments that represent the sample input. However, depending on the type of NGS applications, different library preparation steps are taken. There are four types of NGS applications which are considered below: Whole Genome Sequencing (WGS), Exome Sequencing (Exome-Seq), RNA Sequencing (RNA-Seq), and Methylation Sequencing (Methyl-Seq). We will focus on the protocols used in the Illumina NGS platforms as it uses the most effective sequencing method, sequencing by synthesis, and generates the highest output of all the platforms currently on the market. For a more detailed explanation, please view our Next Generation Sequencing (NGS) – An Introduction knowledge base.

Library Preparation for Whole Genome Sequencing (WGS)

Whole Genome Sequencing, or WGS, refers to the sequencing of an organism’s entire genome. Sample library preparation for WGS is dependent on two considerations: 1) The genome size of the organism from which the sample was derived, and 2) the amount of sample available to be sequenced. Based on these two considerations, the method of sample library preparation can be specified.

1) Illumina TruSeq PCR-free Library Preparation Kit – Any Size Genome with Large Sample Input

The Illumina TruSeq PCR-free Library Preparation Kit is ideal if there is 1-2 μg of genomic DNA available, regardless of the genome size. The purpose of this particular kit is to avoid PCR amplification errors associated with the DNA polymerase working over long distances. Genomic DNA is isolated from the sample and fragmented physically or chemically, leaving random 5’ and 3’ end overhangs. The resulting DNA fragments are then purified for the desired size of 350bp or 550bp, using magnetic beads which bind to these fragment sizes. Size selection occurs by incubating specific ratios of magnetic beads with fragmented DNA; a higher ratio of magnetic bead to DNA results in a greater size range of purified DNA. Following that, the end overhangs created from fragmentation are repaired into blunt ends. This is achieved by using a combination of a 3’ to 5’ exonuclease and a 5’ to 3’ polymerase. The exonculease removes the 3’ overhang, while the polymerase fills in the 5’ overhang. The 3’ ends of the fragments are additionally adenylated; this single base overhang hybridizes with the 3’ thymine overhang of the adapters which are then ligated together. This ligation step is critical for the sequencing reaction later on as the adapters will enable the DNA to hybridize to the surface of the sequencing reaction chip. The collection of adapter-ligated fragments forms a library which can be sequenced. Before the library can be sequenced, it must be validated quantitatively and qualitatively. The library is validated quantitatively with qPCR. There are two reasons for this: 1) the primers used in the qPCR are contained in the adapter sequences and will only allow amplification of adapter-ligated fragments, and 2) the library is too small to be quantified flurometricly as there was no PCR amplification. The library is additionally validated qualitatively with the Agilent Technologies 2100 Bioanalyzer, before optional pooling with other libraries. Technical details regarding library validation instruments can be found in the Quality Control section below. For further details, please refer to Figure 1 presented below.

Next Generation Sequencing NGS TruSeq PCR free protocol

Figure 1 – General flow chart for the Illumina TruSeq PCR-free Library Preparation Kit protocol (adapted from TruSeq PCR-free Library Preparation Kit protocol).

2) Illumina TruSeq Nano DNA Library Prep Kit - Any Size Genome with Small Sample Input

The TruSeq Nano DNA Library Prep Kit is ideal if there is 100-200 ng of genomic DNA available. The protocol is almost identical to the TruSeq PCR-free Library Preparation Kit protocol, save for PCR amplification and library validation. Amplification occurs between adapter ligation and library validation steps (see Figure 2). The purpose of PCR amplification is to enrich for adapter-ligated DNA fragments and increase the concentration of the library for sequencing. Library quantification and qualitative analysis are nearly the same as for the TruSeq PCR-free Library Preparation Kit. However, the high library concentration and, more importantly, selective amplification of fragments ligated with correctly oriented adapters together allow quantification to be done fluorometricly.

Next Generation Sequencing NGS TruSeq Nano DNA protocol

Figure 2 – General flow chart for the Illumina TruSeq Nano DNA Library Prep Kit protocol (adapted from TruSeq Nano DNA Library Prep Kit protocol).

3) Illumina Nextera DNA Library Prep Kit - Large Genome Size with Small Sample Input

The Nextera DNA Library Prep Kit is ideal for large, complex genomes (ex. human genome) and provides a shorter sample preparation time relative to the TruSeq PCR-free and Nano Library Prep Kits. The protocol is fairly similar to that of the TruSeq Nano DNA Library Prep Kit, although with a few differences. Unlike the TruSeq kits, fragmentation and adapter ligation of genomic DNA, or “tagmentation”, occur in the first step. This is done with an enzyme called a transposome. The transposome is a transposase-transposon complex; this means that the enzyme is able to make cuts in DNA like a transposase but also insert a portion of itself in the DNA sequence like a transposon. The Nextera transposome is unique as the transposon portion of the complex consists of adapter sequences. During tagmentation, the Nextera transposome simultaneously cleaves the DNA molecule and inserts the adapter sequences. There is a subsequent clean-up step to remove any remaining transposome bound to the DNA from interfering with later steps. Because DNA fragmentation and tagging occurred at the same time, there is no need for DNA fragment end repair or adapter ligation preparation. Library quantification is solely done fluorometricly with Qubit. For further details, please refer to Figure 3 presented below.

Next Generation Sequencing NGS Nextera DNA protocol

Figure 3 – General flow chart for the Illumina Nextera DNA Library Prep kit (adapted from Nextera DNA Library Prep kit protocol).

4) Illumina Nextera DNA XT Library Prep Kit - Small Genome Size with Small Sample Input

The Nextera DNA XT Library Prep Kit is ideal for small genomes (ex. bacteria) as well as plasmids and amplicons. The protocol is very similar to the Nextera Library Prep Kit. However, there are a few exceptions: There is neither post-tagmentation clean-up nor library quantification.

Library Preparation for Exome Sequencing

Exome Sequencing, or Exome-Seq, is the sequencing of the coding portion of the genome. Currently this is a more affordable alternative to WGS as only about 2% of the whole genome is sequenced. Exome-Seq can be performed in two ways: 1) Sequencing of only the exons or 2) sequencing of all the exons, introns (non-protein coding regions), and regulatory regions such as the 5’ and 3’untranslated regions (5’ and 3’-UTR) and microRNAs (miRNA) sequences.

1) Illumina Nextera Rapid Capture Exome Kit

The Nextera Rapid Capture Exome Kit is ideal if only the exons are to be analyzed. Like the Nextera Library Prep Kit protocol for WGS, tagmentation of genomic DNA happens in the first step. This is followed by a clean-up step where the transposome is removed. The removal is necessary to prevent transposome interference in later steps. Adapter-ligated fragments are amplified with PCR to enrich for adapter-ligated DNA and to increase the concentration of the library. In addition, primers needed for sequencing and indexing are added in the first of three PCR enrichment steps. Once amplification is complete, the library is purified from non-amplified fragments with magnetic beads. The library is also quantified fluorometricly to determine if there is sufficient product. Next, exome-amplified fragments are isolated. This is achieved by hybridizing the exome-amplified fragments to biotinylated oligonucleotide probes which are complementary to the exome, followed by “capture” through non-covalent binding of biotinylated sequences with streptavidin beads. During these steps, non-specifically bound DNA is removed with washes. The process of hybridization and capture is repeated a second time. Once this is complete, the DNA library is enriched twice. The library is then purified with magnetic beads to have a pure sample for a final round of enrichment prior to sequencing. PCR enrichment is performed a third time, and then the library is purified. Finally, the library is validated quantitatively and qualitatively. Quantification is done using either qPCR or Qubit; qualitative analysis is performed with the Agilent Technologies 2100 Bioanalyzer. For further details, please refer to Figure 4 presented below.

Next Generation Sequencing - Nextera Rapid Capture Exome

Figure 4 – General flow chart for the Illumina Nextera Rapid Capture Exome Kit protocol (adapted from Nextera Rapaid Capture Kit protocol).

2) Illumina Nextera Rapid Capture Expanded Exome Kit

The Nextera Rapid Capture Expanded Exome Kit is ideal if a more complete analysis of the exome, including UTRs and miRNA binding regions, is desired. The protocol is almost identical to the Nextera Rapid Capture Exome Kit, except for the addition of specific probes and related beads which bind and capture non-protein coding regions. Additional information about the protocol can be found in the Nextera Rapid Capture Exome Kit section (see link above).

Library Preparation for RNA-Seq

RNA Sequencing, or RNA-Seq, consists of sequencing the RNA transcripts present in the sample of an organism. This includes the entire collection of transcripts present including mRNA, or small RNAs.

RNA-Seq is divided into three categories based on the RNA chosen to be sequenced: total RNA-Seq, mRNA-Seq, and small RNA-Seq. Each of these categories has a unique sample library preparation protocol.

1) Illumina TruSeq Stranded Total RNA Kit

The TruSeq Stranded Total RNA Kit is ideal if a complete view of the transcripts in a sample is desired. Ribosomal RNA (rRNA) is not a desired component of the total RNA sample library, so it must be depleted. The depletion of rRNA is done by binding them to magnetic beads with sequences complementary to rRNA. After hybridization, the magnetic beads are pulled out of the solution with a strong magnet and the supernatant is used in further preparation steps. The remaining RNA is cleaned, fragmented, and primed in a single step for cDNA synthesis. Using random primers, the first cDNA strand is then synthesized. During this step, the compound Actinomycin is added; this is done to prevent second strand synthesis while the first strand is made. The RNA template is then degraded to ensure that only the second cDNA strand will be produced in the next synthesis step. Next, the second cDNA strand is synthesized, although dUTP nucleotides are used instead of dTTP nucleotides. The purpose of using dUTPs is to differentiate between the two strands of DNA once the second cDNA strand has been synthesized. The resulting double-stranded DNA is prepared for adapter ligation through adenylation of the 3’ end; this makes the cDNA able to hybridize with the thymine on the 3’ end of the adapters. Once adenylation is complete, adapters are ligated onto the 3’ends of the cDNA and dUTPs are enzymatically removed (see NEB website link provided here for an overview). The adapter-ligated cDNA fragments now lacking dUTPs are then enriched via PCR amplification. The resulting library is validated quantitatively with qPCR and qualitatively with the Agilent Technologies 2100 Bioanalyzer before normalization. If necessary, the library can be pooled with others for multiplexing. For further details, please refer to Figure 5 presented below. A more detailed protocol can be found on the link here.

Next Generation Sequencing - TruSeq Stranded Total RNA

Figure 5 – General flow chart for the Illumina TruSeq Stranded Total RNA Kit protocol (adapted from TruSeq Stranded Total RNA Kit protocol).

2) Illumina TruSeq Stranded mRNA Kit

The TruSeq Stranded mRNA Kit is ideal if the gene expression profile of a sample is desired. The protocol is identical to the TruSeq Stranded Total RNA kit, with the exception of mRNA enrichment instead of rRNA depletion in the first step. For further details, please refer to Figure 6 presented below.

Next Generation Sequencing - TruSeq Stranded mRNA

Figure 6 – General flow chart for the Illumina TruSeq Stranded mRNA Kit protocol (adapted from TruSeq Stranded mRNA kit protocol).

3) Illumina TruSeq Small RNA Kit

The TruSeq Small RNA Kit is ideal if small, non-coding RNAs (ex. miRNA) are to be analyzed. The protocol for this kit is very different from the TruSeq Stranded Total RNA and TruSeq Stranded mRNA kits. Unlike the other RNA library prep kits, the first step consists of sequential blunt-ended adapter ligation (3’ adapter then 5’ adapter) to total RNA. This protocol also does not involve either depletion or enrichment of RNA. The adapter-ligated RNAs are then subject to RT-PCR to enrich for RNAs that have adapters ligated in the correct orientation. Products of RT-PCR are run on an agarose gel; the desired product sizes are isolated at sizes 147bp and 157bp. The purified library is only validated qualitatively, using the Agilent Technologies 2100 Bioanalyzer. For further details, please refer to Figure 7 presented below.

Next Generation Sequencing - TruSeq Small RNA

Figure 7 – General flow chart for the Illumina TruSeq Small RNA Library Prep Kit protocol (adapted from TruSeq Small RNA Library Prep Kit protocol).

Library Preparation for Methyl-Seq

Methylation Sequencing, or Methyl-Seq, is the sequencing of the methylated regions of the genome. One of the ways to perform Methyl-Seq is by treating genomic DNA with bisulfite to convert non-methylated cytosines to uracils. Methylated cytosines are retained and they can be analyzed for methylation patterns.

1) Illumina TruSeq DNA Methylation Kit

The TruSeq DNA Methylation Kit is ideal if genome methylation is to be analyzed. The sample library preparation begins with fragmentation of the genome. Once complete, the fragments undergo a bisulfite treatment to convert non-methylated cytosines to uracils, while retaining those which are methylated. Using random primers containing the 5’ adaptor sequence at their 5’ end, DNA amplification occurs. Next, the 3’ adapter tag is ligated. PCR enrichment for the adapter-ligated fragments is performed, and if desired, indexing primers for sequencing are added. The enriched library is purified using magnetic beads, before quantitative validation of the library with qPCR or a fluorometric method. Qualitative analysis is also performed with the Agilent Technologies 2100 Bioanalyzer. For further details, please refer to Figure 8 presented below.

Next Generation Sequencing - TruSeq DNA Methylation

Figure 8 – General flow chart for the Illumina TruSeq DNA Methylation Kit protocol (adapted from TruSeq DNA Methylation Kit protocol).

Quality Control

Prior to sequencing, the sample library must be validated quantitatively and qualitatively. This is performed to verify if there is a sufficient amount of good quality DNA in the prepared library. Both quality and quantity play important roles in generating data. The consequence of having either more or less DNA than the optimal amount set by the library protocol is that the sequencing reaction runs less efficiently. This generates low quality data through problems including read problems from flow cell saturation, or reduced coverage because of insufficient DNA. In terms of quality, a good quality library is one that has a diverse set of DNA fragments with minimal duplicate fragments. This is important because during PCR amplification of some sample library preparation protocols, duplicates of fragments will be generated. The consequence of duplicate fragments is that the sequencing reaction will be biased towards these fragments (3). Rather than have a wide range of fragments sequenced, the same fragments are sequenced repetitively; this results in overrepresentation in the machine output.

Library quantification is performed using either qPCR or a fluorometric method like Qubit. Some libraries may only be quantified using one of the two methods. Sample library quality is then verified with the Agilent Technologies 2100 Bioanalyzer. Please refer to the Library Preparation section for further details.

qPCR

qPCR is a method of quantifying a sample library before sequencing. It is ideal when there is an insufficient amount available for fluorometric quantification, commonly due to no PCR amplification. It is also a more sensitive way, relative to Qubit, to quantify the adapter-ligated fragments in a sample. qPCR selectively amplifies such fragments, so it avoids the inaccuracies of Qubit that result from being unable to distinguish between fragments which can and cannot be sequenced. The only drawback to this procedure is that it is very time-consuming.

Qubit

Qubit is an alternative to qPCR for quantifying a sample library. Relative to qPCR, it provides results faster; however, it is not applicable for cases where there is no PCR enrichment as it is less sensitive than qPCR and requires more sample. Quantification is performed by illuminating and detecting dyes which selectively bind to DNA or RNA. First, a standard must be measured with the appropriate assay. The sample, which may be diluted, is then mixed with the appropriate dye before being inserted into the machine. For further details, please refer to the Qubit product page here.

Agilent Technologies 2100 Bioanalyzer

The Bioanalyzer is used to check for the size distribution of the library before the sequencing reaction. It is a way to verify that the sizes that were selected for during sample library preparation are present. The Bioanalyzer consists of a machine that reads gel chips containing diluted samples in the wells. The chips are similar to the idea of agarose gels, except in a smaller format. There are specific details for libraries prepared from DNA or RNA, but the protocols for both are very similar. The first step is to introduce the gel into the chip and pressurize it; this will evenly distribute the gel in the chip, minimizing errors in machine analysis later on. Once complete, markers, ladders, and samples (either diluted or undiluted) are loaded onto the chip. There may be additional reagents needed depending on the kit requirements. The chip is then vortexed for one minute at 2400 rpm before it is loaded onto the Bioanalyzer. The machine will monitor each well for sample; this is visualized with peaks on a graph. The location of the peaks will indicate the markers and the sample size distribution of the library, while the peak height shows the amount of fragments at a specific size. For further details, please refer to the Agilent DNA and RNA analysis kit as well as the Agilent Technologies 2100 Bioanalyzer product pages found here and here.

Downstream Applications

Once the sample library has been prepared, validated, and sequenced, there are various applications for the data output.

Whole Genome Sequencing

There are two downstream applications for WGS:

Creation of a full-length genome by assembling the reads generated from the sequencing reaction, or de novo sequencing of the output (4). This is useful if there is no reference sequence to align the output to generate a full-length sequence.
Alignment of the output sequence with a reference sequence (5). Sequencing reads are aligned to an reference sequence of interest using different computer algorithms for the following applications:
- single nucleotide polymorphism detection
- small insertion/deletion mutations
- genome variation such as chromosome translocations and inversions
- gene annotation

RNA-Sequencing

There are two downstream applications for RNA-Seq:

Generating a transcriptome by assembling the reads generating from the sequencing reaction, or de novo transcriptome analysis of the output (6). This is ideal if studying a non-model organism where there is no reference genome, and funds are insufficient to perform de novo whole genome sequencing (7).
Alignment of the output sequence with a reference sequence. By aligning the sequencing reads to a reference sequence, variants in the transcriptome or a portion of the transcriptome can be detected. This includes the following items:
- alternatively spliced genes
- non-coding RNAs
- gene expression profile (8)

Exome-Sequencing

Application of the output sequence generated from Exome-Seq is limited to alignment with a reference sequence. This is done to detect variations in the exome such as coding variants and Mendelian disorders (9) (10).

Methyl-Seq

Application of the output generated from Methyl-Seq is limited to sequence alignment with a reference sequence to analyze things such as DNA-protein interactions and cell-lineages through the methylation pattern across the sequence (9)(11).

References

Sequencing depth and coverage: key considerations in genomic analyses. Sims, D., et al. 2, 2014, Nature Reviews Genetics, Vol. 15, pp. 121-132.
Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. Forshew, T., et al. 136, 2012, Cancer Genomics, Vol. 4.
Library construction for next-generation sequencing: Overviews and challenges. Head, S.R., et al. 2, 2014, BioTechniques, Vol. 56, pp. 61-77.
De novo sequencing of plant genomes using second-generation technologies. Imelfort, M. and Edwards, D. 6, s.l. : Briefings in Bioinformatics, 2009, Vol. 10, pp. 609-618.
Chapter 7 - Whole genome sequencing: new technologies, approaches, and applications. Mardis, E.M. G.S. Ginsburg and H.F. Willard. Genomic and Personalized Medicine (Second Edition). Waltham : Academic Press, 2013, pp. 87-93.
Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Vera, J.C., et al. 7, s.l. : Molecular Ecology, 2008, Vol. 17, pp. 1636-1647.
Bioinformatics challenges in de novo transcriptome assembly using short read sequences in the absence of a reference genome sequence. Góngora-Castillo, E. and Buell, C.R. 4, 2013, Natural Product Reports, Vol. 30, pp. 490-500.
RNA sequencing: advances, challenges and opportunities. Ozsolak, F. and Milos, P.M. 2, s.l. : Nature Reviews Genetics, 2011, Vol. 12, pp. 87-98.
Exome array analysis identifies novel loci and low-frequency variants for insulin processing and secretion. Huyghe, J.R., et al. 2, s.l. : Nature Genetics, 2013, Vol. 45, pp. 197-201.
Exome sequencing identifies the cause of a Mendelian disorder. Ng, S.B., et al. 1, s.l. : Nature Genetics, 2010, Vol. 42, pp. 30-35.
Epigenetic restriction of embryonic cell lineage fate by methylation of Elf5. Ng, R.K., et al. 11, s.l. : Nature Cell Biology, 2008, Vol. 10.

Introduction

Experimental Design

Data Analysis

Whole Genome Sequencing (WGS)