Whole Genome Sequencing (WGS)

30 min Read
Video Summary

Next Generation Sequencing (NGS) is a useful tool in determining the DNA sequence, information which is valuable in furthering our understanding of biological processes. Unlike some tools, NGS is flexible and it can be applied in different situations, ranging from the exome to the small RNAs. This flexibility means that there are parameters that needed to be considered prior to running an NGS experiment. This section will outline the applications of whole genome sequencing.

What is Whole Genome Sequencing?

Whole Genome Sequencing is a powerful technology that is rapidly gaining popularity as it becomes more and more affordable. With whole genome sequencing you can:

  • Assemble genomes de novo,
  • Compare the genome of your organism to a reference genome,
  • Explore the molecular evolution of a species or population,
  • Accurately track pathogen outbreaks,
  • Search for disease causing mutations,
  • Obtain the WGS of an individual cell,
  • And more.

WGS begins with library preparation, which is explained in the NGS Experimental Design article of our knowledge base. Briefly there are three methods for library preparation:

  • Standard- for 100-200 ng gDNA
  • PCR-free- to avoid PCR induced biases
  • Tagmentation- for large and complex genomes

Libraries are sequenced using a next-gen platform, such as illumina sequencing by synthesis technology, because it provides significantly higher coverage per run than other platforms.

Applications of WGS: Case Studies

De novo Genome Assembly

Generally a genome is assembled from NGS sequence data by aligning to a reference genome. For most organisms however, this is not possible since no reference is available. In these cases de novo genome assembly is preformed.

When assembling a genome without a reference it is essential to have a way to correlate sequences long range, otherwise the assembly will be inaccurate. There are two ways to achieve this. Often a library of fosmids is constructed and sequenced by Sanger Sequencing in parallel to NGS. While sequencing the fosmids does not provide the coverage of NGS, it gives very long reads and so allows the correlation of sequences that are far from each other. The drawback is that the large amount of Sanger Sequencing required is both expensive and time consuming. More recently a new approach has been taken that relies entirely on illumina’s short sequencing reads. Here many libraries are generated, some with very long lengths. The libraries are then undergo paired-end sequencing to generate mate-paired sequences with correlated short reads and a gap of known length but unknown sequence. These mate-pairs allow long range correlation between sequences, allowing a more accurate genome assembly.

This approach was used recently to sequence the genome of the flax plant (Linum usitatissimum) by an international collaboration. Flax is an important crop for both food and textile production. Sequencing its genome will help agronomists develop better varieties and better understand the domestication of this crop. The authors generated seven libraries with varying lengths from 300 bp to 10 kb and sequenced them using paired-end illumine (Figure 1). This generated mate-pair and paired-end reads with 44-100 bp of known sequence and a spacer of defined length (Figure 1). The use of mate-paired reads with thousands of bases between them allowed the alignment of sequences long range, enhancing the accuracy of the assembly.

Next Generation Sequencing - Flax Libraries and Reads

Figure 1 – For flax genome assembly libraries of 300 bp to 10 kb were prepared. These were sequenced as paired-end reads.

The first step in assembly was to remove low quality reads, after which the coverage was determined to be 69x. After filtering, the reads were aligned to each other to generate 116,602 contigs (Figure 2). The contigs were further aligned to generate 88,384 scaffolds, 132 of which contained 50% of the assembly and were longer than 693.5 kb (Figure 2). The longest scaffold was 3.09 Mb. The assembly was found represent 85% of the genome with an average of 45x coverage.

Next Generation Sequencing - Alignment

Figure 2 – Reads are aligned to make contigs and contigs are then aligned to make scafolds.

Expressed sequence tags (ESTs) are short sequences obtained from cDNA libraries. They represent expressed regions of the genome and so can be used to find genes. In this study the known ESTs of flax were aligned to the scaffolds. Ninety-three percent of the flax ESTs aligned to the WGS scaffolds with >95% sequence identity indicating the assembled genes were highly accurate. This study also preformed many different analyses including comparison of the assembled flax genome to the genome of other plants. For more information please see the original paper.

Pathogen Tracking

Generally a genome is assembled from NGS sequence data by aligning to a reference genome. For most organisms however, this is not possible since no reference is available. In these cases de novo genome assembly is preformed.

Whole Genome Sequencing can also be used to track pathogen outbreaks. At present, the gold standard for analyzing strains of pathogenic bacteria is pulsed-field gel electrophoresis (PFGE), which compares the banding pattern between genomes digested by a selected restriction enzyme. This approach is limited however, since significant mutations can be easily hidden when they don’t affect the restriction sites or relative size of the genomic fragments. At the same time a single nucleotide mutation can result in the gain or loss of a restriction site and so can give a different PFGE pattern between closely related strains. This proof of concept study by Revez et al investigates WGS as a replacement for PFGE.

Campylobacter jejuni is among the leading causes of food born illness in the world. It is naturally found in the guts of birds and cows. Humans are most likely to become infected by injecting contaminated water. C. jejuni infection is debilitating, but rarely fatal. In this study samples from a Campylobacter jejuni outbreak in Europe in 2012 were reanalyzed by WGS and compared to the conclusions drawn from standard methodologies, to decide if WGS has similar or enhanced ability to track pathogen source and evolution during an outbreak situation.

Based on the PGFE patterns observed during the outbreak it was concluded that there was a contamination event involving one strain and one water source. However WGS revealed that this was not the case (Figure 3). Of the two human isolates shown here, one was found to be highly similar to the waterborne strain. The other human isolate is highly divergent, too much so to be the result of genetic drift during the course of the outbreak. In light of the WGS data the authors conclude that either a single source of water was contaminated by multiple divergent strains or that there were multiple sources of contamination. These results highlight the importance of more accurate WGS data during pathogen outbreaks, since conventional methodology misidentified a patient strain, potentially missing other sources of contamination.

Next Generation Sequencing - C. jejuni PFGE Knowledge Base

Figure 3 – The relationships between strains determined by WGS was more accurate than those that could be observed by PFGE. For example IHV116260 and 6237/12 are indistinguishable by PFGE but WGS revealed that they are highly divergent.

Molecular Evolution

Whole genome sequencing is also an essential tool for studying molecular evolution. This study uses WGS to study the molecular evolution of the Ithica New York honeybee population in response to the introduction of the mite Varroa destructor. Specimens collected in 2010 were compared to museum specimens collected in 1977, before the introduction of the mite. Honeybees, Apis mellifera, are essential to human agriculture. Both feral and domestic populations exist in North America. Honey bees are a eusocial species; each colony contains a sexually mature queen bee, a few thousand haploid males and tens of thousands of sterile female worker bees (Figure 4). The mite Varroa destructor feeds on the hemolymph of the adult worker bees, weakening them and making them more susceptible to disease (Figure 4). It has been associated with colony collapse.

Varroa destructor feeds on the hemolymph of the adult worker bees

Figure 4 – Varroa destructor feeds on the hemolymph of the adult worker bees.

The authors found a drastic loss of mitochondrial haplotypes between 1977 and 2010, with an entire clade going extinct. This loss indicates a population bottleneck upon the introduction of Varroa destructor. However they also found no decrease in nuclear genetic diversity. This finding indicates that the modern population is descended from a small number of queens through high rates of outbreeding and polyandry. The ancestry of the modern bee population is similar to the museum bees with a few variants. In the modern bees there is traces of African and Arabian ancestry that was absent in the museum bees. The authors also found some genes that were under selection pressure in the modern population relative to the museum bees which may play a role in resistance to the Varroa destructor parasite. For more details please see the original paper. This study demonstrates that WGS is a powerful tool for studying the molecular evolution of a population over time.

Bees - Genetics

  • Sequencing depth and coverage: key considerations in genomic analyses. Sims, D., et al. 2, 2014, Nature Reviews Genetics, Vol. 15, pp. 121-132.
  • Noninvasive identification and monitoring of cancer mutations by targeted deep sequencing of plasma DNA. Forshew, T., et al. 136, 2012, Cancer Genomics, Vol. 4.
  • Library construction for next-generation sequencing: Overviews and challenges. Head, S.R., et al. 2, 2014, BioTechniques, Vol. 56, pp. 61-77.
  • De novo sequencing of plant genomes using second-generation technologies. Imelfort, M. and Edwards, D. 6, s.l. : Briefings in Bioinformatics, 2009, Vol. 10, pp. 609-618.
  • Chapter 7 - Whole genome sequencing: new technologies, approaches, and applications. Mardis, E.M. G.S. Ginsburg and H.F. Willard. Genomic and Personalized Medicine (Second Edition). Waltham : Academic Press, 2013, pp. 87-93.
  • Rapid transcriptome characterization for a nonmodel organism using 454 pyrosequencing. Vera, J.C., et al. 7, s.l. : Molecular Ecology, 2008, Vol. 17, pp. 1636-1647.
  • Bioinformatics challenges in de novo transcriptome assembly using short read sequences in the absence of a reference genome sequence. Góngora-Castillo, E. and Buell, C.R. 4, 2013, Natural Product Reports, Vol. 30, pp. 490-500.
  • RNA sequencing: advances, challenges and opportunities. Ozsolak, F. and Milos, P.M. 2, s.l. : Nature Reviews Genetics, 2011, Vol. 12, pp. 87-98.
  • Exome array analysis identifies novel loci and low-frequency variants for insulin processing and secretion. Huyghe, J.R., et al. 2, s.l. : Nature Genetics, 2013, Vol. 45, pp. 197-201.
  • Exome sequencing identifies the cause of a Mendelian disorder. Ng, S.B., et al. 1, s.l. : Nature Genetics, 2010, Vol. 42, pp. 30-35.
  • Epigenetic restriction of embryonic cell lineage fate by methylation of Elf5. Ng, R.K., et al. 11, s.l. : Nature Cell Biology, 2008, Vol. 10.