Whole genome sequencing
Whole genome sequencing, also known as full genome sequencing or just genome sequencing, is the process of determining the entirety of the DNA sequence of an organism's genome at a single time. This entails sequencing all of an organism's chromosomal DNA as well as DNA contained in the mitochondria and, for plants, in the chloroplast.
Whole genome sequencing has largely been used as a research tool, but was being introduced to clinics in 2014. In the future of personalized medicine, whole genome sequence data may be an important tool to guide therapeutic intervention. The tool of gene sequencing at SNP level is also used to pinpoint functional variants from association studies and improve the knowledge available to researchers interested in evolutionary biology, and hence may lay the foundation for predicting disease susceptibility and drug response.
Whole genome sequencing should not be confused with DNA profiling, which only determines the likelihood that genetic material came from a particular individual or group, and does not contain additional information on genetic relationships, origin or susceptibility to specific diseases. In addition, whole genome sequencing should not be confused with methods that sequence specific subsets of the genome – such methods include whole exome sequencing or SNP genotyping.
History
The DNA sequencing methods used in the 1970s and 1980s were manual; for example, Maxam–Gilbert sequencing and Sanger sequencing. Several whole bacteriophage and animal viral genomes were sequenced by these techniques, but the shift to more rapid, automated sequencing methods in the 1990s facilitated the sequencing of the larger bacterial and eukaryotic genomes.The first virus to have its complete genome sequenced was the Bacteriophage MS2 by 1976. In 1992, yeast chromosome III was the first chromosome of any organism to be fully sequenced. The first organism whose entire genome was fully sequenced was Haemophilus influenzae in 1995. After it, the genomes of other bacteria and some archaea were first sequenced, largely due to their small genome size. H. influenzae has a genome of 1,830,140 base pairs of DNA. In contrast, eukaryotes, both unicellular and multicellular such as Amoeba dubia and humans respectively, have much larger genomes. Amoeba dubia has a genome of 700 billion nucleotide pairs spread across thousands of chromosomes. Humans contain fewer nucleotide pairs than A. dubia, however, their genome size far outweighs the genome size of individual bacteria.
The first bacterial and archaeal genomes, including that of H. influenzae, were sequenced by Shotgun sequencing. In 1996, the first eukaryotic genome was sequenced. S. cerevisiae, a model organism in biology has a genome of only around 12 million nucleotide pairs, and was the first unicellular eukaryote to have its whole genome sequenced. The first multicellular eukaryote, and animal, to have its whole genome sequenced was the nematode worm: Caenorhabditis elegans in 1998. Eukaryotic genomes are sequenced by several methods including Shotgun sequencing of short DNA fragments and sequencing of larger DNA clones from DNA libraries such as bacterial artificial chromosomes and yeast artificial chromosomes.
In 1999, the entire DNA sequence of human chromosome 22, the second shortest human autosome, was published. By the year 2000, the second animal and second invertebrate genome was sequenced – that of the fruit fly Drosophila melanogaster – a popular choice of model organism in experimental research. The first plant genome – that of the model organism Arabidopsis thaliana – was also fully sequenced by 2000. By 2001, a draft of the entire human genome sequence was published. The genome of the laboratory mouse Mus musculus was completed in 2002.
In 2004, the Human Genome Project published an incomplete version of the human genome. In 2008, a group from Leiden, the Netherlands, reported the sequencing of the first female human genome.
Currently thousands of genomes have been wholly or partially sequenced.
Experimental details
Cells used for sequencing
Almost any biological sample containing a full copy of the DNA—even a very small amount of DNA or ancient DNA—can provide the genetic material necessary for full genome sequencing. Such samples may include saliva, epithelial cells, bone marrow, hair, seeds, plant leaves, or anything else that has DNA-containing cells.The genome sequence of a single cell selected from a mixed population of cells can be determined using techniques of single cell genome sequencing. This has important advantages in environmental microbiology in cases where a single cell of a particular microorganism species can be isolated from a mixed population by microscopy on the basis of its morphological or other distinguishing characteristics. In such cases the normally necessary steps of isolation and growth of the organism in culture may be omitted, thus allowing the sequencing of a much greater spectrum of organism genomes.
Single cell genome sequencing is being tested as a method of preimplantation genetic diagnosis, wherein a cell from the embryo created by in vitro fertilization is taken and analyzed before embryo transfer into the uterus. After implantation, cell-free fetal DNA can be taken by simple venipuncture from the mother and used for whole genome sequencing of the fetus.
Early techniques
Sequencing of nearly an entire human genome was first accomplished in 2000 partly through the use of shotgun sequencing technology. While full genome shotgun sequencing for small genomes was already in use in 1979, broader application benefited from pairwise end sequencing, known colloquially as double-barrel shotgun sequencing. As sequencing projects began to take on longer and more complicated genomes, multiple groups began to realize that useful information could be obtained by sequencing both ends of a fragment of DNA. Although sequencing both ends of the same fragment and keeping track of the paired data was more cumbersome than sequencing a single end of two distinct fragments, the knowledge that the two sequences were oriented in opposite directions and were about the length of a fragment apart from each other was valuable in reconstructing the sequence of the original target fragment.The first published description of the use of paired ends was in 1990 as part of the sequencing of the human HPRT locus, although the use of paired ends was limited to closing gaps after the application of a traditional shotgun sequencing approach. The first theoretical description of a pure pairwise end sequencing strategy, assuming fragments of constant length, was in 1991. In 1995, the innovation of using fragments of varying sizes was introduced, and demonstrated that a pure pairwise end-sequencing strategy would be possible on large targets. The strategy was subsequently adopted by The Institute for Genomic Research to sequence the entire genome of the bacterium Haemophilus influenzae in 1995, and then by Celera Genomics to sequence the entire fruit fly genome in 2000, and subsequently the entire human genome. Applied Biosystems, now called Life Technologies, manufactured the automated capillary sequencers utilized by both Celera Genomics and The Human Genome Project.
Current techniques
While capillary sequencing was the first approach to successfully sequence a nearly full human genome, it is still too expensive and takes too long for commercial purposes. Since 2005, capillary sequencing has been progressively displaced by high-throughput sequencing technologies such as Illumina dye sequencing, pyrosequencing, and SMRT sequencing. All of these technologies continue to employ the basic shotgun strategy, namely, parallelization and template generation via genome fragmentation.Other technologies have emerged, including nanopore technology. Though the sequencing accuracy of Nanopore technology is lower than those above, its read length is on average much longer. This generation of long reads is valuable especially in de novo whole-genome sequencing applications.
Analysis
In principle, full genome sequencing can provide the raw nucleotide sequence of an individual organism's DNA at a single point in time. However, further analysis must be performed to provide the biological or medical meaning of this sequence, such as how this knowledge can be used to help prevent disease. Methods for analyzing sequencing data are being developed and refined.Because sequencing generates a lot of data, its output is stored electronically and requires a large amount of computing power and storage capacity.
While analysis of WGS data can be slow, it is possible to speed up this step by using dedicated hardware.
Commercialization
A number of public and private companies are competing to develop a full genome sequencing platform that is commercially robust for both research and clinical use, including Illumina, Knome, Sequenom,454 Life Sciences, Pacific Biosciences, Complete Genomics,
Helicos Biosciences, GE Global Research, Affymetrix, IBM, Intelligent Bio-Systems, Life Technologies, Oxford Nanopore Technologies, and the Beijing Genomics Institute. These companies are heavily financed and backed by venture capitalists, hedge funds, and investment banks.
A commonly-referenced commercial target for sequencing cost until the late 2010s was $1,000USD, however, the private companies are working to reach a new target of only $100.