Exome sequencing


Exome sequencing, also known as whole exome sequencing, is a genomic technique for sequencing all of the protein-coding regions of genes in a genome. It consists of two steps: the first step is to select only the subset of DNA that encodes proteins. These regions are known as exons—humans have about 180,000 exons, constituting about 1% of the human genome, or approximately 30 million base pairs. The second step is to sequence the exonic DNA using any high-throughput DNA sequencing technology.
The goal of this approach is to identify genetic variants that alter protein sequences, and to do this at a much lower cost than whole-genome sequencing. Since these variants can be responsible for both Mendelian and common polygenic diseases, such as Alzheimer's disease, whole exome sequencing has been applied both in academic research and as a clinical diagnostic.
Image:Exome Sequencing workflow 1b.png|thumb|alt=Exome sequencing workflow: Part 2.|Exome sequencing workflow: part 2.

Motivation and comparison to other approaches

Exome sequencing is especially effective in the study of rare Mendelian diseases, because it is an efficient way to identify the genetic variants in all of an individual's genes. These diseases are most often caused by very rare genetic variants that are only present in a tiny number of individuals; by contrast, techniques such as SNP arrays can only detect shared genetic variants that are common to many individuals in the wider population. Furthermore, because severe disease-causing variants are much more likely to be in the protein coding sequence, focusing on this 1% costs far less than whole genome sequencing but still detects a high yield of relevant variants.
In the past, clinical genetic tests were chosen based on the clinical presentation of the patient, or surveyed only certain types of variation but provided definitive genetic diagnoses in fewer than half of all patients. Exome sequencing is now increasingly used to complement these other tests: both to find mutations in genes already known to cause disease as well as to identify novel genes by comparing exomes from patients with similar features.

Technical methodology

Step 1: Target-enrichment strategies

Target-enrichment methods allow one to selectively capture genomic regions of interest from a DNA sample prior to sequencing. Several target-enrichment strategies have been developed since the original description of the direct genomic selection method in 2005.
Though many techniques have been described for targeted capture, only a few of these have been extended to capture entire exomes. The first target enrichment strategy to be applied to whole exome sequencing was the array-based hybrid capture method in 2007, but in-solution capture has gained popularity in recent years.

Array-based capture

s contain single-stranded oligonucleotides with sequences from the human genome to tile the region of interest fixed to the surface. Genomic DNA is sheared to form double-stranded fragments. The fragments undergo end-repair to produce blunt ends and adaptors with universal priming sequences are added. These fragments are hybridized to oligos on the microarray.
Unhybridized fragments are washed away and the desired fragments are eluted. The fragments are then amplified using PCR.
Roche NimbleGen was first to take the original DGS technology and adapt it for next-generation sequencing. They developed the Sequence Capture Human Exome 2.1M Array to capture ~180,000 coding exons. This method is both time-saving and cost-effective compared to PCR based methods. The Agilent Capture Array and the comparative genomic hybridization array are other methods that can be used for hybrid capture of target sequences. Limitations in this technique include the need for expensive hardware as well as a relatively large amount of DNA.

In-solution capture

To capture genomic regions of interest using in-solution capture, a pool of custom oligonucleotides is synthesized and hybridized in solution to a fragmented genomic DNA sample. The probes selectively hybridize to the genomic regions of interest after which the beads can be pulled down and washed to clear excess material. The beads are then removed and the genomic fragments can be sequenced allowing for selective DNA sequencing of genomic regions of interest.
This method was developed to improve on the hybridization capture target-enrichment method. In solution capture there is an excess of probes to target regions of interest over the amount of template required. The optimal target size is about 3.5 megabases and yields excellent sequence coverage of the target regions. The preferred method is dependent on several factors including: number of base pairs in the region of interest, demands for reads on target, equipment in house, etc.

Step 2: Sequencing

There are many Next Generation Sequencing sequencing platforms available, postdating classical Sanger sequencing methodologies. Other platforms include Roche 454 sequencer and Life Technologies SOLiD systems, the Life Technologies Ion Torrent and Illumina's Illumina Genome Analyzer II and subsequent Illumina MiSeq, HiSeq, and NovaSeq series instruments, all of which can be used for massively parallel exome sequencing. These 'short read' NGS systems are particularly well suited to analyse many relatively short stretches of DNA sequence, as found in human exons.

Comparison with other technologies

There are multiple technologies available that identify genetic variants. Each technology has advantages and disadvantages in terms of technical and financial factors. Two such technologies are microarrays and whole-genome sequencing.

Microarray-based genotyping

use hybridization probes to test the prevalence of known DNA sequences, thus they cannot be used to identify unexpected genetic changes. In contrast, the high-throughput sequencing technologies used in exome sequencing directly provide the nucleotide sequences of DNA at the thousands of exonic loci tested. Hence, WES addresses some of the present limitations of hybridization genotyping arrays.
Although exome sequencing is more expensive than hybridization-based technologies on a per-sample basis, its cost has been decreasing due to the falling cost and increased throughput of whole genome sequencing.

Whole-genome sequencing

Exome sequencing is only able to identify those variants found in the coding region of genes which affect protein function. It is not able to identify the structural and non-coding variants associated with the disease, which can be found using other methods such as whole genome sequencing. There remains 99% of the human genome that is not covered using exome sequencing, and exome sequencing allows sequencing of portions of the genome over at least 20 times as many samples compared to whole genome sequencing. For translation of identified rare variants into the clinic, sample size and the ability to interpret the results to provide a clinical diagnosis indicates that with the current knowledge in genetics, there are reports of exome sequencing being used for assisting diagnosis. The cost of exome sequencing is typically lower than whole genome sequencing.

Data analysis

The statistical analysis of the large quantity of data generated from sequencing approaches is a challenge. Even by only sequencing the exomes of individuals, a large quantity of data and sequence information is generated which requires a significant amount of data analysis. Challenges associated with the analysis of this data include changes in programs used to align and assemble sequence reads. Various sequencing technologies also have different error rates and generate various read-lengths which can pose challenges in comparing results from different sequencing platforms.
False positive and false negative findings are associated with genomic resequencing approaches and are critical issues. A few strategies have been developed to improve the quality of exome data such as:
  • Comparing the genetic variants identified between sequencing and array-based genotyping
  • Comparing the coding SNPs to a whole genome sequenced individual with the disorder
  • Comparing the coding SNPs with Sanger sequencing of HapMap individuals
Rare recessive disorders may not have single nucleotide polymorphisms in public databases such as dbSNP. More common recessive phenotypes would be more likely to have disease-causing variants reported in dbSNP. For example, the most common cystic fibrosis variant has an allele frequency of about 3% in most populations. Screening out such variants might erroneously exclude such genes from consideration. Genes for recessive disorders are usually easier to identify than dominant disorders because the genes are less likely to have more than one rare nonsynonymous variant. The system that screens common genetic variants relies on dbSNP which may not have accurate information about the variation of alleles. Using lists of common variation from a study exome or genome-wide sequenced individual would be more reliable. A challenge in this approach is that as the number of exomes sequenced increases, dbSNP will also increase in the number of uncommon variants. It will be necessary to develop thresholds to define the common variants that are unlikely to be associated with a disease phenotype.
Genetic heterogeneity and population ethnicity are also major limitations as they may increase the number of false positive and false negative findings which will make the identification of candidate genes more difficult. Of course, it is possible to reduce the stringency of the thresholds in the presence of heterogeneity and ethnicity, however this will reduce the power to detect variants as well. Using a genotype-first approach to identify candidate genes might also offer a solution to overcome these limitations.
Unlike common variant analysis, the analysis of rare variants in whole-exome sequencing studies evaluates variant sets rather than single variants. Functional annotations predict the effect or function of rare variants and help prioritize rare functional variants. Incorporating these annotations can effectively boost the power of genetic association of rare variants analysis of whole genome sequencing studies. Some methods and tools have been developed to perform functionally-informed rare variant association analysis by incorporating functional annotations to empower analysis in whole exome sequencing studies.