Sequence assembly
In bioinformatics, sequence assembly refers to aligning and merging fragments from a longer DNA sequence in order to reconstruct the original sequence. This is needed as DNA sequencing technology might not be able to 'read' whole genomes in one go, but rather reads small pieces of between 20 and 30,000 bases, depending on the technology used. Typically, the short fragments result from shotgun sequencing genomic DNA, or gene transcript.
The problem of sequence assembly can be compared to taking many copies of a book, passing each of them through a shredder with a different cutter, and piecing the text of the book back together just by looking at the shredded pieces. Besides the obvious difficulty of this task, there are some extra practical issues: the original may have many repeated paragraphs, and some shreds may be modified during shredding to have typos. Excerpts from another book may also be added in, and some shreds may be completely unrecognizable.
Types
There are three approaches to assembling sequencing data:- De-novo: assembling sequencing reads to create full-length sequences, without using a template
- Mapping/Aligning: assembling reads by aligning reads against a template. The assembled consensus may not be identical to the template.
- Reference-guided: grouping of reads by similarity to the most similar region within the reference. Reads within each group are then shortened down to mimic short reads quality. A typical method to do so is the k-mer approach. Reference-guided assembly is most useful using long-reads.
Assemblies
Genome
The first sequence assemblers began to appear in the late 1980s and early 1990s as variants of simpler sequence alignment programs to piece together vast quantities of fragments generated by automated sequencing instruments called DNA sequencers. As the sequenced organisms grew in size and complexity, the assembly programs used in these genome projects needed increasingly sophisticated strategies to handle:- terabytes of sequencing data which need processing on computing clusters;
- identical and nearly identical sequences which can, in the worst case, increase the time and space complexity of algorithms quadratically;
- DNA read errors in the fragments from the sequencing instruments, which can confound assembly.
EST
or EST assembly was an early strategy, dating from the mid-1990s to the mid-2000s, to assemble individual genes rather than whole genomes. The problem differs from genome assembly in several ways. The input sequences for EST assembly are fragments of the transcribed mRNA of a cell and represent only a subset of the whole genome. A number of algorithmical problems differ between genome and EST assembly. For instance, genomes often have large amounts of repetitive sequences, concentrated in the intergenic regions. Transcribed genes contain many fewer repeats, making assembly somewhat easier. On the other hand, some genes are expressed in very high numbers, which means that unlike whole-genome shotgun sequencing, the reads are not uniformly sampled across the genome.EST assembly is made much more complicated by features like alternative splicing, trans-splicing, single-nucleotide polymorphism, and post-transcriptional modification. Beginning in 2008 when RNA-Seq was invented, EST sequencing was replaced by this far more efficient technology, described under de novo transcriptome assembly.
De-novo vs mapping assembly
In terms of complexity and time requirements, de-novo assemblies are orders of magnitude slower and more memory intensive than mapping assemblies. This is mostly due to the fact that the assembly algorithm needs to compare every read with every other read. Current de-novo genome assemblers may use different types of graph-based algorithms, such as the:- Overlap/Layout/Consensus approach, which was typical of the Sanger-data assemblers and relies on an overlap graph;
- de Bruijn Graph approach, which is most widely applied to the short reads from the Solexa and SOLiD platforms. It relies on K-mer graphs, which performs well with vast quantities of short reads;
- Greedy graph-based approach, which may also use one of the OLC or DBG approaches. With greedy graph-based algorithms, the contigs, series of reads aligned together, grow by greedy extension, always taking on the read that is found by following the highest-scoring overlap.
Handling repeats in de-novo assembly requires the construction of a graph representing neighboring repeats. Such information can be derived from reading a long fragment covering the repeats in full or only its two ends. On the other hand, in a mapping assembly, parts with multiple or no matches are usually left for another assembling technique to look into.
Technological advances
The complexity of sequence assembly is driven by two major factors: the number of fragments and their lengths. While more and longer fragments allow better identification of sequence overlaps, they also pose problems as the underlying algorithms show quadratic or even exponential complexity behaviour to both number of fragments and their length. And while shorter sequences are faster to align, they also complicate the layout phase of an assembly as shorter reads are more difficult to use with repeats or near identical repeats.In the earliest days of DNA sequencing, scientists could only gain a few sequences of short length after weeks of work in laboratories. Hence, these sequences could be aligned in a few minutes by hand.
In 1975, the dideoxy termination method was invented and until shortly after 2000, the technology was improved up to a point where fully automated machines could churn out sequences in a highly parallelised mode 24 hours a day. Large genome centers around the world housed complete farms of these sequencing machines, which in turn led to the necessity of assemblers to be optimised for sequences from whole-genome shotgun sequencing projects where the reads
- are about 800–900 bases long
- contain sequencing artifacts like sequencing and cloning vectors
- have error rates between 0.5 and 10%
By 2004 / 2005, pyrosequencing had been brought to commercial viability by 454 Life Sciences. This new sequencing method generated reads much shorter than those of Sanger sequencing: initially about 100 bases, now 400–500 bases. Its much higher throughput and lower cost pushed the adoption of this technology by genome centers, which in turn pushed development of sequence assemblers that could efficiently handle the read sets. The sheer amount of data coupled with technology-specific error patterns in the reads delayed development of assemblers; at the beginning in 2004 only the Newbler assembler from 454 was available. Released in mid-2007, the hybrid version of the MIRA assembler by Chevreux et al. was the first freely available assembler that could assemble 454 reads as well as mixtures of 454 reads and Sanger reads. Assembling sequences from different sequencing technologies was subsequently coined hybrid assembly.
From 2006, the Illumina technology has been available and can generate about 100 million reads per run on a single sequencing machine. Compare this to the 35 million reads of the human genome project which needed several years to be produced on hundreds of sequencing machines. Illumina was initially limited to a length of only 36 bases, making it less suitable for de novo assembly, but newer iterations of the technology achieve read lengths above 100 bases from both ends of a 3–400bp clone. Announced at the end of 2007, the SHARCGS assembler by Dohm et al. was the first published assembler that was used for an assembly with Solexa reads. It was quickly followed by a number of others.
Later, new technologies like SOLiD from Applied Biosystems, Ion Torrent and SMRT were released and new technologies continue to emerge. Despite the higher error rates of these technologies they are important for assembly because their longer read length helps to address the repeat problem. It is impossible to assemble through a perfect repeat that is longer than the maximum read length; however, as reads become longer the chance of a perfect repeat that large becomes small. This gives longer sequencing reads an advantage in assembling repeats even if they have low accuracy.