Spaced seed
In bioinformatics, a spaced seed is a pattern of relevant and irrelevant positions in a biosequence and a method of approximate string matching that allows for substitutions. They are a straightforward modification to the earliest heuristic-based alignment efforts that allow for minor differences between the sequences of interest. Spaced seeds have been used in homology search., alignment, assembly, and metagenomics. They are usually represented as a sequence of zeroes and ones, where a one indicates relevance and a zero indicates irrelevance at the given position. Some visual representations use pound signs for relevant and dashes or asterisks for irrelevant positions.
Principle
Due to a number of functional and evolutionary constraints, nucleic acid sequences between individuals tend to be highly conserved, with the typical difference between two human genomes estimated on the order of 0.6%. Identification of highly similar regions in the genome may indicate functional importance, as mutations in these areas that would result in cessation of function or loss of regulatory ability would be evolutionary unfavorable. More observed differences between two sequences may arise as a result of stochastic sequencing errors. Similarly, when performing assembly of a previously characterized genome, an attempt is made to align the newly sequenced DNA fragments to the existing genome sequence.In both cases, it is useful to be able to directly compare nucleic acid sequences. Since the sequences are not expected to be exactly identical, however, it is beneficial to focus on smaller subsequences that are more likely to be locally identical. Spaced seeds allow for even more permissive local matching by allowing certain base pairs to mismatch without penalty, thus allowing algorithms that use the general "hit-extend" strategy of alignment to explore additional potential matches that would be otherwise ignored.
Example
As a simple example, consider the following DNA sequences:CTAAGTCACG
CTAACACACG
1111001111
Upon visual inspection, it's easy to see that there is a mismatch between the two sequences at the fifth and six base positions. However, the sequences still share 80% sequence similarity. The mismatches may be due to a real change or a sequencing error. In a non-spaced model, this putative match would be ignored if a seed size greater than 4 is specified. But a spaced seed of could be used to effectively zero-weighting the mismatch sites, treating the sequences as same for the purposes of hit identification. In reality, of course, we don't know the relative positioning of the "true" mismatches, so there can be different spaced seed patterns depending on where the mismatches are anticipated.
History
The concept of spaced seeds has been explored in literature under different names. One of the early uses was in sequence homology where the FLASH algorithm from 1993 referred to it as "non-contiguous sub-sequences of tokens" that were generated from all combinations of positions within a sequence window. In 1995, a similar concept was used in approximate string matching where "gapped tuples" of positions in a sequence were explored to identify common substrings between a large text and a query. The term "shape" was used in a 2001 paper to describe gapped q-grams where it refers to a set of relevant positions in a substring and soon after in 2002, PatternHunter introduced "spaced model" which was proposed as an improvement upon the consecutive seeds used in BLAST, which was ultimately adopted by newer versions of it. Finally, in 2003, PatternHunter II settled on the term "spaced seed" to refer to the approach used in PatternHunterSpaced versus Non-Spaced Models
Popular alignment algorithms such as BLAST and MegaBLAST use a non-spaced model, where the entire length of the seed is made of exact matches. Thus, any mismatching base pair along the length of the seed will result in the program ignoring the potential hit. In a spaced model, the matches are not necessarily consecutive.More formally, the difference in a spaced seed model as compared to a non-spaced model is the relative positioning of the matched bases. In a non-spaced model, the length of the seed model, and the weight, of the seeds are the same, as they must be consecutive while in a spaced model, the weight is not necessarily equal to the length of the seed model, since match positions may be non-consecutive. Therefore, a spaced seed model may be longer than a non-spaced seed model but have the same weight. For example, a non-spaced seed has the same weight as a spaced model, but their lengths differ.
The predicted number of hits can be calculated from PatternHunter using the following lemma:
Where is the length of the sequence the model is compared to, is the length of the seed model, is the probability of a match and is the weight of the seed used.