Longest common subsequence
A longest common subsequence is the longest subsequence common to all sequences in a set of sequences. It differs from the longest common substring: unlike substrings, subsequences are not required to occupy consecutive positions within the original sequences. The problem of computing longest common subsequences is a classic computer science problem. Because it is polynomial and has an efficient algorithm to solve it, it is employed to compare data and merge changes to files in programs such as the
diff utility and revision control systems such as Git. It has similar applications in computational linguistics and bioinformatics.For example, consider the sequences and. They have five length-2 common subsequences:,,,, and ; two length-3 common subsequences: and ; and no longer common subsequences. So and are their longest common subsequences.
Complexity
For the general case of an arbitrary number of input sequences, the problem is NP-hard. When the number of sequences is constant, the problem is solvable in polynomial time by dynamic programming.Given sequences of lengths, a naive search would test each of the subsequences of the first sequence to determine whether they are also subsequences of the remaining sequences; each subsequence may be tested in time linear in the lengths of the remaining sequences, so the time for this algorithm would be
For the case of two sequences of n and m elements, the running time of the dynamic programming approach is O. For an arbitrary number of input sequences, the dynamic programming approach gives a solution in
There exist methods with lower complexity,
which often depend on the length of the LCS, the size of the alphabet, or both.
The LCS is not necessarily unique; in the worst case, the number of common subsequences is exponential in the lengths of the inputs, so the algorithmic complexity of listing all common subsequences must be at least exponential.
Solution for two sequences
The LCS problem has an optimal substructure: the problem can be broken down into smaller, simpler subproblems, which can, in turn, be broken down into simpler subproblems, and so on, until, finally, the solution becomes trivial. LCS in particular has overlapping subproblems: the solutions to high-level subproblems often reuse solutions to lower level subproblems. Problems with these two properties are amenable to dynamic programming approaches, in which subproblem solutions are memoized, that is, the solutions of subproblems are saved for reuse.Prefixes
The prefix Sn of S is defined as the first n characters of S. For example, the prefixes of S = areLet LCS be a function that computes a longest subsequence common to X and Y. Such a function has two interesting properties.
First property
LCS = LCS^''A, for all strings X'', Y and all symbols A, where ^ denotes string concatenation. This allows one to simplify the LCS computation for two sequences ending in the same symbol.For example, LCS = LCS^"A", Continuing for the remaining common symbols, LCS = LCS^"ANA".
Second property
If A and B are distinct symbols, then LCS is one of the maximal-length strings in the set, for all strings X, Y.For example,
LCS is the longest string among LCS and LCS; if both happened to be of equal length, one of them could be chosen arbitrarily.
To realize the property, distinguish two cases:
- If LCS ends with a "G", then the final "K" cannot be in the LCS, hence LCS = LCS.
- If LCS does not end with a "G", then the final "G" cannot be in the LCS, hence LCS = LCS.
''LCS'' function defined
Let two sequences be defined as follows: and. The prefixes of are ; the prefixes of are. Let represent the set of longest common subsequence of prefixes and. This set of sequences is given by the following.To find the LCS of and, compare and. If they are equal, then the sequence is extended by that element,. If they are not equal, then the longest among the two sequences,, and, is retained. The base case, when either or is empty, is the empty string,.
Worked example
The longest subsequence common to R =, and C = will be found. Because the LCS function uses a "zeroth" element, it is convenient to define zero prefixes that are empty for these sequences: R0 = ε; and C0 = ε. All the prefixes are placed in a table with C in the first row and R in the first column.| ε | A | G | C | A | T | |
| ε | ε | ε | ε | ε | ε | ε |
| G | ε | |||||
| A | ε | |||||
| C | ε |
This table is used to store the LCS sequence for each step of the calculation. The second column and second row have been filled in with ε, because when an empty sequence is compared with a non-empty sequence, the longest common subsequence is always an empty sequence.
LCS is determined by comparing the first elements in each sequence. G and A are not the same, so this LCS gets the longest of the two sequences, LCS and LCS. According to the table, both of these are empty, so LCS is also empty, as shown in the table below. The arrows indicate that the sequence comes from both the cell above, LCS and the cell on the left, LCS.
LCS is determined by comparing G and G. They match, so G is appended to the upper left sequence, LCS, which is, giving, which is.
For LCS, G and C do not match. The sequence above is empty; the one to the left contains one element, G. Selecting the longest of these, LCS is. The arrow points to the left, since that is the longest of the two sequences.
LCS, likewise, is.
LCS, likewise, is.
| ε | A | G | C | A | T | |
| ε | ε | ε | ε | ε | ε | ε |
| G | ε | ε | ||||
| A | ε | |||||
| C | ε |
For LCS, A is compared with A. The two elements match, so A is appended to ε, giving.
For LCS, A and G do not match, so the longest of LCS, which is, and LCS, which is, is used. In this case, they each contain one element, so this LCS is given two subsequences: and.
For LCS, A does not match C. LCS contains sequences and ; LCS is, which is already contained in LCS. The result is that LCS also contains the two subsequences, and.
For LCS, A matches A, which is appended to the upper left cell, giving.
For LCS, A does not match T. Comparing the two sequences, and, the longest is, so LCS is.
| ε | A | G | C | A | T | |
| ε | ε | ε | ε | ε | ε | ε |
| G | ε | ε | ||||
| A | ε | & | & | |||
| C | ε |
For LCS, C and A do not match, so LCS gets the longest of the two sequences,.
For LCS, C and G do not match. Both LCS and LCS have one element. The result is that LCS contains the two subsequences, and.
For LCS, C and C match, so C is appended to LCS, which contains the two subsequences, and, giving and.
For LCS, C and A do not match. Combining LCS, which contains and, and LCS, which contains, gives a total of three sequences:,, and.
Finally, for LCS, C and T do not match. The result is that LCS also contains the three sequences,,, and.
| ε | A | G | C | A | T | |
| ε | ε | ε | ε | ε | ε | ε |
| G | ε | ε | ||||
| A | ε | & | & | |||
| C | ε | & | & | & & | & & |
The final result is that the last cell contains all the longest subsequences common to and ; these are,, and. The table also shows the longest common subsequences for every possible pair of prefixes. For example, for and, the longest common subsequence are and.
Traceback approach
Calculating the LCS of a row of the LCS table requires only the solutions to the current row and the previous row. Still, for long sequences, these sequences can get numerous and long, requiring a lot of storage space. Storage space can be saved by saving not the actual subsequences, but the length of the subsequence and the direction of the arrows, as in the table below.| ε | A | G | C | A | T | |
| ε | 0 | 0 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 1 | 1 | 1 | 1 |
| A | 0 | 1 | 1 | 1 | 2 | 2 |
| C | 0 | 1 | 1 | 2 | 2 | 2 |
The actual subsequences are deduced in a "traceback" procedure that follows the arrows backwards, starting from the last cell in the table. When the length decreases, the sequences must have had a common element. Several paths are possible when two arrows are shown in a cell. Below is the table for such an analysis, with numbers colored in cells where the length is about to decrease. The bold numbers trace out the sequence,.
| ε | A | G | C | A | T | |
| ε | 0 | 0 | 0 | 0 | 0 | 0 |
| G | 0 | 0 | 1 | 1 | 1 | 1 |
| A | 0 | 1 | 1 | 1 | 2 | 2 |
| C | 0 | 1 | 1 | 2 | 2 | 2 |
Relation to other problems
For two strings and, the length of the shortest common supersequence is related to the length of the LCS byThe edit distance when only insertion and deletion is allowed, or when the cost of the substitution is the double of the cost of an insertion or deletion, is:
Code for the dynamic programming solution
Computing the length of the LCS
The function below takes as input sequencesX and Y, computes the LCS between X and Y for all 1 ≤ i ≤ m and 1 ≤ j ≤ n, and stores it in C. C will contain the length of the LCS of X and Y.function LCSLength
C = array
for i := 0..m
C = 0
for j := 0..n
C = 0
for i := 1..m
for j := 1..n
if X = Y
C := C + 1
else
C := max
return C
Alternatively, memoization could be used.
Reading out a LCS
The following function backtracks the choices taken when computing theC table. If the last characters in the prefixes are equal, they must be in an LCS. If not, check what gave the largest LCS of keeping and, and make the same choice. Just choose one if they were equally long. Call the function with i=m and j=n.function backtrack
if i = 0 or j = 0
return ""
if X = Y
return backtrack + X
if C > C
return backtrack
return backtrack
Reading out all LCSs
If choosing and would give an equally long result, read out both resulting subsequences. This is returned as a set by this function. Notice that this function is not polynomial, as it might branch in almost every step if the strings are similar.function backtrackAll
if i = 0 or j = 0
return
if X = Y
return
R :=
if C ≥ C
R := backtrackAll
if C ≥ C
R := R ∪ backtrackAll
return R
Print the diff
This function will backtrack through the C matrix, and print the diff between the two sequences. Notice that you will get a different answer if you exchange≥ and <, with > and ≤ below.function printDiff
if i >= 0 and j >= 0 and X = Y
printDiff
print " " + X
else if j > 0 and
printDiff
print "+ " + Y
else if i > 0 and
printDiff
print "- " + X
else
print ""
Example
Let be “XMJYAUZ” and be “MZJAWXU”. The longest common subsequence between and is “MJAU”. The table C shown below, which is generated by the function LCSLength, shows the lengths of the longest common subsequences between prefixes of and. The th row and th column shows the length of the LCS between and.The highlighted numbers show the path the function
backtrack would follow from the bottom right to the top left corner, when reading out an LCS. If the current symbols in and are equal, they are part of the LCS, and we go both up and left. If not, we go up or left, depending on which cell has a higher number. This corresponds to either taking the LCS between and, or and.Code optimization
Several optimizations can be made to the algorithm above to speed it up for real-world cases.Reduce the problem set
The C matrix in the naive algorithm grows quadratically with the lengths of the sequences. For two 100-item sequences, a 10,000-item matrix would be needed, and 10,000 comparisons would need to be done. In most real-world cases, especially source code diffs and patches, the beginnings and ends of files rarely change, and almost certainly not both at the same time. If only a few items have changed in the middle of the sequence, the beginning and end can be eliminated. This reduces not only the memory requirements for the matrix, but also the number of comparisons that must be done.function LCS
start := 1
m_end := m
n_end := n
trim off the matching items at the beginning
while start ≤ m_end and start ≤ n_end and X = Y
start := start + 1
trim off the matching items at the end
while start ≤ m_end and start ≤ n_end and X = Y
m_end := m_end - 1
n_end := n_end - 1
C = array
only loop over the items that have changed
for i := start..m_end
for j := start..n_end
the algorithm continues as before...
In the best-case scenario, a sequence with no changes, this optimization would eliminate the need for the C matrix. In the worst-case scenario, a change to the very first and last items in the sequence, only two additional comparisons are performed.
Reduce the comparison time
Most of the time taken by the naive algorithm is spent performing comparisons between items in the sequences. For textual sequences such as source code, you want to view lines as the sequence elements instead of single characters. This can mean comparisons of relatively long strings for each step in the algorithm. Two optimizations can be made that can help to reduce the time these comparisons consume.Reduce strings to hashes
A hash function or checksum can be used to reduce the size of the strings in the sequences. That is, for source code where the average line is 60 or more characters long, the hash or checksum for that line might be only 8 to 40 characters long. Additionally, the randomized nature of hashes and checksums would guarantee that comparisons would short-circuit faster, as lines of source code will rarely be changed at the beginning.There are three primary drawbacks to this optimization. First, an amount of time needs to be spent beforehand to precompute the hashes for the two sequences. Second, additional memory needs to be allocated for the new hashed sequences. However, in comparison to the naive algorithm used here, both of these drawbacks are relatively minimal.
The third drawback is that of collisions. Since the checksum or hash is not guaranteed to be unique, there is a small chance that two different items could be reduced to the same hash. This is unlikely in source code, but it is possible. A cryptographic hash would therefore be far better suited for this optimization, as its entropy is going to be significantly greater than that of a simple checksum. However, the benefits may not be worth the setup and computational requirements of a cryptographic hash for small sequence lengths.