Lcs
Lcs
LCS is actually the solution for the longest subsequence (subsequence is a sequence that can be derived from another sequence by deleting some elements without changing the order of the remaining elements) which is common in the set of sequences. Suppose that we are given with two strings: string S of length n, and string T of length m. our main focus is to produce their longest common subsequence i.e. the longest sequence of characters that appear left-to-right (but not necessarily in a contiguous block) in both strings. For example, consider: S = ABAZDC T = BACBAD In this case, the LCS has length of 4 and the string is ABAD. Another way to look at it is that we are to nd a one-to-one matching between some of the letters in S and some of the letters in T such that none of the edges in the matching cross each other. For instance, this type of problem arises all the time in genomics: given two DNA fragments, the LCS gives information about what they have in common and the best way to line them up. We will now solve the LCS problem using Dynamic Programming. As subproblems, we will look at the LCS of a prex of S and a prex of T, running over all pairs of prexes. For simplification, lets consider rst about nding the length of the LCS and further we can modify the algorithm to produce the actual sequence itself. So, here is the question: say LCS[i, j] is the length of the LCS of S[1..i] with T[1..j]. How can we solve for LCS [i, j] in terms of the LCSs of the smaller problems?
Case 1: what if S[i] T[j]? Then, the desired subsequence has to ignore one of S[i] or
T[j] so we have: LCS [i, j] = max (LCS [i 1, j], LCS [i, j 1]).
Case 2: what if S[i] = T[j]? Then the LCS of S [1...i] and T [1...j] might as well match them up.
Suppose, if one gave you a common subsequence that matched S[i] to an earlier location in T, for instance, you could always match it to T[j] instead. So, in this case we have: LCS [i, j] = 1 + LCS [i 1, j 1]. So, we can only do two loops (over values of i and j), lling in the LCS using these rules.
Heres what it looks like pictorially for the example above, with S along the leftmost column and T along the top row.
We just ll out this matrix row by row, doing constant amount of work after every entry, so this takes O(mn) time overall. The nal answer (the length of the LCS of S and T) is in the lower-right corner.
How can we now find the sequence? To nd the sequence, we will just move backwards
through matrix starting from the lower-right corner. If either the cell directly above or directly to the right contains a value which is equal to the value in the current cell, then move to that cell (if both to, then chose any of one). If both such cells have values strictly less than the value in the current cell, then move diagonally up-left (this corresponds to applying Case 2), and output the associated character. This will output the characters in the LCS in reverse order. For instance, running on the matrix above, these outputs DABA.
More about LCS: Discussion and Extensions. An equivalent problem to LCS is the
minimum edit distance problem, where the legal operations are to be inserted and deleted. (E.g. the unix di command, where S and T are les, and the elements of S and T are lines of text). The minimum edit distance to transform S into T is achieved by doing |S|LCS(S, T) deletes and |T| LCS(S, T) inserts. In computational biology applications, usually one has a more general notion of sequence alignment. Many of these dierent problems all allow for basically the same kind of Dynamic Programming solution.