0% found this document useful (0 votes)
7 views

M2-longest_common_subsequence

The document discusses the concept of the Longest Common Subsequence (LCS) in the context of comparing DNA sequences, highlighting its importance in biological applications. It outlines the definition of subsequences, the brute-force approach to finding LCS, and introduces dynamic programming as an efficient method to compute LCS length and construct the sequence. Additionally, it covers improvements to the algorithm to optimize time and space complexity.

Uploaded by

ssanjayreg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

M2-longest_common_subsequence

The document discusses the concept of the Longest Common Subsequence (LCS) in the context of comparing DNA sequences, highlighting its importance in biological applications. It outlines the definition of subsequences, the brute-force approach to finding LCS, and introduces dynamic programming as an efficient method to compute LCS length and construct the sequence. Additionally, it covers improvements to the algorithm to optimize time and space complexity.

Uploaded by

ssanjayreg
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 30

Longest Common Subsequence

Inspiration
• Biological applications often need to compare the DNA
of two (or more) different organisms
• A strand of DNA consists of a string of molecules called
bases, where the possible bases are adenine, guanine,
cytosine, and thymine
• each of these bases by its initial letter, we can express a
strand of DNA as a string over the finite set {A, C, G, T}
Inspiration
• For example, the DNA of one organism may be S1=
ACCGGTCGAGTGCGCGGAAGCCGGCCGAA, and
the DNA of another organism may be S2=
GTCGTTCGGAATGCCGTTGCTCTGTAAA.
• One reason to compare two strands of DNA is to
determine how “similar the two strands are, as some
measure of how closely related the two organisms are
Inspiration
• We can define similarity in many different ways
• First way - we can say that two DNA strands are similar
if one is a substring of the other

• In our example, neither S1 nor S2 is a substring of the


other.
• Second way - two strands are similar if the number of
changes needed to turn one into the other is small
Inspiration
• Third way measure the similarity of strands S 1 and S2 is by
finding a third strand S3

• In which bases in S3 appear in each of S1 and S2; these bases


must appear in the same order, but not necessarily
consecutively

• Longer the strand S3 we can find, the more similar S1 and S2 are
Inspiration
• S1= ACCGGTCGAGTGCGCGGAAGCCGGCCGAA

• S2 = GTCGTTCGGAATGCCGTTGCTCTGTAAA

• S3 is GTCGTCGGAAGCCGGCCGAA
Problem Statement
• A subsequence of a given sequence is just the given
sequence with zero or more elements left out

• Formally, given a sequence X = <x1,x2,...,xm>, another


sequence Z =<z1,z2,...,zk> is a subsequence of X if there
exists a strictly increasing sequence <i1,i2,...,ik> of indices
of X such that for all j = 1,2,...,k, we have xij= zj
Problem Statement
• For example, Z = <B, C, D, B> is a subsequence of X = <A,
B, C, B, D, A, B> with corresponding index sequence <2, 3,
5, 7>
• Given two sequences X and Y , we say that a sequence Z is a
common subsequence of X and Y if Z is a subsequence of
both X and Y
Problem Statement
• For example, if X = <A, B, C, B, D, A, B> and Y = <B, D, C, A,
B, A>, the sequence <B, C, A> is a common subsequence of
both X and Y
• But not a longest common subsequence (LCS) of X and Y
• Sequence <B, C, B, A>, which is also common to both X and
Y , has length 4 is the LCS
• Since X and Y have no common subsequence of length 5 or
greater
Step 1: Characterizing a longest common subsequence
• Brute-force approach to solve LCS problem:
• Enumerate all subsequences of X
• Check each subsequence to see whether it is also a subsequence of Y
• Keeping track of the longest subsequence we find.

• Each subsequence of X corresponds to a subset of the indices


{1, 2,...,m} of X
• Because X has 2m subsequences, this approach requires
exponential time, making it impractical
Basis of Optimal substructure of an LCS
• Given a sequence X = <x1, x2,...,xm>, we define the ith prefix of
X , for i = 0,1,...,m, as Xi = <x1, x2,...,xi>
• For example, if X = <A, B, C, B, D, A, B>, then X4 = <A, B, C,
B> and X0 is the empty sequence
Theorem 15.1 Optimal substructure of an LCS
• Let X = <x1, x2,...,xm> and Y = <y1, y2,...,yn> be sequences, and let Z

= <z1, z2,..., zk> be any LCS of X and Y .

1. If xm = yn , then ́zk = xm = yn and Zk-1 is an LCS of Xm-1 and Yn-1

2. If xm ≠ yn , then zk ≠ xm implies that Z is an LCS of Xm-1 and Y

3. If xm ≠ yn , then zk ≠ yn implies that Z is an LCS of X and Yn-1


Proof of Theorem 15.1
• (1) If ́ zk ≠ xm , then we could append xm = yn to Z to obtain a

common subsequence of X and Y of length k + 1, contradicting

the supposition that Z is a longest common subsequence of X

and Y . Thus, we must have ́ zk = xm = yn .

• Now, the prefix Zk-1 is a length (k -1) common subsequence of

X and Y
Proof of Theorem 15.1
• We wish to show that it is an LCS

• Suppose for the purpose of contradiction that there exists a

common subsequence W of Xm-1 and Yn-1 with length greater

than k-1

• Then, appending xm = yn to W produces a common subsequence

of X and Y whose length is greater than k, which is a


Proof of Theorem 15.1
(2) If ́ zk ≠ xm, then Z is a common subsequence of X m-1 and Y

• If there were a common subsequence W of X m-1 and Y with

length greater than k, then W would also be a common

subsequence of Xm and Y, contradicting the assumption that Z is

an LCS of X and Y

• (3) The proof is symmetric to (2)


Step 2: A recursive solution
• Theorem 15.1 implies that we should examine either one or two
subproblems when finding an LCS of X = <x1, x2,...,xm> and
Y= <y1, y2,...,yn>

• If xm = yn, we must find an LCS of Xm-1 and Yn-1

• Appending xm = yn to this LCS yields an LCS of X and Y

• If xm ≠ yn , then we must solve two subproblems: finding an LCS


of Xm-1 and Y and finding an LCS of X and Y n-1.
Step 2: A recursive solution
• Whichever of these two LCSs is longer is an LCS of X and Y
• Because these cases exhaust all possibilities, we know that one
of the optimal subproblem solutions must appear within an LCS
of X and Y .
Step 2: Overlapping Subproblem
• To find an LCS of X and Y, we may need to find the LCSs of X

and Yn-1 and of Xm-1 and Y

• But each of these subproblems has the subsubproblem of finding

an LCS of Xm-1 and Yn-1

• Many other subproblems share subsubproblems.


Step 2: Overlapping Subproblem
• Let us define c[i, j] to be the length of an LCS of the sequences

Xi and Yj

• either i = 0 or j = 0, one of the sequences has length 0, and so the


LCS has length 0
Step 3: Computing the length of an LCS
• LCS problem has only θ(m*n) distinct subproblems, however,
we can use dynamic programming to compute the solutions
bottom up.

• We maintain two 2D tables c and b for dynamic programming

• c table maintains the length of the common sub sequence

• b table helps to construct the solution


Step 3: Computing the length of an LCS
Step 4: Constructing an LCS
• b table returned by LCS-LENGTH enables us to quickly

construct an LCS for X = <x1, x2,...,xm> and Y = <y1, y2,...,yn>

• We simply begin at b[m, n] and trace through the table by


following the arrows

• Whenever we encounter a in entry b[i,j], it implies that x i =

yj is an element of the LCS that LCS-L ENGTH found.


Step 4: Constructing an LCS
• With this method, we encounter the elements of this LCS in
reverse order.

• The following recursive procedure prints out an LCS of X and Y


in the proper, forward order

• The initial call is PRINT -LCS(b, X, X.length, Y.length)


• For the b table in Figure
15.8 this procedure prints
BCBA
The procedure takes
time O(m + n) since it
decrements at least one
of i and j in each
recursive call
Improving the code
• Once you have developed an algorithm, you will often find that
you can improve on the time or space it uses

• Some changes can simplify the code and improve constant


factors but otherwise yield no asymptotic improvement in
performance.

• Others can yield substantial asymptotic savings in time and


space.
Improving the code
• In the LCS algorithm, for example, we can eliminate the b table
altogether. Each c[i, j] entry depends on only three other c table
entries: c[i -1, j- 1], c[i - 1, j], and c[i, j -1].

• Given the value of c[i, j], we can determine in O(1) time which
of these three values was used to compute c[i,j], without
inspecting table b.
Improving the code
• Thus, we can reconstruct an LCS in O(m+n) time using a
procedure similar to PRINT -LCS.

• Although we save θ(mn) space by this method, the auxiliary


space requirement for computing an LCS does not
asymptotically decrease, since we need θ(mn) space for the c
table anyway.
Improving the code
• We can, however, reduce the asymptotic space requirements for
LCS-LENGTH , since it needs only two rows of table c at a
time: the row being computed and the previous row.

• This improvement works if we need only the length of an LCS;


if we need to reconstruct the elements of an LCS, the smaller
table does not keep enough information to retrace our steps in
O(m + n) time

You might also like