0% found this document useful (0 votes)
7 views11 pages

Csci3104 S2018 L7

Uploaded by

Aissa Hadjoudja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views11 pages

Csci3104 S2018 L7

Uploaded by

Aissa Hadjoudja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

CSCI 3104, CU-Boulder Profs.

Clauset & Grochow


Lecture 7 Spring 2018

1 Aligning sequences with dynamic programming


Suppose we can represent some user-input as a sequence of symbols x, where xi ∈ Σ with Σ denot-
ing the input alphabet. Our task is to identify a best match of that input to a library of sequences
Y = {y}. However, users are prone to input errors, which means that x may not be an exact match
to any of those in our reference set. Thus, in order to correctly identify the user’s input, we must
first identify the y ∈ Y is the best match for the input x.

Examples of this type of problem are more common than we might imagine. For instance, search
engines often try to correct misspelled words in the query string; in speech recognition, the string
is a sequence of values representing the recorded sound wave, which must be matched to a known
word; and in forensic analysis, the input string is a DNA sequence, which may differ from those in
our reference set by some number of nucleic acid mutations, insertions or deletions.

In each case, we aim to align a pair of sequences so that we find the elements in each that cor-
respond exactly to each other, while ignoring the elements between these aligned parts. Here, we
will focus on what is called a global alignment in which we aim to align the entire two sequences.
In order to define an algorithm for finding such an alignment, we must also define a set of edit
operations E ∈ E and a cost for each c(E) ≥ 0.

The problem of sequence alignment is to find a minimal-cost set of edit operations that transforms
the sequence x into the sequence y. We will solve this problem using a dynamic programming
algorithm.

1.1 Edit operations and costs


Before we can write down an algorithm, we must define the set of edit operations E we may use.
Here, we will utilize three operations beyond the default “no-op” operation, which leaves a letter
unchanged.

• Substitution (sub): replace a letter xi with some other letter in the alphabet Σ, at the same
position as xi .
For instance, “so” and “do” are two strings that differ by a single substitution edit, and which
are commonly misspelled for each other on a keyboard because s and d are next to each other.

• Insertion and Deletion (indel ): insert some letter from the alphabet Σ into x, shifting all
subsequent letters one position later in the string; or, delete xi from x, shifting all subsequent
letters one position earlier in the string. Note that an insertion operation into one string is
equivalent to a deletion operation in the other string.
For instance, “grande” and “grand” are two strings that differ by a single indel operation.

1
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

• Transposition (swap): take two consecutive letters xi , xi+1 and exchange their positions, and
then substitute them into the aligned positions in y.1
For instance, both “their” / “thier” and “teh” / “the” are pairs of strings that differ by a
single transposition.

Given these operations, we must now also choose a cost function c(E). There are several choices
for this function, but here we choose the “edit distance” function (technically called the Damerau-
Levenshtein Distance)2 which simply counts the number of these operations required to transform
x into y.3 The one wrinkle is that transposition is actually three operations: one swap, followed by
two subs, for a total cost of 3, while any single sub or indel costs 1.

1.2 An example
To illustrate how to compute the cost of a particular alignment, consider aligning the two strings
x = THEIR and y = THERE.

Alignment 1 : Substitute the last two characters, for a total cost of 2 sub operations:

THEIR
|||ss
THERE

Alignment 2 : Insert and delete so that the R lines up, for a total cost of 2 indel operations:

THEIR-
|||d|i
THE-RE

where “-” denotes a “gap” character, implying an insertion on the opposing string.

Alignment 3 : At worst, delete the entire first string, and insert the entire second string, for a total
cost of 10 indel operations:
1
Generalizations exist that allow letters to be transposed more than one, or to allow longer substrings to be
transposed, but these algorithms are more complicated.
2
Supposedly, these types of “edits” represent a large fraction, possibly 80% or more, of all human misspellings,
with the remaining presumably being confusion over which word to use in the first place, e.g., “their” versus “they’re”.
3
Other cost structures are certainly possible, depending on the application. For instance, a transposition might be
less costly than an insertion, etc. Furthermore, cost may depend on the letters being changed, perhaps reflecting the
probability of the error. For instance, adjacent letters on a QWERTY keyboard may have lower costs for substitution
or transposition than letters far apart.

2
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

THEIR-----
dddddiiiii
-----THERE

Clearly, the first two alignments are cheaper than the third alignment, and under the edit-distance
cost function, either of those would be an acceptable alignment.

1.3 When can we apply dynamic programming?


Recall that in dynamic programming, we will assemble the solution to a larger problem by utiliz-
ing the exact solutions to a smaller problem contained within our larger problem. In general, the
relationship a problem and its subproblems defines a recursive structure that we can use to build
the full solution in a “bottom-up” fashion.4

A general requirement for dynamic programming is that there cannot be a cycle among subproblem
dependencies, such that solving some problem A requires eventually solving some B that requires
solving A. Thus, dynamic programming can be applied only if the space of subproblems can be
organized into a directed acyclic graph (a “DAG”), in which each subproblem is a vertex and an
arc i → j represents that solving j requires solving i first.

1.4 Dynamic programming solution


The ordered substructure in sequence alignment comes from the additive cost of making addi-
tional edit operations, as we move from left-to-right through the sequences. That is, the cost of
aligning two subsequences x1 x2 . . . xi = x1...i and y1 y2 . . . yj = y1...j is the cost of the edit oper-
ation for xi and yj plus the cost of aligning the subproblem that got us to needing to align xi and yj .

There are only three ways we could have gotten to needing to align xi and yj :

• the last op was sub, and we paid the cost of aligning x1...i−1 and y1...j−1 ,

• the last op was indel, and we paid the cost of aligning either x1...i and y1...j−1 or aligning
x1...i−1 and y1...j , or

• the last op was swap, and we paid the cost of aligning x1...i−2 and y1...j−2 .

Let cost(i, j) be the minimum cost of aligning x1...i and y1...j , where we define as a base case
cost(0, 0) = 0.
4
There are additional requirements for dynamic programming to produce a polynomial-time algorithm: the number
of subproblems must be polynomial in size and the recursive function must run in polynomial time.

3
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

Thus, recursive structure of the subproblems we identified above implies that cost(i, j) may be
computed recursively as


 cost(i − 2, j − 2) + c(swap)
cost(i − 1, j − 1) + c(sub)

cost(i, j) = min

 cost(i − 1, j) + c(indel)
cost(i, j − 1) + c(indel)

where we define c(sub) = 0 if xi = yj , i.e., a “no-op.” This function is equivalent to this DAG
template:

i 2, j 2
sw
a p

i 1, j 1 i 1, j
sub
indel

indel
i, j 1 i, j

which represents the relationship between subproblems.

By memoizing the solutions (costs) to the subproblems for 0 ≤ i ≤ nx (length of x) and 0 ≤ j ≤ ny


(length of y), storing them in a 2-dimensional array S[i, j] = cost(i, j), we can recursively compute
the minimum cost of aligning x and y.

1.5 A small and fully worked example


Before tackling a large example, let us exhaustively do a small one. Consider aligning x = STEP
and y = APE.

We begin by writing out the cost matrix5 S, and filling in the base case for aligning two empty
strings, which has cost(0, 0) = 0.

We may now immediately fill in the values for the 0th column and 0th row, which correspond to
the cost of aligning an empty string with x (column 0) or with y (row 0). In each of these cases,
5
For convenience, we will assume this matrix is 0-indexed, meaning that the first element in a row or a column is
the 0th element.

4
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

the alignment consists of inserting each character in the target string into the empty string, and
thus the costs in the 0th row are S(0, j) = j for 1 ≤ j ≤ ny , and the costs in the 0th column are
S(i, 0) = i for 1 ≤ i ≤ nx .

x/y A P E x/y A P E x/y A P E


0 0 1 2 3 0 1 2 3
S S 1 S 1 1 2 3
T T 2 T 2 2
E E 3 E 3 3
P P 4 P 4 4
base case empty strings aligned first character aligned

At the next step, we set i = 1 and j = 1 and align x0 = S with y 0 = A. There are three subproblems
to consider (the fourth subproblem, corresponding to swap, isn’t allowed yet):

• (Sub) We previously aligned from x with from y, for cost S(0, 0) = 0.


Now we substitute S for A , which costs c(sub) = 1.
Cost = 1.

• (Delete) We previously aligned from x with A from y, for cost S(0, 1) = 1.


Now we delete S , which costs c(indel) = 1.
Cost = 2.

• (Insert) We previously aligned S from x with from y, for cost S(1, 0) = 1.


Now we insert A , which costs c(indel) = 1.
Cost = 2.

The minimum of these choices is uniquely the first one, then thus we record S(1, 1) = 1.

Next we consider i = 1 and j = {2, 3}, in which we align S with {AP, APE}. Although we could
write down the three subproblems for each of these, we may also simply recognize that S appears in
neither of these strings, and thus the minimum cost for each alignment will be the cost of deleting
S and inserting y1...j for j = 2, 3. Thus, we may record S(1, j) = j for j = {2, 3}.

The same fact is true for i = {2, 3, 4} and j = 1, in which we align {ST, STE, STEP} with A. Thus,
we may record S(i, 1) = i for i = {2, 3, 4}. What now remains is to align the remaining cases of
substrings. We will treat each of the 6 cases, one at a time.

5
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

Set i, j = 2 and align ST with AP. There are four subproblems to consider:

• (Sub) Previously S → A . Now substitute T for P Cost = S(1, 1) + 1 = 2

• (Delete) Previously S → AP . Now delete T . Cost = S(1, 2) + 1 = 3

• (Insert) Previously ST → A . Now insert P . Cost = S(2, 1) + 1 = 3

• (Swap) Previously → . Now transpose ST and sub for AP Cost = S(0, 0) + 3 = 3

Thus, we record S(2, 2) = 2.

Now setting i = 2 and j = 3, we align ST with APE:

• (Sub) Previously S → AP . Now substitute T for E . Cost = S(1, 2) + 1 = 3

• (Delete) Previously S → APE . Now delete T . Cost = S(1, 3) + 1 = 4

• (Insert) Previously ST → AP . Now insert E . Cost = S(2, 2) + 1 = 3

• (Swap) Previously → A . Now transpose ST and sub for PE . Cost = S(0, 1) + 3 = 4

Thus, we record S(2, 3) = 3, which represents the cost of either of these subalignments:

S-T ST-
sis ssi
APE APE
Now we set i = 3 and j = 2, in which we align STE with AP. Again, there are four subproblems to
consider:

• (Sub) Previously ST → A . Now substitute E for P . Cost = S(2, 1) + 1 = 3

• (Delete) Previously ST → AP . Now delete E . Cost = S(2, 2) + 1 = 3

• (Insert) Previously STE → A . Now insert P . Cost = S(3, 1) + 1 = 4

• (Swap) Previously S → . Now transpose TE and sub for AP . Cost = S(1, 0) + 3 = 4

Thus, we record S(3, 2) = 3.

6
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

Now setting j = 3 and aligning STE with APE, we have:

• (Sub) Previously ST → AP . Now substitute E for E . Cost = S(2, 2) + 0 = 2

• (Delete) Previously ST → APE . Now delete E (from x). Cost = S(2, 3) + 1 = 4

• (Insert) Previously STE → AP . Now insert E (into y). Cost = S(3, 2) + 1 = 4

• (Swap) Previously S → A . Now transpose TE and sub for PE . Cost = S(1, 1) + 3 = 4

Thus, we record S(3, 3) = 2.

Penultimately, we consider i = 4 and j = 2 and align STEP with AP:

• (Sub) Previously STE → A . Now substitute P for P . Cost = S(3, 1) + 0 = 3

• (Delete) Previously STE → AP . Now delete P (from x). Cost = S(3, 2) + 1 = 4

• (Insert) Previously STEP → A . Now insert P (into y). Cost = S(4, 1) + 1 = 5

• (Swap) Previously ST → . Now transpose EP and sub for AP . Cost = S(2, 0) + 3 = 5

Thus, we record S(4, 2) = 3.

And finally, we set i = 4 and j = 3 and align STEP with APE:

• (Sub) Previously STE → AP . Now substitute P for E . Cost = S(3, 2) + 1 = 4

• (Delete) Previously STE → APE . Now delete P (from x). Cost = S(3, 3) + 1 = 3

• (Insert) Previously STEP → AP . Now insert E (into y). Cost = S(4, 2) + 1 = 4

• (Swap) Previously ST → A . Now transpose EP and sub for PE . Cost = S(2, 1) + 1 = 3

Thus, we record S(4, 3) = 3, which gives the final minimum cost for aligning STEP with APE, via
any of these alignments:
STEP STEP STEP
ss|d dstt sdtt
APE- -APE A-PE

7
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

Here are the completed cost matrices:

x/y A P E x/y A P E x/y A P E


0 1 2 3 0 1 2 3 0 1 2 3
S 1 1 2 3 S 1 1 2 3 S 1 1 2 3
T 2 2 2 3 T 2 2 2 3 T 2 2 2 3
E 3 3 E 3 3 3 2 E 3 3 3 2
P 4 4 P 4 4 P 4 4 3 3
align ST with y align STE with y align STEP with y

To extract the 3 minimum-cost alignments given above, we examine the sequences of choices we
made to arrive at S(4, 3) = 3. Specifically, there are three paths from S(0, 0) that all reach S(4, 3),
and each of these paths corresponds to a minimum-cost alignment. Left- or down- moves represent
indel operations, single-diagonal moves are a sub, and double-diagonal moves are a swap.

x/y A P E x/y A P E x/y A P E


0 1 2 3 0 1 2 3 0 1 2 3
S 1 1 2 3 S 1 1 2 3 S 1 1 2 3
T 2 2 2 3 T 2 2 2 3 T 2 2 2 3
E 3 3 3 2 E 3 3 3 2 E 3 3 3 2
P 4 4 3 3 P 4 4 3 3 P 4 4 3 3

1.6 A large worked example


Consider aligning the strings x = EXPONENTIAL and y = POLYNOMIAL.6 The full matrix S of costs
is shown below, which is produced by starting at i, j = 0 and applying cost(i, j) as given above
iteratively to each element. (Or, by starting at i = nx and j = ny and making the recursive calls.)
Let us focus on a small piece of the overall calculation: aligning EXP and POLY. The cost is given by

cost(3, 4) = min{cost(1, 2) + 3, cost(2, 3) + 1, cost(2, 4) + 1, cost(3, 3) + 1}


= min{5, 4, 5, 4}
=4

The overall minimum cost of 6 is in the bottom-right corner of S. Note, however, that our cost
matrix does not contain corresponding alignment. Given the completed matrix, we may extract
6
This example is taken from Dasgupta, Papadimitriou and Vazirani’s excellent book Algorithms (2006).

8
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
CSCI 5454,
Lecture 7 CU Boulder Christopher Aicher & Ryan Hand
Spring 2018
Sequence Alignment Lecture April 2, 2013

Figure 4: Subproblem Alignment


Figure 3: Subproblem Alignment Costs of Exponential vs. Poly- Path of Exponential vs. Polynomial.
nomial. Figure from [3]. Figure from [3].
the corresponding alignment by starting in the bottom-right corner and finding the minimum cost
pathThe
backwards
matrix ofthrough
subproblemthesolutions
DAG tois S(0,
shown0).in The right-hand
Figure figurethe
3. For example, above shows
minimum this path,
alignment costwhose
of
and is calculated
corresponding alignment is
EXP POLY as

OP T (3, 4) = min{OP T (1, 2)+3, OP T (2, 3)+1, --POLYNOMIAL


OP T (2, 4)+1, OP T (3, 3)+1} = min{2+3, 3+1, 4+1, 3+1} = 4 .
ii||ss|ds|||
Now the matrix only gives the minimum alignment cost. To find an optimal alignment, we keep track of
EXPONEN-TIAL
the previous optimal subalignments that are used: the path through the DAG. An example path is shown
in Figure 4.
for a cost of 3 indel s and 3 subs, or 6 overall.
3.2 Proof of Correctness
1.7 Correctness
Because our solutions are built recursively, the standard technique for proving the correctness of a dynamic
Weprogramming
now proveisthat
induction. Therefore, iswecorrect,
this algorithm will prove
i.e.,thefinds
correctness of any sequence
a minimum-cost alignment
alignment. Asalgorithm
usual with
using the recursion relation (*) with induction.
recursive functions, we provide a proof-by-induction, on the cost of aligning the leading substrings
ofClaim.
x and Any
y. alignment of strings s1 , s2 satisfying the recursion relation (*) is a minimial cost alignment.
Proof. Base Case: The cost of aligning nothing is zero, OP T (0, 0) = 0. Inductive Step: Assume we’ve
Claim: Any
calculated thealignment of strings
minimial alignment forxOP
and
T (k,y l)that
for ksatisfies theThen
< i, l < j. cost(i, j) are
there function, is a minimal
only 4 possible previouscost
alignment.
subalignments based on the last operator(s),
• Transposition + Substitution: We swap s1 [i 1] with s1 [i] and then substitute them for s2 [j] and
Proof :s2First,
[j 1]we dispense These
respectively. with the base
three editscase
costofc(Swap).
aligningThe
tworest
strings
of theof length 0.
alignment The
cost costaligning
is from here must
s1 [1 . . . ithere
be 0 because 2] and
aresno
2 [1 .letters
. . j 2],
towhose
align minimum
and therevalue
canisbeOP
noT (i
edit2,operations.
j 2). Therefore
Thus,the minimum
cost(0, 0) = 0.
cost ending with a swap is OP T (i 2, j 2) + c(Swap).
Now,• assume that We
Substitution: we substitute
have calculated
s1 [i] for as2minimum costc(Sub).
[j]. This costs alignment on x
The rest the and
of1...k y1...` , cost
alignment for k
is < i and
from
` < j. aligning
There ares1 [1only
. . . i four
1] and s2 [1 . . .previous
possible j 1]. Therefore the minimum
subalignments cost ending
to consider, eachwith a substitution
of which is
corresponds
OP T (i 1, j 1) +
to the last edit operation used: c(Sub).

•• Transpose:
Delete/Insert s1 /s2 : We add a gap a gap character after s2 [j] to match s1 [i]. This costs c(InDel).
First, we swap xi−1 and xi , and we then substitute them for yj−1 and yj respec-
The rest of the alignment cost is from aligning s1 [1 . . . i 1] and s2 [1 . . . j]. Therefore the minimum
tively. These three
cost ending with edits together
a substitution is OP Tcost
(i c(swap),
1, j) + c(Sub)by definition.

• Insert/Delete s1 /s2 : Similarly, we add a gap a gap character after s1 [i] to match s1 [j]. This costs
c(InDel). The rest of the alignment cost is from9 aligning s1 [1 . . . i] and s2 [1 . . . j 1]. Therefore the
minimum cost ending with a substitution is OP T (i, j 1) + c(Sub)
Taking the mininum over the possible operations gives the recursive relation (*). Since these are the only
possible paths to aligning substrings (i, j), the recursion gives the minimal cost for aligning (i, j).

5
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

The remaining cost is from aligning x1...i−2 with y1...j−2 , whose minimum cost is cost(i−2, j −
2). Therefore, the minimum cost ending with a swap is cost(i − 2, j − 2) + c(swap).

• Substitute: We substitute the value at xi for the value at yj . This costs c(sub) by definition.
The remaining cost is from aligning x1...i−1 with y1...j−1 , whose minimum cost is cost(i−1, j −
1). Therefore, the minimum cost ending with a sub is cost(i − 1, j − 1) + c(sub).

• Delete in x and Insert in y: We add a gap character after yj to match xi . This costs c(indel)
by definition.
The remaining cost is from aligning x1...i−1 with y1...j , whose minimum cost is cost(i − 1, j).
Therefore, the minimum cost ending with a sub is cost(i − 1, j) + c(indel).

• Insert in x and Delete in y: We add a gap character after xi to match yj . This costs c(indel)
by definition.
The remaining cost is from aligning x1...i with y1...j−1 , whose minimum cost is cost(i, j − 1).
Therefore, the minimum cost ending with a sub is cost(i, j − 1) + c(indel).

Because cost(i, j) is defined as the minimum cost over the four possibilities, and because these are
the only paths to aligning substrings i, j, the recursion relation must give the minimal cost for
aligning i, j. 

1.8 Pseudocode and running time


Although a recursive algorithm that carries out the work of filling in the matrix S is easy to define,
an iterative algorithm is almost as easy to write down. Much like the iterative algorithm for the
0-1 Knapsack problem, the iterative sequence alignment algorithm begins at the base case and fills
in the elements in each column, and then repeats this for each row. Furthermore, without using
asymptotically more space than S, we may also construct the alignment itself in parallel with filling
in S. The algorithm below is a simple generalization of the one originally given by Needleman and
Wunsch in 1970.

input: x with length nx and y with length ny


initialize S, of dimensions nx+1 by ny+1
initialize p, of dimensions nx+1 by ny+1

S[0,0] = 0
p[0,0] = NULL
for i = 0 to nx // consider all letters of x
for j = 0 to ny // consider all letters of y
if i>0 or j>0 // skip the base case
S[i,j] = cost(i,j) // minimum cost up to xi and yj

10
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018

p[i,j] = argmin of cost(i,j) // record the branch did we took


end
end
end
return S[nx,ny] and path starting from p[nx,ny]

where we have used the definition of cost(i, j) given above.

Assuming that each call to cost(i, j) takes constant time, and we carry out (nx + 1) × (ny + 1) − 1 =
O(nx ny ) = O(n2 ) of them, then the running time is O(n2 ). (Note that x and y are treated sym-
metrically, and we may simply adopt the convention of naming the longer length to be n.) The
space requirement is given by the size of S and p, which are also O(n2 ).

There are more space-efficient versions of this algorithm. For instance, notice that cost(i, j) only
ever refers to elements at most two rows up or two columns left of the current problem parameters.
Thus, we may calculate the final solution by only storing three rows of S. (Do you see why we need
three entire rows, rather than a 3 × 3 submatrix with S(i, j) as the bottom-right element?) Now,
the space requirement is only O(n), but we must also give up the matrix p which means we lose the
record of the optimal alignment. In 1975, Hirschberg gave a clever divide-and-conquer algorithm
that solves both problems.

2 On your own
1. Read Chapter 15

11

You might also like