Csci3104 S2018 L7
Csci3104 S2018 L7
Examples of this type of problem are more common than we might imagine. For instance, search
engines often try to correct misspelled words in the query string; in speech recognition, the string
is a sequence of values representing the recorded sound wave, which must be matched to a known
word; and in forensic analysis, the input string is a DNA sequence, which may differ from those in
our reference set by some number of nucleic acid mutations, insertions or deletions.
In each case, we aim to align a pair of sequences so that we find the elements in each that cor-
respond exactly to each other, while ignoring the elements between these aligned parts. Here, we
will focus on what is called a global alignment in which we aim to align the entire two sequences.
In order to define an algorithm for finding such an alignment, we must also define a set of edit
operations E ∈ E and a cost for each c(E) ≥ 0.
The problem of sequence alignment is to find a minimal-cost set of edit operations that transforms
the sequence x into the sequence y. We will solve this problem using a dynamic programming
algorithm.
• Substitution (sub): replace a letter xi with some other letter in the alphabet Σ, at the same
position as xi .
For instance, “so” and “do” are two strings that differ by a single substitution edit, and which
are commonly misspelled for each other on a keyboard because s and d are next to each other.
• Insertion and Deletion (indel ): insert some letter from the alphabet Σ into x, shifting all
subsequent letters one position later in the string; or, delete xi from x, shifting all subsequent
letters one position earlier in the string. Note that an insertion operation into one string is
equivalent to a deletion operation in the other string.
For instance, “grande” and “grand” are two strings that differ by a single indel operation.
1
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
• Transposition (swap): take two consecutive letters xi , xi+1 and exchange their positions, and
then substitute them into the aligned positions in y.1
For instance, both “their” / “thier” and “teh” / “the” are pairs of strings that differ by a
single transposition.
Given these operations, we must now also choose a cost function c(E). There are several choices
for this function, but here we choose the “edit distance” function (technically called the Damerau-
Levenshtein Distance)2 which simply counts the number of these operations required to transform
x into y.3 The one wrinkle is that transposition is actually three operations: one swap, followed by
two subs, for a total cost of 3, while any single sub or indel costs 1.
1.2 An example
To illustrate how to compute the cost of a particular alignment, consider aligning the two strings
x = THEIR and y = THERE.
Alignment 1 : Substitute the last two characters, for a total cost of 2 sub operations:
THEIR
|||ss
THERE
Alignment 2 : Insert and delete so that the R lines up, for a total cost of 2 indel operations:
THEIR-
|||d|i
THE-RE
where “-” denotes a “gap” character, implying an insertion on the opposing string.
Alignment 3 : At worst, delete the entire first string, and insert the entire second string, for a total
cost of 10 indel operations:
1
Generalizations exist that allow letters to be transposed more than one, or to allow longer substrings to be
transposed, but these algorithms are more complicated.
2
Supposedly, these types of “edits” represent a large fraction, possibly 80% or more, of all human misspellings,
with the remaining presumably being confusion over which word to use in the first place, e.g., “their” versus “they’re”.
3
Other cost structures are certainly possible, depending on the application. For instance, a transposition might be
less costly than an insertion, etc. Furthermore, cost may depend on the letters being changed, perhaps reflecting the
probability of the error. For instance, adjacent letters on a QWERTY keyboard may have lower costs for substitution
or transposition than letters far apart.
2
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
THEIR-----
dddddiiiii
-----THERE
Clearly, the first two alignments are cheaper than the third alignment, and under the edit-distance
cost function, either of those would be an acceptable alignment.
A general requirement for dynamic programming is that there cannot be a cycle among subproblem
dependencies, such that solving some problem A requires eventually solving some B that requires
solving A. Thus, dynamic programming can be applied only if the space of subproblems can be
organized into a directed acyclic graph (a “DAG”), in which each subproblem is a vertex and an
arc i → j represents that solving j requires solving i first.
There are only three ways we could have gotten to needing to align xi and yj :
• the last op was sub, and we paid the cost of aligning x1...i−1 and y1...j−1 ,
• the last op was indel, and we paid the cost of aligning either x1...i and y1...j−1 or aligning
x1...i−1 and y1...j , or
• the last op was swap, and we paid the cost of aligning x1...i−2 and y1...j−2 .
Let cost(i, j) be the minimum cost of aligning x1...i and y1...j , where we define as a base case
cost(0, 0) = 0.
4
There are additional requirements for dynamic programming to produce a polynomial-time algorithm: the number
of subproblems must be polynomial in size and the recursive function must run in polynomial time.
3
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
Thus, recursive structure of the subproblems we identified above implies that cost(i, j) may be
computed recursively as
cost(i − 2, j − 2) + c(swap)
cost(i − 1, j − 1) + c(sub)
cost(i, j) = min
cost(i − 1, j) + c(indel)
cost(i, j − 1) + c(indel)
where we define c(sub) = 0 if xi = yj , i.e., a “no-op.” This function is equivalent to this DAG
template:
i 2, j 2
sw
a p
i 1, j 1 i 1, j
sub
indel
indel
i, j 1 i, j
We begin by writing out the cost matrix5 S, and filling in the base case for aligning two empty
strings, which has cost(0, 0) = 0.
We may now immediately fill in the values for the 0th column and 0th row, which correspond to
the cost of aligning an empty string with x (column 0) or with y (row 0). In each of these cases,
5
For convenience, we will assume this matrix is 0-indexed, meaning that the first element in a row or a column is
the 0th element.
4
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
the alignment consists of inserting each character in the target string into the empty string, and
thus the costs in the 0th row are S(0, j) = j for 1 ≤ j ≤ ny , and the costs in the 0th column are
S(i, 0) = i for 1 ≤ i ≤ nx .
At the next step, we set i = 1 and j = 1 and align x0 = S with y 0 = A. There are three subproblems
to consider (the fourth subproblem, corresponding to swap, isn’t allowed yet):
The minimum of these choices is uniquely the first one, then thus we record S(1, 1) = 1.
Next we consider i = 1 and j = {2, 3}, in which we align S with {AP, APE}. Although we could
write down the three subproblems for each of these, we may also simply recognize that S appears in
neither of these strings, and thus the minimum cost for each alignment will be the cost of deleting
S and inserting y1...j for j = 2, 3. Thus, we may record S(1, j) = j for j = {2, 3}.
The same fact is true for i = {2, 3, 4} and j = 1, in which we align {ST, STE, STEP} with A. Thus,
we may record S(i, 1) = i for i = {2, 3, 4}. What now remains is to align the remaining cases of
substrings. We will treat each of the 6 cases, one at a time.
5
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
Set i, j = 2 and align ST with AP. There are four subproblems to consider:
Thus, we record S(2, 3) = 3, which represents the cost of either of these subalignments:
S-T ST-
sis ssi
APE APE
Now we set i = 3 and j = 2, in which we align STE with AP. Again, there are four subproblems to
consider:
6
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
• (Delete) Previously STE → APE . Now delete P (from x). Cost = S(3, 3) + 1 = 3
Thus, we record S(4, 3) = 3, which gives the final minimum cost for aligning STEP with APE, via
any of these alignments:
STEP STEP STEP
ss|d dstt sdtt
APE- -APE A-PE
7
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
To extract the 3 minimum-cost alignments given above, we examine the sequences of choices we
made to arrive at S(4, 3) = 3. Specifically, there are three paths from S(0, 0) that all reach S(4, 3),
and each of these paths corresponds to a minimum-cost alignment. Left- or down- moves represent
indel operations, single-diagonal moves are a sub, and double-diagonal moves are a swap.
The overall minimum cost of 6 is in the bottom-right corner of S. Note, however, that our cost
matrix does not contain corresponding alignment. Given the completed matrix, we may extract
6
This example is taken from Dasgupta, Papadimitriou and Vazirani’s excellent book Algorithms (2006).
8
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
CSCI 5454,
Lecture 7 CU Boulder Christopher Aicher & Ryan Hand
Spring 2018
Sequence Alignment Lecture April 2, 2013
•• Transpose:
Delete/Insert s1 /s2 : We add a gap a gap character after s2 [j] to match s1 [i]. This costs c(InDel).
First, we swap xi−1 and xi , and we then substitute them for yj−1 and yj respec-
The rest of the alignment cost is from aligning s1 [1 . . . i 1] and s2 [1 . . . j]. Therefore the minimum
tively. These three
cost ending with edits together
a substitution is OP Tcost
(i c(swap),
1, j) + c(Sub)by definition.
• Insert/Delete s1 /s2 : Similarly, we add a gap a gap character after s1 [i] to match s1 [j]. This costs
c(InDel). The rest of the alignment cost is from9 aligning s1 [1 . . . i] and s2 [1 . . . j 1]. Therefore the
minimum cost ending with a substitution is OP T (i, j 1) + c(Sub)
Taking the mininum over the possible operations gives the recursive relation (*). Since these are the only
possible paths to aligning substrings (i, j), the recursion gives the minimal cost for aligning (i, j).
5
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
The remaining cost is from aligning x1...i−2 with y1...j−2 , whose minimum cost is cost(i−2, j −
2). Therefore, the minimum cost ending with a swap is cost(i − 2, j − 2) + c(swap).
• Substitute: We substitute the value at xi for the value at yj . This costs c(sub) by definition.
The remaining cost is from aligning x1...i−1 with y1...j−1 , whose minimum cost is cost(i−1, j −
1). Therefore, the minimum cost ending with a sub is cost(i − 1, j − 1) + c(sub).
• Delete in x and Insert in y: We add a gap character after yj to match xi . This costs c(indel)
by definition.
The remaining cost is from aligning x1...i−1 with y1...j , whose minimum cost is cost(i − 1, j).
Therefore, the minimum cost ending with a sub is cost(i − 1, j) + c(indel).
• Insert in x and Delete in y: We add a gap character after xi to match yj . This costs c(indel)
by definition.
The remaining cost is from aligning x1...i with y1...j−1 , whose minimum cost is cost(i, j − 1).
Therefore, the minimum cost ending with a sub is cost(i, j − 1) + c(indel).
Because cost(i, j) is defined as the minimum cost over the four possibilities, and because these are
the only paths to aligning substrings i, j, the recursion relation must give the minimal cost for
aligning i, j.
S[0,0] = 0
p[0,0] = NULL
for i = 0 to nx // consider all letters of x
for j = 0 to ny // consider all letters of y
if i>0 or j>0 // skip the base case
S[i,j] = cost(i,j) // minimum cost up to xi and yj
10
CSCI 3104, CU-Boulder Profs. Clauset & Grochow
Lecture 7 Spring 2018
Assuming that each call to cost(i, j) takes constant time, and we carry out (nx + 1) × (ny + 1) − 1 =
O(nx ny ) = O(n2 ) of them, then the running time is O(n2 ). (Note that x and y are treated sym-
metrically, and we may simply adopt the convention of naming the longer length to be n.) The
space requirement is given by the size of S and p, which are also O(n2 ).
There are more space-efficient versions of this algorithm. For instance, notice that cost(i, j) only
ever refers to elements at most two rows up or two columns left of the current problem parameters.
Thus, we may calculate the final solution by only storing three rows of S. (Do you see why we need
three entire rows, rather than a 3 × 3 submatrix with S(i, j) as the bottom-right element?) Now,
the space requirement is only O(n), but we must also give up the matrix p which means we lose the
record of the optimal alignment. In 1975, Hirschberg gave a clever divide-and-conquer algorithm
that solves both problems.
2 On your own
1. Read Chapter 15
11