Lecture 4
Lecture 4
Distance
Definition of Minimum
Edit Distance
How similar are two strings?
• Spell correction • Computational Biology
• The user typed “graffe” • Align two sequences of nucleotides
Which is closest? AGGCTATCACCTGACCTCCAGGCCGATGCCC
• graf TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• graft • Resulting alignment:
• grail
• giraffe -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
• An alignment:
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
• Given two sequences, align each letter to a letter or gap
Other uses of Edit Distance in NLP
8
Minimum Edit as Search
• But the space of all edit sequences is huge!
• We can’t afford to navigate naïvely
• Lots of distinct paths wind up at the same state.
• We don’t have to keep track of all of them
• Just the shortest path to each of those revisted states.
9
Defining Min Edit Distance
• For two strings
• X of length n
• Y of length m
• We define D(i,j)
• the edit distance between X[1..i] and Y[1..j]
• i.e., the first i characters of X and the first j characters of Y
• The edit distance between X and Y is thus D(n,m)
Minimum Edit
Distance
Computing Minimum
Edit Distance
Dynamic Programming for
Minimum Edit Distance
• Dynamic programming: A tabular computation of D(n,m)
• Solving problems by combining solutions to subproblems.
• Bottom-up
• We compute D(i,j) for small i,j
• And compute larger D(i,j) based on previously computed smaller values
• i.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)
Defining Min Edit Distance (Levenshtein)
• Initialization
D(i,0) = i
D(0,j) = j
• Recurrence Relation:
For each i = 1…M
For each j = 1…N
D(i-1,j) + 1
D(i,j)= min D(i,j-1) + 1
D(i-1,j-1) + 2; if X(i) ≠ Y(j)
0; if X(i) = Y(j)
• Termination:
D(N,M) is distance
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Edit Distance
N 9
O 8
I 7
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
The Edit Distance Table
N 9 8 9 10 11 12 11 10 9 8
O 8 7 8 9 10 11 10 9 8 9
I 7 6 7 8 9 10 9 8 9 10
T 6 5 6 7 8 9 8 9 10 11
N 5 4 5 6 7 8 9 10 11 10
E 4 3 4 5 6 7 8 9 10 9
T 3 4 5 6 7 8 7 8 9 8
N 2 3 4 5 6 7 8 7 8 7
I 1 2 3 4 5 6 7 6 7 8
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
Minimum Edit
Distance
Computing Minimum
Edit Distance
Minimum Edit
Distance
T 6
N 5
E 4
T 3
N 2
I 1
# 0 1 2 3 4 5 6 7 8 9
# E X E C U T I O N
MinEdit with Backtrace
Adding Backtrace to Minimum Edit Distance
AGGCTATCACCTGACCTCCAGGCCGATGCCC
TAGCTATCACGACCGCGGTCGATTTGCCCGAC
-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---
TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC
Why sequence alignment?
• Comparing genes or regions from different species
• to find important regions
• determine function
• uncover evolutionary forces
• Assembling fragments to sequence DNA
• Compare individuals to looking for mutations
Alignments in two fields
• In Natural Language Processing
• We generally talk about distance (minimized)
• And weights
• In Computational Biology
• We generally talk about similarity (maximized)
• And scores
Longest Common Subsequence
x=ABCBDAB and
y=BDCABA,
30
Approaches to Pairwise Sequence Alignment
Dot Matrix.
Dynamic Programming.
• Simpler approach.
• Visual Alignment
• Two sequences to be matched are lined as
axes on a grid.
• Rows – characters in first sequence
• Columns – chars in 2nd sequence
Procedure
• A dot is placed wherever there is an exact match.
• Any region of similarity is revealed by a row of dots.
• It can be easy to visually identify certain sequence
features such as insertions, deletions,repeats.
Dot Plots
Seq A: CTTAACT SeqB : CGGATCAT
C G G A T C A T
C
T
T
A
A
C
T
Dynamic Programming
Three steps:
• Initialization
• Matrix Fill(scoring)
• Traceback (alignment)
Initialization Step
Each row Si,0 is set to w* i
Each Col S0,j set to w* i
- G A A T T C A G T T A
- 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
G -4
G -8
A -12
T -16
C -20
G -24
A -28
Matrix Fill Step
Si,j = Maximum [
Si-1, j-1 + s(ai,bj) match/mismatch in diagonal),
Si,j-1 + w (gap in sequence #1),
Si-1,j + w (gap in sequence #2)]
- G A A T T C A G T T A
- 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
G -4 5
G -8
A -12
T -16
C -20
G -24
A -28
Matrix Fill Step – 2nd Cell
- G A A T T C A G T T A
- 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
G -4 5 1
G -8
A -12
T -16
C -20
G -24
A -28
Completion of First Row
- G A A T T C A G T T A
- 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
A -12
T -16
C -20
G -24
A -28
Matrix Fill
- G A A T T C A G T T A
- 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
T -16
C -20
G -24
A -28
Filled Matrix
Traceback
Actual alignment for maximum score begins in
postion Sm,n
Traceback
- G A A T T C A G T T A
- 0 -4 -8 -12 -16 -20 -24 -28 -32 -36 -40 -44
GAAT T CAG T TA
| | | | | |
GGA–TC–G -- A
Verification
The resulting global alignment is as follows:
GAAT T CAG T TA
| | | | | |
GGA–TC–G-—A
+ - + -++-+--+
53545545445
5 – 3 + 5 – 4 + 5 + 5 – 4 + 5 – 4 –4 + 5 = 11
Local Alignment
Negative scores for mismatches
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0
G 0
A 0
T 0
C 0
G 0
A 0
Matrix Fill Step
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5
G 0
A 0
T 0
C 0
G 0
A 0
Matrix Fill Step
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1
G 0
A 0
T 0
C 0
G 0
A 0
Completing the Remaining Cells
- G A A T T C A G T T A
- 0 0 0 0 0 0 0 0 0 0 0 0
G 0 5 1 0
G 0
A 0
T 0
C 0
G 0
A 0
Filled Matrix
Traceback
Traceback
C - A
| |
CG A
Traceback
Verification
G AAT T C - A
| | | | |
G GAT– C GA
+- ++- + - +
53 5545 45
G AAT T C - A
| | | | |
GGA–TCGA
+ - + - ++ -+
53 54 55 45