Edit Dist
Edit Dist
source
actrest
Only 4 ways to edit
acrest source to target for
this pair
crest
2
Levenshtein Distance
• Cost is fixed across characters
– Insertion cost is 1
– Deletion cost is 1
• Two different costs for substitutions
– Substitution cost is 1 (transformation)
– Substitution cost is 2 (one deletion + one
insertion)
Левенштейн Владимир
What’s the edit Vladimir Levenshtein
distance? 3
Minimum Cost Edit Distance
• An alignment between target and source
4
Function MinEditDistance (target, source)
n = length(target)
m = length(source)
Create matrix D of size (n+1,m+1)
D[0,0] = 0
for i = 1 to n
D[i,0] = D[i-1,0] + insert-cost
for j = 1 to m
D[0,j] = D[0,j-1] + delete-cost
for i = 1 to n
for j = 1 to m
D[i,j] = MIN(D[i-1,j] + insert-cost,
D[i-1,j-1] + subst/eq-cost,
D[i,j-1] + delete-cost)
return D[n,m]
5
Consider two strings: target = g1a2m3b4l5e6
source= g1u2m3b4o5
• We want to find D(6,5)
• We find this recursively using values of D(i,j) where i≤6 j≤5
• For example, consider how to compute D(4,3)
target = g1a2m3b4 • Case 1: SUBSTITUTE b4 for m3
• Use previously stored value for D(3,2)
source= g1u2m3 • Cost(g1a2m3b and g1u2m) = D(3,2) + cost(b≈m)
• For substitution: D(i,j) = D(i-1,j-1) + cost(subst)
• Case 2: INSERT b4
D(3,2) D(4,2) • Use previously stored value for D(3,3)
• Cost(g1a2m3b and g1u2m3) = D(3,3) + cost(ins b)
• For substitution: D(i,j) = D(i-1,j) + cost(ins)
D(3,3) D(4,3)
• Case 3: DELETE m3
• Use previously stored value for D(4,2)
• Cost(g1a2m3b4 and g1u2m) = D(4,2) + cost(del m)
• For substitution: D(i,j) = D(i,j-1) + cost(del) 6
target
g a m b l e
0 1 2 3 4 5 6
g 1 0 1 2 3 4 5
e
u 2 1 2s 3 4 5 6
source
m 3 2 3 2 3 4 5
e
b 4 3 4 3 2e 3i 4
o 5 4 5 4 3 4 5s
7
Edit Distance and FSTs
• Algorithm using a Finite-state transducer:
– construct a finite-state transducer with all possible ways
to transduce source into target
– We do this transduction one char at a time
– A transition x:x gets zero cost and a transition on ε:x
(insertion) or x:ε (deletion) for any char x gets cost 1
– Finding minimum cost edit distance == Finding the
shortest path from start state to final state
8
Edit Distance and FSTs
• Lets assume we want to edit source string 1010
into the target string 1110
• The alphabet is just 1 and 0
SOURCE 0
1:1
1
0:0
2
1:1
3
0:0
4
9
Edit Distance and FSTs
• Construct a FST that allows strings to be edited
1:1
1:<epsilon>
0:0
0:<epsilon>
<epsilon>:1
EDITS
<epsilon>:0
0
10
Edit Distance and FSTs
• Compose SOURCE and EDITS and TARGET
14 1:<epsilon>
<epsilon>:0 16 0:<epsilon>
<epsilon>:0
<epsilon>:1 8 1:<epsilon>
0:0
1:1 9 0:<epsilon>
<epsilon>:1 4 <epsilon>:0 17 1:<epsilon>
1:<epsilon> <epsilon>:1 15 22
<epsilon>:1 1:<epsilon> <epsilon>:0 0:<epsilon>
1:1 5 0:<epsilon>
<epsilon>:1 1 1:1 0:0
1:<epsilon> <epsilon>:1 10 18 24
<epsilon>:1
1:<epsilon> <epsilon>:1 0:<epsilon> <epsilon>:0
1:1 0:<epsilon>
0 3 1:1
7 13 0:<epsilon> <epsilon>:1 23
1:<epsilon> <epsilon>:1 <epsilon>:1 21
1:<epsilon> <epsilon>:1 <epsilon>:1
2
0:<epsilon> 1:1
6 12 0:<epsilon>
20
1:<epsilon> <epsilon>:1 <epsilon>:1
11 0:<epsilon> 19
11
Edit Distance and FSTs
• The shortest path is the minimum edit FST from
SOURCE (1010) to TARGET (1110)
12
Edit distance
• Useful in many NLP applications
• In some cases, we need edits with multiple
characters, e.g. 2 chars deleted for one cost
• Comparing system output with human output, e.g.
input: ibm output: IBM vs. Ibm (TrueCasing of speech
recognition output)
• Error correction
• Defined over character edits or word edits, e.g. MT
evaluation:
– Foreign investment in Jiangsu ‘s agriculture on the increase
– Foreign investment in Jiangsu agricultural investment increased
13
Pronunciation
dialect map of
the Netherlands
based on phonetic
edit-distance
(W. Heeringa
Phd thesis, 2004)
14
Variable Cost Edit Distance
• So far, we have seen edit distance with uniform insert/
delete cost
• In different applications, we might want different insert/
delete costs for different items
• For example, consider the simple application of spelling
correction
• Users typing on a qwerty keyboard will make certain
errors more frequently than others
• So we can consider insert/delete costs in terms of a
probability that a certain alignment occurs between the
correct word and the typo word
15
Spelling Correction
• Types of spelling correction
– non-word error detection
e.g. hte for the
– isolated word error detection
e.g. acres vs. access (cannot decide if it is the right
word for the context)
– context-dependent error detection (real world
errors)
e.g. she is a talented acres vs. she is a talented actress
• For simplicity, we will consider the case with exactly 1 error
16
Noisy Channel Model
Source
original input
Noisy Channel
noisy observation
Bayes Rule
18
Chain Rule
less variance
19
Single Error Spelling Correction
• Insertion (addition)
– acress vs. cress
• Deletion
– acress vs. actress
• Substitution
– acress vs. access
• Transposition (reversal)
– acress vs. caress
20
Noisy Channel Model for Spelling Correction
(Kernighan, Church and Gale, 1990)
t = poton
P(poton | piton) c = piton
sub[o,i]=568
chars[i]=1406
P = .4039
22
Noisy Channel model for Spelling Correction
23
Noisy Channel model for Spelling Correction