0% found this document useful (0 votes)
24 views

Edit Dist

Minimum Cost Edit Distance aims to find the minimum cost way to edit a source string into a target string. It does this by calculating the edit distance recursively using a matrix where each cell represents the cost to edit the source to the target up to that point. The algorithm can handle variable edit costs such as different costs for insertions, deletions, and substitutions. It has applications in areas like spelling correction, text correction, and measuring differences between strings.

Uploaded by

AJAY ANEJA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views

Edit Dist

Minimum Cost Edit Distance aims to find the minimum cost way to edit a source string into a target string. It does this by calculating the edit distance recursively using a matrix where each cell represents the cost to edit the source to the target up to that point. The algorithm can handle variable edit costs such as different costs for insertions, deletions, and substitutions. It has applications in areas like spelling correction, text correction, and measuring differences between strings.

Uploaded by

AJAY ANEJA
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 24

Minimum Cost Edit Distance

•  Edit a source string into a target string


•  Each edit has a cost
•  Find the minimum cost edit(s)
actress
insert(s)
actres
delete(t) minimum cost
actrest edit distance can
insert(t) be accomplished
acrest in multiple ways
insert(a)
Only 4 ways to edit
crest source to target for
1
this pair
Minimum Cost Edit Distance
target

source

actress minimum cost


edit distance can
be accomplished
actres in multiple ways

actrest
Only 4 ways to edit
acrest source to target for
this pair
crest
2
Levenshtein Distance
•  Cost is fixed across characters
–  Insertion cost is 1
–  Deletion cost is 1
•  Two different costs for substitutions
–  Substitution cost is 1 (transformation)
–  Substitution cost is 2 (one deletion + one
insertion)
Левенштейн Владимир
What’s the edit Vladimir Levenshtein
distance? 3
Minimum Cost Edit Distance
•  An alignment between target and source

Find D(n,m) recursively

4
Function MinEditDistance (target, source)

n = length(target)
m = length(source)
Create matrix D of size (n+1,m+1)
D[0,0] = 0

for i = 1 to n
D[i,0] = D[i-1,0] + insert-cost

for j = 1 to m
D[0,j] = D[0,j-1] + delete-cost

for i = 1 to n
for j = 1 to m
D[i,j] = MIN(D[i-1,j] + insert-cost,
D[i-1,j-1] + subst/eq-cost,
D[i,j-1] + delete-cost)
return D[n,m]
5
Consider two strings: target = g1a2m3b4l5e6
source= g1u2m3b4o5
•  We want to find D(6,5)
•  We find this recursively using values of D(i,j) where i≤6 j≤5
•  For example, consider how to compute D(4,3)
target = g1a2m3b4 •  Case 1: SUBSTITUTE b4 for m3
•  Use previously stored value for D(3,2)
source= g1u2m3 •  Cost(g1a2m3b and g1u2m) = D(3,2) + cost(b≈m)
•  For substitution: D(i,j) = D(i-1,j-1) + cost(subst)

•  Case 2: INSERT b4
D(3,2) D(4,2) •  Use previously stored value for D(3,3)
•  Cost(g1a2m3b and g1u2m3) = D(3,3) + cost(ins b)
•  For substitution: D(i,j) = D(i-1,j) + cost(ins)
D(3,3) D(4,3)
•  Case 3: DELETE m3
•  Use previously stored value for D(4,2)
•  Cost(g1a2m3b4 and g1u2m) = D(4,2) + cost(del m)
•  For substitution: D(i,j) = D(i,j-1) + cost(del) 6
target
g a m b l e
0 1 2 3 4 5 6
g 1 0 1 2 3 4 5
e
u 2 1 2s 3 4 5 6
source

m 3 2 3 2 3 4 5
e
b 4 3 4 3 2e 3i 4
o 5 4 5 4 3 4 5s

7
Edit Distance and FSTs
•  Algorithm using a Finite-state transducer:
–  construct a finite-state transducer with all possible ways
to transduce source into target
–  We do this transduction one char at a time
–  A transition x:x gets zero cost and a transition on ε:x
(insertion) or x:ε (deletion) for any char x gets cost 1
–  Finding minimum cost edit distance == Finding the
shortest path from start state to final state

8
Edit Distance and FSTs
•  Lets assume we want to edit source string 1010
into the target string 1110
•  The alphabet is just 1 and 0

SOURCE 0
1:1
1
0:0
2
1:1
3
0:0
4

1:1 1:1 1:1 0:0


TARGET 0 1 2 3 4

9
Edit Distance and FSTs
•  Construct a FST that allows strings to be edited
1:1
1:<epsilon>
0:0
0:<epsilon>
<epsilon>:1
EDITS
<epsilon>:0

0
10
Edit Distance and FSTs
•  Compose SOURCE and EDITS and TARGET

14 1:<epsilon>
<epsilon>:0 16 0:<epsilon>
<epsilon>:0
<epsilon>:1 8 1:<epsilon>
0:0
1:1 9 0:<epsilon>
<epsilon>:1 4 <epsilon>:0 17 1:<epsilon>
1:<epsilon> <epsilon>:1 15 22
<epsilon>:1 1:<epsilon> <epsilon>:0 0:<epsilon>
1:1 5 0:<epsilon>
<epsilon>:1 1 1:1 0:0
1:<epsilon> <epsilon>:1 10 18 24
<epsilon>:1
1:<epsilon> <epsilon>:1 0:<epsilon> <epsilon>:0
1:1 0:<epsilon>
0 3 1:1
7 13 0:<epsilon> <epsilon>:1 23
1:<epsilon> <epsilon>:1 <epsilon>:1 21
1:<epsilon> <epsilon>:1 <epsilon>:1
2
0:<epsilon> 1:1
6 12 0:<epsilon>
20
1:<epsilon> <epsilon>:1 <epsilon>:1

11 0:<epsilon> 19

11
Edit Distance and FSTs
•  The shortest path is the minimum edit FST from
SOURCE (1010) to TARGET (1110)

1:1 0:<epsilon> 1:1 0:<epsilon> <epsilon>:1 <epsilon>:0


6 5 4 3 2 1 0

12
Edit distance
•  Useful in many NLP applications
•  In some cases, we need edits with multiple
characters, e.g. 2 chars deleted for one cost
•  Comparing system output with human output, e.g.
input: ibm output: IBM vs. Ibm (TrueCasing of speech
recognition output)
•  Error correction
•  Defined over character edits or word edits, e.g. MT
evaluation:
–  Foreign investment in Jiangsu ‘s agriculture on the increase
–  Foreign investment in Jiangsu agricultural investment increased
13
Pronunciation
dialect map of
the Netherlands
based on phonetic
edit-distance
(W. Heeringa
Phd thesis, 2004)

14
Variable Cost Edit Distance
•  So far, we have seen edit distance with uniform insert/
delete cost
•  In different applications, we might want different insert/
delete costs for different items
•  For example, consider the simple application of spelling
correction
•  Users typing on a qwerty keyboard will make certain
errors more frequently than others
•  So we can consider insert/delete costs in terms of a
probability that a certain alignment occurs between the
correct word and the typo word
15
Spelling Correction
•  Types of spelling correction
–  non-word error detection
e.g. hte for the
–  isolated word error detection
e.g. acres vs. access (cannot decide if it is the right
word for the context)
–  context-dependent error detection (real world
errors)
e.g. she is a talented acres vs. she is a talented actress
•  For simplicity, we will consider the case with exactly 1 error
16
Noisy Channel Model
Source

original input

Noisy Channel

noisy observation

P(original input | noisy obs)


Decoder
17
Bayes Rule: computing P(orig | noisy)
•  let x = original input, y = noisy observation

Bayes Rule

18
Chain Rule

Approximations: Bias vs. Variance


less bias

less variance
19
Single Error Spelling Correction
•  Insertion (addition)
–  acress vs. cress
•  Deletion
–  acress vs. actress
•  Substitution
–  acress vs. access
•  Transposition (reversal)
–  acress vs. caress
20
Noisy Channel Model for Spelling Correction
(Kernighan, Church and Gale, 1990)

•  t is the word with a single typo and c is the


correct word
Bayes Rule

•  Find the best candidate for the correct word

C is all the words in the vocabulary; |C| = N 21


Noisy Channel Model for Spelling Correction
(Kernighan, Church and Gale, 1990)
single error, condition on previous letter
t = poton
c = potion
del[t,i]=427
chars[t,i]=575
P(poton | potion) P = .7426

t = poton
P(poton | piton) c = piton
sub[o,i]=568
chars[i]=1406
P = .4039
22
Noisy Channel model for Spelling Correction

•  The del, ins, sub, rev matrix values need


data in which contain known errors
(training data)
e.g. Birbeck spelling error corpus (from 1984!)
•  Accuracy on single errors on unseen data
(test data)

23
Noisy Channel model for Spelling Correction

•  Easily extended to multiple spelling errors in a


word using edit distance algorithm (however,
using learned costs for ins, del, replace)
•  Experiments: 87% accuracy for machine vs. 98%
average human accuracy
•  What are the limitations of this model?
… was called a “stellar and versatile acress whose
combination of sass and glamour has defined her

KCG model best guess is acres
24

You might also like