Today's Lecture: String Matching Algorithm Naïve / Brute Force RK
Today's Lecture: String Matching Algorithm Naïve / Brute Force RK
1
String matching Algorithm
2
The Knuth-Morris-Pratt Algorithm
Text = abcxabcyabcdabcrabcdabcz
Pattern= abcdabcz
3
Components of KMP algorithm
The prefix function, Π
The prefix function,Π for a pattern encapsulates
knowledge about how the pattern matches against shifts
of itself. This information can be used to avoid useless
shifts of the pattern ‘p’. In other words, this enables
avoiding backtracking on the string ‘S’.
4
The prefix function, Π
Following pseudocode computes the prefix fucnction, Π:
Compute-Prefix-Function (p)
1 m length[p] //’p’ pattern to be matched
q 1 2 3 4 5 6 7
2 Π[1] 0
3 k0 p a b a b a c a
4 for q 2 to m { Π 0 0
5 while k > 0 and p[k+1] != p[q] {
6 do k Π[k] }
7 If p[k+1] = p[q]{
8 then k k +1 }
9 Π[q] k
10 } end for
10 return Π
5
Example: compute Π for the pattern ‘p’ below:
p a b a b a c a
Initially: m = length[p] = 7
Π[1] = 0
k=0
q 1 2 3 4 5 6 7
Step 1: q = 2, k=0 p a b a b a c a
Π 0 0
Π[2] = 0
q 1 2 3 4 5 6 7
p a b a b a c a
Step 2: q = 3, k = 0,
Π[3] = 1 Π 0 0 1
q 1 2 3 4 5 6 7
Step 3: q = 4, k = 1 p a b a b a c a
6 Π[4] = 2 Π 0 0 1 2
Step 4: q = 5, k =2 q 1 2 3 4 5 6 7
Π[5] = 3 p a b a b a c a
Π 0 0 1 2 3
q 1 2 3 4 5 6 7
Step 5: q = 6, k = 3
p a b a b a c a
Π[6] = 0
Π 0 0 1 2 3 0
q 1 2 3 4 5 6 7
p a b a b a c a
Step 6: q = 7, k = 0
Π[7] = 1 Π 0 0 1 2 3 0 1
Note: KMP finds every occurrence of a ‘p’ in ‘S’. That is why KMP does not
terminate in step 12, rather it searches remainder of ‘S’ for any more
occurrences of ‘p’.
8
Illustration: given a String ‘S’ and pattern ‘p’ as follows:
S
b a c b a b a b a b a c a c a
p a b a b a c a
Let us execute the KMP algorithm to find
whether ‘p’ occurs in ‘S’.
For ‘p’ the prefix function, Π was computed previously and is as follows:
q 1 2 3 4 5 6 7
p a b a b a c a
Π 0 0 1 2 3 o 1
9
Initially: n = size of S = 15;
q 1 2 3 4 5 6 7
m = size of p = 7
p a b a b a c a
Π 0 0 1 2 3 o 1
Step 1: i = 1, k = 0
comparing p[1] with S[1]
S b a c b a b a b a b a c a a b
p a b a b a c a
P[1] does not match with S[1].
Step 2: i = 2, k = 0
comparing p[1] with S[2]
S b a c b a b a b a b a c a a b
p a b a b a c a
10 P[1] matches S[2].
Step 3: i = 3, k = 1
Comparing p[2] with S[3] p[2] does not match with S[3]
S b a c b a b a b a b a c a a b
p a b a b a c a
comparing p[1] and S[3]
Step 4: i = 4, k = 0
comparing p[1] with S[4] p[1] does not match with S[4]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 5: i = 5, k = 0
comparing p[1] with S[5] p[1] matches with S[5]
S b a c b a b a b a b a c a a b
p a b a b a c a q
p
1
a
2
b
3
a
4
b
5
a
6
c
7
Π 0 0 1 2 3 o 1
11
Step 6: i = 6, k = 1
Comparing p[2] with S[6] p[2] matches with S[6]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 7: i = 7, k = 2
Comparing p[3] with S[7] p[3] matches with S[7]
S b a c b a b a b a b a c a a b
p a b a b a c a
Step 8: i = 8, k = 3
Comparing p[4] with S[8] p[4] matches with S[8]
S b a c b a b a b a b a c a a b
q 1 2 3 4 5 6 7
p a b a b a c a p a b a b a c a
12 Π 0 0 1 2 3 o 1
Step 9: i = 9, k = 4
Comparing p[5] with S[9] p[5] matches with S[9]
S b a c b a b a b a b a c a a b
p a b a b a c a
S b a c b a b a b a b a c a a b
p a b a b a c a
comparing p[4] with S[10] because after mismatch k = Π[5] = 3
13 p p
Π
a
0
b
0
a
1
b
2
a
3
c
o
a
1
a b a b a c a
Step 12: i = 12, k = 5
Comparing p[6] with S[12] p[6] matches with S[12]
S b a c b a b a b a b a c a a b
p a b a b a c a
S b a c b a b a b a b a c a a b
p a b a b a c a
p a b a b a c a
Π 0 0 1 2 3 o 1
14
Home Task Part 1
String
a a a a a b a a a a b y
Pattern
a a a a b
15
Home Task Part2
Update KMP algorithm so it should
find matching overlapping
substring once.
Example:
Text : ababazi
Pattern: aba
Old KMP:
Found aba
Updated KMP
Find aba from on 1 and 3
16
Home task example
String
a b c x a b c d a b x a b c d a b c a a b c y
Pattern
a b c d a b c a
17
Running - time analysis : O(m+n) or
O(n)
Compute-Prefix-Function (Π)
KMP Matcher
1 m length[p] //’p’ pattern to be matched
1 n length[S]
2 Π[1] 0
2 m length[p]
3 k0
3 Π Compute-Prefix-Function(p)
4 for q 2 to m
4q0
5 do while k > 0 and p[k+1] != p[q]
5 for i 1 to n
6 do k Π[k]
6 do while q > 0 and p[q+1] != S[i]
7 If p[k+1] = p[q]
7 do q Π[q]
8 then k k +1
8 if p[q+1] = S[i]
9 Π[q] k
9 then q q + 1
10 return Π
10 if q = m
11 then print “Pattern occurs with shift” i – m
12 q Π[ q]
In the above pseudo code for computing the
prefix function, the for loop from step 4 to
The for loop beginning in step 5 runs ‘n’ times, i.e. , as
step 10 runs ‘m’ times. Step 1 to step 3 take long as the length of the string ‘S’. Since step 1 to
constant time. While loop execute at most O(m) step 4 take constant time, the running time is
Because t he while loop is bounded by the total dominated by this for loop. Thus running time of
matching function is O(n).
increase in k over all iterations in for
loop. Hence the running time of compute prefix
18 function is O(m).
Summary
Text editing programs frequently need to find all occurrences of a pattern in the text.
The KMP algorithm searches for occurrences of a “pattern" P within a main "text
string" S by employing the observation that when a mismatch occurs, the word itself
embodies sufficient information to determine where the next match could begin, thus
bypassing re-examination of previously matched characters.
In real world KMP algorithm is used in those applications when there is/are self
matching(s) of pattern string that we want to search for.
A relevant example is the DNA alphabet, which consists on only 4 symbols
(A,C,G,T). Imagine how KMP can work in a "DNA pattern matching problem": it is
really suitable because many repetition of the same letter, and so less computation time
wasted.
19
Questions?
Question in my Should I ask
mind is ? this ?
hmmmmmmmmm?
Sorry I was
sleeping Sir !