0% found this document useful (0 votes)
196 views18 pages

Slides - Chapter 32 - String Matching

This document discusses several algorithms for string matching: 1. Naive string matching compares characters one by one with complexity of O(nm) 2. Rabin-Karp string matching converts the pattern and text to numbers using hash functions to quickly filter out non-matches before full character comparison, reducing average complexity to O(n+m) 3. String matching with finite automata builds a state machine to efficiently represent all shifts where the pattern could match in the text 4. The Knuth-Morris-Pratt algorithm computes a prefix function to optimize the shift of the pattern in each iteration, implementing the finite automata approach efficiently.

Uploaded by

SML
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
196 views18 pages

Slides - Chapter 32 - String Matching

This document discusses several algorithms for string matching: 1. Naive string matching compares characters one by one with complexity of O(nm) 2. Rabin-Karp string matching converts the pattern and text to numbers using hash functions to quickly filter out non-matches before full character comparison, reducing average complexity to O(n+m) 3. String matching with finite automata builds a state machine to efficiently represent all shifts where the pattern could match in the text 4. The Knuth-Morris-Pratt algorithm computes a prefix function to optimize the shift of the pattern in each iteration, implementing the finite automata approach efficiently.

Uploaded by

SML
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

String Matching

Naive String Matching


Rabin-Karp Matcher
String Matching with Finite Automata
Knuth-Morris-Pratt Algorithm

Chapter 32: Slide – 1


String Matching

⊲ string matching Input is a pattern P [1..m] and text T [1..n].


rabin karp
rabin karp 2
rabin karp 3
Find a shift s s.t. P [i] = T [s + i] for 1 ≤ i ≤ m.
rabin karp 4
fsa 1
Applications: document search, grep, DNA
fsa 2
fsa 3
fsa 4
fsa 5
Naive-String-Matcher(T, P )
fsa 6
fsa 7
n ← T.length, m ← P.length
kmp 1
kmp 2
for s ← 0 to n − m
kmp 3
kmp 4
if P [1..m] = T [s + 1..s + m]
kmp 5
pattern occurs with shift s

Naive-String-Matcher is O(nm).
Consider P = aaaaa and T = aaaabaaaab. . .

Chapter 32: Slide – 2


Rabin-Karp String Matching
string matching P [1..m] can be converted to a number:
⊲ rabin karp
rabin karp 2 m−1
rabin karp 3
X
rabin karp 4
fsa 1
p= P [m − i] ∗ di
fsa 2 i=0
fsa 3
fsa 4
fsa 5
where each character is in radix-d notation.
fsa 6
fsa 7
Similarly, T [s + 1 . . . s + m] can be converted:
kmp 1
kmp 2 m−1
kmp 3
X
kmp 4 t= T [s + m − i] ∗ di
kmp 5
i=0

Idea: test p mod q = t mod q before the more


expensive P [1 . . . m] = T [s + 1 . . . s + m].
Use a prime number for q.

Chapter 32: Slide – 3


Rabin-Karp Example
string matching
rabin karp
⊲ rabin karp 2
rabin karp 3
rabin karp 4
fsa 1
fsa 2
fsa 3
fsa 4
fsa 5
fsa 6
fsa 7
kmp 1
kmp 2
kmp 3
kmp 4
kmp 5

Chapter 32: Slide – 4


Rabin-Karp Matcher
string matching
rabin karp
rabin karp 2
Rabin-Karp-Matcher(T, P, d, q)
⊲ rabin karp 3 n ← T.length, m ← P.length
rabin karp 4
fsa 1
fsa 2
p ← 0, t ← 0, h ← dm mod q
fsa 3
fsa 4
for i ← 1 to m
fsa 5
fsa 6
p ← (d ∗ p + P [i]) mod q
fsa 7 t ← (d ∗ t + T [i]) mod q
kmp 1
kmp 2 for s ← 0 to n − m
kmp 3
kmp 4 if s > 0
kmp 5
t ← (d∗t + T [s+m] − T [s]∗h) mod q
if p = t and P [1..m] = T [s + 1..s + m]
pattern occurs with shift s

Chapter 32: Slide – 5


Average Case Analysis of Rabin-Karp
string matching
rabin karp Probability Model:
rabin karp 2
rabin karp 3
Suppose there are v valid shifts.
⊲ rabin karp 4 If s is not a valid shift, then p mod q = t mod q
fsa 1
fsa 2
fsa 3
with probability 1/q.
fsa 4
fsa 5
fsa 6 O(m) per valid shift. O(m) per “spurious hit.”
fsa 7
kmp 1 Expected number of spurious hits is O(n/q).
kmp 2
kmp 3 O(1) per iteration otherwise.
kmp 4
kmp 5

Expected running time is O(m(v + n/q) + n),


which is O(mv + n) if q ≥ m.

Chapter 32: Slide – 6


Finite State Automata
string matching A finite state automaton is defined by:
rabin karp
rabin karp 2 Q, a set of states
rabin karp 3
rabin karp 4 q0 ∈ Q, the start state
⊲ fsa 1 A ⊆ Q, the accepting states
fsa 2
fsa 3
fsa 4
Σ, the input alphabet
fsa 5
fsa 6
δ, the transition function, from Q × Σ to Q
fsa 7
kmp 1
kmp 2
kmp 3
kmp 4
kmp 5

Chapter 32: Slide – 7


FSA Idea for String Matching
string matching Start in state qo .
rabin karp
rabin karp 2 Perform a transition from q0 to q1 if next character
rabin karp 3
rabin karp 4 of T = P [1].
fsa 1
⊲ fsa 2
fsa 3
fsa 4
State qi means first i characters of P match.
fsa 5
fsa 6
Transition from qi to qi+1 if the next character of
fsa 7
kmp 1
T = P [i + 1].
kmp 2
kmp 3
kmp 4 Transition Function for P = aba
kmp 5

State
0 1 2 3 0 a 1 b 2 a 3
Inputs a 1 ? 3 ?
b ? 2 ? ?

Chapter 32: Slide – 8


FSA Matcher (Incomplete)
string matching
rabin karp
rabin karp 2
FSA-Matcher(T, P )
rabin karp 3
rabin karp 4
n ← T.length, m ← P.length
fsa 1
fsa 2
q ← 0 // q is the state of the FSA.
⊲ fsa 3
for s ← 1 to n
fsa 4
fsa 5
fsa 6
if q < m and T [s] = P [q + 1]
fsa 7
kmp 1
q ← q+1
kmp 2
kmp 3
else q ← ???
kmp 4
kmp 5
if q = m
pattern occurs with shift s − m

Cannot simply reset state to 0.


Consider P = ab and T = aab

Chapter 32: Slide – 9


Delta Transition for String Matching

string matching  qi+1 if i < m and T [s] = P [i + 1]
rabin karp 

rabin karp 2 
if j is the maximum value
rabin karp 3


rabin karp 4

fsa 1 δ(qi , T [s]) = qj+1 such that T [s] = P [j + 1] and
fsa 2
T [s − j..s − 1] = P [1..j]
fsa 3




fsa 4 

fsa 5 
q0 otherwise
fsa 6

fsa 7
kmp 1
kmp 2
kmp 3

kmp 4 
 qi+1 if i < m and x = P [i + 1]
kmp 5 

if j is the maximum value



δ(qi , x) = qj+1 such that x = P [j + 1] and
P [i − j + 1..i] = P [1..j]






q0 otherwise

Chapter 32: Slide – 10


FSA for Matching bbbba
string matching
rabin karp
rabin karp 2
a a a
rabin karp 3
rabin karp 4 a b
fsa 1
fsa 2 a 0 1 2 3 4 a 5
fsa 3
fsa 4
b b b b
⊲ fsa 5
b
fsa 6
fsa 7
kmp 1
kmp 2
kmp 3
Transition Function
kmp 4
kmp 5
States
0 1 2 3 4 5
Inputs a 0 0 0 0 5 0
b 1 2 3 4 4 1

Chapter 32: Slide – 11


FSA for Matching ababa
string matching
rabin karp
rabin karp 2 b b b
rabin karp 3
rabin karp 4
fsa 1
b 0 a 1 b 2 a 3 b 4 a 5
fsa 2
fsa 3
a a
fsa 4
fsa 5
⊲ fsa 6
a
fsa 7
kmp 1
kmp 2 Transition Function
kmp 3
kmp 4
kmp 5
State
0 1 2 3 4 5
Inputs a 1 1 3 1 5 1
b 0 2 0 4 0 4

Chapter 32: Slide – 12


FSA for Matching abbaa
string matching
rabin karp a a
rabin karp 2
rabin karp 3 a b
rabin karp 4
fsa 1
b 0 a 1 b 2 b 3 a 4 a 5
fsa 2
fsa 3
fsa 4
b b
fsa 5
fsa 6
⊲ fsa 7
Transition Function
kmp 1
kmp 2
kmp 3
kmp 4 State
kmp 5
0 1 2 3 4 5
Inputs a 1 1 1 4 5 1
b 0 2 3 0 2 2

Chapter 32: Slide – 13


Knuth-Morris-Pratt Algorithm
string matching The Knuth-Morris-Pratt algorithm efficiently
rabin karp
rabin karp 2 implements finite state automatons. It is based on
rabin karp 3
rabin karp 4 computing a prefix function:
fsa 1
fsa 2
fsa 3 π[q] = max{k : k < q and Pk is a suffix of Pq }
fsa 4
fsa 5
fsa 6
fsa 7
where 0 ≤ k < q ≤ m and Pk = P [1 . . . k] and
⊲ kmp 1
Pq = P [1 . . . q]
kmp 2
kmp 3
kmp 4
kmp 5 Pk is a suffix of Pq if Pk = P [q − k + 1..q]

Chapter 32: Slide – 14


Computing the Prefix Function
string matching
rabin karp
rabin karp 2
Compute-Prefix-Function(P )
rabin karp 3
rabin karp 4
m ← P.length
fsa 1
fsa 2
π[1] ← 0
fsa 3
fsa 4
k←0
fsa 5
fsa 6
for q ← 2 to m
fsa 7
kmp 1
while k > 0 and P [k + 1] 6= P [q]
⊲ kmp 2 k ← π[k]
kmp 3
kmp 4 if P [k + 1] = P [q]
kmp 5
k ← k+1
π[q] ← k
return π

Chapter 32: Slide – 15


Prefix Function Analysis
string matching
rabin karp Running time is Θ(m). Count changes to k.
rabin karp 2
rabin karp 3 π[k] < k, so k ← π[k] decreases k.
rabin karp 4
fsa 1 k is incremented m − 1 times and k ≥ 0, so
fsa 2
fsa 3 k can be decreased at most m − 1 times.
fsa 4
fsa 5
fsa 6
fsa 7
If P [q] = P [k + 1], then π[q] = k + 1.
kmp 1
kmp 2
If P [q] 6= P [k + 1], then check π[k] next
⊲ kmp 3 because Pπ[k] is a suffix of both Pk and Pq−1 .
kmp 4
kmp 5

Chapter 32: Slide – 16


KMP Matcher
string matching
rabin karp
rabin karp 2
KMP-Matcher(T, P )
rabin karp 3
rabin karp 4
n ← T.length, m ← P.length
fsa 1
fsa 2
π ← Compute-Prefix-Function(P )
fsa 3
fsa 4
q ← 0 // q is the state of the FSA.
fsa 5
fsa 6
for i ← 1 to n
fsa 7
kmp 1
while q > 0 and P [q + 1] 6= T [i]
kmp 2
kmp 3
q ← π[q]
⊲ kmp 4 if P [q + 1] = T [i] then q ← q + 1
kmp 5
if q = m
pattern occurs with shift i − m
q ← π[q]

Chapter 32: Slide – 17


KMP Matcher Analysis
string matching
rabin karp Running time is O(n+m). Count changes to q.
rabin karp 2
rabin karp 3 π[q] < q, so q ← π[q] decreases q.
rabin karp 4
fsa 1 q is incremented O(n) times and q ≥ 0, so
fsa 2
fsa 3 q can be decreased at most O(n) times.
fsa 4
fsa 5
fsa 6
fsa 7
Show correctness of computation.
kmp 1
kmp 2
Loop invariant is Pq = T [i − q . . . i − 1].
kmp 3
kmp 4
This is true before the first iteration.
⊲ kmp 5 In while loop,
If P [q + 1] = T [i], then q is incremented.
If P [q] 6= T [i], then check π[q] next because Pπ[q]
is also a suffix of Ti−1 .

Chapter 32: Slide – 18

You might also like