0% found this document useful (0 votes)
11 views

Lecture 34, 35 36 - String Matching Algorithms

Uploaded by

kifal535
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Lecture 34, 35 36 - String Matching Algorithms

Uploaded by

kifal535
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

String Matching

Algorithms
Instructor: Dr. Zuhair Zafar

CS-854 Advanced Algorithm Analysis

Lecture # 34, 35 & 36


String Matching Algorithms
⚫ Applications
⚫ Text Editing
⚫ Spell Checkers
⚫ Plagiarism Detection
⚫ Searching content on the web or in the
documents
⚫ Searching patterns in the DNA sequence
String Matching Algorithms
⚫ Assume a text, T, is an array of length n, i.e.,
T[1…..n]
⚫ Goal is to find the pattern, P, of length m, i.e.,
P[1…..m] in the given text T where m<=n.
⚫ Elements of P and T are characters taken
from finite alphabet, ∑, aka sigma.
⚫ Character arrays of P and T are called strings
of characters.
Notations

Symbols /
Representation
Operators
Σ Sigma: Represent all the characters that can appear in the text
Sigma Star: The set of all the finite length strings formed using
Σ∗ sigma, Σ.
𝜀 Epsilon: zero length string or an empty string

|x| Length of string x

x+y Concatenation of string x and y. Content of x are followed by y.

⊏ Prefix notation

⊐ Suffix notation
String Matching Algorithms

1. Naïve-String Matching Algorithm


2. Rabin-Karp String Matching Algorithm
3. Knuth-Morris-Pratt String Matching Algorithm
Naïve String Matching Algorithm

⚫ The above figure portrays the naïve-string algorithm procedure as sliding a template
containing the pattern over the text.
⚫ The pattern occurs at shift s = 2 which is a valid shift.
⚫ Pattern occurs at shift s=0, s=1 and s=3 are invalid shifts.
⚫ Goal is to find all the valid shifts.

Time complexity is O((n-m+1)m)

O(n-m+1)
O(m)
String Matching Algorithms

1. Naïve-String Matching Algorithm


2. Rabin-Karp String Matching Algorithm
3. Knuth-Morris-Pratt String Matching Algorithm
Rabin-Karp String Matching Algo.

⚫ A string search algorithm which compares a string’s hash


values, rather than the strings themselves.
⚫ For efficiency, the hash value of the next position in the
text is easily computed from the hash value of the
current position.
How Rabin-Karp Works?

⚫ Let characters in both arrays in T and P be the decimal


digits, i.e., Σ = (0,1,2 … 9).
⚫ Let p be the value of the characters in P.
⚫ Choose a prime number q such that fits in a computer
word to speed computations.
⚫ Compute (p mod q)
⚫ The value of p mod q is what we will be using to find all matches
of the pattern P in T.
How Rabin-Karp Works (continued)
⚫ The Rabin-Karp string searching algorithm calculates a
hash value of the pattern and for each M-character
subsequence of text to be compared.
⚫ If the hash values are unequal, the algorithm will
calculate the hash value of the next M-character
sequence.
⚫ If the hash values are equal, the algorithm will do a
brute-force comparison between the pattern and the M-
character sequence.
⚫ In this way there is only one comparison per text
subsequence, and Brute-Force is only needed when
hash values match.
Rabin-Karp Example
⚫ Given T = 31415926535.
⚫ We choose q=11
⚫ P mod q => 26 mod 11 = 4
Rabin-Karp Example
Rabin-Karp Example
Rabin-Karp Example (Alphabets)
⚫ Lets assume that our alphabets consists of 10 letters,
i.e., Σ = (𝑎, 𝑏, 𝑐, 𝑑, 𝑒, 𝑓, 𝑔, ℎ, 𝑖, 𝑗).
⚫ Lets assume that ‘a’ corresponds to ‘1’, ‘b’ corresponds
to ‘2’ and so on. (See table) Letter Value
⚫ The hash value for string “cah” would be a 1
𝑃 = 3 × 102 + 1 × 101 + 8 × 100 = 318 b 2
General Form: c 3
𝑃 = 𝑝 1 × 𝑑 𝑚−1 + 𝑝 2 × 𝑑 𝑚−2 + ⋯ + 𝑝[𝑚] × 𝑑 𝑚−𝑚 d 4
Select a prime number q to take modulus, e.g., 97. e 5
f 6
⚫ Compute (P mod q)
g 7
318 % 97 = 27 (hash value for “cah”)
h 8
i 9
j 10
Rabin-Karp Complexity
⚫ The running time of the Rabin-Karp algorithm in the
worst case scenario is O(n-m+1)m but it has a good
average-case running time.
⚫ If a sufficient large prime number is used for the hash
function, the hashed values of two different patterns will
be usually distinct.
⚫ If the expected number of valid shifts is O(1) then the
Rabin-Karp algorithm can be expected to run in time
O(n+m) plus the time to required to process the spurious
hits.
String Matching Algorithms

1. Naïve-String Matching Algorithm


2. Rabin-Karp String Matching Algorithm
3. Knuth-Morris-Pratt String Matching
Algorithm
Knuth-Morris-Pratt String
Matching Algorithm

⚫ Knuth, Morris and Pratt proposed a linear time algorithm


for the string matching problem and published it in 1977.
⚫ A matching time of O(n) is achieved by avoiding
comparisons with elements of ‘S’ that have previously
been involved in comparison with some element of the
pattern ‘p’ to be matched. i.e., backtracking on the string
‘S’ never occurs.
Components of KMP Algorithm
⚫ The prefix function, Π
⚫ The prefix function, Π, for a pattern encapsulates knowledge
about how the pattern matches against shifts of itself.
⚫ This information can be used to avoid useless shifts of the
pattern ‘p’.
⚫ In other words, this enables avoiding backtracking on the string
‘S’.
⚫ The KMP Matcher
⚫ With string ‘S’, pattern ‘p’ and prefix function ‘Π’ as inputs, finds
the occurrence of ‘p’ in ‘S’ and returns the number of shifts of ‘p’
after which occurrence is found.
Pi Table or LPS Table
⚫ Pi, Π table or Longest Prefix Sequence Table

Index 1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10

Pattern a b c d a b e a b f a a b c a d a a b e
Pi values 0 0 0 0 1 2 0 1 2 0 0 1 0 0 1 0 1 2 3 0

Example 1 Example 2

Index 1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9

Pattern a b c d e a b f a b c a a a a b a a c d
Pi values 0 0 0 0 0 1 2 0 1 2 3 0 1 2 3 0 1 2 0 0
Example 3 Example 4
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

0 1 2 3 4 5

p: a b a b d

⚫ Step 1: Compute the pi table of the pattern, p


How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] ?? If characters are unequal and q is greater
than 0, then q will be shifted to the index
mentioned in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater
than 0, then q will be shifted to the index
mentioned in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater
than 0, then q will be shifted to the index
mentioned in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to
0, then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater
than 0, then q will be shifted to the index
mentioned in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to
0, then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater
than 0, then q will be shifted to the index
mentioned in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 2: Iterator variables i and q are initialized.


0 1 2 3 4 5

p: a b a b d S[i] is compared with p[q+1]


If characters are equal, i and q are
0 0 1 2 0
incremented.
If characters are unequal and q is equal to 0,
then i will only be incremented.
⚫ S[i] == p[q+1] If characters are unequal and q is greater than
0, then q will be shifted to the index mentioned
in the corresponding pi value.
How KMP Algorithm works
⚫ Illustration: given a String ‘S’ and pattern ‘p’ as follows:

𝑖
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

S: a b a b c a b c a b a b a b d

𝑞 Step 3: if q == m
0 1 2 3 4 5

p: a b a b d Print (Pattern occurs with shift, i-m)


0 0 1 2 0
Step 4: terminate if i > n
Running - time analysis
⚫ Compute-Prefix-Function (Π) ⚫ KMP Matcher

1 m  length[p] //’p’ pattern to be matched 1 n  length[S]


2 Π[1]  0 2 m  length[p]
3 k0 3 Π  Compute-Prefix-Function(p)
4 for q  2 to m 4 q0
5 do while k > 0 and p[k+1] != p[q] 5 for i  1 to n
6 do k  Π[k] 6 do while q > 0 and p[q+1] != S[i]
7 if p[k+1] = p[q] 7 do q  Π[q]
8 then k  k +1 8 if p[q+1] = S[i]
9 Π[q]  k 9 then q  q + 1
10 return Π 10 if q = m
11 then print “Pattern occurs with shift” i – m
12 q  Π[ q]

In the above pseudocode for computing the The for loop beginning in step 5 runs ‘n’ times,
prefix function, the for loop from step 4 to step 10 i.e., as long as the length of the string ‘S’. Since
runs ‘m’ times. Step 1 to step 3 take constant step 1 to step 4 take constant time, the running
time. Hence the running time of compute prefix time is dominated by this for loop. Thus running
function is Θ(m). time of matching function is Θ(n).

You might also like