0% found this document useful (0 votes)
2 views

KMP algorithm

The document discusses various string matching algorithms including the Naïve String-matching Algorithm, KMP algorithm, and Rabin-Karp Algorithm, along with the concept of suffix trees. It explains the pattern matching problem, applications, and provides examples of how these algorithms work, particularly focusing on the efficiency of the KMP algorithm which preprocesses the pattern to avoid unnecessary comparisons. Additionally, it covers the construction of the LPS table used in the KMP algorithm for optimizing the search process.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

KMP algorithm

The document discusses various string matching algorithms including the Naïve String-matching Algorithm, KMP algorithm, and Rabin-Karp Algorithm, along with the concept of suffix trees. It explains the pattern matching problem, applications, and provides examples of how these algorithms work, particularly focusing on the efficiency of the KMP algorithm which preprocesses the pattern to avoid unnecessary comparisons. Additionally, it covers the construction of the LPS table used in the KMP algorithm for optimizing the search process.
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 19

MODULE3 : STRING MATCHING

Naïve String-matching Algorithm, KMP algorithm, Rabin-Karp Algorithm, Suffix Trees


String Matching - Strings

 Pattern matching
 Exact Matching Algorithms
 A string is sequence of characters (0-indexed)
 Examples of strings
 C++ Code
 HTML document
 DNA sequence
 Digitized Image

 Alphabet (∑) – Set of possible characters for a family of strings


 ASCII or Unicode
 {0,1}
 {A, C, G, T}
String Matching - Strings

 Given string P of size m


 Substring - P[i : j] : subsequence of P consisting of characters with
indexes between i and j
 Prefix – is a substring p[0 : i]
 Suffix – is a substring p[i: m-1]

Given strings T (text) and P (pattern)


Pattern matching problem consists of finding substring of T equal to P

 Applications
 Text Editors, compilers
 Search Engines
 Biological Research
String Matching : Example
0 1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19
1
Text (T) A A B A A C A A D A A B A A B A A D A A

Pattern
A A B A
(P)

Pattern found at 0, 9, 12
Naïve String Matching Algorithm

void search(char* pat, char* txt)


{
int M = strlen(pat);
int N = strlen(txt);

/* A loop to slide pat[] one by one */


for (int i = 0; i <= N - M; i++) {
int j;

/* For current index i, check for pattern match // Driver's code


*/ int main()
for (j = 0; j < M; j++) {
if (txt[i + j] != pat[j]) char txt[] =
break; "AABAACAADAABAAABAADAA";
char pat[] = "AABA";
if (j == M) // if pat[0...M-1] = txt[i, i+1, ...i+M-
1] // Function call
printf("Pattern found at index %d \n", i); search(pat, txt);
} return 0;
} }
Naïve String Searching Algorithm

 Slide the Pattern (P) over Text (T) one by one and check for match
0 1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19
1
Text (T) A A B A A C A A D A A B A A B A A D A A
Pattern A A B A At 0
(P) Text Size – n
A A B A Pattern size – m
A A B A Each comparison takes m steps
Total n comparisons
A A B A O(mn)
A A B A
A A B A
A A B A
A A B A
A A B A
A A B A At 9 …
Naïve String Searching Algorithm

 Slide the Pattern (P) over Text (T) one by one and check for match
0 1 2 3 4 5 6 7 8 9 10 1 12 13 14 15 16 17 18 19
1
Text (T) A A B A A C A A D A A B A A B A A D A A
i j
Pattern A A B A At 0
(P) 0-3 0-3
A A B A 1-4 0-3
A A B A 2-5 0-3
A A B A 3-6 0-3
A A B A 4-7 0-3
A A B A 5-8 0-3
A A B A 6-9 0-3
A A B A 7-10 0-3
A A B A 8-11 0-3
A A B A At 9 … 9-12 0-3
Worst case scenario of Naïve approach

0 1 2 3 4 5 6 7

Text (T) a a a a a a a d
Pattern X
a a a d i j
(P) X 0-3 0-3
a a a d
X 1-4 0-3
a a a d
2-5 0-3
X
a a a d 3-6 0-3
a a a d 4-7 0-3
Found a match at index 5

Naïve method fares badly when more repetitions are there


How about this?
0 1 2 3 4 5 6 7 8 9

(T) T R A I L T R A I N
X i j
(P) T R A I N 0-4 0-4
T R A I N 5-9 0-4

• Here Index i is moved to next one after where mismatch happened (index i keep moving forward)
• Found match at 5
• Does this work always?
How about this?
0 1 2 3 4 5 6 7 8 9 10
i j
(T) O N I O N I O N S P L 0-5 0-5
X
(P) O N I O N S 6-8 0-2
X
O N I X 9 0
O X
10 0
O
• Here Index i is moved to next one after where mismatch happened
• No Match found ; while actually pattern exists
• Why does this NOT work – Overlapping sub patterns

(T) O N I O N I O N S P L

(P) O N I O N S
How about this?
0 1 2 3 4 5 6 7 8 9 10

(T) O N I O N I O N S P L i j
X
(P) O N I O N S 0-5 0-5

O N I O N S 5-8 2-5

• Somehow if we can continue matching Text from index i = 5 with pattern


from index j = 2
Knuth – Morris – Pratt (KMP) Pattern Searching

 Linear time algorithm for string matching ; O(n+m)

 Text index i doesn’t back track

 The basic idea behind KMP’s algorithm is:

1. Whenever we detect a mismatch (after some matches)

2. We already know some of the characters in the text of the next window

3. We take advantage of this information to avoid matching the characters that


we know will anyway match

 How do we know how many characters to be skipped? KMP algorithm preprocesses


the pattern and prepares an integer array lps[] that tells the count of characters to
be skipped
KMP – Pre-processing

 Longest proper prefix which is also a suffix – LPS

 A proper prefix is a prefix where whole string is not allowed


 Proper Prefix of “ABC” are “A”, “AB”

 “ABC” while is a prefix , it is NOT a proper prefix

 Suffixes of “ABC” are “C”, “BC”, “ABC”

 Search for LPS happens in sub-patterns

 For each sub-pattern pat[0..i] where i = 0 to m-1 ; lps[i] stores the length of
the maximum matching proper prefix which is also a suffix of the sub-
If beginning
pattern part of pattern occurs anywhere else in the
pat[0..i]
pattern ?
KMP – Pre-processing

 KMP algorithm maintains a table of size m (same as the size of pattern)


 It is called ∏ table or LPS Table

Patte A B C D A B E A B F
rn 0 0 0 0 1 2 0 1 2 0
LPS
Patte A B C D E A B F A B C
rn 0 0 0 0 0 1 2 0 1 2 3
LPS
Patte A A B C A D A A B E
rn 0 1 0 0 1 0 1 2 3 0
LPS
Patte A A A A B A A C D
rn 0 1 2 3 0 1 2 0 0
LPS
KMP – Pre-processing

 KMP algorithm maintains a table of size m (same as the size of pattern)


 It is called ∏ table or LPS Table

Patte A A A A Patte A A A C A A A A A C
rn rn
0 1 2 3 0 1 2 0 1 2 3 3 3 4
LPS LPS

Patte A B C D E Patte A A A B A A A
rn rn
0 0 0 0 0 0 1 2 0 1 2 3
LPS LPS

Patte A A B A A C A A B A A
rn 0 1 0 1 2 0 1 2 3 4 5
LPS
KMP Way

0 1 2 3 4 5 6 7 8 9 10

(T) O N I O N I O N S P

(P) O N I O N S

Pre-processing the pattern to prepare the LPS table


(P) O N I O N S

LPS 0 0 0 1 2 0
LPS Table
KMP Way… (P) O N I O N S

LPS 0 0 0 1 2 0

0 1 2 3 4 5 6
i
0 1 2 3 4 5 6 7 8 9 10 j

(T) O N I O N I O N S P i j
X
(P) O N I O N S 0-5 0-5

O N I O N S 5-8 2-5

1. Compare T[i] and P[j+1];


2. If match move (++) i and j; continue as long characters match
• T[0-4] and P[1-5] match
• If reached end of pattern ; successful match
3. When mismatch happens;
• if (j ==0) move i ; else j= LPS[j] ; Go To Step 1
A B A B D
LPS 0 0 1 2 0
KMP Way … j
i 0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1
0 1 2 3 4 5
0 1 2 3 4 5

(T) A B A B C A B C A B A B A B D i j
X
(P) A B A B D Mismatch i = 4; j = 4; j = LPS [j] 0-4 0-4
X
Mismatch i = 7 j = 2; A B A B D Mismatch i = 4; j = 2; j = LPS [j] 4-4 2-4
X
Mismatch i = 7 j = 0; A B A B D 5-7 0-4
X 8-13 0-4
Mismatch i = 13 j = 2; A B A B D
Match at 11 13- 2-4
A B A B D
15

1. Compare T[i] and P[j+1];


2. If match move (++) i and j; continue as long as characters match ;
• T[0-4] and P[1-5] match
• If reached end of pattern ; successful match
3. When mismatch happens;
• if (j ==0) move i ; else j= LPS[j] ; Go To Step 1

You might also like