0% found this document useful (0 votes)
816 views

KMP Algorithm 1

The document discusses various string matching algorithms: 1. A straightforward algorithm has worst-case complexity of O(nm) by comparing characters sequentially. 2. The Knuth-Morris-Pratt (KMP) algorithm improves this to O(n+m) by building a failure function to skip matching already seen prefixes/suffixes. 3. The Boyer-Moore algorithm further optimizes to sub-linear average time by jumping past sections of text where a match is impossible based on the pattern. It is often the preferred algorithm in practice.

Uploaded by

Anurag Yadav
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
816 views

KMP Algorithm 1

The document discusses various string matching algorithms: 1. A straightforward algorithm has worst-case complexity of O(nm) by comparing characters sequentially. 2. The Knuth-Morris-Pratt (KMP) algorithm improves this to O(n+m) by building a failure function to skip matching already seen prefixes/suffixes. 3. The Boyer-Moore algorithm further optimizes to sub-linear average time by jumping past sections of text where a match is impossible based on the pattern. It is often the preferred algorithm in practice.

Uploaded by

Anurag Yadav
Copyright
© Attribution Non-Commercial (BY-NC)
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 22

String Matching

detecting the occurrence of a particular substring (pattern) in another string (text)

A straightforward Solution The Knuth-Morris-Pratt Algorithm The Boyer-Moore Algorithm

TECH
Computer Science

Straightforward solution
Algorithm: Simple string matching Input: P and T, the pattern and text strings; m, the length of P. The pattern is assumed to be nonempty. Output: The return value is the index in T where a copy of P begins, or -1 if no match for P is found.

int simpleScan(char[] P,char[] T,int m)

int match //value to return. int i,j,k; match = -1; j=1;k=1; i=j; while(endText(T,j)==false) if( k>m ) match = i; //match found. break; if(tj == pk) j++; k++; else //Back up over matched characters. int backup=k-1; j = j-backup; k = k-backup; //Slide pattern forward,start over. j++; i=j; return match;

Analysis
Worst-case complexity is in (mn) Need to back up. Works quite well on average for natural language.

Finite Automata
Terminologies
: the alphabet *: the set of all finite-length strings formed using characters from . xy: concatenation of two strings x and y. Prefix: a string w is a prefix of a string x if x=wy for some string y *. Suffix: a string w is a suffix of a string x if x= yw for some string y *.

Finite Automata (contd)

Finite Automata, e.g.,

Algorithm

The Knuth-Morris-Pratt algorithm

1. Skip outer iteration I =3

2. Skip first inner iteration testing n vs n at outer iteration i=4

Strategy
In general, if there is a partial match of j chars starting at i, then we know what is in position T[i]T[i+j-1]. So we can save by
Skip outer iterations (for which no match possible) Skip inner iterations (when no need to test know matches).

1. 2.

When a mismatch occurs, we want to slide P forward, but maintain the longest overlap of a prefix of P with a suffix of the part of the text that has matched the pattern so far. KMP algorithm achieves linear time performance by capitalizing on the observation above, via building a simplified finite automaton: each node has only two links, success and fail.

Sliding the pattern for the KMP algorithm

The Knuth-Morris-Pratt Flowchart


Character labels are inside the nodes Each node has two arrows out to other nodes: success link, or fail link next character is read only after a success link A special node, node 0, called get next char which read in next text character.
e.g. P = ABABCB

Construction of the KMP Flowchart


Definition:Fail links
We define fail[k] as the largest r (with r<k) such that p1,..pr-1 matches pk-r+1...pk-1.That is the (r-1) character prefix of P is identical to the one (r-1) character substring ending at index k-1. Thus the fail links are determined by repetition within P itself.

Algorithm: KMP flowchart construction


Input: P,a string of characters;m,the length of P. Output: fail,the array of failure links,defined for indexes 1,...,m.The array is passed in and the algorithm fills it. Step: void kmpSetup(char[] P, int m, int[] fail) int k,s 1. fail[1]=0; 2. for(k=2;k<=m;k++) 3. s=fail[k-1]; 4. while(s>=1) 5. if(ps==pk-1) 6. break; 7. s=fail[s]; 8. fail[k]=s+1;

The Knuth-Morris-Pratt Scan Algorithm


int kmpScan(char[] P,char[] T,int m,int[] fail) int match, j,k; match= -1; j=1; k=1; while(endText(T,j)==false) if(k>m) match = j-m; break; if(k==0) j++; k=1; else if(tj==pk) j++; k++; else //Follow fail arrow. k=fail[k]; //continue loop. return match;

Analysis
KMP Flowchart Construction require 2m 3 character comparisons in the worst case The scan algorithm requires 2n character comparisons in the worst case Overall: Worst case complexity is (n+m)

The Boyer-Moore Algorithm

Algorithm:Computing Jumps for the Boyer-Morre Algorithm Input:Pattern string P:m the length of P;alphabet size alpha=|| Output:Array charJump,defined on indexes 0,....,alpha-1.The array is passed in and the algorithm fills it. void computeJumps(char[] P,int m,int alpha,int[] charJump) char ch; int k; for (ch=0;ch<alpha;ch++) charJump[ch]=m; for (k=1;k<=m;k++) charJump[pk]=m-k;

Computing matchJump

Computing matchjump (e.g.,)

BoyerMooreScan Algorithm

Summary
Straightforward algorithm: O(nm) Finite-automata algorithm: O(n) KMP algorithm: O(n+m)
Relatively easier to implement Do not require random access to the text

BM algorithm: O(n+m), worst, sublinear average


Fewer character comparison The algorithm of choice in practice for string matcing

You might also like