0% found this document useful (0 votes)
311 views18 pages

The Rabin-Karp Algorithm: String Matching

The Rabin-Karp algorithm provides an efficient way to perform string matching by comparing hash values of strings rather than the strings themselves. It works by assigning numeric values to characters, computing a hash of the pattern, and then incrementally computing hashes of substrings in the text to compare against the pattern hash. While its worst-case performance is the same as naive string matching, it has better expected performance of O(n) when the number of hash matches is small.

Uploaded by

SML
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
311 views18 pages

The Rabin-Karp Algorithm: String Matching

The Rabin-Karp algorithm provides an efficient way to perform string matching by comparing hash values of strings rather than the strings themselves. It works by assigning numeric values to characters, computing a hash of the pattern, and then incrementally computing hashes of substrings in the text to compare against the pattern hash. While its worst-case performance is the same as naive string matching, it has better expected performance of O(n) when the number of hash matches is small.

Uploaded by

SML
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 18

The Rabin-Karp

Algorithm
STRING MATCHING
Background
 String matching
 Naïve method
n ≡ size of input string
m ≡ size of pattern to be matched
 O( (n-m+1)m )
Θ( n2 ) if m = floor( n/2 )
 We can do better
String Matching Problem

 We assume that the text is an array T [1..N] of


length n and that the pattern is an array P [1..M] of
length m, where m << n. We also assume that the
elements of P and T are characters in the finite
alphabet 

(e.g., {a,b} We want to find


P = ‘aab’ in T = ‘abbaabaaaab’)
String Matching Problem
(Continued)
 The idea of the string matching problem is that we
want to find all occurrences of the pattern P in the
given text T.
 We could use the brute force method for string
matching, which utilizes iteration over T. At each
letter, we compare the sequence against P until all
letters match of until the end of the alphabet is
reached.
 The worst case scenario can reach O(N*M)
Definition of Rabin-Karp

A string search algorithm which compares a


string's hash values, rather than the strings
themselves. For efficiency, the hash value
of the next position in the text is easily
computed from the hash value of the current
position.
How it works

 Consider a hashing scheme


 Each symbol in alphabet Σ can be
represented by an ordinal value { 0, 1, 2,
..., d }
|Σ| =d
“Radix-d digits”
How it works
 Hash pattern P into a numeric value
 Let a string be represented by the sum of these
digits
 Horner’s rule (§ 30.1)
 Example
{ A, B, C, ..., Z } → { 0, 1, 2, ..., 26 }
 BAN → 1 + 0 + 13 = 14
 CARD → 2 + 0 + 17 + 3 = 22
Upper limits
 Problem
 For long patterns, or for large alphabets, the number representing a given string
may be too large to be practical
 Solution
 Use MOD operation
 When MOD q, values will be < q
 Example
 BAN = 1 + 0 + 13 = 14
 14 mod 13 = 1
 BAN → 1
 CARD = 2 + 0 + 17 + 3 = 22
 22 mod 13 = 9
 CARD → 9
How Rabin-Karp works

 Let characters in both arrays T and P be digits in


radix- notation. (
 Let p be the value of the characters in P
 Choose a prime number q such that fits within a
computer word to speed computations.
 Compute (p mod q)
 The value of p mod q is what we will be using to find
all matches of the pattern P in T.
How Rabin-Karp works
(continued)
 Compute (T[s+1, .., s+m] mod q) for s = 0 ..
n-m
 Test against P only those sequences in T
having the same (mod q) value
 (T[s+1, .., s+m] mod q) can be
incrementally computed by subtracting the
high-order digit, shifting, adding the low-
order bit, all in modulo q arithmetic.
Searching
Spurious Hits
 Question
 Does a hash value match mean that the patterns match?
 Answer
 No – these are called “spurious hits”
 Possible cases
 MOD operation interfered with uniqueness of hash values
 14 mod 13 = 1
 27 mod 13 = 1
 MOD value q is usually chosen as a prime such that 10q just fits within 1 computer word
 Information is lost in generalization (addition)
 BAN → 1 + 0 + 13 = 14
 CAM → 2 + 0 + 12 = 14
Code
RABIN-KARP-MATCHER( T, P, d, q )
n ← length[ T ]
m ← length[ P ]
h ← dm-1 mod q
p←0
t0 ← 0
for i ← 1 to m ► Preprocessing
do p ← ( d*p + P[ i ] ) mod q
t0 ← ( d*t0 + T[ i ] ) mod q
for s ← 0 to n – m ► Matching
do if p = ts
then if P[ 1..m ] = T[ s+1 .. s+m ]
then print “Pattern occurs with shift” s
if s < n – m
then ts+1 ← ( d * ( ts – T[ s + 1 ] * h ) + T[ s + m + 1 ] ) mod q
A Rabin-Karp example

 Given T = 31415926535 and P = 26


 We choose q = 11
P mod q = 26 mod 11 = 4
3 1 4 1 5 9 2 6 5 3 5
31 mod 11 = 9 not equal to 4

3 1 4 1 5 9 2 6 5 3 5

14 mod 11 = 3 not equal to 4

3 1 4 1 5 9 2 6 5 3 5

41 mod 11 = 8 not equal to 4


Rabin-Karp example continued

3 1 4 1 5 9 2 6 5 3 5
15 mod 11 = 4 equal to 4 -> spurious hit

3 1 4 1 5 9 2 6 5 3 5
59 mod 11 = 4 equal to 4 -> spurious hit

3 1 4 1 5 9 2 6 5 3 5
92 mod 11 = 4 equal to 4 -> spurious hit

3 1 4 1 5 9 2 6 5 3 5

26 mod 11 = 4 equal to 4 -> an exact match!!


3 1 4 1 5 9 2 6 5 3 5
65 mod 11 = 10 not equal to 4
Rabin-Karp example continued
3 1 4 1 5 9 2 6 5 3 5
53 mod 11 = 9 not equal to 4

3 1 4 1 5 9 2 6 5 3 5
35 mod 11 = 2 not equal to 4

As we can see, when a match is found, further testing is


done to ensure that a match has indeed been found.
Complexity

 The running time of the Rabin-Karp algorithm in


the worst-case scenario is O(n-m+1)m but it has a
good average-case running time.
 If the expected number of valid shifts is small
O(1) and the prime q is chosen to be quite large,
then the Rabin-Karp algorithm can be expected to
run in time O(n+m) plus the time to required to
process spurious hits.
Performance
 Preprocessing (determining each pattern hash)
 Θ( m)
 Worst case running time
 Θ( (n-m+1)m )
 No better than naïve method
 Expected case
 Ifwe assume the number of hits is constant
compared to n, we expect O( n )
 Only pattern-match “hits” – not all shifts

You might also like