0% found this document useful (0 votes)
46 views

Pattern Matching: Suffix Tree Applications

The document discusses applications of suffix trees, including exact string matching, longest common substrings, finding repeated substrings, and substring matching. It describes how a suffix tree can be constructed in linear time and space, and then patterns can be matched in linear time based on the tree structure. The document also covers the Aho-Corasick algorithm for exact set matching as an alternative to using suffix trees.

Uploaded by

iqcst
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views

Pattern Matching: Suffix Tree Applications

The document discusses applications of suffix trees, including exact string matching, longest common substrings, finding repeated substrings, and substring matching. It describes how a suffix tree can be constructed in linear time and space, and then patterns can be matched in linear time based on the tree structure. The document also covers the Aho-Corasick algorithm for exact set matching as an alternative to using suffix trees.

Uploaded by

iqcst
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 39

Pattern Matching:

Suffix Tree Applications


Applications

 Exact string and substring matching


 Longest common substrings
 Finding and representing repeated substrings
efficiently
 Applications that lead to alternative:
space vs. efficient implementations
 Matching statistics
 Suffix Arrays
Exact Set matching

 Input k
 A set of patterns P = {P1,P2…,Pk} and |P| = P
i 1
i n
 Text T of length m
 Output
 Positions of all occurrences of each pattern Pi in T
 Solution method
 Preprocess to create suffix tree for T
 O(m) time, O(m) space
 Maximally match each Pi in suffix tree
 O(|P1| ) +O(|P2|) … +O(|Pk| ) = O(n)
 Output all leaf positions below match point
 O(k) time where k is number of total matches
Exact set matching using Aho-Cora
sick
 Aho-Corasick algorithm is a classical solution to exa
ct set matching
 build keyword tree of set of patterns P
 A keyword tree for a pattern set P is a rooted tree T
such that:
 Each edge e is labeled by a character
 Any two edge from a node have different labels
 Define L(v) of a node v are the concatenation of edge label
s on the path from the root to v
 For each Pi  P there is a node v s.t L(v) = Pi and for each
leaf v there is a Pi = L(v)
Example of Aho-Corasick

 Example P = {abce,abe,dce,ac}
c 3
e
b 2,6 4
e
a 7
c
1,5,11
0 12
d 8
c 9
e
10
Example of Aho-Corasick

 Example P = {abce,ababc,abac}
e 4

c 3
c 9
a b
0 1 2 8
b
a 7
c 13

Resume Link
Like KMP algorithm, if there is an error on node v, then we resume the
comparison by the resume link of its parent.
Aho-Corasick vs. Suffix tree
 Aho-Corasick Approach
 O(n) preprocess time and space
 to build keyword tree of set of patterns P
 O(m+k) search time
 Linear time by using the resume link
 Suffix Tree Approach
 O(m) preprocess time and space
 to build suffix tree of T
 O(n+k) search time
 Using matching statistics to be defined, can make this trade
off similar to that of Aho-Corasick
Substring problem
 Input
 Pattern P of length n

 A set of Text Ti of total length m

 Output
 Position of all occurrences of P in each Text Ti

 Solution method
 Preprocess to create generalized suffix tree for {Ti}
 O(m) time, O(m) space
 Maximally match P in generalized suffix tree
 Output all leaf positions below match point
 O(n+k) time where k is number of total matches
Generalized suffix tree abc#

c#
 T1 = ababc# d$
 T2 = abd$
ab abc#
c#
b
d$
c#
root
#

d$
$
Longest Common Substring problem
 Input
 Strings S and T

 Output
 The longest common substring of S and T (and its positi
on in S and T)
 Solution method
 Preprocess to create generalized suffix tree for {S,T}
 Mark each node by whether or not its subtree contains a le
af node of S, T, or both
 Simple postfix tree traversal algorithm to do this

 Path label of node with greatest string depth is the longest


common substring of S and T
Common substrings of length k problem

 Input
 Strings S and T
 Integer k
 Output
 all substrings of S and T (and their positions in S and T)
of length at least k
 Solution method
 Same as previous problem
 Look for all nodes with 2 leaf labels of string depth
at least k
Longest Common Substrings of more
than two Strings
 Definition: For a given set of K strings, l(j) for
2 <= j <= K is the length of the longest comm
on substring belong to at least j of the K strin
gs
 Example: {abcedfg, cbcedfa, dbcedg, cbceg,
acea}
j 2 3 4 5
l(j) 5 4 3 2
String bcdef bced bce ce
Longest Common Substrings of more
than two Strings
 Input k
 Strings S1, …, SK, total length = S
i 1
i n
 Output
 l(j) (and positions in Si) for 2 <= j <= K
 Solution method
 Build a generalized suffix tree for the K strings
 each string has a unique end character, so each leaf sh
ows up only once
Longest Common Substrings of more
than two Strings
 Build a generalized suffix tree for the K strings
 each string has a unique end character, so each leaf sh
ows up only once
 Define c(v): number of distinct leaf labels in subtree rooted
at node v and d(v): string-depth from root to node v
 Given c(v) and d(v), do a simple traversal of tree to find l(j)
j = 2~K and pointers to locations in substrings
 Computing c(v) efficiently
 # of leaves is not correct as some leaves may have sam
e label
 length K bit vector, 1 bit per string in set
 OR your way up the tree
 Each OR op takes O(K) time which give O(Kn) running time
 Can be improved to be O(n) later
Repeated Substrings
 Definition:
 maximal pair in S is a pair of identical substrings 
and  in S such that the character to the immediat
e left (right) of  is different than the character to t
he immediate left (right) of .
 Add unique characters to front and end of S to include pr
efixes and suffixes.
 Representation: (p1, p2, n’)
 starting positions and length of the maximal pair
 R(S) is the set of all triples representing maximal
pairs in S
Example of Repeated substrings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
 S= a x y z v e x y z b c d h x y z v c
 (2, 7, 3) is a maximal pair
 (7, 14, 3) is a maximal pair
 (2, 14, 3) is not a maximal pair
 (2, 14, 4) is a maximal pair
Repeated Substrings
 A maximal repeat  is a substring in S that is the s
ubstring defined by a maximal pair of S
 R’(S) is the set of maximal repeats and |R’(S)| ≤ |
R(S)|
 Previous example
 xyz and xyzv are maximal repeats of S
however, xyz is represented only once in R’(S), but there
are (2, 7, 3) and (7, 14, 3) in R(S)
 |R’(S)| is smaller than |R(S)| as xyz shows up twice in R
(S) but only once in R’(S)
Maximal Repeated Substrings

 Maximal repeats
 Input
 String S (length n)
 Output
 R’(S)
 Lemma
 If  is a maximal repeat in S, then  is the path-
label of an internal node v in T
  does not end in the middle of an edge
Maximal Repeated substrings
 Definition left character of i is S[i-1]
 The left character of a leaf of a suffix tree T is the l
eft character of the suffix position represented by t
hat leaf
 A node v of T is called left diverse if at least 2 leav
es in v’s subtree have different left characters
 Theorem
 String  labeling the path to an internal node v of
T is a maximal repeat if and only if v is left diverse
 Capture that character before  is different
Example of left diverse
 S = ababc
root
ab c
b

left diverse

abc abc c
c

root b
Maximal Repeated substrings

 Solution method
 Construct suffix tree for S
 There are at most n maximal repeats
 So that, there are n leaves
 Because all internal nodes except the root have at least tw
o children.
 Therefore, at most n internal nodes
Maximal Repeated substrings
 Find all left diverse nodes in linear time
 All nodes will have a left character label
 Leaf node:
 Label leaves with their left character
 Internal node v:
 If any child is left diverse, so is v
 If two children have different left character labels, v is left di
verse
 Otherwise, take on left character value of children
 Compact representation
 Node v in T is a frontier node if:
 v is a diverse
 none of v’s children are left diverse
Maximal Repeated substrings

 Time complexity
 Construct suffix tree for S  O(n)
 Find all left diverse nodes in linear time  O(n)
 Compact representation  O(k), where k is the nu
mber of maximal pairs
Supermaximal repeated substrin
gs
 A supermaximal repeat  is a maximal repeat
of S that never occurs as a substring of anoth
er maximal repeat of S
 Previous example
 xyzv is a supermaximal repeat of S
 xyz is NOT a supermaximal repeat of S
Supermaximal repeated substrin
gs
 Supermaximal repeats
 Input
 String S (length n)
 Output
 The set of supermaximal repeats of S
 Theorem:
 A left diverse node v represents a supermaximal r
epeat if and only if
 all of v’s children are leaves
 and each has a distinct left character
Matching Statistics
 Input:
 Pattern P of length n
 Text T of length m
 Output
 Compute ms(i) for 1 <=i <= m
 Definition of ms(i)
 For 1 <= i <=m, matching statistic ms(i) is the leng
th of the longest substring of T starting at position
i that matches a substring somewhere in P.
Matching Statistics

 With matching statistics, one can solve sever


al problems with less space than a suffix tree
 Exact matching example
 We’ll show an O(n) preprocessing time and O(m) search
time solution matching the traditional methods
 P matches substring starting at i in T if and only if
ms(i) = |P|
Example of Matching Statistics
i 1 2 3 4 5 6 7 8
T T T A T T A T T T A G C
P T T A G C T T A G C

T T A G C T T G G C

T T A G C

T T A G C

T T A G C

T T A G C

i 1 2 3 4 5 6 7 8
ms 3 2 1 3 2 1 2 5
(i)
Matching Statistics

 Solution method
 Compute suffix tree of P retaining suffix links
 Adding location of substring in P
 p(i): a location in P such that the substring at p(i) matche
s substring starting at T(i) for exactly ms(i) positions
 Before computing ms(i) values, mark each node in T wit
h the leaf number of one of its leaves
 Simply output this value when outputting ms(i) values
Matching Statistics
 Count ms(1): match T against tree
 Get ms(i+1) from ms(i)
 Assume we are at some node v in the tree
 If it is internal, follow suffix link to s(v)
 Else if it is a leaf, go up one level to its parent w
 If w is an internal node, follow suffix link to s(w)
 Traverse downwards using skip/count trick until we have
matched all the characters in edge label (w,v)
 Now match against T character by character till we have a
mismatch and can output ms(i+1)
Applying matching statistics to LCS
problem
 Input
 strings S and T
 Output
 longest common substring of S and T
 Solution method
 Compute suffix tree for shorter string, say S
 Compute ms(i) values for T
 Maximal ms(i) value identifies LCS
Suffix Arrays
 Input
 Text T of length m
 Output
 Pos array
 Definition of Pos array
 A suffix array for T, called Pos, is an array of integ
ers in the range 1 to m specifying the lexicographi
c order of the m suffixes of string T
 Pos[k] = i iff Ti is the kth smallest suffix in the m suffixes
 Add terminating character $ which is lexically smallest
Example of Suffix Arrays
 T = axfcaxgx#  Order = 9. #
 Suffixes = 1. axfcaxgx# 1. axfcaxgx#
5. axgx#
2. xfcaxgx# 4. caxgx#
3. fcaxgx# 3. fcaxgx#
7. gx#
4. caxgx# 8. x#
5. axgx# 2. xfcaxgx#
6. xgx#
6. xgx#
7. gx#
8. x#
8. x#
9. #
k 1 2 3 4 5 6 7 8 9
Pos 9 1 5 4 3 7 8 2 6
Suffix Arrays

 Solution method
 Compute suffix tree of T
 Do a lexical depth-first traversal of T labeling Pos
(k) with leafs in order of encountering them
 Edge (v,u) is lexically smaller than edge (v,w) iff fir
st character of (v,u) is lexically smaller than first c
haracter of (v,w)
Applying Suffix Arrays to exact pattern
matching
 Input
 Pattern P of length n
 Text T of length m
 Output
 All occurrences of P in T
 Solution method
 Compute suffix array Pos for T
 If P is in T, then all these locations will be grouped
consecutively in Pos
Applying Suffix Arrays to exact pattern
matching
 Using binary search, find smallest index i’ such th
at P exactly matches the n characters of suffix Po
s(i’)
 Similarly, find largest index i such that P exactly m
atches the n characters of suffix Pos(i)
 Time complexity O(n log m)
Longest common prefixes
 Input
 Text T of length m
 Output
 Max(Lcp(i,j)) ,for 1≤ i,j ≤ m and i ≠ j
 Definition of Lcp(i,j): Lcp(i,j) is the length of the longe
st common prefix of the suffixes of T beginning at P
os[i] and Pos[j].
 Example from Suffix Arrays
 T = axfcaxgx#, Pos[2] = 1 (axfcaxgx#), Pos[3] = 5 (axgx#)
 Lcp(2,3) = 2
Longest common prefixes

 Solution method
 We want to get Lcp in O(m) time
 However, there are potentially O(m2) different pos
sible pairs of Lcp values
 Crucial point
 Since this is binary search, there are only O(m) values th
at are ever needed, and these have a lot of structure
Longest common prefixes
 Lcp(i,i+1): string depth of lowest common ancesto
r encountered during lexical depth-first traversal of
suffix tree from Pos(i) leaf to Pos(i+1) leaf
 Other Lcp values
 Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)
 Take min of Lcp values of children in the binary tree of n
eeded Lcp values (not the suffix tree)

You might also like