Pattern Matching: Suffix Tree Applications
Pattern Matching: Suffix Tree Applications
Input k
A set of patterns P = {P1,P2…,Pk} and |P| = P
i 1
i n
Text T of length m
Output
Positions of all occurrences of each pattern Pi in T
Solution method
Preprocess to create suffix tree for T
O(m) time, O(m) space
Maximally match each Pi in suffix tree
O(|P1| ) +O(|P2|) … +O(|Pk| ) = O(n)
Output all leaf positions below match point
O(k) time where k is number of total matches
Exact set matching using Aho-Cora
sick
Aho-Corasick algorithm is a classical solution to exa
ct set matching
build keyword tree of set of patterns P
A keyword tree for a pattern set P is a rooted tree T
such that:
Each edge e is labeled by a character
Any two edge from a node have different labels
Define L(v) of a node v are the concatenation of edge label
s on the path from the root to v
For each Pi P there is a node v s.t L(v) = Pi and for each
leaf v there is a Pi = L(v)
Example of Aho-Corasick
Example P = {abce,abe,dce,ac}
c 3
e
b 2,6 4
e
a 7
c
1,5,11
0 12
d 8
c 9
e
10
Example of Aho-Corasick
Example P = {abce,ababc,abac}
e 4
c 3
c 9
a b
0 1 2 8
b
a 7
c 13
Resume Link
Like KMP algorithm, if there is an error on node v, then we resume the
comparison by the resume link of its parent.
Aho-Corasick vs. Suffix tree
Aho-Corasick Approach
O(n) preprocess time and space
to build keyword tree of set of patterns P
O(m+k) search time
Linear time by using the resume link
Suffix Tree Approach
O(m) preprocess time and space
to build suffix tree of T
O(n+k) search time
Using matching statistics to be defined, can make this trade
off similar to that of Aho-Corasick
Substring problem
Input
Pattern P of length n
Output
Position of all occurrences of P in each Text Ti
Solution method
Preprocess to create generalized suffix tree for {Ti}
O(m) time, O(m) space
Maximally match P in generalized suffix tree
Output all leaf positions below match point
O(n+k) time where k is number of total matches
Generalized suffix tree abc#
c#
T1 = ababc# d$
T2 = abd$
ab abc#
c#
b
d$
c#
root
#
d$
$
Longest Common Substring problem
Input
Strings S and T
Output
The longest common substring of S and T (and its positi
on in S and T)
Solution method
Preprocess to create generalized suffix tree for {S,T}
Mark each node by whether or not its subtree contains a le
af node of S, T, or both
Simple postfix tree traversal algorithm to do this
Input
Strings S and T
Integer k
Output
all substrings of S and T (and their positions in S and T)
of length at least k
Solution method
Same as previous problem
Look for all nodes with 2 leaf labels of string depth
at least k
Longest Common Substrings of more
than two Strings
Definition: For a given set of K strings, l(j) for
2 <= j <= K is the length of the longest comm
on substring belong to at least j of the K strin
gs
Example: {abcedfg, cbcedfa, dbcedg, cbceg,
acea}
j 2 3 4 5
l(j) 5 4 3 2
String bcdef bced bce ce
Longest Common Substrings of more
than two Strings
Input k
Strings S1, …, SK, total length = S
i 1
i n
Output
l(j) (and positions in Si) for 2 <= j <= K
Solution method
Build a generalized suffix tree for the K strings
each string has a unique end character, so each leaf sh
ows up only once
Longest Common Substrings of more
than two Strings
Build a generalized suffix tree for the K strings
each string has a unique end character, so each leaf sh
ows up only once
Define c(v): number of distinct leaf labels in subtree rooted
at node v and d(v): string-depth from root to node v
Given c(v) and d(v), do a simple traversal of tree to find l(j)
j = 2~K and pointers to locations in substrings
Computing c(v) efficiently
# of leaves is not correct as some leaves may have sam
e label
length K bit vector, 1 bit per string in set
OR your way up the tree
Each OR op takes O(K) time which give O(Kn) running time
Can be improved to be O(n) later
Repeated Substrings
Definition:
maximal pair in S is a pair of identical substrings
and in S such that the character to the immediat
e left (right) of is different than the character to t
he immediate left (right) of .
Add unique characters to front and end of S to include pr
efixes and suffixes.
Representation: (p1, p2, n’)
starting positions and length of the maximal pair
R(S) is the set of all triples representing maximal
pairs in S
Example of Repeated substrings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
S= a x y z v e x y z b c d h x y z v c
(2, 7, 3) is a maximal pair
(7, 14, 3) is a maximal pair
(2, 14, 3) is not a maximal pair
(2, 14, 4) is a maximal pair
Repeated Substrings
A maximal repeat is a substring in S that is the s
ubstring defined by a maximal pair of S
R’(S) is the set of maximal repeats and |R’(S)| ≤ |
R(S)|
Previous example
xyz and xyzv are maximal repeats of S
however, xyz is represented only once in R’(S), but there
are (2, 7, 3) and (7, 14, 3) in R(S)
|R’(S)| is smaller than |R(S)| as xyz shows up twice in R
(S) but only once in R’(S)
Maximal Repeated Substrings
Maximal repeats
Input
String S (length n)
Output
R’(S)
Lemma
If is a maximal repeat in S, then is the path-
label of an internal node v in T
does not end in the middle of an edge
Maximal Repeated substrings
Definition left character of i is S[i-1]
The left character of a leaf of a suffix tree T is the l
eft character of the suffix position represented by t
hat leaf
A node v of T is called left diverse if at least 2 leav
es in v’s subtree have different left characters
Theorem
String labeling the path to an internal node v of
T is a maximal repeat if and only if v is left diverse
Capture that character before is different
Example of left diverse
S = ababc
root
ab c
b
left diverse
abc abc c
c
root b
Maximal Repeated substrings
Solution method
Construct suffix tree for S
There are at most n maximal repeats
So that, there are n leaves
Because all internal nodes except the root have at least tw
o children.
Therefore, at most n internal nodes
Maximal Repeated substrings
Find all left diverse nodes in linear time
All nodes will have a left character label
Leaf node:
Label leaves with their left character
Internal node v:
If any child is left diverse, so is v
If two children have different left character labels, v is left di
verse
Otherwise, take on left character value of children
Compact representation
Node v in T is a frontier node if:
v is a diverse
none of v’s children are left diverse
Maximal Repeated substrings
Time complexity
Construct suffix tree for S O(n)
Find all left diverse nodes in linear time O(n)
Compact representation O(k), where k is the nu
mber of maximal pairs
Supermaximal repeated substrin
gs
A supermaximal repeat is a maximal repeat
of S that never occurs as a substring of anoth
er maximal repeat of S
Previous example
xyzv is a supermaximal repeat of S
xyz is NOT a supermaximal repeat of S
Supermaximal repeated substrin
gs
Supermaximal repeats
Input
String S (length n)
Output
The set of supermaximal repeats of S
Theorem:
A left diverse node v represents a supermaximal r
epeat if and only if
all of v’s children are leaves
and each has a distinct left character
Matching Statistics
Input:
Pattern P of length n
Text T of length m
Output
Compute ms(i) for 1 <=i <= m
Definition of ms(i)
For 1 <= i <=m, matching statistic ms(i) is the leng
th of the longest substring of T starting at position
i that matches a substring somewhere in P.
Matching Statistics
T T A G C T T G G C
T T A G C
T T A G C
T T A G C
T T A G C
i 1 2 3 4 5 6 7 8
ms 3 2 1 3 2 1 2 5
(i)
Matching Statistics
Solution method
Compute suffix tree of P retaining suffix links
Adding location of substring in P
p(i): a location in P such that the substring at p(i) matche
s substring starting at T(i) for exactly ms(i) positions
Before computing ms(i) values, mark each node in T wit
h the leaf number of one of its leaves
Simply output this value when outputting ms(i) values
Matching Statistics
Count ms(1): match T against tree
Get ms(i+1) from ms(i)
Assume we are at some node v in the tree
If it is internal, follow suffix link to s(v)
Else if it is a leaf, go up one level to its parent w
If w is an internal node, follow suffix link to s(w)
Traverse downwards using skip/count trick until we have
matched all the characters in edge label (w,v)
Now match against T character by character till we have a
mismatch and can output ms(i+1)
Applying matching statistics to LCS
problem
Input
strings S and T
Output
longest common substring of S and T
Solution method
Compute suffix tree for shorter string, say S
Compute ms(i) values for T
Maximal ms(i) value identifies LCS
Suffix Arrays
Input
Text T of length m
Output
Pos array
Definition of Pos array
A suffix array for T, called Pos, is an array of integ
ers in the range 1 to m specifying the lexicographi
c order of the m suffixes of string T
Pos[k] = i iff Ti is the kth smallest suffix in the m suffixes
Add terminating character $ which is lexically smallest
Example of Suffix Arrays
T = axfcaxgx# Order = 9. #
Suffixes = 1. axfcaxgx# 1. axfcaxgx#
5. axgx#
2. xfcaxgx# 4. caxgx#
3. fcaxgx# 3. fcaxgx#
7. gx#
4. caxgx# 8. x#
5. axgx# 2. xfcaxgx#
6. xgx#
6. xgx#
7. gx#
8. x#
8. x#
9. #
k 1 2 3 4 5 6 7 8 9
Pos 9 1 5 4 3 7 8 2 6
Suffix Arrays
Solution method
Compute suffix tree of T
Do a lexical depth-first traversal of T labeling Pos
(k) with leafs in order of encountering them
Edge (v,u) is lexically smaller than edge (v,w) iff fir
st character of (v,u) is lexically smaller than first c
haracter of (v,w)
Applying Suffix Arrays to exact pattern
matching
Input
Pattern P of length n
Text T of length m
Output
All occurrences of P in T
Solution method
Compute suffix array Pos for T
If P is in T, then all these locations will be grouped
consecutively in Pos
Applying Suffix Arrays to exact pattern
matching
Using binary search, find smallest index i’ such th
at P exactly matches the n characters of suffix Po
s(i’)
Similarly, find largest index i such that P exactly m
atches the n characters of suffix Pos(i)
Time complexity O(n log m)
Longest common prefixes
Input
Text T of length m
Output
Max(Lcp(i,j)) ,for 1≤ i,j ≤ m and i ≠ j
Definition of Lcp(i,j): Lcp(i,j) is the length of the longe
st common prefix of the suffixes of T beginning at P
os[i] and Pos[j].
Example from Suffix Arrays
T = axfcaxgx#, Pos[2] = 1 (axfcaxgx#), Pos[3] = 5 (axgx#)
Lcp(2,3) = 2
Longest common prefixes
Solution method
We want to get Lcp in O(m) time
However, there are potentially O(m2) different pos
sible pairs of Lcp values
Crucial point
Since this is binary search, there are only O(m) values th
at are ever needed, and these have a lot of structure
Longest common prefixes
Lcp(i,i+1): string depth of lowest common ancesto
r encountered during lexical depth-first traversal of
suffix tree from Pos(i) leaf to Pos(i+1) leaf
Other Lcp values
Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)
Take min of Lcp values of children in the binary tree of n
eeded Lcp values (not the suffix tree)