0% found this document useful (0 votes)

46 views

Pattern Matching: Suffix Tree Applications

The document discusses applications of suffix trees, including exact string matching, longest common substrings, finding repeated substrings, and substring matching. It describes how a suffix tree can be constructed in linear time and space, and then patterns can be matched in linear time based on the tree structure. The document also covers the Aho-Corasick algorithm for exact set matching as an alternative to using suffix trees.

Uploaded by

iqcst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

46 views

Pattern Matching: Suffix Tree Applications

Uploaded by

iqcst

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 39

Pattern Matching:

Suffix Tree Applications

Applications

 Exact string and substring matching

 Longest common substrings
 Finding and representing repeated substrings
efficiently
 Applications that lead to alternative:
space vs. efficient implementations
 Matching statistics
 Suffix Arrays
Exact Set matching

 Input k
 A set of patterns P = {P1,P2…,Pk} and |P| = P
i 1
i n
 Text T of length m
 Output
 Positions of all occurrences of each pattern Pi in T
 Solution method
 Preprocess to create suffix tree for T
 O(m) time, O(m) space
 Maximally match each Pi in suffix tree
 O(|P1| ) +O(|P2|) … +O(|Pk| ) = O(n)
 Output all leaf positions below match point
 O(k) time where k is number of total matches
Exact set matching using Aho-Cora
sick
 Aho-Corasick algorithm is a classical solution to exa
ct set matching
 build keyword tree of set of patterns P
 A keyword tree for a pattern set P is a rooted tree T
such that:
 Each edge e is labeled by a character
 Any two edge from a node have different labels
 Define L(v) of a node v are the concatenation of edge label
s on the path from the root to v
 For each Pi  P there is a node v s.t L(v) = Pi and for each
leaf v there is a Pi = L(v)
Example of Aho-Corasick

 Example P = {abce,abe,dce,ac}
c 3
e
b 2,6 4
e
a 7
c
1,5,11
0 12
d 8
c 9
e
10
Example of Aho-Corasick

 Example P = {abce,ababc,abac}
e 4

c 3
c 9
a b
0 1 2 8
b
a 7
c 13

Resume Link
Like KMP algorithm, if there is an error on node v, then we resume the
comparison by the resume link of its parent.
Aho-Corasick vs. Suffix tree
 Aho-Corasick Approach
 O(n) preprocess time and space
 to build keyword tree of set of patterns P
 O(m+k) search time
 Linear time by using the resume link
 Suffix Tree Approach
 O(m) preprocess time and space
 to build suffix tree of T
 O(n+k) search time
 Using matching statistics to be defined, can make this trade
off similar to that of Aho-Corasick
Substring problem
 Input
 Pattern P of length n

 A set of Text Ti of total length m

 Output
 Position of all occurrences of P in each Text Ti

 Solution method
 Preprocess to create generalized suffix tree for {Ti}
 O(m) time, O(m) space
 Maximally match P in generalized suffix tree
 Output all leaf positions below match point
 O(n+k) time where k is number of total matches
Generalized suffix tree abc#

c#
 T1 = ababc# d$
 T2 = abd$
ab abc#
c#
b
d$
c#
root
#

d$
$
Longest Common Substring problem
 Input
 Strings S and T

 Output
 The longest common substring of S and T (and its positi
on in S and T)
 Solution method
 Preprocess to create generalized suffix tree for {S,T}
 Mark each node by whether or not its subtree contains a le
af node of S, T, or both
 Simple postfix tree traversal algorithm to do this

 Path label of node with greatest string depth is the longest

common substring of S and T
Common substrings of length k problem

 Input
 Strings S and T
 Integer k
 Output
 all substrings of S and T (and their positions in S and T)
of length at least k
 Solution method
 Same as previous problem
 Look for all nodes with 2 leaf labels of string depth
at least k
Longest Common Substrings of more
than two Strings
 Definition: For a given set of K strings, l(j) for
2 <= j <= K is the length of the longest comm
on substring belong to at least j of the K strin
gs
 Example: {abcedfg, cbcedfa, dbcedg, cbceg,
acea}
j 2 3 4 5
l(j) 5 4 3 2
String bcdef bced bce ce
Longest Common Substrings of more
than two Strings
 Input k
 Strings S1, …, SK, total length = S
i 1
i n
 Output
 l(j) (and positions in Si) for 2 <= j <= K
 Solution method
 Build a generalized suffix tree for the K strings
 each string has a unique end character, so each leaf sh
ows up only once
Longest Common Substrings of more
than two Strings
 Build a generalized suffix tree for the K strings
 each string has a unique end character, so each leaf sh
ows up only once
 Define c(v): number of distinct leaf labels in subtree rooted
at node v and d(v): string-depth from root to node v
 Given c(v) and d(v), do a simple traversal of tree to find l(j)
j = 2~K and pointers to locations in substrings
 Computing c(v) efficiently
 # of leaves is not correct as some leaves may have sam
e label
 length K bit vector, 1 bit per string in set
 OR your way up the tree
 Each OR op takes O(K) time which give O(Kn) running time
 Can be improved to be O(n) later
Repeated Substrings
 Definition:
 maximal pair in S is a pair of identical substrings 
and  in S such that the character to the immediat
e left (right) of  is different than the character to t
he immediate left (right) of .
 Add unique characters to front and end of S to include pr
efixes and suffixes.
 Representation: (p1, p2, n’)
 starting positions and length of the maximal pair
 R(S) is the set of all triples representing maximal
pairs in S
Example of Repeated substrings
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
 S= a x y z v e x y z b c d h x y z v c
 (2, 7, 3) is a maximal pair
 (7, 14, 3) is a maximal pair
 (2, 14, 3) is not a maximal pair
 (2, 14, 4) is a maximal pair
Repeated Substrings
 A maximal repeat  is a substring in S that is the s
ubstring defined by a maximal pair of S
 R’(S) is the set of maximal repeats and |R’(S)| ≤ |
R(S)|
 Previous example
 xyz and xyzv are maximal repeats of S
however, xyz is represented only once in R’(S), but there
are (2, 7, 3) and (7, 14, 3) in R(S)
 |R’(S)| is smaller than |R(S)| as xyz shows up twice in R
(S) but only once in R’(S)
Maximal Repeated Substrings

 Maximal repeats
 Input
 String S (length n)
 Output
 R’(S)
 Lemma
 If  is a maximal repeat in S, then  is the path-
label of an internal node v in T
  does not end in the middle of an edge
Maximal Repeated substrings
 Definition left character of i is S[i-1]
 The left character of a leaf of a suffix tree T is the l
eft character of the suffix position represented by t
hat leaf
 A node v of T is called left diverse if at least 2 leav
es in v’s subtree have different left characters
 Theorem
 String  labeling the path to an internal node v of
T is a maximal repeat if and only if v is left diverse
 Capture that character before  is different
Example of left diverse
 S = ababc
root
ab c
b

left diverse

abc abc c
c

root b
Maximal Repeated substrings

 Solution method
 Construct suffix tree for S
 There are at most n maximal repeats
 So that, there are n leaves
 Because all internal nodes except the root have at least tw
o children.
 Therefore, at most n internal nodes
Maximal Repeated substrings
 Find all left diverse nodes in linear time
 All nodes will have a left character label
 Leaf node:
 Label leaves with their left character
 Internal node v:
 If any child is left diverse, so is v
 If two children have different left character labels, v is left di
verse
 Otherwise, take on left character value of children
 Compact representation
 Node v in T is a frontier node if:
 v is a diverse
 none of v’s children are left diverse
Maximal Repeated substrings

 Time complexity
 Construct suffix tree for S  O(n)
 Find all left diverse nodes in linear time  O(n)
 Compact representation  O(k), where k is the nu
mber of maximal pairs
Supermaximal repeated substrin
gs
 A supermaximal repeat  is a maximal repeat
of S that never occurs as a substring of anoth
er maximal repeat of S
 Previous example
 xyzv is a supermaximal repeat of S
 xyz is NOT a supermaximal repeat of S
Supermaximal repeated substrin
gs
 Supermaximal repeats
 Input
 String S (length n)
 Output
 The set of supermaximal repeats of S
 Theorem:
 A left diverse node v represents a supermaximal r
epeat if and only if
 all of v’s children are leaves
 and each has a distinct left character
Matching Statistics
 Input:
 Pattern P of length n
 Text T of length m
 Output
 Compute ms(i) for 1 <=i <= m
 Definition of ms(i)
 For 1 <= i <=m, matching statistic ms(i) is the leng
th of the longest substring of T starting at position
i that matches a substring somewhere in P.
Matching Statistics

 With matching statistics, one can solve sever

al problems with less space than a suffix tree
 Exact matching example
 We’ll show an O(n) preprocessing time and O(m) search
time solution matching the traditional methods
 P matches substring starting at i in T if and only if
ms(i) = |P|
Example of Matching Statistics
i 1 2 3 4 5 6 7 8
T T T A T T A T T T A G C
P T T A G C T T A G C

T T A G C T T G G C

T T A G C

i 1 2 3 4 5 6 7 8
ms 3 2 1 3 2 1 2 5
(i)
Matching Statistics

 Solution method
 Compute suffix tree of P retaining suffix links
 Adding location of substring in P
 p(i): a location in P such that the substring at p(i) matche
s substring starting at T(i) for exactly ms(i) positions
 Before computing ms(i) values, mark each node in T wit
h the leaf number of one of its leaves
 Simply output this value when outputting ms(i) values
Matching Statistics
 Count ms(1): match T against tree
 Get ms(i+1) from ms(i)
 Assume we are at some node v in the tree
 If it is internal, follow suffix link to s(v)
 Else if it is a leaf, go up one level to its parent w
 If w is an internal node, follow suffix link to s(w)
 Traverse downwards using skip/count trick until we have
matched all the characters in edge label (w,v)
 Now match against T character by character till we have a
mismatch and can output ms(i+1)
Applying matching statistics to LCS
problem
 Input
 strings S and T
 Output
 longest common substring of S and T
 Solution method
 Compute suffix tree for shorter string, say S
 Compute ms(i) values for T
 Maximal ms(i) value identifies LCS
Suffix Arrays
 Input
 Text T of length m
 Output
 Pos array
 Definition of Pos array
 A suffix array for T, called Pos, is an array of integ
ers in the range 1 to m specifying the lexicographi
c order of the m suffixes of string T
 Pos[k] = i iff Ti is the kth smallest suffix in the m suffixes
 Add terminating character $ which is lexically smallest
Example of Suffix Arrays
 T = axfcaxgx#  Order = 9. #
 Suffixes = 1. axfcaxgx# 1. axfcaxgx#
5. axgx#
2. xfcaxgx# 4. caxgx#
3. fcaxgx# 3. fcaxgx#
7. gx#
4. caxgx# 8. x#
5. axgx# 2. xfcaxgx#
6. xgx#
6. xgx#
7. gx#
8. x#
8. x#
9. #
k 1 2 3 4 5 6 7 8 9
Pos 9 1 5 4 3 7 8 2 6
Suffix Arrays

 Solution method
 Compute suffix tree of T
 Do a lexical depth-first traversal of T labeling Pos
(k) with leafs in order of encountering them
 Edge (v,u) is lexically smaller than edge (v,w) iff fir
st character of (v,u) is lexically smaller than first c
haracter of (v,w)
Applying Suffix Arrays to exact pattern
matching
 Input
 Pattern P of length n
 Text T of length m
 Output
 All occurrences of P in T
 Solution method
 Compute suffix array Pos for T
 If P is in T, then all these locations will be grouped
consecutively in Pos
Applying Suffix Arrays to exact pattern
matching
 Using binary search, find smallest index i’ such th
at P exactly matches the n characters of suffix Po
s(i’)
 Similarly, find largest index i such that P exactly m
atches the n characters of suffix Pos(i)
 Time complexity O(n log m)
Longest common prefixes
 Input
 Text T of length m
 Output
 Max(Lcp(i,j)) ,for 1≤ i,j ≤ m and i ≠ j
 Definition of Lcp(i,j): Lcp(i,j) is the length of the longe
st common prefix of the suffixes of T beginning at P
os[i] and Pos[j].
 Example from Suffix Arrays
 T = axfcaxgx#, Pos[2] = 1 (axfcaxgx#), Pos[3] = 5 (axgx#)
 Lcp(2,3) = 2
Longest common prefixes

 Solution method
 We want to get Lcp in O(m) time
 However, there are potentially O(m2) different pos
sible pairs of Lcp values
 Crucial point
 Since this is binary search, there are only O(m) values th
at are ever needed, and these have a lot of structure
Longest common prefixes
 Lcp(i,i+1): string depth of lowest common ancesto
r encountered during lexical depth-first traversal of
suffix tree from Pos(i) leaf to Pos(i+1) leaf
 Other Lcp values
 Lcp(i,j): mink in 1 to j-1 Lcp(k,k+1)
 Take min of Lcp values of children in the binary tree of n
eeded Lcp values (not the suffix tree)

Manual D914L04 8939998 PDF
No ratings yet
Manual D914L04 8939998 PDF
107 pages
Shortest Common Superstring1
No ratings yet
Shortest Common Superstring1
14 pages
Linear Algebra
From Everand
Linear Algebra
Georgi E. Shilov
2.5/5 (3)
Chapter 26 - Biological Inorganic Chemistry INCOMPLETE PDF
No ratings yet
Chapter 26 - Biological Inorganic Chemistry INCOMPLETE PDF
8 pages
September FurryFriendsHallofFame ALLANIMALS
100% (1)
September FurryFriendsHallofFame ALLANIMALS
111 pages
Applications of Suffix Trees
No ratings yet
Applications of Suffix Trees
40 pages
Suffix Trees and Suffix Arrays
No ratings yet
Suffix Trees and Suffix Arrays
33 pages
Suffix Arrays: Justin Zhang 24 May 2017
No ratings yet
Suffix Arrays: Justin Zhang 24 May 2017
5 pages
HW 2
No ratings yet
HW 2
5 pages
Suffix Trees, Suffix Arrays, and Their Applications
No ratings yet
Suffix Trees, Suffix Arrays, and Their Applications
29 pages
Toc
No ratings yet
Toc
6 pages
Common Sub Strings
No ratings yet
Common Sub Strings
6 pages
L17
No ratings yet
L17
23 pages
6 Suffix-Tree
No ratings yet
6 Suffix-Tree
20 pages
10 String Algorithms
No ratings yet
10 String Algorithms
36 pages
09 SuffixTrees
No ratings yet
09 SuffixTrees
21 pages
Module 06. String Algorithms Lecture 3-6
No ratings yet
Module 06. String Algorithms Lecture 3-6
48 pages
Suf Tree
No ratings yet
Suf Tree
6 pages
Boyer Moore Algorithm: Idan Szpektor
100% (1)
Boyer Moore Algorithm: Idan Szpektor
48 pages
10 TSP Exam Sol
No ratings yet
10 TSP Exam Sol
8 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
25 pages
Suffix Array Tutorial
No ratings yet
Suffix Array Tutorial
17 pages
Co 4 (Lo 2)
No ratings yet
Co 4 (Lo 2)
12 pages
Suffixtrees
No ratings yet
Suffixtrees
50 pages
Foundations of Sequence Analysis
No ratings yet
Foundations of Sequence Analysis
161 pages
DSA AND ALGO
No ratings yet
DSA AND ALGO
43 pages
Trie and Suffix Trees
No ratings yet
Trie and Suffix Trees
17 pages
Suffix Trees in Detail
No ratings yet
Suffix Trees in Detail
23 pages
Notes 06 Text Indexing PDF
No ratings yet
Notes 06 Text Indexing PDF
162 pages
String Matching and Hashing
No ratings yet
String Matching and Hashing
10 pages
54.string 2notes
No ratings yet
54.string 2notes
20 pages
String Matching
No ratings yet
String Matching
89 pages
Advanced String Lecture
No ratings yet
Advanced String Lecture
50 pages
Chapter 3 Part 2
No ratings yet
Chapter 3 Part 2
22 pages
Solution Notes
No ratings yet
Solution Notes
3 pages
Tutorial Suffix Tree
No ratings yet
Tutorial Suffix Tree
16 pages
Strings
No ratings yet
Strings
9 pages
Suffix Tree and Suffix Array Techniques For Pattern Analysis in Strings
No ratings yet
Suffix Tree and Suffix Array Techniques For Pattern Analysis in Strings
78 pages
Z Function and Its Calculation:: Int Int Int Int For Int If While If
No ratings yet
Z Function and Its Calculation:: Int Int Int Int For Int If While If
32 pages
Dynamic Programming - Longest Common Subsequence (LCS)
No ratings yet
Dynamic Programming - Longest Common Subsequence (LCS)
34 pages
Unit 3
No ratings yet
Unit 3
34 pages
String Problems
No ratings yet
String Problems
20 pages
6.851 Advanced Data Structures (Spring'12) Prof. Erik Demaine Problem 9 Sample Solution
No ratings yet
6.851 Advanced Data Structures (Spring'12) Prof. Erik Demaine Problem 9 Sample Solution
2 pages
Longest Common Substring Problem: Example
No ratings yet
Longest Common Substring Problem: Example
5 pages
1 s2.0 S0020019015000411 Main
No ratings yet
1 s2.0 S0020019015000411 Main
3 pages
11 Data Structures and Algorithms - Narasimha Karumanchi
No ratings yet
11 Data Structures and Algorithms - Narasimha Karumanchi
12 pages
Crochemore Ilie Rytter
No ratings yet
Crochemore Ilie Rytter
16 pages
String Matching
No ratings yet
String Matching
5 pages
Daa 9
No ratings yet
Daa 9
4 pages
Strings
No ratings yet
Strings
73 pages
Programming-Assignment-3
No ratings yet
Programming-Assignment-3
17 pages
String Matching Algorithms
No ratings yet
String Matching Algorithms
46 pages
Daa Exp-9
No ratings yet
Daa Exp-9
4 pages
4 module algorithms
No ratings yet
4 module algorithms
28 pages
Looking For All Palindromes in A String
No ratings yet
Looking For All Palindromes in A String
5 pages
Week 4
No ratings yet
Week 4
18 pages
Lecture 04 Inaryseachtree
No ratings yet
Lecture 04 Inaryseachtree
20 pages
Suffix Arrays
No ratings yet
Suffix Arrays
20 pages
Suffix Trees
No ratings yet
Suffix Trees
76 pages
A First Course in Functional Analysis
From Everand
A First Course in Functional Analysis
Martin Davis
No ratings yet
Introduction to Partial Differential Equations: From Fourier Series to Boundary-Value Problems
From Everand
Introduction to Partial Differential Equations: From Fourier Series to Boundary-Value Problems
Arne Broman
2.5/5 (2)
Differential Forms
From Everand
Differential Forms
Henri Cartan
5/5 (2)
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Busy Bee Food Counter
No ratings yet
Busy Bee Food Counter
3 pages
Broadcasting Script
No ratings yet
Broadcasting Script
8 pages
NCCT-MEP-CS0005 - ITP Check Sheet - Conduit Installation
No ratings yet
NCCT-MEP-CS0005 - ITP Check Sheet - Conduit Installation
2 pages
ABB ACS850 Latest Catalogue
No ratings yet
ABB ACS850 Latest Catalogue
24 pages
Allene Overseas - Kitchenware Range
No ratings yet
Allene Overseas - Kitchenware Range
61 pages
Report FINAL2-1
No ratings yet
Report FINAL2-1
68 pages
Properties of 2D Shapes For 11 Plus Exam DHF
No ratings yet
Properties of 2D Shapes For 11 Plus Exam DHF
3 pages
CBSE Class 12 Physics 2 Mark Question Bank
No ratings yet
CBSE Class 12 Physics 2 Mark Question Bank
8 pages
Golden_Ratio_Construction_Geometry_Beauty_and_Dive
No ratings yet
Golden_Ratio_Construction_Geometry_Beauty_and_Dive
14 pages
Furcula's FFBE Damage Spreadsheet - The Great Migration
No ratings yet
Furcula's FFBE Damage Spreadsheet - The Great Migration
724 pages
Annals of Tourism Research: David A. Fennell
No ratings yet
Annals of Tourism Research: David A. Fennell
11 pages
Cassini Mission
No ratings yet
Cassini Mission
1 page
TORA 240120 Rescue From Enclosed Space
No ratings yet
TORA 240120 Rescue From Enclosed Space
1 page
PP Presentation
No ratings yet
PP Presentation
21 pages
Tecsis P3276
No ratings yet
Tecsis P3276
4 pages
Liturgy and Sacraments
No ratings yet
Liturgy and Sacraments
62 pages
FDAR Charting
No ratings yet
FDAR Charting
50 pages
Câu I: Chọn một phương án A, B, C hoặc D ứng với từ có phần gạch chân được phát âm khác với các từ còn lại. (0,8 điểm)
No ratings yet
Câu I: Chọn một phương án A, B, C hoặc D ứng với từ có phần gạch chân được phát âm khác với các từ còn lại. (0,8 điểm)
6 pages
03 Study Focus 2
No ratings yet
03 Study Focus 2
10 pages
Answers Heat Mass Transfer I
No ratings yet
Answers Heat Mass Transfer I
8 pages
SETRA - Comprendre Les Principaux Paramètres de Conception Géométrique Des Routes - 2006 For Design (English)
No ratings yet
SETRA - Comprendre Les Principaux Paramètres de Conception Géométrique Des Routes - 2006 For Design (English)
30 pages
Medicine in The Middle Ages
No ratings yet
Medicine in The Middle Ages
11 pages
Petroleum Review April MR
No ratings yet
Petroleum Review April MR
48 pages
Presentacion Telvent
No ratings yet
Presentacion Telvent
39 pages
Thread/Machine Needle Chart: What Makes A Good Thread?
No ratings yet
Thread/Machine Needle Chart: What Makes A Good Thread?
1 page
Critical Reviews in Oncology / Hematology
No ratings yet
Critical Reviews in Oncology / Hematology
9 pages
7.4 Antidiabetic Drugs
No ratings yet
7.4 Antidiabetic Drugs
25 pages

Pattern Matching: Suffix Tree Applications

Uploaded by

Pattern Matching: Suffix Tree Applications

Uploaded by

Pattern Matching:

Suffix Tree Applications

 Exact string and substring matching

 A set of Text Ti of total length m

 Path label of node with greatest string depth is the longest

 With matching statistics, one can solve sever

You might also like