Notes 06 Text Indexing PDF
Notes 06 Text Indexing PDF
Sebastian Wild
9 March 2020
6 Text Indexing
6.1 Motivation
6.2 Suffix Trees
6.3 Applications
6.4 Longest Common Extensions
6.5 Suffix Arrays
6.6 Linear-Time Suffix Sorting
6.7 The LCP Array
6.1 Motivation
Inverted indices
same as “indexes”
� original indices in books: list of (key) words ↦→ page numbers where they occur
2
Inverted indices
same as “indexes”
� original indices in books: list of (key) words ↦→ page numbers where they occur
Inverted index:
� collect all words in 𝑇
� can be as simple as splitting 𝑇 at whitespace
� actual implementations typically support stemming of words
goes → go, cats → cat
2
Clicker Question
D Sure.
pingo.upb.de/622222
3
Tries
� efficient dictionary data structure for strings
4
NOT our standard tries,
but version with compacted
paths to leaves
(see next page)
Trie construction
(correct version)
Clicker Question
A Θ(log 𝑛) F Θ(log 𝑚)
B Θ(log(𝑛𝑚)) G Θ(𝑞)
pingo.upb.de/622222
5
Clicker Question
A Θ(log 𝑛) F Θ(log 𝑚)
B Θ(log(𝑛𝑚)) G Θ(𝑞) �
C Θ(𝑚 · log 𝑛) H Θ(log 𝑞)
pingo.upb.de/622222
5
Clicker Question
B Θ(𝑛 + 𝑚) E Θ(𝑚)
pingo.upb.de/622222
6
Clicker Question
B Θ(𝑛 + 𝑚) E Θ(𝑚)
pingo.upb.de/622222
6
Compact tries =1 child
1 2
a b a b a b
b
2 2 3 bbb$
$ a a b a b $ a a b $ b
aa$ aa$ aaab$ abaab$ 3 bba$ bbab$
b a $ a $ b $
$ a
abb$ bba$ bbb$
abb$ abbab$
$ b b $
aaab$ bbab$
$ $
abaab$ abbab$
fast lookup
8
Tries as inverted index
simple
fast lookup
what if the ‘text’ does not even have words to begin with?!
� biological sequences
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGC
CGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGC
CAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAA
TGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
� binary streams
00000010101001111010111000001111100011111011111001101101000011100010011011110000010001101010
01101100001101011010000000100000000111010110000010000111101011101100100011001011011101111111
110001010001011001010000001110101010011000000001101100001100111110000101 0101011101111000011
10101110010010101010100000111110100110000001111001101010000000100100100000101100011000110111
9
Suffix trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver
� Goal: find the longest substring that occurs in all 𝑘 strings � alive
9
Suffix trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver
� Goal: find the longest substring that occurs in all 𝑘 strings � alive
“Although the longest common substring problem looks trivial now, given our knowledge of suffix trees,
it is very interesting to note that in 1970 Don Knuth conjectured that
a linear-time algorithm for this problem would be impossible.” [Gusfield: Algorithms on Strings, Trees, and Sequences (1997)]
9
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)
10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)
Example:
𝑇 = bananaban$ $
$
0 1 2 3 4 5 6 7 8 9 n $ an$
anaban$
$
𝑇= b a n a n a b a n $ a ban
a
ba naban$
n $ ban$ ananaban$
anaban$
bananaban$
n
$ n$
$ naban$
ban
a
naban$
nanaban$
10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)
Example:
𝑇 = bananaban$ 9
$
0 1 2 3 4 5 6 7 8 9 n $ 7
n$ 3
𝑇= b a n a n a b a n $ a ba
a
ba naban
n $ 6 $ 1
anaban 0
$
n
$ 8
$ 4
ban
a
naban
$ 2
10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)
Example:
𝑇 = bananaban$ 9
$
0 1 2 3 4 5 6 7 8 9
n $ 7
3
2 a b
𝑇= b a n a n a b a n $
3 n
a
0 b 1
$ 6
� also: edge labels like in compact trie
3 a
� (more readable form on slides to explain algorithms) 0
n
$ 8
4
1 a b
2 n
2
10
Suffix trees – Construction
� 𝑇[0..𝑛 − 1] has 𝑛 + 1 suffixes (starting at character 𝑖 ∈ [0..𝑛])
� We can build the suffix tree by inserting each suffix of 𝑇 into a compressed trie.
But that takes time Θ(𝑛 2 ). � not interesting!
11
Suffix trees – Construction
� 𝑇[0..𝑛 − 1] has 𝑛 + 1 suffixes (starting at character 𝑖 ∈ [0..𝑛])
� We can build the suffix tree by inserting each suffix of 𝑇 into a compressed trie.
But that takes time Θ(𝑛 2 ). � not interesting!
same order of growth as reading the text!
� for now, take linear-time construction for granted. What can we do with them?
11
6.3 Applications
Applications of suffix trees
� In this section, always assume suffix tree T for 𝑇 given.
12
Application 1: Text Indexing / String Matching
� 𝑃 occurs in 𝑇 ⇐⇒ 𝑃 is a prefix of a suffix of 𝑇
13
Application 1: Text Indexing / String Matching
� 𝑃 occurs in 𝑇 ⇐⇒ 𝑃 is a prefix of a suffix of 𝑇 b
$ a a n
n
� we have all suffixes in T! 9
b a
a n $ n $ a
� (try to) follow path with label 𝑃, until $
n a
b
6 a 8
1. we get stuck 5 $ a n n
$ b a
at internal node (no node with next character of 𝑃) a b
7 n a
or inside edge (mismatch of next characters) n 0 $ n
b a $
� 𝑃 does not occur in 𝑇 a b
n a 4 2
$ n
2. we run out of pattern $
𝑇 = bananaban$
reach end of 𝑃 at internal node 𝑣 or inside edge towards 𝑣 3 1
14
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we efficiently check all possible substrings?
14
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we efficiently check all possible substrings?
14
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we efficiently check all possible substrings?
� could build the suffix tree for each 𝑇 (𝑗) . . . but doesn’t seem to help
15
Generalized suffix trees
� longest repeated substring (of one string) feels very similar to
longest common substring of several strings 𝑇 (1) , . . . , 𝑇 (𝑘) with 𝑇 (𝑗) ∈ Σ𝑛 𝑗
� can we solve that in the same way?
� could build the suffix tree for each 𝑇 (𝑗) . . . but doesn’t seem to help
15
Generalized suffix trees
� longest repeated substring (of one string) feels very similar to
longest common substring of several strings 𝑇 (1) , . . . , 𝑇 (𝑘) with 𝑇 (𝑗) ∈ Σ𝑛 𝑗
� can we solve that in the same way?
� could build the suffix tree for each 𝑇 (𝑗) . . . but doesn’t seem to help
15
Application 3: Longest common substring
� With that new idea, we can find longest common superstrings:
1. Compute generalized suffix tree T.
2. Store with each node the subset of strings that contain its path label:
2.1. Traverse T bottom-up.
2.2. For a leaf (𝑗, 𝑖), the subset is {𝑗}.
2.3. For an internal node, the subset is the union of its children.
3. In top-down traversal, compute string depths of nodes. (as above)
4. Report deepest node (by string depth) whose subset is {1, . . . , 𝑘}.
� Each step takes time Θ(𝑛) for 𝑛 = 𝑛1 + · · · + 𝑛 𝑘 the total length of all texts.
“Although the longest common substring problem looks trivial now, given our knowledge of suffix trees,
it is very interesting to note that in 1970 Don Knuth conjectured that
a linear-time algorithm for this problem would be impossible.” [Gusfield: Algorithms on Strings, Trees, and Sequences (1997)]
16
Longest common substring – Example
𝑇 (1) = bcabcac, 𝑇 (2) = aabca, 𝑇 (3) = bcaa
123
0
$1 c
$2 $3
$1 b
a c 1 123
$2 a
$3 $1 a
aabca$2 abcac$1
17
6.4 Longest Common Extensions
Application 4: Longest Common Extensions
� We implicitly used a special case of a more general, versatile idea:
18
Application 4: Longest Common Extensions
� We implicitly used a special case of a more general, versatile idea:
� �
� in short: LCE(𝑖, 𝑗) = LCP(𝑇𝑖 , 𝑇𝑗 ) = stringDepth LCA( 𝑖 , 𝑗 )
18
Efficient LCA
How to find lowest common ancestors?
� Could walk up the tree to find LCA � Θ(𝑛) worst case
� Could store all LCAs in big table � Θ(𝑛 2 ) space and preprocessing
19
Efficient LCA
How to find lowest common ancestors?
� Could walk up the tree to find LCA � Θ(𝑛) worst case
� Could store all LCAs in big table � Θ(𝑛 2 ) space and preprocessing
Amazing result: Can compute data structure in Θ(𝑛) time and space
that finds any LCA is constant(!) time.
� a bit tricky to understand
� but a theoretical breakthrough
� and useful in practice
� After linear preprocessing (time & space), we can find LCEs in 𝑂(1) time.
19
Application 5: Approximate matching
𝒌-mismatch matching:
� Input: text 𝑇[0..𝑛 − 1], pattern 𝑃[0..𝑚 − 1], 𝑘 ∈ [0..𝑚)
� Output: “Hamming distance ≤ 𝑘 ”
20
Clicker Question
pingo.upb.de/622222
21
Kangaroo Algorithm for approximate matching
1 procedure kMismatch(𝑇[0..𝑛 − 1], 𝑃[0..𝑚 − 1])
2 // build LCE data structure
3 for 𝑖 := 0, . . . , 𝑛 − 𝑚 − 1 do
4 mismatches := 0; 𝑡 := 𝑖; 𝑝 := 0
5 while mismatches ≤ 𝑘 ∧ 𝑝 < 𝑚 do
6 ℓ := LCE(𝑡, 𝑝) // jump over matching part
7 𝑡 := 𝑡 + ℓ + 1; 𝑝 := 𝑝 + ℓ + 1
8 mismatches := mismatches + 1
9 if 𝑝 = = 𝑚 then
10 return 𝑖
∗ ∗ ∗
23
Suffix trees – Discussion
� Suffix trees were a threshold invention
24
Suffix trees – Discussion
� Suffix trees were a threshold invention
24
6.5 Suffix Arrays
Clicker Question
pingo.upb.de/622222
25
Clicker Question
pingo.upb.de/622222
25
Putting suffix trees on a diet
� Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$
aban$
n$
$ ba
$
an$
n
$ anaban$
ban
a
nab
an$ ananaban$
$ ban$
ba
n
anab
an$ bananaban$
n
$ n$
$ naban$
ban
a
nab
an$ nanaban$
26
Putting suffix trees on a diet
� Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$
aban$
n$
$ ba � Idea: only store list of leaves 𝐿[0..𝑛]
$
an$
� Enough to do efficient string matching!
n
$ anaban$
ban 1. Use binary search for pattern 𝑃
a
nab
an$ ananaban$ 2. check if 𝑃 is prefix of suffix after found position
anab
an$ bananaban$
n
$ n$
$ naban$
ban
a
nab
an$ nanaban$
26
Putting suffix trees on a diet
L[0..𝑛] � Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$ 9
aban$ 5
n$
$ ba � Idea: only store list of leaves 𝐿[0..𝑛]
$
an$ 7
� Enough to do efficient string matching!
n
$ anaban$ 3
ban 1. Use binary search for pattern 𝑃
a
nab
an$ ananaban$ 1 2. check if 𝑃 is prefix of suffix after found position
anab
an$ bananaban$ 0
n
nab
an$ nanaban$ 2
≤ (lg 𝑛 + 2) · 𝑚 character comparisons
26
Clicker Question
Recap: Check all correct statements about suffix array 𝐿[0..𝑛] and
suffix tree T of text 𝑇[0..𝑛).
pingo.upb.de/622222
27
Clicker Question
Recap: Check all correct statements about suffix array 𝐿[0..𝑛] and
suffix tree T of text 𝑇[0..𝑛).
� we do better!
28
Fat-pivot radix quicksort – Example
she
sells
seashells
by
the
sea
shore
the
shells
she
sells
are
surely
seashells
Fat-pivot radix quicksort – Example
she
sells
seashells
by
the
sea
shore
the
shells
she
sells
are
surely
seashells
Fat-pivot radix quicksort – Example
she by
sells are
seashells she
by sells
the seashells
sea sea
shore shore
the shells
shells she
she sells
sells surely
are seashells
surely the
seashells the
Fat-pivot radix quicksort – Example
she by by
seashells she
by sells
the seashells
sea sea
shore shore
the shells
shells she
she sells
sells surely
are seashells
surely the
seashells the
Fat-pivot radix quicksort – Example
she by by
by sells seashells
surely the
seashells the
Fat-pivot radix quicksort – Example
she by by
by sells seashells
29
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson
30
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson
simple to code
� fat-pivot radix quicksort finds suffix array in 𝑂(𝑛 log 𝑛) expected time
30
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson
simple to code
� fat-pivot radix quicksort finds suffix array in 𝑂(𝑛 log 𝑛) expected time
sort suffixes
31
Linear-time suffix sorting
DC3 / Skew algorithm not a multiple of 3
32
Linear-time suffix sorting
DC3 / Skew algorithm not a multiple of 3
32
Linear-time suffix sorting
DC3 / Skew algorithm not a multiple of 3
� Note: 𝐿 can easily be computed from 𝑅 in one pass, and vice versa.
� Can use whichever is more convenient.
32
DC3 / Skew algorithm – Step 2: Inducing ranks
� Assume: rank array 𝑅 1,2 known:
�
rank of 𝑇𝑖 among 𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , . . . for 𝑖 = 1, 2, 4, 5, 7, 8, . . .
� 𝑅 1,2 [𝑖] =
undefined for 𝑖 = 0, 3, 6, 9, . . .
33
DC3 / Skew algorithm – Step 2: Inducing ranks
� Assume: rank array 𝑅 1,2 known:
�
rank of 𝑇𝑖 among 𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , . . . for 𝑖 = 1, 2, 4, 5, 7, 8, . . .
� 𝑅 1,2 [𝑖] =
undefined for 𝑖 = 0, 3, 6, 9, . . .
33
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)
𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑅1,2 (known)
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)
𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)
𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00
𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00
𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00
� sorted 1,2-list:
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . .
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .
� sorted 1,2-list:
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . .
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .
35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$
� Have: 𝑇16 sman$$$
= a𝑇16
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .
35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .
35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$
= s𝑇17
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13
35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$ always at most 2 steps
= s𝑇17 then can use 𝑅 1,2 !
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13
35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$ always at most 2 steps
= s𝑇17 then can use 𝑅 1,2 !
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13
� 𝑂(𝑛) time for merge
35
Clicker Question
pingo.upb.de/622222
36
Clicker Question
pingo.upb.de/622222
36
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!
37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!
37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!
� Need a single string 𝑇 � to recurse on, from which we can deduce 𝑅 1,2 .
37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!
� Need a single string 𝑇 � to recurse on, from which we can deduce 𝑅 1,2 .
𝑇 = bananaban$$$
redefine alphabet to be triples of characters abc � 𝑇 � = ban ana ban $$$
ana ban $$$
� suffixes of 𝑇 � � 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . ban $$$
$$$
� 𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$ � 𝑇𝑖 with 𝑖 �≡ 0 (mod 3).
38
DC3 / Skew algorithm – Fix alphabet explosion
� Still does not quite work!
� Each recursive step cubes 𝜎 by using triples!
� (Eventually) cannot use linear-time sorting anymore!
38
DC3 / Skew algorithm – Fix alphabet explosion
� Still does not quite work!
� Each recursive step cubes 𝜎 by using triples!
� (Eventually) cannot use linear-time sorting anymore!
38
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$
� 𝑇 = hannahbansbananasman$
39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$
� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$
39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$
� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$
� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man
39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$
� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$
� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man
Rank 00 01 02 03 04 05 06 07 08 09 10 11 12
Triple $$$ ahb ana ann ans ban hba man n$$ nas nna nsb sma
39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$
� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$
� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man
Rank 00 01 02 03 04 05 06 07 08 09 10 11 12
Triple $$$ ahb ana ann ans ban hba man n$$ nas nna nsb sma
� 𝑇� = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$
𝑇 �� = 03 01 04 05 02 12 08 00 10 06 11 02 09 07 00
39
Suffix array – Discussion
sleek data structure compared to suffix tree
40
6.7 The LCP Array
Clicker Question
pingo.upb.de/622222
41
Clicker Question
pingo.upb.de/622222
41
String depths of internal nodes
� Recall algorithm for longest repeated substring in suffix tree
$ a b
a n
1. Compute string depth of nodes n
9
2. Find path label to node with maximal string depth b
a n a
$ n $ a
n a
$ b
6 a 8
� Can we do this using suffix arrays? 5 $ a n
$
n
b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
3 1 𝑇 = bananaban$
42
String depths of internal nodes
� Recall algorithm for longest repeated substring in suffix tree
$ a b
a n
1. Compute string depth of nodes n
9
2. Find path label to node with maximal string depth b
a n a
$ n $ a
n a
$ b
6 a 8
� Can we do this using suffix arrays? 5 $ a n
$
n
b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
� Yes, by enhancing the suffix array with the LCP array! 𝑇 = bananaban$
3 1
LCP[1..𝑛]
LCP[𝑟] = LCP(𝑇𝐿[𝑟] , 𝑇𝐿[𝑟−1] )
longest common prefix of suffixes of rank 𝑟 and 𝑟 − 1
42
LCP array and internal nodes
L[0..𝑛]
9
43
LCP array and internal nodes
L[0..𝑛]
$ 9
aban$ 5
an$ 7
anaban$ 3
ananaban$ 1
ban$ 6
bananaban$ 0
n$ 8
naban$ 4
nanaban$ 2
43
LCP array and internal nodes
L[0..𝑛]
$ 9
aban$ 5
a
an$ 7
an
anaban$ 3
ana
ananaban$ 1
ban$ 6
ban
bananaban$ 0
n$ 8
n
naban$ 4
na
nanaban$ 2
43
LCP array and internal nodes
LCP[1..𝑛] L[0..𝑛]
$ 9
0
aban$ 5
1 a
an$ 7
2 an
anaban$ 3
3 ana
ananaban$ 1
0
ban$ 6
3 ban
bananaban$ 0
0
n$ 8
1 n
naban$ 4
2 na
nanaban$ 2
43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ 9
𝜺 0
aban$ 5
𝜺 a 1 a
an$ 7
𝜺 a n 2 an
anaban$ 3
𝜺 a n a 3 ana
ananaban$ 1
𝜺 0
ban$ 6
𝜺 b a n 3 ban
bananaban$ 0
𝜺 0
n$ 8
𝜺 n 1 n
naban$ 4
𝜺 n a 2 na
nanaban$ 2
43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a
𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n
𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n
𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a
nab
an$ nanaban$ nanaban$ 2
43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a
𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n
𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n
𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a
nab
an$ nanaban$ nanaban$ 2
43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a
𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n
𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n
𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a
nab
an$ nanaban$ nanaban$ 2
� Leaf array 𝐿[0..𝑛] plus LCP array LCP[1..𝑛] encode full tree!
43
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total
44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total
� but: seeing one large ( = costly) LCP value � can find another large one!
� Example: 𝑇 = Buffalo␣buffalo␣buffalo␣buffalo$
� first few suffixes in sorted order:
𝑇𝐿[0] = $
𝑇𝐿[1] = alo␣buffalo$
𝑇𝐿[2] = alo␣buffalo␣buffalo$
alo␣buffalo␣buffalo � LCP[3] = 19
𝑇𝐿[3] = alo␣buffalo␣buffalo␣buffalo$
44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total
� but: seeing one large ( = costly) LCP value � can find another large one!
� Example: 𝑇 = Buffalo␣buffalo␣buffalo␣buffalo$
� first few suffixes in sorted order:
𝑇𝐿[0] = $
𝑇𝐿[1] = alo␣buffalo$
𝑇𝐿[2] = alo␣buffalo␣buffalo$
alo␣buffalo␣buffalo � LCP[3] = 19
𝑇𝐿[3] = alo␣buffalo␣buffalo␣buffalo$
� Removing first character from 𝑇𝐿[2] and 𝑇𝐿[3] gives two new suffixes:
𝑇𝐿[?] = lo␣buffalo␣buffalo$
lo␣buffalo␣buffalo � LCP[?] = 18
𝑇𝐿[?] = lo␣buffalo␣buffalo␣buffalo$
unclear where. . .
44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total
� but: seeing one large ( = costly) LCP value � can find another large one!
� Example: 𝑇 = Buffalo␣buffalo␣buffalo␣buffalo$
� first few suffixes in sorted order:
𝑇𝐿[0] = $
𝑇𝐿[1] = alo␣buffalo$
𝑇𝐿[2] = alo␣buffalo␣buffalo$
alo␣buffalo␣buffalo � LCP[3] = 19
𝑇𝐿[3] = alo␣buffalo␣buffalo␣buffalo$
� Removing first character from 𝑇𝐿[2] and 𝑇𝐿[3] gives two new suffixes:
𝑇𝐿[?] = lo␣buffalo␣buffalo$ Shortened suffixes might not
lo␣buffalo␣buffalo � LCP[?] = 18 be adjacent in sorted order!
𝑇𝐿[?] = lo␣buffalo␣buffalo␣buffalo$ � no LCP entry for them!
unclear where. . .
44
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically
� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.
45
Kasai’s algorithm – Code
46
Kasai’s algorithm – Code
Analysis:
1 procedure computeLCP(𝑇[0..𝑛], 𝐿[0..𝑛], 𝑅[0..𝑛])
2 // Assume 𝑇[𝑛] = $, 𝐿 and 𝑅 are suffix array and inverse � dominant operation:
3 ℓ := 0 character comparisons
4 for 𝑖 := 0, . . . , 𝑛 − 1
� separately count those with
5 𝑟 := 𝑅[𝑖]
6 // compute LCP[𝑟]; note that 𝑟 > 0 since 𝑅[𝑛] = 0 outcomes “=” resp. “≠”
7 𝑖−1 := 𝐿[𝑟 − 1] � each ≠ ends iteration of for-loop
while 𝑇[𝑖 + ℓ ] = = 𝑇[𝑖−1 + ℓ ] do
� ≤ 𝑛 cmps
8
9 ℓ := ℓ + 1
10 LCP[𝑟] := ℓ � each = implies increment of ℓ ,
11 ℓ := max{ℓ − 1, 0} but ℓ ≤ 𝑛 and
12 return LCP[1..𝑛] decremented ≤ 𝑛 times
� ≤ 2𝑛 cmps
� remember length ℓ of induced common prefix � Θ(𝑛) overall time
� use 𝐿 to get start index of suffixes
46
Back to suffix trees
We can finally look into the black box of linear-time suffix-array construction!
1. Compute suffix array for 𝑇.
𝜺 a n an
$ anaban$ anaban$ 3
ban
a
𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n
𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n
𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a
nab
an$ nanaban$ nanaban$ 2
47
Conclusion
� (Enhanced) Suffix Arrays are the modern version of suffix trees
48