0% found this document useful (0 votes)

74 views

Notes 06 Text Indexing PDF

The longest common substring of "superiorcalifornialives" and "sealiver" is "iver".

Uploaded by

hussein hammoud

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

74 views

Notes 06 Text Indexing PDF

The longest common substring of "superiorcalifornialives" and "sealiver" is "iver".

Uploaded by

hussein hammoud

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 162

6 Text Indexing –

Searching whole genomes

Sebastian Wild
9 March 2020

version 2020-03-17 14:43

Outline

6 Text Indexing
6.1 Motivation
6.2 Suffix Trees
6.3 Applications
6.4 Longest Common Extensions
6.5 Suffix Arrays
6.6 Linear-Time Suffix Sorting
6.7 The LCP Array
6.1 Motivation
Inverted indices
same as “indexes”

� original indices in books: list of (key) words ↦→ page numbers where they occur

� assumption: searches are only for whole (key) words

� often reasonable for natural language text

2
Inverted indices
same as “indexes”

� original indices in books: list of (key) words ↦→ page numbers where they occur

� assumption: searches are only for whole (key) words

� often reasonable for natural language text

Inverted index:
� collect all words in 𝑇
� can be as simple as splitting 𝑇 at whitespace
� actual implementations typically support stemming of words
goes → go, cats → cat

� store mapping from words to a list of occurrences � how?

2
Clicker Question

Do you know what a trie is?

A A what? No!

B I have heard the term, but don’t quite remember.

C I remember hearing about it in a module.

D Sure.

pingo.upb.de/622222
3
Tries
� eﬃcient dictionary data structure for strings

� name from retrieval, but pronounced “try”

� tree based on symbol comparisons

� Assumption: stored strings are preﬁx-free (no string is a preﬁx of another)

� strings of same length
� some character ∉ Σ

� strings have “end-of-string” marker $

� root
a b
� Example:
a b
{aa$ , aaab$ , abaab$ , abb$ , b

abbab$ , bba$ , bbab$ , bbb$} $ a a b a b

aa$
b a $ a $ b $
abb$ bba$ bbb$
$ b b $
aaab$ bbab$
$ $
abaab$ abbab$

4
NOT our standard tries,
but version with compacted
paths to leaves
(see next page)
Trie construction
(correct version)
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.

Each stored string consists of 𝑚 characters.
We now search for a query string 𝑄 with |𝑄| = 𝑞.
How many nodes in the trie are visited during this query?

A Θ(log 𝑛) F Θ(log 𝑚)

B Θ(log(𝑛𝑚)) G Θ(𝑞)

C Θ(𝑚 · log 𝑛) H Θ(log 𝑞)

D Θ(𝑚 + log 𝑛) I Θ(𝑞 · log 𝑛)

E Θ(𝑚) J Θ(𝑞 + log 𝑛)

pingo.upb.de/622222
5
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.

Each stored string consists of 𝑚 characters.
We now search for a query string 𝑄 with |𝑄| = 𝑞.
How many nodes in the trie are visited during this query?

A Θ(log 𝑛) F Θ(log 𝑚)

B Θ(log(𝑛𝑚)) G Θ(𝑞) �
C Θ(𝑚 · log 𝑛) H Θ(log 𝑞)

D Θ(𝑚 + log 𝑛) I Θ(𝑞 · log 𝑛)

E Θ(𝑚) J Θ(𝑞 + log 𝑛)

pingo.upb.de/622222
5
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.

Each stored string consists of 𝑚 characters.
How many nodes does the trie have in total in the worst case?

A Θ(𝑛) D Θ(𝑛 log 𝑚)

B Θ(𝑛 + 𝑚) E Θ(𝑚)

C Θ(𝑛 · 𝑚) F Θ(𝑚 log 𝑛)

pingo.upb.de/622222
6
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.

Each stored string consists of 𝑚 characters.
How many nodes does the trie have in total in the worst case?

A Θ(𝑛) D Θ(𝑛 log 𝑚)

B Θ(𝑛 + 𝑚) E Θ(𝑚)

C Θ(𝑛 · 𝑚) � F Θ(𝑚 log 𝑛)

pingo.upb.de/622222
6
Compact tries =1 child

� compress paths of unary nodes into single edge

� nodes store index of next character

standard trie compact trie

0
a b a b

1 2
a b a b a b
b
2 2 3 bbb$
$ a a b a b $ a a b $ b
aa$ aa$ aaab$ abaab$ 3 bba$ bbab$
b a $ a $ b $
$ a
abb$ bba$ bbb$
abb$ abbab$
$ b b $
aaab$ bbab$
$ $
abaab$ abbab$

� searching slightly trickier, but same time complexity as in trie

� all nodes ≥ 2 children � #nodes ≤ #leaves = #strings � linear space
7
Tries as inverted index
simple

fast lookup

cannot handle more general queries:

� search part of a word
� search phrase (sequence of words)

8
Tries as inverted index
simple

fast lookup

cannot handle more general queries:

� search part of a word
� search phrase (sequence of words)

what if the ‘text’ does not even have words to begin with?!
� biological sequences
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGC
CGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGC
CAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAA
TGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
� binary streams
00000010101001111010111000001111100011111011111001101101000011100010011011110000010001101010
01101100001101011010000000100000000111010110000010000111101011101100100011001011011101111111
110001010001011001010000001110101010011000000001101100001100111110000101 0101011101111000011
10101110010010101010100000111110100110000001111001101010000000100100100000101100011000110111

� need new ideas

8
6.2 Suﬃx Trees
Suﬃx trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver

� Goal: ﬁnd the longest substring that occurs in all 𝑘 strings

9
Suﬃx trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver

� Goal: ﬁnd the longest substring that occurs in all 𝑘 strings � alive

Can we do this in time 𝑂(|𝑆1 | + · · · + |𝑆 𝑘 |)? How??

9
Suﬃx trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver

� Goal: ﬁnd the longest substring that occurs in all 𝑘 strings � alive

Can we do this in time 𝑂(|𝑆1 | + · · · + |𝑆 𝑘 |)? How??

Enter: suﬃx trees

� versatile data structure for index with full-text search
� linear time (for construction) and linear space
� allows eﬃcient solutions for many advanced string problems

“Although the longest common substring problem looks trivial now, given our knowledge of suﬃx trees,
it is very interesting to note that in 1970 Don Knuth conjectured that
a linear-time algorithm for this problem would be impossible.” [Gusﬁeld: Algorithms on Strings, Trees, and Sequences (1997)]

9
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

Example:
𝑇 = bananaban$ $

suﬃxes: {bananaban$ , ananaban$ , nanaban$ , aban$

$
anaban$ , naban$ , aban$ , ban$ , an$ , n$ , $} ban

$
0 1 2 3 4 5 6 7 8 9 n $ an$
anaban$
$
𝑇= b a n a n a b a n $ a ban
a
ba naban$
n $ ban$ ananaban$

anaban$
bananaban$

n
$ n$
$ naban$
ban
a
naban$
nanaban$

10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

� except: in leaves, store start index (instead of actual string)

Example:
𝑇 = bananaban$ 9

suﬃxes: {bananaban$ , ananaban$ , nanaban$ , 5

n$
anaban$ , naban$ , aban$ , ban$ , an$ , n$ , $} ba

$
0 1 2 3 4 5 6 7 8 9 n $ 7
n$ 3
𝑇= b a n a n a b a n $ a ba
a
ba naban
n $ 6 $ 1

anaban 0
$

n
$ 8
$ 4
ban
a
naban
$ 2

10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

� except: in leaves, store start index (instead of actual string)

Example:
𝑇 = bananaban$ 9

suﬃxes: {bananaban$ , ananaban$ , nanaban$ , 5

b
anaban$ , naban$ , aban$ , ban$ , an$ , n$ , $}
1

$
0 1 2 3 4 5 6 7 8 9
n $ 7
3
2 a b
𝑇= b a n a n a b a n $
3 n

a
0 b 1
$ 6
� also: edge labels like in compact trie
3 a
� (more readable form on slides to explain algorithms) 0

n
$ 8
4
1 a b
2 n
2

10
Suﬃx trees – Construction
� 𝑇[0..𝑛 − 1] has 𝑛 + 1 suﬃxes (starting at character 𝑖 ∈ [0..𝑛])

� We can build the suﬃx tree by inserting each suﬃx of 𝑇 into a compressed trie.
But that takes time Θ(𝑛 2 ). � not interesting!

11
Suﬃx trees – Construction
� 𝑇[0..𝑛 − 1] has 𝑛 + 1 suﬃxes (starting at character 𝑖 ∈ [0..𝑛])

� We can build the suﬃx tree by inserting each suﬃx of 𝑇 into a compressed trie.
But that takes time Θ(𝑛 2 ). � not interesting!
same order of growth as reading the text!

Amazing result: Can construct the suﬃx tree of 𝑇 in Θ(𝑛) time!

� algorithms are a bit tricky to understand

� but were a theoretical breakthrough
� and they are eﬃcient in practice (and heavily used)!

� for now, take linear-time construction for granted. What can we do with them?

11
6.3 Applications
Applications of suﬃx trees
� In this section, always assume suﬃx tree T for 𝑇 given.

Recall: T stored like this: but think about this:

0 b
$ a a n
$ a b n n
9
9 1 3 1 b a
a n $n $ a
n a
b n $ a $ a $
6 ba 8 n
5 2 6 0 8 2 5 $ a n b a
$ a b
$ a b n 7 n n a
b a 0 $ n
7 3 4 2 a b $
n a 4 2
b n $ n
$
3 1 3 1 𝑇 = bananaban$

� Moreover: assume internal nodes store pointer to leftmost leaf in subtree.

� Notation: 𝑇𝑖 = 𝑇[𝑖..𝑛] (including $)

12
Application 1: Text Indexing / String Matching
� 𝑃 occurs in 𝑇 ⇐⇒ 𝑃 is a preﬁx of a suﬃx of 𝑇

� we have all suﬃxes in T!

13
Application 1: Text Indexing / String Matching
� 𝑃 occurs in 𝑇 ⇐⇒ 𝑃 is a prefix of a suffix of 𝑇 b
$ a a n
n
� we have all suffixes in T! 9
b a
a n $ n $ a
� (try to) follow path with label 𝑃, until $
n a
b
6 a 8
1. we get stuck 5 $ a n n
$ b a
at internal node (no node with next character of 𝑃) a b
7 n a
or inside edge (mismatch of next characters) n 0 $ n
b a $
� 𝑃 does not occur in 𝑇 a b
n a 4 2
$ n
2. we run out of pattern $
𝑇 = bananaban$
reach end of 𝑃 at internal node 𝑣 or inside edge towards 𝑣 3 1

� 𝑃 occurs at all leaves in subtree of 𝑣

3. we run out of tree
reach a leaf ℓ with part of 𝑃 left � compare 𝑃 to ℓ .
This cannot happen when testing edge labels since $ ∉ Σ,
but needs check(s) in compact trie implementation!

� Finding ﬁrst match (or NO_MATCH) takes 𝑂(|𝑃|) time!

� 𝑃 occurs at all leaves in subtree of 𝑣

3. we run out of tree Examples:
reach a leaf ℓ with part of 𝑃 left � compare 𝑃 to ℓ . � 𝑃 = ann
This cannot happen when testing edge labels since $ ∉ Σ, � 𝑃 = ana
but needs check(s) in compact trie implementation!
� 𝑃 = briar
� Finding ﬁrst match (or NO_MATCH) takes 𝑂(|𝑃|) time!
13
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we eﬃciently check all possible substrings?

14
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we eﬃciently check all possible substrings?

Repeated substrings = shared paths in suﬃx tree

� 𝑇5 = aban$ and 𝑇7 = an$ have longest common preﬁx ‘a’

$ a b
a n
n
� ∃ internal node with path label ‘a’ 9
b a
here single edge, can be longer path a n $ n $ a
n a
$ b
6 a 8
5 $ a n n
$ b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
3 1 𝑇 = bananaban$

Repeated substrings = shared paths in suﬃx tree

� 𝑇5 = aban$ and 𝑇7 = an$ have longest common preﬁx ‘a’

$ a b
a n
n
� ∃ internal node with path label ‘a’ 9
b a
here single edge, can be longer path a n $ n $ a
n a
$
� longest repeated substring = longest common preﬁx 6
b
a 8
5 $ a n n
(LCP) of two suﬃxes $ b a
a b
7 n a
actually: adjacent leaves n 0 $ n
b a $
a b
n a 4 2
$ n
$
3 1 𝑇 = bananaban$

Repeated substrings = shared paths in suﬃx tree

� 𝑇5 = aban$ and 𝑇7 = an$ have longest common preﬁx ‘a’

� Both can be done in depth-ﬁrst traversal � Θ(𝑛) time

14
Generalized suﬃx trees
� longest repeated substring (of one string) feels very similar to
longest common substring of several strings 𝑇 (1) , . . . , 𝑇 (𝑘) with 𝑇 (𝑗) ∈ Σ𝑛 𝑗
� can we solve that in the same way?

� could build the suﬃx tree for each 𝑇 (𝑗) . . . but doesn’t seem to help

15
Generalized suﬃx trees
� longest repeated substring (of one string) feels very similar to
longest common substring of several strings 𝑇 (1) , . . . , 𝑇 (𝑘) with 𝑇 (𝑗) ∈ Σ𝑛 𝑗
� can we solve that in the same way?

� could build the suﬃx tree for each 𝑇 (𝑗) . . . but doesn’t seem to help

� need a single/joint suﬃx tree for several texts

� could build the suﬃx tree for each 𝑇 (𝑗) . . . but doesn’t seem to help

� need a single/joint suﬃx tree for several texts

Enter: generalized suﬃx tree

� Deﬁne 𝑇 := 𝑇 (1) $1 𝑇 (2) $2 · · · 𝑇 (𝑘) $k for 𝑘 new end-of-word symbols

� Construct suﬃx tree T for 𝑇

(𝑗)
� $j -edges always leads to leaves � ∃ leaf (𝑗, 𝑖) for each suﬃx 𝑇𝑖 = 𝑇 (𝑗) [𝑖..𝑛 𝑗 ]

15
Application 3: Longest common substring
� With that new idea, we can ﬁnd longest common superstrings:
1. Compute generalized suﬃx tree T.
2. Store with each node the subset of strings that contain its path label:
2.1. Traverse T bottom-up.
2.2. For a leaf (𝑗, 𝑖), the subset is {𝑗}.
2.3. For an internal node, the subset is the union of its children.
3. In top-down traversal, compute string depths of nodes. (as above)

4. Report deepest node (by string depth) whose subset is {1, . . . , 𝑘}.

� Each step takes time Θ(𝑛) for 𝑛 = 𝑛1 + · · · + 𝑛 𝑘 the total length of all texts.

16
Longest common substring – Example
𝑇 (1) = bcabcac, 𝑇 (2) = aabca, 𝑇 (3) = bcaa

123
0
$1 c
$2 $3
$1 b
a c 1 123
$2 a
$3 $1 a

123 1 123 3 c$1 2 123

$2 c$
$3 1 c
a$2 a ac$1 $2 a $2 a
a$3 bc b c $1 b $1
a $3 ca $3 c
a
23 2 bca$2 c ca$2 c cac$1
$1 bcac$1 $1
12 4 bcaa$3 caa$3
b
$3 c
a $2 c bcabcac$1 cabcac$1 ]
$2 $1
aa$3 abca$2

aabca$2 abcac$1

17
6.4 Longest Common Extensions
Application 4: Longest Common Extensions
� We implicitly used a special case of a more general, versatile idea:

The longest common extension (LCE) data structure:

18
Application 4: Longest Common Extensions
� We implicitly used a special case of a more general, versatile idea:

The longest common extension (LCE) data structure:

� Given: String 𝑇[0..𝑛 − 1]
� Goal: Answer LCE queries, i. e.,
given positions 𝑖, 𝑗 in 𝑇,
how far can we read the same text from there?
formally: LCE(𝑖, 𝑗) = max{ℓ : 𝑇[𝑖..𝑖 + ℓ ) = 𝑇[𝑗..𝑗 + ℓ )}
$ a b
a n
n
� use suffix tree of 𝑇! 9
b
a n a
$ n $ a
longest common prefix of 𝑖 th and 𝑗 th suffix n a
$ b
6 a 8
� In T: LCE(𝑖, 𝑗) = LCP(𝑇𝑖 , 𝑇𝑗 ) � same thing, different name! 5 $ a n
$
n
b a
a b
= string depth of 7 n 0
n
$
a
n
b a $
lowest common ancester (LCA) of n
a b
a
n 4 2
$
leaves 𝑖 and 𝑗 $
𝑇 = bananaban$
3 1

� �
� in short: LCE(𝑖, 𝑗) = LCP(𝑇𝑖 , 𝑇𝑗 ) = stringDepth LCA( 𝑖 , 𝑗 )

18
Efficient LCA
How to find lowest common ancestors?
� Could walk up the tree to find LCA � Θ(𝑛) worst case

� Could store all LCAs in big table � Θ(𝑛 2 ) space and preprocessing

19
Efficient LCA
How to find lowest common ancestors?
� Could walk up the tree to find LCA � Θ(𝑛) worst case

� Could store all LCAs in big table � Θ(𝑛 2 ) space and preprocessing

Amazing result: Can compute data structure in Θ(𝑛) time and space
that ﬁnds any LCA is constant(!) time.
� a bit tricky to understand
� but a theoretical breakthrough
� and useful in practice

and suﬃx tree construction inside . . .

� for now, use 𝑂(1) LCA as black box.

� After linear preprocessing (time & space), we can ﬁnd LCEs in 𝑂(1) time.
19
Application 5: Approximate matching
𝒌-mismatch matching:
� Input: text 𝑇[0..𝑛 − 1], pattern 𝑃[0..𝑚 − 1], 𝑘 ∈ [0..𝑚)
� Output: “Hamming distance ≤ 𝑘 ”

� smallest 𝑖 so that 𝑇[𝑖..𝑖 + 𝑚) are 𝑃 diﬀer in at most 𝑘 characters

� or NO_MATCH if there is no such 𝑖

� searching with typos

� Assume longest common extensions in 𝑇 $1 𝑃 $2 can be found in 𝑂(1)

� generalized suﬃx tree T has been built
� string depths of all internal nodes have been computed
� constant-time LCA data structure for T has been built

20
Clicker Question

What is the Hamming distance between heart and beard?

pingo.upb.de/622222
21
Kangaroo Algorithm for approximate matching
1 procedure kMismatch(𝑇[0..𝑛 − 1], 𝑃[0..𝑚 − 1])
2 // build LCE data structure
3 for 𝑖 := 0, . . . , 𝑛 − 𝑚 − 1 do
4 mismatches := 0; 𝑡 := 𝑖; 𝑝 := 0
5 while mismatches ≤ 𝑘 ∧ 𝑝 < 𝑚 do
6 ℓ := LCE(𝑡, 𝑝) // jump over matching part
7 𝑡 := 𝑡 + ℓ + 1; 𝑝 := 𝑝 + ℓ + 1
8 mismatches := mismatches + 1
9 if 𝑝 = = 𝑚 then
10 return 𝑖

� Analysis: Θ(𝑛 + 𝑚) preprocessing + 𝑂(𝑛 · 𝑘) matching

� very eﬃcient for small 𝑘

� State of the art

� 2 �
� 𝑂 𝑛 𝑘 log 𝑘 possible with complicated algorithms
𝑚
� extensions for edit distance ≤ 𝑘 possible
22
Application 6: Matching with wildcards
unit* 𝑃
� Allow a wildcard character in pattern
in␣unit5␣we␣will 𝑇
stands for arbitrary (single) character

� similar algorithm as for 𝑘-mismatch � 𝑂(𝑛 · 𝑘 + 𝑚) when 𝑃 has 𝑘 wildcards

∗ ∗ ∗

Many more applications, in particular for problems on biological sequences

20+ described in Gusﬁeld, Algorithms on strings, trees, and sequences (1999)

23
Suﬃx trees – Discussion
� Suﬃx trees were a threshold invention

linear time and space

suddenly many questions eﬃciently solvable in theory

24
Suﬃx trees – Discussion
� Suﬃx trees were a threshold invention

linear time and space

suddenly many questions eﬃciently solvable in theory

construction of suﬃx trees:

linear time, but signiﬁcant overhead

construction methods fairly complicated

many pointers in tree incur large space overhead

24
6.5 Suﬃx Arrays
Clicker Question

Recap: Check all correct statements about suﬃx tree T of 𝑇[0..𝑛).

A We require 𝑇 to end with $.

B The size of T can be Ω(𝑛 2 ) in the worst case.
C T is a standard trie of all suffixes of 𝑇 $.
D T is a compact trie of all suffixes of 𝑇 $.
E The leaves of T store (a copy of) a suffix of 𝑇 $.
F Naive construction of T takes Ω(𝑛 2 ) (worst case).
G T can be computed in 𝑂(𝑛) time (worst case).
H T has 𝑛 leaves.

pingo.upb.de/622222
25
Clicker Question

Recap: Check all correct statements about suﬃx tree T of 𝑇[0..𝑛).

A We require 𝑇 to end with $. �

B The size of T can be Ω(𝑛 2 ) in the worst case.
C T is a standard trie of all suffixes of 𝑇 $.
D T is a compact trie of all suffixes of 𝑇 $. �
E The leaves of T store (a copy of) a suffix of 𝑇 $.
F �
Naive construction of T takes Ω(𝑛 2 ) (worst case).
G T can be computed in 𝑂(𝑛) time (worst case). �
H T has 𝑛 leaves.

pingo.upb.de/622222
25
Putting suffix trees on a diet
� Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$

aban$
n$
$ ba

$
an$
n

$ anaban$
ban
a

nab
an$ ananaban$

$ ban$
ba
n

anab
an$ bananaban$
n

$ n$

$ naban$
ban
a

nab
an$ nanaban$

26
Putting suffix trees on a diet
� Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$

aban$
n$
$ ba � Idea: only store list of leaves 𝐿[0..𝑛]
$
an$
� Enough to do eﬃcient string matching!
n

$ anaban$
ban 1. Use binary search for pattern 𝑃
a

nab
an$ ananaban$ 2. check if 𝑃 is preﬁx of suﬃx after found position

$ ban$ � Example: 𝑃 = ana

ba
n

anab
an$ bananaban$
n

$ n$

$ naban$
ban
a

nab
an$ nanaban$

26
Putting suffix trees on a diet
L[0..𝑛] � Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$ 9

aban$ 5
n$
$ ba � Idea: only store list of leaves 𝐿[0..𝑛]
$
an$ 7
� Enough to do eﬃcient string matching!
n

$ anaban$ 3
ban 1. Use binary search for pattern 𝑃
a

nab
an$ ananaban$ 1 2. check if 𝑃 is preﬁx of suﬃx after found position

$ ban$ 6 � Example: 𝑃 = ana

ba
n

anab
an$ bananaban$ 0
n

� 𝐿[0..𝑛] is called suﬃx array:

$ n$ 8
𝐿[𝑟] = (start index of) 𝑟th suﬃx in sorted order
$ naban$ 4
ban
� using 𝐿, can do string matching with
a

nab
an$ nanaban$ 2
≤ (lg 𝑛 + 2) · 𝑚 character comparisons

26
Clicker Question

Recap: Check all correct statements about suﬃx array 𝐿[0..𝑛] and
suﬃx tree T of text 𝑇[0..𝑛).

A 𝐿[0..𝑛] lists the start indices of leaves of T in left-to-right

order.
� �
B 𝑇 𝐿[𝑟]..𝑛 is the path label in T to the leaf storing 𝑟.
� �
C 𝑇 𝐿[𝑟]..𝑛 is the path label to the 𝑟th leaf in T.

D 𝑇𝐿[𝑟] is the 𝑟th smallest suﬃx of 𝑇 (lexicographic order).

E In terms of Θ-classes, T needs more space than 𝐿.

F 𝐿 (and 𝑇) suﬃce to solve the text indexing problem.

pingo.upb.de/622222
27
Clicker Question

Recap: Check all correct statements about suﬃx array 𝐿[0..𝑛] and
suﬃx tree T of text 𝑇[0..𝑛).

A 𝐿[0..𝑛] lists the start indices of leaves of T in left-to-right

order.
�
��
B 𝑇 𝐿[𝑟]..𝑛 is the path label in T to the leaf storing 𝑟.
� �
C 𝑇 𝐿[𝑟]..𝑛 is the path label to the 𝑟th leaf in T. �
D 𝑇𝐿[𝑟] is the 𝑟th smallest suﬃx of 𝑇 (lexicographic order). �
E In terms of Θ-classes, T needs more space than 𝐿.

F 𝐿 (and 𝑇) suﬃce to solve the text indexing problem. �

pingo.upb.de/622222
27
Suffix arrays – Construction
How to compute 𝐿[0..𝑛]?
� from suffix tree
� possible with traversal . . .
but we are trying to avoid constructing suffix trees!

� sorting the suﬃxes of 𝑇 using general purpose sort

trivial to code!
� but: comparing two suﬃxes can take Θ(𝑛) character comparisons
Θ(𝑛 2 log 𝑛) time in worst case

� we do better!

28
Fat-pivot radix quicksort – Example
she

sells

seashells

the

sea

shore

the

shells

she

sells

are

surely

seashells
Fat-pivot radix quicksort – Example
she

sells

seashells

the

sea

shore

the

shells

she

sells

are

surely

seashells
Fat-pivot radix quicksort – Example
she by

sells are

seashells she

by sells

the seashells

sea sea

shore shore

the shells

shells she

she sells

sells surely

are seashells

surely the

seashells the
Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she

by sells

the seashells

sea sea

shore shore

the shells

shells she

she sells

sells surely

are seashells

surely the

seashells the
Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells

by sells seashells

the seashells sea

sea sea sells

shore shore seashells

the shells she

shells she shore

she sells shells

sells surely she

are seashells surely

surely the

seashells the
Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells

by sells seashells

the seashells sea

sea sea sells

shore shore seashells

the shells she

shells she shore

she sells shells

sells surely she

are seashells surely

surely the the

seashells the the

Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells sells

by sells seashells seashells

the seashells sea sea

sea sea sells sells

shore shore seashells seashells

the shells she she

shells she shore shells

she sells shells she

sells surely she shore

are seashells surely

surely the the the

seashells the the the

Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells sells seashells

by sells seashells seashells sea

the seashells sea sea seashells

sea sea sells sells sells

shore shore seashells seashells sells

the shells she she she$

shells she shore shells shells

she sells shells she she$

sells surely she shore

are seashells surely

surely the the the the

seashells the the the the

Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells sells seashells

by sells seashells seashells sea ...

the seashells sea sea seashells

sea sea sells sells sells

...
shore shore seashells seashells sells

the shells she she she$

shells she shore shells shells ...

she sells shells she she$

sells surely she shore

are seashells surely

surely the the the the

seashells the the the the

29
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson

� partition based on 𝒅th character only (initially 𝑑 = 0)

� 3 segments: smaller, equal, or larger than 𝑑th symbol of pivot

� recurse on smaller and large with same 𝑑, on equal with 𝑑 + 1
� never compare equal preﬁxes twice

30
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson

� partition based on 𝒅th character only (initially 𝑑 = 0)

� 3 segments: smaller, equal, or larger than 𝑑th symbol of pivot

� recurse on smaller and large with same 𝑑, on equal with 𝑑 + 1
� never compare equal preﬁxes twice

� can show: ∼ 2 ln(2) · 𝑛 lg 𝑛 ≈ 1.39𝑛 lg 𝑛 character comparisons in expectation

simple to code

eﬃcient for sorting many lists of strings random pivots

� fat-pivot radix quicksort ﬁnds suﬃx array in 𝑂(𝑛 log 𝑛) expected time

30
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson

� partition based on 𝒅th character only (initially 𝑑 = 0)

� 3 segments: smaller, equal, or larger than 𝑑th symbol of pivot

� recurse on smaller and large with same 𝑑, on equal with 𝑑 + 1
� never compare equal preﬁxes twice

� can show: ∼ 2 ln(2) · 𝑛 lg 𝑛 ≈ 1.39𝑛 lg 𝑛 character comparisons in expectation

simple to code

eﬃcient for sorting many lists of strings random pivots

� fat-pivot radix quicksort ﬁnds suﬃx array in 𝑂(𝑛 log 𝑛) expected time

but we can do 𝑂(𝑛) time worst case!

30
6.6 Linear-Time Suffix Sorting
Inverse suffix array: going left & right
� to understand the fastest algorithm, it is helpful to define the inverse suffix array:
� 𝑅[𝑖] = 𝑟 ⇐⇒ 𝐿[𝑟] = 𝑖 𝐿 = leaf array
⇐⇒ there are 𝑟 suffixes that come before 𝑇𝑖 in sorted order
⇐⇒ 𝑇𝑖 has (0-based) rank 𝑟 � call 𝑅[0..𝑛] the rank array

𝑖 𝑅[𝑖] 𝑇𝑖 right 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟]

0 6th bananaban$
𝑅[0] = 6 0 9 $
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$
7 2th an$ 𝐿[8] = 4 7 8 n$
8 7th n$ left 8 4 naban$
9 0th $ 9 2 nanaban$

sort suﬃxes
31
Linear-time suﬃx sorting
DC3 / Skew algorithm not a multiple of 3

1. Compute rank array 𝑅 1,2 for suﬃxes 𝑇𝑖 starting at 𝑖 �≡ 0 (mod 3) recursively.

2. Induce rank array 𝑅 3 for suﬃxes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . from 𝑅 1,2 .

3. Merge 𝑅 1,2 and 𝑅 0 using 𝑅 1,2 .

� rank array 𝑅 for entire input

32
Linear-time suﬃx sorting
DC3 / Skew algorithm not a multiple of 3

1. Compute rank array 𝑅 1,2 for suﬃxes 𝑇𝑖 starting at 𝑖 �≡ 0 (mod 3) recursively.

2. Induce rank array 𝑅 3 for suﬃxes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . from 𝑅 1,2 .

3. Merge 𝑅 1,2 and 𝑅 0 using 𝑅 1,2 .

� rank array 𝑅 for entire input

� We will show that steps 2. and 3. take Θ(𝑛) time

� 2 �2 � 2 �3 �� 𝑖
� Total complexity is 𝑛 + 23 𝑛 + 3 𝑛+ 3 𝑛 +··· ≤ 𝑛 · 2
3 = 3𝑛 = Θ(𝑛)
𝑖≥0

32
Linear-time suﬃx sorting
DC3 / Skew algorithm not a multiple of 3

1. Compute rank array 𝑅 1,2 for suﬃxes 𝑇𝑖 starting at 𝑖 �≡ 0 (mod 3) recursively.

2. Induce rank array 𝑅 3 for suﬃxes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . from 𝑅 1,2 .

3. Merge 𝑅 1,2 and 𝑅 0 using 𝑅 1,2 .

� rank array 𝑅 for entire input

� We will show that steps 2. and 3. take Θ(𝑛) time

� 2 �2 � 2 �3 �� 𝑖
� Total complexity is 𝑛 + 23 𝑛 + 3 𝑛+ 3 𝑛 +··· ≤ 𝑛 · 2
3 = 3𝑛 = Θ(𝑛)
𝑖≥0

� Note: 𝐿 can easily be computed from 𝑅 in one pass, and vice versa.
� Can use whichever is more convenient.

32
DC3 / Skew algorithm – Step 2: Inducing ranks
� Assume: rank array 𝑅 1,2 known:
�
rank of 𝑇𝑖 among 𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , . . . for 𝑖 = 1, 2, 4, 5, 7, 8, . . .
� 𝑅 1,2 [𝑖] =
undeﬁned for 𝑖 = 0, 3, 6, 9, . . .

� Task: sort the suﬃxes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . in linear time (!)

33
DC3 / Skew algorithm – Step 2: Inducing ranks
� Assume: rank array 𝑅 1,2 known:
�
rank of 𝑇𝑖 among 𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , . . . for 𝑖 = 1, 2, 4, 5, 7, 8, . . .
� 𝑅 1,2 [𝑖] =
undeﬁned for 𝑖 = 0, 3, 6, 9, . . .

� Task: sort the suﬃxes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . in linear time (!)

� Suppose we want to compare 𝑇0 and 𝑇3 .

� Characterwise comparisons too expensive
� but: after removing ﬁrst character, we obtain 𝑇1 and 𝑇4
� these two can be compared in constant time by comparing 𝑅 1,2 [1] and 𝑅 1,2 [4]!

𝑇0 comes before 𝑇3 in lexicographic order

�
iﬀ pair (𝑇[0], 𝑅1,2 [1]) comes before pair (𝑇[3], 𝑅1,2 [4]) in lexicographic order

33
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑅1,2 (known)
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00

𝑇21 $00 � 𝑅0 [21] = 0

𝑇18 a10 � 𝑅0 [18] = 1
radix so �
rt 𝑇15 a14 𝑅0 [15] = 2
𝑇6 b06 � 𝑅0 [ 6] = 3
𝑇0 h05 � 𝑅0 [ 0] = 4
𝑇3 n02 � 𝑅0 [ 3] = 5
𝑇12 n04 � 𝑅0 [12] = 6
𝑇9 s07 � 𝑅0 [ 9] = 7
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00

𝑇21 $00 � 𝑅0 [21] = 0

𝑇18 a10 � 𝑅0 [18] = 1
radix so �
rt 𝑇15 a14 𝑅0 [15] = 2
𝑇6 b06 � 𝑅0 [ 6] = 3
𝑇0 h05 � 𝑅0 [ 0] = 4
𝑇3 n02 � 𝑅0 [ 3] = 5
𝑇12 n04 � 𝑅0 [12] = 6
𝑇9 s07 � 𝑅0 [ 9] = 7
𝑅0
34
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00

𝑇21 $00 � 𝑅0 [21] = 0

radix so 𝑇18 a10 � 𝑅0 [18] = 1 � sorting of pairs doable in 𝑂(𝑛) time
rt 𝑇15 a14 � 𝑅0 [15] = 2
𝑇6 b06 � 𝑅0 [ 6] = 3 by 2 iterations of counting sort
𝑇0 h05 � 𝑅0 [ 0] = 4
𝑇3 n02 � 𝑅0 [ 3] = 5
𝑇12 n04 � 𝑅0 [12] = 6 � Obtain 𝑅 0 in 𝑂(𝑛) time
𝑇9 s07 � 𝑅0 [ 9] = 7
𝑅0
34
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$
𝑇19 n$$$
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$
𝑇8 nsbananasman$$$
� Have: 𝑇16 sman$$$

� sorted 1,2-list:
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . .
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!

� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$
𝑇19 n$$$
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$
𝑇8 nsbananasman$$$
� Have: 𝑇16 sman$$$

� sorted 1,2-list:
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . .
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!

� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$
� Have: 𝑇16 sman$$$
= a𝑇16
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!

� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!

� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$
= s𝑇17
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$ always at most 2 steps
= s𝑇17 then can use 𝑅 1,2 !
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$ always at most 2 steps
= s𝑇17 then can use 𝑅 1,2 !
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13
� 𝑂(𝑛) time for merge
35
Clicker Question

Recap: Check all correct statements about suﬃx array 𝐿[0..𝑛],

inverse suﬃx array 𝑅[0..𝑛], and suﬃx tree T of text 𝑇.

A 𝐿 lists the leaves of T in left-to-right order.

B 𝑅 lists starting indices of suffixes in lexciographic order.
C 𝐿 lists starting indices of suffixes in lexciographic order.
D 𝐿[𝑟] = 𝑖 iff 𝑅[𝑖] = 𝑟
E 𝐿 stands for leaf
F 𝐿 stands for left
G 𝑅 stands for rank
H 𝑅 stands for right

pingo.upb.de/622222
36
Clicker Question

Recap: Check all correct statements about suﬃx array 𝐿[0..𝑛],

inverse suﬃx array 𝑅[0..𝑛], and suﬃx tree T of text 𝑇.

A 𝐿 lists the leaves of T in left-to-right order. �

B 𝑅 lists starting indices of suffixes in lexciographic order.
C 𝐿 lists starting indices of suffixes in lexciographic order. �
D �
𝐿[𝑟] = 𝑖 iff 𝑅[𝑖] = 𝑟
E 𝐿 stands for leaf �
F 𝐿 stands for left �
G 𝑅 stands for rank �
H 𝑅 stands for right �

pingo.upb.de/622222
36
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

� But: we cheated in 1. step! “compute rank array 𝑅 1,2 recursively”

� Taking a subset of suﬃxes is not an instance of the same problem!

37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

� But: we cheated in 1. step! “compute rank array 𝑅 1,2 recursively”

� Taking a subset of suﬃxes is not an instance of the same problem!

� Need a single string 𝑇 � to recurse on, from which we can deduce 𝑅 1,2 .

How can we make 𝑇 � “skip” some suﬃxes?

37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

� But: we cheated in 1. step! “compute rank array 𝑅 1,2 recursively”

� Taking a subset of suﬃxes is not an instance of the same problem!

� Need a single string 𝑇 � to recurse on, from which we can deduce 𝑅 1,2 .

How can we make 𝑇 � “skip” some suﬃxes?

𝑇 = bananaban$$$
redeﬁne alphabet to be triples of characters abc � 𝑇 � = ban ana ban $$$
ana ban $$$
� suﬃxes of 𝑇 � � 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . ban $$$
$$$
� 𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$ � 𝑇𝑖 with 𝑖 �≡ 0 (mod 3).

� Can call suﬃx sorting recursively on 𝑇 � and map result to 𝑅 1,2

37
DC3 / Skew algorithm – Fix alphabet explosion
� Still does not quite work!

38
DC3 / Skew algorithm – Fix alphabet explosion
� Still does not quite work!
� Each recursive step cubes 𝜎 by using triples!
� (Eventually) cannot use linear-time sorting anymore!

� But: Have at most 23 𝑛 diﬀerent triples abc in 𝑇 �!

� Before recursion:
1. Sort all occurring triples. (using counting sort in 𝑂(𝑛))
2. Replace them by their rank (in Σ).

� Maintains 𝜎 ≤ 𝑛 without aﬀecting order of suﬃxes.

38
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man

� Sorted triples with ranks:

Rank 00 01 02 03 04 05 06 07 08 09 10 11 12
Triple $$$ ahb ana ann ans ban hba man n$$ nas nna nsb sma

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man

� Sorted triples with ranks:

Rank 00 01 02 03 04 05 06 07 08 09 10 11 12
Triple $$$ ahb ana ann ans ban hba man n$$ nas nna nsb sma

� 𝑇� = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$
𝑇 �� = 03 01 04 05 02 12 08 00 10 06 11 02 09 07 00

39
Suﬃx array – Discussion
sleek data structure compared to suﬃx tree

simple and fast 𝑂(𝑛 log 𝑛) construction

more involved but fast 𝑂(𝑛) construction

supports eﬃcient string matching

string matching takes 𝑂(𝑚 log 𝑛), not optimal 𝑂(𝑚)

Cannot use more advanced suﬃx tree features

e. g., for longest repeated substrings

40
6.7 The LCP Array
Clicker Question

Which feature of suﬃx trees did we use to ﬁnd the length of a

longest repeated substring?
A order of leaves

B path label of internal nodes

C string depth of internal nodes

D constant-time traversal to child nodes

E constant-time traversal to parent nodes

F constant-time traversal to leftmost leaf in subtree

pingo.upb.de/622222
41
Clicker Question

Which feature of suﬃx trees did we use to ﬁnd the length of a

longest repeated substring?
A order of leaves

B path label of internal nodes

C string depth of internal nodes �

D constant-time traversal to child nodes

E constant-time traversal to parent nodes

F constant-time traversal to leftmost leaf in subtree

pingo.upb.de/622222
41
String depths of internal nodes
� Recall algorithm for longest repeated substring in suﬃx tree
$ a b
a n
1. Compute string depth of nodes n
9
2. Find path label to node with maximal string depth b
a n a
$ n $ a
n a
$ b
6 a 8
� Can we do this using suﬃx arrays? 5 $ a n
$
n
b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
3 1 𝑇 = bananaban$

42
String depths of internal nodes
� Recall algorithm for longest repeated substring in suffix tree
$ a b
a n
1. Compute string depth of nodes n
9
2. Find path label to node with maximal string depth b
a n a
$ n $ a
n a
$ b
6 a 8
� Can we do this using suffix arrays? 5 $ a n
$
n
b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
� Yes, by enhancing the suffix array with the LCP array! 𝑇 = bananaban$
3 1
LCP[1..𝑛]
LCP[𝑟] = LCP(𝑇𝐿[𝑟] , 𝑇𝐿[𝑟−1] )
longest common prefix of suffixes of rank 𝑟 and 𝑟 − 1

� longest repeated substring = ﬁnd maximum in LCP[1..𝑛]

42
LCP array and internal nodes
L[0..𝑛]
9

43
LCP array and internal nodes
L[0..𝑛]
$ 9

aban$ 5

an$ 7

anaban$ 3

ananaban$ 1

ban$ 6

bananaban$ 0

n$ 8

naban$ 4

nanaban$ 2

43
LCP array and internal nodes
L[0..𝑛]
$ 9

aban$ 5
a
an$ 7
an
anaban$ 3
ana
ananaban$ 1

ban$ 6
ban
bananaban$ 0

n$ 8
n
naban$ 4
na
nanaban$ 2

43
LCP array and internal nodes
LCP[1..𝑛] L[0..𝑛]
$ 9
0
aban$ 5
1 a
an$ 7
2 an
anaban$ 3
3 ana
ananaban$ 1
0
ban$ 6
3 ban
bananaban$ 0
0
n$ 8
1 n
naban$ 4
2 na
nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ 9
𝜺 0
aban$ 5
𝜺 a 1 a
an$ 7
𝜺 a n 2 an
anaban$ 3
𝜺 a n a 3 ana
ananaban$ 1
𝜺 0
ban$ 6
𝜺 b a n 3 ban
bananaban$ 0
𝜺 0
n$ 8
𝜺 n 1 n
naban$ 4
𝜺 n a 2 na
nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a

𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n

𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n

𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a

nab
an$ nanaban$ nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a

𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n

𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n

𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a

nab
an$ nanaban$ nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a

𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n

𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n

𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a

nab
an$ nanaban$ nanaban$ 2

� Leaf array 𝐿[0..𝑛] plus LCP array LCP[1..𝑛] encode full tree!
43
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

� but: seeing one large ( = costly) LCP value � can ﬁnd another large one!

� Example: 𝑇 = Buffalo␣buffalo␣buffalo␣buffalo$
� ﬁrst few suﬃxes in sorted order:
𝑇𝐿[0] = $
𝑇𝐿[1] = alo␣buffalo$
𝑇𝐿[2] = alo␣buffalo␣buffalo$
alo␣buffalo␣buffalo � LCP[3] = 19
𝑇𝐿[3] = alo␣buffalo␣buffalo␣buffalo$

44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

� but: seeing one large ( = costly) LCP value � can ﬁnd another large one!

� Removing ﬁrst character from 𝑇𝐿[2] and 𝑇𝐿[3] gives two new suﬃxes:
𝑇𝐿[?] = lo␣buffalo␣buffalo$
lo␣buffalo␣buffalo � LCP[?] = 18
𝑇𝐿[?] = lo␣buffalo␣buffalo␣buffalo$
unclear where. . .

44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

� but: seeing one large ( = costly) LCP value � can ﬁnd another large one!

� Removing first character from 𝑇𝐿[2] and 𝑇𝐿[3] gives two new suffixes:
𝑇𝐿[?] = lo␣buffalo␣buffalo$ Shortened suffixes might not
lo␣buffalo␣buffalo � LCP[?] = 18 be adjacent in sorted order!
𝑇𝐿[?] = lo␣buffalo␣buffalo␣buffalo$ � no LCP entry for them!
unclear where. . .

44
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically