0% found this document useful (0 votes)
74 views

Notes 06 Text Indexing PDF

The longest common substring of "superiorcalifornialives" and "sealiver" is "iver".

Uploaded by

hussein hammoud
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
74 views

Notes 06 Text Indexing PDF

The longest common substring of "superiorcalifornialives" and "sealiver" is "iver".

Uploaded by

hussein hammoud
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 162

6 Text Indexing –

Searching whole genomes

Sebastian Wild
9 March 2020

version 2020-03-17 14:43


Outline

6 Text Indexing
6.1 Motivation
6.2 Suffix Trees
6.3 Applications
6.4 Longest Common Extensions
6.5 Suffix Arrays
6.6 Linear-Time Suffix Sorting
6.7 The LCP Array
6.1 Motivation
Inverted indices
same as “indexes”

� original indices in books: list of (key) words ↦→ page numbers where they occur

� assumption: searches are only for whole (key) words

� often reasonable for natural language text

2
Inverted indices
same as “indexes”

� original indices in books: list of (key) words ↦→ page numbers where they occur

� assumption: searches are only for whole (key) words

� often reasonable for natural language text

Inverted index:
� collect all words in 𝑇
� can be as simple as splitting 𝑇 at whitespace
� actual implementations typically support stemming of words
goes → go, cats → cat

� store mapping from words to a list of occurrences � how?

2
Clicker Question

Do you know what a trie is?


A A what? No!

B I have heard the term, but don’t quite remember.

C I remember hearing about it in a module.

D Sure.

pingo.upb.de/622222
3
Tries
� efficient dictionary data structure for strings

� name from retrieval, but pronounced “try”

� tree based on symbol comparisons

� Assumption: stored strings are prefix-free (no string is a prefix of another)


� strings of same length
� some character ∉ Σ

� strings have “end-of-string” marker $


� root
a b
� Example:
a b
{aa$ , aaab$ , abaab$ , abb$ , b

abbab$ , bba$ , bbab$ , bbb$} $ a a b a b


aa$
b a $ a $ b $
abb$ bba$ bbb$
$ b b $
aaab$ bbab$
$ $
abaab$ abbab$

4
NOT our standard tries,
but version with compacted
paths to leaves
(see next page)
Trie construction
(correct version)
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.


Each stored string consists of 𝑚 characters.
We now search for a query string 𝑄 with |𝑄| = 𝑞.
How many nodes in the trie are visited during this query?

A Θ(log 𝑛) F Θ(log 𝑚)

B Θ(log(𝑛𝑚)) G Θ(𝑞)

C Θ(𝑚 · log 𝑛) H Θ(log 𝑞)

D Θ(𝑚 + log 𝑛) I Θ(𝑞 · log 𝑛)

E Θ(𝑚) J Θ(𝑞 + log 𝑛)

pingo.upb.de/622222
5
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.


Each stored string consists of 𝑚 characters.
We now search for a query string 𝑄 with |𝑄| = 𝑞.
How many nodes in the trie are visited during this query?

A Θ(log 𝑛) F Θ(log 𝑚)

B Θ(log(𝑛𝑚)) G Θ(𝑞) �
C Θ(𝑚 · log 𝑛) H Θ(log 𝑞)

D Θ(𝑚 + log 𝑛) I Θ(𝑞 · log 𝑛)

E Θ(𝑚) J Θ(𝑞 + log 𝑛)

pingo.upb.de/622222
5
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.


Each stored string consists of 𝑚 characters.
How many nodes does the trie have in total in the worst case?

A Θ(𝑛) D Θ(𝑛 log 𝑚)

B Θ(𝑛 + 𝑚) E Θ(𝑚)

C Θ(𝑛 · 𝑚) F Θ(𝑚 log 𝑛)

pingo.upb.de/622222
6
Clicker Question

Suppose we have a trie that stores 𝑛 strings over Σ = {A , . . . , Z}.


Each stored string consists of 𝑚 characters.
How many nodes does the trie have in total in the worst case?

A Θ(𝑛) D Θ(𝑛 log 𝑚)

B Θ(𝑛 + 𝑚) E Θ(𝑚)

C Θ(𝑛 · 𝑚) � F Θ(𝑚 log 𝑛)

pingo.upb.de/622222
6
Compact tries =1 child

� compress paths of unary nodes into single edge


� nodes store index of next character

standard trie compact trie


0
a b a b

1 2
a b a b a b
b
2 2 3 bbb$
$ a a b a b $ a a b $ b
aa$ aa$ aaab$ abaab$ 3 bba$ bbab$
b a $ a $ b $
$ a
abb$ bba$ bbb$
abb$ abbab$
$ b b $
aaab$ bbab$
$ $
abaab$ abbab$

� searching slightly trickier, but same time complexity as in trie


� all nodes ≥ 2 children � #nodes ≤ #leaves = #strings � linear space
7
Tries as inverted index
simple

fast lookup

cannot handle more general queries:


� search part of a word
� search phrase (sequence of words)

8
Tries as inverted index
simple

fast lookup

cannot handle more general queries:


� search part of a word
� search phrase (sequence of words)

what if the ‘text’ does not even have words to begin with?!
� biological sequences
ACAAGATGCCATTGTCCCCCGGCCTCCTGCTGCTGCTGCTCTCCGGGGCCACGGCCACCGCTGCCCTGCCCCTGGAGGGTGGCCCCACCGGC
CGAGACAGCGAGCATATGCAGGAAGCGGCAGGAATAAGGAAAAGCAGCCTCCTGACTTTCCTCGCTTGGTGGTTTGAGTGGACCTCCCAGGC
CAGTGCCGGGCCCCTCATAGGAGAGGAAGCTCGGGAGGTGGCCAGGCGGCAGGAAGGCGCACCCCCCCAGCAATCCGCGCGCCGGGACAGAA
TGCCCTGCAGGAACTTCTTCTGGAAGACCTTCTCCTCCTGCAAATAAAACCTCACCCATGAATGCTCACGCAAGTTTAATTACAGACCTGAA
� binary streams
00000010101001111010111000001111100011111011111001101101000011100010011011110000010001101010
01101100001101011010000000100000000111010110000010000111101011101100100011001011011101111111
110001010001011001010000001110101010011000000001101100001100111110000101 0101011101111000011
10101110010010101010100000111110100110000001111001101010000000100100100000101100011000110111

� need new ideas


8
6.2 Suffix Trees
Suffix trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver

� Goal: find the longest substring that occurs in all 𝑘 strings

9
Suffix trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver

� Goal: find the longest substring that occurs in all 𝑘 strings � alive

Can we do this in time 𝑂(|𝑆1 | + · · · + |𝑆 𝑘 |)? How??

9
Suffix trees – A ‘magic’ data structure
Appetizer: Longest common substring problem
� Given: strings 𝑆1 , . . . , 𝑆 𝑘 Example: 𝑆1 = superiorcalifornialives, 𝑆2 = sealiver

� Goal: find the longest substring that occurs in all 𝑘 strings � alive

Can we do this in time 𝑂(|𝑆1 | + · · · + |𝑆 𝑘 |)? How??

Enter: suffix trees


� versatile data structure for index with full-text search
� linear time (for construction) and linear space
� allows efficient solutions for many advanced string problems

“Although the longest common substring problem looks trivial now, given our knowledge of suffix trees,
it is very interesting to note that in 1970 Don Knuth conjectured that
a linear-time algorithm for this problem would be impossible.” [Gusfield: Algorithms on Strings, Trees, and Sequences (1997)]

9
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

Example:
𝑇 = bananaban$ $

suffixes: {bananaban$ , ananaban$ , nanaban$ , aban$


$
anaban$ , naban$ , aban$ , ban$ , an$ , n$ , $} ban

$
0 1 2 3 4 5 6 7 8 9 n $ an$
anaban$
$
𝑇= b a n a n a b a n $ a ban
a
ba naban$
n $ ban$ ananaban$

anaban$
bananaban$

n
$ n$
$ naban$
ban
a
naban$
nanaban$

10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

� except: in leaves, store start index (instead of actual string)

Example:
𝑇 = bananaban$ 9

suffixes: {bananaban$ , ananaban$ , nanaban$ , 5


n$
anaban$ , naban$ , aban$ , ban$ , an$ , n$ , $} ba

$
0 1 2 3 4 5 6 7 8 9 n $ 7
n$ 3
𝑇= b a n a n a b a n $ a ba
a
ba naban
n $ 6 $ 1

anaban 0
$

n
$ 8
$ 4
ban
a
naban
$ 2

10
Suffix trees – Definition
� suffix tree T for text 𝑇 = 𝑇[0..𝑛 − 1] = compact trie of all suffixes of 𝑇 $ (set 𝑇[𝑛] ≔ $)

� except: in leaves, store start index (instead of actual string)

Example:
𝑇 = bananaban$ 9

suffixes: {bananaban$ , ananaban$ , nanaban$ , 5


b
anaban$ , naban$ , aban$ , ban$ , an$ , n$ , $}
1

$
0 1 2 3 4 5 6 7 8 9
n $ 7
3
2 a b
𝑇= b a n a n a b a n $
3 n

a
0 b 1
$ 6
� also: edge labels like in compact trie
3 a
� (more readable form on slides to explain algorithms) 0

n
$ 8
4
1 a b
2 n
2

10
Suffix trees – Construction
� 𝑇[0..𝑛 − 1] has 𝑛 + 1 suffixes (starting at character 𝑖 ∈ [0..𝑛])

� We can build the suffix tree by inserting each suffix of 𝑇 into a compressed trie.
But that takes time Θ(𝑛 2 ). � not interesting!

11
Suffix trees – Construction
� 𝑇[0..𝑛 − 1] has 𝑛 + 1 suffixes (starting at character 𝑖 ∈ [0..𝑛])

� We can build the suffix tree by inserting each suffix of 𝑇 into a compressed trie.
But that takes time Θ(𝑛 2 ). � not interesting!
same order of growth as reading the text!

Amazing result: Can construct the suffix tree of 𝑇 in Θ(𝑛) time!

� algorithms are a bit tricky to understand


� but were a theoretical breakthrough
� and they are efficient in practice (and heavily used)!

� for now, take linear-time construction for granted. What can we do with them?

11
6.3 Applications
Applications of suffix trees
� In this section, always assume suffix tree T for 𝑇 given.

Recall: T stored like this: but think about this:


0 b
$ a a n
$ a b n n
9
9 1 3 1 b a
a n $n $ a
n a
b n $ a $ a $
6 ba 8 n
5 2 6 0 8 2 5 $ a n b a
$ a b
$ a b n 7 n n a
b a 0 $ n
7 3 4 2 a b $
n a 4 2
b n $ n
$
3 1 3 1 𝑇 = bananaban$

� Moreover: assume internal nodes store pointer to leftmost leaf in subtree.

� Notation: 𝑇𝑖 = 𝑇[𝑖..𝑛] (including $)

12
Application 1: Text Indexing / String Matching
� 𝑃 occurs in 𝑇 ⇐⇒ 𝑃 is a prefix of a suffix of 𝑇

� we have all suffixes in T!

13
Application 1: Text Indexing / String Matching
� 𝑃 occurs in 𝑇 ⇐⇒ 𝑃 is a prefix of a suffix of 𝑇 b
$ a a n
n
� we have all suffixes in T! 9
b a
a n $ n $ a
� (try to) follow path with label 𝑃, until $
n a
b
6 a 8
1. we get stuck 5 $ a n n
$ b a
at internal node (no node with next character of 𝑃) a b
7 n a
or inside edge (mismatch of next characters) n 0 $ n
b a $
� 𝑃 does not occur in 𝑇 a b
n a 4 2
$ n
2. we run out of pattern $
𝑇 = bananaban$
reach end of 𝑃 at internal node 𝑣 or inside edge towards 𝑣 3 1

� 𝑃 occurs at all leaves in subtree of 𝑣


3. we run out of tree
reach a leaf ℓ with part of 𝑃 left � compare 𝑃 to ℓ .
This cannot happen when testing edge labels since $ ∉ Σ,
but needs check(s) in compact trie implementation!

� Finding first match (or NO_MATCH) takes 𝑂(|𝑃|) time!


13
Application 1: Text Indexing / String Matching
� 𝑃 occurs in 𝑇 ⇐⇒ 𝑃 is a prefix of a suffix of 𝑇 b
$ a a n
n
� we have all suffixes in T! 9
b a
a n $ n $ a
� (try to) follow path with label 𝑃, until $
n a
b
6 a 8
1. we get stuck 5 $ a n n
$ b a
at internal node (no node with next character of 𝑃) a b
7 n a
or inside edge (mismatch of next characters) n 0 $ n
b a $
� 𝑃 does not occur in 𝑇 a b
n a 4 2
$ n
2. we run out of pattern $
𝑇 = bananaban$
reach end of 𝑃 at internal node 𝑣 or inside edge towards 𝑣 3 1

� 𝑃 occurs at all leaves in subtree of 𝑣


3. we run out of tree Examples:
reach a leaf ℓ with part of 𝑃 left � compare 𝑃 to ℓ . � 𝑃 = ann
This cannot happen when testing edge labels since $ ∉ Σ, � 𝑃 = ana
but needs check(s) in compact trie implementation!
� 𝑃 = briar
� Finding first match (or NO_MATCH) takes 𝑂(|𝑃|) time!
13
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we efficiently check all possible substrings?

14
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we efficiently check all possible substrings?

Repeated substrings = shared paths in suffix tree

� 𝑇5 = aban$ and 𝑇7 = an$ have longest common prefix ‘a’


$ a b
a n
n
� ∃ internal node with path label ‘a’ 9
b a
here single edge, can be longer path a n $ n $ a
n a
$ b
6 a 8
5 $ a n n
$ b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
3 1 𝑇 = bananaban$

14
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we efficiently check all possible substrings?

Repeated substrings = shared paths in suffix tree

� 𝑇5 = aban$ and 𝑇7 = an$ have longest common prefix ‘a’


$ a b
a n
n
� ∃ internal node with path label ‘a’ 9
b a
here single edge, can be longer path a n $ n $ a
n a
$
� longest repeated substring = longest common prefix 6
b
a 8
5 $ a n n
(LCP) of two suffixes $ b a
a b
7 n a
actually: adjacent leaves n 0 $ n
b a $
a b
n a 4 2
$ n
$
3 1 𝑇 = bananaban$

14
Application 2: Longest repeated substring
� Goal: Find longest substring 𝑇[𝑖..𝑖 + ℓ ) that occurs also at 𝑗 ≠ 𝑖: 𝑇[𝑗..𝑗 + ℓ ) = 𝑇[𝑖..𝑖 + ℓ ).
e. g. for compression � Unit 7
How can we efficiently check all possible substrings?

Repeated substrings = shared paths in suffix tree

� 𝑇5 = aban$ and 𝑇7 = an$ have longest common prefix ‘a’


$ a b
a n
n
� ∃ internal node with path label ‘a’ 9
b a
here single edge, can be longer path a n $ n $ a
n a
$
� longest repeated substring = longest common prefix 6
b
a 8
5 $ a n n
(LCP) of two suffixes $ b a
a b
7 n a
actually: adjacent leaves n 0 $ n
� Algorithm: b a
a b
$
n a 4 2
1. Compute string depth (=length of path label) of nodes $ n
$
2. Find internal nodes with maximal string depth 3 1 𝑇 = bananaban$

� Both can be done in depth-first traversal � Θ(𝑛) time


14
Generalized suffix trees
� longest repeated substring (of one string) feels very similar to
longest common substring of several strings 𝑇 (1) , . . . , 𝑇 (𝑘) with 𝑇 (𝑗) ∈ Σ𝑛 𝑗
� can we solve that in the same way?

� could build the suffix tree for each 𝑇 (𝑗) . . . but doesn’t seem to help

15
Generalized suffix trees
� longest repeated substring (of one string) feels very similar to
longest common substring of several strings 𝑇 (1) , . . . , 𝑇 (𝑘) with 𝑇 (𝑗) ∈ Σ𝑛 𝑗
� can we solve that in the same way?

� could build the suffix tree for each 𝑇 (𝑗) . . . but doesn’t seem to help

� need a single/joint suffix tree for several texts

15
Generalized suffix trees
� longest repeated substring (of one string) feels very similar to
longest common substring of several strings 𝑇 (1) , . . . , 𝑇 (𝑘) with 𝑇 (𝑗) ∈ Σ𝑛 𝑗
� can we solve that in the same way?

� could build the suffix tree for each 𝑇 (𝑗) . . . but doesn’t seem to help

� need a single/joint suffix tree for several texts

Enter: generalized suffix tree


� Define 𝑇 := 𝑇 (1) $1 𝑇 (2) $2 · · · 𝑇 (𝑘) $k for 𝑘 new end-of-word symbols

� Construct suffix tree T for 𝑇


(𝑗)
� $j -edges always leads to leaves � ∃ leaf (𝑗, 𝑖) for each suffix 𝑇𝑖 = 𝑇 (𝑗) [𝑖..𝑛 𝑗 ]

15
Application 3: Longest common substring
� With that new idea, we can find longest common superstrings:
1. Compute generalized suffix tree T.
2. Store with each node the subset of strings that contain its path label:
2.1. Traverse T bottom-up.
2.2. For a leaf (𝑗, 𝑖), the subset is {𝑗}.
2.3. For an internal node, the subset is the union of its children.
3. In top-down traversal, compute string depths of nodes. (as above)

4. Report deepest node (by string depth) whose subset is {1, . . . , 𝑘}.

� Each step takes time Θ(𝑛) for 𝑛 = 𝑛1 + · · · + 𝑛 𝑘 the total length of all texts.

“Although the longest common substring problem looks trivial now, given our knowledge of suffix trees,
it is very interesting to note that in 1970 Don Knuth conjectured that
a linear-time algorithm for this problem would be impossible.” [Gusfield: Algorithms on Strings, Trees, and Sequences (1997)]

16
Longest common substring – Example
𝑇 (1) = bcabcac, 𝑇 (2) = aabca, 𝑇 (3) = bcaa

123
0
$1 c
$2 $3
$1 b
a c 1 123
$2 a
$3 $1 a

123 1 123 3 c$1 2 123


$2 c$
$3 1 c
a$2 a ac$1 $2 a $2 a
a$3 bc b c $1 b $1
a $3 ca $3 c
a
23 2 bca$2 c ca$2 c cac$1
$1 bcac$1 $1
12 4 bcaa$3 caa$3
b
$3 c
a $2 c bcabcac$1 cabcac$1 ]
$2 $1
aa$3 abca$2

aabca$2 abcac$1

17
6.4 Longest Common Extensions
Application 4: Longest Common Extensions
� We implicitly used a special case of a more general, versatile idea:

The longest common extension (LCE) data structure:


� Given: String 𝑇[0..𝑛 − 1]
� Goal: Answer LCE queries, i. e.,
given positions 𝑖, 𝑗 in 𝑇,
how far can we read the same text from there?
formally: LCE(𝑖, 𝑗) = max{ℓ : 𝑇[𝑖..𝑖 + ℓ ) = 𝑇[𝑗..𝑗 + ℓ )}

18
Application 4: Longest Common Extensions
� We implicitly used a special case of a more general, versatile idea:

The longest common extension (LCE) data structure:


� Given: String 𝑇[0..𝑛 − 1]
� Goal: Answer LCE queries, i. e.,
given positions 𝑖, 𝑗 in 𝑇,
how far can we read the same text from there?
formally: LCE(𝑖, 𝑗) = max{ℓ : 𝑇[𝑖..𝑖 + ℓ ) = 𝑇[𝑗..𝑗 + ℓ )}
$ a b
a n
n
� use suffix tree of 𝑇! 9
b
a n a
$ n $ a
longest common prefix of 𝑖 th and 𝑗 th suffix n a
$ b
6 a 8
� In T: LCE(𝑖, 𝑗) = LCP(𝑇𝑖 , 𝑇𝑗 ) � same thing, different name! 5 $ a n
$
n
b a
a b
= string depth of 7 n 0
n
$
a
n
b a $
lowest common ancester (LCA) of n
a b
a
n 4 2
$
leaves 𝑖 and 𝑗 $
𝑇 = bananaban$
3 1

� �
� in short: LCE(𝑖, 𝑗) = LCP(𝑇𝑖 , 𝑇𝑗 ) = stringDepth LCA( 𝑖 , 𝑗 )

18
Efficient LCA
How to find lowest common ancestors?
� Could walk up the tree to find LCA � Θ(𝑛) worst case

� Could store all LCAs in big table � Θ(𝑛 2 ) space and preprocessing

19
Efficient LCA
How to find lowest common ancestors?
� Could walk up the tree to find LCA � Θ(𝑛) worst case

� Could store all LCAs in big table � Θ(𝑛 2 ) space and preprocessing

Amazing result: Can compute data structure in Θ(𝑛) time and space
that finds any LCA is constant(!) time.
� a bit tricky to understand
� but a theoretical breakthrough
� and useful in practice

and suffix tree construction inside . . .

� for now, use 𝑂(1) LCA as black box.

� After linear preprocessing (time & space), we can find LCEs in 𝑂(1) time.
19
Application 5: Approximate matching
𝒌-mismatch matching:
� Input: text 𝑇[0..𝑛 − 1], pattern 𝑃[0..𝑚 − 1], 𝑘 ∈ [0..𝑚)
� Output: “Hamming distance ≤ 𝑘 ”

� smallest 𝑖 so that 𝑇[𝑖..𝑖 + 𝑚) are 𝑃 differ in at most 𝑘 characters


� or NO_MATCH if there is no such 𝑖

� searching with typos

� Assume longest common extensions in 𝑇 $1 𝑃 $2 can be found in 𝑂(1)


� generalized suffix tree T has been built
� string depths of all internal nodes have been computed
� constant-time LCA data structure for T has been built

20
Clicker Question

What is the Hamming distance between heart and beard?

pingo.upb.de/622222
21
Kangaroo Algorithm for approximate matching
1 procedure kMismatch(𝑇[0..𝑛 − 1], 𝑃[0..𝑚 − 1])
2 // build LCE data structure
3 for 𝑖 := 0, . . . , 𝑛 − 𝑚 − 1 do
4 mismatches := 0; 𝑡 := 𝑖; 𝑝 := 0
5 while mismatches ≤ 𝑘 ∧ 𝑝 < 𝑚 do
6 ℓ := LCE(𝑡, 𝑝) // jump over matching part
7 𝑡 := 𝑡 + ℓ + 1; 𝑝 := 𝑝 + ℓ + 1
8 mismatches := mismatches + 1
9 if 𝑝 = = 𝑚 then
10 return 𝑖

� Analysis: Θ(𝑛 + 𝑚) preprocessing + 𝑂(𝑛 · 𝑘) matching

� very efficient for small 𝑘

� State of the art


� 2 �
� 𝑂 𝑛 𝑘 log 𝑘 possible with complicated algorithms
𝑚
� extensions for edit distance ≤ 𝑘 possible
22
Application 6: Matching with wildcards
unit* 𝑃
� Allow a wildcard character in pattern
in␣unit5␣we␣will 𝑇
stands for arbitrary (single) character

� similar algorithm as for 𝑘-mismatch � 𝑂(𝑛 · 𝑘 + 𝑚) when 𝑃 has 𝑘 wildcards

∗ ∗ ∗

Many more applications, in particular for problems on biological sequences

20+ described in Gusfield, Algorithms on strings, trees, and sequences (1999)

23
Suffix trees – Discussion
� Suffix trees were a threshold invention

linear time and space

suddenly many questions efficiently solvable in theory

24
Suffix trees – Discussion
� Suffix trees were a threshold invention

linear time and space

suddenly many questions efficiently solvable in theory

construction of suffix trees:


linear time, but significant overhead

construction methods fairly complicated

many pointers in tree incur large space overhead

24
6.5 Suffix Arrays
Clicker Question

Recap: Check all correct statements about suffix tree T of 𝑇[0..𝑛).

A We require 𝑇 to end with $.


B The size of T can be Ω(𝑛 2 ) in the worst case.
C T is a standard trie of all suffixes of 𝑇 $.
D T is a compact trie of all suffixes of 𝑇 $.
E The leaves of T store (a copy of) a suffix of 𝑇 $.
F Naive construction of T takes Ω(𝑛 2 ) (worst case).
G T can be computed in 𝑂(𝑛) time (worst case).
H T has 𝑛 leaves.

pingo.upb.de/622222
25
Clicker Question

Recap: Check all correct statements about suffix tree T of 𝑇[0..𝑛).

A We require 𝑇 to end with $. �


B The size of T can be Ω(𝑛 2 ) in the worst case.
C T is a standard trie of all suffixes of 𝑇 $.
D T is a compact trie of all suffixes of 𝑇 $. �
E The leaves of T store (a copy of) a suffix of 𝑇 $.
F �
Naive construction of T takes Ω(𝑛 2 ) (worst case).
G T can be computed in 𝑂(𝑛) time (worst case). �
H T has 𝑛 leaves.

pingo.upb.de/622222
25
Putting suffix trees on a diet
� Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$

aban$
n$
$ ba

$
an$
n

$ anaban$
ban
a

nab
an$ ananaban$

$ ban$
ba
n

anab
an$ bananaban$
n

$ n$

$ naban$
ban
a

nab
an$ nanaban$

26
Putting suffix trees on a diet
� Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$

aban$
n$
$ ba � Idea: only store list of leaves 𝐿[0..𝑛]
$
an$
� Enough to do efficient string matching!
n

$ anaban$
ban 1. Use binary search for pattern 𝑃
a

nab
an$ ananaban$ 2. check if 𝑃 is prefix of suffix after found position

$ ban$ � Example: 𝑃 = ana


ba
n

anab
an$ bananaban$
n

$ n$

$ naban$
ban
a

nab
an$ nanaban$

26
Putting suffix trees on a diet
L[0..𝑛] � Observation: order of leaves in suffix tree
= suffixes lexicographically sorted
$ 9

aban$ 5
n$
$ ba � Idea: only store list of leaves 𝐿[0..𝑛]
$
an$ 7
� Enough to do efficient string matching!
n

$ anaban$ 3
ban 1. Use binary search for pattern 𝑃
a

nab
an$ ananaban$ 1 2. check if 𝑃 is prefix of suffix after found position

$ ban$ 6 � Example: 𝑃 = ana


ba
n

anab
an$ bananaban$ 0
n

� 𝐿[0..𝑛] is called suffix array:


$ n$ 8
𝐿[𝑟] = (start index of) 𝑟th suffix in sorted order
$ naban$ 4
ban
� using 𝐿, can do string matching with
a

nab
an$ nanaban$ 2
≤ (lg 𝑛 + 2) · 𝑚 character comparisons

26
Clicker Question

Recap: Check all correct statements about suffix array 𝐿[0..𝑛] and
suffix tree T of text 𝑇[0..𝑛).

A 𝐿[0..𝑛] lists the start indices of leaves of T in left-to-right


order.
� �
B 𝑇 𝐿[𝑟]..𝑛 is the path label in T to the leaf storing 𝑟.
� �
C 𝑇 𝐿[𝑟]..𝑛 is the path label to the 𝑟th leaf in T.

D 𝑇𝐿[𝑟] is the 𝑟th smallest suffix of 𝑇 (lexicographic order).

E In terms of Θ-classes, T needs more space than 𝐿.

F 𝐿 (and 𝑇) suffice to solve the text indexing problem.

pingo.upb.de/622222
27
Clicker Question

Recap: Check all correct statements about suffix array 𝐿[0..𝑛] and
suffix tree T of text 𝑇[0..𝑛).

A 𝐿[0..𝑛] lists the start indices of leaves of T in left-to-right


order.

��
B 𝑇 𝐿[𝑟]..𝑛 is the path label in T to the leaf storing 𝑟.
� �
C 𝑇 𝐿[𝑟]..𝑛 is the path label to the 𝑟th leaf in T. �
D 𝑇𝐿[𝑟] is the 𝑟th smallest suffix of 𝑇 (lexicographic order). �
E In terms of Θ-classes, T needs more space than 𝐿.

F 𝐿 (and 𝑇) suffice to solve the text indexing problem. �


pingo.upb.de/622222
27
Suffix arrays – Construction
How to compute 𝐿[0..𝑛]?
� from suffix tree
� possible with traversal . . .
but we are trying to avoid constructing suffix trees!

� sorting the suffixes of 𝑇 using general purpose sort


trivial to code!
� but: comparing two suffixes can take Θ(𝑛) character comparisons
Θ(𝑛 2 log 𝑛) time in worst case

� we do better!

28
Fat-pivot radix quicksort – Example
she

sells

seashells

by

the

sea

shore

the

shells

she

sells

are

surely

seashells
Fat-pivot radix quicksort – Example
she

sells

seashells

by

the

sea

shore

the

shells

she

sells

are

surely

seashells
Fat-pivot radix quicksort – Example
she by

sells are

seashells she

by sells

the seashells

sea sea

shore shore

the shells

shells she

she sells

sells surely

are seashells

surely the

seashells the
Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she

by sells

the seashells

sea sea

shore shore

the shells

shells she

she sells

sells surely

are seashells

surely the

seashells the
Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells

by sells seashells

the seashells sea

sea sea sells

shore shore seashells

the shells she

shells she shore

she sells shells

sells surely she

are seashells surely

surely the

seashells the
Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells

by sells seashells

the seashells sea

sea sea sells

shore shore seashells

the shells she

shells she shore

she sells shells

sells surely she

are seashells surely

surely the the

seashells the the


Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells sells

by sells seashells seashells

the seashells sea sea

sea sea sells sells

shore shore seashells seashells

the shells she she

shells she shore shells

she sells shells she

sells surely she shore

are seashells surely

surely the the the

seashells the the the


Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells sells seashells

by sells seashells seashells sea

the seashells sea sea seashells

sea sea sells sells sells

shore shore seashells seashells sells

the shells she she she$

shells she shore shells shells

she sells shells she she$

sells surely she shore

are seashells surely

surely the the the the

seashells the the the the


Fat-pivot radix quicksort – Example
she by by

sells are are

seashells she sells sells seashells

by sells seashells seashells sea ...


the seashells sea sea seashells

sea sea sells sells sells


...
shore shore seashells seashells sells

the shells she she she$

shells she shore shells shells ...


she sells shells she she$

sells surely she shore

are seashells surely

surely the the the the

seashells the the the the

29
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson

� partition based on 𝒅th character only (initially 𝑑 = 0)

� 3 segments: smaller, equal, or larger than 𝑑th symbol of pivot


� recurse on smaller and large with same 𝑑, on equal with 𝑑 + 1
� never compare equal prefixes twice

30
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson

� partition based on 𝒅th character only (initially 𝑑 = 0)

� 3 segments: smaller, equal, or larger than 𝑑th symbol of pivot


� recurse on smaller and large with same 𝑑, on equal with 𝑑 + 1
� never compare equal prefixes twice

� can show: ∼ 2 ln(2) · 𝑛 lg 𝑛 ≈ 1.39𝑛 lg 𝑛 character comparisons in expectation

simple to code

efficient for sorting many lists of strings random pivots

� fat-pivot radix quicksort finds suffix array in 𝑂(𝑛 log 𝑛) expected time

30
Fat-pivot radix quicksort
details in §5.1 of Sedgewick, Wayne Algorithms 4th ed. (2011), Pearson

� partition based on 𝒅th character only (initially 𝑑 = 0)

� 3 segments: smaller, equal, or larger than 𝑑th symbol of pivot


� recurse on smaller and large with same 𝑑, on equal with 𝑑 + 1
� never compare equal prefixes twice

� can show: ∼ 2 ln(2) · 𝑛 lg 𝑛 ≈ 1.39𝑛 lg 𝑛 character comparisons in expectation

simple to code

efficient for sorting many lists of strings random pivots

� fat-pivot radix quicksort finds suffix array in 𝑂(𝑛 log 𝑛) expected time

but we can do 𝑂(𝑛) time worst case!


30
6.6 Linear-Time Suffix Sorting
Inverse suffix array: going left & right
� to understand the fastest algorithm, it is helpful to define the inverse suffix array:
� 𝑅[𝑖] = 𝑟 ⇐⇒ 𝐿[𝑟] = 𝑖 𝐿 = leaf array
⇐⇒ there are 𝑟 suffixes that come before 𝑇𝑖 in sorted order
⇐⇒ 𝑇𝑖 has (0-based) rank 𝑟 � call 𝑅[0..𝑛] the rank array

𝑖 𝑅[𝑖] 𝑇𝑖 right 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟]


0 6th bananaban$
𝑅[0] = 6 0 9 $
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$
7 2th an$ 𝐿[8] = 4 7 8 n$
8 7th n$ left 8 4 naban$
9 0th $ 9 2 nanaban$

sort suffixes
31
Linear-time suffix sorting
DC3 / Skew algorithm not a multiple of 3

1. Compute rank array 𝑅 1,2 for suffixes 𝑇𝑖 starting at 𝑖 �≡ 0 (mod 3) recursively.

2. Induce rank array 𝑅 3 for suffixes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . from 𝑅 1,2 .

3. Merge 𝑅 1,2 and 𝑅 0 using 𝑅 1,2 .


� rank array 𝑅 for entire input

32
Linear-time suffix sorting
DC3 / Skew algorithm not a multiple of 3

1. Compute rank array 𝑅 1,2 for suffixes 𝑇𝑖 starting at 𝑖 �≡ 0 (mod 3) recursively.

2. Induce rank array 𝑅 3 for suffixes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . from 𝑅 1,2 .

3. Merge 𝑅 1,2 and 𝑅 0 using 𝑅 1,2 .


� rank array 𝑅 for entire input

� We will show that steps 2. and 3. take Θ(𝑛) time


� 2 �2 � 2 �3 �� �𝑖
� Total complexity is 𝑛 + 23 𝑛 + 3 𝑛+ 3 𝑛 +··· ≤ 𝑛 · 2
3 = 3𝑛 = Θ(𝑛)
𝑖≥0

32
Linear-time suffix sorting
DC3 / Skew algorithm not a multiple of 3

1. Compute rank array 𝑅 1,2 for suffixes 𝑇𝑖 starting at 𝑖 �≡ 0 (mod 3) recursively.

2. Induce rank array 𝑅 3 for suffixes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . from 𝑅 1,2 .

3. Merge 𝑅 1,2 and 𝑅 0 using 𝑅 1,2 .


� rank array 𝑅 for entire input

� We will show that steps 2. and 3. take Θ(𝑛) time


� 2 �2 � 2 �3 �� �𝑖
� Total complexity is 𝑛 + 23 𝑛 + 3 𝑛+ 3 𝑛 +··· ≤ 𝑛 · 2
3 = 3𝑛 = Θ(𝑛)
𝑖≥0

� Note: 𝐿 can easily be computed from 𝑅 in one pass, and vice versa.
� Can use whichever is more convenient.

32
DC3 / Skew algorithm – Step 2: Inducing ranks
� Assume: rank array 𝑅 1,2 known:

rank of 𝑇𝑖 among 𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , . . . for 𝑖 = 1, 2, 4, 5, 7, 8, . . .
� 𝑅 1,2 [𝑖] =
undefined for 𝑖 = 0, 3, 6, 9, . . .

� Task: sort the suffixes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . in linear time (!)

33
DC3 / Skew algorithm – Step 2: Inducing ranks
� Assume: rank array 𝑅 1,2 known:

rank of 𝑇𝑖 among 𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , . . . for 𝑖 = 1, 2, 4, 5, 7, 8, . . .
� 𝑅 1,2 [𝑖] =
undefined for 𝑖 = 0, 3, 6, 9, . . .

� Task: sort the suffixes 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . in linear time (!)

� Suppose we want to compare 𝑇0 and 𝑇3 .


� Characterwise comparisons too expensive
� but: after removing first character, we obtain 𝑇1 and 𝑇4
� these two can be compared in constant time by comparing 𝑅 1,2 [1] and 𝑅 1,2 [4]!

𝑇0 comes before 𝑇3 in lexicographic order



iff pair (𝑇[0], 𝑅1,2 [1]) comes before pair (𝑇[3], 𝑅1,2 [4]) in lexicographic order

33
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑅1,2 (known)
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00

𝑇21 $00 � 𝑅0 [21] = 0


𝑇18 a10 � 𝑅0 [18] = 1
radix so �
rt 𝑇15 a14 𝑅0 [15] = 2
𝑇6 b06 � 𝑅0 [ 6] = 3
𝑇0 h05 � 𝑅0 [ 0] = 4
𝑇3 n02 � 𝑅0 [ 3] = 5
𝑇12 n04 � 𝑅0 [12] = 6
𝑇9 s07 � 𝑅0 [ 9] = 7
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00

𝑇21 $00 � 𝑅0 [21] = 0


𝑇18 a10 � 𝑅0 [18] = 1
radix so �
rt 𝑇15 a14 𝑅0 [15] = 2
𝑇6 b06 � 𝑅0 [ 6] = 3
𝑇0 h05 � 𝑅0 [ 0] = 4
𝑇3 n02 � 𝑅0 [ 3] = 5
𝑇12 n04 � 𝑅0 [12] = 6
𝑇9 s07 � 𝑅0 [ 9] = 7
𝑅0
34
DC3 / Skew algorithm – Inducing ranks example
𝑇 = hannahbansbananasman$$$ (append 3 $ markers)

𝑇0 h annahbansbananasman$$$
𝑇3 n ahbansbananasman$$$ 𝑇1 annahbansbananasman$$$ 𝑅 1,2 [22] = 0 𝑇22 $
𝑇6 b ansbananasman$$$ 𝑇2 nnahbansbananasman$$$ 𝑅 1,2 [20] = 1 𝑇20 $$$
𝑇9 s bananasman$$$ 𝑇4 ahbansbananasman$$$ 𝑅 1,2 [ 4] = 2 𝑇4 ahbansbananasman$$$
𝑇12 n anasman$$$ 𝑇5 hbansbananasman$$$ 𝑅 1,2 [11] = 3 𝑇11 ananasman$$$
𝑇15 a sman$$$ 𝑇7 ansbananasman$$$ 𝑅 1,2 [13] = 4 𝑇13 anasman$$$
𝑇18 a n$$$ sman$$$ = 𝑇16 𝑇8 nsbananasman$$$ 𝑅 1,2 [ 1] = 5 𝑇1 annahbansbananasman$$$
𝑇21 $$ 𝑇10 bananasman$$$ 𝑅 1,2 [ 7] = 6 𝑇7 ansbananasman$$$
𝑇11 ananasman$$$ 𝑅 1,2 [10] = 7 𝑇10 bananasman$$$
𝑇13 anasman$$$ 𝑅 1,2 [ 5] = 8 𝑇5 hbansbananasman$$$
𝑇14 nasman$$$ 𝑅 1,2 [17] = 9 𝑇17 man$$$
𝑇0 h 05 𝑇16 sman$$$ 𝑅 1,2 [19] = 10 𝑇19 n$$$
𝑇3 n 02 𝑅1,2 [16] = 14 𝑇17 man$$$ 𝑅 1,2 [14] = 11 𝑇14 nasman$$$
𝑇6 b 06 𝑇19 n$$$ 𝑅 1,2 [ 2] = 12 𝑇2 nnahbansbananasman$$$
𝑇9 s 07 𝑇20 $$$ 𝑅 1,2 [ 8] = 13 𝑇8 nsbananasman$$$
𝑇12 n 04 𝑇22 $ 𝑅 1,2 [16] = 14 𝑇16 sman$$$
𝑇15 a 14 𝑅1,2 (known)
𝑇18 a 10
𝑇21 $ 00

𝑇21 $00 � 𝑅0 [21] = 0


radix so 𝑇18 a10 � 𝑅0 [18] = 1 � sorting of pairs doable in 𝑂(𝑛) time
rt 𝑇15 a14 � 𝑅0 [15] = 2
𝑇6 b06 � 𝑅0 [ 6] = 3 by 2 iterations of counting sort
𝑇0 h05 � 𝑅0 [ 0] = 4
𝑇3 n02 � 𝑅0 [ 3] = 5
𝑇12 n04 � 𝑅0 [12] = 6 � Obtain 𝑅 0 in 𝑂(𝑛) time
𝑇9 s07 � 𝑅0 [ 9] = 7
𝑅0
34
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$
𝑇19 n$$$
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$
𝑇8 nsbananasman$$$
� Have: 𝑇16 sman$$$

� sorted 1,2-list:
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . .
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!


� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$
𝑇19 n$$$
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$
𝑇8 nsbananasman$$$
� Have: 𝑇16 sman$$$

� sorted 1,2-list:
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . .
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!


� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$
� Have: 𝑇16 sman$$$
= a𝑇16
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!


� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . .

� Task: Merge them!


� use standard merging method from Mergesort
� but speed up comparisons using 𝑅 1,2

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$
= s𝑇17
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$ always at most 2 steps
= s𝑇17 then can use 𝑅 1,2 !
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13

35
DC3 / Skew algorithm – Step 3: Merging
𝑇21 $$ 𝑇22 $ 𝑇22 $
𝑇18 an$$$ 𝑇20 $$$ 𝑇21 $$
𝑇15 asman$$$ 𝑇4 ahbansbananasman$$$ 𝑇20 $$$
𝑇6 bansbananasman$$$ 𝑇11 ananasman$$$ 𝑇4 ahbansbananasman$$$
𝑇0 hannahbansbananasman$$$ 𝑇13 anasman$$$ 𝑇18 an$$$
𝑇3 nahbansbananasman$$$ 𝑇1 annahbansbananasman$$$
𝑇12 nanasman$$$ 𝑇7 ansbananasman$$$
𝑇9 sbananasman$$$ 𝑇10 bananasman$$$
𝑇5 hbansbananasman$$$
𝑇17 man$$$ Compare 𝑇15 to 𝑇11
𝑇19 n$$$
Idea: try same trick as before
𝑇14 nasman$$$
𝑇2 nnahbansbananasman$$$ 𝑇15 = asman$$$
𝑇8 nsbananasman$$$ = asman$$$ can’t compare 𝑇16
� Have: 𝑇16 sman$$$
= a𝑇16 and 𝑇12 either!
� sorted 1,2-list: 𝑇11 = ananasman$$$
= ananasman$$$
𝑇1 , 𝑇2 , 𝑇4 , 𝑇5 , 𝑇7 , 𝑇8 , 𝑇10 , 𝑇11 , . . . = a𝑇12
� sorted 0-list:
𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . � Compare 𝑇16 to 𝑇12
𝑇16 = sman$$$
� Task: Merge them! = sman$$$ always at most 2 steps
= s𝑇17 then can use 𝑅 1,2 !
� use standard merging method from Mergesort
𝑇12 = nanasman$$$
� but speed up comparisons using 𝑅 1,2 = aanasman$$$
= a𝑇13
� 𝑂(𝑛) time for merge
35
Clicker Question

Recap: Check all correct statements about suffix array 𝐿[0..𝑛],


inverse suffix array 𝑅[0..𝑛], and suffix tree T of text 𝑇.

A 𝐿 lists the leaves of T in left-to-right order.


B 𝑅 lists starting indices of suffixes in lexciographic order.
C 𝐿 lists starting indices of suffixes in lexciographic order.
D 𝐿[𝑟] = 𝑖 iff 𝑅[𝑖] = 𝑟
E 𝐿 stands for leaf
F 𝐿 stands for left
G 𝑅 stands for rank
H 𝑅 stands for right

pingo.upb.de/622222
36
Clicker Question

Recap: Check all correct statements about suffix array 𝐿[0..𝑛],


inverse suffix array 𝑅[0..𝑛], and suffix tree T of text 𝑇.

A 𝐿 lists the leaves of T in left-to-right order. �


B 𝑅 lists starting indices of suffixes in lexciographic order.
C 𝐿 lists starting indices of suffixes in lexciographic order. �
D �
𝐿[𝑟] = 𝑖 iff 𝑅[𝑖] = 𝑟
E 𝐿 stands for leaf �
F 𝐿 stands for left �
G 𝑅 stands for rank �
H 𝑅 stands for right �

pingo.upb.de/622222
36
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

� But: we cheated in 1. step! “compute rank array 𝑅 1,2 recursively”


� Taking a subset of suffixes is not an instance of the same problem!

37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

� But: we cheated in 1. step! “compute rank array 𝑅 1,2 recursively”


� Taking a subset of suffixes is not an instance of the same problem!

� Need a single string 𝑇 � to recurse on, from which we can deduce 𝑅 1,2 .

How can we make 𝑇 � “skip” some suffixes?

37
DC3 / Skew algorithm – Fix recursive call
� both step 2. and 3. doable in 𝑂(𝑛) time!

� But: we cheated in 1. step! “compute rank array 𝑅 1,2 recursively”


� Taking a subset of suffixes is not an instance of the same problem!

� Need a single string 𝑇 � to recurse on, from which we can deduce 𝑅 1,2 .

How can we make 𝑇 � “skip” some suffixes?

𝑇 = bananaban$$$
redefine alphabet to be triples of characters abc � 𝑇 � = ban ana ban $$$
ana ban $$$
� suffixes of 𝑇 � � 𝑇0 , 𝑇3 , 𝑇6 , 𝑇9 , . . . ban $$$
$$$
� 𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$ � 𝑇𝑖 with 𝑖 �≡ 0 (mod 3).

� Can call suffix sorting recursively on 𝑇 � and map result to 𝑅 1,2


37
DC3 / Skew algorithm – Fix alphabet explosion
� Still does not quite work!

38
DC3 / Skew algorithm – Fix alphabet explosion
� Still does not quite work!
� Each recursive step cubes 𝜎 by using triples!
� (Eventually) cannot use linear-time sorting anymore!

38
DC3 / Skew algorithm – Fix alphabet explosion
� Still does not quite work!
� Each recursive step cubes 𝜎 by using triples!
� (Eventually) cannot use linear-time sorting anymore!

� But: Have at most 23 𝑛 different triples abc in 𝑇 �!


� Before recursion:
1. Sort all occurring triples. (using counting sort in 𝑂(𝑛))
2. Replace them by their rank (in Σ).

� Maintains 𝜎 ≤ 𝑛 without affecting order of suffixes.

38
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man

� Sorted triples with ranks:

Rank 00 01 02 03 04 05 06 07 08 09 10 11 12
Triple $$$ ahb ana ann ans ban hba man n$$ nas nna nsb sma

39
DC3 / Skew algorithm – Step 3. Example
𝑇 � = 𝑇[1..𝑛)� $$$ 𝑇[2..𝑛)� $$$

� 𝑇 = hannahbansbananasman$ 𝑇2 = nnahbansbananasman$
𝑇 � = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$

� Occurring triples:
ann ahb ans ban ana sma n$$ $$$ nna hba nsb nas man

� Sorted triples with ranks:

Rank 00 01 02 03 04 05 06 07 08 09 10 11 12
Triple $$$ ahb ana ann ans ban hba man n$$ nas nna nsb sma

� 𝑇� = ann ahb ans ban ana sma n$$ $$$ nna hba nsb ana nas man $$$
𝑇 �� = 03 01 04 05 02 12 08 00 10 06 11 02 09 07 00

39
Suffix array – Discussion
sleek data structure compared to suffix tree

simple and fast 𝑂(𝑛 log 𝑛) construction

more involved but fast 𝑂(𝑛) construction

supports efficient string matching

string matching takes 𝑂(𝑚 log 𝑛), not optimal 𝑂(𝑚)

Cannot use more advanced suffix tree features


e. g., for longest repeated substrings

40
6.7 The LCP Array
Clicker Question

Which feature of suffix trees did we use to find the length of a


longest repeated substring?
A order of leaves

B path label of internal nodes

C string depth of internal nodes

D constant-time traversal to child nodes

E constant-time traversal to parent nodes

F constant-time traversal to leftmost leaf in subtree

pingo.upb.de/622222
41
Clicker Question

Which feature of suffix trees did we use to find the length of a


longest repeated substring?
A order of leaves

B path label of internal nodes

C string depth of internal nodes �


D constant-time traversal to child nodes

E constant-time traversal to parent nodes

F constant-time traversal to leftmost leaf in subtree

pingo.upb.de/622222
41
String depths of internal nodes
� Recall algorithm for longest repeated substring in suffix tree
$ a b
a n
1. Compute string depth of nodes n
9
2. Find path label to node with maximal string depth b
a n a
$ n $ a
n a
$ b
6 a 8
� Can we do this using suffix arrays? 5 $ a n
$
n
b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
3 1 𝑇 = bananaban$

42
String depths of internal nodes
� Recall algorithm for longest repeated substring in suffix tree
$ a b
a n
1. Compute string depth of nodes n
9
2. Find path label to node with maximal string depth b
a n a
$ n $ a
n a
$ b
6 a 8
� Can we do this using suffix arrays? 5 $ a n
$
n
b a
a b
7 n a
n 0 $ n
b a $
a b
n a 4 2
$ n
$
� Yes, by enhancing the suffix array with the LCP array! 𝑇 = bananaban$
3 1
LCP[1..𝑛]
LCP[𝑟] = LCP(𝑇𝐿[𝑟] , 𝑇𝐿[𝑟−1] )
longest common prefix of suffixes of rank 𝑟 and 𝑟 − 1

� longest repeated substring = find maximum in LCP[1..𝑛]

42
LCP array and internal nodes
L[0..𝑛]
9

43
LCP array and internal nodes
L[0..𝑛]
$ 9

aban$ 5

an$ 7

anaban$ 3

ananaban$ 1

ban$ 6

bananaban$ 0

n$ 8

naban$ 4

nanaban$ 2

43
LCP array and internal nodes
L[0..𝑛]
$ 9

aban$ 5
a
an$ 7
an
anaban$ 3
ana
ananaban$ 1

ban$ 6
ban
bananaban$ 0

n$ 8
n
naban$ 4
na
nanaban$ 2

43
LCP array and internal nodes
LCP[1..𝑛] L[0..𝑛]
$ 9
0
aban$ 5
1 a
an$ 7
2 an
anaban$ 3
3 ana
ananaban$ 1
0
ban$ 6
3 ban
bananaban$ 0
0
n$ 8
1 n
naban$ 4
2 na
nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ 9
𝜺 0
aban$ 5
𝜺 a 1 a
an$ 7
𝜺 a n 2 an
anaban$ 3
𝜺 a n a 3 ana
ananaban$ 1
𝜺 0
ban$ 6
𝜺 b a n 3 ban
bananaban$ 0
𝜺 0
n$ 8
𝜺 n 1 n
naban$ 4
𝜺 n a 2 na
nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a

𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n

𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n

𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a

nab
an$ nanaban$ nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a

𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n

𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n

𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a

nab
an$ nanaban$ nanaban$ 2

43
LCP array and internal nodes
LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
n$
$ ba 𝜺 a 1 a
$
an$ an$ 7
2
n
𝜺 a n an
$ anaban$ anaban$ 3
ban
a

𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n

𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n

𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a

nab
an$ nanaban$ nanaban$ 2

� Leaf array 𝐿[0..𝑛] plus LCP array LCP[1..𝑛] encode full tree!
43
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

� but: seeing one large ( = costly) LCP value � can find another large one!

� Example: 𝑇 = Buffalo␣buffalo␣buffalo␣buffalo$
� first few suffixes in sorted order:
𝑇𝐿[0] = $
𝑇𝐿[1] = alo␣buffalo$
𝑇𝐿[2] = alo␣buffalo␣buffalo$
alo␣buffalo␣buffalo � LCP[3] = 19
𝑇𝐿[3] = alo␣buffalo␣buffalo␣buffalo$

44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

� but: seeing one large ( = costly) LCP value � can find another large one!

� Example: 𝑇 = Buffalo␣buffalo␣buffalo␣buffalo$
� first few suffixes in sorted order:
𝑇𝐿[0] = $
𝑇𝐿[1] = alo␣buffalo$
𝑇𝐿[2] = alo␣buffalo␣buffalo$
alo␣buffalo␣buffalo � LCP[3] = 19
𝑇𝐿[3] = alo␣buffalo␣buffalo␣buffalo$

� Removing first character from 𝑇𝐿[2] and 𝑇𝐿[3] gives two new suffixes:
𝑇𝐿[?] = lo␣buffalo␣buffalo$
lo␣buffalo␣buffalo � LCP[?] = 18
𝑇𝐿[?] = lo␣buffalo␣buffalo␣buffalo$
unclear where. . .

44
LCP array construction
� computing LCP[1..𝑛] naively too expensive
� each value could take Θ(𝑛) time
Θ(𝑛 2 ) in total

� but: seeing one large ( = costly) LCP value � can find another large one!

� Example: 𝑇 = Buffalo␣buffalo␣buffalo␣buffalo$
� first few suffixes in sorted order:
𝑇𝐿[0] = $
𝑇𝐿[1] = alo␣buffalo$
𝑇𝐿[2] = alo␣buffalo␣buffalo$
alo␣buffalo␣buffalo � LCP[3] = 19
𝑇𝐿[3] = alo␣buffalo␣buffalo␣buffalo$

� Removing first character from 𝑇𝐿[2] and 𝑇𝐿[3] gives two new suffixes:
𝑇𝐿[?] = lo␣buffalo␣buffalo$ Shortened suffixes might not
lo␣buffalo␣buffalo � LCP[?] = 18 be adjacent in sorted order!
𝑇𝐿[?] = lo␣buffalo␣buffalo␣buffalo$ � no LCP entry for them!
unclear where. . .

44
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 b
ban$
6 5th ban$ 6 0 b
bananaban$
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ba
ban$
6 5th ban$ 6 0 ba
bananaban$
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ban
ban$
6 5th ban$ 6 0 ban
bananaban$
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 ban
ban$
6 5th ban$ 6 0 ban
bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 an
ban$
6 5th ban$ 6 0 an
bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 an
ban$
6 5th ban$ 6 0 an
bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$
5 1th aban$ 5 6 an
ban$
6 5th ban$ 6 0 an
bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 an
anaban$
4 8th naban$ 4 1 an
ananaban$
5 1th aban$ 5 6 an
ban$
6 5th ban$ 6 0 an
bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 ana
anaban$
4 8th naban$ 4 1 ana
ananaban$
5 1th aban$ 5 6 an
ban$
6 5th ban$ 6 0 an
bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 ana
anaban$
4 8th naban$ 4 1 ana
ananaban$ 3
5 1th aban$ 5 6 an
ban$
6 5th ban$ 6 0 an
bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 na
anaban$
4 8th naban$ 4 1 na
ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 na
anaban$
4 8th naban$ 4 1 na
ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 na
anaban$
4 8th naban$ 4 1 na
ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 na
anaban$
4 8th naban$ 4 1 na
ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 na
naban$
9 0th $ 9 2 na
nanaban$

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 na
anaban$
4 8th naban$ 4 1 na
ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 na
naban$
9 0th $ 9 2 na
nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 a
naban$
9 0th $ 9 2 a
nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 a
naban$
9 0th $ 9 2 a
nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 a
naban$
9 0th $ 9 2 a
nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 a
an$
3 3th anaban$ 3 3 a
anaban$
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 a
naban$
9 0th $ 9 2 a
nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an
an$
3 3th anaban$ 3 3 an
anaban$
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an
an$
3 3th anaban$ 3 3 an
anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 n
an$
3 3th anaban$ 3 3 n
anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 n
an$
3 3th anaban$ 3 3 n
anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n
n$
8 7th n$ 8 4 n
naban$
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 n
an$
3 3th anaban$ 3 3 n
anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n
n$
8 7th n$ 8 4 n
naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$ 0
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$ 0
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$ 0
2 9th nanaban$ 2 7 an$
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$ 0
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 a
aban$ 0
2 9th nanaban$ 2 7 a
an$
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$ 0
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 a
aban$ 0
2 9th nanaban$ 2 7 a
an$ 1
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$ 0
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$ 0
2 9th nanaban$ 2 7 an$ 1
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$ 0
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Example
� Kasai et al. used above observation systematically

� Key idea: compute LCP values in text order

� Dropping first character of adjacent suffixes might not lead to adjacent shorter suffixes,
but LCP entry can only be longer.

𝑖 𝑅[𝑖] 𝑇𝑖 𝑟 𝐿[𝑟] 𝑇𝐿[𝑟] LCP[𝑟]


0 6th bananaban$ 0 9 $ –
1 4th ananaban$ 1 5 aban$ 0
2 9th nanaban$ 2 7 an$ 1
3 3th anaban$ 3 3 anaban$ 2
4 8th naban$ 4 1 ananaban$ 3
5 1th aban$ 5 6 ban$ 0
6 5th ban$ 6 0 bananaban$ 3
7 2th an$ 7 8 n$ 0
8 7th n$ 8 4 naban$ 1
9 0th $ 9 2 nanaban$ 2

45
Kasai’s algorithm – Code

1 procedure computeLCP(𝑇[0..𝑛], 𝐿[0..𝑛], 𝑅[0..𝑛])


2 // Assume 𝑇[𝑛] = $, 𝐿 and 𝑅 are suffix array and inverse
3 ℓ := 0
4 for 𝑖 := 0, . . . , 𝑛 − 1
5 𝑟 := 𝑅[𝑖]
6 // compute LCP[𝑟]; note that 𝑟 > 0 since 𝑅[𝑛] = 0
7 𝑖−1 := 𝐿[𝑟 − 1]
8 while 𝑇[𝑖 + ℓ ] = = 𝑇[𝑖−1 + ℓ ] do
9 ℓ := ℓ + 1
10 LCP[𝑟] := ℓ
11 ℓ := max{ℓ − 1, 0}
12 return LCP[1..𝑛]

� remember length ℓ of induced common prefix

� use 𝐿 to get start index of suffixes

46
Kasai’s algorithm – Code
Analysis:
1 procedure computeLCP(𝑇[0..𝑛], 𝐿[0..𝑛], 𝑅[0..𝑛])
2 // Assume 𝑇[𝑛] = $, 𝐿 and 𝑅 are suffix array and inverse � dominant operation:
3 ℓ := 0 character comparisons
4 for 𝑖 := 0, . . . , 𝑛 − 1
� separately count those with
5 𝑟 := 𝑅[𝑖]
6 // compute LCP[𝑟]; note that 𝑟 > 0 since 𝑅[𝑛] = 0 outcomes “=” resp. “≠”
7 𝑖−1 := 𝐿[𝑟 − 1] � each ≠ ends iteration of for-loop
while 𝑇[𝑖 + ℓ ] = = 𝑇[𝑖−1 + ℓ ] do
� ≤ 𝑛 cmps
8

9 ℓ := ℓ + 1
10 LCP[𝑟] := ℓ � each = implies increment of ℓ ,
11 ℓ := max{ℓ − 1, 0} but ℓ ≤ 𝑛 and
12 return LCP[1..𝑛] decremented ≤ 𝑛 times
� ≤ 2𝑛 cmps
� remember length ℓ of induced common prefix � Θ(𝑛) overall time
� use 𝐿 to get start index of suffixes

46
Back to suffix trees
We can finally look into the black box of linear-time suffix-array construction!
1. Compute suffix array for 𝑇.

2. Compute LCP array for 𝑇.

3. Construct T from suffix array and LCP array.


LCP-intervals LCP[1..𝑛] L[0..𝑛]
$ $ 9
𝜺 0
aban$ aban$ 5
$
an
$ b 𝜺 a 1 a
$
an$ an$ 7
2
n

𝜺 a n an
$ anaban$ anaban$ 3
ban
a

𝜺 a n a 3 ana
nab
an$ ananaban$ ananaban$ 1
𝜺 0
$ ban$ ban$ 6
ba
n

𝜺 b a n 3 ban
anab
an$ bananaban$ bananaban$ 0
n

𝜺 0
$ n$ n$ 8
𝜺 n 1 n
$ naban$ naban$ 4
ban
𝜺 n a 2 na
a

nab
an$ nanaban$ nanaban$ 2

47
Conclusion
� (Enhanced) Suffix Arrays are the modern version of suffix trees

can be harder to reason about

can support same algorithms as suffix trees

but use much less space

simpler linear-time construction

48

You might also like