gsaca
gsaca
Author:
Uwe Baier
[email protected]
Reviewers:
Prof. Dr. Enno Ohlebusch
Prof. Dr. Uwe Schöning
Supervisor:
Prof. Dr. Enno Ohlebusch
2015
“Linear-time Suffix Sorting– A new approach for suffix array construction”
Version from November 11, 2015
Acknowledgements:
To my supervisor, Enno Ohlebusch, as well as to my correctors, Matthias Gerber, Annika
Maier and Carolin Baier.
iii
Contents
1 Introduction 1
1.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Preliminaries 3
2.1 String Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 The Suffix Array . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.3 Basic Suffix Array Construction Techniques . . . . . . . . . . . . . . . . 5
3 Algorithmic Idea 7
3.1 Basic Sorting Principle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.2 An Introducing Example . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 The Algorithm 23
4.1 Correctness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
4.2 Runtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
5 Implementation 41
6 Performance Analyses 47
7 Conclusion 51
Bibliography 53
v
Chapter 1
Introduction
The suffix array plays an important role in string processing and data compression.
It lists the lexicographically sorted suffixes of a given text in increasing order. First
described by Manber and Myers in 1990 [20] as an alternative to suffix trees [10], the
suffix array nowadays is used in a wide range of applications. To name a few of the
most popular ones:
• The Burrows–Wheeler Transformation [3] of a text can easily be obtained using
the suffix array. Main applications of the BWT include lossless data compression
in tools like bzip2 [30], as well as full text indexing, a powerful method to prepare
a text for fast pattern localization and a lot more operations [6].
• Another popular lossless data compression method, Lempel–Ziv 77 [32], makes
use of the suffix array for fast construction [18]. It is used in data compression
tools such as gzip [9], and further research showed how to build a compressed text
index based on Lempel–Ziv 77 [17].
• Along with the suffix array, Abouelhoda et al. introduced the Enhanced Suffix
Array [1], one more powerful text index, which is able to remove the need for the
space–intensive use of suffix trees in a lot of common string processing operations.
As one can see, the suffix array is very useful in a lot of applications. Unfortunately,
constructing a suffix array from a given text turns out to be a computational hard task.
Although linear–time algorithms exist, some super–linear algorithms for suffix array
construction work faster in practice.
According to a survey paper of Puglisi et al. [28], suffix array construction algorithms
(SACAs) should fulfill all of the following requirements:
• Minimal asymptotic runtime: linear–time complexity
• Fast runtime in practice, tested on real world data
• Minimal space requirements, space usage for the suffix array and the text itself in
an optimal way
Although the paper was published in 2007, no SACA is able to meet all of those re-
quirements so far, so there’s still a need for an ’optimal’ SACA.
As research went on, there was an increasing interest in parallel suffix array construc-
tion, as well as suffix array construction using external memory. Surprisingly, both of
these areas can be combined, and perform quite better than approaches dealing only
with a single case [19]. One key to this result was the use of a fast sequential SACA, so
conventional SACAs could improve this technique further on.
1
Chapter 1 Introduction
1.1 Overview
As we’ve seen already, an ’optimal’ SACA would help a lot of areas in string process-
ing and data compression. My contribution to this theme will be a new linear–time
SACA, which—as first linear–time SACA—computes the suffix array without the use
of recursion.
This thesis will be organized as follows: Chapter 2 takes care of basic fundamentals of
string processing and suffix array construction. In Chapter 3, the basic idea of the new
algorithm will be explained, before Chapter 4 shows the algorithm along with proofs
for correctness and linear runtime. Chapter 5 discusses implementation details and
further optimisations, followed by performance analyses in Chapter 6. Chapter 7 finally
summarizes all results and gives a lookout on future research topics.
1 SACAs assuming a constant–size alphabet for achieving linear–time are ignored here, too.
2
Chapter 2
Preliminaries
In this chapter, we will discuss basic definitions in string processing and some funda-
mentals of suffix array construction. First of all, let’s begin with the term ’string’ and
its components.
Let’s say we got an alphabet Σ = {a, b, c}. Normally, one would order the elements
of this alphabet in a manner like a < b < c, but an other order of those elements could
be possible too. However, to keep it simple, most of the time we will use the lowercase
basic modern Latin alphabet, and, if not mentioned otherwise, use an ’intuitive’ order
over this alphabet (a < b < · · · < z). Now let’s have a look at the definition of the term
string.
Definition 2.1.3. Let S be a string of length n, and let i and j be integers with
1 ≤ i, j ≤ n. S[i] denotes the i-th character of the string S. S[i..j] denotes the substring
of S starting at the i-th and ending at the j-th position. We write S[i..j + 1) analogous
to S[i..j], and state S[i..j] = ε for i > j. Furthermore, Si denotes the i-th suffix S[i..n].
To give some examples for this definitions, consider the example string S = mississippi.
• S[1] = m
• S[2..1] = ε
• S[1..11] = S1 = mississippi
3
Chapter 2 Preliminaries
• S[5..11] = S5 = issippi
To conclude our basic definitions, we introduce a lexicographic order, a way to compare
and order strings.
Definition 2.1.4. Let Σ be an alphabet, S be a string of length n over alphabet Σ
and T be a string of length m over alphabet Σ. We write S <lex T and say that S is
lexicographically smaller than T , if one of the following conditions holds:
(i) There exists an i (1 ≤ i ≤ min{n, m}) with S[i] < T [i] and S[1..i) = T [1..i).
(ii) S is a proper prefix of T , i.e. n < m and S[1..n] = T [1..n].
Also, we define S =lex T if S = T , and write S >lex T analogous to T <lex S, S ≤lex T
if S <lex T or S =lex T , and S ≥lex T analogous to T ≤lex S.
By now, the tools for introducing the suffix array are complete, but let’s have a
look at some examples before moving on. We again consider our example string S =
mississippi (the ’deciding character’ from condition (i) of Definition 2.1.4 is high-
lighted):
• S2 = ississippi <lex mississippi = S1
<
4
2.3 Basic Suffix Array Construction Techniques
Figure 2.1: Suffix array and inverse suffix array of the string S = mississippi$.
5
Chapter 2 Preliminaries
Today, all state of the art linear–time SACAs make use of recursion, achieving quite
nice results in practice. It would be interesting nonetheless if a non–recursive algorithm
with the capability of linear–time can be designed—at least from a theoretical point of
view. This issue will be addressed by the next chapter, along with a new technique for
suffix sorting.
6
Chapter 3
Algorithmic Idea
After introducing the suffix array and discussing some basic construction techniques,
next goal will be to present a new sorting principle along with an algorithm to construct
a suffix array in linear time. To introduce the sorting principle, a definition for the next
lexicographically smaller suffix is needed first.
An example of suffixes and their next lexicographically smaller suffixes can be found
in Figure 3.1.
i SA[i] [
SA[i] SSA[i] S[SA[i]..SA[i]
[)
1 14 15 $ $
2 3 14 aindraining$ aindraining
3 8 14 aining$ aining
4 6 8 draining$ dr
5 13 14 g$ g
6 1 3 graindraining$ gr
7 4 6 indraining$ in
8 11 13 ing$ in
9 9 11 ining$ in
10 5 6 ndraining$ n
11 12 13 ng$ n
12 10 11 ning$ n
13 2 3 raindraining$ r
14 7 8 raining$ r
7
Chapter 3 Algorithmic Idea
group suffixes {14} {3} {8} {6} {13} {1} {11, 9, 4} {12, 10, 5} {7, 2}
group prefix
$
dr
in
r
aindraining
aining
gr
Then, in a second phase, this information will be used to sort suffixes finally. Think
about a suffix Si with S[ bi ] = $. Because of the sorting in the first phase it is clear that all
suffixes in lower ordered groups are lexicographically smaller. On the other hand, since
$ is the lexicographically smallest suffix, it is clear that Si must be the lexicographically
smallest suffix in its current group. Now, define sr to be the number of suffixes placed in
lower groups than group(i), sr := |{ j ∈ [1 . . . n] | group(j) < group(i) }|, then Si is the
lexicographically sr + 1-th smallest suffix, and we can set SA[sr + 1] ← i. Additionally,
if we remove Si from its group and put it in a new group placed immediately before i’s
old group, the group order of Definition 3.0.2 stays consistent, and the same procedure
can be repeated for the next minimal element of Si ’s old group.
Using this idea, the second phase will proceed like the following: first, set SA[1] ← n
because of the definition of the sentinel character. Then, iterate the suffix array from 1
to n in increasing order. Within the i-th iteration, compute all suffixes Sj with b j = SA[i],
and execute the procedure described above for all of them. Some exemplar iterations
can be found in Figure 3.3.
8
3.1 Basic Sorting Principle
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
SA[i] 14 − − − − − − − − − − − − −
group suffix { 14 3 8 6 13 1 11 9 4 12 10 5 7 2
15 }{ 14 }{ 14 }{ 8 }{ 14 }{ 3 }{ 13 , 11 , 6 }{ 13 , 11 , 6 }{ 8 , 3 }
\
suffix
SA[i] 14 3 8 − 13 − − − − − − − − −
group suffix { 14 3 8 6 13 1 11 9 4 12 10 5 7 2
15 }{ 14 }{ 14 }{ 8 }{ 14 }{ 3 }{ 13 , 11 , 6 }{ 13 , 11 , 6 }{ 8 , 3 }
\
suffix
SA[i] 14 3 8 − 13 1 − − − − − − 2 −
group suffix { 14 3 8 6 13 1 11 9 4 12 10 5 2 7
15 }{ 14 }{ 14 }{ 8 }{ 14 }{ 3 }{ 13 , 11 , 6 }{ 13 , 11 , 6 }{ 3 }{ 8 }
\
suffix
SA[i] 14 3 8 6 13 1 − − − − − − 2 7
group suffix { 14 3 8 6 13 1 11 9 4 12 10 5 2 7
15 }{ 14 }{ 14 }{ 8 }{ 14 }{ 3 }{ 13 , 11 , 6 }{ 13 , 11 , 6 }{ 3 }{ 8 }
\
suffix
SA[i] 14 3 8 6 13 1 4 − − 5 − − 2 7
group suffix { 14 3 8 6 13 1 4 11 9 5 12 10 2 7
15 }{ 14 }{ 14 }{ 8 }{ 14 }{ 3 }{ 6 }{ 13 , 11 }{ 6 }{ 13 , 11 }{ 3 }{ 8 }
\
suffix
..
.
SA[i] 14 3 8 6 13 1 4 11 9 5 12 10 2 7
Now finally we can give a first principle how suffixes can be sorted:
9
Chapter 3 Algorithmic Idea
Theorem 3.1.1. Let S be a nullterminated string of length n, and let i and j be two
integers in range [1 . . . n]. If S[i..bi) is a proper prefix of S[j..b
j), then it follows that
Si <lex Sj .
Proof. Let k := bi − i. Because S[i..bi) is a proper prefix of S[j..b j), it is clear that
k < j − j and S[i..i + k) = S[j..j + k). By Definition 3.0.1 of next lexicographically
b
smaller suffixes, it is clear that Si >lex Si+k and Sj <lex Sj+k . Now, let u be a number
such that S[i + u] > S[i + k + u] and S[i..i + u) = S[i + k..i + k + u), and let v be a
number such that S[j + v] < S[j + k + v] and S[j..j + v) = S[j + k..j + k + v). We
assume that u < v; the cases u = v and u > v can be handled analogously. Since
S[i..i + u) = S[i + k..i + k + u), the suffix Si starts with repetitions of S[i..i + k) until
the u + k-th character, i.e.
Also, since S[j..j + v) = S[j + k..j + k + v), the suffix Sj starts with repetitions of
S[j..j + k) until the v + k-th character. Because u < v, the suffix Sj has the form
Because Si and Sj share the same prefix among the first k characters, S[i..i + k + u) =
S[j..j + k + u) must hold. Now finally, we got S[j + k + u] = S[j + u] = S[i + u] >
S[bi + u] = S[i + k + u], so Si <lex Sj must hold.
Another question is if Algorithm 1 fills the suffix array entirely. More detailed, it has
to be shown that before the i-th iteration of Phase 2 the lexicographically i-th suffix is
placed into the suffix array. It is clear that the first suffix (SA[1]) is correctly placed into
the suffix array in line 3 because of the definition of the sentinel character. Now, let Sj
be the lexicographically i-th smallest suffix. Since Sbj <lex Sj holds, by induction, the
suffix Sbj must already have been handled in one of the i − 1 previous iterations. Within
this iteration, Sj is put in place into the suffix array in lines 5 to 9, and therefore Sj is
available before the i-th iteration.
So far, we’ve seen the basic sorting principle along with some thoughts for correctness,
but there are still a lot of issues remaining: How can Phase 1 be implemented? What
asymptotic time and space is required? How will groups be organized? But instead of
answering those questions and presenting a more precise algorithm directly, the next
section shows a larger running example of the final algorithm first, to get better insight
onto the performed steps and to bridge the gap between the basic sorting principle and
the final technical result.
10
3.2 An Introducing Example
groups { 14 }{ 3 , 8 }{ 6 }{ 1 , 13 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Figure 3.4: Initial groups of the string S = graindraining$. All suffixes sharing the
same first character are placed in the same group, groups are ordered by
their group prefix character from left to right. Also, links from the group
with prefix i to its suffixes are displayed.
Step 1
We start by processing suffixes of the highest group.
groups { 14 }{ 3 , 8 }{ 6 }{ 1 , 13 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
11
Chapter 3 Algorithmic Idea
Now, for each suffix of this group, we search for the first previous suffix that is placed
in a lower group. More detailed, if Si is a suffix of the current group, we search for
prev(i) := max{ j ∈ [1..i − 1] | group(j) < group(i) }. Additionally, pointers from
suffixes to their first previous suffixes (prev pointers) get stored.
groups { 14 }{ 3 , 8 }{ 6 }{ 1 , 13 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Next, each of those previous suffixes gets rearranged to new groups, placed immedi-
ately after their old groups. One can think of this as follows: Currently, we are working
on suffixes of the highest group. Since their previous suffixes are followed by them,
those previous suffixes are lexicographically larger than suffixes of the same group not
followed by the processed suffixes. On the other hand, since the next higher group
of such previous suffixes contains suffixes that are lexicographically larger, we have to
place them in new groups immediately after their old groups.
groups { 14 }{ 3 , 8 }{ 6 }{ 1 , 13 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Applied to our example, we get the following rearrangements: The suffix S1 is placed
in a new group, between the groups {13} and {4, 9, 11}. The suffix S6 also is placed in
a new group, but since the old group of S6 is empty after removing S6 , the new group
has the same position as the old one.
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 2
In the next step, we process the next lower group.
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
12
3.2 An Introducing Example
As in Step 1, for each suffix of the processed group we search for the first previous suffix
in a lower group.
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Again, those previous suffixes get rearranged in new groups, placed immediately after
their old groups. In our example, all previous suffixes were placed in the same group
before the rearrangement, so they’ll be placed in the same group after the rearrangement.
Additionally, since their old group becomes empty after the rearrangement, the new
group is identical to the old one.
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 3
Again, we work on suffixes of the next lower group, and search for their first previous
suffixes in lower groups. In our example, this is the first time that a previous suffix is
not the immediate neighbor of a processed suffix, see S11 and its previous suffix S8 .2
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
A new situation appears: The suffixes S3 and S8 are placed in the same group, but S3
is followed by one suffix of the processed group, while S8 is followed by two suffixes of
the processed group. To handle this case, we can proceed as follows: If only the nearest
suffixes of the processed group are used to rearrange both S3 and S8 , then both will be
placed into the same new group. After performing this step, we use the second suffix to
rearrange S8 , so it gets placed into a new group, and therefore lands in a group higher
than that of S3 . Consequently, S8 must be rearranged to an own group that is placed
higher than that of S3 .
2 Notethat all suffixes on the way from S11 to S8 already carry prev pointers, so they can be used to
speedup the search. This technique also is known as pointer jumping.
13
Chapter 3 Algorithmic Idea
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 4
In the next step, the group of S1 is handled. Since S1 has no previous suffix in a lower
group, no action is performed.
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 5
The next group to handle is that of suffix S13 . Its previous suffix is S8 3 , but since S8 is
the only suffix in its group, the rearrangement has no effect.
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 6
The next lower group that gets handled now is that of suffix S6 . As in Step 5, the
previous suffix of S6 has no further suffixes in its group, so the rearrangement has no
effect.
3 The pointer jumping technique mentioned in Step 3 now saves time: if we jump from S12 to S11 ,
and further to S8 , only two operations are needed instead of four.
14
3.2 An Introducing Example
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 7
The next step has no effect, since the previous suffix of S8 already is placed in a exclusive
group, as in Step 6.
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Although two more steps will be performed (groups {3} and {14}), they will not be
shown here, because no further prev pointers get computed and so no group rearrange-
ments take place.
Intermediate Result
After completing Phase 1, let’s have a look at the intermediate result in Figure 3.5. As
one can see, the groups are identical to those of the beginning of this chapter, see Figure
3.2 on page 8. The question to be answered now is why those groups are identical.
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Figure 3.5: Groups and prev pointers from the example string S = graindraining$
after Phase 1.
15
Chapter 3 Algorithmic Idea
context
r
$
groups { 14 }{ 3 , 8 }{ 6 }{ 1 , 13 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Now, every time a rearrangement is performed, the group context of the processed
group implicitly gets appended to that of the rearranged suffixes, thus forming new
groups. Additionally, since rearrangements place suffixes between their old and their
next higher groups, the order of the groups stays consistent with the lexicographic order
of their contexts.
dr
gr
context
a
r
$
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Figure 3.7: Step 1 of Phase 1, after the rearrangements took place. The context r of
the processed group was implicitly appended to the rearranged suffixes.
This also explains the somewhat strange behaviour in Step 3: if a suffix Si of a group
is followed by one group context of the currently processed group, and another suffix
Sj of the same group is followed by two repeated group contexts, the overall context of
Si is a proper prefix of the context of Sj . Thus, the context of Si is lexicographically
smaller than that of Sj , so consequently, group(i) occurs before group(j), see Figures
3.8 and 3.9.
dr
gr
in
context
a
r
$
groups { 14 }{ 3 , 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
16
3.2 An Introducing Example
ainin
ain
dr
gr
in
context
r
$
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Figure 3.9: Step 3 of Phase 1, after the rearrangements. Since the new context of S8 is
lexicographically larger than that of S3 , group(3) < group(8).
This aspect of dynamic programming, or, in string context, prefix doubling, together
with the greedy behaviour of the algorithm, ensures that groups are ordered as required
by the sorting principle.
Now, after having handled the first phase, let’s move on to the second phase to see
how the suffix array can be constructed from the current information.
The main problem in this phase is to find, for a given suffix Si , all suffixes whose next
lexicographically smaller suffix equals Si . More detailed, for a given suffix Si , the set
j = i } has to be computed, see line 5 of Algorithm Excerpt 1. As we
{ j ∈ [1 . . . n] | b
shall see, the prev pointers computed in Phase 1 will handle this task for us, but let’s
see this step by step.
Step 1
We start with the algorithm after line 3, because the first suffix array entry trivially is
correct. So, our current suffix is S14 .
17
Chapter 3 Algorithmic Idea
SA[i] 14 − − − − − − − − − − − − −
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
The first suffix we are going to visit is the preceding suffix of S14 , S13 . Trivially, the
next lexicographically smaller suffix of S13 is S14 . As described in the algorithm on lines
6 to 8, we remove S13 from its group and put it in a new group placed as immediate
predecessor of its old group, since it is followed by the lexicographically smallest suffix
and therefore is the minimal element of its group. Also, S13 is placed in the suffix array.
SA[i] 14 − − − 13 − − − − − − − − −
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Next, we are going to follow the prev pointer from S13 to S8 , repeating the same steps
as above. Recall that the prev pointer by definition points to the first previous suffix
placed in a lower group. Consequently, each suffix Sj between S8 and S13 is placed in a
group equal to or higher than that of S13 . This fact and Theorem 3.1.1 of page 10 imply
that for each such suffix Sj >lex S13 holds (a proof will follow later). Thus, b j ≤ 13, so
those suffixes must be handled in a later step.
On the other hand, since group(8) < group(13), the next lexicographically smaller
suffix of S8 must be behind that of S13 , i.e. b
8≥c 13, so clearly, Sb8 = S14 . Thus, placing
S8 into the suffix array is correct.
SA[i] 14 − 8 − 13 − − − − − − − − −
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
18
3.2 An Introducing Example
For the same reason as mentioned above, we again follow the prev pointer from S8
to S3 , executing lines 6 to 8 of the algorithm. Since S3 has no further prev pointer, the
step now is complete.
SA[i] 14 3 8 − 13 − − − − − − − − −
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 2
In the next step, we process the lexicographically 2-th smallest suffix S3 . Its preceding
suffix is S2 , so we repeat the behaviour from above for S2 . To clarify the correctness of
this behaviour: Assume Sb2 6= S3 , so b2 > 3. By using the definition of next lexicograph-
ically smaller suffixes, S3 >lex S2 must hold. As Phase 2 processes suffixes in increasing
order, S2 should have been placed in the suffix array already, but as one can verify, S2
is not placed in the suffix array after step 1, so Sb2 = S3 .
SA[i] 14 3 8 − 13 − − − − − − − 2 −
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 , 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Next, we follow the prev pointer from S2 to S1 , and place it into the suffix array. As
S1 has no further prev pointer to follow, the step is complete.
19
Chapter 3 Algorithmic Idea
SA[i] 14 3 8 − 13 1 − − − − − − 2 −
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 }{ 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Step 3
SA[i] 14 3 8 − 13 1 − − − − − − 2 7
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 }{ 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
SA[i] 14 3 8 6 13 1 − − − − − − 2 7
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 }{ 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Now, when following the next prev pointer, we reach suffix S3 , which is already
contained in the suffix array. Thus, no action needs to be performed for S3 . Additionally,
since S3 already is placed in the suffix array, any possible prev pointer from S3 to a
previous suffix was handled already, so we do not need to proceed.
20
3.2 An Introducing Example
SA[i] 14 3 8 6 13 1 − − − − − − 2 7
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 }{ 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
Another reason why no further suffix must be processed can be thought of as follows:
suppose there exists another suffix which needs to be processed, i.e. there exists an
i < 3 with bi = 8. Then, by definition of next lexicographically smaller suffixes, S3 >lex
Si >lex Sbi must hold. Since bi = 8, b3 ≤ 8. If b3 < 8, using the definition of the next
lexicographically smaller suffix applied to S[i..bi), Sb3 >lex S8 must hold. If b 3 = 8,
Sb3 =lex S8 , so by combining both cases, Sb3 ≥lex S8 must hold. This means that the
suffix S3 can not have been placed into the suffix array before the current step. This
leads us to a contradiction, and clearly shows that no such i can exist.
Step 4
As next step, the suffix S6 gets processed. As before, we start by handling its preceding
suffix S5 .
SA[i] 14 3 8 6 13 1 − − − 5 − − 2 7
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 , 10 , 12 }{ 2 }{ 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
SA[i] 14 3 8 6 13 1 4 − − 5 − − 2 7
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 , 9 , 11 }{ 5 }{ 10 , 12 }{ 2 }{ 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
21
Chapter 3 Algorithmic Idea
Now, the next prev pointer links from S4 to S3 , but S3 is already placed in the suffix
array, so we can stop here.
SA[i] 14 3 8 6 13 1 4 − − 5 − − 2 7
groups { 14 }{ 3 }{ 8 }{ 6 }{ 13 }{ 1 }{ 4 }{ 9 , 11 }{ 5 }{ 10 , 12 }{ 2 }{ 7 }
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
By now, I would like to come to an end with the running example. The steps 5 to 14
continue quite similarly: start at the processed suffix, and jump to its preceding suffix.
Then, if the suffix is not contained in the suffix array, execute lines 6 to 8 of Algorithm
Excerpt 1 with the suffix, follow the prev pointer if one exists and repeat the procedure.
The reader might perform the steps himself/herself as a little exercise.
What we’ve seen so far is a sorting principle along with an running example illustrating
an efficient implementation of the principle. However, an illustration cannot replace a
concrete algorithm. The next chapter will finally show such an algorithm, along with a
correctness proof and an analysis of the asymptotic runtime. Note that the main points
were explained within this section already, so the following algorithm will not cover too
much new details.
22
Chapter 4
The Algorithm
Without any more preliminaries, Algorithm 2 shows how the suffix array can be con-
structed using the new sorting principle (due to space limitations, the algorithm is split
on two pages).
As described in Chapter 3, the algorithm consists of two phases. Note that the groups
used in the algorithm are not the same as in the basic sorting principle (Algorithm 1 on
page 9). Within this algorithm, the groups are build incrementally, in the same manner
as described in the introducing example from the previous chapter. Nonetheless, the
group definition from page 8 can be applied, so we’re still able to compare groups by their
representative group prefixes. The only difference is that group prefixes consist of an
implicit context, as mentioned in the intermediate result after Phase 1 of the introducing
example, Section 3.2. A formal definition for implicit contexts will be presented later,
first let’s have a look at the algorithm.
23
Chapter 4 The Algorithm
4.1 Correctness
We first want to be sure that Algorithm 2 works correctly. A first step towards this
goal will be to show that the suffix groups computed by Phase 1 partition suffixes in
the same manner as in the first phase of the basic sorting algorithm, page 9.
The strategy will be like the following: first, we define prev pointer chains. This
definition then can be used to define the implicit context of any suffix, as discussed at
the beginning of this chapter. Afterwards, a lemma about context extensions will be
expressed, showing that context extensions of the algorithm form an ’avalanche effect’
that sorts suffixes as desired. Finally, this lemma can be used to proof the correct group
structure after Phase 1.
Furthermore, Π(i) denotes the prev pointer chain set of i, Π(i) := { π k (i) | k ∈ N0 }.
Simpler spoken, the implicit context of a suffix Si is a prefix of Si that extends to the
rightmost position that contains i in its prev pointer chain. Note that this definition
24
4.1 Correctness
exactly matches the more intuitive imagination of contexts from the introducing example
on page 15—except that no formal definition was presented at that point.
These definitions, however, are enough to express an invariant describing the avalanche
effect when processing groups in Phase 1 of Algorithm 2.
Proof. The proof is done by induction on the processed groups (lines 3 to 15).
Induction Base: Within lines 1 and 2 of the algorithm, suffixes are sorted and grouped
by their first character. Since no prev pointers exist before the first iteration, the
contexts of all suffixes consist only of their first characters, so propositions (i) and (ii)
are satisfied. Because the first group G to be processed is the highest group, and every
context has length 1, group(ic ) ≤ G holds for all i ∈ [1, n), so (iii) and (iv) are also
fulfilled.
Induction Step: Let G be the currently processed group, and let Ge be the group to
be processed next. We will show that propositions (i) to (iv) are satisfied when the
processing of G has been finished.
Consider the point in time when the group G is processed. First, the algorithm
computes prev pointers and the set P within lines 4 to 7. The first observation to be
pointed out is the following:
25
Chapter 4 The Algorithm
Now, consider the point in time after the prev pointer computation. Since the algo-
rithm has computed new prev pointers, the contexts of all suffixes in P are extended,
see Definition 4.1.2. For any p ∈ P, let i be the rightmost index such that prev(i) = p
and i ∈ G. Because i is the rightmost index, using proposition (iii), ic cannot be placed
in group G, otherwise a prev pointer from ic to p would exist, so i wouldn’t have been
the rightmost index. As a consequence, group(ic ) < G holds. Also, by definition of
prev pointers, group(p) < G must hold. By the manner in which rearrangements are
performed (lines 9 to 14), this statement holds even after the rearrangements. Sum-
ming everything up, after the processing of group G, for every p ∈ P, group(p) < G,
group(pc ) < G, and, because of proposition (iii) applied to all appended contexts of p,
group(j) ≥ G for all p < j < pc . Recall that Phase 1 processes groups in descending
group order. Since Ge is the immediate predecessor of G, Ge < G, and thus, proposition
(iii) holds for all p ∈ P after the group G has been processed.
So far, we’ve seen that all unprocessed suffixes Si with ic ∈ G from (4.1) fulfill
proposition (iii). Next, we’ll have a look at unprocessed suffixes Si with group(i) < G
and ic 6∈ G. The contexts of those suffixes are not extended during the iteration. Thus,
after the processing of group G, ic is not changed. Since ic 6∈ G, group(ic ) < G must
hold. Because Ge is the immediate predecessor of G, group(ic ) ≤ Ge and group(i) ≤ Ge
holds. Also, by using proposition (iii) at the time before G was processed, group(j) > Ge
holds for all i < j < ic . Thus, proposition (iii) is satisfied for all of those suffixes, too.
Next, let’s care about already processed suffixes. Because the next processed group Ge
is a predecessor of G, and the algorithm does not change contexts of already processed
suffixes, proposition (iv) is still correct after processing group G. The remaining part
is to show that the current group G fulfills proposition (iv) after it has been processed.
Therefore, let i ∈ G be any currently processed suffix. Since group(i) = G > G, e propo-
sition (iv) has to be shown. When i is processed, by proposition (iii), group(ic ) ≤ G =
group(i) holds. Also, for all j with i < j < ic , group(j) > G = group(i) is satisfied, so
proposition (iv) is fulfilled for i.
Summing up the current results, propositions (iii) and (iv) are correct after processing
of the group G. The next task is to show that suffixes with the same context belong to
the same group, as well as to show that groups are ordered by their group prefixes1 , as
propositions (i) and (ii) state.
More detailed, it has to be shown that the suffixes in a group created in line 12 share
the same context, as well as to ensure the correct placement of the new groups. By our
previous arguments we already know that the algorithm performs context extensions
only for suffixes contained in the set P. As a consequence, the contexts of all other
suffixes remain unchanged, so it’s sufficient to show that the new groups are correctly
set up.
Therefore, let u be the group prefix of group G, p ∈ P be an index and pc be the end
of the context of p before the extension. After the prev pointer computation in lines 4
to 7, the algorithm splits P into subsets P1 , . . . , Pk , line 8. After the split, the following
holds:
p ∈ Pl ⇔ p has new context S[p..pc )ul (4.2)
The reason for this is the following: since l equals the number of prev pointers pointing
1 See Definition 3.0.2 on page 8. In this context, group prefixes are assumed to be the implicit contexts
of the suffixes in the group.
26
4.1 Correctness
from G to p, the context of p is extended by exactly l contexts of G, thus the new context
of p is S[p..pc )ul .
Next, the algorithm processes the sets P1 , . . . , Pk in descending order, splits suffixes
of each subset into smaller subsets such that suffixes belonging to the same group are
gathered together in the same subset, and creates new groups, lines 9 to 14. This has
the consequence that new groups consist of suffixes with the same extended context, so
proposition (i) holds after the iteration.
The last remaining part is to show that the order of the new groups is correct,
proposition (ii). Let p ∈ Pl be an index, pc be the end of the context of p before
the context extensions, and u be the group prefix of G. After the context extensions,
S[p..pc ) <lex S[p..pc )ul , so the new group for p must be placed higher than its old group,
as line 12 of the algorithm performs.
Now, let i be the index of a suffix of the immediate successor of p’s old group. If
i was placed in the same group as p before the iteration, i’s context must have been
extended within this iteration. Then, since the subsets P1 , . . . , Pk are processed in
decreasing order, we know that S[i..ic ) = S[p..pc )ul̃ for some ˜l > l, so the context
of p is lexicographically smaller than that of i, and the group placement is correct.
If i was not placed in the same group as p before the iteration, we know that the
group of i is higher ordered than that of p. Using proposition (ii) before the current
iteration, S[p..pc ) <lex S[i..ic ) holds. Then, if S[p..pc ) is no proper prefix of S[i..ic ),
S[p..pc )ul <lex S[i..ic ) holds, thus the new group of p must be placed lower than the
group of i.
If S[p..pc ) is a proper prefix of S[i..ic ), consider the case when S[p..pc )ul is no proper
prefix of S[i..ic ); otherwise S[p..pc )ul <lex S[i..ic ) holds already. Now, let ˜l be a number
such that S[p..pc )ul̃ is a proper prefix of S[i..ic ), but S[p..pc )ul̃+1 is no proper prefix of
S[i..ic ). Also, define j := i + pc − p + ˜l|u|. Using propositions (iii) and (iv), we know that
G < group(j) holds. Thus, u <lex S[j..jc ) must hold because of proposition (ii). Also,
since u 6= S[j..j + |u|) holds by precondition, u cannot be a proper prefix of S[j..jc ). As
we’ve seen in statement (4.2), contexts are extended by full contexts of other groups,
so jc < ic must hold. Thus, there must exist a k with k < ic − i and k < pc − p + l|u|
such that S[p..p + k) = S[i..i + k) and S[p..p + k] < S[i..i + k]. This clearly shows that
S[p..pc )ul <lex S[i..ic ), so the new group of p must be placed before the group of i.
Summing everything up, the group placements of new groups in line 12 of the algo-
rithm ensure a correct new group order, so proposition (ii) holds after the iteration.
Now, after having proved the correctness of the avalanche effect, we can make sure
that Phase 1 of Algorithm 2 delivers an identical group division as that of the basic
sorting principle.
Theorem 4.1.2. Let S be a nullterminated string of length n. When applying Algorithm
2 to S, the following propositions hold after Phase 1 is completed:
• group(i) = group(j) ⇔ S[i..bi) =lex S[j..b
j) ∀i, j ∈ [1 . . . n]
• group(i) < group(j) ⇔ S[i..bi) <lex S[j..b
j) ∀i, j ∈ [1 . . . n]
Proof. Consider the point in time when the last group ($) of S is processed. Since this
group is the lowest one, it will not change any further groups when being processed,
and since $ occurs only once at the end of S, the group order is correct after the start
of the algorithm.
27
Chapter 4 The Algorithm
We show that ic = bi for all i ∈ [1 . . . n] before the last group was processed, by which
in combination with Lemma 4.1.1 (propositions (i) and (ii)) the theorem automatically
follows. For i = n the theorem already is correct, as shown above.
First, assume bi < ic for any i ∈ [1 . . . n). By using proposition (iv) of Lemma 4.1.1,
it follows that group(ic ) ≤ group(i) < group(bi). Then, using proposition (ii) of Lemma
4.1.1, S[i..ic ) <lex S[bi..bic ) must hold. Because group(bi) > group(ic ), bic ≤ ic must hold;
otherwise, proposition (iv) of Lemma 4.1.1 would be harmed for bi. This means that
S[i..ic ) cannot be a proper prefix of S[bi..bic ). Since S[i..ic ) <lex S[bi..bic ), a k < bic − bi
with S[i..i + k) = S[bi..bi + k) and S[i + k] < S[bi + k] must exist. This further implies
Si <lex Sbi what leads to contradiction against Definition 3.0.1 of next lexicographically
smaller suffixes.
So far, we know that ic ≤ bi holds for all i ∈ [1 . . . n). The next claim to be shown is
that ic = bi for all i ∈ [1 . . . n). The proof is done by induction on the distance bi − i of a
suffix Si .
28
4.1 Correctness
Induction Base: Before the first iteration, line 16 sets SA[1] = n. Because of the
definition of the sentinel character $ this is correct, so (i) is fulfilled. Since Sn is the
lexicographically smallest suffix, there exists no suffix Sj with Sbj <lex Sn . Initially, the
suffix array is filled with nil’s except for the first position, thus (ii) is correct, too. Also,
because of the suffix grouping after Phase 1 (Theorem 4.1.2) and Theorem 3.1.1 of page
10, (iii) and (iv) hold.
Induction Step: Consider Algorithm 2 in the i-th iteration. We will first show (ii),
(iii) and (iv), and then use this result to show (i).
So first, we need to show that after the i-th iteration all suffixes Sj with Sbj ≤lex SSA[i]
are correctly placed in the suffix array. By induction hypothesis we know that all suffixes
Sj with Sbj <lex SA[i] are placed correct already, so it’s sufficient to show that all suffixes
Sj with b j = SA[i] are placed correctly during the i-th iteration.
Within the i-th iteration, the algorithm iterates over indices SA[i−1], prev(SA[i − 1]),
prev(prev(SA[i−1])), . . ., until index 0 or an index j with SA[|{ s ∈ [1 . . . n] | group(s) <
group(j) }| + 1] 6= nil is reached, lines 17 to 28. First, let’s discuss the second loop
termination criteria. Let sr := |{ s ∈ [1 . . . n] | group(s) < group(j) }|. Then the
observation is the following:
If j has been placed in the suffix array already, using hypothesis (iii) and (iv), sr is
the number of lexicographically smaller suffixes of Sj . Because of the correct placement
of j, SA[sr + 1] = j must hold, so especially SA[sr + 1] 6= nil holds. Now, consider
SA[sr+1] 6= nil. By hypothesis (iv), | group(SA[sr+1])| = 1 holds. Consequently, |{ s ∈
[1 . . . n] | group(s) < group(SA[sr + 1]) }| = |{ s ∈ [1 . . . n] | group(s) < group(j) }|, so
group(SA[sr + 1]) = group(j). Since SA[sr + 1] belongs to its own group, SA[sr + 1] = j,
so j is placed into the suffix array already.
Using observation (4.3) and Definition 4.1.1 of prev pointer chains from page 24, the
behaviour of the algorithm in the i-th iteration can be described as following: Let k be
the smallest number such that π k (SA[i] − 1) is not contained in the suffix array. Then,
the algorithm iterates the set J := { π l (SA[i] − 1) | l ∈ [0 . . . k) }, places each index
j ∈ J to the suffix array, and creates a new group for each index. Thus, to show that
all suffixes Sj with b j = SA[i] are placed correctly to the suffix array, we need to show
that M := { s ∈ [1 . . . n] | sb = SA[i] } = J , as well as showing that each placement is
correct.
We start with the first part, namely we show that j ∈ J ⇔ j ∈ M. For the forward
direction, the proof is done per induction over the prev pointer chain of SA[i]−1: As base
case, consider j := SA[i] − 1 is not placed into the suffix array. Then, using hypothesis
(i) in the i-th iteration, we know that Sj >lex SSA[i] . Consequently, using the definition
of next lexicographically smaller suffixes, bj = SA[i].
For the induction step, for some l > 0, let j := π l (SA[i] − 1) and k := π l−1 (SA[i] − 1)
be indices such that j is not contained in the suffix array, and b k = SA[i] holds. Since
prev(k) = j, j = min{ s ∈ [1 . . . k) | group(s) < group(k) }, see line 5 of the algorithm,
29
Chapter 4 The Algorithm
as well as the argumentation in Lemma 4.1.1. Now, using the prev pointer definition
and the consistent group order (hypothesis (iii)), group(j) < group(q) implies Sj <lex Sq
for all j < q ≤ k. Also, using the definition of next lexicographically smaller suffixes
for Sk , Sj <lex Sk <lex Sq for all k < q < b k. Consequently, b j ≥b k = SA[i] must hold.
Since j is not contained in the suffix array, using hypothesis (i) in the i-th iteration,
Sj >lex SSA[i] , so clearly, b
j = SA[i].
For the backward direction, let j be an index such that j 6∈ J . If j ≥ SA[i], clearly
j 6= SA[i] holds by the definition of next lexicographically smaller suffixes. Next, consider
b
the case that j is placed between an index k and its prev pointer, i.e. prev(k) < j < k,
and b k = SA[i] holds. Using the definition of prev pointers, group(j) ≥ group(k) holds.
Now, using proposition (iv) of Lemma 4.1.1 after Phase 1, jc ≤ k must hold; if jc > k,
group(j) < group(k) must hold, contradiction. Since the end of the implicit context
equals the next lexicographic smaller suffix (see Theorem 4.1.2), b j = jc ≤ k < SA[i]
holds, so bj 6= SA[i].
For the last missing case, let Sj be a suffix already contained in the suffix array, such that
prev(k) = j for any k ∈ J . We need to show that qb 6= SA[i] for any q ∈ [1 . . . j]. If q = j,
j is contained in the suffix array already. Using hypothesis (ii), Sbj <lex SSA[i] holds, so
j 6= SA[i]. Now, assume for any q with 1 ≤ q < j that qb = SA[i]. Since q < j < SA[i],
b
using the definition of next lexicographically smaller suffixes for q, Sq <lex Sj must
hold. Using the same definition, SSA[i] = Sqb <lex Sq <lex Sj holds. Since j < qb holds
by precondition, this means that b j ≤ SA[i] and Sbj ≥lex SSA[i] hold. Thus, by hypothesis
(ii), the suffix Sj cannot be contained in the suffix array, contradiction.
Both directions show that j ∈ J ⇔ j ∈ M. The missing part is to ensure correct
placement of any j ∈ J into the suffix array. Therefore, consider j to be an index
with b j = SA[i]. Because of the consistent group order (hypothesis (iv)), all suffixes
placed in lower groups are lexicographically than Sj . Now, assume that another suffix
Sk <lex Sj with group(k) = group(j) exists. Since j and k belong to the same group,
using Lemma 4.1.2, S[j..b j) = S[k..bk) must hold. Thus, Sbk <lex Sbj must hold. In this
case, using hypothesis (ii), k must have been placed in the suffix array already. Since k
is placed in the suffix array already, it must belong to its own group, as hypothesis (iv)
states. Consequently, group(k) 6= group(j), so no such suffix can exist, and Sj must be
the lexicographically minimal element of its group. Thus, line 20 computes the number
of lexicographically smaller suffixes of Sj , and line 24 places j at the correct position
into the suffix array. Furthermore, since line 25 places j into its own group, hypothesis
(iv) is correct after the i-th iteration. Finally, since Sj is the lexicographically minimal
suffix of it’s old group, the placement of the new group as immediate predecessor of j’s
old group ensures a consistent group order, so hypothesis (iii) is correct after the i-th
iteration.
So far, we proved hypotheses (ii), (iii) and (iv). Now, it remains to show that after
the i-th iteration the lexicographically i + 1-th smallest suffix Sj is placed in SA, i.e.
SA[i + 1] = j. Since Sj is the lexicographically i + 1-th lowest suffix in S, Sbj ≤lex SSA[i]
must hold. Now, using hypothesis (i) which is proved already, we know that all suffixes
Sk with Sbk = Sbj are correctly placed in the suffix array. As Sj belongs to those suffixes,
SA[i + 1] = j must hold, thus hypothesis (i) is shown.
Since all suffixes are placed correctly to the suffix array, and entries are not changed
(line 21), Algorithm 2 computes the correct and entire suffix array SA of S.
30
4.2 Runtime
To be honest, the correctness proof of Algorithm 2 is quite hard. On the other hand,
the algorithm itself is relatively easy to understand—in my opinion at least. It seems a
bit as if the complexity degree of efficient suffix array construction must been split up
between an algorithm and its proof, forming a balance of complexity between all suffix
array construction algorithms of same asymptotic runtime, but that’s just a marginal
note I was thinking of while proving correctness.
However, we’ve seen the algorithm and its correctness, but one part is missing: asymp-
totic runtime. So, within the next section, the time complexity of the algorithm will be
discussed.
4.2 Runtime
After proving correctness of the algorithm, next issue will be to prove the asymptotic
linear runtime of the algorithm, together with linear space consumption. Since the
algorithm was described in a rather rough way yet, we need to get a bit more technical,
but keeping everything as simple as possible. A more appropriate implementation for
real world usage can be found within the next chapter.
First, recall Phase 1 of Algorithm 2. Main tasks were to build initial groups, iterate
them in descending group order, compute previous smaller suffixes and rearranging
them. In order to explain all necessary steps, Algorithm Excerpt 2 shows the performed
instructions during Phase 1.
31
Chapter 4 The Algorithm
First thing that has to be done is to explain a working set of needed data structures.
Five arrays of size n will be used for this:
• SA contains suffix starting positions, ordered according to the current group order.
• ISA is the inverse permutation of SA, to be able to detect the position of a certain
suffix in SA.
• GSIZE contains the sizes of all groups. Group sizes are ordered according to the
group order, so GSIZE has same order as SA. GSIZE contains the size of each group
only once at the beginning of the group, followed by zeros until the beginning of
the next group.
• GLINK stores pointers from suffixes to their groups. All entries point at the be-
ginning of a group, at the same position where GSIZE contains the size of the
group.
• PREV is used to store prev pointers computed during Phase 1, see line 5 of Al-
gorithm Excerpt 2. All entries initially are set to nil, to detect if a prev pointer
already exists.
An example of the data structure setup can be found in Figure 4.1. The initial setup
of those arrays can be performed in linear time by using a technique called bucket sort
and further iterations using a character count table. Thus, lines 1 and 2 require O(n)
time.
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
S[i] g r a i n d r a i n i n g $
GSIZE[i] 1 2 0 1 2 0 3 0 0 3 0 0 2 0
SA[i] 14 3 8 6 1 13 4 9 11 5 10 12 2 7
GLINK[i] 5 13 2 7 10 4 13 2 7 10 7 10 5 1
ISA[i] 5 13 2 7 10 4 14 3 8 11 9 12 6 1
Figure 4.1: Initial data structure setup after line 2 of Phase 1, applied to the string
S = graindraining$. Prev pointers are not listed since all entries initially
are set to nil, some (but not all) GLINK pointers are displayed for better
illustration.
The first problem that has to be solved is to process the groups in descending group
order, line 3. Assume we got two variables gs and ge, pointing to start and end of a
group. To get to the next lower group, we can set ge ← gs − 1 and gs ← GLINK[SA[gs −
1]]. Doing group iteration in this way, we trivially need O(n) steps to process all groups
in descending group order. Also, the suffixes of the processed group can be accessed by
iterating SA[gs..ge]. Initially, gs can be set to n + 1, to start with the highest group.
Next thing to be handled is the computation of prev pointers, line 5 of Algorithm
Excerpt 2. As already mentioned, we use a technique called pointer jumping for this
purpose: From an index s, we start with its previous index s − 1. If s − 1 belongs to
32
4.2 Runtime
a lower group, we’re done already. If s − 1 belongs to a higher group, its prev pointer
must have been computed already. Since s − 1 is placed in a higher group, all suffixes
between s − 1 and PREV[s − 1] belong to higher groups than that of s, so we can use
PREV[s − 1] to jump above all of those suffixes. After jumping, the procedure will be
repeated, until an index with a lower or equal group is reached, see Algorithm 3.
Let us shortly ignore the special case when Algorithm 3 returns a suffix placed in the
same group as s, we will handle this later on. If every call of function prev-equal
would stop within the first iteration of its inner loop, overall prev pointer computation
would require O(n) work, since it is executed exactly once per suffix. So clearly, the
question is how much additional iterations are performed by pointer jumping.
Algorithm 3 Computation of a prev pointer for the index s where ge indicates the end
of the group of s. Note that the returned index can belong to the same group as s.
1: function prev-equal(s, ge)
2: p←s−1
3: while p > 0 do
4: if ISA[p] ≤ ge then . group(p) ≤ group(s)
5: return p
6: end if
7: p ← PREV[p] . PREV[p] must exist already
8: end while
9: return 0
10: end function
To answer the question, we first will show that prev pointers cannot cross: Let s, s̃ be
two integers in range [1 . . . n] with PREV[s] < PREV[s̃] < s < s̃. Applying the definition
of prev pointers (line 5 of Algorithm Excerpt 2) to s̃, group(PREV[s̃]) < group(s̃) ≤
group(s) must hold. Also, by applying the definition to s, group(PREV[s̃]) ≥ group(s)
holds, contradiction, prev pointers cannot cross.
After the computation of a prev pointer PREV[s] in Algorithm 3, all used pointers
from the pointer jumping technique are overlaid by the new pointer, see Figure 4.2 for
an example. Assume that the pointers used for pointer jumping will be used again, in
a later step. Clearly, all prev pointers for indices between s and PREV[s] already are
computed, because of the descending group order processing. Since a prev pointer is
computed only once per suffix, and Phase 1 processes groups in descending group order,
an index s̃ with PREV[s] < PREV[s̃] < s < s̃ must exist; otherwise, the indices between
PREV[s] and s cannot be accessed. But prev pointers cannot cross, so no pointer used
by the pointer jumping technique is used more than once. Since at most n prev pointers
are computed, at most n pointers can be used by pointer jumping, so the overall prev
pointer computation requires O(n) work.
Now, let’s come back to the special case when Algorithm 3 returns an index that
belongs to the same group as its input index, i.e. group(s) = group(prev-equal(s, ge))
for some s ∈ [1 . . . n]. To solve this case, we need to build some additional instructions:
Let s be an index that belongs to the currently processed group. If the prev pointer
of s is computed already, we proceed with the next index of the currently processed
group. Otherwise, we compute p ← prev-equal(s, ge). If p belongs to a lower ordered
group, we set PREV[s] ← p; if p belongs to the same group, s is added to a list L, and
33
Chapter 4 The Algorithm
the procedure is repeated for s ← p. At some point, the prev pointer for s is correctly
computed. All remaining indices l ∈ L however still require a prev pointer. Since
the prev pointer of s is correct already, caused by function prev-equal, group(j) ≥
group(l) for all s ≤ j < l holds for all l ∈ L. The indices of L belong to the same group
as s, so we set PREV[l] ← PREV[s] for all l ∈ L, resulting in correct prev pointers for
all indices of L. Thus, Algorithm 4 shows how prev pointers of the currently processed
group can be computed.
This extra computation requires at most O(|G|) extra time (where G is the processed
group), and at most O(n) space for the list. Therefore, the prev pointer computation
still is doable in O(n) time, since each group is processed exactly once.
After handling prev pointer computation, next task is to compute the set P of previous
suffixes, and split it into subsets P1 , . . . , Pk , such that each subset Pl consists of suffixes
Si with exactly l indices of G pointing onto i, i.e. i ∈ Pl ⇔ |{ s ∈ G | prev(s) = i}| = l,
see lines 7 to 8 of Algorithm Excerpt 2. An extra array PC of size n is used for this
34
4.2 Runtime
purpose. Initially, all entries of PC are set to zero. Then, after computing prev pointers
in Algorithm 4, for all s ∈ G, PC[PREV[s]] is incremented. During this loop, the set
P can be computed by checking if PC[PREV[s]] = 0 holds: if true, PREV[s] is visited
the first time and can be appended to a list; otherwise, PREV[s] is contained in the list
already. Also, after the loop, for each p ∈ P, PC[p] contains the count of prev pointers
of G pointing onto p, so for the second split, p belongs to the set PPC[p] .
The second split from P into subsets P1 , . . . , Pk can be performed like the following:
While P is not empty, for each p ∈ P, decrement PC[p]. If PC[p] = 0 within the l-th
outer iteration, remove p from P, and add it to the set Pl . This way, the sets P1 , . . . , Pk
are computed in increasing order, and all entries of PC are set to zero, so it can be
reused for the next split operation. The subsets Pl Pl are stored in the SA-interval of
the processed group: define pls(l) := ge + 1 − i=0 |Pi |. Then the set P1 is stored to
SA[pls(1).. pls(0) − 1], the set P2 is stored to SA[pls(2).. pls(1) − 1], and so on. This way,
empty subsets are ignored, and the sets are stored in decreasing order. Additionally, to
know the size of each set, we set GSIZE[pls(l)] ← |Pl | for each non-empty set. Because
the indices of G aren’t used for further instructions, and |P| ≤ |G| holds, no side effects
occur during the storage. Algorithm 5 shows the code for splitting previous suffixes.
35
Chapter 4 The Algorithm
The first loop of Algorithm 5 requires O(|G|) time, since all indices of G are iterated.
Also, the second loop requires O(|G|) time: its inner loop overall requires as much
iterations as prev pointers from G exist, so consequently, at most O(|G|) iterations are
performed.
So, let’s move on to the final step in Phase 1, the rearrangements from lines 9 to 14 in
Algorithm Excerpt 2. By the previous splits, we’re able to iterate subsets Pl in descend-
ing order. It actually turns out that the additional split of a subset Pl from line 10 is not
required to rearrange suffixes: First, for all p ∈ Pl , decrement GSIZE[GLINK[p]], and ex-
change SA[ISA[p]] with the rightmost suffix of its group, SA[GLINK[p]+GSIZE[GLINK[p]]].
Also, update the new ISA values, so ISA stays the correct inverse permutation of SA.
This way, suffixes in Pl are removed from their groups, because they’ll be placed at the
back of their groups, and the sizes of the groups don’t cover them any more. Remaining
task is to set up GLINK and GSIZE, so the new groups are correctly captured. First, for
all p ∈ Pl , set GLINK[p] ← GLINK[p] + GSIZE[GLINK[p]]. Because of the decrements of
GSIZE in the previous loop, GLINK now correctly points to the group beginnings of the
new groups. Last step is to increment GSIZE[GLINK[p]] for all p ∈ Pl , so the sizes of the
new groups are set up correctly. Algorithm 6 shows the full code for rearrangements,
an example can be found in Figure 4.3.
36
4.2 Runtime
i 1 2 3 4 5 6 7 8
−1 −1
GSIZE[i] 1 2 0 1 2 0 3 0 ···
SA[i] 14 3 8 6 1 13 4 9 ···
GLINK[i] 5 13 2 7 10 4 13 2 ···
(1) for each p ∈ Pi , move p to the back of its group and
decrement group size.
i 1 2 3 4 5 6 7 8
GSIZE[i] 1 2 0 0 1 0 3 0 ···
+ +
GLINK[i] 5 13 2 7 10 4 13 2 ···
(2) for each p ∈ Pi , add gsize of old group to GLINK[p].
i 1 2 3 4 5 6 7 8
+1 +1
GSIZE[i] 1 2 0 0 1 0 3 0 ···
GLINK[i] 6 13 2 7 10 4 13 2 ···
(3) for each p ∈ Pi , increment group size of new group.
i 1 2 3 4 5 6 7 8
GSIZE[i] 1 2 0 1 1 1 3 0 ···
SA[i] 14 3 8 6 13 1 4 9 ···
GLINK[i] 6 13 2 7 10 4 13 2 ···
(4) Result after the rearrangements. All suffixes of set Pl
were rearranged in new groups, placed as immediate
successors of their old groups.
Figure 4.3: Suffix rearrangements for the set Pl = {1, 6}. Items of Pl are marked bold.
Since each subset Pl is iterated three times, Algorithm 6 requires O(|P|) = O(|G|)
asymptotic time.
Now, we’re almost done with Phase 1. The last bit I want to mention here is a
preparation step for Phase 2. Recall that gs and ge were the indices of the current
group’s start and end in SA. After the rearrangements took place, SA[ge] will be set to
gs. This last index of the group later will perform as a group counter. Also, to ensure
each suffix has access to the group counter, we set ISA[s] ← ge for all indices s of the
processed group. This step has to be performed before the computation of the set P in
Algorithm 5, since Algorithm 5 overwrites the set indices with prev pointers. Note that
this preparations don’t cause side effects: Each group is processed only once, so SA can
be overwritten without problems. Also, after the preparation, gs ≤ ISA[s] ≤ ge holds
37
Chapter 4 The Algorithm
for all indices s of the processed group, so we can still determine if a suffix belongs to
an already processed group.
Before Phase 2 will be discussed, let’s first recall the performed steps of Phase 2, see
Algorithm Excerpt 3. The only instructions not directly implementable are those from
lines 20 to 25.
Let’s start with the first point, checking if a suffix is contained in SA already, line
21. Since SA was modified during Phase 1, we cannot check if any SA - entry is nil.
To solve this problem, we will use ISA: whenever an index s is placed into SA, we set
ISA[s] ← 0. Then, to check if a suffix is placed into the suffix array already, we only
need to compare its ISA-value with zero.
Next, we’ll have a look at suffix rank computation and suffix placement in lines 20
and 25 of Algorithm Excerpt 3. Recall that each group was prepared in Phase 1, so for
each suffix Sj , ISA[j] points to the end of its group in SA. Also, for every group, the end
of a group contains a counter initially set to the start of the group in SA. To compute
sr in line 20, we follow this counter, i.e. sr ← SA[ISA[j]]. Then, to move a suffix to the
front of its group, we set SA[sr] ← j. To ’remove’ the suffix Sj from its old group, it’s
sufficient to increment the group counter SA[ISA[j]]. Note that the group counter must
be incremented before the placement of j in SA, since both positions can be equal. A
full example for the placement of a suffix can be found in Figure 4.4, Algorithm 7 shows
the full implementation of Phase 2.
Using this implementation of Phase 2, every described operation is supported in
constant time. The overall time for Phase 2 is O(n) plus the number of iterations of the
inner loop. By the correctness of the algorithm (Theorem 4.1.3 from page 28) we know
that within the i-th iteration of the outer loop, the inner loop will process all suffixes j
with bj = SA[i]. Since each suffix has exactly one next lexicographically smaller suffix,
the inner loop iterates n − 1 suffixes, thus Phase 2 requires O(n) time.
38
4.2 Runtime
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
+1
SA[i] 14 3 8 4 13 6 − − 7 − − 10 − 13
ISA[i] 6 14 0 9 12 4 14 0 9 12 9 12 0 0
(1) let sr ← SA[ISA[j]], then increment SA[ISA[j]].
i 1 2 3 4 5 6 7 8 9 10 11 12 13 14
SA[i] 14 3 8 4 13 6 − − 7 − − 10 2 14
ISA[i] 6 0 0 9 12 4 14 0 9 12 9 12 0 0
(2) set SA[sr] to j, and ISA[j] to zero.
Figure 4.4: Suffix placement for j = 2. Suffix items are marked bold, empty positions
in SA are placed over arrows indicating the start of their groups.
Now finally, we’re able to show that the algorithm works in asymptotic optimal run-
time and linear space.
Proof. The initial group setup can be performed in O(n). Also, computation of prev
pointers requires O(n) time. All remaining operations in the outer loop of Phase 1 re-
quire O(|G|) time, where G is the current processed group. Since each group is processed
only once, overall time complexity for Phase 1 is O(n). Phase 2 also can be performed
39
Chapter 4 The Algorithm
in O(n) time, thus the overall time complexity of Algorithm 2 is O(n). Each additional
used array or list has length of at most n, so overall space complexity is O(n) words.
We’ve seen that the algorithm can be implemented with optimal asymptotic runtime
and linear space. This is quite a nice result, but for real world applications, the algorithm
needs to be tuned to consume as less space as possible and run as fast as possible. The
next chapter will show some of such optimisations, resulting in a competitive linear–time
suffix array construction algorithm.
40
Chapter 5
Implementation
After having presented the algorithm, this chapter’s purpose is to describe an imple-
mentation suitable for real world usage. Within Section 4.2, a possible implementation
was described already, but it has several downsides: A lot of additional data structures
are required, and some algorithmic solutions seem a bit awkward. To fix these issues,
the implementation of Section 4.2 will be modified.
First, the required data structure framework is described. It consists of four arrays
of size n, exactly in the same manner as described in Section 4.2, page 32:
• SA contains suffix starting positions, ordered according to the current group order.
• ISA is the inverse permutation of SA, to be able to detect the position of a certain
suffix in SA.
• GSIZE contains the sizes of all groups. Group sizes are ordered according to the
group order, so GSIZE has same order as SA. GSIZE contains the size of each group
only once at the beginning of the group, followed by zeros until the beginning of
the next group.
• GLINK stores pointers from suffixes to their groups. All entries point at the be-
ginning of a group, at the same position where GSIZE contains the size of the
group.
In contrast to the implementation of Section 4.2, this will be all data structures we
need.
Before we proceed on, let’s shortly recapitulate the steps performed within the first
phase of the algorithm. Pseudocode can be found in Algorithm Excerpt 2 on page 31,
it won’t be listed here again. Initially, suffixes were placed in groups according to their
first character, groups themselves are sorted according to the rank of their representative
character. This initial step can be done using bucket sort. Then, a loop processes groups
in descending order. For each processed group G, prev pointers are computed, the set
P of previous suffixes is computed, and suffixes of P are split into subsets P1 , . . . , Pk
according to the number of prev pointers of G pointing to them.1 The last step is to
process the subsets P1 , . . . , Pk in descending order, and rearranging suffixes of a subset
Pl by placing all indices p ∈ Pl that belong to the same group into a new group, as
immediate successor of the old group.
Processing groups in descending order works in the same way as in Section 4.2: Let
gs be the start of a group. To get to the next group, we set ge to gs − 1 and then set
1A subset Pl consists of indices p ∈ P, such that exactly l prev pointers from G are pointing to p.
41
Chapter 5 Implementation
gs ← GLINK[ge]. Now, the suffixes of the new group are contained in SA[gs..ge], and
can be processed.
Next, let’s move on to prev pointer computation. The first thing to be mentioned is
the storage of prev pointers: no separate array is declared for this purpose. To solve
this issue, prev pointers will be stored in the GLINK array: Since the GLINK array is
used only for suffixes of groups that have not been processed yet, and rearrangements
of suffixes only take place in unprocessed groups, no side effects occur. The only thing
we need to be aware of is to use ISA to check if a group has been processed already: if
ISA[s] > ge, suffix s belongs to a group that has been processed already.
Before proceeding, another optimisation shall be discussed. Consider the case when
two prev pointers of suffixes from same group point to the same position, i.e. two indices
i, j ∈ G with i < j exist such that prev(i) = prev(j), see Figure 5.1. In this case, the
left prev pointer prev(i) can be removed: since prev pointers do not cross, prev(i) is
overlaid by the right prev pointer prev(j), thus it won’t be used in the pointer jumping
technique used for prev pointer computation. Additionally, in the second phase, the
right prev pointer prev(j) will be visited first2 , so the suffix Sprev(j) will already be
contained in the suffix array when prev(i) is used. Since the algorithm stops in such
cases, it doesn’t matter whether the left prev pointer exists or not.
(1) Two prev pointers from same group point (2) The left prev pointer is overlaid by the
onto the same suffix. right one, and can be removed.
Using this optimisation, prev pointer computation and the split of suffixes can be
handled easier: First, for each suffix in the current group, compute prev-equal (see
page 33). If the previous suffix belongs to the same group, mark it. Then, filter out
all marked suffixes, and repeat the following procedure for all remaining suffix starting
positions s: let p be the prev pointer of s, p = PREV[s]. if p belongs to the same group
as s, update the prev pointer of s by using the prev pointer of its previous suffix Sp
(PREV[s] ← PREV[p]), and set PREV[p] to zero. Otherwise, add p to a list, and remove s
from the remaining suffix starting positions. After the iteration of all remaining suffixes,
store the computed list to a list of lists. By repeating this behaviour as long as suffixes
remain, all prev pointers get computed correctly. Also, the generated lists correspond
to the subsets P1 , . . . , Pk . In essence, generating lists is not necessary, since the suffixes
can be placed into the index storage of the currently processed group, namely SA[gs..ge]:
Each subset Pl is stored to an interval SA[ps..pe] with gs ≤ ps ≤ pe ≤ ge, in the same
manner as described in the previous chapter (Algorithm 5 on page 35). Algorithm 8
shows this prev pointer computation combined with the split of previous suffixes.
2 Contexts of both suffixes Si and Sj are the same. Since both prev pointers point to the same position,
the context of the left suffix Si ends at the position of the right suffix Sj , i.e. ic = j. Since group G
is processed at the moment, ic won’t be changed during the algorithm, so bi = ic holds by Theorem
4.1.2. This clearly shows that bi = j, so the second phase will process Sj before Si .
42
Algorithm 8 Prev pointer computation and split in the first phase of the algorithm,
variables gs and ge contain start and end bounds of the current group. Every occurrence
of PREV must be replaced with GLINK, but replacement is omitted for better readability.
1: GSIZE[gs] ← 0 . make place for markings
2: for i ← ge down to gs do . compute prev pointers
3: p ← prev-equal(SA[i], ge) . Algorithm 3 on page 33
4: if p > 0 and ISA[p] ≥ gs then . p belongs to current group
5: GSIZE[ISA[p]] ← 1 . mark p
6: end if
7: PREV[SA[i]] ← p . store pointer
8: end for
9: pe ← gs
10: for i ← gs up to ge do . move unmarked suffixes to front
11: if GSIZE[i] 6= 1 then
12: SA[pe] ← SA[i]
13: pe ← pe + 1
14: end if
15: ISA[SA[i]] ← ge . preparation for the second phase
16: end for
17: ps ← gs
18: pe ← pe + 1
19: l←0
20: repeat . compute final pointers and split suffixes
21: i ← pe − 1
22: tmp ← pe
23: while i ≥ ps do
24: p ← PREV[SA[i]]
25: if p > 0 then
26: if ISA[p] < gs then . p is in an other group
27: pe ← pe − 1
28: SA[i] ← SA[pe]
29: SA[pe] ← p . push pointer to back
30: else . p is in current group
31: PREV[SA[i]] ← PREV[p] . copy pointer
32: PREV[p] ← 0 . clear pointer of p, won’t be used any more
33: end if
34: i←i−1
35: else . p points to nothing, remove it
36: SA[i] ← SA[ps]
37: ps ← ps + 1
38: end if
39: end while
40: if pe < tmp then . at least one prev pointer was pushed to back
41: GSIZE[pe] ← tmp − pe . store number of pointers
42: l ←l+1 . and update number of subsets
43: end if
44: until ps = pe
43
Chapter 5 Implementation
The rearrangements can be performed in exactly the same way as described in Section
4.2. Algorithm 9 shows a modified version which works with the modified split lists of
Algorithm 8.
The last step of the first phase is to prepare the current group for processing in the
second phase, by setting ISA[s] ← ge for all suffixes s of the group, and SA[ge] ← gs.
Both steps are performed within the Algorithms 8 and 9. The second phase can be
implemented in exactly the same way as described in Section 4.2, Algorithm 6. The
only thing to be considered is that prev pointers are contained in the GLINK array, not
in a separate array.
Using these tricks, the algorithm can be implemented using 17n bytes of space: n
bytes for the text itself, 4n bytes for SA, and additional 12n bytes for ISA, GLINK and
GSIZE, assuming that an integer requires 4 bytes. An implementation in C requires
about 200 lines of code for the full algorithm, and can be found on github [2].
44
An additional space optimisation can be applied to the GSIZE array: GSIZE stores
group sizes (and markings) whose distance is greater than or equal to their value. More
detailed, let GSIZE[i] = k and GSIZE[j] = l with i < j be any two numbers in GSIZE.
Then the algorithm always ensures that i+k ≤ j holds. This permits the use of variable-
length number storage in GSIZE, e.g. by using Elias Gamma Coding [5], and reduces
the space requirements of GSIZE from 4n bytes to 2n bits.
Summing everything up, the algorithm can be implemented using 13.25n bytes of
space. The implementation available on github [2] does not contain the variable-length
number storage (basically because additional operations for fetching variable-length
numbers would slow down the algorithm), thus it requires 17n bytes of space.
Beside space consumption, performance is a big point of interest on any algorithm.
We already know that the proposed algorithm runs in asymptotic linear time, but
theoretical runtime and practical performance can vary considerably. To fill this gap,
the issue of the next chapter will be to measure practical performance, thus we will see
whether the algorithm is suitable for real world usage.
45
Chapter 6
Performance Analyses
After the discussion about the algorithms functional principle, it is time to check if it
is suitable for real world usage by comparing its performance with other suffix array
construction algorithms.
A list of competing SACAs can be found in Table 6.1. It includes most common
linear-time SACAs, such as the SA-IS by Nong et al., the algorithm by Ko and Aluru
and the DC3 a.k.a. Skew Algorithm by Kärkkäinen and Sanders. Also, a very fast
SACA named divsufsort is included, to compare results with the current state of the
art.
Since the SACA described in this thesis makes heavy use of groups, and also works
in a greedy manner, I called it GSACA—quite an imposing name for a suffix array
construction algorithm, but results will tell whether the promising name satisfies its
expectations.
Table 6.1: Used algorithms in benchmarks. Extra Working Space contains memory
requirements for function calls (O(1) for non–recursive, O(log n) for recursive
algorithms), as well as space required in addition to the 5n for the text and
the suffix array. All space requirements are measured in bytes, where an
integer is assumed to require 4 bytes of space.
Now, a word to the test data. It is a selection of three text corpuses, namely the
Silesia Corpus [4] with small–sized files (< 40 MB), the Pizza & Chili Corpus [7] with
medium–sized files and the Repetitive Corpus [8] with highly repetitive files. The text
selection contains texts with quite different types: normal texts, source codes, database
data, html/xml documents as well as dna and protein sequences. This text variety
gives a good overview over SACA performance in different contexts, and should lead to
representative results in suffix array construction.
1 8.25nbytes of extra working space are possible using the variable-length storage described in Chapter
5, but additional operations would slow down the algorithm.
47
Chapter 6 Performance Analyses
All experiments were conducted on a 64 bit Ubuntu 14.04.3 LTS (Kernel 3.13) system
equipped with two ten-core Intel Xeon processors E5-2680v2 with 2.8 GHz and 128GB
of RAM. The benchmark itself was compiled using g++ (version 4.8.4) and the -O3
option. Construction speeds2 were measured using C++ built–in timers, while cache
miss rates3 were measured using perf_events4 . Results (average of 10 runs) can be
found in Tables 6.2, 6.3 and 6.4.
n = 200.0 MB, σ = 231 cache misses3 55.4 % 69.7 % 52.6 % 83.2 % 74.0 %
dblp.xml.200MB speed2 10.9 MB/s 10.3 MB/s 4.1 MB/s 1.3 MB/s 3.3 MB/s
3
n = 200.0 MB, σ = 97 cache misses 47.4 % 76.0 % 59.2 % 90.0 % 79.5 %
dna.200MB speed2 8.2 MB/s 6.8 MB/s 3.5 MB/s 1.1 MB/s 2.9 MB/s
n = 195.8 MB, σ = 237 cache misses3 55.1 % 73.0 % 51.7 % 82.3 % 72.6 %
Escherichia Coli speed2 9.4 MB/s 11.1 MB/s 4.1 MB/s 1.4 MB/s 3.0 MB/s
48
The results clearly show that divsufsort and SAIS are playing in their own league:
Construction speed is about 2 to 3 times faster than that of KA, DC3 and GSACA.
Additionally, as Table 6.1 shows, the working space required by both of these algorithms
is significantly smaller than that of KA, DC3 and GSACA.
Among the latter three algorithms, KA performs best, followed by GSACA. The DC3
algorithm performs worst of all algorithms, but this is owed to a very simple implemen-
tation requiring only about 50 lines of code. There might exist better implementations,
but I wasn’t able to find one running in linear time. Therefore, the bad results of DC3
shouldn’t be overrated.
Let’s have a closer look at the algorithm presented in this thesis, GSACA. To be
honest, it performs quite poorly compared to the fastest algorithms presented here: it
requires a lot of extra working space, and is about 2 to 5 times slower than SAIS or
divsufsort. On the other hand, such results are not really surprising: the algorithm uses
a lot of dependent non–parallelizable memory accesses, affecting quite different memory
locations. As a direct conclusion, GSACA has the highest average cache miss rates of
all algorithms except for the DC3 algorithm. Also note that even for very small–sized
files (dickens from the Silesia Corpus), the cache miss rates produced by GSACA are
orders of magnitude higher than that of other algorithms.
Cache miss rates of course are not the only reason for bad performance—as one can
see, SAIS performs well despite high cache miss rates on bigger files—but high rates
in consumption with a big working–constant lead to long construction times. So truly,
compared to the state of the art, GSACA will not be used in real world applications at
the current stage—the competitors clearly outperform GSACA.
49
Chapter 7
Conclusion
This thesis has presented a new approach for linear-time suffix array construction. A
new suffix sorting principle was introduced, leading to the first non-recursive linear-time
suffix array construction algorithm, GSACA.
Unfortunately, GSACA cannot hold up with current state of the art suffix array
construction algorithms: Construction speed is about 2 to 5 times slower than that of
faster algorithms, and the extra working space consumption is quite large.
As a result, the algorithm presented in this thesis is interesting only from a theoretical
point of view, since it is the first non-recursive linear-time SACA. Nonetheless, the new
sorting principle may be used to design better linear-time algorithms. To give some
ideas at least, GSACA deals a lot with previous smaller and next smaller values, what
hints to a stack-based computation for next lexicographically smaller suffixes. The
group belonging of a suffix may be computed ’on the fly’, during the computation of
next lexicographically smaller suffixes. Finally, another representation of suffix groups
may leads to a cache-friendly implementation of the second phase, resulting in a much
faster algorithm. However, these ideas are suggestions, no specific description of a better
algorithm.
Summarizing, the results of this thesis are quite promising: Compared to the devel-
opmental history of the currently fastest linear-time suffix array construction algorithm
SA-IS, development of GSACA is in its infancy. Thus, there’s a lot of room for improve-
ments, so it’s worth using GSACA as a starting point for future research topics.
51
Bibliography
[1] M. I. Abouelhoda, S. Kurtz, and E. Ohlebusch. Replacing Suffix Trees with En-
hanced Suffix Arrays. Journal of Discrete Algorithms, 2(1):53–86, 2004.
[5] P. Elias. Universal codeword sets and representations of the integers. IEEE Trans-
actions on Information Theory, 21(2):194–203, 1975.
[10] R. Grossi and G. F. Italiano. Suffix trees and their applications in string algorithms.
In Proceedings of the 1st South American Workshop on String Processing, pages 57–
76, 1993.
[11] W.-K. Hon, K. Sadakane, and W.-K. Sung. Breaking a Time-and-Space Barrier in
Constructing Full-Text Indices. In Proceedings of the 44th Annual IEEE Symposium
on Foundations of Computer Science, FOCS ’03, pages 251–260, 2003.
[12] J. Kärkkäinen and P. Sanders. Simple Linear Work Suffix Array Construction.
In Proceedings of the 30th International Conference on Automata, Languages and
Programming, ICALP ’03, pages 943–955, 2003.
[13] J. Kärkkäinen, P. Sanders, and S. Burkhardt. Linear Work Suffix Array Construc-
tion. Journal of the ACM, 53(6):918–936, 2006.
53
Bibliography
[16] P. Ko and S. Aluru. Space Efficient Linear Time Construction of Suffix Arrays.
In Proceedings of the 14th Annual Conference on Combinatorial Pattern Matching,
CPM ’03, pages 200–210, 2003.
[17] S. Kreft and G. Navarro. On Compressing and Indexing Repetitive Sequences.
Theoretical Computer Science, 483:115–133, 2013.
[27] G. Nong, S. Zhang, and W. H. Chan. Two Efficient Algorithms for Linear Time
Suffix Array Construction. IEEE Transactions on Computers, 60(10):1471–1484,
2011.
[28] S. J. Puglisi, W. F. Smyth, and A. H. Turpin. A Taxonomy of Suffix Array Con-
struction Algorithms. ACM Computational Survey, 39(2), 2007.
54
[31] P. Weiner. Linear Pattern Matching Algorithms. In Proceedings of the 14th Annual
Symposium on Switching and Automata Theory, SWAT ’73, pages 1–11, 1973.
[32] J. Ziv and A. Lempel. A universal algorithm for sequential data compression. IEEE
Transactions on Information Theory, 23:337–343, 1977.
55
Name: Uwe Baier Matriculation Number: 721798
Declaration
I declare that I have developed and written the enclosed Master Thesis completely
by myself, and have not used sources or means without declaration in the text. Any
thoughts from others or literal quotations are clearly marked. The Master Thesis was
not used in the same or in a similar version to achieve an academic grading or is being
published elsewhere.
Ulm, the . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Uwe Baier