0% found this document useful (0 votes)
5 views

Generic Non-recursive Suffix Array Construction

This document presents a new generic non-recursive suffix array construction algorithm that optimizes the existing GSACA algorithm, demonstrating improved performance in constructing suffix arrays, extended Burrows-Wheeler transforms (eBWT), and bijective Burrows-Wheeler transforms (BBWT). The proposed algorithm is significantly faster than GSACA, DSH, and other existing implementations, particularly on real-world data. The authors also provide a comprehensive analysis of the sorting principles and techniques that enhance the efficiency of suffix array construction and related transformations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

Generic Non-recursive Suffix Array Construction

This document presents a new generic non-recursive suffix array construction algorithm that optimizes the existing GSACA algorithm, demonstrating improved performance in constructing suffix arrays, extended Burrows-Wheeler transforms (eBWT), and bijective Burrows-Wheeler transforms (BBWT). The proposed algorithm is significantly faster than GSACA, DSH, and other existing implementations, particularly on real-world data. The authors also provide a comprehensive analysis of the sorting principles and techniques that enhance the efficiency of suffix array construction and related transformations.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 42

Generic Non-recursive Suffix Array Construction

JANNIK OLBRICH, University of Ulm, Ulm, Germany


ENNO OHLEBUSCH, University of Ulm, Ulm, Germany
THOMAS BÜCHLER, University of Ulm, Ulm, Germany

The suffix array is arguably one of the most important data structures in sequence analysis and consequently
there is a multitude of suffix sorting algorithms. However, to this date the GSACA algorithm introduced in
2015 is the only known non-recursive linear-time suffix array construction algorithm (SACA). Despite its
interesting theoretical properties, there has been little effort in improving GSACA’s non-competitive real-world
performance. There is a super-linear algorithm DSH, which relies on the same sorting principle and is faster
than DivSufSort, the fastest SACA for over a decade. The purpose of this article is twofold: We analyse the
sorting principle used in GSACA and DSH and exploit its properties to give an optimised linear-time algorithm,
and we show that it can be very elegantly used to compute both the original extended Burrows-Wheeler
transform (eBWT) and a bijective version of the Burrows-Wheeler transform (BBWT) in linear time. We call
the algorithm “generic,” since it can be used to compute the regular suffix array and the variants used for the
BBWT and eBWT. Our suffix array construction algorithm is not only significantly faster than GSACA but also
outperforms DivSufSort and DSH. Our BBWT-algorithm is faster than or competitive with all other tested
BBWT construction implementations on large or repetitive data, and our eBWT-algorithm is faster than all
other programs on data that is not extremely repetitive.
CCS Concepts: • Theory of computation → Data structures design and analysis; • General and refer-
ence → Performance; • Mathematics of computing → Combinatoric problems; Combinatorics on words;
Additional Key Words and Phrases: Suffix array, suffix sorting, string algorithms, bijective Burrows-Wheeler
transform
ACM Reference Format:
Jannik Olbrich, Enno Ohlebusch, and Thomas Büchler. 2024. Generic Non-recursive Suffix Array Construc-
tion. ACM Trans. Algor. 20, 2, Article 18 (April 2024), 42 pages. https://doi.org/10.1145/3641854

1 INTRODUCTION
The suffix array contains the indices of all suffixes of a string arranged in lexicographical order. It is
arguably one of the most important data structures in stringology, the topic of algorithms on strings
and sequences. It was introduced in 1990 by Manber and Myers [1990] for on-line string searches
[Manber and Myers 1990] and has since been adopted in a wide area of applications including
text indexing and compression [Ohlebusch 2013]. Although the suffix array is conceptually very
simple, constructing it efficiently is not a trivial task.

This work was supported by the Deutsche Forschungsgemeinschaft (DFG–German Research Foundation) (Grant No. OH
53/7-1).
Authors’ address: J. Olbrich, E. Ohlebusch, and T. Büchler, Institute of Theoretical Computer Science, University of Ulm,
89081 Ulm, Germany; e-mails: [email protected], [email protected], [email protected].

This work is licensed under a Creative Commons Attribution International 4.0 License.
© 2024 Copyright held by the owner/author(s).
ACM 1549-6325/2024/04-ART18
https://doi.org/10.1145/3641854

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:2 J. Olbrich et al.

When n is the length of the input text, the suffix array can be constructed in O(n) time and O(1)
additional words of working space when the alphabet is linearly-sortable (i.e., the symbols in the
string can be sorted in O(n) time) [Goto 2019; Li et al. 2022; Nong 2013].1 However, algorithms with
these bounds have historically not always been the fastest in practice. For instance, DivSufSort
has been the fastest suffix array construction algorithm (SACA) for over a decade although
having a super-linear worst-case time complexity [Bertram et al. 2021; Fischer and Kurpicz 2017].
To the best of our knowledge, the currently fastest suffix sorter in practice is libsais, which
appeared as source code in February 2021 on Github2 and has not been subject to peer review
in any academic context. The author claims that libsais is an improved implementation of the
SA-IS algorithm and hence has linear time complexity [Nong et al. 2009].
The only non-recursive linear-time suffix sorting algorithm GSACA was introduced in 2015 by
Baier [2015] and is not competitive, neither in terms of speed nor in the amount of memory con-
sumed [Baier 2015, 2016]. Generally, GSACA employs a kind of grouping principle, i.e., suffixes are
assigned to groups that are refined until the suffix array emerges. Despite the new algorithm’s
entirely novel approach and interesting theoretical properties [Franek et al. 2017], there has been
little effort in optimising it. In 2021, Bertram et al. [2021] provided a much faster SACA DSH using
the same sorting principle as GSACA. Their algorithm beats DivSufSort in terms of speed, but has
a super-linear time complexity.
A data structure closely linked to the suffix array is the Burrows-Wheeler transform (BWT)
[Burrows and Wheeler 1994] introduced by Burrows and Wheeler in 1994. The BWT of a text S is
the string obtained by assigning the ith symbol of the BWT to the last character of the ith lexico-
graphically smallest conjugate of the input text. Notably, S can be restored from its BWT [Burrows
and Wheeler 1994] but the BWT is usually easier to compress and to this date some of the best
compression algorithms make use of the BWT or one of its variants [Baier 2021]. From a theo-
retical point of view, the BWT is slightly unsatisfactory, since it is not a bijective transformation,
that is, for the BWT of S there might be several other strings that have the same BWT. Conse-
quently, one has to have additional information (e.g., which position in the BWT corresponds to
the first/last position in S) or make assumptions about S (e.g., that S is nullterminated) to reverse
the transformation. In 2007, Scott discovered a bijective variant of the BWT (Bijective Burrows-
Wheeler Transform (BBWT), sometimes also called “BWT Scottified” or BWTS for short) [Gil
and Scott 2012]. The BBWT is the string obtained by assigning the ith symbol of the BBWT to
the last character of the ith string in the list of all conjugates of all Lyndon factors of S sorted in
infinite periodic order (see Definition 2.3) [Gil and Scott 2012; Kufleitner 2009]. Mantaci et al. [2005]
introduced the extended BWT (eBWT), an extension of the BWT in the sense that it is a BWT for
a set M of primitive (i.e., non-periodic) strings [Mantaci et al. 2005], for which Hon et al. [2012]
gave an O(n log n) construction algorithm [Hon et al. 2012]. Similar to the BBWT, the eBWT con-
sists of the last characters of the conjugates of the strings in M arranged in infinite periodic order.
The approach of Hon et al. [2012] is to compute what we call the generalised circular suffix
array (GSA◦ ) such that the ith entry of GSA◦ represents the ith smallest such conjugate. Bonomo
et al. [2014] showed that it is possible to reduce the problem of computing the extended BWT to
computing the BBWT (in linear time) and gave an O(n log n/log log n) algorithm for constructing
the BBWT. Similar to Hon et al. [2012]’s eBWT-algorithm, they compute the cirular suffix array
(SA◦ ) where the ith entry represents the ith smallest conjugate [Bonomo et al. 2014].
More recently, Bannai et al. [2021] showed that the SA-IS algorithm can be modified to compute
SA◦ in linear time [Bannai et al. 2021].
1 To the best of our knowledge, there are no publicly available implementations of the algorithms of Goto [2019] and Li
et al. [2022].
2 https://github.com/IlyaGrebnov/libsais, last accessed: July 17, 2023.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:3

Boucher et al. [2021] then extended the notion of the eBWT to multisets of general strings
(i.e., without the restriction to primitive strings made in Mantaci et al. [2005]) and simplified the
algorithm of Bannai et al. [2021]. They provide two implementations using their algorithm, cais3
and PFP-eBWT.4 The former computes GSA◦ of the string collection and derives the eBWT from it,
while the latter first applies a variation of the preprocessing technique prefix-free parsing (PFP)
[Boucher et al. 2019], applies cais to the parse and then derives the eBWT from the result and
the lexicographically sorted dictionary [Boucher et al. 2021]. Note that their implementations only
work correctly on multisets of primitive strings.5
Also note that the term “extended Burrows-Wheeler Transform” has not exclusively been used to
refer to the “original” eBWT as defined by Mantaci et al. [2005], but generally to BWT-variants for
collections of strings. Those other variants all (implicitly or explicitly) append terminator-symbols
to the input strings and thus their output differs from the original eBWT. Moreover, for most
variants, the order in which the input strings are given influences the BWT [Cenzato and Lipták
2022]. We only consider the original eBWT as defined by Mantaci et al. [2005].

Our Contributions. This article extends our previous work on optimising the GSACA suffix ar-
ray construction algorithm, published in Olbrich et al. [2022]. Specifically, we show that small
changes to our algorithm are sufficient to compute the bijective Burrows-Wheeler Transform or
the extended Burrows-Wheeler Transform instead of the suffix array.
An important intermediate state of the GSACA algorithm is the Lyndon grouping where the suf-
fixes are sorted and grouped according to their respective longest Lyndon prefixes. Our contribu-
tions are threefold.
First, we show that the BBWT can be derived from the Lyndon grouping in the same way as the
suffix array, and thereby we obtain linear-time BBWT construction algorithms.
Second, we show that a slight change in the initialisation of our BBWT algorithm is sufficient to
compute the eBWT instead (even for non-primitive input strings). Notably, this is possible without
explicitly sorting the input strings (as would be required for transforming an arbitrary BBWT-
algorithm into an algorithm for the eBWT [Bonomo et al. 2014]).
Third, we provide several techniques that significantly improve the performance of Baier’s al-
gorithms for computing the Lyndon grouping and deriving the suffix array, BBWT or eBWT from
it. Our resulting linear-time SACA is faster than GSACA and DSH, which employ the same sorting
principle but do not exploit certain properties of Lyndon words. Specifically, on real-world text,
our SACA implementation is more than 25% faster than DSH and more than 65% faster than Baier’s
GSACA implementation. Although it is not on par with libsais on real-world data, it significantly
improves Baier’s sorting principle and positively answers whether the precomputed Lyndon array
can be used to accelerate GSACA (posed in Bille et al. [2020]). Our BBWT construction program is
significantly faster than the other linear-time BBWT construction programs we are aware of, and
faster on some of our test data than the previously fastest program (which has quadratic worst-
case time complexity). Our eBWT-algorithm is also significantly faster than “the only tool up to
date that computes the eBWT according to the original definition” [Cenzato and Lipták 2022] on
data that is not extremely repetitive.

3 https://github.com/davidecenzato/cais, last accessed: July 12, 2023.


4 https://github.com/davidecenzato/PFP-eBWT, last accessed: July 12, 2023.
5 As far as we can tell, the actual eBWT itself is always computed correctly, but the accompanying index set required for

reconstructing M can be wrong when the input strings are not primitive. For instance, the outputs of cais on the multisets
M1 = {ababab, ab} and M2 = {abab, abab} are identical. Specifically, cais outputs bbbbaaaa as the eBWT (which is
correct) and {1, 3} as the index set (0-based). Note that the index sets can depend on the order of the input strings in the
concatenation; for M1 ababab came before ab in the input.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:4 J. Olbrich et al.

We only consider sequential suffix sorting, for parallel suffix sorting see, e.g., Du et al. [2023],
Kärkkäinen et al. [2015], and Labeit et al. [2017] and the references therein.
The rest of this article is structured as follows: Section 2 introduces the definitions and notations
used throughout this article. In Section 3, the grouping principle is investigated and a description
of our algorithms is provided. In Section 4 our algorithms are evaluated experimentally and com-
pared to other relevant suffix array, BBWT and eBWT construction algorithms. Finally, Section 5
concludes this article and provides an outlook on possible future research.

2 PRELIMINARIES
For i, j ∈ N0 , we denote the set {k ∈ N0 : i ≤ k ≤ j} by the interval notations [i .. j] = [i .. j + 1) =
(i − 1 .. j] = (i − 1 .. j + 1). For an array A, we analogously denote the subarray from i to j by
A[i .. j] = A[i .. j + 1) = A(i − 1 .. j] = A(i − 1 .. j + 1) = A[i]A[i + 1] . . . A[j]. We use 0-based
indexing, i.e., the first entry of the array A is A[0].
A string S of length n over an alphabet Σ is a sequence of n characters from Σ. We denote the
length n of S by |S | and the ith symbol of S by S[i − 1], i.e., strings are zero-indexed. In this article,
we assume any string S of length n to be over a totally ordered and linearly sortable alphabet (i.e.,
we can sort the characters in S in O(n)). Analogous to arrays, we denote the substring from i to j by
S[i .. j] = S[i .. j + 1) = S(i − 1 .. j] = S(i − 1 .. j + 1) = S[i]S[i + 1] . . . S[j]. For j > i, we let S[i .. j]
be the empty string ε. For two strings u and v and an integer k ≥ 0, we let uv be the concatenation
of u and v and denote the k-times concatenation of u by u k .
A string S is primitive if it is non-periodic, i.e., S = w k implies w = S and k = 1. For any string S
there is a unique primitive string w and a unique integer k such that S = w k . We call w and k the
root and period of S and denote them by root(S) and period(S), respectively.
The suffix i of a string S of length n is the substring S[i .. n) and is denoted by Si . Similarly, the
substring S[0 .. i] is a prefix of S. A suffix (prefix) is proper if i > 0 (i + 1 < n).
In the rest of this article, we use S = decedacebceece$ as our running example. We have, for
instance, S 1 = ecedcebceece$ = S[1]S 2 .
We assume totally ordered alphabets. This induces a total order on strings. Specifically, we say
a string S of length n is lexicographically smaller than another string S  of length m if and only if
there is some  ≤ min{n, m} such that S[0 .. ) = S [0 .. ) and either n =  < m or S[] < S []. If
S is lexicographically smaller than S , then we write S <lex S .
The suffix array SA of S is an array of length n that contains the indices of the suffixes of S in in-
creasing lexicographical order. That is, SA forms a permutation of [0 .. n) and S SA[0] <lex S SA[1] <lex
. . . <lex S SA[n−1] .
Definition 2.1 (pss-tree [Bille et al. 2020]). Let pss be the array such that pss[i] is the index
of the previous smaller suffix  for each
 i ∈ [0 .. n) (or -1 if none exists). Formally, pss[i] :=
max j ∈ [0 .. i) : S j <lex Si ∪ {−1} . Note that pss forms a tree with -1 as the root, in which
each i ∈ [−1 .. n) is represented by a node and pss[i] is the parent of node i. We call this tree
the pss-tree. Further, we impose an order on the nodes that corresponds to the order of the in-
dices represented by the nodes. In particular, if c 1 < c 2 < · · · < c k are the children of i (i.e.,
pss[c 1 ] = · · · = pss[c k ] = i), then we say c k is the last child of i.
 
Analogous to pss[i], we define nss[i] := min j ∈ (i .. n] : S j <lex Si as the next smaller suffix
of i. Note that Sn = ε is smaller than any non-empty suffix of S, hence nss is well-defined. Figure 1
shows the suffix array, nss and the pss-tree of our running example.
Definition 2.2. Let Pi be the set of suffixes with i as next smaller suffix, that is
Pi = {j ∈ [0 .. i) : nss[j] = i}.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:5

Fig. 1. Lyndon prefixes of all suffixes of S = decedacebceece$ and the corresponding suffix array, nss-array,
pss-array, and pss-tree. Each box indicates a Lyndon prefix. For instance, the Lyndon prefix of S 9 = ceece$
is L9 = cee. Note that Li is exactly S[i] concatenated with the Lyndon prefixes of i’s children in the pss-tree
(see Lemma 3.20). For example, L8 = S[8]L9 L12 = bceece.

For instance, in our running example, we have P5 = {2, 4}, because nss[2] = nss[4] = 5.
Definition 2.3 (Infinite Periodic Order). For the infinite periodic order, we compare the infinite
concatenation of strings lexicographically. That is, for strings S and S , we write S <ω S  if and
only if the infinite concatenation S ∞ = SSS . . . is lexicographically smaller than the infinite con-
catenation S ∞ = S S S  . . . .
For instance, ab <lex aba <lex abb and abb >ω ab >ω aba (since abbabb . . . >lex abab . . . >lex
abaaba . . . ). Note that S ∞ = S ∞ holds if and only if root(S) = root(S ). Thus, <ω is not an anti-
symmetric relation on strings in general (e.g., a ≤ω aa and aa ≤ω a but a  aa).6 However, we will
only use the infinite periodic order for comparing primitive strings (where ≤ω is antisymmetric).
Also note that the infinite periodic order is equivalent to the lexicographical order if neither of the
strings in question is a prefix of the other.
For a string S of length n, the ith conjugate is defined as S[i .. n)S[0 .. i) for i ∈ [0 .. n) and denoted
by conji (S).
A non-empty string S is in its canonical form if and only if it is the lexicographically minimal
among its conjugates. If S is additionally strictly smaller than all of its other conjugates, then S is
a Lyndon word. Equivalently, S is a Lyndon word if and only if S is lexicographically smaller than
all its proper suffixes [Duval 1983].
The Lyndon prefix of S is the longest prefix of S that is a Lyndon word. We let Li denote the
Lyndon prefix of Si . Note that a string of length one is always a Lyndon word, hence the Lyndon
prefix of a non-empty string is also non-empty.
Lemma 2.4 (Lemma 15, Franek et al. [2016]). For each non-empty string S, we have Li =
S[i .. nss[i]) for each i ∈ [0 .. |S |).
Theorem 2.5 (Chen-Fox-Lyndon Theorem, Chen et al. [1958]). Any non-empty string S has
a unique Lyndon factorisation, that is, there is a unique sequence of Lyndon words (Lyndon factors)
v 1 ≥lex . . . ≥lex vk with S = v 1 . . . vk [Chen et al. 1958]. Note that v 1 = L0 , v 2 = L |v1 | , . . . , vk =
Lk −1 |v j | .
j=1

6 In
Mantaci et al. [2005] the infinite periodic order is defined such that the exponent breaks ties in cases where roots are
equal. We omit this because we use different tie-breaking strategies for convenience, cf. Definitions 3.7 and 3.23.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:6 J. Olbrich et al.

Fig. 2. Constructing the BBWT of decedacebceece$. Lyndon factors are coloured (), cf. Figure 1. Reading
(from top to bottom) characters in the last column gives the BBWT.

The Lyndon Factorisation of our running example is v 1 = de, v 2 = ced, v 3 = acebceece, v 4 = $


(cf. Figure 1, the outermost boxes exactly correspond to the Lyndon factors).
Definition 2.6 (Bijective Burrows-Wheeler Transform (BBWT)). The bijective Burrows-Wheeler
Transform (BBWT) of a string S is the string obtained by taking the last characters of the con-
jugates of the Lyndon factors of S arranged in infinite periodic order.
Figure 2 shows how the BBWT of our running example can be obtained.
Definition 2.7 (Extended Burrows-Wheeler Transform (eBWT)). The extended Burrows-Wheeler
Transform eBWT of a multiset M of strings is the string obtained by taking the last characters of
the conjugates of the strings in M arranged in infinite periodic order.
Analogous to the original BWT, for reconstructing M from the eBWT one needs the set of
indices of the strings in M in the sorted list of conjugates. Figure 3 shows how the eBWT of
M = {b, bcbc, b, abcbc, bc} can be obtained.
We assume the RAM model of computation, that is, basic arithmetic operations can be performed
in O(1) time on words of length O(log n) bits, where n is the size of the input. Reading and writing
an entry A[i] of an array A can also be performed in constant time when i and A[i] have length in
O(log n).

3 GSACA
In the following, we fix a string S of length n over a linearly sortable alphabet.
We start by giving a high level description of the sorting principle based on grouping by Baier
[2015, 2016]. Very basically, the suffixes are first assigned to ordered groups, which are then refined
until the suffix array emerges. The algorithm consists of the following steps.
— Initialisation: Group the suffixes according to their first character.
— Phase I: Refine the groups until the elements in each group have the same Lyndon prefix.
— Phase II: Sort elements within groups lexicographically.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:7

Fig. 3. Constructing the eBWT of M = {b, bcbc, b, abcbc, bc}. The indices refer to the starting index in the
concatenation T = bbcbcbabcbcbc of the strings in M. Strings in M are coloured (). Reading (from top to
bottom) characters in the last column gives the eBWT. The corresponding index set is {0, 1, 2, 5, 6}. For the
relative order of conjugates with equal roots see Definition 3.23.

We will later show that Phase II can also be used to derive BBWT instead of SA, and that a minor
change in the preconditions to this BBWT-algorithm suffices to turn it into an algorithm for the
eBWT.
Definition 3.1 (Suffix Grouping, Adapted from Bertram et al. [2021]). Let S be a string of length n
and SA the corresponding suffix array. A group G with group context α is a tuple дs , дe , |α | with
group start дs ∈ [0 .. n) and group end дe ∈ [дs .. n) such that the following properties hold:
(1) All suffixes in SA[дs .. дe ] share the prefix α, i.e., for all i ∈ SA[дs .. дe ] it holds Si = αSi+ |α | .
(2) α is a Lyndon word.
We say i is in G or i is an element of G and write i ∈ G if and only if i ∈ SA[дs .. дe ]. A suffix
grouping for S is a set of groups G1 , . . . , Gm , where the groups are pairwise disjoint and cover
the entire suffix array. Formally, if Gi = дs,i , дe,i , |α i | for all i, then дs,1 = 0, дe,m = n − 1 and
дs, j = 1 + дe, j−1 for all j ∈ [2 .. m]. For i, j ∈ [1 .. m], Gi is a lower (higher) group than Gj if and
only if i < j (i > j). If all elements in a group G have α as their Lyndon prefix, then G is a Lyndon
group. If G is not a Lyndon group, then it is called preliminary. Furthermore, a suffix grouping is
Lyndon if all its groups are Lyndon groups, and preliminary otherwise.
Note that, by definition, any suffix grouping constitutes a partial order consistent with the lexi-
cographical order. That is, for some Gi , Gj from a suffix grouping with i < j, we have Si  <lex S j 
for all i  ∈ Gi , j  ∈ Gj .
To see why the notion of Lyndon groups is useful, consider the following two lemmata:
Lemma 3.2. For strings wu and wv over Σ with u <lex wu and v >lex wv, we have wu <lex wv.
Proof. Note that there is no j ≥ 1 such that wv = w j , since otherwise v would be a prefix
of wv and thus v <lex wv. Hence, there are k ∈ N,  ∈ [0 .. |w |), b ∈ Σ and m ∈ Σ∗ such that
wv = w k w[0 .. )bm and b > w[]. There are two cases:
— There is some j ≥ 1 such that wu = w j .
— If j |w | ≤ k |w | + , then wu is a prefix of wv.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:8 J. Olbrich et al.

Fig. 4. Lyndon grouping G1 , . . . , G8 of decedacebceece$ with group contexts. The Lyndon prefixes and the
suffix array are shown for improved clarity. Note that this grouping is the Lyndon grouping with the smallest
number of groups.

— Otherwise, the first different symbol in wu and wv is at index p = k |w | + , and we have


(wu)[p] = w j [p] = w[] < b = (wv)[p].
— There are i ∈ N, j ∈ [0 .. |w |), a ∈ Σ and q ∈ Σ∗ such that wu = w i w[0 .. j)aq and a < w[j].
— If |w i w[0 .. j)| ≤ |w k w[0 .. )|, then the first different symbol is at index p = |w i w[0 .. j)|,
with (wu)[p] = a < w[j] ≤ (wv)[p].
— Otherwise, the first different symbol is at index p = |w k w[0 .. )| with (wv)[p] = b >
w[] = (wu)[p].
In all cases, the claim follows. 
Lemma 3.3. For any i, j, Li <lex L j implies Si <lex S j .
Proof. Assume Li is a prefix of L j , otherwise there is a mismatching position and the claim
follows immediately. By Lemma 2.4, we have nss[i] = i + |Li | and nss[j] = j + |L j | > j + |Li | and
therefore Si+ | Li | <lex Si and S j+ | Li | >lex S j . Lemma 3.2 then implies the claim. 
That is, sorting the suffixes according to their Lyndon prefixes results in a valid partial order
and thus suffix grouping. Intuitively, we can derive the suffix array from a Lyndon grouping using
a kind of induced copying, because we have Si+ | Li | <lex Si for each i (Lemma 2.4).
With these notions, a suffix grouping is created in the initialisation, which is then refined in
Phase I until it is a Lyndon grouping, and further refined in Phase II until the suffix array emerges.
Figure 4 shows a Lyndon grouping with contexts of our running example.
We first deal with Phase II, since it is much less technical than Phase I. In Section 3.1, we reca-
pitulate how Baier [2015] derives the suffix array from a Lyndon grouping, and in Section 3.2, we
show how this algorithm can be modified to produce the BBWT instead. Section 3.3 then shows
how these two almost identical algorithms can be optimised. In Section 3.4, we explain how a Lyn-
don grouping can be computed and describe our improvements over Baier’s Phase I. Section 3.5
describes how the data structures needed for Phase I are set up. Finally, Section 3.6 shows that
only a slight change in these initial data structures for our BBWT-algorithm is sufficient for it to
compute the eBWT instead.

3.1 Phase II
In Phase II, we need to refine the Lyndon grouping obtained in Phase I into the suffix array. Let G
be a Lyndon group with context α and let i, j ∈ G. Since Si = αSi+ |α | and S j = αS j+ |α | , we have
Si <lex S j if and only if Si+ |α | <lex S j+ |α | . Hence, to find the lexicographically smallest suffix in
G, it suffices to find the lexicographically smallest suffix p in {i + |α | : i ∈ G}. Note that removing

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:9

p − |α | from G and inserting it into a new group immediately preceding G yields a valid Lyndon
grouping. We can repeat this process until each element in G is in its own singleton group. As G is
Lyndon, we have Sk+ |α | <lex Sk for each k ∈ G by Lemma 2.4. Therefore, if all groups lower than G
are singletons, then p can be determined by a simple scan over G (by determining which member
of {i + |α | : i ∈ G} is in the lowest group). Consider, for instance, G4 = 3, 4, |ce| containing 6
and 12 from Figure 4. We consider 6 + |ce| = 8 and 12 + |ce| = 14. The group containing 14
is lower than the group containing 8, hence S 12 is lexicographically smaller than S 6 . Thus, we
know that SA[3] = 12, remove 12 from G4 and repeat the same process with the emerging group
G4 = 4, 4, |ce| . As 6 is the only element of G4 we know that SA[4] = 6.
If we refine the groups in lexicographically increasing order (lower to higher) as just described,
then each time a group G is processed, all groups lower than G are singletons. However, sorting
groups in such a way leads to a superlinear time complexity. Bertram et al. [2021] provide a fast-
in-practice O(n log n) algorithm for this, broadly following the described approach.
To get a linear time complexity, Baier turns this approach on its head [Baier 2015, 2016]: Instead
of repeatedly finding the next smaller suffix in a group, we consider the suffixes in lexicographically
increasing order and for each encountered suffix i, we move all suffixes that have i as the next
smaller suffix (i.e., those in Pi ) to new singleton groups immediately preceding their respective
old groups as described above.
However, for this to work, we first need to find the smallest suffix. This is simply solved by
assuming w.l.o.g. that S is nullterminated (i.e., the last character of S is smaller than all other
characters in S) and thus that the lexicographically smallest group is known to be the singleton
group containing SA[0] = n − 1.
For the correctness of this algorithm, we need two more properties: First, before iteration i (0-
based), SA[i] must be known, and second, the procedure of inserting the elements in Pi must be
well-defined.
For the former, assume that we want to process SA[j]. Because SA[j] occurs by definition before
SA[j] in SA, we must have inserted PSA[j] already (with SA[j] ∈ PSA[j] by definition). The second
property is implied by the following Corollary 3.5. (Intuitively, all suffixes in Pi have different
Lyndon prefixes, because those Lyndon prefixes start at different indices but end at the same index
i, hence they must be in different Lyndon groups.)
Lemma 3.4. For any j, j  ∈ Pi , we have L j  L j  if and only if j  j .
Proof. Let j, j  ∈ Pi and j  j . By definition of Pi , we have nss[j] = nss[j ] = i. Since L j =
S[j .. nss[j]) and L j  = S[j  .. nss[j ]), L j and L j  have different lengths, implying the claim. 
Corollary 3.5. In a Lyndon grouping, the elements of Pi are in different groups.
Accordingly, Algorithm 1 correctly computes the suffix array from a Lyndon grouping. A formal
proof of correctness is given in Baier [2015, 2016]. Figure 5 shows Algorithm 1 applied to our
running example. Note that Corollary 3.5 also implies that the order in which we consider the j ∈
PA[i] in Algorithm 1 has no influence on the correctness. A concrete implementation of Algorithm 1
is provided in Section 3.3.

3.2 Deriving the Bijective Burrows-Wheeler Transform


In this section, we show how the algorithm from the previous section can be altered to derive the
BBWT instead of SA from the final Lyndon grouping.
Unlike in the previous section, we do not assume that S is necessarily nullterminated. Further-
more, here we assume that the Lyndon grouping we are starting from has the minimum number

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:10 J. Olbrich et al.

Fig. 5. Refining a Lyndon grouping for S = decedacebceece$ (see Figure 4) into the suffix array, as done
in Algorithm 1. Already processed elements are coloured light gray while inserted but not yet processed
elements are coloured green . Note that uncoloured entries are not actually present in the array but only
serve to indicate the current Lyndon grouping. The Lyndon prefixes are shown at the top for clarity.

ALGORITHM 1: Phase II of GSACA [Baier 2015, 2016]. After execution, the array A is the suffix array.
A[0] ← n − 1;
for i = 0 → n − 1 do
for j ∈ PA[i] do
Let k be the start of the group containing j;  
remove j from its current group and put it into a new group k, k, L j  immediately preceding
j’s old group;
A[k] ← j;
end
end

of groups. That is, we require that suffixes are in the same group if and only if they have the same
Lyndon prefix. (For deriving the suffix array as described in the previous section it suffices that the
grouping is Lyndon, but there may be several Lyndon groups with the same context.) Fortunately,
we obtain exactly such a minimum Lyndon grouping from Phase I as described in Section 3.4 (or
Baier’s implementation of Phase I [Baier 2015, 2016]).
Instead of following the procedure implied by Definition 2.6 literally, we follow the approach
from Bannai et al. [2021] in that we compute the circular suffix array from which the BBWT can
be derived similarly to how the ordinary BWT can be derived from the suffix array.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:11

Let v 1 , . . . , vk be the Lyndon factorisation of S and let L be the set of positions where a Lyndon
factor starts in S, i.e.,
 j
L= |vi | : j ∈ [0 .. k) = {j ∈ [0 .. n) : pss[j] = −1}.
i=1
The following definition provides a natural bijection between the (multi)set of infinite concate-
nations of conjugates of Lyndon factors and the indices in [0 .. n).
Definition 3.6. For each i ∈ [0 .. n) there is p ∈ L so that p ≤ i < nss[p]. We let Ci denote
S[i .. nss[p])Lp∞ , where Lp is the Lyndon prefix of Sp and Lp∞ is its infinite concatenation.
  ∞
Note that for i with p ∈ L and p ≤ i < nss[p], we have Ci = conji−p Lp . For instance, in our
running example (cf. Figures 1 and 2), we have L12 = ce and 5 ∈ L with 5 ≤ 12 < nss[5] = 14. Thus,

C12 = ceacebceece . . . and conj12−5 (acebceece) = (ceacebcee)∞ = ceacebceece. . . = C12 .
The following definitions introduce the order that induces the aforementioned permutation.
Definition 3.7. We write i <inf j if and only if Ci <lex Cj or Ci =lex Cj and i > j.7
Definition 3.8 (Circular Suffix Array). Let SA◦ be the (unique) permutation of [0 .. n) such that
for all i ∈ [0 .. n − 1), we have SA◦ [i] <inf SA◦ [i + 1]. We call SA◦ the circular suffix array.
Reading the numbers in the second column in Figure 2 gives SA◦ for our running example.
For i ∈ [0 .. n), we have

S[nss[SA◦ [i]] − 1], if SA◦ [i] ∈ L,
BBWT[i] = (1)
S[SA◦ [i] − 1], otherwise.
That is, character i of the BBWT is the character preceding the character at index SA◦ [i] when we
consider the Lyndon factors to be independently circular strings, i.e., the character preceding the
first character of a Lyndon factor is the last character of that Lyndon factor (cf. Figure 2).
To establish the connection between GSACA and the BBWT, we now define the corresponding
analogue to the nss-pointer.8
Definition 3.9 (nsc). For i ∈ [0 .. n) with p ∈ L and p ≤ i < nss[p] define

p, if nss[i] = nss[p]
nsc(i) =
nss[i], otherwise (i < nss[i] < nss[p])

That is, for i  L, nsc(i) points to the next smaller conjugate of the Lyndon factor i belongs to (cf.
Lemma 3.12).
In Phase II of GSACA the indices are sorted according to the corresponding suffix, which consists
of Lyndon factors arranged in (lexicographically) decreasing order. That is, for i ∈ [0 .. n), we sort
according to Si = Li S nss[i] = Lnss[i] Lnss2 [i] . . . Ln−1 .
To construct SA◦ instead of SA, we sort according to Ci = Li Lnsc(i) Lnsc2 (i) Lnsc3 (i) . . . For em-
ploying Phase II of GSACA, we need to establish that
7 Note that the case Ci = Cj for i  j can only arise if there are multiple equal Lyndon factors. Those could be filtered
out beforehand and accounted for afterwards (in linear time), as done in Bannai et al. [2021]. Moreover, note that in such
cases the relative order of i and j is irrelevant for the BBWT. We argue that the chosen relative order is natural, because
this way the indices of Lyndon factors occur (from left to right) in decreasing order according to <inf (which corresponds
to the relative order of the Lyndon factors according to <lex ).
8 Note that we do not compute nsc (or nss in case of SA), it only serves to illustrate the working principle of our algorithm

and prove its correctness.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:12 J. Olbrich et al.

— the final Lyndon grouping provides a valid partial order, and


— nsc(i) ≤inf i holds for all i.
Note that the latter point and Definition 3.9 imply that each nsc-chain eventually reaches some p
with nsc(p) = p. From Definition 3.9, we can furthermore deduce that this p must be in L, i.e., the
index of a Lyndon factor. Therefore, to adapt Phase II of GSACA, we need to be able to determine
the positions of the Lyndon factors in SA◦ beforehand, analogously to how (in Phase II of GSACA)
we need to know the position of n − 1 in SA, since any nss-chain eventually reaches n − 1 (recall
that we assumed the text to be nullterminated).
Lemma 3.10. For i ∈ [0 .. n), we have Li ≥lex Lnsc(i) .
Proof. Follows immediately from Definition 3.9 and Lemma 3.2. 
The following lemma implies that a Lyndon group actually constitutes a valid partial order
according to <ω .
Lemma 3.11. For i, j ∈ [0 .. n), Li <lex L j implies i <inf j.
Proof. Let pi , p j ∈ L be the positions with pi ≤ i < nss[pi ] and p j ≤ j < nss[p j ]. Assume Li is
a proper prefix of L j , otherwise the claim is trivially true. We need to show Ci <lex Cj .
We proceed by induction on the minimum number of applications of nsc to i that gives pi , i.e.,
min {k ∈ N0 | nsck (i) = pi }.
Induction base. Let i = pi (i.e., nsc0 (i) = pi ).
By assumption, we have L j = Li v for some non-empty v. Since L j is Lyndon, we have L j <lex
v.
Consider the smallest index k where  there is a mismatch between L j and v, i.e., L j [0 .. k) =
v[0 .. k) and L j [k] < v[k]. (Because L j  > |v |, L j can not be a prefix of v and therefore k must
exist.) Since L j = Li v, the longest common prefix of L j and v (which has length k − 1) can be
factored into m ≥ 0 repetitions of Li followed by a proper (possibly empty) prefix of Li , i.e.,
L j [0 .. k) = v[0 .. k) = Lim Li [0 .. ) where 0 ≤  < |Li | and m|Li | +  = k. In combination with
L j = Li v, this gives L j [0 .. k + |Li |) = Li v[0 .. k) = Lim+1 Li [0 .. ), and in particular Li [] =
Li∞ [k] = L j [k]. Using the definition of k, we then have Li [] = L j [k] < v[k].
Since pi = i holds, we have Ci = Li∞ by Definition 3.6 and therefore Li∞ [0 .. k + |Li |) =
L j [0 .. k + |Li |) and Li∞ [k + |Li |] = Li [] < L j [k + |Li |] by the previous statement, which
implies the claim Ci <lex Cj .
Induction step. Now assume pi < i. We have Ci = Li Cnsc(i) by definition and Lnsc(i) ≤lex Li
by
 Lemma
 3.10. Moreover, L j <lex L j+ | Li | follows from basic properties of Lyndon words and
L j  > |Li |, which in turn implies Li <lex L j+ | L | . In combination, this gives Lnsc(i) <lex L j+ | L |
i i
and thus – by the induction hypothesis – Cnsc(i) <lex Cj+ | Li | . In combination with Ci [0 .. |Li |) =
Li = L j [0 .. |Li |) = Cj [0 .. |Li |) this implies the claim. 
With the help of the previous lemma, we are now in a position to show that nsc(i) actually is
the next smaller conjugate of i.
Lemma 3.12. For i ∈ [0 .. n) \ L, we have nsc(i) <inf i.
Proof. Let p ∈ L be such that p < i < nss[p]. Note that we have i  nsc(i) and S nsc(i) <lex
Si . Let  be the minimum number of times nsc has to be applied to nsc(i) to obtain p, i.e.,  =
min { ∈ N0 | nsc+1 (i) = p}.
Let k ∈ [0 .. ] be such that Lnscj (i) = Lnscj+1 (i) for all j ∈ [0 .. k) and Lnsck (i)  Lnsck +1 (i) . (Note
that k exists by definition of .) We have Lnsck +1 (i) <lex Lnsck (i) , because Lnsck +1 (i)  Lnsck (i)

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:13

(by definition of k) and Lnsck +1 (i) ≤lex Lnsck (i) (by Lemma 3.10). By Lemma 3.11, this implies
nsck+1 (i) <inf nsck (i) and thus the claim. 
Now all except one of the aforementioned requirements for applying the sorting principle from
Phase II of GSACA to SA◦ are shown to be satisfied. The missing one regards the positions of the
Lyndon factors in SA◦ and is given by the following lemma.
Lemma 3.13. For i ∈ L and j  L with Li = L j , we have j <inf i.
Proof. Note that j <inf i holds if and only if nsc(j) <inf nsc(i). Furthermore, Li = Lnsc(i) , since
i ∈ L.
If Lnsc(j)  L j , then Lemma 3.10 implies Lnsc(j) <lex L j = Li = Lnsc(i) . Hence, the claim follows
by Lemma 3.11.
Otherwise, the claim follows by induction on the number of times nsc has to be applied to j to
reach an element in L (whose Lyndon prefix is then guaranteed to be not equal to L j ). 
Lemma 3.13 immediately implies that the Lyndon factors are last in their respective Lyndon
groups. If there are multiple equal Lyndon factors, then Definition 3.7 also gives the relative order
of Lyndon factors within a Lyndon group. Note that this definition regarding the relative order of
equal conjugates of Lyndon factors is arbitrary as it has no effect on the BBWT (cf. Definition 2.6).
To find the positions of all other elements in linear time, we proceed almost exactly as in
Section 3.1. That is, we iterate from left to right over SA◦ and upon encountering some i, we insert
all j that have i as next smaller conjugate (i.e., nsc(j) = i) at the current begin of their respective
groups. Intuitively, this is correct, because nsc(i) comes before i in SA◦ for any i whose index in
SA◦ is not yet known (i.e., those not in L, cf. Lemma 3.12), analogously to how nss[i] comes before
i in SA◦ for any i  n − 1.
Analogously to how Pi contains all elements that have i as next smaller suffix, we now define Pi
as the set of elements that have i as their next smaller conjugate (excluding L, since their positions
in SA◦ are already known).
Definition 3.14 (Pi ). For i ∈ [0 .. n) define
Pi = {j ∈ [0 .. n) | nsc(j) = i} \ L.

Note that 
Pnss[i] \ L, if i ∈ L (i.e., nsc(i) = i),
Pi =
Pi \ L, otherwise,
by Definitions 2.2, 3.9, and 3.14. Furthermore, for all i ∈ L, we have Pnss[i] ∩ L = {i} and therefore
Pi = Pnss[i] − {i}.
Definition 3.14 finally enables us to formulate a concrete algorithm; Algorithm 2 is Algorithm 1
adapted to compute SA◦ instead of SA and Figure 6 shows Algorithm 2 applied to our running
example.
Theorem 3.15. Algorithm 2 correctly computes SA◦ .
Proof. First note that by Lemma 3.11 it suffices to correctly sort elements within Lyndon
groups.
In the first for-loop, the positions of Lyndon factors of S are written to A. Note that Lyndon fac-
tors are larger (according to <inf ) than other elements within the same Lyndon group (Lemma 3.13).
For i, j ∈ L with i > j, we have Li ≤lex L j by definition, and thus i <inf j by Definition 3.7 (if
Li = L j ) and Lemma 3.11 (if Li  L j ). We iterate over L in increasing order (by index) and insert

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:14 J. Olbrich et al.

Fig. 6. Deriving SA◦ of our running example decedacebceece$, given its Lyndon grouping. Already pro-
cessed elements are coloured light gray while inserted but not yet processed elements are coloured green
. The indices of Lyndon factors have a coloured border , cf. Figure 1. Note that uncoloured entries are not
actually present in the array A but only serve to indicate the current Lyndon grouping. The Lyndon prefixes
are shown at the top for clarity.

the positions of Lyndon factors at the current end of their group. Hence, after the first for-loop,
the Lyndon factors are correctly placed in A.
We now proceed by induction on the number of iterations of the second for loop. Let ISA◦ be
the inverse permutation of SA◦ .
After the kth iteration the following invariants hold true:

(1) A[k] = SA◦ [k],


(2) A[0 .. k] = SA◦ [0 .. k],
(3) For all j < k and p ∈ PSA◦ [j] we have A[ISA◦ [p]] = SA◦ [ISA◦ [p]] = p.

Note that the second invariant immediately follows from the first and the fact that entries already
written to A do not change.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:15

ALGORITHM 2: Phase II of GSACA, modified to produce SA◦ instead of SA. After execution the array A
is SA◦ .
for i = 0 → n − 1 do
if i ∈ L then
Let k be the end of the group containing i;
remove i from its current group and put it into a new group k, k, |Li | immediately
succeeding i’s old group;
A[k] ← i;
end
end
for i = 0 → n − 1 do
for j ∈ PA[i] do
Let k be the start of the group containing j;  
remove j from its current group and put it into a new group k, k, L j  immediately preceding
j’s old group;
A[k] ← j;
end
end

Induction base. The lowest Lyndon group only contains Lyndon factors: Assume there is a Lyn-
don prefix j that is not a Lyndon factor in the lowest Lyndon group. Then there is p ∈ L with
p < j < nss[p]. By definition, we then have Lp <lex L j , which is a contradiction.
Therefore, after the first for-loop, we have A[0] = SA◦ [0].
Induction hypothesis. Assume for some k ∈ [0 .. n − 2] that A[0 .. k] = SA◦ [0 .. k] and all elements

in j ∈SA◦ [0 .. k) Pj are correctly placed in A.

Induction step. We now insert PSA◦ [k] . First note that the elements in PSA◦ [k] are in different
Lyndon groups (Corollary 3.5 on page 9, by definition PSA◦ [k] ⊆ PSA◦ [k] if SA◦ [k] ∈ L and PSA◦ [k] ⊆
Pnss[SA◦ [k]] otherwise), hence the order in which we process them is irrelevant. Now consider some
j ∈ PSA◦ [k] . To prove that j is inserted at index ISA◦ [j] it suffices to show that from j’s Lyndon
group exactly the smaller elements were inserted during earlier iterations. Consider some j  with
L j = L j  and j  <inf j. This implies nsc(j ) <inf nsc(j) = SA◦ [k] and hence that ISA◦ [nsc(j )] <
ISA◦ [nsc(j)] = k. Conversely, assume j  with L j = L j  and j  >inf j. This implies nsc(j ) >inf
nsc(j) = SA◦ [k] and hence that ISA◦ [nsc(j )] > ISA◦ [nsc(j)] = k. In combination, this proves
Invariant 3.
For Invariant 1, we have two cases: If SA◦ [k + 1] ∈ L, then it is trivially true. Otherwise, for
j = nsc(SA◦ [k + 1]), we have j <inf SA◦ [k + 1] by Lemma 3.12 and thus A[k + 1] = SA◦ [k + 1] by
Invariant 3. 

We will now modify Algorithm 2 such that the second for-loop is exactly the same as in Phase II
of GSACA as shown in Algorithm 1.
Note that the only difference lies in the definition of the sets PA[i] and PA[i] . Recall that we have
PA[i] = PA[i] \ L if A[i]  L and PA[i] = Pnss[A[i]] \ L otherwise. As the elements in L are inserted
before the second for-loop, we can drop the “set-minus L,” because already inserted elements are
never changed. The now only remaining difference is the processing of elements in L, where we
insert Pnss[s] instead of Ps . By inserting nss[s] instead of s (at the end of s’s group) for each s in L

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:16 J. Olbrich et al.

in the initialisation, this last difference vanishes. The array computed by this modified Phase II is
called the shifted circular suffix array.
Definition 3.16 (Shifted Circular Suffix Array). The shifted circular suffix array SA◦ is derived
from SA◦ by changing each element in L to its next smaller suffix. Formally:

 SA◦ [i] , if SA◦ [i]  L,
SA◦ [i] =
nss[SA◦ [i]] , if SA◦ [i] ∈ L.

Note that SA◦ is a permutation of [1 .. n] instead of [0 .. n), since 0 ∈ L and nss[max L] = n.


Furthermore, note that, in Equation (1), we have the same substitution of SA◦ [i] with nss[SA◦ [i]]
in the case SA◦ [i] ∈ L. Hence, SA◦ also has an advantage compared to SA◦ when deriving BBWT:
There is no need to check whether an entry is in L, we simply have BBWT[i] = S[SA◦ [i] − 1].
Note that for each i ∈ [0 .. n), we have i ∈ L if and only if pss[i] = −1, thus checking whether i
is in L can be done in constant time. Let l 1 < · · · < l |L | be the elements of L arranged in increasing
text order. By Lemma 2.4, we have l 1 = 0, nss[li ] = li+1 for each i ∈ [1 .. |L|), and nss[l |L | ] = n.
Therefore, we can find nss[i] for each i ∈ L in a single left-to-right scan of pss and finding their
positions in SA◦ takes O(n) time (specifically how the current group ends are maintained and
queried in total O(n) time is explained in Section 3.3).

3.3 Optimising Phase II


In this section, we describe our optimisation of Phase II of Baier’s sorting principle. We use Algo-
rithm 1 as a starting point, refine it into a more concrete algorithm, and then alter the order in
which elements are inserted (to improve cache performance). Because we essentially optimise the
for-loop from Algorithm 1, all optimisations apply equally to the construction of SA and BBWT
as discussed at the end of the previous section.
Note that each element i ∈ [0 .. n − 1) has exactly one next smaller suffix, hence there is exactly
one j with i ∈ Pj and thus i is inserted exactly once into a new singleton group in Algorithm 1.
Therefore, for each group from the Lyndon grouping obtained in Phase I, it suffices to maintain a
single pointer to the current start of this group. In Baier [2015], these pointers are stored at the
end of each group in A.9 This leads to them being scattered in memory, potentially harming cache
performance. Instead, we store them contiguously in a separate array C, which improves cache
locality especially when there are few groups.
Besides this minor point, there are two major differences between our Phase II and Baier’s, both
are concerned with the iteration over a Pi -set.
The first difference is the way in which we determine the elements of Pi for some i. The follow-
ing observations immediately enable us to iterate over Pi .
Lemma 3.17. Pi is empty if and only if i = 0 or Si−1 <lex Si . Otherwise, i − 1 ∈ Pi .
Proof. P0 = ∅ by definition. Let i ∈ [1 .. n). If Si−1 >lex Si , then we have nss[i − 1] = i and
thus i − 1 ∈ Pi . Otherwise, (Si−1 <lex Si ), assume there is some j < i − 1 such that nss[j] = i. By
definition, S j >lex Si and S j <lex Sk for each k ∈ (j .. i). But by transitivity, we also have S j >lex Si−1 ,
which is a contradiction, hence Pi must be empty. 

Lemma 3.18. For some j ∈ [0 .. i), we have j ∈ Pi if and only if j’s last child is in Pi , or j = i − 1
and S j >lex Si .

9 Because the groups are filled from left to right, the pointer is overwritten exactly when the last element of the group is
inserted and no further access to the pointer is needed [Baier 2015].

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:17

Proof. By Lemma 3.17, we may assume Pi  ∅ and j + 1 < i, otherwise the claim is trivially
true. If j is a leaf, then we have nss[j] = j + 1 < i, and thus j  Pi by definition. Hence, assume j
is not a leaf and has j  > j as last child, i.e., pss[j ] = j and there is no k > j  with pss[k] = j. It
suffices to show that j  ∈ Pi if and only if j ∈ Pi . Note that pss[j ] = j implies nss[j] > j .
=⇒ : From nss[j ] = i and thus Sk >lex S j  >lex S j (for all k ∈ (j  .. i)), we have nss[j] ≥ i.
Assume nss[j] > i. Then Si >lex S j and thus pss[i] = j, which is a contradiction.
⇐= : From Si <lex S j <lex S j  , we have nss[j ] ≤ i. Assume nss[j ] < i for a contradiction.
For all k ∈ (j .. j ), pss[j ] = j implies Sk >lex S j  . Furthermore, for all k ∈ [j  .. nss[j ]), we have
Sk >lex S nss[j  ] by definition. In combination this implies Sk >lex S nss[j  ] for all k ∈ (j .. nss[j ]). As
nss[j] = i > nss[j ], we hence have pss[nss[j ]] = j, which is a contradiction. 

Specifically, (if Pi is not empty) we can iterate over Pi by walking up the pss-tree starting from
i − 1 and halting when we encounter a node that is not the last child of its parent.10 Baier [2015]
tests whether i − 1 (pss[j]) is in Pi by explicitly checking whether i − 1 (pss[j]) has already been
written to A. This is done by having an explicit marker for each suffix [Baier 2015, 2016]. Reading
and writing those markers leads to bad cache performance, because the accessed memory locations
are hard to predict (for the CPU/compiler). Lemmata 3.17 and 3.18 enable us to avoid reading and
writing those markers. In fact, in our implementation of Phase II, the array A is the only memory
written to that is not always in the cache. Lemma 3.17 tells us whether we need to follow the pss-
chain starting at i − 1 or not. Namely, this is the case if and only if Si−1 >lex Si , i.e., i − 1 is a leaf in
the pss-tree. This information is required when we encounter i in A during the outer for-loop in
Algorithm 1, thus we mark such an entry i in A if and only if Pi  ∅. Implementation-wise, we use
the most significant bit (MSB) of an entry to indicate whether it is marked or not. By definition,
we have Si−1 >lex Si if and only if pss[i] + 1 < i. Since pss[i] must be accessed anyway when i is
inserted into A (for traversing the pss-chain), we can insert i marked or unmarked into A. Further,
Lemma 3.18 implies that we must stop traversing a pss-chain when the current element is not the
last child of its parent. We mark the entries in pss accordingly, also using the MSB of each entry.
In the rest of this article, we assume the pss-array to be marked in this way.
Consider, for instance, i = 12 in our running example. As 12 − 1 = 11 is a leaf (cf. Figure 1),
we have 11 ∈ P12 . We can deduce the fact that 11 is indeed a leaf from pss[12] = 8 < 11 alone.
If we had pss[12] = 11 instead, then we would have nss[11]  12 and thus 11  P12 . Further, 11
is the last child of pss[11] = 9, so 9 ∈ P12 . Since 9 is not the last child of pss[9] = 8, we have
P12 = {9, 11}.
These optimisations unfortunately come at the cost of 2n additional bits of working memory for
the markings. However, as they are integrated into pss and A there are no additional cache misses.
Let G[i] be the index of the group start pointer of i’s group in C. Phase II with our first major
improvement compared to Baier’s algorithm is shown in Algorithm 3.
The second major change concerns the cache-unfriendliness of traversing and inducing the Pi -
sets (i.e., the do-while loop in Algorithm 3). This bad cache performance results from the fact that
we have chains of memory accesses where the location of one entry depends on the entry pre-
ceding it in the chain (this is also known as pointer-chasing). For instance, we have to first know
G[p] before we can start to fetch C[G[p]], which in turn is required to start fetching A[C[G[p]]].
As each such location is essentially random, each access is likely to be a cache-miss. One obvi-
ous mitigation is to interleave the arrays pss and G such that G[p] and pss[p] are next to each

10 Note that n − 1 is the last child of the artificial root −1. This ensures that we always halt before we actually reach the root

of the pss-tree. Moreover, since the elements in Pi belong to different Lyndon groups (Corollary 3.5), the order in which
we process them is not important.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:18 J. Olbrich et al.

ALGORITHM 3: Concrete implementation of Phase II of GSACA. After execution, the array A is SA.
The array G maps each suffix to its Lyndon group and C maps the Lyndon groups that resulted from
Phase I to their current start. The correctness immediately follows from the correctness of Algorithm 1
and Lemmata 3.17 and 3.18.
A[0] ← n − 1;
for i = 0 → n − 1 do
if A[i] − 1 is a leaf in the pss-tree then // i.e., A[i] is marked
p ← A[i] − 1;
do
A[C[G[p]]] ← p;
C[G[p]] ← C[G[p]] + 1;
p ← pss[p];
while p is the last child of pss[p]; // i.e., pss[p] is marked
end
end

other in memory and we can fetch both with only one cache-miss instead of two. Note that we
have a prime example of pointer-chasing here, namely, the traversal of the pss-tree: The next
pss-value (and the corresponding group start pointer) cannot be fetched until the current one is
in memory.
We can almost entirely eliminate the cache-misses caused by pointer-chasing here: Instead of
traversing the Pi -sets one after another, we opt to traversing multiple such sets in a sort of breadth-
first-search manner simultaneously. Specifically, we maintain a small (≤ 210 elements) queue Q of
elements (nodes in the pss-tree) that can currently be processed. Then, we iterate over Q and
process the entries one after another. Parents of last children are inserted into Q in the same order
as the respective children. After each iteration, we continue to scan over the suffix array and for
each encountered marked entry i insert i − 1 into Q until we either encounter an empty entry in
A or Q reaches its maximum capacity. This is repeated until the suffix array emerges. The queue
size could be unlimited, but limiting it ensures that it fits into the CPU’s cache. Figure 7 shows
our Phase II on the running example and Algorithm 4 describes it formally in pseudocode. The
benefit of this breadth-first search comes from the fact that we can start to fetch required data
a few iterations of the repeat-loop in Algorithm 4 before it is required, and thereby ensure that
it is in the CPU’s cache when needed. Note that this optimisation is only useful when the queue
contains many elements (i.e., s is large), otherwise there are not enough iterations of the repeat-
loop between inserting an element into the queue and inserting it into A, and we effectively have
Algorithm 1 with some additional overhead. Fortunately, in real world data this is usually the case
and the small overhead for maintaining the queue is more than compensated by the better cache
performance (cf. Section 4).

Theorem 3.19. Algorithm 4 correctly computes the suffix array from a Lyndon grouping.

Proof. By Lemmata 3.17 and 3.18, Algorithms 1 and 4 are equivalent for a maximum queue
size of 1. Therefore, it suffices to show that the result of Algorithm 4 is independent of the queue
size. Assume for a contradiction that the algorithm inserts two elements i and j with Si <lex S j
belonging to the same Lyndon group with context α, but in a different order as Algorithm 1 would.
This can only happen if j is inserted earlier than i. Note that, since i and j have the same Lyndon
prefix α, the pss-subtrees Ti and T j rooted at i and j, respectively, are isomorphic (see Bille et al.
[2020]). In particular, the path from the rightmost leaf in Ti to i has the same length as the path

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:19

ALGORITHM 4: Breadth-first approach to Phase II. The constant w is the maximum queue size. When
s is large, it is possible to improve the cache-performance of the repeat-loop by prefetching the used data
a few loops ahead. This is not possible in Algorithm 3, because there the address of data accessed in one
iteration of the do-while-loop depends on the data accessed in the previous iteration (“pointer-chasing”).
A ← (n − 1)⊥n−1 ; // set A[0] = n − 1, fill the rest with ‘‘undefined’’
Q ← queue containing only n − 1;
i ← 1; // current index in A
while Q is not empty do
s ← Q.size();
repeat s times // insert elements that are currently in the queue
v ← Q.pop();
if pss[v] is marked then // v is last child of pss[v]
Q.push(pss[v]);
end
A[C[G[v]]] ← v; // insert v
if pss[v] + 1 < v then mark A[C[G[v]]]; // v − 1 is leaf
C[G[v]] ← C[G[v]] + 1; // increment start of v’s old group
end
while Q.size() < w ∧ i < n ∧ A[i]  ⊥ do // refill the queue
if A[i] is marked then // A[i] − 1 is leaf
Q.push(A[i] − 1);
end
i ← i + 1;
end
end

from the rightmost leaf in T j to j. Thus, i and j are inserted in the same order as Si+ |α | and S j+ |α |
occur in the suffix array. Now the claim follows inductively. 

It is clear that Algorithm 4 has linear time complexity: Each index in [0 .. n) is inserted exactly
once into Q, so the repeat-loop runs exactly n times in total. The inner while-loop iterates from
i = 1 to n and thus runs exactly n − 1 times. In each iteration of the outer while-loop, at least one
element is removed from Q (because s > 0) and thus there can be no more than n iterations.
The amount of working memory (i.e., without input and output) is constant besides the memory
for C, G and pss (because the size of the queue Q is constrained by a constant). Both pss and G
occupy n words of memory, while C is dependent on the number of groups in the Lyndon grouping.
We apply an optimisation where for each element i of a singleton-group, G[i] is the group start
in A (instead of C[G[i]]) and thus no entry for i’s group in C is required. Therefore, C can have
at most n/2 entries, and in total we have at most 2.5n words of working memory. Note that in
most cases, the number of groups in a Lyndon grouping is small compared to the text size (cf.
Section 4).

3.4 Phase I
In Phase I, a Lyndon grouping is derived from a suffix grouping in which the group contexts
have length (at least) one. That is, the suffixes are sorted and grouped by their Lyndon prefixes.
Lemma 3.20 describes the relationship between the Lyndon prefixes and the pss-tree that is essen-
tial to Phase I of the grouping principle.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:20 J. Olbrich et al.

Fig. 7. Refining a Lyndon grouping for S = decedacebceece$ (see Figure 4) into the suffix array using
Algorithm 4. Already processed elements are coloured light gray . Not yet processed and marked entries
are coloured blue while inserted but unmarked and unprocessed elements are coloured green . Note that
the uncoloured entries are not actually present in the array A but only serve to indicate the current Lyndon
grouping. The Lyndon prefixes are shown at the top for clarity.

Lemma 3.20. Let c 1 < · · · < c k be the children of i ∈ [0 .. n) in the pss-tree as in Definition 2.1.
Then Li is S[i] concatenated with the Lyndon prefixes of c 1 , . . . , c k . More formally:

Li = S[i .. nss[i])
= S[i]S[c 1 .. c 2 ) . . . S[c k .. nss[i])
= S[i]Lc 1 . . . Lck .

Proof. By definition, we have Li = S[i .. nss[i]). Assume i has k ≥ 1 children c 1 < · · · < c k
in the pss-tree (otherwise nss[i] = i + 1 and the claim is trivial). For the last child c k , we have
nss[c k ] = nss[i] from Lemma 3.18. Let j ∈ [1 .. k) and assume nss[c j ]  c j+1 . Then, we have
nss[c j ] < c j+1 , otherwise c j+1 would be a child of c j . As we have S nss[c j ] <lex Sc j and Sc j <lex Sc j 

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:21
 
for each j  ∈ [1 .. j) (by induction), we also have S nss[c j ] <lex Si  for each i  ∈ i .. nss[c j ] . Since
nss[i] > nss[c j ], nss[c j ] must be a child of i in the pss-tree, which is a contradiction. 
We start from the initial suffix grouping in which the suffixes are grouped according to their first
characters. From the relationship between the Lyndon prefixes and the pss-tree in Lemma 3.20 one
can get the general idea of extending the context of a node’s group with the Lyndon prefixes of
its children (in correct order) while maintaining the sorting [Baier 2015]. Note that any node is
by definition in a higher group than its parent. Also, by Lemma 3.20 the leaves of the pss-tree
are already in Lyndon groups in the initial suffix grouping. Therefore, if we consider the groups
in lexicographically decreasing order (i.e., higher to lower) and append the context of the current
group to each parent (and insert them into new groups accordingly), then each encountered group
is guaranteed to be Lyndon [Baier 2015]. Consequently, we obtain a Lyndon grouping. Figure 8
shows this principle applied to our running example.
Formally, the suffix grouping satisfies the following invariant during Phase I before and after
processing a group:
Invariant 1. For any i ∈ [0 .. n) with children c 1 < · · · < c k there is j ∈ [0 .. k] such that
— c 1 , . . . , c j are in groups that have already been processed,
— c j+1 , . . . , c k are in groups that have not yet been processed, and
— the context of the group containing i is S[i]Lc 1 . . . Lc j .
Furthermore, each processed group is Lyndon.
Additionally and unlike in Baier’s original approach, all groups created during our Phase I are
either Lyndon or only contain elements whose Lyndon prefix is different from the group’s context.
This has several advantages, which are discussed below.
Definition 3.21 (Strongly Preliminary Group). We call a preliminary group G = дs , дe , |α |
strongly preliminary if and only if G contains only elements whose Lyndon prefix is not α. A
preliminary group that is not strongly preliminary is called weakly preliminary.
The following lemma shows that a weakly preliminary group can always be split into a (lower)
Lyndon group and a (higher) strongly preliminary group.
Lemma 3.22. For any weakly preliminary group G = дs , дe , |α | there is some д  ∈ [дs .. дe ) such
that G  = дs , д , |α | is a Lyndon group and G  = д  + 1, дe , |α | is a strongly preliminary group.
Splitting G into G  and G  results in a valid suffix grouping.
Proof. Let G = дs , дe , |α | be a weakly preliminary group. Let F ⊂ G be the set of elements
from G whose Lyndon prefix is α. By Lemma 3.2, we have Si <lex S j for any i ∈ F , j ∈ G \ F . Hence,
splitting G into two groups G  = дs , дs + |F | − 1, |α | and G  = дs + |F |, дe , |α | results in a valid
suffix grouping. Note that, by construction, the former is a Lyndon group and the latter is strongly
preliminary. 
For instance, in Figure 8 there is a group containing 2, 6, and 12 with context ce. However, 6 and
12 have this context as Lyndon prefix while 2 has ced. Consequently, 2 will later be moved to a
new group. Hence, when Baier [2015] (and Bertram et al. [2021]) create a weakly preliminary group
(in Figure 8 this happens while processing the Lyndon group with context e), we instead create
two groups, the lower (Lyndon group) containing 6 and 12 and the higher (strictly preliminary)
containing 2.
During Phase I, we maintain the suffix grouping using the following data structures:
— An array A of length n containing the unprocessed Lyndon groups and the sizes of the
strongly preliminary groups.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:22 J. Olbrich et al.

Fig. 8. Refining the initial suffix grouping for S = decedacebceece$ (see Figure 4) into the Lyndon grouping.
Elements in Lyndon groups are marked light gray or green , depending on whether they have been pro-
cessed already. Note that the applied procedure does not entirely correspond to our algorithm for Phase I; it
only serves to illustrate the general sorting principle. The Lyndon prefixes are shown at the top for clarity.

— An array I of length n mapping each element s to the start of the group containing it. We
call I [s] the group pointer of s.
— A list C storing the starts of the already processed Lyndon groups.
Note that C is an input to Phase II. Transforming the array I into G as required for Phase II is trivial
with the help of C after Phase I. In combination, C, G, and pss make up the entire input to Phase II,
and the contents of A can be discarded after Phase I.
These data structures are organised as follows. Let G = дs , дe , |α | be a group. For each s ∈ G,
we have I [s] = дs . If G is Lyndon and has not yet been processed, then we also have s ∈ A[дs .. дe ]
for all s ∈ G and A[дs ] < A[дs + 1] < · · · < A[дe ]. If G is Lyndon and has been processed already,

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:23

then there is some j such that C[j] = дs . Otherwise, if G is (strongly) preliminary, then we have
A[дs ] = дe + 1 − дs and A[k] = 0 for all k ∈ (дs .. дe ].
In contrast to Baier, we have the Lyndon groups in A sorted and store the sizes of the strictly
preliminary groups in A as well [Baier 2015, 2016]. The former makes finding the number of chil-
dren a parent has in the currently processed group easier and faster. The latter makes the separate
array of length n used by Baier for the group sizes obsolete [Baier 2015, 2016] and is made possible
by the fact that we only write Lyndon groups to A.
As alluded above, we follow Baier’s approach and consider the Lyndon groups in lexicographi-
cally decreasing order while updating the groups containing the parents of elements in the current
group.

ALGORITHM 5: Phase I: Traversing the groups [Baier 2015, 2016]


дe ← n − 1;
while дe ≥ 0 do
дs ← I [A[дe ]];
process group дs , дe , ⊥ ;
дe ← дs − 1;
end

Note that in Algorithm 5, дe is always the end of a Lyndon group. This is due to the fact that a
child is by definition lexicographically greater than its parent. Hence, when a group ends at дe and
all suffixes in SA(дe .. n) have been processed, the children of all elements in that group have been
processed and it consequently must be Lyndon. Thus, Algorithm 5 actually results in a Lyndon
grouping. For a formal proof see Baier [2015].
Of course, we have to explain how to actually process a Lyndon group. This is done in the rest
of this section.
Let G = дs , дe , |α | be the currently processed group and w.l.o.g. assume that no element in
G has the root −1 as parent (we do not have the root in the suffix grouping, thus nodes with the
root as parent can be ignored here). Furthermore, let A be the set of parents of elements in G (i.e.,
A = {pss[i] : i ∈ G, pss[i] ≥ 0}) and let G1 < · · · < Gk be those (necessarily preliminary) groups
containing elements from A. For each д ∈ [1 .. k] let αд be the context of Gд .
As noted in Figure 8, we have to consider the number of children an element in A has in G.
Namely, if a node has multiple children with the same Lyndon prefix, then of course all of them
contribute to its Lyndon prefix. This means that we need to move two parents in A, which are
currently in the same group, to different new groups if they have differing numbers of children in
G. For example, while processing the group with context e in Figures 8 and 9 has two children in
this group, while 6 only has one. Both are currently in the same group with context c. As dictated
by Invariant 1, after processing the group with context e, 9 must be in a group with context cee
and 6 must be in a (lower) group with context ce (because ce <lex cee).
Let A  contain those elements from A with exactly  children in G. Maintaining Invariant 1
requires that, after processing G, for some д ∈ [1 .. k] the elements in A  ∩ Gд are in groups with

context αд α  . Note that, for any  <  , we have αд α  <lex αд α  . Consequently, the elements in
A  ∩ Gд must form a lower group than those in A  ∩ Gд after G has been processed [Baier 2015,
2016]. To achieve this, first the parents in A | G | are moved to new groups, then those in A | G |−1
and so on [Baier 2015, 2016].
We proceed as follows. First, determine A and count how many children each parent has in G.
Then, sort the parents according to these counts using a bucket sort. Because the elements of yet
unprocessed Lyndon groups must be sorted in A, this sort must be stable. Further, partition each

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:24 J. Olbrich et al.

Fig. 9. Shown is the memory layout during the bucket sort that is applied during the processing of a Lyndon
group. The data in grey areas is irrelevant. p1 < · · · < pm are the elements in A \ A1 and ki = key(pi ).

bucket into two sub-buckets, one containing the elements that should be inserted into Lyndon
groups and the other containing those that should be inserted into strongly preliminary groups.
Then, for the sub-buckets (in the order of decreasing count; for equal counts: first strongly pre-
liminary then Lyndon sub-buckets) move the parents into new groups.11 These steps will now be
described in detail.
For brevity, we refer to those elements in A that have their last child in G as finalists. Partition
A  into F  and N  , such that the former contains finalists and the latter the non-finalists.
To determine the aforementioned sub-buckets, we associate a key with each element in A such
that (stably) sorting according to these keys yields the desired partitioning. Specifically, for a fixed
, let key(s) = 2 for each s ∈ F  and key(s) = 2 + 1 for each s ∈ N  .
As we need to sort stably, the bucket sort requires an additional array B of length |G|, and
another array for the bucket counters.
Finding parents is done using the same pss array as in Phase II. Since A[дs .. дe ] is sorted by in-
creasing index, children of the same parent are in a contiguous part of A[дs .. дe ]. Hence, we deter-
mine A and the keys within one scan over A[дs .. дe ]. Since in practice most elements have no sib-
ling in the same Lyndon group, we treat those explicitly. Specifically, we move F 1 to A[дs .. дs + |F 1 |)
and N 1 to B(|G| − |N 1 | .. |G|]. Parents with keys larger than two are written with their keys inter-
spersed to B[1 .. 2(|A| − |A1 |)]. Interspersing the keys is done to improve the cache-locality and
thus performance.
Then, we copy N 1 to A[дs + |F 1 | .. дs + |A1 |). Note that A1 is now correctly sorted in
A[дs .. дs + |A1 |). Then, we sort A \ A1 by the keys. That is, we count the frequency of each
key, determine the end of each key’s bucket and insert the elements into the correct bucket in
A[дs + |A1 | .. дs + |A|). Figure 9 shows how the data is organised during the sorting.

Reordering parents into Lyndon groups. Let A be a sub-bucket of length k containing only parents
that will now be moved to a Lyndon group and whose context is extended by α q for some fixed
q. Note that the bucket sort ensures that A is sorted. Within each current preliminary group
G  = дs , дe , |β | , the elements in A must be moved to a new Lyndon group following G  \ A. For
each element s in A, we decrement A[I [s]] (i.e., the size of the group currently containing s) and
write s to A[I [s] + A[I [s]]]. Afterwards, the new group start must be set (iff G  is now not empty)
to I [s] + A[I [s]] (the start of the old group plus the remaining size of the old group). To determine
whether G  is now not empty, we mark inserted elements in A using the MSB. If A[I [s]] has the
MSB set, then we do not need to change the group pointer I [s].

11 Note that Baier [2015] broadly follows the same steps (determine parents, sort them, move them to new groups accord-
ingly) [Baier 2015, 2016]. However, each individual step is different because of our distinction between strongly preliminary,
weakly preliminary and Lyndon groups.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:25

Reordering parents into strongly preliminary groups. Let A be a sub-bucket of length k containing
only parents that are now moved to a strongly preliminary group. The general procedure is similar
to the reordering into Lyndon groups, but simpler. First, we decrement the sizes of the old groups.
In a second scan over A, we set the new group pointer as above, and in a third scan, we increment
the sizes of the new groups.
Note that in the reordering step, we iterate two and three times, respectively, over the elements
in a sub-bucket and that in each scan the group pointers are required. Furthermore, the group
pointers are updated only in the last scan. As the group pointers are generally scattered in memory,
it would be inefficient to fetch them individually in each scan for two reasons. First, a single group
pointer could occupy an entire cache-line (i.e., we mostly keep and transfer irrelevant data in the
cache). Second, the memory accesses are unnecessarily unpredictable. To mitigate these problems,
we pull the group pointers into the temporary array B, that is, we set B[i] ← I [A[i + дs ]] for each
i ∈ [0 .. |A|). Of course, in this fetching of group pointers, we have the same problems as before,
but during the actual reordering the group pointers can be accessed much more efficiently.
Note that in contrast to Baier [2015], we do not compute the parents on the fly during Phase I
but instead construct the pss-array in advance using the linear-time pss-construction algorithm of
Bille et al. [2020]. There are two reasons for this, namely, first, determining the parents on the fly
as done in Baier [2015] requires a kind of pointer jumping that is very cache-unfriendly and hence
slow; and second, it is not clear how to efficiently determine on the fly whether a node is the last
child of its parent.
Another difference that is speeding up the algorithm is that we only write Lyndon groups to A.
This way, we do not have to rearrange elements in weakly preliminary groups when some of their
elements are moved to new groups. Furthermore, it is possible to have the elements in Lyndon
groups sorted in A, which makes determining the parents and their corresponding keys easier and
faster.
The time complexity of Phase I is clearly linear, because each index in [0 .. n) is in exactly one
group and each group is processed in time linear to its size (it is easily verified that each step
involved in processing a group – determining the parents, sorting them, and moving them to new
groups – takes linear time).
In terms of working memory, we require pss, I and C for maintaining the suffix grouping, where
pss and I each require n words of memory. As explained at the end of Section 3.3, C requires at
most m/2 words of memory when the already processed groups contain m elements in total.12
For processing a group with  distinct parents, the only additional memory needed is the array
B of length . It is clear that  can be at most (n − m)/2, since those parents cannot be in already
processed groups. Consequently, we require at most 2n +m/2+(n − m)/2 = 2.5n words of working
memory in total. Note that, because both the number of groups and the maximum size of a group
are usually relatively small in practice, we can often use the memory in A previously occupied by
already processed groups to store B and thus require less than the 2.5n words of working memory.

3.5 Initialisation
In the initialisation, the pss-array with markings must be computed. We use a variant of the linear-
time pss-construction algorithm of Bille et al. [2020], which we adapted to mark each last child i
using the most significant bit of pss[i]. (This modification is straightforward and will thus not be
explained here.)

12 Notethat, although C needs to be a growable array with O(1) amortized append-complexity, we do not need to have
reserved space for future expansion, because the last m entries of the array A are unoccupied.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:26 J. Olbrich et al.

Further, the initial suffix grouping needs to be constructed. That is, we must create two groups
(buckets) for each character, one Lyndon and one strictly preliminary, where the Lyndon groups
contain exactly the leaves. Note that i is a leaf if and only if i = n − 1 or i + 1 < n ∧ Si >lex Si+1
(otherwise pss[i +1] = i). During a right-to-left scan over S, it is possible in constant time to decide
for each i < n − 1 whether Si >lex Si+1 holds [Ko and Aluru 2003; Nong et al. 2009].13 Thus, we can
determine the size and start position for each group in O(n + σ ) time, where σ is the size of the
alphabet. In a second right-to-left scan over S, we can then set the references to the group starts
in the array I and write the leaves to SA.

3.6 Computing the Extended Burrows-Wheeler Transform


The original eBWT of a multiset M of strings is derived by taking the last characters of the con-
jugates of the strings in M arranged in infinite periodic order [Mantaci et al. 2005]. The original
definition of the eBWT assumed that M only contains primitive strings [Mantaci et al. 2005], but
as in Boucher et al. [2021] we do not have this restriction.
For convenience, we assume that M is given in concatenated form as a single string T and

a list of lengths l 0 , . . . , l | M |−1 of which si = i−1 r =0 l r is the ith prefix sum, such that M =

T [s 0 .. s 1 ),T [s 1 .. s 2 ), . . . ,T s |M |−1 .. s |M | . (At the end of this section, we explain that we do not
actually need M to be given in concatenated form.) Let n = |T | = s |M | .  
For each i ∈ [0 .. n) there is some j ∈ [0 .. |M |) such that s j ≤ i < s j+1 . We call T s j .. s j+1 the
 
source string of i. With a slight abuse  of notation, in this section, we denote by Ti the i − s j th
  
conjugate of the source string T s j .. s j+1 , i.e., Ti = T i .. s j + l j T s j .. i . As with the BBWT, we
first define a total order on the indices [0 .. n) and then use that to define the eBWT.
д  
Definition 3.23. We write i <inf j if and only if Ti <ω T j or root(Ti ) = root T j and i > j.
Note the similarity to Definition 3.7. Like for the BBWT, the relative order in the case of equal
roots is irrelevant for the eBWT, but was chosen to conform with the lexicographical order of the
Lyndon factors so that it is easier to adapt our algorithm to compute the eBWT.
д
We can then define the permutation sorting the indices according to <inf , analogous to the
suffix array (lexicographic order of suffixes) or circular suffix array (infinite periodic order of the
conjugates of the Lyndon factors):
Definition 3.24. Let GSA◦ be the (unique) permutation of [0 .. n) such that for all i ∈ [0 .. n − 1),
д
we have GSA◦ [i] <inf GSA◦ [i + 1]. We call GSA◦ the generalised circular suffix array.14
For i ∈ [0 .. n) with s j ≤ GSA◦ [i] < s j+1 for some j, we then have

T [s j+1 − 1], if GSA◦ [i] = s j ,
eBWT[i] = (2)
T [GSA◦ [i] − 1], otherwise.
Lemma 3.25. When each  string in M is in canonical form and T [s 0 .. s 1 ) ≥lex T [s 1 .. s 2 ) ≥lex
. . . ≥lex T s |M |−1 .. s |M | , the BBWT of T is equal to the eBWT of M.
Proof. This immediately follows from the fact that the Lyndon factors of T directly correspond
to the roots of the strings in M (Lemma 3.28 below), Definitions 3.24 and 3.8, and Lemma 3.11. 
As a result of Lemma 3.25, Bannai et al. [2021] obtain their (linear time) eBWT construction
algorithm from their BBWT algorithm [Bannai et al. 2021]. However, computing the eBWT via
13 Note that the leaves (inner nodes) in the pss-tree are exactly the L-type (S-type) suffixes used in the SA-IS algorithm
[Nong et al. 2009].
14 This corresponds to the generalised conjugate array from Boucher et al. [2021].

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:27

Fig. 10. Shown are the generalised circular suffix array, the lpss-tree, and the pssπ (T ) -tree for M =
{b, bcbc, b, abcbc,bc} (cf. Figure 3), where the indices refer to the concatenation T = bbcbcbabcbcbc. Note
that the structures of the lpss-tree and the pssπ (T ) -tree are very similar, with the only difference being the
order of the children of the artificial root (and of course the indices).

the BBWT requires two preprocessing steps: First, the minimum conjugates of the strings in M
must be found, and then these minimum conjugates must be sorted lexicographically. Boucher
et al. [2021] showed that the algorithm of Bannai et al. [2021] can be adapted such that these pre-
processing steps are not necessary [Boucher et al. 2021]. We now demonstrate that our algorithm
can be adapted such that it computes the eBWT of M without the preprocessing step of sorting
the minimum conjugates. Although we still require the canonicalisation of the input strings, our
implementation is much faster than Boucher et al. [2021]’s (cf. Section 4).
Adapting our algorithm to compute GSA◦ . In the following, we assume that the strings in M are
in canonical form.
After canonicalising the input strings, the only algorithmic change compared to our SA◦ -
algorithm concerns the initialisation. Namely, it suffices to compute the pss-tree for each canoni-
calised input string separately.
Specifically, we construct an array lpss where lpss[i] points to the previous smaller suffix within
the source string of i (i.e., each lpss-value is restricted to within the respective input string). For-
mally, for each i ∈ [0 .. n) with s j ≤ i < s j+1 , we define
       
lpss[i] = max p ∈ s j .. i | T p .. s j+1 <lex T i .. s j+1 ∪ {−1} .
Figure 10 shows the lpss-array for M = {b, bcbc, b, abcbc, bc}.
Intuitively, the correctness of our GSA◦ -algorithm (which is the same as our SA◦ -algorithm
except that it operates on the lpss-tree instead of the pss-tree) follows from the fact that it be-
haves exactly as our SA◦ -algorithm would on the concatenation of the strings in M arranged in

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:28 J. Olbrich et al.

lexicographically decreasing order. This is because pss for this concatenation has the same struc-
ture as lpss (Lemma 3.32 below) and both Phase I and II operate only on the given pss-tree.
To actually prove this, we show that each created grouping is order-isomorphic to a grouping
created when running our SA◦ -algorithm on the concatenation of the strings in M sorted in de-
creasing lexicographic order. The following definition provides that isomorphism (the proof that
it actually is that order isomorphism is provided in Theorem 3.35).
Definition 3.26. Let π be the (unique) permutation of [0 .. n) such that, for each i, j ∈ [0 .. |M |),
we have
   
— T [si .. si + li ) <lex T s j .. s j + l j implies π (si ) > π s j ,
   
— T [si .. si + li ) = T s j .. s j + l j ∧ i > j implies π (si ) > π s j , and
— π (si + k) = π (si ) + k for each k ∈ [0 .. li ).
That is, permuting T according to π (denoted by π (T ) = T [π −1 (0)]T [π −1 (1)] . . . T [π −1 (n − 1)])
results in the concatenation of the strings in M sorted in decreasing lexicographic order: The
first two properties sort the start positions of the strings in M (with ties broken using the text
positions), and the third property dictates that (the indices within) the strings in M should remain
contiguous.
Figure 10 shows the permutation π , the permuted string π (T ) and the tree for pssπ (T ) for an
example.
Note that we do not need to compute π ; it only serves to prove that our approach is correct.
д
We claim that the permutation π is an order isomorphism in the sense that i <inf j if and only if
π (T ) π (T )
π (i) <inf π (j) (where <inf compares the conjugates of Lyndon factors in π (T )). This claim is
proven in Lemma 3.29. We say two groupings G1 , . . . , Gk and G1, . . . , Gk are isomorphic, if and
only if i ∈ Gj implies π (i) ∈ Gj and vice versa for each j ∈ [1 .. k]. We will show later that
our SA◦ -algorithm invoked with π (T ) and pssπ (T ) treats each π (i) exactly as the same algorithm
invoked with T and lpss treats i. That is, applying our SA◦ -algorithm to T and lpss results in GSA◦
by Lemma 3.25.
Lemma 3.27. For two different Lyndon words u and w, we have u <lex w if and only if u n <lex w m
for all n, m ∈ N+ .
Proof. Let u and w be Lyndon words with u <lex w and consider any n, m ∈ N+ . We show that
u <lex w implies u n <lex w m ; the other direction then follows from symmetry (if u n ≤lex w m then
u >lex w cannot be true, because it would imply u n >lex w m ).
If u is not a prefix of w, then there is a mismatching character and u n <lex w m follows trivially.
Thus, assume w = uv for a non-empty v. By definition of the lexicographic order, we have u n−1 <lex
u n , and since w is a Lyndon word, we also have v >lex w = uv. Lemma 3.2 then implies u n <lex
wm . 
Lemma 3.28. The Lyndon factors of π (T ) are exactly the   roots of the strings
 in M. Formally, for
each j ∈ [0 .. |M |) and k ∈ [0 .. p), π (T ) π s j + k ·  .. π s j + (k + 1) ·  is a Lyndon factor of π (T ),
 
where p and  are the period and the length of the root of T s j .. s j + l j .
Proof. Since the strings in M are in canonical form, their roots are Lyndon words. Now con-
sider any two strings T [s j .. s j + l j ) and T [s j  .. s j  + l j  ) in M with T [s j .. s j + l j ) <lex T [s j  .. s j  + l j  ).
By Lemma 3.27 this implies root(T [s j .. s j + l j )) <lex root(T [s j  .. s j  + l j  )). By Definition 3.26, we
also have π (s j ) > π (s j  ). Therefore, π (T ) consists of the roots of the strings in M arranged in lexi-
cographically non-increasing order. By the Chen-Fox-Lyndon theorem (Theorem 2.5), those roots
are the Lyndon factors of π (T ). 

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:29

Note that we now have proven Lemma 3.25. The following lemma is slightly stronger than
Lemma 3.25 and shows that π is actually an order isomorphism.
д π (T ) π (T )
Lemma 3.29. For any i, j with i <inf j, we have π (i) <inf π (j), where <inf refers to π (T ).
д
Proof. Consider any i and j with i <inf j. By Definition 3.23, we have either Ti <ω T j or
 
root(Ti ) = root T j and i > j. By Lemma   3.28, the conjugates of Lyndon factors of π (T ) starting
at π (i) and π (j) are root(Ti ) and root T j , respectively. According to Definition 3.7 this implies
π (T )
π (i) <inf π (j) in both aforementioned cases and thus the claim. 

Corollary 3.30. When SA◦ is the circular suffix array of π (T ), we have SA◦ [i] = π (GSA◦ [i]) for
each i ∈ [0 .. n).
The following simple lemma shows that the relative lexicographic order of two suffixes origi-
nating from the same Lyndon factor can be decided using only that Lyndon factor. We will use
it to prove that the structural similarity between pssπ (T ) and lpss observed in Figure 10 is not
coincidental.
Lemma 3.31. LetT [s .. e) be a Lyndon factor ofT . For any i, j ∈ [s .. e), we haveT [i .. e) <lex T [j .. e)
if and only if T [i .. n) <lex T [j .. n).
Proof. Consider some i, j with T [i .. e) <lex T [j .. e). We show that this implies T [i .. n) <lex
T [j .. n), the other direction follows from symmetry.
If i < j, then T [i .. e) cannot be a prefix of T [j .. e), which implies T [i .. n) <lex T [j .. n). Now
assume i > j and T [i .. e) is a prefix of T [j .. e) (otherwise the claim holds trivially as in the previous
case). Note that T [i .. n) <lex T [j .. n) holds if and only if T [e .. n) <lex T [j + (e − i) .. n). Since
T [s .. e) is a Lyndon word, we have T [j + (e − i) .. e) >lex T [s .. e), and since T [s .. e) is also a Lyndon
factor, we have T [s .. n) >lex T [e .. n). In combination, this gives T [j + (e − i) .. n) >lex T [s .. n) >lex
T [e .. n) and thus the claim. 

Lemma 3.32. Let pssπ (T ) be the pss-array of π (T ). Then


 for each i ∈ [0 .. n), we have pssπ (T ) [i] =
−1
−1 if and only if lpss[π (i)] = −1 and pss π (T ) [i] = π lpss[π −1 (i)] otherwise.

Proof. By Lemma 3.28, pssπ (T ) is solely determined by the roots of the strings in M and their
frequencies. Therefore, we may assume for ease of notation that all  strings in M are primitive
(and are
  thus Lyndon words). Now, consider some i ∈ [0 .. n) with π s j ≤ i < π s j + l j . Note that
 
π (T ) π s j .. π s j + l j is a Lyndon factor of π (T ) by Lemma 3.28. There are two cases:
 
— If i = π s j , then we have lpss[π −1 (i)] = −1 by definition of lpss. By simple properties of the
Lyndon factorisation,   π (T )i has no previous smaller suffix
  and the claim holds.
— Otherwise  (i.e., π s < i < π s +
  j  j l ), we have π s j ≤ π (p) < i for p := lpss[π −1 (i)],
j 
since T s j .. s j + l j = π (T ) π s j .. π s j + l j is a Lyndon word. Lemma 3.31 then implies
pssπ (T ) [i] = π (p) and thus the claim.


As with the BBWT, we now formally define the indices of the next smaller conjugates, and later
(Lemma 3.34) show that they correspond to nsc of π (T ) as defined for the BBWT in Section 3.2.
Definition 3.33 (nsc). Let p1 < · · · < pk be exactly the indices where lpss is −1 and set pk+1 = n
for convenience. For i ∈ [0 .. n) with p j ≤ i < p j+1 , we define nsc(i) as the next smaller conjugate

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:30 J. Olbrich et al.

of Ti , i.e.,

 pj , if nssT [p j .. p j+1 ) [i − p j ] = p j+1 − p j ,
nsc (i) =
p j + nssT [p j .. p j+1 ) [i − p j ],
otherwise.
 
In the first case, the root of the input word i belongs to (the Lyndon word T p j .. p j+1 ) has no non-
empty smaller suffix after i, and thus the next smaller conjugate is the root itself. In the second
case, this next smaller suffix coincides with the next smaller conjugate (as implied by Lemma 3.31).
Note that
 we define the next smaller conjugate of a Lyndon word to be the Lyndon word itself, i.e.,
nsc p j = p j for each j.
Similar to lpss and pssπ (T ) , π relates nsc and nscπ (T ) :

Lemma 3.34. We have nsc(i) = π −1 nscπ (T ) (π (i))

Proof. By Lemma 3.31, we have nssT [p j .. p j+1 ) [i − p j ] = p j+1 − p j if and only if nssπ (T ) [π (i)] =
     
π p j + p j+1 − p j . In this case, we have nscπ (T ) (π (i)) = π p j by Definition 3.9.
 
Otherwise, there is i  ∈ i .. p j+1 such that nssT [p j .. p j+1 ) [i − p j ] = i  − p j . Lemma 3.31 then
implies nssπ (T ) [π (i)] = i  and therefore nscπ (T ) (π (i)) = π (i ) according to Definition 3.9.
In either case, the claim immediately follows from Definition 3.33. 

Now, we have shown that π relates lpss to pssπ (T ) , nsc to nscπ (T ) and GSA◦ to SA◦π (T ) , and we
are in a position to show that our proposed GSA◦ -algorithm is correct.
Theorem 3.35. Using our algorithm for SA◦ on T with lpss instead of pss results in GSA◦ .
Proof. We will show that π provides an isomorphism between the mth grouping
Gm,1 , . . . , Gm,km created during our proposed algorithm for GSA◦ and the mth grouping
 , . . . , G
Gm,1 created during our algorithm for SA◦ , where Gm,i  is defined as the group contain-
 m,km 
ing π (j) | j ∈ Gm,i . Corollary 3.30 implies that the groups defined as such are valid according
to Definition 3.1 (with SA of course swapped with SA◦ or GSA◦ ) and that this is sufficient for the
correctness of our algorithm for GSA◦ .
For Phase I, we show this inductively. In the initial suffix grouping, the group in which an index
i is placed purely depends on the ith character and whether i is a leaf in the lpss-tree. Therefore,
Lemma 3.32 and Corollary 3.30 imply that the claim holds for the initial suffix groupings (m = 0).
Now consider the ith (1-based) iteration of Algorithm 5, that is, we have a group-
ing Gi,1 , . . . , Gi,ki and process group Gi,ki +1−i = дs , дe , |α | . By assumption, we have
{π (GSA◦ [j]) | j ∈ [дs .. дe ]} = {SA◦ [j] | j ∈ [дs .. дe ]} beforehand. By Lemma 3.32, π provides
a bijection between the multisets  of parents, i.e., between {lpss[GSA◦ [j]] | j ∈ [дs .. дe ]} and
pssπ (T ) [SA◦ [j]] | j ∈ [дs .. дe ] . Since the changes to the grouping depend on those parents and
the number of children in Gi,ki +1−i (or Gi,k  ) only, the groupings are still isomorphic after the
i +1−i
ith iteration. 15

Now consider Phase II, specifically the first for-loop in Algorithm 2 and let i ∈ Gj be such that
lpss[i] = −1, where Gj = дs , дe , |α | is a group in the grouping resulting from Phase I. By Corol-
lary 3.30, it suffices to show that i is inserted at index SA−1 ◦ [π (i)]. By Lemma 3.32 and Corollary 3.30,
we have pssπ (T ) [π (i)] = −1 and дs ≤ SA−1 ◦ [π (i)] ≤ дe , respectively. Thus, it suffices to show that for

15 Note that A[дs .. дe ] is sorted at the beginning of the ith iteration. When computing SA or SA◦ , this fact is used to
determine the number of children a parent has in the current group. By the definition of π , children of the same parent are
still consecutive in A[дs .. дe ] and thus this approach can be used here too.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:31

each i  ∈ Gj with lpss[i ] = −1, we have i < i  if and only if π (i) < π (i ) (the for-loop iterates from
left to right). Since Gj is a Lyndon group and lpss[i] = lpss[i ] = −1, we have root(Ti ) = root(Ti  )
which—in combination with Definition 3.26—implies that i < i  holds if and only if π (i) < π (i )
holds.
Now consider the second for-loop in Algorithm 2. By Definitions 3.14 and 3.33 and Lemmata 3.34,
3.17 and 3.18 we have for each i ∈ [0 .. n)
   
Pπ (i) = π (lpss[i]), π lpss2 [i] , . . . , π (lpssq [i]) ,

where q is maximal such that lpssk [i] is the last child of lpssk+1 [i] for each k ∈ [0 .. q). It induc-
tively follows that applying an implementation of the second for-loop in Algorithm 2 relying on
Lemmata 3.17 and 3.18 to compute the Pi -sets (such as Algorithms 3 and 4) with lpss instead of
pss results in GSA◦ according to Lemma 3.29. 

Similar to our BBWT algorithm we compute an array GSA◦ that can be derived from GSA◦ by
replacing s j with s j+1 . Formally:

 s j+1 , if there is some j with s j = GSA◦ [i],
GSA◦ [i] =
GSA◦ [i], otherwise.
This is done for the same reasons noted at the end of Section 3.2, namely, first, that deriving the
eBWT from GSA◦ is simpler than deriving it from GSA◦ (cf. Equation (2)), and second, that it
simplifies Phase II.
Implementation notes. Finding the minimum conjugate of a string can be done in linear time,
specifically using at most n + d/2 character comparisons, where d is the length of the string’s
root [Shiloach 1981]. The algorithm of Shiloach [1981] is a two-stage algorithm. The first stage
rules out indices that cannot (uniquely) correspond to minimum conjugates at a rate of one index
per comparison, but it can exclude at most half the indices. The second stage then finds the answer
among the remaining indices and takes up to two comparisons per remaining index [Shiloach 1981].
We only use the second stage of Shiloach [1981]’s algorithm (which requires at most 2n character
comparisons), since it is already quite fast (cf. Section 4) and much simpler to implement.
Computing the index set I—i.e., the set of indices where GSA◦ contains the start index of a
string in M (before canonicalisation)—can be trivially computed from GSA◦ and the indices of the
smallest rotations of the input strings in O(n) using, e.g., the memory already allocated for G (cf.
Section 3.3).
It is not necessary to use additional working space for the stringT composed of the canonicalised
strings in M. Note that T is only needed in three places: computing lpss, setting up the initial
group structure, and deriving the eBWT from GSA◦ . Setting up the initial group structure simply
involves two scans over each input string (cf. Section 3.5) and thus can be trivially executed on the
unmodified string. Since before computing lpss and after computing GSA◦ the array containing
the references to the current groups (called I in Phase I/Section 3.4 and G in Phase II/Section 3.316 )
is not required, we can use it to temporarily store T (the concatenated canonicalised strings from
M). Note that this is also the reason why we do not need M to be given in concatenated form;
we do not immediately operate on M as given anyway. While needed, the indices of the smallest
rotations of the input strings can be stored in the memory designated for returning the index set

16 The names are different because there is no immediate relationship between them. Specifically, the way the groups are
referenced is different: in Phase I such a reference is the start of the group in SA/SA◦ /GSA◦ , in Phase II it is an index to an
array that in turn contains the current group start.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:32 J. Olbrich et al.

I. Therefore, our eBWT and BBWT construction algorithms require exactly the same amount of
working memory.
Since canonicalising the input strings and computing lpss can be done in linear time, the linear
time complexity of our eBWT algorithm immediately follows from the linear time complexity of
our BBWT algorithm.

4 EXPERIMENTS
Our implementation FGSACA of the optimised GSACA as well as the BBWT and eBWT construction
algorithms are publicly available.17
We compare our SACA with the GSACA implementation by Baier [2015, 2016] and the double
sort algorithms DS1 and DSH by Bertram et al. [2021]. The latter two also use the grouping prin-
ciple but employ integer sorting and have super-linear time complexity. DSH differs from DS1 only
in the initialisation: in DS1 the suffixes are sorted by their first character while in DSH up to eight
characters are considered [Bertram et al. 2021]. We further include DivSufSort 2.0.2, since it is
used by Bertram et al. [2021] as a reference [Bertram et al. 2021], as well as libsais 2.7.3 and
sais-lite 2.4.1.18 Both libsais and sais-lite are implementations of the SA-IS algorithm
[Nong et al. 2009], but the former uses cache-prefetching techniques to outperform sais-lite
(and all other SACAs known to us) on real-world data.
We compare our eBWT construction algorithm with PFP-eBWT by Boucher et al. [2021], since
PFP-eBWT is “the only tool up to date that computes the eBWT according to the original definition”
[Cenzato and Lipták 2022], and with cais by the same authors, because we believe that it is a fairer
comparison: Like our algorithm, cais computes GSA◦ and derives the eBWT from that, while
PFP-eBWT uses Prefix-Free Parsing and applies cais to the parse and derives the eBWT from the
result and the lexicographically sorted dictionary [Boucher et al. 2021]. For extremely repetitive
input (such as genomes from individuals of the same species), the total length of the parse and the
dictionary are expected to be significantly smaller than the text itself, thus PFP-eBWT is expected
to be faster on such data. Note that this means that it is possible to use our algorithm in PFP-eBWT
instead of cais.
Besides cais, we are only aware of two non-trivial BBWT construction algorithms, namely,
Yuta Mori’s algorithm in OpenBWT 2.0.019 and mk-bwts by Neal Burns.20 The former is claimed by
the author to have linear time complexity and seems to be an adaptation of the SA-IS suffix sorting
algorithm, while the latter modifies an already computed suffix array until BBWT can be derived in
the same way as we derive it from SA◦ [Bannai et al. 2021]. Originally, mk-bwts uses DivSufSort
for suffix sorting, but since libsais is much faster, we instead use it in our comparison.
All experiments were conducted on a Linux-5.4.0 machine with an AMD EPYC 7742 processor
and 256GB of RAM. All algorithms were compiled with GCC 10.3.0 with flags -O3 -funroll-loops
-march=native -DNDEBUG or—if applicable—the flags that they were distributed with. Since
PFP-eBWT is a multi-stage program that reads data from and writes data to disk (an SSD in our
case) in between stages, we report both the wall clock time (as with the other algorithms) and
the system time (the CPU time spent in the kernel, e.g., for file access).21 We report the maximum
resident memory as working memory (measured via the GNU Time utility). The data written to

17 https://gitlab.com/qwerzuiop/lfgsaca
18 https://github.com/kurpicz/sais-lite-lcp,
last accessed: July 17, 2023 (the original source website is no longer accessible).
19 https://encode.su/attachment.php?attachmentid=1405, last accessed: September 23, 2022.
20 https://github.com/NealB/Bijective-BWT, last accessed: September 23, 2022, we used the implementation in

mk_bwts_new_algo.c.
21 The amount of CPU time was always at least 99% of the wall clock time, i.e., the program was not held back significantly

by I/O.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:33

Table 1. For Each File, the Number of Characters n, the Size of the Alphabet σ , Compressibility Ratio r /n,
the Number of Lyndon Factors, and the Number of Groups in the Lyndon Grouping Are Given
Name n σ r /n #Lyndon Factors #Groups in Lyndon grouping
PC-Real
dblp.xml 296 135 874 97 0.138577 15 11 905 017
dna 403 927 746 16 0.602813 18 37 938 838
english.1024MB 1 073 741 824 237 0.337027 18 56 358 829
pitches 55 832 855 133 0.419645 39 8 126 704
proteins 1 184 051 855 27 0.373175 30 86 021 741
sources 210 866 607 230 0.227143 31 18 525 128
PC-Rep-Real
cere 461 286 644 5 0.0250921 21 3 780 348
coreutils 205 281 778 236 0.0228196 17 2 520 129
einstein.de.txt 92 758 441 117 0.00109282 21 57 665
einstein.en.txt 467 626 544 139 0.000620658 59 139 082
Escherichia_Coli 112 689 515 15 0.133504 13 3 358 636
influenza 154 808 555 15 0.0195262 10 1 584 290
kernel 257 961 616 160 0.0108209 32 1 228 151
para 429 265 758 5 0.0364267 1238 4 681 664
world_leaders 46 968 181 89 0.0122101 13 263 989
PC-Rep-Art
fib41 267 914 296 2 7.465 · 10−9 21 22
rs.13 216 747 218 2 3.506 · 10−7 27 52
tm29 268 435 456 2 2.980 · 10−7 41 67
Manzini’s
chr22.dna 34 553 758 5 0.585923 20 3 708 060
etext99 105 277 340 146 0.393343 15 7 478 014
gcc-3.0.tar 86 630 400 150 0.187271 3184 11 969 022
howto 39 422 105 197 0.321702 27 4 603 415
jdk13c 69 728 899 113 0.0638505 20 1 962 643
linux-2.4.5.tar 116 254 720 256 0.221157 4888 14 753 797
rctail96 114 711 151 93 0.146041 30 3 925 759
rfc 116 421 901 120 0.215815 12 13 201 175
sprot34.dat 109 617 186 66 0.252751 17 10 474 676
w3c2 104 201 579 256 0.0766359 51 3 146 092
The compressibility is measured as the number of runs r in the BWT (or eBWT in the case of string collections) divided
by n (i.e., a larger number indicates worse compressibility).

(or read from) disk is not counted as working-memory.22 For all other algorithms, we first load the
text into RAM and allocate the memory for the output (either SA, eBWT or BBWT) and then start
measuring the wall clock time. For these, the working memory is the maximum amount of allo-
cated memory excluding the memory for input and output. The memory for the input is read-only.
Each algorithm was executed five times on each test case, and we use the mean as the final result.
We evaluated the SA- and BBWT-algorithms on data from the Pizza & Chili corpus23 and
Manzini’s Corpus.24 We also include strings from the artificial skyline string family, for which
the SA-IS algorithm reaches maximum recursion depth [Bingmann et al. 2016]. Moreover, we
22 We invoked PFP-eBWT with default parameters, and in the case of reads and random2_6 additionally with the --reads-

flag (because the program would not work otherwise).


23 http://pizzachili.dcc.uchile.cl/, last accessed: September 14, 2023.
24 https://people.unipmn.it/manzini/lightweight/corpus/, last accessed: September 14, 2023.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:34 J. Olbrich et al.

Table 2. See Table 1 for the Legend


Name n σ r /n #Lyndon Factors #Groups in Lyndon Grouping
Random
random8 100,000,000 26 0.961554 23 18,486,215
random9 1,000,000,000 26 0.961535 25 164,950,425
Skyline
skyline26.txt 67,108,864 27 0 27 27
skyline28.txt 268,435,456 29 0 29 29
aab, bba
aab.8 100,000,000 2 0 1 100,000,000
bba.8 100,000,000 2 0 100,000,000 2
Large
enwik10 10,000,000,000 211 0.26235 55 542,391,487
GRCh38.twice 6,679,478,218 63 0.270486 29 272,233,996
String collections
reads 1,010,000,000 5 0.267418 10,000,300 55,978,164
covid4 297,920,341 11 0.000712123 10,000 190,960
articles3 36,167,198 67 0.346011 1,000 3,008,821
articles4 368,700,066 67 0.322013 10,000 25,131,940
articles5 2,265,842,952 67 0.268275 100,000 129,583,796
random2_6 100,000,000 26 0.961553 1,000,000 18,486,208
random4_4 100,000,000 26 0.961553 10,000 18,486,169
random6_2 100,000,000 26 0.961554 100 18,486,211
Note that for the eBWT, the number of Lyndon factors refers to the canonicalised strings from the input and thus
corresponds to the number of sequences, except for reads, which contains a few non-primitive strings. For our
eBWT-algorithm, this number has the same relevance as the number of Lyndon factors for our SACA (cf. Section 3.6).

include the strings an−1b and bn−1a with n = 108 (aab.8, bba.8), which maximise the number
of unique Lyndon prefixes (and thus the number of groups in the Lyndon grouping) and the num-
ber of Lyndon factors, respectively. To test the algorithms on large inputs for which 32-bit integers
are not sufficient, we also use a dataset containing larger texts, namely, the first 1010 bytes of the
English Wikipedia dump from June 1, 202225 and the human DNA concatenated with itself.26
The eBWT-algorithms were evaluated on 104 SARS-CoV-2 genomes (covid4),27 a set of 107
real reads with 101bp each (reads),28 the first 103 , 104 and 105 pages from above Wikipedia
dump (articles3, articles4 and articles5),29 and three sets of random strings random2_6,
random4_4 and random6_2, where randomi_j consists of 10j strings of length 10i each.30 No string
collection contained duplicate strings.
An overview of the test data is provided in Tables 1 and 2.
For each of the datasets and algorithms, we determined the average time and memory used,
relative to the input size.

25 https://dumps.wikimedia.org/enwiki/20220601/, last accessed: June 3, 2022.


26 https://www.ncbi.nlm.nih.gov/assembly/GCF_000001405.40, last accessed: June 3, 2022.
27 https://www.covid19dataportal.org, last accessed: July 17, 2023.
28 The first 107 unique reads from ftp://ftp.sra.ebi.ac.uk/vol1/fastq/SRR769/SRR769545/SRR769545_2.fastq.gz, last accessed:

September 14, 2023.


29 We used the sections enclosed in the <page>-tags, converted all lowercase letters to uppercase and removed all non-

graphical characters, because PFP-eBWT also turns lowercase letters to uppercase and disallows certain non-graphical
characters.
30 Each character was chosen uniformly at random from {“a”, . . . , “z”}.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:35

Table 3. Time for Constructing the Suffix Array (and Phases Is Applicable)

libsais

sais-lite

DivSufSort
FGSACA GSACA DS1 DSH

Phase II

Phase II

Phase II

Phase II
Phase I

Phase I

Phase I

Phase I
total

total

total

total
init

init

init

init
pss
Name
PC-Real
2 3
dblp.xml 0.72 0.57 3.03 1.35 5 . 6 6 0 . 7 3 13.3 5 8 19.1 0 . 3 3 5 . 7 5 3 . 8 2 9 . 9 0 2 . 0 7
. 0 2.09 3.46 7.63 5.33 8.90 7.29
1 9 5 6 1
dna 1.39 0.70 3.48 1.96 7 . 5 2 0 . 6 9 13.8 8 . 4 9 22.9 0 . 3 4 6 . 0 1 4 . 1 1 1 0 . 4 2 . 0 6 2.97 3.84 8.88 5.18 11.9 10.5
8 1 8 8 6 1
english.1024MB 1.15 0.66 5.09 2.42 2 0
9 . 3 0 . 7 18. 6 9 . 2 28. 0 . 3 6 . 8 5 . 7 1 2 . 2 . 9 1
4 6 3 5 1 8 0
3.0 4.7 10. 6 6 3 8
6.3 13. 11. 2
2 9 0 .36 .33 .37
pitches 1.11 0.59 3.21 1 . 7 6 6 . 6 7 0 . 7 2 10.1 5 . 3 5 16.1 0 . 3 5 5 . 5 8 3 . 4 4 9 . 3 7 5 . 3 3 2 . 5 8 3 . 4 0 1 1 . 3 4 9 6
8 8 4 5 4 2 5
proteins 1.08 0.62 5.48 2 . 7 5 9 . 9 3 0 . 6 9 21.6 1 0 . 2 32.6 0 . 3 3 7 . 4 0 5 . 8 3 1 3 . 5 6.43 3.46 5.05 14.9 7.15 15.3 13.0
8 4 1 0
sources 1.01 0.62 3.34 9 6
1 . 4 6 . 4 0 . 7 13 5 . 1 1
5 . 7 19 . 6 5 7
0.3 5.9 3.9 10 0 . 2 5 1 1 7
2.8 2.6 3.6 9.0 5.3 10 2 . 0 7.1 5
PC-Rep-Real
5 7 0
cere 1.31 0.69 3.25 1.42 6 . 6 7 0 . 6 9 12.1 6 . 0 2 18.8 0 . 3 5 5 . 2 9 4 . 3 6 1 0 . 0 1.98 2.33 4.08 8.39 4.64 8.42 9.08
6 9
coreutils 0.98 0.63 3.14 1.20 5
5 . 9 0 . 7 115 . 4 8
4 . 2 16 . 4 3 4
0.3 5.3 3.6 9.3 4 1 2.06 2.03 3.34 7.42 4.95 7.87 7.46
9 8
einstein.de.txt 1.02 0.67 2.58 1.10 5 . 3 8 0 . 7 3 10.7 4 . 0 6 15.5 0 . 3 3 4 . 5 0 3 . 3 8 8 . 2 1 1.69 1.57 3.01 6.28 4.58 7.55 6.95
4 6
einstein.en.txt 1.04 0.67 3.24 1.24 6 . 1 9 0 . 7 2 14.7 5 . 5 0 20.9 0.32 5 . 0 4 3 . 8 9 9 . 2 5 1.75 1.85 3.36 6.97 5.07 8.38 8.49
5 9
Escherichia_Coli 1.37 0.70 3.18 1.36 0
6 . 6 0 . 6 108 . 0 6
4 . 8 15 . 5 0.32 5.0 4.0 9.3 1.99 2.09 3.74 7.82 4.69
5 0 8 9.24 8.69
2
influenza 1.17 0.62 2.41 1.18 5 . 3 8 0 . 6 9 9 . 5 9 4 . 3 4 14.6 0.33 4.34 3.51 8.18 1.82 1.84 3.29 6.95 4.55 7.35 7.64
8 8
kernel 1.06 0.66 3.36 1.22 6 . 3 0 0 . 7 5 12.7 4 . 5 5 18.0 0.33 5.47 3.94 9.75 2.07 2 . 1 3 3 . 5 9 7 . 7 9 5 . 0 2 8.15 7.66
4 9
para 1.33 0.70 3.36 1.43 6 . 8 2 0 . 6 9 12.3 6 . 0 6 19.0 0.34 5.27 4.28 9.90 2.11 2. 2 1 4. 0 0 8. 3 2 4. 8 2 8.84 9.41
2
world_leaders 1.66 0.62 1.56 1.20 4 8 3
5 . 0 0 . 7 6 . 7 3 . 5 11.1 0 0.41 4.53 3.57 8.51 1.47 2.8 3.3 7.6 3.3 5.27 4.14
6 4 6 2
PC-Rep-Art
4
fib41 0.89 0.41 1.16 1.28 3.74 0.74 2.71 4.31 7.76 0.40 2.51 5.01 7.92 0.98 0.95 4.75 6.68 4.22 5.31 15.9
0 0
rs.13 0.80 0.42 1.41 1 . 3 6 3 . 9 9 0 . 7 2 5 . 8 7 4 . 1 2 10.7 0 . 2 9 2.70 3.56 6.55 1.05 1.25 3.23 5.52 3.97 5.01 15. 6
4 8 0 2 9 8 0
tm29 0.86 0.42 1.52 4 . 2 2 7 . 0 2 0 . 7 3 6 . 0 2 7 . 4 9 14.2 0 . 3 1 2 . 7 2 23.9 2 7 . 0 1.06 1.11 18.5 20.6 11.1 8.51 26.1
All times are given in seconds per 100 · 220 characters (100MiB). For libsais, sais-lite and DivSufSort the time
given is the total time.

For the suffix array construction algorithms, the times are shown in Tables 3 and 4. In general,
all linear-time algorithms were faster on the more repetitive datasets, on which the differences
between those algorithms were also smaller.
On all datasets, FGSACA requires between 5.6% (bba.8) and 70% (einstein.en.txt) less time
than GSACA, with the mean and median speed-up being 59% and 64%, respectively. Compared to
DSH, FGSACA is between 11% (einstein.en.txt) and 66% (tm29) faster, with the mean and median
being 30% and 26%, respectively.
DS1 is in most cases slower than DSH, the biggest outliers are the random texts where the initial-
isation of DSH is several times slower than it is on other texts (per input symbol).
Especially notable is the difference in the time required for Phase II: Our Phase II is up to 77%
faster (tm29) than Phase II of DSH with a mean of 56% and a median of 60%. (The only case where
our Phase II is slower than DSH’s is aab.8.) Our Phase I is also significantly faster than Phase I of
DS1 (median: 44%). Conversely, Phase I of DSH is much faster than our Phase I (median: 25%). How-
ever, this is only due to the more elaborate construction of the initial suffix grouping as demon-
strated by the much slower Phase I of DS1 (which is the same as Phase I of DSH). Note that, unlike
GSACA, DS1 and DSH, we need to compute the pss-array before Phase I, which takes between 6%
(skyline28.txt) and 33% (world_leaders) of the total time (median: 16%). (GSACA also computes
pss, but on the fly during Phase I and without the markings for the last children.)

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:36 J. Olbrich et al.

Table 4. Table 3 Continued

libsais

sais-lite

DivSufSort
FGSACA GSACA DS1 DSH

Phase II

Phase II

Phase II

Phase II
Phase I

Phase I

Phase I

Phase I
total

total

total

total
init

init

init

init
pss

Name
Manzini’s
2
chr22.dna 1.38 0.73 2.70 1.63 6.44 0.74 9 . 4 5 5 . 9 3 16.1 0 . 3 6 5.47 3.63 2.44 3.51 7.91
9.46 1.96 4.17 9.31 7.96
6 9 2 2
etext99 1.14 0.69 3.26 1.46 6.55 0.73 12.3 5 . 5 1 18.5 0 . 3 4 5.81 3.87 10.0 2.63
2.40 3.58 8.61 5.27 11.1 8.74
2 8
gcc-3.0.tar 1.22 0.63 3.03 1.36 6.24 0.77 11.5 4 . 6 9 16.9 0 . 3 7 5.74 3.61 2.77 3.39 8.58
9.72 2.41 4.60 8.33 6.02
5 9 .35
howto 1.18 0.6 3.09 1.44 6.40 0.77
9 10 . 5 7
5 . 2 16 . 5 5.7 3.7 9.8 3.0 2.70 3.51 9.24
0 6 0 2 2 4.52 9.32 7.09
1 8
jdk13c 0.81 0.60 2.61 1.18 5.21 0.78 10. 7 4 . 0 15. 0 . 3 5 . 2 5 3 . 4 4 9 . 0 4 1 . 6 0 1 . 8 6 3 . 1 8 6 . 6 3
9 5 5 4.75 7.47 6.25
linux-2.4.5.tar 1.22 0.64 3.18 1.45 6.48 4.79 9.04 6.44
2 7
rctail96 0.92 0.59 2.89 1 . 2 6 5 . 6 7 0 . 7 4 11.9 4 . 7 1 17.3 0 . 3 4 5.51 3.67 9.53 2.06 2.09 3.36 7.51 5.20 9.03 7.64
7 8 7
rfc 1.15 0.65 3.20 1 . 3 9 6 . 3 8 0 . 7 5 12.0 5 . 6 6 18.4 0 . 3 6 5 . 9 7 3 . 8 5 10.1 2 . 5 4 3 . 1 2 3 . 6 3 9 . 2 9 4 . 8 8 9 . 3 3 6.99
8 6 8 5
sprot34.dat 0.94 0 . 6 3 3 . 1 5 1 . 3 5 6 . 0 5 0 . 7 4 12.3 5 . 1 4 18.2 0 . 3 5 6 . 0 6 3 . 7 8 1 0 . 1 4 . 0 8 2 . 5 4 3 . 5 3 10.1 4 . 9 1 9 . 6 6 7.81
w3c2 0.82 0 . 6 1 2 . 7 6 1 . 2 1 5 . 4 1 4.92 7.75 6.35
Random
2 4 5 4 2 7 3
random8 1.16 0 . 6 7 4 . 0 6 1 . 8 5 7 . 7 3 0 . 7 3 15.0 8 . 1 9 23.9 0 . 3 5 7.31 3.89 11.5 10.0 3 . 1 3 3 . 6 5 1 6 . 8 5 . 5 9 1 3 . 8 1 2 . 1
7 5 3 0 0 2 3 8 2
random9 1 . 1 5 0 . 6 7 6 . 4 0 3 . 4 5 11.6 0 . 7 2 24.6 13.7 39.1 0 . 3 5 8 . 6 4 5 . 6 2 14.6 11.4 3 . 9 7 4 . 8 5 20.2 7 . 1 1 1 5 . 7 16.6
Skyline
0 9 4 0 2 9
skyline26.txt 0 . 4 5 0 . 4 1 1 . 2 7 3 . 3 5 5 . 4 8 0 . 7 3 2 . 0 7 4 . 9 3 7 . 7 4 0 . 3 7 2 . 9 1 12.6 15.8 1 . 0 4 0 . 9 2 10.1 12.1 11.8 8 . 6 1 27.8
5 9 8.56 5 5 7 7 . 1
5
skyline28.txt 0 . 4 3 0 . 4 1 1 . 2 9 5 . 3 2 7 . 4 4 0 . 7 2 2 . 1 0 8 . 5 3 11.3 0 . 3 6 3 . 0 2 25.1 2 1 . 0 4 0 . 9 6 19.0 2 1 . 0 3 3 . 7 25.6 1 0 9
aab,bba
aab.8 0.65 0.46 1.52 0.44 3.06 0.80 2.68 0.30 3.79 0.51 3.80 0.41 4.73 1.15 3.63 0.41 5.20 1.27 0.62 0.61
bba.8 0 . 4 8 0 . 6 0 0 . 1 2 0 . 9 6 2 . 1 5 0 . 8 0 0 . 6 8 0 . 7 9 2 . 2 7 0 . 5 0 0 . 9 1 1 . 3 7 2.78 1.13 0.83 1.38 3.34 1.27 0.65 0.49
Large
3 6 5 5 5 5 3
enwik10 1 . 2 4 0 . 7 8 7 . 7 4 5 . 6 7 15.4 0 . 3 7 10.8 9 . 9 1 21.1 4 . 6 8 4 . 5 8 9 . 9 9 1 9 . 2 1 4 . 0 19.8 1 8 . 0
6 1 9 6 8 2 6 8 2
GRCh38.twice 1 . 4 6 0 . 7 5 7 . 6 9 6 . 2 6 16.1 0 . 3 6 10.1 11.6 22.1 2 . 8 1 5 . 7 3 11.5 2 0 . 1 1 2 . 5 19.0 2 2 . 1
Note that Baier’s GSACA-implementation, DS1 and DSH require terminator-symbols at the end and/or the beginning of
the input text. As linux-2.4.5.tar and w3c2 both have an alphabet size of 28 , it was not possible to run these
algorithms on those files.

On the aab,bba and Skyline datasets, the algorithms behave as expected: On the former, the
algorithms based on GSACA are much slower than the SA-IS variants (for which these cases very
easy to solve). However, on Skyline, the SA-IS variants (and DivSufSort) are extremely slow.
Interestingly, on those files the aggressive prefetching of libsais seems to be counterproduc-
tive (compared to sais-lite, which does not employ explicit prefetching and is generally much
slower).
Compared to FGSACA, libsais is 20% faster at the median point. The only non-Skyline cases
where FGSACA is faster are from PC-Rep-Art.
Tables 8 and 9 show the amount of working memory consumed. For 32-bit words, FGSACA
mostly uses between 8 and 9 bytes per input character, the theoretical maximum amount of work-
ing space (10 bytes per character) is reached for rs.13, tm29 and the degenerate cases aab.8
and bba.8. GSACA always uses 12 bytes/character, and DS1 and DSH use 10.1 bytes/character
and 9.33 bytes/character on average, respectively. However, they can both consume up to 41.8
bytes/character (aab.8). DivSufSort requires only a small constant amount of working memory,
and libsais and sais-lite never exceed 21kiB of working memory on our test data.
The times for the BBWT-construction algorithms are shown in Tables 5 and 6. Excluding
the aab/bba and Skyline cases, our algorithm GBBWT for the BBWT is always between 48%

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:37

Table 5. Time for Constructing the Bijective Burrows-Wheeler Transform (and Phases Is Applicable)

OpenBWT
GBBWT cais mk-bwts

Phase II
Phase I

BBWT

BBWT

BBWT
SA−1
total

total

total
SA◦
init

init
pss

SA

fix
Name
PC-Real
dblp.xml 0.70 0.67 3.10 1.36 0.75 6.59 1.08 21.37 1.42 23.87 5.33 0.89 0.32 0.54 7.09 19.19
dna 1.39 0.81 3.53 1.97 0.86 8.56 1.07 23.63 1.57 26.27 5.19 1.05 0.71 0.92 7.87 25.26
english.1024MB 1.14 0.78 5.24 2.48 0.94 10.59 1.10 29.73 2.04 32.87 6.27 2.32 0.25 0.98 9.82 31.14
pitches 1.11 0.72 3.25 1.77 0.39 7.24 1.01 15.40 0.76 17.18 4.34 0.48 0.28 0.43 5.53 16.27
proteins 1.08 0.74 5.58 2.72 0.98 11.10 1.10 32.27 2.09 35.46 7.12 2.30 0.38 0.90 10.70 32.99
sources 1.00 0.74 3.40 1.49 0.67 7.31 1.05 20.73 1.30 23.09 5.30 0.77 0.29 0.51 6.87 20.04
PC-Rep-Real
cere 1.32 0.81 3.28 1.44 0.86 7.70 1.29 20.27 1.60 23.17 4.63 1.06 0.84 0.91 7.43 20.89
coreutils 1.00 0.74 3.17 1.20 0.68 6.80 1.04 19.23 1.28 21.55 4.93 0.79 0.26 0.59 6.57 17.92
einstein.de.txt 1.02 0.79 2.62 1.10 0.67 6.20 1.04 20.39 1.31 22.74 4.57 0.82 0.31 0.58 6.29 17.90
einstein.en.txt 1.04 0.78 3.27 1.24 0.89 7.21 1.07 22.71 1.52 25.30 5.04 1.14 0.30 0.83 7.31 21.39
Escherichia_Coli 1.38 0.82 3.19 1.35 0.71 7.45 1.19 20.47 1.39 23.05 4.66 0.84 0.34 0.75 6.59 20.64
influenza 1.17 0.73 2.43 1.18 0.70 6.21 1.21 18.86 1.39 21.46 4.53 0.82 0.66 0.66 6.66 17.82
kernel 1.08 0.78 3.37 1.22 0.70 7.14 1.06 20.14 1.35 22.54 4.97 0.87 0.30 0.71 6.84 18.98
para 1.35 0.81 3.35 1.42 0.87 7.81 1.29 20.84 1.61 23.74 4.78 1.04 22.92 0.92 29.65 21.90
world_leaders 2.19 0.74 1.59 1.20 0.59 6.31 1.02 13.34 1.19 15.54 3.30 0.68 0.28 0.64 4.90 12.06
PC-Rep-Art
fib41 0.92 0.51 1.21 1.28 0.96 4.88 1.03 15.55 1.63 18.21 4.22 0.99 23.08 1.02 29.32 15.08
rs.13 0.82 0.52 1.43 1.36 0.90 5.03 1.04 14.98 1.59 17.62 4.02 0.98 8.46 0.96 14.42 14.58
tm29 0.92 0.52 1.56 4.12 0.92 8.04 1.10 18.01 1.99 21.11 11.29 16.67 2.01 5.50 35.47 22.79
All times are given in seconds per 100 · 220 characters (100MiB). For cais, the init refers to the construction of the bit
vector with rank-select-support that cais uses to mark the start indices of strings. Note that deriving the BBWT is
comparatively slow in cais because of these rank-select queries. For mk-bwts, the SA-stage refers to the computation
of the suffix array via libsais, SA−1 is the computation of the inverse suffix array and fix is the “Fix sort order”-stage
where the suffix array adjusted so that BBWT can be derived.

(world_leaders) and 69% (random8) faster than OpenBWT (median: 64%). Compared to mk-bwts,
GBBWT is faster on the cases where the “Fix sort order” stage becomes slow due to its quadratic time
complexity and/or libsais is already slower than FGSACA. Interestingly, on Large, computing the
inverse permutation of the suffix array takes much more time, approximately a quarter of the run-
time of mk-bwts. This is the reason why GBBWT is faster than mk-bwts on that data set, despite the
data being structurally similar to PC-Real, where libsais is usually faster than FGSACA. GBBWT is
also between 57% (pitches) and 73% (fib41) faster than cais, with an average difference in speed
of 68%.
The amount of working memory is shown in Tables 8 and 9. OpenBWT only requires slightly
more than one word per input character, mk-bwts requires two words per character (SA and SA’s
inverse), and GBBWT uses up to one word per character more than FGSACA.31 Seemingly, cais always
allocates the theoretical maximum amount of memory needed for the used SA-IS variant (per input
character one word for SA◦ and another one for working memory), unlike OpenBWT, which only
allocates as much working memory as needed. Additionally, cais uses a bit vector with rank-
select-support (2 bits/input character).
Table 7 shows the times measured for the eBWT construction algorithms. Our algorithm GEBWT
is always significantly faster than both cais and PFP-eBWT, except on the covid4-dataset, where

31 This is due to the array SA◦ counting as working memory in GBBWT while SA does not in FGSACA. However, sometimes
the memory for the BBWT can be used during Phases I and II for the bucket sort and group start array C, and we require
slightly less than one word more than FGSACA.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:38 J. Olbrich et al.

Table 6. Table 5 continued

OpenBWT
GBBWT cais mk-bwts

Phase II
Phase I

BBWT

BBWT

BBWT
SA−1
total

total

total
SA◦
init

init
pss

SA

fix
Name
Manzini’s
chr22.dna 1.38 0.86 2.76 1.64 0.57 7.20
1.19 17.84 1.16 20.20 4.16 0.66 0.55 0.56 5.93 17.77
etext99 1.14 0.81 3.35 1.47 0.68 7.44
1.03 24.35 1.34 26.72 5.26 0.78 0.32 0.67 7.03 21.69
gcc-3.0.tar 1.22 0.74 3.11 1.37 0.57 7.03
1.02 17.48 1.12 19.62 4.60 0.70 0.30 0.42 6.02 16.40
howto 1.18 0.82 3.17 1.45 0.56 7.18
1.01 18.80 1.12 20.93 4.52 0.66 0.30 0.44 5.92 16.74
jdk13c 0.81 0.72 2.70 1.19 0.62 6.04
1.03 18.01 1.19 20.23 4.74 0.70 0.34 0.45 6.22 15.48
linux-2.4.5.tar 1.22 0.75 3.26 1.45 0.61 7.30
0.25 18.52 1.19 19.97 4.81 0.70 0.30 0.47 6.28 18.09
rctail96 0.92 0.71 2.98 1.27 0.68 6.56
1.02 21.13 1.32 23.48 5.21 0.76 0.35 0.58 6.89 18.98
rfc 1.14 0.76 3.27 1.40 0.67 7.24
1.02 19.92 1.29 22.23 4.88 0.73 0.32 0.57 6.50 18.61
sprot34.dat 0.94 0.74 3.23 1.36 0.66 6.92
1.04 21.73 1.32 24.09 4.91 0.75 0.33 0.59 6.59 20.25
w3c2 0.82 0.74 2.84 1.22 0.62 6.24
1.02 18.46 1.18 20.66 4.97 0.70 0.32 0.40 6.39 16.29
Random
random8 1.16 0.79 4.16 1.85 0.68 8.64 1.04 29.74 1.38 32.15 5.68 0.83 0.29 0.75 7.54 28.27
random9 1.15 0.78 6.45 3.38 0.98 12.74 1.06 35.10 1.93 38.09 6.90 2.40 0.33 1.03 10.66 36.35
Skyline
skyline26.txt 0.45 0.52 1.39 3.21 0.74 6.31 1.02 21.98 1.32 24.31 12.21 5.30 22.15 1.20 40.86 22.75
skyline28.txt 0.47 0.51 1.37 5.18 0.88 8.41 1.04 34.38 1.74 37.15 32.96 12.88 58.62 2.91 107.37 43.95
aab, bba
aab.8 0.65 0.53 1.70 0.43 0.06 3.37 1.00 3.44 0.14 4.58 1.27 0.06 0.24 0.13 1.70 4.32
bba.8 0.47 0.70 0.40 0.44 0.07 2.08 0.87 51.57 10.42 62.86 1.27 0.08 1.07 0.16 2.58 4.35
Large
enwik10 1.24 0.98 7.56 5.53 3.14 18.45 13.90 4.98 0.54 3.15 22.58 47.44
GRCh38.fna.twice 1.46 0.95 7.53 6.32 3.47 19.73 12.36 6.14 0.47 3.78 22.75 44.36
Note that cais only supports 32-bit indices and thus could not be tested on Large.

Table 7. Time for Constructing the Extended Burrows-Wheeler Transform


GEBWT cais PFP-eBWT
canonicalisation

eBWT of parse

system time
Phase II

parsing
Phase I

eBWT

eBWT

eBWT
GSA◦
total

total

total
init

init
pss

Name
reads 0.36 1.32 3.56 4.45 2.53 1.00 13.22 0.11 35.37 2.56 38.04 13.30 1.90 40.91 56.10 0.55
covid4 0.33 1.33 2.01 2.63 1.07 0.93 8.30 0.32 19.77 1.56 21.65 2.26 0.73 1.54 4.47 0.16
articles3 0.17 1.07 1.72 3.20 1.32 0.59 8.07 0.92 21.01 1.17 23.10 6.62 0.87 56.38 56.38 0.70
articles4 0.17 1.09 1.86 3.94 1.53 0.79 9.37 0.33 27.34 1.52 29.19 8.62 1.44 84.50 94.56 0.66
articles5 0.16 1.23 4.50 6.63 4.67 1.37 18.56 10.49 1.87 97.79 110.15 0.74
random2_6 0.15 1.06 1.98 4.17 1.73 0.69 9.78 0.13 39.15 1.70 40.98 10.79 1.15 51.42 60.97 0.70
random4_4 0.11 1.13 1.82 4.17 1.72 0.69 9.64 0.33 29.61 1.35 31.29 7.22 0.73 81.17 89.12 0.74
random6_2 0.11 1.13 1.75 4.19 1.73 0.69 9.58 0.96 29.68 1.37 32.01 7.23 0.72 81.20 89.15 0.75
Name #Words in parse #Characters in dictionary
reads 10,320,059 665,290,932
covid4 2,764,337 3,469,208
articles3 368,138 39,600,010
articles4 3,778,409 400,094,069
articles5 23,159,228 2,332,965,216
random2_6 999,718 74,414,799
random4_4 1,000,735 111,008,086
random6_2 1,000,758 111,008,339
All times are given in seconds per 100 · 220 characters (100 MiB). cais only supports 32-bit indices, so it could not be
run on articles5. The lower table gives the number of words in the parse and the size of the dictionary (measured as
the sum of the length of the dictionary entries) as reported by PFP-eBWT.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:39

Table 8. Working Memory (i.e., without the Input and Output) in Bytes Per Input
Character for the SA- and BBWT-algorithms

mk-bwts

OpenBWT
FGSACA

GSACA

GBBWT

cais
DS1

DSH
Name
PC-Real
dblp.xml 8.21 12.00 5.79 5.52 12.21 8.16 8.00 4.54
dna 9.15 12.00 10.69 10.59 13.15 8.16 8.00 4.53
english.1024MB 8.26 12.00 7.41 5.84 12.26 8.16 8.00 4.56
pitches 8.28 12.00 6.87 7.38 12.00 8.16 8.00 4.53
proteins 8.25 12.00 6.99 6.55 12.25 8.16 8.00 4.55
sources 8.12 12.00 8.29 6.57 12.06 8.16 8.00 4.54
PC-Rep-Real
cere 9.15 12.00 8.71 8.68 13.15 8.16 8.00 4.52
coreutils 8.08 12.00 7.09 4.90 12.00 8.16 8.00 4.54
einstein.de.txt 8.02 12.00 5.90 4.86 12.02 8.16 8.00 4.54
einstein.en.txt 8.02 12.00 6.08 4.98 12.02 8.16 8.00 4.55
Escherichia_Coli 8.99 12.00 8.52 8.39 12.99 8.16 8.00 4.53
influenza 8.93 12.00 9.48 7.90 12.93 8.16 8.00 4.53
kernel 8.04 12.00 6.49 4.77 12.01 8.16 8.00 4.54
para 9.18 12.00 8.96 8.94 13.18 8.16 8.00 4.52
world_leaders 9.99 12.00 12.06 11.16 13.98 8.16 8.00 4.47
PC-Rep-Art
fib41 9.53 12.00 13.89 10.12 13.53 8.16 8.00 4.61
rs.13 10.00 12.00 12.00 12.00 14.00 8.16 8.00 4.59
tm29 10.00 12.00 12.00 12.00 14.00 8.16 8.00 4.57
Manzini’s
chr22.dna 9.01 12.00 9.66 9.47 13.01 8.16 8.00 4.52
etext99 8.09 12.00 7.63 6.05 12.06 8.16 8.00 4.55
gcc-3.0.tar 8.11 12.00 8.75 7.50 12.01 8.16 8.00 4.53
howto 8.15 12.00 9.96 7.33 12.01 8.16 8.00 4.54
jdk13c 8.18 12.00 5.49 5.09 12.18 8.16 8.00 4.54
linux-2.4.5.tar 8.14 12.00 7.84 6.57 12.06 8.16 8.00 4.52
rctail96 8.23 12.00 6.30 5.42 12.23 8.16 8.00 4.55
rfc 8.22 12.00 10.21 6.94 12.19 8.16 8.00 4.52
sprot34.dat 8.14 12.00 9.10 6.55 12.06 8.16 8.00 4.54
w3c2 8.09 12.00 5.76 5.28 12.00 8.16 8.00 4.54
The SA-IS variants libsais and sais-lite as well as DivSufSort use only a constant amount of
memory on our data set and are thus omitted. Note that for the BBWT-algorithms, the output is a
character array and thus SA◦ is counted as working memory.

the strings in the set are extremely similar and thus the parse and dictionary are very small com-
pared to the input texts (Table 7). In this case, the amount of working memory PFP-eBWT needs is
also very small (Table 10). In the other cases, PFP-eBWT is significantly slower than cais, which
results from the fact that here the dictionary is approximately as large as the input. On most
other cases, PFP-eBWT’s consumed memory sits between the working memory of GEBWT and cais.
GEBWT and cais use the expected amount of memory, namely, a bit more than 12 and 8 bytes per

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:40 J. Olbrich et al.

Table 9. Table 8 continued

mk-bwts

OpenBWT
FGSACA

GSACA

GBBWT

cais
DS1

DSH
Name
Random
random8 8.15 12.00 8.75 11.75 12.15 8.16 8.00 4.54
random9 8.15 12.00 7.94 9.94 12.15 8.16 8.00 4.54
Skyline
skyline26.txt 10.00 12.00 12.02 12.02 14.00 8.16 8.00 4.75
skyline28.txt 10.00 12.00 12.00 12.00 14.00 8.16 8.00 4.75
aab, bba
aab.8 8.00 12.00 41.79 41.79 12.00 8.16 8.00 4.38
bba.8 10.00 12.00 24.01 24.01 14.00 21.62 8.00 4.38
Large
enwik10 10.11 9.02 7.76 18.11 16.00 8.55
GRCh38.twice 10.88 9.56 9.52 18.88 16.00 8.52

Table 10. Working Memory (i.e., without the


Input and Output) in Bytes Per Input
Character for the eBWT-algorithms

Name GEBWT cais PFP-eBWT


reads 12.99 8.20 6.30
covid4 13.28 8.16 0.16
articles3 12.10 8.16 10.47
articles4 12.07 8.16 10.26
articles5 18.11 17.95
random2_6 12.15 8.20 7.14
random4_4 12.15 8.16 10.47
random6_2 12.15 8.16 10.47

character, respectively. Note that on articles5, GEBWT computes a 64-bit/entry GSA◦ (with 40-
bit/entry auxiliary arrays) and thus needs much more memory.

5 CONCLUDING REMARKS
In this article, we significantly improved the performance of the GSACA algorithm. Our resulting
suffix array construction algorithm is the second-fastest algorithm known to us. We also showed
that minimal changes to our algorithm suffice to turn it into an algorithm for the BBWT and
eBWT. Our BBWT-algorithm is the fastest algorithm for large or repetitive files, and our eBWT-
algorithm is the fastest algorithm for computing the original eBWT for string collections that are
not extremely repetitive.
We only considered single-threaded algorithms. As modern computers gain their processing
power more and more through parallelism, it may be worthwhile to spend effort on trying to par-
allelise our algorithms. For instance, while processing a final group, all steps besides the reordering
of the parents are entirely independent of other groups.
Also, the results for DSH and DS1 indicate that it may be useful to use an already refined suffix
grouping as a starting point for our Phase I, as this enables us to skip many refinement steps.

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
Generic Non-recursive Suffix Array Construction 18:41

Deriving the BBWT or eBWT from SA◦ or GSA◦ is a rather time-consuming step in our al-
gorithms (especially on very large inputs), since the memory accesses to the input are virtually
randomly distributed and hence cache-unfriendly. It may be worthwhile to investigate whether
it is possible to effectively use the additional information present during Phase I and II. For in-
stance, in Phase II, we fetch pss[s − 1] (lpss[s − 1]) of each entry s of SA◦ (GSA◦ ) and hence this
cache-inefficiency could be mitigated by interleaving the input string with pss (lpss).

ACKNOWLEDGMENTS
We thank Michael L. Crogan for suggesting the possible connection between the BBWT and our
suffix array construction algorithm, and we thank the anonymous reviewers for their helpful
comments.

REFERENCES
Uwe Baier. 2015. Linear-time Suffix Sorting. Master’s Thesis. Ulm University.
Uwe Baier. 2016. Linear-time suffix sorting - A new approach for suffix array construction. In Proceedings of the 27th
Annual Symposium on Combinatorial Pattern Matching (Leibniz International Proceedings in Informatics), Roberto Grossi
and Moshe Lewenstein (Eds.), Vol. 54. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl, Germany, Article
23. https://doi.org/10.4230/LIPIcs.CPM.2016.23
Uwe Baier. 2021. BWT tunneling. Ph.D. Dissertation. Universität Ulm. https://doi.org/10.18725/OPARU-35218
Hideo Bannai, Juha Kärkkäinen, Dominik Köppl, and Marcin Piątkowski. 2021. Constructing the bijective and the extended
Burrows-Wheeler transform in linear time. In Proceedings of the 32nd Annual Symposium on Combinatorial Pattern Match-
ing (Leibniz International Proceedings in Informatics), Paweł Gawrychowski and Tatiana Starikovskaya (Eds.), Vol. 191.
Schloss Dagstuhl – Leibniz-Zentrum für Informatik, Article 7:1–7:16. https://doi.org/10.4230/LIPIcs.CPM.2021.7
Nico Bertram, Jonas Ellert, and Johannes Fischer. 2021. Lyndon words accelerate suffix sorting. In Proceedings of the 29th
Annual European Symposium on Algorithms (Leibniz International Proceedings in Informatics), Petra Mutzel, Rasmus Pagh,
and Grzegorz Herman (Eds.), Vol. 204. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Article 15. https://doi.org/10.
4230/LIPIcs.ESA.2021.15
Philip Bille, Jonas Ellert, Johannes Fischer, Inge Li Gørtz, Florian Kurpicz, J. Ian Munro, and Eva Rotenberg. 2020. Space effi-
cient construction of Lyndon arrays in linear time. In Proceedings of the 47th International Colloquium on Automata, Lan-
guages, and Programming (Leibniz International Proceedings in Informatics), Artur Czumaj, Anuj Dawar, and Emanuela
Merelli (Eds.), Vol. 168. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Article 14. https://doi.org/10.4230/LIPIcs.
ICALP.2020.14
Timo Bingmann, Johannes Fischer, and Vitaly Osipov. 2016. Inducing suffix and LCP arrays in external memory. ACM J.
Exper. Algor. 21, Article 2.3 (2016). https://doi.org/10.1145/2975593
Silvia Bonomo, Sabrina Mantaci, Antonio Restivo, Giovanna Rosone, and Marinella Sciortino. 2014. Sorting conjugates
and suffixes of words in a multiset. Int. J. Found. Comput. Sci. 25, 08 (2014), 1161–1175. https://doi.org/10.1142/
S0129054114400309
Christina Boucher, Davide Cenzato, Zsuzsanna Lipták, Massimiliano Rossi, and Marinella Sciortino. 2021. Computing the
original eBWT faster, simpler, and with less memory. In String Processing and Information Retrieval (LNCS), Vol. 12944.
129–142. https://doi.org/10.1007/978-3-030-86692-1_11
Christina Boucher, Travis Gagie, Alan Kuhnle, Ben Langmead, Giovanni Manzini, and Taher Mun. 2019. Prefix-free parsing
for building big BWTs. Algor. Mol. Biol. 14 (May 2019). https://doi.org/10.1186/s13015-019-0148-5
Michael Burrows and David Wheeler. 1994. A block-sorting lossless data compression algorithm. Digital SRC Res. Rep.
124 (1994).
Davide Cenzato and Zsuzsanna Lipták. 2022. A theoretical and experimental analysis of BWT variants for string collections.
In Proceedings of the 33rd Annual Symposium on Combinatorial Pattern Matching (Leibniz International Proceedings in
Informatics), Hideo Bannai and Jan Holub (Eds.), Vol. 223. Schloss Dagstuhl–Leibniz-Zentrum für Informatik, Dagstuhl,
Germany, 25:1–25:18. https://doi.org/10.4230/LIPIcs.CPM.2022.25
Kuo Tsai Chen, Ralph H. Fox, and Roger C. Lyndon. 1958. Free differential calculus, IV. The quotient groups of the lower
central series. Ann. Math. (1958), 81–95. https://doi.org/10.2307/1970044
Zhihui Du, Sen Zhang, and David A. Bader. 2023. Parallel suffix sorting for large string analytics. In Parallel Processing
and Applied Mathematics (LNCS), Roman Wyrzykowski, Jack Dongarra, Ewa Deelman, and Konrad Karczewski (Eds.),
Vol. 13826. Springer, 71–82. https://doi.org/10.1007/978-3-031-30442-2_6
Jean-Pierre Duval. 1983. Factorizing words over an ordered alphabet. J. Algor. 4, 4 (1983), 363–381. https://doi.org/10.1016/
0196-6774(83)90017-2

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.
18:42 J. Olbrich et al.

Johannes Fischer and Florian Kurpicz. 2017. Dismantling DivSufSort. In Proceedings of the Prague Stringology Conference,
Jan Holub and Jan Žďárek (Eds.). Czech Technical University in Prague, Czech Republic, 62–76.
Frantisek Franek, A. S. M. Sohidull Islam, M. Sohel Rahman, and William F. Smyth. 2016. Algorithms to compute the Lyndon
array. In Proceedings of the Prague Stringology Conference, Jan Holub and Jan Žďárek (Eds.). Czech Technical University
in Prague, Czech Republic, 172–184.
Frantisek Franek, Asma Paracha, and William F. Smyth. 2017. The linear equivalence of the suffix array and the partially
sorted Lyndon array. In Proceedings of the Prague Stringology Conference, Jan Holub and Jan Žďárek (Eds.). Czech Tech-
nical University in Prague, Czech Republic, 77–84.
Joseph Yossi Gil and David Allen Scott. 2012. A Bijective String Sorting Transform. Retrieved from https://arxiv.org/abs/
1201.3077. https://doi.org/10.48550/ARXIV.1201.3077
Keisuke Goto. 2019. Optimal time and space construction of suffix arrays and LCP arrays for integer alphabets. In Proceed-
ings of the Prague Stringology Conference, Jan Holub and Jan Žďárek (Eds.). Czech Technical University in Prague, Czech
Republic, 111–125.
Wing-Kai Hon, Tsung-Han Ku, Chen-Hua Lu, Rahul Shah, and Sharma V. Thankachan. 2012. Efficient algorithm for circular
burrows-wheeler transform. In Proceedings of the 23rd Annual Symposium on Combinatorial Pattern Matching, Juha
Kärkkäinen and Jens Stoye (Eds.). Springer, 257–268. https://doi.org/10.1007/978-3-642-31265-6_21
Juha Kärkkäinen, Dominik Kempa, and Simon J. Puglisi. 2015. Parallel external memory suffix sorting. In Proceedings of the
26th Annual Symposium on Combinatorial Pattern Matching (LNCS), Ferdinando Cicalese, Ely Porat, and Ugo Vaccaro
(Eds.), Vol. 9133. Springer, 329–342. https://doi.org/10.1007/978-3-319-19929-0_28
Pang Ko and Srinivas Aluru. 2003. Space efficient linear time construction of suffix arrays. In Proceedings of the 14th Annual
Symposium on Combinatorial Pattern Matching (LNCS), Ricardo Baeza-Yates, Edgar Chávez, and Maxme Crochemore
(Eds.), Vol. 2676. 200–210. https://doi.org/10.1007/3-540-44888-8_15
Manfred Kufleitner. 2009. On bijective variants of the burrows-wheeler transform. In Proceedings of the Prague Stringology
Conference, Jan Holub and Jan Žďárek (Eds.). Czech Technical University in Prague, Czech Republic, 65–79.
Julian Labeit, Julian Shun, and Guy E. Blelloch. 2017. Parallel lightweight wavelet tree, suffix array and FM-index construc-
tion. J. Discrete Algor. 43 (2017), 2–17. https://doi.org/10.1016/j.jda.2017.04.001
Zhize Li, Jian Li, and Hongwei Huo. 2022. Optimal in-place suffix sorting. Info. Comput. 285, Article 104818 (2022). https:
//doi.org/10.1016/j.ic.2021.104818
Udi Manber and Gene Myers. 1990. Suffix arrays: A new method for on-line string searches. In Proceedings of the First
Annual ACM-SIAM Symposium on Discrete Algorithms. Society for Industrial and Applied Mathematics, 319–327. https:
//doi.org/10.5555/320176.320218
Sabrina Mantaci, Antonio Restivo, G. Rosone, and Marinella Sciortino. 2005. An extension of the burrows wheeler trans-
form and applications to sequence comparison and data compression. In Proceedings of the 16th Annual Symposium on
Combinatorial Pattern Matching, Alberto Apostolico, Maxime Crochemore, and Kunsoo Park (Eds.). Springer, 178–189.
https://doi.org/10.1007/11496656_16
Ge Nong. 2013. Practical linear-time O(1)-workspace suffix sorting for constant alphabets. ACM Trans. Info. Syst. 31, 3 (2013),
1–15. https://doi.org/10.1145/2493175.2493180
Ge Nong, Sen Zhang, and Wai Hong Chan. 2009. Linear suffix array construction by almost pure induced-sorting. In Pro-
ceedings of the Data Compression Conference. 193–202. https://doi.org/10.1109/DCC.2009.42
Enno Ohlebusch. 2013. Bioinformatics Algorithms: Sequence Analysis, Genome Rearrangements, and Phylogenetic Reconstruc-
tion. Oldenbusch Verlag.
Jannik Olbrich, Enno Ohlebusch, and Thomas Büchler. 2022. On the optimisation of the GSACA suffix array construction
algorithm. In String Processing and Information Retrieval (LNCS), Diego Arroyuelo and Barbara Poblete (Eds.), Vol. 13617.
Springer, Cham, 99–113. https://doi.org/10.1007/978-3-031-20643-6_8
Yossi Shiloach. 1981. Fast canonization of circular strings. J. Algor. 2, 2 (1981), 107–121. https://doi.org/10.1016/0196-6774(81)
90013-4

Received 9 December 2022; revised 4 October 2023; accepted 19 December 2023

ACM Trans. Algor., Vol. 20, No. 2, Article 18. Publication date: April 2024.

You might also like