An Efficient, Versatile Approach to Suffix Sorting
•
5
3.2 Prefix Doubling
The prefix-doubling technique was first applied to suffix-sorting by Manber
and Myers [1993], inspired by the earlier work of Karp et al. [1972] in string
matching. The most efficient implementation is that of Larsson and Sadakane
[1999].
Generally, the approach works in rounds—at the beginning of the round h,
the suffixes are sorted on their 2
h−1
prefix in SA
h
with corresponding ranks
in ISA
h
. It is then observed that a sort using the integer pairs (ISA
h
[i], ISA
h
[i +h]) as keys, i +h ≤ n, computes a 2h-order of the suffixes i (suffixes i > n− h
are necessarily already fully ordered).
The two main implementations of the prefix-doubling approach differ pri-
marily in their application of the above observation. Manber and Myers do
an implicit 2h-sort by performing a left-to-right scan of SA
h
that induces the
2h-rank of SA
h
[ j ]h, j ∈ 0..n. On the other hand, Larsson and Sadakane ex-
plicitly sort each h-group using the ternary-split quicksort (TSQS) of Bentley
and McIlroy [1993]. Both approaches require 8n bytes of working space. Prefix-
doubling sorters have the advantage of being alphabet independent and taking
O(n log n) time, in the worst case.
3.3 Copy and Variants
Seward [2000] describes an important heuristic algorithm for suffix-sorting
called copy. The main idea bears a resemblance to two stage. Algorithm copy
initially sorts the suffixes into 1- and 2-groups, based on their first two charac-
ters (using a counting sort). 1-Groups refer to contiguous portions of the suffix
array, where suffixes share the same first character and 2-groups (“contained”
in 1-groups) refer to contiguous portions sharing the same first two characters.
Seward sorts the 1-groups in order of smallest to largest (i.e., those containing
least suffixes to those containing the most). Let G
λ
denote the 1-group whose
member suffixes all start with the letter λ ∈ . When G
λ
is completely sorted,
by passing back over the portion of SA containing G
λ
(now in order) for each
suffix i encountered, the order of the suffixes in 2-group prefixed x[i − 1]λ can
be induced. As sorting of 1-groups proceeds, ever more 2-groups will be already
ordered, allowing the sort routine to skip those portions of the 1-group. Seward
shows how the sorting of 1-groups can be made still more efficient by avoiding
the sorting of suffixes in G
λ
prefixed λλ. If such suffixes are left until after the
other members of G
λ
are sorted, their order can also be induced. This ability of
copy to deal with long runs of identical characters efficiently gives it a distinct
advantage over two stage, which has no such mechanism.
It is worth noting that copy was intended for use in a character-based BWT
setting, where it is assumed ||≤2
8
. This assumption keeps the space re-
quired for the ||
2
buckets reasonable. If, however, ||=2
16
, the memory
requirements for the algorithm would increase dramatically, making the al-
gorithm impractical, in some applications. This weakness is inherited by al-
gorithms which extend copy. Several very fast suffix sorters are based on
copy, namely, cache [Seward 2000], deep-shallow (ds) [Manzini and Ferragina
2004], and bucket pointer refinement (bpr) [Sch
¨
urmann and Stoye 2005]. These
ACM Journal of Experimental Algorithmics, Vol. 12, Article No. 1.2, Publication June: 2008.