4 A. Farruggia et al.
The inverse function of LF is Ψ, with Ψ(i) =
select
c
(BWT, i − C[c]), where c is the largest character
value with C[c] < i. With functions Ψ and LF, we
can move forward and backward in the text, while
maintaining the lexicographic rank of the current suffix.
If the sequence S is not evident from the context, we
write LF
S
and Ψ
S
.
Compressed suffix arrays (CSA) [54, 55, 56] are
text indexes supporting a functionality similar to the
suffix array. This includes the following queries: i)
find(P ) = [sp, ep] determines the lexicographic range of
suffixes starting with pattern P [1, `]; ii) locate(sp, ep) =
SA[sp, ep] returns the starting positions of these suffixes;
and iii) extract(i, j) = T [i, j] extracts substrings of the
text. In practice, the find performance of CSAs can be
competitive with suffix arrays, while locate queries are
orders of magnitude slower [57]. Typical index sizes are
less than the size of the uncompressed text.
The FM-index (FMI) [55] is a common type of
compressed suffix array. A typical implementation [58]
stores the BWT in a wavelet tree [52]. The index
implements find queries via backward searching. Let
[sp, ep] be the lexicographic range of the suffixes of
the text starting with suffix P[i + 1, `] of the pattern.
We can find the range matching suffix P [i, `] with a
generalization of function LF as
LF([sp, ep], P [i]) = [C[P [i]] + rank
P [i]
(BWT, sp−1)+1,
C[P [i]] + rank
P [i]
(BWT, ep)].
We support locate queries by sampling some suffix
array pointers. If we want to determine a value
SA[i] that has not been sampled, we can compute
it as SA[i] = SA[j] + k, where SA[j] is a sampled
pointer found by iterating LF k times, starting from
position i. Given sample interval d, the samples can
be chosen in suffix order, sampling SA[i] at positions
divisible by d, or in text order, sampling T [i] at
positions divisible by d and marking the sampled SA
positions in a bitvector. Suffix-order sampling requires
less space, often resulting in better time/space trade-
offs in practice, while text-order sampling guarantees
better worst-case performance. We also sample the ISA
pointers for extract queries. To extract T [i, j], we find
the nearest sampled pointer after T [j], and traverse
backwards to T [i] with function LF.
Compressed suffix trees (CST) [5] are compressed
text indexes supporting the full functionality of a
suffix tree (see Table 1). They combine a compressed
suffix array, a compressed representation of the LCP
array, and a compressed representation of suffix tree
topology. For the LCP array, there are several common
representations:
• LCP-byte [51] stores the LCP array as a byte array.
If LCP[i] < 255, the LCP value is stored in the
byte array. Larger values are marked with a 255
in the byte array and stored separately. As many
texts produce small LCP values, LCP-byte usually
requires n to 1.5n bytes of space.
• We can store the LCP array by using variable-
length codes. LCP-dac uses directly addressable
codes [59] for the purpose, resulting in a structure
that is typically somewhat smaller and somewhat
slower than LCP-byte.
• The permuted LCP (PLCP) array [5] PLCP[1, n] is
the LCP array stored in text order and used as
LCP[i] = PLCP[SA[i]]. Because PLCP[i + 1] ≥
PLCP[i]−1, the array can be stored as a bitvector of
length 2n in 2n+o(n) bits. If the text is repetitive,
run-length encoding can be used to compress the
bitvector to take even less space [6]. Because
accessing PLCP uses locate, it is much slower than
the above two encodings.
Suffix tree topology representations are the main
difference between the various CST proposals. While
the compressed suffix arrays and the LCP arrays are
interchangeable, the tree representation determines how
various suffix tree operations are implemented. There
are three main families of compressed suffix trees:
• Sadakane’s compressed suffix tree (CST-Sada) [5]
uses a balanced parentheses representation for
the tree. Each node is encoded as an opening
parenthesis, followed by the encodings of its
children and a closing parenthesis. This can be
encoded as a bitvector of length 2n
0
, where n
0
is
the number of nodes, requiring up to 4n + o(n)
bits. CST-Sada tends to be larger and faster than
the other compressed suffix trees [11, 13].
• The fully compressed suffix tree (FCST) of Russo
et al. [10, 14] aims to use as little space as possible.
It does not require an LCP array at all, and
stores a balanced parentheses representation for a
sampled subset of suffix tree nodes in o(n) bits.
Unsampled nodes are retrieved by following suffix
links. FCST is smaller and much slower than the
other compressed suffix trees [10, 13].
• Fischer et al. [6] proposed an intermediate
representation, CST-NPR, based on lcp-intervals.
Tree navigation is handled by searching for the
values defining the lcp-intervals. Range minimum
queries rmq(sp, ep) find the leftmost minimal value
in LCP[sp, ep], while next/previous smaller value
queries nsv(i)/psv(i) find the next/previous LCP
value smaller than LCP[i]. After the improvements
by various authors [7, 9, 8, 11, 13], the CST-NPR is
perhaps the most practical compressed suffix tree.
For typical texts and component choices, the size of
compressed suffix trees ranges from the 1.5n to 3n bytes
of CST-Sada to the 0.5n to n bytes of FCST [11, 13].
There are also some CST variants for repetitive texts,
such as versioned document collections and collections
of individual genomes. Abeliuk et al. [13] developed
a variant of CST-NPR that can sometimes be smaller