* Logarithms in this paper are in base 2 unless the base is stated explicitly.
A Practical Implementation of Compressed Suffix
Arrays with Applications to Self-Indexing
Hongwei Huo
*+
, Longgang Chen
*
, Jeffrey Scott Vitter
+
and Yakov Nekrich
+
*
Xidian University
No.2 Taibai South Road
Xi’an, Shaanxi 710071, China
{hwhuo,lgchen}@mail.xidian.edu.cn
+
The University of Kansas
1450 Jayhawk Blvd.
Lawrence, KS 66045, USA
{jsv,yakov}@ittc.ku.edu
Abstract: In this paper we develop a simple and practical text indexing scheme
for compressed suffix arrays (CSA). For a text of n characters, our CSA can be
constructed in linear time and needs 2nH
k
+ n + o(n) bits of space for any k ≤
clog
σ
n − 1 and any constant c < 1, where H
k
denotes the kth order entropy. We
compare the performance of our method with two established compressed
indexing methods, the FM-index and the Sad-CSA. Experiments on the
Canterbury Corpus and the Pizza&Chili Corpus show significant advantages of
our algorithm over two other indexes in terms of compression and query time.
Our storage scheme achieves better performance on all types of data present in
these two corpora, except for evenly distributed data, such as DNA. The source
code for our CSA is available online.
1. Introduction
Suffix trees[15,22] and suffix arrays[14] are versatile data structures that play a key
role in numerous string processing applications in such areas as string matching,
information retrieval, genome analysis, and text compression. Both the suffix tree and the
suffix array support pattern matching queries in optimal or almost-optimal time and use
linear space of O(n log n) bits. However in practice these data structures occupy 5 to 20
times more space than the raw string data; the latter needs only n log bits of space,
where denotes the alphabet size.
The compressed suffix array (CSA) [9,11,19–21] and the FM-index [2–4] overcome
the space limitation by exploiting the text compressibility and index regularities, while
supporting the functionalities of suffix arrays and suffix trees. The compressed suffix
array and FM-index are self-indexes in that they do not require a copy of the original data;
that is, they each serve as an index as well as a compressed version of the original text.
Grossi and Vitter [9,11] introduced the compressed suffix array (GV-CSA), which
uses O(n log
σ
) bits* of space and answers string matching queries in o(|P| / log
σ
n +
occ log
σ
n) time, where |P| is the length of the query pattern P and occ denotes the
number of times P occurs in the source string. Sadakane [20,21] showed how to convert
the GV-CSA into a self-index, and the resulting index, called Sad-CSA, needs (1/߳)nH
0
+
O(n log log
σ
) +
σ
log
σ
bits of space and answers queries in O(|P| log n + occ
ఢ
n)
time, where 0 < ߳ ≤ 1 is an arbitrary constant. Henceforth H
k
for k 0 denotes the kth
order empirical entropy of the source string T. Ferragina and Manzini designed the FM-
index that relies on the Burrows-Wheeler transform (BWT) [1] and the backward
searching approach [4,9]. Their original index uses at most 5nH
k
(T) + o(n log
σ
) bits of
2014 Data Compression Conference
1068-0314/14 $31.00 © 2014 IEEE
DOI 10.1109/DCC.2014.49
292