of block addressing is that it can achieve both sublinear space overhead and sublinear query time, whereas
inverted indices achieve only the second goal [4]. Unfortunately, up to now all the known block addressing
indices [18, 4] achieve this goal only under some restrictive conditions on the block size. We show how to
use our opportunistic data structure to devise a novel block addressing scheme, called CGlimpse (standing
for Compressed Glimpse), which always achieves time and space sublinearity.
2 Background
Let T [1, u] be a text drawn from a constant-size alphabet Σ. A central concept in our discussion is
the suffix array data structure [17]. The suffix array A built on T [1, u] is an array containing the
lexicographically ordered sequence of the suffixes of T , represented via pointers to their starting positions
(i.e., integers). For instance, if T = ababc then A = [1, 3, 2, 4, 5]. In practice A occupies 4u bytes,
actually a lot when indexing large text collections. It is a long standing belief that suffix arrays are
uncompressible because of the “apparently random” permutation of the suffix pointers. Recent results
in the data compression field have opened the door to revolutionary ways to compress suffix arrays and
are the basic tools of our solution. In [7], Burrows and Wheeler proposed a transformation (BWT from
now on) consisting of a reversible permutation of the text characters which gives a new string that is
“easier to compress”. The BWT tends to group together characters which occur adjacent to similar text
substrings. This nice property is exploited by locally-adaptive compression algorithms, such as move-to-
front coding [6], in combination with statistical (i.e. Huffman or Arithmetic coders) or structured coding
models. The BWT-based compressors are among the best compressors currently available since they
achieve a very good compression ratio using relatively small resources (time and space).
The reversible BW-transform. We distinguish between a forward transformation, which produces
the string to be compressed, and a backward transformation which gives back the original text from the
transformed one. The forward BWT consists of three basic steps: (1) Append to the end of T a special
character # smaller than any other text character; (2) form a conceptual matrix M whose rows are the
cyclic shifts of the string T# sorted in lexicographic order; (3) construct the transformed text L by taking
the last column of M. Notice that every column of M is a permutation of the last column L, and in
particular the first column of M, call it F , is obtained by lexicographically sorting the characters in L.
There is a strong apparent relation between the matrix M and the suffix array A of the string T #.
When sorting the rows of the matrix M we are essentially sorting the suffixes of T #. Consequently, entry
A[i] points to the suffix of T # occupying (a prefix of) the ith row of M. The cost of performing the
forward BWT is given by the cost of constructing the suffix array A, and this requires O(u) time [20].
The cyclic shift of the rows of M is crucial to define the backward BWT, which is based on two easy
to prove observations [7]:
a. Given the ith row of M, its last character L[i] precedes its first character F [i] in the original text
T , namely T = · · · L[i]F [i] · · ·.
b. Let L[i] = c and let r
i
be the rank of the row M[i] among all the rows ending with the character
c. Take the row M[j] as the r
i
-th row of M starting with c. Then the character corresponding to
L[i] in the first column F is located at F [j] (we call this LF-mapping, where LF [i] = j).
We are therefore ready to describe the backward BWT:
1. Compute the array C[1 . . . |Σ|] storing in C[c] the number of occurrences of characters {#, 1, . . . , c−
1} in the text T . Notice that C[c] + 1 is the position of the first occurrence of c in F (if any).
2. Define the LF-mapping LF [1 . . . u + 1] as follows LF [i] = C[L[i]] + r
i
, where r
i
equals the number
of occurrences of character L[i] in the prefix L[1, i] (see observation (b) above).
3. Reconstruct T backward as follows: set s = 1 and T [u] = L[1] (because M[1] = #T ); then, for
each i = u − 1, . . . , 1 do s = LF [s] and T [i] = L[s].
3