previous or following position of its neighbor, which
provides not only the possibility of restoring the text,
but also leads to a compression to the higher-order
entropy of the text.
2.2 The Burrows-Wheeler transform. Let T be
a string of n characters from an alphabet Σ of size σ
and P a query pattern of length p. The string T has n
suffix, which starts at each of the n locations in the text.
The ith suffix, which start at position i, is denoted by
T [i..n]. The suffix array SA[1..n] of T is an array of n
integers that gives the sorted order of the suffixes of T .
That is, SA[i] = j if T [j..n] is the ith smallest suffix of
T in lexicographical order. Similarly, the inverse suffix
array is defined by SA
−1
[j] = i. All the suffixes prefixed
by P occupy a contiguous range in the sorted array SA.
In the CSA [9, 10], the suffix array values are
indirectly encoded by instead storing the Φ function,
where Φ(i) = SA
−1
[SA[i] + 1]. The Φ function can
b e compressed into optimal space (in entropy sense),
and then each SA value can be computed by referring
to a small portion of the Φ function in O(polylogn)
time. More recently, Huo et al. [13] gave a practical
implementation of the CSA by encoding the differences
Φ(i) − Φ(i − 1) using Elias’s Gamma coding, in which
they used a remarkable property of Φ, that is, an
increasing sequence of positions within each Σ list.
i SA
Φ
LF F L
0 7 3 5 # abaaba b
1 2 4 6 a abab#a b
2 5 5 7 a b#abaa b
3 0 6 0 a baabab #
4 3 7 1 a bab#ab a
5 6 0 2 b #abaab a
6 1 1 3 b aabab# a
7 4 2 4 b ab#aba a
Figure 1: Example of the BWT of T = abaabab#.
The Burrows-Wheeler transform (BWT) of T is
an invertible permutation of T , denoted by L, such
that L[i] is the character in the text just preceding
the ith lexicographically smallest suffix of T . That is,
L[i] = T [SA[i] − 1 mod n]. Intuitively, the sequence
L[i] is easier to compress because adjacent characters
often share higher-order contexts, and thus space can
b e reduced even further to about nH
k
bits. The LF
function [3] stands for last-to-first column mapping
since the character L[i] in the last column of Figure 1
is located in the column F at position LF(i), i.e.,
L[i] = F [LF (i)]. In the example of Figure 1, L[6] and
F [LF (6)] = F [3] both correspond to the third a in the
string abaabab#. Thus we can walk backwards through
the text T using the function LF . That is, if T [k] = L[i],
then T [k − 1] = L[LF (i)].
The FM-index and the CSA are closely related:
the LF function and the CSA neighbor function Φ are
inverses of one another. That is, SA[LF (i)] = SA[i] −1;
equivalently LF (i) = SA
−1
[SA[i] − 1] = Φ
−1
(i).
Since the BWT does not change the distribution of
characters, the 0th-order empirical entropy of T remains
the same. However, it tends to move characters with
similar contexts close together and thus the resulting
string L has a good locality (being consecutive piecewise
in the lexicographic order as shown in Figure 1) which
makes the move-to-front or wavelet tree transformation
more effective.
2.3 The FM-index. Ferragina and Manzini intro-
duced the elegant FM-index [3, 4], based upon the
Burrows-Wheeler transform. The FM-index was the
first self-index shown to have both fast performance
and space usage within a constant factor of the desired
entropy bound for constant-sized alphabets. The core
problem of the FM-index is to provide a compressed rep-
resentation of L together with some auxiliary structures
which makes it possible to compute the LF mapping
efficiently. LF (i) = C[L(i)] + Occ(i, L(i)) − 1, where
Occ(i, c) is the number of occurrences of character c in
the prefix L[0, i], and C[] is the array of length σ + 1
such that C[c] is the total number of text characters
which are alphabetically smaller than c. C = {0, 1, 5, 8}
for the example in Figure 1.
FM-index is based upon a procedure called
backward sear ch [4, 10], which finds the range of rows
in the BWT matrix M (the last three columns in Fig-
ure 1) that begin with a given pattern P . This range
represents the occurrences of P in T , which answers
the count query (returns the number of occurrences of
P in T ). Thus by backward search we turn the count
query on T into a sequence of rank queries on L. With
a slight extension, we can implement the locate and
extr act queries, where locate reports the occurring po-
sitions of P in T and extract displays the text substring
T [start, start + len − 1], given start and len.
The count algorithm describes the backward search-
based counting operation in which the while loop
p erforms p iterations from p−1 to 0,where p is the length
of pattern P . The algorithm maintains the following
invariant: after the k iterations, the variable l points to
the first row of the M prefixed by P [p − k, p − 1], and
the variable r points to the last row of the M prefixed
by P [p −k, p −1]. After the p iterations, occ = r −l + 1,
the number of occurrences of P in T .
12
Copyright © 2015.
by the Society for Industrial and Applied Mathematics.
Downloaded 01/08/15 to 115.155.1.118. Redistribution subject to SIAM license or copyright; see http://www.siam.org/journals/ojsa.php