Universal variable-length data compression
of binary sources using fountain codes
Giuseppe Caire Shlomo Shamai Amin Shokrollahi Sergio Verd´u
Institut Eurecom Technion EPFL Princeton University
giuseppe.caire@eurecom.fr, sshlomo@ee.technion.ac.il,amin.shokrollahi@epfl.ch,verdu@princeton.edu
Abstract — This paper proposes a universal
variable-length lossless compression algorithm based
on fountain codes. The compressor concatenates
the Burrows-Wheeler block sorting transform (BWT)
with a fountain encoder, together with the closed-
loop iterative doping algorithm. The decompressor
uses a Belief Propagation algorithm in conjunction
with the iterative doping algorithm and the inverse
BWT. Linear-time compression/decompression com-
plexity and competitive performance with respect to
state-of-the-art compression algorithms are achieved.
I. Introduction
It is known that linear fixed-length encoding can achieve
for asymptotically large blocklength the minimum compres-
sion rate for memoryless sources [1] and for arbitrary (not
necessarily stationary/ergodic) sources [2].
After initial attempts [3, 4, 5] to construct linear lossless
co des were nonuniversal, limited to memoryless sources and
failed to reach competitive performance with standard data
compression algorithms, the interest in linear data compres-
sion waned. Recently [6, 7, 8] came up with a universal loss-
less data compression algorithm based on irregular low-density
parity-check codes which has linear encoding and decoding
complexity, can exploit source memory and in the experiments
for binary sources presented in [6, 7, 8] showed competitive
p erformance with respect to standard compressors such as
gzip, PPM and bzip.
The scheme of [6, 7, 8] was based on the important class
of sparse-graph error correcting codes called low-density par-
ity check (LDPC) codes. The block-sorting transform (or
Burrows-Wheeler transform (BWT)) [9] is a one-to-one trans-
formation, which performs the following operation: it gener-
ates all cyclic shifts of the given data string and sorts them
lexicographically. The last column of the resulting matrix is
the BWT output from which the original data string can be
recovered, knowing the BWT index which is the location of
the original sequence in the matrix. The BWT shifts redun-
dancy in the memory to redundancy in the marginal distribu-
tions. The redundancy in the marginal distributions is then
much easier to exploit at the decoder as the decoding com-
plexity is independent of the complexity of the source model
(in particular, the number of states for Markov sources). The
output of the BWT (as the blocklength grows) is asymptoti-
cally piecewise i.i.d. for stationary ergodic tree sources. The
length, location, and distribution of the i.i.d. segments depend
on the statistics of the source. The existing universal BWT-
based methods for data compression generally hinge on the
idea of compression for a memoryless source with an adaptive
pro cedure that learns implicitly the local distribution of the
piecewise i.i.d. segments, while forgetting the effect of distant
symbols.
In the data compression algorithm of [6, 7], the compression
is carried out by multiplication of the Burrows-Wheeler Trans-
form of the source string with the parity-check matrix of an
error correcting code. Of particular interest are LDPC codes
since the belief propagation (BP) decoder is able to incorpo-
rate the time-varying marginals at the output of the BWT
in a very natural way. The nonidentical marginals produced
at the output of the BWT have a synergistic effect with the
BP algorithm which is able to iteratively exploit imbalances
in the reliability of variable nodes. The universal implementa-
tion of the algorithm where the encoder identifies the source
segmentation and describes it to the decompressor is discussed
in [8].
An important ingredient in the compression scheme of
[6, 7, 8] is the ability to do decompression at the compres-
sor. This enables to tune the choice of the codebook to the
source realization and more importantly it enables the use of
the Closed-Loop Iterative Doping (CLID) algorithm of [6, 2].
This is an efficient algorithm which enables zero-error variable-
length data compression with performance which is quite com-
p etitive with that of standard data compression algorithms.
In this paper, instead of adopting irregular low-density par-
ity check codes of a given rate approximately matched to the
source we adopt a different approach based on rateless foun-
tain codes. This class of codes turns out to be more natural for
variable-length data compression applications than standard
blo ck codes and achieves in general comparable p erformance
to the LDPC-based scheme of [6, 7, 8].
The rest of the paper is organized as follows. Section II
reviews the main features of fountain codes for channel co d-
ing. Section III gives a brief summary of the principle of
b elief propagation decoding which is common to both chan-
nel and source decoding. Our scheme for data compression
with fountain codes is explained in detail in Section IV in the
setting of nonuniversal compression of binary sources. For fur-
ther background on linear codes for data compression and the
closed-lo op iterative doping algorithm the reader is referred to
[2]. The modelling module necessary for universal application
is discussed in Section V. Section VI shows several experi-
ments and comparisons of redundancy with off-the-shelf data
compression algorithms run with synthetic sources. Through-
out the discussion we limit ourselves to binary sources. The
generalization to nonbinary alphabets is treated in [10].
II. Fountain Codes for Channel Coding
Fountain codes [11] form a new class of sparse-graph codes
designed for protection of data against noise in environments
where the noise level is not known a-priori. To achieve this, a
fountain code produces a potentially limitless stream of out-
put symbols for a given vector of input symbols. In practical
applications, each output symbol is a linear function of the
input symbols, and the output symbols are generated inde-
p endently and randomly, according to a distribution which is