28 1. Approaches to Compression
Exercise 1.3: Compare the three different approaches (two-passes, training, and adap-
tive compression algorithms) and list some of the pros and cons for each.
Several variable-length codes are listed and described later in this section, and the
discussion shows how the average code length can be used to determine the statistical
distribution to which the code is best suited.
The second consideration in the design of a variable-length code is unique decod-
ability (UD). We start with a simple example: the code a
1
=0,a
2
= 10, a
3
= 101,
and a
4
= 111. Encoding the string a
1
a
3
a
4
... with these codewords results in the bit-
string 0101111.... However, decoding is ambiguous. The same bitstring 0101111...can
be decoded either as a
1
a
3
a
4
... or a
1
a
2
a
4
.... This code is not uniquely decodable. In
contrast, the similar code a
1
=0,a
2
= 10, a
3
= 110, and a
4
= 111 (where only the
codeword of a
3
is different) is UD. The string a
1
a
3
a
4
...is easily encoded to 0110111...,
and this bitstring can be decoded unambiguously. The first 0 implies a
1
, because only
the codeword of a
1
starts with 0. The next (second) bit, 1, can be the start of a
2
, a
3
,
or a
4
. The next (third) bit is also 1, which reduces the choice to a
3
or a
4
. The fourth
bit is 0, so the decoder emits a
3
.
A little thinking clarifies the difference between the two codes. The first code is
ambiguous because 10, the code of a
2
, is also the prefix of the code of a
3
. When the
decoder reads 10..., it often cannot tell whether this is the codeword of a
2
or the start
of the codeword of a
3
. The second code is UD because the codeword of a
2
is not the
prefix of any other codeword. In fact, none of the codewords of this code is the prefix
of any other codeword.
This observation suggests the following rule. To construct a UD code, the codewords
should satisfy the following prefix property. Once a codeword c is assigned to a symbol,
no other codeword should start with the bit pattern c. Prefix codes are also referred to
as prefix-free codes, prefix condition codes, or instantaneous codes. Observe, however,
that a UD code does not have to be a prefix code. It is possible, for example, to designate
the string 111 as a separator (a comma) to separate individual codewords of different
lengths, provided that no codeword contains the string 111. There are other ways to
construct a set of non-prefix, variable-length codes.
A UD code is said to be instantaneous if it is possible to decode each codeword in
a compressed file without knowing the succeeding codewords. Prefix codes are instan-
taneous.
Constructing a UD code for given finite set of data symbols should start with the
probabilities of the symbols. If the probabilities are known (at least approximately),
then the best variable-length code for the symbols is obtained by the Huffman algo-
rithm (Chapter 2). There are, however, applications where the set of data symbols is
unbounded; its size is either extremely large or is not known in advance. Here are a few
practical examples of both cases:
Text. There are 128 ASCII codes, so the size of this set of symbols is reasonably
small. In contrast, the number of Unicodes is in the tens of thousands, which makes it
impractical to use variable-length codes to compress text in Unicode; a different approach
is required.
A grayscale image. For 8-bit pixels, the number of shades of gray is 256, so a set of
256 codewords is required, large, but not too large.