Implementing Sorting in Database Systems 7
Fig. 4. Order-preserving dictionary compression.
symbol has been encountered for the first time. Alternatively, encoding or decoding
may start with a tree constructed for static order-preserving Huffman compression
based on a fixed sample of text. Hybrids of the two approaches are also possible, that
is, starting with a nonempty tree and developing it further if necessary. Similarly, a
binary tree with leaves containing strings rather than single characters can be used for
order-preserving dynamic dictionary encoding. A separate parser must cut the input
into encoding units, which are then encoded and decoded using a binary tree.
When run-length encoding and dictionary compression are modified to be order-
preserving, the symbols following the substituted string must be considered. When a
new string is inserted into the dictionary, the longest preexisting prefix of a new string
must be assigned two encodings, rather than only one [Antoshenkov et al. 1996]. For
example, assume that a dictionary already contains the string “the” with an appropri-
ate code, and the string “there” is to be inserted into the dictionary with its own code. In
this case, the string “the” must be assigned not one, but two codes: one for “the” strings
followed by a string less than “re,” and one for “the” strings followed by a string greater
than “re.” The encoding for “there” might be the value 124, and the two encodings for
“the” are either 123 or 125, depending on its continuation. Using these three codes, the
strings “then,” “therefore,” and “they” can be compressed based on the encodings. The
prefix “the” within “then” requires code 123, whereas “the” within “they” requires code
125 such that “then,” “therefore,” and “they” can be sorted correctly.
Figure 4 illustrates the idea and combines it with earlier concepts about adaptive
order-preserving Huffman compression. At some point, the string “the” has an encoding
or bit pattern assigned to it in the example ending in “1.” When the string “there”
is introduced, the leaf node representation of “the” is expanded into a small subtree
with 3 leaf nodes. Now, the compression of “the” in “then” ends in “10” and of “the”
in “they” ends in “111.” The compression of “there” in “therefore” ends in “110,” which
sorts correctly between the encodings of “then” and “they.” The newly created subtree
in Figure 4 is right-deep based on the assumption that future text will contain more
occurrences of “the” sorting lower than “there” than occurrences sorting higher than
“there.” Subsequent tree rotations may optimize the compression scheme further.
Dictionary compression is particularly effective for long strings of padding charac-
ters (e.g., white space in fixed-size string fields) and for default values. Of course, it
is also related to the normalization of NULL values, as described earlier. A useful ex-
tension uses multiple bits to compress NULL, default, and otherwise frequent values.
For example, 2 bits (instead of only 1 bit for NULL values) permit one value for NULL
values (“00”) that sort lower than all valid data values, one for the default value (“10”),
and two for actual values that are smaller or larger than the default value (“01” and
“11”). For example, the value 0 is a frequent value in many numeric columns, so the
2-bit combination “10” may indicate the column value 0, which does not need to be
stored explicitly, “01” indicates negative values, and “11” indicates positive values. If
multiple frequent values are known a priori, say 7 values in addition to NULL, then
ACM Computing Surveys, Vol. 38, No. 3, Article 10, Publication date: September 2006.