
Microsoft LZX Data Compression Format
Page 4 of 18 March 20, 1997
Trees
LZX uses canonical Huffman tree structures to represent elements. Huffman trees are well known in data
compression and are not described here. Since an LZX decoder uses only the path lengths of the Huffman
tree to reconstruct the identical tree, the following constraints are made on the tree structure:
1. For any two elements with the same path length, the lower-numbered element must be further left
on the tree than the higher numbered element. An alternative way of stating this constraint is that
lower-numbered elements must have lower path traversal values; for example, 0010 (left-left-right-
left) is lower than 0011 (left-left-right-right).
2. For each level, starting at the deepest level of the tree and then moving upwards, leaf nodes must
start as far left as possible. An alternative way of stating this constraint is that if any tree node has
children then all tree nodes to the left of it with the same path length must also have children.
3. Zero length Huffman codes are not permitted, therefore a tree must contain at least 2 elements. In
the case where all tree elements are zero frequency, or all but one tree element is zero frequency,
the resulting tree must consist of the two Huffman codes «0» and «1». In the latter case, constraint
#1 still applies.
LZX uses several Huffman tree structures. The most important tree is the main tree, which comprises 256
elements corresponding to all possible ASCII characters, plus 8 * NUM_POSITION_SLOTS (see above)
elements corresponding to matches. The second most important tree is the length tree, which comprises 249
elements.
Other trees, such as the aligned offset tree (comprising 8 elements), and the pre-trees (comprising 20
elements each), have a smaller role.
Repeated offsets
LZX extends the conventional LZ77 format in several ways, one of which is in the use of repeated offset
codes. Three match offset codes, named the repeated offset codes, are reserved to indicate that the current
match offset is the same as that of one of the three previous matches which is not itself a repeated offset.
The three special offset codes are encoded as offset values 0, 1, and 2 (i.e. encoding an offset of 0 means
«use the most recent non-repeated match offset», an offset of 1 means «use the second most recent non-
repeated match offset», etc.). All remaining offset values are displaced by +3, as is shown in the table
below, which prevents matches at offsets WINDOW_SIZE, WINDOW_SIZE-1, and WINDOW_SIZE-2.
Correlation between encoded offset and real offset
Encoded offset Real offset
0 Most recent non-repeated match offset
1 Second most recent non-repeated match offset
2 Third most recent non-repeated match offset
3 1 (closest allowable)
42
53
64
75
86
500 498
x+2 x
WINDOW_SIZE-1
(maximum possible)
WIDOW_SIZE-3
评论1