SIGMOD’20, June 14–19, 2020, Portland, OR, USA Huanchen Zhang et al.
abc 0110
abcd
abcf
abcgh
abcpq
abc
abd
string axis
dictionary entry:
Figure 2: Dictionary Entry Example
– All sub-intervals of
[abc,
abd) are valid mappings for dictionary entry abc −→ 0110.
cannot encode arbitrary strings unless they grow the dictio-
nary, but growing to accommodate new entries may require
the DBMS to re-encode the entire corpus [
19
]. In the string
axis model, a dictionary is complete if and only if the union
of all the intervals (i.e.,
Ð
I
i
) covers the entire string axis.
A dictionary encoding
Enc
:
Σ
∗
→ X
∗
is
uniquely decod-
able
if
Enc
is an injection (i.e., there is a one-to-one mapping
from every element of
Σ
∗
to an element in
X
∗
). To guarantee
unique decodability, we must ensure that (1) there is only one
way to encode a source string and (2) every encoded result
is unique. Under our string axis model, these requirements
are equivalent to (1) all intervals
I
i
’s are disjoint and (2) the
set of codes
C
used in the dictionary are uniquely decodable
(we only consider prex codes in this paper).
With these requirements, we can use the string axis model
to construct a dictionary that is both complete and uniquely
decodable. As shown in Figure 1, for a given dictionary size
of
n
entries, we rst divide the string axis into
n
consecutive
intervals
I
0
, I
1
, . . . , I
n−1
, where the max-length common pre-
x
s
i
of all strings in
I
i
is not empty (i.e.,
len(s
i
) >
0) for each
interval. We use
b
0
, b
1
, . . . , b
n−1
, b
n
to denote interval bound-
aries. That is,
I
i
= [b
i
, b
i+1
)
for
i =
0
,
1
, . . . , n −
1. We then
assign a set of uniquely decodable codes
c
0
, c
1
, . . . , c
n−1
to the
intervals. Our dictionary is thus
s
i
→ c
i
, i =
0
,
1
, . . . , n−
1. A
dictionary lookup maps the source string to a single interval
I
i
, where b
i
< src < b
i+1
.
We can achieve the
order-preserving
property on top
of unique decodability by assigning monotonically increas-
ing codes
c
0
< c
1
< . . . < c
n−1
to the intervals. This is
easy to prove. Suppose there are two source strings (
src
1
,
src
2
), where
src
1
< src
2
. If
src
1
and
src
2
belong to the
same interval
I
i
in the dictionary, they must share com-
mon prex
s
i
. Replacing
s
i
with
c
i
in each string does not
aect their relative ordering. If
src
1
and
src
2
map to dier-
ent intervals
I
i
and
I
j
, then
Enc(src
1
) = c
i
· Enc(src
1
su f f i x
)
,
Enc(src
2
)= c
j
· Enc(src
2
su f f i x
)
. Since
src
1
< src
2
,
I
i
must pre-
ceed
I
j
on the string axis. That means
c
i
< c
j
. Because
c
i
’s
are prex codes, c
i
· Enc(src
1
su f f i x
) < c
j
· Enc(src
2
su f f i x
).
For encoding search tree keys, we prefer schemes that are
complete and order-preserving; unique decodability is im-
plied by the latter property. Completeness allows the scheme
to encode arbitrary keys, while order-preserving guarantees
Fixed-Len Interval
Fixed-Len Code
a b
01100001
c d e
01100010
01100011
01100100
a b c d e
010
0111001110
11
11100001
01100001
01100010
01100011
01100100
010
0110011110
11
11100001
a abc acabd acaz acs
a
aae acabc acabe acn
a b c d
a b c d
a a acd ac
a a
acab ac
Code
Symbol (interval common prefix)
FIFC
Fixed-Len Interval
Variable-Len Code
FIVC
Variable-Len Interval
Fixed-Len Code
VIFC
Variable-Len Interval
Variable-Len Code
VIVC
Figure 3: Compression Models –
Four categories of complete
and order-preserving dictionary encoding schemes.
that the search tree supports meaningful range queries on
the encoded keys.
3.2 Exploiting Entropy
For a dictionary encoding scheme to reduce the size of
the corpus, its emitted codes must be shorter than the
source strings. Given a complete, order-preserving dictio-
nary
D
:
s
i
→ c
i
, i =
0
,
1
, . . . , n −
1, let
p
i
denote the
probability that a dictionary entry is accessed at each step
during the encoding of an arbitrary source string. Because
the dictionary is complete and uniquely decodable (implied
by order-preserving),
Í
n−1
i=0
p
i
=
1. The encoding scheme
achieves the best compression when the compression rate
CPR =
Í
n−1
i=0
len(s
i
)p
i
Í
n−1
i=0
len(c
i
)p
i
is maximized.
According to the string axis model, we can characterize
a dictionary encoding scheme in two aspects: (1) how to
divide intervals and (2) what code to assign to each interval.
Interval division determines the symbol lengths (
len(s
i
)
) and
the access probability distribution (
p
i
) in a dictionary. Code
assignment exploits the entropy in
p
i
’s by using shorter
codes (c
i
) for more frequently-accessed intervals.
We consider two interval-division strategies: xed-length
intervals and variable-length intervals. For code assignment,
we consider two types of prex codes: xed-length codes
and optimal variable-length codes. We, therefore, divide all
complete and order-preserving dictionary encoding schemes
into four categories, as shown in Figure 3.
Fixed-length Interval, Fixed-length Code (FIFC):
This
is the baseline scheme because ASCII encodes characters in
this way. We do not consider this category for compression.
Fixed-length Interval, Variable-length Code (FIVC):
This category is the classic Hu-Tucker encoding [
27
]. If order-
preserving is not required, both Human encoding [
28
] and