SuRF: Practical Range ery Filtering with Fast Succinct Tries SIGMOD’18, June 10–15, 2018, Houston, TX, USA
The second bitmap (D-HasChild) indicates whether a branch
points to a sub-trie or terminates (i.e., points to the value or the
branch does not exist). Taking the root node in Figure 2 as an
example, the
f
and the
t
branches continue with sub-tries while
the
s
branch terminates with a value. In this case, the D-HasChild
bitmap only sets the 102nd (f) and 116th (t) bits for the node.
The third bitmap (D-IsPrexKey) includes only one bit per node.
The bit indicates whether the prex that leads to the node is also a
valid key. For example, in Figure 2, the rst node at level 1 has
f
as
its prex. Meanwhile,
‘f’
is also a key stored in the trie. To denote
this situation, the D-IsPrexKey bit for this child node must be set.
The nal byte-sequence (D-Values) stores the xed-length values
(e.g., pointers) mapped by the keys. The values are concatenated in
level order: same as the three bitmaps.
Tree navigation uses array lookups and rank & select operations.
We denote
rank
1
/
select
1
over bit sequence bs on position pos to be
rank
1
/
select
1
(bs, pos). Let pos be the current bit position in D-Labels.
To traverse down the trie, given pos where D-HasChild[pos] = 1,
D-ChildNodePos
(pos) = 256
×rank
1
(D-HasChild, pos) computes
the bit position of the rst child node. To move up the trie,
D-
ParentNodePos
(pos) = 256
×se lect
1
(D-HasChild,
⌊
pos/256
⌋
) com-
putes the bit position of the parent node. To access values, given
pos where D-HasChild[pos] = 0,
D-ValuePos
(pos) =
rank
1
(D-Labels,
pos) -
rank
1
(D-HasChild, pos) +
rank
1
(D-IsPrexKey,
⌊
pos/256
⌋
)-1
gives the lookup position.
2.3 LOUDS-Sparse
As shown in the lower half of Figure 2, LOUDS-Sparse encodes a
trie node using four byte or bit-sequences. The encoded nodes are
then concatenated in level-order.
The rst byte-sequence, S-Labels, records all the branching labels
for each trie node. As an example, the rst non-value node at level 2
in Figure 2 has three branches. S-Labels includes their labels
r
,
s
, and
t
in order. We denote the case where the prex leading to a node is
also a valid key using the special byte
0xFF
at the beginning of the
node (this case is handled by D-IsPrexKey in LOUDS-Dense). For
example, in Figure 2, the rst non-value node at level 3 has
‘fas’
as
its incoming prex. Since
‘fas’
itself is also a stored key, the node
adds
0xFF
to S-Labels as the rst byte. Because the special byte
always appears at the beginning of a node, it can be distinguished
from the real 0xFF label.
The second bit-sequence (S-HasChild) includes one bit for each
byte in S-Labels to indicate whether a child branch continues (i.e.,
points to a sub-trie) or terminates (i.e., points to a value). Taking
the rightmost node at level 2 in Figure 2 as an example, because
the branch labeled
i
points to a sub-trie, the corresponding bit in
S-HasChild is set. The branch labeled
y
, on the other hand, points
to a value. Its S-HasChild bit is cleared.
The third bit-sequence (S-LOUDS) also includes one bit for each
byte in S-Labels. S-LOUDS denotes node boundaries: if a label is the
rst in a node, its S-LOUDS bit is set. Otherwise, the bit is cleared.
For example, in Figure 2, the rst non-value node at level 2 has
three branches and is encoded as 100 in the S-LOUDS sequence.
The nal byte-sequence (S-Values) is organized the same way as
D-Values in LOUDS-Dense.
Tree navigation on LOUDS-Sparse is as follows: to move
down the trie, S-ChildNodePos(pos) = select
1
(S-LOUDS, rank
1
(S-
HasChild, pos) + 1); to move up,
S-ParentNodePos
(pos) =
select
1
(S-
HasChild,
rank
1
(S-LOUDS, pos) - 1); to access a value,
S-ValuePos
(pos) = pos - rank
1
(S-HasChild, pos) - 1.
2.4 LOUDS-DS and Operations
LOUDS-DS is a hybrid trie in which the upper levels are encoded
with LOUDS-Dense and the lower levels with LOUDS-Sparse. The
dividing point between the upper and lower levels is tunable to
trade performance and space. FST keeps the number of upper levels
small in favor of the space eciency provided by LOUDS-Sparse.
We maintain a size ratio
R
between LOUDS-Sparse and LOUDS-
Dense to determine the dividing point among levels. Suppose the
trie has
H
levels. Let
LOUDS-Dense-Size(l)
, 0
≤ l ≤ H
denote the
size of LOUDS-Dense-encoded levels up to
l
(non-inclusive). Let
LOUDS-Sparse-Size(l)
, represent the size of LOUDS-Sparse encoded
levels from
l
(inclusive) to
H
. The cuto level is dened as the largest
l
such that
LOUDS-Dense-Size(l) × R ≤ LOUDS-Sparse-Size(l )
. Re-
ducing
R
leads to more levels, favoring performance over space. We
use R=64 as the default.
LOUDS-DS supports three basic operations eciently:
• ExactKeySearch
(key): Return the value of key if key exists (or
NULL otherwise).
• LowerBound
(key): Return an iterator pointing to the key-value
pair
(k, v)
where
k
is the smallest in lexicographical order satis-
fying k ≥ key.
• MoveToNext(iter): Move the iterator to the next key-value.
A point query on LOUDS-DS works by rst searching the
LOUDS-Dense levels. If the search does not terminate, it continues
into the LOUDS-Sparse levels. The high-level searching steps at
each level are similar regardless of the encoding mechanism: First,
search the current node’s range in the label sequence for the tar-
get key byte. If the key byte does not exist, terminate and return
NULL. Otherwise, check the corresponding bit in the HasChild bit-
sequence. If the bit is
1
(i.e., the branch points to a child node),
compute the child node’s starting position in the label sequence
and continue to the next level. Otherwise, return the corresponding
value in the value sequence. We precompute two aggregate values
based on the LOUDS-Dense levels: the node count and the number
of HasChild bits set. Using these two values, LOUDS-Sparse can
operate as if the entire trie is encoded with LOUDS-Sparse.
Range queries use a high-level algorithm similar to the point
query implementation. When performing LowerBound, instead of
doing an exact search in the label sequence, the algorithm searches
for the smallest label ≥ the target label. When moving to the next
key, the cursor starts at the current leaf label position and moves
forward. If another valid label
l
is found within the node, the algo-
rithm nds the left-most leaf key in the subtree rooted at
l
. If the
cursor hits node boundary instead, the algorithm moves the cursor
up to the corresponding position in the parent node.
We include per-level cursors in the iterator to minimize the
relatively expensive “move-to-parent” and “move-to-child” calls,
which require rank & select operations. These cursors record a trace
from root to leaf (i.e., the per-level positions in the label sequence)
for the current key. Because of the level-order layout of LOUDS-DS,
Research 4: Query Processing
SIGMOD’18, June 10-15, 2018, Houston, TX, USA