SuRF：高效紧凑的范围查询过滤数据结构

5星 · 超过95%的资源需积分: 50 50 浏览量更新于2024-09-07 收藏 2.63MB PDF 举报

SuRF: Practical Range Query Filtering with Fast Succinct Tries SuRF是一项针对近似成员资格测试的高效且紧凑的数据结构，它在计算机科学领域具有重要意义，特别是在处理大量数据存储和查询效率上。与传统的Bloom过滤器不同，SuRF支持单一关键字查找以及常见的范围查询操作，包括开放范围查询（即查找所有在指定范围内的元素）、闭合范围查询（查找指定范围的边界元素）以及范围计数（计算满足特定范围条件的元素数量）。这种灵活性使得SuRF在许多需要高效查询性能的应用场景中具有优势。其核心技术是Fast Succinct Trie (FST)，这是一种新型的数据结构，它在点查询（单个关键字查询）和范围查询的性能上达到了最先进的有序索引的水平。令人印象深刻的是，FST每个节点仅占用10位存储空间，这意味着在保持高效性的同时，还实现了显著的空间压缩。这种设计极大地减少了存储需求，对于内存受限的环境特别有利。 SuRF的另一个关键特性在于它的假阳率控制能力。用户可以根据应用程序的具体需求调整点查询和范围查询的误报率，这在处理需要精确度与效率之间平衡的应用时显得尤为有用。例如，在数据库系统如RocksDB中，SuRF可以作为现有数据结构的有效替代品，提升查询性能并优化资源管理。研究者们，包括来自卡内基梅隆大学的Huanchen Zhang、Hyeontaek Lim、Viktor Leis、David G. Andersen、Michael Kaminsky、Kimberly Keeton以及Andrew Pavlo，共同开发了这一创新技术。他们的工作展示了如何将理论上的高效算法转化为实际应用中的实用工具，这对于大数据管理和分析领域有着深远的影响。 SuRF通过结合Fast Succinct Trie的优势，提供了一种高效、可配置的范围查询过滤解决方案，适用于对存储效率和查询性能有高要求的实时数据处理场景。其在Bloom过滤器的基础上进行了重要的进步，有望在未来的IT实践中得到广泛应用。

SuRF: Practical Range ery Filtering with Fast Succinct Tries SIGMOD’18, June 10–15, 2018, Houston, TX, USA

The second bitmap (D-HasChild) indicates whether a branch

points to a sub-trie or terminates (i.e., points to the value or the

branch does not exist). Taking the root node in Figure 2 as an

example, the

and the

branches continue with sub-tries while

the

branch terminates with a value. In this case, the D-HasChild

bitmap only sets the 102nd (f) and 116th (t) bits for the node.

The third bitmap (D-IsPrexKey) includes only one bit per node.

The bit indicates whether the prex that leads to the node is also a

valid key. For example, in Figure 2, the rst node at level 1 has

its prex. Meanwhile,

‘f’

is also a key stored in the trie. To denote

this situation, the D-IsPrexKey bit for this child node must be set.

The nal byte-sequence (D-Values) stores the xed-length values

(e.g., pointers) mapped by the keys. The values are concatenated in

level order: same as the three bitmaps.

Tree navigation uses array lookups and rank & select operations.

We denote

rank

select

over bit sequence bs on position pos to be

rank

select

(bs, pos). Let pos be the current bit position in D-Labels.

To traverse down the trie, given pos where D-HasChild[pos] = 1,

D-ChildNodePos

(pos) = 256

×rank

(D-HasChild, pos) computes

the bit position of the rst child node. To move up the trie,

D-

ParentNodePos

(pos) = 256

×se lect

(D-HasChild,

⌊

pos/256

⌋

) com-

putes the bit position of the parent node. To access values, given

pos where D-HasChild[pos] = 0,

D-ValuePos

(pos) =

rank

(D-Labels,

pos) -

rank

(D-HasChild, pos) +

rank

(D-IsPrexKey,

⌊

pos/256

⌋

)-1

gives the lookup position.

2.3 LOUDS-Sparse

As shown in the lower half of Figure 2, LOUDS-Sparse encodes a

trie node using four byte or bit-sequences. The encoded nodes are

then concatenated in level-order.

The rst byte-sequence, S-Labels, records all the branching labels

for each trie node. As an example, the rst non-value node at level 2

in Figure 2 has three branches. S-Labels includes their labels

, and

in order. We denote the case where the prex leading to a node is

also a valid key using the special byte

0xFF

at the beginning of the

node (this case is handled by D-IsPrexKey in LOUDS-Dense). For

example, in Figure 2, the rst non-value node at level 3 has

‘fas’

its incoming prex. Since

‘fas’

itself is also a stored key, the node

adds

0xFF

to S-Labels as the rst byte. Because the special byte

always appears at the beginning of a node, it can be distinguished

from the real 0xFF label.

The second bit-sequence (S-HasChild) includes one bit for each

byte in S-Labels to indicate whether a child branch continues (i.e.,

points to a sub-trie) or terminates (i.e., points to a value). Taking

the rightmost node at level 2 in Figure 2 as an example, because

the branch labeled

points to a sub-trie, the corresponding bit in

S-HasChild is set. The branch labeled

, on the other hand, points

to a value. Its S-HasChild bit is cleared.

The third bit-sequence (S-LOUDS) also includes one bit for each

byte in S-Labels. S-LOUDS denotes node boundaries: if a label is the

rst in a node, its S-LOUDS bit is set. Otherwise, the bit is cleared.

For example, in Figure 2, the rst non-value node at level 2 has

three branches and is encoded as 100 in the S-LOUDS sequence.

The nal byte-sequence (S-Values) is organized the same way as

D-Values in LOUDS-Dense.

Tree navigation on LOUDS-Sparse is as follows: to move

down the trie, S-ChildNodePos(pos) = select

(S-LOUDS, rank

(S-

HasChild, pos) + 1); to move up,

S-ParentNodePos

(pos) =

select

(S-

HasChild,

rank

(S-LOUDS, pos) - 1); to access a value,

S-ValuePos

(pos) = pos - rank

(S-HasChild, pos) - 1.

2.4 LOUDS-DS and Operations

LOUDS-DS is a hybrid trie in which the upper levels are encoded

with LOUDS-Dense and the lower levels with LOUDS-Sparse. The

dividing point between the upper and lower levels is tunable to

trade performance and space. FST keeps the number of upper levels

small in favor of the space eciency provided by LOUDS-Sparse.

We maintain a size ratio

between LOUDS-Sparse and LOUDS-

Dense to determine the dividing point among levels. Suppose the

trie has

levels. Let

LOUDS-Dense-Size(l)

, 0

≤ l ≤ H

denote the

size of LOUDS-Dense-encoded levels up to

(non-inclusive). Let

LOUDS-Sparse-Size(l)

, represent the size of LOUDS-Sparse encoded

levels from

(inclusive) to

. The cuto level is dened as the largest

such that

LOUDS-Dense-Size(l) × R ≤ LOUDS-Sparse-Size(l )

. Re-

ducing

leads to more levels, favoring performance over space. We

use R=64 as the default.

LOUDS-DS supports three basic operations eciently:

• ExactKeySearch

(key): Return the value of key if key exists (or

NULL otherwise).

• LowerBound

(key): Return an iterator pointing to the key-value

pair

(k, v)

where

is the smallest in lexicographical order satis-

fying k ≥ key.

• MoveToNext(iter): Move the iterator to the next key-value.

A point query on LOUDS-DS works by rst searching the

LOUDS-Dense levels. If the search does not terminate, it continues

into the LOUDS-Sparse levels. The high-level searching steps at

each level are similar regardless of the encoding mechanism: First,

search the current node’s range in the label sequence for the tar-

get key byte. If the key byte does not exist, terminate and return

NULL. Otherwise, check the corresponding bit in the HasChild bit-

sequence. If the bit is

(i.e., the branch points to a child node),

compute the child node’s starting position in the label sequence

and continue to the next level. Otherwise, return the corresponding

value in the value sequence. We precompute two aggregate values

based on the LOUDS-Dense levels: the node count and the number

of HasChild bits set. Using these two values, LOUDS-Sparse can

operate as if the entire trie is encoded with LOUDS-Sparse.

Range queries use a high-level algorithm similar to the point

query implementation. When performing LowerBound, instead of

doing an exact search in the label sequence, the algorithm searches

for the smallest label ≥ the target label. When moving to the next

key, the cursor starts at the current leaf label position and moves

forward. If another valid label

is found within the node, the algo-

rithm nds the left-most leaf key in the subtree rooted at

. If the

cursor hits node boundary instead, the algorithm moves the cursor

up to the corresponding position in the parent node.

We include per-level cursors in the iterator to minimize the

relatively expensive “move-to-parent” and “move-to-child” calls,

which require rank & select operations. These cursors record a trace

from root to leaf (i.e., the per-level positions in the label sequence)

for the current key. Because of the level-order layout of LOUDS-DS,

Research 4: Query Processing

SIGMOD’18, June 10-15, 2018, Houston, TX, USA

325

剩余13页未读，继续阅读

qq_28488285

粉丝: 4

SuRF：高效紧凑的范围查询过滤数据结构

SURF算法原文 请拜读

surf算法的原文翻译

SURF原论文翻译

exwm-surf:exwm下的Surf接口

SURF: Speeded Up Robust Features

SURF: Detecting and Measuring Search Poisoning

surf:冲浪和冲浪-PI

Surf：简单而强大PHP部署工具

surf:Go中的状态化程序化Web浏览

surf算法matlab代码-NeurIPS18_SURF:提升稀疏和低秩张量回归

最新资源

SURF算法原文请拜读