HOPE：内存搜索树的高速有序密钥压缩

43 浏览量更新于2024-07-14 收藏 2.16MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"Order-Preserving Key Compression for In-Memory Search Trees - 2003 (2003.02391) - 计算机科学" 这篇论文提出了一个名为High-speed Order-Preserving Encoder (HOPE)的压缩算法，专为内存中的搜索树设计。HOPE的主要目标是在保持键（key）原有顺序的同时，对任意键进行高效压缩。在数据库系统中，搜索树是一种关键的数据结构，用于快速查找和操作数据。然而，随着数据量的增加，内存占用成为一个关键问题，因此对键进行压缩可以有效地节省存储空间。 HOPE的工作原理是通过识别键中的细粒度共性模式，并利用这些模式的熵（即信息含量）来实现高压缩率。这种技术的关键在于找到一种方法，在保持键的排序顺序的同时，利用数据的内在规律进行压缩。论文作者首先建立了一个理论模型，用于分析和设计保持顺序的字典压缩方法。这个模型帮助他们评估不同压缩策略的效率和可行性。论文中，作者选取了六种代表性的压缩方案，并将它们集成到HOPE中。这些方案在压缩率和编码速度之间有不同的权衡。为了验证HOPE的性能，研究者在五种常用的数据库数据结构上进行了实验：SuRF、ART、HOT等。这些数据结构广泛应用于实际的数据库系统中，如索引、查询优化等。实验结果表明，HOPE能够在不影响搜索效率的前提下，显著减少内存中的键占用空间，同时保持搜索树的顺序特性。这对于内存受限的数据库系统来说，是一个重要的优化手段，能够提升系统的整体性能和可扩展性。此外，HOPE的可定制性和灵活性使其可以根据不同的应用场景和数据特性进行调整，进一步优化压缩效果。这篇2003年的计算机科学研究论文提出了一个创新的压缩技术，它对内存中的搜索树数据结构进行了优化，以提高存储效率并保持查询性能。HOPE的贡献在于提供了一种新的思路，平衡了压缩效率和查询速度，对数据库领域的研究和实践具有深远的影响。

资源详情

资源推荐

SIGMOD’20, June 14–19, 2020, Portland, OR, USA Huanchen Zhang et al.

abc 0110

abcd

abcf

abcgh

abcpq

abc

abd

string axis

dictionary entry:

Figure 2: Dictionary Entry Example

– All sub-intervals of

[abc,

abd) are valid mappings for dictionary entry abc −→ 0110.

cannot encode arbitrary strings unless they grow the dictio-

nary, but growing to accommodate new entries may require

the DBMS to re-encode the entire corpus [

]. In the string

axis model, a dictionary is complete if and only if the union

of all the intervals (i.e.,

) covers the entire string axis.

A dictionary encoding

Enc

∗

→ X

∗

uniquely decod-

able

Enc

is an injection (i.e., there is a one-to-one mapping

from every element of

∗

to an element in

∗

). To guarantee

unique decodability, we must ensure that (1) there is only one

way to encode a source string and (2) every encoded result

is unique. Under our string axis model, these requirements

are equivalent to (1) all intervals

’s are disjoint and (2) the

set of codes

used in the dictionary are uniquely decodable

(we only consider prex codes in this paper).

With these requirements, we can use the string axis model

to construct a dictionary that is both complete and uniquely

decodable. As shown in Figure 1, for a given dictionary size

entries, we rst divide the string axis into

consecutive

intervals

, I

, . . . , I

n−1

, where the max-length common pre-

x

of all strings in

is not empty (i.e.,

len(s

) >

0) for each

interval. We use

, b

, . . . , b

n−1

, b

to denote interval bound-

aries. That is,

= [b

, b

i+1

)

for

i =

, . . . , n −

1. We then

assign a set of uniquely decodable codes

, c

, . . . , c

n−1

to the

intervals. Our dictionary is thus

→ c

, i =

, . . . , n−

1. A

dictionary lookup maps the source string to a single interval

, where b

< src < b

i+1

We can achieve the

order-preserving

property on top

of unique decodability by assigning monotonically increas-

ing codes

< c

< . . . < c

n−1

to the intervals. This is

easy to prove. Suppose there are two source strings (

src

), where

src

< src

. If

src

and

src

belong to the

same interval

in the dictionary, they must share com-

mon prex

. Replacing

with

in each string does not

aect their relative ordering. If

src

and

src

map to dier-

ent intervals

and

, then

Enc(src

) = c

· Enc(src

su f f i x

)

Enc(src

)= c

· Enc(src

su f f i x

)

. Since

src

< src

must pre-

ceed

on the string axis. That means

< c

. Because

’s

are prex codes, c

· Enc(src

su f f i x

) < c

· Enc(src

su f f i x

For encoding search tree keys, we prefer schemes that are

complete and order-preserving; unique decodability is im-

plied by the latter property. Completeness allows the scheme

to encode arbitrary keys, while order-preserving guarantees

Fixed-Len Interval

Fixed-Len Code

a b

01100001

c d e

01100010

01100011

01100100

a b c d e

010

0111001110

11100001

01100001

01100010

01100011

01100100

010

0110011110

11100001

a abc acabd acaz acs

aae acabc acabe acn

a b c d

a a acd ac

a a

acab ac

Code

Symbol (interval common prefix)

FIFC

Fixed-Len Interval

Variable-Len Code

FIVC

Variable-Len Interval

Fixed-Len Code

VIFC

Variable-Len Interval

Variable-Len Code

VIVC

Figure 3: Compression Models –

Four categories of complete

and order-preserving dictionary encoding schemes.

that the search tree supports meaningful range queries on

the encoded keys.

3.2 Exploiting Entropy

For a dictionary encoding scheme to reduce the size of

the corpus, its emitted codes must be shorter than the

source strings. Given a complete, order-preserving dictio-

nary

→ c

, i =

, . . . , n −

1, let

denote the

probability that a dictionary entry is accessed at each step

during the encoding of an arbitrary source string. Because

the dictionary is complete and uniquely decodable (implied

by order-preserving),

n−1

i=0

1. The encoding scheme

achieves the best compression when the compression rate

CPR =

n−1

i=0

len(s



n−1

i=0

len(c

is maximized.

According to the string axis model, we can characterize

a dictionary encoding scheme in two aspects: (1) how to

divide intervals and (2) what code to assign to each interval.

Interval division determines the symbol lengths (

len(s

)

) and

the access probability distribution (

) in a dictionary. Code

assignment exploits the entropy in

’s by using shorter

codes (c

) for more frequently-accessed intervals.

We consider two interval-division strategies: xed-length

intervals and variable-length intervals. For code assignment,

we consider two types of prex codes: xed-length codes

and optimal variable-length codes. We, therefore, divide all

complete and order-preserving dictionary encoding schemes

into four categories, as shown in Figure 3.

Fixed-length Interval, Fixed-length Code (FIFC):

This

is the baseline scheme because ASCII encodes characters in

this way. We do not consider this category for compression.

Fixed-length Interval, Variable-length Code (FIVC):

This category is the classic Hu-Tucker encoding [

]. If order-

preserving is not required, both Human encoding [

] and

剩余17页未读，继续阅读

weixin_38722317

粉丝: 9
资源: 911

HOPE：内存搜索树的高速有序密钥压缩

zoj 2561 Order-Preserving Codes.md

Topology-Preserving Deep Image Segmentation.pdf

Privacy-Preserving Machine Learning Using Federated Learning and Secure Aggregation

Local Edge-Preserving, LEP

Asymmetric Scalar-Product-Preserving Encryption (ASPE)是什么

用c++实现Asymmetric Scalar-Product-Preserving Encryption

multi-scale feature fusion and structure-preserving network for face super-r

Edge-Preserving Filtering 是干嘛的

subspace-preserving

详细介绍一下Asymmetric Scalar-Product-Preserving Encryption (ASPE)

Write a biographical narrative essay for about 200-300 words in English.

sklearn.model_selection

在文献Efficient and Privacy-Preserving Multi-Party Skyline Queries Over Encrypted Data中SMSQ协议的内容

do you know something about privacy-preserving machine learning?

sklearn.manifold

使用差分隐私技术对图像进行保护的参考文献

基于区块链的电子病历存储国外研究现状 文献

高维稳健隐私回归的这个模型首次那在哪出现$$\min {\beta \beta} \sum{i=1}^n \rho\left(y_i-x_i^T \beta-z_i^T e_i\right)+P_1\left(\beta, \lambda_1\right)+P_2\left(e, \lambda_2\right)$$

最新资源

基于区块链的电子病历存储国外研究现状文献