首个优化的版本化外存字典：查询与更新间的理想平衡

89 浏览量更新于2024-08-25 收藏 383KB PDF 举报

本文档关注的是"Versioned external-memory dictionaries with optimal query-update tradeoffs"这一主题，发表于2011年4月12日，作者是Andrew Byde和Andy Twigg，来自Acunu Ltd。论文发表在arXiv上，属于计算机科学领域（cs.DS）。外部内存字典在文件系统和数据库中是基础的数据结构，对于版本化（或全持久）字典，它们具有一个版本树，允许查询在任何版本进行，更新仅限于叶子版本，并可通过添加子节点实现版本克隆。在非版本化字典中，已知存在多种查询和更新操作之间的权衡关系，其中许多权衡关系具有匹配的上界和下界。然而，直到这篇论文发布时，没有发现能在外部内存环境下提供理想空间、查询和更新性能的完全版本化字典。具体来说，当时尚无版本化构造能够以O(N)的空间占用实现单次更新操作的I/O复杂度为O(1)。作者们在这项研究中提出了首例针对这一问题的缓存 oblivious 和 cache-aware 构造方法。这些构造技术旨在扩展当前在空间、查询成本与更新效率之间的最佳点，填补了之前理论上的空白。研究的核心内容涉及： 1. **背景与动机**：对空间、查询成本和更新性能之间权衡的研究，特别是在外部内存环境中的版本化数据结构，尤其是在面临缓存层次结构时如何设计高效算法。 2. **贡献**：首次提供了在外部内存中优化这些权衡的算法，包括处理缓存未知情况（cache-oblivious）和考虑特定缓存策略（cache-aware）的方法。 3. **术语和概念**：论文探讨了相关术语，如cache-oblivious算法和外部内存算法，以及版本化数据结构的设计原则。 4. **方法与分析**：可能包括详细描述了新的数据结构设计，分析了其空间复杂性、查询时间复杂性和更新操作的时间复杂性，以证明其在理论和实际应用中的优越性。 5. **应用与影响**：研究结果可能为文件系统和数据库系统提供了一种改进的方法，使得版本控制下的数据操作更加高效，对于大规模数据处理和存储场景具有重要意义。这篇论文不仅深化了对外部内存版本化字典的理解，还为提高这类数据结构在实际应用中的性能提供了一种创新的解决方案。通过结合理论分析与实际性能优化，它为数据管理领域的未来发展奠定了坚实的基础。

ing versions descending by their DFS number satisﬁes this,

with the advantage that ancestorship can be tested in O(1)

time: let the interval I(v) = [DFS(v), max

wv

DFS(w)],

then v  w ⇐⇒ DFS(w) ∈ I(v). As the version tree

changes, we can use an eﬃcient renumbering scheme to re-

tain integer DFS values, such as in the order maintenance

problem [4].

2.2 Deﬁnitions

Consider a set of elements A and versions V . An element

(k, v) is a lead element (at v) if v ∈ V . Deﬁne lead (A, v) as

the total number of lead elements at v in A and lead(A, V ) =

v∈V

lead (A, v). The lead-below count is the total lead at

versions descendent from v, i.e. lead

below(v) =

xv

lead (v).

An element (k, x) is said to be live (or accessible) at version

v in A if x  v and k has not been rewritten between x and

v, i.e. there is no other element (k, y) ∈ A with x ≺ y  v.

Let liv e(A, v) be the total number of elements of A that are

live at v. Note that if v  w then live(v) ≤ live(w). Also

live(v) ≤ live(parent(v)) + lead(v), (1)

with the diﬀerence between right and left-hand sides being

equal to the number of keys k which appear in both versions

v and parent(v). We use N to denote the total number of

keys written; for a version v, we use N

to denote the number

of keys that are live at v, i.e. the number of distinct keys

written in ancestor versions of v (each key is live at least

once, so

≥ N).

We assume that keys and values (which could be pointers to

data or real data) are all of ﬁxed size.

3. A CACHE-OBLIVIOUS VERSIONED B-

TREE

In this section we present a cach e-oblivious versioned B-tree,

which we refer to as a stratiﬁed doubling array ( SDA). It

contains a collection of arrays of key-version-value tuples,

arranged into levels, with ‘forward pointers’ to facilitate

searchin g. Arrays in level l are roughly twice as large as

arrays in level l −1, hence ‘doubling’, and have d isjoint sets

of versions associated to them, hence ‘stratiﬁed’ in version

space.

The basic idea is to store arrays of kv-ordered elements, as

in the COLA of Bender et al. [5], except that we apply

a version split process, similar to the one employed in the

versioned B-tree, albeit more complex, in order to avoid ar-

rays containing too few elements from some version ( we call

this a ‘density’ property). The result is that each level may

have several arrays, tagged with disjoint sets of versions that

indicate which should be used.

3.1 Arrays

An array (A, V ) contains a set A of entries (k, v, x) where

k is a key, v is a version, and x is either a data value or

a forward pointer containing an array index (the array into

which it indexes will become clear from the context later),

ordered by (k, v). The set V is a set of ‘valid versions’ that

will be used for lookups and merges between various arrays.

Each array also contains a point er to a unique ‘next array’,

identifying the array, if any, into which its forward pointers

point. Arrays implement the following operations:

• search(k,v,[lb],[ub]): search for a (k, v) pair, within

optional lower and upper bounds. It returns the index

of a least upper bound y for (k, v) in the k-v order,

and t he destinations of the two closest forward point-

ers either side of y.

• iterate(loc): provides an iterator over elements start-

ing from index loc.

• append(k,v,x): appends the entry to the end of the

array, returning its location.

3.2 Deﬁnitions

For a version v, the density of version v in A is δ(A, v) =

live(A, v)/|A|. We say that a version v is dense in A if

δ(A, v) ≥ 1/3, and that an array (A, V ) is dense if every

v ∈ V is dense in A. Note that if v is dense in (A, V ) then

every descendant version is also dense there.

Given a non- empty set of versions V , we say a version v is

an orphan of V if it has no strict ancestor in V . We say

the array (A, V ) is a stratum if the orphans of V are all

siblings – they have the same parent, not in V , which we

write without ambiguity as parent(V );

For a version v and set of versions V , let T

[v] = {w ∈ V :

v  w} be the subtree of V rooted at v. For W ⊂ V a set of

versions and A an array, deﬁne the split of A with respect

to W to be the set of all entries live in any version in W :

λ(A, W ) = {(k, x) ∈ A : (k, x) is live at some v ∈ W }, i.e.

the set of all keys live in any version in W . For W a stratum

with orphans w

having common parent p, deﬁne

arr_size(A, W ) := live(A, p) + lead(A, W )

= live(A, p) +

lead

below(A, w

)

(2)

As in (1), |λ

(A, W )| ≤ arr_size(A, W ) with the diﬀerence

being those keys live in the parent version but over-written

in all orphans of W .

As a special case, when W = T

[v] for some version v ∈ V ,

deﬁne λ

(A, V, v) = λ(A, T

[v]), and as usual where A

and V are clear, we write T [v] and λ

(v) for the set of

versions and corresponding split respectively. Note that

lead (A, T

[v]) = lead

below(A, v).

A version split of an array (A, V ) gives a set of strata {(A

, V

)}

such that A = ∪

, and V = ∪

, and V

are mutually dis-

joint.

3.3 Levels

As previously mentioned, an SDA keeps (k, v, x) tuples in

arrays arranged into levels. Each level l ≥ 0 contains a set

of arrays (A

, V

) with disjoint sets of valid versions. We

keep in memory a map from version to t he array in which

it is valid – if such a thing exists. We also keep track of the

subset of t hose versions (which we call ‘real’) for which there

is at least one lead key in th e array where v is valid.

剩余10页未读，继续阅读

weixin_38664159

粉丝: 5
资源: 921

首个优化的版本化外存字典：查询与更新间的理想平衡

crt 64位亲测可用

TortoiseSVN绿色破解版

MediaCreationTool.exe 联想技术支持win10刻录工具（支持32bit,64bit,中文,英文）

Stratified B-trees and Versioning Dictionaries - 30 March 2011 (1103.4282v2)-计算机科学

Python库 | dbnd-airflow-versioned-dag-0.24.29.tar.gz

Python库 | dbnd-airflow-versioned-dag-0.27.11.tar.gz

PyPI 官网下载 | dbnd-airflow-versioned-dag-0.29.8.tar.gz

Python库 | dbnd-airflow-versioned-dag-0.49.3.tar.gz

Python库 | dbnd-airflow-versioned-dag-0.40.0.tar.gz

Python库 | many_versioned_wheel-0.2.0-py2.py3-none-any.whl

最新资源