揭示Bw-Tree设计全貌：理论与实践挑战

20 浏览量更新于2024-07-14 收藏 2.19MB PDF 举报

"Building a Bw-Tree Takes More Than Just Buzz Words" 是一篇发表于2018年的研究论文，由Ziqi Wang、Andrew Pavlo、Hyeontaek Lim、Viktor Leis、Huanchen Zhang、Michael Kaminsky和David G. Andersen等人合作完成，他们分别来自卡内基梅隆大学（Carnegie Mellon University）、慕尼黑工业大学（TUM）和英特尔实验室（Intel Labs）。这篇论文关注的是微软在2013年提出的一种名为Bw-Tree的数据结构，它在SQL Server的Hekaton引擎中被设计用于处理高吞吐量的事务型数据库工作负载，特别强调了其无锁（lock-free）特性，通过附加delta记录到树节点和使用比较与交换（Compare-and-Swap, CaS）操作实现原子物理指针更新，从而避免了传统锁机制。然而，原始论文中存在重要的细节缺失，这使得理解和实现这种技术变得困难。作者们指出，虽然Bw-Tree的概念引人注目，但微软的研究成果并未提供足够的详尽指导，且相关的源代码也没有公开。因此，这篇论文的核心贡献在于填补这一空白，提供了构建Bw-Tree所需的缺失指南和技术细节，包括如何正确处理并发控制、内存管理、冲突解决以及性能优化等方面。构建Bw-Tree并非仅仅依靠流行术语就能实现，它需要深入理解数据结构的底层原理，如平衡因子维护、节点合并和分裂策略，以及如何利用CaS操作确保在多线程环境中的一致性和原子性。此外，论文还可能探讨了Bw-Tree在实际应用中的局限性、适应不同工作负载的能力，以及如何在保证高性能的同时兼顾其他系统特性，如内存占用和扩展性。这篇文章对于那些想要深入了解Bw-Tree设计、实现和优化过程的IT专业人士来说，是一份宝贵的资源，它不仅填补了理论空白，还可能包含了一些实用的编码实践和教训，对于后继者在构建类似的数据结构时具有重要参考价值。"

Δinsert [K

, V

]

Δdelete [K

, V

]

Δinsert [K

, V

]

Δdelete [K

, V

]

Leaf node

Delta record

present

deleted

Figure 3: Non-unique Key Support

– The two sets (

present

deleted

) track

the visibility of ∆insert and ∆delete records in the Delta Chain.

a new traversal from the root, using the current low key or high

key to reach the previous or the next sibling node.

3.3 Mapping Table Expansion

Since every thread accesses the Bw-Tree’s Mapping Table multiple

times during traversal, it is important that it is not a bottleneck.

Storing the Mapping Table as an array of physical pointers indexed

by the node ID is the fastest data structure. But using a xed-size

array makes it dicult to dynamically resize the Mapping Table as

the number of items in the tree grows and shrinks. This last point

is the problem that we address here.

The OpenBw-Tree pre-allocates large virtual address space for

the Mapping Table without requesting backing physical pages. This

allows it to leverage the OS to lazily allocate physical memory

without using locks; this technique was previously used in the

KISS-Tree [

]. As the index grows, a thread may attempt to access

one of the Mapping Table’s pages that have not been mapped to

the physical memory, incurring a page fault. The OS then allocates

a new empty physical page for the virtual page. In practice, the

amount of virtual address space we reserve is estimated using the

total amount of physical memory and the lower bound of virtual

node size.

Although this approach makes it easy to increase the number of

entries in the Mapping Table as the index grows, it does not solve

the problem of shrinking the size of the Mapping Table. To the best

of our knowledge, there is no lock-free way of doing this. The only

way to shrink the Mapping Table is to block all worker threads and

rebuild the index.

4 COMPONENT OPTIMIZATION

A good-faith implementation of the data structure described in

original Bw-Tree paper design can further be improved. We present

our optimizations for the OpenBw-Tree’s key components to im-

prove its performance and scalability. As we show in Section 5,

these optimizations increase the index’s throughput by 1.1–2.5

for multi-threaded environments.

4.1 Delta Record Pre-allocation

As described in Section 2.1, the Delta Chain in Bw-Tree is a linked

list of delta records that is allocated on the heap. Traversing this

linked list is slow because a thread can incur a cache miss for

each pointer dereference. Additionally, excessive allocations of

small objects create contention in the allocator, which becomes a

scalability bottleneck as the number of cores increases.

Base Node

Base node storage

Free

Space

Low address

High address

Logical view

Physical view

Allocation metadata (incl. the marker)

Growing

Figure 4: Pre-allocated Chunk

– This diagram depicts the logical view

and physical view of a OpenBw-Tree node. Slots are acquired by threads

using a CaS on the marker, which is part of the allocation metadata on

lower-address of the chunk.

To avoid these problems, the OpenBw-Tree pre-allocates the

delta records inside of each base node. As shown in Fig. 4, it stores

the base node in the high-address end of the pre-allocated chunk

and stores the delta records from high to low addresses (right-to-left

in the gure). Each chain also maintains an allocation marker that

points to the last delta record or the base node. When a worker

thread claims a slot, it decrements this marker by the number of

bytes for the new delta record using an atomic subtraction. If the

pre-allocated area is full, then this triggers a node consolidation.

This reverse-growth design is optimized for ecient Delta Chain

traversals. Reading delta records in the new-to-old order is likely

to (but not always) access memory linearly from low to high ad-

dresses, which is ideal for modern CPUs with hardware memory

prefetching. But threads must traverse a node’s Delta Chain by

following each delta record’s pointer to nd the next entry, rather

than just scanning from low to high addresses. This is because

the logical order of delta records may not match their physical

locations in memory. Slot allocations and Delta Chain appendings

are not atomic, permitting multiple threads to interleave them. For

example, Fig. 4 shows that delta record

∆

was logically added to

the node

before

delta record

∆

, but

∆

appears

after ∆

physically

in memory.

4.2 Garbage Collection

The OpenBw-Tree adopts a garbage collection (GC) scheme that

is similar to the one used in Silo [

] and Deuteronomy [

]. The

epoch-based GC scheme of the original Bw-Tree [

] provides

safe memory reclamation that prevents the index from reusing

memory when there may exist a thread that is accessing it. With

this approach, the index maintains a list of global epoch objects,

and appends new epoch objects to the end of this list at xed

intervals (e.g., every 40 ms). Every thread must enter the epoch

by enrolling itself in the current epoch object before it accesses

the index’s internal data structures (e.g., performing a key lookup).

When the thread completes its operation, it removes itself from the

epoch it has entered. Any objects that are marked for deletion by a

thread are added into the garbage list of the current epoch. Once all

threads exit an epoch, the index’s GC component can then reclaim

the objects in that epoch that are marked for deletion.

Fig. 5a illustrates the centralized GC scheme with three active

epochs, three worker threads (

), and a background GC

thread (

дc

). In this diagram,

adds a new node to the garbage list

of epoch 103. At the same time, the GC thread

дc

installs a new

epoch object to the epoch list. Since the counter inside epoch 101

has reached zero, t

дc

will reclaim all entries in its garbage list.

剩余15页未读，继续阅读

weixin_38696458

粉丝: 5
资源: 919

揭示Bw-Tree设计全貌：理论与实践挑战

buzz语音生成，抖音，B站等网站语音合成

Buzz-0.8.3-windows.exe

fizz-buzz-java:Fizz Buzz kata的源代码。 视频在

which-more-buzz

fizz-buzz-python:Jeu Fizz Buzz en Python

github-actions-fizz-buzz

Buzz-Mod-v12

c语言-leetcode题解412-fizz-buzz.c

game-of-fizz-buzz:FizzBu​​zz的简单游戏

kata2TDD-FizzBuzz:Kata TDD - 持续集成 - Fizz Buzz

最新资源

fizz-buzz-java:Fizz Buzz kata的源代码。视频在

game-of-fizz-buzz:FizzBuzz的简单游戏