Bw-Tree：面向新型硬件平台的高性能B树

15 浏览量更新于2024-08-24 收藏 273KB PDF 举报

"The Bw-Tree - A B-tree for New Hardware Platforms (Microsoft Research) - 计算机科学" 本文是微软研究团队关于新型硬件平台上的一种新型B树结构——Bw-Tree的研究论文。随着新技术和平台的发展，数据管理系统的设计也在不断演变。尽管如此，关键索引访问记录等基础功能仍然是至关重要的。作者Justin J. Levandoski、David B. Lomet和Sudipta Sengupta提出，在利用现有系统架构分层的同时，也需要对每一层进行彻底的新设计。 Bw-Tree是针对现代多核芯片设计的一种无锁（latch-free）B树，它能够充分利用处理器的缓存，从而实现极高的性能。这种新型B树的独特之处在于，它模糊了页面和记录存储之间的界限，其存储管理器采用了一种特殊的日志结构，这使得它与闪存存储配合得非常好。论文详细阐述了Bw-Tree的架构和算法，主要关注内存方面。在内存层面，Bw-Tree的设计考虑了多核环境下的并发访问和高速缓存效率。同时，论文还包含了实验结果，证明了这种创新方法能带来卓越的性能表现。 I. INTRODUCTION部分通常会介绍研究背景、动机以及论文的主要贡献。在这里，作者可能讨论了传统B树在新硬件环境下的局限性，以及Bw-Tree如何克服这些局限，提供更高效的数据访问。 II. RELATED WORK可能会回顾已有的B树变体和其他数据结构，解释为什么现有的解决方案不足以应对新的硬件挑战，以及Bw-Tree的创新之处。 III. BW-TREE ARCHITECTURE将详细描述Bw-Tree的内部结构，包括节点布局、指针管理和分裂/合并操作。这部分可能会特别强调无锁设计如何减少冲突并提高并发性。 IV. ALGORITHMS会详细阐述Bw-Tree的插入、删除和查找操作，展示它们如何适应新的硬件特性，如多核并行处理和缓存优化。 V. PERFORMANCE EVALUATION将展示实验设置和结果，比较Bw-Tree与其他数据结构（如传统的B树、B+树）的性能，可能包括读写速度、并发性能和空间效率等方面的指标。 VI. DISCUSSION可能涵盖了设计选择的权衡、未来改进的方向，以及在不同应用场景中的适用性。 VII. CONCLUSION将总结研究发现，并指出Bw-Tree对未来数据库和数据管理系统设计的潜在影响。这篇论文对于理解和实现针对现代硬件优化的数据结构，以及在多核环境下构建高性能数据管理系统具有很高的参考价值。

展开

B. The Mapping Table

Our cache layer maintains a mapping table, that maps logi-

cal page s to physical pages, logical pages being identiﬁed by a

logical ”page identiﬁer” or PID. The mapping table translates

a PID into eith er (1) a ﬂash offset, the address of a page on

stable storage, or (2) a memory pointer, the addr e ss o f the page

in memory. The mapping table is thus the central location for

managing our “paginated” tree. While this indirection tech-

nique is not unique to our approa ch, we exploit it as the base

for several innovations. We use PIDs in the Bw-tree to link the

nodes of the tree. For instance, all downward “search” po inters

between Bw-tree nod es are PIDs, not physical pointers.

The mapping table severs the connection between physical

location and inter-node links. This enables the physical loca-

tion of a Bw-tre e node to change on every update and every

time a page is written to stable storage, without re quiring that

the location change be propagated to the root of the tree (i.e.,

updating inter-node links). This “relo c ation” tolerance directly

enables both de lta updating of the node in main memory and

log structuring of our stable storage, as described below.

Bw-tree nodes are thus logic al a nd do not occupy ﬁxed

physical locations, either on stable storage or in main memory.

Hence we are free to mold them to our needs. A “page” f or a

node thus suggests a policy, not a requirement, either in terms

of how we represen t nodes or how large they m ight become.

We permit page size to be elastic, meaning that we can split

when convenient a s size co nstraints do not impose a splittin g

requirement.

C. Delta Upda ting

Page state chan ges are done by creating a delta record (de-

scribing the change) and pr epending it to an existing page

state. We install the (new) memory address of the delta record

into the page’s physical address slot in the mapping table

using the atomic compare and swap (CAS) instruction

. If

successful, the delta record address becomes th e new physi-

cal address for the pa ge. This strategy is used both for data

changes (e.g., inser ting a record) and managem ent changes

(e.g., a page being split or ﬂushing a page to stable storage).

Occasionally, we consolidate pages (create a new page that

applies all delta changes) to both reduce me mory footpr int and

to improve search p erformance. A consolidated form of the

page is a lso installed with a CAS, and the prior page structure

is gar bage collected (i.e., its memory reclaimed). A reference

to the entire data structure for the page, includin g deltas, is

placed on a pending list all of which will be reclaimed when

safe. We use a form of epoch to accomplish safe gar bage

collection, [10].

Our delta u pdating simultaneously enables latch-free access

in the Bw-tree and preserves processor data caches by avoid ing

update-in-place. The Bw-tre e mapping table is the key enabler

of these features via its ability to isola te the effects of node

updates to that node alone.

The CAS is an atomic instruction that compares a given old value to a

current value at memory location L, if the values are equal the instruction

writes a new value to L, replacing current.

D. Bw-tree Structure Modiﬁcations

Latches do not protect parts of our index tree during struc-

ture modiﬁcations (SMOs) such as page splits. T his introd uces

a pr oblem. For example, a page split introduces changes to

more than one page: the original overly large page O, the new

page N that will receive half O

′

s contents, and the paren t in-

dex page P that points down to O, and that m ust subsequently

point to both O and N . Thus, we cannot in stall a page split

with a single CAS. A similar but harder p roblem arises when

we m erge nodes that have become too small.

To deal with this problem, we brea k an SMO into a sequence

of atomic ac tions, each installab le via a CAS. We use a B-link

design [ 11] to make this easier. With a side link in each page,

we can decompose a node split into two “half split” atomic

actions. In order to m ake sure that no thread h as to wait for

an SMO to co mplete, a thr e ad that sees a partial SMO will

complete it before proceeding with its own operation. This

ensures that no thread needs to wait for an SMO to complete.

E. Log Structured Store

Our LSS has the usual advantages of log structuring [ 12].

Pages ar e written sequentially in a large batch, gre atly reducing

the number of separate write I/Os required. However, because

of garbage collection, log structuring normally incurs extra

writes to relocate pages that persist in reclaimed storage areas

of the log. Our LSS design greatly reduces this problem.

When ﬂushing a page, the LSS need only ﬂush the deltas

that represent the changes made to the page since its previous

ﬂush. This dramatically reduces how m uch d a ta is wr itten

during a ﬂush, increasing the number of pages that ﬁt in the

ﬂush buffer, and hence reducing the number of I/O’s per page.

There is a penalty on reads, however, as the discontinuous parts

of pa ges all mu st be read to return a page to the m ain memory

cache. Here is when the very high random read performance

of ﬂash really contributes to our ARS performance.

The LSS cleans pr ior parts of ﬂash representing the old parts

of its log storage. Delta ﬂushing reduces pressure on the LSS

cleaner by reducin g the amount of storage used per page. This

reduces the “write ampliﬁcation” that is a characteristic o f log

structuring. During cleaning, L SS makes pages and their deltas

contiguous for improved access performan ce.

F. Managing Transactional Logs

As in a conventional database system, our ARS needs to

ensure that upd a te s persist across system crashes. We tag each

update ope ration with a unique id entiﬁer that is typically the

log sequence num ber (LSN) of the update o n the transactional

log (maintained elsewhere, e.g., in a transactional c ompone nt).

LSNs a re managed so as to support recovery idempotence, i.e.,

ensuring tha t operations are executed at most once.

Like conventional systems, pages are ﬂushed lazily while

honoring the write-ahead log protocol (WAL). Unconvention-

ally, however, we do not block page ﬂushes to enfor ce WAL.

Instead, because the recent updates are separate deltas from

the rest of the page, we c an remove “recent” update s (not yet

on the stable transactional log) from pages when ﬂu shing.

下载后可阅读完整内容，剩余11页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38648037

粉丝: 0

Bw-Tree：面向新型硬件平台的高性能B树

Hardware-in-the-loop Simulation for CPU-GPU Heterogeneous Platforms

Hardware-in-the-loop Simulation for CPU-GPU Heterogeneous Platforms.ppt

STOPLESS - A Real-Time Garbage Collector for Multiprocessors (10.1.1.108.322)-计算机科学

Direct Cache Access for High Bandwidth Network IO - 2005 (huggahalli05)-计算机科学

zora-platforms

Multipair two-way massive MIMO AF relaying with ZFR/ZFT and hardware impairments over high-altitude platforms

最新资源