BzTree：非易失性内存的高性能无锁范围索引

需积分: 1 73 浏览量更新于2024-09-07 收藏 689KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"BzTree: A High-Performance Latch-free Range Index for Non-Volatile Memory" 在当前的数据库技术中，非易失性内存（Non-Volatile Memory, NVM）正逐渐成为存储系统的一个重要组成部分，因为它能提供高速度、高耐久性和快速恢复的能力。随着多核CPU的发展，现代内存数据库为了充分利用并行处理能力，开始采用无锁（latch-free）或无锁（lock-free）的索引结构，如Bw-Tree和跳跃列表。本文主要介绍了一种专为NVM设计的新型无锁B树——BzTree。 BzTree的核心在于其采用了持久化多词比较并交换操作（Persistent Multi-word Compare-and-Swap, PMwCAS）作为基础构建模块。这一创新使得BzTree在保持无锁特性的同时，实现了比其他竞争索引结构，如Bw-Tree，更多的优势： 1. **无锁但实现简单**：与Bw-Tree等复杂无锁结构相比，BzTree的实现更简洁。这降低了开发和维护的复杂性，使得在NVM环境中的部署和优化更加容易。 2. **高性能**：BzTree在性能上表现出色，实验结果显示，它在某些场景下可以达到比竞争对手高2倍的吞吐量。这意味着在处理大量数据查询时，BzTree能够提供更快的响应速度，提升整体系统效率。 3. **适应性强**：BzTree的设计考虑了NVM的特性和挑战，如数据持久性和一致性问题。通过使用PMwCAS，BzTree能够在不引入额外锁的情况下，保证在并发操作中的数据一致性，从而适应NVM的非易失性特点。 4. **范围查询优化**：BzTree针对范围查询进行了优化，这是许多数据库应用中的常见操作。无锁结构和PMwCAS的结合使得范围查询的执行更为高效，降低了查询延迟，提高了用户体验。 5. **可靠性与恢复**：由于设计在NVM上，BzTree在系统崩溃或断电后能够快速恢复，因为NVM的数据在断电后仍然保留。这种特性对于确保数据完整性至关重要，是传统缓存无法比拟的。总结来说，BzTree是一种针对非易失性内存设计的高性能无锁B树索引结构。它通过PMwCAS实现无锁操作，简化实现，提升性能，并优化了范围查询，同时具备快速恢复的能力，为NVM环境下的数据库系统提供了强大的支持。这一研究为数据库领域在利用NVM技术提升性能方面开辟了新的路径。

资源详情

资源推荐

during the

PMwCAS

. For example, each modiﬁcation that installs a

descriptor address (or target value) sets a dirty bit to signify that the

value is volatile, and that a reader must ﬂush the value and unset

the bit before proceeding. This protocol ensures that any dependent

writes are guaranteed that the value read will survive power failure.

2.3.2 Execution

Internally,

PMwCAS

makes use of a descriptor that stores all the

information needed to complete the operation. Figure 1 depicts an

example descriptor for three target words. A descriptor contains,

for each target word, (1) the target word’s address, (2) the expected

value to compare against, (3) the new value, (4) the dirty bit, and

(5) a memory recycling policy. The policy ﬁeld indicates whether the

new and old values are pointers to memory objects and, if so, which

objects are to be freed on the successful completion (or failure) of

the operation. The descriptor also contains a status word tracking

the operation’s progress. The

PMwCAS

operation itself is lock-free;

the descriptor contains enough information for any thread to help

complete (or roll back) the operation. The operation consists of two

phases.

Phase 1

. This phase attempts to install a pointer to the descriptor

in each target address using a double-compare single-swap (

RDCSS

)

operation [11].

RDCSS

applies change to a target word only if the

values of two words (including the one being changed) match their

speciﬁed expected values. That is,

RDCSS

requires an additional

“expected” value to compare against (but not modify) compared

to a regular

CAS

RDCSS

is necessary to guard against subtle race

conditions and maintain a linearizable sequence of operations on

the same word. Speciﬁcally, we must guard against the installation

of a descriptor for a completed

PMwCAS

(

) that might inadvertently

overwrite the result of another

PMwCAS

(

), where

should occur

after p

(details in [37]).

A descriptor pointer in a word indicates that a

PMwCAS

is underway.

Any thread that encounters a descriptor pointer helps complete the

operation before proceeding with its own work, making

PMwCAS

cooperative (typical for lock-free operations). We use one high order

bit (in addition to the dirty bit) in the target word to signify whether

it is a descriptor or regular value. Descriptor pointer installation

proceeds in a target address order to avoid deadlocks between two

competing PMwCAS operations that might concurrently overlap.

Upon completing Phase 1, a thread persists the target words whose

dirty bit is set. To ensure correct recovery, this must be done before

updating the descriptor’s

status

ﬁeld and advancing to Phase 2.

We update

status

using

CAS

to either

Succeeded

Failed

(with

the dirty bit set) depending on whether Phase 1 succeeded. We

then persist the

status

ﬁeld and clear its dirty bit. Persisting the

status

ﬁeld “commits” the operation, ensuring its effects survive

even across power failures.

Phase 2

. If Phase 1 succeeds, the

PMwCAS

is guaranteed to suc-

ceed, even if a failure occurs – recovery will roll forward with the

new values recorded in the descriptor. Phase 2 installs the ﬁnal

values (with the dirty bit set) in the target words, replacing the

pointers to the descriptor. Since the ﬁnal values are installed one by

one, it is possible that a crash in the middle of Phase 2 leaves some

target ﬁelds with new values, while others point to the descriptor.

Another thread might have observed some of the newly installed

values and make dependent actions (e.g., performing a

PMwCAS

of its

own) based on the read. Rolling back in this case might cause data

inconsistencies. Therefore, it is crucial to persist

status

before en-

tering Phase 2. The recovery routine (covered next) can then rely on

the

status

ﬁeld of the descriptor to decide if it should roll forward

or backward. If the

PMwCAS

fails in Phase 1, Phase 2 becomes a

rollback procedure by installing the old values (with the dirty bit

set) in all target words containing a descriptor pointer.

Recovery.

Due to the two-phase execution of

PMwCAS

, a target

address may contain a descriptor pointer or normal value after a

crash. Correct recovery requires that the descriptor be persisted

before entering Phase 1. The dirty bit in the

status

ﬁeld is cleared

because the caller has not started to install descriptor pointers in the

target ﬁelds; any failure that might occur before this point does not

affect data consistency upon recovery.

The

PMwCAS

descriptors are pooled in a memory location known

to recovery. Crash recovery then proceeds by scanning the descriptor

pool. If a descriptor’s status ﬁeld signiﬁes success, the operation is

rolled forward by applying the target values in the descriptor; if the

status signiﬁes failure it is rolled back by applying the old values.

Uninitialized descriptors are simply ignored. Therefore, recovery

time is determined by the number of in-progress

PMwCAS

operations

during the crash; this is usually on the order of number of threads,

meaning very fast recovery. In fact, in an end-to-end recovery

experiment for the BzTree, we measured an average recovery time

of 145

s when running a write-intensive workload with 48 threads.

Memory management.

Since the

PMwCAS

is lock-free, descriptor

memory lifetime is managed by the epoch-based recycling scheme

described in Section 2.2. This ensures that no thread can possibly

dereference a descriptor pointer after its memory is reclaimed and

reused by another

PMwCAS

. If any of the 8-byte expected or target

values are pointers to larger memory objects, these objects can

also be managed by the same memory reclamation scheme. Each

word in the descriptor is marked with a memory recycling policy

that denotes whether and what memory to free on completion of

the operation. For instance, if a

PMwCAS

succeeds, the user may

want memory behind the expected (old) value to be freed once the

descriptor is deemed safe to recycle. Section 6 discusses the details

of the interplay between PMwCAS and memory reclamation.

3 BzTree Architecture and Design

3.1 Architecture

The BzTree is a high-performance main-memory B+Tree. Internal

nodes store search keys and pointers to child nodes. Leaf nodes

store keys and either record pointers or actual payload values. Keys

can be variable or ﬁxed length. Our experiments assume leaf nodes

store 8-byte record pointers as payloads (common in main-memory

databases [6]), though we also discuss how to handle full variable-

length payloads. The BzTree is a range access method that supports

standard atomic key-value operations (insert, read, update, delete,

range scan). Typical of most access methods, it can be deployed as

a stand-alone key-value store, or embedded in a database engine to

support ACID transactions, where concurrency control takes place

outside of the access method as is common in most systems (e.g.,

within a lock manager) [12, 23].

Persistence Modes

. A salient feature of the BzTree is that its

design works for both volatile and persistent environments. In

volatile mode, BzTree nodes are stored in volatile DRAM. Content

is lost after a system failure. This mode is appropriate for use in

existing main-memory system designs (e.g., Microsoft Hekaton [6])

that already contain recovery infrastructure to recover indexes. In

durable node, both internal and leaf nodes are stored in NVM. The

BzTree guarantees that all updates are persistent and the index

can recover quickly to a correct state after a failure. For disaster

recovery (media failure), the BzTree must rely on common solutions

like database replication.

Metadata

. Besides nodes, there are only two other 64-bit values

used by the BzTree:

剩余12页未读，继续阅读

weixin_43008602

粉丝: 0
资源: 3

BzTree：非易失性内存的高性能无锁范围索引

bztree:开源BzTree实现

CVPR 2018 paper下载工具

搭建spring环境，打印机装配 接口：Print(打印机) Ink(墨水) Paper(纸张) 打印机实现类：EpsonPrint HPPrint 墨水实现类：BlackInk ColorInk 纸张实现类：A3Paper A4Paper使用spring完成打印机的装配和内容的注入(xml,annotation)

//添加 @GetMapping("./paper/savePaper") @ResponseBody public MzResult savePaper(Paper paper){ try { paperService.savePaper(paper); return MzResult.ok();//方法调用 } catch (Exception e) { e.printStackTrace(); return MzResult.error(e.getMessage()); }

linux下qt获取打印机的paper source信息，并修改为自定义的paper source

java Paper 设置边距

react native paper主题使用

react native paper基本使用

how to proofread your peer's paper

give me a sample MLA research paper

我有一个数据paper是map<Integer,QuestionInfo>类型,其中QuestionInfo是一个类且具有getAnsweruser的方法,它会获得useranswer的值,我如何在JAVA另一个类中获取paper的值且最后获得useranswer的值

latex 英文paper模版

帮我罗列paper做名词的常用搭配及在句中分别作什么成分

paper.install(window);报错

用greendao的形式翻译如下代码：delete from Paper where Paper.rowid not in (select MAX(Paper.rowid) from Paper group by PaperID);

多模态深度学习paper

如何使用paper with code

mysql 将paper表中paper_time列的varchar(255)类型改成时间类型

最新资源

搭建spring环境，打印机装配　接口：Print(打印机) Ink(墨水) Paper(纸张) 打印机实现类：EpsonPrint HPPrint 墨水实现类：BlackInk ColorInk 纸张实现类：A3Paper A4Paper使用spring完成打印机的装配和内容的注入(xml,annotation)