during the
PMwCAS
. For example, each modification that installs a
descriptor address (or target value) sets a dirty bit to signify that the
value is volatile, and that a reader must flush the value and unset
the bit before proceeding. This protocol ensures that any dependent
writes are guaranteed that the value read will survive power failure.
2.3.2 Execution
Internally,
PMwCAS
makes use of a descriptor that stores all the
information needed to complete the operation. Figure 1 depicts an
example descriptor for three target words. A descriptor contains,
for each target word, (1) the target word’s address, (2) the expected
value to compare against, (3) the new value, (4) the dirty bit, and
(5) a memory recycling policy. The policy field indicates whether the
new and old values are pointers to memory objects and, if so, which
objects are to be freed on the successful completion (or failure) of
the operation. The descriptor also contains a status word tracking
the operation’s progress. The
PMwCAS
operation itself is lock-free;
the descriptor contains enough information for any thread to help
complete (or roll back) the operation. The operation consists of two
phases.
Phase 1
. This phase attempts to install a pointer to the descriptor
in each target address using a double-compare single-swap (
RDCSS
)
operation [11].
RDCSS
applies change to a target word only if the
values of two words (including the one being changed) match their
specified expected values. That is,
RDCSS
requires an additional
“expected” value to compare against (but not modify) compared
to a regular
CAS
.
RDCSS
is necessary to guard against subtle race
conditions and maintain a linearizable sequence of operations on
the same word. Specifically, we must guard against the installation
of a descriptor for a completed
PMwCAS
(
p
1
) that might inadvertently
overwrite the result of another
PMwCAS
(
p
2
), where
p
2
should occur
after p
1
(details in [37]).
A descriptor pointer in a word indicates that a
PMwCAS
is underway.
Any thread that encounters a descriptor pointer helps complete the
operation before proceeding with its own work, making
PMwCAS
cooperative (typical for lock-free operations). We use one high order
bit (in addition to the dirty bit) in the target word to signify whether
it is a descriptor or regular value. Descriptor pointer installation
proceeds in a target address order to avoid deadlocks between two
competing PMwCAS operations that might concurrently overlap.
Upon completing Phase 1, a thread persists the target words whose
dirty bit is set. To ensure correct recovery, this must be done before
updating the descriptor’s
status
field and advancing to Phase 2.
We update
status
using
CAS
to either
Succeeded
or
Failed
(with
the dirty bit set) depending on whether Phase 1 succeeded. We
then persist the
status
field and clear its dirty bit. Persisting the
status
field “commits” the operation, ensuring its effects survive
even across power failures.
Phase 2
. If Phase 1 succeeds, the
PMwCAS
is guaranteed to suc-
ceed, even if a failure occurs – recovery will roll forward with the
new values recorded in the descriptor. Phase 2 installs the final
values (with the dirty bit set) in the target words, replacing the
pointers to the descriptor. Since the final values are installed one by
one, it is possible that a crash in the middle of Phase 2 leaves some
target fields with new values, while others point to the descriptor.
Another thread might have observed some of the newly installed
values and make dependent actions (e.g., performing a
PMwCAS
of its
own) based on the read. Rolling back in this case might cause data
inconsistencies. Therefore, it is crucial to persist
status
before en-
tering Phase 2. The recovery routine (covered next) can then rely on
the
status
field of the descriptor to decide if it should roll forward
or backward. If the
PMwCAS
fails in Phase 1, Phase 2 becomes a
rollback procedure by installing the old values (with the dirty bit
set) in all target words containing a descriptor pointer.
Recovery.
Due to the two-phase execution of
PMwCAS
, a target
address may contain a descriptor pointer or normal value after a
crash. Correct recovery requires that the descriptor be persisted
before entering Phase 1. The dirty bit in the
status
field is cleared
because the caller has not started to install descriptor pointers in the
target fields; any failure that might occur before this point does not
affect data consistency upon recovery.
The
PMwCAS
descriptors are pooled in a memory location known
to recovery. Crash recovery then proceeds by scanning the descriptor
pool. If a descriptor’s status field signifies success, the operation is
rolled forward by applying the target values in the descriptor; if the
status signifies failure it is rolled back by applying the old values.
Uninitialized descriptors are simply ignored. Therefore, recovery
time is determined by the number of in-progress
PMwCAS
operations
during the crash; this is usually on the order of number of threads,
meaning very fast recovery. In fact, in an end-to-end recovery
experiment for the BzTree, we measured an average recovery time
of 145
µ
s when running a write-intensive workload with 48 threads.
Memory management.
Since the
PMwCAS
is lock-free, descriptor
memory lifetime is managed by the epoch-based recycling scheme
described in Section 2.2. This ensures that no thread can possibly
dereference a descriptor pointer after its memory is reclaimed and
reused by another
PMwCAS
. If any of the 8-byte expected or target
values are pointers to larger memory objects, these objects can
also be managed by the same memory reclamation scheme. Each
word in the descriptor is marked with a memory recycling policy
that denotes whether and what memory to free on completion of
the operation. For instance, if a
PMwCAS
succeeds, the user may
want memory behind the expected (old) value to be freed once the
descriptor is deemed safe to recycle. Section 6 discusses the details
of the interplay between PMwCAS and memory reclamation.
3 BzTree Architecture and Design
3.1 Architecture
The BzTree is a high-performance main-memory B+Tree. Internal
nodes store search keys and pointers to child nodes. Leaf nodes
store keys and either record pointers or actual payload values. Keys
can be variable or fixed length. Our experiments assume leaf nodes
store 8-byte record pointers as payloads (common in main-memory
databases [6]), though we also discuss how to handle full variable-
length payloads. The BzTree is a range access method that supports
standard atomic key-value operations (insert, read, update, delete,
range scan). Typical of most access methods, it can be deployed as
a stand-alone key-value store, or embedded in a database engine to
support ACID transactions, where concurrency control takes place
outside of the access method as is common in most systems (e.g.,
within a lock manager) [12, 23].
Persistence Modes
. A salient feature of the BzTree is that its
design works for both volatile and persistent environments. In
volatile mode, BzTree nodes are stored in volatile DRAM. Content
is lost after a system failure. This mode is appropriate for use in
existing main-memory system designs (e.g., Microsoft Hekaton [6])
that already contain recovery infrastructure to recover indexes. In
durable node, both internal and leaf nodes are stored in NVM. The
BzTree guarantees that all updates are persistent and the index
can recover quickly to a correct state after a failure. For disaster
recovery (media failure), the BzTree must rely on common solutions
like database replication.
Metadata
. Besides nodes, there are only two other 64-bit values
used by the BzTree: