B. The Mapping Table
Our cache layer maintains a mapping table, that maps logi-
cal page s to physical pages, logical pages being identified by a
logical ”page identifier” or PID. The mapping table translates
a PID into eith er (1) a flash offset, the address of a page on
stable storage, or (2) a memory pointer, the addr e ss o f the page
in memory. The mapping table is thus the central location for
managing our “paginated” tree. While this indirection tech-
nique is not unique to our approa ch, we exploit it as the base
for several innovations. We use PIDs in the Bw-tree to link the
nodes of the tree. For instance, all downward “search” po inters
between Bw-tree nod es are PIDs, not physical pointers.
The mapping table severs the connection between physical
location and inter-node links. This enables the physical loca-
tion of a Bw-tre e node to change on every update and every
time a page is written to stable storage, without re quiring that
the location change be propagated to the root of the tree (i.e.,
updating inter-node links). This “relo c ation” tolerance directly
enables both de lta updating of the node in main memory and
log structuring of our stable storage, as described below.
Bw-tree nodes are thus logic al a nd do not occupy fixed
physical locations, either on stable storage or in main memory.
Hence we are free to mold them to our needs. A “page” f or a
node thus suggests a policy, not a requirement, either in terms
of how we represen t nodes or how large they m ight become.
We permit page size to be elastic, meaning that we can split
when convenient a s size co nstraints do not impose a splittin g
requirement.
C. Delta Upda ting
Page state chan ges are done by creating a delta record (de-
scribing the change) and pr epending it to an existing page
state. We install the (new) memory address of the delta record
into the page’s physical address slot in the mapping table
using the atomic compare and swap (CAS) instruction
1
. If
successful, the delta record address becomes th e new physi-
cal address for the pa ge. This strategy is used both for data
changes (e.g., inser ting a record) and managem ent changes
(e.g., a page being split or flushing a page to stable storage).
Occasionally, we consolidate pages (create a new page that
applies all delta changes) to both reduce me mory footpr int and
to improve search p erformance. A consolidated form of the
page is a lso installed with a CAS, and the prior page structure
is gar bage collected (i.e., its memory reclaimed). A reference
to the entire data structure for the page, includin g deltas, is
placed on a pending list all of which will be reclaimed when
safe. We use a form of epoch to accomplish safe gar bage
collection, [10].
Our delta u pdating simultaneously enables latch-free access
in the Bw-tree and preserves processor data caches by avoid ing
update-in-place. The Bw-tre e mapping table is the key enabler
of these features via its ability to isola te the effects of node
updates to that node alone.
1
The CAS is an atomic instruction that compares a given old value to a
current value at memory location L, if the values are equal the instruction
writes a new value to L, replacing current.
D. Bw-tree Structure Modifications
Latches do not protect parts of our index tree during struc-
ture modifications (SMOs) such as page splits. T his introd uces
a pr oblem. For example, a page split introduces changes to
more than one page: the original overly large page O, the new
page N that will receive half O
′
s contents, and the paren t in-
dex page P that points down to O, and that m ust subsequently
point to both O and N . Thus, we cannot in stall a page split
with a single CAS. A similar but harder p roblem arises when
we m erge nodes that have become too small.
To deal with this problem, we brea k an SMO into a sequence
of atomic ac tions, each installab le via a CAS. We use a B-link
design [ 11] to make this easier. With a side link in each page,
we can decompose a node split into two “half split” atomic
actions. In order to m ake sure that no thread h as to wait for
an SMO to co mplete, a thr e ad that sees a partial SMO will
complete it before proceeding with its own operation. This
ensures that no thread needs to wait for an SMO to complete.
E. Log Structured Store
Our LSS has the usual advantages of log structuring [ 12].
Pages ar e written sequentially in a large batch, gre atly reducing
the number of separate write I/Os required. However, because
of garbage collection, log structuring normally incurs extra
writes to relocate pages that persist in reclaimed storage areas
of the log. Our LSS design greatly reduces this problem.
When flushing a page, the LSS need only flush the deltas
that represent the changes made to the page since its previous
flush. This dramatically reduces how m uch d a ta is wr itten
during a flush, increasing the number of pages that fit in the
flush buffer, and hence reducing the number of I/O’s per page.
There is a penalty on reads, however, as the discontinuous parts
of pa ges all mu st be read to return a page to the m ain memory
cache. Here is when the very high random read performance
of flash really contributes to our ARS performance.
The LSS cleans pr ior parts of flash representing the old parts
of its log storage. Delta flushing reduces pressure on the LSS
cleaner by reducin g the amount of storage used per page. This
reduces the “write amplification” that is a characteristic o f log
structuring. During cleaning, L SS makes pages and their deltas
contiguous for improved access performan ce.
F. Managing Transactional Logs
As in a conventional database system, our ARS needs to
ensure that upd a te s persist across system crashes. We tag each
update ope ration with a unique id entifier that is typically the
log sequence num ber (LSN) of the update o n the transactional
log (maintained elsewhere, e.g., in a transactional c ompone nt).
LSNs a re managed so as to support recovery idempotence, i.e.,
ensuring tha t operations are executed at most once.
Like conventional systems, pages are flushed lazily while
honoring the write-ahead log protocol (WAL). Unconvention-
ally, however, we do not block page flushes to enfor ce WAL.
Instead, because the recent updates are separate deltas from
the rest of the page, we c an remove “recent” update s (not yet
on the stable transactional log) from pages when flu shing.