-7-
block and the filling block on each level contain an integral number of page-sized nodes of the C
1
tree, which simply happen to be buffer resident. (During the merge step that restructures
individual nodes, other types of concurrent access to the entries on those nodes are blocked.)
Whenever a complete flush of all buffered nodes to disk is required, all buffered information at
each level must be written to new positions on disk (with positions reflected in superior di-
rectory information, and a sequential log entry for recovery purposes). At a later point, when
the filling block in buffer on some level of the C
1
tree fills and must be flushed again, it goes to
a new position. Old information that might still be needed during recovery is never overwritten
on disk, only invalidated as new writes succeed with more up-to-date information. A somewhat
more detailed explanation of the rolling merge process is presented in Section 4, where con-
currency and recovery designs are considered.
It is an important efficiency consideration of the LSM-tree that when the rolling merge process
on a particular level of the C
1
tree passes through nodes at a relatively high rate, all reads and
writes are in multi-page blocks. By eliminating seek time and rotational latency, we expect to
gain a large advantage over random page I/O involved in normal B-tree entry insertion. (This
advantage is analyzed below, in Section 3.2.) The idea of always writing multi-page blocks to
new locations was inspired by the Log-Structured File System devised by Rosenblum and
Ousterhout [23], from which the Log-Structured Merge-tree takes its name. Note that the
continuous use of new disk space for fresh multi-page block writes implies that the area of disk
being written will wrap, and old discarded blocks must be reused. This bookkeeping can be done
in a memory table; old multi-page blocks are invalidated and reused as single units, and re-
covery is guaranteed by the checkpoint. In the Log-Structured File System, the reuse of old
blocks involves significant I/O because blocks are typically only partially freed up, so reuse
requires a block read and block write. In the LSM-Tree, blocks are totally freed up on the
trailing edge of the rolling merge, so no extra I/O is involved.
2.2 Finds in the LSM-tree Index
When an exact-match find or range find requiring immediate response is performed through the
LSM-tree index, first the C
0
tree and then the C
1
tree is searched for the value or values de-
sired. This may imply a slight CPU overhead compared to the B-tree case, since two directories
may need to be searched. In LSM-trees with more than two components, there may also be an
I/O overhead. To anticipate Chapter 3 somewhat, we define a multi component LSM-tree as
having components C
0
, C
1
, C
2
, . . ., C
K-1
and C
K
, indexed tree structures of increasing size,
where C
0
is memory resident and all other components are disk resident. There are asyn-
chronous rolling merge processes in train between all component pairs (C
i-1
, C
i
) that move
entries out from the smaller to the larger component each time the smaller component, C
i-1
,
exceeds its threshold size. As a rule, in order to guarantee that all entries in the LSM-tree have
been examined, it is necessary for an exact-match find or range find to access each component C
i
through its index structure. However, there are a number of possible optimizations where this
search can be limited to an initial subset of the components.
First, where unique index values are guaranteed by the logic of generation, as when time-
stamps are guaranteed to be distinct, a matching indexed find is complete if it locates the desired
value in an early C
i
component. As another example, we could limit our search when the find
criterion uses recent timestamp values so that the entries sought could not yet have migrated
out to the largest components. As the merge cursor circulates through the (C
i
, C
i+1
) pairs, we
will often have reason to retain entries in C
i
that have been inserted in the recent past (in the
last τ
i
seconds), allowing only the older entries to go out to C
i+1
. In cases where the most
frequent find references are to recently inserted values, many finds can be completed in the C
0
tree, and so the C
0
tree fulfills a valuable memory buffering function. This point was made also