The RAMCloud Storage System 1:11
Object Describes a single object, including table identifier, key, value, version num-
ber, and coarse-grain timestamp for last modification (for cleaning). §4.2
Tombstone Indicates that an object has been deleted or overwritten. Contains the ta-
ble identifier, key, and version number of the deleted object, as well as the
identifier of the segment containing the object. §4.4
Segment header This is the first entry in each segment; it contains an identifier for the log’s
master and the identifier of this segment within the master’s log. §4.5
Log digest Contains the identifers of all the segments that were part of the log when
this entry was written. §4.3, §4.5, §7.4
Safe version Contains a version number larger than the version of any object ever
managed by this master; ensures monotonicity of version numbers across
deletes when a master’s tablets are transferred to other masters during
crash recovery.
Tablet statistics Compressed representation of the number of log entries and total log bytes
consumed by each tablet stored on this master. §7.4
Fig. 5. The different types of entries stored in the RAMCloud log. Each entry also contains a checksum used
to detect corruption. Log digests, safe versions, and table statistics are present only in segments containing
newly written data, and they follow immediately after the segment header; they are not present in other
segments, such as those generated by the cleaner or during recovery. The section numbers indicate where
each entry type is discussed.
The segment size was chosen to make disk I/O efficient: with an 8 MB segment
size, disk latency accounts for only about 10% of the time to read or write a full seg-
ment. Flash memory could support smaller segments efficiently, but RAMCloud re-
quires each object to be stored in a single segment, so the segment size must be at
least as large as the largest possible object (1 MB).
4.2. Durable writes
When a master receives a write request from a client, it appends a new entry for the
object to its head log segment, creates a hash table entry for the object (or updates an
existing entry), and then replicates the log entry synchronously in parallel to the back-
ups storing the head segment. During replication, each backup appends the entry to a
replica of the head segment buffered in its memory and responds to the master with-
out waiting for I/O to secondary storage. When the master has received replies from
all the backups, it responds to the client. The backups write the buffered segments
to secondary storage asynchronously. The buffer space is freed once the segment has
been closed (meaning a new head segment has been chosen and this segment is now
immutable) and the buffer contents have been written to secondary storage.
This approach has two attractive properties: writes complete without waiting for
I/O to secondary storage, and backups use secondary storage bandwidth efficiently by
performing I/O in large blocks, even if objects are small.
However, the buffers create potential durability problems. RAMCloud promises
clients that objects are durable at the time a write returns. In order to honor this
promise, the data buffered in backups’ main memories must survive power failures;
otherwise a datacenter power failure could destory all copies of a newly written object.
RAMCloud currently assumes that servers can continue operating for a short period
after an impending power failure is detected, so that buffered data can be flushed to
secondary storage. The amount of data buffered on each backup is small (not more
than a few tens of megabytes), so only a few hundred millseconds are needed to write
it safely to secondary storage. An alternative approach is for backups to store buffered
ACM Transactions on Computer Systems, Vol. ??, No. ??, Article 1, Publication date: March ??.