were co-mingled in the same physical log file. One ap-
proach would be for each new tablet server to read this
full commit log file and apply just the entries needed for
the tablets it needs to recover. However, under such a
scheme, if 100 machines were each assigned a single
tablet from a failed tablet server, then the log file would
be read 100 times (once by each server).
We avoid duplicating log reads by first sort-
ing the commit log entries in order of the keys
htable, row name, log sequence numberi. In the
sorted output, all mutations for a particular tablet are
contiguous and can therefore be read efficiently with one
disk seek followed by a sequential read. To parallelize
the sorting, we partition the log file into 64 MB seg-
ments, and sort each segment in parallel on different
tablet servers. This sorting process is coordinated by the
master and is initiated when a tablet server indicates that
it needs to recover mutations from some commit log file.
Writing commit logs to GFS sometimes causes perfor-
mance hiccups for a variety of reasons (e.g., a GFS server
machine involved in the write crashes, or the network
paths traversed to reach the particular set of three GFS
servers is suffering network congestion, or is heavily
loaded). To protect mutations from GFS latency spikes,
each tablet server actually has two log writing threads,
each writing to its own log file; only one of these two
threads is actively in use at a time. If writes to the ac-
tive log file are performing poorly, the log file writing is
switched to the other thread, and mutations that are in
the commit log queue are written by the newly active log
writing thread. Log entries contain sequence numbers
to allow the recovery process to elide duplicated entries
resulting from this log switching process.
Speeding up tablet recovery
If the master moves a tablet from one tablet server to
another, the source tablet server first does a minor com-
paction on that tablet. This compaction reduces recov-
ery time by reducing the amount of uncompacted state in
the tablet server’s commit log. After finishing this com-
paction, the tablet server stops serving the tablet. Before
it actually unloads the tablet, the tablet server does an-
other (usually very fast) minor compaction to eliminate
any remaining uncompacted state in the tablet server’s
log that arrived while the first minor compaction was
being performed. After this second minor compaction
is complete, the tablet can be loaded on another tablet
server without requiring any recovery of log entries.
Exploiting immutability
Besides the SSTable caches, various other parts of the
Bigtable system have been simplified by the fact that all
of the SSTables that we generate are immutable. For ex-
ample, we do not need any synchronization of accesses
to the file system when reading from SSTables. As a re-
sult, concurrency control over rows can be implemented
very efficiently. The only mutable data structure that is
accessed by both reads and writes is the memtable. To re-
duce contention during reads of the memtable, we make
each memtable row copy-on-write and allow reads and
writes to proceed in parallel.
Since SSTables are immutable, the problem of perma-
nently removing deleted data is transformed to garbage
collecting obsolete SSTables. Each tablet’s SSTables are
registered in the METADATA table. The master removes
obsolete SSTables as a mark-and-sweep garbage collec-
tion [25] over the set of SSTables, where the METADATA
table contains the set of roots.
Finally, the immutability of SSTables enables us to
split tablets quickly. Instead of generating a new set of
SSTables for each child tablet, we let the child tablets
share the SSTables of the parent tablet.
7 Performance Evaluation
We set up a Bigtable cluster with N tablet servers to
measure the performance and scalability of Bigtable as
N is varied. The tablet servers were configured to use 1
GB of memory and to write to a GFS cell consisting of
1786 machines with two 400 GB IDE hard drives each.
N client machines generated the Bigtable load used for
these tests. (We used the same number of clients as tablet
servers to ensure that clients were never a bottleneck.)
Each machine had two dual-core Opteron 2 GHz chips,
enough physical memory to hold the working set of all
running processes, and a single gigabit Ethernet link.
The machines were arranged in a two-level tree-shaped
switched network with approximately 100-200 Gbps of
aggregate bandwidth available at the root. All of the ma-
chines were in the same hosting facility and therefore the
round-trip time between any pair of machines was less
than a millisecond.
The tablet servers and master, test clients, and GFS
servers all ran on the same set of machines. Every ma-
chine ran a GFS server. Some of the machines also ran
either a tablet server, or a client process, or processes
from other jobs that were using the pool at the same time
as these experiments.
R is the distinct number of Bigtable row keys involved
in the test. R was chosen so that each benchmark read or
wrote approximately 1 GB of data per tablet server.
The sequential write benchmark used row keys with
names 0 to R − 1. This space of row keys was parti-
tioned into 10N equal-sized ranges. These ranges were
assigned to the N clients by a central scheduler that as-
To appear in OSDI 2006 8