
SSTable format. SSTables are stored in GFS; Bigtable
relies on GFS to preserve data in the event of disk loss.
Bigtable allows users to control the performance charac-
teristics of the table by grouping a set of columns into
a locality group. The columns in each locality group are
stored in their own set of SSTables, which makes scan-
ning them less expensive since the data in other columns
need not be scanned.
The decision to build on Bigtable defined the over-
all shape of Percolator. Percolator maintains the gist of
Bigtable’s interface: data is organized into Bigtable rows
and columns, with Percolator metadata stored along-
side in special columns (see Figure 5). Percolator’s
API closely resembles Bigtable’s API: the Percolator li-
brary largely consists of Bigtable operations wrapped in
Percolator-specific computation. The challenge, then, in
implementing Percolator is providing the features that
Bigtable does not: multirow transactions and the ob-
server framework.
2.2 Transactions
Percolator provides cross-row, cross-table transac-
tions with ACID snapshot-isolation semantics. Percola-
tor users write their transaction code in an imperative
language (currently C++) and mix calls to the Percola-
tor API with their code. Figure 2 shows a simplified ver-
sion of clustering documents by a hash of their contents.
In this example, if Commit() returns false, the transac-
tion has conflicted (in this case, because two URLs with
the same content hash were processed simultaneously)
and should be retried after a backoff. Calls to Get() and
Commit() are blocking; parallelism is achieved by run-
ning many transactions simultaneously in a thread pool.
While it is possible to incrementally process data with-
out the benefit of strong transactions, transactions make
it more tractable for the user to reason about the state of
the system and to avoid the introduction of errors into
a long-lived repository. For example, in a transactional
web-indexing system the programmer can make assump-
tions like: the hash of the contents of a document is al-
ways consistent with the table that indexes duplicates.
Without transactions, an ill-timed crash could result in a
permanent error: an entry in the document table that cor-
responds to no URL in the duplicates table. Transactions
also make it easy to build index tables that are always
up to date and consistent. Note that both of these exam-
ples require transactions that span rows, rather than the
single-row transactions that Bigtable already provides.
Percolator stores multiple versions of each data item
using Bigtable’s timestamp dimension. Multiple versions
are required to provide snapshot isolation [5], which
presents each transaction with the appearance of reading
from a stable snapshot at some timestamp. Writes appear
in a different, later, timestamp. Snapshot isolation pro-
bool UpdateDocument(Document doc) {
Transaction t(&cluster);
t.Set(doc.url(), "contents", "document", doc.contents());
int hash = Hash(doc.contents());
// dups table maps hash → canonical URL
string canonical;
if (!t.Get(hash, "canonical-url", "dups", &canonical)) {
// No canonical yet; write myself in
t.Set(hash, "canonical-url", "dups", doc.url());
} // else this document already exists, ignore new copy
return t.Commit();
}
Figure 2: Example usage of the Percolator API to perform ba-
sic checksum clustering and eliminate documents with the same
content.
[t]
Figure 3: Transactions under snapshot isolation perform reads
at a start timestamp (represented here by an open square) and
writes at a commit timestamp (closed circle). In this example,
transaction 2 would not see writes from transaction 1 since trans-
action 2’s start timestamp is before transaction 1’s commit times-
tamp. Transaction 3, however, will see writes from both 1 and 2.
Transaction 1 and 2 are running concurrently: if they both write
the same cell, at least one will abort.
tects against write-write conflicts: if transactions A and
B, running concurrently, write to the same cell, at most
one will commit. Snapshot isolation does not provide
serializability; in particular, transactions running under
snapshot isolation are subject to write skew [5]. The main
advantage of snapshot isolation over a serializable proto-
col is more efficient reads. Because any timestamp rep-
resents a consistent snapshot, reading a cell requires only
performing a Bigtable lookup at the given timestamp; ac-
quiring locks is not necessary. Figure 3 illustrates the re-
lationship between transactions under snapshot isolation.
Because it is built as a client library accessing
Bigtable, rather than controlling access to storage itself,
Percolator faces a different set of challenges implement-
ing distributed transactions than traditional PDBMSs.
Other parallel databases integrate locking into the sys-
tem component that manages access to the disk: since
each node already mediates access to data on the disk it
can grant locks on requests and deny accesses that violate
locking requirements.
By contrast, any node in Percolator can (and does) is-
sue requests to directly modify state in Bigtable: there is
no convenient place to intercept traffic and assign locks.
As a result, Percolator must explicitly maintain locks.
Locks must persist in the face of machine failure; if a
lock could disappear between the two phases of com-
3