Figure 1: Scalable Replication
Figure 2: Operations Across Entity Groups
replicated via Paxos). Operations across entity groups could
rely on expensive two-phase commits, but typically leverage
Megastore’s efficient asynchronous messaging. A transac-
tion in a sending entity group places one or more messages
in a queue; transactions in receiving entity groups atomically
consume those messages and apply ensuing mutations.
Note that we use asynchronous messaging between logi-
cally distant entity groups, not physically distant replicas.
All network traffic between datacenters is from replicated
op erations, which are synchronous and consistent.
Indexes local to an entity group obey ACID semantics;
those across entity groups have looser consistency. See Fig-
ure 2 for the various operations on and between entity groups.
2.2.2 Selecting Entity Group Boundaries
The entity group defines the a priori grouping of data
for fast op erations. Boundaries that are too fine-grained
force excessive cross-group operations, but placing too much
unrelated data in a single group serializes unrelated writes,
which degrades throughput.
The following examples show ways applications can work
within these constraints:
Email Each email account forms a natural entity group.
Operations within an account are transactional and
consistent: a user who sends or labels a message is
guaranteed to observe the change despite possible fail-
over to another replica. External mail routers handle
communication between accounts.
Blogs A blogging application would be modeled with mul-
tiple classes of entity groups. Each user has a profile,
which is naturally its own entity group. However, blogs
are collaborative and have no single permanent owner.
We create a second class of entity groups to hold the
p osts and metadata for each blog. A third class keys
off the unique name claimed by each blog. The appli-
cation relies on asynchronous messaging when a sin-
gle user operation affects both blogs and profiles. For
a lower-traffic operation like creating a new blog and
claiming its unique name, two-phase commit is more
convenient and performs adequately.
Maps Geographic data has no natural granularity of any
consistent or convenient size. A mapping application
can create entity groups by dividing the glob e into non-
overlapping patches. For mutations that span patches,
the application uses two-phase commit to make them
atomic. Patches must be large enough that two-phase
transactions are uncommon, but small enough that
each patch requires only a small write throughput.
Unlike the previous examples, the number of entity
groups does not grow with increased usage, so enough
patches must be created initially for sufficient aggre-
gate throughput at later scale.
Nearly all applications built on Megastore have found nat-
ural ways to draw entity group boundaries.
2.2.3 Physical Layout
We use Google’s Bigtable [15] for scalable fault-tolerant
storage within a single datacenter, allowing us to support
arbitrary read and write throughput by spreading operations
across multiple rows.
We minimize latency and maximize throughput by let-
ting applications control the placement of data: through the
selection of Bigtable instances and specification of locality
within an instance.
To minimize latency, applications try to keep data near
users and replicas near each other. They assign each entity
group to the region or continent from which it is accessed
most. Within that region they assign a triplet or quintuplet
of replicas to datacenters with isolated failure domains.
For low latency, cache efficiency, and throughput, the data
for an entity group are held in contiguous ranges of Bigtable
rows. Our schema language lets applications control the
placement of hierarchical data, storing data that is accessed
together in nearby rows or denormalized into the same row.
3. A TOUR OF MEGASTORE
Megastore maps this architecture onto a feature set care-
fully chosen to encourage rapid development of scalable ap-
plications. This section motivates the tradeoffs and de-
scrib es the developer-facing features that result.
3.1 API Design Philosophy
ACID transactions simplify reasoning about correctness,
but it is equally imp ortant to be able to reason about perfor-
mance. Megastore emphasizes cost-transparent APIs with
runtime costs that match application developers’ intuitions.
Normalized relational schemas rely on joins at query time
to service user operations. This is not the right model for
Megastore applications for several reasons:
• High-volume interactive workloads benefit more from
predictable performance than from an expressive query
language.