Megastore:ProvidingScalable,HighlyAvailableStorageforInteractiveServices

Megastore

Google

5星 · 超过95%的资源需积分: 15 140 浏览量更新于2023-07-05 评论收藏 931KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Megastore: Providing Scalable, Highly Available

Storage for Interactive Ser vices

Jason Baker, Chris Bond, James C. Corbett, JJ Furman, Andrey Khorlin, James Larson,

Jean-Michel L

eon, Yawei Li, Alexander Lloyd, Vadim Yushprakh

Google, Inc.

{jasonbaker,chrisbond,jcorbett,jfurman,akhorlin,jimlarson,jm,yaweili,alloyd,vadimy}@google.com

ABSTRACT

Megastore is a storage system developed to meet the re-

quirements of today’s interactive online services. Megas-

tore blends the scalability of a NoSQL datastore with the

convenience of a traditional RDBMS in a novel way, and

provides both strong consistency guarantees and high avail-

ability. We provide fully serializable ACID semantics within

ﬁne-grained partitions of data. This partitioning allows us

to synchronously replicate each write across a wide area net-

work with reasonable latency and support seamless failover

b etween datacenters. This paper describ es Megastore’s se-

mantics and replication algorithm. It also describes our ex-

p erience supporting a wide range of Google production ser-

vices built with Megastore.

Categories and Subject Descriptors

C.2.4 [Distributed Systems]: Distributed databases; H.2.4

[Database Management]: Systems—concurrency, distrib-

uted databases

General Terms

Algorithms, Design, Performance, Reliability

Keywords

Large databases, Distributed transactions, Bigtable, Paxos

1. INTRODUCTION

Interactive online services are forcing the storage commu-

nity to meet new demands as desktop applications migrate

to the cloud. Services like email, collaborative documents,

and social networking have been growing exponentially and

are testing the limits of existing infrastructure. Meeting

these services’ storage demands is challenging due to a num-

b er of conﬂicting requirements.

First, the Internet brings a huge audience of potential

users, so the applications must be highly scalable. A service

This article is published under a Creative Commons Attribution License

(http://creativecommons.org/licenses/by/3.0/), which permits distribution

and reproduction in any medium as well allowing derivative works, pro-

vided that you attribute the original work to the author(s) and CIDR 2011.

Biennial Conference on Innovative Data Systems Research (CIDR ’11)

January 9-12, 2011, Asilomar, California, USA.

can be built rapidly using MySQL [10] as its datastore, but

scaling the service to millions of users requires a complete

redesign of its storage infrastructure. Second, services must

comp ete for users. This requires rapid development of fea-

tures and fast time-to-market. Third, the service must be

resp onsive; hence, the storage system must have low latency.

Fourth, the service should provide the user with a consistent

view of the data—the result of an update should be visible

immediately and durably. Seeing edits to a cloud-hosted

spreadsheet vanish, however brieﬂy, is a poor user experi-

ence. Finally, users have come to expect Internet services to

b e up 24/7, so the service must be highly available. The ser-

vice must be resilient to many kinds of faults ranging from

the failure of individual disks, machines, or routers all the

way up to large-scale outages aﬀecting entire datacenters.

These requirements are in conﬂict. Relational databases

provide a rich set of features for easily building applications,

but they are diﬃcult to scale to hundreds of millions of

users. NoSQL datastores such as Google’s Bigtable [15],

Apache Hadoop’s HBase [1], or Facebook’s Cassandra [6]

are highly scalable, but their limited API and loose consis-

tency models complicate application development. Repli-

cating data across distant datacenters while providing low

latency is challenging, as is guaranteeing a consistent view

of replicated data, esp ecially during faults.

Megastore is a storage system developed to meet the stor-

age requirements of today’s interactive online services. It

is novel in that it blends the scalability of a NoSQL data-

store with the convenience of a traditional RDBMS. It uses

synchronous replication to achieve high availability and a

consistent view of the data. In brief, it provides fully serial-

izable ACID semantics over distant replicas with low enough

latencies to support interactive applications.

We accomplish this by taking a middle ground in the

RDBMS vs. NoSQL design space: we partition the data-

store and replicate each partition separately, providing full

ACID semantics within partitions, but only limited con-

sistency guarantees across them. We provide traditional

database features, such as secondary indexes, but only those

features that can scale within user-tolerable latency limits,

and only with the semantics that our partitioning scheme

can supp ort. We contend that the data for most Internet

services can be suitably partitioned (e.g., by user) to make

this approach viable, and that a small, but not spartan, set

of features can substantially ease the burden of developing

cloud applications.

Contrary to conventional wisdom [24, 28], we were able to

use Paxos [27] to build a highly available system that pro-

223

vides reasonable latencies for interactive applications while

synchronously replicating writes across geographically dis-

tributed datacenters. While many systems use Paxos solely

for locking, master election, or replication of metadata and

conﬁgurations, we believe that Megastore is the largest sys-

tem deployed that uses Paxos to replicate primary user data

across datacenters on every write.

Megastore has been widely deployed within Google for

several years [20]. It handles more than three billion write

and 20 billion read transactions daily and stores nearly a

p etabyte of primary data across many global datacenters.

The key contributions of this paper are:

1. the design of a data model and storage system that

allows rapid development of interactive applications

where high availability and scalability are built-in from

the start;

2. an implementation of the Paxos replication and con-

sensus algorithm optimized for low-latency operation

across geographically distributed datacenters to pro-

vide high availability for the system;

3. a report on our experience with a large-scale deploy-

ment of Megastore at Google.

The paper is organized as follows. Section 2 describes how

Megastore provides availability and scalability using parti-

tioning and also justiﬁes the suﬃciency of our design for

many interactive Internet applications. Section 3 provides

an overview of Megastore’s data model and features. Sec-

tion 4 explains the replication algorithms in detail and gives

some measurements on how they perform in practice. Sec-

tion 5 summarizes our experience developing the system.

We review related work in Section 6. Section 7 concludes.

2. TOWARD AVAILABILITY AND SCALE

In contrast to our need for a storage platform that is

global, reliable, and arbitrarily large in scale, our hardware

building blocks are geographically conﬁned, failure-prone,

and suﬀer limited capacity. We must bind these compo-

nents into a uniﬁed ensemble oﬀering greater throughput

and reliability.

To do so, we have taken a two-pronged approach:

• for availability, we implemented a synchronous, fault-

tolerant log replicator optimized for long distance-links;

• for scale, we partitioned data into a vast space of small

databases, each with its own replicated log stored in a

per-replica NoSQL datastore.

2.1 Replication

Replicating data across hosts within a single datacenter

improves availability by overcoming host-speciﬁc failures,

but with diminishing returns. We still must confront the

networks that connect them to the outside world and the

infrastructure that powers, cools, and houses them. Eco-

nomically constructed sites risk some level of facility-wide

outages [25] and are vulnerable to regional disasters. For

cloud storage to meet availability demands, service providers

must replicate data over a wide geographic area.

2.1.1 Strategies

We evaluated common strategies for wide-area replication:

Asynchronous Master/Slave A master node replicates

write-ahead log entries to at least one slave. Log ap-

pends are acknowledged at the master in parallel with

transmission to slaves. The master can support fast

ACID transactions but risks downtime or data loss

during failover to a slave. A consensus protocol is re-

quired to mediate mastership.

Synchronous Master/Slave A master waits for changes

to be mirrored to slaves before acknowledging them,

allowing failover without data loss. Master and slave

failures need timely detection by an external system.

Optimistic Replication Any member of a homogeneous

replica group can accept mutations [23], which are

asynchronously propagated through the group. Avail-

ability and latency are excellent. However, the global

mutation ordering is not known at commit time, so

transactions are impossible.

We avoided strategies which could lose data on failures,

which are common in large-scale systems. We also discarded

strategies that do not permit ACID transactions. Despite

the operational advantages of eventually consistent systems,

it is currently too diﬃcult to give up the read-modify-write

idiom in rapid application development.

We also discarded options with a heavyweight master.

Failover requires a series of high-latency stages often caus-

ing a user-visible outage, and there is still a huge amount

of complexity. Why build a fault-tolerant system to arbi-

trate mastership and failover workﬂows if we could avoid

distinguished masters altogether?

2.1.2 Enter Paxos

We decided to use Paxos, a proven, optimal, fault-tolerant

consensus algorithm with no requirement for a distinguished

master [14, 27]. We replicate a write-ahead log over a group

of symmetric peers. Any node can initiate reads and writes.

Each log append blocks on acknowledgments from a ma-

jority of replicas, and replicas in the minority catch up as

they are able—the algorithm’s inherent fault tolerance elim-

inates the need for a distinguished “failed” state. A novel

extension to Paxos, detailed in Section 4.4.1, allows local

reads at any up-to-date replica. Another extension permits

single-roundtrip writes.

Even with fault tolerance from Paxos, there are limita-

tions to using a single log. With replicas spread over a

wide area, communication latencies limit overall through-

put. Moreover, progress is impeded when no replica is cur-

rent or a majority fail to acknowledge writes. In a traditional

SQL database hosting thousands or millions of users, us-

ing a synchronously replicated log would risk interruptions

of widespread impact [11]. So to improve availability and

throughput we use multiple replicated logs, each governing

its own partition of the data set.

2.2 Partitioning and Locality

To scale our replication scheme and maximize performance

of the underlying datastore, we give applications ﬁne-grained

control over their data’s partitioning and locality.

2.2.1 Entity Groups

To scale throughput and localize outages, we partition our

data into a collection of entity groups [24], each indepen-

dently and synchronously replicated over a wide area. The

underlying data is stored in a scalable NoSQL datastore in

each datacenter (see Figure 1).

Entities within an entity group are mutated with single-

phase ACID transactions (for which the commit record is

224

Figure 1: Scalable Replication

Figure 2: Operations Across Entity Groups

replicated via Paxos). Operations across entity groups could

rely on expensive two-phase commits, but typically leverage

Megastore’s eﬃcient asynchronous messaging. A transac-

tion in a sending entity group places one or more messages

in a queue; transactions in receiving entity groups atomically

consume those messages and apply ensuing mutations.

Note that we use asynchronous messaging between logi-

cally distant entity groups, not physically distant replicas.

All network traﬃc between datacenters is from replicated

op erations, which are synchronous and consistent.

Indexes local to an entity group obey ACID semantics;

those across entity groups have looser consistency. See Fig-

ure 2 for the various operations on and between entity groups.

2.2.2 Selecting Entity Group Boundaries

The entity group deﬁnes the a priori grouping of data

for fast op erations. Boundaries that are too ﬁne-grained

force excessive cross-group operations, but placing too much

unrelated data in a single group serializes unrelated writes,

which degrades throughput.

The following examples show ways applications can work

within these constraints:

Email Each email account forms a natural entity group.

Operations within an account are transactional and

consistent: a user who sends or labels a message is

guaranteed to observe the change despite possible fail-

over to another replica. External mail routers handle

communication between accounts.

Blogs A blogging application would be modeled with mul-

tiple classes of entity groups. Each user has a proﬁle,

which is naturally its own entity group. However, blogs

are collaborative and have no single permanent owner.

We create a second class of entity groups to hold the

p osts and metadata for each blog. A third class keys

oﬀ the unique name claimed by each blog. The appli-

cation relies on asynchronous messaging when a sin-

gle user operation aﬀects both blogs and proﬁles. For

a lower-traﬃc operation like creating a new blog and

claiming its unique name, two-phase commit is more

convenient and performs adequately.

Maps Geographic data has no natural granularity of any

consistent or convenient size. A mapping application

can create entity groups by dividing the glob e into non-

overlapping patches. For mutations that span patches,

the application uses two-phase commit to make them

atomic. Patches must be large enough that two-phase

transactions are uncommon, but small enough that

each patch requires only a small write throughput.

Unlike the previous examples, the number of entity

groups does not grow with increased usage, so enough

patches must be created initially for suﬃcient aggre-

gate throughput at later scale.

Nearly all applications built on Megastore have found nat-

ural ways to draw entity group boundaries.

2.2.3 Physical Layout

We use Google’s Bigtable [15] for scalable fault-tolerant

storage within a single datacenter, allowing us to support

arbitrary read and write throughput by spreading operations

across multiple rows.

We minimize latency and maximize throughput by let-

ting applications control the placement of data: through the

selection of Bigtable instances and speciﬁcation of locality

within an instance.

To minimize latency, applications try to keep data near

users and replicas near each other. They assign each entity

group to the region or continent from which it is accessed

most. Within that region they assign a triplet or quintuplet

of replicas to datacenters with isolated failure domains.

For low latency, cache eﬃciency, and throughput, the data

for an entity group are held in contiguous ranges of Bigtable

rows. Our schema language lets applications control the

placement of hierarchical data, storing data that is accessed

together in nearby rows or denormalized into the same row.

3. A TOUR OF MEGASTORE

Megastore maps this architecture onto a feature set care-

fully chosen to encourage rapid development of scalable ap-

plications. This section motivates the tradeoﬀs and de-

scrib es the developer-facing features that result.

3.1 API Design Philosophy

ACID transactions simplify reasoning about correctness,

but it is equally imp ortant to be able to reason about perfor-

mance. Megastore emphasizes cost-transparent APIs with

runtime costs that match application developers’ intuitions.

Normalized relational schemas rely on joins at query time

to service user operations. This is not the right model for

Megastore applications for several reasons:

• High-volume interactive workloads beneﬁt more from

predictable performance than from an expressive query

language.

225

剩余11页未读，继续阅读

mipmap

2012-04-17

能找到的东西太有限了，只剩论文了

caizhanfei

粉丝: 324
资源: 9

会员权益专享

Megastore: Providing Scalable, Highly Available Storage for Inte...

评论3

会员权益专享

最新资源

Megastore: Providing Scalable, Highly Available Storage for Inte...

评论3

MegaStore:具有数据库交互的 Android 应用程序

google megastore 报告与揭秘

Altera IP MegaStore

MEGASTORE V1.0 商城门户多品类外贸独立站商城模板.zip

Google Megastore分布式存储技术全揭秘.doc

html+css+bootstrap实现生活用品购物电商MegaStore.zip

Design and Architecture of CockroachDb

Google论文集合

GoogleMegastore分布式存储技术全揭秘

电子、通信、计算机大类学生课程实验的心得体会

【营销】任务一金融产品与金融产品营销认识.docx

单片机课程实验-秒表实现

基于LSTM的SDN流量预测与负载均衡python源码+详细注释+数据.zip

实验室管理微信小程序设计

机械设计CNC自动打孔机（sw18可编辑+工程图）非常好的设计图纸100%好用.zip

51单片机使用蜂鸣器来播放音乐蜂鸣器播放音乐-小星星亮晶晶

机械设计德国6层电万能蒸烤箱（sw21可编辑+cad）非常好的设计图纸100%好用.zip

node-v12.22.5-darwin-x64.tar.xz

超声波传感器系列 UFA-150, UFA-200 安装指南

VOS4890 户外电源系统 用户手册

会员权益专享

最新资源

VOS4890 户外电源系统用户手册