亚马逊Dynamo：高可用键值存储系统设计揭秘

需积分: 16 197 浏览量更新于2024-07-25 收藏 879KB PDF 举报

"本文主要介绍了亚马逊的Dynamo，一个高度可用的键值存储系统，它在大规模的电子商务操作中确保了可靠性。Dynamo被亚马逊的核心服务用来提供无间断的用户体验，通过牺牲一致性来换取高可用性。" Dynamo是亚马逊设计的一个分布式键值存储系统，它的目标是在大规模分布式环境中实现极高的可用性和可扩展性。这个系统的设计灵感来源于对大型电子商务网站所面临的挑战，尤其是在面对持续不断的硬件故障时如何保持服务的连续性和稳定性。在亚马逊这样的大型网络基础设施中，成千上万的服务器和网络组件分布在世界各地的数据中心。这些组件不断出现小到大范围的故障，因此，如何在失败发生时管理持久状态成为了软件系统可靠性和可扩展性的关键。 Dynamo的设计原则包括以下几个方面： 1. **复制与分区**：为了提高可用性，Dynamo使用了一种称为Gossip协议的去中心化机制进行数据复制和节点间的通信。数据被分割成多个分片（shards），每个分片都有多个副本，分布在不同的数据中心，以防止单点故障。 2. **一致性哈希**：Dynamo采用一致性哈希来分配和路由数据，这样可以在节点加入或离开时减少重新分布的数据量。 3. **版本控制与冲突解决**：由于牺牲了一致性，Dynamo支持多版本并发控制（MVCC）。当不同节点同时更新同一个键时，可能会产生版本冲突。Dynamo提供了多种策略来处理这些冲突，例如，应用可以指定优先级或使用时间戳来决定哪个更新获胜。 4. **动态伸缩**：随着负载的变化，Dynamo能够动态调整分片的数量和位置，从而实现系统的弹性扩展。 5. **故障恢复**：Dynamo内置了自动故障检测和恢复机制，能够在节点故障后快速恢复服务。 6. **可编程的容错**：Dynamo允许用户自定义错误恢复策略，通过用户定义的服务质量（QoS）函数来调整一致性和可用性的权衡。 7. **最终一致性**：虽然牺牲了强一致性，但Dynamo保证了在一段时间后所有读取操作都能看到最近的写入，即实现了最终一致性。通过这些设计，Dynamo成功地在亚马逊的生态系统中提供了高可用性的键值存储服务，即使在面临各种硬件和网络故障的情况下，也能保证核心服务的连续运行。这种设计理念对于现代云服务和大规模分布式系统有着深远的影响。

The next design choice is who performs the process of conflict

resolution. This can be done by the data store or the application. If

conflict resolution is done by the data store, its choices are rather

limited. In such cases, the data store can only use simple policies,

such as “last write wins” [22], to resolve conflicting updates. On

the other hand, since the application is aware of the data schema it

can decide on the conflict resolution method that is best suited for

its client’s experience. For instance, the application that maintains

customer shopping carts can choose to “merge” the conflicting

versions and return a single unified shopping cart. Despite this

flexibility, some application developers may not want to write

their own conflict resolution mechanisms and choose to push it

down to the data store, which in turn chooses a simple policy such

as “last write wins”.

Other key principles embraced in the design are:

Incremental scalability: Dynamo should be able to scale out one

storage host (henceforth, referred to as “node”) at a time, with

minimal impact on both operators of the system and the system

itself.

Symmetry: Every node in Dynamo should have the same set of

responsibilities as its peers; there should be no distinguished node

or nodes that take special roles or extra set of responsibilities. In

our experience, symmetry simplifies the process of system

provisioning and maintenance.

Decentralization: An extension of symmetry, the design should

favor decentralized peer-to-peer techniques over centralized

control. In the past, centralized control has resulted in outages and

the goal is to avoid it as much as possible. This leads to a simpler,

more scalable, and more available system.

Heterogeneity: The system needs to be able to exploit

heterogeneity in the infrastructure it runs on. e.g. the work

distribution must be proportional to the capabilities of the

individual servers. This is essential in adding new nodes with

higher capacity without having to upgrade all hosts at once.

3. RELATED WORK

3.1 Peer to Peer Systems

There are several peer-to-peer (P2P) systems that have looked at

the problem of data storage and distribution. The first generation

of P2P systems, such as Freenet and Gnutella

, were

predominantly used as file sharing systems. These were examples

of unstructured P2P networks where the overlay links between

peers were established arbitrarily. In these networks, a search

query is usually flooded through the network to find as many

peers as possible that share the data. P2P systems evolved to the

next generation into what is widely known as structured P2P

networks. These networks employ a globally consistent protocol

to ensure that any node can efficiently route a search query to

some peer that has the desired data. Systems like Pastry [16] and

Chord [20] use routing mechanisms to ensure that queries can be

answered within a bounded number of hops. To reduce the

additional latency introduced by multi-hop routing, some P2P

systems (e.g., [14]) employ O(1) routing where each peer

maintains enough routing information locally so that it can route

requests (to access a data item) to the appropriate peer within a

constant number of hops.

Various storage systems, such as Oceanstore [9] and PAST [17]

were built on top of these routing overlays. Oceanstore provides a

global, transactional, persistent storage service that supports

serialized updates on widely replicated data. To allow for

concurrent updates while avoiding many of the problems inherent

with wide-area locking, it uses an update model based on conflict

resolution. Conflict resolution was introduced in [21] to reduce

the number of transaction aborts. Oceanstore resolves conflicts by

processing a series of updates, choosing a total order among them,

and then applying them atomically in that order. It is built for an

environment where the data is replicated on an untrusted

infrastructure. By comparison, PAST provides a simple

abstraction layer on top of Pastry for persistent and immutable

objects. It assumes that the application can build the necessary

storage semantics (such as mutable files) on top of it.

3.2 Distributed File Systems and Databases

Distributing data for performance, availability and durability has

been widely studied in the file system and database systems

community. Compared to P2P storage systems that only support

flat namespaces, distributed file systems typically support

hierarchical namespaces. Systems like Ficus [15] and Coda [19]

replicate files for high availability at the expense of consistency.

Update conflicts are typically managed using specialized conflict

resolution procedures. The Farsite system [1] is a distributed file

system that does not use any centralized server like NFS. Farsite

achieves high availability and scalability using replication. The

Google File System [6] is another distributed file system built for

hosting the state of Google’s internal applications. GFS uses a

simple design with a single master server for hosting the entire

metadata and where the data is split into chunks and stored in

chunkservers. Bayou is a distributed relational database system

that allows disconnected operations and provides eventual data

consistency [21].

Among these systems, Bayou, Coda and Ficus allow disconnected

operations and are resilient to issues such as network partitions

and outages. These systems differ on their conflict resolution

procedures. For instance, Coda and Ficus perform system level

conflict resolution and Bayou allows application level resolution.

All of them, however, guarantee eventual consistency. Similar to

these systems, Dynamo allows read and write operations to

continue even during network partitions and resolves updated

conflicts using different conflict resolution mechanisms.

Distributed block storage systems like FAB [18] split large size

objects into smaller blocks and stores each block in a highly

available manner. In comparison to these systems, a key-value

store is more suitable in this case because: (a) it is intended to

store relatively small objects (size < 1M) and (b) key-value stores

are easier to configure on a per-application basis. Antiquity is a

wide-area distributed storage system designed to handle multiple

server failures [23]. It uses a secure log to preserve data integrity,

replicates each log on multiple servers for durability, and uses

Byzantine fault tolerance protocols to ensure data consistency. In

contrast to Antiquity, Dynamo does not focus on the problem of

data integrity and security and is built for a trusted environment.

Bigtable is a distributed storage system for managing structured

data. It maintains a sparse, multi-dimensional sorted map and

allows applications to access their data using multiple attributes

[2]. Compared to Bigtable, Dynamo targets applications that

require only key/value access with primary focus on high

availability where updates are not rejected even in the wake of

network partitions or server failures.

http://freenetproject.org/, http://www.gnutella.org

198208

剩余15页未读，继续阅读

liuxin_321

粉丝: 0
资源: 2

亚马逊Dynamo：高可用键值存储系统设计揭秘

Dynamo：Amazon's Highly Available Key-value Store

Werner_Vogels——Dynamo_amazon's_highly_available_key-value_store.pdf

云计算系统中Key-Value数据管理研究.pdf

Simple-Dynamo--Amazon-s-Key-Value-Store:Simple Dynamo，Amazon高可用键值存储的实现

Amazon-Dynamo-Style-Key-Value-Storage

Replicated-Key-Value-Storage:用于键值存储的Amazon Dynamo的简化版本

亚马逊高可用Key-value存储系统：Dynamo的设计与实现

Apache Cassandra：分布式Key-Value存储详解

Apache Cassandra：分布式Key-Value存储系统解析

Key-Value与关系数据库对比：深度解析与企业级应用

最新资源