The next design choice is who performs the process of conflict
resolution. This can be done by the data store or the application. If
conflict resolution is done by the data store, its choices are rather
limited. In such cases, the data store can only use simple policies,
such as “last write wins” [22], to resolve conflicting updates. On
the other hand, since the application is aware of the data schema it
can decide on the conflict resolution method that is best suited for
its client’s experience. For instance, the application that maintains
customer shopping carts can choose to “merge” the conflicting
versions and return a single unified shopping cart. Despite this
flexibility, some application developers may not want to write
their own conflict resolution mechanisms and choose to push it
down to the data store, which in turn chooses a simple policy such
as “last write wins”.
Other key principles embraced in the design are:
Incremental scalability: Dynamo should be able to scale out one
storage host (henceforth, referred to as “node”) at a time, with
minimal impact on both operators of the system and the system
itself.
Symmetry: Every node in Dynamo should have the same set of
responsibilities as its peers; there should be no distinguished node
or nodes that take special roles or extra set of responsibilities. In
our experience, symmetry simplifies the process of system
provisioning and maintenance.
Decentralization: An extension of symmetry, the design should
favor decentralized peer-to-peer techniques over centralized
control. In the past, centralized control has resulted in outages and
the goal is to avoid it as much as possible. This leads to a simpler,
more scalable, and more available system.
Heterogeneity: The system needs to be able to exploit
heterogeneity in the infrastructure it runs on. e.g. the work
distribution must be proportional to the capabilities of the
individual servers. This is essential in adding new nodes with
higher capacity without having to upgrade all hosts at once.
3. RELATED WORK
3.1 Peer to Peer Systems
There are several peer-to-peer (P2P) systems that have looked at
the problem of data storage and distribution. The first generation
of P2P systems, such as Freenet and Gnutella
1
, were
predominantly used as file sharing systems. These were examples
of unstructured P2P networks where the overlay links between
peers were established arbitrarily. In these networks, a search
query is usually flooded through the network to find as many
peers as possible that share the data. P2P systems evolved to the
next generation into what is widely known as structured P2P
networks. These networks employ a globally consistent protocol
to ensure that any node can efficiently route a search query to
some peer that has the desired data. Systems like Pastry [16] and
Chord [20] use routing mechanisms to ensure that queries can be
answered within a bounded number of hops. To reduce the
additional latency introduced by multi-hop routing, some P2P
systems (e.g., [14]) employ O(1) routing where each peer
maintains enough routing information locally so that it can route
requests (to access a data item) to the appropriate peer within a
constant number of hops.
Various storage systems, such as Oceanstore [9] and PAST [17]
were built on top of these routing overlays. Oceanstore provides a
global, transactional, persistent storage service that supports
serialized updates on widely replicated data. To allow for
concurrent updates while avoiding many of the problems inherent
with wide-area locking, it uses an update model based on conflict
resolution. Conflict resolution was introduced in [21] to reduce
the number of transaction aborts. Oceanstore resolves conflicts by
processing a series of updates, choosing a total order among them,
and then applying them atomically in that order. It is built for an
environment where the data is replicated on an untrusted
infrastructure. By comparison, PAST provides a simple
abstraction layer on top of Pastry for persistent and immutable
objects. It assumes that the application can build the necessary
storage semantics (such as mutable files) on top of it.
3.2 Distributed File Systems and Databases
Distributing data for performance, availability and durability has
been widely studied in the file system and database systems
community. Compared to P2P storage systems that only support
flat namespaces, distributed file systems typically support
hierarchical namespaces. Systems like Ficus [15] and Coda [19]
replicate files for high availability at the expense of consistency.
Update conflicts are typically managed using specialized conflict
resolution procedures. The Farsite system [1] is a distributed file
system that does not use any centralized server like NFS. Farsite
achieves high availability and scalability using replication. The
Google File System [6] is another distributed file system built for
hosting the state of Google’s internal applications. GFS uses a
simple design with a single master server for hosting the entire
metadata and where the data is split into chunks and stored in
chunkservers. Bayou is a distributed relational database system
that allows disconnected operations and provides eventual data
consistency [21].
Among these systems, Bayou, Coda and Ficus allow disconnected
operations and are resilient to issues such as network partitions
and outages. These systems differ on their conflict resolution
procedures. For instance, Coda and Ficus perform system level
conflict resolution and Bayou allows application level resolution.
All of them, however, guarantee eventual consistency. Similar to
these systems, Dynamo allows read and write operations to
continue even during network partitions and resolves updated
conflicts using different conflict resolution mechanisms.
Distributed block storage systems like FAB [18] split large size
objects into smaller blocks and stores each block in a highly
available manner. In comparison to these systems, a key-value
store is more suitable in this case because: (a) it is intended to
store relatively small objects (size < 1M) and (b) key-value stores
are easier to configure on a per-application basis. Antiquity is a
wide-area distributed storage system designed to handle multiple
server failures [23]. It uses a secure log to preserve data integrity,
replicates each log on multiple servers for durability, and uses
Byzantine fault tolerance protocols to ensure data consistency. In
contrast to Antiquity, Dynamo does not focus on the problem of
data integrity and security and is built for a trusted environment.
Bigtable is a distributed storage system for managing structured
data. It maintains a sparse, multi-dimensional sorted map and
allows applications to access their data using multiple attributes
[2]. Compared to Bigtable, Dynamo targets applications that
require only key/value access with primary focus on high
availability where updates are not rejected even in the wake of
network partitions or server failures.
1
http://freenetproject.org/, http://www.gnutella.org
198208