5. Management and administration:
Separate systems require significantly
more work to manage and deploy than a single one. Even for users, they
require learning multiple APIs and execution models.
Because of these limitations, a unified abstraction for cluster computing would
have significant benefits in not only usability but also performance, especially for
complex applications and multi-user settings.
1.2 Resilient Distributed Datasets (RDDs)
To address this problem, we introduce a new abstraction, resilient distributed
datasets (RDDs), that forms a simple extension to the MapReduce model. The insight
behind RDDs is that although the workloads that MapReduce was unsuited for
(e.g., iterative, interactive and streaming queries) seem at first very different, they all
require a common feature that MapReduce lacks: efficient data sharing across parallel
computation stages. With an efficient data sharing abstraction and MapReduce-
like operators, all of these workloads can be expressed efficiently, capturing the
key optimizations in current specialized systems. RDDs offer such an abstraction
for a broad set of parallel computations, in a manner that is both efficient and
fault-tolerant.
In particular, previous fault-tolerant processing models for clusters, such as
MapReduce and Dryad, structured computations as a directed acyclic graph (DAG)
of tasks. This allowed them to efficiently replay just part of the DAG for fault recov-
ery. Between separate computations, however (e.g., between steps of an iterative
algorithm), these models provided no storage abstraction other than replicated file
systems, which add significant costs due to data replication across the network.
RDDs are a fault-tolerant distributed memory abstraction that avoids replication.
Instead, each RDD remembers the graph of operations used to build it, similarly
to batch computing models, and can efficiently recompute data lost on failure. As
long as the operations that create RDDs are relatively coarse-grained, i.e., a single
operation applies to many data elements, this technique is much more efficient
than replicating the data over the network. RDDs work well for a wide range of
today’s data-parallel algorithms and programming models, all of which apply each
operation to many items.
While it may seem surprising that just adding data sharing greatly increases the
generality of MapReduce, we explore from several perspectives why this is so. First,
from an expressiveness perspective, we show that RDDs can emulate any distributed
system, and will do so efficiently as long as the system tolerates some network
latency. This is because, once augmented with fast data sharing, MapReduce can
emulate the Bulk Synchronous Parallel (BSP) [
108
] model of parallel computing,
with the main drawback being the latency of each MapReduce step. Empirically,
in our Spark system, this can be as low as 50–100 ms. Second, from a systems
perspective, RDDs, unlike plain MapReduce, give applications enough control
4