This is because, to achieve resilience, jobs write their
outputs to replicated, on-disk storage systems, leading to
costly disk I/O and data replication across the network.
Our key insight is that it is possible to achieve sub-
second latencies in a batch system by leveraging Re-
silient Distributed Datasets (RDDs) [36], a recently pro-
posed in-memory storage abstraction that provides fault
tolerance without resorting to replication or disk I/O. In-
stead, each RDD tracks the lineage graph of operations
used to build it, and can replay them to recompute lost
data. RDDs are an ideal fit for discretized streams, allow-
ing the execution of meaningful computations in tasks as
short as 50–200 ms. We show how to implement sev-
eral standard streaming operators using RDDs, including
stateful computation and incremental sliding windows,
and show that they can be run at sub-second latencies.
The D-Stream model also provides significant advan-
tages in terms of fault recovery. While previous systems
relied on costly replication or upstream backup [16], the
batch model of D-Streams naturally enables a more ef-
ficient recovery mechanism: parallel recovery of a lost
node’s state. When a node fails, each node in the cluster
works to recompute part of the lost RDDs, resulting in
far faster recovery than upstream backup without the cost
of replication. Parallel recovery was hard to perform in
record-at-a-time systems due to the complex state main-
tenance protocols needed even for basic replication (e.g.,
Flux [29]),
1
but is simple in deterministic batch jobs [9].
In a similar way, D-Streams can recover from stragglers
(slow nodes), an even more common issue in large clus-
ters, using speculative execution [9], while traditional
streaming systems do not handle them.
We have implemented D-Streams in Spark Streaming,
an extension to the Spark cluster computing engine [36].
The system can process over 60 million records/second
on 100 nodes at sub-second latency, and can recover from
faults and stragglers in less than a second. It outperforms
widely used open source streaming systems by up to 5×
in throughput while offering recovery and consistency
guarantees that they lack. Apart from its performance,
we illustrate Spark Streaming’s expressiveness through
ports of two applications: a video distribution monitor-
ing system and an online machine learning algorithm.
More importantly, because D-Streams use the same
processing model and data structures (RDDs) as batch
jobs, Spark Streaming interoperates seamlessly with
Spark’s batch and interactive processing features. This
is a powerful feature in practice, letting users run ad-hoc
queries on arriving streams, or combine streams with his-
torical data, from the same high-level API. We sketch
how we are using this feature in applications to blur the
line between streaming and offline processing.
1
The one parallel recovery algorithm we are aware of, by Hwang et
al. [17], only tolerates one node failure and cannot mitigate stragglers.
2 Goals and Background
Many important applications process large streams of
data arriving in real time. Our work targets applications
that need to run on tens to hundreds of machines, and tol-
erate a latency of several seconds. Some examples are:
• Site activity statistics: Facebook built a distributed
aggregation system called Puma that gives advertis-
ers statistics about users clicking their pages within
10–30 seconds and processes 10
6
events/second [30].
• Spam detection: A social network such as Twitter
may wish to identify new spam campaigns in real
time by running statistical learning algorithms [34].
• Cluster monitoring: Datacenter operators often col-
lect and mine program logs to detect problems, using
systems like Flume [1] on hundreds of nodes [12].
• Network intrusion detection: A NIDS for a large
enterprise may need to correlate millions of events
per second to detect unusual activity.
For these applications, we believe that the 0.5–2 sec-
ond latency of D-Streams is adequate, as it is well be-
low the timescale of the trends monitored, and that the
efficiency benefits of D-Streams (fast recovery without
replication) far outweigh their latency cost. We purposely
do not target applications with latency needs below a few
hundred milliseconds, such as high-frequency trading.
Apart from offering second-scale latency, our goal is
to design a system that is both fault-tolerant (recovers
quickly from faults and stragglers) and efficient (does not
consume significant hardware resources beyond those
needed for basic processing). Fault tolerance is critical
at the scales we target, where failures and stragglers are
endemic [9]. In addition, recovery needs to be fast: due to
the time-sensitivity of streaming applications, we wish to
recover from faults within seconds. Efficiency is also cru-
cial because of the scale. For example, a design requiring
replication of each processing node would be expensive
for an application running on hundreds of nodes.
2.1 Previous Streaming Systems
Although there has been a wide array of work on dis-
tributed stream processing, most previous systems em-
ploy the same record-at-a-time processing model. In this
model, streaming computations are divided into a set of
long-lived stateful operators, and each operator processes
records as they arrive by updating internal state (e.g., a ta-
ble tracking page view counts over a window) and send-
ing new records in response [7]. Figure 1(a) illustrates.
While record-at-a-time processing minimizes latency,
the stateful nature of operators, combined with nondeter-
minism that arises from record interleaving on the net-
work, makes it hard to provide fault tolerance efficiently.
We sketch this problem before presenting our approach.
2