Discretized Streams: An Efficient and Fault-Tolerant Model for
Stream Processing on Large Clusters
Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica
University of California, Berkeley
Abstract
Many important “big data” applications need to process
data arriving in real time. However, current program-
ming models for distributed stream processing are rel-
atively low-level, often leaving the user to worry about
consistency of state across the system and fault recov-
ery. Furthermore, the models that provide fault recovery
do so in an expensive manner, requiring either hot repli-
cation or long recovery times. We propose a new pro-
gramming model, discretized streams (D-Streams), that
offers a high-level functional programming API, strong
consistency, and efficient fault recovery. D-Streams sup-
port a new recovery mechanism that improves efficiency
over the traditional replication and upstream backup so-
lutions in streaming databases: parallel recovery of lost
state across the cluster. We have prototyped D-Streams in
an extension to the Spark cluster computing framework
called Spark Streaming, which lets users seamlessly in-
termix streaming, batch and interactive queries.
1 Introduction
Much of “big data” is received in real time, and is most
valuable at its time of arrival. For example, a social net-
work may want to identify trending conversation topics
within minutes, an ad provider may want to train a model
of which users click a new ad, and a service operator may
want to mine log files to detect failures within seconds.
To handle the volumes of data and computation they
involve, these applications need to be distributed over
clusters. However, despite substantial work on clus-
ter programming models for batch computation [6, 22],
there are few similarly high-level tools for stream pro-
cessing. Most current distributed stream processing sys-
tems, including Yahoo!’s S4 [19], Twitter’s Storm [21],
and streaming databases [2, 3, 4], are based on a record-
at-a-time processing model, where nodes receive each
record, update internal state, and send out new records
in response. This model raises several challenges in a
large-scale cloud environment:
• Fault tolerance: Record-at-a-time systems provide
recovery through either replication, where there are
two copies of each processing node, or upstream
backup, where nodes buffer sent messages and re-
play them to a second copy of a failed downstream
node. Neither approach is attractive in large clusters:
replication needs 2× the hardware and may not work
if two nodes fail, while upstream backup takes a long
time to recover, as the entire system must wait for the
standby node to recover the failed node’s state.
• Consistency: Depending on the system, it can be
hard to reason about the global state, because dif-
ferent nodes may be processing data that arrived at
different times. For example, suppose that a system
counts page views from male users on one node and
from females on another. If one of these nodes is
backlogged, the ratio of their counters will be wrong.
• Unification with batch processing: Because the in-
terface of streaming systems is event-driven, it is
quite different from the APIs of batch systems, so
users have to write two versions of each analytics
task. In addition, it is difficult to combine streaming
data with historical data, e.g., join a stream of events
against historical data to make a decision.
In this work, we present a new programming model,
discretized streams (D-Streams), that overcomes these
challenges. The key idea behind D-Streams is to treat a
streaming computation as a series of deterministic batch
computations on small time intervals. For example, we
might place the data received each second into a new in-
terval, and run a MapReduce operation on each interval
to compute a count. Similarly, we can perform a running
count over several intervals by adding the new counts
from each interval to the old result. Two immediate ad-
vantages of the D-Stream model are that consistency is
well-defined (each record is processed atomically with
the interval in which it arrives), and that the processing
model is easy to unify with batch systems. In addition, as
we shall show, we can use similar recovery mechanisms
to batch systems, albeit at a much smaller timescale, to
mitigate failures more efficiently than existing streaming
systems, i.e., recover data faster at a lower cost.
There are two key challenges in realizing the D-
Stream model. The first is making the latency (interval
granularity) low. Traditional batch systems like Hadoop
and Dryad fall short here because they keep state on
disk between jobs and take tens of seconds to run each
1