S4: Distributed Stream Computing Platform
Leonardo Neumeyer
Yahoo! Labs
Santa Clara, CA
neumeyer@yahoo-inc.com
Bruce Robbins
Yahoo! Labs
Santa Clara, CA
robbins@yahoo-inc.com
Anish Nair
Yahoo! Labs
Santa Clara, CA
anishn@yahoo-inc.com
Anand Kesari
Yahoo! Labs
Santa Clara, CA
anands@yahoo-inc.com
Abstract—S4 is a general-purpose, distributed, scalable, par-
tially fault-tolerant, pluggable platform that allows program-
mers to easily develop applications for processing continuous
unbounded streams of data. Keyed data events are routed with
affinity to Processing Elements (PEs), which consume the events
and do one or both of the following: (1) emit one or more events
which may be consumed by other PEs, (2) publish results.
The architecture resembles the Actors model [1], providing
semantics of encapsulation and location transparency, thus
allowing applications to be massively concurrent while exposing
a simple programming interface to application developers. In
this paper, we outline the S4 architecture in detail, describe
various applications, including real-life deployments. Our de-
sign is primarily driven by large scale applications for data
mining and machine learning in a production environment.
We show that the S4 design is surprisingly flexible and lends
itself to run in large clusters built with commodity hardware.
Keywords-actors programming model; complex event pro-
cessing; concurrent programming; data processing; distributed
programming; map-reduce; middleware; parallel program-
ming; real-time search; software design; stream computing
I. INTRODUCTION
S4 (Simple Scalable Streaming System) is a distributed
stream processing engine inspired by the MapReduce model.
We designed this engine to solve real-world problems in
the context of search applications that use data mining and
machine learning algorithms. Current commercial search
engines, such as Google, Bing, and Yahoo!, typically provide
organic web results in response to user queries and then
supplement with textual advertisements that provide revenue
based on a “cost-per-click” billing model [2]. To render
the most relevant ads in an optimal position on the page,
scientists develop algorithms that dynamically estimate the
probability of a click on the ad given the context. The context
may include user preferences, geographic location, prior
queries, prior clicks, etc. A major search engine may process
thousands of queries per second, which may include several
ads per page. To process user feedback, we developed S4,
a low latency, scalable stream processing engine.
To facilitate experimentation with online algorithms, we
envisioned an architecture that could be suitable for both
research and production environments. The main require-
ment for research is to have a high degree of flexibility
to deploy algorithms to the field very quickly. This makes
it possible to test online algorithms using live traffic with
minimal overhead and support. The main requirements for a
production environment are scalability (ability to add more
servers to increase throughput with minimal effort) and high
availability (ability to achieve continuous operation with no
human intervention in the presence of system failures). We
considered extending the open source Hadoop platform to
support computation of unbound streams but we quickly
realized that the Hadoop platform was highly optimized for
batch processing. MapReduce systems typically operate on
static data by scheduling batch jobs. In stream computing,
the paradigm is to have a stream of events that flow into
the system at a given data rate over which we have no
control. The processing system must keep up with the
event rate or degrade gracefully by eliminating events, this
is typically called load shedding. The streaming paradigm
dictates a very different architecture than the one used in
batch processing. Attempting to build a general-purpose
platform for both batch and stream computing would result
in a highly complex system that may end up not being
optimal for either task. An example of a MapReduce online
architecture built as an extension of Hadoop can be found
in [3].
The MapReduce programming model makes it possible to
easily parallelize a number of common batch data processing
tasks and operate in large clusters without worrying about
system issues like failover management [4]. With the surge
of open source projects such as Hadoop [5], adoption of
the MapReduce programming model has accelerated and is
moving from the research labs into real-world applications
as diverse as web search, fraud detection, and online dating.
Despite these advances, there is no similar trend for general-
purpose distributed stream computing software. There are
various projects and commercial engines ([6], [7], [8], [9],
[10]), but their use is still restricted to highly specialized
applications. Amini et. al. [7] provide a review of the various
systems.
The emergence of new applications such as real-time
search, high frequency trading, and social networks is push-
ing the limits of what can be accomplished with traditional
data processing systems [11]. There is a clear need for
highly scalable stream computing solutions that can operate
at high data rates and process massive amounts of data.
For example, to personalize search advertising, we need to