S4：Yahoo的分布式流计算平台架构与应用

需积分: 10 120 浏览量更新于2024-09-20 收藏 504KB PDF 举报

S4是一种分布式流计算平台，由Yahoo! Labs开发，位于美国加州圣克拉拉。该平台旨在为开发者提供一个通用、可扩展、部分容错且插件式的环境，以便于构建处理持续无界数据流的应用程序。它的核心理念是将键值对数据事件路由到具有特定关联性的处理元素（Processing Elements, PEs），这些PE负责接收事件并执行两种主要操作：一是产生新的事件供其他PE处理，二是发布结果。 S4的设计灵感来源于Actor模型，这一模型强调封装和位置透明性，使得应用程序能够实现大规模并发处理，同时为开发者提供了一个简单直观的编程接口。这种设计使得开发者无需关注底层复杂性，可以直接专注于业务逻辑的实现。论文深入剖析了S4的架构细节，包括其组件如何协同工作，以及如何通过模块化设计实现高度灵活性。S4不仅适用于理论研究，还包含了实际部署中的应用案例，展示了其在诸如实时数据分析、社交网络监控、在线广告优化等领域的实用价值。设计S4的主要驱动力是对现有流处理技术挑战的回应，比如处理大量数据流的性能需求、系统的容错能力和动态扩展能力。通过S4，Yahoo! Labs旨在提供一个强大的工具，使企业能够实时处理不断增长的数据洪流，提高业务决策的效率和精度。 S4分布式流计算平台是现代大数据处理领域的一个重要里程碑，它代表了在海量数据处理场景下，如何通过分布式系统和先进的编程模型来实现高效、可靠的数据流处理。对于IT专业人士和数据工程师来说，理解和掌握S4平台的技术特性与实践应用，对于提升在实时分析和大规模数据处理领域的技能至关重要。

S4: Distributed Stream Computing Platform

Leonardo Neumeyer

Yahoo! Labs

Santa Clara, CA

neumeyer@yahoo-inc.com

Bruce Robbins

Yahoo! Labs

Santa Clara, CA

robbins@yahoo-inc.com

Anish Nair

Yahoo! Labs

Santa Clara, CA

anishn@yahoo-inc.com

Anand Kesari

Yahoo! Labs

Santa Clara, CA

anands@yahoo-inc.com

Abstract—S4 is a general-purpose, distributed, scalable, par-

tially fault-tolerant, pluggable platform that allows program-

mers to easily develop applications for processing continuous

unbounded streams of data. Keyed data events are routed with

afﬁnity to Processing Elements (PEs), which consume the events

and do one or both of the following: (1) emit one or more events

which may be consumed by other PEs, (2) publish results.

The architecture resembles the Actors model [1], providing

semantics of encapsulation and location transparency, thus

allowing applications to be massively concurrent while exposing

a simple programming interface to application developers. In

this paper, we outline the S4 architecture in detail, describe

various applications, including real-life deployments. Our de-

sign is primarily driven by large scale applications for data

mining and machine learning in a production environment.

We show that the S4 design is surprisingly ﬂexible and lends

itself to run in large clusters built with commodity hardware.

Keywords-actors programming model; complex event pro-

cessing; concurrent programming; data processing; distributed

programming; map-reduce; middleware; parallel program-

ming; real-time search; software design; stream computing

I. INTRODUCTION

S4 (Simple Scalable Streaming System) is a distributed

stream processing engine inspired by the MapReduce model.

We designed this engine to solve real-world problems in

the context of search applications that use data mining and

machine learning algorithms. Current commercial search

engines, such as Google, Bing, and Yahoo!, typically provide

organic web results in response to user queries and then

supplement with textual advertisements that provide revenue

based on a “cost-per-click” billing model [2]. To render

the most relevant ads in an optimal position on the page,

scientists develop algorithms that dynamically estimate the

probability of a click on the ad given the context. The context

may include user preferences, geographic location, prior

queries, prior clicks, etc. A major search engine may process

thousands of queries per second, which may include several

ads per page. To process user feedback, we developed S4,

a low latency, scalable stream processing engine.

To facilitate experimentation with online algorithms, we

envisioned an architecture that could be suitable for both

research and production environments. The main require-

ment for research is to have a high degree of ﬂexibility

to deploy algorithms to the ﬁeld very quickly. This makes

it possible to test online algorithms using live trafﬁc with

minimal overhead and support. The main requirements for a

production environment are scalability (ability to add more

servers to increase throughput with minimal effort) and high

availability (ability to achieve continuous operation with no

human intervention in the presence of system failures). We

considered extending the open source Hadoop platform to

support computation of unbound streams but we quickly

realized that the Hadoop platform was highly optimized for

batch processing. MapReduce systems typically operate on

static data by scheduling batch jobs. In stream computing,

the paradigm is to have a stream of events that ﬂow into

the system at a given data rate over which we have no

control. The processing system must keep up with the

event rate or degrade gracefully by eliminating events, this

is typically called load shedding. The streaming paradigm

dictates a very different architecture than the one used in

batch processing. Attempting to build a general-purpose

platform for both batch and stream computing would result

in a highly complex system that may end up not being

optimal for either task. An example of a MapReduce online

architecture built as an extension of Hadoop can be found

in [3].

The MapReduce programming model makes it possible to

easily parallelize a number of common batch data processing

tasks and operate in large clusters without worrying about

system issues like failover management [4]. With the surge

of open source projects such as Hadoop [5], adoption of

the MapReduce programming model has accelerated and is

moving from the research labs into real-world applications

as diverse as web search, fraud detection, and online dating.

Despite these advances, there is no similar trend for general-

purpose distributed stream computing software. There are

various projects and commercial engines ([6], [7], [8], [9],

[10]), but their use is still restricted to highly specialized

applications. Amini et. al. [7] provide a review of the various

systems.

The emergence of new applications such as real-time

search, high frequency trading, and social networks is push-

ing the limits of what can be accomplished with traditional

data processing systems [11]. There is a clear need for

highly scalable stream computing solutions that can operate

at high data rates and process massive amounts of data.

For example, to personalize search advertising, we need to

下载后可阅读完整内容，剩余7页未读，立即下载

jkl_1985

粉丝: 0
资源: 1

S4：Yahoo的分布式流计算平台架构与应用

Yahoo的分布式流计算平台 S4

S4：分布式流计算平台

S4：分布式流计算平台概述与应用

iprocess：分布式流数据实时计算平台详解

java源码：Yahoo的分布式流计算平台 S4.rar

Yahoo的分布式流计算平台 S4源码

JAVA源码Yahoo的分布式流计算平台S4

java资源Yahoo的分布式流计算平台S4

基于Java的Yahoo的分布式流计算平台 S4.zip

基于java的Yahoo的分布式流计算平台 S4.zip

最新资源