Discretized Streams: 建立容错的可扩展流处理模型

需积分: 9 5 浏览量更新于2024-07-18 收藏 739KB PDF 举报

"Discretized Streams - A Fault-Tolerant Model for Scalable Stream Processing" 这篇技术报告探讨了Discretized Streams，这是一个针对可扩展流处理的容错模型，由Matei Zaharia、Tathagata Das、Haoyuan Li、Timothy Hunter、Scott Shenker和Ion Stoica等人在加州大学伯克利分校的电气工程与计算机科学系发表。该模型旨在解决大规模数据流处理中的容错性和可扩展性问题。 Discretized Streams（DStreams）是Spark Streaming的核心概念，Spark Streaming是Apache Spark项目的一部分，用于处理实时数据流。DStreams是由一系列连续的、时间分片的数据块（或称为微批次）组成，这些数据块在Spark的分布式计算模型中处理。这种设计允许DStreams利用Spark的弹性分布式数据集（Resilient Distributed Datasets, RDDs）的优势，提供了强大的容错能力。报告中强调了DStreams模型的以下几个关键特性： 1. **容错性**：由于DStreams基于RDDs，它们能够利用Spark的检查点机制来实现容错。如果某个工作节点失败，系统可以从最近的检查点恢复，确保数据处理的连续性。 2. **可扩展性**：DStreams通过将连续的数据流分解为小批量处理，能够在集群中水平扩展以处理高吞吐量的流数据。这种微批处理方式使得资源分配更加高效，从而提高整体性能。 3. **简洁的API**：Spark Streaming提供了一种直观的API，使得开发者可以方便地定义数据流转换，如窗口操作、联接和聚合等，这降低了开发实时应用的复杂性。 4. **实时处理与批处理的统一**：DStreams允许将实时处理逻辑与批处理逻辑统一，使得开发人员可以在同一个平台上处理离线和在线数据。 5. **延迟控制**：DStreams允许设置处理延迟，平衡延迟和吞吐量之间的关系，以满足特定应用的需求。 6. **集成性**：Spark Streaming与其他Apache生态系统的组件（如Hadoop、Kafka和Flume）有良好的集成，能够轻松地从各种数据源获取和输出数据。报告还提到，这项研究得到了国家自然科学基金、工业界的支持，以及包括DARPA在内的多个机构的资金资助。这些支持使得研究团队能够深入探索并优化这个容错且可扩展的流处理模型。 Discretized Streams模型为实时数据处理提供了一个健壮的框架，通过结合微批次处理和Spark的分布式计算能力，实现了高效率、高可用性的流处理解决方案。这一模型对于需要处理大规模实时数据的现代企业和组织具有重要的实际应用价值。

This is because, to achieve resilience, jobs write their

outputs to replicated, on-disk storage systems, leading to

costly disk I/O and data replication across the network.

Our key insight is that it is possible to achieve sub-

second latencies in a batch system by leveraging Re-

silient Distributed Datasets (RDDs) [36], a recently pro-

posed in-memory storage abstraction that provides fault

tolerance without resorting to replication or disk I/O. In-

stead, each RDD tracks the lineage graph of operations

used to build it, and can replay them to recompute lost

data. RDDs are an ideal ﬁt for discretized streams, allow-

ing the execution of meaningful computations in tasks as

short as 50–200 ms. We show how to implement sev-

eral standard streaming operators using RDDs, including

stateful computation and incremental sliding windows,

and show that they can be run at sub-second latencies.

The D-Stream model also provides signiﬁcant advan-

tages in terms of fault recovery. While previous systems

relied on costly replication or upstream backup [16], the

batch model of D-Streams naturally enables a more ef-

ﬁcient recovery mechanism: parallel recovery of a lost

node’s state. When a node fails, each node in the cluster

works to recompute part of the lost RDDs, resulting in

far faster recovery than upstream backup without the cost

of replication. Parallel recovery was hard to perform in

record-at-a-time systems due to the complex state main-

tenance protocols needed even for basic replication (e.g.,

Flux [29]),

but is simple in deterministic batch jobs [9].

In a similar way, D-Streams can recover from stragglers

(slow nodes), an even more common issue in large clus-

ters, using speculative execution [9], while traditional

streaming systems do not handle them.

We have implemented D-Streams in Spark Streaming,

an extension to the Spark cluster computing engine [36].

The system can process over 60 million records/second

on 100 nodes at sub-second latency, and can recover from

faults and stragglers in less than a second. It outperforms

widely used open source streaming systems by up to 5×

in throughput while offering recovery and consistency

guarantees that they lack. Apart from its performance,

we illustrate Spark Streaming’s expressiveness through

ports of two applications: a video distribution monitor-

ing system and an online machine learning algorithm.

More importantly, because D-Streams use the same

processing model and data structures (RDDs) as batch

jobs, Spark Streaming interoperates seamlessly with

Spark’s batch and interactive processing features. This

is a powerful feature in practice, letting users run ad-hoc

queries on arriving streams, or combine streams with his-

torical data, from the same high-level API. We sketch

how we are using this feature in applications to blur the

line between streaming and ofﬂine processing.

The one parallel recovery algorithm we are aware of, by Hwang et

al. [17], only tolerates one node failure and cannot mitigate stragglers.

2 Goals and Background

Many important applications process large streams of

data arriving in real time. Our work targets applications

that need to run on tens to hundreds of machines, and tol-

erate a latency of several seconds. Some examples are:

• Site activity statistics: Facebook built a distributed

aggregation system called Puma that gives advertis-

ers statistics about users clicking their pages within

10–30 seconds and processes 10

events/second [30].

• Spam detection: A social network such as Twitter

may wish to identify new spam campaigns in real

time by running statistical learning algorithms [34].

• Cluster monitoring: Datacenter operators often col-

lect and mine program logs to detect problems, using

systems like Flume [1] on hundreds of nodes [12].

• Network intrusion detection: A NIDS for a large

enterprise may need to correlate millions of events

per second to detect unusual activity.

For these applications, we believe that the 0.5–2 sec-

ond latency of D-Streams is adequate, as it is well be-

low the timescale of the trends monitored, and that the

efﬁciency beneﬁts of D-Streams (fast recovery without

replication) far outweigh their latency cost. We purposely

do not target applications with latency needs below a few

hundred milliseconds, such as high-frequency trading.

Apart from offering second-scale latency, our goal is

to design a system that is both fault-tolerant (recovers

quickly from faults and stragglers) and efﬁcient (does not

consume signiﬁcant hardware resources beyond those

needed for basic processing). Fault tolerance is critical

at the scales we target, where failures and stragglers are

endemic [9]. In addition, recovery needs to be fast: due to

the time-sensitivity of streaming applications, we wish to

recover from faults within seconds. Efﬁciency is also cru-

cial because of the scale. For example, a design requiring

replication of each processing node would be expensive

for an application running on hundreds of nodes.

2.1 Previous Streaming Systems

Although there has been a wide array of work on dis-

tributed stream processing, most previous systems em-

ploy the same record-at-a-time processing model. In this

model, streaming computations are divided into a set of

long-lived stateful operators, and each operator processes

records as they arrive by updating internal state (e.g., a ta-

ble tracking page view counts over a window) and send-

ing new records in response [7]. Figure 1(a) illustrates.

While record-at-a-time processing minimizes latency,

the stateful nature of operators, combined with nondeter-

minism that arises from record interleaving on the net-

work, makes it hard to provide fault tolerance efﬁciently.

We sketch this problem before presenting our approach.

剩余15页未读，继续阅读

chaokunyang

粉丝: 84

Discretized Streams: 建立容错的可扩展流处理模型

Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing

Spark经典论文合集

discretized-semantic-stream

spark-streaming-kafka-0-8-assembly_2.11-2.4.5

spark-2.2-for-hadoop-2.2

Spark Streaming Real-time big-data processing

Multi-core parallel robust structuredmultifrontal factorization method for large discretized PDEs

SparkStreaming之滑动窗口的实现.zip_Spark!_spark stream 窗口_spark streamin

FVFOM: A three dimensional semi-implicit unstructured grid ocean model

Stream-Programming

最新资源

Spark StreamingReal-time big-data processing