D-Streams：大规模流处理的高效容错模型

需积分: 10 84 浏览量更新于2024-09-13 收藏 265KB PDF 举报

"Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters." 在大数据处理领域，实时处理不断流入的数据是许多关键应用的核心需求。然而，现有的分布式流处理编程模型相对较低级，往往需要用户关注系统中的状态一致性以及故障恢复问题。此外，那些提供故障恢复功能的模型通常成本较高，需要热备份或长时间的恢复过程。本文提出了一种新的编程模型——离散化流（Discretized Streams，简称D-Streams），它提供了高级别的函数式编程API，确保了强一致性，并实现了高效的故障恢复。 D-Streams通过引入一种新的恢复机制，提高了效率，超越了传统流数据库中的复制和上游备份解决方案。这种并行恢复机制可以在集群中并行恢复丢失的状态，从而显著提升了性能。D-Streams的设计目标是让用户能够在处理实时数据流时，无需过多关注底层的复杂性和容错性问题，而是专注于业务逻辑。为了实现这一概念，研究者们在Spark集群计算框架的基础上开发了一个名为Spark Streaming的扩展，它使用户能够轻松地利用D-Streams进行流处理。Spark Streaming允许用户以批处理的方式来处理连续的数据流，从而简化了编程模型，同时也保持了实时处理的能力。 D-Streams的关键特性包括： 1. 高级函数式编程API：D-Streams提供了简洁且强大的编程接口，用户可以使用高级语言来描述数据流的转换和操作，无需关心底层的并发控制和容错细节。 2. 强一致性：通过设计保证了在处理实时数据时，系统状态的一致性，避免了数据不一致的问题。 3. 效率的故障恢复：D-Streams的并行恢复机制能够在出现故障时快速恢复，减少了系统的停机时间，提高了服务的可用性。 4. 容错性：在大规模集群环境中，D-Streams能够优雅地处理节点故障，确保系统的健壮性。在Spark Streaming中，D-Streams被划分为微批次（micro-batches），这样既能实现近实时处理，又保留了Spark批处理的优点，如高效的内存管理和并行计算能力。这种方式使得D-Streams成为处理大规模实时数据的理想选择，尤其适用于需要高吞吐量和低延迟的场景。总结来说，Discretized Streams是一种革新性的流处理模型，它结合了高级编程模型、强一致性保证和高效的故障恢复机制，旨在解决现有分布式流处理的挑战。通过在Spark框架上的实现，D-Streams为开发者提供了更强大、更易用的工具，以应对日益增长的大规模实时数据处理需求。

Discretized Streams: An Efﬁcient and Fault-Tolerant Model for

Stream Processing on Large Clusters

Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, Ion Stoica

University of California, Berkeley

Abstract

Many important “big data” applications need to process

data arriving in real time. However, current program-

ming models for distributed stream processing are rel-

atively low-level, often leaving the user to worry about

consistency of state across the system and fault recov-

ery. Furthermore, the models that provide fault recovery

do so in an expensive manner, requiring either hot repli-

cation or long recovery times. We propose a new pro-

gramming model, discretized streams (D-Streams), that

offers a high-level functional programming API, strong

consistency, and efﬁcient fault recovery. D-Streams sup-

port a new recovery mechanism that improves efﬁciency

over the traditional replication and upstream backup so-

lutions in streaming databases: parallel recovery of lost

state across the cluster. We have prototyped D-Streams in

an extension to the Spark cluster computing framework

called Spark Streaming, which lets users seamlessly in-

termix streaming, batch and interactive queries.

1 Introduction

Much of “big data” is received in real time, and is most

valuable at its time of arrival. For example, a social net-

work may want to identify trending conversation topics

within minutes, an ad provider may want to train a model

of which users click a new ad, and a service operator may

want to mine log ﬁles to detect failures within seconds.

To handle the volumes of data and computation they

involve, these applications need to be distributed over

clusters. However, despite substantial work on clus-

ter programming models for batch computation [6, 22],

there are few similarly high-level tools for stream pro-

cessing. Most current distributed stream processing sys-

tems, including Yahoo!’s S4 [19], Twitter’s Storm [21],

and streaming databases [2, 3, 4], are based on a record-

at-a-time processing model, where nodes receive each

record, update internal state, and send out new records

in response. This model raises several challenges in a

large-scale cloud environment:

• Fault tolerance: Record-at-a-time systems provide

recovery through either replication, where there are

two copies of each processing node, or upstream

backup, where nodes buffer sent messages and re-

play them to a second copy of a failed downstream

node. Neither approach is attractive in large clusters:

replication needs 2× the hardware and may not work

if two nodes fail, while upstream backup takes a long

time to recover, as the entire system must wait for the

standby node to recover the failed node’s state.

• Consistency: Depending on the system, it can be

hard to reason about the global state, because dif-

ferent nodes may be processing data that arrived at

different times. For example, suppose that a system

counts page views from male users on one node and

from females on another. If one of these nodes is

backlogged, the ratio of their counters will be wrong.

• Uniﬁcation with batch processing: Because the in-

terface of streaming systems is event-driven, it is

quite different from the APIs of batch systems, so

users have to write two versions of each analytics

task. In addition, it is difﬁcult to combine streaming

data with historical data, e.g., join a stream of events

against historical data to make a decision.

In this work, we present a new programming model,

discretized streams (D-Streams), that overcomes these

challenges. The key idea behind D-Streams is to treat a

streaming computation as a series of deterministic batch

computations on small time intervals. For example, we

might place the data received each second into a new in-

terval, and run a MapReduce operation on each interval

to compute a count. Similarly, we can perform a running

count over several intervals by adding the new counts

from each interval to the old result. Two immediate ad-

vantages of the D-Stream model are that consistency is

well-deﬁned (each record is processed atomically with

the interval in which it arrives), and that the processing

model is easy to unify with batch systems. In addition, as

we shall show, we can use similar recovery mechanisms

to batch systems, albeit at a much smaller timescale, to

mitigate failures more efﬁciently than existing streaming

systems, i.e., recover data faster at a lower cost.

There are two key challenges in realizing the D-

Stream model. The ﬁrst is making the latency (interval

granularity) low. Traditional batch systems like Hadoop

and Dryad fall short here because they keep state on

disk between jobs and take tens of seconds to run each

下载后可阅读完整内容，剩余5页未读，立即下载

pan12jian

粉丝: 34
资源: 8

D-Streams：大规模流处理的高效容错模型

[Google 论文] MillWheel: Fault-Tolerant Stream Processing at Internet Scale

Discretized Streams _ A Fault-Tolerant Model for Scalable Stream Processing

Discretized Streams: 建立容错的可扩展流处理模型

FVFOM: A three dimensional semi-implicit unstructured grid ocean model

discretized-semantic-stream

Spark Streaming Real-time big-data processing

Spark-dig-and-dig:Dig Spark的源代码-spark source code

spark-source-code-learn-note:火花学习笔记-spark source code

spark-2.2-for-hadoop-2.2

Multi-core parallel robust structuredmultifrontal factorization method for large discretized PDEs

最新资源

Spark StreamingReal-time big-data processing