Apache Flink：实时流处理与超越

需积分: 9 100 浏览量更新于2024-07-18 收藏 3.62MB PDF 举报

"Apache Flink是开源的流处理框架，它基于流式处理模型，特别适合对无界数据集进行实时处理。Apache Flink提供了一个融合的平台，支持快速且轻松地构建突破性的实时应用程序，使得数据能够即时用于流处理。在全球物联网（IoT）规模下，它能实现每秒复制数百万条消息的能力。用户可以通过相关的免费培训课程进一步学习Flink，如MapR的Learn Streaming课程。此外，还有一本由Ellen Friedman和Kostas Tzoumas合著的《Introduction to Apache Flink》，详细介绍了Flink的流处理技术及其超越实时应用的潜力。" Apache Flink是一个强大的分布式流处理引擎，它旨在处理连续不断的数据流，同时也支持批处理，为开发者提供了统一的数据处理模型。Flink的核心特性包括： 1. **流处理模型**：Flink基于DataStream API，它支持两种数据流类型——无界流（unbounded streams）和有界流（bounded streams）。无界流是无限的，而有界流是有限的，这种模型使得Flink可以处理各种实时和历史数据。 2. **事件时间处理**：Flink支持事件时间处理，这在处理延迟数据或者乱序事件时非常重要，因为它确保了数据处理的准确性。 3. **状态管理和容错**：Flink通过其检查点和保存点机制实现了高效的状态管理和故障恢复，保证了数据的一致性和处理的精确一次（Exactly-once）语义。 4. **低延迟与高吞吐**：Flink设计目标之一就是低延迟，同时能处理高吞吐量的数据流，使其在实时分析领域具有竞争优势。 5. **丰富的算子和连接器**：Flink提供了一系列的算子，如窗口操作、状态操作等，以及多种数据源和数据 sink的连接器，便于与其他系统集成。 6. **批处理与流处理的统一**：Flink通过统一的API，将批处理看作是特殊形式的流处理，这简化了开发和维护工作。 7. **内存优化和并行计算**：Flink使用高效的内存管理策略，以及分布式并行执行模型，能够有效地利用多核CPU和大规模集群资源。 8. **与Hadoop集成**：Flink能够很好地与Hadoop生态系统中的其他组件如HDFS、YARN等集成，允许用户无缝迁移或扩展现有的Hadoop应用。 9. **实时交互查询**：Flink SQL和Table API提供了SQL接口，支持实时交互式查询，使业务分析师也能直接对流数据进行分析。 10. **全球化部署**：Flink可以在全球范围内进行数据复制和处理，支持大型的、分布式的物联网应用。 Apache Flink作为一个强大的流处理框架，不仅具备实时处理能力，还能提供批处理功能，是构建实时数据分析和决策系统的重要工具。通过学习和掌握Flink，开发者能够构建出更高效、可靠的实时应用程序，满足现代大数据处理的需求。

exactly-once guarantees for maintaining accurate state, and even the

guarantees that Storm could provide came at a high overhead.

Overview of Lambda Architecture: Advantages and

Limitations

The need for affordable scale drove people to distributed file sys‐

tems such as HDFS and batch-based computing (MapReduce jobs).

But that approach made it difficult to deal with low-latency

insights. Development of real-time stream processing technology

with Apache Storm helped address the latency issue, but not as a

complete solution. For one thing, Storm did not guarantee state

consistency with exactly-once processing and did not handle event-

time processing. People who had these needs were forced to imple‐

ment these features in their application code.

A hybrid view of data analytics that mixed these approaches offered

one way to deal with these challenges. This hybrid, called Lambda

architecture, provided delayed but accurate results via batch Map‐

Reduce jobs and an in-the-moment preliminary view of new results

via Storm’s processing.

The Lambda architecture is a helpful framework for building big

data applications, but it is not sufficient. For example, with a

Lambda system based on MapReduce and HDFS, there is a time

window, in hours, when inaccuracies due to failures are visible.

Lambda architectures need the same business logic to be coded

twice, in two different programming APIs: once for the batch sys‐

tem and once for the streaming system. This leads to two codebases

that represent the same business problem, but have different kinds

of bugs. In practice, this is very difficult to maintain.

To compute values that depend on multiple

streaming events, it is necessary to retain data

from one event to another. This retained data is

known as the state of the computation. Accurate

handling of state is essential for consistency in

computation. The ability to accurately update

state after a failure or interruption is a key to

fault tolerance.

8 | Chapter 1: Why Apache Flink?

https://www.iteblog.com

It’s hard to maintain fault-tolerant stream processing that has high

throughput with very low latency, but the need for guarantees of

accurate state motivated a clever compromise: what if the stream of

data from continuous events were broken into a series of small,

atomic batch jobs? If the batches were cut small enough—so-called

“micro-batches”—your computation could approximate true

streaming. The latency could not quite reach real time, but latencies

of several seconds or even subseconds for very simple applications

would be possible. This is the approach taken by Apache Spark

Streaming, which runs on the Spark batch engine.

More important, with micro-batching, you can achieve exactly-once

guarantees of state consistency. If a micro-batch job fails, it can be

rerun. This is much easier than would be true for a continuous

stream-processing approach. An extension of Storm, called Storm

Trident, applies micro-batch computation on the underlying stream

processor to provide exactly-once guarantees, but at a substantial

cost to latency.

However, simulating streaming with periodic batch jobs leads to

very fragile pipelines that mix DevOps with application develop‐

ment concerns. The time that a periodic batch job takes to finish is

tightly coupled with the timing of data arrival, and any delays can

cause inconsistent (a.k.a. wrong) results. The underlying problem

with this approach is that time is only managed implicitly by the

part of the system that creates the small jobs. Frameworks like Spark

Streaming mitigate some of the fragility, but not entirely, and the

sensitivity to timing relative to batches still leads to poor latency and

a user experience where one needs to think a lot about performance

in the application code.

These tradeoffs between desired capabilities have motivated contin‐

ued attempts to improve existing processors (for example, the devel‐

opment of Storm Trident to try to overcome some of the limitations

of Storm). When existing processors fall short, the burden is placed

on the application developer to deal with any issues that result. An

example is the case of micro-batching, which does not provide an

excellent fit between the natural occurrence of sessions in event data

and the processor’s need to window data only as multiples of the

batch time (recovery interval). With less flexibility and expressivity,

development time is slower and operations take more effort to

maintain properly.

Evolution of Stream Processing Technologies | 9

https://www.iteblog.com

to a low-level programming API, it does not support event time, and

it does not have support for batch computations). And none of these

projects have been able to attract an open source community com‐

parable to the Flink community.

Now, let’s take a look at what Flink is and how the project came

about.

First Look at Apache Flink

The Apache Flink project home page starts with the tagline, “Apache

Flink is an open source platform for distributed stream and batch

data processing.” For many people, it’s a surprise to realize that Flink

not only provides real-time streaming with high throughput and

exactly-once guarantees, but it’s also an engine for batch data pro‐

cessing. You used to have to choose between these approaches, but

Flink lets you do both with one technology.

How did this top-level Apache project get started? Flink has its ori‐

gins in the Stratosphere project, a research project conducted by

three Berlin-based Universities as well as other European Universi‐

ties between 2010 and 2014. The project had already attracted a

broader community base, in part through presentations at several

public developer conferences including Berlin Buzzwords, NoSQL

Matters in Cologne, and others. This strong community base is one

reason the project was appropriate for incubation under the Apache

Software Foundation.

A fork of the Stratosphere code was donated in April 2014 to the

Apache Software Foundation as an incubating project, with an ini‐

tial set of committers consisting of the core developers of the sys‐

tem. Shortly thereafter, many of the founding committers left

university to start a company to commercialize Flink: data Artisans.

During incubation, the project name had to be changed from Strato‐

sphere because of potential confusion with an unrelated project. The

name Flink was selected to honor the style of this stream and batch

processor: in German, the word “flink” means fast or agile. A logo

showing a colorful squirrel was chosen because squirrels are fast,

agile and—in the case of squirrels in Berlin—an amazing shade of

reddish-brown, as you can see in Figure 1-3.

First Look at Apache Flink | 11

https://www.iteblog.com

剩余107页未读，继续阅读

qq_29668687

粉丝: 0
资源: 2

Apache Flink：实时流处理与超越

Python库 | apache_flink-1.12.2-cp35-cp35m-manylinux1_x86_64.whl

apache-flink

PyPI 官网下载 | apache_flink_statefun-3.1.0-py3-none-any.whl

apache_flink_learning

Apache_Flink_Meter:基于Apache Flink的计量工具

Introduction_to_Apache_Flink_MapR_final

Learning_Apache_Flink_ColorImages.pdf

kostas_tzumas_apache_flink_presentation.pdf

Stream_Processing_with_Apache_Flink.zip

Apache_Flink在万达金融的实践-李呈祥

最新资源