Apache Flume深入学习：概念、模型与入门案例

需积分: 10 5 浏览量更新于2024-07-18 收藏 169KB DOC 举报

"flume学习总结1" 在深入探讨Apache Flume之前，让我们先理解这个工具的基本概念和核心特性。Flume是一个专为大数据收集、聚合和传输设计的系统，尤其适用于处理日志数据。作为Apache的顶级项目，它提供了一个可靠且分布式的解决方案，确保数据在多个源头和集中存储之间的稳定流动。 ### Flume概念 1. Flume Event：Flume事件是数据传输的基本单元，包含字节数据负载和可选的属性集合。 2. Flume Agent：Flume代理是系统的核心，它作为一个进程运行，负责从外部源接收事件并将其转发至目标位置。每个代理包括Source、Channel和Sink三个主要部分。 3. Source：数据源，从外部系统接收事件，通常是按照Flume Source能够解析的格式。 4. Channel：数据通道，作为临时存储，保持事件直到被Sink消费。Channel有内存和文件两种实现，前者速度快但不支持数据恢复，后者速度较慢但提供容错能力。 5. Sink：数据汇点，将Flume Event发送到外部目标，如HDFS、HBase或其他存储系统。 ### Flume流动模型 Flume的数据流动模型基于Source、Channel和Sink的组合。事件从Source流入，经过Channel临时存储，最后由Sink转发出去。这种模型支持复杂的流量路径，包括多级流动、扇出流（一到多）、扇入流（多到一）以及故障转移和重试机制。 ### Flume的特点 1. 复杂流动性：Flume的灵活性使得用户可以构建多级数据流，同时支持多种流向模式，如分支、合并等。 2. 可靠性：通过事务处理确保数据在整个流动过程中的完整性，即使在故障情况下也能保证数据不丢失。 3. 可恢复性：借助文件Channel，Flume能够在系统故障后恢复未完成的事件传输。 ### 入门案例设置Flume通常需要配置一个Agent，例如名为"a1"的Agent，配置文件如下： ```properties # example.conf: 单节点Flume配置 # 命名组件 a1.sources=r1 a1.sinks=k1 a1.channels=c1 # 配置Source a1.sources.r1.type=netcat a1.sources.r1.bind=0.0.0.0 a1.sources.r1.port=44444 # 配置Sink a1.sinks.k1.type=<sink_type> ``` 在这个例子中，Source "r1" 使用`netcat`类型监听0.0.0.0的44444端口，Sink "k1" 的类型根据实际需求进行设置，如HDFS或Avro。 Flume的强大之处在于其可扩展性和灵活性，可以通过配置多个Agent和连接它们来创建复杂的数据流网络。此外，Flume还支持动态重新配置，允许在运行时调整数据流，这对于处理不断变化的大数据环境至关重要。Flume是大数据环境中不可或缺的日志管理和分析工具，为日志数据的高效处理提供了强大支持。

selector.type 复制还是多路复用

selector.* Depends on the selector.type value

interceptors – 空格分隔的拦截器列表

interceptors.*

6.2.3. 案例

编写配置文件：

＃命名 Agent a1 的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置 Source

a1.sources.r1.type = avro

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 33333

＃描述 Sink

a1.sinks.k1.type = logger

＃描述内存 Channel

a1.channels.c1.type = memory

a1.channels.c1.capacity = 1000

a1.channels.c1.transactionCapacity = 100

＃为 Channle 绑定 Source 和 Sink

a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1

启动 flume：

./flume-ng agent --conf ../conf --conf-file

../conf/template2.conf --name a1 -Dflume.root.logger=INFO,console

可以通过 tail 命令，收集日志文件中后续追加的日志

6.3. Spooling Directory Source

6.3.1. Spooling Directory Source 概述

这个 Source 允许你将将要收集的数据放置到"自动搜集"目录中。这个 Source 将监

视该目录，并将解析新文件的出现。事件处理逻辑是可插拔的，当一个文件被完全读入通

道，它会被重命名或可选的直接删除。

要注意的是，放置到自动搜集目录下的文件不能修改，如果修改，则 flume 会报错。

另外，也不能产生重名的文件，如果有重名的文件被放置进来，则 flume 会报错。

6.3.2. Spooling Directory Source 属性说明

!channels –

!type – 类型，需要指定为"spooldir"

!spoolDir – 读取文件的路径，即"搜集目录"

fileSuffix .COMPLETED 对处理完成的文件追加的后缀

deletePolicy never 处理完成后是否删除文件，需

是"never"或"immediate"

fileHeader false 是否添加一个存储的绝对路径名的头文件.

fileHeaderKey file Header key to use when appending absolute

path filename to event header.

basenameHeader false Whether to add a header storing the

basename of the file.

basenameHeaderKey basename Header Key to use when appending

basename of file to event header.

ignorePattern ^$ 正则表达式指定哪些文件需要忽略

trackerDir .flumespool Directory to store metadata related

to processing of files. If this path is not an absolute path,

then it is interpreted as relative to the spoolDir.

consumeOrder 处理文件的策略，oldest, youngest 或 random。

maxBackoff 4000 The maximum time (in millis) to wait

between consecutive attempts to write to the channel(s) if the

channel is full. The source will start at a low backoff and

increase it exponentially each time the channel throws a

ChannelException, upto the value specified by this parameter.

batchSize 100 Granularity at which to batch transfer to the

channel

inputCharset UTF-8 读取文件时使用的编码。

decodeErrorPolicy FAIL 当在输入文件中发现无法处理的字符编码时如何

处理。FAIL：抛出一个异常而无法解析该文件。REPLACE：用“替换字符”字符，通常是

Unicode 的 U + FFFD 更换不可解析角色。忽略：掉落的不可解析的字符序列。

deserializer LINE 声明用来将文件解析为事件的解析器。默认一行为一个事

件。处理类必须实现 EventDeserializer.Builder 接口。

deserializer.* Varies per event deserializer.

bufferMaxLines – (Obselete) This option is now ignored.

bufferMaxLineLength 5000 (Deprecated) Maximum length of a

line in the commit buffer. Use deserializer.maxLineLength

instead.

selector.type replicating replicating or multiplexing

selector.* Depends on the selector.type value

interceptors – Space-separated list of interceptors

interceptors.*

6.3.3. 案例

编写配置文件：

＃命名 Agent a1 的组件

a1.sources = r1

a1.sinks = k1

a1.channels = c1

＃描述/配置 Source

a1.sources.r1.type = spooldir

a1.sources.r1.spoolDir=/home/park/work/apache-flume-1.6.0-

bin/mydata

＃描述 Sink

剩余32页未读，继续阅读

Aaron_peter

粉丝: 0
资源: 39

Apache Flume深入学习：概念、模型与入门案例

flume学习总结3

Flume学习思维导图总结

关于Flume学习视频的对照文档.docx

IT十八掌_Flume阶段学习笔记(知识点总结)

Flume集群环境搭建，flume监控

Flume ng share

logs_flume.rar

大数据技术之Flume笔记

flume+kafka+storm教程

大数据技术之Flume.docx

最新资源