使用Apache Flume实时传输日志到Hadoop

5星 · 超过95%的资源需积分: 10 111 浏览量更新于2023-06-02 收藏 3.72MB PDF 举报

"Apache Flume 是一个分布式、可靠且可用的服务，用于高效地收集、聚合和移动大量日志数据。它的主要目标是将数据从应用程序传输到Apache Hadoop的HDFS。它基于流数据流的简单和灵活架构，并具有强大的故障恢复机制。《Apache Flume - Distributed Log Collection for Hadoop》这本书深入讲解了Flume如何解决HDFS和流数据/日志的问题，以及其通用架构，包括与数据库、NoSQL存储的数据交互和性能优化。书中包含实际的Flume实施场景，指导读者进行安装和配置，并提供编写自定义实现的提示，帮助读者构建一系列Flume代理以实现实时传输数据到Hadoop。" Apache Flume 是一个关键的日志收集工具，尤其在大数据处理环境中。它设计用于处理来自各种来源的大量数据流，如网络日志、系统日志或社交媒体数据。Flume的核心组件包括Sources、Channels和Sinks，它们共同构成了数据流处理的链条。 Sources是数据流的起点，负责从不同的数据源获取信息。这些源可以是应用程序、服务器日志文件或其他数据生成器。Flume支持多种类型的Source，如SimpleTCPSource（接收TCP连接中的数据）、ExecSource（执行命令并收集输出）等。 Channels作为临时存储，确保数据在被发送到Sinks之前不会丢失。这些通道可以是内存型的（如MemoryChannel），也可以是持久化的（如FileChannel），提供了容错性和可靠性。Channels的容量和类型可以根据需要进行配置。 Sinks是数据流的终点，负责将数据写入目的地，这通常是HDFS，但也可以是其他存储系统，如Cassandra、Solr或Kafka。Flume提供了多种Sink实现，如HDFS sink用于写入Hadoop文件系统，AvroSink用于与其他Flume节点通信。在Flume中，Channel Selectors和Sink Processors允许用户根据需要定制数据流路径。例如，Channel Selectors可以决定数据应进入哪个Channel，而Sink Processors则控制数据如何在多个Sinks之间分发。通过本书，读者将了解到如何配置和操作Flume，包括创建复杂的Flume拓扑结构，以及如何为特定需求编写自定义Source、Sink或Channel。此外，书中还提供了关于故障恢复策略和性能调优的指导，以确保在大规模环境中稳定运行。最后，读者将能够利用Flume构建出一套高效、可靠的日志收集系统，实时地将数据从各种源头传输到Hadoop，这对于实时数据分析和监控至关重要。书中所涵盖的知识点不仅限于理论，还结合了实际案例，使得读者能更好地理解和应用Apache Flume。

Preface

[ 5 ]

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media.

At Packt, we take the protection of our copyright and licenses very seriously. If you

come across any illegal copies of our works, in any form, on the Internet, please

provide us with the location address or website name immediately so that we can

pursue a remedy.

Please contact us at

pirated material.

We appreciate your help in protecting our authors, and our ability to bring you

valuable content.

Questions

You can contact us at questions@packtpub.com if you are having a problem with

any aspect of the book, and we will do our best to address it.

Overview and Architecture

[ 8 ]

Flume 0.9

Flume was rst introduced in Cloudera's CDH3 Distribution in 2011. It consisted

of a federation of worker daemons (agents) congured from a centralized master

(or masters) via Zookeeper (a federated conguration and coordination system).

From the master you could check agent status in a Web UI, as well as push out

conguration centrally from the UI or via a command line shell (both really

communicating via Zookeeper to the worker agents).

Data could be sent in one of the three modes, namely, best effort (BE), disk failover

(DFO), and end-to-end (E2E). The masters were used for the end-to-end (E2E) mode

acknowledgements and multi-master conguration never really matured so usually

you had only one master making it a central point of failure for E2E data ows.

Best effort is just what it sounds like—the agent would try and send the data, but if

it couldn't, the data would be discarded. This mode is good for things like metrics

where gaps can easily be tolerated, as new data is just a second away. Disk failover

mode stores undeliverable data to the local disk (or sometimes a local database)

and keeps retrying until the data can be delivered to the next recipient in your data

ow. This is handy for those planned (or unplanned) outages as long as you have

sufcient local disk space to buffer the load.

In June of 2011, Cloudera moved control of the Flume project to the Apache

foundation. It came out of incubator status a year later in 2012. During that

incubation year, work had already begun to refactor Flume under the Star Trek

Themed tag, Flume-NG (Flume the Next Generation).

Flume 1.X (Flume-NG)

There were many reasons to why Flume was refactored. If you are interested in

the details you can read about it at https://issues.apache.org/jira/browse/

FLUME-728

. What started as a refactoring branch eventually became the main line

of development as Flume 1.X.

The most obvious change in Flume 1.X is that the centralized conguration master/

masters and Zookeeper are gone. The conguration in Flume 0.9 was overly verbose

and mistakes were easy to make. Furthermore, centralized conguration was really

outside the scope of Flume's goals. Centralized conguration was replaced with

a simple on-disk conguration le (although the conguration provider is pluggable

so that it can be replaced). These conguration les are easily distributed using tools

such as cf-engine, chef, and puppet. If you are using a Cloudera Distribution, take

a look at Cloudera Manager to manage your congurations—their licensing was

recently changed to lift the node limit so it may be an attractive option for you.

Be sure you don't manage these congurations manually or you'll be editing those

les manually forever.

Chapter 1

[ 9 ]

Another major difference in Flume 1.X is that the reading of input data and the

writing of output data are now handled by different worker threads (called Runners).

In Flume 0.9, the input thread also did the writing to the output (except for failover

retries). If the output writer was slow (rather than just failing outright), it would block

Flume's ability to ingest data. This new asynchronous design leaves the input thread

blissfully unaware of any downstream problem.

The version of Flume covered in this book is 1.3.1 (current at the time of this

book's writing).

The problem with HDFS and streaming

data/logs

HDFS isn't a real lesystem, at least not in the traditional sense, and many of the

things we take for granted with normal lesystems don't apply here, for example

being able to mount it. This makes getting your streaming data into Hadoop a little

more complicated.

In a regular Portable Operating System Interface (POSIX) style lesystem, if you

open a le and write data, it still exists on disk before the le is closed. That is, if

another program opens the same le and starts reading, it will get the data already

ushed by the writer to disk. Furthermore, if that writing process is interrupted,

any portion that made it to disk is usable (it may be incomplete, but it exists).

In HDFS the le exists only as a directory entry, it shows as having zero length until

the le is closed. This means if data is written to a le for an extended period without

closing it, a network disconnect with the client will leave you with nothing but an

empty le for all your efforts. This may lead you to the conclusion that it would be

wise to write small les so you can close them as soon as possible.

The problem is Hadoop doesn't like lots of tiny les. Since the HDFS metadata is

kept in memory on the NameNode, the more les you create, the more RAM you'll

need to use. From a MapReduce prospective, tiny les lead to poor efciency.

Usually, each mapper is assigned a single block of a le as input (unless you have

used certain compression codecs). If you have lots of tiny les, the cost of starting

the worker processes can be disproportionally high compared to the data it is

processing. This kind of block fragmentation also results in more mapper tasks

increasing the overall job run times.

www.allitebooks.com

剩余107页未读，继续阅读

vanridin

粉丝: 108
资源: 1187

使用Apache Flume实时传输日志到Hadoop

Apache Flume, Distributed Log Collection for Hadoop（第二版）

Apache Flume Distributed Log Collection For Hadoop

Apache Flume Distributed Log Collection for Hadoop(PACKT,2ed,2015)

使用sudo mv ./apache-flume-1.7.0-bin ./flume后显示无法获取文件状态

将 flume 文件先拷贝到桌面上，在移动到/usr/local 下，在/usr/local 解压 tar -zxvf apache-flume-1.7.0-bin.tar.gz mv apache-flume-1.7.0-bin flume #改名

文件① file-flume-kafka.conf 文件② kafak-flume-hdfs.conf 分别在hadoop102、103启动文件①，然后hadoop104上启动文件②

apache-flume-1.9.0-bin.tar.gz下载

[ys@hadoop102 flume]$ flume-ng agent --conf-file /path/to/file-flume-kafka.conf --name kafka-flume-agent -Dflume.root.logger=INFO,console bash: flume-ng: 未找到命令...

启动文件file-flume-kafka.conf

添加flume相关的依赖，如flume-ng-sdk、avro和log4j

最新资源