Apache Flume实战：Hadoop日志收集与定制

需积分: 10 166 浏览量更新于2024-07-17 收藏 7.22MB PDF 举报

《Apache Flume - Hadoop分布式日志收集（第二版）》是一本由Steve Hoffman所著的专业书籍，针对Apache Flume在Hadoop生态系统中的关键角色进行了深入讲解。这本书是专为那些希望理解和利用Flume进行实时数据流处理和日志收集的读者设计的，特别是那些想要构建和配置Flume代理以将数据有效地发送到Hadoop的用户。首先，书中提供了一个全面的Flume架构概述，包括其基本组件，如源（Source）、通道（Channel）和接收器（Sink）。源负责从各种数据源捕获数据，如网络接口、文件系统或数据库；通道则是数据的临时存储区域，可以是内存缓冲区或持久化存储；接收器则负责将数据写入特定的目标，比如本地文件、HDFS或消息队列。作者特别强调了HDFS接收器的重要性，它允许Flume将数据持久地写入Hadoop分布式文件系统，这对于大规模的日志存储和分析至关重要。此外，书中的内容详细介绍了如何设计和实现一系列定制的Flume agent，以便根据实际需求调整数据传输流程。对于每个组件，书籍提供了详尽的实现和配置选项，使得读者能够灵活地调整Flume的工作模式，满足不同的业务场景。无论是对数据实时性要求较高的应用程序，还是需要长期存储和处理海量日志的环境，都能从中找到相应的解决方案。版权方面，所有内容受Packt Publishing的保护，未经出版商书面许可，禁止任何形式的复制、存储或传输。尽管作者和出版社已尽力确保信息的准确性，但书中提供的信息并不保证绝对无误，且在法律框架内销售，不承担任何直接或间接损失的责任。最后，书中提及的所有公司和产品商标信息，Packt Publishing都已尽可能正确标注，以体现尊重知识产权的原则。《Apache Flume - Hadoop分布式日志收集（第二版）》是一本实用的参考书，不仅适合系统管理员和数据工程师，也适合那些希望通过Flume扩展Hadoop功能的开发者，帮助他们优化日志管理和大数据处理流程。

Preface

[ 5 ]

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do

happen. If you nd a mistake in one of our books—maybe a mistake in the text

or the code—we would be grateful if you would report this to us. By doing so, you can

save other readers from frustration and help us improve subsequent versions of this

book. If you nd any errata, please report them by visiting http://www.packtpub.

com/submit-errata, selecting your book, clicking on the errata submission form link,

and entering the details of your errata. Once your errata are veried, your submission

will be accepted and the errata will be uploaded on our website, or added to any list of

existing errata, under the Errata section of that title. Any existing errata can be viewed

by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media.

At Packt, we take the protection of our copyright and licenses very seriously. If you

come across any illegal copies of our works, in any form, on the Internet, please

provide us with the location address or website name immediately so that we

can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected

pirated material.

We appreciate your help in protecting our authors, and our ability to bring

you valuable content.

Questions

You can contact us at questions@packtpub.com if you are having a problem

with any aspect of the book, and we will do our best to address it.

Overview and Architecture

[ 8 ]

Flume 0.9

Flume was rst introduced in Cloudera's CDH3 distribution in 2011. It consisted

of a federation of worker daemons (agents) congured from a centralized master

(or masters) via Zookeeper (a federated conguration and coordination system).

From the master, you could check the agent status in a web UI as well as push

out conguration centrally from the UI or via a command-line shell (both really

communicating via Zookeeper to the worker agents).

Data could be sent in one of three modes: Best effort (BE), Disk Failover (DFO), and

End-to-End (E2E). The masters were used for the E2E mode acknowledgements and

multimaster conguration never really matured, so you usually only had one master,

making it a central point of failure for E2E data ows. The BE mode is just what it

sounds like: the agent would try to send the data, but if it couldn't, the data would

be discarded. This mode is good for things such as metrics, where gaps can easily be

tolerated, as new data is just a second away. The DFO mode stores undeliverable data

to the local disk (or sometimes, a local database) and would keep retrying until the

data could be delivered to the next recipient in your data ow. This is handy for those

planned (or unplanned) outages, as long as you have sufcient local disk space to

buffer the load.

In June, 2011, Cloudera moved control of the Flume project to the Apache Foundation.

It came out of the incubator status a year later in 2012. During the incubation year,

work had already begun to refactor Flume under the Star-Trek-themed tag, Flume-NG

(Flume the Next Generation).

Flume 1.X (Flume-NG)

There were many reasons why Flume was refactored. If you are interested in

the details, you can read about them at https://issues.apache.org/jira/

browse/FLUME-728. What started as a refactoring branch eventually became the

main line of development as Flume 1.X.

The most obvious change in Flume 1.X is that the centralized conguration master(s)

and Zookeeper are gone. The conguration in Flume 0.9 was overly verbose, and

mistakes were easy to make. Furthermore, centralized conguration was really outside

the scope of Flume's goals. Centralized conguration was replaced with a simple on-

disk conguration le (although the conguration provider is pluggable so that it

can be replaced). These conguration les are easily distributed using tools such as

cf-engine, Chef, and Puppet. If you are using a Cloudera distribution, take a look at

Cloudera Manager to manage your congurations. About two years ago, they created

a free version with no node limit, so it may be an attractive option for you. Just be

sure you don't manage these congurations manually, or you'll be editing these les

manually forever.

Chapter 1

[ 9 ]

Another major difference in Flume 1.X is that the reading of input data and the

writing of output data are now handled by different worker threads (called

Runners). In Flume 0.9, the input thread also did the writing to the output (except

for failover retries). If the output writer was slow (rather than just failing outright),

it would block Flume's ability to ingest data. This new asynchronous design leaves

the input thread blissfully unaware of any downstream problem.

The rst edition of this book covered all the versions of Flume up till Version 1.3.1.

This second edition will cover till Version 1.5.2 (the current version at the time of

writing this).

The problem with HDFS and streaming

data/logs

HDFS isn't a real lesystem, at least not in the traditional sense, and many of

the things we take for granted with normal lesystems don't apply here, such

as being able to mount it. This makes getting your streaming data into Hadoop

a little more complicated.

In a regular POSIX-style lesystem, if you open a le and write data, it still exists

on the disk before the le is closed. That is, if another program opens the same

le and starts reading, it will get the data already ushed by the writer to the disk.

Furthermore, if this writing process is interrupted, any portion that made it to disk

is usable (it may be incomplete, but it exists).

In HDFS, the le exists only as a directory entry; it shows zero length until the le

is closed. This means that if data is written to a le for an extended period without

closing it, a network disconnect with the client will leave you with nothing but an

empty le for all your efforts. This may lead you to the conclusion that it would be

wise to write small les so that you can close them as soon as possible.

The problem is that Hadoop doesn't like lots of tiny les. As the HDFS lesystem

metadata is kept in memory on the NameNode, the more les you create, the more

RAM you'll need to use. From a MapReduce prospective, tiny les lead to poor

efciency. Usually, each Mapper is assigned a single block of a le as the input

(unless you have used certain compression codecs). If you have lots of tiny les,

the cost of starting the worker processes can be disproportionally high compared

to the data it is processing. This kind of block fragmentation also results in more

Mapper tasks, increasing the overall job run times.

剩余177页未读，继续阅读

miles_cmg

粉丝: 8
资源: 9

Apache Flume实战：Hadoop日志收集与定制

apache-flume-1.7.0-bin.tar.gz

apache-flume-1.7.0-bin.tar.zip

Apache Flume：Hadoop分布式日志收集详解

Apache Flume：Hadoop分布式日志收集利器

Apache Flume 2版：Hadoop分布式日志收集指南

使用Apache Flume高效收集分布式日志

Apache Flume与Hadoop的日志收集实战

Apache Flume详解：分布式日志采集与传输实战

flume-hadoop-jar.zip

flume-hadoop-fonxian1024.zip

最新资源