深入学习Apache Flume：分布式日志收集与Hadoop集成

5星 · 超过95%的资源需积分: 10 105 浏览量更新于2023-05-31 收藏 7.21MB PDF 举报

"Apache Flume 分布式日志收集用于 Hadoop(PACKT,2nd,2015)" Apache Flume 是一个专为高效收集、聚合和移动大量日志数据而设计的分布式、可靠且可用的服务。它常用于将应用程序服务器的日志流式传输到 HDFS，以便进行临时分析。本书提供了对 Flume 架构及其逻辑组件的概述，旨在帮助读者深入理解 Flume 的工作原理以及如何构建和配置 Flume 代理，以动态传输系统中的流数据和日志到 Hadoop。书中详细介绍了以下内容： 1. Flume 架构理解：首先，读者将了解到 Flume 的核心架构，包括其基本组件，如节点、源（Sources）、通道（Channels）和接收器（Sinks）。Flume 的这些组件协同工作，确保数据的稳定传输。 2. 下载与安装：如何从 Apache 官方网站下载并安装开源的 Flume，这对于实际操作至关重要。 3. 实时日志传输：通过一个详尽的例子，读者将学习如何实时（Near Real Time, NRT）传输 Web 日志到 Kibana/Elasticsearch，并将其归档在 HDFS 中。 4. 生产环境中的日志传输技巧：提供有关在生产环境中安全有效地传输日志和数据的提示和技巧。 5. HDFS 接收器配置：深入理解并配置 Hadoop 文件系统（HDFS）接收器，这是将数据存储到 Hadoop 中的关键步骤。 6. Solr 集成：使用 morphline 支持的接收器将数据馈送进 Solr，扩展了 Flume 的数据处理能力。 7. 冗余数据流：通过设置接收器组创建冗余数据流，提高系统的容错性和可靠性。 8. 多种数据源的配置：学习如何配置各种来源以摄入不同类型的数据，适应不同的数据输入场景。 9. 基于内容的路由：检查数据记录并根据负载内容将其移动到多个目的地，实现灵活的数据分发。 10. 数据转换：在数据传输到 Hadoop 的过程中进行转换，同时监控数据流的状态，确保数据质量和流程的可监控性。本书采用逐步的方式，从简单的功能开始，逐渐引入更高级的特性，最终形成一个完整的、适用于真实世界的端到端案例。无论你是初学者还是经验丰富的 IT 专业人士，这本书都将提供必要的知识和实践指导，帮助你充分利用 Apache Flume 的能力，优化日志管理和大数据分析流程。

Preface

[ 5 ]

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do

happen. If you nd a mistake in one of our books—maybe a mistake in the text

or the code—we would be grateful if you would report this to us. By doing so, you can

save other readers from frustration and help us improve subsequent versions of this

book. If you nd any errata, please report them by visiting http://www.packtpub.

com/submit-errata, selecting your book, clicking on the errata submission form link,

and entering the details of your errata. Once your errata are veried, your submission

will be accepted and the errata will be uploaded on our website, or added to any list of

existing errata, under the Errata section of that title. Any existing errata can be viewed

by selecting your title from http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media.

At Packt, we take the protection of our copyright and licenses very seriously. If you

come across any illegal copies of our works, in any form, on the Internet, please

provide us with the location address or website name immediately so that we

can pursue a remedy.

Please contact us at copyright@packtpub.com with a link to the suspected

pirated material.

We appreciate your help in protecting our authors, and our ability to bring

you valuable content.

Questions

You can contact us at questions@packtpub.com if you are having a problem

with any aspect of the book, and we will do our best to address it.

Overview and Architecture

[ 8 ]

Flume 0.9

Flume was rst introduced in Cloudera's CDH3 distribution in 2011. It consisted

of a federation of worker daemons (agents) congured from a centralized master

(or masters) via Zookeeper (a federated conguration and coordination system).

From the master, you could check the agent status in a web UI as well as push

out conguration centrally from the UI or via a command-line shell (both really

communicating via Zookeeper to the worker agents).

Data could be sent in one of three modes: Best effort (BE), Disk Failover (DFO), and

End-to-End (E2E). The masters were used for the E2E mode acknowledgements and

multimaster conguration never really matured, so you usually only had one master,

making it a central point of failure for E2E data ows. The BE mode is just what it

sounds like: the agent would try to send the data, but if it couldn't, the data would

be discarded. This mode is good for things such as metrics, where gaps can easily be

tolerated, as new data is just a second away. The DFO mode stores undeliverable data

to the local disk (or sometimes, a local database) and would keep retrying until the

data could be delivered to the next recipient in your data ow. This is handy for those

planned (or unplanned) outages, as long as you have sufcient local disk space to

buffer the load.

In June, 2011, Cloudera moved control of the Flume project to the Apache Foundation.

It came out of the incubator status a year later in 2012. During the incubation year,

work had already begun to refactor Flume under the Star-Trek-themed tag, Flume-NG

(Flume the Next Generation).

Flume 1.X (Flume-NG)

There were many reasons why Flume was refactored. If you are interested in

the details, you can read about them at https://issues.apache.org/jira/

browse/FLUME-728. What started as a refactoring branch eventually became the

main line of development as Flume 1.X.

The most obvious change in Flume 1.X is that the centralized conguration master(s)

and Zookeeper are gone. The conguration in Flume 0.9 was overly verbose, and

mistakes were easy to make. Furthermore, centralized conguration was really outside

the scope of Flume's goals. Centralized conguration was replaced with a simple on-

disk conguration le (although the conguration provider is pluggable so that it

can be replaced). These conguration les are easily distributed using tools such as

cf-engine, Chef, and Puppet. If you are using a Cloudera distribution, take a look at

Cloudera Manager to manage your congurations. About two years ago, they created

a free version with no node limit, so it may be an attractive option for you. Just be

sure you don't manage these congurations manually, or you'll be editing these les

manually forever.

Chapter 1

[ 9 ]

Another major difference in Flume 1.X is that the reading of input data and the

writing of output data are now handled by different worker threads (called

Runners). In Flume 0.9, the input thread also did the writing to the output (except

for failover retries). If the output writer was slow (rather than just failing outright),

it would block Flume's ability to ingest data. This new asynchronous design leaves

the input thread blissfully unaware of any downstream problem.

The rst edition of this book covered all the versions of Flume up till Version 1.3.1.

This second edition will cover till Version 1.5.2 (the current version at the time of

writing this).

The problem with HDFS and streaming

data/logs

HDFS isn't a real lesystem, at least not in the traditional sense, and many of

the things we take for granted with normal lesystems don't apply here, such

as being able to mount it. This makes getting your streaming data into Hadoop

a little more complicated.

In a regular POSIX-style lesystem, if you open a le and write data, it still exists

on the disk before the le is closed. That is, if another program opens the same

le and starts reading, it will get the data already ushed by the writer to the disk.

Furthermore, if this writing process is interrupted, any portion that made it to disk

is usable (it may be incomplete, but it exists).

In HDFS, the le exists only as a directory entry; it shows zero length until the le

is closed. This means that if data is written to a le for an extended period without

closing it, a network disconnect with the client will leave you with nothing but an

empty le for all your efforts. This may lead you to the conclusion that it would be

wise to write small les so that you can close them as soon as possible.

The problem is that Hadoop doesn't like lots of tiny les. As the HDFS lesystem

metadata is kept in memory on the NameNode, the more les you create, the more

RAM you'll need to use. From a MapReduce prospective, tiny les lead to poor

efciency. Usually, each Mapper is assigned a single block of a le as the input

(unless you have used certain compression codecs). If you have lots of tiny les,

the cost of starting the worker processes can be disproportionally high compared

to the data it is processing. This kind of block fragmentation also results in more

Mapper tasks, increasing the overall job run times.

剩余177页未读，继续阅读

vanridin

粉丝: 108
资源: 1187

深入学习Apache Flume：分布式日志收集与Hadoop集成

Apache Flume Distributed Log Collection For Hadoop

Apache Flume, Distributed Log Collection for Hadoop（第二版）

Apache Flume Distributed Log Collection for Hadoop（中文版）

Apache Flume- Distributed Log Collection for Hadoop(PACKT,2013)

[Apache Flume] Apache Flume 分布式日志采集应用 (Hadoop 实现) (英文版)

Big_Data_Analytics_with_Spark_and_Hadoop-Packt_Publishing2016

Apache Flume 2版：Hadoop分布式日志收集指南

Apache Flume：Hadoop分布式日志收集详解

Apache Flume：日志收集器，无缝对接Hadoop集群

使用Apache Flume高效收集分布式日志

最新资源