Flume 1.7.0 用户指南：高效收集与数据传输

需积分: 10 133 浏览量更新于2024-07-19 收藏 1.68MB PDF 举报

Apache Flume用户手册提供了对这个强大的日志数据收集、聚合和传输工具的全面指南。作为Apache软件基金会的顶级项目，Flume专为高效地处理大量来自各种源（包括网络流量数据、社交媒体生成的数据、电子邮件以及任何可能的数据源）的事件数据而设计，其应用场景远超简单的日志数据聚合。 Flume 1.7.0版本用户手册强调了使用1.x系列的好处，因为这版提供了性能提升和配置灵活性，鼓励新老用户采用它。系统要求主要包括： 1. **Java运行环境**：Flume需要Java 1.7或更高版本作为其底层支持，确保运行环境的兼容性和稳定性。 2. **内存需求**：根据所使用的数据源、通道和sink配置，Flume需要足够的内存来处理实时数据流，这涉及到内存管理的有效性，尤其是在处理大规模数据时。 3. **磁盘空间**：尽管Flume主要关注实时数据传输，但存储环节也需要考虑磁盘空间，因为部分数据可能需要暂存，尤其是当数据量大或源不稳定时，可能会产生临时数据备份。 **概述**： Flume的核心理念是设计一个可靠的分布式框架，能够无缝地将来自多源的数据汇集到中央存储库。它通过定制化的数据源模块（如Kafka、syslog或HTTP监控器）接收数据，然后通过中间的通道（如Memory Channel、File Channel或JDBC Channel）进行缓冲和路由，最后将数据发送到目标sink（如HDFS、HBase或Solr），实现数据的持久化和分析。 **架构和组件**： Flume由以下关键组件组成： - **Source**：负责从原始数据源获取数据，可以是实时的网络流量，也可以是定期轮询的定时任务。 - **Channel**：是数据传输的临时存储区域，用于数据在不同组件之间的缓存和调度。 - **Interceptor**：可以在数据流中执行额外的操作，如过滤、转换或加密。 - **Sink**：接收并处理经过处理的数据，将其写入目标存储系统或执行进一步的处理。 **配置和灵活性**： Flume的配置灵活，允许用户根据实际场景调整每个组件的行为，例如设置数据传输的优先级、设置数据分片策略以及故障恢复机制。这使得Flume能够在不同的业务场景下提供定制化的解决方案。 **最佳实践**：为了充分利用Flume，用户应了解如何正确配置数据源、选择合适的通道类型、设置适当的拦截器以及配置sink以确保数据安全、可靠和高效的传输。同时，定期维护和监控系统的运行状态也是保障Flume性能的关键。 Apache Flume用户手册是一个详细的指南，帮助用户掌握如何设置、管理和优化这个强大的日志数据管道，使其成为现代数据架构中不可或缺的一部分。无论你是初次接触Flume还是寻求性能提升，这份手册都能为你提供所需的知识和工具。

Property

Name Default Description

ssl false Set this to true to enable SSL encryption. You must also specify a “keystore” and a “keystore-password”.

keystore – This is the path to a Java keystore ﬁle. Required for SSL.

keystore-

password

– The password for the Java keystore. Required for SSL.

keystore-type JKS The type of the Java keystore. This can be “JKS” or “PKCS12”.

exclude-

protocols

SSLv3 Space-separated list of SSL/TLS protocols to exclude. SSLv3 will always be excluded in addition to the protocols speciﬁed.

kerberos false Set to true to enable kerberos authentication. In kerberos mode, agent-principal and agent-keytab are required for successful

authentication. The Thrift source in secure mode, will accept connections only from Thrift clients that have kerberos enabled and are

successfully authenticated to the kerberos KDC.

agent-

principal

– The kerberos principal used by the Thrift Source to authenticate to the kerberos KDC.

agent-keytab —- The keytab location used by the Thrift Source in combination with the agent-principal to authenticate to the kerberos KDC.

Example for agent named a1:

a1.sources = r1

a1.channels = c1

a1.sources.r1.type = thrift

a1.sources.r1.channels = c1

a1.sources.r1.bind = 0.0.0.0

a1.sources.r1.port = 4141

Exec Source

Exec source runs a given Unix command on start-up and expects that process to continuously produce data on standard out (stderr is simply discarded, unless

property logStdErr is set to true). If the process exits for any reason, the source also exits and will produce no further data. This means conﬁgurations such as

cat [named pipe] or tail -F [file] are going to produce the desired results where as date will probably not - the former two commands produce streams of

data where as the latter produces a single event and exits.

Required properties are in bold.

Property

Name Default Description

Property

Name Default Description

channels – 

type – The component type name, needs to be exec

command – The command to execute

shell – A shell invocation used to run the command. e.g. /bin/sh -c. Required only for commands relying on shell features like wildcards,

back ticks, pipes etc.

restartThrottle 10000 Amount of time (in millis) to wait before attempting a restart

restart false Whether the executed cmd should be restarted if it dies

logStdErr false Whether the command’s stderr should be logged

batchSize 20 The max number of lines to read and send to the channel at a time

batchTimeout 3000 Amount of time (in milliseconds) to wait, if the buffer size was not reached, before data is pushed downstream

selector.type replicating replicating or multiplexing

selector.*  Depends on the selector.type value

interceptors – Space-separated list of interceptors

interceptors.*  

Warning: The problem with ExecSource and other asynchronous sources is that the source can not guarantee that if there is a failure to put the event into

the Channel the client knows about it. In such cases, the data will be lost. As a for instance, one of the most commonly requested features is the tail -F

[file]-like use case where an application writes to a log ﬁle on disk and Flume tails the ﬁle, sending each line as an event. While this is possible, there’s an

obvious problem; what happens if the channel ﬁlls up and Flume can’t send an event? Flume has no way of indicating to the application writing the log ﬁle

that it needs to retain the log or that the event hasn’t been sent, for some reason. If this doesn’t make sense, you need only know this: Your application can

never guarantee data has been received when using a unidirectional asynchronous interface such as ExecSource! As an extension of this warning - and to be

completely clear - there is absolutely zero guarantee of event delivery when using this source. For stronger reliability guarantees, consider the Spooling

Directory Source or direct integration with Flume via the SDK.

Note: You can use ExecSource to emulate TailSource from Flume 0.9x (ﬂume og). Just use unix command tail -F /full/path/to/your/file. Parameter -

F is better in this case than -f as it will also follow ﬁle rotation.

Example for agent named a1:

a1.sources = r1

a1.channels = c1

a1.sources.r1.type = exec

a1.sources.r1.command = tail -F /var/log/secure

a1.sources.r1.channels = c1

The ‘shell’ conﬁg is used to invoke the ‘command’ through a command shell (such as Bash or Powershell). The ‘command’ is passed as an argument to ‘shell’

for execution. This allows the ‘command’ to use features from the shell such as wildcards, back ticks, pipes, loops, conditionals etc. In the absence of the ‘shell’

conﬁg, the ‘command’ will be invoked directly. Common values for ‘shell’ : ‘/bin/sh -c’, ‘/bin/ksh -c’, ‘cmd /c’, ‘powershell -Command’, etc.

a1.sources.tailsource-1.type = exec

a1.sources.tailsource-1.shell = /bin/bash -c

a1.sources.tailsource-1.command = for i in /path/*.txt; do cat $i; done

JMS Source

JMS Source reads messages from a JMS destination such as a queue or topic. Being a JMS application it should work with any JMS provider but has only been

tested with ActiveMQ. The JMS source provides conﬁgurable batch size, message selector, user/pass, and message to ﬂume event converter. Note that the

vendor provided JMS jars should be included in the Flume classpath using plugins.d directory (preferred), –classpath on command line, or via

FLUME_CLASSPATH variable in ﬂume-env.sh.

Required properties are in bold.

Property Name Default Description

channels – 

type – The component type name, needs to be jms

initialContextFactory – Inital Context Factory, e.g: org.apache.activemq.jndi.ActiveMQInitialContextFactory

connectionFactory – The JNDI name the connection factory should appear as

providerURL – The JMS provider URL

destinationName – Destination name

destinationType – Destination type (queue or topic)

messageSelector – Message selector to use when creating the consumer

userName – Username for the destination/provider

passwordFile – File containing the password for the destination/provider

batchSize 100 Number of messages to consume in one batch

converter.type DEFAULT Class to use to convert messages to ﬂume events. See below.

converter.* – Converter properties.

converter.charset UTF-8 Default converter only. Charset to use when converting JMS TextMessages to byte arrays.

Converter

The JMS source allows pluggable converters, though it’s likely the default converter will work for most purposes. The default converter is able to convert Bytes,

Text, and Object messages to FlumeEvents. In all cases, the properties in the message are added as headers to the FlumeEvent.

BytesMessage:

Bytes of message are copied to body of the FlumeEvent. Cannot convert more than 2GB of data per message.

TextMessage:

Text of message is converted to a byte array and copied to the body of the FlumeEvent. The default converter uses UTF-8 by default but this is

conﬁgurable.

ObjectMessage:

Object is written out to a ByteArrayOutputStream wrapped in an ObjectOutputStream and the resulting array is copied to the body of the FlumeEvent.

Example for agent named a1:

a1.sources = r1

a1.channels = c1

a1.sources.r1.type = jms

a1.sources.r1.channels = c1

a1.sources.r1.initialContextFactory = org.apache.activemq.jndi.ActiveMQInitialContextFactory

a1.sources.r1.connectionFactory = GenericConnectionFactory

a1.sources.r1.providerURL = tcp://mqserver:61616

a1.sources.r1.destinationName = BUSINESS_DATA

a1.sources.r1.destinationType = QUEUE

Spooling Directory Source

This source lets you ingest data by placing ﬁles to be ingested into a “spooling” directory on disk. This source will watch the speciﬁed directory for new ﬁles,

and will parse events out of new ﬁles as they appear. The event parsing logic is pluggable. After a given ﬁle has been fully read into the channel, it is renamed to

indicate completion (or optionally deleted).

Unlike the Exec source, this source is reliable and will not miss data, even if Flume is restarted or killed. In exchange for this reliability, only immutable,

uniquely-named ﬁles must be dropped into the spooling directory. Flume tries to detect these problem conditions and will fail loudly if they are violated:

1. If a ﬁle is written to after being placed into the spooling directory, Flume will print an error to its log ﬁle and stop processing.

2. If a ﬁle name is reused at a later time, Flume will print an error to its log ﬁle and stop processing.

To avoid the above issues, it may be useful to add a unique identiﬁer (such as a timestamp) to log ﬁle names when they are moved into the spooling directory.

Despite the reliability guarantees of this source, there are still cases in which events may be duplicated if certain downstream failures occur. This is consistent

with the guarantees offered by other Flume components.

Property Name Default Description

channels – 

type – The component type name, needs to be spooldir.

spoolDir – The directory from which to read ﬁles from.

ﬁleSufﬁx .COMPLETED Sufﬁx to append to completely ingested ﬁles

deletePolicy never When to delete completed ﬁles: never or immediate

ﬁleHeader false Whether to add a header storing the absolute path ﬁlename.

ﬁleHeaderKey ﬁle Header key to use when appending absolute path ﬁlename to event header.

basenameHeader false Whether to add a header storing the basename of the ﬁle.

basenameHeaderKey basename Header Key to use when appending basename of ﬁle to event header.

includePattern ^.*$ Regular expression specifying which ﬁles to include. It can used together with ignorePattern. If a ﬁle matches

both ignorePattern and includePattern regex, the ﬁle is ignored.

剩余46页未读，继续阅读

朋沙

粉丝: 4

Flume 1.7.0 用户指南：高效收集与数据传输

最新资源