Apache Flume实战：弹性、可扩展的数据流传输

5星 · 超过95%的资源需积分: 9 2 浏览量更新于2024-07-22 2 收藏 3.77MB PDF 举报

"《Using Flume: Flexible, Scalable, and Reliable Data Streaming》是由Hari Shreedharan编写的，由O'Reilly Media在2014年出版的专业指南，详细介绍了如何使用Apache Flume进行实时数据流传输。本书旨在帮助操作工程师配置、部署和监控Flume集群，并教导开发者编写自定义插件以适应特定需求。书中包含Flume设计与实现的深度解析，以及其高可扩展性、灵活性和可靠性的关键特性。" Apache Flume是一个强大的工具，专门用于收集、聚合和将大量流式数据写入Hadoop分布式文件系统（HDFS）、Apache HBase、SolrCloud和Elastic Search等系统。它通过作为数据生产者和消费者之间的缓冲区来提供稳定的流量速率。书中的内容包括： 1. **Apache Hadoop和Apache HBase简介**：了解这两个关键的大数据存储和处理框架，它们在大数据生态系统中的角色，以及Flume如何与它们集成。 2. **流式数据使用Apache Flume**：深入理解Flume的工作原理，如何通过Flume实现近实时的数据传输。 3. **源（Sources）**：探讨不同类型的Flume源，这些源可以接收各种数据源的数据，如日志文件、网络套接字等。 4. **通道（Channels）**：学习Flume如何使用通道存储数据，保证数据在传输过程中的持久性和可靠性。 5. **接收器（Sinks）**：研究如何配置和使用Flume接收器将数据写入目标存储系统，如HDFS、HBase等。 6. **拦截器、通道选择器、接收器组和接收器处理器**：这些组件允许对数据进行定制处理，如过滤、转换和格式化。 7. **向Flume发送数据**：通过API接口从自定义应用程序中发送数据到Flume代理的详细方法。 8. **规划、部署和监控Flume**：指导如何根据需求规划Flume集群的架构，以及如何有效地监控运行中的集群以确保稳定性。本书还提供了代码示例和练习，帮助读者加深对Flume实际应用的理解。无论你是希望提升Flume操作技能的操作工程师，还是希望开发自定义组件的开发者，都能从这本书中获益。通过学习，你将能够构建和管理一个高效、灵活且可靠的Flume数据流处理系统。

HowtoContactUs

Pleaseaddresscommentsandquestionsconcerningthisbooktothepublisher:

O’ReillyMedia,Inc.

1005GravensteinHighwayNorth

Sebastopol,CA95472

800-998-9938(intheUnitedStatesorCanada)

707-829-0515(internationalorlocal)

707-829-0104(fax)

Wehaveawebpageforthisbook,wherewelisterrata,examples,andanyadditional

information.Youcanaccessthispageathttp://bit.ly/using-flume.

Tocommentorasktechnicalquestionsaboutthisbook,sendemailto

bookquestions@oreilly.com.

Formoreinformationaboutourbooks,courses,conferences,andnews,seeourwebsiteat

http://www.oreilly.com.

FindusonFacebook:http://facebook.com/oreilly

FollowusonTwitter:http://twitter.com/oreillymedia

WatchusonYouTube:http://www.youtube.com/oreillymedia

HDFS

AtthecoreofHadoopisadistributedfilesystemreferredtoasHDFS.HDFSisahighly

distributed,fault-tolerantfilesystemthatisspecificallybuilttorunoncommodity

hardwareandtoscaleasmoredataisaddedbysimplyaddingmorehardware.HDFScan

beconfiguredtoreplicatedataseveraltimesondifferentmachinestoensurethatthereis

nodataloss,evenifamachineholdingthedatafails.Replicatingdataalsoallowsthe

systemtobehighlyavailableevenifmachinesholdingacopyofthedataaredisconnected

fromthenetworkorgodown.ThissectionwillbrieflycoverthedesignofHDFSandthe

variousprocessingsystemsthatrunontopofHDFS.

HDFSwasoriginallydesignedonthebasisoftheGoogleFileSystem[gfs].HDFSisa

distributedsystemthatcanstoredataonthousandsofoff-the-shelfservers,withnospecial

requirementsforhardwareconfiguration.ThismeansHDFSdoesnotrequiretheuseof

storageareanetworks(SANs),expensivenetworkconfiguration,oranyspecialdisks.

HDFScanberunonanyrun-of-the-milldatacentersetup.HDFSreplicatesalldata

writtentoit,basedonthereplicationfactorconfiguredbytheuser.Thedefaultreplication

factoris3,whichensuresthatanydatawrittentoHDFSisreplicatedonthreedifferent

serverswithinthecluster.Thisgreatlyreducesthepossibilitythatanydatawrittento

HDFSwillbelost.

HDFS,likeanyotherfilesystem,writesdatatoindividualblocks.EachHDFSfile

consistsofatleastoneblock.Eachfileconsistsofmultipleblocks,basedonthesizeofthe

file.HDFSisdesignedtoholdverylargefiles.Therefore,HDFSblocksizesarealso

usuallyprettylargecomparedtootherfilesystems.HDFSblocksizesareconfigurable,

andinmostcasesrangebetween128MBto512MB.HDFStriestoensurethateach

blockisreplicatedbasedonthereplicationfactor,thusensuringthefileitselfisreplicated

asmuchasthereplicationfactor.HDFSisrack-aware,andthedefaultblockplacement

policytriestoensurethateachreplicaofablockisonadifferentrack.

HDFSconsistsoftwotypesofservers:namenodesanddatanodes.MostHadoopclusters

generallyhavetwonamenodesandseveraldatanodes.Datanodesarethenodesonwhich

thedataisstored.Atanypointintime,thereisoneactivenamenodeandanoptional

standbynamenode.Theactivenamenodeisthecurrentlyactivenamenodethatserves

clientandotherdatanodes.Thestandbynamenodeisanactivebackuptotheprimary,

andtakesoveriftheactivenamenodegoesdownorisnolongeraccessibleforsome

reason.Namenodesareresponsibleforstoringmetadataaboutfilesandblocksonthefile

system.Thenamenodemapseveryfiletothelistofblocksthatthefileconsistsof.The

namenodealsoholdsinformationabouteachblock’slocation—whichdatanodesthe

blockisstoredonandwhereonthedatanodeitis.

Eachclientwriteisinitiallywrittentoalocalfileontheclientmachine,untiltheclient

flushesthefileorclosesitorthesizeofthetemporaryfileexceedsablockboundary.At

thispoint,thefileiscreated(oranewblockisaddedifnewdataisbeingwrittenoncea

blockboundaryiscrossedoranexistingfileisreopenedforappend)andthenamenode

assignsblockstoit.Thenthedataiswrittentoeachblock,whichisreplicatedtomultiple

datanodes,oneafteranother.Theoperationissuccessfulonlyifallthedatanodes

succesfullyreplicatetheblocks.

剩余268页未读，继续阅读

ramissue

粉丝: 354
资源: 1487

Apache Flume实战：弹性、可扩展的数据流传输

using flume pdf

Flume 构建高可用、可扩展的海量日志采集系统_PDF电子书下载 带索引书签目录_（美）史瑞德哈伦著_电子工业出版社_P208_2015.08.pdf

Using.Flume.Flexible.Scalable.and.Reliable.Data.Streaming.pdf

Apress.Pro.Spark.Streaming.The.Zen.of.Real-Time.Analytics.Using.Apache.Spark

Cannot resolve org.apache.flume:flume-ng-sinks:1.9.0

../bin/flume-ng -n a1 -c ../config

ukihsoroy#ukihsoroy.github.io#Spark Streaming整合Flume实战1

flume整合 SparkStreaming.rar

./bin/flume-ng agent --conf conf --conf-file /usr/local/flume/conf/flume-env.sh --name flume-ng -Dflume.root.logger=INFO,console

最新资源

Flume 构建高可用、可扩展的海量日志采集系统_PDF电子书下载带索引书签目录_（美）史瑞德哈伦著_电子工业出版社_P208_2015.08.pdf