Apache Crunch初学者指南：构建分布式计算管道

[APACHE]Apache

需积分: 9 64 浏览量更新于2024-09-04 收藏 279KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Apache Crunch是一个由Apache软件基金会开发的Java API，旨在简化在Apache Hadoop之上构建数据管道的过程。它借鉴了Google的FlumeJava库的设计理念，为开发人员提供了一套构建分布式计算任务的工具。相比Apache Pig、Apache Hive和Cascading等其他基于MapReduce的抽象项目，Crunch更注重开发者的体验，提供了更直接和高效的编程模型。 Apache Crunch的核心概念包括Pipeline（管道）、PTransform（转换）和PCollection（集合）。Pipeline是整个数据处理流程的组织结构，它由一系列PTransform组成，而PTransform则定义了数据处理的操作，如过滤、聚合等。PCollection则代表了数据集，可以是单个数据源或处理结果。在‘Getting Started’教程中，用户将学习如何创建一个简单的Crunch pipeline，用于统计文本文档中的单词数量。这个例子展示了如何使用Crunch的基本组件来实现分布式计算任务。通过这个过程，开发者将理解到如何定义输入源、如何定义数据转换以及如何处理输出结果。相比于Pig和Hive，Crunch提供了更直接的编程接口，使得开发者可以直接操作数据流，减少了中间的抽象层，从而提高了性能和灵活性。同时，Crunch支持HBase 0.96.x版本，意味着它可以与NoSQL数据库集成，进行更复杂的数据处理。此外，Apache Crunch还具有以下特点： 1. 类库丰富：Crunch提供了一系列预定义的转换函数，如Map、Filter、GroupByKey等，方便开发者快速构建复杂的处理逻辑。 2. 错误处理：Crunch支持错误检测和恢复机制，增强了数据处理的可靠性。 3. 编译时检查：由于Crunch使用Java，因此可以在编译时检查pipeline的正确性，避免运行时出现意外错误。 4. 资源优化：Crunch能够自动优化pipeline，减少不必要的数据复制和提高执行效率。为了开始使用Apache Crunch，开发者需要下载相关的SDK，并参考用户指南了解详细信息。同时，可以通过参与源代码开发、订阅邮件列表、跟踪项目问题以及查阅wiki来深入理解和贡献项目。 Apache Crunch是针对开发人员设计的一个强大工具，它简化了Hadoop上的数据处理工作，同时保持了高性能和可扩展性。对于那些希望在Java环境中构建高效数据处理应用的开发者来说，Crunch是一个值得考虑的选择。"

资源详情

资源推荐

2015/10/18 ApacheCrunchGettingStarted
http://crunch.apache.org/gettingstarted.html 1/5
APACHECRUNCH
Overview
GettingStarted
UserGuide
Download
API(supporting
HBase0.96.x)
DEVELOPMENT
SourceCode
MailingLists
IssueTracking
Wiki
PROJECT
About
Bylaws
License
GettingStarted
GettingStartedwillguideyouthroughtheprocessofcreatingasimpleCrunchpipelinetocountthewordsinatext
document,whichistheHelloWorldofdistributedcomputing.Alongtheway,we'llexplainthecoreCrunchconceptsand
howtousethemtocreateeffectiveandefficientdatapipelines.
Overview
TheApacheCrunchprojectdevelopsandsupportsJavaAPIsthatsimplifytheprocessofcreatingdatapipelinesontopof
ApacheHadoop.TheCrunchAPIsaremodeledafterFlumeJava(PDF),whichisthelibrarythatGoogleusesforbuilding
datapipelinesontopoftheirownimplementationofMapReduce.
OneofthemostcommonquestionswehearishowCrunchcomparestootherprojectsthatprovideabstractionsontopof
MapReduce,suchasApachePig,ApacheHive,andCascading.
1. Developerfocused.ApacheHiveandApachePigwerebuilttomakeMapReduceaccessibletodataanalystswith
limitedexperienceinJavaprogramming.CrunchwasdesignedfordeveloperswhounderstandJavaandwanttouse
MapReduceeffectivelyinordertowritefast,reliableapplicationsthatneedtomeettightSLAs.Crunchisoftenusedin
conjunctionwithHiveandPig;aCrunchpipelinewrittenbythedevelopmentteamsessionizesasetofuserlogs
generatesarethenprocessedbyadiversecollectionofPigscriptsandHivequerieswrittenbyanalysts.
2. Minimalabstractions.CrunchpipelinesprovideathinveneerontopofMapReduce.Developershaveaccesstolow
levelMapReduceAPIswhenevertheyneedthem.ThismimimalismalsomeansthatCrunchisextremelyfast,only
slightlyslowerthanahandtunedpipelinedevelopedwiththeMapReduceAPIs,andthecommunityisworkingon
makingitfasterallthetime.Thatsaid,oneofthegoalsoftheprojectisportability,andtheabstractionsthatCrunch
providesaredesignedtoeasethetransitionfromHadoop1.0toHadoop2.0andtoprovidetransparentsupportfor
futuredataprocessingframeworksthatrunonHadoop,includingApacheSparkandApacheTez.
3. FlexibleDataModel.Hive,Pig,andCascadingalluseatuplecentricdatamodelthatworksbestwhenyourinputdata
canberepresentedusinganamedcollectionofscalarvalues,muchliketherowsofadatabasetable.Crunchallows
developersconsiderableflexibilityinhowtheyrepresenttheirdata,whichmakesCrunchthebestpipelineplatformfor
developersworkingwithcomplexstructureslikeApacheAvrorecordsorprotocolbuffers,geospatialandtimeseries
data,anddatastoredinApacheHBasetables.
WhichVersionofCrunchDoINeed?
ThecorelibrariesareprimarilydevelopedagainstHadoop1.1.2,andarealsotestedagainstHadoop2.2.0.Theyshould
workwithanyversionofHadoop1.xafter1.0.3andanyversionofHadoop2.xafter2.0.0alpha,althoughyoushouldnote
thatsomeofHadoop2.x'sdependencieschangedbetween2.0.4alphaand2.2.0(forexample,theprotocolbufferlibrary
switchedfrom2.4.1to2.5.0.)CrunchisalsoknowntoworkwithdistributionsfromvendorslikeCloudera,Hortonworks,and
IBM.TheCrunchlibrariesarenotcompatiblewithversionofHadooppriorto1.x,suchas0.20.2.
Ifyou'reusingthecrunchhbaselibrary,pleasenotethatCrunch0.9.0switchedtousingHBase0.96.0,whileallprior
versionsofcrunchhbaseweredevelopedagainstHBase0.94.3.
HereareallofthecurrentlyrecommendedCrunchversionsinoneconvenienttable:
HadoopVersions HBaseVersions RecommendedCrunchVersion
1.x 0.96.x 0.12.0
2.x 0.96.x 0.12.0hadoop2
MavenDependencies
TheCrunchprojectprovidesMavenartifactsonMavenCentraloftheform:
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch‐core</artifactId>
<version>${crunch.version}</version>
</dependency>
The crunch‐core artifactcontainsthecorelibrariesforplanningandexecutingMapReducepipelines.Dependingonyour
usecase,youmayalsofindthefollowingartifactsuseful:
crunch‐test :HelperclassesforintegrationtestingofCrunchpipelines
Apache » Crunch