APACHECRUNCH
Overview
GettingStarted
UserGuide
Download
API(supporting
HBase0.96.x)
DEVELOPMENT
SourceCode
MailingLists
IssueTracking
Wiki
PROJECT
About
Bylaws
License
GettingStarted
GettingStartedwillguideyouthroughtheprocessofcreatingasimpleCrunchpipelinetocountthewordsinatext
document,whichistheHelloWorldofdistributedcomputing.Alongtheway,we'llexplainthecoreCrunchconceptsand
howtousethemtocreateeffectiveandefficientdatapipelines.
Overview
TheApacheCrunchprojectdevelopsandsupportsJavaAPIsthatsimplifytheprocessofcreatingdatapipelinesontopof
ApacheHadoop.TheCrunchAPIsaremodeledafterFlumeJava(PDF),whichisthelibrarythatGoogleusesforbuilding
datapipelinesontopoftheirownimplementationofMapReduce.
OneofthemostcommonquestionswehearishowCrunchcomparestootherprojectsthatprovideabstractionsontopof
MapReduce,suchasApachePig,ApacheHive,andCascading.
1. Developerfocused.ApacheHiveandApachePigwerebuilttomakeMapReduceaccessibletodataanalystswith
limitedexperienceinJavaprogramming.CrunchwasdesignedfordeveloperswhounderstandJavaandwanttouse
MapReduceeffectivelyinordertowritefast,reliableapplicationsthatneedtomeettightSLAs.Crunchisoftenusedin
conjunctionwithHiveandPig;aCrunchpipelinewrittenbythedevelopmentteamsessionizesasetofuserlogs
generatesarethenprocessedbyadiversecollectionofPigscriptsandHivequerieswrittenbyanalysts.
2. Minimalabstractions.CrunchpipelinesprovideathinveneerontopofMapReduce.Developershaveaccesstolow
levelMapReduceAPIswhenevertheyneedthem.ThismimimalismalsomeansthatCrunchisextremelyfast,only
slightlyslowerthanahandtunedpipelinedevelopedwiththeMapReduceAPIs,andthecommunityisworkingon
makingitfasterallthetime.Thatsaid,oneofthegoalsoftheprojectisportability,andtheabstractionsthatCrunch
providesaredesignedtoeasethetransitionfromHadoop1.0toHadoop2.0andtoprovidetransparentsupportfor
futuredataprocessingframeworksthatrunonHadoop,includingApacheSparkandApacheTez.
3. FlexibleDataModel.Hive,Pig,andCascadingalluseatuplecentricdatamodelthatworksbestwhenyourinputdata
canberepresentedusinganamedcollectionofscalarvalues,muchliketherowsofadatabasetable.Crunchallows
developersconsiderableflexibilityinhowtheyrepresenttheirdata,whichmakesCrunchthebestpipelineplatformfor
developersworkingwithcomplexstructureslikeApacheAvrorecordsorprotocolbuffers,geospatialandtimeseries
data,anddatastoredinApacheHBasetables.
WhichVersionofCrunchDoINeed?
ThecorelibrariesareprimarilydevelopedagainstHadoop1.1.2,andarealsotestedagainstHadoop2.2.0.Theyshould
workwithanyversionofHadoop1.xafter1.0.3andanyversionofHadoop2.xafter2.0.0alpha,althoughyoushouldnote
thatsomeofHadoop2.x'sdependencieschangedbetween2.0.4alphaand2.2.0(forexample,theprotocolbufferlibrary
switchedfrom2.4.1to2.5.0.)CrunchisalsoknowntoworkwithdistributionsfromvendorslikeCloudera,Hortonworks,and
IBM.TheCrunchlibrariesarenotcompatiblewithversionofHadooppriorto1.x,suchas0.20.2.
Ifyou'reusingthecrunchhbaselibrary,pleasenotethatCrunch0.9.0switchedtousingHBase0.96.0,whileallprior
versionsofcrunchhbaseweredevelopedagainstHBase0.94.3.
HereareallofthecurrentlyrecommendedCrunchversionsinoneconvenienttable:
HadoopVersions HBaseVersions RecommendedCrunchVersion
1.x 0.96.x 0.12.0
2.x 0.96.x 0.12.0hadoop2
MavenDependencies
TheCrunchprojectprovidesMavenartifactsonMavenCentraloftheform:
<dependency>
<groupId>org.apache.crunch</groupId>
<artifactId>crunch‐core</artifactId>
<version>${crunch.version}</version>
</dependency>
The crunch‐core artifactcontainsthecorelibrariesforplanningandexecutingMapReducepipelines.Dependingonyour
usecase,youmayalsofindthefollowingartifactsuseful:
crunch‐test :HelperclassesforintegrationtestingofCrunchpipelines
Apache » Crunch