掌握Apache Spark：大数据分析与机器学习关键组件详解

spark

需积分: 9 13 浏览量更新于2024-07-19 收藏 16.32MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"《精通Apache Spark》是一份详尽的大数据必备教程，专注于介绍Apache Spark的各个方面。Spark是分布式计算框架，被广泛应用于大数据处理和机器学习任务。本指南将带你深入理解Spark的核心概念，包括： 1. **概述**：首先，我们会对Spark进行总体介绍，强调其在大数据处理中的关键角色，以及它如何通过内存计算加速数据分析。 2. **Spark MLlib**：Spark MLlib是Spark机器学习库，它是构建和部署大规模机器学习模型的基础。主要内容涵盖： - **MLPipelines (spark.ml)**：提供了一种模块化的方式来构建机器学习流程，包括Pipeline和PipelineStage。 - **Transformers**：数据预处理组件，如Tokenizer用于文本特征的分割，是Estimators（如StringIndexer）的后续步骤。 - **Estimators**：负责训练模型的组件，如StringIndexer用于转换类别变量为整数，KMeans则执行聚类任务。 - **TrainValidationSplit**：用于模型评估和选择的工具，如随机森林回归器（RandomForestRegressor）和线性回归（LinearRegression）。 - **Evaluator**：用于模型评估的标准组件，如BinaryClassificationEvaluator评估二分类模型，ClusteringEvaluator评估聚类模型。 2.2.6节更深入地讨论了特定的模型和评估器，如决策树分类器（DecisionTreeClassifier）、随机森林分类器（RandomForestClassifier），以及各种评估指标。 3. **模型与性能评估**：这部分详细介绍了RegressionEvaluator等用于评估回归模型性能的工具。此外，指南还涵盖了其他主题，如数据处理、Spark的分布式架构、存储选项（如Hadoop Distributed File System, HDFS）以及Spark的生态系统的其他组成部分。通过这些章节，读者不仅能掌握如何在实际项目中使用Spark，还能深入了解其内部工作原理和优化策略。无论你是初次接触Spark还是希望提升现有技能，这份资源都提供了丰富的学习材料和实践案例，助你在大数据分析领域取得成功。"

资源详情

资源推荐

UsingSparkApplicationFrameworks,Sparksimplifiesaccesstomachinelearningand

predictiveanalyticsatscale.

SparkismainlywritteninScala,butprovidesdeveloperAPIforlanguageslikeJava,Python,

andR.

Note

Microsoft’sMobiusprojectprovidesC#APIforSpark"enablingthe

implementationofSparkdriverprogramanddataprocessingoperationsinthe

languagessupportedinthe.NETframeworklikeC#orF#."

Ifyouhavelargeamountsofdatathatrequireslowlatencyprocessingthatatypical

MapReduceprogramcannotprovide,Sparkisaviablealternative.

Accessanydatatypeacrossanydatasource.

Hugedemandforstorageanddataprocessing.

TheApacheSparkprojectisanumbrellaforSQL(withDatasets),streaming,machine

learning(pipelines)andgraphprocessingenginesbuiltatopSparkCore.Youcanrunthem

allinasingleapplicationusingaconsistentAPI.

Sparkrunslocallyaswellasinclusters,on-premisesorincloud.ItrunsontopofHadoop

YARN,ApacheMesos,standaloneorinthecloud(AmazonEC2orIBMBluemix).

Sparkcanaccessdatafrommanydatasources.

ApacheSpark’sStreamingandSQLprogrammingmodelswithMLlibandGraphXmakeit

easierfordevelopersanddatascientiststobuildapplicationsthatexploitmachinelearning

andgraphanalytics.

Atahighlevel,anySparkapplicationcreatesRDDsoutofsomeinput,run(lazy)

transformationsoftheseRDDstosomeotherform(shape),andfinallyperformactionsto

collectorstoredata.Notmuch,huh?

YoucanlookatSparkfromprogrammer’s,dataengineer’sandadministrator’spointofview.

Andtobehonest,allthreetypesofpeoplewillspendquitealotoftheirtimewithSparkto

finallyreachthepointwheretheyexploitalltheavailablefeatures.Programmersuse

language-specificAPIs(andworkatthelevelofRDDsusingtransformationsandactions),

dataengineersusehigher-levelabstractionslikeDataFramesorPipelinesAPIsorexternal

tools(thatconnecttoSpark),andfinallyitallcanonlybepossibletorunbecause

administratorssetupSparkclusterstodeploySparkapplicationsto.

ItisSpark’sgoaltobeageneral-purposecomputingplatformwithvariousspecialized

applicationsframeworksontopofasingleunifiedengine.

OverviewofApacheSpark

Note

Whenyouhear"ApacheSpark"itcanbetwothings — theSparkengineaka

SparkCoreortheApacheSparkopensourceprojectwhichisan"umbrella"

termforSparkCoreandtheaccompanyingSparkApplicationFrameworks,i.e.

SparkSQL,SparkStreaming,SparkMLlibandSparkGraphXthatsitontopof

SparkCoreandthemaindataabstractioninSparkcalledRDD-Resilient

DistributedDataset.

WhySpark

Let’slistafewofthemanyreasonsforSpark.Wearedoingitfirst,andthencomesthe

overviewthatlendsamoretechnicalhelpinghand.

EasytoGetStarted

Sparkoffersspark-shellthatmakesforaveryeasyheadstarttowritingandrunningSpark

applicationsonthecommandlineonyourlaptop.

YoucouldthenuseSparkStandalonebuilt-inclustermanagertodeployyourSpark

applicationstoaproduction-gradeclustertorunonafulldataset.

UnifiedEngineforDiverseWorkloads

AssaidbyMateiZaharia-theauthorofApacheSpark-inIntroductiontoAmpLabSpark

Internalsvideo(quotingwithfewchanges):

OneoftheSparkprojectgoalswastodeliveraplatformthatsupportsaverywidearray

ofdiverseworkflows-notonlyMapReducebatchjobs(therewereavailablein

Hadoopalreadyatthattime),butalsoiterativecomputationslikegraphalgorithmsor

MachineLearning.

Andalsodifferentscalesofworkloadsfromsub-secondinteractivejobstojobsthatrun

formanyhours.

Sparkcombinesbatch,interactive,andstreamingworkloadsunderonerichconciseAPI.

Sparksupportsnearreal-timestreamingworkloadsviaSparkStreamingapplication

framework.

ETLworkloadsandAnalyticsworkloadsaredifferent,howeverSparkattemptstooffera

unifiedplatformforawidevarietyofworkloads.

GraphandMachineLearningalgorithmsareiterativebynatureandlesssavestodiskor

transfersovernetworkmeansbetterperformance.

ThereisalsosupportforinteractiveworkloadsusingSparkshell.

OverviewofApacheSpark

YoushouldwatchthevideoWhatisApacheSpark?byMikeOlson,ChiefStrategyOfficer

andCo-FounderatCloudera,whoprovidesaveryexceptionaloverviewofApacheSpark,its

riseinpopularityintheopensourcecommunity,andhowSparkisprimedtoreplace

MapReduceasthegeneralprocessingengineinHadoop.

LeveragestheBestindistributedbatchdataprocessing

Whenyouthinkaboutdistributedbatchdataprocessing,Hadoopnaturallycomestomind

asaviablesolution.

SparkdrawsmanyideasoutofHadoopMapReduce.Theyworktogetherwell-Sparkon

YARNandHDFS-whileimprovingontheperformanceandsimplicityofthedistributed

computingengine.

Formany,SparkisHadoop++,i.e.MapReducedoneinabetterway.

Anditshouldnotcomeasasurprise,withoutHadoopMapReduce(itsadvancesand

deficiencies),Sparkwouldnothavebeenbornatall.

RDD-DistributedParallelScalaCollections

AsaScaladeveloper,youmayfindSpark’sRDDAPIverysimilar(ifnotidentical)toScala’s

CollectionsAPI.

ItisalsoexposedinJava,PythonandR(aswellasSQL,i.e.SparkSQL,inasense).

So,whenyouhaveaneedfordistributedCollectionsAPIinScala,SparkwithRDDAPI

shouldbeaseriouscontender.

RichStandardLibrary

Notonlycanyouusemapand reduce(asinHadoopMapReducejobs)inSpark,butalso

avastarrayofotherhigher-leveloperatorstoeaseyourSparkqueriesandapplication

development.

Itexpandedontheavailablecomputationstylesbeyondtheonlymap-and-reduceavailable

inHadoopMapReduce.

Unifieddevelopmentanddeploymentenvironmentforall

RegardlessoftheSparktoolsyouuse-theSparkAPIforthemanyprogramminglanguages

supported-Scala,Java,Python,R,ortheSparkshell,orthemanySparkApplication

FrameworksleveragingtheconceptofRDD,i.e.SparkSQL,SparkStreaming,SparkMLlib

OverviewofApacheSpark

andSparkGraphX,youstillusethesamedevelopmentanddeploymentenvironmenttofor

largedatasetstoyieldaresult,beitaprediction(SparkMLlib),astructureddataqueries

(SparkSQL)orjustalargedistributedbatch(SparkCore)orstreaming(SparkStreaming)

computation.

It’salsoveryproductiveofSparkthatteamscanexploitthedifferentskillstheteam

membershaveacquiredsofar.Dataanalysts,datascientists,Pythonprogrammers,orJava,

orScala,orR,canallusethesameSparkplatformusingtailor-madeAPI.Itmakesfor

bringingskilledpeoplewiththeirexpertiseindifferentprogramminglanguagestogethertoa

Sparkproject.

InteractiveExploration/ExploratoryAnalytics

Itisalsocalledadhocqueries.

UsingtheSparkshellyoucanexecutecomputationstoprocesslargeamountofdata(The

BigData).It’sallinteractiveandveryusefultoexplorethedatabeforefinalproduction

release.

Also,usingtheSparkshellyoucanaccessanySparkclusterasifitwasyourlocalmachine.

JustpointtheSparkshelltoa20-nodeof10TBRAMmemoryintotal(using --master)and

useallthecomponents(andtheirabstractions)likeSparkSQL,SparkMLlib,Spark

Streaming,andSparkGraphX.

Dependingonyourneedsandskills,youmayseeabetterfitforSQLvsprogrammingAPIs

orapplymachinelearningalgorithms(SparkMLlib)fromdataingraphdatastructures

(SparkGraphX).

SingleEnvironment

Regardlessofwhichprogramminglanguageyouaregoodat,beitScala,Java,Python,Ror

SQL,youcanusethesamesingleclusteredruntimeenvironmentforprototyping,adhoc

queries,anddeployingyourapplicationsleveragingthemanyingestiondatapointsoffered

bytheSparkplatform.

Youcanbeaslow-levelasusingRDDAPIdirectlyorleveragehigher-levelAPIsofSpark

SQL(Datasets),SparkMLlib(MLPipelines),SparkGraphX(Graphs)orSparkStreaming

(DStreams).

Orusethemallinasingleapplication.

Thesingleprogrammingmodelandexecutionenginefordifferentkindsofworkloads

simplifydevelopmentanddeploymentarchitectures.

OverviewofApacheSpark

DataIntegrationToolkitwithRichSetofSupportedData

Sources

Sparkcanreadfrommanytypesofdatasources — relational,NoSQL,filesystems,etc. —

usingmanytypesofdataformats-Parquet,Avro,CSV,JSON.

Both,inputandoutputdatasources,allowprogrammersanddataengineersuseSparkas

theplatformwiththelargeamountofdatathatisreadfromorsavedtoforprocessing,

interactively(usingSparkshell)orinapplications.

Toolsunavailablethen,atyourfingertipsnow

Asmuchandoftenasit’srecommendedtopicktherighttoolforthejob,it’snotalways

feasible.Time,personalpreference,operatingsystemyouworkonareallfactorstodecide

whatisrightatatime(andusingahammercanbeareasonablechoice).

Sparkembracesmanyconceptsinasingleunifieddevelopmentandruntimeenvironment.

Machinelearningthatissotool-andfeature-richinPython,e.g.SciKitlibrary,cannow

beusedbyScaladevelopers(asPipelineAPIinSparkMLliborcallingpipe()).

DataFramesfromRareavailableinScala,Java,Python,RAPIs.

Singlenodecomputationsinmachinelearningalgorithmsaremigratedtotheir

distributedversionsinSparkMLlib.

ThissingleplatformgivesplentyofopportunitiesforPython,Scala,Java,andR

programmersaswellasdataengineers(SparkR)andscientists(usingproprietaryenterprise

datawarehouseswithThriftJDBC/ODBCServerinSparkSQL).

Mindtheproverbifallyouhaveisahammer,everythinglookslikeanail,too.

Low-levelOptimizations

ApacheSparkusesadirectedacyclicgraph(DAG)ofcomputationstages(akaexecution

DAG).Itpostponesanyprocessinguntilreallyrequiredforactions.Spark’slazyevaluation

givesplentyofopportunitiestoinducelow-leveloptimizations(sousershavetoknowlessto

domore).

Mindtheproverblessismore.

Excelsatlow-latencyiterativeworkloads

OverviewofApacheSpark

剩余1191页未读，继续阅读

凤雏0618

粉丝: 0
资源: 2

掌握Apache Spark：大数据分析与机器学习关键组件详解

Mastering Apache Spark 2.X(2nd) epub

Mastering Apache Spark 无水印pdf 0分

mastering-apache-spark最好的spark教程

mastering apache pulsar pdf

mastering go 2 pdf

maven参考文献近三年

mastering docker 4 pdf

发那科mastering恢复

mastering docker – second edition

Flink推荐学习书籍

mastering sfml game development pdf

mastering your phd 的pdf版

mastering_go_zh_cn-master

mastering go epub

mastering cmake中文版

mastering delphi

mastering freeswitch pdf

mastering regular expressions 3rd 中文版

mastering opencv with practical projects源码

mastering the freertos real time kernel中文

最新资源