深入理解Apache Spark 2.3.0：核心概念与机器学习

需积分: 7 159 浏览量更新于2024-07-18 收藏 16.44MB PDF 举报

"Mastering Apache Spark 2.3.0，深入理解Spark主要架构，并涵盖机器学习库MLlib的详细内容" 《Mastering Apache Spark 2.3.0》是一部全面介绍Apache Spark技术的指南，专注于Spark 2.3.0版本。这本书详细阐述了Spark的核心架构，帮助读者掌握其工作原理和主要功能。Apache Spark是一个用于大规模数据处理的开源集群计算系统，以其高效、易用和支持多种数据处理模式的特点在大数据领域受到广泛应用。在Spark的主体架构部分，书中可能涵盖了分布式计算模型、RDD（弹性分布式数据集）、Spark SQL、DataFrame、DataSet以及Spark Streaming等内容。这些模块共同构建了Spark的强大能力，使得它能够快速处理大量数据，进行批处理、交互式查询、实时流处理等任务。 Spark MLlib是其内置的机器学习库，书中特别提到了MLlib与新推出的Spark.ml框架。Spark.ml提供了统一的API，支持构建机器学习管道（Pipeline），使得数据预处理、建模和评估流程更加模块化和可复用。在这一部分，读者可以了解到如何使用Transformer和Estimator进行特征工程和模型训练，例如Tokenizer用于文本分词，StringIndexer用于将类别变量转换为数值，以及各种分类器（如RandomForestClassifier、DecisionTreeClassifier）和回归模型（如KMeans、LinearRegression）的使用方法。此外，书中的ML Pipeline章节深入探讨了如何构建和调优Pipeline，包括PipelineStage的概念，以及如何使用Evaluator评估模型性能。例如，BinaryClassificationEvaluator、MulticlassClassificationEvaluator和RegressionEvaluator分别用于二分类、多分类和回归模型的评估。同时，ClusteringEvaluator则适用于聚类模型的评估。书中还可能涉及了Spark的其他重要特性，如Spark SQL用于结构化数据处理，DataFrame和DataSet提供了更高级的数据抽象，以及Spark Streaming用于处理实时数据流。最后，读者还将接触到Spark的容错机制、调度策略以及如何优化Spark应用性能等方面的知识。通过深入学习《Mastering Apache Spark 2.3.0》，读者不仅可以掌握Spark的基础知识，还能了解到如何利用Spark进行复杂的数据分析和机器学习项目，从而在大数据领域提升自己的专业技能。

UsingSparkApplicationFrameworks,Sparksimplifiesaccesstomachinelearningand

predictiveanalyticsatscale.

SparkismainlywritteninScala,butprovidesdeveloperAPIforlanguageslikeJava,Python,

andR.

Note

Microsoft’sMobiusprojectprovidesC#APIforSpark"enablingthe

implementationofSparkdriverprogramanddataprocessingoperationsinthe

languagessupportedinthe.NETframeworklikeC#orF#."

Ifyouhavelargeamountsofdatathatrequireslowlatencyprocessingthatatypical

MapReduceprogramcannotprovide,Sparkisaviablealternative.

Accessanydatatypeacrossanydatasource.

Hugedemandforstorageanddataprocessing.

TheApacheSparkprojectisanumbrellaforSQL(withDatasets),streaming,machine

learning(pipelines)andgraphprocessingenginesbuiltatopSparkCore.Youcanrunthem

allinasingleapplicationusingaconsistentAPI.

Sparkrunslocallyaswellasinclusters,on-premisesorincloud.ItrunsontopofHadoop

YARN,ApacheMesos,standaloneorinthecloud(AmazonEC2orIBMBluemix).

Sparkcanaccessdatafrommanydatasources.

ApacheSpark’sStreamingandSQLprogrammingmodelswithMLlibandGraphXmakeit

easierfordevelopersanddatascientiststobuildapplicationsthatexploitmachinelearning

andgraphanalytics.

Atahighlevel,anySparkapplicationcreatesRDDsoutofsomeinput,run(lazy)

transformationsoftheseRDDstosomeotherform(shape),andfinallyperformactionsto

collectorstoredata.Notmuch,huh?

YoucanlookatSparkfromprogrammer’s,dataengineer’sandadministrator’spointofview.

Andtobehonest,allthreetypesofpeoplewillspendquitealotoftheirtimewithSparkto

finallyreachthepointwheretheyexploitalltheavailablefeatures.Programmersuse

language-specificAPIs(andworkatthelevelofRDDsusingtransformationsandactions),

dataengineersusehigher-levelabstractionslikeDataFramesorPipelinesAPIsorexternal

tools(thatconnecttoSpark),andfinallyitallcanonlybepossibletorunbecause

administratorssetupSparkclusterstodeploySparkapplicationsto.

ItisSpark’sgoaltobeageneral-purposecomputingplatformwithvariousspecialized

applicationsframeworksontopofasingleunifiedengine.

OverviewofApacheSpark

Note

Whenyouhear"ApacheSpark"itcanbetwothings — theSparkengineaka

SparkCoreortheApacheSparkopensourceprojectwhichisan"umbrella"

termforSparkCoreandtheaccompanyingSparkApplicationFrameworks,i.e.

SparkSQL,SparkStreaming,SparkMLlibandSparkGraphXthatsitontopof

SparkCoreandthemaindataabstractioninSparkcalledRDD-Resilient

DistributedDataset.

WhySpark

Let’slistafewofthemanyreasonsforSpark.Wearedoingitfirst,andthencomesthe

overviewthatlendsamoretechnicalhelpinghand.

EasytoGetStarted

Sparkoffersspark-shellthatmakesforaveryeasyheadstarttowritingandrunningSpark

applicationsonthecommandlineonyourlaptop.

YoucouldthenuseSparkStandalonebuilt-inclustermanagertodeployyourSpark

applicationstoaproduction-gradeclustertorunonafulldataset.

UnifiedEngineforDiverseWorkloads

AssaidbyMateiZaharia-theauthorofApacheSpark-inIntroductiontoAmpLabSpark

Internalsvideo(quotingwithfewchanges):

OneoftheSparkprojectgoalswastodeliveraplatformthatsupportsaverywidearray

ofdiverseworkflows-notonlyMapReducebatchjobs(therewereavailablein

Hadoopalreadyatthattime),butalsoiterativecomputationslikegraphalgorithmsor

MachineLearning.

Andalsodifferentscalesofworkloadsfromsub-secondinteractivejobstojobsthatrun

formanyhours.

Sparkcombinesbatch,interactive,andstreamingworkloadsunderonerichconciseAPI.

Sparksupportsnearreal-timestreamingworkloadsviaSparkStreamingapplication

framework.

ETLworkloadsandAnalyticsworkloadsaredifferent,howeverSparkattemptstooffera

unifiedplatformforawidevarietyofworkloads.

GraphandMachineLearningalgorithmsareiterativebynatureandlesssavestodiskor

transfersovernetworkmeansbetterperformance.

ThereisalsosupportforinteractiveworkloadsusingSparkshell.

OverviewofApacheSpark

YoushouldwatchthevideoWhatisApacheSpark?byMikeOlson,ChiefStrategyOfficer

andCo-FounderatCloudera,whoprovidesaveryexceptionaloverviewofApacheSpark,its

riseinpopularityintheopensourcecommunity,andhowSparkisprimedtoreplace

MapReduceasthegeneralprocessingengineinHadoop.

LeveragestheBestindistributedbatchdataprocessing

Whenyouthinkaboutdistributedbatchdataprocessing,Hadoopnaturallycomestomind

asaviablesolution.

SparkdrawsmanyideasoutofHadoopMapReduce.Theyworktogetherwell-Sparkon

YARNandHDFS-whileimprovingontheperformanceandsimplicityofthedistributed

computingengine.

Formany,SparkisHadoop++,i.e.MapReducedoneinabetterway.

Anditshouldnotcomeasasurprise,withoutHadoopMapReduce(itsadvancesand

deficiencies),Sparkwouldnothavebeenbornatall.

RDD-DistributedParallelScalaCollections

AsaScaladeveloper,youmayfindSpark’sRDDAPIverysimilar(ifnotidentical)toScala’s

CollectionsAPI.

ItisalsoexposedinJava,PythonandR(aswellasSQL,i.e.SparkSQL,inasense).

So,whenyouhaveaneedfordistributedCollectionsAPIinScala,SparkwithRDDAPI

shouldbeaseriouscontender.

RichStandardLibrary

Notonlycanyouusemapand reduce(asinHadoopMapReducejobs)inSpark,butalso

avastarrayofotherhigher-leveloperatorstoeaseyourSparkqueriesandapplication

development.

Itexpandedontheavailablecomputationstylesbeyondtheonlymap-and-reduceavailable

inHadoopMapReduce.

Unifieddevelopmentanddeploymentenvironmentforall

RegardlessoftheSparktoolsyouuse-theSparkAPIforthemanyprogramminglanguages

supported-Scala,Java,Python,R,ortheSparkshell,orthemanySparkApplication

FrameworksleveragingtheconceptofRDD,i.e.SparkSQL,SparkStreaming,SparkMLlib

OverviewofApacheSpark

andSparkGraphX,youstillusethesamedevelopmentanddeploymentenvironmenttofor

largedatasetstoyieldaresult,beitaprediction(SparkMLlib),astructureddataqueries

(SparkSQL)orjustalargedistributedbatch(SparkCore)orstreaming(SparkStreaming)

computation.

It’salsoveryproductiveofSparkthatteamscanexploitthedifferentskillstheteam

membershaveacquiredsofar.Dataanalysts,datascientists,Pythonprogrammers,orJava,

orScala,orR,canallusethesameSparkplatformusingtailor-madeAPI.Itmakesfor

bringingskilledpeoplewiththeirexpertiseindifferentprogramminglanguagestogethertoa

Sparkproject.

InteractiveExploration/ExploratoryAnalytics

Itisalsocalledadhocqueries.

UsingtheSparkshellyoucanexecutecomputationstoprocesslargeamountofdata(The

BigData).It’sallinteractiveandveryusefultoexplorethedatabeforefinalproduction

release.

Also,usingtheSparkshellyoucanaccessanySparkclusterasifitwasyourlocalmachine.

JustpointtheSparkshelltoa20-nodeof10TBRAMmemoryintotal(using --master)and

useallthecomponents(andtheirabstractions)likeSparkSQL,SparkMLlib,Spark

Streaming,andSparkGraphX.

Dependingonyourneedsandskills,youmayseeabetterfitforSQLvsprogrammingAPIs

orapplymachinelearningalgorithms(SparkMLlib)fromdataingraphdatastructures

(SparkGraphX).

SingleEnvironment

Regardlessofwhichprogramminglanguageyouaregoodat,beitScala,Java,Python,Ror

SQL,youcanusethesamesingleclusteredruntimeenvironmentforprototyping,adhoc

queries,anddeployingyourapplicationsleveragingthemanyingestiondatapointsoffered

bytheSparkplatform.

Youcanbeaslow-levelasusingRDDAPIdirectlyorleveragehigher-levelAPIsofSpark

SQL(Datasets),SparkMLlib(MLPipelines),SparkGraphX(Graphs)orSparkStreaming

(DStreams).

Orusethemallinasingleapplication.

Thesingleprogrammingmodelandexecutionenginefordifferentkindsofworkloads

simplifydevelopmentanddeploymentarchitectures.

OverviewofApacheSpark

DataIntegrationToolkitwithRichSetofSupportedData

Sources

Sparkcanreadfrommanytypesofdatasources — relational,NoSQL,filesystems,etc. —

usingmanytypesofdataformats-Parquet,Avro,CSV,JSON.

Both,inputandoutputdatasources,allowprogrammersanddataengineersuseSparkas

theplatformwiththelargeamountofdatathatisreadfromorsavedtoforprocessing,

interactively(usingSparkshell)orinapplications.

Toolsunavailablethen,atyourfingertipsnow

Asmuchandoftenasit’srecommendedtopicktherighttoolforthejob,it’snotalways

feasible.Time,personalpreference,operatingsystemyouworkonareallfactorstodecide

whatisrightatatime(andusingahammercanbeareasonablechoice).

Sparkembracesmanyconceptsinasingleunifieddevelopmentandruntimeenvironment.

Machinelearningthatissotool-andfeature-richinPython,e.g.SciKitlibrary,cannow

beusedbyScaladevelopers(asPipelineAPIinSparkMLliborcallingpipe()).

DataFramesfromRareavailableinScala,Java,Python,RAPIs.

Singlenodecomputationsinmachinelearningalgorithmsaremigratedtotheir

distributedversionsinSparkMLlib.

ThissingleplatformgivesplentyofopportunitiesforPython,Scala,Java,andR

programmersaswellasdataengineers(SparkR)andscientists(usingproprietaryenterprise

datawarehouseswithThriftJDBC/ODBCServerinSparkSQL).

Mindtheproverbifallyouhaveisahammer,everythinglookslikeanail,too.

Low-levelOptimizations

ApacheSparkusesadirectedacyclicgraph(DAG)ofcomputationstages(akaexecution

DAG).Itpostponesanyprocessinguntilreallyrequiredforactions.Spark’slazyevaluation

givesplentyofopportunitiestoinducelow-leveloptimizations(sousershavetoknowlessto

domore).

Mindtheproverblessismore.

Excelsatlow-latencyiterativeworkloads

OverviewofApacheSpark

剩余1198页未读，继续阅读

RacingHeart

粉丝: 62
资源: 6

深入理解Apache Spark 2.3.0：核心概念与机器学习

mastering-apache-spark最好的spark教程

mastering-apache-spark2.4.2.pdf

Mastering-Advanced-Analytics-With-Apache-Spark

mastering-spark-sql

mastering-spark-sql.pdf

Mastering-Spark

Mastering Apache Spark

mastering apache spark

Mastering-Machine-Learning-on-AWS:Packt发行的AWS上的精通机器学习

Mastering Apache Spark(掌握Apache Spark)英文版.pdf

最新资源