精通Apache Spark：机器学习与管道实践

需积分: 9 3 浏览量更新于2024-07-18 收藏 16.31MB PDF 举报

"Mastering Apache Spark 是一本专注于 Spark 技术的权威指南，被誉为 Spark 最佳文档，涵盖了广泛的 Spark 应用和概念。" 在《Mastering Apache Spark》这本书中，作者深入浅出地介绍了Apache Spark的核心特性、用法以及机器学习库Spark MLlib。Spark作为一个快速、通用且可扩展的大数据处理框架，它提供了分布式内存计算，使得大规模数据处理变得更加高效。 1. Spark概述 Spark 提供了一个统一的平台，用于处理批处理、交互式查询、实时流处理和机器学习任务。它的核心组件包括Spark Core、Spark SQL、Spark Streaming、MLlib以及GraphX，这些组件协同工作，构建了一个强大的大数据处理生态系统。 2. Spark MLlib Spark MLlib是Spark的机器学习库，它包含了各种机器学习算法如分类、回归、聚类、协同过滤，以及模型选择和评估工具。书中详细讨论了如何使用这些算法。 - MLPipelines (spark.ml) Spark MLlib通过spark.ml模块提供了一种管道（Pipeline）机制，使得机器学习流程可以被建模为一系列可组合的转换步骤，简化了模型开发和部署。 - Pipeline组件 - Pipeline：将多个`PipelineStage`（转换器或估计器）组织成一个可执行的流水线。 - PipelineStage：表示Pipeline中的一个步骤，可以是Transformer或Estimator。 - Transformers：对数据进行转换的操作，如特征提取、归一化等。 - Estimators：训练模型的算法，如分类器、回归器等。 - 特定算法 - Tokenizer：文本数据预处理，将句子拆分为单词。 - StringIndexer：将类别变量转换为数值索引。 - KMeans：聚类算法。 - TrainValidationSplit：训练和验证数据集的划分工具。 - RandomForest：随机森林模型，包括分类和回归。 - LinearRegression：线性回归模型。 - Classifier：分类模型，如决策树和随机森林分类器。 - Regressor：回归模型，如线性回归。 3. Evaluator与模型评估书中还涵盖了如何评估模型的性能，包括二分类、多分类和回归问题的评估器： - BinaryClassificationEvaluator：二分类模型的评估。 - ClusteringEvaluator：聚类模型的评估。 - MulticlassClassificationEvaluator：多分类模型的评估。 - RegressionEvaluator：回归模型的评估。 4. 其他主题书中可能还包括Spark的其他重要主题，如数据源管理、Spark SQL的使用、Spark Streaming的实时处理、图处理（GraphX）、Spark的弹性分布式数据集（RDD）以及性能调优等。通过深入阅读《Mastering Apache Spark》，读者不仅可以掌握Spark的基本用法，还能了解到如何构建复杂的数据处理系统，以及如何利用Spark MLlib实现高效的机器学习任务。这本书对于想要深入理解Spark并应用其解决实际问题的开发者来说，是一份宝贵的参考资料。

UsingSparkApplicationFrameworks,Sparksimplifiesaccesstomachinelearningand

predictiveanalyticsatscale.

SparkismainlywritteninScala,butprovidesdeveloperAPIforlanguageslikeJava,Python,

andR.

Note

Microsoft’sMobiusprojectprovidesC#APIforSpark"enablingthe

implementationofSparkdriverprogramanddataprocessingoperationsinthe

languagessupportedinthe.NETframeworklikeC#orF#."

Ifyouhavelargeamountsofdatathatrequireslowlatencyprocessingthatatypical

MapReduceprogramcannotprovide,Sparkisaviablealternative.

Accessanydatatypeacrossanydatasource.

Hugedemandforstorageanddataprocessing.

TheApacheSparkprojectisanumbrellaforSQL(withDatasets),streaming,machine

learning(pipelines)andgraphprocessingenginesbuiltatopSparkCore.Youcanrunthem

allinasingleapplicationusingaconsistentAPI.

Sparkrunslocallyaswellasinclusters,on-premisesorincloud.ItrunsontopofHadoop

YARN,ApacheMesos,standaloneorinthecloud(AmazonEC2orIBMBluemix).

Sparkcanaccessdatafrommanydatasources.

ApacheSpark’sStreamingandSQLprogrammingmodelswithMLlibandGraphXmakeit

easierfordevelopersanddatascientiststobuildapplicationsthatexploitmachinelearning

andgraphanalytics.

Atahighlevel,anySparkapplicationcreatesRDDsoutofsomeinput,run(lazy)

transformationsoftheseRDDstosomeotherform(shape),andfinallyperformactionsto

collectorstoredata.Notmuch,huh?

YoucanlookatSparkfromprogrammer’s,dataengineer’sandadministrator’spointofview.

Andtobehonest,allthreetypesofpeoplewillspendquitealotoftheirtimewithSparkto

finallyreachthepointwheretheyexploitalltheavailablefeatures.Programmersuse

language-specificAPIs(andworkatthelevelofRDDsusingtransformationsandactions),

dataengineersusehigher-levelabstractionslikeDataFramesorPipelinesAPIsorexternal

tools(thatconnecttoSpark),andfinallyitallcanonlybepossibletorunbecause

administratorssetupSparkclusterstodeploySparkapplicationsto.

ItisSpark’sgoaltobeageneral-purposecomputingplatformwithvariousspecialized

applicationsframeworksontopofasingleunifiedengine.

OverviewofApacheSpark

Note

Whenyouhear"ApacheSpark"itcanbetwothings — theSparkengineaka

SparkCoreortheApacheSparkopensourceprojectwhichisan"umbrella"

termforSparkCoreandtheaccompanyingSparkApplicationFrameworks,i.e.

SparkSQL,SparkStreaming,SparkMLlibandSparkGraphXthatsitontopof

SparkCoreandthemaindataabstractioninSparkcalledRDD-Resilient

DistributedDataset.

WhySpark

Let’slistafewofthemanyreasonsforSpark.Wearedoingitfirst,andthencomesthe

overviewthatlendsamoretechnicalhelpinghand.

EasytoGetStarted

Sparkoffersspark-shellthatmakesforaveryeasyheadstarttowritingandrunningSpark

applicationsonthecommandlineonyourlaptop.

YoucouldthenuseSparkStandalonebuilt-inclustermanagertodeployyourSpark

applicationstoaproduction-gradeclustertorunonafulldataset.

UnifiedEngineforDiverseWorkloads

AssaidbyMateiZaharia-theauthorofApacheSpark-inIntroductiontoAmpLabSpark

Internalsvideo(quotingwithfewchanges):

OneoftheSparkprojectgoalswastodeliveraplatformthatsupportsaverywidearray

ofdiverseworkflows-notonlyMapReducebatchjobs(therewereavailablein

Hadoopalreadyatthattime),butalsoiterativecomputationslikegraphalgorithmsor

MachineLearning.

Andalsodifferentscalesofworkloadsfromsub-secondinteractivejobstojobsthatrun

formanyhours.

Sparkcombinesbatch,interactive,andstreamingworkloadsunderonerichconciseAPI.

Sparksupportsnearreal-timestreamingworkloadsviaSparkStreamingapplication

framework.

ETLworkloadsandAnalyticsworkloadsaredifferent,howeverSparkattemptstooffera

unifiedplatformforawidevarietyofworkloads.

GraphandMachineLearningalgorithmsareiterativebynatureandlesssavestodiskor

transfersovernetworkmeansbetterperformance.

ThereisalsosupportforinteractiveworkloadsusingSparkshell.

OverviewofApacheSpark

YoushouldwatchthevideoWhatisApacheSpark?byMikeOlson,ChiefStrategyOfficer

andCo-FounderatCloudera,whoprovidesaveryexceptionaloverviewofApacheSpark,its

riseinpopularityintheopensourcecommunity,andhowSparkisprimedtoreplace

MapReduceasthegeneralprocessingengineinHadoop.

LeveragestheBestindistributedbatchdataprocessing

Whenyouthinkaboutdistributedbatchdataprocessing,Hadoopnaturallycomestomind

asaviablesolution.

SparkdrawsmanyideasoutofHadoopMapReduce.Theyworktogetherwell-Sparkon

YARNandHDFS-whileimprovingontheperformanceandsimplicityofthedistributed

computingengine.

Formany,SparkisHadoop++,i.e.MapReducedoneinabetterway.

Anditshouldnotcomeasasurprise,withoutHadoopMapReduce(itsadvancesand

deficiencies),Sparkwouldnothavebeenbornatall.

RDD-DistributedParallelScalaCollections

AsaScaladeveloper,youmayfindSpark’sRDDAPIverysimilar(ifnotidentical)toScala’s

CollectionsAPI.

ItisalsoexposedinJava,PythonandR(aswellasSQL,i.e.SparkSQL,inasense).

So,whenyouhaveaneedfordistributedCollectionsAPIinScala,SparkwithRDDAPI

shouldbeaseriouscontender.

RichStandardLibrary

Notonlycanyouusemapand reduce(asinHadoopMapReducejobs)inSpark,butalso

avastarrayofotherhigher-leveloperatorstoeaseyourSparkqueriesandapplication

development.

Itexpandedontheavailablecomputationstylesbeyondtheonlymap-and-reduceavailable

inHadoopMapReduce.

Unifieddevelopmentanddeploymentenvironmentforall

RegardlessoftheSparktoolsyouuse-theSparkAPIforthemanyprogramminglanguages

supported-Scala,Java,Python,R,ortheSparkshell,orthemanySparkApplication

FrameworksleveragingtheconceptofRDD,i.e.SparkSQL,SparkStreaming,SparkMLlib

OverviewofApacheSpark

andSparkGraphX,youstillusethesamedevelopmentanddeploymentenvironmenttofor

largedatasetstoyieldaresult,beitaprediction(SparkMLlib),astructureddataqueries

(SparkSQL)orjustalargedistributedbatch(SparkCore)orstreaming(SparkStreaming)

computation.

It’salsoveryproductiveofSparkthatteamscanexploitthedifferentskillstheteam

membershaveacquiredsofar.Dataanalysts,datascientists,Pythonprogrammers,orJava,

orScala,orR,canallusethesameSparkplatformusingtailor-madeAPI.Itmakesfor

bringingskilledpeoplewiththeirexpertiseindifferentprogramminglanguagestogethertoa

Sparkproject.

InteractiveExploration/ExploratoryAnalytics

Itisalsocalledadhocqueries.

UsingtheSparkshellyoucanexecutecomputationstoprocesslargeamountofdata(The

BigData).It’sallinteractiveandveryusefultoexplorethedatabeforefinalproduction

release.

Also,usingtheSparkshellyoucanaccessanySparkclusterasifitwasyourlocalmachine.

JustpointtheSparkshelltoa20-nodeof10TBRAMmemoryintotal(using --master)and

useallthecomponents(andtheirabstractions)likeSparkSQL,SparkMLlib,Spark

Streaming,andSparkGraphX.

Dependingonyourneedsandskills,youmayseeabetterfitforSQLvsprogrammingAPIs

orapplymachinelearningalgorithms(SparkMLlib)fromdataingraphdatastructures

(SparkGraphX).

SingleEnvironment

Regardlessofwhichprogramminglanguageyouaregoodat,beitScala,Java,Python,Ror

SQL,youcanusethesamesingleclusteredruntimeenvironmentforprototyping,adhoc

queries,anddeployingyourapplicationsleveragingthemanyingestiondatapointsoffered

bytheSparkplatform.

Youcanbeaslow-levelasusingRDDAPIdirectlyorleveragehigher-levelAPIsofSpark

SQL(Datasets),SparkMLlib(MLPipelines),SparkGraphX(Graphs)orSparkStreaming

(DStreams).

Orusethemallinasingleapplication.

Thesingleprogrammingmodelandexecutionenginefordifferentkindsofworkloads

simplifydevelopmentanddeploymentarchitectures.

OverviewofApacheSpark

DataIntegrationToolkitwithRichSetofSupportedData

Sources

Sparkcanreadfrommanytypesofdatasources — relational,NoSQL,filesystems,etc. —

usingmanytypesofdataformats-Parquet,Avro,CSV,JSON.

Both,inputandoutputdatasources,allowprogrammersanddataengineersuseSparkas

theplatformwiththelargeamountofdatathatisreadfromorsavedtoforprocessing,

interactively(usingSparkshell)orinapplications.

Toolsunavailablethen,atyourfingertipsnow

Asmuchandoftenasit’srecommendedtopicktherighttoolforthejob,it’snotalways

feasible.Time,personalpreference,operatingsystemyouworkonareallfactorstodecide

whatisrightatatime(andusingahammercanbeareasonablechoice).

Sparkembracesmanyconceptsinasingleunifieddevelopmentandruntimeenvironment.

Machinelearningthatissotool-andfeature-richinPython,e.g.SciKitlibrary,cannow

beusedbyScaladevelopers(asPipelineAPIinSparkMLliborcallingpipe()).

DataFramesfromRareavailableinScala,Java,Python,RAPIs.

Singlenodecomputationsinmachinelearningalgorithmsaremigratedtotheir

distributedversionsinSparkMLlib.

ThissingleplatformgivesplentyofopportunitiesforPython,Scala,Java,andR

programmersaswellasdataengineers(SparkR)andscientists(usingproprietaryenterprise

datawarehouseswithThriftJDBC/ODBCServerinSparkSQL).

Mindtheproverbifallyouhaveisahammer,everythinglookslikeanail,too.

Low-levelOptimizations

ApacheSparkusesadirectedacyclicgraph(DAG)ofcomputationstages(akaexecution

DAG).Itpostponesanyprocessinguntilreallyrequiredforactions.Spark’slazyevaluation

givesplentyofopportunitiestoinducelow-leveloptimizations(sousershavetoknowlessto

domore).

Mindtheproverblessismore.

Excelsatlow-latencyiterativeworkloads

OverviewofApacheSpark

剩余1191页未读，继续阅读

landongjun

粉丝: 0
资源: 5

精通Apache Spark：机器学习与管道实践

mastering-apache-spark最好的spark教程

mastering-apache-spark2.4.2.pdf

Mastering-Advanced-Analytics-With-Apache-Spark

mastering apache pulsar pdf

CogView: Mastering Text-to-Image Generation via Transformers

给我一个markdown下载地址

列举几个SpringBoot相关的技术文献,以文献引用的格式展示

spring boot项目英文文献

springboot的参考文献

关于springboot的外文文献

最新资源