深入理解Apache Spark：核心技术与实践

5星 · 超过95%的资源需积分: 11 38 浏览量更新于2024-07-18 1 收藏 17.45MB PDF 举报

"Mastering Apache Spark 是一本深入探讨Apache Spark技术的书籍，涵盖了Spark的核心组件、数据传输、网络服务、Web界面以及存储机制等多个方面。这本书详细解析了Spark的架构和工作原理，旨在帮助读者全面理解和高效使用Apache Spark进行大数据处理。" Apache Spark是一个开源的并行计算框架，其主要设计目标是提供快速、通用和可扩展的数据处理能力。书中首先介绍了Spark概述，讲解了Spark如何在分布式环境中处理大规模数据，并强调了其内存计算的特点，使得数据处理速度相比传统Hadoop MapReduce有了显著提升。在Spark Core部分，书中详细讨论了数据块在Spark集群中的传输机制。ShuffleClient和BlockTransferService是核心组件，负责数据块的获取和上传。ShuffleClient定义了获取shuffle blocks的接口，而BlockTransferService则提供了可插拔的实现，如NettyBlockTransferService，用于通过Netty网络库进行数据传输。NettyBlockRpcServer作为服务端处理RPC请求，BlockFetchingListener、RetryingBlockFetcher、BlockFetchStarter等辅助类协同工作，确保数据传输的可靠性和效率。 Spark Core的Web UI是监控和诊断Spark应用的重要工具，它提供了Jobs、Stages、Storage、Environment和Executors等视图。Jobs Tab展示了所有作业的状态和进度，Stages Tab详细列出了作业中的执行阶段，Storage Tab显示数据存储情况，Environment Tab揭示了运行环境的配置信息，而Executors Tab则展示执行器的详细状态。这些功能帮助用户实时监控应用性能，定位问题并优化配置。在存储机制方面，书本详细介绍了RDD（Resilient Distributed Datasets）的概念，它是Spark中最基本的数据抽象。RDD具有容错性，能够在数据丢失时自动恢复。StoragePage和RDDPage提供了查看和管理存储在内存或磁盘中的RDD的界面，而EnvironmentPage展示了存储配置，如缓存策略和持久化级别。 "Mastering Apache Spark"深入剖析了Spark的各个组件和工作流程，对想要掌握Spark的大数据专业人员来说，是一份非常有价值的参考资料。通过学习本书，读者能够更好地理解和利用Spark的强大功能，提升大数据处理的能力和效率。

MasteringApacheSpark(2.3.1)

WelcometoMasteringApacheSparkgitbook!I’mveryexcitedtohaveyouhereandhope

youwillenjoyexploringtheinternalsofApacheSpark(Core)asmuchasIhave.

IwritetodiscoverwhatIknow.

—FlanneryO'Connor

I’mJacekLaskowski,anindependentconsultant,softwaredeveloperandtechnicalinstructor

specializinginApacheSpark,ApacheKafkaandKafkaStreams(withScala,sbt,

Kubernetes,DC/OS,ApacheMesos,andHadoopYARN).

Ioffersoftwaredevelopmentandconsultancyserviceswithveryhands-onin-depth

workshopsandmentoring.Reachouttomeatjacek@japila.plor@jaceklaskowskito

discussopportunities.

ConsiderjoiningmeatWarsawScalaEnthusiastsandWarsawSparkmeetupsinWarsaw,

Poland.

Tip

I’malsowritingMasteringSparkSQL,MasteringKafkaStreams,ApacheKafka

NotebookandSparkStructuredStreamingNotebookgitbooks.

Expecttextandcodesnippetsfromavarietyofpublicsources.Attributionfollows.

Now,letmeintroduceyoutoApacheSpark.

Introduction

UsingSparkApplicationFrameworks,Sparksimplifiesaccesstomachinelearningand

predictiveanalyticsatscale.

SparkismainlywritteninScala,butprovidesdeveloperAPIforlanguageslikeJava,Python,

andR.

Note

Microsoft’sMobiusprojectprovidesC#APIforSpark"enablingthe

implementationofSparkdriverprogramanddataprocessingoperationsinthe

languagessupportedinthe.NETframeworklikeC#orF#."

Ifyouhavelargeamountsofdatathatrequireslowlatencyprocessingthatatypical

MapReduceprogramcannotprovide,Sparkisaviablealternative.

Accessanydatatypeacrossanydatasource.

Hugedemandforstorageanddataprocessing.

TheApacheSparkprojectisanumbrellaforSQL(withDatasets),streaming,machine

learning(pipelines)andgraphprocessingenginesbuiltatopSparkCore.Youcanrunthem

allinasingleapplicationusingaconsistentAPI.

Sparkrunslocallyaswellasinclusters,on-premisesorincloud.ItrunsontopofHadoop

YARN,ApacheMesos,standaloneorinthecloud(AmazonEC2orIBMBluemix).

Sparkcanaccessdatafrommanydatasources.

ApacheSpark’sStreamingandSQLprogrammingmodelswithMLlibandGraphXmakeit

easierfordevelopersanddatascientiststobuildapplicationsthatexploitmachinelearning

andgraphanalytics.

Atahighlevel,anySparkapplicationcreatesRDDsoutofsomeinput,run(lazy)

transformationsoftheseRDDstosomeotherform(shape),andfinallyperformactionsto

collectorstoredata.Notmuch,huh?

YoucanlookatSparkfromprogrammer’s,dataengineer’sandadministrator’spointofview.

Andtobehonest,allthreetypesofpeoplewillspendquitealotoftheirtimewithSparkto

finallyreachthepointwheretheyexploitalltheavailablefeatures.Programmersuse

language-specificAPIs(andworkatthelevelofRDDsusingtransformationsandactions),

dataengineersusehigher-levelabstractionslikeDataFramesorPipelinesAPIsorexternal

tools(thatconnecttoSpark),andfinallyitallcanonlybepossibletorunbecause

administratorssetupSparkclusterstodeploySparkapplicationsto.

ItisSpark’sgoaltobeageneral-purposecomputingplatformwithvariousspecialized

applicationsframeworksontopofasingleunifiedengine.

OverviewofApacheSpark

Note

Whenyouhear"ApacheSpark"itcanbetwothings — theSparkengineaka

SparkCoreortheApacheSparkopensourceprojectwhichisan"umbrella"

termforSparkCoreandtheaccompanyingSparkApplicationFrameworks,i.e.

SparkSQL,SparkStreaming,SparkMLlibandSparkGraphXthatsitontopof

SparkCoreandthemaindataabstractioninSparkcalledRDD-Resilient

DistributedDataset.

WhySpark

Let’slistafewofthemanyreasonsforSpark.Wearedoingitfirst,andthencomesthe

overviewthatlendsamoretechnicalhelpinghand.

EasytoGetStarted

Sparkoffersspark-shellthatmakesforaveryeasyheadstarttowritingandrunningSpark

applicationsonthecommandlineonyourlaptop.

YoucouldthenuseSparkStandalonebuilt-inclustermanagertodeployyourSpark

applicationstoaproduction-gradeclustertorunonafulldataset.

UnifiedEngineforDiverseWorkloads

AssaidbyMateiZaharia-theauthorofApacheSpark-inIntroductiontoAmpLabSpark

Internalsvideo(quotingwithfewchanges):

OneoftheSparkprojectgoalswastodeliveraplatformthatsupportsaverywidearray

ofdiverseworkflows-notonlyMapReducebatchjobs(therewereavailablein

Hadoopalreadyatthattime),butalsoiterativecomputationslikegraphalgorithmsor

MachineLearning.

Andalsodifferentscalesofworkloadsfromsub-secondinteractivejobstojobsthatrun

formanyhours.

Sparkcombinesbatch,interactive,andstreamingworkloadsunderonerichconciseAPI.

Sparksupportsnearreal-timestreamingworkloadsviaSparkStreamingapplication

framework.

ETLworkloadsandAnalyticsworkloadsaredifferent,howeverSparkattemptstooffera

unifiedplatformforawidevarietyofworkloads.

GraphandMachineLearningalgorithmsareiterativebynatureandlesssavestodiskor

transfersovernetworkmeansbetterperformance.

ThereisalsosupportforinteractiveworkloadsusingSparkshell.

OverviewofApacheSpark

YoushouldwatchthevideoWhatisApacheSpark?byMikeOlson,ChiefStrategyOfficer

andCo-FounderatCloudera,whoprovidesaveryexceptionaloverviewofApacheSpark,its

riseinpopularityintheopensourcecommunity,andhowSparkisprimedtoreplace

MapReduceasthegeneralprocessingengineinHadoop.

LeveragestheBestindistributedbatchdataprocessing

Whenyouthinkaboutdistributedbatchdataprocessing,Hadoopnaturallycomestomind

asaviablesolution.

SparkdrawsmanyideasoutofHadoopMapReduce.Theyworktogetherwell-Sparkon

YARNandHDFS-whileimprovingontheperformanceandsimplicityofthedistributed

computingengine.

Formany,SparkisHadoop++,i.e.MapReducedoneinabetterway.

Anditshouldnotcomeasasurprise,withoutHadoopMapReduce(itsadvancesand

deficiencies),Sparkwouldnothavebeenbornatall.

RDD-DistributedParallelScalaCollections

AsaScaladeveloper,youmayfindSpark’sRDDAPIverysimilar(ifnotidentical)toScala’s

CollectionsAPI.

ItisalsoexposedinJava,PythonandR(aswellasSQL,i.e.SparkSQL,inasense).

So,whenyouhaveaneedfordistributedCollectionsAPIinScala,SparkwithRDDAPI

shouldbeaseriouscontender.

RichStandardLibrary

Notonlycanyouusemapand reduce(asinHadoopMapReducejobs)inSpark,butalso

avastarrayofotherhigher-leveloperatorstoeaseyourSparkqueriesandapplication

development.

Itexpandedontheavailablecomputationstylesbeyondtheonlymap-and-reduceavailable

inHadoopMapReduce.

Unifieddevelopmentanddeploymentenvironmentforall

RegardlessoftheSparktoolsyouuse-theSparkAPIforthemanyprogramminglanguages

supported-Scala,Java,Python,R,ortheSparkshell,orthemanySparkApplication

FrameworksleveragingtheconceptofRDD,i.e.SparkSQL,SparkStreaming,SparkMLlib

OverviewofApacheSpark

剩余1351页未读，继续阅读

隐分隔符对象

粉丝: 10
资源: 8

深入理解Apache Spark：核心技术与实践

Mastering Apache Spark 无水印pdf 0分

mastering-apache-spark最好的spark教程

mastering-apache-spark

Packt.Mastering.Apache.Spark

Mastering.Apache.Spark.178397146

Mastering Apache Spark

mastering apache spark

mastering apache spark 2.x second edition

Mastering Apache Spark 2.x - Second Edition

Mastering Apache Spark 2.X(2nd) epub

最新资源