Apache Spark入门与核心功能详解

需积分: 9 119 浏览量更新于2024-07-20 收藏 18.1MB PDF 举报

Apache Spark是一个强大的开源分布式计算框架，它在大数据处理领域中扮演着核心角色。本指南将深入探讨Mastering Apache Spark的主要知识点，包括其概述、Spark SQL、SparkSession的创建、数据处理和操作、以及高级特性的使用。 1. **概述** Apache Spark 提供了一种内存计算模型，能够支持实时流处理和批处理，以高效的速度进行复杂的数据分析。它的设计理念是将计算任务分布在集群中的节点上，以并行方式执行，从而提高性能。 2. **Spark SQL** Spark SQL 是Spark的一个关键组件，它允许用户在大规模数据集上执行结构化查询。它提供了DataFrame API，使数据处理更像SQL，支持标准的SQL语法，以及与关系型数据库的交互。 3. **SparkSession** 和 **Builder** SparkSession是Spark编程的入口点，它封装了所有主要的Spark功能。Builder模式允许通过简洁的API构建自定义的SparkSession，以适应不同的应用场景。 4. **Datasets和DataFrames** Datasets是Spark的数据抽象，它们是强类型版本的DataFrame，提供更好的类型安全性和优化。DataFrame是列式存储的二维表格，由Row对象组成，RowEncoder用于将数据转换为内部表示。 5. **Schema和Data Types** Schema定义了DataFrame或Dataset的数据结构，包括字段名称、类型和约束。StructType和StructField用于表示复杂的结构，而DataTypes则提供了各种内置数据类型。 6. **DataFrame Operators和Column Operators** DataFrame提供了丰富的操作符，如选择（ColumnOperators）、过滤（Selection）、聚合（Aggregation）和连接（Joins）。StandardFunctions提供了内置函数库，用于数据转换和处理。 7. **窗口操作和用户定义函数 (UDFs)** WindowAggregateOperators支持窗口函数，用于分组统计分析，而UDFs允许开发者编写自定义操作来扩展Spark的功能。 8. **Caching** Caching功能可以缓存中间结果，避免重复计算，提高性能。这对于大规模数据分析尤为重要。 9. **DataSource API** DataSource API是Spark用于加载和保存数据的标准接口。DataFrameReader用于读取外部数据源，DataFrameWriter负责数据写入。 10. **高级特性** - 数据源支持广泛，包括文件系统、数据库、流等。 - Spark的分布式计算能力和容错机制确保任务的可靠执行。 - 2.8.3.1至2.8.5.3部分详细介绍了更复杂的主题，如数据分区、动态分区和数据流处理。 11. **其他内容** - 2.9可能涉及分布式计算和资源管理，2.10可能涵盖了性能调优、性能监控和最佳实践等内容。 Mastering Apache Spark指南涵盖了从基础概念到高级特性的全面知识，旨在帮助开发者熟练掌握Spark框架，以高效地处理和分析大规模数据。

ApacheSpark

ApacheSparkisanopen-sourcedistributedgeneral-purposeclustercomputing

frameworkwithin-memorydataprocessingenginethatcandoETL,analytics,machine

learningandgraphprocessingonlargevolumesofdataatrest(batchprocessing)orin

motion(streamingprocessing)withrichconcisehigh-levelAPIsfortheprogramming

languages:Scala,Python,Java,R,andSQL.

Figure1.TheSparkPlatform

YoucouldalsodescribeSparkasadistributed,dataprocessingengineforbatchand

streamingmodesfeaturingSQLqueries,graphprocessing,andMachineLearning.

IncontrasttoHadoop’stwo-stagedisk-basedMapReduceprocessingengine,Spark’smulti-

stagein-memorycomputingengineallowsforrunningmostcomputationsinmemory,and

henceveryoftenprovidesbetterperformance(therearereportsaboutbeingupto100times

faster-readSparkofficiallysetsanewrecordinlarge-scalesorting!)forcertainapplications,

e.g.iterativealgorithmsorinteractivedatamining.

Sparkaimsatspeed,easeofuse,andinteractiveanalytics.

Sparkisoftencalledclustercomputingengineorsimplyexecutionengine.

Sparkisadistributedplatformforexecutingcomplexmulti-stageapplications,like

machinelearningalgorithms,andinteractiveadhocqueries.Sparkprovidesanefficient

abstractionforin-memoryclustercomputingcalledResilientDistributedDataset.

OverviewofApacheSpark

UsingSparkApplicationFrameworks,Sparksimplifiesaccesstomachinelearningand

predictiveanalyticsatscale.

SparkismainlywritteninScala,butsupportsotherlanguages,i.e.Java,Python,andR.

Ifyouhavelargeamountsofdatathatrequireslowlatencyprocessingthatatypical

MapReduceprogramcannotprovide,Sparkisanalternative.

Accessanydatatypeacrossanydatasource.

Hugedemandforstorageanddataprocessing.

TheApacheSparkprojectisanumbrellaforSQL(withDataFrames),streaming,machine

learning(pipelines)andgraphprocessingenginesbuiltatopSparkCore.Youcanrunthem

allinasingleapplicationusingaconsistentAPI.

Sparkrunslocallyaswellasinclusters,on-premisesorincloud.ItrunsontopofHadoop

YARN,ApacheMesos,standaloneorinthecloud(AmazonEC2orIBMBluemix).

Sparkcanaccessdatafrommanydatasources.

ApacheSpark’sStreamingandSQLprogrammingmodelswithMLlibandGraphXmakeit

easierfordevelopersanddatascientiststobuildapplicationsthatexploitmachinelearning

andgraphanalytics.

Atahighlevel,anySparkapplicationcreatesRDDsoutofsomeinput,run(lazy)

transformationsoftheseRDDstosomeotherform(shape),andfinallyperformactionsto

collectorstoredata.Notmuch,huh?

YoucanlookatSparkfromprogrammer’s,dataengineer’sandadministrator’spointofview.

Andtobehonest,allthreetypesofpeoplewillspendquitealotoftheirtimewithSparkto

finallyreachthepointwheretheyexploitalltheavailablefeatures.Programmersuse

language-specificAPIs(andworkatthelevelofRDDsusingtransformationsandactions),

dataengineersusehigher-levelabstractionslikeDataFramesorPipelinesAPIsorexternal

tools(thatconnecttoSpark),andfinallyitallcanonlybepossibletorunbecause

administratorssetupSparkclusterstodeploySparkapplicationsto.

ItisSpark’sgoaltobeageneral-purposecomputingplatformwithvariousspecialized

applicationsframeworksontopofasingleunifiedengine.

Note

Whenyouhear"ApacheSpark"itcanbetwothings — theSparkengineaka

SparkCoreortheApacheSparkopensourceprojectwhichisan"umbrella"

termforSparkCoreandtheaccompanyingSparkApplicationFrameworks,i.e.

SparkSQL,SparkStreaming,SparkMLlibandSparkGraphXthatsitontopof

SparkCoreandthemaindataabstractioninSparkcalledRDD-Resilient

DistributedDataset.

OverviewofApacheSpark

WhySpark

Let’slistafewofthemanyreasonsforSpark.Wearedoingitfirst,andthencomesthe

overviewthatlendsamoretechnicalhelpinghand.

EasytoGetStarted

Sparkoffersspark-shellthatmakesforaveryeasyheadstarttowritingandrunningSpark

applicationsonthecommandlineonyourlaptop.

YoucouldthenuseSparkStandalonebuilt-inclustermanagertodeployyourSpark

applicationstoaproduction-gradeclustertorunonafulldataset.

UnifiedEngineforDiverseWorkloads

AssaidbyMateiZaharia-theauthorofApacheSpark-inIntroductiontoAmpLabSpark

Internalsvideo(quotingwithfewchanges):

OneoftheSparkprojectgoalswastodeliveraplatformthatsupportsaverywidearray

ofdiverseworkflows-notonlyMapReducebatchjobs(therewereavailablein

Hadoopalreadyatthattime),butalsoiterativecomputationslikegraphalgorithmsor

MachineLearning.

Andalsodifferentscalesofworkloadsfromsub-secondinteractivejobstojobsthatrun

formanyhours.

Sparkcombinesbatch,interactive,andstreamingworkloadsunderonerichconciseAPI.

Sparksupportsnearreal-timestreamingworkloadsviaSparkStreamingapplication

framework.

ETLworkloadsandAnalyticsworkloadsaredifferent,howeverSparkattemptstooffera

unifiedplatformforawidevarietyofworkloads.

GraphandMachineLearningalgorithmsareiterativebynatureandlesssavestodiskor

transfersovernetworkmeansbetterperformance.

ThereisalsosupportforinteractiveworkloadsusingSparkshell.

YoushouldwatchthevideoWhatisApacheSpark?byMikeOlson,ChiefStrategyOfficer

andCo-FounderatCloudera,whoprovidesaveryexceptionaloverviewofApacheSpark,its

riseinpopularityintheopensourcecommunity,andhowSparkisprimedtoreplace

MapReduceasthegeneralprocessingengineinHadoop.

LeveragestheBestindistributedbatchdataprocessing

OverviewofApacheSpark

Whenyouthinkaboutdistributedbatchdataprocessing,Hadoopnaturallycomestomind

asaviablesolution.

SparkdrawsmanyideasoutofHadoopMapReduce.Theyworktogetherwell-Sparkon

YARNandHDFS-whileimprovingontheperformanceandsimplicityofthedistributed

computingengine.

Formany,SparkisHadoop++,i.e.MapReducedoneinabetterway.

Anditshouldnotcomeasasurprise,withoutHadoopMapReduce(itsadvancesand

deficiencies),Sparkwouldnothavebeenbornatall.

RDD-DistributedParallelScalaCollections

AsaScaladeveloper,youmayfindSpark’sRDDAPIverysimilar(ifnotidentical)toScala’s

CollectionsAPI.

ItisalsoexposedinJava,PythonandR(aswellasSQL,i.e.SparkSQL,inasense).

So,whenyouhaveaneedfordistributedCollectionsAPIinScala,SparkwithRDDAPI

shouldbeaseriouscontender.

RichStandardLibrary

Notonlycanyouusemapandreduce(asinHadoopMapReducejobs)inSpark,butalso

avastarrayofotherhigher-leveloperatorstoeaseyourSparkqueriesandapplication

development.

Itexpandedontheavailablecomputationstylesbeyondtheonlymap-and-reduceavailable

inHadoopMapReduce.

Unifieddevelopmentanddeploymentenvironmentforall

RegardlessoftheSparktoolsyouuse-theSparkAPIforthemanyprogramminglanguages

supported-Scala,Java,Python,R,ortheSparkshell,orthemanySparkApplication

FrameworksleveragingtheconceptofRDD,i.e.SparkSQL,SparkStreaming,SparkMLlib

andSparkGraphX,youstillusethesamedevelopmentanddeploymentenvironmenttofor

largedatasetstoyieldaresult,beitaprediction(SparkMLlib),astructureddataqueries

(SparkSQL)orjustalargedistributedbatch(SparkCore)orstreaming(SparkStreaming)

computation.

It’salsoveryproductiveofSparkthatteamscanexploitthedifferentskillstheteam

membershaveacquiredsofar.Dataanalysts,datascientists,Pythonprogrammers,orJava,

orScala,orR,canallusethesameSparkplatformusingtailor-madeAPI.Itmakesfor

OverviewofApacheSpark

bringingskilledpeoplewiththeirexpertiseindifferentprogramminglanguagestogethertoa

Sparkproject.

InteractiveExploration/ExploratoryAnalytics

Itisalsocalledadhocqueries.

UsingtheSparkshellyoucanexecutecomputationstoprocesslargeamountofdata(The

BigData).It’sallinteractiveandveryusefultoexplorethedatabeforefinalproduction

release.

Also,usingtheSparkshellyoucanaccessanySparkclusterasifitwasyourlocalmachine.

JustpointtheSparkshelltoa20-nodeof10TBRAMmemoryintotal(using --master)and

useallthecomponents(andtheirabstractions)likeSparkSQL,SparkMLlib,Spark

Streaming,andSparkGraphX.

Dependingonyourneedsandskills,youmayseeabetterfitforSQLvsprogrammingAPIs

orapplymachinelearningalgorithms(SparkMLlib)fromdataingraphdatastructures

(SparkGraphX).

SingleEnvironment

Regardlessofwhichprogramminglanguageyouaregoodat,beitScala,Java,Python,Ror

SQL,youcanusethesamesingleclusteredruntimeenvironmentforprototyping,adhoc

queries,anddeployingyourapplicationsleveragingthemanyingestiondatapointsoffered

bytheSparkplatform.

Youcanbeaslow-levelasusingRDDAPIdirectlyorleveragehigher-levelAPIsofSpark

SQL(Datasets),SparkMLlib(MLPipelines),SparkGraphX(Graphs)orSparkStreaming

(DStreams).

Orusethemallinasingleapplication.

Thesingleprogrammingmodelandexecutionenginefordifferentkindsofworkloads

simplifydevelopmentanddeploymentarchitectures.

DataIntegrationToolkitwithRichSetofSupportedData

Sources

Sparkcanreadfrommanytypesofdatasources — relational,NoSQL,filesystems,etc. —

usingmanytypesofdataformats-Parquet,Avro,CSV,JSON.

OverviewofApacheSpark

剩余1285页未读，继续阅读

PyQter

粉丝: 14
资源: 39

Apache Spark入门与核心功能详解

深入理解Apache Spark 2.3.0：核心概念与机器学习

精通Apache Spark：权威编程指南

深入理解Apache Spark：核心技术与实践

mastering apache spark

Mastering Apache Spark(掌握Apache Spark)英文版.pdf

mastering apache spark2.4.2

Mastering Apache Spark 无水印pdf 0分

mastering apache spark 2.x second edition

Mastering Apache Spark(PACKT,2015)

Mastering Apache Spark 2.x - Second Edition

最新资源