Apache Spark API详解与实战指南

5星 · 超过95%的资源需积分: 34 103 浏览量更新于2024-07-23 1 收藏 263KB PDF 举报

"Spark API Master 是一份针对初学者的Apache Spark API命令参考，由Matthias Langer和Zhen He撰写。这份文档详尽地介绍了Spark的RDD（弹性分布式数据集）API，提供了如何配置Spark Shell、调整内存和工作线程，以及如何使用各种RDD操作的方法。" Spark API是Apache Spark的核心部分，它提供了一组丰富的编程接口，使得开发者能够方便地处理大规模数据。本参考文档主要涵盖了以下关键知识点： 1. **Shell配置**：在使用Spark进行数据分析时，首先需要设置Shell环境。调整内存和工作线程的数量是优化性能的关键步骤。通过调整这些参数，可以确保Spark能够有效地利用计算资源，避免内存溢出或过度负载。 2. **RDD API**：RDD（弹性分布式数据集）是Spark的基础数据结构，它是不可变的、分区的数据集合。RDD提供了多种操作方法： - **aggregate**：对数据进行聚合操作，可以自定义聚合函数。 - **cartesian**：生成两个RDD的笛卡尔积，即所有可能的元素对组合。 - **checkpoint**：将RDD保存到持久化存储中，以优化后续计算。 - **coalesce/repartition**：用于调整RDD的分区数量，coalesce用于减小分区，repartition则可增加或减少分区并重新分布数据。 - **cogroup/groupWith**：将两个RDD基于相同的键进行分组，返回键值对的RDD。 - **collect/toArray**：将RDD转换为本地集合，如数组。 - **collectAsMap[Pair]**：将键值对RDD转换为本地Map。 - **combineByKey[Pair]**：创建一个新的键值对RDD，其中值是根据原始键的值进行组合的结果。 - **compute**：执行RDD的计算操作。 - **count**：返回RDD中元素的总数。 - **countApprox**：返回元素的大致计数，适用于近似算法。 - **countByKey[Pair]**：返回按键分组的元素计数。 - **countByKeyApprox[Pair]**：类似countByKey，但返回近似结果。 - **countByValue**：返回RDD中不同元素的出现次数。 - **countByValueApprox**：近似计算RDD中不同元素的出现次数。 - **countApproxDistinct**：估算RDD中不同元素的个数。 - **countApproxDistinctByKey[Pair]**：估算分组后不同键的个数。 - **dependencies**：查看RDD的依赖关系，理解计算流程。 - **distinct**：返回RDD中的唯一元素。 - **first**：获取RDD的第一个元素。 - **filter**：过滤满足条件的元素。 - **filterWith**：基于另一个RDD过滤元素。 - **flatMap**：将每个元素展开成多个元素。 - **flatMapValues[Pair]**：对键值对RDD的值进行展开。 - **flatMapWith**：结合一个函数，对RDD的每个元素进行展开。这份文档对于理解Spark的API使用至关重要，特别是对于刚接触Spark的开发者，能够帮助他们快速上手并有效地处理大数据任务。通过学习和实践这些API，开发者可以构建高效、容错的分布式数据处理应用程序。

detector. Later, we intend to draw an image of a map that highlights these lo-

cations using the aggregate function. In this case the zeroValue could be an area

map with no highlights. The possibly huge set of input data is stored as GPS

coordinates across many partitions. seqOp could convert the GPS coordinates to

map coordinates and put a marker on the map at the respective position. combOp

will receive these highlights as partial maps and combine them into a single ﬁnal

output map.

Listing 3.1: Variants

def a ggreg ate [U : Clas sTa g ]( zer oValu e : U)( seqOp : (U , T) = > U , c ombO p : (U

, U ) = > U) : U

Listing 3.2: Examples

val z = sc . p arallel iz e ( List (1 ,2 ,3 ,4 ,5 ,6) , 2)

z. a ggreg ate (0) ( math . max (_ , _ ) , _ + _ )

res40 : Int = 9

val z = sc . p arallel iz e ( List (" a " ," b " ," c " ,"d" ,"e " ,"f ") ,2)

z. a ggreg ate ("") (_ + _ , _ + _)

res115 : String = abcdef

z. a ggreg ate (" x ") ( _ + _ , _ +_)

res116 : String = x xdefx abc

val z = sc . p arallel iz e ( List ("12" ,"2 3" ," 345" ,"4567") ,2)

z. a ggreg ate ("") ((x ,y ) = > math . max ( x . length , y. length ) . toString , (x ,y)

=> x + y)

res141 : String = 42

z. a ggreg ate ("") ((x ,y ) = > math . min ( x . length , y. length ) . toString , (x ,y)

=> x + y)

res142 : String = 11

val z = sc . p arallel iz e ( List ("12" ,"23" ,"345" ,"") ,2)

z. a ggreg ate ("") ((x ,y ) = > math . min ( x . length , y. length ) . toString , (x ,y)

=> x + y)

res143 : String = 10

The main issue with the code above is that the result of the inner min is a string of

length 1. The zero in the output is due to the empty string being the last string in the

list. We see this result because we are not recursively reducing any further within the

partition for the ﬁnal string.

Listing 3.3: Examples 2

val z = sc . p arallel iz e ( List ("12" ,"23" ,"" ,"345") ,2)

z. a ggreg ate ("") ((x ,y ) = > math . min ( x . length , y. length ) . toString , (x ,y)

=> x + y)

res144 : String = 11

14/ 02/ 25 18 :13 :53 INFO RDD Ch eckpoin tD at a : Done ch ec kp oi nt in g RDD 11 to

file :/ home / cloudera / D ocume nts / spark -0.9.0 - incubating - bin - cdh4 / bin /

my _d ir ec to ry_na me /65407913 - fdc6 -4 ec1 -82 c9 -48 a1 656b95d 6 / rdd -11 , new

parent is RDD 12

res23 : Long = 4

3.4 coalesce, repartition

Coalesces the associated data into a given number of partitions. repartition(numPartitions)

is simply an abbreviation for coalesce(numPartitions, shuﬄe = true).

Listing 3.8: Variants

def c oalesce ( numPar ti ti on s : Int , shuffle : Boolean = fals e ): RDD [ T ]

def repa rt ition ( nu mP ar ti ti on s : Int ): RDD [T ]

Listing 3.9: Examples

val y = sc . p arallel iz e (1 to 10 , 10)

val z = y. c oalesce (2 , false )

z. p ar titio ns . le ngth

res9 : Int = 2

3.5 cogroup

[Pair]

, groupWith

[Pair]

A very powerful set of functions that allow grouping up to 3 key-value RDDs together

using their keys.

Listing 3.10: Variants

def c ogroup [ W ]( other : RDD [(K , W) ]): RDD [(K , ( Seq [V ] , Seq [W ]))]

def c ogroup [ W ]( other : RDD [(K , W) ] , n umPartitions : Int ) : RDD [( K , ( Seq [ V

], Seq [ W ]) ) ]

def c ogroup [ W ]( other : RDD [(K , W) ] , part it io ner : Pa rt it ioner ): RDD [(K , (

Seq [ V] , Seq [ W ]) ) ]

def c ogroup [ W1 , W2 ]( oth er1 : RDD [(K , W1 )] , ot her2 : RDD [(K , W2 ) ]) : RDD [(K

, ( Seq [ V ], Seq [ W1 ], Seq [ W2 ]) ) ]

def c ogroup [ W1 , W2 ]( oth er1 : RDD [(K , W1 )] , ot her2 : RDD [(K , W2 )],

numPartitio ns : Int ): RDD [(K , ( Seq [ V ], Seq [ W1 ], Seq [ W2 ]) ) ]

def c ogroup [ W1 , W2 ]( oth er1 : RDD [(K , W1 )] , ot her2 : RDD [(K , W2 )],

parti ti oner : P ar ti tione r ) : RDD [(K , ( Seq [V ] , Seq [ W1 ] , Seq [ W2 ]) ) ]

def g roupW ith [W ]( othe r : RDD [(K , W) ]) : RDD [(K , ( Seq [V ] , Seq [W ]))]

def g roupW ith [W1 , W2 ]( other1 : RDD [(K , W1 )], other2 : RDD [(K , W2 ) ]) : RDD

[( K , ( Seq [ V], Seq [ W1], Seq [ W2 ]) ) ]

Listing 3.11: Examples

val a = sc . p arallel iz e ( List (1 , 2 , 1, 3) , 1)

val b = a. map ((_ , " b ") )

val c = a. map ((_ , " c ") )

剩余50页未读，继续阅读

hustszh

粉丝: 1
资源: 5

Apache Spark API详解与实战指南

spark-master.zip

Spark实战高手之路-第5章Spark API编程动手实战（1）

tensorflowonspark_master

Spark API入门指南：初学者实用命令详解

SPARK API接口提交作业

spark 2.0.1 JavaAPI

ext-spark:适用于Apache Spark的API模块

Spark 1.0.2 API (Scala)

spark-api-0.1.7.zip

learning-spark-examples-master

最新资源