各类速查表汇总-PySpark Cheat Sheet -Spark in Python

Cheat

Spark

需积分: 9 23 浏览量更新于2023-05-18 收藏 661KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Python For Data Science Cheat Sheet

PySpark - RDD Basics

Learn Python for data science Interactively at www.DataCamp.com

DataCamp

Learn Python for Data Science Interactively

Initializing Spark

PySpark is the Spark Python API that exposes

the Spark programming model to Python.

>>> from pyspark import SparkContext

>>> sc = SparkContext(master = 'local[2]')

Loading Data

>>> rdd = sc.parallelize([('a',7),('a',2),('b',2)])

>>> rdd2 = sc.parallelize([('a',2),('d',1),('b',1)])

>>> rdd3 = sc.parallelize(range(100))

>>> rdd4 = sc.parallelize([("a",["x","y","z"]),

("b",["p", "r"])])

Spark

>>> rdd.map(lambda x: x+(x[1],x[0])) Apply a function to each RDD element

.collect()

[('a',7,7,'a'),('a',2,2,'a'),('b',2,2,'b')]

>>> rdd5 = rdd.atMap(lambda x: x+(x[1],x[0])) Apply a function to each RDD element

and ﬂaen the result

>>> rdd5.collect()

['a',7,7,'a','a',2,2,'a','b',2,2,'b']

>>> rdd4.atMapValues(lambda x: x) Apply a ﬂatMap function to each (key,value)

.collect() pair of rdd4 without changing the keys

[('a','x'),('a','y'),('a','z'),('b','p'),('b','r')]

Applying Functions

>>> sc.stop()

>>> sc.version Retrieve SparkContext version

>>> sc.pythonVer Retrieve Python version

>>> sc.master Master URL to connect to

>>> str(sc.sparkHome) Path where Spark is installed on worker nodes

>>> str(sc.sparkUser()) Retrieve name of the Spark User running

SparkContext

>>> sc.appName Return application name

>>> sc.applicationId Retrieve application ID

>>> sc.defaultParallelism Return default level of parallelism

>>> sc.defaultMinPartitions Default minimum number of partitions for

RDDs

>>> from pyspark import SparkConf, SparkContext

>>> conf = (SparkConf()

.setMaster("local")

.setAppName("My app")

.set("spark.executor.memory", "1g"))

>>> sc = SparkContext(conf = conf)

Conﬁguration

SparkContext

Stopping SparkContext

Using The Shell

$ ./bin/spark-shell --master local[2]

$ ./bin/pyspark --master local[4] --py-les code.py

Inspect SparkContext

Set which master the context connects to with the --master argument, and

add Python .zip, .egg or .py ﬁles to the runtime path by passing a

comma-separated list to --py-les.

Execution

$ ./bin/spark-submit examples/src/main/python/pi.py

In the PySpark shell, a special interpreter-aware SparkContext is already

created in the variable called sc.

>>> rdd.saveAsTextFile("rdd.txt")

>>> rdd.saveAsHadoopFile("hdfs://namenodehost/parent/child",

'org.apache.hadoop.mapred.TextOutputFormat')

>>> rdd3.max() Maximum value of RDD elements

>>> rdd3.min() Minimum value of RDD elements

>>> rdd3.mean() Mean value of RDD elements

49.5

>>> rdd3.stdev() Standard deviation of RDD elements

28.866070047722118

>>> rdd3.variance() Compute variance of RDD elements

833.25

>>> rdd3.histogram(3) Compute histogram by bins

([0,33,66,99],[33,33,34])

>>> rdd3.stats() Summary statistics (count, mean, stdev, max &

min)

Saving

Retrieving RDD Information

>>> rdd.getNumPartitions() List the number of partitions

>>> rdd.count() Count RDD instances

>>> rdd.countByKey() Count RDD instances by key

defaultdict(<type 'int'>,{'a':2,'b':1})

>>> rdd.countByValue() Count RDD instances by value

defaultdict(<type 'int'>,{('b',2):1,('a',2):1,('a',7):1})

>>> rdd.collectAsMap() Return (key,value) pairs as a

{'a': 2,'b': 2} dictionary

>>> rdd3.sum() Sum of RDD elements

4950

>>> sc.parallelize([]).isEmpty() Check whether RDD is empty

True

Summary

Basic Information

>>> rdd.repartition(4) New RDD with 4 partitions

>>> rdd.coalesce(1) Decrease the number of partitions in the RDD to 1

Repartitioning

Parallelized Collections

External Data

>>> textFile = sc.textFile("/my/directory/*.txt")

>>> textFile2 = sc.wholeTextFiles("/my/directory/")

Sort

>>> rdd2.sortBy(lambda x: x[1]) Sort RDD by given function

.collect()

[('d',1),('b',1),('a',2)]

>>> rdd2.sortByKey() Sort (key, value) RDD by key

.collect()

[('a',2),('b',1),('d',1)]

Mathematical Operations

>>> rdd.subtract(rdd2) Return each rdd value not contained

.collect() in rdd2

[('b',2),('a',7)]

>>> rdd2.subtractByKey(rdd) Return each (key,value) pair of rdd2

.collect() with no matching key in rdd

[('d', 1)]

>>> rdd.cartesian(rdd2).collect() Return the Cartesian product of rdd

and rdd2

Read either one text ﬁle from HDFS, a local ﬁle system or or any

Hadoop-supported ﬁle system URI with textFile(), or read in a directory

of text ﬁles with wholeTextFiles().

Reducing

>>> rdd.reduceByKey(lambda x,y : x+y) Merge the rdd values for

.collect() each key

[('a',9),('b',2)]

>>> rdd.reduce(lambda a, b: a + b) Merge the rdd values

('a',7,'a',2,'b',2)

Grouping by

>>> rdd3.groupBy(lambda x: x % 2) Return RDD of grouped values

.mapValues(list)

.collect()

>>> rdd.groupByKey() Group rdd by key

.mapValues(list)

.collect()

[('a',[7,2]),('b',[2])]

Aggregating

>>> seqOp = (lambda x,y: (x[0]+y,x[1]+1))

> > > c o m b O p = (l a m b d a x,y:(x[0]+y[0],x[1]+y[1]))

>>> rdd3.aggregate((0,0),seqOp,combOp) Aggregate RDD elements of each

(4950,100) partition and then the results

>>> rdd.aggregateByKey((0,0),seqop,combop) Aggregate values of each RDD key

.collect()

[('a',(9,2)), ('b',(2,1))]

>>> rdd3.fold(0,add) Aggregate the elements of each

4950 partition, and then the results

>>> rdd.foldByKey(0, add) Merge the values for each key

.collect()

[('a',9),('b',2)]

>>> rdd3.keyBy(lambda x: x+x) Create tuples of RDD elements by

.collect() applying a function

Reshaping Data

>>> def g(x): print(x)

>>> rdd.foreach(g) Apply a function to all RDD elements

('a', 7)

('b', 2)

('a', 2)

Iterating

Selecting Data

Geing

>>> rdd.collect() Return a list with all RDD elements

[('a', 7), ('a', 2), ('b', 2)]

>>> rdd.take(2) Take ﬁrst 2 RDD elements

[('a', 7), ('a', 2)]

>>> rdd.rst() Take ﬁrst RDD element

('a', 7)

>>> rdd.top(2) Take top 2 RDD elements

[('b', 2), ('a', 7)]

Sampling

>>> rdd3.sample(False, 0.15, 81).collect() Return sampled subset of rdd3

[3,4,27,31,40,41,42,43,60,76,79,80,86,97]

Filtering

>>> rdd.lter(lambda x: "a" in x)

Filter the RDD

.collect()

[('a',7),('a',2)]

>>> rdd5.distinct().collect()

Return distinct RDD values

['a',2,'b',7]

>>> rdd.keys().collect()

Return (key,value) RDD's keys

['a', 'a', 'b']

本内容试读结束，登录后可阅读更多

下载后可阅读完整内容，剩余0页未读，立即下载

xuwx66

粉丝: 1
资源: 25

会员权益专享

各类速查表汇总-PySpark Cheat Sheet -Spark in Python

各类速查表汇总-PySpark_SQL_Cheat_Sheet_Python

Spark Transformation和Action算子速查表.pdf

Apache Spark Cheat Sheet

vim-cheat-sheet

scikit-learn算法cheat-sheet

代码：有训练集csv和预测集csv，训练集中有日期列“年/月/日”，预测集中没有日期列，用BP神经网络进行对未来三个月的时间序列预测，将结果保存在excel中的代码

c语言 cheat sheet

orin-cheat 账号

openflow协议_SDN 技术之 OpenFlow 流表 CheatSheet

sas cheatsheet

atomichabits.com/cheatsheet

ubuntu安装cheat

数据结构 cheatsheet

bash script cheatsheet

在线学习hive sql

Asciidoc 语法

请分别生成几篇python的MySQL操作、scoket通信、flask框架、html、css、javas的参考文献，请以标准格式给出

github markdown图标 icon

会员权益专享

最新资源