PySpark入门指南：理解Spark和Resilient Distributed Datasets

需积分: 10 152 浏览量更新于2024-07-18 收藏 11.91MB PDF 举报

"Learning-PySpark" 是一本高清版的 PySpark 学习资源，包括带目录的 PDF 电子书，适合对 PySpark 感兴趣的学习者。在深入 PySpark 的学习过程中，这本书覆盖了从基础到进阶的多个主题。首先，它介绍了 Apache Spark 的基本概念，解释了 Spark 是如何作为一个快速、通用且可扩展的数据处理框架存在的。Apache Spark 提供了多种 API 来执行数据处理任务，包括 Spark Jobs 和各种操作接口。书中详细讲述了 Spark 的执行过程，核心概念是 Resilient Distributed Dataset（RDD），这是一种容错的分布式数据集。RDD 具有强大的并行计算能力，能够高效地在集群中进行数据处理。随着 Spark 的发展，DataFrame 和 Datasets 成为了更加高级的数据抽象，它们提供了更丰富的类型安全和优化的查询性能。DataFrame 是基于 Catalyst Optimizer 的，它能进行高效的查询计划优化。而 Datasets 结合了 RDD 的优点和 Scala/Java 的强类型系统，进一步提高了开发效率。在 Spark 2.0 及以后的版本中，DataFrame 和 Datasets 被统一到一个接口下，即 SparkSession，这简化了开发者的编程体验。Tungsten phase 2 是 Spark 在内存管理和计算优化上的一个重要改进，它提升了数据处理的性能。此外，本书还涵盖了 Structured Streaming，这是 Spark 提供的一种用于处理持续数据流的机制，可以用于构建实时分析应用。Structured Streaming 支持连续应用程序，使得数据处理更接近实时。在技术细节方面，书中有专门的章节讲解 RDD 的内部工作原理，包括如何创建 RDD、定义其模式以及从文件中读取数据。书中还通过使用 lambda 表达式展示了全局和局部作用域的概念，并详细解释了一系列常见的 RDD 转换操作，如 map、filter、flatMap、distinct、sample 和 join 等。总体而言，"Learning-PySpark" 是一本全面的指南，不仅适合初学者理解 PySpark 的基础知识，也适用于有一定经验的开发者深入研究 Spark 的高级特性。书中提供的代码示例和详细的解释将帮助读者在实际项目中更好地应用 PySpark。

About the Authors

Tomasz Drabas is a Data Scientist working for Microsoft and currently

residing in the Seattle area. He has over 13 years of experience in data

analytics and data science in numerous fields: advanced technology,

airlines, telecommunications, finance, and consulting he gained while

working on three continents: Europe, Australia, and North America.

While in Australia, Tomasz has been working on his PhD in Operations

Research with a focus on choice modeling and revenue management

applications in the airline industry.

At Microsoft, Tomasz works with big data on a daily basis, solving

machine learning problems such as anomaly detection, churn prediction,

and pattern recognition using Spark.

Tomasz has also authored the Practical Data Analysis Cookbook

published by Packt Publishing in 2016.

I would like to thank my family: Rachel, Skye, and Albert—you are the

love of my life and I cherish every day I spend with you! Thank you for

always standing by me and for encouraging me to push my career goals

further and further. Also, to my family and my in-laws for putting up

with me (in general).

There are many more people that have influenced me over the years that

I would have to write another book to thank them all. You know who

you are and I want to thank you from the bottom of my heart!

However, I would not have gotten through my PhD if it was not for

Czesia Wieruszewska; Czesiu - dziękuję za Twoją pomoc bez której nie

rozpocząłbym mojej podróży po Antypodach. Along with Krzys

Krzysztoszek, you guys have always believed in me! Thank you!

Denny Lee is a Principal Program Manager at Microsoft for the Azure

DocumentDB team—Microsoft's blazing fast, planet-scale managed

document store service. He is a hands-on distributed systems and data

看书请加微信 YYAANNGG

剩余379页未读，继续阅读

iwsci

粉丝: 0
资源: 44

PySpark入门指南：理解Spark和Resilient Distributed Datasets

learning-pyspark.pdf

PySpark-Learning-PySpark-:PySpark实战指南（Leaning PySpark）代码

Learning-PySpark:Packt学习PySpark的代码存储库

deep-learning-pyspark:使用Apache Spark和深度认知进行深度学习

Machine-Learning-with-Pyspark

machine-learning-with-pyspark:Pramod Singh的“使用PySpark进行机器学习”的源代码-Source code learning

Mastering-Big-Data-Analytics-with-PySpark-master.zip

mmtf-pyspark：使用MMTF和Apache Spark并行和分布式分析和挖掘蛋白质数据库的方法

Spark_Streaming_Machine_Learning_PySpark：Spark_Streaming_Machine_Learning_PySpark

Machine Learning with PySpark

最新资源