Python PySpark入门与RDD深度解析

4星 · 超过85%的资源需积分: 9 52 浏览量更新于2024-07-18 收藏 13.24MB PDF 举报

"《Python Spark学习指南》深入探索Apache Spark与Python集成的世界" 在这个全面的教程中，我们将首先理解Spark的基本概念，它是一个开源的大数据处理框架，以在内存中高效执行计算而闻名。Apache Spark由LinkedIn开发，后来成为Apache软件基金会的一部分，支持实时流处理和批处理任务。本书的标题"pyspark_study"着重于Python编程语言在Spark中的应用。章节1，"Understanding Spark"，会介绍Spark的核心组件，如Spark Jobs和APIs，以及其执行流程。Resilient Distributed Datasets (RDD)是Spark的基础，它们是分布式、容错的数据结构，可以在集群上进行并行操作。这里还会讲解DataFrame和Dataset的概念，后者是Spark 2.0引入的新抽象，旨在简化数据处理。Catalyst Optimizer用于优化DataFrame的执行计划，而Project Tungsten则提升了性能，尤其是在内存管理方面。此外，Spark 2.0架构的统一了Datasets和DataFrames，引入了SparkSession作为易用的接口。 Structured Streaming部分介绍了Spark的实时流处理能力，如何构建连续应用，并讨论了这些在处理持续数据流时的优势。例如，Lambda expressions（lambda表达式）在这里发挥了重要作用，允许用户定义简洁的操作逻辑。作者还会讲解各种转换方法，如.map()、.filter()、.flatMap()等，以及如何利用.distinct()、.sample()进行数据清洗和采样，以及.leftOuterJoin()用于进行关联查询。本书不仅涵盖了技术细节，还包含了对Spark工作原理的深入剖析，适合希望通过Python进行大数据分析和处理的初学者和专业人士。阅读过程中，读者可以参考网站<https://www.iteblog.com>获取更多学习资源和支持。为了确保最佳学习体验，书中提供了实践代码下载链接，以及彩色图像供读者参考。同时，作者和评审者的贡献以及读者反馈都在相应章节有所提及，旨在共同提升内容质量。务必注意，如果发现任何错误或侵权行为，请通过指定渠道报告，以便及时修正。如果你在学习过程中有任何疑问，作者鼓励读者提问，共同探讨Spark与Python结合的无限可能。

About the Authors

Tomasz Drabas is a Data Scientist working for Microsoft and currently

residing in the Seattle area. He has over 13 years of experience in data

analytics and data science in numerous fields: advanced technology,

airlines, telecommunications, finance, and consulting he gained while

working on three continents: Europe, Australia, and North America.

While in Australia, Tomasz has been working on his PhD in Operations

Research with a focus on choice modeling and revenue management

applications in the airline industry.

At Microsoft, Tomasz works with big data on a daily basis, solving

machine learning problems such as anomaly detection, churn prediction,

and pattern recognition using Spark.

Tomasz has also authored the Practical Data Analysis Cookbook

published by Packt Publishing in 2016.

I would like to thank my family: Rachel, Skye, and Albert—you are the

love of my life and I cherish every day I spend with you! Thank you for

always standing by me and for encouraging me to push my career goals

further and further. Also, to my family and my in-laws for putting up

with me (in general).

There are many more people that have influenced me over the years that

I would have to write another book to thank them all. You know who

you are and I want to thank you from the bottom of my heart!

However, I would not have gotten through my PhD if it was not for

Czesia Wieruszewska; Czesiu - dziękuję za Twoją pomoc bez której nie

rozpocząłbym mojej podróży po Antypodach. Along with Krzys

Krzysztoszek, you guys have always believed in me! Thank you!

Denny Lee is a Principal Program Manager at Microsoft for the Azure

DocumentDB team—Microsoft's blazing fast, planet-scale managed

document store service. He is a hands-on distributed systems and data

https://www.iteblog.com

剩余379页未读，继续阅读

weixin_37790309

粉丝: 8
资源: 3

Python PySpark入门与RDD深度解析

distributed-computing-pyspark:使用PySpark进行分布式计算

Learning pyspark

PySpark 知识速览

UW_ML_Case_Study:华盛顿大学机器学习案例研究

黄金分割法matlab源代码-study_path_dat:大数据方向学习路线参考资料

Shopee-Case-Study

sparks_task1：目标是根据学习时间预测学生的百分比

"深入了解PySparkSQL：Spark SQL基础入门与实战技巧

pyspark-2.2.1

pyspark相关包.zip

最新资源