实战教程：使用Apache Spark和Python处理大数据

3星 · 超过75%的资源需积分: 10 119 浏览量更新于2024-07-19 收藏 14.45MB PDF 举报

《弗兰克·凯恩的Apache Spark与Python驾驭大数据实战》是一本由Frank Kane撰写的专业书籍，旨在通过真实世界的例子，帮助读者在实际操作中有效分析大型数据集。这本书是2017年Packt Publishing出版的，版权受到保护，未经版权所有者书面许可，不得进行任何形式的复制、存储或传输。本书的核心内容围绕Apache Spark和Python这两个强大的数据处理工具展开。Apache Spark是一个开源的分布式计算框架，特别适合处理大规模数据，它提供了一个内存计算模型，能实现实时数据处理和分析。而Python，作为一门易学且功能丰富的编程语言，被广泛应用于数据分析领域，其丰富的库（如Pandas、NumPy和SciPy等）使得数据操作和分析变得高效。书中通过一系列实例，讲解如何使用Spark的DataFrame API和Spark SQL来处理数据，包括数据清洗、数据转换、聚合和机器学习等关键步骤。读者将学会如何利用Spark的并行计算能力，以及如何编写简洁、高效的Python代码来执行复杂的计算任务。此外，书中还将涉及如何整合其他Python库，如Databricks Notebook，以优化工作流程。值得注意的是，尽管作者和Packt Publishing努力确保书中信息的准确性，但书中的内容并非无懈可击，读者在实践中可能会遇到某些特定环境或版本差异导致的问题。此外，由于版权法律的限制，书中引用的商标信息可能存在更新不及时的情况，但这并不影响读者学习和理解Spark与Python在大数据处理中的核心应用。《Frank Kane's Taming Big Data with Apache Spark and Python》是一本实用的指南，适合数据分析师、数据工程师或者希望提升大数据处理技能的专业人士。无论是初学者还是经验丰富的开发者，都能从中找到有价值的内容，提升自己在处理海量数据时的效率和效果。

Preface

[ 4 ]

What you need for this book

For this book you’ll need a Python development environment (Python 3.5 or newer), a

Canopy installer, Java Development Kit, and of course Spark itself (Spark 2.0 and beyond).

We'll show you how to install this software in first chapter of the book.

This book is based on the Windows operating system, so installations are provided

according to it. If you have Mac or Linux, you can follow this URL h t t p ://m e d i a . s u n d o g - s

o f t . c o m /s p a r k - p y t h o n - i n s t a l l . p d f , which contains written instructions on getting

everything set up on Mac OS and on Linux.

Who this book is for

I wrote this book for people who have at least some programming or scripting experience in

their background. We're going to be using the Python programming language throughout

this book, which is very easy to pick up, and I'm going to give you over 15 real hands-on

examples of Spark Python scripts that you can run yourself, mess around with, and learn

from. So, by the end of this book, you should have the skills needed to actually turn

business problems into Spark problems, code up that Spark code on your own, and actually

run it in the cluster on your own.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds

of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions, path

names, dummy URLs, user input, and Twitter handles are shown as follows: "Now, you'll

need to remember the path that we installed the JDK into, which in our case was C:\jdk."

A block of code is set as follows:

from pyspark import SparkConf, SparkContext

import collections

conf = SparkConf().setMaster("local").setAppName("RatingsHistogram")

sc = SparkContext(conf = conf)

lines = sc.textFile("file:///SparkCourse/ml-100k/u.data")

ratings = lines.map(lambda x: x.split()[2])

result = ratings.countByValue()

sortedResults = collections.OrderedDict(sorted(result.items()))

Preface
[ 7 ]
The code bundle for the book is also hosted on GitHub at h t t p s ://g i t h u b . c o m /P a c k t P u b l
i s h i n g /F r a n k - K a n e s - T a m i n g - B i g - D a t a - w i t h - A p a c h e - S p a r k - a n d - P y t h o n . We also have
other code bundles from our rich catalog of books and videos available at h t t p s ://g i t h u b .
c o m /P a c k t P u b l i s h i n g /. Check them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in the output.
You can download this file from h t t p s ://w w w . p a c k t p u b . c o m /s i t e s /d e f a u l t /f i l e s /d o w n
l o a d s /F r a n k K a n e s T a m i n g B i g D a t a w i t h A p a c h e S p a r k a n d P y t h o n _ C o l o r I m a g e s . p d f .
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting h t t p ://w w w . p a c k t p u b . c o m /s u b m i t - e r r a t a , selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title. To view the previously submitted errata, go to h t t p s ://w w w . p a c k t p u b . c o m /b o o k
s /c o n t e n t /s u p p o r t and enter the name of the book in the search field. The required
information will appear under the Errata section.
Piracy
Piracy of copyrighted material on the Internet is an ongoing problem across all media. At
Packt, we take the protection of our copyright and licenses very seriously. If you come
across any illegal copies of our works in any form on the Internet, please provide us with
the location address or website name immediately so that we can pursue a remedy. Please
contact us at copyright@packtpub.com with a link to the suspected pirated material. We
appreciate your help in protecting our authors and our ability to bring you valuable
content.
Questions
If you have a problem with any aspect of this book, you can contact us at
questions@packtpub.com, and we will do our best to address the problem