Python开发者指南：Spark大数据集群计算实战

需积分: 10 82 浏览量更新于2024-07-19 收藏 3.24MB PDF 举报

"Spark for Python Developers" 是一本由Packt Publishing在2015年出版的书籍，主要面向熟悉Python编程的开发者，旨在帮助他们学习和理解Apache Spark。该书深入探讨了如何在生产环境中使用Spark进行大数据集群计算。作者包括Ilya Ganelin、Ema Orhian、Kai Sasaki和Brennon York。 Apache Spark是大数据处理领域的一个强大工具，它提供了一个分布式、内存计算框架，可以极大地提升数据处理的速度和效率。对于Python开发者来说，Spark提供了PySpark接口，使得使用Python编写分布式应用程序变得简单易行。本书可能涵盖了以下关键知识点： 1. **Spark基础知识**：介绍Spark的基本架构，包括Master和Worker节点，以及如何设置和管理Spark集群。 2. **PySpark入门**：讲解如何安装和配置PySpark环境，以及如何创建和操作SparkContext，这是PySpark程序的基础。 3. **RDD（Resilient Distributed Datasets）**：RDD是Spark的核心数据结构，书中会解释其概念、创建、转换和行动操作，以及如何利用RDD的弹性特性处理数据错误。 4. **DataFrame和Spark SQL**：随着Spark的发展，DataFrame和Spark SQL成为处理结构化数据的主要方式。这部分会介绍如何使用DataFrame API进行数据操作，以及如何执行SQL查询。 5. **Spark Streaming**：Spark支持实时流处理，书中可能会介绍如何使用DStream（Discretized Stream）处理连续的数据流，并实现实时分析。 6. **Spark MLlib**：Spark的机器学习库MLlib提供了各种算法，包括分类、回归、聚类和协同过滤等。这部分将涵盖如何使用这些算法构建预测模型。 7. **Spark GraphX**：对于图数据的处理，GraphX提供了API来创建和操作图，适合于社交网络分析、推荐系统等场景。 8. **Spark性能优化**：讨论如何通过调整配置参数、数据分区策略和缓存机制来提升Spark应用的性能。 9. **Spark与Hadoop集成**：由于Spark可以在Hadoop之上运行，书里可能包含如何与HDFS、HBase等Hadoop生态系统组件交互的内容。 10. **案例研究**：通过实际项目或案例，展示如何在生产环境中部署和管理Spark应用，以及解决可能出现的问题。这本书对于希望利用Python和Spark处理大规模数据的开发者来说，是一份宝贵的参考资料，它不仅介绍了理论知识，还提供了实践经验，有助于读者快速上手并掌握Spark的核心功能。

 A set of guidelines and trade-offs on the various configuration

parameters that can be used to tune Spark for high availability and

fault tolerance

 A complete picture of a production workflow and the various

components necessary to migrate an application into a production

workflow

What You Need to Use This Book

You should understand the basics of development and usage atop Apache

Spark. This book

will not

be covering introductory material. There are

numerous books, forums, and resources available that cover this topic and,

as such, we assume all readers have basic Spark knowledge or, if duly lost,

will read the interested topics to better understand the material

presented in this book.

The source code for the samples is available for download from the Wiley

website at: www.wiley.com/go/sparkbigdataclustercomputing.

Conventions

To help you get the most from the text and keep track of what’s happening,

we’ve used a number of conventions throughout the book.

NOTE Notes indicate notes, tips, hints, tricks, or asides to the

current discussion. As for styles in the text:

 We

highlight

new terms and important words when we introduce them.

 We show code within the text like so: persistence.properties.

Source Code

As you work through the examples in this book, you may choose either to

type in all the code manually, or to use the source code files that

accompany the book. All the source code used in this book is available

for download at www.wiley.com. Specifically for this book, the code

download is on the Download Code tab at

www.wiley.com/go/sparkbigdataclustercomputing.

You can also search for the book at www.wiley.com by ISBN.

CHAPTER 1

Finishing Your Spark Job

When you scale out a Spark application for the first time, one of the more

common occurrences you will encounter is the application’s inability to

merely succeed and finish its job. The Apache Spark framework’s ability

to scale is tremendous, but it does not come out of the box with those

properties. Spark was created, first and foremost, to be a framework that

would be easy to get started and use. Once you have developed an initial

application, however, you will then need to take the additional exercise

of gaining deeper knowledge of Spark’s internals and configurations to

take the job to the next stage.

In this chapter we lay the groundwork for getting a Spark application to

succeed. We will focus primarily on the hardware and system-level design

choices you need to set up and consider before you can work through the

various Spark-specific issues to move an application into production.

We will begin by discussing the various ways you can install a

production-grade cluster for Apache Spark. We will include the scaling

efficiencies you will need depending on a given workload, the various

installation methods, and the common setups. Next, we will take a look

at the historical origins of Spark in order to better understand its design

and to allow you to best judge when it is the right tool for your jobs.

Following that, we will take a look at resource management: how memory,

CPU, and disk usage come into play when creating and executing Spark

applications. Next, we will cover storage capabilities within Spark and

their external subsystems. Finally, we will conclude with a discussion

of how to instrument and monitor a Spark application.

Installation of the Necessary Components

Before you can begin to migrate an application written in Apache Spark

you will need an actual cluster to begin testing it on. You can download,

compile, and install Spark in a number of different ways within its system

(some will be easier than others), and we’ll cover the primary methods

in this chapter.

Let’s begin by explaining how to configure a

native

installation, meaning

one where

only

Apache Spark is installed, then we’ll move into the various

Hadoop distributions (Cloudera and Hortonworks), and conclude by

providing a brief explanation on how to deploy Spark on Amazon Web Services

(AWS).

Before diving too far into the various ways you can install Spark, the

obvious question that arises is, “What type of hardware should I leverage

for a Spark cluster?” We can offer various possible answers to this

question, but we’d like to focus on a few resounding truths of the Spark

framework rather than necessitating a given layout.

It’s important to know that Apache Spark is an

in-memory

compute grid.

Therefore, for maximum efficiency, it is highly recommended that the

system, as a whole, maintain enough memory

within the framework

for the

largest workload (or dataset) that will be conceivably consumed. We are

not saying that you cannot scale a cluster later, but it is always better

to plan ahead, especially if you work inside a larger organization where

purchase orders might take weeks or months.

On the concept of memory it is necessary to understand that when computing

the amount of memory you need to understand that the computation does not

equate to a one-to-one fashion. That is to say, for a given 1TB dataset,

you will need

than 1TB of memory. This is because when you create

objects within Java from a dataset, the object is typically much larger

than the original data element. Multiply that expansion times the number

of objects created for a given dataset and you will have a much more

accurate representation of the amount of memory a system will require to

perform a given task.

To better attack this problem, Spark is, at the time of this writing,

working on what Apache has called

Project Tungsten

, which will greatly

reduce the memory overhead of objects by leveraging off heap memory. You

don’t need to know more about Tungsten as you continue reading this book,

but this information may apply to future Spark releases, because Tungsten

is poised to become the de facto memory management system.

The second major component we want to highlight in this chapter is the

number of CPU cores you will need per physical machine when you are

determining hardware for Apache Spark. This is a much more fragmented

answer in that, once the data load normalizes into memory, the application

is typically network or CPU bound. That said, the easiest solution is to

test your Spark application on a smaller dataset and measure its bounding

case, be it either network or CPU, and then plan accordingly from there.

Native Installation Using a Spark Standalone Cluster

The simplest way to install Spark is to deploy a Spark Standalone cluster.

In this mode, you deploy a Spark binary to each node in a cluster, update

a small set of configuration files, and then start the appropriate

processes on the master and slave nodes. In Chapter 2, we discuss this

process in detail and present a simple scenario covering installation,

deployment, and execution of a basic Spark job.

Because Spark is not tied to the Hadoop ecosystem, this mode does not have

any dependencies aside from the Java JDK. Spark currently recommends the

Java 1.7 JDK. If you wish to run alongside an existing Hadoop deployment,

you can launch the Spark processes on the same machines as the Hadoop

installation and configure the Spark environment variables to include the

Hadoop configuration.

NOTE For more on a Cloudera installation of Spark try

http://www.cloudera.com/content/www/en-us/documentation/enterprise/la

test/topics/cdh_ig_spark_installation.html. For more on the Hortonworks

installation try http://hortonworks.com/hadoop/spark/#section_6. And

for more on an Amazon Web Services installation of Spark try

http://aws.amazon.com/articles/4926593393724923.

The History of Distributed Computing That Led to

Spark

We have introduced Spark as a distributed compute framework; however, we

haven’t really discussed what this means. Until recently, most computer

systems available to both individuals and enterprises were based around

single machines. These single machines came in many shapes and sizes and

differed dramatically in terms of their performance, as they do today.

We’re all familiar with the modern ecosystem of personal machines. At

the low-end, we have tablets and mobile phones. We can think of these as

relatively weak, un-networked computers. At the next level we have laptops

and desktop computers. These are more powerful machines, with more storage

and computational ability, and potentially, with one or more graphics

cards (GPUs) that support certain types of massively parallel

computations. Next are those machines that some people have networked with

in their home, although generally these machines were not networked to

share their computational ability, but rather to provide shared

storage—for example, to share movies or music across a home network.

剩余216页未读，继续阅读

来自北方的猫

粉丝: 44
资源: 23

Python开发者指南：Spark大数据集群计算实战

2015年Nandi Spark教程Python开发代码解压指南

PySpark入门指南：Python开发者打造数据处理神器

使用Python开发Spark数据应用实战

Nandi -- Spark for Python Developers -- 2015 -- code.7z

Spark for Python Developers

Spark for Python Developers.pdf

Spark for Python Developers 无水印pdf 0分

Machine Learning for Developers-Packt Publishing(2017).pdf

TypeScript 2.x for Angular Developers-Packt Publishing(2017).epub

MongoDB for Java Developers(PACKT,2015)

最新资源