精通Spark：深入学习指南

5星 · 超过95%的资源需积分: 10 121 浏览量更新于2024-07-22 2 收藏 2.9MB PDF 举报

"Learning Spark"是一本由Holden Karau, Andy Konwinski, Patrick Wendell, 和 Matei Zaharia合著的书籍，专注于教授Apache Spark的相关知识。这本书由Databricks公司于2010年出版，并由O'Reilly Media, Inc.发行。该书覆盖了Spark的核心概念、应用以及最佳实践，旨在帮助读者深入理解并掌握Spark技术。 Apache Spark是一个开源的大数据处理框架，它提供了高级的并行计算能力，允许用户在大规模数据集上进行快速的运算。Spark的核心特性包括其弹性分布式数据集（Resilient Distributed Datasets, RDD），这是一种容错的内存计算模型，显著提高了数据处理的速度。通过RDD，Spark支持多种操作，如转换和行动，使得数据处理变得更加高效和便捷。这本书可能包括以下内容： 1. **Spark简介**：介绍Spark的历史、设计理念和它在大数据生态系统中的位置，以及与Hadoop等其他框架的比较。 2. **Spark架构**：详细解释Spark的组件，如Driver程序、Executor和Cluster Manager（如YARN或Mesos），以及如何配置和管理Spark集群。 3. **编程模型**：讲解如何使用Spark的API，包括Scala、Java、Python和R接口，以及DataFrame和Dataset API的使用，这些API简化了数据处理任务。 4. **核心组件**：涵盖Spark SQL用于结构化数据处理，Spark Streaming用于实时流处理，MLlib机器学习库，以及GraphX图处理框架。 5. **性能优化**：讨论如何调整Spark作业的性能，如内存管理、数据序列化和并行度设置，以及如何使用Dynamic Resource Allocation和Shuffle优化。 6. **案例研究**：展示Spark在实际项目中的应用，可能包括推荐系统、日志分析、图像处理和社交网络分析等。 7. **部署和运维**：指导如何在本地、集群或云环境中部署Spark，以及监控和调试Spark应用程序的方法。 8. **Spark生态**：介绍与Spark相关的工具和库，如Spark SQL与Hive的集成，以及Spark与其他大数据组件（如Kafka和HDFS）的交互。 9. **开发和测试**：讨论开发流程，包括单元测试、代码质量保证以及持续集成。 10. **社区与未来**：介绍Spark的社区资源，以及Spark的未来发展方向和更新。这本书的修订历史显示了作者团队对内容的持续更新和改进，确保读者可以获取到最新的Spark信息和技术。通过阅读《Learning Spark》，读者将不仅能够理解和掌握Spark的基本原理，还能获得实践经验，以便在实际工作中有效地使用Spark解决复杂的数据处理问题。无论你是初学者还是经验丰富的开发者，这本书都能提供有价值的洞察和指导。

allows your applications to also run on them. The Chapter 7 explores the different

options and how to choose the correct cluster manager.

Who Uses Spark, and For What?

Because Spark is a general purpose framework for cluster computing, it is used for a

diverse range of applications. In the Preface we outlined two personas that this book

targets as readers: Data Scientists and Engineers. Let’s take a closer look at each of these

personas and how they use Spark. Unsurprisingly, the typical use cases differ across the

two personas, but we can roughly classify them into two categories, data science and

data applications.

Of course, these are imprecise personas and usage patterns, and many folks have skills

from both, sometimes playing the role of the investigating Data Scientist, and then

“changing hats” and writing a hardened data processing system. Nonetheless, it can be

illuminating to consider the two personas and their respective use cases separately.

Data Science Tasks

Data Science is the name of a discipline that has been emerging over the past few years

centered around analyzing data. While there is no standard definition, for our purposes

a Data Scientist is somebody whose main task is to analyze and model data. Data sci‐

entists may have experience using SQL, statistics, predictive modeling (machine learn‐

ing), and some programming, usually in Python, Matlab or R. Data scientists also have

experience with techniques necessary to transform data into formats that can be ana‐

lyzed for insights (sometimes referred to as data wrangling).

Data Scientists use their skills to analyze data with the goal of answering a question or

discovering insights. Oftentimes, their workflow involves ad-hoc analysis, and so they

use interactive shells (vs. building complex applications) that let them see results of

queries and snippets of code in the least amount of time. Spark’s speed and simple APIs

shine for this purpose, and its built-in libraries mean that many algorithms are available

out of the box.

Spark supports the different tasks of data science with a number of components. The

PySpark shell makes it easy to do interactive data analysis using Python. Spark SQL also

has a separate SQL shell which can be used to do data exploration using SQL, or Spark

SQL can be used as part of a regular Spark program or in the PySpark shell. Machine

learning and data analysis is supported through the MLLib libraries. In addition support

exists for calling out to existing programs in Matlab or R. Spark enables Data Scientists

to tackle problems with larger data sizes than they could before with tools like R or

Pandas.

Sometimes, after the initial exploration phase, the work of a Data Scientist will be “pro‐

ductionized”, or extended, hardened (i.e. made fault tolerant), and tuned to become a

Who Uses Spark, and For What? | 5

production data processing application, which itself is a component of a business ap‐

plication. For example, the initial investigation of a Data Scientist might lead to the

creation of a production recommender system that is integrated into a web application

and used to generate customized product suggestions to users. Often it is a different

person or team that leads the process of productizing the work of the Data Scientists,

and that person is often an Engineer.

Data Processing Applications

The other main use case of Spark can be described in the context of the Engineer persona.

For our purposes here, we think of Engineers as large class of software developers who

use Spark to build production data processing applications. These developers usually

have an understanding of the principles of software engineering, such as encapsulation,

interface design, and Object Oriented Programming. They frequently have a degree in

Computer Science. They use their engineering skills to design and build software sys‐

tems that implement a business use case.

For Engineers, Spark provides a simple way to parallelize these applications across clus‐

ters, and hides the complexity of distributed systems programming, network commu‐

nication and fault tolerance. The system gives enough control to monitor, inspect and

tune applications while allowing common tasks to be implemented quickly. The mod‐

ular nature of the API (based on passing distributed collections of objects) makes it easy

to factor work into reusable libraries and test it locally.

Spark’s users choose to use it for their data processing applications because it provides

a wide variety of functionality, is easy to learn and use, and is mature and reliable.

A Brief History of Spark

Spark is an open source project that has been built and is maintained by a thriving and

diverse community of developers from many different organizations. If you or your

organization are trying Spark for the first time, you might be interested in the history

of the project. Spark started in 2009 as a research project in the UC Berkeley RAD Lab,

later to become the AMPLab. The researchers in the lab had previously been working

on Hadoop MapReduce, and observed that MapReduce was inefficient for iterative and

interactive computing jobs. Thus, from the beginning, Spark was designed to be fast for

interactive queries and iterative algorithms, bringing in ideas like support for in-

memory storage and efficient fault recovery.

Research papers were published about Spark at academic conferences and soon after

its creation in 2009, it was already 10—20x faster than MapReduce for certain jobs.

Some of Spark’s first users were other groups inside of UC Berkeley, including machine

learning researchers such as the the Mobile Millennium project, which used Spark to

monitor and predict traffic congestion in the San Francisco bay Area. In a very short

6 | Chapter 1: Introduction to Data Analysis with Spark

1. https://cwiki.apache.org/confluence/display/SPARK/Powered+By+Spark

2. http://www.meetup.com/spark-users/

3. http://spark-summit.org

4. https://amplab.cs.berkeley.edu/software

time, however, many external organizations began using Spark, and today, over 50 or‐

ganizations list themselves on the Spark PoweredBy page

, and dozens speak about

their use cases at Spark community events such as Spark Meetups

and the Spark Sum‐

mit

. Apart from UC Berkeley, major contributors to the project currently include

Yahoo!, Intel and Databricks.

In 2011, the AMPLab started to develop higher-level components on Spark, such as

Shark (Hive on Spark) and Spark Streaming. These and other components are often

referred to as the Berkeley Data Analytics Stack (BDAS)

. BDAS includes both com‐

ponents of Spark and other software projects that complement it, such as the Tachyon

memory manager.

Spark was first open sourced in March 2010, and was transferred to the Apache Software

Foundation in June 2013, where it is now a top-level project.

Spark Versions and Releases

Since its creation Spark has been a very active project and community, with the number

of contributors growing with each release. Spark 1.0 had over 100 individual contrib‐

utors. Though the level of activity has rapidly grown, the community continues to release

updated versions of Spark on a regular schedule. Spark 1.0 was released in May 2014.

This book focuses primarily on Spark 1.1.0 and beyond, though most of the concepts

and examples also work in earlier versions.

Persistence layers for Spark

Spark can create distributed datasets from any file stored in the Hadoop distributed file

system (HDFS) or other storage systems supported by the Hadoop APIs (including your

local file system, Amazon S3, Cassandra, Hive, HBase, etc). Its important to remember

that Spark does not require Hadoop, it simply has support for storage systems imple‐

menting the Hadoop APIs. Spark supports text files, SequenceFiles, Avro, Parquet, and

any other Hadoop InputFormat. We will look at interacting with these data sources in

the loading and saving chapter.

Spark Versions and Releases | 7

剩余164页未读，继续阅读

nanm

粉丝: 0
资源: 6

精通Spark：深入学习指南

Learning Spark SQL epub

learning spark 中文版下载

Learning Spark.pdf

learning spark

LearningSpark

Learning Spark SQL

Learning Spark pdf

LearningSpark：学习使用Spark的Scala示例

learning spark笔记17-spark sql

复古怀旧教室桌椅素材同学聚会毕业纪念册模板.pptx

最新资源