Apache Spark权威指南：大数据处理简易途径

5星 · 超过95%的资源需积分: 50 76 浏览量更新于2024-07-18 3 收藏 7.88MB PDF 举报

"Apache Spark是面向大规模数据处理的高性能计算引擎，强调速度、易用性和通用性。相较于Hadoop，Spark引入了内存分布式数据集，支持交互式查询和优化迭代工作负载，使其在处理特定任务时更为高效。《Spark: The Definitive Guide》由Bill Chambers和Matei Zaharia撰写，详细介绍了Spark的使用方法和核心概念，是学习Spark的重要参考书。" Apache Spark作为一个强大的大数据处理框架，其主要特点和优势包括： 1. **速度**：Spark通过内存计算（In-Memory Computing）显著提升了数据处理速度。它将数据存储在内存中，避免了Hadoop每次计算都要写入磁盘的IO开销，从而在迭代计算和实时分析等场景下展现出更高的性能。 2. **易用性**：Spark提供了丰富的API，包括Scala、Java、Python和R，使得开发人员可以方便地进行数据处理。此外，Spark Shell提供了交互式的环境，便于快速测试和调试代码。 3. **弹性**：Spark支持在不同的集群管理器上运行，如Hadoop YARN、Mesos或独立模式，具有良好的可扩展性和容错性。 4. **多模态处理**：Spark不仅仅用于批处理，还提供了Spark Streaming用于流处理，MLlib支持机器学习，GraphX用于图计算，Spark SQL用于结构化数据处理，形成了一个全面的数据处理生态系统。 5. **数据交互性**：Spark SQL允许用户通过SQL或者DataFrame API对数据进行操作，适合业务分析师和数据科学家进行数据分析。 6. **编程模型**：Spark的核心概念是RDD（Resilient Distributed Datasets），这是一种不可变、分区的记录集合，具有容错性和并行计算的能力。随着版本的发展，DataFrame和Dataset成为了更高级的抽象，提供了更高效的执行计划优化和更好的类型安全。 7. **Spark作业调度**：Spark的Job、Stage和Task模型确保了任务的并行执行和资源的有效利用。Stage是任务的边界，对应于一次shuffle操作，而Task是在Stage内部并行执行的工作单元。 8. **容错机制**：通过检查点和宽依赖关系的重新计算，Spark能够在节点故障时恢复计算，保证了系统的稳定性。《Spark: The Definitive Guide》这本书深入探讨了Spark的各个方面，包括核心组件的使用、高级特性、性能调优以及实际案例分析，对于理解Spark的原理和实践应用非常有帮助。通过阅读此书，读者可以系统地学习如何利用Spark处理大数据问题，提高数据分析的效率。

passes over the data, and in MapReduce, each pass had to be written as a separate MapReduce

job, which had to be launched separately on the cluster and load the data from scratch.

To address this problem, the Spark team first designed an API based on functional programming

that could succinctly express multistep applications. The team then implemented this API over a

new engine that could perform efficient, in-memory data sharing across computation steps. The

team also began testing this system with both Berkeley and external users.

The first version of Spark supported only batch applications, but soon enough another

compelling use case became clear: interactive data science and ad hoc queries. By simply

plugging the Scala interpreter into Spark, the project could provide a highly usable interactive

system for running queries on hundreds of machines. The AMPlab also quickly built on this idea

to develop Shark, an engine that could run SQL queries over Spark and enable interactive use by

analysts as well as data scientists. Shark was first released in 2011.

After these initial releases, it quickly became clear that the most powerful additions to Spark

would be new libraries, and so the project began to follow the “standard library” approach it has

today. In particular, different AMPlab groups started MLlib, Spark Streaming, and GraphX.

They also ensured that these APIs would be highly interoperable, enabling writing end-to-end

big data applications in the same engine for the first time.

In 2013, the project had grown to widespread use, with more than 100 contributors from more

than 30 organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache

Software Foundation as a long-term, vendor-independent home for the project. The early

AMPlab team also launched a company, Databricks, to harden the project, joining the

community of other companies and organizations contributing to Spark. Since that time, the

Apache Spark community released Spark 1.0 in 2014 and Spark 2.0 in 2016, and continues to

make regular releases, bringing new features into the project.

Finally, Spark’s core idea of composable APIs has also been refined over time. Early versions of

Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations

such as maps and reduces over collections of Java objects. Beginning with 1.0, the project added

Spark SQL, a new API for working with structured data—tables with a fixed data format that is

not tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations

across libraries and APIs by understanding both the data format and the user code that runs on it

in more detail. Over time, the project added a plethora of new APIs that build on this more

powerful structured foundation, including DataFrames, machine learning pipelines, and

Structured Streaming, a high-level, automatically optimized streaming API. In this book, we will

spend a signficant amount of time explaining these next-generation APIs, most of which are

marked as production-ready.

The Present and Future of Spark

Spark has been around for a number of years but continues to gain in popularity and use cases.

Many new projects within the Spark ecosystem continue to push the boundaries of what’s

possible with the system. For example, a new high-level streaming engine, Structured Streaming,

was introduced in 2016. This technology is a huge part of companies solving massive-scale data

challenges, from technology companies like Uber and Netflix using Spark’s streaming and

machine learning tools, to institutions like NASA, CERN, and the Broad Institute of MIT and

Harvard applying Spark to scientific data analysis.

Spark will continue to be a cornerstone of companies doing big data analysis for the foreseeable

future, especially given that the project is still developing quickly. Any data scientist or engineer

who needs to solve big data problems probably needs a copy of Spark on their machine—and

hopefully, a copy of this book on their bookshelf!

Running Spark

This book contains an abundance of Spark-related code, and it’s essential that you’re prepared to

run it as you learn. For the most part, you’ll want to run the code interactively so that you can

experiment with it. Let’s go over some of your options before we begin working with the coding

parts of the book.

You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in Scala, and runs

on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster,

all you need is an installation of Java. If you want to use the Python API, you will also need a

Python interpreter (version 2.7 or later). If you want to use R, you will need a version of R on

your machine.

There are two options we recommend for getting started with Spark: downloading and installing

Apache Spark on your laptop, or running a web-based version in Databricks Community Edition,

a free cloud environment for learning Spark that includes the code in this book. We explain both

of those options next.

Downloading Spark Locally

If you want to download and run Spark locally, the first step is to make sure that you have Java

installed on your machine (available as java), as well as a Python version if you would like to

use Python. Next, visit the project’s official download page, select the package type of “Pre-built

for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR

file, or tarball, that you will then need to extract. The majority of this book was written using

Spark 2.2, so downloading version 2.2 or later should be a good starting point.

Downloading Spark for a Hadoop cluster

Spark can run locally without any distributed storage system, such as Apache Hadoop. However,

if you would like to connect the Spark version on your laptop to a Hadoop cluster, make sure you

download the right Spark version for that Hadoop version, which can be chosen at

http://spark.apache.org/downloads.html by selecting a different package type. We discuss how

Spark runs on clusters and the Hadoop file system in later chapters, but at this point we

Chapter 2. A Gentle Introduction to

Spark

Now that our history lesson on Apache Spark is completed, it’s time to begin using and applying

it! This chapter presents a gentle introduction to Spark, in which we will walk through the core

architecture of a cluster, Spark Application, and Spark’s structured APIs using DataFrames and

SQL. Along the way we will touch on Spark’s core terminology and concepts so that you can

begin using Spark right away. Let’s get started with some basic background information.

Spark’s Basic Architecture

Typically, when you think of a “computer,” you think about one machine sitting on your desk at

home or at work. This machine works perfectly well for watching movies or working with

spreadsheet software. However, as many users likely experience at some point, there are some

things that your computer is not powerful enough to perform. One particularly challenging area

is data processing. Single machines do not have enough power and resources to perform

computations on huge amounts of information (or the user probably does not have the time to

wait for the computation to finish). A cluster, or group, of computers, pools the resources of

many machines together, giving us the ability to use all the cumulative resources as if they were

a single computer. Now, a group of machines alone is not powerful, you need a framework to

coordinate work across them. Spark does just that, managing and coordinating the execution of

tasks on data across a cluster of computers.

The cluster of machines that Spark will use to execute tasks is managed by a cluster manager like

Spark’s standalone cluster manager, YARN, or Mesos. We then submit Spark Applications to

these cluster managers, which will grant resources to our application so that we can complete our

work.

Spark Applications

Spark Applications consist of a driver process and a set of executor processes. The driver process

runs your main() function, sits on a node in the cluster, and is responsible for three things:

maintaining information about the Spark Application; responding to a user’s program or input;

and analyzing, distributing, and scheduling work across the executors (discussed momentarily).

The driver process is absolutely essential—it’s the heart of a Spark Application and maintains all

relevant information during the lifetime of the application.

The executors are responsible for actually carrying out the work that the driver assigns them.

This means that each executor is responsible for only two things: executing code assigned to it

by the driver, and reporting the state of the computation on that executor back to the driver node.

剩余600页未读，继续阅读

huahua.Dr

粉丝: 87

Apache Spark权威指南：大数据处理简易途径

Spark权威指南2017年12月版深度解读

Spark权威指南代码仓库：实践与探索Scala大数据处理

《Spark权威指南》配套资源及数据集下载指南

Spark-权威指南：Spark：权威指南的代码存储库

Spark编程指南中文版

Spark 编程指南简体中文版.pdf

掌握Spark权威指南：实操代码与环境搭建指南

深入学习Spark：权威指南

Spark入门指南：官方权威370页精讲

Spark权威指南：利用Linux树莓派打造智能家居

最新资源