Apache Spark深度解析：大数据处理简化指南

需积分: 0 179 浏览量更新于2024-07-18 收藏 7.92MB PDF 举报

"Spark: The Definitive Guide: Big Data Processing Made Simple" 是一本由Apache Spark的创建者Bill Chambers和Matei Zaharia撰写的全面指南，旨在帮助读者理解和使用Spark进行大数据处理。这本书重点关注Spark 2.0的新特性和改进，适合开发者和系统管理员阅读。在书中，作者深入浅出地介绍了Spark的基础操作和核心API，包括DataFrame、SQL和Dataset。DataFrame和Dataset是Spark 2.0引入的重要概念，它们提供了结构化的数据处理能力，使得数据处理更加直观和高效。读者可以通过实际案例学习如何使用这些API进行数据操作。 Spark的低级API，如Resilient Distributed Datasets (RDDs)，也在书中进行了详细阐述。RDD是Spark的基础构建块，它提供了容错性并支持并行计算。此外，书中还探讨了如何执行SQL查询以及DataFrame的操作，帮助读者理解Spark的查询执行机制。对于系统管理员，本书提供了关于监控、调优和调试Spark集群和应用程序的实用技巧。这部分内容对于确保Spark集群的稳定运行和性能优化至关重要。同时，书中的机器学习章节介绍了如何利用Spark的MLlib库实现大规模机器学习任务，涵盖了多种机器学习算法和应用场景。 Spark的生态系统也被广泛讨论，包括SparkR（Spark的R语言接口）和图形分析功能。SparkR使得R语言用户可以方便地利用Spark处理大规模数据，而图形分析部分则展示了Spark在处理复杂网络数据和图算法上的能力。最后，书中详细讲解了Spark的部署策略，不仅覆盖了本地部署，还包括在云端运行Spark的实践指导。这部分内容对于那些希望在不同环境中部署和管理Spark集群的读者非常有价值。总而言之，"Spark: The Definitive Guide" 是一本全面且深入的Spark参考书籍，无论你是初学者还是有经验的开发人员，都能从中获得宝贵的见解和实践经验，从而更好地驾驭大数据处理的世界。

Unfortunately, this trend in hardware stopped around 2005: due to hard limits in heat dissipation,

hardware developers stopped making individual processors faster, and switched toward adding

more parallel CPU cores all running at the same speed. This change meant that suddenly

applications needed to be modified to add parallelism in order to run faster, which set the stage

for new programming models such as Apache Spark.

On top of that, the technologies for storing and collecting data did not slow down appreciably in

2005, when processor speeds did. The cost to store 1 TB of data continues to drop by roughly

two times every 14 months, meaning that it is very inexpensive for organizations of all sizes to

store large amounts of data. Moreover, many of the technologies for collecting data (sensors,

cameras, public datasets, etc.) continue to drop in cost and improve in resolution. For example,

camera technology continues to improve in resolution and drop in cost per pixel every year, to

the point where a 12-megapixel webcam costs only $3 to $4; this has made it inexpensive to

collect a wide range of visual data, whether from people filming video or automated sensors in

an industrial setting. Moreover, cameras are themselves the key sensors in other data collection

devices, such as telescopes and even gene-sequencing machines, driving the cost of these

technologies down as well.

The end result is a world in which collecting data is extremely inexpensive—many organizations

today even consider it negligent not to log data of possible relevance to the business—but

processing it requires large, parallel computations, often on clusters of machines. Moreover, in

this new world, the software developed in the past 50 years cannot automatically scale up, and

neither can the traditional programming models for data processing applications, creating the

need for new programming models. It is this world that Apache Spark was built for.

History of Spark

Apache Spark began at UC Berkeley in 2009 as the Spark research project, which was first

published the following year in a paper entitled “Spark: Cluster Computing with Working Sets”

by Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, and Ion Stoica of the

UC Berkeley AMPlab. At the time, Hadoop MapReduce was the dominant parallel programming

engine for clusters, being the first open source system to tackle data-parallel processing on

clusters of thousands of nodes. The AMPlab had worked with multiple early MapReduce users to

understand the benefits and drawbacks of this new programming model, and was therefore able

to synthesize a list of problems across several use cases and begin designing more general

computing platforms. In addition, Zaharia had also worked with Hadoop users at UC Berkeley to

understand their needs for the platform—specifically, teams that were doing large-scale machine

learning using iterative algorithms that need to make multiple passes over the data.

Across these conversations, two things were clear. First, cluster computing held tremendous

potential: at every organization that used MapReduce, brand new applications could be built

using the existing data, and many new groups began using the system after its initial use cases.

Second, however, the MapReduce engine made it both challenging and inefficient to build large

applications. For example, the typical machine learning algorithm might need to make 10 or 20

passes over the data, and in MapReduce, each pass had to be written as a separate MapReduce

job, which had to be launched separately on the cluster and load the data from scratch.

To address this problem, the Spark team first designed an API based on functional programming

that could succinctly express multistep applications. The team then implemented this API over a

new engine that could perform efficient, in-memory data sharing across computation steps. The

team also began testing this system with both Berkeley and external users.

The first version of Spark supported only batch applications, but soon enough another

compelling use case became clear: interactive data science and ad hoc queries. By simply

plugging the Scala interpreter into Spark, the project could provide a highly usable interactive

system for running queries on hundreds of machines. The AMPlab also quickly built on this idea

to develop Shark, an engine that could run SQL queries over Spark and enable interactive use by

analysts as well as data scientists. Shark was first released in 2011.

After these initial releases, it quickly became clear that the most powerful additions to Spark

would be new libraries, and so the project began to follow the “standard library” approach it has

today. In particular, different AMPlab groups started MLlib, Spark Streaming, and GraphX.

They also ensured that these APIs would be highly interoperable, enabling writing end-to-end

big data applications in the same engine for the first time.

In 2013, the project had grown to widespread use, with more than 100 contributors from more

than 30 organizations outside UC Berkeley. The AMPlab contributed Spark to the Apache

Software Foundation as a long-term, vendor-independent home for the project. The early

AMPlab team also launched a company, Databricks, to harden the project, joining the

community of other companies and organizations contributing to Spark. Since that time, the

Apache Spark community released Spark 1.0 in 2014 and Spark 2.0 in 2016, and continues to

make regular releases, bringing new features into the project.

Finally, Spark’s core idea of composable APIs has also been refined over time. Early versions of

Spark (before 1.0) largely defined this API in terms of functional operations—parallel operations

such as maps and reduces over collections of Java objects. Beginning with 1.0, the project added

Spark SQL, a new API for working with structured data—tables with a fixed data format that is

not tied to Java’s in-memory representation. Spark SQL enabled powerful new optimizations

across libraries and APIs by understanding both the data format and the user code that runs on it

in more detail. Over time, the project added a plethora of new APIs that build on this more

powerful structured foundation, including DataFrames, machine learning pipelines, and

Structured Streaming, a high-level, automatically optimized streaming API. In this book, we will

spend a signficant amount of time explaining these next-generation APIs, most of which are

marked as production-ready.

The Present and Future of Spark

Spark has been around for a number of years but continues to gain in popularity and use cases.

Many new projects within the Spark ecosystem continue to push the boundaries of what’s

possible with the system. For example, a new high-level streaming engine, Structured Streaming,

was introduced in 2016. This technology is a huge part of companies solving massive-scale data

challenges, from technology companies like Uber and Netflix using Spark’s streaming and

machine learning tools, to institutions like NASA, CERN, and the Broad Institute of MIT and

Harvard applying Spark to scientific data analysis.

Spark will continue to be a cornerstone of companies doing big data analysis for the foreseeable

future, especially given that the project is still developing quickly. Any data scientist or engineer

who needs to solve big data problems probably needs a copy of Spark on their machine—and

hopefully, a copy of this book on their bookshelf!

Running Spark

This book contains an abundance of Spark-related code, and it’s essential that you’re prepared to

run it as you learn. For the most part, you’ll want to run the code interactively so that you can

experiment with it. Let’s go over some of your options before we begin working with the coding

parts of the book.

You can use Spark from Python, Java, Scala, R, or SQL. Spark itself is written in Scala, and runs

on the Java Virtual Machine (JVM), so therefore to run Spark either on your laptop or a cluster,

all you need is an installation of Java. If you want to use the Python API, you will also need a

Python interpreter (version 2.7 or later). If you want to use R, you will need a version of R on

your machine.

There are two options we recommend for getting started with Spark: downloading and installing

Apache Spark on your laptop, or running a web-based version in Databricks Community Edition,

a free cloud environment for learning Spark that includes the code in this book. We explain both

of those options next.

Downloading Spark Locally

If you want to download and run Spark locally, the first step is to make sure that you have Java

installed on your machine (available as java), as well as a Python version if you would like to

use Python. Next, visit the project’s official download page, select the package type of “Pre-built

for Hadoop 2.7 and later,” and click “Direct Download.” This downloads a compressed TAR

file, or tarball, that you will then need to extract. The majority of this book was written using

Spark 2.2, so downloading version 2.2 or later should be a good starting point.

Downloading Spark for a Hadoop cluster

Spark can run locally without any distributed storage system, such as Apache Hadoop. However,

if you would like to connect the Spark version on your laptop to a Hadoop cluster, make sure you

download the right Spark version for that Hadoop version, which can be chosen at

http://spark.apache.org/downloads.html by selecting a different package type. We discuss how

Spark runs on clusters and the Hadoop file system in later chapters, but at this point we

剩余601页未读，继续阅读

changqingt27

粉丝: 0
资源: 21

Apache Spark深度解析：大数据处理简化指南

spark the definitive guide(epub)

Hadoop: The Definitive Guide 中英两版

Spark: The Definitive Guide: Big Data Processing Made Simple 英文.pdf版

Spark The Definitive Guide epub

Spark The Definitive Guide-201712

Spark: The Definitive Guide: Big Data Processing Made Simple 英文高清.pdf版

Spark: The Definitive Guide: Big Data Processing Made Simple 1st Edition

Spark-The Definitive Guide Big Data Processing Made Simple

Spark应用程序的性能调优与容量规划实践

Spark_The Definitive Guide-O'Reilly(2018).epub

最新资源