加速大数据处理：Spark 2 实战

需积分: 10 109 浏览量更新于2024-07-20 收藏 44.05MB PDF 举报

"快速处理Spark 2" 在大数据分析领域，Spark 2 提供了高效的数据处理能力，使得处理大规模数据的速度和效率得到了显著提升。"Fast Data Processing with Spark 2" 第三版是一本深入介绍如何利用 Spark 2 实现快速、流畅的大数据项目的书籍，由 Krishna Sankar 撰写。这本书旨在帮助读者将理论转化为实践，以实现更快、更精准的数据分析。 Spark 2 的核心改进之一是优化了计算性能，它引入了新的执行引擎，如 Tungsten 和 Catalyst，这些引擎通过代码生成和优化显著提高了内存管理和查询执行的效率。Tungsten 提供了内存管理的底层控制，使得数据可以以更紧凑的形式存储，从而减少了数据序列化和反序列化的开销。Catalyst 是一个基于规则的查询优化器，能够对 SQL 查询进行深度优化，包括重写查询计划、消除冗余操作和改进 join 策略。此外，Spark 2 引入了 Dataset API，这是一个结合了 RDD（弹性分布式数据集）的强类型和 SQL 查询的高效数据处理接口。Dataset API 提供了类型安全的编程模型，允许开发者使用 Scala 或 Java 的模式匹配和函数式编程特性，同时保持 Spark 内部的优化。这使得开发人员能够在不牺牲性能的情况下编写更简洁、更易于理解和维护的代码。另一个重要特性是 Spark SQL，它是 Spark 2 中用于处理结构化和半结构化数据的组件。Spark SQL 支持多种数据源，包括 Hive、Parquet、JSON 和 JDBC，允许用户以 SQL 或 DataFrame/Dataset API 进行数据查询。此外，Spark SQL 还与 Spark Streaming 和 MLlib 集成，提供了一致的编程模型，使得实时流处理和机器学习任务的处理变得更加方便。 Spark 2 在容错性和可扩展性方面也有所增强。例如，它改进了动态资源调度，可以更好地适应不断变化的工作负载，并且支持更细粒度的资源分配。这使得集群资源得到更有效的利用，降低了作业的等待时间。 "Fast Data Processing with Spark 2" 第三版是理解并利用 Spark 2 高效处理大数据的宝贵资源。通过这本书，读者可以学习到如何设计和实现高性能的数据处理管道，以及如何利用 Spark 2 的新特性和优化来提升数据分析的准确性和速度。无论你是初学者还是经验丰富的开发人员，都能从中受益，提升在大数据领域的专业技能。

Preface

[ 3 ]

Code words in text, database table names, folder names, filenames, file extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The

hallmark of a MapReduce system is this: map and reduce, the two primitives."

A block of code is set as follows:

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

Any command-line input or output is written as follows:

./ec2/spark-ec2 -i ~/spark-keypair.pem launch myfirstsparkcluster --resume

New terms and important words are shown in bold. Words that you see on the screen, for

example, in menus or dialog boxes, appear in the text like this: "From Spark 2.0.0 onwards,

they have changed the packaging, so we have to

include spark-2.0.0/assembly/target/scala-2.11/jars in Add External Jars…."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book-what you liked or disliked. Reader feedback is important for us as it helps us develop

titles that you will really get the most out of. To send us general feedback, simply e-

mail feedback@packtpub.com, and mention the book's title in the subject of your

message. If there is a topic that you have expertise in and you are interested in either

writing or contributing to a book, see our author guide at www.packtpub.com/authors.

https://www.iteblog.com

Installing Spark and Setting Up Your Cluster
[ 7 ]
As you explore the latest version of Spark, an essential task is to read the
release notes and especially what has been changed and deprecated. For
2.0.0, the list is slightly long and is available at h t t p s : / / s p a r k . a p a c h e . o r
g / r e l e a s e s / s p a r k - r e l e a s e - 2 - 
0
 - 
0
 . h t m l # r e m o v a l s - b e h a v i o r - c h a n g e s -
a n d - d e p r e c a t i o n s. For example, the note talks about where the EC2
scripts have moved to and support for Hadoop 2.1 and earlier.
To compile the Spark source, you will need the appropriate version of Scala and the
matching JDK. The Spark source tar utility includes the required Scala components. The
following discussion is only for information there is no need to install Scala.
The Spark developers have done a good job of managing the dependencies. Refer to the h t t
p s : / / s p a r k . a p a c h e . o r g / d o c s / l a t e s t / b u i l d i n g - s p a r k . h t m l web page for the latest
information on this. The website states that:
“Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+.”
Scala gets pulled down as a dependency by Maven (currently Scala 2.11.8). Scala does not
need to be installed separately; it is just a bundled dependency.
Just as a note, Spark 2.0.0 by default runs with Scala 2.11.8, but can be compiled to run with
Scala 2.10. I have just seen e-mails in the Spark users' group on this.
This brings up another interesting point about the Spark community. The
two essential mailing lists are user@spark.apache.org and
dev@spark.apache.org. More details about the Spark community are
available at h t t p s : / / s p a r k . a p a c h e . o r g / c o m m u n i t y . h t m l.
Directory organization and convention
One convention that would be handy is to download and install software in the /opt
directory. Also, have a generic soft link to Spark that points to the current version. For
example, /opt/spark points to /opt/spark-2.0.0 with the following command:
sudo ln -f -s spark-2.0.0 spark
Downloading the example code
You can download the example code files for all of the Packt books you
have purchased from your account at h t t p : / / w w w . p a c k t p u b . c o m. If you
purchased this book elsewhere, you can visit h t t p : / / w w w . p a c k t p u b . c o m /
s u p p o r t and register to have the files e-mailed directly to you.
https://www.iteblog.com

剩余268页未读，继续阅读

hnhbdss

粉丝: 85
资源: 28

加速大数据处理：Spark 2 实战

Fast Data Processing with Spark 2, 3rd Edition.pdf

Fast Data Processing with Spark(PACKT,2ed,2015)

Fast Data Processing with Spark 2, 3rd Editio

Fast Data Processing with Spark

An Architecture for Fast and General Data Processing on Large Clusters

Apache Spark 2 for Beginners [2016]

Apache Spark 2.x for Java Developers

Spark_for_Python

Learning Apache Spark 2

Spark for Python Developers 无水印pdf 0分

最新资源