Spark 2快速数据处理第三版：实战指南

需积分: 1 119 浏览量更新于2024-07-19 收藏 44.06MB PDF 举报

《2016年快速数据处理与Spark 2第三版》是由Krishna Sankar编著的一本专业书籍，由Packt Publishing出版。这本书是针对大数据领域的重要参考资料，特别关注如何利用Apache Spark进行高效的数据处理和分析，以支持大规模、高性能的项目。Spark是一个开源的大数据处理框架，以其在实时计算和分布式计算中的出色性能而闻名。该书深入浅出地介绍了Spark 2的关键原理和实践技巧，让读者能够掌握如何利用其强大的并行处理能力，加速大数据项目的执行速度，实现数据的实时清洗、转换和分析。Spark支持SQL查询、机器学习、流处理等多种数据处理任务，因此，这是一本对大数据工程师、数据分析师以及对Spark技术感兴趣的读者非常有价值的资源。书中涵盖了Spark的RDD（弹性分布式数据集）模型、DataFrame和DataSet的使用、Spark SQL、Spark Streaming以及Spark MLlib等核心组件的详细介绍。此外，还探讨了如何优化Spark应用的性能，包括集群配置、缓存策略和故障恢复机制。值得注意的是，由于版权问题，未经Packt Publishing事先书面许可，本书的部分内容不得复制、存储或通过任何方式传播。尽管作者和出版社已尽力确保信息的准确性，但书中提供的所有信息均不附带任何形式的保证，无论是明示的还是暗示的。读者在使用本书时应自行判断，任何因使用本书内容导致的损失或损害，作者和出版社概不负责。《快速数据处理与Spark 2第三版》不仅适合初学者系统学习Spark技术，也适合有一定经验的开发人员作为参考手册，提升他们在大数据处理领域的技能和效率。对于想要紧跟大数据发展趋势、提高数据分析速度的专业人士来说，这是一本不容错过的实用指南。如果你正在寻找一本全面且最新的Spark实战教材，这本书无疑是你的理想选择。

Preface

[ 3 ]

Code words in text, database table names, folder names, filenames, file extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: "The

hallmark of a MapReduce system is this: map and reduce, the two primitives."

A block of code is set as follows:

<groupId>junit</groupId>

<artifactId>junit</artifactId>

</dependency>

Any command-line input or output is written as follows:

./ec2/spark-ec2 -i ~/spark-keypair.pem launch myfirstsparkcluster --resume

New terms and important words are shown in bold. Words that you see on the screen, for

example, in menus or dialog boxes, appear in the text like this: "From Spark 2.0.0 onwards,

they have changed the packaging, so we have to

include spark-2.0.0/assembly/target/scala-2.11/jars in Add External Jars…."

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book-what you liked or disliked. Reader feedback is important for us as it helps us develop

titles that you will really get the most out of. To send us general feedback, simply e-

mail feedback@packtpub.com, and mention the book's title in the subject of your

message. If there is a topic that you have expertise in and you are interested in either

writing or contributing to a book, see our author guide at www.packtpub.com/authors.

https://www.iteblog.com

Installing Spark and Setting Up Your Cluster
[ 7 ]
As you explore the latest version of Spark, an essential task is to read the
release notes and especially what has been changed and deprecated. For
2.0.0, the list is slightly long and is available at h t t p s : / / s p a r k . a p a c h e . o r
g / r e l e a s e s / s p a r k - r e l e a s e - 2 - 
0
 - 
0
 . h t m l # r e m o v a l s - b e h a v i o r - c h a n g e s -
a n d - d e p r e c a t i o n s. For example, the note talks about where the EC2
scripts have moved to and support for Hadoop 2.1 and earlier.
To compile the Spark source, you will need the appropriate version of Scala and the
matching JDK. The Spark source tar utility includes the required Scala components. The
following discussion is only for information there is no need to install Scala.
The Spark developers have done a good job of managing the dependencies. Refer to the h t t
p s : / / s p a r k . a p a c h e . o r g / d o c s / l a t e s t / b u i l d i n g - s p a r k . h t m l web page for the latest
information on this. The website states that:
“Building Spark using Maven requires Maven 3.3.9 or newer and Java 7+.”
Scala gets pulled down as a dependency by Maven (currently Scala 2.11.8). Scala does not
need to be installed separately; it is just a bundled dependency.
Just as a note, Spark 2.0.0 by default runs with Scala 2.11.8, but can be compiled to run with
Scala 2.10. I have just seen e-mails in the Spark users' group on this.
This brings up another interesting point about the Spark community. The
two essential mailing lists are user@spark.apache.org and
dev@spark.apache.org. More details about the Spark community are
available at h t t p s : / / s p a r k . a p a c h e . o r g / c o m m u n i t y . h t m l.
Directory organization and convention
One convention that would be handy is to download and install software in the /opt
directory. Also, have a generic soft link to Spark that points to the current version. For
example, /opt/spark points to /opt/spark-2.0.0 with the following command:
sudo ln -f -s spark-2.0.0 spark
Downloading the example code
You can download the example code files for all of the Packt books you
have purchased from your account at h t t p : / / w w w . p a c k t p u b . c o m. If you
purchased this book elsewhere, you can visit h t t p : / / w w w . p a c k t p u b . c o m /
s u p p o r t and register to have the files e-mailed directly to you.
https://www.iteblog.com

剩余268页未读，继续阅读

china_cobra

粉丝: 3
资源: 8

Spark 2快速数据处理第三版：实战指南

Fast Data Processing with Spark 2 Third Edition.pdf

mastering elasticsearch[m]. 2nd edition. uk: packt publishing

custom tiles provider - humanitarian map style | packt courseware

深度学习xiazai

python自动驾驶算法代码

QT singles

有关python大数据分析技术的文献及其作者和出处

geoserver javascripe

u-charts.js:1490 Uncaught TypeError: Cannot read properties of undefined (reading 'seriesGap') at eval (u-charts.js:1490:37)

关于Java的外文期刊参考文献

最新资源