优化Apache Spark性能：实现大数据的极致运算

5星 · 超过95%的资源需积分: 10 14 浏览量更新于2024-07-20 1 收藏 5.41MB PDF 举报

"《High Performance Spark》是Holden Karau和Rachel Warren合著的一本关于提升Apache Spark性能的专业书籍，适合已经使用Spark解决过中等规模问题，但想要进一步优化大规模数据处理的软件工程师、数据工程师、开发者和系统管理员阅读。书中介绍了如何使Spark作业运行更快、如何利用Spark进行生产环境下的探索性数据分析、处理更大规模的数据集以及减少数据处理管道的运行时间，以获得更快速的洞察。主要内容包括Spark的工作原理、DataFrame、Dataset、Spark SQL以及JOIN操作的讨论。" 在《High Performance Spark》中，作者深入探讨了以下关键知识点： 1. **Spark的工作原理**：了解Spark的核心架构，包括RDD（弹性分布式数据集）、DAG（有向无环图）执行模型以及内存管理机制，是优化性能的基础。通过理解这些概念，读者能够识别性能瓶颈并采取相应的优化策略。 2. **DataFrame、Datasets与Spark SQL**：DataFrame和Datasets是Spark 2.0引入的高级抽象，提供了更丰富的类型安全和更高的性能。它们统一了SQL查询和程序式API，使得数据处理更加方便且高效。学习如何有效地使用这些API可以显著提升数据处理速度。 3. **JOIN操作**：JOIN是大数据处理中的常见操作，但如果不当使用，可能会成为性能杀手。书中详细解释了不同类型的JOIN（如内连接、外连接、广播JOIN等），以及如何根据数据特性和任务需求选择合适的JOIN策略来优化性能。 4. **生产环境的探索性数据分析**：在大规模数据上进行探索性分析时，需要考虑如何将实验性的代码转化为可扩展的生产流程。书中会涵盖如何维护代码质量、监控性能、处理错误和异常，以及如何使用Spark的交互式工具进行有效的数据探索。 5. **处理大规模数据集**：随着数据量的增长，存储和计算的需求也会增加。书中的章节将介绍如何配置和扩展Spark集群，以处理更大的数据集，同时保持良好的性能和稳定性。 6. **优化Spark作业**：通过调整配置参数、优化数据序列化、减少网络传输和磁盘I/O，以及利用Spark的缓存机制，可以显著提高作业的运行速度。书中的实践案例和技巧将帮助读者掌握这些优化方法。 7. **减少管道运行时间**：通过并行化、流水线设计和任务调度优化，可以缩短整个数据处理流程的时间。书中会分享如何设计高效的处理流程，以实现更快的洞察提取。 8. **最佳实践和案例研究**：除了理论知识，书中的实例和最佳实践将帮助读者将理论应用于实际工作，解决他们在处理大规模数据时遇到的实际问题。《High Performance Spark》提供了一套全面的指南，帮助读者深入理解Spark的性能特性，并提供实用的建议和技巧，以最大化地发挥Spark在大数据处理中的潜力。无论你是希望提升现有项目性能，还是准备应对更大的数据挑战，这本书都将是一个宝贵的参考资料。

2 Although, as we explore in this book, the performance implications and evaluation semantics are quite differ‐

ent.

generally. The chapters in this book are written with enough context to allow the

book to be used as a reference; however, the structure of this book is intentional and

reading the sections in order should give you not only a few scattered tips but a com‐

prehensive understanding of Apache Spark and how to make it sing.

It’s equally important to point out what you will likely not get from this book. This

book is not intended to be an introduction to Spark or Scala; several other books and

video series are available to get you started. The authors may be a little biased in this

regard, but we think “Learning Spark” by Karau, Konwinski, Wendel, and Zaharia as

well as Paco Nathan’s Introduction to Apache Spark video series are excellent options

for Spark beginners. While this book is focused on performance, it is not an opera‐

tions book, so topics like setting up a cluster and multi-tenancy are not covered. We

are assuming that you already have a way to use Spark in your system and won’t pro‐

vide much assistance in making higher-level architecture decisions. There are future

books in the works, by other authors, on the topic of Spark operations that may be

done by the time you are reading this one. If operations are your show, or if there isn’t

anyone responsible for operations in your organization, we hope those books can

help you. ==== Why Scala?

In this book, we will focus on Spark’s Scala API and assume a working knowledge of

Scala. Part of this decision is simply in the interest of time and space; we trust readers

wanting to use Spark in another language will be able to translate the concepts used

in this book without presenting the examples in Java and Python. More importantly,

it is the belief of the authors that “serious” performant Spark development is most

easily achieved in Scala. To be clear these reasons are very specific to using Spark with

Scala; there are many more general arguments for (and against) Scala’s applications in

other contexts.

To Be a Spark Expert You Have to Learn a Little Scala Anyway

Although Python and Java are more commonly used languages, learning Scala is a

worthwhile investment for anyone interested in delving deep into Spark develop‐

ment. Spark’s documentation can be uneven. However, the readability of the codebase

is world-class. Perhaps more than with other frameworks, the advantages of cultivat‐

ing a sophisticated understanding of the Spark code base is integral to the advanced

Spark user. Because Spark is written in Scala, it will be difficult to interact with the

Spark source code without the ability, at least, to read Scala code. Furthermore, the

methods in the RDD class closely mimic those in the Scala collections API. RDD

functions, such as map, filter, flatMap, reduce, and fold, have nearly identical spec‐

ifications to their Scala equivalents

Fundamentally Spark is a functional framework,

What You Can Expect to Get from This Book | 13

3 Of course, in performance, every rule has its exception. mapPartitions in Spark 1.6 and earlier in Java suffers

some sever performance restrictions we discuss in ???.

relying heavily on concepts like immutability and lambda definition, so using the

Spark API may be more intuitive with some knowledge of the functional program‐

ming.

The Spark Scala API is Easier to Use Than the Java API

Once you have learned Scala, you will quickly find that writing Spark in Scala is less

painful than writing Spark in Java. First, writing Spark in Scala is significantly more

concise than writing Spark in Java since Spark relies heavily on in line function defi‐

nitions and lambda expressions, which are much more naturally supported in Scala

(especially before Java 8). Second, the Spark shell can be a powerful tool for debug‐

ging and development, and it is obviously not available in a compiled language like

Java.

Scala is More Performant Than Python

It can be attractive to write Spark in Python, since it is easy to learn, quick to write,

interpreted, and includes a very rich set of data science tool kits. However, Spark code

written in Python is often slower than equivalent code written in the JVM, since Scala

is statically typed, and the cost of JVM communication (from Python to Scala) can be

very high. Last, Spark features are generally written in Scala first and then translated

into Python, so to use cutting edge Spark functionality, you will need to be in the

JVM; Python support for MLlib and Spark Streaming are particularly behind.

Why Not Scala?

There are several good reasons, to develop with Spark in other languages. One of the

more important constant reason is developer/team preference. Existing code, both

internal and in libraries, can also be a strong reason to use a different language.

Python is one of the most supported languages today. While writing Java code can be

clunky and sometimes lag slightly in terms of API, there is very little performance

cost to writing in another JVM language (at most some object conversions).

While all of the examples in this book are presented in Scala for the

final release, we will port many of the examples from Scala to Java

and Python where the differences in implementation could be

important. These will be available (over time) at our Github. If you

find yourself wanting a specific example ported please either e-mail

us or create an issue on the github repo.

14 | Chapter 1: Introduction to High Performance Spark

Spark SQL does much to minimize performance difference when using a non-JVM

language. ??? looks at options to work effectively in Spark with languages outside of

the JVM, including Spark’s supported languages of Python and R. This section also

offers guidance on how to use Fortran, C, and GPU specific code to reap additional

performance improvements. Even if we are developing most of our Spark application

in Scala, we shouldn’t feel tied to doing everything in Scala, because specialized libra‐

ries in other languages can be well worth the overhead of going outside the JVM.

Learning Scala

If after all of this we’ve convinced you to use Scala, there are several excellent options

for learning Scala. The current version of Spark is written against Scala 2.10 and

cross-compiled for 2.11 (with the future changing to being written for 2.11 and cross-

compiled against 2.10). Depending on how much we’ve convinced you to learn Scala,

and what your resources are, there are a number of different options ranging from

books to MOOCs to professional training.

For books, Programming Scala, 2nd Edition by Dean Wampler and Alex Payne can be

great, although much of the actor system references are not relevant while working in

Spark. The Scala language website also maintains a list of Scala books.

In addition to books focused on Spark, there are online courses for learning Scala.

Functional Programming Principles in Scala, taught by Martin Ordersky, its creator, is

on Coursera as well as Introduction to Functional Programming on edX. A number

of different companies also offer video-based Scala courses, none of which the

authors have personally experienced or recommend.

For those who prefer a more interactive approach, professional training is offered by

a number of different companies including, Typesafe. While we have not directly

experienced Typesafe training, it receives positive reviews and is known especially to

help bring a team or group of individuals up to speed with Scala for the purposes of

working with Spark.

Conclusion

Although you will likely be able to get the most out of Spark performance if you have

an understanding of Scala, working in Spark does not require a knowledge of Scala.

For those whose problems are better suited to other languages or tools, techniques for

working with other languages will be covered in ???. This book is aimed at individuals

who already have a grasp of the basics of Spark, and we thank you for choosing High

Performance Spark to deepen your knowledge of Spark. The next chapter will intro‐

duce some of Spark’s general design and evaluation paradigm which is important to

understanding how to efficiently utilize Spark.

Conclusion | 15

1 MapReduce is a programmatic paradigm that defines programs in terms of map procedures that filter and

sort data onto the nodes of a distributed system, and reduce procedures that aggregate the data on the mapper

nodes. Implementations of MapReduce have been written in many languages, but the term usually refers to a

popular implementation called link::http://hadoop.apache.org/[Hadoop MapReduce

2 DryadLINQ is a Microsoft research project which puts the .NET Language Integrated Query (LINQ) on top

of the Dryad distributed execution engine. Like Spark, The DraydLINQ API defines an object representing a

distributed dataset and exposes functions to transform data as methods defined on the dataset object. Dray‐

dLINQ is lazily evaluated and its scheduler is similar to Spark’s however, it doesn’t use in memory storage. For

more information see the DraydLINQ documentation.

See the original Spark Paper.

CHAPTER 2

How Spark Works

This chapter introduces Spark’s place in the big data ecosystem and its overall design.

Spark is often considered an alternative to Apache MapReduce, since Spark can also

be used for distributed data processing with Hadoop.

, packaged with the distributed

file system Apache Hadoop.] As we will discuss in this chapter, Spark’s design princi‐

pals are quite different from MapReduce’s and Spark doe not need to be run in tan‐

dem with Apache Hadoop. Furthermore, while Spark has inherited parts of its API,

design, and supported formats from existing systems, particularly DraydLINQ,

Spark’s internals, especially how it handles failures, differ from many traditional sys‐

tems.

Spark’s ability to leverage lazy evaluation within memory computations make

it particularly unique. Spark’s creators believe it to be the first high-level programing

language for fast, distributed data processing.

Understanding the general design

principals behind Spark will be useful for understanding the performance of Spark

jobs.

To get the most out of Spark, it is important to understand some of the principles

used to design Spark and, at a cursory level, how Spark programs are executed. In this

chapter, we will provide a broad overview of Spark’s model of parallel computing and

剩余90页未读，继续阅读

ramissue

粉丝: 354
资源: 1487

优化Apache Spark性能：实现大数据的极致运算

high-performance-spark

High Performance Spark Best Practices for Scaling and Optimizing Apache 无水印pdf

High Performance Spark, 1st Edition

collector.performance_schema.eventswaits - collector.performance_schema.file_events - collector.performance_schema.indexiowaits - collector.performance_schema.tableiowaits - collector.performance_schema.tablelocks这几参数结尾都需要等号吗

- collector.performance_schema.eventswaits - collector.performance_schema.file_events - collector.performance_schema.indexiowaits - collector.performance_schema.tableiowaits - collector.performance_schema.tablelocks这几参数都需要等号吗

window.performance.navigation.type

@Value("${scp.performance.maxOverCost:1000}") 这是什么意思

window.performance.measure 强制释放内存

window.performance 性能监控

window.performance.measure 的值和当前网页所占内存关系。如何对比

最新资源