Apache Spark优化实践：高性能指南

5星 · 超过95%的资源需积分: 10 139 浏览量更新于2024-07-19 收藏 7.01MB PDF 举报

"High Performance Spark —— 一本关于优化和扩展Apache Spark的最佳实践指南，由Holden Karau和Rachel Warren合著，共有375页，是完整版而非早期发布的版本。" Apache Spark 是一个用于大数据处理的开源计算框架，以其高性能、易用性和对多种数据处理任务的支持而闻名。《High Performance Spark》这本书详细介绍了如何有效地利用Spark来实现规模扩展和性能优化，以提升大数据处理的效率。以下是一些关键的知识点： 1. **内存管理**：Spark的核心特性之一是其基于内存的数据处理，书中会探讨如何有效地管理内存，包括配置适当的内存分区，避免溢出，以及使用Tungsten项目中的优化，如代码生成和压缩，以减少GC（垃圾回收）开销。 2. **RDD（弹性分布式数据集）优化**：RDD是Spark的基础数据结构，书里会讲解如何创建持久化RDD以减少重复计算，以及使用窄依赖和宽依赖来改善任务调度。 3. **DataFrame和Dataset**：Spark 2.0引入了DataFrame和Dataset，提供了更高级别的抽象和类型安全。书中会介绍如何使用这些新特性进行更高效的数据处理，包括利用 Catalyst 查询优化器来提高查询性能。 4. **并行与并发**：讨论如何设计并行任务以充分利用集群资源，包括任务调度策略，如动态资源分配和Stage划分，以避免任务饥饿和资源浪费。 5. **Spark SQL与数据源**：Spark SQL允许用户以SQL语句处理数据，书中会讲述如何优化SQL查询，以及如何连接到各种数据源，如Hadoop HDFS、Cassandra、Hive等。 6. ** Shuffle操作优化**：Shuffle是Spark中数据重新分布的关键操作，它可能导致性能瓶颈。书中会提供有关如何减少shuffle读写，优化分区策略，以及使用shuffle write和read调优的方法。 7. **网络传输优化**：包括压缩数据传输，使用Tachyon或Alluxio作为缓存层，以及调整网络参数如buffer大小，以减少网络延迟和带宽使用。 8. **故障恢复与容错**：讨论如何配置容错机制，如检查点和RDD持久化策略，以提高系统的健壮性。 9. **集群管理和监控**：介绍如何设置和管理Spark集群，包括YARN、Mesos或Standalone模式，以及使用工具如Ganglia、Spark UI和Prometheus来监控性能和健康状况。 10. **性能调优工具和策略**：书中可能涵盖使用Spark Profiler、GcViewer等工具进行性能分析，以及基于实验数据制定调优策略。通过《High Performance Spark》，读者将深入理解Spark的工作原理，并掌握实践中的最佳实践，以应对大数据处理的挑战，实现高效、可靠的系统运行。

Finally, thank you to our respective employers for being understanding as we’ve

worked on this book. Especially Lawrence Spracklen who insisted we mention him

here :p.

xiv | Preface

succeeds on the same system with terabytes of data. In the authors’ experience writ‐

ing production Spark code, we have seen the same tasks, run on the same clusters,

run 100× faster using some of the optimizations discussed in this book. In terms of

data processing, time is money, and we hope this book pays for itself through a

reduction in data infrastructure costs and developer hours.

Not all of these techniques are applicable to every use case. Especially because Spark

is highly configurable and is exposed at a higher level than other computational

frameworks of comparable power, we can reap tremendous benefits just by becoming

more attuned to the shape and structure of our data. Some techniques can work well

on certain data sizes or even certain key distributions, but not all. The simplest exam‐

ple of this can be how for many problems, using groupByKey in Spark can very easily

cause the dreaded out-of-memory exceptions, but for data with few duplicates this

operation can be just as quick as the alternatives that we will present. Learning to

understand your particular use case and system and how Spark will interact with it is

a must to solve the most complex data science problems with Spark.

What You Can Expect to Get from This Book

Our hope is that this book will help you take your Spark queries and make them

faster, able to handle larger data sizes, and use fewer resources. This book covers a

broad range of tools and scenarios. You will likely pick up some techniques that

might not apply to the problems you are working with, but that might apply to a

problem in the future and may help shape your understanding of Spark more gener‐

ally. The chapters in this book are written with enough context to allow the book to

be used as a reference; however, the structure of this book is intentional and reading

the sections in order should give you not only a few scattered tips, but a comprehen‐

sive understanding of Apache Spark and how to make it sing.

It’s equally important to point out what you will likely not get from this book. This

book is not intended to be an introduction to Spark or Scala; several other books and

video series are available to get you started. The authors may be a little biased in this

regard, but we think Learning Spark by Karau, Konwinski, Wendell, and Zaharia as

well as Paco Nathan’s introduction video series are excellent options for Spark begin‐

ners. While this book is focused on performance, it is not an operations book, so top‐

ics like setting up a cluster and multitenancy are not covered. We are assuming that

you already have a way to use Spark in your system, so we won’t provide much assis‐

tance in making higher-level architecture decisions. There are future books in the

works, by other authors, on the topic of Spark operations that may be done by the

time you are reading this one. If operations are your show, or if there isn’t anyone

responsible for operations in your organization, we hope those books can help you.

2 | Chapter 1: Introduction to High Performance Spark

2 MiMa is the Migration Manager for Scala and tries to catch binary incompatibilities between releases.

Spark Versions

Spark follows semantic versioning with the standard [MAJOR].[MINOR].[MAINTE‐

NANCE] with API stability for public nonexperimental nondeveloper APIs within

minor and maintenance releases. Many of these experimental components are some

of the more exciting from a performance standpoint, including Datasets—Spark

SQL’s new structured, strongly-typed, data abstraction. Spark also tries for binary

API compatibility between releases, using MiMa

; so if you are using the stable API

you generally should not need to recompile to run a job against a new version of

Spark unless the major version has changed.

This book was created using the Spark 2.0.1 APIs, but much of the

code will work in earlier versions of Spark as well. In places where

this is not the case we have attempted to call that out.

Why Scala?

In this book, we will focus on Spark’s Scala API and assume a working knowledge of

Scala. Part of this decision is simply in the interest of time and space; we trust readers

wanting to use Spark in another language will be able to translate the concepts used

in this book without presenting the examples in Java and Python. More importantly,

it is the belief of the authors that “serious” performant Spark development is most

easily achieved in Scala.

To be clear, these reasons are very specific to using Spark with Scala; there are many

more general arguments for (and against) Scala’s applications in other contexts.

To Be a Spark Expert You Have to Learn a Little Scala Anyway

Although Python and Java are more commonly used languages, learning Scala is a

worthwhile investment for anyone interested in delving deep into Spark develop‐

ment. Spark’s documentation can be uneven. However, the readability of the code‐

base is world-class. Perhaps more than with other frameworks, the advantages of

cultivating a sophisticated understanding of the Spark codebase is integral to the

advanced Spark user. Because Spark is written in Scala, it will be difficult to interact

with the Spark source code without the ability, at least, to read Scala code. Further‐

more, the methods in the Resilient Distributed Datasets (RDD) class closely mimic

those in the Scala collections API. RDD functions, such as map, filter, flatMap,

Spark Versions | 3

3 Although, as we explore in this book, the performance implications and evaluation semantics are quite

different.

4 Of course, in performance, every rule has its exception. mapPartitions in Spark 1.6 and earlier in Java suffers

some severe performance restrictions that we discuss in “Iterator-to-Iterator Transformations with mapParti‐

tions” on page 98.

reduce, and fold, have nearly identical specifications to their Scala equivalents.

Fun‐

damentally Spark is a functional framework, relying heavily on concepts like immut‐

ability and lambda definition, so using the Spark API may be more intuitive with

some knowledge of functional programming.

The Spark Scala API Is Easier to Use Than the Java API

Once you have learned Scala, you will quickly find that writing Spark in Scala is less

painful than writing Spark in Java. First, writing Spark in Scala is significantly more

concise than writing Spark in Java since Spark relies heavily on inline function defini‐

tions and lambda expressions, which are much more naturally supported in Scala

(especially before Java 8). Second, the Spark shell can be a powerful tool for debug‐

ging and development, and is only available in languages with existing REPLs (Scala,

Python, and R).

Scala Is More Performant Than Python

It can be attractive to write Spark in Python, since it is easy to learn, quick to write,

interpreted, and includes a very rich set of data science toolkits. However, Spark code

written in Python is often slower than equivalent code written in the JVM, since Scala

is statically typed, and the cost of JVM communication (from Python to Scala) can be

very high. Last, Spark features are generally written in Scala first and then translated

into Python, so to use cutting-edge Spark functionality, you will need to be in the

JVM; Python support for MLlib and Spark Streaming are particularly behind.

Why Not Scala?

There are several good reasons to develop with Spark in other languages. One of the

more important constant reasons is developer/team preference. Existing code, both

internal and in libraries, can also be a strong reason to use a different language.

Python is one of the most supported languages today. While writing Java code can be

clunky and sometimes lag slightly in terms of API, there is very little performance

cost to writing in another JVM language (at most some object conversions).

4 | Chapter 1: Introduction to High Performance Spark

剩余355页未读，继续阅读

qiang5714

粉丝: 0
资源: 8

Apache Spark优化实践：高性能指南

High Performance Spark Best Practices for Scaling and Optimizing Apache epub

High Performance Spark Best Practices for Scaling and Optimizing Apache azw3

High Performance Spark mobi

High Performance Spark英文版

High Performance Spark, 1st Edition

High Performance Spark Best Practices for Scaling and Optimizing Apache Spark

High Performance Spark: Best Practices for Scaling and Optimizing Apache Spark

High Performance Spark Best Practices for Scaling and Optimizing Apache 无水印pdf

High Performance Spark 技术深度解析

high-performance-spark

最新资源