掌握实战：Spark处理批流数据与机器学习的全面指南

需积分: 10 23 浏览量更新于2024-07-18 收藏 10.97MB PDF 举报

《Spark in Action》是一本深入讲解Apache Spark实用技术的实战指南，作者是Petar Zecevic和Marko Bonaci。本书旨在帮助读者掌握处理批处理和实时数据的核心理论与技能，使他们在实际项目中有效地利用Spark进行数据处理。首先，书中会引导读者熟悉Spark命令行界面（CLI），通过一些入门示例来逐步操作。然后，读者将学习如何使用Spark的核心API编程，包括对结构化数据的处理，这主要借助Spark SQL。Spark SQL允许用户在DataFrame上执行SQL操作，从而转化为RDD（弹性分布式数据集）的操作，实现数据的清洗、转换和分析。 Spark Streaming是本书的重点内容之一，它支持多种流数据源，如Kafka、Flume、Twitter、Hadoop Distributed File System (HDFS) 和 ZeroMQ等，这些流数据可以实时进行处理。Spark Streaming利用DStream（Discretized Stream）模型定期生成RDD，从而实现实时数据的分析。此外，它还结合了机器学习功能，能够应用预训练的模型对流数据进行实时预测和决策。 Spark MLlib，即Spark机器学习库，是Spark ML的一部分，用于构建和部署机器学习模型。在这里，DataFrame被用来表示数据，因为它们提供了更易理解的数据结构，并且与RDD密切相关，但又具有更好的性能和内存管理。Spark ML中的模型同样基于RDD进行计算，利用Spark Core提供的并行计算能力。 Spark GraphX是Spark的一个扩展模块，专门处理图数据。它背后使用了Spark Core的功能，但提供了一套针对图数据处理的API。Spark GraphX的核心数据结构是GraphRDD，它可以与流数据源结合，支持实时的图数据分析。Filesystems部分涵盖了常用的存储系统，如HDFS、GCS（Google Cloud Storage）和Amazon S3，这些都与Spark的读写操作紧密相连。在运行Spark应用程序时，驱动器（Driver）进程起着关键作用，负责任务的调度和协调。核心操作如`parallelize`、`map`、`reduceByKey`等用于RDD的创建和处理，而RDD的线性依赖和生命周期管理则是理解Spark性能的关键。`map`、`flatMap`等函数用于转换数据，`reduce`、`fold`则用于聚合数据，`ShuffledRDD`则是为了实现分区操作后数据的重新分发。总结来说，《Spark in Action》是一本全面的实践指南，涵盖了从基础操作到高级特性的深入剖析，无论你是初学者还是经验丰富的开发人员，都能从中收获宝贵的知识和实践经验，提升在大数据处理领域的技能。

PREFACE

xviii

interacting with distributed file systems, various relational and no-SQL databases, real-

time systems, and so on. Then there are the runtime aspects—installing, configuring,

and running Spark—which are equally relevant.

We tried to do justice to these important topics and make this book a thorough but

gentle guide to using Spark. We hope you’ll enjoy it.

ABOUT THIS BOOK

xxi

Spark can interact with many systems, some of which are covered in the book. To

fully appreciate the content, knowledge of the following topics is preferable (but not

required):

We’ve prepared a virtual machine to make it easy for you to run the examples in the

book. In order to use it, your computer should meet the software and hardware pre-

requisites listed in chapter 1.

How this book is organized

This book has 14 chapters, organized in 4 parts. Part 1 introduces Apache Spark and

its rich

API. An understanding of this information is important for writing high-quality

Spark programs and is an excellent foundation for the rest of the book:

■

Chapter 1 roughly describes Spark’s main features and compares them with

Hadoop’s MapReduce and other tools from the Hadoop ecosystem. It also

includes a description of the spark-in-action virtual machine, which you can use

to run the examples in the book.

■

Chapter 2 further explores the virtual machine, teaches you how to use Spark’s

command-line interface (the spark-shell), and uses several examples to explain

resilient distributed datasets (

RDDs): the central abstraction in Spark.

■

In chapter 3, you’ll learn how to set up Eclipse to write standalone Spark appli-

cations. Then you’ll write an application for analyzing GitHub logs and execute

the application by submitting it to a Spark cluster.

■

Chapter 4 explores the Spark core API in more detail. Specifically, it shows how

to work with key-value pairs and explains how data partitioning and shuffling

work in Spark. It also teaches you how to group, sort, and join data, and how to

use accumulators and broadcast variables.

In part 2, you’ll get to know other components that make up Spark, including Spark

SQL, Spark Streaming, Spark MLlib, and Spark GraphX:

■

Chapter 5 introduces Spark SQL. You’ll learn how to create and use Data-

Frames, how to use

SQL to query DataFrame data, and how to load data to and

save it from external data sources. You’ll also learn about optimizations done by

Spark’s

SQL Catalyst optimization engine and about performance improve-

ments introduced with the Tungsten project.

■

Spark Streaming, one of the more popular Spark family members, is intro-

duced in chapter 6. You’ll learn about discretized streams, which periodically

produce

RDDs as a streaming application is running. You’ll also learn how to

save computation state over time and how to use window operations. We’ll

■

SQL and JDBC (chapter 5)

■

Amazon EC2 (chapter 11)

■

Hadoop (HDFS and YARN,

chapters 5 and 12)

■

Basics of linear algebra, and the ability to under-

stand mathematical formulas (chapters 7 and 8)

■

Kafka (chapter 6)

■

Mesos (chapter 12)

ABOUT THIS BOOK

xxii

examine ways of connecting to Kafka and how to obtain good performance

from your streaming jobs. We’ll also talk about structured streaming, a new con-

cept included in Spark 2.0.

■

Chapters 7 and 8 are about machine learning, specifically about the Spark

MLlib and Spark ML sections of the Spark API. You’ll learn about machine

learning in general and about linear regression, logistic regression, decision

trees, random forests, and k-means clustering. Along the way, you’ll scale and

normalize features, use regularization, and train and evaluate machine learning

models. We’ll explain

API standardizations brought by Spark ML.

■

Chapter 9 explores how to build graphs with Spark’s GraphX API. You’ll trans-

form and join graphs, use graph algorithms, and implement the A* search algo-

rithm using the GraphX

API.

Using Spark isn’t just about writing and running Spark applications. It’s also about

configuring Spark clusters and system resources to be used efficiently by applications.

Part 3 explains the necessary concepts and configuration options for running Spark

applications on Spark standalone, Hadoop

YARN, and Mesos clusters:

■

Chapter 10 explores Spark runtime components, Spark cluster types, job and

resource scheduling, configuring Spark, and the Spark web

UI. These are con-

cepts common to all cluster managers Spark can run on: the Spark standalone

cluster,

YARN, and Mesos. The two local modes are also explained in chapter 10.

■

You’ll learn about the Spark standalone cluster in chapter 11: its components,

how to start it and run applications on it, and how to use its web

UI. The Spark

History server, which keeps details about previously run jobs, is also discussed.

Finally, you’ll learn how to use Spark’s scripts to start up a Spark standalone

cluster on Amazon

EC2.

■

Chapter 12 goes through the specifics of setting up, configuring, and using

YARN and Mesos clusters to run Spark applications.

Part 4 covers higher-level aspects of using Spark:

■

Chapter 13 brings it all together and explores a Spark streaming application for

analyzing log files and displaying the results on a real-time dashboard. The

application implemented in chapter 13 can be used as a basis for your own

future applications.

■

Chapter 14 introduces H2O, a scalable and fast machine-learning framework

with implementations of many machine-learning algorithms, most notably deep

learning, which Spark lacks; and Sparkling Water,

H2O’s package that enables

you to start and use an

H2O cluster from Spark. Through Sparkling Water, you

can use Spark’s Core,

SQL, Streaming, and GraphX components to ingest, pre-

pare, and analyze data, and transfer it to

H2O to be used in H2O’s deep-learning

algorithms. You can then transfer the results back to Spark and use them in sub-

sequent computations.

剩余460页未读，继续阅读

summerfoliage

粉丝: 0
资源: 10

掌握实战：Spark处理批流数据与机器学习的全面指南

Spark in Action.pdf

Spark In Action.pdf

spark-in-action

Spark in Action-2016

spark in action true pdf

Spark in Action 无水印原版pdf

Spark in Action.pdf.zip

Spark in Action：Manning出版社深度解析

Spark in Action: 高性能数据处理与分析

Spark GraphX in Action

最新资源