Spark 2.1 入门教程：分布式数据处理

需积分: 9 179 浏览量更新于2024-07-16 收藏 10.79MB PDF 举报

"Spark 2.1 for Beginners.pdf" Apache Spark 2.1 是一个流行的开源大数据处理框架，专为高效、快速以及容错性而设计。这个学习入门教程旨在帮助初学者掌握Spark 2.1的基本概念和用法，特别强调了在Scala和Python中的应用。Spark的核心在于它的弹性分布式数据集（Resilient Distributed Datasets，简称RDD），这是一个可以在集群中存储和操作的数据结构。 Lambda架构是一种用于构建大数据处理系统的模式，它由三个主要组件组成：源数据层、批处理层和实时计算层。在Spark中，Lambda架构可以方便地实现数据批处理和实时流处理的结合，非常适合构建推荐系统。推荐系统通常需要处理大量用户行为数据，并基于这些数据实时生成个性化推荐，Spark 2.1 的高效处理能力和对数据流的处理能力使得它成为构建此类系统的理想选择。本教程将引导你了解如何开发大规模分布式数据处理应用程序，包括以下关键知识点： 1. **Spark核心概念**：理解Spark的RDD模型，它是Spark所有操作的基础。RDD是不可变的，且支持并行操作，可以高效地执行转换和行动操作。 2. **Spark编程模型**：学习如何使用Scala和Python API创建和操作RDD。Scala API更接近Spark的底层实现，而Python API则提供更简洁易用的语法。 3. **Spark SQL与DataFrame**：Spark 2.1引入了DataFrame，它提供了SQL查询和DataFrame API，使得结构化数据处理更加方便，适用于数据分析和ETL任务。 4. **Spark Streaming**：了解如何使用Spark Streaming进行实时数据处理，它可以处理来自各种源的连续数据流，如网络套接字或Kafka。 5. **Spark的存储和调度**：深入理解Spark的内存管理策略，包括如何配置缓存和持久化，以及如何优化作业调度。 6. **Spark的部署与集群管理**：学习如何在本地模式、集群模式（如YARN或Mesos）以及standalone模式下部署和管理Spark应用程序。 7. **Lambda架构的实施**：通过实例学习如何在Spark中构建Lambda架构，包括如何使用批处理层进行历史数据分析，实时计算层处理新数据，并结合这两者以实现完整的推荐系统。 8. **性能优化**：掌握如何通过调整参数、分区策略和数据编码等方法来提升Spark应用的性能。 9. **错误处理和容错**：了解Spark如何处理节点失败和数据丢失，以及如何设计容错的应用程序。通过这个教程，读者将能够从零基础开始，逐步掌握Spark 2.1的基本用法，并具备构建和优化分布式数据处理应用的能力，特别是在推荐系统领域的应用。无论是对于数据科学家、数据工程师还是希望学习大数据处理的初学者，这都是一个极好的起点。

Preface

[ 4 ]

Who this book is for

If you are an application developer, data scientist, or big data solutions architect who is

interested in combining the data processing power of Spark with R, and consolidating data

processing, stream processing, machine learning, and graph processing into one unified and

highly interoperable framework with a uniform API using Scala or Python, this book is for

you.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds

of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: " It is a

good idea to customize this property spark.driver.memory to have a higher value."

A block of code is set as follows:

Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19)

[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

Any command-line input or output is written as follows:

$ python

Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19)

[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>>

New terms and important words are shown in bold. Words that you see on the screen, for

example, in menus or dialog boxes, appear in the text like this: "The shortcuts in this book

are based on the Mac OS X 10.5+ scheme."

Preface
[ 6 ]
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at h t t p s : / / g i t h u b . c o m / P a c k t P u b l
i s h i n g / A p a c h e - S p a r k - 2 - f o r - B e g i n n e r s. We also have other code bundles from our rich
catalog of books and videos available at h t t p s : / / g i t h u b . c o m / P a c k t P u b l i s h i n g /. Check
them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in the output.
You can download this file from h t t p : / / w w w . p a c k t p u b . c o m / s i t e s / d e f a u l t / f i l e s / d o w n l
o a d s / A p a c h e S p a r k 2 f o r B e g i n n e r s _ C o l o r I m a g e s . p d f.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting h t t p : / / w w w . p a c k t p u b . c o m / s u b m i t - e r r a t a, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title.
To view the previously submitted errata, go to h t t p s : / / w w w . p a c k t p u b . c o m / b o o k s / c o n t e n
t / s u p p o r t and enter the name of the book in the search field. The required information will
appear under the Errata section.

Spark Fundamentals

Data is one of the most important assets of any organization. The scale at which data is

being collected and used in organizations is growing beyond imagination. The speed at

which data is being ingested, the variety of the data types in use, and the amount of data

that is being processed and stored are breaking all-time records every moment. It is very

common these days, even in small-scale organizations, that data is growing from gigabytes

to terabytes to petabytes. For the same reason, the processing needs are also growing that

ask for capability to process data at rest as well as data on the move.

Take any organization; its success depends on the decisions made by its leaders and for

making sound decisions, you need the backing of good data and the information generated

by processing the data. This poses a big challenge on how to process the data in a timely

and cost-effective manner so that right decisions can be made. Data processing techniques

have evolved since the early days of computers. Countless data processing products and

frameworks came into the market and disappeared over these years. Most of these data

processing products and frameworks were not general purpose in nature. Most of the

organizations relied on their own bespoke applications for their data processing needs, in a

silo way, or in conjunction with specific products.

Large-scale Internet applications, popularly known as Internet of Things (IoT)

applications, heralded the common need to have open frameworks to process huge

amounts of data ingested at great speed dealing with various types of data. Large-scale web

sites, media streaming applications, and the huge batch processing needs of organizations

made the need even more relevant. The open source community is also growing

considerably along with the growth of the Internet, delivering production quality software

supported by reputed software companies. A huge number of companies started using

open source software and started deploying them in their production environments.

剩余321页未读，继续阅读

acehand

粉丝: 9
资源: 117

Spark 2.1 入门教程：分布式数据处理

Learning Spark.pdf

Spark 2.1 for Beginners

java for Beginners.pdf

R for beginners .pdf

SQL Queries For Beginners.pdf

UNIX Commands for Beginners.pdf

Bash Guide for Beginners.pdf

A Docker Tutorial for Beginners.pdf

Seasim simulation guidance for beginners.pdf

AutoCAD2016ForBeginners.pdf 英文原版

最新资源