Spark 2入门：Scala与Python构建大数据处理应用

需积分: 10 175 浏览量更新于2024-07-20 收藏 19.99MB PDF 举报

Apache Spark 2.0 是一个开源的分布式计算框架，专为大规模数据处理而设计，特别适合在 Scala 和 Python 这两种编程语言中构建并行应用程序。Spark 2.0 的核心理念是基于内存计算，它通过将数据存储在内存中，实现快速的数据处理和迭代操作，相比传统的磁盘IO，显著提高了性能。对于初学者而言，Spark 2.0 的学习路径通常包括以下几个关键方面： 1. **分布式计算基础**：Spark 的分布式架构是其优势所在，它通过将工作负载分布在集群中的多个节点上，实现了数据并行处理。理解 Spark 的工作模式，包括任务调度、数据分区、Stage 和 Task 等概念至关重要。 2. **API与编程模型**：Spark 提供了丰富的 API，如 DataFrames 和 Datasets，它们简化了数据处理过程，使开发者能够用类似于 SQL 的方式操作数据。此外，Scala 和 Python 的版本提供了不同的编程体验，如Scala 的函数式编程风格和Python 的易读性。 3. **核心组件**：Spark Core 是 Spark 的基石，包括 SparkContext（用于启动会话）、Resilient Distributed Datasets (RDDs)（基本数据结构）和Broadcast Variables（高效地在所有节点上共享数据）。理解这些组件的工作原理有助于构建高效的应用程序。 4. **Spark SQL 和 DataFrame/Dataset**：Spark SQL 允许用户执行SQL查询，同时利用Spark进行分布式处理。DataFrame 和 Dataset 是 Spark 2.0 引入的更高级别的抽象，它们提供了一致的API，并支持更丰富的数据类型和优化。 5. **机器学习库**：Spark MLlib 是 Spark 的一个重要部分，提供了广泛的机器学习算法，如分类、回归、聚类和协同过滤等。学会如何使用 MLlib 或 Spark 的深度学习库 MLlib Spark，可以帮助开发者构建数据分析和预测应用。 6. **实时流处理**：Spark Streaming 拓展了 Spark 的能力，使得实时处理流数据成为可能。学习如何处理持续数据流，并将结果实时更新，是Spark 2.0 应用的一个重要领域。 7. **性能调优与故障恢复**：了解如何调整 Spark 的配置参数、优化作业调度和数据缓存，以及在出现故障时的容错机制，是提高应用程序效率的关键。学习 Apache Spark 2.0 对于希望在大数据处理领域有所建树的开发者来说是一个很好的起点。通过掌握其核心概念和技术，初学者可以构建高效、可扩展的数据处理系统，并在实际项目中发挥重要作用。然而，始终要记住，在实践中不断实践和调试，以确保代码的正确性和性能。

Preface

[ 4 ]

Who this book is for

If you are an application developer, data scientist, or big data solutions architect who is

interested in combining the data processing power of Spark with R, and consolidating data

processing, stream processing, machine learning, and graph processing into one unified and

highly interoperable framework with a uniform API using Scala or Python, this book is for

you.

Conventions

In this book, you will find a number of text styles that distinguish between different kinds

of information. Here are some examples of these styles and an explanation of their meaning.

Code words in text, database table names, folder names, filenames, file extensions,

pathnames, dummy URLs, user input, and Twitter handles are shown as follows: " It is a

good idea to customize this property spark.driver.memory to have a higher value."

A block of code is set as follows:

Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19)

[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

Any command-line input or output is written as follows:

$ python

Python 3.5.0 (v3.5.0:374f501f4567, Sep 12 2015, 11:00:19)

[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin

Type "help", "copyright", "credits" or "license" for more information.

>>>

New terms and important words are shown in bold. Words that you see on the screen, for

example, in menus or dialog boxes, appear in the text like this: "The shortcuts in this book

are based on the Mac OS X 10.5+ scheme."

Preface
[ 6 ]
Once the file is downloaded, please make sure that you unzip or extract the folder using the
latest version of:
WinRAR / 7-Zip for Windows
Zipeg / iZip / UnRarX for Mac
7-Zip / PeaZip for Linux
The code bundle for the book is also hosted on GitHub at h t t p s : / / g i t h u b . c o m / P a c k t P u b l
i s h i n g / A p a c h e - S p a r k - 2 - f o r - B e g i n n e r s. We also have other code bundles from our rich
catalog of books and videos available at h t t p s : / / g i t h u b . c o m / P a c k t P u b l i s h i n g /. Check
them out!
Downloading the color images of this book
We also provide you with a PDF file that has color images of the screenshots/diagrams used
in this book. The color images will help you better understand the changes in the output.
You can download this file from h t t p : / / w w w . p a c k t p u b . c o m / s i t e s / d e f a u l t / f i l e s / d o w n l
o a d s / A p a c h e S p a r k 2 f o r B e g i n n e r s _ C o l o r I m a g e s . p d f.
Errata
Although we have taken every care to ensure the accuracy of our content, mistakes do
happen. If you find a mistake in one of our books-maybe a mistake in the text or the code-
we would be grateful if you could report this to us. By doing so, you can save other readers
from frustration and help us improve subsequent versions of this book. If you find any
errata, please report them by visiting h t t p : / / w w w . p a c k t p u b . c o m / s u b m i t - e r r a t a, selecting
your book, clicking on the Errata Submission Form link, and entering the details of your
errata. Once your errata are verified, your submission will be accepted and the errata will
be uploaded to our website or added to any list of existing errata under the Errata section of
that title.
To view the previously submitted errata, go to h t t p s : / / w w w . p a c k t p u b . c o m / b o o k s / c o n t e n
t / s u p p o r t and enter the name of the book in the search field. The required information will
appear under the Errata section.

Spark Fundamentals

Data is one of the most important assets of any organization. The scale at which data is

being collected and used in organizations is growing beyond imagination. The speed at

which data is being ingested, the variety of the data types in use, and the amount of data

that is being processed and stored are breaking all-time records every moment. It is very

common these days, even in small-scale organizations, that data is growing from gigabytes

to terabytes to petabytes. For the same reason, the processing needs are also growing that

ask for capability to process data at rest as well as data on the move.

Take any organization; its success depends on the decisions made by its leaders and for

making sound decisions, you need the backing of good data and the information generated

by processing the data. This poses a big challenge on how to process the data in a timely

and cost-effective manner so that right decisions can be made. Data processing techniques

have evolved since the early days of computers. Countless data processing products and

frameworks came into the market and disappeared over these years. Most of these data

processing products and frameworks were not general purpose in nature. Most of the

organizations relied on their own bespoke applications for their data processing needs, in a

silo way, or in conjunction with specific products.

Large-scale Internet applications, popularly known as Internet of Things (IoT)

applications, heralded the common need to have open frameworks to process huge

amounts of data ingested at great speed dealing with various types of data. Large-scale web

sites, media streaming applications, and the huge batch processing needs of organizations

made the need even more relevant. The open source community is also growing

considerably along with the growth of the Internet, delivering production quality software

supported by reputed software companies. A huge number of companies started using

open source software and started deploying them in their production environments.

剩余321页未读，继续阅读

PyQter

粉丝: 14
资源: 39

Spark 2入门：Scala与Python构建大数据处理应用

Spark 2.0初学者指南：分布式数据处理

Spark 2.0入门：大数据处理与实战

掌握Spark 2.0入门指南：Scala与Python实战

Spark 2.0 for Beginners mobi

Spark 2.0 for Beginners 无水印pdf

Spark 2.0 for Beginners(PACKT,2016)

spark2.0 for Begginners

Apache Spark 2 for Beginners [2016]

R for Beginners中文版2.0：入门指南与实践详解

白色大气风格的旅游酒店企业网站模板.zip

最新资源