Apache Spark 2入门：大数据处理与机器学习

需积分: 9 170 浏览量更新于2024-07-18 收藏 5.56MB PDF 举报

" Beginning Apache Spark 2 大数据：这本书详细介绍了如何使用Apache Spark 2进行大数据应用程序的开发，包括利用Hadoop和云技术。书中涵盖了Resilient Distributed Datasets (RDD)、Spark SQL、Structured Streaming以及Spark Machine Learning库的基础知识和应用实践。作者Hien Luu通过本书引导读者深入理解Spark的原理和功能。" Apache Spark 2是大数据处理领域的关键工具，以其高效、易用和灵活性著称。这本书主要分为几个核心部分，深入浅出地讲解了以下几个关键知识点： 1. **Resilient Distributed Datasets (RDD)**: RDD是Spark的基本数据抽象，它是不可变的、分区的记录集合，可以在集群中并行操作。RDD提供了容错机制，即使在节点故障时也能恢复数据，确保计算的可靠性。通过RDD，开发者可以编写高效的分布式数据处理程序。 2. **Spark SQL**: Spark SQL是Spark的一个模块，它允许用户使用SQL或DataFrame API来查询数据。DataFrame API提供了一种统一的方式来处理结构化和半结构化数据，可以与多种数据源（如Hive、Parquet、JSON等）集成。Spark SQL的出现使得Spark更易于与现有的SQL系统集成，提高了开发者的生产力。 3. **Structured Streaming**: Structured Streaming是Spark 2引入的流处理框架，它扩展了Spark SQL的概念，允许开发者以类似批处理的方式处理连续的数据流。Structured Streaming提供了强大的容错性和精确一次的状态一致性保证，使得实时数据分析变得更加简单和可靠。 4. **Spark Machine Learning Library (MLlib)**: MLlib是Spark的机器学习库，提供了各种机器学习算法和实用工具，包括分类、回归、聚类、协同过滤等。MLlib支持管道和模型选择，有助于简化机器学习工作流程，同时提供了跨语言的API，方便不同背景的开发者使用。 5. **Spark与Hadoop的结合**：Spark可以与Hadoop生态系统紧密集成，充分利用HDFS作为数据存储，MapReduce作为任务调度。这使得开发者能够利用Spark的快速处理能力来处理Hadoop集群中的大数据集。 6. **云计算集成**：Spark也支持在云环境中运行，如Amazon Web Services (AWS)、Microsoft Azure等，这使得开发者能够轻松扩展计算资源，应对不断增长的数据量和复杂度。本书不仅覆盖了理论概念，还包括实际操作示例，帮助读者理解和应用这些技术。无论是初学者还是经验丰富的开发者，都能从中受益，提升在大数据领域的技能和实践能力。通过深入学习和实践，读者将能够利用Spark 2构建高性能、可扩展的大数据解决方案。

At the time of launching a Spark application, you can request how many Spark

executors an application needs and how much memory and the number of CPU cores

each executor should have. Figuring out an appropriate number of Spark executors,

the amount of memory, and the number of CPU requires some understanding of the

amount of data that will be processed, the complexity of the data processing logic, and

the desired duration by which a Spark application should complete the processing logic.

Spark Unified Stack

Unlike its predecessors, Spark provides a unified data processing engine known as the

Spark stack. Similar to other well-designed systems, this stack is built on top of a strong

foundation called Spark Core, which provides all the necessary functionalities to manage

and run distributed applications such as scheduling, coordination, and fault tolerance.

In addition, it provides a powerful and generic programming abstraction for data

processing called resilient distributed datasets (RDDs). On top of this strong foundation

is a collection of components where each one is designed for a specific data processing

workload, as shown in Figure1-3. Spark SQL is for batch as well as interactive data

processing. Spark Streaming is for real-time stream data processing. Spark GraphX is for

graph processing. Spark MLlib is for machine learning. Spark R is for running machine

learning tasks using the R shell.

Figure 1-2. A small Spark cluster with three executors

Chapter 1 IntroduCtIon toapaChe Spark

The distributed computing infrastructure is responsible for the distribution,

coordination, and scheduling of computing tasks across many machines in the

cluster. This enables the ability to perform parallel data processing of a large volume

of data efficiently and quickly on a large cluster of machines. Two other important

responsibilities of the distributed computing infrastructure are handling of computing

task failures and efficiently moving data across machines, which is known as data

shuffling. Advanced users of Spark need to have intimate knowledge of the Spark

distributed computing infrastructure to be effective at designing highly performant

Spark applications.

The key programming abstraction in Spark is called RDD, and it is something every

Spark developer should have some knowledge of, especially its APIs and main concepts.

The technical definition of an RDD is that it is an immutable and fault-tolerant collection

of objects partitioned across a cluster that can be manipulated in parallel. Essentially, it

provides a set of APIs for Spark application developers to easily and efficiently perform

large-scale data processing without worrying where data resides on the cluster or dealing

with machine failures. For example, say you have a 1.5TB log file that resides on HDFS

and you need to find out the number of lines containing the word Exception. You can

create an instance of RDD to represent all the log statements in that log file, and Spark

can partition them across the nodes in the cluster such that filtering and counting logic

can be executed in parallel to speed up the search and counting logic.

The RDD APIs are exposed in multiple programming languages (Scala, Java, and

Python), and they allow users to pass local functions to run on the cluster, which is

something that is powerful and unique. RDDs will be covered in detail in Chapter 3.

The rest of the components in the Spark stack are designed to run on top of Spark

Core. Therefore, any improvements or optimizations done in Spark Core between

versions of Spark will be automatically available to the other components.

Spark SQL

Spark SQL is a component built on top of Spark Core, and it is designed for structured

data processing at scale. Its popularity has skyrocketed since its inception because it

brings a new level of flexibility, ease of use, and performance.

Chapter 1 IntroduCtIon toapaChe Spark

Structured Query Language (SQL) has been the lingua franca for data processing

because it is fairly easy for users to express their intent, and the execution engine then

performs the necessary intelligent optimizations. Spark SQL brings that to the world of

data processing at the petabyte level. Spark users now can issue SQL queries to perform

data processing or use the high-level abstraction exposed through the DataFrames

APIs. A DataFrame is effectively a distributed collection of data organized into named

columns. This is not a novel idea; in fact, this idea was inspired by data frames in R and

Python. An easier way to think about a DataFrame is that it is conceptually equivalent to

a table in a relational database.

Behind the scenes, Spark SQL leverages the Catalyst optimizer to perform the kinds

of the optimizations that are commonly done in many analytical database engines.

Another feature Spark SQL provides is the ability to read data from and write data

to various structured formats and storage systems, such as JavaScript Object Notation

(JSON), comma-separated value (CSV) files, Parquet or ORC files, relational databases,

Hive, and others. This feature really helps in elevating the level of versatility of Spark

because Spark SQL can be used as a data converter tool to easily convert data from one

format to another.

According to a 2016 Spark survey, Spark SQL was the fastest-growing component.

This makes sense because Spark SQL enables a wider audience beyond big data

engineers to leverage the power of distributed data processing (i.e., data analysts or

anyone who is familiar with SQL).

The motto for Spark SQL is to write less code, read less data, and let the optimizer do

the hard work.

Spark Structured Streaming andStreaming

It has been said that “Data in motion has equal or greater value than historical data.”

The ability to process data as it arrives is becoming a competitive advantage for many

companies in highly competitive industries. The Spark Streaming module enables

the ability to process real-time streaming data from various data sources in a high-

throughput and fault-tolerant manner. Data can be ingested from sources such as Kafka,

Flume, Kinesis, Twitter, HDFS, or TCP sockets.

Chapter 1 IntroduCtIon toapaChe Spark

The main abstraction in the first generation of the Spark Streaming processing

engine is called discretized stream (DStream), which implements an incremental

stream processing model by splitting the input data into small batches (based on a time

interval) that can regularly combine the current processing state to produce new results.

In other words, once the incoming data is split into small batches, each batch is treated

as an RDD and replicated out onto the cluster so they can be processed accordingly.

Stream processing sometimes involves joining with data at rest, and Spark makes it

easy to do so. In other words, combining batch and interactive queries with streaming

processing can be easily done in Spark because of the unified Spark stack.

A new scalable and fault-tolerant streaming processing engine called Structured

Streaming was introduced in Spark 2.1, and it was built on top of the Spark SQL engine.

This engine further simplifies the life of streaming processing application developers

by treating streaming computation the same way as one would express a batch

computation on static data. This new engine will automatically execute the streaming

processing logic incrementally and continuously as new streaming data continues to

arrive. A new and important feature that Structured Streaming provides is the ability to

process incoming streaming data based on the event time, which is necessary for many

of the new streaming processing use cases. Another unique feature in the Structured

Streaming engine is the end-to-end, exactly once guarantee, which will make a big data

engineer’s life much easier than before in terms of saving data to a storage system such

as a relational database or a NoSQL database.

As this new engine matures, undoubtedly it will enable a new class of streaming

processing applications that are easy to develop and maintain.

According to Reynold Xin, Databricks’ chief architect, the simplest way to perform

streaming analytics is not having to reason about streaming.

Spark MLlib

In addition to providing more than 50 common machine learning algorithms, the

Spark MLlib library provides abstractions for managing and simplifying many of the

machine learning model building tasks, such as featurization, pipeline for constructing,

evaluating and tuning model, and persistence of models to help with moving the model

from development to production.

Chapter 1 IntroduCtIon toapaChe Spark

剩余397页未读，继续阅读

dusong7

粉丝: 3

Apache Spark 2入门：大数据处理与机器学习

Beginning Apache Spark 2

大数据--Apache Spark实用详解

Apache Spark

大数据之Kafka学习

Hadoop经典技术书籍合集(Spark, Kafka, HBase, etc.)

掌握Apache Spark 2源码实战指南

深入浅出Apache Spark 2核心技术与应用

大数据平台架构与二次开发技术深入解析

【ELK Stack与大数据集成】：与Hadoop、Spark的数据交互技巧

Linux上运行大数据应用的Docker实践：最佳配置指南

最新资源