使用Spark构建可扩展的机器学习应用

需积分: 9 23 浏览量更新于2024-07-18 收藏 5.45MB PDF 举报

"Machine Learning with Spark" 本书《Machine Learning with Spark》深入探讨了使用Apache Spark构建可扩展的机器学习应用程序，以驱动现代数据驱动业务的方法。作者Nick Pentreath详细介绍了Spark编程模型及其核心组件，包括SparkContext和弹性分布式数据集（RDD）。 Spark作为一个快速、通用且可扩展的数据处理框架，其主要优势在于它能够支持大数据处理和机器学习算法的高效执行。SparkContext是Spark应用程序的主要入口点，它连接到Spark集群并管理计算任务。RDD是Spark的基础数据结构，提供了容错和并行操作的能力，允许数据在集群中以分布式方式处理。在讨论编程语言时，Nick Pentreath提到了使用Scala、Java和Python编写Spark程序的可能性。Scala是Spark的首选语言，因为它与Spark API紧密集成，提供了强大的函数式编程特性。Java程序员也可以利用Spark的Java API来构建应用，虽然语法可能较为冗长。对于数据科学家和Python开发者，PySpark提供了一个直观的接口，使得Python用户能方便地使用Spark功能，这极大地扩大了Spark的使用范围。书中可能涵盖了以下几个关键知识点： 1. **Spark架构**：Spark的主-从架构，包括Driver节点和Executor节点，以及它们在分布式计算中的角色。 2. **RDD操作**：转换（Transformation）和动作（Action）的概念，如map、filter、reduce和count等操作的用法。 3. **数据加载和持久化**：如何从各种数据源（如HDFS、Cassandra或HBase）加载数据，并将RDD持久化以优化性能。 4. **Spark SQL**：Spark对SQL的支持，用于处理结构化数据，以及DataFrame和Dataset API的使用。 5. **机器学习库MLlib**：介绍Spark的机器学习库MLlib，包括监督和无监督学习算法，如线性回归、逻辑回归、决策树、随机森林、协同过滤等。 6. **图计算**：使用GraphX进行图分析，处理复杂网络数据。 7. **Spark Streaming**：实时数据流处理，结合DStream进行连续数据处理。 8. **Spark MLlib管道和模型评估**：构建和优化机器学习流水线，以及模型验证和选择的策略。 9. **Spark的性能优化**：内存管理、Tungsten执行引擎、Shuffle操作的优化等提高Spark性能的方法。 10. **Spark应用部署**：在本地模式、集群模式（如YARN、Mesos或Kubernetes）下部署和管理Spark应用。通过这本书，读者将不仅了解到Spark的基本原理，还能掌握构建大规模机器学习系统所需的技术和实践，从而在数据驱动的业务环境中发挥Spark的强大潜力。

Preface

In recent years, the volume of data being collected, stored, and analyzed has

exploded, in particular in relation to the activity on the Web and mobile devices, as

well as data from the physical world collected via sensor networks. While previously

large-scale data storage, processing, analysis, and modeling was the domain of the

largest institutions such as Google, Yahoo!, Facebook, and Twitter, increasingly,

many organizations are being faced with the challenge of how to handle a massive

amount of data.

When faced with this quantity of data and the common requirement to utilize it in

real time, human-powered systems quickly become infeasible. This has led to a rise

in the so-called big data and machine learning systems that learn from this data to

make automated decisions.

In answer to the challenge of dealing with ever larger-scale data without any

prohibitive cost, new open source technologies emerged at companies such as

Google, Yahoo!, Amazon, and Facebook, which aimed at making it easier to handle

massive data volumes by distributing data storage and computation across a cluster

of computers.

The most widespread of these is Apache Hadoop, which made it signicantly easier

and cheaper to both store large amounts of data (via the Hadoop Distributed File

System, or HDFS) and run computations on this data (via Hadoop MapReduce,

a framework to perform computation tasks in parallel across many nodes in a

computer cluster).

www.it-ebooks.info

Preface

[ 2 ]

However, MapReduce has some important shortcomings, including high overheads

to launch each job and reliance on storing intermediate data and results of the

computation to disk, both of which make Hadoop relatively ill-suited for use cases of

an iterative or low-latency nature. Apache Spark is a new framework for distributed

computing that is designed from the ground up to be optimized for low-latency

tasks and to store intermediate data and results in memory, thus addressing some of

the major drawbacks of the Hadoop framework. Spark provides a clean, functional,

and easy-to-understand API to write applications and is fully compatible with the

Hadoop ecosystem.

Furthermore, Spark provides native APIs in Scala, Java, and Python. The Scala and

Python APIs allow all the benets of the Scala or Python language, respectively,

to be used directly in Spark applications, including using the relevant interpreter

for real-time, interactive exploration. Spark itself now provides a toolkit (called

MLlib) of distributed machine learning and data mining models that is under heavy

development and already contains high-quality, scalable, and efcient algorithms for

many common machine learning tasks, some of which we will delve into in this book.

Applying machine learning techniques to massive datasets is challenging, primarily

because most well-known machine learning algorithms are not designed for parallel

architectures. In many cases, designing such algorithms is not an easy task. The

nature of machine learning models is generally iterative, hence the strong appeal

of Spark for this use case. While there are many competing frameworks for parallel

computing, Spark is one of the few that combines speed, scalability, in-memory

processing, and fault tolerance with ease of programming and a exible, expressive,

and powerful API design.

Throughout this book, we will focus on real-world applications of machine learning

technology. While we may briey delve into some theoretical aspects of machine

learning algorithms, the book will generally take a practical, applied approach with

a focus on using examples and code to illustrate how to effectively use the features

of Spark and MLlib, as well as other well-known and freely available packages for

machine learning and data analysis, to create a useful machine learning system.

What this book covers

Chapter 1, Getting Up and Running with Spark, shows how to install and set up a local

development environment for the Spark framework as well as how to create a Spark

cluster in the cloud using Amazon EC2. The Spark programming model and API will

be introduced, and a simple Spark application will be created using each of Scala,

Java, and Python.

www.it-ebooks.info

Preface

[ 3 ]

Chapter 2, Designing a Machine Learning System, presents an example of a real-world

use case for a machine learning system. We will design a high-level architecture for

an intelligent system in Spark based on this illustrative use case.

Chapter 3, Obtaining, Processing, and Preparing Data with Spark, details how to go about

obtaining data for use in a machine learning system, in particular from various freely

and publicly available sources. We will learn how to process, clean, and transform

the raw data into features that may be used in machine learning models, using

available tools, libraries, and Spark's functionality.

Chapter 4, Building a Recommendation Engine with Spark, deals with creating a

recommendation model based on the collaborative ltering approach. This model

will be used to recommend items to a given user as well as create lists of items

that are similar to a given item. Standard metrics to evaluate the performance of a

recommendation model will be covered here.

Chapter 5, Building a Classication Model with Spark, details how to create a model

for binary classication as well as how to utilize standard performance-evaluation

metrics for classication tasks.

Chapter 6, Building a Regression Model with Spark, shows how to create a model

for regression, extending the classication model created in Chapter 5, Building a

Classication Model with Spark. Evaluation metrics for the performance of regression

models will be detailed here.

Chapter 7, Building a Clustering Model with Spark, explores how to create a clustering

model as well as how to use related evaluation methodologies. You will learn how to

analyze and visualize the clusters generated.

Chapter 8, Dimensionality Reduction with Spark, takes us through methods to extract

the underlying structure from and reduce the dimensionality of our data. You will

learn some common dimensionality-reduction techniques and how to apply and

analyze them, as well as how to use the resulting data representation as input to

another machine learning model.

Chapter 9, Advanced Text Processing with Spark, introduces approaches to deal with

large-scale text data, including techniques for feature extraction from text and

dealing with the very high-dimensional features typical in text data.

Chapter 10, Real-time Machine Learning with Spark Streaming, provides an overview

of Spark Streaming and how it ts in with the online and incremental learning

approaches to apply machine learning on data streams.

www.it-ebooks.info

剩余337页未读，继续阅读

楚琪仔

粉丝: 32
资源: 3

使用Spark构建可扩展的机器学习应用

Machine Learning with Spark(PACKT,2015)

Machine Learning with Spark(2nd) epub

Machine Learning with Spark(2nd) 无水印pdf

spark大数据编程头歌

Spark和NLP技术参考文献

pyspark学习推荐书籍

给我一个用Scala编写的复杂一点的和药相关的spark实例，包含代码和数据获取网址

spark hadoop

sqlserver 18

feature engineering python

最新资源