利用Spark构建可扩展的机器学习应用

需积分: 9 109 浏览量更新于2024-07-19 收藏 4.68MB PDF 举报

"《机器学习与Spark》是一本专为数据驱动业务提供强大支持的英文书籍，由Nick Pentreath撰写，由BIRMINGHAM-MUMBAI出版。本书旨在帮助读者利用Apache Spark构建可扩展的机器学习应用，使之成为现代商业环境中不可或缺的一部分。Spark是一个开源的大数据处理框架，以其高效的数据处理能力和内存计算能力在大数据分析领域中占据重要地位。作者通过深入浅出的方式，介绍了如何利用Spark的分布式计算能力加速机器学习算法的执行，包括但不限于分类、聚类、回归和深度学习等核心机器学习技术。书中不仅涵盖了理论概念，还提供了大量的实践案例和代码示例，使读者能够在实际项目中快速理解和应用这些技术。版权方面，这本书受2015年Packt Publishing所有，未经事先书面许可，禁止任何形式的复制、存储或传输，除非用于学术引用。尽管作者和出版社努力确保信息的准确性，但书中提供的信息是按“现状”销售，不提供任何形式的明示或默示担保，不会对因本书内容导致或声称的任何直接或间接损失承担责任。《机器学习与Spark》的首次出版日期为2015年2月，反映了当时的技术发展水平。该书是Spark用户和机器学习爱好者的重要参考资料，对于希望将Spark技术与机器学习相结合以提升业务分析能力的读者来说，是一本不可多得的实用指南。通过阅读这本书，读者不仅能掌握Spark的基础知识，还能掌握如何在实际场景中优化和部署机器学习模型，从而推动企业向数据驱动决策转变。"

Preface

In recent years, the volume of data being collected, stored, and analyzed has

exploded, in particular in relation to the activity on the Web and mobile devices, as

well as data from the physical world collected via sensor networks. While previously

large-scale data storage, processing, analysis, and modeling was the domain of the

largest institutions such as Google, Yahoo!, Facebook, and Twitter, increasingly,

many organizations are being faced with the challenge of how to handle a massive

amount of data.

When faced with this quantity of data and the common requirement to utilize it in

real time, human-powered systems quickly become infeasible. This has led to a rise

in the so-called big data and machine learning systems that learn from this data to

make automated decisions.

In answer to the challenge of dealing with ever larger-scale data without any

prohibitive cost, new open source technologies emerged at companies such as

Google, Yahoo!, Amazon, and Facebook, which aimed at making it easier to handle

massive data volumes by distributing data storage and computation across a cluster

of computers.

The most widespread of these is Apache Hadoop, which made it signicantly easier

and cheaper to both store large amounts of data (via the Hadoop Distributed File

System, or HDFS) and run computations on this data (via Hadoop MapReduce,

a framework to perform computation tasks in parallel across many nodes in a

computer cluster).

Preface

[ 2 ]

However, MapReduce has some important shortcomings, including high overheads

to launch each job and reliance on storing intermediate data and results of the

computation to disk, both of which make Hadoop relatively ill-suited for use cases of

an iterative or low-latency nature. Apache Spark is a new framework for distributed

computing that is designed from the ground up to be optimized for low-latency

tasks and to store intermediate data and results in memory, thus addressing some of

the major drawbacks of the Hadoop framework. Spark provides a clean, functional,

and easy-to-understand API to write applications and is fully compatible with the

Hadoop ecosystem.

Furthermore, Spark provides native APIs in Scala, Java, and Python. The Scala and

Python APIs allow all the benets of the Scala or Python language, respectively,

to be used directly in Spark applications, including using the relevant interpreter

for real-time, interactive exploration. Spark itself now provides a toolkit (called

MLlib) of distributed machine learning and data mining models that is under heavy

development and already contains high-quality, scalable, and efcient algorithms for

many common machine learning tasks, some of which we will delve into in this book.

Applying machine learning techniques to massive datasets is challenging, primarily

because most well-known machine learning algorithms are not designed for parallel

architectures. In many cases, designing such algorithms is not an easy task. The

nature of machine learning models is generally iterative, hence the strong appeal

of Spark for this use case. While there are many competing frameworks for parallel

computing, Spark is one of the few that combines speed, scalability, in-memory

processing, and fault tolerance with ease of programming and a exible, expressive,

and powerful API design.

Throughout this book, we will focus on real-world applications of machine learning

technology. While we may briey delve into some theoretical aspects of machine

learning algorithms, the book will generally take a practical, applied approach with

a focus on using examples and code to illustrate how to effectively use the features

of Spark and MLlib, as well as other well-known and freely available packages for

machine learning and data analysis, to create a useful machine learning system.

What this book covers

Chapter 1, Getting Up and Running with Spark, shows how to install and set up a local

development environment for the Spark framework as well as how to create a Spark

cluster in the cloud using Amazon EC2. The Spark programming model and API will

be introduced, and a simple Spark application will be created using each of Scala,

Java, and Python.

Preface

[ 3 ]

Chapter 2, Designing a Machine Learning System, presents an example of a real-world

use case for a machine learning system. We will design a high-level architecture for

an intelligent system in Spark based on this illustrative use case.

Chapter 3, Obtaining, Processing, and Preparing Data with Spark, details how to go about

obtaining data for use in a machine learning system, in particular from various freely

and publicly available sources. We will learn how to process, clean, and transform

the raw data into features that may be used in machine learning models, using

available tools, libraries, and Spark's functionality.

Chapter 4, Building a Recommendation Engine with Spark, deals with creating a

recommendation model based on the collaborative ltering approach. This model

will be used to recommend items to a given user as well as create lists of items

that are similar to a given item. Standard metrics to evaluate the performance of a

recommendation model will be covered here.

Chapter 5, Building a Classication Model with Spark, details how to create a model

for binary classication as well as how to utilize standard performance-evaluation

metrics for classication tasks.

Chapter 6, Building a Regression Model with Spark, shows how to create a model

for regression, extending the classication model created in Chapter 5, Building a

Classication Model with Spark. Evaluation metrics for the performance of regression

models will be detailed here.

Chapter 7, Building a Clustering Model with Spark, explores how to create a clustering

model as well as how to use related evaluation methodologies. You will learn how to

analyze and visualize the clusters generated.

Chapter 8, Dimensionality Reduction with Spark, takes us through methods to extract

the underlying structure from and reduce the dimensionality of our data. You will

learn some common dimensionality-reduction techniques and how to apply and

analyze them, as well as how to use the resulting data representation as input to

another machine learning model.

Chapter 9, Advanced Text Processing with Spark, introduces approaches to deal with

large-scale text data, including techniques for feature extraction from text and

dealing with the very high-dimensional features typical in text data.

Chapter 10, Real-time Machine Learning with Spark Streaming, provides an overview

of Spark Streaming and how it ts in with the online and incremental learning

approaches to apply machine learning on data streams.

剩余337页未读，继续阅读

wjj_cadn

粉丝: 2
资源: 9

利用Spark构建可扩展的机器学习应用

免费获取Machine Learning with Spark实战指南

Spark 2.x 深入机器学习实战

"Spark大数据机器学习规模扩展至数十亿参数

Packt.Machine Learning with Spark.2015

Large Scale Machine Learning with Spark.pdf

Mastering Machine Learning with Spark 2.X

Packt Machine Learning with Spark 2nd.Edition 代码

Mastering Machine Learning with Spark 2.X azw3

Mastering Machine Learning with Spark 2.X 无水印pdf

Mastering Machine Learning with Spark 2.x-Packt Publishing(2017).pdf

最新资源