大规模机器学习的并行与分布式方法

需积分: 16 186 浏览量更新于2024-07-20 收藏 7.82MB PDF 举报

"Scaling up Machine Learning: Paralleland Distributed Approaches" 本书《Scaling up Machine Learning》(CAMBRIDGE, 2012)是关于如何在并行和分布式计算平台上扩展机器学习和数据挖掘方法的一本集成性著作。随着大数据集、复杂模型以及实时性能需求的增长，对并行化学习算法的需求变得越来越重要。为了适应大规模机器学习任务，理解和权衡不同可用选项的优点、折衷和限制显得至关重要。书中的解决方案涵盖了各种并行化平台，包括FPGA（现场可编程门阵列）、GPU（图形处理单元）、多核系统以及商品级集群。同时，书中讨论了如CUDA（Compute Unified Device Architecture）、MPI（Message Passing Interface）、MapReduce和DryadLINQ等并发编程框架。这些工具和技术都是应对大规模数据和复杂计算挑战的关键。在学习设置方面，这本书涉及了监督学习、无监督学习、半监督学习以及在线学习等多种场景，确保读者能根据不同的任务需求找到适用的方法。书中特别深入地探讨了如提升树（Boosted Trees）、支持向量机（SVMs）、谱聚类（Spectral Clustering）、信念传播（Belief Propagation）等流行学习算法的并行化实现，这些算法在实际应用中非常常见且重要。此外，书中还提供了广泛的应用实例，不仅对于研究人员和学者，也对于学生和实践者来说都极具价值。通过深入研究这些算法和应用，读者能够掌握如何在特定环境下优化和扩展机器学习模型，以应对数据量大、计算复杂度高或实时响应要求高的挑战。《Scaling up Machine Learning》为读者提供了一个全面的框架，帮助他们理解并选择最适合大规模机器学习任务的算法和平台，从而实现更高效的数据处理和分析。这本书的详尽内容和实用案例，使得它成为这个领域不可多得的参考资料。

Preface

This book attempts to aggregate state-of-the-art research in parallel and distributed

machine learning. We believe that parallelization provides a key pathway for scaling

up machine learning to large datasets and complex methods. Although large-scale ma-

chine learning has been increasingly popular in both industrial and academic research

communities, there has been no singular resource covering the variety of approaches

recently proposed. We did our best to assemble the most representative contemporary

studies in one volume. While each contributed chapter concentrates on a distinct ap-

proach and problem, together with their references they provide a comprehensive view

of the ﬁeld.

We believe that the book will be useful to the broad audience of researchers, practi-

tioners, and anyone who wants to grasp the future of machine learning. To smooth the

ramp-up for beginners, the ﬁrst ﬁve chapters provide introductory material on machine

learning algorithms and parallel computing platforms. Although the book gets deeply

technical in some parts, the reader is assumed to have only basic prior knowledge of

machine learning and parallel/distributed computing, along with college-level mathe-

matical maturity. We hope that an engineering undergraduate who is familiar with the

notion of a classiﬁer and had some exposure to threads, MPI, or MapReduce will be

able to understand the majority of the book’s content. We also hope that a seasoned

expert will ﬁnd this book full of new, interesting ideas to inspire future research in the

area.

We are deeply thankful to all chapter authors for signiﬁcant investments of their

time, talent, and creativity in preparing their contributions to this volume. We appre-

ciate the efforts of our editors at Cambridge University Press: Heather Bergman, who

initiated this project, and Lauren Cowles, who worked with us throughout the process,

guiding the book to completion. We thank chapter reviewers who provided detailed,

thoughtful feedback to chapter authors that was invaluable in shaping the book: David

Andrzejewski, Yoav Artzi, Arthur Asuncion, Hongjie Bai, Sugato Basu, Andrew Ben-

der, Mark Chapman, Wen-Yen Chen, Sulabh Choudhury, Adam Coates, Kamalika Das,

Kevin Duh, Igor Durdanovic, Cl

ement Farabet, Dennis Fetterly, Eric Garcia, Joseph

Gonzalez, Isaac Greenbaum, Caden Howell, Ferris Jumah, Andrey Kolobov, Jeremy

CHAPTER 1

Scaling Up Machine Learning:

Introduction

Ron Bekkerman, Mikhail Bilenko, and John Langford

Distributed and parallel processing of very large datasets has been employed for decades

in specialized, high-budget settings, such as ﬁnancial and petroleum industry applica-

tions. Recent years have brought dramatic progress in usability, cost effectiveness, and

diversity of parallel computing platforms, with their popularity growing for a broad set

of data analysis and machine learning tasks.

The current rise in interest in scaling up machine learning applications can be

partially attributed to the evolution of hardware architectures and programming frame-

works that make it easy to exploit the types of parallelism realizable in many learning

algorithms. A number of platforms make it convenient to implement concurrent pro-

cessing of data instances or their features. This allows fairly straightforward paralleliza-

tion of many learning algorithms that view input as an unordered batch of examples

and aggregate isolated computations over each of them.

Increased attention to large-scale machine learning is also due to the spread of very

large datasets across many modern applications. Such datasets are often accumulated

on distributed storage platforms, motivating the development of learning algorithms

that can be distributed appropriately. Finally, the proliferation of sensing devices that

perform real-time inference based on high-dimensional, complex feature representa-

tions drives additional demand for utilizing parallelism in learning-centric applications.

Examples of this trend include speech recognition and visual object detection becoming

commonplace in autonomous robots and mobile devices.

The abundance of distributed platform choices provides a number of options for im-

plementing machine learning algorithms to obtain efﬁciency gains or the capability to

process very large datasets. These options include customizable integrated circuits (e.g.,

Field-Programmable Gate Arrays – FPGAs), custom processing units (e.g., general-

purpose Graphics Processing Units – GPUs), multiprocessor and multicore parallelism,

High-Performance Computing (HPC) clusters connected by fast local networks, and

datacenter-scale virtual clusters that can be rented from commercial cloud computing

providers. Aside from the multiple platform options, there exists a variety of program-

ming frameworks in which algorithms can be implemented. Framework choices tend

2 1 scaling up machine learning: introduction

to be particularly diverse for distributed architectures, such as clusters of commodity

PCs.

The wide range of platforms and frameworks for parallel and distributed comput-

ing presents both opportunities and challenges for machine learning scientists and

engineers. Fully exploiting the available hardware resources requires adapting some

algorithms and redesigning others to enable their concurrent execution. For any pre-

diction model and learning algorithm, their structure, dataﬂow, and underlying task

decomposition must be taken into account to determine the suitability of a particular

infrastructure choice.

Chapters making up this volume form a representative set of state-of-the-art solutions

that span the space of modern parallel computing platforms and frameworks for a

variety of machine learning algorithms, tasks, and applications. Although it is infeasible

to cover every existing approach for every platform, we believe that the presented

set of techniques covers most commonly used methods, including the popular “top

performers” (e.g., boosted decision trees and support vector machines) and common

“baselines” (e.g., k-means clustering).

Because most chapters focus on a single choice of platform and/or framework, the

rest of this introduction provides the reader with unifying context: a brief overview

of machine learning basics and fundamental concepts in parallel and distributed com-

puting, a summary of typical task and application scenarios that require scaling up

learning, and thoughts on evaluating algorithm performance and platform trade-offs.

Following these are an overview of the chapters and bibliography notes.

1.1 Machine Learning Basics

Machine learning focuses on constructing algorithms for making predictions from

data. A machine learning task aims to identify (to learn) a function f :

X → Y that

maps input domain

X (of data) onto output domain Y (of possible predictions). The

function f is selected from a certain function class, which is different for each family

of learning algorithms. Elements of

X and Y are application-speciﬁc representations

of data objects and predictions, respectively.

Two canonical machine learning settings are supervised learning and unsupervised

learning. Supervised learning algorithms utilize training data to construct a prediction

function f , which is subsequently applied to test instances. Typically, training data is

provided in the form of labeled examples (x, y) ∈

X × Y, where x is a data instance

and y is the corresponding ground truth prediction for x.

The ultimate goal of supervised learning is to identify a function f that produces

accurate predictions on test data. More formally, the goal is to minimize the prediction

error (loss) function l :

Y × Y → R, which quantiﬁes the difference between any f (x)

and y – the predicted output of x and its ground truth label. However, the loss cannot

be minimized directly on test instances and their labels because they are typically

unavailable at training time. Instead, supervised learning algorithms aim to construct

predictive functions that generalize well to previously unseen data, as opposed to

performing optimally just on the given training set, that is, overﬁtting the training data.

The most common supervised learning setting is induction, where it is assumed that

each training and test example (x, y) is sampled from some unknown joint probability

1.2 reasons for scaling up machine learning 3

distribution P over X × Y. The objective is to ﬁnd f that minimizes expected loss

(x,y)∼P

l( f (x), y). Because the joint distribution P is unknown, expected loss cannot

be minimized in closed form; hence, learning algorithms approximate it based on

training examples. Additional supervised learning settings include semi-supervised

learning (where the input data consists of both labeled and unlabeled instances),

transfer learning, and online learning (see Section 1.6.3).

Two classic supervised learning tasks are classiﬁcation and regression. In classiﬁca-

tion, the output domain is a ﬁnite discrete set of categories (classes),

Y ={c

, ..., c

whereas in regression the output domain is the set of real numbers,

Y = R.More

complex output domains are explored within advanced learning frameworks, such as

structured learning (Bakir et al., 2007).

The simplest classiﬁcation scenario is binary, in which there are two classes. Let

us consider a small example. Assume that the task is to learn a function that predicts

whether an incoming email message is spam or not. A common way to represent textual

messages is as large, sparse vectors, in which every entry corresponds to a vocabulary

word, and non-zero entries represent words that are present in the message. The label

can be represented as 1 for spam and −1 for nonspam. With this representation, it

is common to learn a vector of weights w optimizing f (x) = sign







so as to

predict the label.

The most prominent example of unsupervised learning is data clustering. In clus-

tering, the goal is to construct a function f that partitions an unlabeled dataset into

k =|

Y| clusters, with Y being the set of cluster indices. Data instances assigned to the

same cluster should presumably be more similar to each other than to data instances

assigned to any other cluster. There are many ways to deﬁne similarity between data

instances; for example, for vector data, (inverted) Euclidean distance and cosine simi-

larity are commonly used. Clustering quality is often measured against a dataset with

existing class labels that are withheld during clustering: a quality measure penalizes f

if it assigns instances of the same class to different clusters and instances of different

classes to the same cluster.

We note that both supervised and unsupervised learning settings distinguish between

learning and inference tasks, where learning refers to the process of identifying the

prediction function f , while inference refers to computing f (x) on a data instance x.

For many learning algorithms, inference is a component of the learning process, as

predictions of some interim candidate f



on the training data are used in the search

for the optimal f . Depending on the application domain, scaling up may be required

for either the learning or the inference algorithm, and chapters in this book present

numerous examples of speeding up both.

1.2 Reasons for Scaling Up Machine Learning

There are a number of settings where a practitioner could ﬁnd the scale of a ma-

chine learning task daunting for single-machine processing and consider employing

parallelization. Such settings are characterized by:

1. Large number of data instances: In many domains, the number of potential training

examples is extremely large, making single-machine processing infeasible.

www.allitebooks.com

剩余492页未读，继续阅读

vanridin

粉丝: 108

大规模机器学习的并行与分布式方法

"Spark大数据机器学习规模扩展至数十亿参数

Amazon EC2 Auto Scaling 入门教程

掌握Amazon EC2 Auto Scaling最佳实践

Scaling up Machine Learning

scaling up MIMO

scaling___GA_fitnessscaling_matlab_machinelearning_

Feature Engineering for Machine Learning

Large Scale Machine Learning with Spark

large scale machine learning with spark

Data Algorithms Recipes for Scaling Up with Hadoop and Spark

最新资源