2 1 scaling up machine learning: introduction
to be particularly diverse for distributed architectures, such as clusters of commodity
PCs.
The wide range of platforms and frameworks for parallel and distributed comput-
ing presents both opportunities and challenges for machine learning scientists and
engineers. Fully exploiting the available hardware resources requires adapting some
algorithms and redesigning others to enable their concurrent execution. For any pre-
diction model and learning algorithm, their structure, dataflow, and underlying task
decomposition must be taken into account to determine the suitability of a particular
infrastructure choice.
Chapters making up this volume form a representative set of state-of-the-art solutions
that span the space of modern parallel computing platforms and frameworks for a
variety of machine learning algorithms, tasks, and applications. Although it is infeasible
to cover every existing approach for every platform, we believe that the presented
set of techniques covers most commonly used methods, including the popular “top
performers” (e.g., boosted decision trees and support vector machines) and common
“baselines” (e.g., k-means clustering).
Because most chapters focus on a single choice of platform and/or framework, the
rest of this introduction provides the reader with unifying context: a brief overview
of machine learning basics and fundamental concepts in parallel and distributed com-
puting, a summary of typical task and application scenarios that require scaling up
learning, and thoughts on evaluating algorithm performance and platform trade-offs.
Following these are an overview of the chapters and bibliography notes.
1.1 Machine Learning Basics
Machine learning focuses on constructing algorithms for making predictions from
data. A machine learning task aims to identify (to learn) a function f :
X → Y that
maps input domain
X (of data) onto output domain Y (of possible predictions). The
function f is selected from a certain function class, which is different for each family
of learning algorithms. Elements of
X and Y are application-specific representations
of data objects and predictions, respectively.
Two canonical machine learning settings are supervised learning and unsupervised
learning. Supervised learning algorithms utilize training data to construct a prediction
function f , which is subsequently applied to test instances. Typically, training data is
provided in the form of labeled examples (x, y) ∈
X × Y, where x is a data instance
and y is the corresponding ground truth prediction for x.
The ultimate goal of supervised learning is to identify a function f that produces
accurate predictions on test data. More formally, the goal is to minimize the prediction
error (loss) function l :
Y × Y → R, which quantifies the difference between any f (x)
and y – the predicted output of x and its ground truth label. However, the loss cannot
be minimized directly on test instances and their labels because they are typically
unavailable at training time. Instead, supervised learning algorithms aim to construct
predictive functions that generalize well to previously unseen data, as opposed to
performing optimally just on the given training set, that is, overfitting the training data.
The most common supervised learning setting is induction, where it is assumed that
each training and test example (x, y) is sampled from some unknown joint probability