机器学习基石：深度解析与实践应用

下载需积分: 10 | PDF格式 | 3.39MB | 更新于2024-07-25 | 92 浏览量 | 举报

"Foundations_of_Machine_Learning" 是一本由 Thomas Dietterich 编辑，Mehryar Mohri、Afshin Rostamizadeh 和 Ameet Talwalkar 合著的机器学习基础书籍，属于 Adaptive Computation and Machine Learning 系列。《机器学习基础》是该系列的一部分，旨在提供全面的机器学习理论和实践知识。本书由业界知名的专家编写，包括了 Thomas Dietterich、Christopher Bishop、David Heckerman、Michael Jordan 和 Michael Kearns 等作为副编辑，这确保了内容的专业性和权威性。该系列的其他书籍列表可在书的背面找到。书中涵盖了机器学习领域的核心概念，可能包括但不限于以下几个方面： 1. **监督学习（Supervised Learning）**：这是机器学习中最常见的类型，包括了线性回归、逻辑回归、支持向量机（SVM）、决策树、随机森林以及神经网络等模型。这些模型主要用于根据已有的输入-输出对训练模型，以预测新数据的结果。 2. **无监督学习（Unsupervised Learning）**：在这种方法中，数据没有标签，模型需要自我学习数据的内在结构和模式。常见的无监督学习技术有聚类（如K-means）、主成分分析（PCA）和自编码器。 3. **半监督学习（Semi-supervised Learning）**：介于监督和无监督之间，数据集大部分是未标记的，但有一小部分是有标签的。半监督学习通常用于处理大量未标记数据的情况。 4. **强化学习（Reinforcement Learning）**：通过与环境的交互，智能体学习如何做出决策以最大化奖励。Q-learning、策略梯度和深度强化学习（如Deep Q-Network, DQN）是强化学习的关键算法。 5. **概率模型和贝叶斯学习（Probabilistic Models and Bayesian Learning）**：这部分可能会涉及概率图模型（如贝叶斯网络和马尔科夫随机场）、朴素贝叶斯分类器和贝叶斯优化等。 6. **特征选择和降维（Feature Selection and Dimensionality Reduction）**：如主成分分析（PCA）、线性判别分析（LDA）和正则化技术（如L1和L2正则化）。 7. **集成学习（Ensemble Learning）**：如随机森林、梯度提升机（Gradient Boosting）和AdaBoost，通过组合多个弱学习器形成强学习器，提高模型的稳定性和性能。 8. **深度学习（Deep Learning）**：包括卷积神经网络（CNN）、循环神经网络（RNN）、长短期记忆网络（LSTM）和生成对抗网络（GAN），这些技术在图像识别、自然语言处理和语音识别等领域取得了显著成果。 9. **模型评估和选择（Model Evaluation and Selection）**：交叉验证、网格搜索、学习曲线、AUC-ROC曲线等都是评估和选择模型的重要工具。 10. **算法收敛和优化（Algorithm Convergence and Optimization）**：如梯度下降、随机梯度下降（SGD）和牛顿法等，用于求解模型参数的最优化问题。《机器学习基础》不仅深入探讨这些概念，还可能包含实践案例、实验指导和数学推导，旨在帮助读者理解和应用这些方法。无论是初学者还是经验丰富的从业者，都能从本书中获益，建立坚实的机器学习理论基础，并掌握解决实际问题的技能。

展开

1.2 Deﬁnitions and terminology 3

Figure 1.1 The zig-zag line on the left panel is consistent over the blue and red

training sample, but it is a complex separation surface that is not likely to generalize

well to unseen data. In contrast, the decision surface on the right panel is simpler

and might generalize better in spite of its misclassiﬁcation of a few points of the

training sample.

Which concept families can actually be learned, and under what conditions? How

well can these concepts be learned computationally?

1.2 Deﬁnitions and terminology

We will use the canonical problem of spam detection as a running example to

illustrate some basic deﬁnitions and to describe the use and evaluation of machine

learning algorithms in practice. Spam detection is the problem of learning to

automatically classify email messages as either spam or non-spam.

Examples: Items or instances of data used for learning or evaluation. In our spam

problem, these examples correspond to the collection of email messages we will use

for learning and testing.

Features : The set of attributes, often represented as a vector, associated to an

example. In the case of email messages, some relevant features may include the

length of the message, the name of the sender, various characteristics of the header,

the presence of certain keywords in the body of the message, and so on.

Labels : Values or categories assigned to examples. In classiﬁcation problems,

examples are assigned speciﬁc categories, for instance, the spam and non-spam

categories in our binary classiﬁcation problem. In regression, items are assigned

real-valued labels.

Training sample: Examples used to train a learning algorithm. In our spam

problem, the training sample consists of a set of email examples along with their

associated labels. The training sample varies for diﬀerent learning scenarios, as

described in section 1.4.

Validation sample: Examples used to tune the parameters of a learning algorithm

4 Introduction

when working with labeled data. Learning algorithms typically have one or more

free parameters, and the validation sample is used to select appropriate values for

these model parameters.

Test sample: Examples used to evaluate the performance of a learning algorithm.

The test sample is separate from the training and validation data and is not made

available in the learning stage. In the spam problem, the test sample consists of a

collection of email examples for which the learning algorithm must predict labels

based on features. These predictions are then compared with the labels of the test

sample to measure the performance of the algorithm.

Loss function : A function that measures the diﬀerence, or loss, between a pre-

dicted label and a true label. Denoting the set of all labels as Y and the set of

possible predictions as Y



, a loss function L is a mapping L: Y×Y



→ R

. In most

cases, Y



= Y and the loss function is bounded, but these conditions do not always

hold. Common examples of loss functions include the zero-one (or misclassiﬁcation)

loss deﬁned over {−1, +1}×{−1, +1} by L(y, y



)=1



=y

and the squared loss

deﬁned over I × I by L(y, y



)=(y



− y)

,whereI ⊆ R is typically a bounded

interval.

Hypothesis set: A set of functions mapping features (feature vectors) to the set of

labels Y. In our example, these may be a set of functions mapping email features

to Y = {spam, non-spam}. More generally, hypotheses may be functions mapping

features to a diﬀerent set Y



. They could be linear functions mapping email feature

vectors to real numbers interpreted as scores (Y



= R), with higher score values

more indicative of spam than lower ones.

We now deﬁne the learning stages of our spam problem. We start with a given

collection of labeled examples. We ﬁrst randomly partition the data into a training

sample, a validation sample, and a test sample. The size of each of these samples

depends on a number of diﬀerent considerations. For example, the amount of data

reserved for validation depends on the number of free parameters of the algorithm.

Also, when the labeled sample is relatively small, the amount of training data is

often chosen to be larger than that of test data since the learning performance

directly depends on the training sample.

Next, we associate relevant features to the examples. This is a critical step in

the design of machine learning solutions. Useful features can eﬀectively guide the

learning algorithm, while poor or uninformative ones can be misleading. Although

it is critical, to a large extent, the choice of the features is left to the user. This

choice reﬂects the user’s prior knowledge about the learning task which in practice

can have a dramatic eﬀect on the performance results.

Now, we use the features selected to train our learning algorithm by ﬁxing diﬀerent

values of its free parameters. For each value of these parameters, the algorithm

1.3 Cross-validation 5

selects a diﬀerent hypothesis out of the hypothesis set. We choose among them

the hypothesis resulting in the best performance on the validation sample. Finally,

using that hypothesis, we predict the labels of the examples in the test sample. The

performance of the algorithm is evaluated by using the loss function associated to

the task, e.g., the zero-one loss in our spam detection task, to compare the predicted

and true labels.

Thus, the performance of an algorithm is of course evaluated based on its test error

and not its error on the training sample. A learning algorithm may be consistent,

that is it may commit no error on the examples of the training data, and yet

have a poor performance on the test data. This occurs for consistent learners

deﬁned by very complex decision surfaces, as illustrated in ﬁgure 1.1, which tend

to memorize a relatively small training sample instead of seeking to generalize well.

This highlights the key distinction between memorization and generalization, which

is the fundamental property sought for an accurate learning algorithm. Theoretical

guarantees for consistent learners will be discussed with great detail in chapter 2.

1.3 Cross-validation

In practice, the amount of labeled data available is often too small to set aside

a validation sample since that would leave an insuﬃcient amount of training data.

Instead, a widely adopted method known as n-fold cross-validation is used to exploit

the labeled data both for model selection (selection of the free parameters of the

algorithm) and for training.

Let θ denote the vector of free parameters of the algorithm. For a ﬁxed value

of θ, the method consists of ﬁrst randomly partitioning a given sample S of

m labeled examples into n subsamples, or folds. The ith fold is thus a labeled

sample ((x

),...,(x

)) of size m

. Then, for any i ∈ [1,n], the learning

algorithm is trained on all but the ith fold to generate a hypothesis h

,andthe

performance of h

is tested on the ith fold, as illustrated in ﬁgure 1.2a. The

parameter value θ is evaluated based on the average error of the hypotheses h

which is called the cross-validation error .Thisquantityisdenotedby



(θ)and

deﬁned by



(θ)=



i=1



j=1

L(h

),y

)



 

error of h

on the ith fold

The folds are generally chosen to have equal size, that is m

= m/n for all i ∈ [1,n].

How should n be chosen? The appropriate choice is subject to a trade-oﬀ and the

topic of much learning theory research that we cannot address in this introductory

6 Introduction

test train train train train

testtrain train train train

testtrain train traintrain

error

(a) (b)

Figure 1.2 n-fold cross validation. (a) Illustration of the partitioning of the

training data into

5 folds. (b) Typical plot of a classiﬁer’s prediction error as a

function of the size of the training sample: the error decreases as a function of the

number of training points.

chapter. For a large n, each training sample used in n-fold cross-validation has size

m−m/n = m(1−1/n) (illustrated by the right vertical red line in ﬁgure 1.2b), which

is close to m, the size of the full sample, but the training samples are quite similar.

Thus, the method tends to have a small bias but a large variance. In contrast,

smaller values of n lead to more diverse training samples but their size (shown by

the left vertical red line in ﬁgure 1.2b) is signiﬁcantly less than m,thusthemethod

tends to have a smaller variance but a larger bias.

In machine learning applications, n is typically chosen to be 5 or 10. n-fold cross

validation is used as follows in model selection. The full labeled data is ﬁrst split

into a training and a test sample. The training sample of size m is then used to

compute the n-fold cross-validation error



(θ) for a small number of possible

values of θ. θ is next set to the value θ

for which



(θ) is smallest and the

algorithm is trained with the parameter setting θ

over the full training sample of

size m.Itsperformanceisevaluatedonthetestsampleasalreadydescribedinthe

previous section.

The special case of n-fold cross validation where n = m is called leave-one-out

cross-validation, since at each iteration exactly one instance is left out of the training

sample. As shown in chapter 4, the average leave-one-out error is an approximately

unbiased estimate of the average error of an algorithm and can be used to derive

simple guarantees for some algorithms. In general, the leave-one-out error is very

costly to compute, since it requires training n times on samples of size m − 1, but

for some algorithms it admits a very eﬃcient computation (see exercise 10.9).

In addition to model selection, n-fold cross validation is also commonly used for

performance evaluation. In that case, for a ﬁxed parameter setting θ, the full labeled

sample is divided into n random folds with no distinction between training and test

samples. The performance reported is the n-fold cross-validation on the full sample

as well as the standard deviation of the errors measured on each fold.

1.4 Learning scenarios 7

1.4 Learning scenarios

We next brieﬂy describe common machine learning scenarios. These scenarios diﬀer

in the types of training data available to the learner, the order and method by which

training data is received and the test data used to evaluate the learning algorithm.

Supervised learning: The learner receives a set of labeled examples as training

data and makes predictions for all unseen points. This is the most common scenario

associated with classiﬁcation, regression, and ranking problems. The spam detection

problem discussed in the previous section is an instance of supervised learning.

Unsupervised learning: The learner exclusively receives unlabeled training data,

and makes predictions for all unseen points. Since in general no labeled exam-

ple is available in that setting, it can be diﬃcult to quantitatively evaluate the

performance of a learner. Clustering and dimensionality reduction are example of

unsupervised learning problems.

Semi-supervised learning: The learner receives a training sample consisting of

both labeled and unlabeled data, and makes predictions for all unseen points. Semi-

supervised learning is common in settings where unlabeled data is easily accessible

but labels are expensive to obtain. Various types of problems arising in applications,

including classiﬁcation, regression, or ranking tasks, can be framed as instances

of semi-supervised learning. The hope is that the distribution of unlabeled data

accessible to the learner can help him achieve a better performance than in the

supervised setting. The analysis of the conditions under which this can indeed

be realized is the topic of much modern theoretical and applied machine learning

research.

Transductive inference: As in the semi-supervised scenario, the learner receives

a labeled training sample along with a set of unlabeled test points. However, the

objective of transductive inference is to predict labels only for these particular test

points. Transductive inference appears to be an easier task and matches the scenario

encountered in a variety of modern applications. However, as in the semi-supervised

setting, the assumptions under which a better performance can be achieved in this

setting are research questions that have not been fully resolved.

On-line learning: In contrast with the previous scenarios, the online scenario

involves multiple rounds and training and testing phases are intermixed. At each

round, the learner receives an unlabeled training point, makes a prediction, receives

the true label, and incurs a loss. The objective in the on-line setting is to minimize

the cumulative loss over all rounds. Unlike the previous settings just discussed, no

distributional assumption is made in on-line learning. In fact, instances and their

labels may be chosen adversarially within this scenario.

剩余426页未读，继续阅读

身份认证购VIP最低享 7 折!

30元优惠券

bjutuser

粉丝: 0

机器学习基石：深度解析与实践应用

Foundations_and_Trends_in_Machine_Learning

Foundations_of_Machine_Learning.zip_文章/文档_matlab__文章/文档_matlab_

机器学习基础--Foundations_of_Machine_Learning.pdf

Statistical Foundations of Machine Learning

《Foundations of Machine Learning》

Statistical foundations of machine learning.pdf

Foundations of Machine Learning 2018版.rar

Foundations of Machine Learning And Data Science For Developers

Jeremy_Watt-Machine_Learning_Refined-EN.pdf

Foundations of Machine Learning（优秀英文原版教材）.pdf

最新资源