高斯过程在机器学习中的应用详解

需积分: 43 42 浏览量更新于2024-07-22 收藏 3.54MB PDF 举报

"高斯过程是机器学习领域中一种重要的概率模型，由C.E.Rasmussen和C.K.I.Williams在2006年的著作《GaussianProcesses for Machine Learning》中进行了深入探讨。这本书是MIT Press出版的Adaptive Computation and Machine Learning系列的一部分，详细阐述了高斯过程的理论及其在实际中的应用。通过访问www.GaussianProcess.org/gpml可以获取更多相关资源。" 高斯过程（Gaussian Process, GP）是一种统计学和机器学习的概念，它为不确定性建模提供了一种强大的框架。高斯过程定义了一个随机变量集合，其中任何有限子集都服从联合高斯分布。这种分布特性使得高斯过程成为非参数回归、分类以及优化问题的理想工具，特别是在数据稀疏或噪声较大的情况下。在机器学习中，高斯过程通常用于回归任务，其中目标是预测未知函数的值。给定一组输入输出对，高斯过程可以生成一个函数的后验概率分布，这个分布能够量化预测的不确定性和对新数据的敏感性。高斯过程回归的关键在于选择合适的核函数（也称为协方差函数），它决定了函数的平滑度和复杂度。常见的核函数有高斯径向基函数（RBF）、指数函数和多项式函数等。 Rasmussen和Williams的书中详细介绍了高斯过程的数学基础，包括概率论、线性代数和微积分的相关知识。他们还讨论了如何通过最大化后验概率来估计模型参数，以及如何利用高斯过程进行预测。此外，书中的章节涵盖了高斯过程在其他领域的应用，如贝叶斯优化、半监督学习和计算机视觉等。在实际应用中，高斯过程的一个显著优点是其内在的贝叶斯性质，这使得模型能够自然地处理过拟合问题，因为它会考虑预测的不确定性。然而，高斯过程的计算复杂度随着数据量的增加而增加，这限制了其在大数据集上的应用。为了解决这个问题，书中提到了一些近似方法，如变分推理和稀疏高斯过程，它们能在保持一定精度的同时降低计算成本。高斯过程是机器学习中一个富有理论深度且实用的工具，它提供了处理不确定性问题的优雅方法。Rasmussen和Williams的著作是理解并掌握这一主题的重要参考，对于想要深入研究机器学习和统计建模的读者来说是不可或缺的资源。

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,

ISBN 026218253X.



2006 Massachuse tts Institute of Te chnology. www.GaussianProcess.org/gpml

xvi Preface

Amos Storkey, Volker Tresp, Sethu Vijayakumar, Grace Wahba, Joe Whittaker

and Tong Zhang for valuable discussions on speciﬁc issues. We also thank Bob

Prior and the staﬀ at MIT Press for their support during the writing of the

book. We thank the Gatsby Computational Neuroscience Unit (UCL) and Neil

Lawre nce at the Department of Computer Science, University of Sheﬃeld for

hosting our visits and kindly providing space for us to work, and the Depart-

ment of Computer Science at the University of Toronto for computer support.

Thanks to John and Fiona for their hospitality on numerous occasions. Some

of the diagrams in this book have be en inspired by similar diagrams appearing

in published work, as follows: Figure 3.5, Sch¨olkopf and Smola [2002]; Fig-

ure 5.2, MacKay [1992b]. CER gratefully acknowledges ﬁnancial support from

the German Research Foundation (DFG). CKIW thanks the School of Infor-

matics, University of Edinburgh for granting him sabbatical leave for the period

October 2003-March 2004.

Finally, we reserve our deepest appreciation for our wives Agnes and Bar-

bara, and children Ezra, Kate, Miro and Ruth for their patience and under-

standing while the book was being written.

Despite our best eﬀorts it is inevitable that some errors will make it througherrata

to the printed version of the book. Errata will be made available via the book’s

we bsite at

http://www.GaussianProcess.org/gpml

We have found the joint writing of this book an excellent experience. Although

hard at times, we are conﬁdent that the end result is much better than either

one of us could have written alone.

Now, ten years after their ﬁrst introduction into the machine learning c om-looking ahead

munity, Gaussian processes are receiving growing attention. Although GPs

have been known for a long time in the statistics and geostatistics ﬁelds, and

their use can perhaps be traced back as far as the end of the 19th century, their

application to real problems is still in its early phases. This contrasts somewhat

the application of the non-probabilistic analogue of the GP, the support vec-

tor machine, which was taken up more quickly by practitioners. Perhaps this

has to do with the probabilistic mind-set needed to understand GPs, which is

not so generally appreciated. Perhaps it is due to the need for computational

short-cuts to implement inference for large datasets. Or it could be due to the

lack of a self-contained introduction to this exciting ﬁeld—with this volume, we

hope to contribute to the momentum gained by Gaussian processes in machine

learning.

Carl Edward Rasmussen and Chris Williams

T¨ubingen and Edinburgh, summer 2005

Second printing: We thank Baback Moghaddam, Mikhail Parakhin, Leif Ras-

mussen, Benjamin Sobotta, Kevin S. Van Horn and Aki Vehtari for reporting

errors in the ﬁrst printing which have now been corrected.

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,

ISBN 026218253X.



2006 Massachuse tts Institute of Te chnology. www.GaussianProcess.org/gpml

Symbo ls and Notation

Matrices are capitalized and vectors are in bold type. We do not generally distinguish between proba-

bilities and probability densities. A subscript asterisk, such as in X

∗

, indicates reference to a test set

quantity. A superscript asterisk denotes complex conjugate.

Symbol Meaning

\ left matrix divide: A\b is the vector x which solves Ax = b

, an equality which acts as a deﬁnition

= equality up to an additive constant

|K| determinant of K matrix

|y| Euclidean length of vector y, i.e.





1/2

hf, gi

RKHS inner pro duct

kfk

RKHS norm

the transpose of ve ctor y

∝ proportional to; e.g. p(x|y) ∝ f(x, y) means that p(x|y) is equal to f(x, y) times

a factor which is independent of x

∼ distributed according to; example: x ∼ N(µ, σ

)

∇ or ∇

partial derivatives (w.r.t. f )

∇∇ the (Hessian) matrix of second derivatives

0 or 0

vec tor of all 0’s (of length n)

1 or 1

vec tor of all 1’s (of length n)

C number of class es in a classiﬁcation problem

cholesky(A) Cholesky decomposition: L is a lower triangular matrix such that LL

= A

cov(f

∗

) Gaussian process posterior covariance

D dimension of input space X

D data set: D = {(x

, y

)|i = 1, . . . , n}

diag(w) (vec tor argument) a diagonal matrix containing the elements of vector w

diag(W ) (matrix argument) a vector containing the diagonal elements of matrix W

Kronecker delta, δ

= 1 iﬀ p = q and 0 otherwise

E or E

q( x)

[z(x)] expectation; expectation of z(x) when x ∼ q(x)

f(x) or f Gaussian process (or vector of) latent function values, f = (f (x

), . . . , f(x

))

∗

Gaussian process (posterior) prediction (random variable)

∗

Gaussian process posterior mean

GP Gaussian process: f ∼ GP



m(x), k(x, x

)



, the function f is distributed as a

Gaussian process with mean function m(x) and covariance function k(x, x

)

h(x) or h(x) either ﬁxed basis function (or set of basis functions) or weight function

H or H(X) set of basis functions evaluated at all training points

I or I

the identity matrix (of size n)

(z) Bessel function of the ﬁrst kind

k(x, x

) covariance (or kernel) function evaluated at x and x

K or K(X, X) n × n covariance (or Gram) matrix

∗

n × n

∗

matrix K(X, X

∗

), the covariance between training and test cases

k(x

∗

) or k

∗

vec tor, short for K(X, x

∗

), when there is only a single test case

or K covariance matrix for the (noise free) f values

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,

ISBN 026218253X.



2006 Massachuse tts Institute of Te chnology. www.GaussianProcess.org/gpml

xviii Symbols and Notation

Symbol Meaning

covariance matrix for the (noisy) y values; for independent homoscedastic noise,

= K

+ σ

(z) modiﬁed Bessel function

L(a, b) loss function, the loss of predicting b, when a is true; note argument order

log(z) natural logarithm (base e)

log

(z) logarithm to the base 2

` or `

characteristic length-scale (for input dimension d)

λ(z) logistic function, λ(z) = 1/



1 + exp(−z)



m(x) the mean function of a Gaussian process

µ a measure (see section A.7)

N(µ, Σ) or N(x|µ, Σ) (the variable x has a) Gaussian (Normal) distribution with mean vector µ and

covariance matrix Σ

N(x) short for unit Gaussian x ∼ N(0, I)

n and n

∗

number of training (and test) cases

N dimension of feature space

number of hidden units in a neural network

N the natural numbers, the positive integers

O(·) big Oh; for functions f and g on N, we write f(n) = O(g(n)) if the ratio

f(n)/g(n) remains bounded as n → ∞

O either matrix of all zeros or diﬀerential operator

y|x and p(y|x) conditional random variable y given x and its probability (density)

the regular n-polygon

φ(x

) or Φ(X) feature map of input x

(or input set X)

Φ(z) cumulative unit Gaussian: Φ(z) = (2π)

−1/2

−∞

exp(−t

/2)dt

π(x) the sigmoid of the latent value: π(x) = σ(f(x)) (stochastic if f(x) is stochastic)

ˆπ(x

∗

) MAP prediction: π evaluated at

f(x

∗

¯π(x

∗

) mean prediction: expected value of π(x

∗

). Note, in general that ˆπ(x

∗

) 6= ¯π(x

∗

)

R the real numbers

(f) or R

(l|x

∗

) expected loss for predicting l, averaged w.r.t. the model’s pred. distr. at x

∗

decision region for class c

S(s) power spectrum

σ(z) any sigmoid function, e.g. logistic λ(z), cumulative Gaussian Φ(z), etc.

variance of the (noise free) signal

noise variance

θ vec tor of hyperparameters (parameters of the c ovariance function)

tr(A) trace of (square) matrix A

the circle with circumference l

V or V

q( x)

[z(x)] variance; variance of z(x) when x ∼ q(x)

X input space and also the index set for the stochastic process

X D ×n matrix of the training inputs {x

}

i=1

: the design matrix

∗

matrix of test inputs

the ith training input

the dth coordinate of the ith training input x

Z the intege rs . . . , −2, −1, 0, 1, 2, . . .

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,

ISBN 026218253X.



2006 Massachuse tts Institute of Te chnology. www.GaussianProcess.org/gpml

Chapter 1

Introduction

In this book we will be concerned with supervised learning, which is the problem

of learning input-output mappings from empirical data (the training dataset).

Depending on the characteristics of the output, this problem is known as either

regression, for continuous outputs, or classiﬁcation, when outputs are discrete.

A well known example is the classiﬁcation of images of handwritten digits. digit classiﬁcation

The training se t consists of small digitized images, together with a classiﬁcation

from 0, . . . , 9, normally provided by a human. The goal is to learn a mapping

from image to classiﬁcation label, which can then be used on new, unseen

images. Supervised learning is an attractive way to attempt to tackle this

problem, since it is not easy to specify accurately the characteristics of, say, the

handwritten digit 4.

An example of a regression problem can be found in robotics, where we wish rob otic control

to learn the inverse dynamics of a robot arm. Here the task is to map from

the state of the arm (given by the positions, velocities and accelerations of the

joints) to the corresponding torques on the joints. Such a model can then be

used to compute the torques needed to move the arm along a given trajectory.

Another example would be in a chemical plant, where we might wish to predict

the yield as a function of process parameters such as temperature, pressure,

amount of catalyst e tc.

In general we denote the input as x, and the output (or target) as y. The the dataset

input is usually represented as a vector x as there are in general many input

variables—in the handwritten digit recognition example one may have a 256-

dimensional input obtained from a raster scan of a 16 × 16 image, and in the

robot arm example there are three input measurements for each joint in the

arm. The target y may either b e continuous (as in the regression case) or

discrete (as in the classiﬁcation case). We have a dataset D of n observations,

D = {(x

, y

)|i = 1, . . . , n}.

Given this training data we wish to make predictions for new inputs x

∗

training is inductive

that we have not seen in the training set. Thus it is clear that the problem

at hand is inductive; we need to move from the ﬁnite training data D to a

C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,

ISBN 026218253X.



2006 Massachuse tts Institute of Te chnology. www.GaussianProcess.org/gpml

2 Introduction

function f that makes predictions for all possible input values. To do this we

must make assumptions about the characteristics of the underlying function,

as otherwise any function which is consistent with the training data would be

equally valid. A wide variety of methods have b ee n proposed to deal with the

supervised learning problem; here we describe two common approaches. Thetwo approaches

ﬁrst is to res trict the class of functions that we consider, for example by only

considering linear functions of the input. The second approach is (speaking

rather loosely) to give a prior probability to every possible function, where

higher probabilities are given to functions that we consider to be more likely, for

example because they are smoother than other functions.

The ﬁrst approach

has an obvious problem in that we have to decide upon the richness of the class

of functions considered; if we are using a mo del based on a certain class of

functions (e.g. linear functions) and the target function is not well modelled by

this class, then the predictions will be poor. One may be tempted to increase the

ﬂexibility of the class of functions, but this runs into the danger of overﬁtting,

where we can obtain a good ﬁt to the training data, but perform badly when

making test predictions.

The second approach appears to have a serious problem, in that surely

there are an uncountably inﬁnite set of possible functions, and how are we

going to compute with this set in ﬁnite time? This is where the GaussianGaussian process

process comes to our rescue. A Gaussian process is a generalization of the

Gaussian probability distribution. Whereas a probability distribution describes

random variables which are scalars or vectors (for multivariate distributions),

a stochastic process governs the properties of functions. Leaving mathematical

sophistication aside, one can loosely think of a function as a very long ve ctor,

each entry in the vector specifying the function value f(x) at a particular input

x. It turns out, that although this idea is a little na¨ıve, it is surprisingly close

what we need. Indeed, the question of how we deal computationally with these

inﬁnite dimensional objects has the most pleasant resolution imaginable: if you

ask only for the properties of the function at a ﬁnite number of points, then

inference in the Gaussian process will give you the same answer if you ignore the

inﬁnitely many other points, as if you would have taken them all into account!

And these answers are consistent with answers to any other ﬁnite queries youconsistency

may have. One of the main attractions of the Gaussian process framework is

precisely that it unites a sophisticated and consistent view with computationaltractability

tractability.

It should come as no surprise that these ideas have been around for some

time, although they are perhaps not as well known as they might be. Indeed,

many models that are commonly employed in both machine learning and statis-

tics are in fact special cases of, or restricted kinds of Gaussian processes. In this

volume, we aim to give a systematic and uniﬁed treatment of the area, showing

connections to related models.

These two app roaches may be regarded as imposing a restriction bias and a preference

bias respectively; see e.g. Mitchell [1997].

剩余265页未读，继续阅读

viviwong12

粉丝: 1
资源: 1

高斯过程在机器学习中的应用详解

期权matlab代码-SONIG:SONIG算法的Matlab源代码：稀疏在线噪声输入高斯过程回归

贝叶斯基础回顾重点在于高斯过程模型

高斯过程讲义

关于高斯过程的介绍Introduction Of Gaussian Process

matlab-gpml_高斯过程回归_高斯过程_GPR预测_GPR_themselvesokc_

GPR高斯过程回归案例

高斯过程回归与分类工具箱gpml使用介绍

高斯过程机器学习概论

高斯过程详解与应用

高斯过程机器学习概要

最新资源