高斯过程：机器学习中的理论与应用

需积分: 5 14 浏览量更新于2024-01-21 收藏 36.64MB PDF 举报

Gaussian Processes for Machine Learning（机器学习的高斯过程）是由Carl Edward Rasmussen和Christopher K. I. Williams编写的一本专门讨论高斯过程在机器学习中的应用的书籍。高斯过程提供了一种合乎逻辑、实用、概率化的方法来学习核机器中的知识。在过去的十年里，高斯过程在机器学习领域受到了越来越多的关注，这本书为机器学习领域长期以来需要的系统性和统一性提供了。高斯过程不仅提供了一种强大的学习方法，还在实际应用中取得了很大的成功。在《Gaussian Processes for Machine Learning》这本书中，Rasmussen和Williams讨论了如何使用高斯过程来进行机器学习。他们深入探讨了高斯过程在监督学习、无监督学习和强化学习中的应用，并详细介绍了高斯过程的原理和相关算法。除此之外，他们还介绍了如何使用高斯过程进行回归分析、分类问题、和概率估计。除了讨论高斯过程的基本原理和算法，Rasmussen和Williams还从实践角度出发，详细介绍了高斯过程在现实世界中的各种应用。他们在书中提供了大量的实例和案例，向读者展示了高斯过程在金融、医学、工程等领域中的广泛应用。在书中的每一章节，Rasmussen和Williams都对高斯过程的相关理论进行了深入的剖析和解释。他们不仅介绍了现有的高斯过程模型和算法，还提出了一些新的观点和方法。此外，他们还对高斯过程的优缺点进行了深入的思考和分析，使读者能够更好地理解高斯过程的本质和作用。总的来说，《Gaussian Processes for Machine Learning》这本书是一本介绍高斯过程在机器学习中应用的权威著作。它不仅为相关领域的研究者和工程师提供了宝贵的参考资料，还为学习者提供了一个系统性、全面性的学习工具。这本书从理论到实践都进行了全面深入的探讨，为读者提供了一本难得的好书。

Preface xv

in the time series analysis literature; some pointers to this literature are given

in Appendix B.

The book is primarily intended for graduate students and researchers in intended audience

machine learning at departments of Computer Science, Statistics and Applied

Mathematics. As prerequisites we require a good basic grounding in calculus,

linear algebra and probability theory as would be obtained by graduates in nu-

merate disciplines such as electrical engineering, physics and computer science.

For preparation in calculus and linear algebra any good university-level text-

book on mathematics for physics or engineering such as Arfken [1985] would

be ﬁne. For probability theory some familiarity with multivariate distributions

(especially the Gaussian) and conditional probability is required. Some back-

ground mathematical material is also provided in Appendix A.

The main focus of the book is to present clearly and concisely an overview focus

of the main ideas of Gaussian processes in a machine learning context. We have

also covered a wide range of connections to existing models in the literature,

and c over approximate inference for faster practical algorithms. We have pre-

sented detailed algorithms for many methods to aid the practitioner. Software

implementations are available from the website for the book, see Appendix C.

We have also included a small set of exercises in each chapter; we hope these

will help in gaining a deeper understanding of the material.

In order limit the size of the volume, we have had to omit some topics, such scope

as, for example, Markov chain Monte Carlo methods for inference. One of the

most diﬃcult things to decide when writing a book is what sections not to write.

Within sections, we have often chosen to describe one algorithm in particular

in depth, and m ention related work only in passing. Although this causes the

omission of some material, we feel it is the best approach for a monograph, and

hope that the reader will gain a general understanding so as to be able to push

further into the growing literature of GP models.

The book has a natural split into two parts, with the chapters up to and book organization

including chapter 5 covering core material, and the remaining s ec tions covering

the connections to other methods, fast approximations, and more specialized

prop e rties. Some sections are marked by an aste risk. These sections may be ∗

omitted on a ﬁrst reading, and are not pre-requisites for later (un-starred)

material.

We wish to express our considerable gratitude to the many people with acknowledgements

who we have interacted during the writing of this book. In particular Moray

Allan, David Barber, Peter Bartlett, Miguel Carreira-Perpi˜n´an, Marcus Gal-

lagher, Manfred Opper, Anton Schwaighofer, Matthias Seeger, Hanna Wallach,

Joe Whittaker, and Andrew Zisserman all read parts of the book and provided

valuable feedback. Dilan G¨or¨ur, Malte Kuss, Iain Murray, Joaquin Qui˜nonero-

Candela, Leif Rasmussen and Sam Roweis were especially heroic and provided

comments on the whole manuscript. We thank Chris Bishop, Miguel Carreira-

Perpi˜n´an, Nando de Freitas, Zoubin Ghahramani, Peter Gr¨unwald, Mike Jor-

dan, John Kent, Radford Neal, Joaquin Qui˜nonero-Candela, Ryan Rifkin, Ste-

fan Schaal, Anton Schwaighofer, Matthias Seeger, Peter Sollich, Ingo Steinwart,

xvi Preface

Amos Storkey, Volker Tresp, Sethu Vijayakumar, Grace Wahba, Joe Whittaker

and Tong Zhang for valuable discussions on speciﬁc issues. We also thank Bob

Prior and the staﬀ at MIT Press for their support during the writing of the

book. We thank the Gatsby Computational Neuroscience Unit (UCL) and Neil

Lawrence at the Department of Computer Science, University of Sheﬃeld for

hosting our visits and kindly providing space for us to work, and the Depart-

ment of Computer Science at the University of Toronto for computer support.

Thanks to John and Fiona for their hospitality on numerous occasions. Some

of the diagrams in this book have been inspired by similar diagrams appearing

in published work, as follows: Figure 3.5, Sch¨olkopf and Smola [2002]; Fig-

ure 5.2, MacKay [1992b]. CER gratefully acknowledges ﬁnancial support from

the German Research Foundation (DFG). CKIW thanks the School of Infor-

matics, University of Edinburgh for granting him sabbatical leave for the period

October 2003-March 2004.

Finally, we reserve our deepest appreciation for our w ives Agnes and Bar-

bara, and children Ezra, Kate, Miro and Ruth for their patience and under-

standing while the book was being written.

Despite our best eﬀorts it is inevitable that some errors will make it througherrata

to the printed version of the book. Errata will be made available via the book’s

website at

http://www.GaussianProcess.org/gpml

We have found the joint writing of this book an excellent experience. Although

hard at times, we are conﬁdent that the end result is much better than either

one of us could have written alone.

Now, ten years after their ﬁrst introduction into the machine learning com-looking ahead

munity, Gaussian processes are receiving growing attention. Although GPs

have been known for a long time in the statistics and geostatistics ﬁelds, and

their use can perhaps be traced back as far as the end of the 19th century, their

application to real problems is still in its early phases. This contrasts somewhat

the application of the non-probabilistic analogue of the GP, the support vec-

tor machine, which was taken up more quickly by practitioners. Perhaps this

has to do with the probabilistic mind-set needed to understand GPs, which is

not so generally appreciated. Perhaps it is due to the need for computational

short-cuts to implement inference for large datasets. Or it could be due to the

lack of a self-contained introduction to this exciting ﬁeld—with this volume, we

hope to contribute to the momentum gained by Gaussian processes in machine

learning.

Carl Edward Rasmussen and Chris Williams

T¨ubingen and Edinburgh, summer 2005

Symbols and Notation

Matrices are capitalized and vectors are in bold type. We do not generally distinguish between proba-

bilities and probability densities. A subscript asterisk, such as in X

∗

, indicates reference to a test set

quantity. A superscript asterisk denotes complex conjugate.

Symbol Meaning

\ left matrix divide: A\b is the vector x which solves Ax = b

, an equality which acts as a deﬁnition

= equality up to an additive constant

|K| determinant of K matrix

|y| Euclidean length of vector y, i.e.





1/2

hf, gi

RKHS inner product

kfk

RKHS norm

the transpose of vector y

∝ prop ortional to; e.g. p(x|y) ∝ f (x, y) means that p(x|y) is equal to f(x, y) times

a factor which is independent of x

∼ distributed according to; example: x ∼ N(µ, σ

)

∇ or ∇

partial derivatives (w.r.t. f)

∇∇ the (Hes sian) matrix of second derivatives

0 or 0

vector of all 0’s (of length n)

1 or 1

vector of all 1’s (of length n)

C number of classes in a classiﬁcation problem

cholesky(A) Cholesky dec omposition: L is a lower triangular matrix such that LL

= A

cov(f

∗

) Gaussian process posterior covariance

D dimension of input space X

D data set: D = {(x

, y

)|i = 1, . . . , n}

diag(w) (vector argument) a diagonal matrix containing the elements of vector w

diag(W ) (matrix argument) a vector containing the diagonal elements of matrix W

Kronecker delta, δ

= 1 iﬀ p = q and 0 otherwise

E or E

q( x)

[z(x)] expectation; expectation of z(x) when x ∼ q(x)

f(x) or f Gaussian process (or vector of) latent function values , f = (f(x

), . . . , f (x

))

∗

Gaussian process (posterior) prediction (random variable)

∗

Gaussian process posterior mean

GP Gaussian process: f ∼ GP



m(x), k(x, x

)



, the function f is distributed as a

Gaussian process with mean function m(x) and covariance function k(x, x

)

h(x) or h(x) either ﬁxed basis function (or set of basis functions) or weight function

H or H(X) set of basis functions evaluated at all training points

I or I

the identity matrix (of size n)

(z) Bessel function of the ﬁrst kind

k(x, x

) covariance (or kernel) function evaluated at x and x

K or K(X, X) n × n covariance (or Gram) matrix

∗

n × n

∗

matrix K(X, X

∗

), the covariance between training and test cases

k(x

∗

) or k

∗

vector, short for K(X, x

∗

), when there is only a single test case

or K covariance matrix for the (noise free) f values

xviii Symbols and Notation

Symbol Meaning

covariance matrix for the (noisy) y values; for independent homoscedastic noise,

= K

+ σ

(z) modiﬁed Bessel function

L(a, b) loss function, the loss of predicting b, when a is true; note argument order

log(z) natural logarithm (base e)

log

(z) logarithm to the base 2

` or `

characteristic length-scale (for input dimension d)

λ(z) logistic function, λ(z) = 1/



1 + exp(−z)



m(x) the mean function of a Gaussian process

µ a m easure (see se ction A.7)

N(µ, Σ) or N(x|µ, Σ) (the variable x has a) Gaussian (Normal) distribution with mean vector µ and

covariance matrix Σ

N(x) short for unit Gaussian x ∼ N(0, I)

n and n

∗

number of training (and test) cases

N dimension of feature space

number of hidden units in a neural network

N the natural numbers, the positive integers

O(·) big Oh; for functions f and g on N, we write f(n) = O(g(n)) if the ratio

f(n)/g(n) remains bounded as n → ∞

O either matrix of all zeros or diﬀerential operator

y|x and p(y|x) conditional random variable y given x and its probability (density)

the regular n-polygon

φ(x

) or Φ(X) feature map of input x

(or input set X)

Φ(z) cumulative unit Gaussian: Φ(z) = (2π)

−1/2

−∞

exp(−t

/2)dt

π(x) the sigmoid of the latent value: π(x) = σ(f(x)) (stochastic if f (x) is stochastic)

ˆπ(x

∗

) MAP prediction: π evaluated at

f(x

∗

¯π(x

∗

) mean prediction: expected value of π(x

∗

). Note, in general that ˆπ(x

∗

) 6= ¯π(x

∗

)

R the real numb e rs

(f) or R

(l|x

∗

) expected loss for predicting l, averaged w.r.t. the model’s pred. distr. at x

∗

decision region for class c

S(s) power spectrum

σ(z) any sigmoid function, e.g. logistic λ(z), cumulative Gaussian Φ(z), etc.

variance of the (noise free) signal

noise variance

θ vector of hyperparameters (parameters of the covariance function)

tr(A) trace of (square) matrix A

the c ircle with circumference l

V or V

q( x)

[z(x)] variance; variance of z(x) when x ∼ q(x)

X input space and also the index set for the stochastic process

X D × n matrix of the training inputs {x

}

i=1

: the design matrix

∗

matrix of test inputs

the ith training input

the dth coordinate of the ith training input x

Z the integers . . . , −2, −1, 0, 1, 2, . . .

Chapter 1

Introduction

In this book we will be concerned with supervised learning, which is the problem

of learning input-output mappings from empirical data (the training dataset).

Depending on the characteristics of the output, this problem is known as either

regression, for continuous outputs, or classiﬁcation, when outputs are discrete.

A well known example is the classiﬁcation of images of handwritten digits. digit classiﬁcation

The training set consists of small digitized images, together with a classiﬁcation

from 0, . . . , 9, normally provided by a human. The goal is to learn a mapping

from image to classiﬁcation label, which can then be used on new, unseen

images. Supervised learning is an attractive way to attempt to tackle this

problem, since it is not easy to specify accurately the characteristics of, say, the

handwritten digit 4.

An example of a regression problem can be found in robotics, where we wish robotic control

to learn the inverse dynamics of a robot arm. Here the task is to map from

the state of the arm (given by the positions, velocities and accelerations of the

joints) to the corresponding torques on the joints. Such a model can then be

used to compute the torques needed to move the arm along a given trajectory.

Another example would be in a chemical plant, where we might wish to predict

the yield as a function of proces s parameters such as temperature, pressure,

amount of catalyst etc.

In general we denote the input as x, and the output (or target) as y. The the dataset

input is usually represented as a vector x as there are in general many input

variables—in the handwritten digit recognition example one may have a 256-

dimensional input obtained from a raster scan of a 16 × 16 image, and in the

robot arm example there are three input measurements for each joint in the

arm. The target y may either be continuous (as in the regression cas e) or

discrete (as in the classiﬁcation case). We have a dataset D of n observations,

D = {(x

, y

)|i = 1, . . . , n}.

Given this training data we wish to make predictions for new inputs x

∗

training is inductive

that we have not seen in the training set. Thus it is clear that the problem

at hand is inductive; we need to move from the ﬁnite training data D to a

剩余265页未读，继续阅读

承让@

粉丝: 8
资源: 380

高斯过程：机器学习中的理论与应用

Gaussian Processes for Machine Learning

如何在Python中使用Gaussian Processes for Machine Learning库实现高斯过程回归，并进行超参数优化？

电子书 Gaussian Processes for Machine Learning

gaussian processes for machine learning

gaussian processes for machine learning python版本

matplotlib-3.6.3-cp39-cp39-linux_armv7l.whl

numpy-2.0.1-cp39-cp39-linux_armv7l.whl

基于springboot个人公务员考试管理系统源码数据库文档.zip

onnxruntime-1.13.1-cp310-cp310-win_amd64.whl

基于springboot的西山区家政服务网站源码数据库文档.zip

最新资源