C. E. Rasmussen & C. K. I. Williams, Gaussian Processes for Machine Learning, the MIT Press, 2006,
ISBN 026218253X.
c
2006 Massachuse tts Institute of Te chnology. www.GaussianProcess.org/gpml
2 Introduction
function f that makes predictions for all possible input values. To do this we
must make assumptions about the characteristics of the underlying function,
as otherwise any function which is consistent with the training data would be
equally valid. A wide variety of methods have b ee n proposed to deal with the
supervised learning problem; here we describe two common approaches. Thetwo approaches
first is to res trict the class of functions that we consider, for example by only
considering linear functions of the input. The second approach is (speaking
rather loosely) to give a prior probability to every possible function, where
higher probabilities are given to functions that we consider to be more likely, for
example because they are smoother than other functions.
1
The first approach
has an obvious problem in that we have to decide upon the richness of the class
of functions considered; if we are using a mo del based on a certain class of
functions (e.g. linear functions) and the target function is not well modelled by
this class, then the predictions will be poor. One may be tempted to increase the
flexibility of the class of functions, but this runs into the danger of overfitting,
where we can obtain a good fit to the training data, but perform badly when
making test predictions.
The second approach appears to have a serious problem, in that surely
there are an uncountably infinite set of possible functions, and how are we
going to compute with this set in finite time? This is where the GaussianGaussian process
process comes to our rescue. A Gaussian process is a generalization of the
Gaussian probability distribution. Whereas a probability distribution describes
random variables which are scalars or vectors (for multivariate distributions),
a stochastic process governs the properties of functions. Leaving mathematical
sophistication aside, one can loosely think of a function as a very long ve ctor,
each entry in the vector specifying the function value f(x) at a particular input
x. It turns out, that although this idea is a little na¨ıve, it is surprisingly close
what we need. Indeed, the question of how we deal computationally with these
infinite dimensional objects has the most pleasant resolution imaginable: if you
ask only for the properties of the function at a finite number of points, then
inference in the Gaussian process will give you the same answer if you ignore the
infinitely many other points, as if you would have taken them all into account!
And these answers are consistent with answers to any other finite queries youconsistency
may have. One of the main attractions of the Gaussian process framework is
precisely that it unites a sophisticated and consistent view with computationaltractability
tractability.
It should come as no surprise that these ideas have been around for some
time, although they are perhaps not as well known as they might be. Indeed,
many models that are commonly employed in both machine learning and statis-
tics are in fact special cases of, or restricted kinds of Gaussian processes. In this
volume, we aim to give a systematic and unified treatment of the area, showing
connections to related models.
1
These two app roaches may be regarded as imposing a restriction bias and a preference
bias respectively; see e.g. Mitchell [1997].