
46 UNCERTAINTY IN ARTIFICIAL INTELLIGENCE PROCEEDINGS 2000
Variational Relevance Vector Machines
Christopher M. Bishop Michael E. Tipping
Microsoft Research
7 J. J. Thompson Avenue, Cambridge CB3 0FB, U.K.
{cmbishop,mtipping}@microsoft.com
http://research.microsoft.com/{∼cmbishop,∼mtipping}
In Uncertainty in Artificial Intelligence 2000, C. Boutilier and M. Goldszmidt (Eds), 46–53, Morgan Kaufmann.
Abstract
The Support Vector Machine (SVM) of Vap-
nik [9] has become widely established as one
of the leading approaches to pattern recogni-
tion and machine learning. It expresses pre-
dictions in terms of a linear combination of
kernel functions centred on a subset of the
training data, known as support vectors.
Despite its widespread success, the SVM suf-
fers from some important limitations, one
of the most significant being that it makes
point predictions rather than generating pre-
dictive distributions. Recently Tipping [8]
has formulated the Relevance Vector Ma-
chine (RVM), a probabilistic model whose
functional form is equivalent to the SVM. It
achieves comparable recognition accuracy to
the SVM, yet provides a full predictive distri-
bution, and also requires substantially fewer
kernel functions.
The original treatment of the RVM re-
lied on the use of type II maximum like-
lihood (the ‘evidence framework’) to pro-
vide point estimates of the hyperparameters
which govern model sparsity. In this paper
we show how the RVM can be formulated
and solved within a completely Bayesian
paradigm through the use of variational in-
ference, thereby giving a posterior distribu-
tion over both parameters and hyperparam-
eters. We demonstrate the practicality and
performance of the variational RVM using
both synthetic and real world examples.
1 RELEVANCE VECTORS
Many problems in machine learning fall under the
heading of supervized learning, in which we are given a
set of input vectors X = {x
n
}
N
n=1
together with corre-
sponding target values T = {t
n
}
N
n=1
. The goal is to use
this training data, together with any pertinent prior
knowledge, to make predictions of t for new values of
x. We can distinguish two distinct cases: regression,
in which t is a continuous variable, and classification,
in which t belongs to a discrete set.
Here we consider models in which the prediction
y(x, w) is expressed as a linear combination of basis
functions φ
m
(x) of the form
y(x, w) =
M
X
m=0
w
m
φ
m
(x) = w
T
φ (1)
where the {w
m
} are the parameters of the model, and
are generally called weights.
One of the most popular approaches to machine learn-
ing to emerge in recent years is the Support Vector Ma-
chine (SVM) of Vapnik [9]. The SVM uses a particular
specialization of (1) in which the basis functions take
the form of kernel functions, one for each data point
x
m
in the training set, so that φ
m
(x) = K(x, x
m
),
where K(·, ·) is the kernel function. The framework
which we develop in this paper is much more general
and applies to any model of the form (1). However, in
order to facilitate direct comparisions with the SVM,
we focus primarily on the use of kernels as the basis
functions.
Point estimates for the weights are determined in the
SVM by optimization of a criterion which simultane-
ously attempts to fit the training data while at the
same time minimizing the ‘complexity’ of the function
y(x, w). The result is that some proportion of the
weights are set to zero, leading to a sparse model in
which predictions, governed by (1), depend only on a
subset of the kernel functions.