IEEE SIGNAL PROCESSING MAGAZINE [85] NOVEMBER 2012
from. A directed model generates data by first choosing the
states of the latent variables from a prior distribution and then
choosing the states of the observable variables from their condi-
tional distributions given the latent states. Examples of directed
models with one layer of latent variables are factor analysis, in
which the latent variables are drawn from an isotropic
Gaussian, and GMMs, in which they are drawn from a discrete
distribution. An undirected model has a very different way of
generating data. Instead of using one set of parameters to define
a prior distribution over the latent variables and a separate set
of parameters to define the condition-
al distributions of the observable vari-
ables given the values of the latent
variables, an undirected model uses a
single set of parameters, W, to define
the joint probability of a vector of val-
ues of the observable variables, v, and
a vector of values of the latent vari-
ables, h, via an energy function, E
vhW(, ; ) , ,p
Z
eZe
1
vhW v h W
vh
(,;) (,;)
,
EE
==
--
ll
ll
/
(5)
where Z is called the partition function.
If many different latent variables interact nonlinearly to
generate each data vector, it is difficult to infer the states of
the latent variables from the observed data in a directed
model because of a phenomenon known as “explaining away”
[19]. In undirected models, however, inference is easy pro-
vided the latent variables do not have edges linking them.
Such a restricted class of undirected models is ideal for lay-
erwise pretraining because each layer will have an easy infer-
ence procedure.
We start by describing an approximate learning algorithm
for a restricted Boltzmann machine (RBM) which consists of a
layer of stochastic binary “visible” units that represent binary
input data connected to a layer of stochastic binary hidden units
that learn to model significant nonindependencies between the
visible units [20]. There are undirected connections between
visible and hidden units but no visible-visible or hidden-hidden
connections. An RBM is a type of Markov random field (MRF)
but differs from most MRFs in several ways: it has a bipartite
connectivity graph, it does not usually share weights between
different units, and a subset of the variables are unobserved,
even during training.
AN EFFICIENT LEARNING PROCEDURE FOR RBMs
A joint configuration, (v, h) of the visible and hidden units of an
RBM has an energy given by
vh()Eavbhvhw,
ii
i
jj
j
ijij
visible hidden
,ij
=- - -
!!
///
, (6)
where
,vh
ij
are the binary states of visible unit i and hidden
unit j,
,ab
ij
are their biases, and
w
ij
is the weight between
them. The network assigns a probability to every possible pair of
a visible and a hidden vector via this energy function as in (5)
and the probability that the network assigns to a visible vector,
v, is given by summing over all possible hidden vectors
v()p
Z
e
1
h
v, h()E
=
-
/
. (7)
The derivative of the log probability of a training set with
respect to a weight is surprisingly simple
v()log
Nw
p
vh vh
1
ij
n
n
nN
ij ij
1
data model
2
12 12
2
=-
=
=
/
, (8)
where N is the size of the
training set and the angle
brackets are used to denote
expectations under the dis-
tribution specified by the
subscript that follows. The
simple derivative in (8)
leads to a very simple learn-
ing rule for performing sto-
chastic steepest ascent in the log probability of the training data
wvh vh
data modelij i j i j
12 12eD =-
^h
, (9)
where
e is a learning rate.
The absence of direct connections between hidden units in
an RBM makes it is very easy to get an unbiased sample of
vh
ij data
12
. Given a randomly selected training case, v, the
binary state,
h
j
, of each hidden unit, j, is set to one with prob-
ability
v(1) ( )ph b vwlogistic
jjiij
i
;== +
/
(10)
and
vh
ij
is then an unbiased sample. The absence of direct con-
nections between visible units in an RBM makes it very easy to
get an unbiased sample of the state of a visible unit, given a hid-
den vector
() ( ).hpv a hw1 logistic
iijij
j
;== +
/
(11)
Getting an unbiased sample of
vh
ij model
12
, however, is
much more difficult. It can be done by starting at any random
state of the visible units and performing alternating Gibbs sam-
pling for a very long time. Alternating Gibbs sampling consists
of updating all of the hidden units in parallel using (10) fol-
lowed by updating all of the visible units in parallel using (11).
A much faster learning procedure called contrastive diver-
gence (CD) was proposed in [20]. This starts by setting the states
of the visible units to a training vector. Then the binary states of
the hidden units are all computed in parallel using (10). Once
binary states have been chosen for the hidden units, a “recon-
struction” is produced by setting each
v
i
to one with a probabil-
ity given by (11). Finally, the states of the hidden units are
updated again. The change in a weight is then given by
()wvh vh
ij ij ijdata recon
12 12eD =-
. (12)
WHAT WE NEED IS A BETTER
METHOD OF USING THE INFORMATION
IN THE TRAINING SET TO BUILD
MULTIPLE LAYERS OF NONLINEAR
FEATURE DETECTORS.