The layer-to-layer conditionals associated with the RBM factorize like in (1) and give rise to
P (v
k
= 1|h) = sigm(b
k
+
P
j
W
jk
h
j
) and Q(h
j
= 1|v) = sigm(c
j
+
P
k
W
jk
v
k
).
2.2 Gibbs Markov chain and log-likelihood gradient in an RBM
To obtain an estimator of the gradient on the log-likelihood of an RBM, we consider a Gibbs Markov
chain on the (visible units, hidden units) pair of variables. Gibbs sampling from an RBM proceeds by
sampling h given v, then v given h, etc. Denote v
t
for the t-th v sample from that chain, starting at
t = 0 with v
0
, the “input observation” for the RBM. Therefore, (v
k
, h
k
) for k → ∞ is a sample from
the joint P (v, h). The log-likelihood of a value v
0
under the model of the RBM is
log P (v
0
) = log
X
h
P (v
0
, h) = log
X
h
e
−energy(v
0
,h)
− log
X
v,h
e
−energy(v,h)
and its gradient with respect to θ = (W, b, c) is
∂ log P (v
0
)
∂θ
= −
X
h
0
Q(h
0
|v
0
)
∂energy(v
0
, h
0
)
∂θ
+
X
v
k
,h
k
P (v
k
, h
k
)
∂energy(v
k
, h
k
)
∂θ
for k → ∞. An unbiased sample is −
∂energy(v
0
, h
0
)
∂θ
+ E
h
k
∂energy(v
k
, h
k
)
∂θ
|v
k
,
where h
0
is a sample from Q(h
0
|v
0
) and (v
k
, h
k
) is a sample of the Markov chain, and the expecta-
tion can be easily computed thanks to P (h
k
|v
k
) factorizing. The idea of the Contrastive Divergence
algorithm (Hinton, 2002) is to take k small (typically k = 1). A pseudo-code for Contrastive Di-
vergence training (with k = 1) of an RBM with binomial input and hidden units is presented in the
Appendix (Algorithm RBMupdate(x, , W, b, c)). This procedure is called repeatedly with v
0
= x
sampled from the training distribution for the RBM. To decide when to stop one may use a proxy for
the training criterion, such as the reconstruction error − log P (v
1
= x|v
0
= x).
2.3 Greedy layer-wise training of a DBN
A greedy layer-wise training algorithm was proposed (Hinton et al., 2006) to train a DBN one layer at
a time. One first trains an RBM that takes the empirical data as input and models it. Denote Q(g
1
|g
0
)
the posterior over g
1
associated with that trained RBM (we recall that g
0
= x with x the observed
input). This gives rise to an “empirical” distribution bp
1
over the first layer g
1
, when g
0
is sampled
from the data empirical distribution bp: we have bp
1
(g
1
) =
X
g
0
bp(g
0
)Q(g
1
|g
0
).
Note that a 1-level DBN is an RBM. The basic idea of the greedy layer-wise strategy is that after
training the top-level RBM of a `-level DBN, one changes the interpretation of the RBM parameters
to insert them in a (` + 1)-level DBN: the distribution P (g
`−1
|g
`
) from the RBM associated with
layers ` − 1 and ` is kept as part of the DBN generative model. In the RBM between layers ` − 1
and `, P (g
`
) is defined in terms on the parameters of that RBM, whereas in the DBN P(g
`
) is defined
in terms of the parameters of the upper layers. Consequently, Q(g
`
|g
`−1
) of the RBM does not
correspond to P (g
`
|g
`−1
) in the DBN, except when that RBM is the top layer of the DBN. However,
we use Q(g
`
|g
`−1
) of the RBM as an approximation of the posterior P (g
`
|g
`−1
) for the DBN.
The samples of g
`−1
, with empirical distribution bp
`−1
, are converted stochastically into samples of g
`
with distribution bp
`
through bp
`
(g
`
) =
P
g
`−1
bp
`−1
(g
`−1
)Q(g
`
|g
`−1
). Although bp
`
cannot be rep-
resented explicitly it is easy to sample unbiasedly from it: pick a training example and propagate it
stochastically through the Q(g
i
|g
i−1
) at each level. As a nice side benefit, one obtains an approxi-
mation of the posterior for all the hidden variables in the DBN, at all levels, given an input g
0
= x.
Mean-field propagation (see below) gives a fast deterministic approximation of posteriors P (g
`
|x).
Note that if we consider all the layers of a DBN from level i to the top, we have a smaller DBN,
which generates the marginal distribution P (g
i
) for the complete DBN. The motivation for the greedy
procedure is that a partial DBN with ` − i levels starting above level i may provide a better model for
P (g
i
) than does the RBM initially associated with level i itself.
The above greedy procedure is justified using a variational bound (Hinton et al., 2006). As a con-
sequence of that bound, when inserting an additional layer, if it is initialized appropriately and has
enough units, one can guarantee that initial improvements on the training criterion for the next layer