6
S under the MRF distribution, i.e., training corresponds to finding the parameters θ that maximize
the likelihood given the training data. The likelihood L : Θ → R of an MRF given the data set S maps
parameters θ from a parameter space Θ to L(θ | S) =
ℓ
Q
i=1
p(x
i
| θ). Maximizing the likelihood is the
same as maximiz in g the log-likelihood given by
ln L( θ | S) = ln
ℓ
Y
i=1
p(x
i
| θ) =
ℓ
X
i=1
ln p(x
i
| θ) . (4)
For the Gibbs distribution of an MRF, it is in general not possible to find the maximum likelihood
parameters analytically. Thus, numerical approximations have to be used, for example gradient ascent,
which is described below.
Maximizing the likelihood corresponds to minimizing the distance b etween the unknown distribu-
tion q underlying S and the distribution p of the MRF in terms of th e Kullback–Leibler divergence
(KL divergence), which for a finite state space Ω is given by
KL(q||p) =
X
x∈Ω
q(x) ln
q(x)
p(x)
=
X
x∈Ω
q(x) ln q(x) −
X
x∈Ω
q(x) ln p(x) . (5)
The KL divergence is a (non-symmetric) measure of the difference between two distributions. It is
always positive, and it is zero if and only if the distributions are the same. As becomes clear by
equation (5), the KL divergence can be expressed as the difference between the entropy of q and a
second term. Only the latter depends on the parameters su bject to optimization. Approximating the
expectation over q in this term by the training samples from q results in the log-likelihood. Therefore,
maximizing the log-likelihood corres ponds to minimizing the KL divergence.
Optimization by gradient ascent. If it is not possible to find parameters maximizing the likelihood
analytically, the us ual way to find them is by gradient ascent on the log-likelihood. This corresponds
to iteratively updati ng the parameters θ
(t)
to θ
(t+1)
based on the gradient of the log-likelihood. Let
us consider the following update rule:
θ
(t+1)
= θ
(t)
+ η
∂
∂θ
(t)
ln L( θ
(t)
| S)
−λθ
(t)
+ ν∆θ
(t−1)
|
{z }
= ∆θ
(t)
(6)
If the constants λ ∈ R
+
0
and ν ∈ R
+
0
are set to zero, we have vanilla gradient ascent. The constant
η ∈ R
+
is the learning rate. As we will see later, it can be desirable to strive for mode ls with weights
having small abs olu te values. To achieve th is , we can optimize an objective function consisting of the
log-likelihood minus half of the norm of the parameters kθk
2
/2 weighted by λ. This method is called
weight decay, and penalizes weights with large magnitude. It leads to the −λθ
(t)
term in our update
rule (6). In a Bayesian framework, weight decay can be interpreted as assuming a zero-mean Gaussian
prior on the parameters. The update rule can be further extended by a momentum term, ∆θ
(t−1)
,
weighted by the parameter ν. Using a momentum term hel ps against oscillations in the iterative update
procedure and can speed up the learning process, as is seen in feed-forward neural network training
[43].
Introducing latent variables. Suppose we want to model an m-dimensional unknown pr ob abili ty
distribution q (e.g., each component of a sample cor r es ponds to one of m pixels of an image). Typically,
not all the variables X = (X
v
)
v∈V
in an MRF need to correspond to some observed component,
and the number of nodes is larger than m. We split X into visible (or observed) variabl es V =
(V
1
, . . . , V
m
) corresponding to the components of the observations and latent (or hidden) variables H =