For the purpose of solving the above problems, let us introduce a recognition model q
φ
(z|x): an
approximation to the intractable true posterior p
θ
(z|x). Note that in contrast with the approximate
posterior in mean-field variational inference, it is not necessarily factorial and its parameters φ are
not computed from some closed-form expectation. Instead, we’ll introduce a method for learning
the recognition model parameters φ jointly with the generative model parameters θ.
From a coding theory perspective, the unobserved variables z have an interpretation as a latent
representation or code. In this paper we will therefore also refer to the recognition model q
φ
(z|x)
as a probabilistic encoder, since given a datapoint x it produces a distribution (e.g. a Gaussian)
over the possible values of the code z from which the datapoint x could have been generated. In a
similar vein we will refer to p
θ
(x|z) as a probabilistic decoder, since given a code z it produces a
distribution over the possible corresponding values of x.
2.2 The variational bound
The marginal likelihood is composed of a sum over the marginal likelihoods of individual datapoints
log p
θ
(x
(1)
, · · · , x
(N)
) =
P
N
i=1
log p
θ
(x
(i)
), which can each be rewritten as:
log p
θ
(x
(i)
) = D
KL
(q
φ
(z|x
(i)
)||p
θ
(z|x
(i)
)) + L(θ, φ; x
(i)
) (1)
The first RHS term is the KL divergence of the approximate from the true posterior. Since this
KL-divergence is non-negative, the second RHS term L(θ, φ; x
(i)
) is called the (variational) lower
bound on the marginal likelihood of datapoint i, and can be written as:
log p
θ
(x
(i)
) ≥ L(θ, φ; x
(i)
) = E
q
φ
(z|x)
[− log q
φ
(z|x) + log p
θ
(x, z)] (2)
which can also be written as:
L(θ, φ; x
(i)
) = −D
KL
(q
φ
(z|x
(i)
)||p
θ
(z)) + E
q
φ
(z|x
(i)
)
h
log p
θ
(x
(i)
|z)
i
(3)
We want to differentiate and optimize the lower bound L(θ, φ; x
(i)
) w.r.t. both the variational
parameters φ and generative parameters θ. However, the gradient of the lower bound w.r.t. φ
is a bit problematic. The usual (na
¨
ıve) Monte Carlo gradient estimator for this type of problem
is: ∇
φ
E
q
φ
(z)
[f(z)] = E
q
φ
(z)
f(z)∇
q
φ
(z)
log q
φ
(z)
'
1
L
P
L
l=1
f(z)∇
q
φ
(z
(l)
)
log q
φ
(z
(l)
) where
z
(l)
∼ q
φ
(z|x
(i)
). This gradient estimator exhibits exhibits very high variance (see e.g. [BJP12])
and is impractical for our purposes.
2.3 The SGVB estimator and AEVB algorithm
In this section we introduce a practical estimator of the lower bound and its derivatives w.r.t. the
parameters. We assume an approximate posterior in the form q
φ
(z|x), but please note that the
technique can be applied to the case q
φ
(z), i.e. where we do not condition on x, as well. The fully
variational Bayesian method for inferring a posterior over the parameters is given in the appendix.
Under certain mild conditions outlined in section 2.4 for a chosen approximate posterior q
φ
(z|x) we
can reparameterize the random variable
e
z ∼ q
φ
(z|x) using a differentiable transformation g
φ
(, x)
of an (auxiliary) noise variable :
e
z = g
φ
(, x) with ∼ p() (4)
See section 2.4 for general strategies for chosing such an approriate distribution p() and function
g
φ
(, x). We can now form Monte Carlo estimates of expectations of some function f(z) w.r.t.
q
φ
(z|x) as follows:
E
q
φ
(z|x
(i)
)
[f(z)] = E
p()
h
f(g
φ
(, x
(i)
))
i
'
1
L
L
X
l=1
f(g
φ
(
(l)
, x
(i)
)) where
(l)
∼ p() (5)
We apply this technique to the variational lower bound (eq. (2)), yielding our generic Stochastic
Gradient Variational Bayes (SGVB) estimator
e
L
A
(θ, φ; x
(i)
) ' L(θ, φ; x
(i)
):
e
L
A
(θ, φ; x
(i)
) =
1
L
L
X
l=1
log p
θ
(x
(i)
, z
(i,l)
) − log q
φ
(z
(i,l)
|x
(i)
)
where z
(i,l)
= g
φ
(
(i,l)
, x
(i)
) and
(l)
∼ p() (6)
3