Figure 1: LSTM VAE model of (Bowman et al.,
2016)
forward part is composed of a fully convolutional
encoder and a decoder that combines deconvolu-
tional layers and a conventional RNN. Finally, we
discuss optimization recipes that help VAE to re-
spect latent variables, which is critical training a
model with a meaningful latent space and being
able to sample realistic sentences.
3.1 Variational Autoencoder
The VAE is a recently introduced latent vari-
able generative model, which combines varia-
tional inference with deep learning. It modifies the
conventional autoencoder framework in two key
ways. Firstly, a deterministic internal representa-
tion z (provided by the encoder) of an input x is re-
placed with a posterior distribution q(z|x). Inputs
are then reconstructed by sampling z from this
posterior and passing them through a decoder. To
make sampling easy, the posterior distribution is
usually parametrized by a Gaussian with its mean
and variance predicted by the encoder. Secondly,
to ensure that we can sample from any point of
the latent space and still generate valid and diverse
outputs, the posterior q(z|x) is regularized with
its KL divergence from a prior distribution p(z).
The prior is typically chosen to be also a Gaussian
with zero mean and unit variance, such that the KL
term between posterior and prior can be computed
in closed form (Kingma and Welling, 2013). The
total VAE cost is composed of the reconstruction
term, i.e., negative log-likelihood of the data, and
the KL regularizer:
J
vae
= KL(q(z|x)||p(z))
−E
q(z|x)
[log p(x|z)]
(1)
Kingma and Welling (2013) show that the loss
function from Eq (1) can be derived from the
probabilistic model perspective and it is an upper
bound on the true negative likelihood of the data.
One can view a VAE as a traditional Autoen-
coder with some restrictions imposed on the in-
ternal representation space. Specifically, using a
sample from the q(z|x) to reconstruct the input
instead of a deterministic z, forces the model to
map an input to a region of the space rather than
to a single point. The most straight-forward way to
achieve a good reconstruction error in this case is
to predict a very sharp probability distribution ef-
fectively corresponding to a single point in the la-
tent space (Raiko et al., 2014). The additional KL
term in Eq (1) prevents this behavior and forces the
model to find a solution with, on one hand, low re-
construction error and, on the other, predicted pos-
terior distributions close to the prior. Thus, the de-
coder part of the VAE is capable of reconstructing
a sensible data sample from every point in the la-
tent space that has non-zero probability under the
prior. This allows for straightforward generation
of novel samples and linear operations on the la-
tent codes. Bowman et al. (2016) demonstrate
that this does not work in the fully deterministic
Autoencoder framework . In addition to regulariz-
ing the latent space, KL term indicates how much
information the VAE stores in the latent vector.
Bowman et al. (2016) propose a VAE model for
text generation where both encoder and decoder
are LSTM networks (Figure 1). We will refer to
this model as LSTM VAE in the remainder of the
paper. The authors show that adapting VAEs to
text generation is more challenging as the decoder
tends to ignore the latent vector (KL term is close
to zero) and falls back to a language model. Two
training tricks are required to mitigate this issue:
(i) KL-term annealing where its weight in Eq (1)
gradually increases from 0 to 1 during the training;
and (ii) applying dropout to the inputs of the de-
coder to limit its expressiveness and thereby forc-
ing the model to rely more on the latent variables.
We will discuss these tricks in more detail in Sec-
tion 3.4. Next we describe a deconvolutional layer,
which is the core element of the decoder in our
VAE model.
3.2 Deconvolutional Networks
A deconvolutional layer (also referred to as trans-
posed convolutions (Gulrajani et al., 2016) and
fractionally strided convolutions (Radford et al.,
2015)) performs spatial up-sampling of its inputs
and is an integral part of latent variable genera-
tive models of images (Radford et al., 2015; Gulra-
jani et al., 2016) and semantic segmentation algo-
rithms (Noh et al., 2015). Its goal is to perform an
“inverse” convolution operation and increase spa-
tial size of the input while decreasing the number
of feature maps. This operation can be viewed as