1520-9210 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TMM.2018.2885235, IEEE
Transactions on Multimedia
3
Image Generation. The early works in image generation
are mostly limited to simple texture synthesis based on hand-
crafted features [50]. Recent approaches for generating re-
alistic images mainly employ two kinds of models, i.e., the
variational autoencoder (VAE) [6] and GAN [5]. VAE is a
well-known approach for capturing complicated distributions
and has been widely applied in various generative tasks.
Gregor et al. [51] established a sequential generation model for
generating realistic images that takes advantage of both VAE
and recurrent neural networks with an attention mechanism.
Yan et al. [20] propose a layered generative model built on
the conditional VAE. GAN is another fairly popular gener-
ative model that has been exploited by many recent works.
Some works have focused on the architecture of the original
GAN approach and have provided better solutions for image
generation [52], [53], [54], [55]. Instead of utilizing only
noise vectors as inputs, the CGAN [26] approach adds extra
information along with the noise to constrain the generation
process and produces impressive results. Based on CGAN,
many works have proposed [24], [25] solving the problems of
image-to-image translation and style translation. Our genera-
tive model also takes inspiration from the CGAN and image-
to-image models in that we take landmarks as the structure
conditions and a reference image as the appearance condition,
and a generative model is trained to produce new object poses
corresponding to the given conditions.
Video Generation. Another topic closely related to our
problem is video generation and prediction. The early works
in this space have been mostly based on video texture meth-
ods [56], [57], [58]. These methods utilize an input reference
video and aim to generate periodic motion sequences. Guided
by ideas from neuroscience, Lotter et al. [11] propose a
predictive neural network for video generation. Finn et al. [8]
show another approach to video prediction conditioned on
action cues that focuses on pixel-level movement. Mathieu et
al. [12] exploit multi-scale architecture to solve the problem of
deformation in future predictions. Instead of predicting future
information at the pixel level, Van et al. [59] attempt to obtain
the transformations between frames given the observed videos.
GAN or VAE are also utilized in video generation, such as
in [9], [60], [61]. A recent work [62] disentangles the process
of learning pose features from content features in videos and
can synthesis high-quality videos. However, these proposed
methods usually suffer from the two following main issues: 1)
the foreground objects are easily deformed during the genera-
tion process, 2) and the consistency between adjacent frames
is hard to maintain. These problems suggest that strong motion
constraints need to be considered for generating more realistic
videos. Therefore, landmark information is employed to guide
the motion generation in our proposed model. Recently, some
other contemporary works [19], [63] also propose employing
landmark information for motion prediction.
III. METHOD
We formulate our problem as follows. Given an appearance
reference image I that specifies the appearance information
for both foreground objects and the background, we aim to
generate a sequence of frames
ˆ
Y = {ˆy
1
, ..., ˆy
T
} containing
the same object moving against the background according to
a specific motion pattern c. In other words, our task is to
generate motion sequences from single static image. Given
very limited information (I and c), the main challenge is how
to find a good solution in extremely high-dimensional video
space. To address this issue, we propose a two-step framework
to improve the generation quality. First, we introduce an
action-conditioned landmark generation network, which aims
to generate a sequence of object landmarks
ˆ
S = {ˆs
1
, ..., ˆs
T
}
corresponding to the specified motion pattern c. This network
is described in detail in Section III-B. The landmark sequence
ˆ
S is a good low-dimensional representation of foreground ob-
ject structures and thus provides strong guidance information
for video generation. Therefore, we propose an appearance-
preserving motion generation network in the second step. This
network takes reference image I and object landmarks
ˆ
S
as appearance and motion constraints, respectively, and will
thus generate high-quality motion sequences. The details of
this network are described in Section III-C. Specifically, we
employ a VAE model and a GAN model in these two steps.
In the following, we first briefly revisit generative adversarial
networks and then introduce our two-step model.
A. Generative Adversarial Networks
The GAN [5] approach was explicitly designed to address
data generation tasks. Specifically, GAN models consist of two
parts, i.e., a generator network G and a discriminator network
D. The discriminator aims to distinguish real samples from
synthesized ones, while the generator aims to generate data
that are close to the real distribution to fool the discriminator.
In this way, G and D form a min-max game, and the objective
function is defined as follows:
min
G
max
D
L
GAN
= E
x∼p
r
(x)
[logD (x)] (1)
+E
z∼p(z)
[log (1 − D (G (z)))] ,
where p
r
and p(z) are the distribution of real data and the
zero-mean Gaussian distribution N (0, 1), respectively.
However, the original GAN is not stable and is difficult
to train because the use of Jensen-Shannon (JS) divergence
in its loss function. Under an (approximately) optimal D,
minimizing the loss of G is equivalent to minimizing the JS
divergence between the real distribution p
r
and the generat-
ing distribution p
g
. Because having a non-negligible overlap
between p
r
and p
g
is almost impossible, the JS divergence
is always a constant (log2). This divergence will eventually
make the generator gradient approach 0. WGAN [64] was
proposed to improve the stability of the model by replacing
JS distance with Wasserstein distance, which is Lipschitz
continuous and thus solves the gradient vanishing problem.
Based on Wasserstein distance,the loss function of G is defined
as follows:
L
W GAN
(G) = −E
x∼P
g
[D(x)] (2)
and the loss function of D is defined as follows:
L
W GAN
(D) = E
x∼P
g
[D(x)] − E
x∼P
r
[D(x)] (3)