and quantitatively, especially for large pose variation. Additionally, human perceptual study further
indicates the superiority of our model that achieves remarkably higher scores compared to other
methods with more realistic generated results.
2 Relation Works
Image Synthesis
. Driven by remarkable results of GANs [
10
], lots of researchers leveraged GANs to
generate images [
12
,
6
,
18
]. DCGANs [
24
] introduced an unsupervised learning method to effectively
generate realistic images, which combined convolutional neural networks (CNNs) with GANs.
Pix2pix [
13
] exploited a conditional adversarial networks (CGANs) [
22
] to tackle the image-to-image
translation tasks, which learned the mapping from condition images to target images. CycleGAN [
35
],
DiscoGAN [
15
], and DualGAN [
33
] each proposed an unsupervised method to generate the image
from two domains with unlabeled images. Furthermore, StarGAN [
5
] proposed a unified model for
image-to-image transformations task towards multiple domains, which is effective on young-to-old,
angry-to-happy, and female-to-male. Pix2pixHD [
30
] used two different scales residual networks
to generate the high-resolution images by two steps. These approaches are capable of learning to
generate realistic images, but have limited scalability in handling posed-based person synthesis,
because of the unseen target poses and the complex conditional appearances. Unlike those methods,
we proposed a novel Soft-Gated Warping-GAN that pays attention to pose alignment in deep feature
space and deals with textures rendering on the region-level for synthesizing person images.
Person Image Synthesis
. Recently, lots of studies have been proposed to leverage adversarial
learning for person image synthesis. PG2 [
20
] proposed a two-stage GANs architecture to synthesize
the person images based on pose keypoints. BodyROI7 [
21
] applied disentangle and restructure
methods to generate person images from different sampling features. DSCF [
28
] introduced a special
U-Net [
26
] structure with deformable skip connections as a generator to synthesize person images
from decomposed and deformable images. AUNET [
8
] presented a variational U-Net for generating
images conditioned on a stickman (more artificial pose information), manipulating the appearance
and shape by a variational Autoencoder. Skeleton-Aided [
32
] proposed a skeleton-aided method
for video generation with a standard pix2pix [
13
] architecture, generating human images base on
poses. [
1
] proposed a modular GANs, separating the image into different parts and reconstructing
them by target pose. [
23
] essentially used CycleGAN [
35
] to generate person images, which applied
conditioned bidirectional generators to reconstruct the original image by the pose. VITON [
11
] used
a coarse-to-fine strategy to transfer a clothing image into a fixed pose person image. CP-VTON [
29
]
learns a thin-plate spline transformation for transforming the in-shop clothes into fitting the body
shape of the target person via a Geometric Matching Module (GMM). However, all methods above
share a common problem, ignoring the deep feature maps misalignment between the condition and
target images. In this paper, we exploit a Soft-Gated Warping-GAN, including a pose-guided parser to
generate the target parsing, which guides to render textures on the specific part segmentation regions,
and a novel warping-block to align the image features, which produces more realistic-look textures
for synthesizing high-quality person images conditioned on different poses.
3 Soft-Gated Warping-GAN
Our goal is to change the pose of a given person image to another while keeping the texture
details, leveraging the transformation mapping between the condition and target segmentation maps.
We decompose this task into two stages: pose-guided parsing and Warping-GAN rendering. We
first describe the overview of our Soft-Gated Warping-GAN architecture. Then, we discuss the
pose-guided parsing and Warping-GAN rendering in details, respectively. Next, we present the
warping-block design and the pipeline for estimating transformation parameters and warping images,
which benefits to generate realistic-looking person images. Finally, we give a detailed description of
the synthesis loss functions applied in our network.
3.1 Network Architectures
Our pipeline is a two-stage architecture for pose-guided parsing and Warping-GAN rendering
respectively, which includes a human parsing parser, a pose estimator, and the affine [
7
]/TPS [
2
,
25
]
(Thin-Plate Spline) transformation estimator. Notably, we make the first attempt to estimate the
3