Unsupervised Person Image Generation with Semantic Parsing Transformation
Sijie Song
1
, Wei Zhang
2
, Jiaying Liu
1∗
, Tao Mei
2
1
Institute of Computer Science and Technology, Peking University, Beijing, China
2
JD AI Research, Beijing, China
Abstract
In this paper, we address unsupervised pose-guided per-
son image generation, which is known challenging due to
non-rigid deformation. Unlike previous methods learning a
rock-hard direct mapping between human bodies, we pro-
pose a new pathway to decompose the hard mapping in-
to two more accessible subtasks, namely, semantic pars-
ing transformation and appearance generation. Firstly, a
semantic generative network is proposed to transform be-
tween semantic parsing maps, in order to simplify the non-
rigid deformation learning. Secondly, an appearance gen-
erative network learns to synthesize semantic-aware tex-
tures. Thirdly, we demonstrate that training our frame-
work in an end-to-end manner further refines the semantic
maps and final results accordingly. Our method is gener-
alizable to other semantic-aware person image generation
tasks, e.g., clothing texture transfer and controlled image
manipulation. Experimental results demonstrate the supe-
riority of our method on DeepFashion and Market-1501
datasets, especially in keeping the clothing attributes and
better body shapes.
1. Introduction
Pose-guided image generation has attracted great atten-
tions recently, which is to change the pose of the person im-
age to a target pose, while keeping the appearance details.
This topic is of great importance in fashion and art domains
for a wide range of applications from image / video editing,
person re-identification to movie production.
With the development of deep learning and generative
model [
8], many researches have been devoted to pose-
guided image generation [
19, 21, 5, 27, 26, 1, 20]. Initial-
ly, this problem is explored under the fully supervised set-
ting [
19, 27, 26, 1]. Though promising results have been p-
resented, their training data has to be composed with paired
images (i.e., same person in the same clothing but in differ-
ent poses). To tackle this data limitation and enable more
∗
Corresponding author. This work was done at JD AI Research.
Our project is available at
https://github.com/SijieSong/
person_generation_spt.git
.
Figure 1: Visual results of different methods on DeepFash-
ion [
18]. Compared with PG
2
[19], Def-GAN [27], and
UPIS [
21], our method successfully keeps the clothing at-
tributes (e.g., textures) and generates better body shapes
(e.g., arms).
flexible generation, more recent efforts have been devot-
ed to learning the mapping with unpaired data [
21, 5, 20].
However without “paired” supervision, results in [
21] are
far from satisfactory due to the lack of supervision. Dis-
entangling image into multiple factors (e.g., background /
foreground, shape / appearance) is explored in [
20, 5]. But
ignoring the non-rigid human-body deformation and cloth-
ing shapes leads to compromised generation quality.
Formally, the key challenges of this unsupervised task
are in three folds. First, due to the non-rigid nature of hu-
man body, transforming the spatially misaligned body-parts
is difficult for current convolution-based networks. Sec-
ond, clothing attributes, e.g., sleeve lengths and textures,
are generally difficult to preserve during generation. How-
ever, these clothing attributes are crucial for human visual
perception. Third, the lack of paired training data gives little
clue in establishing effective training objectives.
To address these aforementioned challenges, we propose
to seek a new pathway for unsupervised person image gen-
eration. Specifically, instead of directly transforming the
person image, we propose to transform the semantic parsing
between poses. On one hand, translating between person
1
2357