Comp-GAN: Compositional Generative Adversarial Network in
Synthesizing and Recognizing Facial Expression
Wenxuan Wang
1
, Qiang Sun
1
, Yanwei Fu
1#
, Tao Chen
1
Chenjie Cao
2
, Ziqi Zheng
2
, Guoqiang Xu
2
, Han Qiu
2
, Yu-Gang Jiang
1#
, Xiangyang Xue
1
Fudan University
1
, Ping An OneConnect
2
ABSTRACT
Facial expression is important in understanding our social inter-
action. Thus the ability to recognize facial expression enables the
novel multimedia applications. With the advance of recent deep ar-
chitectures, research on facial expression recognition has achieved
great progress. However, these models are still suering from the
problems of lacking sucient and diverse high quality training
faces, vulnerability to the facial variations, and recognizing a lim-
ited number of basic types of emotions. To tackle these problems,
this paper proposes a novel end-to-end Compositional Generative
Adversarial Network (Comp-GAN) that is able to synthesize new
face images with specied poses and desired facial expressions;
and such synthesized images can be further utilized to help train a
robust and generalized expression recognition model. Essentially,
Comp-GAN can dynamically change the expression and pose of
faces according to the input images while keeping the identity in-
formation. Specically, the generator has two major components:
one for generating images with desired expression and the other
for changing the pose of faces. Furthermore, a face reconstruction
learning process is applied to re-generate the input image and con-
strains the generator for preserving the key information such as
facial identity. For the rst time, various one/zero-shot facial expres-
sion recognition tasks have been created. We conduct extensive
experiments to show that the images generated by Comp-GAN
are helpful to improve the performance of one/zero-shot facial
expression recognition.
# indicates corresponding authors.
Wenxuan Wang, Yu-Gang Jiang, Xiangyang Xue are with the Shanghai Key Lab of
Intelligent Information Processing, School of Computer Science, Fudan University;
Qiang Sun is with the Academy for Engineering & Technology, Fudan University; Tao
Chen is with the School of Information Science and Technology, Fudan University;
and Yanwei Fu is with the School of Data Science, and Fudan-Xinzailing Joint Research
Centre for Big Data, Fudan University. {wxwang17, 18110860051, yanweifu, eetchen,
xyxue, ygj}@fudan.edu.cn.
Chenjie Cao, Ziqi Zheng, Guoqiang Xu, and Han Qiu are with Ping An OneConnect.
{caochenjie948, zhengziqi356,xuguoqiang371,hannaqiu}@pingan.com.
This work was supported in part by NSFC (No.61572138 & No.U1611461), and STCSM
Project (19ZR1471800).
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
MM ’19, October 21–25, 2019, Nice, France
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6889-6/19/10... $15.00
https://doi.org/10.1145/3343031.3351032
CCS CONCEPTS
• Computing methodologies → Activity recognition and un-
derstanding; Image representations; Visual inspection.
KEYWORDS
Facial Expression, Generative Adversarial Network
ACM Reference Format:
Wenxuan Wang
1
, Qiang Sun
1
, Yanwei Fu
1#
, Tao Chen
1
and Chenjie Cao
2
,
Ziqi Zheng
2
, Guoqiang Xu
2
, Han Qiu
2
, Yu-Gang Jiang
1#
, Xiangyang Xue
1
. 2019. Comp-GAN: Compositional Generative Adversarial Network in
Synthesizing and Recognizing Facial Expression. In Proceedings of the 27th
ACM International Conference on Multimedia (MM ’19), October 21–25, 2019,
Nice, France. ACM, New York, NY, USA, 9 pages. https://doi.org/10.1145/
3343031.3351032
1 INTRODUCTION
Facial expression as one important facial attribute plays a key role
in the communication [
20
,
37
], which can reect the emotional
state and intention of humans. Building a system capable of auto-
matically recognizing facial expression from media data has been
an important research eld over the past few years. Such a system
can enable various multimedia applications in real-world scenarios,
such as medical test, education, security, driver fatigue surveillance,
and many other human-computer interactions.
With the advance of recent deep architectures, several pilot
studies [
12
,
24
,
26
] have investigated the possibility of learning rep-
resentative deep facial emotion features from data. Consequently
and as expected, their results show that deep-feature-based expres-
sion recognition (ER) system indeed outperforms the traditional
hand-crafted-feature-based ER models [
15
,
23
,
27
]. Despite the en-
couraging advances in these ER works, there are still several key
challenges in extending facial ER system to real-world applications.
(1) Lack of sucient and diverse high-quality training data. The
annotation task for facial expression generally requires devoted
contributions from the experts, and the labeling procedure is much
more dicult and time-consuming than labeling image class [
2
]. It
is thus a severe problem in training deep ER models in general. To
bypass the problem of insucient training data, the typical solu-
tion would be rstly pre-training a model on large scale auxiliary
dataset of dierent recognition tasks (e.g., ImageNet [
13
], CASIA
WebFace [
41
]), and then ne-tuning the model on ER tasks of the
target dataset. The performance of such a manner is very sensitive
to the relation between recognition tasks in auxiliary dataset and
ER task of the target dataset. On the other hand, data augmentation
is widely utilized in enlarging the training dataset. In ER tasks,
several GAN-based methods are applied in synthesizing faces with
dierent expressions [
40
], poses [
16
] and identities [
1
], respectively.
However, they do not properly preserve the identity and expression
Session 1B: Affective Computing & Facial Analytics
MM ’19, October 21–25, 2019, Nice, France