CHENG et al.: FACE SEGMENTOR-ENHANCED DEEP FEATURE LEARNING FOR FACE RECOGNITION 225
top of L-Softmax loss with weight normalization on a hyper-
sphere manifold. Wang et al. [22] reformulate the softmax loss
as a cosine loss by L2 normalizing both features and weight
vectors to remove radial variations while Deng et al. proposed
ArcFace loss [23] to utilize the arc-cosine function to calculate
the angle between the current feature and the target weight.
Most of these methods aim to improve discrimination built
on holistic features. We tend to further explore facial local
information to make face features more sufficient.
B. Facial Parsing
Facial parsing provides the per-pixel estimation of its
semantic class, implicitly providing facial semantic segmen-
tation. Warrell and Prince [32] introduced LabelFaces which
used priors to loosely model the topological structure of
face images. Le et al. [33] proposed an active shape model
that allowed for greater independence among facial compo-
nents and improved the appearance fitting step by a Viterbi
optimization process. Zhou et al. [34] presented an interlinked
convolutional neural network (iCNN), where a special inter-
linking layer was designed to integrate local information and
contextual information efficiently. Luo et al. [35] presented
a hierarchical structure via deep learning, where they used
component-specific segmentors on each component to estimate
pixel-wise label. Due to the low generalization of segmen-
tors on complicated label interactions, an exemplar-based face
parsing method [36] was proposed with hand-labeled seg-
mentation maps and a set of sparse keypoint descriptors.
Zhou et al. [37] presented Fully-Convolutional continuous
CRF Neural Network (FC-CNN) architecture to improve
high segmentation accuracy. On the other hand, the parsing
information could benefit other facial tasks. Chen et al. [38]
made full use of the geometry prior (e.g., parsing maps) to
super-resolve low-resolution images. Lu et al. [39] advanced
the expression synthesis domain by the introduction of a
Couple-Agent Face Parsing based Generative Adversarial
Network (CAFP-GAN) that unites the knowledge of facial
semantic regions and controllable expression signals. We are
inspired by facial parsing that assigns each pixel a probability
value of its class to obtain facial semantic segmentation which
reveals facial geometry, assisting facial information.
III. P
ROPOSED APPROACH
In this section, we first introduce the motivations and the
overview of our proposed approach, and then illustrate
the details of each part in our framework. Lastly, we present
the implementation details.
A. Motivations
Deep learning based face recognition methods have shown
the effectiveness relying on the discriminative power of
advanced networks. Nevertheless, the resulting features are
nearly built on basis of facial holistic appearance character-
istics. The representation still have insufficient information
due to ignoring some detailed local information. For exam-
ple, the property of facial components (e.g., eyes and nose)
also provides judgment to discern identities. To make full use
Fig. 2. Cosine similarity among faces. I
a
and I
b1
are from different iden-
tities while I
b1
and I
b2
are the same person with various poses. ‘AF’, ‘SF’
and ‘Fusion’ indicates appearance features, semantic local features and their
combination, respectively. With the help of semantic features, the face rep-
resentation distinguishes different identities I
a
and I
b1
more obviously. On
the other hand, the images I
b1
and I
b2
cause the large distance of appear-
ance features due to pose variations. The semantic features extract each facial
component characteristics which show the similarity among same person
faces.
of the local component features, we observe that the facial
parsing [34] could segment the face into semantic part, cov-
ering rich localized information. The generated local features
are potentially complementary to holistic features.
Fig. 2 shows the cosine similarity scores of two persons
with appearance features, semantic features and the fused fea-
tures, respectively. I
a
and I
b1
are different identities while I
b1
and I
b2
denote the same person with various poses. We com-
pute the similarity between different identities (I
a
and I
b1
) and
same person (I
b1
and I
b2
) with these three-pattern features. The
appearance similarity between I
a
and I
b1
is higher than I
b1
and
I
b2
where only with the holistic information, the appearance
features may lead to the wrong verification. For the seman-
tic parsing features, the similarity between various persons
is obviously lower than appearance similarity. The semantic
information generated by facial parsing provides the details of
facial components, such as big or small eyes, which assists the
appearance information. After fusing semantic information, it
reduces the similarity among different persons. On the other
hand, the appearance features of I
b1
and I
b2
are very different
due to the pose variations while the semantic features reveal
each component personalized attributes to show the similarity
among same person images. Therefore, our framework targets
at incorporating holistic and local information to enhance the
discriminative ability of face descriptors.
B. Face Segmentor-Enhanced Network
Our proposed FSENet simultaneously exploits the global
and local information, which mainly consists of four parts:
backbone module, semantic parsing network, part mask and
correlation matrix module. The holistic and local features are