attention importance weighting, ranking regularization, and
noise relabeling. Given a batch of images, a backbone CNN
is first used to extract facial features. Then the self-attention
importance weighting module learns a weight for each im-
age to capture the sample importance for loss weighting. It
is expected that uncertain facial images are assigned low im-
portance weights. Further, the ranking regularization mod-
ule ranks these weights in descending order, splits them
into two groups (i.e. high importance weights and low im-
portance weights), and regularizes the two groups by en-
forcing a margin between the average weights of the two
groups. This regularization is implemented with a loss func-
tion, termed as Rank Regularization loss (RR-Loss). The
ranking regularization module ensures that the first module
learns meaningful weights to highlight certain samples (e.g.
reliable annotations) and to suppress uncertain samples (e.g.
ambiguous annotations). The last module is a careful rela-
beling module that attempts to relabel these samples from
the bottom group by comparing the maximum predicted
probabilities to the probabilities of given labels. A sample is
assigned to a pseudo label if the maximum prediction prob-
ability is higher than the one of given label with a margin
threshold. In addition, since the main evidence of uncer-
tainties is the incorrect/noisy annotation problem, we col-
lect an extreme noisy FER dataset from the Internet, termed
as WebEmotion, to investigate the effect of SCN with ex-
treme uncertainties.
Overall, our contributions can be summarized as follows,
• We innovatively pose the uncertainty problem in facial
expression recognition, and propose a Self-Cure Net-
work to reduce the impact of uncertainties.
• We elaborately design a rank regularization to super-
vise the SCN to learn meaningful importance weights,
which also provides a reference for the relabeling mod-
ule.
• We extensively validate our SCN on synthetic FER
data and a new real-world uncertain emotion dataset
(WebEmotion) collected from the Internet. Our
SCN also achieves performance 88.14% on RAF-DB,
60.23% on AffectNet, and 89.35% on FERPlus, which
set new records on them.
2. Related Work
2.1. Facial Expression Recognition
Generally, a FER system mainly consists of three stages,
namely face detection, feature extraction, and expression
recognition. In face detection stage, several face detectors
like MTCNN [44] and Dlib [2]) are used to locate faces in
complex scenes. The detected faces can be further aligned
alternatively. For feature extraction, various methods are
designed to capture facial geometry and appearance features
caused by facial expressions. According to the feature type,
they can be grouped into engineered features and learning-
based features. For the engineered features, they can be
further divided into texture-based local features, geometry-
based global features, and hybrid features. The texture-
based features mainly include SIFT [34], HOG [6], His-
tograms of LBP [35], Gabor wavelet coefficients [26], etc.
The geometry-based global features are mainly based on the
landmark points around noses, eyes, and mouths. Combin-
ing two or more of the engineered features refers to the hy-
brid feature extraction, which can further enrich the repre-
sentation. For the learned features, Fasel [12] finds that a
shallow CNN is robust to face poses and scales. Tang [37]
and Kahou et al. [21] utilize deep CNNs for feature extrac-
tion, and win the FER2013 and Emotiw2013 challenge, re-
spectively. Liu et al. [27] propose a Facial Action Units
based CNN architecture for expression recognition. Re-
cently, both Li et al. [25] and Wang et al. [42] have de-
signed region-based attention networks for pose and occlu-
sion aware FER, where the regions are either cropped from
landmark points or fixed positions.
2.2. Learning with Uncertainties
Uncertainties in the FER task mainly come from am-
biguous facial expressions, low-quality facial images, in-
consistent annotations, and incorrect annotations (i.e. noisy
labels). Particularly, learning with noisy labels is exten-
sively studied in the computer vision community while the
other two aspects are rarely explored. In order to handle
noisy labels, one intuitive idea is to leverage a small set of
clean data that can be used to assess the quality of the labels
during the training process [40, 23, 8], or to estimate the
noise distribution [36], or to train the feature extractors [3].
Li et al. [23] propose a unified distillation framework using
‘side’ information from a small clean dataset and label re-
lations in knowledge graph, to ‘hedge the risk’ of learning
from noisy labels. Veit et al.[41] use a multi-task network
that jointly learns to clean noisy annotations and to clas-
sify images. Azadi et al.[3] select reliable images by an
auxiliary image regularization for deep CNNs with noisy
labels. Other methods do not need a small clean dataset
but they may assume extra constrains or distributions on
the noisy samples [31], such as a specific loss for randomly
flipped labels [33], regularizing the deep networks on cor-
rupted labels by a MentorNet [20], and other approaches
that model the noise with a softmax layer by connecting
the latent correct labels to the noisy ones [13, 43]. For the
FER task, Zeng et al. [43] first consider the inconsistent
annotation problem among different FER datasets, and pro-
pose to leverage these uncertainties to improve FER. In con-
trast, our work focuses on suppressing these uncertainties
to learn better facial expression features.