DeepFacialExpressionRecognitionASurvey.pdf_steallikeanartistpdf

深度学习

需积分: 25 163 浏览量更新于2023-03-16 评论收藏 3.69MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Deep Facial Expression Recognition: A Survey

Shan Li and Weihong Deng

∗

, Member, IEEE

Abstract—With the transition of facial expression recognition (FER) from laboratory-controlled to challenging in-the-wild conditions

and the recent success of deep learning techniques in various ﬁelds, deep neural networks have increasingly been leveraged to learn

discriminative representations for automatic FER. Recent deep FER systems generally focus on two important issues: overﬁtting

caused by a lack of sufﬁcient training data and expression-unrelated variations, such as illumination, head pose and identity bias. In this

paper, we provide a comprehensive survey on deep FER, including datasets and algorithms that provide insights into these intrinsic

problems. First, we introduce the available datasets that are widely used in the literature and provide accepted data selection and

evaluation principles for these datasets. We then describe the standard pipeline of a deep FER system with the related background

knowledge and suggestions of applicable implementations for each stage. For the state of the art in deep FER, we review existing

novel deep neural networks and related training strategies that are designed for FER based on both static images and dynamic image

sequences, and discuss their advantages and limitations. Competitive performances on widely used benchmarks are also summarized

in this section. We then extend our survey to additional related issues and application scenarios. Finally, we review the remaining

challenges and corresponding opportunities in this ﬁeld as well as future directions for the design of robust deep FER systems.

Index Terms—Facial Expressions Recognition, Facial expression datasets, Affect, Deep Learning, Survey.

1 INTRODUCTION

ACIAL expression is one of the most powerful, natural and

universal signals for human beings to convey their emotional

states and intentions [1], [2]. Numerous studies have been con-

ducted on automatic facial expression analysis because of its

practical importance in sociable robotics, medical treatment, driver

fatigue surveillance, and many other human-computer interaction

systems. In the ﬁeld of computer vision and machine learning,

various facial expression recognition (FER) systems have been

explored to encode expression information from facial represen-

tations. As early as the twentieth century, Ekman and Friesen [3]

deﬁned six basic emotions based on cross-culture study [4], which

indicated that humans perceive certain basic emotions in the same

way regardless of culture. These prototypical facial expressions

are anger, disgust, fear, happiness, sadness, and surprise. Contempt

was subsequently added as one of the basic emotions [5]. Recently,

advanced research on neuroscience and psychology argued that the

model of six basic emotions are culture-speciﬁc and not universal

[6].

Although the affect model based on basic emotions is limited

in the ability to represent the complexity and subtlety of our

daily affective displays [7], [8], [9], and other emotion description

models, such as the Facial Action Coding System (FACS) [10] and

the continuous model using affect dimensions [11], are considered

to represent a wider range of emotions, the categorical model

that describes emotions in terms of discrete basic emotions is

still the most popular perspective for FER, due to its pioneering

investigations along with the direct and intuitive deﬁnition of facial

expressions. And in this survey, we will limit our discussion on

FER based on the categorical model.

FER systems can be divided into two main categories accord-

ing to the feature representations: static image FER and dynamic

sequence FER. In static-based methods [12], [13], [14], the feature

• The authors are with the Pattern Recognition and Intelligent System Lab-

oratory, School of Information and Communication Engineering, Beijing

University of Posts and Telecommunications, Beijing, 100876, China.

E-mail:{ls1995, whdeng}@bupt.edu.cn.

representation is encoded with only spatial information from the

current single image, whereas dynamic-based methods [15], [16],

[17] consider the temporal relation among contiguous frames in

the input facial expression sequence. Based on these two vision-

based methods, other modalities, such as audio and physiological

channels, have also been used in multimodal systems [18] to assist

the recognition of expression.

The majority of the traditional methods have used handcrafted

features or shallow learning (e.g., local binary patterns (LBP) [12],

LBP on three orthogonal planes (LBP-TOP) [15], non-negative

matrix factorization (NMF) [19] and sparse learning [20]) for FER.

However, since 2013, emotion recognition competitions such as

FER2013 [21] and Emotion Recognition in the Wild (EmotiW)

[22], [23], [24] have collected relatively sufﬁcient training data

from challenging real-world scenarios, which implicitly promote

the transition of FER from lab-controlled to in-the-wild settings. In

the meanwhile, due to the dramatically increased chip processing

abilities (e.g., GPU units) and well-designed network architecture,

studies in various ﬁelds have begun to transfer to deep learning

methods, which have achieved the state-of-the-art recognition ac-

curacy and exceeded previous results by a large margin (e.g., [25],

[26], [27], [28]). Likewise, given with more effective training data

of facial expression, deep learning techniques have increasingly

been implemented to handle the challenging factors for emotion

recognition in the wild. Figure 1 illustrates this evolution on FER

in the aspect of algorithms and datasets.

Exhaustive surveys on automatic expression analysis have

been published in recent years [7], [8], [29], [30]. These surveys

have established a set of standard algorithmic pipelines for FER.

However, they focus on traditional methods, and deep learning

has rarely been reviewed. Very recently, FER based on deep

learning has been surveyed in [31], which is a brief review without

introductions on FER datasets and technical details on deep FER.

Therefore, in this paper, we make a systematic research on deep

learning for FER tasks based on both static images and videos

(image sequences). We aim to give a newcomer to this ﬁled an

overview of the systematic framework and prime skills for deep

arXiv:1804.08348v2 [cs.CV] 22 Oct 2018

2017

2015

2013

2011

2009

2007

Zhao et al. [15] (LBP-TOP, SVM)

Shan et al. [12] (LBP, AdaBoost)

CK+

MMI

FER2013

Zhi et al. [19] (NMF)

Zhong et al. [20] (Sparse learning)

Tang (CNN) [130] (winner of FER2013)

Kahou et al. [57] (CNN, DBN, DAE)

(winner of EmotiW 2013)

EmotioNet

RAF-DB

AffectNet

----

EmotiW

Fan et al. [108] (CNN-LSTM, C3D)

(winner of EmotiW 2016)

LP loss

tuplet cluster loss

Island loss

… …

HoloNet

PPDN

IACNN

FaceNet2ExpNet

… …

Algorithm

Dataset

Fig. 1. The evolution of facial expression recognition in terms of datasets

and methods.

FER.

Despite the powerful feature learning ability of deep learning,

problems remain when applied to FER. First, deep neural networks

require a large amount of training data to avoid overﬁtting.

However, the existing facial expression databases are not sufﬁcient

to train the well-known neural network with deep architecture that

achieved the most promising results in object recognition tasks.

Additionally, high inter-subject variations exist due to different

personal attributes, such as age, gender, ethnic backgrounds and

level of expressiveness [32]. In addition to subject identity bias,

variations in pose, illumination and occlusions are common in

unconstrained facial expression scenarios. These factors are non-

linearly coupled with facial expressions and therefore strengthen

the requirement of deep networks to address the large intra-class

variability and to learn effective expression-speciﬁc representa-

tions.

In this paper, we introduce recent advances in research on

solving the above problems for deep FER. We examine the state-

of-the-art results that have not been reviewed in previous survey

papers. The rest of this paper is organized as follows. Frequently

used expression databases are introduced in Section 2. Section 3

identiﬁes three main steps required in a deep FER system and

describes the related background. Section 4 provides a detailed

review of novel neural network architectures and special network

training tricks designed for FER based on static images and

dynamic image sequences. We then cover additional related issues

and other practical scenarios in Section 5. Section 6 discusses

some of the challenges and opportunities in this ﬁeld and identiﬁes

potential future directions.

2 FACIAL EXPRESSION DATABASES

Having sufﬁcient labeled training data that include as many

variations of the populations and environments as possible is

important for the design of a deep expression recognition system.

In this section, we discuss the publicly available databases that

contain basic expressions and that are widely used in our reviewed

papers for deep learning algorithms evaluation. We also introduce

newly released databases that contain a large amount of affective

images collected from the real world to beneﬁt the training of

deep neural networks. Table 1 provides an overview of these

datasets, including the main reference, number of subjects, number

of image or video samples, collection environment, expression

distribution and additional information.

CK+ [33]: The Extended CohnKanade (CK+) database is the

most extensively used laboratory-controlled database for evaluat-

ing FER systems. CK+ contains 593 video sequences from 123

subjects. The sequences vary in duration from 10 to 60 frames

and show a shift from a neutral facial expression to the peak

expression. Among these videos, 327 sequences from 118 subjects

are labeled with seven basic expression labels (anger, contempt,

disgust, fear, happiness, sadness, and surprise) based on the Facial

Action Coding System (FACS). Because CK+ does not provide

speciﬁed training, validation and test sets, the algorithms evaluated

on this database are not uniform. For static-based methods, the

most common data selection method is to extract the last one to

three frames with peak formation and the ﬁrst frame (neutral face)

of each sequence. Then, the subjects are divided into n groups for

person-independent n-fold cross-validation experiments, where

commonly selected values of n are 5, 8 and 10.

MMI [34], [35]: The MMI database is laboratory-controlled

and includes 326 sequences from 32 subjects. A total of 213

sequences are labeled with six basic expressions (without “con-

tempt”), and 205 sequences are captured in frontal view. In

contrast to CK+, sequences in MMI are onset-apex-offset labeled,

i.e., the sequence begins with a neutral expression and reaches

peak near the middle before returning to the neutral expression.

Furthermore, MMI has more challenging conditions, i.e., there are

large inter-personal variations because subjects perform the same

expression non-uniformly and many of them wear accessories

(e.g., glasses, mustache). For experiments, the most common

method is to choose the ﬁrst frame (neutral face) and the three peak

frames in each frontal sequence to conduct person-independent

10-fold cross-validation.

JAFFE [36]: The Japanese Female Facial Expression (JAFFE)

database is a laboratory-controlled image database that contains

213 samples of posed expressions from 10 Japanese females. Each

person has 3˜4 images with each of six basic facial expressions

(anger, disgust, fear, happiness, sadness, and surprise) and one

image with a neutral expression. The database is challenging be-

cause it contains few examples per subject/expression. Typically,

all the images are used for the leave-one-subject-out experiment.

TFD [37]:The Toronto Face Database (TFD) is an amalgama-

tion of several facial expression datasets. TFD contains 112,234

images, 4,178 of which are annotated with one of seven expres-

sion labels: anger, disgust, fear, happiness, sadness, surprise and

neutral. The faces have already been detected and normalized to a

size of 48*48 such that all the subjects eyes are the same distance

apart and have the same vertical coordinates. Five ofﬁcial folds are

provided in TFD; each fold contains a training, validation, and test

set consisting of 70%, 10%, and 20% of the images, respectively.

FER2013 [21]: The FER2013 database was introduced during

the ICML 2013 Challenges in Representation Learning. FER2013

is a large-scale and unconstrained database collected automati-

cally by the Google image search API. All images have been

registered and resized to 48*48 pixels after rejecting wrongfully

labeled frames and adjusting the cropped region. FER2013 con-

tains 28,709 training images, 3,589 validation images and 3,589

test images with seven expression labels (anger, disgust, fear,

happiness, sadness, surprise and neutral).

AFEW [48]: The Acted Facial Expressions in the Wild

(AFEW) database was ﬁrst established and introduced in [49]

and has served as an evaluation platform for the annual Emo-

TABLE 1

An overview of the facial expression datasets. P = posed; S = spontaneous; Condit. = Collection condition; Elicit. = Elicitation method.

Database Samples

Subject Condit. Elicit.

Expression distribution Access

CK+ [33]

593 image

sequences

123 Lab P & S

6 basic expressions plus contempt

and neutral

http://www.consortium.ri.cmu.edu/ckagree/

MMI [34], [35]

740 images

and 2,900

videos

25 Lab P

6 basic expressions plus neutral https://mmifacedb.eu/

JAFFE [36] 213 images

10 Lab P

6 basic expressions plus neutral http://www.kasrl.org/jaffe.html

TFD [37]

112,234

images

N/A Lab P

6 basic expressions plus neutral josh@mplab.ucsd.edu

FER-2013 [21] 35,887 images

N/A Web P & S

6 basic expressions plus neutral

https://www.kaggle.com/c/challenges-in-representatio

n-learning-facial-expression-recognition-challenge

AFEW 7.0 [24] 1,809 videos

N/A Movie P & S

6 basic expressions plus neutral https://sites.google.com/site/emotiwchallenge/

SFEW 2.0 [22] 1,766 images

N/A Movie P & S

6 basic expressions plus neutral https://cs.anu.edu.au/few/emotiw2015.html

Multi-PIE [38]

755,370

images

337 Lab P

Smile, surprised, squint, disgust,

scream and neutral

http://www.ﬂintbox.com/public/project/4742/

BU-3DFE [39] 2,500 images

100 Lab P

6 basic expressions plus neutral

http://www.cs.binghamton.edu/

∼

lijun/Research/3DFE

/3DFE Analysis.html

Oulu-CASIA [40]

2,880 image

sequences

80 Lab P

6 basic expressions http://www.cse.oulu.fi/CMV/Downloads/Oulu-CASIA

RaFD [41] 1,608 images

67 Lab P

6 basic expressions plus contempt

and neutral

http://www.socsci.ru.nl:8180/RaFD2/RaFD

KDEF [42] 4,900 images

70 Lab P

6 basic expressions plus neutral http://www.emotionlab.se/kdef/

EmotioNet [43]

1,000,000

images

N/A Web P & S

23 basic expressions or compound

expressions

http://cbcsl.ece.ohio-state.edu/dbform emotionet.html

RAF-DB [44], [45] 29672 images

N/A Web P & S

6 basic expressions plus neutral and

12 compound expressions

http://www.whdeng.cn/RAF/model1.html

AffectNet [46]

450,000

images

(labeled)

N/A Web P & S

6 basic expressions plus neutral http://mohammadmahoor.com/databases-codes/

ExpW [47] 91,793 images

N/A Web P & S

6 basic expressions plus neutral

http://mmlab.ie.cuhk.edu.hk/projects/socialrelation/ind

ex.html

tion Recognition In The Wild Challenge (EmotiW) since 2013.

AFEW contains video clips collected from different movies with

spontaneous expressions, various head poses, occlusions and il-

luminations. AFEW is a temporal and multimodal database that

provides with vastly different environmental conditions in both

audio and video. Samples are labeled with seven expressions:

anger, disgust, fear, happiness, sadness, surprise and neutral. The

annotation of expressions have been continuously updated, and

reality TV show data have been continuously added. The AFEW

7.0 in EmotiW 2017 [24] is divided into three data partitions in

an independent manner in terms of subject and movie/TV source:

Train (773 samples), Val (383 samples) and Test (653 samples),

which ensures data in the three sets belong to mutually exclusive

movies and actors.

SFEW [50]: The Static Facial Expressions in the Wild

(SFEW) was created by selecting static frames from the AFEW

database by computing key frames based on facial point clustering.

The most commonly used version, SFEW 2.0, was the bench-

marking data for the SReco sub-challenge in EmotiW 2015 [22].

SFEW 2.0 has been divided into three sets: Train (958 samples),

Val (436 samples) and Test (372 samples). Each of the images is

assigned to one of seven expression categories, i.e., anger, disgust,

fear, neutral, happiness, sadness, and surprise. The expression

labels of the training and validation sets are publicly available,

whereas those of the testing set are held back by the challenge

organizer.

Multi-PIE [38]: The CMU Multi-PIE database contains

755,370 images from 337 subjects under 15 viewpoints and 19

illumination conditions in up to four recording session. Each facial

image is labeled with one of six expressions: disgust, neutral,

scream, smile, squint and surprise. This dataset is typically used

for multiview facial expression analysis.

BU-3DFE [39]: The Binghamton University 3D Facial Ex-

pression (BU-3DFE) database contains 606 facial expression se-

quences captured from 100 people. For each subject, six universal

facial expressions (anger, disgust, fear, happiness, sadness and

surprise) are elicited by various manners with multiple intensities.

Similar to Multi-PIE, this dataset is typically used for multiview

3D facial expression analysis.

Oulu-CASIA [40]: The Oulu-CASIA database includes 2,880

image sequences collected from 80 subjects labeled with six

basic emotion labels: anger, disgust, fear, happiness, sadness, and

surprise. Each of the videos is captured with one of two imaging

systems, i.e., near-infrared (NIR) or visible light (VIS), under

three different illumination conditions. Similar to CK+, the ﬁrst

frame is neutral and the last frame has the peak expression.

Typically, only the last three peak frames and the ﬁrst frame

(neutral face) from the 480 videos collected by the VIS System

under normal indoor illumination are employed for 10-fold cross-

validation experiments.

RaFD [41]: The Radboud Faces Database (RaFD) is

laboratory-controlled and has a total of 1,608 images from 67

subjects with three different gaze directions, i.e., front, left and

right. Each sample is labeled with one of eight expressions: anger,

contempt, disgust, fear, happiness, sadness, surprise and neutral.

KDEF [42]: The laboratory-controlled Karolinska Directed

Emotional Faces (KDEF) database was originally developed for

use in psychological and medical research. KDEF consists of

images from 70 actors with ﬁve different angles labeled with six

basic facial expressions plus neutral.

In addition to these commonly used datasets for basic emo-

tion recognition, several well-established and large-scale publicly

available facial expression databases collected from the Internet

that are suitable for training deep neural networks have emerged

in the last two years.

EmotioNet [43]: EmotioNet is a large-scale database with one

million facial expression images collected from the Internet. A

total of 950,000 images were annotated by the automatic action

unit (AU) detection model in [43], and the remaining 25,000

images were manually annotated with 11 AUs. The second track of

the EmotioNet Challenge [51] provides six basic expressions and

ten compound expressions [52], and 2,478 images with expression

labels are available.

RAF-DB [44], [45]: The Real-world Affective Face Database

(RAF-DB) is a real-world database that contains 29,672 highly

diverse facial images downloaded from the Internet. With man-

ually crowd-sourced annotation and reliable estimation, seven

basic and eleven compound emotion labels are provided for the

samples. Speciﬁcally, 15,339 images from the basic emotion set

are divided into two groups (12,271 training samples and 3,068

testing samples) for evaluation.

AffectNet [46]: AffectNet contains more than one million

images from the Internet that were obtained by querying different

search engines using emotion-related tags. It is by far the largest

database that provides facial expressions in two different emotion

models (categorical model and dimensional model), of which

450,000 images have manually annotated labels for eight basic

expressions.

ExpW [47]: The Expression in-the-Wild Database (ExpW)

contains 91,793 faces downloaded using Google image search.

Each of the face images was manually annotated as one of the

seven basic expression categories. Non-face images were removed

in the annotation process.

3 DEEP FACIAL EXPRESSION RECOGNITION

In this section, we describe the three main steps that are common

in automatic deep FER, i.e., pre-processing, deep feature learning

and deep feature classiﬁcation. We brieﬂy summarize the widely

used algorithms for each step and recommend the existing state-of-

the-art best practice implementations according to the referenced

papers.

3.1 Pre-processing

Variations that are irrelevant to facial expressions, such as different

backgrounds, illuminations and head poses, are fairly common in

unconstrained scenarios. Therefore, before training the deep neural

network to learn meaningful features, pre-processing is required

to align and normalize the visual semantic information conveyed

by the face.

3.1.1 Face alignment

Face alignment is a traditional pre-processing step in many face-

related recognition tasks. We list some well-known approaches

TABLE 2

Summary of different types of face alignment detectors that are widely

used in deep FER models.

type # points real-time speed performance used in

Holistic AAM [53] 68 7 fair

poor

generalization

[54], [55]

Part-based

MoT [56] 39/68 7 slow/

fast

good

[57], [58]

DRMF [59] 66 7 [60], [61]

Cascaded

regression

SDM [62] 49 3

fast/

very fast

good/

very good

[16], [63]

3000 fps [64] 68 3 [55]

Incremental [65] 49 3 [66]

Deep

learning

cascaded CNN [67] 5 3

fast

good/

very good

[68]

MTCNN [69] 5 3 [70], [71]

and publicly available implementations that are widely used in

deep FER.

Given a series of training data, the ﬁrst step is to detect the

face and then to remove background and non-face areas. The

Viola-Jones (V&J) face detector [72] is a classic and widely

employed implementation for face detection, which is robust and

computationally simple for detecting near-frontal faces.

Although face detection is the only indispensable procedure to

enable feature learning, further face alignment using the coordi-

nates of localized landmarks can substantially enhance the FER

performance [14]. This step is crucial because it can reduce the

variation in face scale and in-plane rotation. Table 2 investigates

facial landmark detection algorithms widely-used in deep FER and

compares them in terms of efﬁciency and performance. The Active

Appearance Model (AAM) [53] is a classic generative model that

optimizes the required parameters from holistic facial appearance

and global shape patterns. In discriminative models, the mixtures

of trees (MoT) structured models [56] and the discriminative

response map ﬁtting (DRMF) [59] use part-based approaches that

represent the face via the local appearance information around

each landmark. Furthermore, a number of discriminative models

directly use a cascade of regression functions to map the image

appearance to landmark locations and have shown better results,

e.g., the supervised descent method (SDM) [62] implemented

in IntraFace [73], the face alignment 3000 fps [64], and the

incremental face alignment [65]. Recently, deep networks have

been widely exploited for face alignment. Cascaded CNN [67]

is the early work which predicts landmarks in a cascaded way.

Based on this, Tasks-Constrained Deep Convolutional Network

(TCDCN) [74] and Multi-task CNN (MTCNN) [69] further lever-

age multi-task learning to improve the performance. In general,

cascaded regression has become the most popular and state-of-

the-art methods for face alignment as its high speed and accuracy.

In contrast to using only one detector for face alignment,

some methods proposed to combine multiple detectors for better

landmark estimation when processing faces in challenging uncon-

strained environments. Yu et al. [75] concatenated three different

facial landmark detectors to complement each other. Kim et al.

[76] considered different inputs (original image and histogram

equalized image) and different face detection models (V&J [72]

and MoT [56]), and the landmark set with the highest conﬁdence

provided by the Intraface [73] was selected.

3.1.2 Data augmentation

Deep neural networks require sufﬁcient training data to ensure

generalizability to a given recognition task. However, most pub-

licly available databases for FER do not have a sufﬁcient quantity

Emotion

Labels

Training

Trained

Model

Anger

Neutral

Surprise

Happiness

Contempt

Disgust

Fear

Sadness

RBM

𝑣

ℎ

label

Convolutions

P1 Layer

C1 Layer

C2 Layer

P2 Layer

Subsampling

Convolutions

Subsampling

Full

Connected

CNN

𝑊 𝑊 𝑊 𝑊

𝑉

𝑶

𝒕−𝟏

𝒔

𝒕−𝟏

𝑈

𝒙

𝒕−𝟏

𝑉

𝑶

𝒕

𝒔

𝒕

𝑈

𝒙

𝒕

𝑈

𝒙

𝒕+𝟏

𝒔

𝒕+𝟏

𝑉

𝑶

𝒕+𝟏

𝑜

𝑠

Unfold

𝑉

𝑈

𝑊

𝑥

Data Augmentation

Images &

Sequences

Pre-processing

Input

Face

Normalization

scaling, rotation, colors, noises…

Illumination Pose

Alignment

Input Images

Testing

Deep Networks

Output

Trained

Model

DBN

DAE

𝑽

Image

visible variables

𝒉

𝑾

hidden variables

Bipartite

Structure

RNN

Input Layer

Layer

bottle

Output Layer

Layer

…

Code

As close as possible

𝑥 𝑥

𝑊

𝑇

𝑊

𝑇

Encoder Decoder

Data

sample

Discriminator

Generator

Yes / No

Noise

Generated

sample

Data sample ?

GAN

Fig. 2. The general pipeline of deep facial expression recognition systems.

of images for training. Therefore, data augmentation is a vital

step for deep FER. Data augmentation techniques can be divided

into two groups: on-the-ﬂy data augmentation and ofﬂine data

augmentation.

Usually, the on-the-ﬂy data augmentation is embedded in deep

learning toolkits to alleviate overﬁtting. During the training step,

the input samples are randomly cropped from the four corners and

center of the image and then ﬂipped horizontally, which can result

in a dataset that is ten times larger than the original training data.

Two common prediction modes are adopted during testing: only

the center patch of the face is used for prediction (e.g., [61], [77])

or the prediction value is averaged over all ten crops (e.g., [76],

[78]).

Besides the elementary on-the-ﬂy data augmentation, various

ofﬂine data augmentation operations have been designed to further

expand data on both size and diversity. The most frequently used

operations include random perturbations and transforms, e.g., rota-

tion, shifting, skew, scaling, noise, contrast and color jittering. For

example, common noise models, salt & pepper and speckle noise

[79] and Gaussian noise [80], [81] are employed to enlarge the data

size. And for contrast transformation, saturation and value (S and

V components of the HSV color space) of each pixel are changed

[70] for data augmentation. Combinations of multiple operations

can generate more unseen training samples and make the network

more robust to deviated and rotated faces. In [82], the authors

applied ﬁve image appearance ﬁlters (disk, average, Gaussian,

unsharp and motion ﬁlters) and six afﬁne transform matrices that

were formalized by adding slight geometric transformations to the

identity matrix. In [75], a more comprehensive afﬁne transform

matrix was proposed to randomly generate images that varied in

terms of rotation, skew and scale. Furthermore, deep learning

based technology can be applied for data augmentation. For

example, a synthetic data generation system with 3D convolutional

neural network (CNN) was created in [83] to conﬁdentially create

faces with different levels of saturation in expression. And the

generative adversarial network (GAN) [84] can also be applied to

augment data by generating diverse appearances varying in poses

and expressions. (see Section 4.1.7).

3.1.3 Face normalization

Variations in illumination and head poses can introduce large

changes in images and hence impair the FER performance.

Therefore, we introduce two typical face normalization methods

to ameliorate these variations: illumination normalization and

pose normalization (frontalization).

Illumination normalization: Illumination and contrast can

vary in different images even from the same person with the same

expression, especially in unconstrained environments, which can

result in large intra-class variances. In [60], several frequently

used illumination normalization algorithms, namely, isotropic

diffusion (IS)-based normalization, discrete cosine transform

(DCT)-based normalization [85] and difference of Gaussian

(DoG), were evaluated for illumination normalization. And [86]

employed homomorphic ﬁltering based normalization, which has

been reported to yield the most consistent results among all other

techniques, to remove illumination normalization. Furthermore,

related studies have shown that histogram equalization combined

with illumination normalization results in better face recognition

performance than that achieved using illumination normalization

on it own. And many studies in the literature of deep FER (e.g.,

[75], [79], [87], [88]) have employed histogram equalization to

increase the global contrast of images for pre-processing. This

method is effective when the brightness of the background and

foreground are similar. However, directly applying histogram

equalization may overemphasize local contrast. To solve this

problem, [89] proposed a weighted summation approach to

combine histogram equalization and linear mapping. And in

[79], the authors compared three different methods: global

contrast normalization (GCN), local normalization, and histogram

equalization. GCN and histogram equalization were reported

to achieve the best accuracy for the training and testing steps,

respectively.

Pose normalization: Considerable pose variation is another

common and intractable problem in unconstrained settings. Some

studies have employed pose normalization techniques to yield

剩余24页未读，继续阅读

yep。

粉丝: 19
资源: 2

会员权益专享

Deep Facial Expression Recognition A Survey.pdf

评论0

会员权益专享

最新资源

Deep Facial Expression Recognition A Survey.pdf

评论0

Deep Convolutional Network Cascade for Facial Point Detection

A survey: facial micro-expression recognition

Real-Time-Facial-Expression-Recognition-with-DeepLearning

deep learning based facial expression recognition: a survey. ieee access, 8,

towards semi-supervised deep facial expression recognition with an adaptive

有关神经网络表情识别的参考文献十篇

有关神经网络表情识别的参考文献

FileNotFoundError: [Errno 2] No such file or directory: 'E:\\ljh\\Facial-Expression-Recognition.Pytorch-master1\\ck+数据\\train_los.csv'

SAC-FE、SAR-FE、SDM-FE模型

Expression recognition

当前最好的基于pytorch的人脸表情识别神经网络模型是什么？

face alignment

local_features = self.fuser(left_eye_fake_features, right_eye_fake_features, nose_fake_features, mouth_fake_features)

Python实现非深度学习算法的Facial recognition的代码

arXiv:2304.09174

会员权益专享

最新资源