Spatio-temporal Analysis for Infrared Facial Expression
Recognition from Videos
Zhilei Liu
School of Computer Science and Technology
Tianjin University
Tianjin, China 300072
zhileiliu@tju.edu.cn
Cuicui Zhang
School of Marine Science and Technology
Tianjin University
Tianjin, China 300072
cuicui.zhang@tju.edu.cn
ABSTRACT
Facial expression recognition (FER) for emotion inference has
become one of the most important research fields in human-
computer interaction. Existing study on FER mainly focuses on
visible images, whereas varying lighting conditions may influence
their performances. Recent studies have demonstrated the
advantages of infrared thermal images reflecting the temperature
distributions, which are robust to lighting changes. In this paper, a
novel infrared image sequence based FER method is proposed
using spatiotemporal feature analysis and deep Boltzmann
machines (DBM). Firstly, a dense motion field among infrared
image sequences is generated using optical flow algorithm. Then,
PCA is applied for dimension reduction and a three-layer DBM
structure is designed for final expression classification. Finally,
the effectiveness of the proposed method is well demonstrated
based on several experiments conducted on NVIE database.
CCS Concepts
• Computing methodologies→Computer vision represent-
tations; Image representations;
Keywords
Facial expression recognition; Infrared image sequences; optical
flow; Deep Boltzmann machine
1. INTRODUCTION
Facial expression recognition (FER) has become an important
area of personalized human-computer interaction [1, 3, 4].
Existing works on FER mainly focus on visible images. Whereas
varying lighting conditions may influence the performance of
visible expression recognition. Recent studies have shown that
infrared thermal images (IRTI), which reflect the temperature
distribution of subjects, are less sensitive to the lighting conditions
[1, 2]. Infrared expression recognition has been recognized as a
crucial complementarity for the FER [1, 5, 6]. Existing feature
extraction methods for FER can be roughly divided into two types:
static image based methods and dynamic image sequence (e.g.
video) based methods. Examples of the first type include the
Active Contour Model (ACM) [7], the Active Shape Model (ASM)
[8], the Active Appearance Model (AAM) [9], the Gabor filter
[10], the Elastic Graph Matching (EGM) [11], the Fisher
Discriminant Analysis method [12] and so on. The second type of
methods includes the dense motion field based methods (e.g. [13])
and the key feature point based methods (e.g.[14]). Existing
recognition methods include the shallow structure methods such
as the Hidden Markov Model (HMM) [15], the Support Vector
Machine (SVM) [16], the Adaboost [17]; and the deep structure
methods such as the Deep Belief Network (DBN)[26], the Deep
Boltzmann Machine (DBM)[27], the Convolutional Neural
Network (CNN)[1, 25], the Auto-Encoder (AE) method [18], etc..
These above methods can be performed on either visible images
or infrared images.
There are also several specific methods for infrared images. For
example, Benjamin Hernandze and Gustavo Olague [19] used
Gray Level Co-occurrence Matrix (GLCM) to extract features
from infrared images for the recognition of surprise, happiness,
and anger. Guotai Jiang et al. [20] extracted global and local
features from the region of interest (ROI) in infrared images for
the facial expression recognition. Yasunari Yoshitomi et al.[21]
used two-dimensional Discrete Cosine Transform (2D-DCT) to
extract features in the frequency domain for facial expression
recognition. A. Merla and G.L. Romani [22] performed
recognition of happiness, fear, and disgust through conducting the
facial temperature distribution of 10 subjects. Although these
methods considering face spatial features have got some
achievements, they only work on static images other than dynamic
image sequences. Since the facial expression is a dynamic process,
the temporal information is also very important for recognition.
Therefore, we need to develop a spatio-temporal method to
involve both spatial and temporal features for recognition.
To solve these problems mentioned above, in this paper, we
propose a novel spatio-temporal feature analysis method based on
optical flow and deep learning method named as Deep Boltzmann
Machine (DBM)[27]. Other than existing works which perform
spatial feature extraction on static images, spatio-temporal
features are extracted from infrared image sequences in this paper.
Firstly, we use the optical flow estimation method [23] to generate
dense motion field between each two adjacent infrared images.
Then, we use the principle component analysis (PCA)[24] for
dimension reduction. Finally, the DBM model is utilized to realize
the FER task
2. II. INFRARED FACIAL EXPRESSION
RECOGNITION BASED ON OPTICAL
FLOW AND DBM
The framework of our method is illustrated in Fig. 1, which
contains 3 steps: optical flow algorithm for spatio-temporal dense
motion field extraction, PCA for the dimension reduction, and
DBM for facial expression recognition.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. Copyrights
for components of this work owned by others than ACM must be
honored. Abstracting with credit is permitted. To copy otherwise, or
republish, to post on servers or to redistribute to lists, requires prior
specific permission and/or a fee. Request permissions from
Permissions@acm.org.
ICVIP’17, December 27–29, 2017, Singapore, Singapore
© 2017 Association for Computing Machinery.
ACM ISBN 978-1-4503-5383-0/17/12…$15.00.
DOI: https://doi.org/10.1145/3177404.3177408