
IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICS—PART B: CYBERNETICS, VOL. 42, NO. 4, AUGUST 2012 1017
Recognizing Emotions From
an Ensemble of Features
Usman Tariq, Student Member, IEEE, Kai-Hsiang Lin, Zhen Li, Xi Zhou, Zhaowen Wang,
Vuong Le, Student Member, IEEE, Thomas S. Huang, Life Fellow, IEEE,XutaoLv,andTonyX.Han
Abstract—This paper details the authors’ efforts to push the
baseline of emotion recognition performance on the Geneva
Multimodal Emotion Portrayals (GEMEP) Facial Expression
Recognition and Analysis database. Both subject-dependent and
subject-independent emotion recognition scenarios are addressed
in this paper. The approach toward solving this problem involves
face detection, followed by key-point identification, then feature
generation, and then, finally, classification. An ensemble of fea-
tures consisting of hierarchical Gaussianization, scale-invariant
feature transform, and some coarse motion features have been
used. In the classification stage, we used support vector machines.
The classification task has been divided into person-specific and
person-independent emotion recognitions using face recognition
with either manual labels or automatic algorithms. We achieve
100% performance for the person-specific one, 66% performance
for the person-independent one, and 80% performance for overall
results, in terms of classification rate, for emotion recognition with
manual identification of subjects.
Index Terms—Biometrics, computer vision, emotion recogni-
tion, machine vision.
I. INTRODUCTION
A
UTOMATED emotion recognition shall very soon have
its sizeable impact in areas ranging from psychology to
human–computer interaction (HCI) to human–robot interaction
(HRI). For instance, in HRI and HCI, there is an ever-increasing
demand to make the computers and robots behave more human-
like. Some example works that employ emotion recognition
in HCI and HRI are [1] and [2]. Another application is in
computer-aided automated learning [3]. Here, the computer
should ideally be able to identify the cognitive state of the
Manuscript received May 11, 2011 revised November 3, 2011 and
February 15, 2012; accepted March 6, 2012. Date of publication May 3,
2012; date of current version July 13, 2012. This work was supported by a
Google Faculty Research Award. This paper was recommended by Associate
Editor M. Pantic.
U. Tariq, K.-H. Lin, Z. Li, Z. Wang, V. Le, and T. S. Huang are
with the Department of Electrical and Computer Engineering, Coordinated
Science Laboratory, and Beckman Institute for Advanced Science and
Technology, University of Illinois at Urbana-Champaign, Urbana, IL 61801
USA (e-mail: utariq2@illinois.edu; klin21@illinois.edu; zhenli3@illinois.edu;
wang308@illinois.edu; vuongle2@illinois.edu; t-huang1@illinois.edu).
X. Zhou is with Chongqing Institute of Green and Intelligent
Technology, Chinese Academy of Sciences, Beijing 100732, China (e-mail:
xizhou2@illinois.edu).
X. Lv and T. X. Han are with the Department of Electrical and Computer
Engineering, University of Missouri, Columbia, MO 65201 USA (e-mail:
xlyp2@mail.mizzou.edu; hantx@missouri.edu).
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org.
Digital Object Identifier 10.1109/TSMCB.2012.2194701
student and then accordingly act. For example, if the student
is gloomy, it might tell a joke.
The increasing applications of emotion recognition have
invited a great deal of research in this area in the past decade.
Psychologists and linguists have various opinions about how
the importance of different cues in human affect judgment [3].
However, there are some studies (e.g., [4]) that indicate that
facial expression in the visual channel is the most effective and
important cue that correlates well with the body and the voice.
In this paper, we also use features extracted from the facial
region.
This paper was carried out as part of the 9th IEEE Conference
on Face and Gesture Recognition (FG 2011) Facial Expression
Recognition and Analysis Challenge (FERA 2011). Our results
stood out in the final comparison. We stood first for person-
specific results, while we were second in terms of the overall
performance [5]. It is worthwhile to note that the work [6] that
outperformed us in the overall results may face some limitations
in other testing scenarios, as outlined in Section X.
II. B
ACKGROUND WORK
Emotion recognition using visual cues has been receiving a
great deal of attention in the past decade. Most of the existing
approaches do recognition on six universal basic emotions
because of their stability over culture, age, and availability
of such facial expression databases. The choices of features
employed for emotion recognition are classified in [3] into
two main categories, i.e., geometric features and appearance
features. In this section, we closely follow that taxonomy to
review some of the notable works on the topic.
The geometric features are extracted from the shape or salient
point locations of important facial components such as mouth
and eyes. In [7], 58 landmark points are used to construct an
active shape model (ASM). These are then tracked to do facial
expressions recognition. Pantic and Bartlett [8] introduced a set
of more refined features. They utilize facial characteristic points
around the mouth, the eyes, the eyebrows, the nose, and the
chin as geometric features for emotion recognition. In a more
holistic approach, the active appearance model is utilized to
analyze the characteristics of the facial expressions in [9].
When sequences of images are available, the temporal
dynamics of facial actions can be modeled for expression
recognition. In [10], Valstar et al. propose to characterize speed,
intensity, duration, and the cooccurrence of facial-muscle acti-
vations in video sequences in a parameterized framework. They
then decide whether a behavior is deliberate or spontaneous.
1083-4419/$31.00 © 2012 IEEE