dynamic nature of facial actions, individually recognizing each AU
intensity is not accurate and reliable for spontaneous facial expres-
sions. Understanding spontaneous facial expression requires not
only improving facial motion observations, but more importantly,
exploiting the spatiotemporal interactions among facial motions
since the coherent, coordinated, and synchronized interactions
among AUs produce a meaningful facial expression.
Some previous studies focused on exploiting the semantic and
dynamic relationships among facial actions [35,35,25]. Lien et al. [35]
and Valstar et al. [36] employed a set of Hidden Markov Models
(HMMs) t o represent the facial actions evolution in time. Tong et al.
[25,3 7] constructed a Dynamic Bayesian Network (DBN) based model
to further exploit the semantic and temporal dependencies among
facial actions. However, these works are all limited to detection of the
presence and absence of AUs mainly in posed expr essions. Recently ,
there are some works exploiting the interactions between AU
intensity values. Sandbach et al. [58] employed a Markov Random
Field to model the static correlation relationship among AU inten-
sities, which improves the recognition accuracy compared to regres-
sors, i.e., SVRs and Directed Acyclic Graph Support Vector Machines.
A Conditional Ordinal Random Fields (CORF) is extended [59] to
applicationinAUintensityestimation. However, this method [59]
can only estimate the AU intensity when the presence of the AU is
already known. Baltrušaitis et al. [57] combine ANNs and Continuous
Conditional Random Fields (CCRF) as a Continuous Conditional
Neural Fields (CCNF) for structured regression for AU intensity
estimation, where each hidden component consists of a neutral lay er.
In this work, we construct a DBN to systematically model the
spatiotemporal interactions among multi-level AU intensities.
Advanced machine learning techniques are employed to train
the framework from both subjective prior knowledge and training
data. The proposed method differs from previous works [25,37] in
both theory and applications. Theoretically, this paper focuses on
exploiting the AU intensity correlations, and modeling the spatio-
temporal dependencies among multi-level AU intensities. In terms
of applications, the focus of this paper is to design and develop an
automatic system to measure the intensity of spontaneous facial
action units, which is much more challenging than detecting
posed facial action units.
Fig. 2 gives the flowchart of the proposed online AU intensity
measuring system, which consists of two independent but colla-
borative phases: image observation extraction and DBN inference.
First, we detect the face and 66 facial landmark points in videos
automatically. Then we register the images according to the
detected facial landmark points. HOG and Gabor features are
employed to describe local appearance changes of the face. After
that, SVM classification produces an observation score for each AU
intensity individually. Given the image observations for all AUs, the
AU intensity recognition is accomplished through probabilistic
inference by systematically integrating the image observation with
the proposed DBN model.
The remainder of the paper is organized as follows. Section 3
describes the AU observation extraction method. In Section 4,we
build DBN model for AU intensity recognition, including BN model
structure learning (Section 4.1), DBN model parameter learning
(Section 4.3) and DBN inference (Section 4.4). Section 5 presents
our experimental results and discussion and Section 6 concludes
the paper.
3. AU intensity observation extraction
In this section we describe our AU intensity image observation
extraction method, which consists of face registration, facial image
representation, dimensionality reduction and SVM classifi
cation.
3.1. Face registration
Image registration is a commonly used technique to align
similar data (i.e. the reference and sensed image). In order to
register two images, oftentimes, a set of points called landmark
points is exploited for representing and aligning images. In our
study, we used 66 landmark points of DISFA database to represent
the location of important facial components [9]. To obtain the
reference landmark points we averaged the 6 6 landmark points
over the entire training set. A 2D similarity transformation and the
bilinear interpolation technique were utilized to transform the
new image into the reference coordinate system. The registered
images are then masked to extract the facial regions and resized to
128 108 pixels.
3.2. Facial image representation
After registering facial images, we utilized two well-known
feature extraction techniques that are capable of representing the
appearance information. These features are Histogram of Oriented
Gradient (HOG), and Localized Gabor Features which are
described below.
Fig. 2. The flowchart of the proposed online AU intensity recognition system.
Y. Li et al. / Pattern Recognition ∎ (∎∎∎∎) ∎∎∎–∎∎∎ 3
Please cite this article as: Y. Li, et al., Measuring the intensity of spontaneous facial action units with dynamic Bayesian network, Pattern
Recognition (2015), http://dx.doi.org/10.1016/j.patcog.2015.04.022i