Coupled hidden conditional random fields for RGB-D human
action recognition
An-An Liu
a,
n
, Wei-Zhi Nie
a
, Yu-Ting Su
a,
n
,LiMa
a
, Tong Hao
b
, Zhao-Xuan Yang
a
a
School of Electronic Information Engineering, Tianjin University, Tianjin 300072, China
b
College of Life Sciences, Tianjin Normal University, Tianjin 300387, China
article info
Article history:
Received 24 February 2014
Received in revised form
21 August 2014
Accepted 25 August 2014
Available online 3 September 2014
Keywords:
Coupled hidden conditional random fields
Multimodal
Temporal context
Human action recognition
abstract
This paper proposes a human action recognition method via coupled hidden conditional
random fields model by fusing both RGB and depth sequential information. The coupled
hidden conditional random fields model extends the standard hidden-state conditional
random fields model only with one chain-structure sequential observation to multiple
chain-structure sequential observations, which are synchronized sequence data captured
in multiple modalities. For model formulation, we propose the specific graph structure for
the interaction among multiple modalities and design the corresponding potential
functions. Then we propose the model learning and inference methods to discover the
latent correlation between RGB and depth data as well as model temporal context within
individual modality. The extensive experiments show that the proposed model can boost
the performance of human action recognition by taking advance of complementary
characteristics from both RGB and depth modalities.
& 2014 Elsevier B.V. All rights reserved.
1. Introduction
Nowadays, human action recognition is a hot research
topic in the fields of computer vision and machine learning
since it plays essential roles on the applications of intel-
ligent visual surveillance, natural user interface and so on.
Especially, with the emergence of multiple sensors, like
depth image, laser, etc., we can capture the signals of
human action in multiple modalities and consequently
multimodal human action recognition is becoming extre-
mely popular in the recent years [1–5].
The task of human action recognition is challenging
because of the high variabi lity of appearances, shapes and
potential occlusions. The related methods can be classified
into two categories. One representative method is the space-
time feature-based method. The extraction of space-time
feature usually involv es local feature detectors and descrip-
tors [6–9]. The detectors usually design specific objective
function for the selection of X–Y–T locations. The represen-
tative local feature detectors include Harris3D [10],Cuboid
[11],3DHessian[12] and DSTIP [13] on RGB or depth
imagery . The feature descriptors [12,14–20] can be calculated
to represent the charact eristics of shape and mot ion around
the detected local space-time points. With the recent advent
of Kinect, depth cameras have received increasing attention
and man y researchers are engaged in the formulation of the
depth-based local saliency descriptor [2 1,22,13].Atlast,the
bag-of-words (BoW) method [23,24] is usually leverag ed for
video representation and model learning. The probabilistic
model can be utilized to overcome the constraint by camera
views [25,26]. The other representat i ve method focuses on
learning the sequential dynamics within one action image
sequence captured by the traditional RGB camera [27–29] .
The graph-based methods [30–33] for sequential modeling
Contents lists available at ScienceDirect
journal homepage: www.elsevier.com/locate/sigpro
Signal Processing
http://dx.doi.org/10.1016/j.sigpro.2014.08.038
0165-1684/& 2014 Elsevier B.V. All rights reserved.
n
Corresponding authors.
E-mail address: anan0422@gmail.com (A.-A. Liu).
Signal Processing 112 (2015) 74–82