Action recognition using linear dynamic systems
Haoran Wang
a,b
, Chunfeng Yuan
b
, Guan Luo
b
, Weiming Hu
b,
n
, Changyin Sun
a
a
School of Automation, Southeast University, Nanjing, China
b
National Laboratory of Pattern Recognition, Institute of Automation, CAS, Beijing, China
article info
Article history:
Received 20 April 2012
Received in revised form
26 November 2012
Accepted 1 December 2012
Available online 12 December 2012
Keywords:
Linear dynamic system
Kernel principal angle
Multiclass spectral clustering
Supervised codebook pruning
Action recognition
abstract
In this paper, we propose a novel approach based on Linear Dynamic Systems (LDSs) for action
recognition. Our main contributions are two-fold. First, we introduce LDSs to action recognition. LDSs
describe the dynamic texture which exhibits certain stationarity properties in time. They are adopted to
model the spatiotemporal patches which are extracted from the video sequence, because the
spatiotemporal patch is more analogous to a linear time invariant system than the video sequence.
Notably, LDSs do not live in the Euclidean space. So we adopt the kernel princip al angle to measure the
similarity between LDSs, and then the multiclass spectral clustering is used to generate the codebook
for the bag of features representation. Second, we propose a supervised codebook pruning method to
preserve the discriminative visual words and suppress the noise in each action class. The vis ual words
which maximize the inter-class distance and minimize the intra-class distance are selected for
classification. Our approach yields the state-of-the-art performance on three benchmark datasets.
Especially, the experiments on the challenging UCF Sports and Feature Films datasets demonstrate the
effectiveness of the proposed approach in realistic complex scenarios.
& 2012 Elsevier Ltd. All rights reserved.
1. Introduction
Automatic recognition of human actions in videos is useful for
surveillance, content-based summarization, and human–computer
interaction applications. Yet, it is still a challenging problem. In
recent years, a large number of researchers have addressed this
problem as evidenced by several survey papers [1–4].
Action representation is important for action recognition.
There are appearance-based representation [5,40], shape-based
representation [6,41], optical-flow-based representation [7,42],
volume-based representation [8,43] and interest-point-based
representation [9,44]. Among them, methods using local interest
point features together with the bag of visual words model are
greatly popular, due to their simple implementation and good
performance. The bag of visual words approaches are robust to
noise, occlusion and geometric variation, without requirement for
reliable tracking on a particular subject. Despite recent develop-
ments, the representation of local regions in videos is still an open
field of research.
Dynamic textures are sequences of images of moving scenes that
exhibit certain stationarity properties in time, such as sea-waves,
smoke, foliage, whirlwind etc. They capture the dynamic informa-
tion in the motion of objects. Doretto et al. [10] show that dynamic
textures can be modeled using a LDS. Tools from system identifica-
tion are borrowed to capture the essence of dynamic textures. Once
learned, the LDS model has predictive power and can be used for
extrapolating dynamic textures with negligible computational cost.
In tradition, LDS is used to describe dynamic textures of video
sequence [11,12]. But a video sequence is usually not a linear time
invariant system due in part to its long time span and complex
changes. Compared with video sequence, the spatiotemporal patch
is analogous to a linear time invariant system. Moreover, LDS
exhibits mo re dynamic information, which is important for the
representation of moving scenes, than traditional local features.
Several categorization algorithms have been proposed based
on the LDS parameters, which live in a non-Euclidean space.
Among these methods, Vishwanathan et al. [13] use Binet–
Cauchy kernels to compare the parameters of two LDSs. Chan
and Vasconcelos [14] use both the KL divergence and the Martin
distance [12,15] as a metric between dynamic systems. Woolfe
and Fitzgibbon [16] use the family of Chernoff distances, and the
distances between cepstrum coefficients are adopted as the
metrics between LDSs. These methods usually define a distance
measurement between the model parameters of two dynamic
systems. Once such a metric has been defined, classifiers such as
nearest neighbors or support vector machines can be used to
categorize a query video sequence based on the training data.
However, all the above approaches are supervised classification.
They are not suitable for the codebook generation in the bag
of words representation.
Contents lists available at SciVerse ScienceDirect
journal homepage: www.elsevier.com/locate/pr
Pattern Recognition
0031-3203/$ - see front matter & 2012 Elsevier Ltd. All rights reserved.
http://dx.doi.org/10.1016/j.patcog.2012.12.001
*
Corresponding author. Tel.: þ86 13910900826.
E-mail address: wmhu@nlpr.ia.ac.cn (W. Hu).
Pattern Recognition 46 (2013) 1710–1718