A Hierarchical Model Based on Latent Dirichlet
Allocation for Action Recognition
Shuang Yang, Chunfeng Yuan, Weiming Hu
National Laboratory of Pattern Recognition,
Institute of Automation, CAS, China
Email: {syang,cfyuan,wmhu}@nlpr.ia.ac.cn
Xinmiao Ding
Shandong Institute of
Business and Technology
Email: dingxinmiao@126.com
Abstract—Inspired by the recent success of hierarchical rep-
resentation, we propose a new hierarchical variant of latent
Dirichlet allocation (h-LDA) for action recognition. The model
consists of an appearance group and a motion group, and we
introduce a new hierarchical structure including two-layer topics
in each group to learn the spatial temporal patterns (STPs)
of human actions. The basic idea is that the two-layer topics
are used to model the global STPs and the local STPs of the
actions respectively. Two groups of discrete words are generated
from two complementary kinds of features for each group.
Each topic learned in these two groups is used to describe
a particular aspect of the actions. Specifically, the mid-level
topics are learned to describe the local STPs by including the
geometric structure information in the lower-level words. The
top-level topics are learned from the mid-level topics and are the
mixture distribution of the local STPs, which makes the top-level
topics appropriate to represent the global STPs. In addition, we
give the learning and inference process by Gibbs sampling with
reasonable assumptions. Finally, each sample is discriminatively
represented as the probabilistic distribution over the global STPs
learned by the proposed h-LDA. Experimental results on two
datasets demonstrate the effectiveness of our approach for action
recognition.
I. INTRODUCTION
In recent years, a significant amount of effort has been
devoted to automatic recognition of human actions in videos.
However, there still exist many difficulties in the appropri-
ate representation of different actions, which makes action
recognition a challenging problem. In this paper, we propose
a new hierarchical model based on latent Dirichlet allocation
(LDA) to learn the spatial temporal patterns (STPs) of action
representation. Combined with the random forest classifier,
experimental results show that our approach is effective for
action recognition.
A. Related Work
Recently, representation by learning from a hierarchical
structure for action recognition has gained a lot of interest.
Song et al. [1] propose a hierarchical sequence summariza-
tion approach by learning multiple layers of discriminative
feature representations at different temporal granularities for
action recognition. Wang et al. [2] construct a hierarchical
representation of local feature descriptors by combining the
local features and their contexts for action recognition. Niebles
and FeiFei [3] propose a hierarchical model to combine the
spatial and spatial-temporal features to represent each frame
as a mixture of constellations for action recognition.
All the methods above show that the representation using
a hierarchical structure is powerful for action recognition,
yet there are still certain weaknesses in these methods, such
as using only one kind of feature [4] or requiring manual
annotation. Furthermore, they are all based on discriminative
models, which are devised for the specific task and do not
provide a generic characterization.
Among the various generative models, topic models have
been applied widely for many computer vision tasks, such
as scene categorization[5], object recognition [6] and action
recognition [7]. The topic models are proposed at the first
time in the text domain to learn the latent semantic topics in
each text documents, such as the probabilistic latent Semantic
indexing (pLSI) [8] and latent Dirichlet allocation (LDA) [9].
In recent years, they are introduced frequently into the field
of computer vision. In [5], Fei-Fei et al. build a variant of
LDA which considers an image as a document and an image
patch as a word to discover the intermediate themes for natural
scene categorization. In [7], Wang et al. take the class label
to be the latent topic and a frame in a sequence to be a
word to build the supervised LDA model (s-LDA) for action
recognition. In [10], Wang et al. present spatial LDA by adding
the Gaussian distribution over the words assigned in the same
document to learn the semantic representation of images for
object recognition.
Most methods build the model with only one layer topics.
In spite of their simplicity, these methods work well for the
specific task. However, the hierarchical structure with only one
layer topics leads to the limited generalizability. Moreover,
most of the previous topic-model based methods build their
model from only one type of observation, which is efficient
but may be not enough for complex actions.
B. Our Approach
To solve the above limitations, we propose a novel hierarchi-
cal variant of LDA, named h-LDA, by combining two groups
of two-layer topics to learn the spatial temporal patterns
(STPs) of human actions. Specifically, the two-layer topics
are introduced to learn the global STPs and the local STPs of
the actions in the corresonding group respectively. The low-
level words in the two groups are generated individually from
2014 22nd International Conference on Pattern Recognition
1051-4651/14 $31.00 © 2014 IEEE
DOI 10.1109/ICPR.2014.451
2613