Human Action Recognition Using
Labeled Latent Dirichlet Allocation Model
Jiahui YANG Changhong CHEN
*
Zongliang GAN Xiuchang ZHU
Jiangsu Province’s Key Lab of Image Procession and Image Communications
Nanjing University of Posts and Telecommunications,
Nanjing210003, China
Corresponding author: chenchh@njupt.edu.cn
Abstract—Recognition of human actions has already been an
active area in the computer vision domain and techniques related
to action recognition have been applied in plenty of fields such as
smart surveillance, motion analysis and virtual reality. In this
paper, we propose a new action recognition method which
represents human actions as a bag of spatio-temporal words
extracted from input video sequences and uses L-LDA (labeled
Latent Dirichlet Allocation) model as a classifier. L-LDA is a
supervised model extended from LDA which is unsupervised.
The L-LDA adds a label layer on the basis of LDA to label the
category of the train video sequences, so L-LDA can assign the
latent topic variable in the model to the specific action
categorization automatically. What’s more, due to above
characteristic of L-LDA, it can help to estimate the model
parameters more reasonably, accurately and fast. We test our
method on the KTH and Weizmann human action dataset and
the experimental results show that L-LDA is better than its
unsupervised counterpart LDA as well as SVMs (support vector
machines).
Keywords—action recognition; interest points detection; topic
model; labeled Latent Dirichlet Allocation model
I. INTRODUCTION
Action recognition is to represent and track human actions
using computer techniques, and then infer and category actions
combined with other information such as background and
surrounding environment [1]. The key techniques in the field of
action recognition include extracting representative visual
features from video sequences, choosing appropriate feature
descriptor and designing classification model with a good
performance [2]. According to the above analysis, action
recognition can be divided into two level tasks: (1) feature
extraction and representation at the bottom; (2) model learning
and action categorization at the top. We can see the flowchart
of general action recognition approach in Fig.1.
At present, the low-level features of the first level task
mainly include contour, flow, motion trajectory information,
spatio-temporal interest point features and so on. Methods
using contour feature are simple and easy to implement, but
many of them depend on the boundary information of the
contour which is easily affected by the change of background
[3]. Flow feature can perform the detection and tracking of the
actor without any prior knowledge, but it can be affected by
video noise and the change of illumination intensity and the
computation is complex and costly [4]. Motion trajectory
information can be used to analyze the detail of human motion,
but estimation of the position of human key joints and tracking
them in subsequent frames is still hard to perform perfectly [5].
Recently action recognition methods based on spatio-temporal
interest point feature have been widely used and they have
many advantages. For examples, the actors can be accurately
located and these points can record the main information of the
action without tracking the actor. So far, there have been many
interest point detectors. Harris corner detector originally used
in the image processing domain was extended to the space-time
domain, but limited interest points can be detected because the
response function is not sensitive to the change in the temporal
dimension [6]. To solve the problem, a three-dimensional
linear filter detector was proposed which is combined with 2D
Gaussian filters along the spatial dimensions and a pair of 1D
Gabor filters in the temporal dimension and it can detect
sufficient interest points [7]. The idea of the Hessian matrix
was used to detect spatio-temporal interest points upon
invariant scale detection method, which obtains dense interest
points [8].
For the modeling and categorization at the top, there has
been a growing attention in using latent topic models such as
PLSA (the probabilistic latent semantic analysis) [9] and LDA
(Latent Dirichlet Allocation) [10] as classification model. The
topic models were firstly introduced and applied in the domain
of information retrieval and text analysis and so on. When the
models are used to represent the video sequences, more
emphasis is placed on the coherence of content, rather than
mere spatial neighbor relations. Generally, the latent topic
models applied in action recognition are unsupervised, in other
words, they don’t label the category of the train samples, but
only need to put in the number of categories to automatically
learn the probability distributions of the visual words and the
latent topics. Savarese et al. [11] extract local spatio-temporal
interest points as low-level features and apply PLSA to learn
and generate semantic description of each action. In [12], LDA
was used to model human activities in the real scene. Although
unsupervised topic models applied on action recognition have
made much progress, they still have some weakness. For
examples, due to the unsupervised characteristic of the models,
only by employing the ground truth labels, each discovered
cluster can only be named with the most popular action class
label within the cluster. That is to say, we can’t automatically