Human Activity Recognition based on 3D Mesh MoSIFT Feature Descriptor
Yue Ming
School of Electronic Engineering
Beijing University of Posts and Telecommunications
Beijing 100876, P.R. China
Email: myname35875235@126.com
Abstract—The times of Big Data promotes increasingly high-
er demands for information processing. The rapid development
of 3D digital capturing devices prompts the traditional behavior
analysis towards fine motion recognition, such as hands and
gesture. In this paper, a complete framework of 3D human
activity recognition is presented for the behavior analysis of
hands and gesture. First, the improved graph cuts method is
introduced to hand segmentation and tracking. Then, combined
with 3D geometric characteristics and human behavior prior
information, 3D Mesh MoSIFT feature descriptor is proposed
to represent the discriminant property of human activity.
Simulation orthogonal matching pursuit (SOMP) is used to
encode the visual codewords. Experiments, based on a RGB-
D video dataset and ChaLearn gesture dataset, show the
improved accuracy of human activity recognition.
Keywords-Big Data; 3D digital capturing devices; 3D human
activity recognition; hand segmentation and tracking; 3D Mesh
MoSIFT feature descriptor;
I. INTRODUCTION
Big data technologies describe new architectures for
intelligent information processing. In the recent years,
the growing interest inspired by human activity analysis
prompts the scholars to pay more attention for algorithm
studying. Pavan Turage [1] provided a survey on real-
time video analysis. Joshua Candamo [2] focused on
how to understand Transit Scenes and reviewed the
relative algorithms for human behavior analysis in the
corresponding scenes. Technical progress and the rapid
decline in the price make more and more researchers
exploit their research to more capturing devices for getting
more motion information. Omar Oreifej [3] introduced the
depth sequences for activity recognition. A.Jalal [4] applied
their proposed feature descriptors for life logging at smart
home. Ross B. Girshick [5] proposed general pose estimate
framework from depth data. The superior performance on
3D data demonstrates a potential solution to human activity
recognition. However, with the rapid development of big
data technology, fine motion description, such as hands
and gesture, in the massive network data present huge
challenges for deep data mining and research.
In this paper, we focus on human fine motion description.
Through the extraction of consistently invariant feature, the
framework of 3D hand activity recognition is established.
First, a novel method for hand segmentation and tracking is
introduced to our framework. An effective dynamic model
based on graph cuts can be used for hand state prediction.
Then, inspired by the fusion technology for RGB and
depth information, we consider combining the RGB and
depth videos for fine motion analysis, e.g. hand activity and
gesture recognition. The novel feature representation, named
as 3D Mesh MoSIFT, is developed from the original 3D
MoSIFT feature descriptor [10] for key points detection and
activity description. For learning a discriminative model,
all feature descriptors are clustered to generate a visual
codebook by k-means. A sparser coding method called
simulation orthogonal matching pursuit (SOMP) is used for
representing the linear combination of codewords. Finally,
the new input sample can be recognized by k-nearest
neighbor (KNN) classifier. Experimental results based on
ChaLearn gesture dataset and our hand RGB-D activity
dataset show that our proposed framework for hand activity
recognition can provide better accuracy than other classical
algorithms.
The paper is organized as follows. First, we discuss the
hand segmentation and tracking in Section 2. Then, we
introduce the 3D Mesh MoSIFT feature descriptor in Section
3. In the following, the hand activity recognition framework
based on SOMP is proposed in Section 4. Experimental
analysis is described in Section 5. Section 6 concludes the
paper.
II. H
AND SEGMENTATION AND TRACKING
Kinect camera is used to simultaneously collect the RGB
and depth videos with the different kinds of human hands
activities. The first step is hand segmentation based on
the RGB videos. The simple segmentation of objects and
background can be solved by minimizing the following
energy with respect to labeling functions λ,
ε(λ)=ε
D
(λ)+ε
S
(λ) (1)
where data term ε
D
evaluates the likelihood P
n
(i) of a pixel
i to belong to an object n.
ε
D
(λ)=−
i∈I
N
n=0
ln(p
n
(i))δ(λ, n) (2)
SocialCom/PASSAT/BigData/EconCom/BioMedCom 2013
978-0-7695-5137-1/13 $26.00 © 2013 IEEE
DOI 10.1109/SocialCom.2013.151
959