KEY JOINTS SELECTION AND SPATIOTEMPORAL MINING FOR SKELETON-BASED
ACTION RECOGNITION
Zhikai Wang
1
,Chongyang Zhang
1,2∗
,Wu Luo
1
, and Weiyao Lin
1
1
School of Electronic Information and Electrical Engineering,
Shanghai Jiao Tong University, Shanghai 200240, China
2
Shanghai Key Lab of Digital Media Processing and Transmission, Shanghai 200240, China
∗
Corresponding email: sunny zhang@sjtu.edu.cn
ABSTRACT
Trajectories and spatiotemporal attention model have been
successfully used in skeleton-based action recognition. Most
existing methods focus more attention on temporal structure
mining. However, only a few local joints and their position
features (e.g., critical position changes of hand, head, leg etc.)
are responsible for the action label. In this work, we introduce
a novel action recognition framework using Key Joints Se-
lection and Spatiotemporal Mining, which can identify both
key joints and their position & velocity histogram as well as
trajectory features for action classification. First, histogram
of human joints position and velocity are developed to en-
hance the spatiotemporal structure representation of existing
trajectory-based methods. Second, the key joints are select-
ed according to their information gains, and then their posi-
tion & velocity histograms are weighted and composed with
trajectory features to form one richer representation for fi-
nal action classification. Experiments on two widely-tested
benchmark datasets show that by combining the strength of
both richer features and key joints selecting, our method can
achieve state-of-the-art or competitive performance compared
with existing results using sophisticated models such as deep
learning, with advantages regarding the recognition accuracy
and robustness.
Index Terms— Action recognition, key joints, position &
velocity histograms, spatiotemporal mining, skeleton
1. INTRODUCTION
Action recognition has attracted much attention due to its im-
portance in many applications. Thanks to the development of
commodity RGB-D cameras, skeleton-based action recogni-
tion has drawn considerable attention in the computer vision
community recently [1, 2]. Although the recent advances in
deep convolutional networks (ConvNets) have brought some
improvements on action recognition [3], it remains a difficult
challenge due to the problem that they require a large num-
ber of labeled videos for training [4], while most available
datasets, especially the skeleton-based 3D action datasets, are
relatively small. Thus, traditional handcrafted feature based
methods are still useful for 3D action recognition.
In recent years, many learning-based methods have been
proposed for skeleton-based action recognition. Three cate-
gories of approaches are often used: spatial modeling, tem-
poral modeling, and spatiotemporal modeling. The modeling
in the spatial domain is mainly driven by the fact that an ac-
tion is usually only characterized by the interactions or com-
binations of a subset of skeleton joints [5]. In HBRNN [6],
skeletons are decomposed into five parts and a hierarchical
recurrent neural network is built to model the relationship a-
mong these parts. Similarly, in [7] a part-aware model is pro-
posed to construct the relationship between body parts. In S-
MIJ [8], the most informative joints are selected simply based
on measures such as mean or variance of joint angle trajecto-
ries. On the temporal domain, temporal pyramid matching
[9], and dynamic time warping [10] or segmentation [11] are
the common methods for temporal modeling. In [12], short-
term and long-term temporal models are combined to form
a multi-model framework. Many efforts on spatiotemporal
modeling are also proposed: In [13], LSTM model is extend-
ed to spatiotemporal domain to analyze skeletons, spatiotem-
poral vector of locally max pooled features are developed in
[14], and spatiotemporal attention parts are selected in [2].
Good features are crucial to reliable action recognition.
Although features developed from existing works have shown
big improvements in many domains, most of the mentioned
methods pay more attention to temporal trajectory features
while largely ignore key local parts’ spatial patterns: posi-
tion and velocity distribution of the key body parts. Without
this type patterns, they have limitations in precisely differen-
tiating the ambiguity among fine-grained action classes due
to the subtle inter-class trajectory differences. For example,
the actions of Horizontal-arm-wave and High-arm-wave have
similar hands trajectory. Conversely, the hands height his-
tograms of these two actions have notable differences (Fig.
1), which can be used to distinguish them more easily. In an-
other case, actions with inverse part activities, such as pull and
push, are easy to be confused due to the similar trajectory and