Hierarchical Recurrent Neural Network for Skeleton Based Action Recognition
Yong Du, Wei Wang, Liang Wang
Center for Research on Intelligent Perception and Computing, CRIPAC
Nat’l Lab of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences
{yong.du, wangwei, wangliang}@nlpr.ia.ac.cn
Abstract
Human actions can be represented by the trajectories of
skeleton joints. Traditional methods generally model the
spatial structure and temporal dynamics of human skeleton
with hand-crafted features and recognize human actions by
well-designed classifiers. In this paper, considering that re-
current neural network (RNN) can model the long-term con-
textual information of temporal sequences well, we propose
an end-to-end hierarchical RNN for skeleton based action
recognition. Instead of taking the whole skeleton as the in-
put, we divide the human skeleton into five parts accord-
ing to human physical structure, and then separately feed
them to five subnets. As the number of layers increases, the
representations extracted by the subnets are hierarchically
fused to be the inputs of higher layers. The final represen-
tations of the skeleton sequences are fed into a single-layer
perceptron, and the temporally accumulated output of the
perceptron is the final decision. We compare with five other
deep RNN architectures derived from our model to verify
the effectiveness of the proposed network, and also com-
pare with several other methods on three publicly available
datasets. Experimental results demonstrate that our model
achieves the state-of-the-art performance with high compu-
tational efficiency.
1. Introduction
As an important branch of computer vision, action recog-
nition has a wide range of applications, e.g., intelligent
video surveillance, robot vision, human-computer interac-
tion, game control, and so on [15, 36]. Traditional studies
about action recognition mainly focus on recognizing ac-
tions from videos recorded by 2D cameras. But actually,
human actions are generally represented and recognized in
the 3D space. Human body can be regarded as an articu-
lated system including rigid bones and hinged joints which
are further combined into four limbs and a trunk [31]. Hu-
man actions are composed of the motions of these limbs
and trunk which are represented by the movements of hu-
BR NN
BR NN
BR NN
BR NN
BR NN
BR NN
BR NN
BR NN
BR NN
BR NN
BR NN
BR NN
Layer1 Layer2 Layer3 Layer4 Layer5 Layer6 Layer7
Fully Connected Layer
Softmax Layer
Layer8 Layer9
Figure 1: An illustrative sketch of the proposed hierarchi-
cal recurrent neural network. The whole skeleton is divided
into five parts, which are fed into five bidirectional recur-
rent neural networks (BRNNs). As the number of layers
increases, the representations extracted by the subnets are
hierarchically fused to be the inputs of higher layers. A
fully connected layer and a softmax layer are performed on
the final representation to classify the actions.
man skeleton joints in the 3D space [37]. Currently, reliable
joint coordinates can be obtained from the cost-effective
depth sensor using the real-time skeleton estimation algo-
rithms [27, 28]. Effective approaches should be investigated
for skeleton based action recognition.
Human skeleton based action recognition is generally
considered as a time series problem [5, 17], in which the
characteristics of body postures and their dynamics over
time are extracted to represent a human action. Most of
the existing skeleton based action recognition methods ex-
plicitly model the temporal dynamics of skeleton joints by
using Temporal Pyramids (TPs) [19, 31, 33] and Hidden
Markov Models (HMMs) [20, 34, 35]. The TPs methods
are generally restricted by the width of the time windows
and can only utilize limited contextual information. As for
HMMs, it is very difficult to obtain the temporal aligned se-
quences and the corresponding emission distributions. Re-
cently, recurrent neural networks (RNNs) with Long-Short
Term Memory (LSTM) [8, 10] neurons have been used for
action recognition [1, 11, 16]. All this work just uses sin-
gle layer RNN as a sequence classifier without part-based