Learning Spatiotemporal Features using 3DCNN and Convolutional LSTM for
Gesture Recognition
Liang Zhang, Guangming Zhu, Peiyi Shen, Juan Song
School of Software, Xidian University
{liangzhang, gmzhu, pyshen, songjuan}@xidian.edu.cn
Syed Afaq Shah, Mohammed Bennamoun
University of Western Australia
{afaq.shah, mohammed.bennamoun}@uwa.edu.au
Abstract
Gesture recognition aims at understanding the ongoing
human gestures. In this paper, we present a deep archi-
tecture to learn spatiotemporal features for gesture recog-
nition. The deep architecture first learns 2D spatiotempo-
ral feature maps using 3D convolutional neural networks
(3DCNN) and bidirectional convolutional long-short-term-
memory networks (ConvLSTM). The learnt 2D feature maps
can encode the global temporal information and local spa-
tial information simultaneously. Then, 2DCNN is utilized
further to learn the higher-level spatiotemporal features
from the 2D feature maps for the final gesture recogni-
tion. The spatiotemporal correlation information is kept
through the whole process of feature learning. This makes
the deep architecture an effective spatiotemporal feature
learner. Experiments on the ChaLearn LAP large-scale iso-
lated gesture dataset (IsoGD) and the Sheffield Kinect Ges-
ture (SKIG) dataset demonstrate the superiority of the pro-
posed deep architecture.
1. Introduction
Gestures, as a nonverbal body language, play a very
important role in humans daily life. Gesture recognition
aims at understanding the ongoing human gestures and is
of great significance for human-robot/computer interaction,
sign language recognition and virtual [23].
Effective and universal gesture recognition from videos
is extremely difficult; partly due to the large gesture vocab-
ularies with cultural differences, various illumination con-
ditions, out-of-vocabulary motions, inconsistent and non-
standard behaviors among different performers, etc [12].
Moreover, gestures have various time durations and involve
different body parts. A small handful of gestures can be
Figure 1. Overview of the proposed deep architecture. 3DCNN
and bidirectional ConvLSTM are utilized to learn the short-
term and long-term spatiotemporal features successively, and then
2DCNN is used to learn higher-level spatiotemporal features based
on the learnt 2D long-term spatiotemporal feature maps for the fi-
nal gesture recognition.
represented by a single posture of hands and arms, but most
of the gestures are composed of a sequence of hand and arm
postures. Therefore, learning effective spatiotemporal fea-
tures is crucially important for robust gesture recognition.
According to [32], there are four typical properties for ef-
fective spatiotemporal features of gestures: (i) generic, (ii)
compact, (iii) efficient to compute, and (iv) simple to imple-
ment.
Inspired by the deep learning breakthroughs in image
recognition [17, 29, 31], lots of neural network based
frameworks are proposed to learn spatiotemporal features
3120