Large-scale Isolated Gesture Recognition using
Pyramidal 3D Convolutional Networks
Guangming Zhu
1
, Liang Zhang
1
, Lin Mei
2
, Jie Shao
2
, Juan Song
1
, Peiyi Shen
1
1
School of Software, Xidian University, Xi’an, 710071, China
2
The Third Research Institute of Ministry of Public Security, Shanghai, 201210, China
gmzhu@xidian.edu.cn
Abstract—Human gesture recognition is one of the central
research fields of computer vision, and effective gesture
recognition is still challenging up to now. In this paper, we
present a pyramidal 3D convolutional network framework for
large-scale isolated human gesture recognition. 3D convolutional
networks are utilized to learn the spatiotemporal features from
gesture video files. Pyramid input is proposed to reserve the
multi-scale contextual information of gestures, and each pyramid
segment is uniformly sampled with temporal jitter. Pyramid
fusion layers are inserted into the 3D convolutional networks to
fuse the features of pyramid input. This strategy makes the
networks recognize human gestures from the entire video files,
not just from segmented clips independently. We present the
experiment results on the 2016 ChaLearn LAP Large-scale
Isolated Gesture Recognition Challenge, in which we placed third.
Keywords-gesture recognition; 3D convolutional networks;
pyramid; temporal jitter
I. INTRODUCTION
Gestures, as a nonverbal body language, play a very
important role in human daily communication. With the rapid
development of human-computer and human-robot interaction,
visual gesture recognition [1] becomes one of the central
research fields of computer vision. Effective gesture
recognition is very challenging [3], due to several factors:
cultural differences, various observation conditions, out-of-
vocabulary motions, the relative small size of fingers in images,
noises in camera channels, tiny differences among similar
gestures, etc. In order to push researches on gesture recognition,
ChaLearn has organized a series of gesture recognition
challenges since 2011 [4].
Human gestures may involve motions of the whole body,
but arms and hands play the crucial roles, especially for sign
language recognition [5]. Only a small handful of human
gestures can be recognized from one single still posture, and
complex scene backgrounds may affect gesture recognition in a
bad way, since gestures generally focus on the motion of arms
and hands.
With the rapid development of the deep learning theory,
deep neural networks (DNN) have made a tremendous impact
on computer vision. Convolutional neural networks (CNN) [6]
have demonstrated unattainable performance on some field of
computer vision, such as image classification [8], object
detection [9], image segmentation [10], scene recognition [11],
face recognition [12], and human action/activity recognition
[13]. Compared to still images, the temporal component of
videos provides an additional clue for video-based tasks.
Simonyan et al proposed a two-stream Convolutional Networks
for action recognition in video data [14]. Tran et al learned the
spatiotemporal features with 3D ConvNets for action
recognition [15]. Recurrent Neural Networks (RNN) are well
known to be “deep in time”, Donahue et al proposed Long-
term Recurrent Convolutional Networks (LRCNs) which stacks
a CNN and Long Short-Term Memory (LSTM) recurrent
neural networks for action recognition [16].
However, different from the human actions recognized by
the aforementioned methods, human gestures focus more on
the spatiotemporal features of arms and hands. The effective
spatial convolutional features of hands may be overwhelmed
by complex scene backgrounds due to the relative small size of
fingers in images; the temporal information becomes more
discriminative for gesture recognition, compared to the general
video classification tasks [17]. Therefore, it will be not
effective enough to learn the spatial and temporal features
separately for gesture recognition. Spatiotemporal feature
learning may be a better option, since spatiotemporal features
may suppress the effect of complex scene backgrounds and
diverse illumination in some degree.
In this paper, we present the pyramidal 3D convolutional
networks based on the 3D ConvNets [15] for isolated gesture
recognition. The proposed networks placed third in the
ChaLearn LAP Large-scale Isolated Gesture Recognition
Challenge organized in 2016 [7] and is illustrated in Fig. 1. The
main contributions of the proposed networks, compared to the
3D ConvNets [15], are summarized as follows:
(a) Pyramid Input: Each gesture video file is segmented
pyramidally and each segment is uniformly sampled with
temporal jitter to construct the pyramid input, which reserve
the multi-scale contextual information of gestures.
(b) Pyramid fusion: Pyramid fusion is utilized to fuse the
pyramid input, as displayed in Fig. 1, which makes the
networks recognize gestures from the entire gesture videos, not
just from segmented clips independently.
(c) Multi-Modalities: RGB and Depth modalities based
networks are fused to improve the recognition accuracy.
This work is partially supported by the China Postdoctoral Science
Foundation (Grant No. 2016M592763), the Fundamental Research Funds for
the Central Universities (Grant NO. JB161006), the National Natural Science
Foundation of China (Grant NO. 61401324, 61305109, 61072105).