Deep Learning for Hand Gesture Recognition on Skeletal Data
Guillaume Devineau
1
and Wang Xi
2
and Fabien Moutarde
1
and Jie Yang
2
1
MINES ParisTech, PSL Research University, Center for Robotics, 60 Bd St Michel 75006 Paris, France
2
Shanghai Jiao Tong University, School of Electronic Information and Electrical Engineering, Shangai, China
Abstract— In this paper, we introduce a new 3D hand gesture
recognition approach based on a deep learning model.
We introduce a new Convolutional Neural Network (CNN)
where sequences of hand-skeletal joints’ positions are processed
by parallel convolutions; we then investigate the performance
of this model on hand gesture sequence classification tasks. Our
model only uses hand-skeletal data and no depth image.
Experimental results show that our approach achieves a
state-of-the-art performance on a challenging dataset (DHG
dataset from the SHREC 2017 3D Shape Retrieval Contest),
when compared to other published approaches. Our model
achieves a 91.28% classification accuracy for the 14 gesture
classes case and an 84.35% classification accuracy for the 28
gesture classes case.
I. INTRODUCTION
Touch and gesture are two natural ways for a user to
interact with one’s environment. While touch necessarily
involves a physical contact (e.g. to write a message on
a phone, to grab a physical object, or to swipe touch-
sensitive textiles), gestures allow remote interactions (e.g.
to interact with a smart screen, or with virtual-reality and
augmented-reality objects). As such, gesture-based human-
computer interfaces can ease the use of digital computing
[27] in situations where it would previously have been diffi-
cult or even impossible because of practical constraints like
interacting with everyday life physical objects (e.g. lights,
mirrors, doorknobs, notebooks, ...) or like using computers
in settings where the person has to focus entirely on a task
(e.g. while driving a car, cooking or doing surgery).
Gesture can convey semantic meaning, as well as con-
textual information such as personality, emotion or attitude.
For instance, research shows that speech and gesture share
the same communication system [2] and that one’s gestures
are directly linked to one’s memory [18]. Among gestures,
hand gestures distinguish themselves from two other types of
gestures [25]: body gestures and head gestures. We chose to
work on hand gestures since they can carry more information
more easily than the two other types of gestures. One
preferred way to infer the intent of a gesture is to use a
taxonomy of gestures and to classify the unknown gesture
into one of the existing categories based of the gesture data,
in a similar way to what is done in computer vision for
instance. The classification can either be obtained in realtime
at each time step or at the end of the gesture, depending on
the the processing power and the application needs.
In this paper we propose a convolutional neural network
architecture relying on intra- and inter- parallel processing
Tip
Articulation (a)
Articulation (b)
Base
Palm
Wrist
Fig. 1. Hand skeleton returned by the Intel RealSense camera. Each dot
represents one of the n = 22 joints of the skeleton.
of sequences of positions (of hand-skeletal joints) to classify
complete hand gestures. Where most existing deep learn-
ing approaches to gesture recognition use RGB-D image
sequences to classify gestures [49], our neural network only
uses hand (3D) skeletal data sequences which are quicker to
process than image sequences.
The rest of this paper is structured as follows. We first
review common recognition methods in Section II. We then
present the DHG dataset we used to evaluate our network in
Section III. We detail our approach in Section IV in terms
of motivations, architecture and results. Finally, we conclude
in Section VI and discuss how our model can be improved
and integrated into a realtime interactive system.
II. DEFINITION & RELATED WORK
We define a 3D skeletal data sequence s as a vector
s = (p
1
· · · p
n
)
T
whose components p
i
are multivariate time sequences.
Each component p
i
= (p
i
(t))
t∈R
represents a multivariate
sequence with three (univariate sequences) components
p
i
= (x
(i)
,y
(i)
,z
(i)
)
that alltogether represent a time sequence of the positions
p
i
(t) of the i-th skeletal joint j
i
. Every skeletal joint j
i
represents a distinct and precise articulation or part of one’s
hand in the physical world. An illustration of a 3D hand
skeleton is proposed in figure 1.
In the following subsections, we present a short review of
some approaches to gesture recognition. Typical approaches
to hand gesture recognition begin with the extraction of
spatial and temporal features from raw data. The features
are later classified by a Machine Learning algorithm. The978-1-5386-2335-0/18/$31.00
c
2018 IEEE