Data Fusion-based Real-Time Hand Gesture
Recognition with Kinect V2
Yuhai Lan
School of Information Engnieering
Nanchang University
Nanchang, 330031, China
Jing Li
School of Information Engnieering
Nanchang University
Nanchang, 330031, China
jingli@ncu.edu.cn
Zhaojie Ju
Intelligent Systems and Biomedical
Robotics Group, School of
Computing
University of Portsmouth
Portsmouth, PO1 3HE, U.K.
Abstract—Hand gesture recognition is an important topic in
human
-computer interaction. However, most of the current
methods are complicated and time
-consuming, which limits the
use of hand gesture recognition in real
-time circumstances. In
this paper, we propose a data fusion
-based hand gesture
recognition model by fusing depth information and skeleton data.
Because
of the accurate segmentation and tracking with Kinect
V2, the model can achieve real
-time performance, which is 18.7%
faster than some of the state
-of-the-art methods. Based on the
experimental results, the proposed model is accurate and robust
to rotation, flip, scale changes, lighting changes, cluttered
background, and distortions. This ensures its use in different
real
-world human-computer interaction tasks.
Keywords
—hand gesture recognition; skeleton; Kinect V2;
depth image; real
-time; data fusion
I. INTRODUCTION
As an important topic in human
-computer/robot
interaction, not only does hand gesture recognition provide
reliable information for exploring the meanings of human hand
gestures for friendly and comfortable interaction experience,
such as virtual reality
[1] and augmented reality [2]; but also it
underpins a wide range of computer vision applications,
including sign language recognition [3]
and advanced driver
assistance systems
[4].
Recently, researchers have proposed numerous hand
gesture recognition algorithms in the literature [9], most of
which can be divided into three different levels: 1) static hand
gesture recognition; 2) dynamic gesture recognition; and 3) 3D
hand gesture recognition. Traditionally, the approaches in the
first
two levels are mostly two-dimensional, which use 2D
color images as input while ignoring depth information. By
contrast, the technologies in the third level are 3D-based, where
depth information of images is explored to ensure more
effective hand gesture recognition performance. However, 3D
depth information cannot be obtained from a single camera,
needing
a special device to obtain. Due to the high-speed
development of human
-computer interaction and sensing
technologies, some famous depth cameras were developed,
including Microsoft Kinect, Intel RealSense, Leap Motion, and
Asus Xtion, wherein
Kinect is the most widely used devices in
computer vision research area. Kinect catches depth
information by time of flight (TOF) and
provide different kinds
of high
-quality data, e.g., color, depth, infrared, skeletons, and
solid SDK.
Fig. 1. Kinect V2 for Xbox One.
This paper proposed a model to accurately and fast
recognize hand gestures of different digits by fusing skeleton
information through depth images obtained by Kinect for Xbox
One
(‘Kinect V2’ for short). This is the first paper to use
Kinect V2
to recognize different human hand gestures. Kinect
for
Windows (‘Kinect’ for short) is a kind of human-computer
interaction facilities developed in 2010. It is fused with many
advanced visual technologies and has been widely used in
various kinds of computer vision tasks, such as face
recognition
[5][6], scene understanding [7], and human gesture
recognition
[8]. However, since its improved version - Kinect
V2
was developed in 2014; there is little work in the literature.
Compared with Kinect, the hardware of Kincet V2 has been
largely improved. F
or example, Kinect V2 can track at most six
skeletons compared with two skeletons using Kinect. Skeleton
tracking has been upgrade
d to a large amount, where the
tracked positions ar
e more accurate and robust in anatomy, and
the tracked area is wider. Moreover, the Kinect for Windows
SDK h
as been updated continuously from SDK 1.8 to SDK 2.0.
This simple model can deal with various kinds of
challenges in hand gesture recognition, such as rotation, scale
changes, lighting changes, cluttered background, and
distortions. It is consisted of two parts: 1) hardware: Kinect
V2 is used to obtain depth images with RGB
-D cameras; and 2)
software: Microsoft Kinect SDK
combined with Open Source
Computer Vision (OpenCV)
is adopted for real-time