大规模孤立手势识别：金字塔3D卷积网络方法

73 浏览量更新于2024-08-29 收藏 392KB PDF 举报

"这篇研究论文探讨了大规模孤立手势识别技术，采用了一种基于金字塔3D卷积网络的框架。该框架旨在从手势视频文件中学习空间时间特征，并通过金字塔输入保留手势的多尺度上下文信息。通过在3D卷积网络中插入金字塔融合层，网络能够从整个视频文件中识别手势，而不仅仅是独立的片段。" 本文的核心知识点包括： 1. **人类手势识别**：手势识别是计算机视觉领域的重要研究方向，它涉及理解和解析人类非语言交流的视觉信号。有效的手势识别对于人机交互、无障碍通信等领域具有重要意义。 2. **3D卷积神经网络（3D CNN）**：3D CNN是一种扩展自传统的2D CNN的深度学习模型，它可以同时捕获图像的二维空间信息和一维时间信息，适用于处理如视频数据这样包含空间和时间信息的数据。 3. **金字塔输入**：为了保留手势的多尺度上下文信息，论文提出了金字塔输入策略。这种方法通过不同尺度的采样，确保网络可以捕捉到不同大小和复杂度的手势特征。 4. **均匀时间抖动采样**：每个金字塔段都采用均匀的时间抖动采样，这种采样方法能增加模型对时间变化的鲁棒性，使得模型在处理不规则或有噪声的视频流时表现更稳定。 5. **金字塔融合层**：这些层被插入到3D CNN中，用于融合不同尺度金字塔输入的特征。这一设计有助于提高网络对全局信息的理解，增强识别的准确性。 6. **视频范围的手势识别**：与仅关注视频片段的传统方法不同，该框架能够从整个视频文件中识别手势，这增强了模型对连续动作序列的理解和处理能力。 7. **实验结果**：虽然摘要没有提供具体结果，但可以推断论文中应该包含实验部分，展示了所提出方法相对于其他方法的性能提升，可能包括准确率、召回率等关键指标。 8. **应用场景**：这种大规模孤立手势识别技术可能应用于各种场景，如智能家庭系统、自动驾驶汽车的驾驶员行为监测、虚拟现实(VR)和增强现实(AR)中的交互方式等。通过以上知识点，这篇论文展示了如何利用深度学习技术改进手势识别的效率和准确性，尤其是在处理大量和复杂手势时。这种方法的创新性和实用性使其成为计算机视觉领域的前沿研究之一。

Large-scale Isolated Gesture Recognition using

Pyramidal 3D Convolutional Networks

Guangming Zhu

, Liang Zhang

, Lin Mei

, Jie Shao

, Juan Song

, Peiyi Shen

School of Software, Xidian University, Xi’an, 710071, China

The Third Research Institute of Ministry of Public Security, Shanghai, 201210, China

gmzhu@xidian.edu.cn

Abstract—Human gesture recognition is one of the central

research fields of computer vision, and effective gesture

recognition is still challenging up to now. In this paper, we

present a pyramidal 3D convolutional network framework for

large-scale isolated human gesture recognition. 3D convolutional

networks are utilized to learn the spatiotemporal features from

gesture video files. Pyramid input is proposed to reserve the

multi-scale contextual information of gestures, and each pyramid

segment is uniformly sampled with temporal jitter. Pyramid

fusion layers are inserted into the 3D convolutional networks to

fuse the features of pyramid input. This strategy makes the

networks recognize human gestures from the entire video files,

not just from segmented clips independently. We present the

experiment results on the 2016 ChaLearn LAP Large-scale

Isolated Gesture Recognition Challenge, in which we placed third.

Keywords-gesture recognition; 3D convolutional networks;

pyramid; temporal jitter

I. INTRODUCTION

Gestures, as a nonverbal body language, play a very

important role in human daily communication. With the rapid

development of human-computer and human-robot interaction,

visual gesture recognition [1] becomes one of the central

research fields of computer vision. Effective gesture

recognition is very challenging [3], due to several factors:

cultural differences, various observation conditions, out-of-

vocabulary motions, the relative small size of fingers in images,

noises in camera channels, tiny differences among similar

gestures, etc. In order to push researches on gesture recognition,

ChaLearn has organized a series of gesture recognition

challenges since 2011 [4].

Human gestures may involve motions of the whole body,

but arms and hands play the crucial roles, especially for sign

language recognition [5]. Only a small handful of human

gestures can be recognized from one single still posture, and

complex scene backgrounds may affect gesture recognition in a

bad way, since gestures generally focus on the motion of arms

and hands.

With the rapid development of the deep learning theory,

deep neural networks (DNN) have made a tremendous impact

on computer vision. Convolutional neural networks (CNN) [6]

have demonstrated unattainable performance on some field of

computer vision, such as image classification [8], object

detection [9], image segmentation [10], scene recognition [11],

face recognition [12], and human action/activity recognition

[13]. Compared to still images, the temporal component of

videos provides an additional clue for video-based tasks.

Simonyan et al proposed a two-stream Convolutional Networks

for action recognition in video data [14]. Tran et al learned the

spatiotemporal features with 3D ConvNets for action

recognition [15]. Recurrent Neural Networks (RNN) are well

known to be “deep in time”, Donahue et al proposed Long-

term Recurrent Convolutional Networks (LRCNs) which stacks

a CNN and Long Short-Term Memory (LSTM) recurrent

neural networks for action recognition [16].

However, different from the human actions recognized by

the aforementioned methods, human gestures focus more on

the spatiotemporal features of arms and hands. The effective

spatial convolutional features of hands may be overwhelmed

by complex scene backgrounds due to the relative small size of

fingers in images; the temporal information becomes more

discriminative for gesture recognition, compared to the general

video classification tasks [17]. Therefore, it will be not

effective enough to learn the spatial and temporal features

separately for gesture recognition. Spatiotemporal feature

learning may be a better option, since spatiotemporal features

may suppress the effect of complex scene backgrounds and

diverse illumination in some degree.

In this paper, we present the pyramidal 3D convolutional

networks based on the 3D ConvNets [15] for isolated gesture

recognition. The proposed networks placed third in the

ChaLearn LAP Large-scale Isolated Gesture Recognition

Challenge organized in 2016 [7] and is illustrated in Fig. 1. The

main contributions of the proposed networks, compared to the

3D ConvNets [15], are summarized as follows:

(a) Pyramid Input: Each gesture video file is segmented

pyramidally and each segment is uniformly sampled with

temporal jitter to construct the pyramid input, which reserve

the multi-scale contextual information of gestures.

(b) Pyramid fusion: Pyramid fusion is utilized to fuse the

pyramid input, as displayed in Fig. 1, which makes the

networks recognize gestures from the entire gesture videos, not

just from segmented clips independently.

networks are fused to improve the recognition accuracy.

This work is partially supported by the China Postdoctoral Science

Foundation (Grant No. 2016M592763), the Fundamental Research Funds for

the Central Universities (Grant NO. JB161006), the National Natural Science

Foundation of China (Grant NO. 61401324, 61305109, 61072105).

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38742951

粉丝: 16
资源: 938

大规模孤立手势识别：金字塔3D卷积网络方法

simple-isolated-word-recognition.rar_ Isolated_isolated_word rec

Video-based Sign Language Recognition without Temporal Segmentation.pdf

play-scala-isolated-slick-example：示例Play Slick项目

chrome-extension-isolated-cookie:每个标签都有单独的 cookie

AN12900 Secure Over-the-Air Prototype for Linux Using CAAM and Mender

swift-webview-isolated:迷你网络浏览器。 为 Mac OS X 使用隔离的 WebView 的示例

A Step-Up DC-DC Converter for Non-Isolated

Noise-word-recognition.rar_matlab例程_matlab_

An Improved Non-isolated LED Converter

40.5 W Non-Isolated Buck LED Driver

最新资源

swift-webview-isolated:迷你网络浏览器。为 Mac OS X 使用隔离的 WebView 的示例