HOG-HOF特征的单次学习手势识别：ChaLearn竞赛成果

需积分: 0 186 浏览量更新于2024-07-01 收藏 741KB PDF 举报

本文主要探讨了2014年发表在《机器学习研究》(Journal of Machine Learning Research, Vol. 15, pp. 2513-2532)上的一项研究成果，标题为“基于HOG-HOF特征的一次性学习手势识别”。作者Jakub Konečný、Michal Hagara以及论文的编辑Isabelle Guyon、Vassilis Athitsos和Sergio Escalera共同贡献了这篇论文。该研究关注的重点是基于一次学习（one-shot learning）的手势识别系统，特别是针对 ChaLearn 手势数据集（ChaLearn Gesture Dataset）的设计与应用。作者们利用了RGB和深度图像，将视觉特征（Histograms of Oriented Gradients, HOG）和运动特征（Histogram of Optical Flow, HOF）结合起来，以实现视频序列的并行时间分割和识别。HOG捕捉物体的纹理信息，而HOF则专注于动作的连续性和方向变化，两者结合有助于提高识别精度。论文提出了一种新的视频修剪算法，旨在从视频中去除无关帧，以减少噪声和冗余信息对识别性能的影响。此外，作者还介绍了两种方法，它们都采用了HOG-HOF特征组合，并结合了动态时间规整（Dynamic Time Warping, DTW）的不同变体。DTW是一种常用的序列比对技术，通过最小化两个序列之间的编辑距离来匹配时间序列，这对于处理非均匀时长的手势序列非常有效。在这两项方法中，作者展示了使用HOG-HOF特征和DTW变体的策略能够显著优于当时的其他公开方法。这表明，通过结合视觉和运动特征，并优化序列处理过程，一次性学习的手势识别系统能够在复杂的数据集中展现出良好的性能。这些研究成果对于后续的研究者来说，提供了一个强大的基础，可以进一步探索和改进一次学习在计算机视觉中的应用，特别是在无需大量训练样本的情况下对手势识别的高效处理。

Kone

y and Hagara

The challenging aspects of the data are that within a single batch there is only one

labelled example of each gesture. Between diﬀerent batches there are variations in recording

conditions, clothing, skin color and lightning. Some users are less skilled than others, thus

there are some errors or omissions in performing the gestures. And in some batches, parts

of the body may be occluded.

For the evaluation of results the Levenshtein distance was used, provided as the metric

for the competition. That is the minimum number of edit operations (insertion, deletion or

substitution) needed to be performed to go from one vector to another. For each unlabelled

video, the distance D(T, L) was computed, where T is the truth vector of labels, and L

is our predicted vector of labels. This distance is also known as the “edit distance”. For

example, D([1, 2], [1]) = 1, D([1, 2, 3], [2, 4]) = 2, D([1, 2, 3], [3, 2]) = 2.

The overall score for a batch was computed as a sum of Levenshtein distances divided

by the total number of gestures performed in the batch. This is similar to an error rate (but

can exceed 1). We multiply the result by a factor of 100 to resemble the fail percentage.

For simplicity, in the rest of this work, we call it the error rate.

4. Preprocessing

In this Section we describe how we overcame some of the challenges with the given data set

as well as the solutions we propose. In Section 4.1 we focus on depth noise removal. Later

we describe the need for trimming the videos—removing set of frames—and the method

employed.

4.1 Depth Noise Removal

One of the problems with the given data set is the noise (or missing values) in the depth

data. Whenever the Kinect sensor does not receive a response from a particular point, the

sensor outputs a 0, resulting in the black areas shown in Figure 1. This noise usually occurs

along the edges of objects or, particularly in this data set, humans. The noise is also visible

if the object is out of the range of the sensor (0.8 to 3.5 meters).

Figure 1: Examples of depth images with various levels of noise

The level of noise is usually the same within a single batch. However, there is a big

diﬀerence in the noise level across diﬀerent batches. If the level is not too high, it looks like

‘salt and pepper’ noise.

2516

剩余19页未读，继续阅读

白绍伟

粉丝: 16
资源: 287

HOG-HOF特征的单次学习手势识别：ChaLearn竞赛成果

Gesture Recognition

Gestures Recognion

gesturerecognition

EMG Signal for gesture recognition

segwaywarrior / gesture_recognition_opencv_yolov5

你是否了解ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech这篇论文的深度学习框架是什么

gesture识别长划短划

Dhristi-GS Dataset

gesture recognition toolkit

ZeroEGGS: Zero-shot Example-based Gesture Generation from Speech告诉我上面论文的作者

最新资源