RGB-D数据下的一次性手势识别：BoF与3DEMoSIFT方法

需积分: 10 191 浏览量更新于2024-07-22 收藏 2.88MB PDF 举报

本文档标题为《基于RGB-D数据的一次性学习手势识别：BoF方法》（One-shot Learning Gesture Recognition from RGB-D Data Using Bag of Features），主要探讨了在一次样本学习（one-shot learning）环境下，如何有效地解决手势识别中的两个关键挑战：一是如何提取出具有区分性的特征，二是如何仅凭一个训练样本构建一个有效的分类模型。首先，关于特征提取，研究者提出了一种新的时空特征表示方法——三维增强运动尺度不变特征变换（3D Enhanced Motion Scale-Invariant Feature Transform, 3DEMoSIFT）。这种方法融合了RGB（颜色）和深度（Depth）数据，旨在提高特征的鲁棒性。与传统特征相比，3DEMoSIFT具有显著的优势：它对尺度和旋转变化有更强的不变性，从而能够提供更紧凑且丰富的视觉表示。这有助于在一次性学习条件下，即使在样本数量有限的情况下也能准确地捕获手势的细节和动态特性。在特征学习方面，作者们集中于使用所有从单个训练样本中提取的3DEMoSIFT特征来构建一个分类器。这个过程可能涉及到深度学习技术，如支持向量机（SVM）、神经网络或者深度神经网络（DNN），它们能通过少量训练样本学习到手势的通用模式，并能够在测试阶段快速适应新的、未见过的手势实例。这种方法的优势在于它的泛化能力，即能在面对新类别时，仅依赖于一个或少数几个示例就能做出准确的判断。这篇论文为一次性学习环境下，尤其是处理RGB-D数据的手势识别问题提供了一个创新的解决方案。通过结合3DEMoSIFT特征提取和高效的学习策略，研究人员展示了在数据稀缺情况下仍能实现高精度手势识别的可能性，这对于实际应用，如智能家居控制、虚拟现实交互等领域具有重要意义。

ONE-SHOT LEARNING GESTURE RECOGNITION FROM RGB-D DATA USING BAG OF FEATURES

mentioned problems, we propose a new spatio-temporal feature and give examples to explain how

to extract the new feature step by step.

3.1.1 FEATURE POINTS DETECTION FROM RGB-D DATA

Although the 3D MoSIFT feature has achieved good results in human activity recognition, it still

cannot eliminate some inﬂuences from the slight motion as shown in Figure 2(a). Therefore, we

fuse depth information to detect robust interest points. We know that SIFT algorithm (Lowe, 2004)

uses the Gaussian function as the scale-space kernel to produce a scale space of an input image. The

whole s cale space is divided into a sequence of octaves and each octave consists of a sequence of

intervals, where each interval is a scaled image.

Building Gaussian Pyramid. Given a gesture sample including two videos (one for RGB video

and the other for depth video),

a Gaussian pyramid for every grayscale frame (converted from RGB

frame) and a depth Gaussian pyramid for every depth frame can be built via Equation (1).

i, j

(x,y) = G(x,y,k

σ) ∗ L

i,0

(x,y), 0 ≤ i < n,0 ≤ j < s + 3,

i, j

(x,y) = G(x,y,k

σ) ∗ L

i,0

(x,y), 0 ≤ i < n,0 ≤ j < s + 3,

(1)

where (x,y) is the coordinate in an image; n is the number of octaves and s is the number of in-

tervals; L

i, j

and L

i, j

denote the blurred image of the ( j + 1)

image in the (i + 1)

octave; L

i,0

(or

i,0

) denotes the ﬁrst grayscale (or depth) image in the (i + 1)

octave; For i = 0, L

0,0

(or L

0,0

) is

calculated from the original grayscale (depth) frame via bilinear interpolation and the size of L

0,0

twice the size of the original frame; For i > 1, L

i,0

(or L

i,0

) is down-sampled from L

i−1,s

(or L

i−1,s

)

by taking every second pixel in each row and column. In Figure 3(a), the blue arrow shows that the

ﬁrst image L

1,0

in the second octave is down-sampled from the third image L

0,2

in the ﬁrst octave.

∗ is the convolution operation; G(x, y, k

σ) =

2π(k

σ)

−(x

)/(2(k

σ)

)

is a Gaussian function with

variable-scale value; σ is the initial smoothing parameter in Gaussian function and k = 2

1/s

(Lowe,

2004). Then, the difference of Gaussian (DoG) images, D f , are calculated from the difference of

two nearby scales in Equation (2).

D f

i, j

= L

i, j+1

− L

i, j

, 0 ≤ i < n,0 ≤ j < s + 2. (2)

We give an example to intuitively understand the Gaussian pyramid and DoG pyramid. Figure

3 shows two Gaussian pyramids (L

, L

t+1

) built from two consecutive grayscale frames and two

depth Gaussian pyramids (L

, L

t+1

) built from the corresponding depth frames. In this example,

the number of octaves is n = 4 and the number of intervals is s = 2; Therefore, for each frame,

we can build ﬁve images for each octave. And we can see that larger k

σ results in a more blurred

image (see the enlarged portion of the red rectangle in Figure 3). Then, we use the Gaussian pyramid

shown in Figure 3(a) to build the DoG pyramid via Equation (2), which is shown in Figure 4.

Building Optical Flow Pyramid. First, we brieﬂy review the Lucas-Kanade method (Lucas

et al., 1981) which is widely used in computer vision. The method assumes that the displacement

of two consecutive frames is small and approximately constant within a neighborhood of the point

ρ. The two consecutive frames are denoted by F1 and F2 at time t and t + 1, respectively. Then

1. The depth values are normalized to [0 255] in depth videos.

2555

剩余33页未读，继续阅读

Marxulia

粉丝: 0
资源: 10

RGB-D数据下的一次性手势识别：BoF与3DEMoSIFT方法

2017-J神-One-Shot-Learning Gesture Recognition Using HOG-HOF Feat

jawide-hand-gesture-recognition-master.zip

Data fusion-based real-time hand gesture recognition with Kinect V2

Large-scale Isolated Gesture Recognition using Pyramidal 3D Convolutional Networks

hw4-hand-gesture-tracking-and-recognition-WeiyanZhu：hw4-hand-gesture-tracking-and-recognition-WeiyanZhu由GitHub Classroom创建

A C-program for gesture recognition

Sign-Language-and-Static-gesture-recognition-using-sklearn:使用scikit learning和scikit图像库构建的执行手定位和静态手势识别的机器学习管道

颜色分类leetcode-hand-gesture-recognition:手势识别

Drone-with-Gesture-Recognition:NTU-CSIE 飞行物体的用户定义手势

Real-Time-Gesture-Recognition:通过网络摄像头检测手部和头部运动手势

最新资源