ONE-SHOT LEARNING GESTURE RECOGNITION FROM RGB-D DATA USING BAG OF FEATURES
mentioned problems, we propose a new spatio-temporal feature and give examples to explain how
to extract the new feature step by step.
3.1.1 FEATURE POINTS DETECTION FROM RGB-D DATA
Although the 3D MoSIFT feature has achieved good results in human activity recognition, it still
cannot eliminate some influences from the slight motion as shown in Figure 2(a). Therefore, we
fuse depth information to detect robust interest points. We know that SIFT algorithm (Lowe, 2004)
uses the Gaussian function as the scale-space kernel to produce a scale space of an input image. The
whole s cale space is divided into a sequence of octaves and each octave consists of a sequence of
intervals, where each interval is a scaled image.
Building Gaussian Pyramid. Given a gesture sample including two videos (one for RGB video
and the other for depth video),
1
a Gaussian pyramid for every grayscale frame (converted from RGB
frame) and a depth Gaussian pyramid for every depth frame can be built via Equation (1).
L
I
i, j
(x,y) = G(x,y,k
j
σ) ∗ L
I
i,0
(x,y), 0 ≤ i < n,0 ≤ j < s + 3,
L
D
i, j
(x,y) = G(x,y,k
j
σ) ∗ L
D
i,0
(x,y), 0 ≤ i < n,0 ≤ j < s + 3,
(1)
where (x,y) is the coordinate in an image; n is the number of octaves and s is the number of in-
tervals; L
I
i, j
and L
D
i, j
denote the blurred image of the ( j + 1)
th
image in the (i + 1)
th
octave; L
I
i,0
(or
L
D
i,0
) denotes the first grayscale (or depth) image in the (i + 1)
th
octave; For i = 0, L
I
0,0
(or L
D
0,0
) is
calculated from the original grayscale (depth) frame via bilinear interpolation and the size of L
I
0,0
is
twice the size of the original frame; For i > 1, L
I
i,0
(or L
D
i,0
) is down-sampled from L
I
i−1,s
(or L
D
i−1,s
)
by taking every second pixel in each row and column. In Figure 3(a), the blue arrow shows that the
first image L
I
1,0
in the second octave is down-sampled from the third image L
I
0,2
in the first octave.
∗ is the convolution operation; G(x, y, k
j
σ) =
1
2π(k
j
σ)
2
e
−(x
2
+y
2
)/(2(k
j
σ)
2
)
is a Gaussian function with
variable-scale value; σ is the initial smoothing parameter in Gaussian function and k = 2
1/s
(Lowe,
2004). Then, the difference of Gaussian (DoG) images, D f , are calculated from the difference of
two nearby scales in Equation (2).
D f
i, j
= L
I
i, j+1
− L
I
i, j
, 0 ≤ i < n,0 ≤ j < s + 2. (2)
We give an example to intuitively understand the Gaussian pyramid and DoG pyramid. Figure
3 shows two Gaussian pyramids (L
I
t
, L
I
t+1
) built from two consecutive grayscale frames and two
depth Gaussian pyramids (L
D
t
, L
D
t+1
) built from the corresponding depth frames. In this example,
the number of octaves is n = 4 and the number of intervals is s = 2; Therefore, for each frame,
we can build five images for each octave. And we can see that larger k
j
σ results in a more blurred
image (see the enlarged portion of the red rectangle in Figure 3). Then, we use the Gaussian pyramid
shown in Figure 3(a) to build the DoG pyramid via Equation (2), which is shown in Figure 4.
Building Optical Flow Pyramid. First, we briefly review the Lucas-Kanade method (Lucas
et al., 1981) which is widely used in computer vision. The method assumes that the displacement
of two consecutive frames is small and approximately constant within a neighborhood of the point
ρ. The two consecutive frames are denoted by F1 and F2 at time t and t + 1, respectively. Then
1. The depth values are normalized to [0 255] in depth videos.
2555