random forest which is composed of a series of GA
search-based binary decision trees, as shown in the
flowchart in the red box. Besides the MC-based recog-
nition, another scheme based on S-T correlation match
(within the blue box) is adopted. In specific, we first
describe each STIP by a descriptor which contains three
parts: PCA of original image patch, HOG and distribution
of nearby STIPs. Afterwards, we use k-means algorithm
to cluster the STIPs, hence we can describe each video by
a series of STIP occurrence sequences which serve as a
template of that video. Finally, the spatial correlation
score between two videos is calculated within the MC
framework in the way similar to ‘‘histogram intersection
kernel’’, whereas the temporal correlation score is calcu-
lated by a biological sequence matching algorithm called
Needleman–Wunsch algorithm [59]. Experiments on UT-
Interaction dataset demonstrate that both MC and S-T
correlation-based methods can work well separately, and
that the combination of the two methods outperforms
other common machine learning methods and most state-
of-the-art works. The details of ‘‘fusion’’ in Fig. 1 are
discussed in Sect. 4.4.
4 Approach
4.1 STIP-based mid-level feature extraction
4.1.1 STIP extraction based on voxel variance
Numerous studies [30, 45, 60] have confirmed the superi-
ority of Dollar’s STIPs over Laptev’s counterparts. How-
ever, Dollar’s method constructs motion saliency maps by
2-D spatial Gaussian filtering and 1-D temporal Gabor
filtering, which still has considerable computational load,
especially when the video volume is large. Here, we use an
even more straightforward method presented in [31]to
extract STIPs. A sliding window is used to calculate the
motion saliency maps from groups of frames within the
window. As shown in Fig. 2, each pixel value of the mo-
tion saliency map (corresponding to the center frame of the
window) is just the variance of the voxel values in the same
location of a group of frames within the window. As
pointed out by [31], the sliding window size plays an im-
portant role: too many frames in a group will blur the
saliency map and make it difficult to distinguish between
even ‘‘walk’’ and ‘‘run’’. An empirical choice for the
window size is 5–10, and we choose 7 in our experiments.
The STIPs are extracted by finding the local maxima of
the saliency maps. We use Eq. (1) as our threshold to de-
tect local maxima (non-maximum suppression is imple-
mented to avoid too close STIPs),
threshold ¼ mean þðmax meanÞ0:005 ð1Þ
where mean and max correspond to the mean and max-
imum of the pixel values of all the saliency maps in a
video. We also compare such STIPs with Dollar’s coun-
terparts (using the same way to generate thresholds),
finding that they have similar density, whereas the former
calculates much faster (examples are given in Fig. 3).
4.1.2 Motion context (MC)
Motion context (MC) feature, which catches global infor-
mation of motion and shape, is used to train a random
forest. The idea of MC comes from ‘‘shape context (SC)
[51]’’ which uses a log-polar diagram (centered at a ref-
erence edge point) to measure the distribution of the edge
points of an object. Similarly, MC also uses a log-polar
diagram, but measures the distribution of STIPs rather than
edge points. An MC descriptor, which is also a histogram,
can be constructed from each frame. But in practice, we
discard those frames which have less than 30 STIPs, thus
avoiding too sparse histograms that correspond to frames
without obvious motion. As depicted in Fig. 4, we use a
log-polar diagram containing 24 sub-regions to generate a
24-D histogram called MC descriptor.
The diagram’s center (cx,cy) and diameter D are deter-
mined by
ðcx; cyÞ¼
x
min
þ x
max
2
;
y
min
þ y
max
2
D ¼ g maxðx
max
x
min
; y
max
y
min
Þ
(
ð2Þ
where x
min
, x
max
, y
min
and y
max
denote the extrema of all the
STIPs’ coordinates in the current frame, and the coefficient
g (g ¼ 1: 2) is used to make D larger to cover most STIPs.
In specific, the ratio of the three radial intervals of the log-
polar diagram is 1:ln3:ln
2
3. Similar to [31], we define
MC’s main orientation as the fan sector with most STIPs.
To ensure the invariance of MC feature under mirrored
motions, we align the MCs so that their main orientations
always locate at the right side (Fig. 5).
Group of Frames
Motion Saliency Map
Calculate
Voxel Variance
Fig. 2 [31] Illustration of motion saliency map calculation based on
voxel variance
270 Pattern Anal Applic (2016) 19:267–282
123