mode-seeking algorithm similar to the mean shift [30] to generate
a naïve proposal when there is no detection, which further im-
proves the performance of the BPF regardless of occasional sparse
Adaboost detections. Lastly, we use SPPCA to update the shape
template of the tracker, while [5] did not update their observa-
tional model and use only the initialized model as their template.
The work most similar to ours is Giebel et al. [31]. They pre-
sented a system that can track and detect pedestrians using a cam-
era mounted on a moving car. Their tracker combines texture,
shape, and depth information in their observation likelihood. The
texture is encoded by the color histogram, the shape is represented
by a Point Distribution Model (PDM) [32], and the depth informa-
tion is provided by the stereo system. In order to capture more
variations of the shape, they constructed multiple Eigen-subspaces
from the training data, and the transition probability between sub-
spaces were also learned. During run time, they used a Particle Fil-
ter to estimate the probability of the hidden variables of the
tracking. To reduce the number of particles, they also used a smart
proposal distribution based on the detection results. Our tracker
shares the same merits. However, in the template updating part,
we infer the probability of the hidden variables using the Rao-
Blackwellized Particle Filter to increase the speed. In multi-target
tracking, we use the boosted particle filter that incorporates the
cascaded Adaboost detector to obtain fast and reliable detections.
2.2. Visual action recognition
The goal of visual action recognition is to classify the actions of
persons based on a video sequence. In this section, we briefly re-
view the literature related to our visual action recognition system.
For a more complete survey, please refer to the reviews of Gavrila
[3] and Hu et al. [4].
Freeman et al. [33] utilized global orientation histograms to en-
code the shapes of the hands, and used a nearest-neighbor classi-
fier to determine the gesture of the hands. In [34], they further
divided the images into cells, and computed the orientation histo-
grams of all cells. However, their approach determines the gesture
of the target only by the current posture of the person. No previous
posture information is used. Recently, Wang et al. [58] have ex-
tended this approach using a hierarchical model.
Efros et al. [7] employed a motion descriptor, the Decomposed
Optical Flow (DOF). The DOF descriptor can be constructed by
decomposing the optical flow of two consecutive frames into four
channels ðF
þ
X
; F
X
; F
þ
Y
; F
Y
Þ, where F
þ
X
, F
X
, F
þ
Y
, and F
Y
represent the
optical flow along the X
þ
, X
, Y
þ
, and Y
directions, respectively.
They also presented a novel motion-to-motion similarity measure
that can handle actions of different speeds. A nearest-neighbor
classifier was used to determine the person’s actions.
Wu [35] extended Efros et al. [7] by introducing another motion
descriptor, the Decomposed Image Gradients (DIG). The DIG
descriptor can be constructed by first computing the image gradi-
ents of the image, and then decomposing the image gradients into
four channels ðG
þ
X
; G
X
; G
þ
Y
; G
Y
Þ, where G
þ
X
, G
X
, G
þ
Y
, and G
Y
represent
the image gradient along the X
þ
, X
, Y
þ
, and Y
directions, respec-
tively. He also used the motion-to-motion similarity measure sim-
ilar to Efros et al. A nearest-neighbor classifier was also used to
determine the person’s actions.
The problem of action recognition can be also formulated in a
generative probabilistic model. For example, [36] used Hidden
Markov Models (HMMs) to recognize the target’s action. In their
system, they trained separate HMMs for each action. The hidden
state of the HMM represents the appearance variations of the tar-
get and the observation is either raw images or the target’s con-
tour. During recognition, they fed the entire video sequence to all
HMMs and the actions of the target is determine by the HMM hav-
ing the maximum likelihood. In our previous work, we also em-
ployed HMMs to recognize the target’s actions [37,38]. Instead of
using the entire video sequence, we used a fixed-length sliding
window to determine the target’s actions.
3. Observation models
Observation models encode the visual information of the tar-
get’s appearance. Since a single cue does not work in all cases,
many researchers have combined multiple cues for robust tracking
[31,39–41]. In this article, we utilize the Hue–Saturation–Value
(HSV) color histogram to capture the color information of the tar-
get, and the Histogram of Oriented Gradients (HOG) descriptors [6]
to encode the shape information.
3.1. Color
We encode the color information of the targets by a two-part
color histogram based on the Hue–Saturation–Value (HSV) color
histogram used in [25,5]. We use the HSV color histogram because
it decouples the intensity (i.e., value) from color (i.e., Hue and Sat-
uration), and it is therefore more insensitive to illumination effects
than using the RGB color histogram. The exploitation of the spatial
layout of the color is also crucial due to the fact that the jersey and
pants of hockey players usually have different colors [25,5].
Our color observation model is composed of a 2D histogram
based on Hue and Saturation and a 1D histogram based on value.
Both histograms are normalized such that all bins are sum to
one. We assign the same number of bins for each color component,
i.e., N
h
¼ N
s
¼ N
v
¼ 10, and it results in a N
h
N
s
þ N
v
¼ 10
10 þ 10 ¼ 110 dimension HSV histogram. Fig. 3 shows two in-
stances of the HSV color histograms.
3.2. Shape
We apply the Histograms of Oriented Gradient (HOG) descriptor
[6] to encode the shape information of the targets. The HOG
descriptor is computed by sampling a set of the SIFT descriptors
[42] with a fixed spacing over the image patches. Combined with
a Support Vector Machine (SVM) classifier, the HOG descriptor
has been shown to be very successful in the state-of-the-art pedes-
trian detection system [6]. In this article, we employ the HOG
descriptor because it is robust under viewpoint and lighting
changes, possesses good discriminative power, and can be effi-
ciently computed.
The SIFT descriptor was originally introduced by Lowe [42] to
capture the appearance information centered on the detected SIFT
features. To compute the SIFT descriptor, we first resize the image
patch to an p
w
p
h
patch and then smooth the image patch by a
Gaussian low-pass filter and compute the image gradients using a
½1; 0; 1 kernel. The original SIFT descriptor implementation [42]
rotated the directions of the gradients to align the dominating ori-
entation of the SIFT features in order to have a rotation-invariant lo-
cal descriptor. In our case, however, we do not rotate the directions
of the gradients because the dominating orientation provides cru-
cial information for the tracking and action recognition system.
After computing the image gradients, we divide the image patch
into small spatial regions (‘‘cells”), for each cell accumulating a local
1D histogram of gradient directions over the pixels of the cell. In this
article, we use the unsigned image gradient, and the orientation bins
are evenly spaced over 0–180° to make the descriptor more invari-
ant to the color of the players’ uniforms. For better invariance to
lighting changes, we normalize the local response by the total histo-
gram energy accumulated over all cells across the image patch.
The HOG descriptor is constructed by uniformly sampling the
SIFT descriptor of the same size over the image patch with a fixed
192 W.-L. Lu et al. / Image and Vision Computing 27 (2009) 189–205