ZHANG et al.: ROBUST HEAD TRACKING BASED ON MULTIPLE CUES FUSION IN KERNEL-BAYESIAN FRAMEWORK 1199
and illumination changes, and to prevent the model from
drifting away.
The arrangement of this paper is as follows. A brief review
of kernel-based and Bayesian-based tracking frameworks is
given in Section II. The kernel-Bayesian framework is de-
scribed in detail in Section III. The multiple cues fusion-based
similarity measure and its application in the kernel-Bayesian
framework are discussed in Section IV. Experimental results
are presented in Section V, and Section VI is devoted to a
conclusion.
II. Review of Kernel-Based and Bayesian-Based
Frameworks
In this section, we briefly review the two typical track-
ing frameworks: kernel-based framework and Bayesian-based
framework.
A. Kernel-Based Framework
The most famous kernel-based framework, namely the mean
shift algorithm, first appeared in [23] as a method for estimat-
ing the gradient of a density function. It was applied for visual
tracking by Comaniciu et al. [3] in 2000.
Mean shift is a nonparametric mode seeking technique that
shifts each data point to the average of the data points in
its neighborhood [23]. Let R be a finite set embedded in
n-dimensional space, the mean shift vector ms of x is defined
as follows:
ms =
a
K(a − x)w(a)a
a
K(a − x)w(a)
− x, a ∈ R (1)
where K is a kernel function and w is a weight function. The
mean shift algorithm works by iteratively shifting the data in
the direction of mean shift vector until convergence.
B. Bayesian-Based Framework
Another popular approach is to view tracking as an online
Bayesian inference process for estimating the unknown state
s
t
from sequential observations o
1:t
perturbed by noise. A
dynamic state-space form employed in Bayesian inference
framework is shown as follows [27]:
state transition model : s
t
= f
t
(s
t−1
,
t
)(2)
observation model : o
t
= h
t
(s
t
,ν
t
) (3)
where s
t
,o
t
are system state and observation,
t
,ν
t
are the
system noise and observation noise, respectively, f
t
(., .) char-
acterizes the kinematics of the object, and h
t
(., .) models the
observations. The key idea of Bayesian inference is to approx-
imate the posterior probability distribution by a weighted sam-
ple set {(s
(n)
,w
(n)
)|n =1,... ,N}. Each sample consists of an
element s
(n)
that represents the hypothetical state of an object
and a corresponding discrete sampling probability w
(n)
, where
N
n=1
w
(n)
= 1. First, the sample set is resampled to avoid
the degeneracy problem, and the new samples are propagated
according to the state transition model. Then, each element of
the set is weighted with probability w
(n)
= p(o
t
|s
(n)
t
), which
is calculated from the observation model. Finally, the state
estimate
ˆ
s
t
can be either be the minimum mean square error
estimation or the maximum a posteriori (MAP) estimation.
III. Kernel-Bayesian-Based Framework
The kernel-based framework has a low-computational com-
plexity, but it is often trapped in local optima, while Bayesian-
based framework can improve the robustness of the tracking
process, but it suffers a large computational load by generating
a huge number of hypotheses to cover the global optimum.
Thus, in this section, we propose a kernel-Bayesian tracking
framework that combines the merits of both frameworks.
A. Kernel-Bayesian Framework
The state transition model is an important component of
the Bayesian tracking framework. Most of the existing models
use a naive random walk around previous system states [28]
or learn through prelabeled video sequences [2]. The random
walk approaches do not use information about the object
motion, and thus involve a quite large computational load since
a large number of hypotheses need to be randomly generated to
cover the object. The learning-based approaches often suffer
from over fitting, so they are only effective for the training
sequences.
The kernel-based mean shift algorithm provides an estimate
of the object motion, which motivates us to embed the kernel
method into a Bayesain framework to provide a heuristic prior.
In detail, the mean shift algorithm is first applied to the current
frame to obtain the direction of the object motion and the offset
of the object state, which are then incorporated into the state
transition model as prior information. In this way, the kernel-
based method and the Bayesian-based method are combined
into a unified framework.
B. Optimization View
A reinterpretation of the kernel-Bayesian framework from
an optimization point of view is presented to show why this
framework can combine the merits of both the kernel method
and the Bayesian method.
An input image with three templates superimposed, cor-
responding to the initialization, the local maximum and the
global maximum are illustrated in the left column of Fig.
1, and its likelihood function based on the spatial-constraint
MOG-based appearance model is shown in the right column of
Fig. 1. As shown in Fig. 1, starting from the initial position, the
kernel method converges to a local maximum that is near to the
global maximum. It is clear that a few hypotheses generated
around the local maximum are enough to guide the algorithm
to the global maximum. If the tracker starts from the initial
position, more hypotheses need to be generated in order to
reach the object.
IV. Proposed Tracking Algorithm
In our work, the motion state of a tracked object between
two consecutive frames is approximated by a set of affine