336 Front. Comput. Sci., 2020, 14(2): 334–348
a part-based CF tracking approach that models the target with
multiple parts using several CFs. Lukezic et al. [26] model the
part-based CF responses and their constellation constraints
jointly as an equivalent spring system, and derive a highly ef-
ficient optimization approach to infer the most probable tar-
get deformation.
The DCF based trackers usually suffer from boundary ef-
fects, thereby limiting the discrimination capabilities of the
learned CFs. To address this issue, recently, Danelljan et al.
[27] reformulate the CF objective by introducing a spatially
Gaussian weight function to penalize non-zero filter values
outside the object bounding box. Mueller et al. [14] present
a framework that allows to explicitly incorporate surround-
ing context information into the CF learning. Different from
the above-mentioned methods that construct lots of virtually
circulant samples to train a CF, recently, Galoogahi et al. [15]
leverage the whole frame to get a set of real negative samples,
which facilitate learning a better classifier.
2.2 Manifold regularized tracking
Manifold regularization is usually applied to semi-supervised
learning with both labeled and unlabeled samples [28–30],
which constructs a Laplacian graph to leverage the samples to
exploit the hidden geometrical structure of the feature space.
For example, in feature space analysis, Chang and Yang [29]
exploit both labeled and unlabeled training data for a more
reliable feature space selection algorithm. Moreover, in vi-
sual tracking, Yu et al. [30] leverage the manifold structure
in the appearance space with spatio-temporal constraints to
perform robust person localization and tracking in real world
surveillance scenarios. Bai and Tang [31] employ an online
Laplacian regularized ranking support vector machine to es-
timate the object location for visual tracking. To make better
use of the unlabeled data and the manifold structure of the
sample space, Hu et al. [32] propose a manifold regularized
DCF based tracker with augmented circularly shifted sam-
ples and leverage a block optimization strategy that can be
efficiently computed via FFTs. Zhuang et al. [33] construct a
discriminative sparse similarity map for visual tracking based
on a Laplacian regularized multitask reverse sparse represen-
tation.
3 Manifold regularized context-aware corre-
lation tracking
We first review the context-aware CF tracking approach [14]
that is most related to our MRCT, and then introduce the prin-
ciple of our MRCT in detail.
3.1 Context-aware correlation tracking
In [14], a set of k contextual patches x
i
∈ R
s
around the
tracked object x
0
∈ R
s
are extracted, whose corresponding
circulant matrices are X
i
∈ R
s×s
and X
0
∈ R
s×s
. These con-
textual patches are served as negative samples with zero la-
bels. The aim is to learn a filter w ∈ R
s
that gives the target a
high score while the surrounding area a low one. As a result,
the objective function is
min
w
X
0
w − y
2
2
+ λ w
2
2
+ λ
1
X
i
w
2
2
, (1)
where y is a vectorized regression target of a 2D Gaussian
and λ, λ
1
> 0 are regularization parameters.
To solve the convex optimization Eq. (1) efficiently, the
terms therein can be rewritten by stacking the contextual
patches and regression target into the special forms as fol-
lows
B =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
X
0
√
λ
1
X
1
.
.
.
√
λ
1
X
k
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
,
¯
y =
⎡
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎢
⎣
y
0
.
.
.
0
⎤
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎥
⎦
. (2)
Then Eq. (1) can be rewritten as
min
w
Bw −
¯
y
2
2
+ λ w
2
2
. (3)
Similar to the traditional CF learning [7], a closed-form
solution of Eq. (3) can be efficiently calculated in the Fourier
domain, among which the following characteristic of the cir-
culant matrix is the key component for solving the problem
efficiently
X = Fdiag
(ˆ
x
)
F
H
, X
T
= Fdiag
(ˆ
x
∗
)
F
H
, (4)
where
ˆ
x denotes the Fourier transform F
H
x, x
∗
represents the
conjugate of x. Then, the solution can be efficiently achieved
by
ˆ
w =
ˆ
x
∗
0
ˆ
y
ˆ
x
∗
0
ˆ
x
0
+ λ + λ
1
k
i=1
ˆ
x
∗
i
ˆ
x
i
. (5)
The detection procedure is the same as the traditional CF
based tracking, and when a new image patch z is coming, the
tracking result is determined as the location of the maximum
response r, which can be simply calculated in the Fourier do-
main by
ˆ
r
(
w, z
)
=
ˆ
z
ˆ
w. (6)
3.2 Manifold regularization for augmented samples
As shown in Fig. 1, given the bounding box of the tracked
target, we first expand it to include some surrounding re-
gions, and then partition the expanded region into several