104 Y. Wang, J. Wang, C. Deng, H. Zhu, and S. Wang
Recently, sparse linear representations have been introduced to target representations.
In L1 algorithm [3], a target candidate is sparsely represented by using both target tem-
plates and trivial templates. The target templates are used to represent target appear-
ance, and trivial templates are used to describe outliers or occlusions. The L1 algorithm
is robust to partial occlusions. However, it is time-consuming in solving `
1
minimization
problem, which limits the tracking performance in real time. Jia et al. [23] propose a
structural local sparse appearance model, where a target candidate is sparsely represented
by using the partial information and spatial information via a alignment-pooling method.
Taking advantage of generative and discriminative models, Zhong et al. [4] propose a
sparsity-based collaborative appearance model based on both holistic templates and local
representations. Recently, Zhang et al. [24] propose structural spare tracking algorithm
by exploiting the spatial layout structure among the local patches inside each target can-
didate. In [25], a target candidate is represented by sparse combinations of particles by
exploiting underlying low-rank constraints.
Discriminative tracking algorithms consider visual tracking as a binary classification
problem, in which a classifier is learnt to distinguish a target from the around background.
Avidan [10] proposes an ensemble tracking algorithm by combining a set of weak classifiers
into a strong classifier and computes the confidence value for each pixel. The target is
located by a vote confidence map. Bai et al.[11] consider the contribution of confidences as
a weight vector and combine a set of weak classifiers into a strong classifiers. Babenko et al.
[15] introduce the multiple instance learning framework into visual tracking where positive
and negative bags are considered as training samples. Kalal et al. [14] formulate visual
tracking in a tracking-learning-detecting framework. In [14], a bootstrapping classifier is
learnt and used to select potential samples for updating unlabeled data with positive and
negative constraints. Hare et al.[12] propose a tracking-by-detecting algorithm based on
an online structured output support vector machine (SVM). Ning et al. [26] learn linear
structured SVM and explicit feature map to track object. In [27, 28, 29], the features
based on deep convolutional neural networks are learnt.
3. The proposed visual tracking algorithm. In this section, we describe `
1
-`
2
norms
based target representation and a likelihood evaluation based on the reconstruction resid-
ual and the coding coefficient. Based on the target representation and the likelihood
evaluation, we outline the proposed tracking algorithm in a particle filter framework [30].
3.1. `
1
-`
2
norms based target representation. During tracking, m particles (i.e.,
target candidates) are sampled at the t-th frame, the state of a particle is denoted as
x
i
t
, i = 1, 2, ··· , m. The corresponding observation of x
i
t
is denoted as y
i
t
at frame t. The
state of the located target at frame t is denoted as
ˆ
x
t
, and the corresponding observation
is denoted as
ˆ
y
t
.
In visual tracking, the observation y
i
t
of a target candidate is often represented by a
linear combination of target templates
y
i
t
≈ d
1
α
1
+ d
2
α
2
+ ··· + d
n
α
n
, (1)
where D = [d
1
, d
2
, ··· , d
n
] is a set of target templates, α = [α
1
, α
2
, ··· , α
n
]
T
∈ R
n
is the
corresponding template coefficient vector.
Different from sparse linear representations in [3, 4, 23], in the proposed tracking algo-
rithm, the observation y
i
t
of a target candidate is approximated in the form of non-sparse
combinations of a set of target templates by solving
ˆα = arg min
α
ky
i
t
− Dαk
1
+ λkαk
2
2
,
(2)