SMEULDERS ET AL.: VISUAL TRACKING: AN EXPERIMENTAL SURVEY 1447
[IVT] Incremental Visual Tracking: The tracker in [54]
recognizes that in tracking it is important to keep an
extended model of appearances capturing the full range of
appearances of the target in the past. Eigen Images of
the target are computed by incremental PCA over the tar-
get’s intensity-value template. They are stored in a leaking
memory to slowly forget old observations. Candidate win-
dows are sampled by Particle Filtering [55]fromthemotion
model, which is a Gaussian distribution around the previ-
ous position. The confidence of each sample is the distance
of the intensity feature set from candidate window to the
target’s Eigen image subspace. The candidate window with
the minimum score is selected.
[TAG] Tracking on the Affine Group: The paper [56]
also uses an extended model of appearances.Itextendsthe
traditional {translation, scale, rotation} motion types to
a more general 2-dimensional affine matrix group. The
tracker departs from the extended model of IVT adopt-
ing its appearance model including the incremental PCA
of the target intensity values. The tracker samples all pos-
sible transformations of the target from the affine group
using a Gaussian model.
[TST] Tracking by Sampling Trackers: The paper [45]
observes that the real-world varies significantly over time,
requiring the tracker to adapt to the current situation.
Therefore, the method relies on tracking by sampling many
trackers. In this way it maintains an extended model of
trackers. It can be conceived as the extended equivalence
of IVT. Each tracker is made from four components: an
appearance model, a motion model, a state representa-
tion and an observation model. Each component is further
divided into sub-components. The state of the target stores
the center, scale and spatial information, the latter further
subdivided by vertical projection of edges, similar to the
FRT-tracker. Multiple locations and scales are considered.
Sparse incremental PCA with leaking of HSI- and edge-
features captures the state’s appearance past over the last
five frames, similar to IVT. Only the states with the highest
Eigen values are computed. The motion model is composed
of multiple Gaussian distributions. The observation model
consists of Gaussian filter responses of the intensity fea-
tures. Basic trackers are formed from combinations of the
four components. In a new frame, the basic tracker with
the best target state is selected from the space of trackers.
3.3 Tracking Using Matching with Constraints
Following major successes for sparse representations in the
object detection and classification literature, a recent devel-
opment in tracking reduces the target representation to a
sparse representation, and performs sparse optimisation.
[TMC] Tracking by Monte Carlo sampling: The
method [43] aims to track targets for which the object shape
changes drastically over time by sparse optimization over
patch pairs. Given the target location in the first frame,
the target is modeled by sampling a fixed number of tar-
get patches that are described by edge features and color
histograms. Each patch is then associated with a corre-
sponding background patch sampled outside the object
boundaries. Patches are inserted as nodes in a star-shaped
graph where the edges represent the relative distance to the
center of the target. The best locations of the patches in the
new frame are found by warping each target patch to an
old target patch. Apart from the appearance probability, the
geometric likelihood is based on the difference in location
with the old one. The new target location is found by maxi-
mum a posteriori estimation. TMC has an elaborate update
scheme by adding patches, removing them, shifting them
to other locations, or slowly substituting their appearance
with the current appearance.
[ACT] Adaptive Coupled-layer Tracking: The recent
tracker [57] aims for rapid and significant appearance
changes by sparse optimization in two layers. The tracker con-
straint changes in the local layers by maintaining a global
layer. In each local layer, at the start, patches will receive
uniform weight and be grouped in a regular grid within
the target bounding box. Each layer is a gray level his-
togram and location. For a new frame, the locations of the
patches are predicted by a constant-velocity Kalman-filter
and tuned to its position in the new frame by an affine
transformation. Patches which drift away from the target
are removed. The global layer contains a representation of
appearance, shape and motion. Color HSV-histograms of
target and background assess the appearance likelihood per
pixel. Motion is defined by computing the optical flow of
a set of salient points by KLT. The difference between the
velocity of the points and the velocity of the tracker assesses
the likelihood of the motion per pixel. Finally, the degree
of being inside or outside the convex hull spanned around
the patches gives the likelihood of a pixel. The local layer
uses these three likelihoods to modify the weight of each
patch and to decide whether to remove the patch or not.
Finally, the three likelihoods are combined into an overall
probability for each pixel to belong to the target. The local
layer in ACT is updated by adding and removing patches.
The global layer is slowly updated by the properties of the
stable patches of the local layer.
[L1T] L1-minimization Tracker: The tracker [58],
employs sparse optimization by L1 from the past appearance.
It starts using the intensity values in target windows
sampled near the target as the bases for a sparse represen-
tation. Individual, non-target intensity values are used as
alternative bases. Candidate windows in the new frame
are sampled from a Gaussian distribution centered at the
previous target position by Particle Filtering. They are
expressed as a linear combination of these sparse bases
by L1-minimization such that many of the coefficients are
zero. The tracker expands the number of candidates by
also considering affine warps of the current candidates.
The search is applied over all candidate windows, selecting
the new target by the minimum L1-error. The method
concludes with an elaborate target window update scheme.
[L1O] L1 Tracker with Occlusion detection: Advancing
the sparse optimization by L1, the paper [59] uses L2 least
squares optimization to improve the speed. It also considers
occlusion explicitly. The candidate windows are sorted on
the basis of the reconstruction error in the least squares. The
ones above a threshold are selected for L1-minimization. To
detect occluded pixels, the tracker considers the coefficients
of the alternative bases over a certain threshold to find pix-
els under occlusion. When more than 30% of the pixels are
occluded, L1O declares occlusion, which disables the model
updating.