Struck: Structured Output Tracking with Kernels
Sam Hare
1
Amir Saffari
1,2
Philip H. S. Torr
1
1
Oxford Brookes University, Oxford, UK
2
Sony Computer Entertainment Europe, London, UK
{sam.hare,philiptorr}@brookes.ac.uk amir@ymer.org
Abstract
Adaptive tracking-by-detection methods are widely used
in computer vision for tracking arbitrary objects. Current
approaches treat the tracking problem as a classification
task and use online learning techniques to update the ob-
ject model. However, for these updates to happen one needs
to convert the estimated object position into a set of la-
belled training examples, and it is not clear how best to
perform this intermediate step. Furthermore, the objective
for the classifier (label prediction) is not explicitly coupled
to the objective for the tracker (accurate estimation of ob-
ject position). In this paper, we present a framework for
adaptive visual object tracking based on structured output
prediction. By explicitly allowing the output space to ex-
press the needs of the tracker, we are able to avoid the
need for an intermediate classification step. Our method
uses a kernelized structured output support vector machine
(SVM), which is learned online to provide adaptive track-
ing. To allow for real-time application, we introduce a bud-
geting mechanism which prevents the unbounded growth in
the number of support vectors which would otherwise oc-
cur during tracking. Experimentally, we show that our al-
gorithm is able to outperform state-of-the-art trackers on
various benchmark videos. Additionally, we show that we
can easily incorporate additional features and kernels into
our framework, which results in increased performance.
1. Introduction
Visual object tracking is one of the core problems of
computer vision, with wide-ranging applications including
human-computer interaction, surveillance and augmented
reality, to name just a few. For other areas of computer vi-
sion which aim to perform higher-level tasks such as scene
understanding and action recognition, object tracking pro-
vides an essential component.
For some applications, the object to be tracked is known
in advance, and it is possible to incorporate prior knowledge
when designing the tracker. There are other cases, however,
where it is desirable to be able to track arbitrary objects,
Figure 1. Different adaptive tracking-by-detection paradigms:
given the current estimated object location, traditional approaches
(shown on the right-hand side) generate a set of samples and, de-
pending on the type of learner, produce training labels. Our ap-
proach (left-hand side) avoids these steps, and operates directly on
the tracking output.
which may only be specified at runtime. In these scenarios,
the tracker must be able to model the appearance of the ob-
ject on-the-fly, and adapt this model during tracking to take
into account changes caused by object motion, lighting con-
ditions, and occlusion. Even when prior information about
the object is known, having a framework with the flexibility
to adapt to appearance changes and incorporate new infor-
mation during tracking is attractive, and in real-world sce-
narios is often essential to allow successful tracking.
An approach to tracking which has become particu-
larly popular recently is tracking-by-detection [2], which
treats the tracking problem as a detection task applied over
time. This popularity is due in part to the great deal of
progress made recently in object detection, with many of
the ideas being directly transferable to tracking [2]. An-
other key factor is the development of methods which allow
the classifiers used by these approaches to be trained on-
line, providing a natural mechanism for adaptive tracking,
1