FLIPFLOP CORRELATION TRACKING WITH CONVOLUTION KERNELS NETWORKS
Hui He
1
, Bo Ma*
1
and Luoyu Qin
2
1
Beijing Laboratory of Intelligent Information Technology, Beijing Institute of Technology, China
2
China Academy of Space Technology, China
ABSTRACT
Correlation filter-based tracking methods have accomplished
competitive performance on accuracy and robustness, but
there is still a huge potential in choosing suitable features.
Recently, Convolutional Kernel Networks (CKN), which
provide a fast and simple procedure to approximate kernel
descriptors, have been proposed and achieved state-of-the-art
performance in many vision tasks. In this paper, we present
an adaptive tracker which integrates the kernel correlation
filters with multiple effective CKN descriptors. By adopt-
ing a FlipFlop scheme, the weights of different features can
be adjusted in the process of tracking to get better perfor-
mance. Extensive experimental results on the OTB-2013
tracking benchmark show that our approach performs fa-
vorably against some representative state-of-the-art tracking
algorithms.
Index Terms— correlation tracking, convolutional kernel
networks, adaptive multiple features
1. INTRODUCTION
Visual tracking, whose goal is to estimate the states of the
target in the subsequent frames[1], plays a critical role in
numerous computer vision applications such as surveillance,
robotics and behavior analysis. Although decades of research
have been studied in this field, it is still a challenging and
interesting task due to several complication factors, such as
background clutter, illumination variation, partial occlusions
and deformation.
From the perspective of the foreground and background
information usage, the mainstream tracking methods can be
categorized into generative ones and discriminative ones.
Generative trackers focus on establishing robust appearance
models of the target by using templates or subspaces and
performing tracking by searching the best-matching windows
[2, 3]. While discriminative trackers often construct online
classifiers which aim to distinguish the target from its back-
grounds [4, 5, 6, 7]. It has been proved that background
*Corresponding author: bma000@bit.edu.cn (Bo Ma). This work was
supported in part by the National Natural Science Foundation of China (No.
61472036).
information involved in discriminative methods is advanta-
geous in effective tracking [8].
In particular, the correlation filter-based discriminative
trackers have made significant achievements recently and
attracted much attention [9, 10, 11, 12]. As proposed in
[13], by expanding single-channel filter to multi-channels
and replacing original pixel values with Histogram of Ori-
ented Gradients (HOG), Kernelized Correlation Filter (KCF)
tracker is more competitive in performance than state-of-the-
art trackers with high speed running at hundreds of frames-
per-second. Considering that the HOG is a hand-crafted
feature, it is necessary to extract more effective features for
better performance.
Recently, Convolutional Kernel Networks (CKN) [14], as
a simple convolutional neural network to approximate patch-
based kernel descriptors [15], have been developed to provide
a end-to-end image representation and demonstrated state-of-
the-art performance in many vision tasks, such as classifica-
tion [14] and image retrieval [16]. Although these semantic
representations are shown to be very effective in categorizing
and capturing original spatial details of objects [17], they are
not the optimal representation for visual tracking. To achieve
better performance, it is imperative to combine multiple fea-
tures for best representation and separate foreground targets
from the background clutters.
Decision-theoretic online learning (DTOL) [18] is a
framework to dynamically allocate resources among some
experts and capture learning problems proceeding in rounds,
which is suitable for combining multiple responses to final
decision in visual tracking. Hedge algorithm, which uses a
set of experts to explain the observations regardless of how
the observations are generated, is first proposed to solve the
DTOL problem. The resource assignment of each expert
depends on the cumulative loss of this expert and a learning
rate parameter. However, Hedge algorithm cannot ensure the
best prediction in various applications as the best learning
rate cannot be obtained at all times [19]. AdaHedge [20] and
FlipFlop [21] are both proposed by Tim et al. to overcome
the drawback of the original Hedge algorithm. By dividing
original learning problem to sub-problem, the learning rate
parameter can be directly obtained by a part of the loss, which
makes the decision more close to the optimal result. Consid-
ering the superiority of FlipFlop, in this paper, we use this
1937978-1-5090-4117-6/17/$31.00 ©2017 IEEE ICASSP 2017