Online Object Tracking Based on CNN with
Metropolis-Hasting Re-Sampling
Xiangzeng Zhou and Lei Xie
∗
School of Computer Science
Northwestern Polythechnical University
Xi’an, P. R. China
xenuts@gmail.com, lxie@nwpu.edu.cn
Peng Zhang
∗
and Yanning Zhang
School of Computer Science
Northwestern Polythechnical University
Xi’an, P. R. China
{zh0036ng, ynzhang}@nwpu.edu.cn
ABSTRACT
Tracking-by-learning strategies have been effective in solv-
ing many challenging problems in visual tracking, in which
the learning sample generation and labeling play important
roles for final performance. Since the concern of deep learn-
ing based approaches has shown an impressive performance
in different vision tasks, how to properly apply the learning
model, such as CNN, to an online tracking framework is still
challenging. In this paper, to overcome the overfitting prob-
lem caused by straight-forward incorporation, we propose
an online tracking framework by constructing a CNN based
adaptive appearance model to generate more reliable train-
ing data over time. With a reformative Metropolis-Hastings
re-sampling scheme to reshape particles for a better state
posterior representation during online learning, the proposed
tracking outperforms most of the state-of-art trackers on
challenging benchmark video sequences.
Categories and Subject Descriptors
I.4.8 [Image Processing and Computer Vision]: Scene
Analysis—Tracking
General Terms
Algorithm, Theory
Keywords
Object tracking, CNN, Metropolis-Hastings, Re-sampling
1. INTRODUCTION
Learning sample quality is an essential factor to robust on-
line tracking, but this task is not easy because it is hard to
manually intervene the sample generation and labeling when
tracking is on-the-fly. Although different tracking strategies
have tried various types of traditional models for sample gen-
eration[15], the descriptive capability of those online sample
∗
Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from Permissions@acm.org.
MM’15, October 26–30, 2015, Brisbane, Australia.
c
2015 ACM. ISBN 978-1-4503-3459-4/15/10 ...$15.00.
DOI: http://dx.doi.org/10.1145/2733373.2806307.
is still far from sufficient for object characteristic represen-
tation.
In order to exploit more descriptive training samples, nowa-
days, deep learning models, e.g. convolutional neural net-
work (CNN) [16], have been successfully applied in a va-
riety of audio and visual tasks such as speech recognition
and image classification, and obtain a remarkable progress.
But due to the requirement of a large number of training
data and high computational cost, most of those studies
approached their tasks with off-line learning process as pre-
sented in some recently proposed works [14, 11, 7]. Wang
et al. [14] proposed an online tracking strategy based on a
compact image representation learned from an off-line pre-
trained deep neural network which requires large amounts
of auxiliary images. Similarly, Hong et al. [7] carried out the
learning of discriminative saliency using a CNN, but still de-
manded a pre-trained model. Different with [14] and [7], Li
et al. [11] proposed a variation of CNN with truncated struc-
tural loss to construct an online tracker and showed promis-
ing performance. But it mainly focused on model reforming
of CNN for online learning, and the sample generation prob-
lem is not addressed, which may lead to tracking failure in
complicated scenarios. Thus, how to utilize the advantage
of deep learning to generate more representative samples is
a challenging problem in online tracking tasks, and this is
also a motivation of this study.
Sample labeling is another challenge to properly utilize a
CNN model as learning strategy for online tracking. This
is because CNN is prone to overfitting to recent samples
and is sensitive to mislabeled samples. Typically, a particle
filter is used for efficiently conducting online object track-
ing by simulating object state’s posterior with a finite set
of weighted particles. However, it is difficult for the parti-
cles being used to carry out a self-repair without any prior
knowledge when a specific error pattern arises. Such type
of error may be caused by an incorrect object’s interference
due to dramatic appearance change or overfitting problem
(e.g. CNN based appearance model). Therefore, an effective
re-sampling process over particle filter may benefit the label
assignment, providing more reliable labeled samples for the
learning of CNN model. This is another motivation leading
to this study.
In this work, we propose a robust online tracker by ex-
ploiting the strong learning capability of a CNN model with
particle filtering framework. An overview of our tracking
framework is shown in Fig. 1. The contributions of the pro-
posed work are three folds. Firstly, we carry out an at-
tempt by introducing a single convolutional neural network