
Learning Multi-Domain Convolutional Neural Networks for Visual Tracking
Hyeonseob Nam Bohyung Han
Dept. of Computer Science and Engineering, POSTECH, Korea
{namhs09, bhhan}@postech.ac.kr
Abstract
We propose a novel visual tracking algorithm based on
the representations from a discriminatively trained Convo-
lutional Neural Network (CNN). Our algorithm pretrains
a CNN using a large set of videos with tracking ground-
truths to obtain a generic target representation. Our net-
work is composed of shared layers and multiple branches
of domain-specific layers, where domains correspond to in-
dividual training sequences and each branch is responsible
for binary classification to identify the target in each do-
main. We train the network with respect to each domain
iteratively to obtain generic target representations in the
shared layers. When tracking a target in a new sequence,
we construct a new network by combining the shared lay-
ers in the pretrained CNN with a new binary classification
layer, which is updated online. Online tracking is performed
by evaluating the candidate windows randomly sampled
around the previous target state. The proposed algorithm
illustrates outstanding performance compared with state-
of-the-art methods in existing tracking benchmarks.
1. Introduction
Convolutional Neural Networks (CNNs) have recently
been applied to various computer vision tasks such as im-
age classification [27, 5, 34], semantic segmentation [30],
object detection [13], and many others [37, 36]. Such great
success of CNNs is mostly attributed to their outstanding
performance in representing visual data. Visual tracking,
however, has been less affected by these popular trends
since it is difficult to collect a large amount of training data
for video processing applications and training algorithms
specialized for visual tracking are not available yet, while
the approaches based on low-level handcraft features still
work well in practice [18, 6, 21, 42]. Several recent track-
ing algorithms [20, 39] have addressed the data deficiency
issue by transferring pretrained CNNs on a large-scale clas-
sification dataset such as ImageNet [33]. Although these
methods may be sufficient to obtain generic feature repre-
sentations, its effectiveness in terms of tracking is limited
due to the fundamental inconsistency between classification
and tracking problems, i.e., predicting object class labels
versus locating targets of arbitrary classes.
To fully exploit the representation power of CNNs in vi-
sual tracking, it is desirable to train them on large-scale data
specialized for visual tracking, which cover a wide range
of variations in the combination of target and background.
However, it is truly challenging to learn a unified represen-
tation based on the video sequences that have completely
different characteristics. Note that individual sequences in-
volve different types of targets whose class labels, moving
patterns, and appearances are different, and tracking algo-
rithms suffer from sequence-specific challenges including
occlusion, deformation, lighting condition change, motion
blur, etc. Training CNNs is even more difficult since the
same kind of objects can be considered as a target in a se-
quence and as a background object in another. Due to such
variations and inconsistencies across sequences, we believe
that the ordinary learning methods based on the standard
classification task are not appropriate, and another approach
to capture sequence-independent information should be in-
corporated for better representations for tracking.
Motivated by this fact, we propose a novel CNN archi-
tecture, referred to as Multi-Domain Network (MDNet), to
learn the shared representation of targets from multiple an-
notated video sequences for tracking, where each video is
regarded as a separate domain. The proposed network has
separate branches of domain-specific layers for binary clas-
sification at the end of the network, and shares the common
information captured from all sequences in the preceding
layers for generic representation learning. Each domain in
MDNet is trained separately and iteratively while the shared
layers are updated in every iteration. By employing this
strategy, we separate domain-independent information from
domain-specific one and learn generic feature representa-
tions for visual tracking. Another interesting aspect of our
architecture is that we design the CNN with a small number
of layers compared to the networks for classification tasks
such as AlexNet [27] and VGG nets [5, 34].
We also propose an effective online tracking framework
based on the representations learned by MDNet. When a
1
arXiv:1510.07945v2 [cs.CV] 6 Jan 2016