FAST TRACKING VIA CONTEXT DEPTH
MODEL LEARNING
Zhaoyun Chen, Lei Luo, Mei Wen, Chunyuan Zhang
College of Computer, National University of Defense Technology, ChangSha, China
Email: chenzhaoyun09@163.com
Abstract—Visual tracking is one of the challenging tasks in
computer vision. In this paper, we propose a fast and robust
visual tracking algorithm which is directly extended from STC
[1]. By exploring RGB-D data, we construct a context depth
model to record spatial correlation between the low-level features
from the target and its surrounding regions. According to the
continuity and stability of target in depth image, we adopt region
growing method and a model updating schema for scaling and
occlusion detection. Both qualitative and quantitative evaluations
on challenging benchmark image sequences demonstrate that the
proposed tracker performs favorably against several state-of-the-
art algorithms.
I. INTRODUCTION
Visual tracking is an important research direction in com-
puter vision. A robust and real-time tracker in continuous
image sequences has a wide range of applications such as video
surveillance, intelligent traffic, human-computer interaction,
robot navigation, video compression and retrieval.
In traditional tracking, generative model is usually pro-
posed to represent target appearance changes [2, 3]. Some
approaches have been proposed by mining auxiliary objects
or local visual information surrounding the target to assist
tracking [4, 5]. Numerous learning methods have been adapted
to the tracking problem [6–8]. Algorithms mentioned above,
however, cannot solve heavy occlusion due to lack of 3D visual
understanding. Moreover, some methods cannot work in real-
time scenario due to high computational complexity.
The fast tracking algorithm via spatio-temporal context
learning(STC) [1] presents a new framework to exploit context
information to facilitate visual tracking. Although STC has
a good effect in common scenes, it performs poor under
various challenging factors due to occlusion, scaling variation,
deformation, background clutter and etc.
Meanwhile, off-the-shelf depth sensors, such as Microsoft
Kinect, make depth information acquisition very easy. Depth
information has been introduced into object detection, object
segmentation, scene understanding [9, 10], etc. But there has
not been an algorithm in RGB-D tracking which can be used
effectively for all the situations [11].
We propose to extend STC by exploring RGB-D data. The
depth information is introduced to correct the spatio-temporal
context model into context depth model to improve scale es-
timation, tackle occlusion and deformation. Main contribution
of this paper includes: (1) We construct a 3D context model
Corresponding author: Lei Luo, e-mail: l.luo@nudt.edu.cn
0
10000
20000
30000
40000
50000
60000
1
11
21
31
41
51
61
71
81
91
101
111
121
131
141
151
161
Depth
The Fl uctuation of Cent er Depth
#81
#84
#90 #93
Object Center Location
Occlusion DetectionRegion Grow ScalingBounding Box
Fig. 1. Overview of the proposed algorithm. It consists of four parts: Object
Center Location, Occlusion Detection, Region Growing Scaling, Bounding
Box Output.
based on depth information; (2) Region Growing method is
adopted for scaling and the target is not limited by fixed aspect
ratio. (3) A method of reducing the learning rate is proposed
to improve the performance in long-term occlusion.
The paper is structured as follows: Section 2 presents our
proposed method. The experiment and evaluation are described
in Section 3. Section 4 concludes the paper.
II. M
ETHODOLOGY
The tracking problem in STC is formulated by computing a
confidence map which estimates the object location likelihood:
𝑐(𝑥)=𝑃 (𝑥∣𝑜), (1)
where 𝑥 ∈ 𝑅
2
is the object location and 𝑜 stands for the object
present in the scene. In current frame, the object location 𝑥
∗
is
given. The local context feature set from the image is defined
as 𝑋
𝑐
= {𝑐(𝑧)=(𝐼(𝑧),𝑧)∣𝑧 ∈ Ω
𝑐
(𝑥
∗
)} where 𝐼(𝑧) stands
for the image intensity at location 𝑧 and Ω
𝑐
(𝑥
∗
) stands for the
neighborhood of the location 𝑥
∗
. By marginalizing the joint
probability, the confidence map can be decomposed as
𝑐(𝑥)=𝑃 (𝑥∣𝑜)
=
∑
𝑐(𝑧)∈𝑋
𝑐
𝑃 (𝑥, 𝑐(𝑧)∣𝑜) (2)
=
∑
𝑐(𝑧)∈𝑋
𝑐
𝑃 (𝑥∣𝑐(𝑧),𝑜)𝑃 (𝑐(𝑧),𝑜),
where 𝑃 (𝑥∣𝑐(𝑧),𝑜) is the spatio context probability and
𝑃 (𝑐(𝑧)∣𝑜) is the prior context probability. The center of