Robust Fragments-based Tracking using the Integral Histogram
Amit Adam and Ehud Rivlin
Dept. of Computer Science
Technion - Israel Institute of Technology
Haifa 32000, Israel
{amita,ehudr}@cs.technion.ac.il
Ilan Shimshoni
Dept. of Management Information Systems
Haifa University
Haifa 31905, Israel
{ishimshoni}@mis.haifa.ac.il
Abstract
We present a novel algorithm (which we call “Frag-
Track”) for tracking an object in a video sequence. The
template object is represented by multiple image fragments
or patches. The patches are arbitrary and are not based on
an object model (in contrast with traditional use of model-
based parts e.g. limbs and torso in human tracking). Every
patch votes on the possible positions and scales of the ob-
ject in the current frame, by comparing its histogram with
the corresponding image patch histogram. We then mini-
mize a robust statistic in order to combine the vote maps of
the multiple patches.
A key tool enabling the application of our algorithm to
tracking is the integral histogram data structure [18]. Its
use allows to extract histograms of multiple rectangular re-
gions in the image in a very efficient manner.
Our algorithm overcomes several difficulties which can-
not be handled by traditional histogram-based algorithms
[8, 6]. First, by robustly combining multiple patch votes, we
are able to handle partial occlusions or pose change. Sec-
ond, the geometric relations between the template patches
allow us to take into account the spatial distribution of the
pixel intensities - information which is lost in traditional
histogram-based algorithms. Third, as noted by [18], track-
ing large targets has the same computational cost as track-
ing small targets.
We present extensive experimental results on challenging
sequences, which demonstrate the robust tracking achieved
by our algorithm (even with the use of only gray-scale (non-
color) information).
1. Introduction
Tracking is an important subject in computer vision with
a wide range of applications - some of which are surveil-
lance, activity analysis, classification and recognition from
motion and human-computer interfaces. The three main
categories into which most algorithms fall are feature-based
tracking (e.g. [3]), contour-based tracking (e.g. [15]) and
region-based tracking (e.g [13]). In the region-based cate-
gory, modeling of the region’s content by a histogram or by
other non-parametric descriptions (e.g. kernel-density esti-
mate) have become very popular in recent years. In particu-
lar, one of the most influential approaches is the mean-shift
approach [8, 6].
With the experience gained by using histograms and the
mean shift approach, some difficulties have been studied in
recent years. One issue is the local basin of convergence
that the mean shift algorithm has. Recently in [22] the au-
thors describe a method for converging to the optimum from
far-away starting points.
A second issue, inherent in the use of histograms, is the
loss of spatial information. This issue has been addressed
by several works. In [26] the authors introduce a new sim-
ilarity measure between the template and image regions,
which replaces the original Bhattacharyya metric. This
measure takes into account both the intensities and their
position in the window. The measure is further computed
efficiently by using the Fast Gauss Transform. In [12], the
spatial information is taken into account by using “oriented
kernels” - this approach is additionally shown to be useful
for wide baseline matching. Recently, [4] has addressed this
issue by adding the spatial mean and covariance of the pixel
positions who contribute to a given bin in the histogram -
naming this approach as “spatiograms”.
A third issue which is not specifically addressed by these
previous approaches is occlusions. The template model is
global in nature and hence cannot handle well partial occlu-
sions.
In this work we address the latter two issues (spatial in-
formation and occlusion) by using parts or fragments to rep-
resent the template. The first issue is addressed by efficient
exhaustive search which will be discussed later on. Given
a template to be tracked, we represent it by multiple his-
tograms of multiple rectangular sub regions (patches) of the
template. By measuring histogram similarity with patches