Copyright (c) 2010 IEEE. Personal use is permitted. For any other purposes, Permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.
4
further improvement for caches with more than 60 samples.
Consequently, the training period for their algorithm must
comprise at least 20 frames. Finally, to cope with lighting
changes and objects appearing or fading in the background,
two additional mechanisms (one at the pixel level, a second at
the blob level) are added to the consensus algorithm to handle
entire objects.
The method proposed in this paper operates differently in
handling new or fading objects in the background, without
the need to take account of them explicitly. In addition to
being faster, our method exhibits an interesting asymmetry
in that a ghost (a region of the background discovered once
a static object starts moving) is added to the background
model more quickly than an object that stops moving. Another
major contribution of this paper resides in the proposed update
policy. The underlying idea is to gather samples from the past
and to update the sample values by ignoring when they were
added to the models. This policy ensures a smooth exponential
decaying lifespan for the sample values of the pixel models and
allows our technique to deal with concomitant events evolving
at various speeds with a unique model of a reasonable size for
each pixel.
III. DESCRIPTION OF A UNIVERSAL BACKGROUND
SUBTRACTION TECHNIQUE: VIBE
Background subtraction techniques have to deal with at
least three considerations in order to be successful in real
applications: (1) what is the model and how does it behave?,
(2) how is the model initialized?, and (3) how is the model
updated over time? Answers to these questions are given in
the three subsections of this section. Most papers describe the
intrinsic model and the updating mechanism. Only a minority
of papers discuss initialization, which is critical when a fast
response is expected, as in the case inside a digital camera. In
addition, there is often a lack of coherence between the model
and the update mechanism. For example, some techniques
compare the current value of a pixel p to that of a model
b with a given tolerance T . They consider that there is a good
match if the absolute difference between p and b is lower
than T . To be adaptive over time, T is adjusted with respect
to the statistical variance of p. But the statistical variance is
estimated by a temporal average. Therefore, the adjustment
speed is dependent on the acquisition framerate and on the
number of background pixels. This is inappropriate in some
cases, as in the case of remote IP cameras whose framerate is
determined by the available bandwidth.
We detail below a background subtraction technique, called
“ViBe” (for “VIsual Background Extractor”). For convenience,
we present a complete version of our algorithm in a C-like
code in Appendix A.
A. Pixel model and classification process
To some extent, there is no way around the determination,
for a given color space, of a probability density function (pdf)
for every background pixel or at least the determination of
statistical parameters, such as the mean or the variance. Note
that with a gaussian model, there is no distinction to be made
as the knowledge of the mean and variance is sufficient to
determine the pdf. While the classical approaches to back-
ground subtraction and most mainstream techniques rely on
pdfs or statistical parameters, the question of their statistical
significance is rarely discussed, if not simply ignored. In
fact, there is no imperative to compute the pdf as long as
the goal of reaching a relevant background segmentation is
achieved. An alternative is to consider that one should enhance
statistical significance over time, and one way to proceed is to
build a model with real observed pixel values. The underlying
assumption is that this makes more sense from a stochastic
point of view, as already observed values should have a higher
probability of being observed again than would values not yet
encountered.
Like the authors of [65], we do not opt for a particular
form for the pdf, as deviations from the assumed pdf model
are ubiquitous. Furthermore, the evaluation of the pdf is a
global process and the shape of a pdf is sensitive to outliers.
In addition, the estimation of the pdf raises the non-obvious
question regarding the number of samples to be considered;
the problem of selecting a representative number of samples
is intrinsic to all the estimation processes.
If we see the problem of background subtraction as a
classification problem, we want to classify a new pixel value
with respect to its immediate neighborhood in the chosen
color space, so as to avoid the effect of any outliers. This
motivates us to model each background pixel with a set of
samples instead of with an explicit pixel model. Consequently
no estimation of the pdf of the background pixel is performed,
and so the current value of the pixel is compared to its
closest samples within the collection of samples. This is an
important difference in comparison with existing algorithms,
in particular with those of consensus-based techniques. A new
value is compared to background samples and should be close
to some of the sample values instead of the majority of all
values. The underlying idea is that it is more reliable to
estimate the statistical distribution of a background pixel with
a small number of close values than with a large number of
samples. This is somewhat similar to ignoring the extremities
of the pdf, or to considering only the central part of the
underlying pdf by thresholding it. On the other hand, if one
trusts the values of the model, it is crucial to select background
pixel samples carefully. The classification of pixels in the
background therefore needs to be conservative, in the sense
that only background pixels should populate the background
models.
Formally, let us denote by v(x) the value in a given
Euclidean color space taken by the pixel located at x in the
image, and by v
i
a background sample value with an index
i. Each background pixel x is modeled by a collection of N
background sample values
M(x) = { v
1
, v
2
, . . . , v
N
} (1)
taken in previous frames. For now, we ignore the notion of
time; this is discussed later.
To classify a pixel value v(x) according to its corresponding
model M(x), we compare it to the closest values within the
set of samples by defining a sphere S
R
(v(x)) of radius R