Object Tracking: A Survey 9
Fig. 3. Mixture of Gaussian modeling for background subtraction. (a) Image from a sequence
in which a person is walking across the scene. (b) The mean of the highest-weighted Gaussians
at each pixels position. These means represent the most temporally persistent per-pixel color
and hence should represent the stationary background. (c) The means of the Gaussian with
the second-highest weight; these means represent colors that are observed less frequently. (d)
Background subtraction result. The foreground consists of the pixels in the current frame that
matched a low-weighted Gaussian.
4.2. Background Subtraction
Object detection can be achieved by building a representation of the scene called the
background model and then finding deviations from the model for each incoming frame.
Any significant change in an image region from the background model signifies a moving
object. The pixels constituting the regions undergoing change are marked for further
processing. Usually, a connected component algorithm is applied to obtain connected
regions corresponding to the objects. This process is referred to as the background
subtraction.
Frame differencing of temporally adjacent frames has been well studied since the
late 70s [Jain and Nagel 1979]. However, background subtraction became popular fol-
lowing the work of Wren et al. [1997]. In order to learn gradual changes in time, Wren
et al. propose modeling the color of each pixel, I(x, y), of a stationary background
with a single 3D (Y, U, and V color space) Gaussian, I (x, y) ∼ N (μ(x, y), (x, y)). The
model parameters, the mean μ(x, y ) and the covariance (x, y), are learned from the
color observations in several consecutive frames. Once the background model is de-
rived, for every pixel (x, y) in the input frame, the likelihood of its color coming from
N(μ(x, y ), (x, y)) is computed, and the pixels that deviate from the background model
are labeled as the foreground pixels. However, a single Gaussian is not a good model for
outdoor scenes [Gao et al. 2000] since multiple colors can be observed at a certain loca-
tion due to repetitive object motion, shadows, or reflectance. A substantial improvement
in background modeling is achieved by using multimodal statistical models to describe
per-pixel background color. For instance, Stauffer and Grimson [2000] use a mixture
of Gaussians to model the pixel color. In this method, a pixel in the current frame is
checked against the background model by comparing it with every Gaussian in the
model until a matching Gaussian is found. If a match is found, the mean and vari-
ance of the matched Gaussian is updated, otherwise a new Gaussian with the mean
equal to the current pixel color and some initial variance is introduced into the mix-
ture. Each pixel is classified based on whether the matched distribution represents the
background process. Moving regions, which are detected using this approach, along
with the background models are shown in Figure 3.
Another approach is to incorporate region-based (spatial) scene information instead
of only using color-based information. Elgammal and Davis [2000] use nonparamet-
ric kernel density estimation to model the per-pixel background. During the sub-
traction process, the current pixel is matched not only to the corresponding pixel in
the background model, but also to the nearby pixel locations. Thus, this method can
handle camera jitter or small movements in the background. Li and Leung [2002]
fuse the texture and color features to perform background subtraction over blocks of
ACM Computing Surveys, Vol. 38, No. 4, Article 13, Publication date: December 2006.