Learning color and locality cues for moving object detection and segmentation
Feng Liu and Michael Gleicher
Department of Computer Sciences, University of Wisconsin-Madison
1210 West Dayton Street, Madison, WI, 53706
{fliu|gleicher}@cs.wisc.edu
Abstract
This paper presents an algorithm for automatically de-
tecting and segmenting a moving object from a monocular
video. Detecting and segmenting a moving object from a
video with limited object motion is challenging. Since exist-
ing automatic algorithms rely on motion to detect the mov-
ing object, they cannot work well when the object motion is
sparse and insufficient. In this paper, we present an unsu-
pervised algorithm to learn object color and locality cues
from the sparse motion information. We first detect key
frames with reliable motion cues and then estimate mov-
ing sub-objects based on these motion cues using a Markov
Random Field ( MRF) framework. From these sub-objects,
we learn an appearance model as a color Gaussian Mixture
Model. To avoid the false classification of background pix-
els with similar color to the moving objects, the locations
of these sub-objects are propagated to neighboring frames
as locality cues. Finally, robust moving object segmenta-
tion is achieved by combining these learned color and lo-
cality cues with motion cues in a MRF framework. Experi-
ments on videos with a variety of object and camera motion
demonstrate the effectiveness of this a lgorithm.
1. Introduction
Automatically detecting and segmenting a moving ob-
ject from a monocular video is useful in many applications
like video editing, video summarization, video coding, vi-
sual surveillance, human computer interaction, etc. Many
methods have been presented (c.f. [21, 9, 3, 24, 23]). Many
of them aim at a robust algorithm for extracting a moving
object from a video with rich object and camera motion.
However, extracting a moving object from a video with less
object and camera motion is also challenging. Most previ-
ous automatic methods rely on object and/or camera mo-
tion to detect the moving object. Small motion of the ob-
ject and/or camera do not provide sufficient information for
these methods.
For example, most existing methods use motion to detect
moving objects. They assume if a compact region moves
differently from the global background motion, it mostly
likely belongs to a moving object. Motion-based methods
[8, 12, 21, 9, 3] usually take the detected moving pixels as
seeds, and cluster pixels into layers with consistent motions
(and consistent color and depth). When motion information
is sparse and incomplete, they cannot work robustly. For
example, Figure 1 shows an example where a boy sits on
thefloorandmovesonlyinafewframes.Andeveninthese
frames, he only moves a part of his body. Methods using
object motion information can only detect an incomplete
part of the object. For example, if we segment the object
in a popular Markov Random Field (MRF) framework, as
described in § 2.3, only the moving part of the boy’s body
is detected in frames where the part moves, and no m ean-
ingful region is found in other frames as shown in Figure 1
(b) and (c). This example shows that using object motion
alone to infer moving objects is insufficient. Similarly, in
this example, since the camera barely moves, it is also dif-
ficult for a structure from motion (SFM) algorithm as used
in methods like [24] to obtain useful depth information to
infer the moving object.
Impressive results have been reported recently for bi-
layer video segmentation in the scenario of video chat-
ting [4, 23]. These algorithms can robustly segment a major
foreground object from a video with dynamic background,
however, they are not suitable for videos with complex cam-
era motions.
Instead of building a moving object model, some other
methods build a background model to d etect and segment a
moving object (c.f. [5, 10, 17, 15, 18, 22]). These methods
work well for videos with static cameras. When videos have
complex camera motions, the background model is hard to
build.
This paper presents a solution that learns a moving object
model by collecting the sparse and insufficient motion in-
formation throughout the video. Specifically, we presented
an unsupervised algorithm to learn the co lor and locality
cues of the moving object. We first detect key frames that
contain motion cues that can reliably indicate at least some