Key-Segments for Video Object Segmentation
Yong Jae Lee, Jaechul Kim, and Kristen Grauman
University of Texas at Austin
yjlee0222@utexas.edu, jaechul@cs.utexas.edu, grauman@cs.utexas.edu
Abstract
We present an approach to discover and segment fore-
ground object(s) in video. Given an unannotated video
sequence, the method first identifies object-like regions in
any frame according to both static and dynamic cues. We
then compute a series of binary partitions among those
candidate “key-segments” to discover hypothesis groups
with persistent appearance and motion. Finally, using each
ranked hypothesis in turn, we estimate a pixel-level object
labeling across all frames, where (a) the foreground likeli-
hood depends on both the hypothesis’s appearance as well
as a novel localization prior based on partial shape match-
ing, and (b) the background likelihood depends on cues
pulled from the key-segments’ (possibly diverse) surround-
ings observed across the sequence. Compared to existing
methods, our approach automatically focuses on the per-
sistent foreground regions of interest while resisting over-
segmentation. We apply our method to challenging bench-
mark videos, and show competitive or better results than the
state-of-the-art.
1. Introduction
Video object segmentation is the problem of automat-
ically segmenting the objects in an unannotated video.
While the unsupervised form of the problem has received
relatively little attention, it is important for many potential
applications including video summarization, activity recog-
nition, and video retrieval.
Existing unsupervised methods explore tracking regions
or keypoints over time [4, 30, 5] or formulate clustering ob-
jectives to group pixels from all frames using appearance
and motion cues [11, 10]. Aside from the well-known chal-
lenges associated with tracking (drift, occlusion, and initial-
ization) and clustering (model selection and computational
complexity), these methods lack an explicit notion of what
a foreground object should look like in video data. Conse-
quently, the low-level grouping of pixels usually results in a
so-called “over-segmentation”.
Instead, we propose an approach that automatically dis-
covers a set of key-segments to explicitly model likely fore-
ground regions for video object segmentation. Our main
Input: Unannotated video
Output: Segmentation of high!ranking foreground object
.
.
Figure 1. Our idea is to discover a set of key-segments to automat-
ically generate a foreground object segmentation of the video.
idea is to leverage both static and dynamic cues to de-
tect persistent object-like regions, and then estimate a com-
plete segmentation of the video using those regions and a
novel localization prior that uses their partial shape matches
across the sequence. See Figure 1.
To implement this idea, we first introduce a measure that
reflects a region’s likelihood of belonging to a foreground
object. To capture object-like motion and persistence, we
use dynamic inter-frame properties such as motion differ-
ence from surroundings and recurrence. Intuitively, a re-
gion that moves differently from its surroundings and ap-
pears frequently throughout the video will likely be among
the main objects of interest. Conversely, one that seldom
occurs is more likely to be an uninteresting, background ob-
ject. To capture object-like appearance and shape, we use
static properties such as a well-defined closed boundary in
space and clear separation from surroundings, as recently
explored in static images [8, 6, 1]. We use both aspects to
group the key-segments, estimating multiple inlier/outlier
partitions of the candidate regions. Each ranked partition
automatically defines a foreground and background model,
with which we solve for a pixel-wise segmentation using
graph cuts on a space-time MRF.
The rank reflects the cor-
responding object’s centrality to the scene.
How does key-segment discovery help video object seg-
mentation? The key-segments are a reliable source for