CNN in MRF: Video Object Segmentation via Inference in
A CNN-Based Higher-Order Spatio-Temporal MRF
Linchao Bao Baoyuan Wu Wei Liu
Tencent AI Lab
linchaobao@gmail.com wubaoyuan1987@gmail.com wliu@ee.columbia.edu
Abstract
This paper addresses the problem of video object seg-
mentation, where the initial object mask is given in the
first frame of an input video. We propose a novel spatio-
temporal Markov Random Field (MRF) model defined over
pixels to handle this problem. Unlike conventional MRF
models, the spatial dependencies among pixels in our model
are encoded by a Convolutional Neural Network (CNN).
Specifically, for a given object, the probability of a labeling
to a set of spatially neighboring pixels can be predicted
by a CNN trained for this specific object. As a result,
higher-order, richer dependencies among pixels in the set
can be implicitly modeled by the CNN. With temporal de-
pendencies established by optical flow, the resulting MRF
model combines both spatial and temporal cues for tackling
video object segmentation. However, performing inference
in the MRF model is very difficult due to the very high-
order dependencies. To this end, we propose a novel CNN-
embedded algorithm to perform approximate inference in
the MRF. This algorithm proceeds by alternating between a
temporal fusion step and a feed-forward CNN step. When
initialized with an appearance-based one-shot segmenta-
tion CNN, our model outperforms the winning entries of the
DAVIS 2017 Challenge, without resorting to model ensem-
bling or any dedicated detectors.
1. Introduction
Video object segmentation refers to a task of extracting
pixel-level masks for class-agnostic objects in videos. This
task can be further divided into two settings [36], namely
unsupervised and semi-supervised. While the unsupervised
task does not provide any manual annotation, the semi-
supervised task provides information about objects of inter-
est in the first frame of a video. In this paper, we focus
on the latter task, where the initial masks for objects of
interest are provided in the first frame. The task is important
for many applications such as video editing, video sum-
marization, action recognition, etc. Note that the seman-
tic class/type of the objects of interest cannot be assumed
known and the task is thus class-agnostic. It is usually
treated as a temporal label propagation problem and solved
with spatio-temporal graph structures [
18, 37, 45, 4] like a
Markov Random Field (MRF) model [
46]. Recent advances
on the task show significant improvements over traditional
approaches when incorporating deep Convolutional Neural
Networks (CNNs) [
13, 35, 48, 42, 15, 23, 22]. Despite
the remarkable progress achieved with CNNs, video object
segmentation is still challenging when applied in real-world
environments. One example is that even the top perform-
ers [
13, 35] on the DAVIS 2016 benchmark [36] show
significantly worse performance on the more challenging
DAVIS 2017 benchmark [
38], where interactions between
objects, occlusions, motions, object deformation, etc., are
more complex and frequent in the videos.
Reviewing the top performing CNN-based methods and
traditional spatio-temporal graph-based methods, there is a
clear gap between the two lines. The CNN-based methods
usually treat each video frame individually or only use sim-
ple heuristics to propagate information along the temporal
axis, while the well established graph-based models cannot
utilize the powerful representation capabilities of neural
networks. In order to fully exploit the appearance/shape
information about the given objects, as well as the tempo-
ral information flow along the time axis, a better solution
should be able to combine the best from both. For example,
built on the top-performing CNN-based methods [
13, 35],
there should be a temporal averaging between the CNN
outputs of an individual frame and its neighboring frames,
so that the segmentation results are temporally consistent.
The temporal averaging, however, is heuristic and likely to
degrade the segmentation performance due to outliers. A
more principled method should be developed. In this paper,
we propose a novel approach along this direction.
Specifically, we build a spatio-temporal MRF model
over a video sequence, where each random variable rep-
resents the label of a pixel. While the pairwise temporal
dependencies between random variables are established us-
5977