腾讯研究：基于CNN的视频对象分割高阶时空马尔可夫随机场

需积分: 9 181 浏览量更新于2024-09-09 收藏 2.51MB PDF 举报

本文主要探讨了在视频对象分割（Video Object Segmentation）任务中的挑战，特别是当初始物体掩码提供于输入视频的第一帧时。研究者提出了名为CNN in MRF（Convolutional Neural Network in Markov Random Field）的方法，这是一种创新的模型，用于处理视频中的对象分割问题。传统的Markov Random Field (MRF) 模型通常依赖于手工设计的邻域结构来捕捉像素间的空间依赖性。然而，CNN in MRF则将这一过程进行了现代化改造。该模型的核心是利用卷积神经网络（CNN）来编码像素之间的复杂空间关系。具体来说，针对一个特定的对象，CNN能够预测一组相邻像素的标签概率，从而隐式地捕捉到像素之间的高阶、丰富的空间依赖。这种设计使得模型能够更好地理解和融合不同像素间的深层次联系，超越了传统MRF模型的局限。为了处理视频中的时空信息，CNN in MRF结合了光学流技术来建立时间上的依赖性。这使得模型能够同时利用空间特征和动态信息，提高了视频对象分割的准确性。然而，由于高阶的MRF结构，直接进行精确的后验推理变得非常困难。为了解决这个问题，研究人员开发了一种新颖的CNN嵌入算法，这是一种近似推理方法，通过交替执行时间融合步骤和前馈CNN步骤来进行。这种算法有效地处理了MRF模型的复杂性，并且在不依赖模型融合或专用检测器的情况下，仅基于一次拍摄的外观特征，就能够在DAVIS 2017挑战赛中取得优于获胜者的性能。 CNN in MRF通过结合深度学习和MRF的强项，为视频对象分割提供了一个高效且灵活的框架。它不仅提升了对复杂空间和时间上下文的理解，还简化了推理过程，使得在实际应用中展现出强大的竞争力。这项工作对于推动视频分析领域的研究，尤其是在基于深度学习的视觉理解方面具有重要意义。

CNN in MRF: Video Object Segmentation via Inference in

A CNN-Based Higher-Order Spatio-Temporal MRF

Linchao Bao Baoyuan Wu Wei Liu

Tencent AI Lab

linchaobao@gmail.com wubaoyuan1987@gmail.com wliu@ee.columbia.edu

Abstract

This paper addresses the problem of video object seg-

mentation, where the initial object mask is given in the

ﬁrst frame of an input video. We propose a novel spatio-

temporal Markov Random Field (MRF) model deﬁned over

pixels to handle this problem. Unlike conventional MRF

models, the spatial dependencies among pixels in our model

are encoded by a Convolutional Neural Network (CNN).

Speciﬁcally, for a given object, the probability of a labeling

to a set of spatially neighboring pixels can be predicted

by a CNN trained for this speciﬁc object. As a result,

higher-order, richer dependencies among pixels in the set

can be implicitly modeled by the CNN. With temporal de-

pendencies established by optical ﬂow, the resulting MRF

model combines both spatial and temporal cues for tackling

video object segmentation. However, performing inference

in the MRF model is very difﬁcult due to the very high-

order dependencies. To this end, we propose a novel CNN-

embedded algorithm to perform approximate inference in

the MRF. This algorithm proceeds by alternating between a

temporal fusion step and a feed-forward CNN step. When

initialized with an appearance-based one-shot segmenta-

tion CNN, our model outperforms the winning entries of the

DAVIS 2017 Challenge, without resorting to model ensem-

bling or any dedicated detectors.

1. Introduction

Video object segmentation refers to a task of extracting

pixel-level masks for class-agnostic objects in videos. This

task can be further divided into two settings [36], namely

unsupervised and semi-supervised. While the unsupervised

task does not provide any manual annotation, the semi-

supervised task provides information about objects of inter-

est in the ﬁrst frame of a video. In this paper, we focus

on the latter task, where the initial masks for objects of

interest are provided in the ﬁrst frame. The task is important

for many applications such as video editing, video sum-

marization, action recognition, etc. Note that the seman-

tic class/type of the objects of interest cannot be assumed

known and the task is thus class-agnostic. It is usually

treated as a temporal label propagation problem and solved

with spatio-temporal graph structures [

18, 37, 45, 4] like a

Markov Random Field (MRF) model [

46]. Recent advances

on the task show signiﬁcant improvements over traditional

approaches when incorporating deep Convolutional Neural

Networks (CNNs) [

13, 35, 48, 42, 15, 23, 22]. Despite

the remarkable progress achieved with CNNs, video object

segmentation is still challenging when applied in real-world

environments. One example is that even the top perform-

ers [

13, 35] on the DAVIS 2016 benchmark [36] show

signiﬁcantly worse performance on the more challenging

DAVIS 2017 benchmark [

38], where interactions between

objects, occlusions, motions, object deformation, etc., are

more complex and frequent in the videos.

Reviewing the top performing CNN-based methods and

traditional spatio-temporal graph-based methods, there is a

clear gap between the two lines. The CNN-based methods

usually treat each video frame individually or only use sim-

ple heuristics to propagate information along the temporal

axis, while the well established graph-based models cannot

utilize the powerful representation capabilities of neural

networks. In order to fully exploit the appearance/shape

information about the given objects, as well as the tempo-

ral information ﬂow along the time axis, a better solution

should be able to combine the best from both. For example,

built on the top-performing CNN-based methods [

13, 35],

there should be a temporal averaging between the CNN

outputs of an individual frame and its neighboring frames,

so that the segmentation results are temporally consistent.

The temporal averaging, however, is heuristic and likely to

degrade the segmentation performance due to outliers. A

more principled method should be developed. In this paper,

we propose a novel approach along this direction.

Speciﬁcally, we build a spatio-temporal MRF model

over a video sequence, where each random variable rep-

resents the label of a pixel. While the pairwise temporal

dependencies between random variables are established us-

5977

下载后可阅读完整内容，剩余9页未读，立即下载

AlgoFei

粉丝: 9

腾讯研究：基于CNN的视频对象分割高阶时空马尔可夫随机场

mongoose-reference-validator：实现MongoDB外部参考验证

BLC-MRF:双层上下文MRF提升高空间分辨率遥感影像分类性能

小波-MRF融合方法：遥感图像建筑物高效分割

matlab遥感分类代码-CNN-AL-MRF:这是“具有卷积神经网络和主动学习的高光谱图像分类”的代码

matlab温度分布代码-TT-MRF:TT-MRF

docker-drachtio-freeswitch-mrf：用于创建最小的Freeswitch镜像以与drachtio-mrf一起使用的Dockerfile

king代码matlab-MT-MRF:磁流式细胞仪

贝叶斯算法图像分类matlab代码-MRF-HSRM:MRF-HSRM

hsi图像分割matlab代码-CNN_HSIC_MRF:这是用于高光谱图像分类的卷积神经网络的TensorFlow实现

maple-mrf24w:Microchip MRF24W WiFi 模块的 Maple 库

最新资源