深度学习驱动的视频对象选择：语义分割的子模态方法

需积分: 9 114 浏览量更新于2024-09-07 收藏 982KB PDF 举报

"Submodular Video Object Proposal Selection for Semantic Object Segmentation" 是一篇由Tinghuai Wang撰写的研究论文，发表于Nokia Labs和Nokia Technologies的芬兰分部。该研究聚焦于在视频领域中利用数据驱动的方法来学习空间-时间上的一致性语义对象表示，这是实现视频对象分割的关键。论文的核心目标是通过学习一种能够捕捉连续帧中多个实例之间协同作用的数据驱动表示，从而实现对视频中对象的精确分割。论文首先强调了在处理嘈杂检测结果时，理解并选择具有区分性和代表性的子集的重要性。作者将这个选择过程转化为一个优化问题，即最大化一个称为子模函数的数学概念。子模函数特性在于其增益随着元素数量的增加非减小，这对于确保选择的最优子集具有重要的理论支持。作者的方法旨在挖掘长期的上下文依赖关系，这为视频中的对象分割提供了强大的稳健性。他们提出了一种算法，该算法能够有效地解决子模函数优化问题，从而提高视频对象分割的性能。与现有的最先进的方法进行了一系列严谨的实验，结果表明他们的方法在具有挑战性的数据集上表现出显著的优势。关键词包括子模函数、视频语义对象分割和深度学习。这篇论文深入探讨了如何结合子模优化和深度学习技术来提升视频对象识别的准确性和一致性，对于理解视频数据的复杂结构和提高自动化分析能力具有重要意义。通过实证验证，它为视频理解和处理领域的研究者们提供了一个有效的工具和新的视角。

SUBMODULAR VIDEO OBJECT PROPOSAL SELECTION FOR SEMANTIC OBJECT

SEGMENTATION

Tinghuai Wang

Nokia Labs,

Nokia Technologies, Finland

ABSTRACT

Learning a data-driven spatio-temporal semantic representa-

tion of the objects is the key to coherent and consistent la-

belling in video. This paper proposes to achieve semantic

video object segmentation by learning a data-driven represen-

tation which captures the synergy of multiple instances from

continuous frames. To prune the noisy detections, we exploit

the rich information among multiple instances and select the

discriminative and representative subset. This selection pro-

cess is formulated as a facility location problem solved by

maximising a submodular function. Our method retrieves

the longer term contextual dependencies which underpins a

robust semantic video object segmentation algorithm. We

present extensive experiments on a challenging dataset that

demonstrate the superior performance of our approach com-

pared with the state-of-the-art methods.

Index Terms— Submodular function, semantic video ob-

ject segmentation, deep learning

1. INTRODUCTION

The proliferation of user-uploaded videos which are fre-

quently associated with semantic tags provides a vast resource

for computer vision research. These semantic tags, albeit not

spatially or temporally located in the video, suggest visual

concepts appearing in the video. This social trend has led

to an increasing interest in exploring the idea of segmenting

video objects with weak supervision or labels.

Hartmann et al. [1] ﬁrstly formulated the problem as

learning weakly supervised classiﬁers for a set of independent

spatio-temporal segments. Tang et al. [2] learned discrima-

tive model by leveraging labelled positive videos and a large

collection of negative examples based on distance matrix. Liu

et al. [3] extended the traditional binary classiﬁtion prob-

lem to multi-class and proposed nearest neighbor-based la-

bel transfer algorithm which encourages smoothness between

regions that are spatio-temporally adjacent and similar in ap-

pearance. Zhang et al. [4] utilized pre-trained object detector

to generate a set of detections and then pruned noisy detec-

tions and regions by preserving spatio-temporal constraints.

Fig. 1. An illustration of the proposed object discovery strat-

egy .

In contrast to previous works, we propose to learn a class-

speciﬁc representation which captures the synergy of multi-

ple instances from continuous frames. To prune the noisy

detections, we exploit the rich information among multiple

instances and select the discriminative and representative sub-

set. In this framework, our algorithm is able to bridge the gap

between image classiﬁcation and video object segmentation,

leveraging the ample pre-trained image recognition models

rather than strongly-trained object detectors.

2. OBJECT DISCOVERY

Semantic object segmentation requires not only localising ob-

jects of interest within a video, but also assigning class label

for pixels belonging to the objects. One potential challenge

of using pre-trained image recognition model to detect ob-

jects is that any regions containing the object or even part of

the object, might be “correctly” recognised, which results in

a large search space to accurately localise the object. To nar-

row down the search of targeted objects, we adopt category-

independent bottom-up object proposals [5]. The proposed

object discovery strategy is illustrated in Fig. 1, in which the

key steps are detailed in the following sections.

2.1. Proposal Scoring and Classiﬁcation

We combine the objectness score associated with each pro-

posal from Endres and Hoiem [5] and motion information as

a context cue to characterise video objects. We follow Papa-

zoglou and Ferrari [6] which roughly produces a binary map

,((( ,&,3

下载后可阅读完整内容，剩余4页未读，立即下载

大白菜丫丫

粉丝: 73
资源: 15

深度学习驱动的视频对象选择：语义分割的子模态方法

Bayes Saliency-Based Object Proposal Generator for Nighttime Traffic Images

YOLO SIFT、HOG selective search（理解SS）为代表的object proposal方法，相比较sli

Elastic edge boxes for object proposal on RGB-D images

object-clone-proposal:关于ECMAScript中的Object.clone方法的建议

proposal-object-has:ECMAScript的Object.has（）建议

proposal-object-from-entries:TC39对Object.fromEntries的建议

Metaclasses Proposal for C++

SAP:Self-Adaptive Proposal Model for Temporal Action Detection Based on Reinforcement Learning

proposal-object-values-entries:ECMAScript建议，规格和Object.valuesObject.entries的参考实现

request for proposal

最新资源