深度学习驱动的无歧义图像对象描述生成与理解

CVPR

需积分: 7 90 浏览量更新于2024-09-08 收藏 1.73MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"Mao_Generation_and_Comprehension - 一种能生成并理解无歧义对象描述的方法" 在计算机视觉（CV）领域，准确地描述图像中的特定对象或区域是一项重要的任务，这通常涉及到生成无歧义的引用表达（Referring Expression）。文章"Mao_Generation_and_Comprehension"提出了一个创新方法，该方法不仅能够生成这样的描述，还能理解或解释这些表达，以确定所描述的对象。这项工作在CVPR（Computer Vision and Pattern Recognition，计算机视觉与模式识别）会议上发表，这是一个国际上知名的计算机视觉研究领域的顶级会议。在传统的物体识别或描述方法中，可能会忽略场景中其他可能引起歧义的对象，导致描述不精确。而本文提出的方法解决了这一问题，它考虑了场景中的上下文信息，生成的描述更具有唯一性，从而避免了歧义。这在实际应用中，如智能助手、自动图像注释或无障碍技术等领域，有着广泛的应用潜力。论文作者来自谷歌、加州大学洛杉矶分校、牛津大学和约翰斯·霍普金斯大学，他们在深度学习和计算机视觉方面都有深厚的背景。他们借鉴了深度学习在图像标题生成（Image Captioning）上的成功经验，但同时指出，图像标题生成的评估相对困难，而他们的任务提供了更为客观的评价标准。为了推动该领域的研究，作者创建了一个基于MS-COCO数据集的大规模新数据集，专门用于训练和评估引用表达生成。MS-COCO是一个广泛使用的多模态数据集，包含了丰富的图像和相应的注释，适合进行复杂的视觉理解和描述任务。此外，作者还公开了这个数据集和可视化工具箱，以促进后续的研究工作和社区的使用。 "Mao_Generation_and_Comprehension"的工作对计算机视觉领域的物体描述和理解技术提出了新的挑战和解决方案，通过深度学习模型实现无歧义的引用表达生成，并提供了评估和实验的基础，这对于推动相关领域的进步具有重要意义。

资源详情

资源推荐

Generation and Comprehension of Unambiguous Object Descriptions

Junhua Mao

2∗

Jonathan Huang

Alexander Toshev

Oana Camburu

Alan Yuille

2,4

Kevin Murphy

Google Inc.

University of California, Los Angeles

University of Oxford

Johns Hopkins University

{mjhustc@,yuille@stat.}ucla.edu, oana-maria.camburu@cs.ox.ac.uk

{jonathanhuang,toshev,kpmurphy}@google.com

Abstract

We propose a method that can generate an unambigu-

ous description (known as a referring expression) of a spe-

ciﬁc object or region in an image, and which can also com-

prehend or interpret such an expression to infer which ob-

ject is being described. We show that our method outper-

forms previous methods that generate descriptions of ob-

jects without taking into account other potentially ambigu-

ous objects in the scene. Our model is inspired by recent

successes of deep learning methods for image captioning,

but while image captioning is difﬁcult to evaluate, our task

allows for easy objective evaluation. We also present a new

large-scale dataset for referring expressions, based on MS-

COCO. We have released the dataset and a toolbox for visu-

alization and evaluation, see

https://github.com/

mjhucla/Google_Refexp_toolbox

1. Introduction

There has been a lot of recent interest in generating text

descriptions of images (see e.g., [

13, 53, 9, 5, 12, 26, 28, 40,

55, 8]). However, fundamentally this problem of image cap-

tioning is subjective and ill-posed. With so many valid ways

to describe any given image, automatic captioning methods

are thus notoriously difﬁcult to evaluate. In particular, how

can we decide that one sentence is a better description of an

image than another?

In this paper, we focus on a special case of text genera-

tion given images, where the goal is to generate an unam-

biguous text description that applies to exactly one object or

region in the image. Such a description is known as a “refer-

ring expression” [

50, 52, 41, 42, 14, 19, 27]. This approach

has a major advantage over generic image captioning, since

there is a well-deﬁned performance metric: a referring ex-

pression is considered to be good if it uniquely describes

the relevant object or region within its context, such that a

listener can comprehend the description and then recover

the location of the original object. In addition, because of

the discriminative nature of the task, referring expressions

tend to be more detailed (and therefore more useful) than

image captions. Finally, it is easier to collect training data

The major part of this work was done while J. Mao and O. Camburu

were interns at Google Inc.

“The man who is

touching his head.”

Whole frame image

Object bounding box

Referring

Expression

Our Model

Whole frame image

& Region proposals

Description Generation

Description Comprehension

Chosen region in red

Input Input

Input

InputOutput

Output

Figure 1. Illustration of our generation and comprehension system.

On the left we see that the system is given an image and a region

of interest; it describes it as “the man who is touching his head”,

which is unambiguous (unlike other possible expressions, such as

“the man wearing blue”, which would be unclear). On the right we

see that the system is given an image, an expression, and a set of

candidate regions (bounding boxes), and it selects the region that

corresponds to the expression.

to “cover” the space of reasonable referring expressions for

a given object than it is for a whole image.

We consider two problems: (1) description generation,

in which we must generate a text expression that uniquely

pinpoints a highlighted object/region in the image and (2)

description comprehension, in which we must automati-

cally select an object given a text expression that refers to

this object (see Figure 1). Most prior work in the litera-

ture has focused exclusively on description generation (e.g.,

[

31, 27]). Golland et al. [19] consider generation and com-

prehension, but they do not process real world images.

In this paper, we jointly model both tasks of description

generation and comprehension, using state-of-the-art deep

learning approaches to handle real images and text. Specif-

ically, our model is based upon recently developed methods

that combine convolutional neural networks (CNNs) with

recurrent neural networks (RNNs). We demonstrate that

our model outperforms a baseline which generates referring

expressions without regard to the listener who must com-

prehend the expression. We also show that our model can

be trained in a semi-supervised fashion, by automatically

generating descriptions for image regions.

Being able to generate and comprehend object descrip-

tions is critical in a number of applications that use nat-

下载后可阅读完整内容，剩余9页未读，立即下载

sc823387242

粉丝: 0
资源: 2

深度学习驱动的无歧义图像对象描述生成与理解

Python进阶内容 List Comprehension _python_

Bioinformatics_and_Computational_Biology_Solutions_Using_R_and_Bioconductor

(python)请定义函数list_comprehension(exp, iter, iterable)，使得函数比一般的列表推导式处理速度更快，并且取代列表推导式的功能。

Python列表中的字典怎么转成列表

data=[x+[0] for x in df_li]

把一个的二维元组转换成一维元组

calories_list = [i * 200 + j * 100 for i in len(run_list) for j in len(swim_list)]

list comprehension是什么，有什么用、

. Write a program that generates a vector with 30 random integers between –20 and 20 and then finds the sum of all the elements that are divisible by 3.

tf.transpose(x_filt, [1, 0] + [d for d in range(2, len(x.shape))])

使用 JavaScript 的列表推导式（List Comprehension）创建了一个包含 this.fields_dict 对象中所有属性名的数组

Define a function manipulate_list(list1) that returns a new list with 1 added to every element using list comprehension.

thread_list = [threading.Thread(target=Wdlm, args=(urls_list[i])) for i in range(len(urls_list))]

column_data = [row[1] for row in reader] column_data = list(map(int, column_data[1:]))

events = [i for i in dir(cv2) if 'EVENT' in i]

Python定义函数式子

obtain the even numbers in a list using list comprehension

最新资源