结构视觉表示的图像字幕生成：场景图_captioner框架

需积分: 19 166 浏览量更新于2024-08-12 收藏 2.1MB PDF 举报

"场景图字幕：基于结构视觉表示的图像字幕" 这篇研究论文"Scenegraphcaptioner:基于结构视觉表示的图像字幕"由Ning Xu, An-An Liu, Jing Liu, Weizhi Nie和Yuting Su共同撰写，他们均来自天津大学电气与信息工程学院。该论文在2018年5月16日提交，经过修订后于11月26日接受，并于12月14日在线发布。主要关键词包括：图像字幕、场景图、结构表示和注意力机制。文章摘要指出，尽管深度神经网络在图像字幕生成任务上已经取得了显著的成果，但它们并没有显式地利用图像中的结构视觉和文本知识。为此，作者们提出了一个名为Scene Graph Captioner (SGC)的框架，该框架旨在通过显式建模对象、物体属性以及物体间的关系来捕获视觉场景的全面结构语义。首先，他们开发了一种方法，通过在大规模的对象、属性和关系数据集上学习独立模块来生成场景图。然后，SGC框架将高阶图形信息和视觉注意力信息融入到深度模型中，这样做可以更精确地理解和描述图像内容。通过这种方式，SGC不仅考虑了单个物体和其属性，还考虑了物体间的相互作用，从而生成更具有结构性和语义性的图像字幕。这种结构化的方法有助于提高图像描述的准确性和完整性，使得生成的字幕不仅能反映图像的表面特征，还能理解并表达出图像的内在逻辑和上下文。在图像理解领域，这样的进展对于提升机器的视觉认知能力，以及在视觉问答、图像检索和辅助视觉障碍人士理解图像等方面都有重要的应用潜力。

Scene graph captioner: Image captioning based on structural visual

representation

Ning Xu, An-An Liu

⇑

, Jing Liu

, Weizhi Nie, Yuting Su

School of Electrical and Information Engineering, Tianjin University, Tianjin, China

article info

Article history:

Received 16 May 2018

Revised 26 November 2018

Accepted 11 December 2018

Available online 14 December 2018

Keywords:

Image captioning

Scene graph

Structural representation

Attention

abstract

While deep neural networks have recently achieved promising results on the image captioning task, they

do not explicitly use the structural visual and textual knowledge within an image. In this work, we pro-

pose the Scene Graph Captioner (SGC) framework for the image captioning task, which captures the com-

prehensive structural semantic of visual scene by explicitly modeling objects, attributes of objects, and

relationships between objects. Firstly, we develop an approach to generate the scene graph by learning

individual modules on the large object, attribute and relationship datasets. Then, SGC incorporates

high-level graph information and visual attention information into a deep captioning framework.

Speciﬁcally, we propose a novel framework to embed a scene graph into the structural representation,

which captures the semantic concepts and the graph topology. Further, we develop the scene-graph-

driven method to generate the attention graph by exploiting high internal homogeneity and external

inhomogeneity among the nodes in the scene graph. Finally, a LSTM-based framework translates these

information into text. We evaluate the proposed framework on a held-out MSCOCO dataset.

1. Introduction

In the past year, deep recurrent neural network methods have

demonstrated promising performances on the task of generating

descriptions for images and videos [1–7]. From a structural view-

point, we ﬁrst need a formalized way to represent the scene for

describing images including the comprehensive contents. This rep-

resentation must be powerful enough to describe the rich variety

of scene that can exist, without being too cumbersome. Unfortu-

nately the current systems [8,5,9–14] fail to use the structural nat-

ure of the image, as shown in Fig. 1.

To solve this problem, we assume that a computer vision sys-

tem should explicitly represent objects, attributes, and relation-

ships within the image. Zitnick et al. made important steps

toward this goal by studying abstract scenes composed of clip-

art [15–17], though these were limited in the cartoon images.

Johnson et al. used the scene graph as the query to retrieve seman-

tically real-world images [18] but with human-generated cases.

However, they demonstrated that perfect recognition of detailed

semantic can beneﬁt scene understanding.

Meanwhile, describing the content of an image using properly

formed sentences is a very challenging task. RNN-based methods,

which directly translate image features into text, are reﬂected in

many captioning works [2,1,4,19], without developing high-level

semantic concepts. Recently, some more complex approaches

[9,20–23] are developed to leverage the well-studied object or

action recognition. But it is not enough to integrate the relation-

ships between objects, their attributes and locations, to drive the

natural language generation.

Using the scene graph to tackle image captioning would be a

major leap forward, but it involves two main challenges: (1) how

to construct the scene graph which captures dense annotations

of objects, attributes, and relationships within one image. (2) inte-

grating the scene graph for image captioning is difﬁcult, because

the interactions between objects as well as their localizations can

be highly complex, going beyond simple pairwise relations.

In order to address these challenges, we ﬁrst propose an

approach that can infer the scene graph by learning a set of visual

detection modules. Speciﬁcally, we formalize element (i.e., objects,

attributes and relationships) predictions as individual tasks and

build up the topological structure of the scene graph. Secondly,

we propose the Scene Graph Captioner (SGC), which can integrate

the complex elements and the interactions between objects as well

as their localizations, to generate descriptions, which conveys the

linguistic logic presented in the image. Unlike previous captioners

https://doi.org/10.1016/j.jvcir.2018.12.027

This article is part of the Special Issue on Multimodal_Cooperation.

⇑

Corresponding authors.

E-mail addresses: anan0422@gmail.com (A.-A. Liu), jliu_tju@tju.edu.cn (J. Liu).

J. Vis. Commun. Image R. 58 (2019) 477–485

Contents lists available at ScienceDirect

J. Vis. Commun. Image R.

journal homepage: www.elsevier.com/locate/jvci

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38724370

粉丝: 5
资源: 931

结构视觉表示的图像字幕生成：场景图_captioner框架

带有视觉注意的图像字幕：我的学士学位论文的代码

使用基于场景图的语义概念对图像进行字幕

StructCap：用于图像字幕的结构化语义嵌入

remote-sensing-image-captioning:遥感图像字幕论文的体系结构

基于学习的图像字幕评估：提高与人类判断的相关性

GraphR-CNN: 基于图的高效场景图生成与关系处理

场景图驱动的语义概念在图像字幕中的应用

场景图驱动的语义概念在图像字幕生成中的应用

复杂图像检索：利用密集字幕推理

最新资源