Scene graph captioner: Image captioning based on structural visual
representation
q
Ning Xu, An-An Liu
⇑
, Jing Liu
*
, Weizhi Nie, Yuting Su
School of Electrical and Information Engineering, Tianjin University, Tianjin, China
article info
Article history:
Received 16 May 2018
Revised 26 November 2018
Accepted 11 December 2018
Available online 14 December 2018
Keywords:
Image captioning
Scene graph
Structural representation
Attention
abstract
While deep neural networks have recently achieved promising results on the image captioning task, they
do not explicitly use the structural visual and textual knowledge within an image. In this work, we pro-
pose the Scene Graph Captioner (SGC) framework for the image captioning task, which captures the com-
prehensive structural semantic of visual scene by explicitly modeling objects, attributes of objects, and
relationships between objects. Firstly, we develop an approach to generate the scene graph by learning
individual modules on the large object, attribute and relationship datasets. Then, SGC incorporates
high-level graph information and visual attention information into a deep captioning framework.
Specifically, we propose a novel framework to embed a scene graph into the structural representation,
which captures the semantic concepts and the graph topology. Further, we develop the scene-graph-
driven method to generate the attention graph by exploiting high internal homogeneity and external
inhomogeneity among the nodes in the scene graph. Finally, a LSTM-based framework translates these
information into text. We evaluate the proposed framework on a held-out MSCOCO dataset.
Ó 2018 Elsevier Inc. All rights reserved.
1. Introduction
In the past year, deep recurrent neural network methods have
demonstrated promising performances on the task of generating
descriptions for images and videos [1–7]. From a structural view-
point, we first need a formalized way to represent the scene for
describing images including the comprehensive contents. This rep-
resentation must be powerful enough to describe the rich variety
of scene that can exist, without being too cumbersome. Unfortu-
nately the current systems [8,5,9–14] fail to use the structural nat-
ure of the image, as shown in Fig. 1.
To solve this problem, we assume that a computer vision sys-
tem should explicitly represent objects, attributes, and relation-
ships within the image. Zitnick et al. made important steps
toward this goal by studying abstract scenes composed of clip-
art [15–17], though these were limited in the cartoon images.
Johnson et al. used the scene graph as the query to retrieve seman-
tically real-world images [18] but with human-generated cases.
However, they demonstrated that perfect recognition of detailed
semantic can benefit scene understanding.
Meanwhile, describing the content of an image using properly
formed sentences is a very challenging task. RNN-based methods,
which directly translate image features into text, are reflected in
many captioning works [2,1,4,19], without developing high-level
semantic concepts. Recently, some more complex approaches
[9,20–23] are developed to leverage the well-studied object or
action recognition. But it is not enough to integrate the relation-
ships between objects, their attributes and locations, to drive the
natural language generation.
Using the scene graph to tackle image captioning would be a
major leap forward, but it involves two main challenges: (1) how
to construct the scene graph which captures dense annotations
of objects, attributes, and relationships within one image. (2) inte-
grating the scene graph for image captioning is difficult, because
the interactions between objects as well as their localizations can
be highly complex, going beyond simple pairwise relations.
In order to address these challenges, we first propose an
approach that can infer the scene graph by learning a set of visual
detection modules. Specifically, we formalize element (i.e., objects,
attributes and relationships) predictions as individual tasks and
build up the topological structure of the scene graph. Secondly,
we propose the Scene Graph Captioner (SGC), which can integrate
the complex elements and the interactions between objects as well
as their localizations, to generate descriptions, which conveys the
linguistic logic presented in the image. Unlike previous captioners
https://doi.org/10.1016/j.jvcir.2018.12.027
1047-3203/Ó 2018 Elsevier Inc. All rights reserved.
q
This article is part of the Special Issue on Multimodal_Cooperation.
⇑
Corresponding authors.
E-mail addresses: anan0422@gmail.com (A.-A. Liu), jliu_tju@tju.edu.cn (J. Liu).
J. Vis. Commun. Image R. 58 (2019) 477–485
Contents lists available at ScienceDirect
J. Vis. Commun. Image R.
journal homepage: www.elsevier.com/locate/jvci