VrR-VG: Refocusing Visually-Relevant Relationships
∗
Yuanzhi Liang
1,2
, Yalong Bai
2
, Wei Zhang
2
, Xueming Qian
1
, Li Zhu
1
, and Tao Mei
2
1
Xi’an Jiaotong University
2
JD AI Research, Beijing, China
liangyzh13@stu.xjtu.edu.cn ylbai@outlook.com wzhang.cu@gmail.com {qianxm,
zhuli}@mail.xjtu.edu.cn tmei@live.com
Abstract
Relationships encode the interactions among individual
instances, and play a critical role in deep visual scene un-
derstanding. Suffering from the high predictability with
non-visual information, existing methods tend to fit the sta-
tistical bias rather than “learning” to “infer” the relation-
ships from images. To encourage further development in vi-
sual relationships, we propose a novel method to automati-
cally mine more valuable relationships by pruning visually-
irrelevant ones. We construct a new scene-graph dataset
named Visually-Relevant Relationships Dataset (VrR-VG)
based on Visual Genome. Compared with existing datasets,
the performance gap between learnable and statistical
method is more significant in VrR-VG, and frequency-based
analysis does not work anymore. Moreover, we propose to
learn a relationship-aware representation by jointly con-
sidering instances, attributes and relationships. By ap-
plying the representation-aware feature learned on VrR-
VG, the performances of image captioning and visual ques-
tion answering are systematically improved with a large
margin, which demonstrates the gain of our dataset and
the features embedding schema. VrR-VG is available via
http://vrr-vg.com/.
1. Introduction
Although visual perception tasks (e.g., classification,
detection) have witnessed great advancement in the past
decade, visual cognition tasks (e.g., image captioning, ques-
tion answering) are still limited due to the difficulty of rea-
soning [16]. Existing vision tasks are mostly based on in-
dividual objects analysis. However, a natural image usually
consists of multiple instances in a scene, and most of them
are related in some ways. To fully comprehend a visual im-
∗
This work was performed at JD AI Research.
on
light
sign
train
track
on
person
train
person
sign
platform
people
platform
leave
face
hang on
walk toward
Figure 1. Example scene graphs in VG150 (left) and VrR-VG
(right, ours). More visually-relevant relationships are included in
VrR-VG.
age, a holistic view is required to understand the relation-
ships and interactions among object instances.
Visual relationships [19, 6, 33, 38, 40], which encode the
interplay between individual instances, become the indis-
pensable factor for visual cognitive tasks such as image cap-
tioning [36], visual question answering (VQA) [21]. In ex-
isting literature, visual relationships are mostly represented
as a scene graph (Fig. 1): a node represents a specific in-
stance (either as subject or object), and an edge encodes
the relation label (r) between a subject (s) and an object
(o). Equivalently, a scene graph can also be represented as
a set of triplets h s, r, oi . Recently, extensive research ef-
forts [33, 38, 20, 35] are conducted on scene graph gener-
ation, which aims to extract the scene graph from an image
(Fig. 1). Essentially, scene graph generation bridges the gap
between visual perception and high-level cognition.
Among the datasets [26, 16, 19, 34, 24] adopted in vi-
sual relationship, Visual Genome (VG) [16] provides the
largest set of relationship annotations, offering large-scale
(2.3 million relationships) and dense (21 relationships per
image) relationship annotations. However, the relationships
10403