Image Retrieval by Dense Caption Reasoning
Xinru Wei, Yonggang Qi, Jun Liu, Fang Liu
School of Communication and Information Engineering, BUPT, Beijing, China
Abstract—Humans tend to understand image scene by recog-
nizing visual elements, then conjecturing and inferring based on
them, hence are able to search relevant images. In this paper, we
study the problem of complex image retrieval by reasoning image
dense captions, which is similar to the way of human perception
for searching images. Specifically, we transform the problem
of complex image retrieval into a dense captioning and scene
graph matching issue by using structured language descriptions
for retrieval. Experimental results on a novel proposed large-
scale content-based image retrieval dataset demonstrate the
effectiveness of our proposed method.
Index Terms—Image Retrieval, Dense Caption Reasoning,
Captioning, Scene Graph Matching, Deep Learning
I. INTRODUCTION
Retrieving images by visual query is one of the most
attracting vision problem, which aims to search for images by
reasoning about the visual elements of query image. It is a very
challenging problem since an ideal retriever should be able to
not only understand the whole scene but also the describing
contents in details. Plenty of previous arts exist for addressing
this task.
Traditional methods for content-based image retrieval often
utilize low-level visual feature representations such as color,
shape and appearance by means of SIFT[1], HOG[2], Fisher
vector[3], etc. Meanwhile, many also rely on richer represen-
tations to work, e.g., bags of features[4], spatial pyramids[5].
However, there is an obvious drawback of the above efforts, in
which semantic gap exists between the hand-crafted features
extracted and the profusion of high-level human perceptions
in regards to the stimuli images. There are mainly two reasons
behind this: (i) the visual variation is quite large in real
images which low-level features can hardly handle with, (ii)
people often search images after inferring, that is, people tend
to conjecture different visual concepts to be relevant, e.g.,
“food, forks, knives and plates” might be evidence of inferring
“kitchen”, “restaurant” or even “family gathering”.
Recently, there has been much interest in dealing with
CBIR(Content Based Image Retrieval) by matching the visual
elements of images in forms of natural language, where
image captioning plays the key role. Image captioning[6][7][8]
achieves convincing performance due to the power of deep
learning techniques. It significantly expands the complexity
of the label space from a fixed small set of categories to
sequences of words, which are able to express significantly
wooden storage
cabinet
large
microwave
light
pantry
on
table
in
under
glass
Fig. 1. Upper Left: Query image. Upper Right: Part of its scene graph to
query image. Below: Example output relevant images, which contains very
similar visual concepts like “fruits”, “food”, “dish washer”, “microwave” and
“light wooden storage” to query image.
richer visual concepts contained in images. Inspired by this,
we treat CBIR as a caption generation and matching problem
in this paper.
Caption matching is quite critical for ranking images given
the produced query and candidate captions. It is a text match-
ing problem in the field of NLP(Natural Language Processing),
and traditional method for text matching involves string-based
method[9], corpus-based method[10] and knowledge-based
method[10]. However, these methods are not designed for
image caption matching, which concentrates on matching the
structured visual elements in images, i.e., objects, interactions
between objects and the attributes of objects. Therefore, a
scene graph construction and matching strategy are presented
to handle this problem.
In this paper we deal with the problem of image retrieval
by generating and matching image captions (see Fig. 1).
Specifically, for a given image: (i) a dense set of descriptions
across regions are generated, (ii) a scene graph is constructed
by structuring the produced natural languages, which involves
objects, relationships and attributes, (iii) images are ordered
according to their scene graph similarities given by using
visual concept embeddings, which is capable to calculate se-
mantic distance between any pair of concepts. In addition, we
proposed a novel large-scale CBIR dataset. For that existing
CBIR datasets either comes from classification dataset, e.g.,
VOC challenge dataset[11], which only concentrates on simple
scene, or the dataset[12] contains complex scene images but
without explicit annotation of their similarities. Therefore, to
facilitate CBIR in complex image scenes, we select 10,000 real
images from Visual Genome dataset[13], and for each of the
images we manually labeled 100 of it’s most similar images