Generation and Comprehension of Unambiguous Object Descriptions
Junhua Mao
2∗
Jonathan Huang
1
Alexander Toshev
1
Oana Camburu
3
Alan Yuille
2,4
Kevin Murphy
1
1
Google Inc.
2
University of California, Los Angeles
3
University of Oxford
4
Johns Hopkins University
{mjhustc@,yuille@stat.}ucla.edu, oana-maria.camburu@cs.ox.ac.uk
{jonathanhuang,toshev,kpmurphy}@google.com
Abstract
We propose a method that can generate an unambigu-
ous description (known as a referring expression) of a spe-
cific object or region in an image, and which can also com-
prehend or interpret such an expression to infer which ob-
ject is being described. We show that our method outper-
forms previous methods that generate descriptions of ob-
jects without taking into account other potentially ambigu-
ous objects in the scene. Our model is inspired by recent
successes of deep learning methods for image captioning,
but while image captioning is difficult to evaluate, our task
allows for easy objective evaluation. We also present a new
large-scale dataset for referring expressions, based on MS-
COCO. We have released the dataset and a toolbox for visu-
alization and evaluation, see
https://github.com/
mjhucla/Google_Refexp_toolbox
.
1. Introduction
There has been a lot of recent interest in generating text
descriptions of images (see e.g., [
13, 53, 9, 5, 12, 26, 28, 40,
55, 8]). However, fundamentally this problem of image cap-
tioning is subjective and ill-posed. With so many valid ways
to describe any given image, automatic captioning methods
are thus notoriously difficult to evaluate. In particular, how
can we decide that one sentence is a better description of an
image than another?
In this paper, we focus on a special case of text genera-
tion given images, where the goal is to generate an unam-
biguous text description that applies to exactly one object or
region in the image. Such a description is known as a “refer-
ring expression” [
50, 52, 41, 42, 14, 19, 27]. This approach
has a major advantage over generic image captioning, since
there is a well-defined performance metric: a referring ex-
pression is considered to be good if it uniquely describes
the relevant object or region within its context, such that a
listener can comprehend the description and then recover
the location of the original object. In addition, because of
the discriminative nature of the task, referring expressions
tend to be more detailed (and therefore more useful) than
image captions. Finally, it is easier to collect training data
The major part of this work was done while J. Mao and O. Camburu
were interns at Google Inc.
“The man who is
touching his head.”
Whole frame image
Object bounding box
Referring
Expression
Our Model
Whole frame image
& Region proposals
Description Generation
Description Comprehension
Chosen region in red
Input Input
Input
InputOutput
Output
Figure 1. Illustration of our generation and comprehension system.
On the left we see that the system is given an image and a region
of interest; it describes it as “the man who is touching his head”,
which is unambiguous (unlike other possible expressions, such as
“the man wearing blue”, which would be unclear). On the right we
see that the system is given an image, an expression, and a set of
candidate regions (bounding boxes), and it selects the region that
corresponds to the expression.
to “cover” the space of reasonable referring expressions for
a given object than it is for a whole image.
We consider two problems: (1) description generation,
in which we must generate a text expression that uniquely
pinpoints a highlighted object/region in the image and (2)
description comprehension, in which we must automati-
cally select an object given a text expression that refers to
this object (see Figure 1). Most prior work in the litera-
ture has focused exclusively on description generation (e.g.,
[
31, 27]). Golland et al. [19] consider generation and com-
prehension, but they do not process real world images.
In this paper, we jointly model both tasks of description
generation and comprehension, using state-of-the-art deep
learning approaches to handle real images and text. Specif-
ically, our model is based upon recently developed methods
that combine convolutional neural networks (CNNs) with
recurrent neural networks (RNNs). We demonstrate that
our model outperforms a baseline which generates referring
expressions without regard to the listener who must com-
prehend the expression. We also show that our model can
be trained in a semi-supervised fashion, by automatically
generating descriptions for image regions.
Being able to generate and comprehend object descrip-
tions is critical in a number of applications that use nat-
1
11