视觉问答权威综述Visual Question Answering： A Survey of Methods and Datasets

需积分: 50 129 浏览量更新于2023-05-30 1 收藏 6.7MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Visual Question Answering: A Survey of Methods and Datasets

Qi Wu, Damien Teney, Peng Wang, Chunhua Shen

∗

, Anthony Dick, Anton van den Hengel

e-mail: ﬁrstname.lastname@adelaide.edu.au

School of Computer Science, The University of Adelaide, SA 5005, Australia

Abstract

Visual Question Answering (VQA) is a challenging task that has received increasing attention from both the computer

vision and the natural language processing communities. Given an image and a question in natural language, it requires

reasoning over visual elements of the image and general knowledge to infer the correct answer. In the ﬁrst part of

this survey, we examine the state of the art by comparing modern approaches to the problem. We classify methods

by their mechanism to connect the visual and textual modalities. In particular, we examine the common approach

of combining convolutional and recurrent neural networks to map images and questions to a common feature space.

We also discuss memory-augmented and modular architectures that interface with structured knowledge bases. In the

second part of this survey, we review the datasets available for training and evaluating VQA systems. The various

datatsets contain questions at diﬀerent levels of complexity, which require diﬀerent capabilities and types of reasoning.

We examine in depth the question/answer pairs from the Visual Genome project, and evaluate the relevance of the

structured annotations of images with scene graphs for VQA. Finally, we discuss promising future directions for the

ﬁeld, in particular the connection to structured knowledge bases and the use of natural language processing models.

Keywords: Visual Question Answering, Natural Language Processing, Knowledge Bases, Recurrent Neural

Networks

Contents

1 Introduction 2

2 Methods for VQA 3

2.1 Joint embedding approaches . . . . . . 3

2.2 Attention mechanisms . . . . . . . . . 6

2.3 Compositional Models . . . . . . . . . 7

2.3.1 Neural Module Networks . . . . 8

2.3.2 Dynamic Memory Networks . . 9

2.4 Models using external knowledge bases 9

3 Datasets and evaluation 10

3.1 Datasets of natural images . . . . . . . 11

3.2 Datasets of clipart images . . . . . . . . 16

3.3 Knowledge base-enhanced datasets . . . 17

3.4 Other datasets . . . . . . . . . . . . . . 18

4 Structured scene annotations for VQA 18

5 Discussion and future directions 21

∗

Corresponding author

6 Conclusion 22

Preprint submitted to Elsevier July 21, 2016

arXiv:1607.05910v1 [cs.CV] 20 Jul 2016

1. Introduction

Visual question answering is a task that was pro-

posed to connect computer vision and natural language

processing (NLP), to stimulate research, and push the

boundaries of both ﬁelds. On the one hand, computer

vision studies methods for acquiring, processing, and

understanding images. In short, its aim is to teach

machines how to see. On the the other hand, NLP

is the ﬁeld concerned with enabling interactions be-

tween computers and humans in natural language, i.e.

teaching machines how to read, among other tasks.

Both computer vision and NLP belong to the domain

of artiﬁcial intelligence and they share similar methods

rooted in machine learning. However, they have histor-

ically developed separately. Both ﬁelds have seen sig-

niﬁcant advances towards their respective goals in the

past few decades, and the combined explosive growth

of visual and textual data is pushing towards a mar-

riage of eﬀorts from both ﬁelds. For example, re-

search in image captioning, i.e. automatic image de-

scription [15, 35, 54, 77, 93, 85] has produced power-

ful methods for jointly learning from image and text in-

puts to form higher-level representations. A successful

approach is to combine convolutional neural networks

(CNNs), trained on object recognition, with word em-

beddings, trained on large text corpora.

In the most common form of Visual Question An-

swering (VQA), the computer is presented with an im-

age and a textual question about this image (see exam-

ples in Figures 3–5). It must then determine the correct

answer, typically a few words or a short phrase. Vari-

ants include binary (yes/no) [3, 98] and multiple-choice

settings [3, 100], in which candidate answers are pro-

posed. A closely related task is to “ﬁll in the blank”

[95], where an aﬃrmation describing the image must be

completed with one or several missing words. These

aﬃrmations essentially amount to questions phrased in

declarative form. A major distinction between VQA and

other tasks in computer vision is that the question to be

answered is not determined until run time. In traditional

problems such as segmentation or object detection, the

single question to be answered by an algorithm is pre-

determined and only the input image changes. In VQA,

in contrast, the form that the question will take is un-

known, as is the set of operations required to answer

it. In this sense, it more closely reﬂects the challenge

of general image understanding. VQA is related to the

task of textual question answering, in which the answer

is to be found in a speciﬁc textual narrative (i.e. read-

ing comprehension) or in large knowledge bases (i.e.

information retrieval). Textual QA has been studied for

a long time in the NLP community, and VQA is its

extension to additional visual supporting information.

The added challenge is signiﬁcant, as images are much

higher dimensional, and typically more noisy than pure

text. Moreover, images lack the structure and grammat-

ical rules of language, and there is no direct equivalent

to the NLP tools such as syntactic parsers and regular

expression matching. Finally, images capture more of

the richness of the real world, whereas natural language

already represents a higher level of abstraction. For ex-

ample, compare the phrase ‘a red hat’ with the mul-

titude of its representations that one can picture, and in

which many styles could not be described in a short sen-

tence.

Visual question answering is a signiﬁcantly more

complex problem than image captioning, as it fre-

quently requires information not present in the image.

The type of this extra required information may range

from common sense to encyclopedic knowledge about

a speciﬁc element from the image. In this respect,

VQA constitutes a truly AI-complete task [3], as it

requires multimodal knowledge beyond a single sub-

domain. This comforts the increased interest in VQA,

as it provides a proxy to evaluate our progress towards

AI systems capable of advanced reasoning combined

with deep language and image understanding. Note

that image understanding could in principle be evalu-

ated equally well through image captioning. Practically

however, VQA has the advantage of an easier evaluation

metric. Answers typically contain only a few words.

The long ground truth image captions are more diﬃcult

to compare with predicted ones. Although advanced

evaluation metrics have been studied, this is still an open

research problem [43, 26, 76].

One of the ﬁrst integrations of vision and language

is the “SHRDLU” from system from 1972 [84] which

allowed users to use language to instruct a computer

to move various objects around in a “blocks world”.

More recent attempts at creating conversational robotic

agents [39, 9, 55, 64] are also grounded in the visual

world. However, these works were often limited to

speciﬁc domains and/or on restricted language forms.

In comparison, VQA speciﬁcally addresses free-form

open-ended questions. The increasing interest in VQA

is driven by the existence of mature techniques in both

computer vision and NLP and the availability of relevant

large-scale datasets. Therefore, a large body of litera-

ture on VQA has appeared over the last few years. The

aim of this survey is to give a comprehensive overview

of the ﬁeld, covering models, datasets, and to suggest

promising future directions. To the best of our knowl-

edge, this article is the ﬁrst survey in the ﬁeld of VQA.

In the ﬁrst part of this survey (Section 2), we present

a comprehensive review of VQA methods through four

categories based on the nature of their main contribu-

tion. Incremental contributions means that most meth-

ods belong to multiple of these categories (see Table 2).

First, the joint embedding approaches (Section 2.1) are

motivated by the advances of deep neural networks in

both computer vision and NLP. They use convolutional

and recurrent neural networks (CNNs and RNNs) to

learn embeddings of images and sentences in a com-

mon feature space. This allows one to subsequently

feed them together to a classiﬁer that predicts an an-

swer [22, 52, 49]. Second, attention mechanisms (Sec-

tion 2.2) improve on the above method by focusing on

speciﬁc parts of the input (image and/or question). At-

tention in VQA [100, 90, 11, 32, 2, 92] was inspired by

the success of similar techniques in the context of im-

age captioning [91]. The main idea is to replace holistic

(image-wide) features with spatial feature maps, and to

allow interactions between the question and speciﬁc re-

gions of these maps. Third, compositional models (Sec-

tion 2.3) allow to tailor the performed computations to

each problem instance. For example, Andreas et al.

[2] use a parser to decompose a given question, then

build a neural network out of modules whose composi-

tion reﬂect the structure of the question. Fourth, knowl-

edge base-enhanced approaches (Section 2.4) address

the use of external data by querying structured knowl-

edge bases. This allows retrieving information that is

not present in the common visual datasets such as Ima-

geNet [14] or COCO [45], which are only labeled with

classes, bounding boxes, and/or captions. Information

available from knowledge bases ranges from common

sense to encyclopedic level, and can be accessed with

no need for being available at training time [87, 78].

In the second part of this survey (Section 3), we

examine datasets available for training and evaluating

VQA systems. These datasets vary widely along three

dimensions: (i) their size, i.e. the number of images,

questions, and diﬀerent concepts represented. (ii) the

amount of required reasoning, e.g. whether the detec-

tion of a single object is suﬃcient or whether inference

is required over multiple facts or concepts, and (iii) how

much information beyond that present in the actual

images is necessary, be it common sense or subject-

speciﬁc information. Our review points out that ex-

isting datasets lean towards visual-level questions, and

require little external knowledge, with few exceptions

[78, 79]. These characteristics reﬂect the struggle with

simple visual questions still faced by the current state of

the art, but these characteristics must not be forgotten

when VQA is presented as an AI-complete evaluation

proxy. We conclude that more varied and sophisticated

datasets will eventually be required.

Another signiﬁcant contribution of this survey is an

in-depth analysis of the question/answer pairs provided

in the Visual Genome dataset (Section 4). They consti-

tute the largest VQA dataset available at the time of this

writing, and, importantly, it includes rich structured im-

ages annotations in the form of scene graphs [41]. We

evaluate the relevance of these annotations for VQA,

by comparing the occurrence of concepts involved in

the provided questions, answers, and image annotations.

We ﬁnd out that only about 40% of the answers directly

match elements in the scene graphs. We further show

that this matching rate can be signiﬁcantly increased by

relating scene graphs to external knowledge bases. We

conclude this paper in Section 5 by discussing the po-

tential of better connection to such knowledge bases, to-

gether with better use of existing work from the ﬁeld of

NLP.

2. Methods for VQA

One of the ﬁrst attempts at “open-world” visual ques-

tion answering was proposed by Malinowski et al. [51].

They described a method combining semantic text pars-

ing with image segmentation in a Bayesian formulation

that samples from nearest neighbors in the training set.

The method requires human-deﬁned predicates, which

are inevitably dataset-speciﬁc and diﬃcult to scale. It is

also very dependent on the accuracy of the image seg-

mentation algorithm and of the estimated image depth

information. Another early attempt at VQA by Tu et

al. [74] was based on a joint parse graph from text and

videos. In [23], Geman et al. proposed an automatic

“query generator” that is trained on annotated images

and then produces a sequence of binary questions from

any given test image. A common characteristic of these

early approaches is to restrict questions to predeﬁned

forms. The remainder of this article focuses on modern

approaches aimed at answering free-form open-ended

questions. We will present methods through four cat-

egories: joint embedding approaches, attention mech-

anisms, compositional models, and knowledge base-

enhanced approaches. As summarized in Table 2, most

methods combine multiple strategies and thus belong to

several categories.

2.1 Joint embedding approaches

Motivation The concept of jointly embedding images

and text was ﬁrst explored for the task of image cap-

tioning [15, 35, 54, 77, 93, 85]. It was motivated by

Joint Attention Compositional Knowledge Answer Image

Method embedding mechanism model base class. / gen. features

Neural-Image-QA [52] X generation GoogLeNet [71]

VIS+LSTM [63] X classiﬁcation VGG-Net [68]

Multimodal QA [22] X generation GoogLeNet [71]

DPPnet [58] X classiﬁcation VGG-Net [68]

MCB [21] X classiﬁcation ResNet [25]

MCB-Att [21] X X classiﬁcation ResNet [25]

MRN [38] X X classiﬁcation ResNet [25]

Multimodal-CNN [49] X classiﬁcation VGG-Net [68]

iBOWING [99] X classiﬁcation GoogLeNet [71]

VQA team [3] X classiﬁcation VGG-Net [68]

Bayesian [34] X classiﬁcation ResNet [25]

DualNet [65] X classiﬁcation VGG-Net [68] & ResNet [25]

MLP-AQI [31] X classiﬁcation ResNet [25]

LSTM-Att [100] X X classiﬁcation VGG-Net [68]

Com-Mem [32] X X generation VGG-Net [68]

QAM [11] X X classiﬁcation VGG-Net [68]

SAN [92] X X classiﬁcation GoogLeNet [71]

SMem [90] X X classiﬁcation GoogLeNet [71]

Region-Sel [66] X X classiﬁcation VGG-Net [68]

FDA [29] X X classiﬁcation ResNet [25]

HieCoAtt [48] X X classiﬁcation ResNet [25]

NMN [2] X X classiﬁcation VGG-Net [68]

DMN+ [89] X X classiﬁcation VGG-Net [68]

Joint-Loss [57] X X classiﬁcation ResNet [25]

Attributes-LSTM [85] X X generation VGG-Net [68]

ACK [87] X X generation VGG-Net [68]

Ahab [78] X generation VGG-Net [68]

Facts-VQA [79] X generation VGG-Net [68]

Multimodal KB [101] X generation ZeilerNet [96]

Table 1: Overview of existing approaches to VQA, characterized by the use of a joint embedding of image and language features (Section 2.1), the

use of an attention mechanism (Section 2.2), an explicitly compositional neural network architecture (Section 2.3), and the use of information from

an external structured knowledge base (Section 2.4). We also note whether the output answer is obtained by classiﬁcation over a predeﬁned set

of common words and short phrases, or generated, typically with a recurrent neural network. The last column indicates the type of convolutional

network used to obtain image feature.

the success of deep learning methods in both computer

vision and NLP, which allow one to learn representa-

tions in a common feature space. In comparison to the

task of image captioning, this motive is further rein-

forced in VQA by the need to perform further reason-

ing over both modalities together. A representation in

a common space allows learning interactions and per-

forming inference over the question and the image con-

tents. Practically, image representations are obtained

with convolutional neural networks (CNNs) pre-trained

on object recognition. Text representations are obtained

with word embeddings pre-trained on large text corpora.

Word embeddings practically map words to a space in

which distances reﬂect semantic similarities [56, 61].

The embeddings of the individual words of a question

are then typically fed to a recurrent neural network to

capture syntactic patterns and handle variable-length se-

quences.

Methods Malinowski et al. [52] propose an approach

named “Neural-Image-QA” with a Recurrent Neural

Network (RNN) implemented with Long Short-Term

Memory cells (LSTMs) (Figure 1). The motivation be-

hind RNNs is to handle inputs (questions) and outputs

(answers) of variable size. Image features are produced

by a CNN pre-trained for object recognition. Question

and image features are both fed together to a ﬁrst “en-

coder” LSTM. It produces a feature vector of ﬁxed-size

CNN

Joint embedding

Sentence

会员权益专享

视觉问答权威综述Visual Question Answering： A Survey of Methods and Datas...

会员权益专享

最新资源

视觉问答权威综述Visual Question Answering： A Survey of Methods and Datas...

Visual_Question_Answering.pytorch:视觉问答

视觉问答（文章中附代码）Tutorial on Answering Questions about Images with Deep Learning

TGIF-QA: Toward Spatio-Temporal Reasoning in Visual Question Answering

complex knowledge base question answering:A survey

Counterfactual Samples Synthesizing for Robust Visual Question Answering的主要方法

Counterfactual Samples Synthesizing for Robust Visual Question Answering用于解决什么问题

Drawbacks of the current methods OR, CI

现在有什么大模型可以解决这一问题视觉问答

Knowledge Graph Embedding Based Question Answering

coco数据集 类型有哪些

Question Answering Over Temporal Knowledge Graphs怎么实现

soft attention vector

写一个文本和图片的cross attention

大模型 英文怎么翻译

bert qa ace 2005

Answer in English, Help me prepare an answer for the IELTS speaking test in part 1. The question is:Do you often wear jewelry?

prompt-Bert

ELMo下游任务的开源任务

a knowledge graph based question answering method for medical domain

VQA的几个经典数据集介绍

会员权益专享

最新资源

coco数据集类型有哪些

大模型英文怎么翻译