CVPR2018:融合上下文的注意力机制提升图像描述与视觉问答性能

需积分: 10 3 浏览量更新于2024-07-16 收藏 10.45MB PDF 举报

本文档"CVPR2018.pdf"探讨了图像描述（Image Captioning）和视觉问答（Visual Question Answering, VQA）领域中的新型注意力机制——"Bottom-Up and Top-Down Attention"。在传统的图像理解任务中，上行（Top-down）视觉注意力机制被广泛应用，通过精细的分析和多步骤推理来提升模型对图像内容的深度理解。然而，这种机制往往依赖于预先设定的层级结构或先验知识。作者Peter Anderson、Xiaodong He等人提出了一个结合了底部（Bottom-Up）和顶部（Top-Down）注意力的框架。底部注意力机制基于Faster R-CNN算法，它首先在图像中识别出一系列显著的物体区域，并为每个区域生成相应的特征向量。这种方式强调了对象级别的关注，有助于捕捉图像中的关键元素。而顶部注意力机制则负责对这些特征向量进行权重分配，它可以根据上下文信息和先前的理解动态调整每个区域的重要性。这样，模型能够在处理复杂问题时，既能利用全局信息进行全局理解，又能根据当前任务需求聚焦到局部细节，提高了整体的注意力效果和理解能力。这种方法的优势在于其灵活性和适应性，能够同时处理多个视觉对象，并进行跨区域的信息融合。通过这种方式，模型在生成图像描述时能更准确地描绘场景，而在回答视觉问题时，能够更有效地整合上下文和视觉线索。与传统方法相比，这种混合注意力模型可能展现出更高的性能和效率，尤其是在处理需要复杂推理和细致观察的任务时。 "Bottom-Up and Top-Down Attention"为图像描述和视觉问答提供了新的思考角度，强调了物体级注意力在增强深度理解中的核心作用，并为未来的视觉智能研究提供了有力的技术支持。通过结合这两类注意力机制，模型能够更好地模拟人类观察和理解视觉世界的自然过程。

Figure 2. Example output from our Faster R-CNN bottom-up at-

tention model. Each bounding box is labeled with an attribute class

followed by an object class. Note however, that in captioning and

VQA we utilize only the feature vectors – not the predicted labels.

suppression with an intersection-over-union (IoU) thresh-

old, the top box proposals are selected as input to the second

stage. In the second stage, region of interest (RoI) pooling

is used to extract a small feature map (e.g. 14 ×14) for each

box proposal. These feature maps are then batched together

as input to the ﬁnal layers of the CNN. The ﬁnal output of

the model consists of a softmax distribution over class la-

bels and class-speciﬁc bounding box reﬁnements for each

box proposal.

In this work, we use Faster R-CNN in conjunction with

the ResNet-101 [13] CNN. To generate an output set of im-

age features V for use in image captioning or VQA, we take

the ﬁnal output of the model and perform non-maximum

suppression for each object class using an IoU threshold.

We then select all regions where any class detection prob-

ability exceeds a conﬁdence threshold. For each selected

region i, v

is deﬁned as the mean-pooled convolutional

feature from this region, such that the dimension D of the

image feature vectors is 2048. Used in this fashion, Faster

R-CNN effectively functions as a ‘hard’ attention mecha-

nism, as only a relatively small number of image bounding

box features are selected from a large number of possible

conﬁgurations.

To pretrain the bottom-up attention model, we ﬁrst ini-

tialize Faster R-CNN with ResNet-101 pretrained for clas-

siﬁcation on ImageNet [35]. We then train on Visual

Genome [21] data. To aid the learning of good feature

representations, we add an additional training output for

predicting attribute classes (in addition to object classes).

To predict attributes for region i, we concatenate the mean

pooled convolutional feature v

with a learned embedding

of the ground-truth object class, and feed this into an addi-

tional output layer deﬁning a softmax distribution over each

attribute class plus a ‘no attributes’ class.

The original Faster R-CNN multi-task loss function con-

tains four components, deﬁned over the classiﬁcation and

bounding box regression outputs for both the RPN and the

ﬁnal object class proposals respectively. We retain these

components and add an additional multi-class loss compo-

nent to train the attribute predictor. In Figure 2 we provide

some examples of model output.

3.2. Captioning Model

Given a set of image features V , our proposed caption-

ing model uses a ‘soft’ top-down attention mechanism to

weight each feature during caption generation, using the

existing partial output sequence as context. This approach

is broadly similar to several previous works [34, 27, 46].

However, the particular design choices outlined below

make for a relatively simple yet high-performing baseline

model. Even without bottom-up attention, our captioning

model achieves performance comparable to state-of-the-art

on most evaluation metrics (refer Table 1).

At a high level, the captioning model is composed of two

LSTM [15] layers using a standard implementation [9]. In

the sections that follow we will refer to the operation of the

LSTM over a single time step using the following notation:

= LSTM(x

, h

t−1

) (1)

where x

is the LSTM input vector and h

is the LSTM

output vector. Here we have neglected the propagation of

memory cells for notational convenience. We now describe

the formulation of the LSTM input vector x

and the output

vector h

for each layer of the model. The overall caption-

ing model is illustrated in Figure 3.

3.2.1 Top-Down Attention LSTM

Within the captioning model, we characterize the ﬁrst

LSTM layer as a top-down visual attention model, and the

剩余14页未读，继续阅读

Vivinia_Vivinia

粉丝: 494
资源: 51

CVPR2018:融合上下文的注意力机制提升图像描述与视觉问答性能

CVPR-2018论文下载工具

Endnote IEEE Style 输出格式

[爬虫程序]Arxiv上的CVPR每天更新内容

请给我几个计算机方向顶刊顶会的链接

R-CNN: Rich feature hierarchies for accurate object detection and semantic segmentation by Ross Girshick, et al. (2014)在哪里下载

cvpr latex模板

cvpr的latex模板

CVPR2023实例分割

matlab低照度增强代码

最新资源