GLA模型：解决图像描述中的全局与局部注意力问题

167 浏览量更新于2024-08-25 收藏 4.28MB PDF 举报

“GLA：图像描述的全球本地关注 - IEEETRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 3, MARCH 2018” 在计算机视觉和自然语言处理领域，自动图像描述是人工智能的一个重要任务。近年来，随着卷积神经网络（CNNs）和循环神经网络（RNNs）的发展，基于CNN-RNN框架的多种方法被提出用于生成图像描述，并取得了显著的进步。然而，现有的大多数方法仍存在两个主要问题：对象丢失和错误预测。对象丢失是指在生成图像描述时，某些重要的对象可能被忽略。错误预测则是在识别物体时将其分类到错误的类别。为了解决这些问题，本文提出了一个新的方法——全球-局部注意力模型（GLA，Global-Local Attention）。GLA模型利用注意力机制，结合了全局和局部信息，以提高图像描述的准确性和完整性。 GLA模型的核心在于它同时考虑了全局和局部的上下文信息。全局注意力允许模型对整个图像的特征进行理解，捕捉整体场景的关键信息。局部注意力则专注于图像中的特定区域或对象，确保关键细节不被忽视。这种双重关注机制有助于减少对象丢失的问题，同时提高了对物体识别的准确性，降低了错误预测的可能性。在GLA模型中，首先通过CNN提取图像的多层次特征，这些特征既包含全局图像信息，也包含了不同尺度的局部信息。然后，通过一个注意力机制，模型能够动态地分配权重给这些特征，强调那些与图像描述相关的重要部分。RNN随后使用这些加权的特征来生成连贯且准确的文本描述。实验结果表明，GLA模型在多个图像描述基准数据集上的性能优于现有的方法，如Microsoft COCO和Flickr30K。GLA模型的创新之处在于其能够平衡全局和局部信息的处理，增强了模型对复杂场景的理解能力，提高了生成描述的质量和一致性。 GLA模型为解决自动图像描述中的关键挑战提供了一个有效的解决方案，展示了深度学习在图像理解和自然语言生成方面的潜力。这种方法有望进一步推动图像描述技术的发展，对于辅助视觉障碍者理解图像、提升人机交互以及在自动驾驶等应用中都有重要意义。

展开

728 IEEE TRANSACTIONS ON MULTIMEDIA, VOL. 20, NO. 3, MARCH 2018

methods which are based on deep neural networks and the lim-

itations of these methods. Second, we brieﬂy introduce the

attention-based image caption approaches. Finally, we intro-

duce some works on object detection which are related with our

proposed methods.

Deep neural network-based image caption: With the suc-

cessful application of deep neural network in the task of image

recognition and machine translation, the task of automatically

generating image description also makes signiﬁcant progress.

There exist several effective methods [13], [14], [30]–[33] based

on deep neural networks.

As mentioned above, these approaches consider generating

image description as a translation process. They directly trans-

late an image to a sentence via utilizing the encoder-decoder

framework [28] which is originally introduced in machine trans-

lation task. In general, this paradigm ﬁrstly uses a deep CNN

as the encoder which encodes an image to a static representa-

tion, and then uses a RNN as the decoder which decodes this

static representation to a meaningful sentence. The generated

sentence should be grammatically correct and well describe the

content of the image as much as possible.

To address the task of image description, in [32], Mao et al.

propose a multimodal RNN (m-RNN) model which can also be

used for image and sentence retrieval. The proposed m-RNN

additionally utilizes a multimodal layer to connect the language

model and the CNN together. Similarly, Karpathy et al. [33]

propose an alignment model via a multimodal embedding layer.

This alignment model can align segments of sentence with the

regions of the corresponding image that they describe. Replac-

ing the basic RNN by LSTM, a more powerful RNN model,

Vinyals et al. [31] propose an end-to-end model named NIC

by combining deep CNN with LSTM for t he problem. Further-

more, to address the problem of “drift away” or “lose track”

of the image content, Jia et al. [14] propose gLSTM model,

an alternative extension of LSTM. This model utilizes seman-

tic information extracted from image as input along with the

whole image to generate image descriptions. Donahue et al. [13]

propose Long-term Recurrent Convolutional Network (LRCN)

which combines convolutional layers and long-range temporal

recursion for visual recognition and description.

However, as shown in Fig. 1, we notice that the above men-

tioned approaches may suffer from the problems of object miss-

ing and misprediction in that those methods encode the whole

image to a static global feature vector. To overcome these prob-

lems, in this paper, we propose to integrate object-level features

with image-level features for generating image caption via the

widely used attention mechanism. In the next section, we brief

some related works based on attention mechanism.

Attention mechanism in image caption and machine transla-

tion. Recently, attention mechanism has been widely used and

proved to be important and effective in the ﬁeld of natural lan-

guage processing [27] and computer vision [30], [37], [44], [45].

In fact, the essence of attention mechanism is to assign positive

weights to different parts to indicate the importance of these

parts.

Attention mechanism is originally introduced in machine

translation task [27]. In [27], Bahdanau et al. exploit BRNN with

attention mechanism for machine translation. This approach is

able to automatically search the part of the source sentence

which is most relevant to a target word. Then, attention mech-

anism is introduced into image/video understanding task. Xu

et al. [30] explore two kinds of attention mechanism for image

caption, i.e., soft-attention and hard-attention, and analyze how

the attention mechanism works in the process of generating im-

age caption via visualization manner. In [37], Yao et al. address

video caption task through capturing global temporal structure

among video frames with a temporal attention mechanism which

is based on soft-alignment method. This temporal attention

mechanism makes the model dynamically focus on key frames

which are more relevant with the predicted word. ATT [44] pro-

poses to utilize semantic concept to improve the performance.

This method ﬁrstly obtains semantic concept proposals by uti-

lizing different approaches, such as, k-NN, multi-label ranking

and so on, and then integrates these concept proposals into one

vector via the attention mechanism. The integrated vector is

ﬁnally used to guide language model to generate description.

Different from soft/hard attention method [30] and ATT

method [37], our proposed GLA method integrates local rep-

resentation at object-level with global representation at image-

level through attention mechanism, whose aim is to address

aforementioned problems of object missing and misprediction.

Due to these methods which use only global frame-level features

which cannot avoid problems of object missing and mispredic-

tion. Instead of considering semantic concepts or attributes used

in ATT [44], we directly apply image visual feature with atten-

tion mechanism to image caption. RA [45] proposes a compli-

cated pipeline to obtain important regions from selective search

region proposals [46] and combines them with scene-speciﬁc

contexts to generate image caption. Compared with ATT and

RA methods, our GLA method is simpler and the performance

is much better than RA method.

Object Detection: With the great success achieved by deep

learning technology, object detection has also made signiﬁcant

progress. R-CNN [47] stands out as one of the notable landmarks

in the process of object detection tasks. It takes advantage of

high quality region proposals (selective search method [46]) and

CNN features. This pipeline mainly contains four procedures:

(1) Extracting region proposals which likely contain objects

via region proposal methods; (2) Extracting CNN features of

these region proposals via CNNs; (3) Classifying these propos-

als through classiﬁer trained with CNN features; (4) Localizing

these objects via bounding box regression methods.

However, this kind of framework is time consuming due to

the four distinct steps. To reduce the computing time and im-

prove the accuracy of detection, SPP-Net [48], Fast R-CNN [49]

and DeepID-Net [8] et al.

are developed. These methods inte-

grate the last three steps into one end-to-end framework which

can simultaneously complete classiﬁcation and bounding box

regression.

Although these improved methods have improved the perfor-

mance of object detection via the end-to-end part, the region

proposal generated process also isolates with it and this pro-

cess is the most time-cosuming. Thus, some real-time methods,

such as, YOLO [9], Faster R-CNN [6] and SSD [50] et al.,are

下载后可阅读完整内容，剩余11页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38679277

粉丝: 6

GLA模型：解决图像描述中的全局与局部注意力问题

Gap Light Analyzer(GLA)

GLA叶面积指数计算

GapLightAnalyzer(GLA)软件说明书.pdf

error: no matching constructor for initialization of 'QList<GLA>' GLA是结构体

Icesat-1 GLA14

接收机接收GLA，GLO数据完整率低的原因是什么

delete security group failed:security_group.inuse:Security group sg-916h9gla56 is still in use

GLA-MADDPG

GT25C256A-2GLA1-TR

git怎么创建GLA签署

最新资源