深度视觉语义对齐：生成图像描述的新方法

下载需积分: 50 | PDF格式 | 5.21MB | 更新于2024-07-18 | 19 浏览量 | 举报

本文档深入探讨了"深度视觉-语义对应关系在生成图像描述中的应用"（Deep Visual-Semantic Alignments for Generating Image Descriptions），由Andrej Karpathy和Li Fei-Fei共同提出，两位作者均来自斯坦福大学计算机科学系。他们的研究专注于开发一种先进的模型，该模型能够理解图像和文本之间的内在联系，从而生成自然语言描述。研究的核心是建立一个融合了卷积神经网络（Convolutional Neural Networks, CNNs）和双向循环神经网络（Bidirectional Recurrent Neural Networks, BRNNs）的框架。CNNs用于分析图像区域，捕捉视觉特征，而BRNNs则处理句子，捕捉文本的语义结构。通过设计一个结构化的多模态嵌入目标，该模型能够将这两种不同的模态（视觉和语言）进行对齐，从而实现跨模态的理解。论文的核心创新在于提出了一种多模态递归神经网络（Multimodal Recurrent Neural Network, MRNN）架构，它利用学到的对应关系来生成全新的图像区域描述。这种方法不仅提升了在Flickr8K、Flickr30K和MSCOCO等大规模图像描述数据集上的检索性能，而且在全图像和新的区域级标注数据集上也展现出显著的生成描述能力。具体来说，实验结果显示，这个深度视觉-语义对齐模型在图像检索任务中达到了当时最先进的水平，表明其能够准确地匹配图像和相应的描述。同时，生成的描述不仅仅是针对整个图片，还能精确地描述图片中的特定区域，这在实际应用场景中具有很高的价值，如图像搜索、自动图文描述生成等。这篇论文不仅介绍了深度学习技术在图像理解和文本生成方面的最新进展，还展示了如何通过整合多种神经网络架构和有效的多模态学习策略来提升图像描述的质量。这对于推动计算机视觉和自然语言处理的交叉研究具有重要的理论与实践意义。

image-sentence score as a function of the individual region-

word scores. Intuitively, a sentence-image pair should have

a high matching score if its words have a conﬁdent support

in the image. The model of Karpathy et a. [24] interprets the

dot product v

between the i-th region and t-th word as a

measure of similarity and use it to deﬁne the score between

image k and sentence l as:

t∈g

i∈g

max(0, v

). (7)

Here, g

is the set of image fragments in image k and g

is the set of sentence fragments in sentence l. The indices

k, l range over the images and sentences in the training set.

Together with their additional Multiple Instance Learning

objective, this score carries the interpretation that a sentence

fragment aligns to a subset of the image regions whenever

the dot product is positive. We found that the following

reformulation simpliﬁes the model and alleviates the need

for additional objectives and their hyperparameters:

t∈g

max

i∈g

. (8)

Here, every word s

aligns to the single best image region.

As we show in the experiments, this simpliﬁed model also

leads to improvements in the ﬁnal ranking performance.

Assuming that k = l denotes a corresponding image and

sentence pair, the ﬁnal max-margin, structured loss remains:

C(θ) =

max(0, S

− S

+ 1)

| {z }

rank images

(9)

max(0, S

− S

+ 1)

| {z }

rank sentences

This objective encourages aligned image-sentences pairs to

have a higher score than misaligned pairs, by a margin.

3.1.4 Decoding text segment alignments to images

Consider an image from the training set and its correspond-

ing sentence. We can interpret the quantity v

as the un-

normalized log probability of the t-th word describing any

of the bounding boxes in the image. However, since we are

ultimately interested in generating snippets of text instead

of single words, we would like to align extended, contigu-

ous sequences of words to a single bounding box. Note that

the na

ıve solution that assigns each word independently to

the highest-scoring region is insufﬁcient because it leads to

words getting scattered inconsistently to different regions.

To address this issue, we treat the true alignments as latent

variables in a Markov Random Field (MRF) where the bi-

nary interactions between neighboring words encourage an

Figure 3. Diagram for evaluating the image-sentence score S

Object regions are embedded with a CNN (left). Words (enriched

by their context) are embedded in the same multimodal space with

a BRNN (right). Pairwise similarities are computed with inner

products (magnitudes shown in grayscale) and ﬁnally reduced to

image-sentence score with Equation 8.

alignment to the same region. Concretely, given a sentence

with N words and an image with M bounding boxes, we

introduce the latent alignment variables a

∈ {1 . . . M} for

j = 1 . . . N and formulate an MRF in a chain structure

along the sentence as follows:

E(a) =

j=1...N

) +

j=1...N −1

, a

j+1

) (10)

= t) = v

(11)

, a

j+1

) = β

= a

j+1

]. (12)

Here, β is a hyperparameter that controls the afﬁnity to-

wards longer word phrases. This parameter allows us to

interpolate between single-word alignments (β = 0) and

aligning the entire sentence to a single, maximally scoring

region when β is large. We minimize the energy to ﬁnd the

best alignments a using dynamic programming. The output

of this process is a set of image regions annotated with seg-

ments of text. We now describe an approach for generating

novel phrases based on these correspondences.

3.2. Multimodal Recurrent Neural Network for

generating descriptions

In this section we assume an input set of images and their

textual descriptions. These could be full images and their

sentence descriptions, or regions and text snippets, as in-

ferred in the previous section. The key challenge is in the

design of a model that can predict a variable-sized sequence

of outputs given an image. In previously developed lan-

guage models based on Recurrent Neural Networks (RNNs)

[40, 50, 10], this is achieved by deﬁning a probability distri-

bution of the next word in a sequence given the current word

and context from previous time steps. We explore a simple

剩余16页未读，继续阅读

XGF的碎碎念

粉丝: 6

深度视觉语义对齐：生成图像描述的新方法

ngx-semantic: 探索Angular与语义UI的集成新方案

使用less-plugin-semantic-ui在自定义Less代码中导入UI样式

React-semantic-ui：提升React组件的语义化与状态管理

Survey of Visual-Semantic Embedding Methods for Zero-Shot Imag

Survey of Visual-Semantic Embedding Methods for Zero-Shot Im

Rethinking-Atrous-Convolution-for-Semantic-Image-Segmentation-1.zip

VSO-Visual-Semantic-Odometry.pdf

Deep Visual Semantic Alignments 2015_CVPR_paper

Visual-Studio-Semantic-Highlighting:视觉工作室的语义突出显示

A-Comparative-Study-of-Architectures-for-2D-Image-Semantic-Segmentation:此仓库包含完全卷积网络（FCN），改进的U-Net和改进的完全卷积网络（IFCN）架构的实现

最新资源