复杂图像检索：利用密集字幕推理

需积分: 9 82 浏览量更新于2024-08-12 收藏 5.02MB PDF 举报

"通过密集字幕推理检索图像" 本文的研究集中在利用密集字幕推理进行复杂图像检索，这是模拟人类理解图像场景并基于视觉元素推断的搜索方式。作者 Xinru Wei, Yonggang Qi, Jun Liu 和 Fang Liu 来自北京邮电大学信息与通信工程学院。他们提出了一种新的方法，将复杂的图像检索问题转化为密集字幕描述和场景图匹配的问题。在传统的图像检索中，通常依赖于查询图像的视觉元素来寻找相似的图像。然而，这种方法往往无法捕捉到图像中的复杂语义信息，导致检索效果有限。密集字幕推理则提供了一种更全面的视角，通过生成详细的图像描述（即密集字幕）并结合场景图，可以更准确地理解图像内容并进行匹配。密集字幕是指对图像中的每个区域或物体都提供一个详细的描述，这比简单的单句描述更能揭示图像的多层次信息。在本文中，作者使用结构化语言描述来构建检索系统，使得检索过程能够考虑更多的上下文信息和语义关系。场景图是一种表示图像中物体、它们之间的关系以及它们属性的图形结构。将密集字幕与场景图相结合，可以更有效地捕捉到图像的内在结构和语义联系，从而提高检索的准确性。通过深度学习技术，模型可以学习到如何生成高质量的密集字幕以及如何匹配这些描述与目标图像的场景图。实验部分，作者创建了一个新的大规模基于内容的图像检索数据集，并证明了所提方法的有效性。这个数据集可能包含大量带有详细描述的图像，用于训练和验证模型的性能。通过在该数据集上进行测试，作者的模型在复杂图像检索任务上的表现优于传统方法，进一步证实了密集字幕推理在理解和检索图像中的潜力。关键词：图像检索，密集字幕推理，字幕生成，场景图匹配，深度学习这篇研究论文探讨了一种新的图像检索策略，它利用密集字幕和场景图的匹配来提升检索效率和准确性，特别是在处理包含复杂语义信息的图像时。这种方法有望改进现有的图像检索系统，使其更加接近人类的理解和搜索模式。

Image Retrieval by Dense Caption Reasoning

Xinru Wei, Yonggang Qi, Jun Liu, Fang Liu

School of Communication and Information Engineering, BUPT, Beijing, China

Abstract—Humans tend to understand image scene by recog-

nizing visual elements, then conjecturing and inferring based on

them, hence are able to search relevant images. In this paper, we

study the problem of complex image retrieval by reasoning image

dense captions, which is similar to the way of human perception

for searching images. Speciﬁcally, we transform the problem

of complex image retrieval into a dense captioning and scene

graph matching issue by using structured language descriptions

for retrieval. Experimental results on a novel proposed large-

scale content-based image retrieval dataset demonstrate the

effectiveness of our proposed method.

Index Terms—Image Retrieval, Dense Caption Reasoning,

Captioning, Scene Graph Matching, Deep Learning

I. INTRODUCTION

Retrieving images by visual query is one of the most

attracting vision problem, which aims to search for images by

reasoning about the visual elements of query image. It is a very

challenging problem since an ideal retriever should be able to

not only understand the whole scene but also the describing

contents in details. Plenty of previous arts exist for addressing

this task.

Traditional methods for content-based image retrieval often

utilize low-level visual feature representations such as color,

shape and appearance by means of SIFT[1], HOG[2], Fisher

vector[3], etc. Meanwhile, many also rely on richer represen-

tations to work, e.g., bags of features[4], spatial pyramids[5].

However, there is an obvious drawback of the above efforts, in

which semantic gap exists between the hand-crafted features

extracted and the profusion of high-level human perceptions

in regards to the stimuli images. There are mainly two reasons

behind this: (i) the visual variation is quite large in real

images which low-level features can hardly handle with, (ii)

people often search images after inferring, that is, people tend

to conjecture different visual concepts to be relevant, e.g.,

“food, forks, knives and plates” might be evidence of inferring

“kitchen”, “restaurant” or even “family gathering”.

Recently, there has been much interest in dealing with

CBIR(Content Based Image Retrieval) by matching the visual

elements of images in forms of natural language, where

image captioning plays the key role. Image captioning[6][7][8]

achieves convincing performance due to the power of deep

learning techniques. It signiﬁcantly expands the complexity

of the label space from a ﬁxed small set of categories to

sequences of words, which are able to express signiﬁcantly

black

dish washer

fruit

wooden storage

cabinet

large

microwave

light

pantry

table

under

glass

Fig. 1. Upper Left: Query image. Upper Right: Part of its scene graph to

query image. Below: Example output relevant images, which contains very

similar visual concepts like “fruits”, “food”, “dish washer”, “microwave” and

“light wooden storage” to query image.

richer visual concepts contained in images. Inspired by this,

we treat CBIR as a caption generation and matching problem

in this paper.

Caption matching is quite critical for ranking images given

the produced query and candidate captions. It is a text match-

ing problem in the ﬁeld of NLP(Natural Language Processing),

and traditional method for text matching involves string-based

method[9], corpus-based method[10] and knowledge-based

method[10]. However, these methods are not designed for

image caption matching, which concentrates on matching the

structured visual elements in images, i.e., objects, interactions

between objects and the attributes of objects. Therefore, a

scene graph construction and matching strategy are presented

to handle this problem.

In this paper we deal with the problem of image retrieval

by generating and matching image captions (see Fig. 1).

Speciﬁcally, for a given image: (i) a dense set of descriptions

across regions are generated, (ii) a scene graph is constructed

by structuring the produced natural languages, which involves

objects, relationships and attributes, (iii) images are ordered

according to their scene graph similarities given by using

visual concept embeddings, which is capable to calculate se-

mantic distance between any pair of concepts. In addition, we

proposed a novel large-scale CBIR dataset. For that existing

CBIR datasets either comes from classiﬁcation dataset, e.g.,

VOC challenge dataset[11], which only concentrates on simple

scene, or the dataset[12] contains complex scene images but

without explicit annotation of their similarities. Therefore, to

facilitate CBIR in complex image scenes, we select 10,000 real

images from Visual Genome dataset[13], and for each of the

images we manually labeled 100 of it’s most similar images

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38731479

粉丝: 3
资源: 916

复杂图像检索：利用密集字幕推理

深度学习课程作业：基于因果推理的图像字幕描述.docx

基于注意力机制图卷积神经网络的图像检索方法与流程.pdf

AI推理卡图像分类分割推理模板

图像字幕

改进的模糊推理规则图像边缘检测算法 (2).pdf

GCN_classification:基于场景文本的细粒度图像分类与检索的多模态推理图

基于实例推理的检索过程的研究

基于因果推理的CIIC框架：对抗图像字幕混淆器

多传感器融合与证据推理在图像处理与卡尔曼滤波中的应用

跨媒体关联推理与检索新方法探讨

最新资源