多任务学习提升跨模态图像文本检索性能

178 浏览量更新于2024-08-26 收藏 1.14MB PDF 举报

本文主要探讨了"具有多任务学习的跨模式图像文本检索"这一研究领域。在当前信息爆炸的时代，跨模态数据（如图像和文本）的高效检索成为一项关键挑战。作者 Junyu Luo、Ying Shen、Xiang Ao 和 Min Yang 合作提出了一种创新的方法，旨在通过多任务学习策略提升图像和文本之间的关联性和检索精度。首先，他们设计了一个关系识别网络，用于识别和理解不同模态（如图像和文本）之间的复杂关系以及共享的内在信息。这种网络通过学习共通特征，增强了模型对两种模态间潜在联系的理解，有助于提高检索的准确性。接着，他们引入了一种对应交叉模态自编码器，该模型旨在重构跨模态输入，促进单模态自编码器隐藏表示间的关联性。自编码器作为无监督学习的重要工具，通过重构输入数据，可以提取出模态间的共享表示，这对于跨模态信息融合至关重要。为了进一步优化检索性能，论文提出了两个正则化项：一是方差约束，确保学习到的共享信息在不同的模态下具有较大的变化性，从而增强模型的泛化能力；二是一致性约束，要求在不同模态间的嵌入保持一致，使模型能够处理模态间的转换不变性，提升跨模态检索的鲁棒性。最后，他们的方法还考虑了大规模数据的处理，这表明他们的研究不仅注重理论上的突破，也兼顾了实际应用的需求。通过将多任务学习与深度学习技术相结合，本文的研究有望推动跨模态图像文本检索领域的技术进步，为未来的智能搜索和信息检索系统提供更强大的支持。

Cross-modal Image-Text Retrieval with Multitask Learning

∗

Junyu Luo

†

SIAT, Chinese Academy of Sciences

asbljy@outlook.com

Ying Shen

Peking University

shenying@pkusz.edu.cn

Xiang Ao

ICT, Chinese Academy of Sciences

aoxiang@ict.ac.cn

Zhou Zhao

Zhejiang University

zhaozhou@zju.edu.cn

Min Yang*

SIAT, Chinese Academy of Sciences

min.yang@siat.ac.cn

ABSTRACT

In this paper, we propose a multi-task learning approach for cross-

modal image-text retrieval. First, a correlation network is proposed

for relation recognition task, which helps learn the complicated

relations and common information of dierent modalities. Then, we

propose a correspondence cross-modal autoencoder for cross-modal

input reconstruction task, which helps correlate the hidden repre-

sentations of two uni-modal autoencoders. In addition, to further

improve the performance of cross-modal retrieval, two regulariza-

tion terms (variance and consistency constraints) are introduced

to the cross-modal embeddings such that the learned common in-

formation has large variance and is modality invariant. Finally, to

enable large-scale cross-modal similarity search, a exible binary

transform network is designed to convert the text and image embed-

dings into binary codes. Extensive experiments on two benchmark

datasets demonstrate that our model has robust superiority over

the compared strong baseline methods. Source code is available at

https://github.com/daerv/DAEVR.

KEYWORDS

Cross-modal retrieval, Correspondence autoencoder, Correlation

network, Variance constraint

ACM Reference Format:

Junyu Luo, Ying Shen, Xiang Ao, Zhou Zhao, and Min Yang*. 2019. Cross-

modal Image-Text Retrieval with Multitask Learning. In The 28th ACM

International Conference on Information and Knowledge Management (CIKM

’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 4 pages.

https://doi.org/10.1145/3357384.3358104

1 INTRODUCTION

With the rapid growth of multimedia data that usually co-occur to

describe the same object or event, automated cross-modal retrieval

is becoming imperative to take one type of data as the query to

∗

Min Yang is corresponding author (min.yang@siat.ac.cn).

†

Also with Sichuan University.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for prot or commercial advantage and that copies bear this notice and the full citation

on the rst page. Copyrights for components of this work owned by others than ACM

must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,

to post on servers or to redistribute to lists, requires prior specic permission and/or a

fee. Request permissions from permissions@acm.org.

CIKM ’19, November 3–7, 2019, Beijing, China

ACM ISBN 978-1-4503-6976-3/19/11.. . $15.00

https://doi.org/10.1145/3357384.3358104

retrieve relevant data of dierent media types. During the past few

years, a large number of approaches [

] have been proposed

to deal with the cross-modal retrieval problem. The core of these

cross-modal retrieval approaches is to learn a common subspace

where the items of dierent modalities can be directly compared

with each other.

The performance of conventional cross-modal retrieval approaches,

such as Canonical Correlation Analysis (CCA)[

] and Topic-regression

Multi-modal LDA (Tr-mm LDA) [

], is highly dependent on the

visual and textual feature representation. Traditional handcrafted

feature extraction techniques, such as Scale Invariant Feature Trans-

form (SIFT), have limited performance of cross-modal retrieval. Re-

cent advances in the deep learning based approaches have taken the

state-of-the-art cross-modal retrieval results to a new level [

The general idea behind these methods is to apply deep networks

to learn the joint representations for multi-modal data.

Although remarkable progress has been made by previous ap-

proaches, it is worth noting that there are still some common dis-

advantages hindering the current cross-modal retrieval methods.

First, most existing deep learning based models focus merely on

preserving the pairwise similarity of the coupled cross-modal data.

However, a common representation learned in this way fails to

fully preserve the underlying cross-modal semantic structure in

multimedia data. Second, previous methods narrow the modality

gap by constraining the corresponding hash codes with certain pre-

dened loss functions. The code length is usually less than 128 bits,

thus most of the useful information is neutralized, making the hash

codes incapable of capturing the inherent modality consistency.

To alleviate the aforementioned limitations, we propose a Multi-

task learning approach for Cross-modal Image-Text Retrieval (de-

noted as MCITR), which takes into full consideration the common

features across modalities and the modality consistency.

First

, we

propose a correlation network to distinguish the mismatched image-

text pairs from the matched ones. The correlation network helps to

learn the common information of dierent modalities, and capture

the meaningful nearest neighbors of dierent modalities for cross-

modal retrieval.

Second

, a relation-enhanced correspondence cross-

modal autoencoder (RCCA) is employed to correlate the hidden

representations of two uni-modal autoencoders that are responsible

for the text and image modalities, respectively. Dierent from the

standard autoencoder that reconstructs the input itself, our RCCA

reconstructs inputs from dierent modalities where the input to

each subnetwork are features from the hidden states of the correla-

tion network. It enables the model to focus on the common informa-

tion of the cross-modal data and improve the stability of learning

下载后可阅读完整内容，剩余3页未读，立即下载

weixin_38713306

粉丝: 3
资源: 883

多任务学习提升跨模态图像文本检索性能

图像文本检索+图像预训练

图像检索英文文献

基于跨模态AI模型实现图像检索诗歌

在图像文本匹配任务中，如何设计一个堆叠交叉注意力机制以提升模型的可解释性和性能？

深度学习图像检索(cbir): 十年之大综述

对比学习输出图像关系

Image-Text Matching

clip,embedding

基于子空间学习的方法的定义

CLIP-ViT模型

最新资源