Cross-modal Image-Text Retrieval with Multitask Learning
∗
Junyu Luo
†
SIAT, Chinese Academy of Sciences
asbljy@outlook.com
Ying Shen
Peking University
shenying@pkusz.edu.cn
Xiang Ao
ICT, Chinese Academy of Sciences
aoxiang@ict.ac.cn
Zhou Zhao
Zhejiang University
zhaozhou@zju.edu.cn
Min Yang*
SIAT, Chinese Academy of Sciences
min.yang@siat.ac.cn
ABSTRACT
In this paper, we propose a multi-task learning approach for cross-
modal image-text retrieval. First, a correlation network is proposed
for relation recognition task, which helps learn the complicated
relations and common information of dierent modalities. Then, we
propose a correspondence cross-modal autoencoder for cross-modal
input reconstruction task, which helps correlate the hidden repre-
sentations of two uni-modal autoencoders. In addition, to further
improve the performance of cross-modal retrieval, two regulariza-
tion terms (variance and consistency constraints) are introduced
to the cross-modal embeddings such that the learned common in-
formation has large variance and is modality invariant. Finally, to
enable large-scale cross-modal similarity search, a exible binary
transform network is designed to convert the text and image embed-
dings into binary codes. Extensive experiments on two benchmark
datasets demonstrate that our model has robust superiority over
the compared strong baseline methods. Source code is available at
https://github.com/daerv/DAEVR.
KEYWORDS
Cross-modal retrieval, Correspondence autoencoder, Correlation
network, Variance constraint
ACM Reference Format:
Junyu Luo, Ying Shen, Xiang Ao, Zhou Zhao, and Min Yang*. 2019. Cross-
modal Image-Text Retrieval with Multitask Learning. In The 28th ACM
International Conference on Information and Knowledge Management (CIKM
’19), November 3–7, 2019, Beijing, China. ACM, New York, NY, USA, 4 pages.
https://doi.org/10.1145/3357384.3358104
1 INTRODUCTION
With the rapid growth of multimedia data that usually co-occur to
describe the same object or event, automated cross-modal retrieval
is becoming imperative to take one type of data as the query to
∗
Min Yang is corresponding author (min.yang@siat.ac.cn).
†
Also with Sichuan University.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for prot or commercial advantage and that copies bear this notice and the full citation
on the rst page. Copyrights for components of this work owned by others than ACM
must be honored. Abstracting with credit is permitted. To copy otherwise, or republish,
to post on servers or to redistribute to lists, requires prior specic permission and/or a
fee. Request permissions from permissions@acm.org.
CIKM ’19, November 3–7, 2019, Beijing, China
© 2019 Association for Computing Machinery.
ACM ISBN 978-1-4503-6976-3/19/11.. . $15.00
https://doi.org/10.1145/3357384.3358104
retrieve relevant data of dierent media types. During the past few
years, a large number of approaches [
4
,
10
,
12
] have been proposed
to deal with the cross-modal retrieval problem. The core of these
cross-modal retrieval approaches is to learn a common subspace
where the items of dierent modalities can be directly compared
with each other.
The performance of conventional cross-modal retrieval approaches,
such as Canonical Correlation Analysis (CCA)[
12
] and Topic-regression
Multi-modal LDA (Tr-mm LDA) [
10
], is highly dependent on the
visual and textual feature representation. Traditional handcrafted
feature extraction techniques, such as Scale Invariant Feature Trans-
form (SIFT), have limited performance of cross-modal retrieval. Re-
cent advances in the deep learning based approaches have taken the
state-of-the-art cross-modal retrieval results to a new level [
3
,
14
].
The general idea behind these methods is to apply deep networks
to learn the joint representations for multi-modal data.
Although remarkable progress has been made by previous ap-
proaches, it is worth noting that there are still some common dis-
advantages hindering the current cross-modal retrieval methods.
First, most existing deep learning based models focus merely on
preserving the pairwise similarity of the coupled cross-modal data.
However, a common representation learned in this way fails to
fully preserve the underlying cross-modal semantic structure in
multimedia data. Second, previous methods narrow the modality
gap by constraining the corresponding hash codes with certain pre-
dened loss functions. The code length is usually less than 128 bits,
thus most of the useful information is neutralized, making the hash
codes incapable of capturing the inherent modality consistency.
To alleviate the aforementioned limitations, we propose a Multi-
task learning approach for Cross-modal Image-Text Retrieval (de-
noted as MCITR), which takes into full consideration the common
features across modalities and the modality consistency.
First
, we
propose a correlation network to distinguish the mismatched image-
text pairs from the matched ones. The correlation network helps to
learn the common information of dierent modalities, and capture
the meaningful nearest neighbors of dierent modalities for cross-
modal retrieval.
Second
, a relation-enhanced correspondence cross-
modal autoencoder (RCCA) is employed to correlate the hidden
representations of two uni-modal autoencoders that are responsible
for the text and image modalities, respectively. Dierent from the
standard autoencoder that reconstructs the input itself, our RCCA
reconstructs inputs from dierent modalities where the input to
each subnetwork are features from the hidden states of the correla-
tion network. It enables the model to focus on the common informa-
tion of the cross-modal data and improve the stability of learning