Modality-Reconstructed Cross-Media Retrieval
Paper:
Modality-Reconstructed Cross-Media Retrieval via Sparse
Neural Networks Pre-Trained by Restricted Boltzmann Machines
Bin Zhang, Huaxiang Zhang
†
, Jiande Sun, Zhenhua Wang, Hongchen Wu, and Xiao Dong
Department of Computer Science, Shandong Normal University
No. 1, University Road, Changqing District, Jinan 250300, China
E-mail: huaxzhang@163.com
†
Corresponding author
[Recei ved August 24, 2017; accepted May 16, 2018]
Cross-media retrieval has raised a lot of research in-
terests, and a significant number of works focus on
mapping the heterogeneous data into a common sub-
space using a couple of projection matrices corre-
sponding to each modal data bef or e implementing
similarity comparison. Differently, we reconstruct
one modal data (e.g., images) to the other one (e.g.,
texts) using a model named sparse neural network pre-
trained by Restricted Boltzmann Machines (MRCR-
RSNN) so that we can project one modal data into the
space of the other one directly. In the model, input is
low-level features of one modal data and output is the
other one. And cross-media retrieval is implemented
based on the similarities of their representatives. Our
model need not any manual annotation and its appli-
cation is more widely. It is simple but effective. We
evaluate the performance of our method on several
benchmark datasets, and experimental results prove
its effectiveness based on the Mean Average Precision
(MAP) and Precision Recall (PR).
Keywords: cross-media retrieval, restricted Boltz-
mann machines, sparse neural networks, modality-
reconstructed
1. Introduction
Cross-media retrieval is becoming the trend of infor-
mation retrieval. With the coming era of big data, multi-
modal data grows rapidly. The retrieval of single-modal
data cannot satisfy the needs of people in many domains.
For example, when we retrieve information on the Inter-
net about the Great Wall, we take a photo and submit it
as a query. What we want is not only the similar images,
but also the relevant textual materials about it. Conse-
quently, cross-media retrieval came into play. In this pa-
per, we mainly concentrate on cross-media retrieval be-
tween images and texts, whose retrieval tasks include two
parts: giving a query image to retrieve the similar match-
ing texts, and giving a query text to retrieve the similar
matching images.
Traditional information retrieval is text-based (e.g.,
Ants communicate
with each other using
pheromones. These
chemical signals are
more developed in ants
than in other
hymenopteran groups.
Chelsea's highest
appearance-maker is ex-
captain Ron Harris, who
played in 795 first-class
games for the club
between 1961 and 1980.
The Canadian Rockies are
composed of sedimentary
rock, including shale,
sandstone, limestone and
quartzite, that originated as
deposits in a shallow inland
sea.
Ants communicate
with each other using
pheromones. These
chemical signals are
more developed in ants
than in other
hymenopteran groups.
Chelsea's highest
appearance-maker is ex-
captain Ron Harris, who
played in 795 first-class
games for the club
between 1961 and 1980.
The Canadian Rockies are
composed of sedimentary rock,
including shale, sandstone,
limestone and quartzite, that
originated as deposits in a shallow
inland sea.
Fig. 1. The method of subspace learning for cross-modal
retrieval.
search engines such as Google, Baidu and Bing) or
content-based [1–3] (e.g., retrieval systems such as
SpeechBot [4], VideoQ [5], and SIMPLicity [6]). The
text-based retrieval relies on the keywords which are an-
notated by human. Then in 1990s, the content-based re-
trieval was proposed. However, both of them are single-
modality-based and don’t satisfy the needs of information
retrieval. consequently, the cross-media retrieval is be-
coming more and more popular.
Essentially, the fundamental challenge of cross-media
retrieval is the heterogeneity gap between different media
data. For example, it is difficult to measure the content
similarity between an image with 1000-dimensional vi-
sual features and a text with 100-dimensional textual fea-
tures. Although they may have the same semantic, it is
not easy to find the relationship between them. A straight-
forward method, named subspace learning method, maps
the visual features of images and the textual features
of texts into an isomorphic subspace to learn a com-
mon representation, using a couple of projection matri-
ces, so that they can be directly measured (as shown in
Fig. 1). Canonical Correlation Analysis (CCA) is a clas-
sic method which can learn a subspace with the same di-
mensions by maximizing the correlations between differ-
ent modal data [7]. It is unsupervised. The other unsu-
Vol.22 No.5, 2018 Journal of Advanced Computational Intelligence 611
and Intelligent Informatics