PL-ranking: A Novel Ranking Method for Cross-Modal
Retrieval
Liang Zhang
1,3
, Bingpeng Ma
1,2,3
∗
, Guorong Li
1,3
, Qingming Huang
1,2,3
, Qi Tian
4
1
University of Chinese Academy of Sciences, China
2
Key Lab of Intell. Info. Process., Inst. of Comput. Tech., CAS, China
3
Key Laboratory of Big Data Mining and Knowledge Management, CAS, China
4
Department of Computer Science, University of Texas at San Antonio, TX, 78249, USA
zhangliang14@mails.ucas.ac.cn {bpma, liguorong, qmhuang}@ucas.ac.cn
qitian@cs.utsa.edu
ABSTRACT
This paper proposes a novel method for cross-modal retrieval
named Pairwise-Listwise ranking (PL-ranking) based on
the low-rank optimization framework. Motivated by the fac-
t that optimizing the top of ranking is more applicable in
practice, we focus on improving the precision at the top of
ranked list for a given sample and learning a low-dimensional
common subspace for multi-modal data. Concretely, there
are three constraints in PL-ranking. First, we use a pair-
wise ranking loss constraint to optimize the top of ranking.
Then, considering that the pairwise ranking loss constraint
ignores class information, we further adopt a listwise con-
straint to minimize the intra-neighbors variance and max-
imize the inter-neighbors separability. By this way, class
information is preserved while the number of iterations is
reduced. Finally, low-rank based regularization is applied to
exploit the correlations between features and labels so that
the relevance between the different modalities can be en-
hanced after mapping them into the common subspace. We
design an efficient low-rank stochastic subgradient descent
method to solve the proposed optimization problem. The
experimental results show that the average MAP scores of
PL-ranking are improved 5.1%, 9.2%, 4.7% and 4.8% than
those of the state-of-the-art methods on the Wiki, Flickr,
Pascal and NUS-WIDE datasets, respectively.
Keywords
Multi-modal analysis; Cross-modal retrieval; Subspace learn-
ing; Learning to rank
1. INTRODUCTION
With the rapid growth of multi-modal data, including im-
age, text, video and audio, cross-modal retrieval has been
∗
Corresponding author.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
MM ’16, October 15-19, 2016, Amsterdam, Netherlands
c
2016 ACM. ISBN 978-1-4503-3603-1/16/10. . . $15.00
DOI: http://dx.doi.org/10.1145/2964284.2964336
widely studied in recent years [7, 9, 13, 15, 17, 18, 20, 21, 22,
26, 27]. The key problem for cross-modal matching is how to
push relevant samples from another modality appearing at
the top of the ranked list when we give a query sample from
one modality. This motivates that the techniques of the
learning to rank have become increasingly popular, which
can exploit the correlation shared by different modalities.
These methods optimize the top of ranking by maximizing
a criterion (e.g., MAP or NDCG) related to the ultimate
retrieval performance.
The most successful ranking method in cross-modal re-
trieval may be bi-directional cross-media semantic represen-
tation model (Bi-CMSRM), which optimizes ranking perfor-
mance directly [27]. Bi-CMSRM is based on the structural
SVM and optimized by using the 1-Slack cutting plane al-
gorithm, and it has shown good performance in cross-modal
retrieval. However, despite using an efficient convex method
to solve the dual problem,Bi-CMSRM has shown the weak-
nesses on the scalability to large, high-dimensional dataset
[16]. Besides, Bi-CMSRM only focuses on learning optimal
mappings but ignores the structure of mappings such that
it can not further exploit the label relevance between the
different modalities.
In this paper, we propose an efficient ranking method for
cross-modal retrieval named PL-ranking. PL-ranking in-
tegrates the weighted approximate rank pairwise (WARP)
loss
1
, listwise loss
2
and low-rank constraint into a generic
minimization formulation, and then is optimized by extend-
ing the recently proposed FAST-SSGD [8]. By this way,
PL-ranking not only optimizes the top of ranking, but also
effectively captures the label correlations as well as scales
to high-dimensional and large datasets. Thus, we can ef-
fectively retrieve relevant samples by searching in a small
neighborhood of the query sample. Specifically, there are
three important components contained in PL-ranking.
We first extend WARP to bi-directional WARP (bWARP)
such that the learned model can be applied to image-query-
texts and text-query-images simultaneously. Since both the
directions of retrieval are optimized in the training period,
bWARP ensures that the different modalities are projected
1
The pairwise ranking method takes the sample pairs as the
training instances and formulates the ranking as a task of
learning a classification or regression model from the collec-
tion of the pairwise instances of samples.
2
The listwise information reflects the class relation of mul-
tiple samples, e.g., intra-class and inter-class relations.