IEEE TRANSACTIONS ON COMPUTERS, VOL. , NO. , JULY 2011 3
boosting algorithm that optimizes an exponential loss,
which upper bounds the metrics of MAP and NDCG.
(2) The second stream defines several listwise loss
functions, which take the list of retrieved documents
for the same query as a sample. ListNet [4] defines
a loss function based on the KL-divergence between
two permutation probability distributions. ListMLE
[18] defines another listwise likelihood loss function
based on the Luce Model [17].
Another aspect related to the work in this pa-
per is sparse learning, which has been widely ap-
plied to many applications in computer vision, sig-
nal processing and bioinformatics. Many learning
algorithms have been proposed for sparse clas-
sification/regression, such as decomposition algo-
rithms [28], [27], algorithms for
1
constrained opti-
mization problem [31], [30], [29] (interested readers
please refer to [27] for more discussions about sparse
classification/regression algorithms).
Since the pairwise approach reduces the ranking
problem to a classification problem on document
pairs, in principle, many algorithms for sparse clas-
sification can be applied to obtain sparse ranking
models. However, few efforts have been made to
tackle the problem of learning a sparse solution for
ranking. Recently, Sun et al. [10] proposed a reduction
framework to reduce ranking to importance-weighted
pairwise classification and then used an
1
regularized
algorithm to learn a sparse ranking predictor. Despite
success, it does not justify the individual contribution
of each of its two parts, the reduction framework
and the sparse learning algorithm. Sparse learning
for ranking is a relatively new topic that needs more
exploration.
3NOTATIONS
We introduce the notations used throughout this pa-
per. In the learning-to-rank problem, there is a la-
beled training set S={(q
k
,X
k
,Y
k
)}
n
k=1
and a test set
T = {(q
k
,X
k
)}
n+u
k=n+1
. Here q
k
denotes a query, X
k
=
{X
k,i
}
n(q
k
)
i=1
denotes the list of corresponding retrieved
objects (i.e., documents) for q
k
, and Y
k
= {y
k,i
}
n(q
k
)
i=1
is
the list of corresponding relevance labels provided by
human, where y
k,i
∈{0, 1, 2, 3, 4}, n(q
k
) represents the
number of objects in the retrieved object list belongs
to query q
k
, and X
k,i
represents the i
th
object in
the retrieved object list belongs to query q
k
. Each
X
k,i
∈ R
m
is an m-dimensional feature vector and
each attribute of X
k,i
is scaled to the range [0, 1].
We define a pairs set P of comparable object pairs
as following: (k, i, j) ∈ P if and only if X
k,i
,X
k,j
belong to the same query q
k
and y
k,i
= y
k,j
.Weuse
p to denote the number of pairs in P . In addition,
we define an object pairwise comparison error matrix
K ∈ R
p×m
as follows: each pair in P corresponds to a
row in K. Denote the l
th
pair in P as {k
l
,i
l
,j
l
}, the l
th
row of K as K
l
. We define K
l
= y
k
l
,i
l
,j
l
(X
k
l
,i
l
−X
k
l
,j
l
),
TABLE 1
List of notations
Notations Meaning
S={(q
i
,X
i
,Y
i
)}
n
i=1
training set
m dimension of data
p number of pairs in set P
r the radius of
1
-ball: ||w||
1
≤ r
K matrix in R
p×m
that contains
the pairwise information
I
C
(w) I
C
(w)=0if condition C is satisfied,
otherwise I
C
(w)=∞.
where y
k
l
,i
l
,j
l
=1if y
k
l
,i
l
>y
k
l
,j
l
, and otherwise
y
k
l
,i
l
,j
l
= −1. Since X
i,j
∈ [0, 1]
m
for all i, j, we have
K
l
∈ [−1, 1]
m
for all l.
We use x, y to represent the inner product of two
vectors x and y. Let r denote the radius of an
1
-
ball: ||w||
1
≤ r . We introduce an indictor function
I
C
(w): I
C
(w)=0if and only if for a given vector w,
condition C is satisfied, otherwise I
C
(w)=+∞. The
above notations are summarized in Table 1.
4PROBLEM STATEMENT
The learning-to-rank problem has a wide range of
applications in information retrieval systems. We are
given a labeled training set S = {(q
k
,X
k
,Y
k
)}
n
k=1
and
a test set T = {(q
k
,X
k
)}
n+u
k=n+1
. The task of learning
to rank is to construct a ranking predictor from the
training data, and then sort the examples in the test
set using the ranking predictor.
Following the common practice in learning-to-rank,
in this paper, we only focus on learning a linear rank-
ing predictor f(x)=w,x. Many existing learning-to-
rank algorithms use this setting. The SVM methods,
such as the recently proposed RankSVM-Struct [19]
and RankSVM-Primal [16], are notable algorithms
for learning linear ranking predictors, which achieve
a state-of-the-art performance on several benchmark
datasets. These methods learn ranking models by
minimizing the following form of regularized pair-
wise loss functions:
min
w
1
2
||w
||
2
2
+ C
(k,i,j)∈P
(y
k,i,j
w
T
(X
k,i
− X
k,j
)) (1)
where (x) can be the hinge loss (x)=max(0, 1 − x)
or the squared hinge loss (x)=max(0, 1 − x)
2
,
and C is a parameter to control the trade-off be-
tween training error and the model complexity. Ex-
isting work [34] in learning-to-rank revealed that
the classification based pairwise loss function (i.e.,
hinge loss) is both an upper bound of 1-NDCG and
1-MAP. When we take (x)=max(0, 1 − x), the
objective function given by (1) is the objective of
Ranking SVM. There exist several algorithms, such
as the quadratic programming [26] or the cutting
plane algorithm [19], which minimize this objective
function. If (x)=max(0, 1 − x)
2
, the function given
by (1) becomes the objective of RankSVM-Primal [16],
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication.