Adapting Ranking SVM to Document Retrieval
Yunbo CAO
1
, Jun XU
2
, Tie-Yan LIU
1
, Hang LI
1
, Yalou HUANG
2
, Hsiao-Wuen HON
1
1
Microsoft Research Asia,
No.49 Zhichun Road, Haidian District
Beijing, China, 100080
{yunbo.cao, tyliu, hangli, hon}@microsoft.com
2
College of Software, Nankai University,
No.94 Weijin Road, Nankai District
Tianjin, China, 300071
nkxj@hotmail.com, yellow@nankai.edu.cn
ABSTRACT
The paper is concerned with applying learning to rank to
document retrieval. Ranking SVM is a typical method of learning
to rank. We point out that there are two factors one must consider
when applying Ranking SVM, in general a “learning to rank”
method, to document retrieval. First, correctly ranking documents
on the top of the result list is crucial for an Information Retrieval
system. One must conduct training in a way that such ranked
results are accurate. Second, the number of relevant documents
can vary from query to query. One must avoid training a model
biased toward queries with a large number of relevant documents.
Previously, when existing methods that include Ranking SVM
were applied to document retrieval, none of the two factors was
taken into consideration. We show it is possible to make
modifications in conventional Ranking SVM, so it can be better
used for document retrieval. Specifically, we modify the “Hinge
Loss” function in Ranking SVM to deal with the problems
described above. We employ two methods to conduct
optimization on the loss function: gradient descent and quadratic
programming. Experimental results show that our method,
referred to as Ranking SVM for IR, can outperform the
conventional Ranking SVM and other existing methods for
document retrieval on two datasets.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Information Search
and Retrieval – Retrieval Models
General Terms
Algorithms, Experimentation, Theory
Keywords
Information retrieval, loss function, Ranking SVM
1. INTRODUCTION
Ranking functions in document retrieval traditionally use a small
number of features (e.g., term frequency, inversed document
frequency, and document length), which makes it possible to
empirically tune ranking parameters [20]. Recently, however, a
growing number of features such as structural features, title text,
and anchor text, and query-independent features (e.g., PageRank
and URL length) have proved useful in document retrieval while
empirical tuning of ranking functions has become increasingly
difficult.
Fortunately, in recent years more and more human-judged
document retrieval results have become available. This makes it
possible to employ supervised learning methodologies in the
tuning of ranking functions. Many such efforts have been made
using these approaches.
In one of such effort, document retrieval is formalized as
classification of documents into two categories: relevant and
irrelevant. Nallapati [12], for example, formalizes document
retrieval as binary classification and solves the classification
problem using Support Vector Machines (SVM) and Maximum
Entropy (ME).
In another approach, document retrieval is formalized as a
“learning to rank” problem in which documents are mapped into
several ordered categories (ranks). OHSUMED [9] is a data
collection that contains multiple data categories or ranks:
definitely relevant, partially relevant, and irrelevant. Herbrich et
al. [8], for instance, propose a method of learning to rank on the
basis of SVM and apply their method to document retrieval. We
refer to their method as Ranking SVM (or conventional Ranking
SVM) in this paper. Specifically, Ranking SVM formalizes
learning to rank as a problem of classifying instance pairs into
two categories (correctly ranked and incorrectly ranked). Other
methods within this approach have also been proposed [1, 19, 24].
We explore the problem of applying learning to rank to
document retrieval and propose a new learning method on the
basis of Ranking SVM. We refer to the method as Ranking SVM
for IR.
We note two important factors to take into general
consideration when applying Ranking SVM in a learning method
for ranking documents being retrieved. Unfortunately, they are
ignored in the existing methods, such as Ranking SVM.
(1) To have high accuracy on top-ranked documents is crucial
for an IR system. Analysis on click-through data from search
engines shows that users usually click on top-ranked documents
among returned search results [16, 17, 18]. The Normalized
Discounted Cumulated Gain (NDCG) measure [10] used in
evaluation of document retrieval also reflects this preference.
Therefore, it is necessary to perform training so that the top-
ranked results (equivalently the ranked results with regard to the
highest ranks) are generally accurate. However, in existing
learning methods such as Ranking SVM, the losses (penalties) of
incorrect ranking between higher ranks and lower ranks and
incorrect ranking among lower ranks are defined the same.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGIR’06, August 6-11, 2006, Seattle, Washington, USA.
Copyright 2006 ACM 1-59593-369-7/06/0008...$5.00.
186