没有合适的资源?快使用搜索试试~ 我知道了~
首页Learning to Rank for Information Retrieval
资源详情
资源评论
资源推荐

Foundations and Trends
R
in
sample
Vol. xx, No xx (2008) 1–112
c
2008 xxxxxxxxx
DOI: xxxxxx
Learning to Rank for Information Retrieval
Tie-Yan Liu
1
1
Microsoft Research Asia, Sigma Center, No. 49, Zhichun Road, Haidian
District, Beijing, 100190, P. R. China, Tie-Yan.Liu@microsoft.com
Abstract
Learning to rank for information retrieval (IR) is a task to automat-
ically construct a ranking model using training data, such that the
model can sort new objects according to their degrees of relevance,
preference, or importance. Many IR problems are by nature ranking
problems, and many IR technologies can be potentially enhanced by
using learning-to-rank techniques. The objective of this tutorial is to
give an introduction to this research direction. Specifically, the exist-
ing learning-to-rank algorithms are reviewed and categorized into three
approaches: the pointwise, pairwise, and listwise approaches. The ad-
vantages and problems with each approach are analyzed, and the rela-
tionships between the loss functions used in these approaches and IR
evaluation measures are discussed. Then the empirical evaluations on
typical learning-to-rank methods are shown, with the LETOR collec-
tion as a benchmark dataset, which seem to suggest that the listwise ap-
proach be the most effective one among all the approaches. After that,
a statistical ranking theory is introduced, which can describe different
learning-to-rank algorithms, and be used to analyze their query-level
generalization abilities. At the end of the tutorial, we make a summary
and discuss potential future work on learning to rank.

Contents
1 Introduction 1
1.1 Ranking in Information Retrieval 3
1.2 Learning to Rank 11
1.3 About this Tutorial 20
2 The Pointwise Approach 22
2.1 Regression based Algorithms 23
2.2 Classification based Algorithms 24
2.3 Ordinal Regression based Algorithms 26
2.4 Discussions 30
3 The Pairwise Approach 34
3.1 Example Algorithms 35
3.2 Discussions 41
4 The Listwise Approach 44
i

ii Contents
4.1 Direct Optimization of IR Evaluation Measures 44
4.2 Minimization of Listwise Ranking Losses 50
4.3 Discussions 54
5 Analysis of the Approaches 56
5.1 The Pointwise Approach 57
5.2 The Pairwise Approach 59
5.3 The Listwise Approach 61
5.4 Discussions 64
6 Benchmarking Learning-to-Rank Algorithms 66
6.1 The LETOR Collection 67
6.2 Experimental Results on LETOR 73
7 Statistical Ranking Theory 79
7.1 Conventional Generalization Analyses on Ranking 80
7.2 A Query-level Ranking Framework 85
7.3 Query-level Generalization Analysis 89
7.4 Discussions 94
8 Summary and Outlook 96
References 104
Acknowledgements 112

1
Introduction
With the fast development of the Web, every one of us is experiencing
the flood of information. It was estimated that there are about 25
billion pages on the Web as of October 2008
1
, which makes it generally
impossible for common users to locate their desired information by
browsing the Web. As a consequence, efficient and effective information
retrieval (IR) has become more important than ever, and search engines
(or IR systems) have become an essential tool for many people.
Ranking is a central problem in IR. Many IR problems are by nature
ranking problems, such as document retrieval, collaborative filtering
[58], key term extraction [30], definition finding [130], important email
routing [23], sentiment analysis [94], product rating [36], and anti Web
spam [56]. In this tutorial, we will mainly take document retrieval as
an example. Note that document retrieval is not a narrow task. Web
pages, emails, academic papers, books, and news stories are just a few
of the many examples of documents. And there are also many different
ranking scenarios for document retrieval of our interest.
Scenario 1: Rank the documents purely according to their rele-
1
http://www.worldwidewebsize.com/
1

2 Introduction
vance with regards to the query.
Scenario 2: Consider the relationships of similarity [118], website
structure [35], and diversity [139] between documents in the ranking
process. This is also referred to as relational ranking [102].
Scenario 3: Aggregate several candidate ranked lists to get a bet-
ter ranked list. This scenario is also referred to as meta search [7]. The
candidate ranked lists may come from different index servers, or differ-
ent vertical search engines, and the target ranked list is the final result
presented to users.
Scenario 4: Find whether and to what degree a property of a
webpage influences the ranking result. This is referred to as “reverse
engineering” in search engine optimization (SEO)
2
.
To tackle the problem of document retrieval, many heuristic ranking
models have been proposed and used in the literature of IR. Recently,
given the amount of potential training data available, it has become
possible to leverage machine learning (ML) technologies to build effec-
tive ranking models. Specifically, we call those methods that learn how
to combine predefined features for ranking by means of discriminative
learning “learning-to-rank” methods.
In recent years, learning to rank has become a very hot research
direction in IR, and a large number of learning-to-rank algorithms have
been proposed, such as [49] [73] [33] [90] [78] [34] [59] [114] [26] [9] [29]
[14] [122] [47] [62] [97] [16] [117] [136] [134] [13] [104] [99] [17] [129]. We
can foresee that learning to rank will have an even bigger impact on
IR in the future.
When a research area comes to this stage, several questions as fol-
lows naturally arise.
•
To what respect are these learning-to-rank algorithms similar
and in which aspects do they differ? What are the strengths
and weaknesses of each algorithm?
•
Empirically, which of those many learning-to-rank algorithms
perform the best? What kind of datasets can be used to make
2
http://www.search-marketing.info/newsletter/reverse-engineering.htm
剩余114页未读,继续阅读















安全验证
文档复制为VIP权益,开通VIP直接复制

评论4