Promoting Ranking Diversity for Biomedical
Information Retrieval based on LDA
Yan Chen
∗†
, Xiaoshi Yin
∗†
, Zhoujun Li
∗†
, Xiaohua Hu
§
and Jimmy Xiangji Huang
¶
∗
State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China
†
School of Computer Science and Engineering, Beihang University, Beijing, China
§
College of Information Science and Technology, Drexel University, Philadelphia, PA, USA
¶
School of Information Technology, York University, Canada
chenyan@cse.buaa.edu.cn, xiaoshiyin@cse.buaa.edu.cn, lizj@buaa.edu.cn, xiaohua.hu@ischool.drexel.edu, jhuang@yorku.ca
Abstract—In this paper, we propose an approach based on
a topic generative model called Latent Dirichlet Allocation
(LDA) to promoting ranking diversity for biomedical information
retrieval. Different from other approaches or models which
consider aspects on word level, our approach assumes that aspects
should be identified by the topics of retrieved documents. We
present LDA model to discover topic distribution of retrieval
passages and word distribution of each topic dimension, and
then re-rank retrieval results with topic distribution similarity
between passages based on 𝑁 -size slide window. Experiments
on TREC 2007 Genomics collection and two distinctive IR
baseline runs demonstrate the effectiveness of our method in
promoting ranking diversity for biomedical information retrieval.
Evaluation results show that our approach can achieve 8%
improvement over the highest Aspect MAP reported in TREC
2007 Genomics track.
Index Terms—ranking diversity, biomedical IR, LDA
I. INTRODUCTION
For biomedical information retrieval, there are immense
data and tremendous increase of genomics and biomedical
relevant publications. The wealth of information has led to
an increasing amount of interest in and need for applying
information retrieval techniques to access the scientific litera-
ture in genomics and related biomedical disciplines. In many
cases, the desired information of a question (query) asked by
biologists is a list of a certain type of entities covering different
aspects that are related to the question [1], such as cells, genes,
diseases, proteins, mutations, etc. Hence, it is important of a
biomedical IR system to be able to provide relevant and diverse
answers to fulfill biologists’ information needs. In recent years,
the “aspect retrieval” was proposed in TREC Genomics tracks.
Aspects of a retrieved passages could be a list of named
entities or MeSH terms [2], representing answers that cover
different portions of a full answer to the query. Aspect Mean
Average Precision (MAP) [2] was defined in the Genomics
tracks. Its purpose is to study how a biomedical retrieval
system can support a user to gather information about different
aspects of a query. Biomedical retrieval system should return
relevant information at the passage level. Relevant passages
that do not contribute any new aspects will not be used
to accumulate Aspect MAP. Therefore, Aspect MAP is a
measurement for both relevance and diversity of an IR ranked
list.
There has been several research focused on promoting rank-
ing diversity in recent years. Perhaps the most representative
method is maximum marginal relevance (MMR) [3], as well
as mixture models [4], subtopic diversity [5], and others. The
basic idea of above three methods is to penalize redundancy
by lowering an item’s rank if it is similar to the items already
ranked. However, these methods often treat relevance ranking
and diversity ranking separately, and sometimes with heuristic
procedures. Rianne Kaptein et al. [6] employed a top down
sliding window to diversify ranked list of retrieved documents.
A recent study concerning on the Genomics aspect retrieval
was conducted by Huang et al. [7] and Yin et al. [8]. A
side effect of these three re-ranking strategies is that they
favor long documents, as the long documents tend to contain
more distinct terms. Zhu et al. [9] proposed a clustering-based
ranking algorithm called GRASSHOPPER to promote ranking
diversity in biomedical retrieval domain. Unfortunately, this
re-ranking method would reduce their system’s performance
and decrease the Aspect MAP of the original results for the
genomics aspect retrieval [10].
However, the previous work considers the aspects of user
query and retrieved documents mainly on word level. In
other words, one word or more co-occurrence words are used
to identify a specific aspect. This assumption could cause
two problems: firstly, one or more co-occurrence words in a
passage are used to identify the aspect. However, it is common
sense for us that a specific word can express more than one
latent topics according to different contexts in a passage;
secondly, words in a passage are considered as independent
to each other. However, some potential relationships between
words might exist. Therefore, it is insufficient to identify
aspect on word level.
In this paper, we aim at addressing both above problems.
We propose an approach which employs Latent Dirichlet
Allocation (LDA) [11], a topic generative model, to promote
diversity in the ranked list for biomedical information retrieval.
Experiments conducted on TREC 2007 Genomics track col-
lection and two very different IR baseline runs demonstrate the
effectiveness of our approach. The evaluation results show that
our approach can achieve 8% improvement over the highest
Aspect MAP reported in TREC 2007 Genomics track.
The rest of this paper is organized as follows. In Section 2,
2011 IEEE International Conference on Bioinformatics and Biomedicine
978-0-7695-4574-5/11 $26.00 © 2011 IEEE
DOI 10.1109/BIBM.2011.28
456