PROCEEDINGS Open Access
A LDA-based approach to promoting ranking
diversity for genomics information retrieval
Yan Chen
1,2
, Xiaoshi Yin
1,2
, Zhoujun Li
1,2*
, Xiaohua Hu
3
, Jimmy Xiangji Huang
4
From IEEE International Conference on Bioinformatics and Biomedicine 2011
Atlanta, GA, USA. 12-15 November 2011
Abstract
Background: In the biomedical domain, there are immense data and tremendous increase of genomics and
biomedical relevant publications. The wealth of information has led to an increasing amount of intere st in and
need for applying information retrieval techniques to access the scientific literature in genomics and related
biomedical disciplines. In many cases, the desired informatio n of a query asked by biologists is a list of a certain
type of entities covering differe nt aspects that are related to the question, such as cells, genes, diseases, proteins,
mutations, etc. Hence, it is important of a biomedical IR system to be able to provide relevant and diverse answers
to fulfill biologi sts’ information need s. However traditional IR model only concerns with the relevance between
retrieved documents and user query, but does not take redundancy between retrieved documents into account.
This will lead to high redundancy and low diversity in the retrieval ranked lists.
Results: In this paper, we propose an approach which employs a topic generative model called Latent Dirichlet
Allocation (LDA) to promoting ra nking diversity for biomedical information retrieval. Different from other
approaches or models which consider aspects on word level, our approach assumes that aspects should be
identified by the topics of retrieved documents. We present LDA model to discover topic distribution of retrieval
passages and word distribution of each topic dimension, and then re-rank retrieval results with topic distribution
similarity between passages based on N-size slide window. We perform our approach on TREC 2007 Genomics
collection and two distinctive IR baseline runs, which can achieve 8% improvement over the highest Aspect MAP
reported in TRE C 2007 Genomics track.
Conclusions: The proposed method is the first study of adopting topic model to genomics information retrieval,
and demonstrates its effectiveness in promoting ranking diversity as well as in improving relevance of ranked lists
of genomics search. Moreover, we proposes a distance measure to quantify how much a passage can increase
topical diversity by considering both topical importance and topical coefficient by LDA, and the distance measure
is a modified Euclidean distance.
Background
Traditional information retrieval (IR) system should
respond with a ranked list of retrieved documents or
passages to users, according to their probabilities of
relevance to the query. The model only concerns with
the relevance between retrieved documents and user
query, but does not take redundancy between retrieved
documents into account. The retrieved documents with
similar contents thus tend to appear over and over
again. Ideally, in order to provide a comprehensive pic-
ture of all interpretations to the query, it would be bet-
ter for an informat ion retrieval system to return a
ranked list of retrieved documents or passages taking
both relevance and diversity into account.
For genomics information retrieval, the problem is par-
ticularly prominent, on account of immense data and tre-
mendous increase of genomics and biomedical relevant
publications. The wealth of information has led to an
* Correspondence: lizj@buaa.edu.cn
1
State Key Laboratory of Software Development Environment, Beihang
University, Beijing 100191, China
Full list of author information is available at the end of the article
Chen et al. BMC Genomics 2012, 13(Suppl 3):S2
http://www.biomedcentral.com/1471-2164/13/S3/S2
© 2012 Chen et al. licensee BioMed Central Ltd. This is an open access art icle distributed under the terms of the Creative Commons
Attribution License (http://creativecommons.o rg/licenses/by/2.0), whi ch permits unre stricted use, distribution, and reproduction in
any medium, provided the original work is properly cited.