使用LDA提升生物医学信息检索排名多样性

32 浏览量更新于2024-08-27 收藏 218KB PDF 举报

"本文提出了一种基于潜在狄利克雷分配（LDA）的主题生成模型方法，用于提高生物医学信息检索的排名多样性。与仅关注词级层面的其他方法或模型不同，该方法认为应通过检索文档的主题来识别方面。LDA模型用于发现检索段落的主题分布和每个主题维度的词分布，然后对检索结果进行重新排序，以增加多样性的展示。" 在生物医学信息检索领域，提高排名多样性是至关重要的，因为这直接影响到研究者和医生找到相关信息的效率和准确性。传统的检索系统往往过于关注精确匹配，可能导致具有不同视角或解释的重要信息被忽略。LDA是一种概率图模型，它假设文档是由多个隐含主题混合生成的，而每个主题又由一组概率分布的词汇构成。在本文中，作者提出利用LDA模型来挖掘文档的主题分布，这使得我们可以理解检索结果背后的潜在主题，而不仅仅是单个单词的出现。通过这种方式，检索系统可以识别出不同文档的多样性，即使它们可能包含相同的关键词，但代表了不同的研究方向或医学概念。例如，一篇关于癌症的研究可能涉及“治疗”，“预防”和“基因”等多个主题，这些主题提供了更全面的视角。为了实现排名多样性，作者应用LDA模型首先分析检索结果中的文档，确定每篇文档所属主题的概率分布。接着，根据文档主题的多样性和相关性，对原始检索结果进行重新排序。这种方法有助于确保检索结果不仅包含最相关的文档，还包含那些能提供多样化信息的文档，从而为用户提供更全面的理解。此外，论文可能还讨论了实验评估，通过比较传统方法和基于LDA的多样性促进方法的性能，以证明其有效性。可能包括使用标准的信息检索评估指标，如平均精度、NDCG（正常化的 Discounted Cumulative Gain）等，以及用户满意度调查，来验证提出的LDA方法在保持检索效果的同时，如何显著提高了排名的多样性。这篇论文提出了一种创新的策略，通过LDA模型增强生物医学信息检索的排名多样性，有助于确保用户能够接触到广泛且多样的研究成果，对于提升科研效率和医疗决策质量具有重要意义。

Promoting Ranking Diversity for Biomedical

Information Retrieval based on LDA

Yan Chen

∗†

, Xiaoshi Yin

∗†

, Zhoujun Li

∗†

, Xiaohua Hu

and Jimmy Xiangji Huang

∗

State Key Laboratory of Software Development Environment, Beihang University, Beijing 100191, China

†

School of Computer Science and Engineering, Beihang University, Beijing, China

College of Information Science and Technology, Drexel University, Philadelphia, PA, USA

School of Information Technology, York University, Canada

chenyan@cse.buaa.edu.cn, xiaoshiyin@cse.buaa.edu.cn, lizj@buaa.edu.cn, xiaohua.hu@ischool.drexel.edu, jhuang@yorku.ca

Abstract—In this paper, we propose an approach based on

a topic generative model called Latent Dirichlet Allocation

(LDA) to promoting ranking diversity for biomedical information

retrieval. Different from other approaches or models which

consider aspects on word level, our approach assumes that aspects

should be identiﬁed by the topics of retrieved documents. We

present LDA model to discover topic distribution of retrieval

passages and word distribution of each topic dimension, and

then re-rank retrieval results with topic distribution similarity

between passages based on 𝑁 -size slide window. Experiments

on TREC 2007 Genomics collection and two distinctive IR

baseline runs demonstrate the effectiveness of our method in

promoting ranking diversity for biomedical information retrieval.

Evaluation results show that our approach can achieve 8%

improvement over the highest Aspect MAP reported in TREC

2007 Genomics track.

Index Terms—ranking diversity, biomedical IR, LDA

I. INTRODUCTION

For biomedical information retrieval, there are immense

data and tremendous increase of genomics and biomedical

relevant publications. The wealth of information has led to

an increasing amount of interest in and need for applying

information retrieval techniques to access the scientiﬁc litera-

ture in genomics and related biomedical disciplines. In many

cases, the desired information of a question (query) asked by

biologists is a list of a certain type of entities covering different

aspects that are related to the question [1], such as cells, genes,

diseases, proteins, mutations, etc. Hence, it is important of a

biomedical IR system to be able to provide relevant and diverse

answers to fulﬁll biologists’ information needs. In recent years,

the “aspect retrieval” was proposed in TREC Genomics tracks.

Aspects of a retrieved passages could be a list of named

entities or MeSH terms [2], representing answers that cover

different portions of a full answer to the query. Aspect Mean

Average Precision (MAP) [2] was deﬁned in the Genomics

tracks. Its purpose is to study how a biomedical retrieval

system can support a user to gather information about different

aspects of a query. Biomedical retrieval system should return

relevant information at the passage level. Relevant passages

that do not contribute any new aspects will not be used

to accumulate Aspect MAP. Therefore, Aspect MAP is a

measurement for both relevance and diversity of an IR ranked

list.

There has been several research focused on promoting rank-

ing diversity in recent years. Perhaps the most representative

method is maximum marginal relevance (MMR) [3], as well

as mixture models [4], subtopic diversity [5], and others. The

basic idea of above three methods is to penalize redundancy

by lowering an item’s rank if it is similar to the items already

ranked. However, these methods often treat relevance ranking

and diversity ranking separately, and sometimes with heuristic

procedures. Rianne Kaptein et al. [6] employed a top down

sliding window to diversify ranked list of retrieved documents.

A recent study concerning on the Genomics aspect retrieval

was conducted by Huang et al. [7] and Yin et al. [8]. A

side effect of these three re-ranking strategies is that they

favor long documents, as the long documents tend to contain

more distinct terms. Zhu et al. [9] proposed a clustering-based

ranking algorithm called GRASSHOPPER to promote ranking

diversity in biomedical retrieval domain. Unfortunately, this

re-ranking method would reduce their system’s performance

and decrease the Aspect MAP of the original results for the

genomics aspect retrieval [10].

However, the previous work considers the aspects of user

query and retrieved documents mainly on word level. In

other words, one word or more co-occurrence words are used

to identify a speciﬁc aspect. This assumption could cause

two problems: ﬁrstly, one or more co-occurrence words in a

passage are used to identify the aspect. However, it is common

sense for us that a speciﬁc word can express more than one

latent topics according to different contexts in a passage;

secondly, words in a passage are considered as independent

to each other. However, some potential relationships between

words might exist. Therefore, it is insufﬁcient to identify

aspect on word level.

In this paper, we aim at addressing both above problems.

We propose an approach which employs Latent Dirichlet

Allocation (LDA) [11], a topic generative model, to promote

diversity in the ranked list for biomedical information retrieval.

Experiments conducted on TREC 2007 Genomics track col-

lection and two very different IR baseline runs demonstrate the

effectiveness of our approach. The evaluation results show that

our approach can achieve 8% improvement over the highest

Aspect MAP reported in TREC 2007 Genomics track.

The rest of this paper is organized as follows. In Section 2,

2011 IEEE International Conference on Bioinformatics and Biomedicine

DOI 10.1109/BIBM.2011.28

456

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38567873

粉丝: 5
资源: 887

使用LDA提升生物医学信息检索排名多样性

A Survival Modeling Approach to Biomedical Search Result Diversification Using Wikipedia

Information Architecture for the World Wide Web

Promoting professional growth for the school psychologist

Diversity of plant-growth-promoting Paenibacillus mucilaginosus isolated from vegetable fields in Zhejiang, China

Promoting effects of Ag on In2O3 nanospheres of sub-ppb NO2 detection

Promoting effect of licorice extract on induction of β-glucuronidase in Penicillium purpurogenum Li-3

Promoting Charge Separation in g-C3N4/Graphene/MoS2 Photocatalysts by Two-Dimensional Nanojunction for Enhanced Photocatalytic H2 Production

Promoting excellence,facilitating change

Sparsity Promoting Iterated Constrained Endmember

Promoting photonics: it is up to all of us

最新资源