利用维基百科主题提升搜索建议的多样性和相关性

25 浏览量更新于2024-08-26 收藏 307KB PDF 举报

本文主要探讨了如何通过利用维基百科（Wikipedia）中的主题来实现查询建议的多样性。随着查询建议技术的发展，其目标是推荐既相关又多样的查询，以便用户能够探索更广泛的信息领域。当前大多数研究依赖于查询日志分析来进行建议的多样化，但这种方法在结构化数据场景下并不总是可行，因为并非所有的查询日志都可用。论文关注的问题是设计一种策略，即利用维基百科中的丰富实体和概念信息来生成具有多样性的查询建议。维基百科作为广受欢迎的在线百科全书，提供了广泛的覆盖范围，可以作为获取多样化主题的重要来源。首先，作者们从维基百科中提取出与查询相关的所有主题，然后将每个查询术语映射到这些主题上。这个过程是一项挑战，因为它需要兼顾维基百科的文本信息以及结构化数据中的语义关联，以便实现准确的词义映射。为了实现这一点，论文提出了一种结合了文本挖掘和语义分析的创新方法，利用维基百科条目和查询背景中的上下文信息来确定术语与主题之间的联系。这一步骤旨在确保每个建议不仅基于查询的历史行为，还考虑了更广泛的知识领域，从而提供更为丰富的查询选择。最后，论文介绍了一种高效的算法，用于在大规模数据集上快速生成多样化的查询建议。作者们对实际数据集进行了详尽的评估，结果显示他们的方法能够有效地提升查询建议的多样性和相关性，显示出良好的性能和实用性。这项研究为查询建议系统提供了新的视角，即利用维基百科这样的开放资源来增强建议的丰富度，这对于那些无法获取完整查询历史的平台来说是一个有价值的补充。关键词包括查询建议、多样性、维基百科主题和语义映射，展示了作者们在该领域的创新贡献。

Diversifying Query Suggestions by using Topics

from Wikipedia

Hao Hu, Mingxi Zhang, Zhenying He, Peng Wang, Wei Wang

School of Computer Science, Fudan University

Shanghai, China

{huhao, 10110240025, zhenying, pengwang5, weiwang1}@fudan.edu.cn

Abstract—Diversifying query suggestions has emerged recently,

by which the recommended queries can be both relevant and

diverse. Most existing works diversify suggestions by query log

analysis, however, for structured data, not all query logs are

available. To this end, this paper studies the problem of suggesting

diverse query terms by using topics from Wikipedia. Wikipedia is

a successful online encyclopedia, and has high coverage of entities

and concepts. We ﬁrst obtain all relevant topics from Wikipedia,

and then map each term to these topics. As the mapping is a

nontrivial task, we leverage information from both Wikipedia and

structured data to semantically map each term to topics. Finally,

we propose a fast algorithm to efﬁciently generate the suggestions.

Extensive evaluations are conducted on a real dataset, and our

approach yields promising results.

Keywords-query suggestion diversiﬁcation, Wikipedia, topics

I. INTRODUCTION

Query suggestion [1–7] is popular for web search engines.

It refers to the process of suggesting related queries. While

there are many works focusing on improving the relevance

of related queries, most of them neglect to address the issue

of providing diverse queries. Actually, the suggested queries

should be both similar to the original query and semantically

different from each other so that they can cover broader latent

topics. To handle informational queries, in this paper, we study

how to better diversify suggested queries on structured data.

For structured data, current techniques suggest the most

related query terms as suggestion. The related query terms are

deﬁned by different similarity measures, e.g., ﬁnding frequent

co-occurring terms [8], contextual random walk [6], TFIDF

scoring [9], NetClus [10]. However, suggesting only relevant

queries may cause dissatisfaction. Consider a simple example

on DBLP

, where the system suggests the most relevant query

terms. The relevant query terms are obtained by NetClus [10].

Example 1. Assume a user wants to ﬁnd papers about

“data mining”, the top-5 suggested query terms are listed in

Table I. We assign each term a latent topic according to its

semantics, (e.g., the term “mine” is assigned to topic data

mining, and “frequent” is assigned to association rule mining

because “frequent pattern/itemset” is a popular concept in

association rule mining). It is obvious that these related

terms are concentrated on latent topics (2 terms from data

This work was supported in part by National Science Foundation of China

grants 61170007, 60673133 and 61033010.

http://dblp.uni-trier.de/

TABLE I

OP-5 RELATED TERMS FOR “DATA MINING”

AND LATENT TOPIC

Related Term Latent Topic

mine data mining

data data mining

association association rule mining

frequent association rule mining

rule association rule mining

TABLE II

ART OF RELATED TOPICS

FOR

“DATA MINING”

association rule mining

classiﬁcation

decision tree

text mining

concept mining

...

mining and 3 from association rule mining), hence they do

not deliver diverse information. A better suggestion list should

contain terms in broader subtopics of “data mining”. Some

related subtopics are listed in Table II, as we can see, it

would leave users a better impression if we could suggest

“classiﬁcation”, “association”, “tree”, as these terms are

from different popular subtopics.

The major step of diversifying query suggestions from

topics is to map query terms to all topics, on the premise that

we can obtain these topics. However, some difﬁculties would

be run into while obtaining topics from structured data. First,

not all topics can be extracted from schema. For example, in

DBLP, each paper is recorded with its conference and authors.

Users can easily obtain papers in topic WI 2012 (i.e., they

need all papers in WI 2012), whereas it is hard to obtain all

papers in topic association rule mining. Second, there is rare

topic information in most structured data. Though topics can

be extracted by machine learning methods such as LDA [11],

it would be impractical to let users set parameters (e.g., the

number of topics) when suggesting. Third, query logs are

sometimes not available, hence topic extraction from query

logs [12, 13] may be not applicable.

On the other hand, Wikipedia

is a successful online en-

cyclopedia consisting of numerous entities and concepts. It

allows users to learn entities, facts and concepts through cat-

egory information and links between them. The prior knowl-

edge in Wikipedia has high quality and coverage. Therefore,

it would be beneﬁcial to perform the suggestion task by

identifying the semantic relationship between terms and topics

from Wikipedia.

Actually, the topics in Table II are from category informa-

tion in Wikipedia. These related topics cover almost every

subarea of “data mining”. With the prior knowledge from

Wikipedia, we can map query terms to latent topics, and then

diversify them.

http://www.wikipedia.org/

2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT)

DOI 10.1109/WI-IAT.2013.21

139

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38556668

粉丝: 5

利用维基百科主题提升搜索建议的多样性和相关性

使用Wea​​viate 矢量搜索引擎 通过矢量化 Wikipedia (SentenceBERT) 进行语义搜索_Python

wikipedia_info.zip_数据集_维基百科网站_跨模态_跨模态图像_跨模态数据

question_answering:使用来自Wikipedia的数据来回答问题:thinking_face:

三角函数及其应用;来自wikipedia

dns-over-wikipedia：使用在主题的Wikipedia页面上找到的官方链接来重定向.idk域

wikipedia:解析、索引和搜索来自 Wikipedia、DBPedia 和 Freebase 的有用数据

Wikipedia-Link-Map:来自给定维基百科文章的广度主题搜索

WikiPedia-Search-demo:使用Wikipedia演示APP在Wikipedia上进行搜索

wikipedia:使用Python的命令行Wikipedia查看器

Wikipedia_Webscrapper_BinaryClassifier_Python:我对来自Wikipedia的“人类学”和“量子力学”类别的文章进行webscrap。 我对数据进行预处理，并尝试使用二进制分类模型拟合它们

最新资源

使用Weaviate 矢量搜索引擎通过矢量化 Wikipedia (SentenceBERT) 进行语义搜索_Python

Wikipedia_Webscrapper_BinaryClassifier_Python:我对来自Wikipedia的“人类学”和“量子力学”类别的文章进行webscrap。我对数据进行预处理，并尝试使用二进制分类模型拟合它们