Diversifying Query Suggestions by using Topics
from Wikipedia
Hao Hu, Mingxi Zhang, Zhenying He, Peng Wang, Wei Wang
School of Computer Science, Fudan University
Shanghai, China
{huhao, 10110240025, zhenying, pengwang5, weiwang1}@fudan.edu.cn
Abstract—Diversifying query suggestions has emerged recently,
by which the recommended queries can be both relevant and
diverse. Most existing works diversify suggestions by query log
analysis, however, for structured data, not all query logs are
available. To this end, this paper studies the problem of suggesting
diverse query terms by using topics from Wikipedia. Wikipedia is
a successful online encyclopedia, and has high coverage of entities
and concepts. We first obtain all relevant topics from Wikipedia,
and then map each term to these topics. As the mapping is a
nontrivial task, we leverage information from both Wikipedia and
structured data to semantically map each term to topics. Finally,
we propose a fast algorithm to efficiently generate the suggestions.
Extensive evaluations are conducted on a real dataset, and our
approach yields promising results.
Keywords-query suggestion diversification, Wikipedia, topics
I. INTRODUCTION
Query suggestion [1–7] is popular for web search engines.
It refers to the process of suggesting related queries. While
there are many works focusing on improving the relevance
of related queries, most of them neglect to address the issue
of providing diverse queries. Actually, the suggested queries
should be both similar to the original query and semantically
different from each other so that they can cover broader latent
topics. To handle informational queries, in this paper, we study
how to better diversify suggested queries on structured data.
For structured data, current techniques suggest the most
related query terms as suggestion. The related query terms are
defined by different similarity measures, e.g., finding frequent
co-occurring terms [8], contextual random walk [6], TFIDF
scoring [9], NetClus [10]. However, suggesting only relevant
queries may cause dissatisfaction. Consider a simple example
on DBLP
1
, where the system suggests the most relevant query
terms. The relevant query terms are obtained by NetClus [10].
Example 1. Assume a user wants to find papers about
“data mining”, the top-5 suggested query terms are listed in
Table I. We assign each term a latent topic according to its
semantics, (e.g., the term “mine” is assigned to topic data
mining, and “frequent” is assigned to association rule mining
because “frequent pattern/itemset” is a popular concept in
association rule mining). It is obvious that these related
terms are concentrated on latent topics (2 terms from data
This work was supported in part by National Science Foundation of China
grants 61170007, 60673133 and 61033010.
1
http://dblp.uni-trier.de/
TABLE I
T
OP-5 RELATED TERMS FOR “DATA MINING”
AND LATENT TOPIC
Related Term Latent Topic
t
1
mine data mining
t
2
data data mining
t
3
association association rule mining
t
4
frequent association rule mining
t
5
rule association rule mining
TABLE II
P
ART OF RELATED TOPICS
FOR
“DATA MINING”
association rule mining
classification
decision tree
text mining
concept mining
...
mining and 3 from association rule mining), hence they do
not deliver diverse information. A better suggestion list should
contain terms in broader subtopics of “data mining”. Some
related subtopics are listed in Table II, as we can see, it
would leave users a better impression if we could suggest
“classification”, “association”, “tree”, as these terms are
from different popular subtopics.
The major step of diversifying query suggestions from
topics is to map query terms to all topics, on the premise that
we can obtain these topics. However, some difficulties would
be run into while obtaining topics from structured data. First,
not all topics can be extracted from schema. For example, in
DBLP, each paper is recorded with its conference and authors.
Users can easily obtain papers in topic WI 2012 (i.e., they
need all papers in WI 2012), whereas it is hard to obtain all
papers in topic association rule mining. Second, there is rare
topic information in most structured data. Though topics can
be extracted by machine learning methods such as LDA [11],
it would be impractical to let users set parameters (e.g., the
number of topics) when suggesting. Third, query logs are
sometimes not available, hence topic extraction from query
logs [12, 13] may be not applicable.
On the other hand, Wikipedia
2
is a successful online en-
cyclopedia consisting of numerous entities and concepts. It
allows users to learn entities, facts and concepts through cat-
egory information and links between them. The prior knowl-
edge in Wikipedia has high quality and coverage. Therefore,
it would be beneficial to perform the suggestion task by
identifying the semantic relationship between terms and topics
from Wikipedia.
Actually, the topics in Table II are from category informa-
tion in Wikipedia. These related topics cover almost every
subarea of “data mining”. With the prior knowledge from
Wikipedia, we can map query terms to latent topics, and then
diversify them.
2
http://www.wikipedia.org/
2013 IEEE/WIC/ACM International Conferences on Web Intelligence (WI) and Intelligent Agent Technology (IAT)
978-1-4799-2902-3/13 $31.00 © 2013 IEEE
DOI 10.1109/WI-IAT.2013.21
139