RUCIR at NTCIR-12 IMINE-2 Task
Ming Yue
1
, Zhicheng Dou
1
, Sha Hu
2∗
, Jinxiu Li
2
, Xiaojie Wang
1
, and Ji-Rong Wen
2
Beijing Key Laboratory of Big Data Management and Analysis Methods, China
School of Information, Renmin University of China
1
{yomin,dou,wangxiaojie}@ruc.edu.cn,
2
{sallyshahu,jinxiu2216,jirong.wen}@gmail.com
ABSTRACT
In this paper, we present our participation in the Query Un-
derstanding subtask and the Vertical Incorporating subtask
of the NTCIR-12 IMine-2 task, for both English and Chi-
nese topics. In the Query Understanding subtask, we com-
bine the extracted candidates from search engine suggestion-
s and Wikipeida, and classify their verticals after clustering
and ranking them. In the Vertical Incorporating subtask, we
provide a general method for adapting traditional diversity
algorithms to deal with predefined subtopics with classified
verticals in diversification.
Team Name
RUC IR
Subtasks
Query Understanding (Chinese, English)
Vertical Incorporating (Chinese, English)
1. INTRODUCTION
In modern information systems, users type in some key-
words and search engines return matched results. However,
with an ambiguous or broad query, a retrieval system or
search engine may misunderstand users’ interests, by sim-
ply comparing the query with the corpus and returning a
ranked result list. The goal of NTCIR-12 IMine-2 Task is
to find potential intents for a query and classify each intent
into one of six verticals. These verticals help us detect d-
ifferent user interests more precisely. The classified intents
with their verticals can also be used to improve document
ranking. The IMine-2 task consists of two subtasks: Query
Understanding and Vertical Incorporating.
In the Query Understanding subtask, the system is re-
quired not only to return a ranked list of subtopic candidates
for a given query, but also to identify the relevant vertical
∗
Corresponding author
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
Copyright 20XX ACM X-XXXXX-XX-X/XX/XX ...$15.00.
intent for each subtopic. A subtopic of a given query spe-
cializes or disambiguates the original query. These subtopics
with their verticals present what information the users are
interested in.
We first extract candidates from disambiguation pages in
Wikipedia [3, 4]. We do not do any other operations on
official query suggestions because they are already good re-
sults. Due to the fact that candidates are usually short and
do not have enough information, we further retrieve top 300
results from the search engine and group them into clusters
to find important candidates by using two different cluster-
ing algorithms. After that, we rank them by their relevance
and diversity. Finally we make a classification to get each
subtopic’s vertical.
In the Vertical Incorporating subtask, our goal is to diver-
sify search results in the top ranks, just like the Document
Ranking subtask in IMine-1
1
. The unique part of VI task
is that it classifies subtopics into verticals to solve diver-
sification problem. The algorithms have to consider addi-
tional virtual documents involved by the verticals from the
subtopics of a query.
We provide a general method to adapt traditional diversi-
fication algorithms to deal with the VI subtask. The main-
ly difference from traditional models is that we (1)consider
verticals and virtual documents in diversity, and (2)under-
stand subtopics by fine-grained information. We have tried
this method on several state-of-the-art models, and report
the results of PM2[6] as the basic method in this subtask.
2. QUERY UNDERSTANDING
We divide this subtask into two smaller tasks. One is
subtopic mining, similar to IMINE, the former NTCIR sub-
task. The other one is a classification task, which can be
treated as a classic machine learning problem. In NTCIR-
12, we use query suggestions and knowledge bases to mine
subtopics and classify the vertical intent of each subtopic.
2.1 Methodology
Step 1. Extracting Subtopic Candidates From Various
Resources. Query suggestions from search engines are one
of the official data sets. Besides this, we also use the knowl-
edge base of Wikipedia. In Wikipedia, a disambiguation
page describes different aspects for a specific term. We check
each query in the task. If a query has a disambiguation page,
the terms on the page would be considered as candidates.
1
http://www.thuir.org/IMine/
Proceedings of the 12th NTCIR Conference on Evaluation of Information Access Technologies, June 7-10, 2016 Tokyo Japan
36