Keyword query with structure: towards semantic scoring of XML
search results
Xiping Liu
1
•
Changxuan Wan
1
•
Dexi Liu
1
Springer Science+Business Media New York 2015
Abstract Keyword search is an effective paradigm for
information discovery and has been introduced recently to
query XML documents. Scoring of XML search results is
an important issue in XML keyword search. Traditional
‘‘bag-of-words’’ model cannot differentiate the roles of
keywords as well as the relationship between keywords,
thus is not proper for XML keyword queries. In this paper,
we present a new scoring method based on a novel query
model, called keyword query with structure (QWS), which
is specially designed for XML keyword query. The method
is based on a totally new view taken by the QWS model on
a keyword query that, a keyword query is a composition of
several query units, each representing a query condition.
We believe that this method captures the semantic rele-
vance of the search results. The paper first introduces an
algorithm reformulating a keyword query to a QWS. Then,
a scoring method is presented which measures the rele-
vance of search results according to how many and how
well the query conditions are matched. The scoring method
is also extended to clusters of search results. Experimental
results verify the effectiveness of our methods.
Keywords XML keyword search Keyword query with
structure Query unit Cluster
1 Introduction
Keyword search is an effective paradigm for information
discovery that has been extensively studied for flat docu-
ments (text, HTML, etc.). As XML has been accepted as a
standard for document mark-up and exchange, it is natural
to extend keyword search techniques to support XML data
[1, 2].
Scoring is at the core of keyword search. Scoring methods
have been extensively studied in traditional information
retrieval (IR) field, and a number of scoring functions have
been proposed [3]. Several scoring methods have also been
proposed concerning XML keyword search [1, 2]. Existing
XML scoring methods are based on the traditional ‘‘bag-of-
words’’ model. In this model, a text (such as a sentence or a
document) is represented as the bag (multiset) of its words,
disregarding grammar and even word order but keeping
multiplicity. Though simple enough, the model is not well-
suited for XML keyword search.
Consider a query Q
1
: ‘‘journal database article transac-
tion’’. The query intends to search for articles about
‘‘transaction’’ in a journal named ‘‘database’’. Given an
XML document, an ideal result of the query is a subtree
rooted at an element labelled ‘‘article’’ containing ‘‘trans-
action’’ in its text content, which is nested in an element
labelled ‘‘journal’’ with ‘‘database’’ in its content. Obvi-
ously, it is not proper to view the query as a bag of words.
First, the keywords in the query are different in their roles.
The keywords ‘‘article’’ and ‘‘journal’’ should be treated as
tags of elements, while ‘‘database’’ and ‘‘transaction’’ are
keywords appearing in text contents. Second, the rela-
tionships between keywords in the query are different. The
keyword ‘‘transaction’’ is more closely related to ‘‘article’’
than to ‘‘database’’, and ‘‘database’’ has closer relationship
with ‘‘journal’’ than with ‘‘article’’.
& Xiping Liu
lewislxp@gmail.com
1
School of Information Technology, Jiangxi University of
Finance and Economics, Nanchang 330013,
People’s Republic of China
123
Inf Technol Manag
DOI 10.1007/s10799-015-0247-z