Xiping Liu i dr. Prikaz pretrage XML ključne riječi primjenom skrivenog Markovljevog modela
Tehnički vjesnik 23, 6(2016), 1649-1658 1649
ISSN 1330-3651 (Print), ISSN 1848-6339 (Online)
DOI: 10.17559/TV-20150314113111
INTERPRETING XML KEYWORD QUERY USING HIDDEN MARKOV MODEL
Xiping Liu, Changxuan Wan, Dexi Liu
Original scientific paper
Keyword search on XML database has attracted a lot of research interests. As XML documents are very different from flat documents, effective search of
XML documents needs special considerations. Traditional bag-of-words model does not take the roles of keywords and the relationship between keywords
into consideration, and thus is not suited for XML keyword search. In this paper, we present a novel model, called semi-structured keyword query (SSQ),
which understands a keyword query in a different way: a keyword query is composed of several query units, where each unit represents query condition.
To interpret a keyword query under this model, we take two steps. First, we propose a probabilistic approach based on a Hidden Markov Model for
computing the best mapping of the query keywords into the database terms, i.e., elements, attributes and values. Second, we generate SSQs based on the
mapping. Experimental results verify the effectiveness of our methods.
Keywords: hidden Markov model (HMM); semi-structured keyword query (SSQ); XML keyword query
Prikaz pretrage XML ključne riječi primjenom skrivenog Markovljevog modela
Izvorni znanstveni članak
Pretraživanje ključne riječi na XML bazi podataka privuklo je prilično zanimanja. Kako se XML dokumenti vrlo razlikuju od plošnih (flat) dokumenata,
učinkovita pretraga XML dokumenata zahtijeva posebno razmatranje. Tradicionalni model vreće riječi (bag-of-words) ne uzima u obzir uloge ključnih
riječi i odnos između ključnih riječi pa prema tome nije pogodan za XML pretragu ključne riječi. U ovom radu predstavljamo novi model, nazvan polu-
strukturno pretraživanje ključne riječi (SSQ), koji podrazumijeva pretraživanje ključne riječi na različit način; to se pretraživanje sastoji od nekoliko
cjelina pretrage i svaka cjelina predstavlja stanje pretrage (query condition). Za interpretaciju pretrage po tom modelu, potrebna su dva koraka. Prvo,
predlažemo probabilistički pristup zasnovan na skrivenom Markovljevom modelu za izračunavanje najboljeg uklapanja traženih ključnih riječi u termine
baze podataka, tj. elemenata, atributa i vrijednosti. Drugo, generiramo konstrukcije ključnih riječi (SSQs) na osnovu uklapanja. Eksperimentalni rezultati
potvrđuju učinkovitost naših metoda.
Ključne riječi: polu-strukturno pretraživanje ključne riječi; skriveni Markovljev model (HMM); XML pretraživanje ključne riječi
1 Introduction
Keyword search, due to its simplicity and friendness,
has been widely used and extended to search a variety of
sources of information, such as relational database and
XML documents [1, 2]. An XML document is composed
of nested elements. The nested structure of XML
documents poses great challenges to keyword search
techniques, as the users are able to search XML
documents through structure and text contents.
The unique characteristics of XML documents calls
for a fresh look at and deep understanding of the keyword
query. Existing XML keyword search methods are based
on the "bag-of-words" model. In this model, a text unit
(such as a paragraph or a document) is taken as the bag
(multiset) of words, which means that the grammar and
order of words are not taken into consideration. However,
this model is too simple for XML keyword search.
Consider a query Q
1
: "journal info system article
expert".The query intention is to search for articles about
"expert" in a journal named "info system". In an XML
database, the answer may be an element labelled "article"
nested in an element labelled "journal", where the
"article" element contains "expert" in its text content, and
the "journal" element has "info system" in its content.
Obviously, it is not natural to view the query as a bag of
words. First, the keywords in the query have different
roles. The keywords "article" and "journal" are labels of
elements, while "expert" and "info system" are just
keywords in texts. Second, there exist different
relationships between keywords. The keyword "expert" is
more closely related to "article" than to "info" and
"system", and "info system" has closer relationship with
"journal" than with "article".
From this example we can see that the traditional
bag-of-words model is not proper for XML keyword
search, because it does not provide information about the
structure of the query hidden in the XML keyword query.
In this paper, we present a new model, called semi-
structured keyword query (SSQ), to model a keyword
query against an XML document. An SSQ is different
from a keyword query in that it has structural information,
and it is less strict compared with a structured query. The
SSQ model is special in that it makes explicit the structure
of the query. However, it is not straightforward to
transform a keyword query into a query in SSQ form. In
this work, we propose two steps to make the
transformation. In the first step, we map the words in the
keyword query into database terms, where each term is
either from the schema vocabulary or from the texts of the
database. As each word can be mapped to many terms, we
develop a Hidden Markov Model-based probabilistic
approach for interpreting the query keywords in terms of
database terms. In the second step, we design an
algorithm which takes a sequence of database terms as
input, and outputs a set of SSQs. Once the SSQs are
generated, it is possible to improve XML search results
based on the SSQs, but that is beyond the scope of this
paper.
To summarize, the following contributions are made in
this paper:
1) We propose a novel way to analyse and interpret an
XML keyword query. The approach makes explicit the
structural information hidden in the keyword query, and
transforms the query into a semi-structured keyword query
(SSQ). The SSQ helps to get the semantics and intention of
a keyword query.