main contributions are summarized as follows:
1) This is the first work that addresses the keyword ambi-
guity problem. We also identify three crucial issues that
an effective XML keyword search engine should meet.
2) We define our own XML TF (term frequency) and XML
DF (document frequency), which are cornerstones of all
formulae proposed later.
3) We propose three important guidelines in identifying the
user desired search for node type, and design a formula
to compute the confidence level of a certain node type
to be a desired search for node based on the guidelines.
4) We design formulae to compute the confidence of each
candidate node type as the desired search via node to
model natural human intuitions, in which we take into
account the pattern of keywords co-occurrence in query.
5) We propose a novel relevance oriented ranking scheme
called XML TF*IDF similarity which can capture the
hierarchical structure of XML, and resolve Ambiguity 1
and Ambiguity 2 in a heuristic way; and also distinguish
the similarity computation for leaf nodes and internal
nodes in XML data. Moreover, our approach is able to
handle both semi-structured and unstructured data.
6) We implement the proposed techniques in a keyword
search engine prototype called XReal. Extensive exper-
iments show its effectiveness, efficiency and scalability.
The rest of the paper is organized as follows. We present
the related work in Section II, and preliminary on IR and data
model in Section III. Section IV infers user search intention,
and Section V discusses relevance oriented ranking. Section
VI presents the search algorithms. Experimental evaluation is
given in Section VII and we conclude in Section VIII.
II. R
ELATED WORK
Extensive research efforts have been conducted in XML
keyword search to find the smallest sub-structures in XML
data that each contains all query keywords in either the tree
data model or the directed graph (i.e. digraph) data model.
In tree data model, LCA (lowest common ancestor) seman-
tics is first proposed and studied in [8], [2] to find XML nodes,
each of which contains all query keywords within its subtree.
Subsequently, SLCA (smallest LCA [9], [3]) is proposed to
find the smallest LCAs that do not contain other LCAs in their
subtrees. GDMCT (minimum connecting trees) [5] excludes
the subtrees rooted at the LCAs that do not contain query
keywords. Sun et al. [10] generalize SLCA to support key-
word search involving combinations of AND and OR boolean
operators. XSeek [4] generates the return nodes which can be
explicitly inferred by keyword match pattern and the concept
of entities in XML data. However, it addresses neither the
ranking problem nor the keyword ambiguity problem. Besides,
it relies on the concept of entity (i.e. object class) and considers
a node type t in DTD as an entity if t is “*”-annotated in DTD.
As a result, customer, phone, interest, book in Figure 1,
are identified as object classes by XSeek. However, it causes
the multi-valued attribute to be mistakenly identified as an
entity, causing the inferred return node not as intuitive as
possible. E.g. phone and interest are not intuitive as entities.
In fact, the identification of entity is highly dependent on the
semantics of the underlying database rather than its DTD, so
it usually requires the verification and decision from database
administrator. Therefore, the adoption of entities for keyword
search should be optional although this concept is very useful.
In digraph data model, previous approaches are heuristics-
based, as the reduced tree problem on graph is as hard as
NP-complete. Li et al. [11] show the reduction from minimal
reduced tree problem to the NP-complete Group Steiner Tree
problem on graphs. BANKS [12] uses bidirectional expansion
heuristic algorithms to search as small portion of graph as
possible. BLINKS [13] proposes a bi-level index to prune and
accelerate searching for top-k results in digraphs. Cohen et
al. [14] study the computation complexity of interconnection
semantics. XKeyword [15] provides keyword proximity search
that conforms to an XML schema; however, it needs to com-
pute candidate networks and thus is constrained by schemas.
On the issue of result ranking, XRANK [2] extends
Google’s PageRank to XML element level, to rank among
the LCA results; but no empirical study is done to show the
effectiveness of its ranking function. XSEarch [1] adopts a
variant of LCA, and combines a simple tf*idf IR ranking with
size of the tree and the node relationship to rank results; but it
requires users to know the XML schema information, causing
limited query flexibility. EASE [16] combines IR ranking and
structural compactness based DB ranking to fulfill keyword
search on heterogenous data. Regarding to ranking methods,
TF*IDF similarity [7] which is originally designed for flat
document retrieval is insufficient for XML keyword search due
to XML’s hierarchical structure and the presence of Ambiguity
1 and Ambiguity 2. Several proposals for XML information
retrieval suggest to extend the existing XML query languages
[17], [18], [19] or use XML fragments [20] to explicitly
specify the search intention for result retrieval and ranking.
III. P
RELIMINARIES
A. TF*IDF cosine similarity
TF*IDF (Term Frequency * Inverse Document Frequency)
similarity is one of the most widely used approaches to
measure the relevance of keywords and document in keyword
search over flat documents. We first review its basic idea, then
address its limitations for keyword search in XML. The main
idea of TF*IDF is summarized in the following three rules.
• Rule 1: A keyword appearing in many documents should
not be regarded as being more important than a keyword
appearing in a few.
• Rule 2: A document with more occurrences of a query
keyword should not be regarded as being less important
for that keyword than a document that has less.
• Rule 3: A normalization factor is needed to balance be-
tween long and short documents, as Rule 2 discriminates
against short documents which may have less chance to
contain more occurrences of keywords.
To combine the intuitions in the above three rules, the
TF*IDF similarity is designed:
519519
Authorized licensed use limited to: Dalian University of Technology. Downloaded on March 10,2010 at 04:23:52 EST from IEEE Xplore. Restrictions apply.