Gao N, et al. Sci China Inf Sci May 2014 Vol. 57 052107:4
extends the PageRank hyperlink metric to XML ranking. Apparently, PageRank does not deal with the
ambiguity problem. In XRANK, a node in XML tree is designated as a result node only if it contains
at least one occurrence of each keyword in its subtre e, after excluding the nodes in its desce ndants that
already contain all the keywords. The formal definition is a s follow:
Definition 1. Given a keyword query Q = {k
1
, . . . , k
q
}, an XML tree Xtree, we assume that V
i
(1 <=
i <= q) is the set of nodes that directly contains keyword k
i
in Xtre e. LCA(Q, Xtree) is defined as
LCA(Q, Xtree) = {n|∃(v
1
∈ V
1
, . . . , v
q
∈ V
q
), n is the lowest common ancestors of {v
1
, . . . , v
q
}.
Definition 2. Given a keywor d query Q = {k
1
, . . . , k
q
}, an XML tree Xtree, XRANK(Q, Xtree) is
defined as follows: XRANK(Q, Xtree) = n|n
∗
∈ LCA(Q, Xtre e ), where n∗ is node n after excluding all
its descendant nodes which belong to LCA node set.
XKSearch was put forward by Xu et al. [15]. XKSearch defines SLCA as query semantic model. For
a query, a node in XML tree is considered as a result node in SLCA only if it contains at least one
occurrence of each keyword in its subtre e , and none of its descendants does. Howe ver, XKSearch does
not address the ra nking problem. The formal definition of SLCA is as follow:
Definition 3. Given a keyword query Q = {k
1
, . . . , k
q
}, an XML tree Xtree, SLCA(Q, Xtree) is
defined as SLCA(Q, Xtree) = {n|n ∈ LCA(Q, Xtree) ∩ ¬(∃n
′
∈ LC A(Q, Xtre e), n ≻ n
′
)}, where n ≻ n
′
means n is an ancestor of n
′
.
XSeek was introduced by Liu et al. [16]. In XSeek, nodes in XML tree a re grouped into thr ee categories:
entity node , attribute node and connection node.
Definition 4. Entity node: If a node has siblings under the sa me name, then this indicates a many-
to-one relationship with its parent node, and is consider ed to represent an entity. E.g. <workshop>,
<paper> in Figure 4 are c onsidered as entity nodes.
Definition 5. Attribute node: If a node does not have siblings of the same name, and it has one
child, which is a value, then it is c onsidered to represent an attribute. E.g. <date>, <title>, <editors>,
<author> are defined as attribute nodes .
Definition 6. Connection node: A node is a connectio n node if it repres ents neither an entity nor an
attribute. E.g. < proceedings> in Figure 4 is a connection node.
Given a keyword query, XSeek scans the candidate result list, and then replaces each non-entity node
with the nearest entity node on its path to the root. This process guarantees that each node returned to
the user is an entity node. Same as XKSearch, the ranking strategy is not discussed in XSeek.
XReal was put forward by Bao et al. [17]. XReal utilizes the statistics of underly ing XML data to
attack the ranking problem. Firstly it identifies the s earch for node s and search via nodes of a query, a nd
then the search engine ranks the individual matches of all candidate results by using an XML TF*IDF
strategy. Nevertheless, XReal is unable to solve the ambiguity problem.
3 Preliminaries
3.1 System framework
Figure 3 describes the framework of our search engine. The inverted index of the data collection is
initially processed in the background. Afterwards, when user submits a query to the interface, the search
engine firstly retrieves the relevant elements ac cording to the de finition of the semantic model maximal
lowest common ances tor (MAXLCA). The extracted elements are disordered results of the query. Thus,
we use a ranking model BM25 to rank these disordered element results, and the output of the processing
is a ranked list. To further improve the effect of the ranking module, we re-rank the top several r e sults
in the ranked list by distribution measurements. In distribution measurements, there are four criterions
based on the distribution of keywords taken into consideration, explicitly introduced in Section 4, and
the weights of these four measurements in the final ranking function are tr ained by a learning to optimize