holistic XML tree pattern matching algorithms. The
experimental results show that our algorithm can
correctly process extended XML tree patterns,
achieving performance speedup for tested queries
and data sets, even in their restricted focus. The
improvement mainly owes to the reduction of the
size of intermediate results.
1.2 Outline
The rest of the paper is organized as follows: Section 2 gives
the preliminaries about research problem and the proces-
sing model. Section 3 shows a set of theories about
matching cross and Section 4 presents an extended XML
tree pattern matching algorithm called TreeMatch. Section 5
presents thorough experimental studies between the novel
algorithms and the prior methods. Finally, Section 6
presents previous work related to the XML tree pattern
matching and Section 7 concludes this paper.
2PRELIMINARIES
2.1 Modeling of XML Data and Extended Tree
Pattern Query
An XML database D is usually modeled as a rooted, node-
labeled tree (in this paper, we use D to represent the
database and the related tree model exchangeably without
specific declaration), element tags and attributes are
mapped to nodes in the trees and the edges are used to
represent the direct nesting relationships. Our primary
focus is on element nodes; and it is not difficult to extend
our methods to process the other types of nodes, including
attribute and character data. For convenience, we distin-
guish between query nodes and database nodes by using
the term “node” to refer to a query node and the term
“element” to refer to a data element in D.
An extended tree query Q describes a complex traversal of
the XML document and retrieves relevant tree-structured
portions of it. The nodes in Q include element tags, attributes,
and character data. We use “*” to denote the wildcard, which
can match any single tree element. There are four kinds of
query edges, which are the four combinations between
(positive and negative) and (parent-child and ancestor-descen-
dant). For example, in Fig. 2b, (A; B)isapositiveparent-child
edge and (A; C)isanegative parent-child edge. We use a
symbol “:” to denote a negative edge. There are two kinds of
query node: ordered and unordered node. We use “< ” in a box
to denote the ordered node, otherwise it is an unordered node.
For example, the node A, in Figs. 2c and 2d are ordered nodes.
In each extended tree query pattern, there is one or multiple
nodes which are assigned as the selected return nodes,
denoted with an underline. For example, in Fig. 2a, C is the
selected return node.
Given an extended tree query Q with n selected return
nodes and an XML database D,amatch of Q in D is
identified by a mapping from nodes in Q to the elements in
D, such that:
1. query node types (i.e., tag name) are satisfied by the
corresponding database elements and wildcards “*”
can match any single database element;
2. the positive edge relationships (including positive
parent-child and positives ancestor-descendant
edges) between query nodes are satisfied by the
corresponding database elements;
3. the negative edge relationships (including negative
parent-child and negative ancestor-descendant
edges) are satisfied, that is, no corresponding
database element pairs exist; and
4. the order relationship of children of each ordered
node is satisfied by the corresponding database
elements.
The answers of a query can be represented as a set of
database elements, where each element identifies a distinct
match of the selected return nodes on D. For example, Fig. 5
shows an example mapping relationship between an
extended XML tree pattern and a document tree.
2.2 Labeling Schemes
Most XML query processing algorithms on XML documents
rely on certain labeling schemes, such as region encoding
scheme [27], prefix scheme [13], ORDPATH [19], and
extended Dewey scheme [16]. In this paper, we use the
extended Dewey labeling scheme, proposed in paper [16], to
assign each node in XML documents a sequence of integers
to capture the structure information of documents.
Extended Dewey labeling scheme is a variant scheme of
the prefix labeling scheme. In the prefix labeling scheme, the
root is labeled by an empty string and for a nonroot element
u, labelðuÞ¼labelðvÞ:n, where u is the nth child of v.In
Extended Dewey labeling scheme, each label provides
complete information about ancestors’ names and labels.
For example, given an element e with label “1.2.3,” prefix
labeling schemes can tell us parentðeÞ¼‘‘1:2’’ and
grandparentðeÞ¼‘‘1’’, but extended Dewey labeling scheme
can also tell us the tag name of elements, say, tagðeÞ¼‘‘A’’,
tagðparentðeÞÞ¼‘‘B’’, and tagðgrandparentðeÞÞ¼‘‘C’’. In or-
der to achieve this goal, paper [16] uses module function to
encode the element tag information to prefix labels, and use
finite state transducer (FST) to decode the the types
information for a single extended Dewey label. The details
LU ET AL.: EXTENDED XML TREE PATTERN MATCHING: THEORIES AND ALGORITHMS 3
Fig. 4. Illustration to the relationship between BMC and UMC. The
shaded portions demonstrate the optimal query classes.
Fig. 5. Mapping relationship between an extended tree pattern and a
document tree.