Bei et al. / J Zhejiang Univ Sci A 2008 9(6):744-757
746
from bottom to top based on a compact global tree
guide. All infrequent nodes are pruned to accelerate
tree enumeration. The support of a candidate tree is
computed without scanning the database as it is cal-
culated directly from the global tree guide.
Caching results of XML queries has been con-
sidered a useful strategy to improve performance of
XML query processing. XCache (Chen et al., 2002) is
a holistic XQuery-based semantic caching system.
Mining approaches for finding frequent queries are
also incorporated into caching (Yang et al., 2003a;
Chen et al., 2005). Yang et al.(2003a) employ
FastXMiner to discover frequent XML query patterns
and demonstrate how the frequent patterns can be
used to improve caching performance. Chen et al.
(2005) take into account temporal features of queries
for frequent queries discovery and design an appro-
priate cache replacement strategy by finding both
positive and negative association rules. Hong and
Kang (2005) integrate heterogeneous data sources on
the Web and cache results of queries through XML
views of data sources to accelerate query processing.
PRELIMINARY CONCEPTS
Frequent rooted query pattern tree
Definition 1 (Query pattern tree, QPT) An XML
query can be modeled as a query pattern tree
QPT=<R, N, E>, where R is the root node, N is the
node set, and E is the edge set. Each node n has a label
whose value is in {“*”, “//”}
∪labelSet where the
labelSet is the label set of all elements and attributes.
For each edge e=(n
1
, n
2
), node n
1
is the parent of n
2
.
Definition 2 (Query pattern subtree, QPS) Given
two query pattern trees T and S, S is considered to be a
query pattern subtree of T iff there exists a one-to-one
mapping φ: V
S
→V
T
satisfying the following condi-
tions: (1) φ preserves the labels, i.e., L(v)=L(φ(v))
∀v∈V
S
; (2) φ preserves the parent relation, i.e.,
(u,v)∈E
S
iff (φ(u), φ(v))∈E
T
.
Definition 3 (Rooted query pattern subtree, RQPS)
Given two query pattern trees T and S, we say that S is
a rooted query pattern subtree of T iff S is a query
pattern subtree of T and the trees S and T have the
same root label.
Definition 4 (Query database tree, QDT) An XML
query database, which is a collection of XML queries,
can be represented as QDT=<T, R, Q, Φ>, where T is
a tree whose root is R; Q is the set of query pattern
trees {q
1
, q
2
, …, q
n
}; R is the virtual root node of the
tree with a special label not belonging to labelSet; Φ:
V→Q is a query mapping function from all children
of the root R to the trees Q, where V represents the set
of all children of the root R. For a complete tree with
the root node being the ith node of v
i
∈V, we have
Φ(v
i
)=q
i
.
Definition 5 (Frequent rooted query pattern tree,
FRQPT) Let D denote all the query pattern trees of
the issued queries and d
T
be an indicator variable with
d
T
(S)=1 if the query pattern tree S is a rooted query
pattern subtree of T and d
T
(S)=0 if tree S is not. The
support of query pattern tree S in D can be defined as
σ(S)=∑
T
∈
D
d
T
(S)/∑
T
∈
D
, i.e., the percentage of the
number of trees in D that contain tree S. A rooted
query pattern tree is frequent if its support is more
than, or equal to, a user-specified minimum support,
defined as minsupp.
With the help of the QDT, we can transform the
problem of discovering FRQPTs from the original
query database into the problem of discovering
FRQPTs over the QDT. Let n
T
(S) denote the number
of occurrences of the rooted subtree S in a tree T.
Then the support of a rooted query pattern tree S can
be defined as σ(S)=n
QDT
(S)/|Q|. In this way, we can
deal with query pattern trees with different root nodes,
and discover frequent query pattern trees while only
considering the rooted query pattern subtrees. After
finding all the FRQPTs over the QDT, the frequent
query patterns are obtained by simply removing the
virtual root of each FRQPT. Fig.1 shows a query
database tree composed of five XML queries. Given
the minimum support 0.6, we can obtain six FRQPTs.
Compressed global tree guide
For each user issued QPT, we assign a unique ID,
denoted as QPT.ID, which will be used for the con-
struction of a global tree guide in the mining process.
Definition 6 (Global tree guide, GTG) We merge all
issued queries over the query database tree to create a
global tree guide, where the ID list of each node
represents the queries containing the path from the
root to the current node. Fig.2 shows a GTG con-
structed using 15 query pattern trees. The QPT list for
node “Java” indicates that there are six queries that
contain the path “R/order/items/book/title/Java”.
万方数据