
.
Notice that the query result
is
and that
=
where
removes ancestor nodes from its input.
The function
computes the right match of
in a set
,
that is the node of
that has the smallest id that is greater than or
equal to
;
computes the left match of
in a set
,
that is the node of
that has the biggest id that is less than or equal
to
1
.
(
) returns null when there is no right
(left) match node. The cost of
(
) is
since it takes
steps (Dewey number comparisons) to find
the right (left) match node and the cost of comparing two Dewey
numbers is
. The function
returns the
other argument when one argument is null and returns the descen-
dant node when
and
have ancestor-descendant relationship.
The cost of the function
is
.
3. ALGORITHMSFORFINDINGTHESLCA
OF KEYWORD LISTS
This section presents the core Indexed Lookup Eager algorithm,
its Scan Eager variation and the prior work Stack algorithm [13].
A brute-force solution to the SLCA problem computes the LCAs
of all node combinations and then removes ancestor nodes. Its
complexity is
. Besides being inefficient the
brute-force approach is blocking. After it computes an LCA
for some
,
,
, it cannot report
as an answer since there might be another set of
nodes
"
"
such that
"
"
.
The complexity analysis given in this section is for main memory
cases. We will give disk access complexity in Section 4 after we
discuss the implementation details of how we compress and store
keyword lists on disk. In the sequel we choose
to be the smallest
keyword list since
, where
is any permutation of
, and there is a benefit
in using the smallest list as
as we will see in the complexity
analysis of the algorithms.
3.1 TheIndexedLookupEagerAlgorithm(IL)
The Indexed Lookup Eager algorithm is based on four properties
of SLCAs, which we explain starting from the simplest case where
and
is a singleton
.
According to the above Property (1), we compute the LCA of
and its left match in
, the LCA of
and its right match in
, and
the singleton formed from the deeper node from the two LCAs is
. Property (1) is based on the following two observa-
tions. For any two nodes
to the right (according to preorder)
of a node
, if
, then
; similarly, for any two nodes
to the left of a node
, if
, then
2
.
We generalize to arbitrary
when the first set is a singleton.
Notice the recursiveness in Property (2).
1
The right or left match of a node
in
is itself if
. This
may happen when a node’s label contains multiple keywords.
2
The two observations apply to inorder and postorder as well.
Next we generalize to arbitrary
.
Property (3) straightforwardly leads to an algorithm to compute
: first computes
for each
(
),
is the answer. Each
is computed by using Properties (2) and (1).
The benefit of the above algorithm over the brute force approach
is that for each node
in
, the algorithm does not compute
for all
, but computes a
single
where each
(
) is computed
by the match functions (
and
). The complexity of the al-
gorithm is
or
where
(
) is the minimum (maximum) size of key-
word lists
through
because for each node
in
the algo-
rithm needs to find a left and a right match in each one of the other
keyword lists. Finding a match in list
costs
.
Hence the total cost of match operations is
.
The total cost of the
and
operations is
and hence is dominated by the cost of the match operations. The
factor is attributed to the cost of removing ancestors opera-
tion.
The subroutine
, based on the following two lemmas,
computes
efficiently by removing ancestor nodes on
the fly.
LEMMA 1. Given any two nodes
and a set
, if
and
,
then
.
LEMMA 2. For any two nodes
and a set
such that
and
,
if
is not an ancestor of
, then for any
such that
,
#
.
Consider
sorted by id, where
. Let
!
where
,
,
.
According to Lemma 1, if
where node
ap-
pears after
in
!
(that is,
"
), then
is an ancestor node. Thus
when computing the list
!
, we can discard the out-of-order nodes
such as
. The resulting list
!
is in order and contains the nodes
of
. However
!
is not necessarily ancestor node free.
Consider any two adjacent nodes
,
in
!
where
is after
. If
is not an ancestor of
, then
cannot be an ancestor of
any node
that is after
in
!
(according to Lemma 2), which
means
is a
! #
of
,
. Lemma 1 and 2 together lead to the
subroutine
that computes
efficiently. Line
#5 in
applies Lemma 1 to remove out-of-order nodes, and
lines #6-8 apply Lemma 2 to identify a SLCA as early as possible.
As can be seen from
, at any time only three nodes (
"
,
,
) are needed in memory.
Consider
=[
,
,
,
,
],
=[
,
,
,
,
] (the key-
word lists for “John” and “Ben” respectively). In the first itera-
tion of the loop at line #3,
"
,
,
=0 (line #4). At
the end of the first iteration
"
(line #8). In the second iter-
ation,
,
,
"
. In the third iteration,
,
,
"
. In the fourth iteration,
,
(line #4). Notice that the condition at