resulting identifier locality: For every (complete) subtree of T ,
the ordered identifiers of the included nodes form a contiguous
sequence of integers. The vertex set of any such subtree can
thus be compactly expressed as an integer interval. Let T
v
=
(V
T
v
, E
T
v
) denote the subtree of T rooted at node v. We have
π(w)
w ∈ V
T
v
=
min
w∈V
T
v
π(w), max
w∈V
T
v
π(w)
(6)
=
min
w∈V
T
v
π(w), π(v)
.
Above interval is called tree interval of v and will be denoted
by I
T
(v) in the remainder of the text.
The complete reachability information of the spanning tree
T is encoded in the collection of tree intervals. For a pair of
nodes u, v ∈ V , there exists a path from u to v in T iff the
post-order number of the target is contained in the tree interval
of the source, that is,
u ∼
T
v ⇐⇒ π(v) ∈ I
T
(u). (7)
This reachability index for trees allows for O(1) query pro-
cessing at a space consumption of O(n).
Extension to DAGs. While above technique can be used
to easily answer reachability queries on trees, the case of
general DAGs is much more challenging. The reason is that,
in general, the reachable set R(v) of a vertex v in the DAG
is only partly represented by the interval I
T
(v), as the tree
interval only accounts for reachability relationships that are
preserved in T . Vertices that can only be reached from a node
v by traversing one or more non-tree edges have to be handled
seperately: instead of merely storing the tree intervals I
T
(v),
every node v is now assigned a set of intervals, denoted by
I(v). The purpose of this so-called reachable interval set is
to capture the complete reachability information of a node.
The sets I(v), v ∈ V are initialized to contain only the
tree interval I
T
(v). Then, the vertices are visited in reverse
topological order. For the current vertex v and every outgoing
edge (v, w) ∈ E, the reachable interval set I(w) is merged
into the set I(v). The merge operation on the intervals resolves
all cases of interval subsumption and extension exhaustively,
eventually ensuring interval disjointness. Due to the fact that
the vertices are visited in reverse topological order, it is
ensured that for every non-tree edge (s, t) ∈ E \ E
T
, the
reachability intervals in I(t) will be propagated and merged
into the reachable interval sets of s and all its predecessors.
As a result, all reachability relationships are covered by the
resulting intervals.
Query Processing. Using the reachable interval sets I(v),
queries on DAGs can be answered by checking whether the
post-order number of the target is contained in one of the
intervals associated with the source:
u ∼ v ⇐⇒ ∃
[α, β] ∈ I(u)
: α ≤ π(v) ≤ β. (8)
By ordering the intervals contained in a set, reachability
queries can now be answered efficiently in O(log n) time on
DAGs. The resulting index (collection of reachable interval
sets) can be regarded as a materialization of the transitive clo-
sure of the graph, rendering this approach potentially infeasible
for large graphs, both in terms of space consumption as well
as computational complexity.
III. APPROXIMATE INTERVALS
For massive problem instances, indexing approaches that
materialize the transitive closure (or compute a compressed
variant without an a priori size restriction), suffer from limited
applicability. For this reason, recent work on reachability query
processing over massive graphs includes a shift towards guided
online search procedures. In this setting, every node is assigned
a concise label which – in contrast to the interval sets described
in Section II-A – is restricted by a predefined size constraint.
These labels in general do not allow answering the query after
inspection of just the source node, yet can be used to prune
portions of the graph in an online search.
As a basic example, consider a reachability index that labels
every node v ∈ V with its topological order number τ(v).
While this simple variant of node labeling is obviously not
sufficient to answer a reachability query by means of a single-
lookup, a graph search procedure can greatly benefit from the
node labels: For a given query (s, t), the online search rooted
at s can terminate the expansion of a branch of the graph
whenever for the currently considered node v it holds
τ(v) ≥ τ(t). (9)
This follows from the properties of a topological ordering.
The recently proposed GRAIL reachability index [22], [23]
further extends this idea by labeling the vertices with approx-
imate intervals:
Suppose that for every node v we replace the set I(v) by
a single interval
I
0
(v) :=
min
w∈R(v)
π(w), max
w∈R(v)
π(w)
, (10)
spanning from the lowest to the highest reachable id. This
interval is approximate in the sense that all reachable ids are
covered whereas false positive entries are possible:
Definition 1 (False Positive). Let v ∈ V denote a node with
the approximate interval I
0
(v) = [α, β]. A vertex w ∈ V is
called false positive with respect to I
0
(v) if
α ≤ π(w) ≤ β and v 6∼ w. (11)
Obviously, the single interval I
0
(v) is not sufficient to
establish a definite answer to a reachability query of the form
(G, v, w). However, all queries involving a target id π(w) that
lies outside the interval, i. e.
π(w) < α or π(w) > β, (12)
can be answered instantly with a negative answer, similar to
the basic approach based on Equation (9). In the opposite case,
that is, α ≤ π(w) ≤ β, no definite answer to the reachability
query can be given and the online search procedure continues
with an expansion of the child vertices, terminating as soon