4 D.J. Abadi et al.: Aurora: a new model and architecture for data stream management
The middle path in Fig. 2 represents a view. In this case,
a path is defined with no connected application. It is allowed
to have a QoS specification as an indication of the importance
of the view. Applications can connect to the end of this path
whenever there is a need. Before this happens, the system
can propagate some, all, or none of the values stored at the
connection point in order to reduce latency for applications
that connect later. Moreover, it can store these partial results at
any point along a view path. This is analogous to a materialized
or partially materialized view. View materialization is under
the control of the scheduler.
The bottom path represents an ad hoc query. An ad hoc
query can be attached to a connection point at any time. The
semantics of an ad hoc query is that the system will process
data items and deliver answers from the earliest time T (per-
sistence specification) stored in the connection point until the
query branch is explicitly disconnected. Thus, the semantics
for an Aurora ad hoc query is the same as a continuous query
that starts executing at t
now
− T and continues until explicit
termination.
2.2 Graphical user interface
The Aurora user interface cannot be covered in detail because
of space limitations. Here, we mention only a few salient fea-
tures. To facilitate designing large networks, Aurora will sup-
port a hierarchical collection of groups of boxes. A designer
can begin near the top of the hierarchy where only a few super-
boxes are visible on the screen. A zoom capability is provided
to allow him to move into specific portions of the network,
by replacing a group with its constituent boxes and groups.
In this way, a browsing capability is provided for the Aurora
diagram.
Boxes and groups have a tag, an argument list, a description
of the Functionality, and, ultimately, a manual page. Users can
teleport to specific places in an Aurora network by querying
these attributes. Additionally, a user can place bookmarks in a
network to allow him to return to places of interest.
These capabilities give an Aurora user a mechanism to
query theAurora diagram. The user interface also allows mon-
itors for arcs in the network to facilitate debugging as well as
facilities for “single stepping” through a sequence of Aurora
boxes. We plan a graphical performance monitor as well as
more sophisticated query capabilities.
3 Aurora optimization
In traditional relational query optimization, one of the primary
objectives is to minimize the number of iterations over large
data sets. Stream-oriented operators that constitute the Aurora
network, on the other hand, are designed to operate in a data
flow mode in which data elements are processed as they appear
on the input. Although the amount of computation required by
an operator to process a new element is usually quite small,
we expect to have a large number of boxes. Furthermore, high
data rates add another dimension to the problem. Lastly, we
expect many changes to be made to an Aurora network over
time, and it seems unreasonable to take the network offline
to perform a compile time optimization. We now present our
strategies to optimize an Aurora network.
3.1 Dynamic continuous query optimization
We begin execution of an unoptimizedAurora network, i.e., the
one that the user constructed. During execution we gather run-
time statistics such as the average cost of box execution and
box selectivity. Our goal is to perform run-time optimization
of a network, without having to quiesce it. Hence combining
all the boxes into a massive query and then applying conven-
tional query optimization is not a workable approach. Besides
being NP-complete [25], it would require quiescing the whole
network. Instead, the optimizer will select a portion of the net-
work for optimization. Then it will find all connection points
that surround the subnetwork to be optimized. It will hold all
input messages at upstream connection points and drain the
subnetwork of messages through all downstream connection
points. The optimizer will then apply the following local tac-
tics to the identified subnetwork.
• Inserting projections. It is unlikely that the application ad-
ministrator will have inserted map operators (see Sect. 5)
to project out all unneeded attributes. Examination of an
Aurora network allows us to insert or move such map oper-
ations to the earliest possible points in the network, thereby
shrinking the size of the tuples that must be subsequently
processed. Note that this kind of optimization requires that
the system be provided with operator signatures that de-
scribe the attributes that are used and produced by the
operators.
• Combining boxes. As a next step, Aurora diagrams will be
processed to combine boxes where possible. A pairwise
examination of the operators suggests that, in general, map
and filter can be combined with almost all of the operators,
whereas windowed or binary operators cannot.
It is desirable to combine two boxes into a single box when
this leads to some cost reduction. As an example, a map
operator that only projects out attributes can be combined
easily with any adjacent operator, thereby saving the box-
execution overhead for a very cheap operator. In addition,
two filtering operations can be combined into a single,
more complex filter that can be more efficiently executed
than the two boxes it replaces. Not only is the overhead of a
second box activation avoided, but also standard relational
optimization on one-table predicates can be applied in the
larger box. In general, combining boxes at least saves the
box-execution overhead and reduces the total number of
boxes, leading to a simpler diagram.
• Reordering boxes. Reordering the operations in a conven-
tional relational DBMS to an equivalent but more efficient
form is a common technique in query optimization. For
example, filter operations can sometimes be pushed down
the query tree through joins. In Aurora, we can apply the
same technique when two operations commute.
To decide when to interchange two commutative operators,
we make use of the following performance model. Each Au-
rora box, b, has a cost, c(b), defined as the expected execution
time for b to process one input tuple. Additionally, each box
has a selectivity, s(b), which is the expected number of output
tuples per input tuple. Consider two boxes, b
i
and b
j
, with b
j
following b
i
. In this case, for each input tuple for b
i
we can
compute the amount of processing as c(b
i
)+c(b
j
) × s(b
i
).
Reversing the operators gives a like calculation. Hence we