1.4 Related Work
In a recent paper pL93], Yan and Larson identified a
transformation that enables pushing the group-by past
joins. Their approach is based on deriving two queries,
one with and the other without a group-by clause, from
the given SQL query. The result of the given query is
obtained by joining the two queries so formed. Thus,
in their approach, given a query, there is a unique al-
ternate placement for the group-by operator. Observe
that the transformation reduces the space of choices
for join ordering since the ordering is considered only
within each query. Our transformations vastly general-
ize their proposal and also avoids the problem of the re-
duced search space for join ordering. For example, the
alternative execution suggested in Example 1.1 cannot
be obtained by transformations in pL93].
Prior work on group-by has addressed the problem
of pipelining group-by and aggregation with join [D87,
K182b] as well use of group-by to flatten nested SQL
queries [K82, D87, G87, M92]. But, these problems are
orthogonal to the problem of optimizing queries con-
taining group-by that we are addressing in this paper.
1.5 Outline
Section 2 discusses the preliminary concepts and as-
sumptions. In Section 3, we define the proposed trans-
formations. Section 4 is devoted to the optimization
algorithm. In Section 5, we discuss the experimental
results using our implementation of the optimizer. The
results in this section demonstrate that incorporating
the transformations in the traditional cost-based opti-
mizer is practical and results in significant improvement
in the quality of the plan produced.
2 Preliminaries and Notation
2.1 Query
We will follow the operational semantics associated
with SQL queries [DD93, ISO92]. We assume that the
query is a single block SQL query, as below.
Select All <columnlist> AGGl(bl)..AGGn(bn)
From
<tablelist>
Where condl And cond2 . . . And condn
Group By coll,..colj
The WHERE clause of the query is a conjunction of simple
predicates. SQL semantics require that <columnlist>
must be among ~011,.
. co1 j . In the above notation,
AGGl. .AGGn represent built-in SQL aggregate functions.
In this paper, we will not be discussing the cases where
there is a HAVIBG and/or an OBBBB BY clause in the
query. We will also assume that there are no nulls in
the database. These extensions are addressed in [CS94].
We refer to columns in {bl, ..bn} as the aggregating
columns of the query. The columns in (~011, ..colj}
are called grouping columns of the query. The func-
tions {AGGI, ..AGGn} are called the aggregating func-
tions of the query. For the purposes of this paper, we
will assume that every aggregate function has one of
the following forms: Sum(colname), Hax(colname) or
Min(colname). Thus, we have excluded Avg and Count
as well as cases where the aggregate functions apply on
columns with the qualifier Distinct. In Section 3.4,
we will discuss extensions of our techniques.
2.2 Extended Annotated Join Trees
An execution plan for a query specifies choice of access
methods for each relation and an ordering of joins in
the query. Traditionally, such an execution plan is rep-
resented syntactically as an annotated join tree [S*79]
where the root is a group-by operation and each leaf
node is a scan operation. An internal node represents
a join operation. The annotations of a join node in-
clude the choice of the join method, as well as the se-
lection conditions and the list of projection attributes.
We assume that the selection conditions are evaluated
and projections are applied as early as possible. The
optimization problem is to cho
from its execution space.
T
e a plan of least cost
For optimization efficiency,
the execution space is often restricted to be the class
of left-deep join trees. These are annotated join trees
where the right child of every internal node is a leaf.
The transformations that we propose introduce
group-by operators as internal nodes. Therefore, we
define extended annotated join trees which are anno-
tated join trees except that a group-by may also occur
as an internal node. Likewise, we can define extended
left-deep join trees. These are trees subject to the same
restrictions as the traditional left-deep join trees. For
example, in Figure 1, tree (a) denotes a left-deep join
tree, whereas trees (b) and (c) are extended left-deep
join trees since the group-by occurs as an internal node
in these trees. Finally, note that we mark the scan
nodes by the name of the relation.
2.3 Group-By as an Operator
We assume that a group-by operator in an extended
join tree is specified using the following annotations:
(a) Grouping Columns (b) Aggregating Columns. The
meaning of these annotations are analogous to the cor-
responding properties for the query (See Section 2.1).
In reality, we need somewhat more elaborate annota-
tions, including aggregating functions, but such details
are not germane to our discussion here. We consider the
question of determining the annotations of a group-by
node if we place it immediately above a join or a scan
node n.
Definition 2.1: Join columns of a node n are columns
of n that participate in join predicates that are evalu-
356