GRAEFE: VOLCANO-QUERY EVALUATION SYSTEM
Execute (driver)
Exchange Choose-Plan
Hash One-to-One Match Sort One-to-One Mate h
sort
Hash One-to-Many Match Sort One-to-Many Match
Index Maintenance
scans
Functional Join Filter
Files & Records Devices
Indices (B+-trees)
Physical I/O
Buffer Manager
Memory Manager
Fig. 1. Volcano’s main modules.
placement, and reading and writing disk pages, while the
higher level software determines the policies depending
on data semantics, importance, and access patterns. It is
surprising that database buffer managers derive replace-
ment decisions from observed reference behavior in spite
of the fact that this behavior is generated by higher level
database software and thus known and foreseeable in ad-
vance within the same system, albeit different subcom-
ponents.
Files are composed of records, clusters, and extents.
Since file operations are invoked very frequently in any
database system, all design decisions in the file module
have been made to provide basic functionality with the
highest attainable performance. A cluster, consisting of
one or more pages, is the unit of I/O and of buffering, as
discussed above. The cluster size is set for each file in-
dividually. Thus, different files on the same device can
have different cluster sizes. Disk space for files is allo-
cated in physically contiguous extents, because extents
allow very fast scanning without seeks and large-chunk
read-ahead and write-behind.
Records are identified by a record identifier (RID), and
can be accessed directly using the RID. For fast access to
a large set of records, Volcano supports not only individ-
ual file and record operations but also scans that support
read-next and append operations. There are two interfaces
to file scans; one is part of the file system and is described
momentarily; the other is part of the query processing
level and is described later. The first one has the standard
procedures for file scans, namely open, next, close, and
rewind. The next procedure returns the main memory ad-
dress of the next record. This address is guaranteed
(pinned) until the next operation is invoked on the scan.
Thus, getting the next record within the same cluster does
not require calling the buffer manager and can be per-
formed very efficiently.
For fast creation of files, scans support an append op-
eration. It allocates a new record slot, and returns the new
slot’s main memory address. It is the caller’s responsibil-
ity to fill the provided record space with useful informa-
tion, i.e., the append routine is entirely oblivious to the
data and their representation.
Scans also support optional predicates. The predicate
function is called by the next procedure with the argument
and a record address. Selective scans are the first example
of support functions mentioned briefly in the introduction.
Instead of determining a qualification itself, the scan
mechanism relies on a predicate function imported from
a higher level.
Support functions are passed to an operation as a func-
tion entry point and a typeless pointer that serves as a
123
predicate argument. Arguments to support functions can
be used in two ways, namely in compiled and interpreted
query execution. In compiled scans, i.e., when the pred-
icate evaluation function is available in macvhine code,
the argument can be used to pass a constant or a pointer
to several constants to the predicate function. For exam-
ple, if the predicate consists of comparing a record field
with a string, the comparison function is passed as pred-
icate function while the search string is passed as predi-
cate argument. In interpreted scans, i.e., when a general
interpreter is used to evaluate all predicates in query, they
can be used to pass appropriate code to the interpreter.
The interpreter’s entry point is given as predicate func-
tion. Thus, both interpreted and compiled scans are sup-
ported with a single simple and efficient mechanism. Vol-
cano’s use of support functions and their arguments is
another example for a mechanism that leaves a policy de-
cision, in this case whether to use compiled or interpreted
scans, open to be decided by higher level software.
Zndices are implemented currently only in the form of
B + -trees with an interface similar to files. A leaf entry
consists of a key and information. The information part
typically is a RID, but it could include more or different
information. The key and the information can be of any
type; a comparison function must be provided to compare
keys. The comparison function uses an argument equiv-
alent to the one described for scan predicates. Permitting
any information in the leaves gives more choices in phys-
ical database design. It is another example of Volcano
providing a mechanism to allow a multitude of designs
and usage policies. B + -trees support scans similar to files,
including predicates and append operations for fast load-
ing. In addition, B f -tree scans allow seeking to a partic-
ular key, and setting lower and upper bounds.
For intermediate results in query processing (later called
streams), Volcano uses special devices called virtual de-
vices. The difference between virtual and disk devices is
that data pages of virtual devices only exist in the buffer.
As soon as such data pages are unpinned, they disappear
and their contents are lost. Thus, Volcano uses the same
mechanisms and function calls for permanent and inter-
mediate data sets, easing implementation of new opera-
tors significantly.
In summary, much of Volcano’s file system is conven-
tional in its goals but implemented in a flexible, efficient,
and compact way. The file system supports basic abstrac-
tions and operations, namely devices, files, records,
B+-trees, and scans. It provides mechanisms to access
these objects, leaving many policy decisions to higher
level software. High performance was a very important
goal in the design and implementation of these mecha-
nisms since performance studies and parallelization only
make sense if the underlying mechanisms are efficient.
Furthermore, research into implementation and perfor-
mance trade-offs for extensible database systems and new
data models is only relevant if an efficient evaluation plat-
form is used.