lists are small, highly-compressible data structures that can
be operated on directly with very little overhead. For exam-
ple, 32 (or 64 depending on processor word size) positions
can be intersected at once when ANDing together two posi-
tion lists represented as bit-strings. Note, however, that one
problem with this late materialization approach is that it re-
quires re-scanning the base columns to form tuples, which
can be slow (though they are likely to still be in memory
upon re-access if the query is properly pipelined).
The main contribution of this paper is not to introduce
new materialization strategies (as described in the related
work, many of these strategies have been used in other
column-stores). Rather, it is to systematically explore the
trade-offs between different strategies and provide a foun-
dation for choosing a strategy for a particular query. We
focus on standard warehouse-style queries: read-only work-
loads with selections, aggregations, and joins. We extended
the C-Store column-oriented DBMS [14] with a variety of
materialization strategies, and experimentally evaluate the
effects of varying selectivities, compression techniques, and
query plans on these strategies. Further, we provide a model
that can be used (for example) in a query optimizer to se-
lect a materialization strategy. Our results show that, on
some workloads, late materialization can be an order of
magnitude faster than early-materialization, while on other
workloads, early materialization outperforms late material-
ization.
The remainder of this paper is organized as follows. Sec-
tion 2 gives a brief overview of the C-Store query executor.
We illustrate the trade-offs between materialization strate-
gies in Section 3 and then present both pseudocode and an
analytical model for example query plans using each strat-
egy in Section 4. We validate our models experimentally
(using a version of C-Store we extended) in Section 5. Fi-
nally, we describe related work in Section 6 and conclude
in Section 7.
2 The C-Store Query Executor
We chose to use C-Store as the column-oriented DBMS
to extend and experiment with since we were already famil-
iar with the source code. Since this is the system in which
the various materialization strategies were implemented for
this study, we now provide a brief overview of the relevant
details of the C-Store query executor, which is more fully
described in previous work [4, 14] and available in an open
source release [2]. The components of the query execu-
tor most relevant are the on-disk layout of data, the access
methods provided for reading data from disk, the data struc-
tures provided for representing data in the DBMS, and the
operators for manipulating data.
Each column is stored in a separate file on disk as a se-
ries of 64KB blocks and can be optionally encoded using a
variety of compression techniques. In this paper we exper-
iment with column-specific compression techniques (run-
length encoding and bit-vector encoding) and with uncom-
pressed columns. In a run-length encoded file, each block
contains a series of RLE triples (V, S, L), where V is the
value, S is the start position of the run, and L is the length
of the run.
A bit-vector encoded file representing a column of size
n with k distinct values consists of k bit-strings of length n,
one per unique value, stored sequentially. Bit-string k has a
1 in the i
th
position if the column it represents has the value
k in the i
th
position.
C-Store provides an access method (or data source) for
each encoding type. All C-Store data sources support two
basic operations: reading positions from a column and read-
ing (position, value) pairs from a column. Additionally,
all C-Store data sources accept predicates to restrict the set
of results returned. In order to minimize CPU overhead, C-
Store data sources and operators are block-oriented. Data
sources return data from the underlying files as blocks
of encoded data, wrapped inside a C++ object that pro-
vides iterator-style (hasNext() and getNext() methods) and
vector-style [8] (asArray()) access to the data.
In Section 4.1 we give pseudocode for the C-Store op-
erators relevant to this paper: DataSource (Select), AND,
and Merge. These operators are used to construct the query
plans we experiment with. We also describe the Join op-
erator in Section 5.3. The DataSource operator reads in a
column of data and produces the column values that pass
a predicate. AND accepts input position lists and pro-
duces an output position list representing their intersec-
tion. Finally, the n -ary Merge operator combines n lists
of (position, value) pairs into a single output list of n-
attribute tuples.
3 Materialization Strategy Trade-offs
In this section we present some of the trade-offs that are
made between materialization strategies. A materialization
strategy needs to be in place whenever more than one at-
tribute from any given relation is accessed (which is the case
for most queries). Since a column-oriented DBMS stores
each attribute independently, it must have some mechanism
for stitching together multiple attributes from the same log-
ical tuple into a physical tuple. Every proposed column-
oriented architecture accomplishes this by attaching either
physical or virtual tuple identifiers or positions to column
values. To reconstruct a tuple from multiple columns of
a relation, the DBMS simply needs to find matching po-
sitions. Modern column-oriented systems [7, 8, 14] store
columns in position order; i.e., to reconstruct the i’th tuple,
one uses the i’th value from each column. This accelerates
the tuple reconstruction process.
As described in the introduction, tuple reconstruction
can occur at different points in a query plan. Early materi-
alization (EM) constructs tuples as soon as (or sometimes
before) tuple values are needed in the query plan. Late