Leveraging Compression in In-Memory Databases
Jens Krueger, Johannes Wust, Martin Linkhorst, Hasso Plattner
Hasso Plattner Institute for Software Engineering
University of Potsdam
Potsdam, Germany
Email: {jens.krueger@hpi.uni-potsdam.de, johannes.wust@hpi.uni-potsdam.de,
martin.linkhorst@hpi.uni-potsdam.de, hasso.plattner@hpi.uni-potsdam.de}
Abstract—Recently, there has been a trend towards column-
oriented databases, which in most cases apply lightweight com-
pression techniques to improve read access. At the same time,
in-memory databases become reality due to availability of huge
amounts of main memory. In-memory databases achieve their
optimal performance by building up cache-aware algorithms
based on cost models for memory hierarchies. In this paper, we
use a generic cost model for main memory access and show how
lightweight compression schemes improve the cache behavior,
which directly correlates with the performance of in-memory
databases.
Keywords-in-memory databases; database compression; dictio-
nary compression.
I. INTRODUCTION
Nowadays, most database management systems are hard
disk based and - since I/O-operations are expensive - therefore,
limited by both the throughput and latency of those hard
disks. Increasing capacities of main memory that reach up
to several terabytes today offer the opportunity to store an
entire database completely in main memory. Besides, the much
higher throughput of main memory compared to disk access
significant performance improvements are also achieved by the
much faster random access capability of main memory and at
the same time much lower latency. A database management
system that stores all of its data completely in main memory -
using hard disks only for persistency and recovery – is called
an in-memory database (IMDB).
In earlier work, we have shown that in-memory databases
perform especially well in enterprise application scenar-
ios [12], [14]. As shown in [12], enterprise workloads are
mostly reads rather than data modification operations; this has
lead to the conclusion to leverage read-optimized databases
with a differential buffer for this workloads [11]. Furthermore,
enterprise data is typically sparse data with a well known
value domain and a relatively low number of distinct values.
Therefore, enterprise data qualifies particularly well for data
compression as these techniques exploit redundancy within
data and knowledge about the data domain for optimal results.
We apply compression for two reasons:
• Reducing the overall size of the database to fit the entire
database into main memory, and
• Increasing database performance by reducing the amount
of data transferred from and to main memory.
In this paper, we focus on the second aspect. We analyze
different lightweight compression schemes regarding cache
behavior, based on a cost model that estimates expected cache
misses.
A. The Memory Bottleneck
During the last two decades, processor speed increased
faster than memory speed did [6]. The effect of this de-
velopment is that processors nowadays have to wait more
cycles to get a response from memory than they needed to
20 years ago. Since processors need to access data from
memory for any computation, performance improvements are
limited by memory latency time. As seen from a processor’s
perspective, main memory access becomes more and more
expensive compared to earlier days – the Memory Gap widens.
Nevertheless, it would be possible to manufacture memory that
is as fast as a processor is but there is a direct trade-off between
memory size and latency. The more capacity memory has, the
longer is its latency time or - important as well - the faster
memory is, the more expensive it gets. Since manufacturers
concentrated on increasing capacity of main memory there
wasn’t much focus on improving latency times.
A solution to the problem found in modern processors is
the use of a cache hierarchy to hide the latency of the main
memory. Between the processors registers and main memory,
a faster but smaller memory layer is placed that holds copies
of a subset of data found in main memory. When a processor
finds the needed data in the cache it will copy it from there
waiting less processor cycles. The whole cache is usually much
smaller and much faster than main memory. Since the Memory
Gap widens with every new processor generation one layer of
cache is not enough to fulfill both capacity and latency time
demands. Therefore, modern CPUs have up to three layers of
cache, each of which with more capacity but worse latency
times than the one closer to the processor [8].
Since programs usually do not need to access the whole
address space of main memory randomly there is the concept
of locality. When a processor fetches a processor word from
memory, it is very likely that it needs to fetch another
word close by, so-called data locality. Leveraging that fact,
processors do not only copy the requested data to its registers
but also copy subsequent bytes to the cache. The amount of
bytes that are copied at once to the cache is called a cache
line or a cache block and usually is about four to 16 processor
147Copyright (c) IARIA, 2012. ISBN: 978-1-61208-185-4
DBKDA 2012 : The Fourth International Conference on Advances in Databases, Knowledge, and Data Applications