Vectorized Bloom Filters for Advanced SIMD Processors
Orestis Polychroniou
Columbia University
orestis@cs.columbia.edu
Kenneth A. Ross
∗
Columbia University
kar@cs.columbia.edu
ABSTRACT
Analytics are at the core of many business intelligence tasks.
Efficient query execution is facilitated by advanced hard-
ware features, such as multi-core parallelism, shared-nothing
low-latency caches, and SIMD vector instructions. Only re-
cently, the SIMD capabilities of mainstream hardware have
been augmented with wider vectors and non-contiguous loads
termed gathers. While analytical DBMSs minimize the use
of indexes in favor of scans based on sequential memory
accesses, some data structures remain crucial. The Bloom
filter, one such example, is the most efficient structure for
filtering tuples based on their existence in a set and its per-
formance is critical when joining tables with vastly different
cardinalities. We introduce a vectorized implementation for
probing Bloom filters based on gathers that eliminates con-
ditional control flow and is independent of the SIMD length.
Our techniques are generic and can be reused for accelerat-
ing other database operations. Our evaluation indicates a
significant performance improvement over scalar code that
can exceed 3X when the Bloom filter is cache-resident.
1. INTRODUCTION
Advances in computer hardware have had a tremendous
impact in the way software is written. The most profound is
the inherent parallelism of the multi-core CPUs that forces
all efficient applications to be re-written as parallel appli-
cations. In fact, thread parallelism is only the tip of the
iceberg. Other hardware features that potentially influence
performance are multi-level caches, both private and shared
across the cores, cache consistency protocols augmented with
hardware transactional memory support, and wide CPU reg-
isters supported by comprehensive SIMD instruction sets.
Due to the large main memory capacity of recent hard-
ware, many workloads can be kept in RAM. Thus, DBMSs
focus on in-memory execution of queries [18] in order to per-
form real-time analytics. Indexes have become less impor-
tant as most queries access a large percentage of the data.
∗
Supported by NSF grant 0915956 and an Oracle Corp gift.
Permission to make digital or hard copies of all or part of this work for personal or
classroom use is granted without fee provided that copies are not made or distributed
for profit or commercial advantage and that copies bear this notice and the full cita-
tion on the first page. Copyrights for components of this work owned by others than
ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-
publish, to post on servers or to redistribute to lists, requires prior specific permission
and/or a fee. Request permissions from permissions@acm.org.
DaMoN’14, June 22 - 27 2014, Snowbird, UT, USA.
Copyright 2014 ACM 978-1-4503-2971-2/14/06 ...$15.00.
http://dx.doi.org/10.1145/2619228.2619234.
Common approaches on mainstream architectures include
the use of SIMD instructions for scans [12, 19] and the use
of partitioning for creating cache-resident sub-problems to
avoid random memory accesses [13]. Data compression is
also important [18, 19], as it allows us to process more tu-
ples with the same number of instructions using the same
registers. Besides scans, other database operations, such as
sorting [5, 16], are also made faster using SIMD. However,
these operations have the common property of sequential in-
put access. The question whether SIMD is helpful remains
open for operations that require random access patterns.
To bridge the gap, mainstream hardware now offers non-
contiguous SIMD load instructions, termed gathers, that al-
low random memory accesses through entirely SIMD code.
A problem that stands in the middle ground between ran-
dom accesses and sequential scans is Bloom filter probing.
A Bloom filter is a probabilistic data structure for testing
whether an item belongs to a set. Bloom filters are crucial
to analytical databases for performing joins between tables
that have vastly different cardinalities. The keys of the small
table are used to build the Bloom filter and the keys of the
large table are probed through the filter to discard (most
of) those that do not match. In distributed query execution,
Bloom filters are used to filter tuples before sending them
over the network. The process of filtering across tables, ap-
proximately or not, is termed semi-join. The items used
to build the filter are relatively few compared to the items
probed through the filter to test set membership. Thus,
Bloom filter performance is typically dominated by probing.
Bloom filters are built using a pre-determined number k
of hash functions. To test whether an item belongs to the
set, we need to test k bits in different locations of the filter.
The locations are determined by the k hash functions and
do not need to be distinct. If any bit (out of k) is not set,
the key is certainly not part of the qualifying set. A Bloom
filter should be small to be cache-resident if possible. In-
creasing the number hash functions is not always helpful. If
the Bloom filter size is m bits and is built using n distinct
items, we can estimate the false positive error probability
using the formula: p = (1 − e
−kn/m
)
k
, which has a single
(global) minimum for a small integer k. On the other hand,
using more hash functions can be slower than using more
bits. Overall, the DBMS should pick the fastest configu-
ration. Each configuration includes a range of sizes and a
number of functions and we can profile all cases for the un-
derlying hardware off-line. Then, the optimizer can use the
set cardinality and the desired error rate to decide the most
suitable one to use for the target point of the query plan.