Algorithm 2 Selection Scan (Scalar - Branchless)
j ← 0 output index
for i ← 0 to |T
keys in
| − 1 do
k ← T
keys in
[i] copy all columns
T
payloads out
[j] ← T
payloads in
[i]
T
keys out
[j] ← k
m ←(k ≥ k
lower
? 1 : 0) & (k ≤ k
upper
? 1 : 0)
j ← j + m if-then-else expressions use conditional ...
end for ... flags to update the index without branching
Vectorized selection scans use selective stores to store the
lanes that satisfy the selection predicates. We use SIMD in-
structions to evaluate the predicates resulting in a bitmask of
the qualifying lanes. Partially vectorized selection extracts
one bit at a time from the bitmask and accesses the corre-
sponding tuple. Instead, we use the bitmask to selectively
store the qualifying tuples to the output vector at once.
When the selection has a very low selectivity, it is de-
sirable to avoid accessing the payload columns due to the
performance drop caused by the memory bandwidth. Fur-
thermore, when the branch is speculatively executed, we is-
sue needless loads to payloads. To avoid reducing the band-
width, we use a small cache resident buffer that stores in-
dexes of qualifiers rather than the actual values. When the
buffer is full, we reload the indexes from the buffer, gather
the actual values from the columns, and flush them to the
output. This variant is shown in Algorithm 3. Appendix A
describes the notation used in the algorithmic descriptions.
When we materialize data on RAM without intent to reuse
them soon, we use streaming stores. Mainstream CPUs pro-
vide non-temporal stores that bypass the higher cache levels
and increase the RAM bandwidth for storing data. Xeon
Phi does not support scalar streaming stores, but provides
an instruction to overwrite a cache line with data from a
vector without first loading it. This technique requires the
vector length to be equal to the cache line and eliminates the
need for write-combining buffers used in mainstream CPUs.
All operators that write the output to memory sequentially,
use buffering, which we omit in the algorithmic descriptions.
Algorithm 3 Selection Scan (Vector)
i, j, l ← 0 input, output, and buffer indexes
r ← {0, 1, 2, 3, ..., W − 1} input indexes in vector
for i ← 0 to |T
keys in
| − 1 step W do # of vector lanes
k ← T
keys in
[i] load vectors of key columns
m ← (
k ≥ k
lower
) & (
k ≤ k
upper
) predicates to mask
if m 6= false then optional branch
B[l] ←
m
r selectively store indexes
l ← l + |m| update buffer index
if l > |B| − W then flush buffer
for b ← 0 to |B| − W step W do
p ← B[b] load input indexes
k ← T
keys in
[p] dereference values
v ← T
payloads in
[p]
T
keys out
[b + j] ←
k flush to output with ...
T
payloads out
[b + j] ← v ... streaming stores
end for
p ← B[|B| − W ] move overflow ...
B[0] ← p ... indexes to start
j ← j + |B| − W update output index
l ← l − |B| + W update buffer index
end if
end if
r ← r + W update index vector
end for flush last items after the loop
5. HASH TABLES
Hash tables are used in database systems to execute joins
and aggregations since they allow constant time key lookups.
In hash join, one relation is used to build the hash table and
the other relation probes the hash table to find matches. In
group-by aggregation they are used either to map tuples to
unique group ids or to insert and update partial aggregates.
Using SIMD instructions in hash tables has been proposed
as a way to build bucketized hash tables. Rather than com-
paring against a single key, we place multiple keys per bucket
and compare them to the probing key using SIMD vector
comparisons. We term the approach of comparing a single
input (probing) key with multiple hash table keys, horizontal
vectorization. Some hash table variants such as bucketized
cuckoo hashing [30] can support much higher load factors.
Loading a single 32-bit word is as fast as loading an entire
vector, thus, the cost of bucketized probing diminishes to
extracting the correct payload, which requires log W steps.
Horizontal vectorization, if we expect to search fewer than
W buckets on average per probing key, is wasteful. For
example, a 50% full hash table with one match per key needs
to access ≈ 1.5 buckets on average to find the match using
linear probing. In such a case, comparing one input key
against multiple table keys cannot yield high improvement
and takes no advantage of the increasing SIMD register size.
In this paper, we propose a generic form of hash table vec-
torization termed vertical vectorization that can be applied
to any hash table variant without altering the hash table
layout. The fundamental principle is to process a different
input key per vector lane. All vector lanes process different
keys from the input and access different hash table locations.
The hash table variants we discuss are linear probing (Sec-
tion 5.1), double hashing (Section 5.2), and cuckoo hashing
(Section 5.3). For the hash function, we use multiplicative
hashing, which requires two multiplications, or for 2
n
buck-
ets, one multiplication and a shift. Multiplication costs very
few cycles in mainstream CPUs and is supported in SIMD.
5.1 Linear Probing
Linear probing is an open addressing scheme that, to ei-
ther insert an entry or terminate the search, traverses the
table linearly until an empty bucket is found. The hash ta-
ble stores keys and payloads but no pointers. The scalar
code for probing the hash table is shown in Algorithm 4.
Algorithm 4 Linear Probing - Probe (Scalar)
j ← 0 output index
for i ← 0 to |S
keys
| − 1 do outer (probing) relation
k ← S
keys
[i]
v ← S
payloads
[i]
h ← (k · f ) × ↑ |T | “× ↑”: multiply & keep upper half
while T
keys
[h] 6= k
empty
do until empty bucket
if k = T
keys
[h] then
RS
R payloads
[j] ← T
payloads
[h] inner payloads
RS
S payloads
[j] ← v outer payloads
RS
keys
[j] ← k join keys
j ← j + 1
end if
h ← h + 1 next bucket
if h = |T | then reset if last bucket
h ← 0
end if
end while
end for