利用SIMD优化内存数据库：新技术与实现

96 浏览量更新于2024-07-14 收藏 2.09MB PDF 举报

"这篇论文《Rethinking SIMD Vectorization for In-Memory Databases》由Orestis Polychroniou、Arun Raghavan和Kenneth A. Ross共同撰写，探讨了在内存数据库中重新思考SIMD（单指令多数据）向量化的应用。随着分析型数据库不断适应硬件以充分利用并行性，硬件也在多个方向上发展，如MIC（多核集成计算）架构，这种架构通过在芯片上集成更多简单核心并依赖SIMD指令来弥补性能差距。尽管CPU已经采用了更宽的SIMD寄存器和更先进的指令，但数据库对SIMD能力的利用仍然有限。本文提出了一种基于高级SIMD操作的新颖向量化数据库操作设计和实现，如聚集和散列，并研究了选择、哈希表和分区，将这些组合起来构建排序和连接等复杂操作。" 文章深入讨论了如何在内存数据库中更有效地利用SIMD技术，以提高数据分析和处理的性能。SIMD是一种并行处理技术，允许单个指令同时操作多个数据元素，这对于处理大量数据的数据库系统尤其有利。传统的CPU往往依赖于其他优化策略，而未充分利用SIMD。然而，随着硬件的进步，更宽的SIMD寄存器和更复杂的指令集使得SIMD成为提升效率的重要手段。论文的重点在于提出新的向量化数据库操作，特别是利用高级SIMD特性如聚集和散列。聚集操作涉及对一组数据进行汇总计算，如求和或平均值，而散列则常用于快速查找和数据组织。在内存数据库中，这些操作的向量化实现可以显著加速数据处理速度。作者还研究了选择操作，这是一种基本的查询操作，用于从数据集中选择满足特定条件的记录。通过SIMD，可以并行处理多个记录，从而减少执行时间。此外，哈希表的向量化实现可以改进查找性能，因为SIMD可以同时处理多个键值对。进一步，他们结合这些操作构建了排序和连接算法。排序是数据库操作的关键部分，对于数据分析和查询优化至关重要。使用SIMD，可以并行比较和交换多个元素，加快排序过程。连接操作则是数据库查询中的另一个关键部分，它合并来自两个或更多表的数据。SIMD向量化在此处的潜力在于同时处理多个记录对，加速连接过程。这篇论文展示了SIMD在内存数据库中的潜力，特别是在优化高性能分析型数据库方面。通过创新的设计和实现，数据库系统可以更好地适应硬件的最新发展，提高数据处理的速度和效率，为大数据时代的数据分析提供更强的支撑。

Algorithm 2 Selection Scan (Scalar - Branchless)

j ← 0  output index

for i ← 0 to |T

keys in

| − 1 do

k ← T

keys in

[i]  copy all columns

payloads out

[j] ← T

payloads in

[i]

keys out

[j] ← k

m ←(k ≥ k

lower

? 1 : 0) & (k ≤ k

upper

? 1 : 0)

j ← j + m  if-then-else expressions use conditional ...

end for  ... ﬂags to update the index without branching

Vectorized selection scans use selective stores to store the

lanes that satisfy the selection predicates. We use SIMD in-

structions to evaluate the predicates resulting in a bitmask of

the qualifying lanes. Partially vectorized selection extracts

one bit at a time from the bitmask and accesses the corre-

sponding tuple. Instead, we use the bitmask to selectively

store the qualifying tuples to the output vector at once.

When the selection has a very low selectivity, it is de-

sirable to avoid accessing the payload columns due to the

performance drop caused by the memory bandwidth. Fur-

thermore, when the branch is speculatively executed, we is-

sue needless loads to payloads. To avoid reducing the band-

width, we use a small cache resident buﬀer that stores in-

dexes of qualiﬁers rather than the actual values. When the

buﬀer is full, we reload the indexes from the buﬀer, gather

the actual values from the columns, and ﬂush them to the

output. This variant is shown in Algorithm 3. Appendix A

describes the notation used in the algorithmic descriptions.

When we materialize data on RAM without intent to reuse

them soon, we use streaming stores. Mainstream CPUs pro-

vide non-temporal stores that bypass the higher cache levels

and increase the RAM bandwidth for storing data. Xeon

Phi does not support scalar streaming stores, but provides

an instruction to overwrite a cache line with data from a

vector without ﬁrst loading it. This technique requires the

vector length to be equal to the cache line and eliminates the

need for write-combining buﬀers used in mainstream CPUs.

All operators that write the output to memory sequentially,

use buﬀering, which we omit in the algorithmic descriptions.

Algorithm 3 Selection Scan (Vector)

i, j, l ← 0  input, output, and buﬀer indexes

r ← {0, 1, 2, 3, ..., W − 1}  input indexes in vector

for i ← 0 to |T

keys in

| − 1 step W do  # of vector lanes



k ← T

keys in

[i]  load vectors of key columns

m ← (



k ≥ k

lower

) & (



k ≤ k

upper

)  predicates to mask

if m 6= false then  optional branch

B[l] ←

r  selectively store indexes

l ← l + |m|  update buﬀer index

if l > |B| − W then  ﬂush buﬀer

for b ← 0 to |B| − W step W do

p ← B[b]  load input indexes



k ← T

keys in

[p]  dereference values

v ← T

payloads in

[p]

keys out

[b + j] ←



k  ﬂush to output with ...

payloads out

[b + j] ← v  ... streaming stores

end for

p ← B[|B| − W ]  move overﬂow ...

B[0] ← p  ... indexes to start

j ← j + |B| − W  update output index

l ← l − |B| + W  update buﬀer index

end if

r ← r + W  update index vector

end for  ﬂush last items after the loop

5. HASH TABLES

Hash tables are used in database systems to execute joins

and aggregations since they allow constant time key lookups.

In hash join, one relation is used to build the hash table and

the other relation probes the hash table to ﬁnd matches. In

group-by aggregation they are used either to map tuples to

unique group ids or to insert and update partial aggregates.

Using SIMD instructions in hash tables has been proposed

as a way to build bucketized hash tables. Rather than com-

paring against a single key, we place multiple keys per bucket

and compare them to the probing key using SIMD vector

comparisons. We term the approach of comparing a single

input (probing) key with multiple hash table keys, horizontal

vectorization. Some hash table variants such as bucketized

cuckoo hashing [30] can support much higher load factors.

Loading a single 32-bit word is as fast as loading an entire

vector, thus, the cost of bucketized probing diminishes to

extracting the correct payload, which requires log W steps.

Horizontal vectorization, if we expect to search fewer than

W buckets on average per probing key, is wasteful. For

example, a 50% full hash table with one match per key needs

to access ≈ 1.5 buckets on average to ﬁnd the match using

linear probing. In such a case, comparing one input key

against multiple table keys cannot yield high improvement

and takes no advantage of the increasing SIMD register size.

In this paper, we propose a generic form of hash table vec-

torization termed vertical vectorization that can be applied

to any hash table variant without altering the hash table

layout. The fundamental principle is to process a diﬀerent

input key per vector lane. All vector lanes process diﬀerent

keys from the input and access diﬀerent hash table locations.

The hash table variants we discuss are linear probing (Sec-

tion 5.1), double hashing (Section 5.2), and cuckoo hashing

(Section 5.3). For the hash function, we use multiplicative

hashing, which requires two multiplications, or for 2

buck-

ets, one multiplication and a shift. Multiplication costs very

few cycles in mainstream CPUs and is supported in SIMD.

5.1 Linear Probing

Linear probing is an open addressing scheme that, to ei-

ther insert an entry or terminate the search, traverses the

table linearly until an empty bucket is found. The hash ta-

ble stores keys and payloads but no pointers. The scalar

code for probing the hash table is shown in Algorithm 4.

Algorithm 4 Linear Probing - Probe (Scalar)

j ← 0  output index

for i ← 0 to |S

keys

| − 1 do  outer (probing) relation

k ← S

keys

[i]

v ← S

payloads

[i]

h ← (k · f ) × ↑ |T |  “× ↑”: multiply & keep upper half

while T

keys

[h] 6= k

empty

do  until empty bucket

if k = T

keys

[h] then

R payloads

[j] ← T

payloads

[h]  inner payloads

S payloads

[j] ← v  outer payloads

keys

[j] ← k  join keys

j ← j + 1

end if

h ← h + 1  next bucket

if h = |T | then  reset if last bucket

h ← 0

end if

end while

end for

剩余15页未读，继续阅读

weixin_38653687

粉丝: 3
资源: 973

利用SIMD优化内存数据库：新技术与实现

揭示in-context学习真相：模型理解prompt的误解与机制

"物流管理专业文献翻译：物联网与传统数据库的变革

现代RDMA网络下的数据库高可用性重新思考

Rethinking-Atrous-Convolution-for-Semantic-Image-Segmentation-1.zip

PersonReID-TSF:AAAI 2020 论文代码 Rethinking Temporal Fusion for Video-based Person Re-identification on Semantic and Time Aspect

rethinking-database-design-with-apache-drill

[IJCAI_2022,_Official_Code]_for_paper_Rethinking__TANet-image-

Rethinking the Heatmap Regression for Bottom-Up Human Pose

Beyond Block IO - Rethinking Traditional Storage Primitives (ouyangx-hpca2011-slides)-计算机科学

Beyond Block IO - Rethinking Traditional Storage Primitives (ouyangx-hpca2011)-计算机科学

最新资源