SIMD向量化Bloom过滤器优化：提升商业智能分析效率

102 浏览量更新于2024-08-25 收藏 975KB PDF 举报

"Vectorized Bloom Filters for Advanced SIMD Processors" 是一篇由哥伦比亚大学的 Orestis Polychroniou 和 Kenneth A. Ross 合作撰写的论文，主要探讨如何利用现代高级SIMD（单指令多数据）处理器的特性来优化数据分析任务中的查询执行效率。SIMD技术通常包含更宽的向量和非连续数据加载功能，如gathers，这些在处理大规模并行数据时具有显著优势。论文的核心观点是，在许多商业智能任务中，分析占据了核心地位，而高效的查询执行离不开高性能硬件的支持，包括多核并行性、低延迟的无共享缓存以及SIMD矢量指令。然而，直到最近，主流硬件的SIMD能力才得到了显著增强，特别是对于宽向量和非连续数据操作的支持。传统的分析型数据库管理系统倾向于减少对索引的依赖，转而使用基于顺序内存访问的扫描。在这个背景下，Bloom过滤器作为一种用于判断元素是否存在于集合中的高效数据结构，其性能在表连接等操作中尤为重要，尤其是当连接的表具有显著不同的大小时。作者提出了一种针对SIMD处理器的向量化Bloom过滤器实现，它利用gathers技术消除条件控制流，从而避免了对SIMD长度的依赖。这种技术的实现是通用的，可以被广泛应用于加速各种数据处理任务，提高查询速度和吞吐量。通过向量化，Bloom过滤器的性能得到了显著提升，这对于大数据分析场景下的实时查询和过滤操作具有重要的实际意义。这篇论文为理解和优化利用现代SIMD硬件进行Bloom过滤器操作提供了新的视角和方法，对于那些依赖于大量数据处理和分析的应用来说，这是一项关键的技术进步。通过将Bloom过滤器与SIMD向量指令结合，研究人员和工程师们能够更好地利用硬件资源，提升计算效率，推动业务智能任务的实时性和准确性。

Vectorized Bloom Filters for Advanced SIMD Processors

Orestis Polychroniou

Columbia University

orestis@cs.columbia.edu

Kenneth A. Ross

∗

Columbia University

kar@cs.columbia.edu

ABSTRACT

Analytics are at the core of many business intelligence tasks.

Eﬃcient query execution is facilitated by advanced hard-

ware features, such as multi-core parallelism, shared-nothing

low-latency caches, and SIMD vector instructions. Only re-

cently, the SIMD capabilities of mainstream hardware have

been augmented with wider vectors and non-contiguous loads

termed gathers. While analytical DBMSs minimize the use

of indexes in favor of scans based on sequential memory

accesses, some data structures remain crucial. The Bloom

ﬁlter, one such example, is the most eﬃcient structure for

ﬁltering tuples based on their existence in a set and its per-

formance is critical when joining tables with vastly diﬀerent

cardinalities. We introduce a vectorized implementation for

probing Bloom ﬁlters based on gathers that eliminates con-

ditional control ﬂow and is independent of the SIMD length.

Our techniques are generic and can be reused for accelerat-

ing other database operations. Our evaluation indicates a

signiﬁcant performance improvement over scalar code that

can exceed 3X when the Bloom ﬁlter is cache-resident.

1. INTRODUCTION

Advances in computer hardware have had a tremendous

impact in the way software is written. The most profound is

the inherent parallelism of the multi-core CPUs that forces

all eﬃcient applications to be re-written as parallel appli-

cations. In fact, thread parallelism is only the tip of the

iceberg. Other hardware features that potentially inﬂuence

performance are multi-level caches, both private and shared

across the cores, cache consistency protocols augmented with

hardware transactional memory support, and wide CPU reg-

isters supported by comprehensive SIMD instruction sets.

Due to the large main memory capacity of recent hard-

ware, many workloads can be kept in RAM. Thus, DBMSs

focus on in-memory execution of queries [18] in order to per-

form real-time analytics. Indexes have become less impor-

tant as most queries access a large percentage of the data.

∗

Supported by NSF grant 0915956 and an Oracle Corp gift.

Permission to make digital or hard copies of all or part of this work for personal or

classroom use is granted without fee provided that copies are not made or distributed

for proﬁt or commercial advantage and that copies bear this notice and the full cita-

tion on the ﬁrst page. Copyrights for components of this work owned by others than

ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or re-

publish, to post on servers or to redistribute to lists, requires prior speciﬁc permission

and/or a fee. Request permissions from permissions@acm.org.

DaMoN’14, June 22 - 27 2014, Snowbird, UT, USA.

http://dx.doi.org/10.1145/2619228.2619234.

Common approaches on mainstream architectures include

the use of SIMD instructions for scans [12, 19] and the use

of partitioning for creating cache-resident sub-problems to

avoid random memory accesses [13]. Data compression is

also important [18, 19], as it allows us to process more tu-

ples with the same number of instructions using the same

registers. Besides scans, other database operations, such as

sorting [5, 16], are also made faster using SIMD. However,

these operations have the common property of sequential in-

put access. The question whether SIMD is helpful remains

open for operations that require random access patterns.

To bridge the gap, mainstream hardware now oﬀers non-

contiguous SIMD load instructions, termed gathers, that al-

low random memory accesses through entirely SIMD code.

A problem that stands in the middle ground between ran-

dom accesses and sequential scans is Bloom ﬁlter probing.

A Bloom ﬁlter is a probabilistic data structure for testing

whether an item belongs to a set. Bloom ﬁlters are crucial

to analytical databases for performing joins between tables

that have vastly diﬀerent cardinalities. The keys of the small

table are used to build the Bloom ﬁlter and the keys of the

large table are probed through the ﬁlter to discard (most

of) those that do not match. In distributed query execution,

Bloom ﬁlters are used to ﬁlter tuples before sending them

over the network. The process of ﬁltering across tables, ap-

proximately or not, is termed semi-join. The items used

to build the ﬁlter are relatively few compared to the items

probed through the ﬁlter to test set membership. Thus,

Bloom ﬁlter performance is typically dominated by probing.

Bloom ﬁlters are built using a pre-determined number k

of hash functions. To test whether an item belongs to the

set, we need to test k bits in diﬀerent locations of the ﬁlter.

The locations are determined by the k hash functions and

do not need to be distinct. If any bit (out of k) is not set,

the key is certainly not part of the qualifying set. A Bloom

ﬁlter should be small to be cache-resident if possible. In-

creasing the number hash functions is not always helpful. If

the Bloom ﬁlter size is m bits and is built using n distinct

items, we can estimate the false positive error probability

using the formula: p = (1 − e

−kn/m

)

, which has a single

(global) minimum for a small integer k. On the other hand,

using more hash functions can be slower than using more

bits. Overall, the DBMS should pick the fastest conﬁgu-

ration. Each conﬁguration includes a range of sizes and a

number of functions and we can proﬁle all cases for the un-

derlying hardware oﬀ-line. Then, the optimizer can use the

set cardinality and the desired error rate to decide the most

suitable one to use for the target point of the query plan.

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38659248

粉丝: 4
资源: 963

SIMD向量化Bloom过滤器优化：提升商业智能分析效率

Vectorized Bloom Filters for Advanced SIMD Processors - Columbia - Slides-计算机科学

Hive-Vectorized-Query-Execution-Design.pdf

A Novel Hybrid Quicksort Algorithm Vectorized using AVX-512 on Intel Skylake - 2017 (Paper_44-A_Novel_Hybrid_Quicksort_Algorithm_Vectorized)-计算机科学

Vectorized raytracer for sphere objects:Vectorized raytracer for sphere objects-matlab开发

Faster Population Counts using AVX2 Instructions (1611.07612v1)-计算机科学

Fast Sorting Algorithms using AVX-512 on Intel Knights Landing - 24 Apr 2017 (1704.08579)-计算机科学

MATLAB模拟粒子散射代码-Vectorized-Particle-in-Cell-Code-With-Graphics:已经开发出了C/C

Two-Dimensional-Vectorized-Particle-in-Cell-Code-With-Graphics:已经开发了CC ++中的二维等离子体单元中粒子模拟代码，并将其矢量化到具有硬件加速图形输出的图形处理单元（GPU）上

Vectorized implementation of the Modified Hausdorff Distance:vectorized implementation of the Modified Hausdorff Distance-matlab开发

Vectorized Picard-Chebyshev Method：用于分析2012年ASME会议论文87878的Vectorized Picard-Chebyshev Method-matlab开发

最新资源