现代SIMD处理器上的高吞吐量重击者聚合

78 浏览量更新于2024-08-25 收藏 920KB PDF 举报

"High Throughput Heavy Hitter Aggregation for Modern SIMD Processors" 是一篇关于在现代SIMD（Single Instruction Multiple Data）处理器上实现高吞吐量的频繁项（heavy hitter）聚合技术的研究论文。作者Orestis Polychroniou和Kenneth A. Ross来自哥伦比亚大学。在大数据分析中，频繁项或“heavy hitters”是指在数据集中出现频率极高的数据项，对于组织进行分析处理时理解和总结数据至关重要。当数据集存在显著的偏斜（skew）时，频繁项的数量可能相对较少。该研究利用这个特性，旨在在高速缓存内存中一次性快速计算这些频繁项的聚合函数，从而提高处理效率。论文提出了一种基于高速缓存、无共享结构的设计，只存储最频繁的元素。算法分为三个阶段：首先，通过采样选择潜在的频繁项候选者；然后，构建哈希表以精确计算这些元素的聚合值；最后，验证步骤从候选者中识别出真正的频繁项。作者探讨了哈希表配置与性能之间的权衡。配置包括探测算法（probing algorithm）和表容量，这两者共同决定了可以容纳多少候选项。不同的配置会带来不同的性能效果，需要根据具体应用场景进行优化选择。 SIMD技术允许处理器同时处理多个数据，极大地提升了处理并行性和效率。在处理大量数据时，尤其是在数据分析和流处理任务中，这种技术的应用可以显著提升处理速度。通过优化算法和利用SIMD处理器的特性，论文所提出的方案有望在高效找出数据集中的频繁项方面提供显著的性能提升。此外，论文还可能涉及如何有效地利用内存资源、降低计算复杂度以及提高数据处理的实时性等方面的内容，这对于大数据处理和实时分析领域具有重要的实践意义。通过深入研究这些技术，可以为现代数据中心和云计算平台提供更高效的数据处理解决方案。

High Throughput Heavy Hitter Aggregation

for Modern SIMD Processors

Orestis Polychroniou

Columbia University

orestis@cs.columbia.edu

Kenneth A. Ross

∗

Columbia University

kar@cs.columbia.edu

ABSTRACT

Heavy hitters are data items that occur at high frequency in

a data set. They are among the most important items for an

organization to summarize and understand during analytical

processing. In data sets with suﬃcient skew, the number of

heavy hitters can be relatively small. We take advantage of

this small footprint to compute aggregate functions for the

heavy hitters in fast cache memory in a single pass.

We design cache-resident, shared-nothing structures that

hold only the most frequent elements. Our algorithm works

in three phases. It ﬁrst samples and picks heavy hitter can-

didates. It then builds a hash table and computes the exact

aggregates of these elements. Finally, a validation step iden-

tiﬁes the true heavy hitters from among the candidates.

We identify trade-oﬀs between the hash table conﬁgura-

tion and performance. Conﬁgurations consist of the probing

algorithm and the table capacity that determines how many

candidates can be aggregated. The probing algorithm can

be perfect hashing, cuckoo hashing and bucketized hashing

to explore trade-oﬀs between size and speed.

We optimize performance by the use of SIMD instructions,

utilized in novel ways beyond single vectorized operations,

to minimize cache accesses and the instruction footprint.

1. INTRODUCTION

Databases allow users to process vast amounts of data.

Nevertheless, due to the limitations of human perception,

the conclusions we draw from this volume of information

are often summarized in a few words or charts. One way to

narrow down the volume of information presented is to focus

on the most important items among those being analyzed.

One measure of importance is the total contribution an

item makes to the whole. Items that contribute the most are

called heavy hitters. Heavy hitters can be deﬁned in absolute

terms (e.g., items occuring more than 1% of the time) or in

relative terms (e.g., the top 100 items). In the scope of this

∗

This work was supported by NSF grants IIS-0915956 and

IIS-1049898.

Permission to make digital or hard copies of all or part of this work for

personal or classroom use is granted without fee provided that copies are

not made or distributed for proﬁt or commercial advantage and that copies

bear this notice and the full citation on the ﬁrst page. To copy otherwise, to

republish, to post on servers or to redistribute to lists, requires prior speciﬁc

permission and/or a fee.

DaMoN’13, June 24 2013, New York, NY, USA

paper we use the top-K deﬁnition, but our approach can

easily be modiﬁed to account for other deﬁnitions. In many

real-world datasets, skew in the data means that aggregate

data about a small number of heavy hitters convey a lot of

information. Our goal is to identify the heavy hitters and

calculate exact aggregates (count, sum, etc.) for those items.

Now that systems with very large main memories are

available, the performance bottleneck has shifted from I/O

to CPU and memory [11]. Modern commodity processors

are multi-core systems. Parallelism and the ability to scale

to many execution units have become primary performance

considerations. Many database algorithms have been re-

designed in the context of in-memory multicore platforms.

With such issues in mind, we focus on parallel computation

of heavy hitters from a memory-resident dataset.

Recent work on in-memory aggregation has shown that

sharing a common aggregation data structure among many

cores is a bad idea when there are heavy hitters [4]. Con-

tention for popular data items causes signiﬁcant delays, se-

rializing execution and preventing the full utilization of the

parallel hardware. A solution to this problem is to keep

a private running aggregate for each heavy hitter on each

core, to avoid coordination overheads. The ﬁnal totals can

be combined at the end of the pass.

When the number of grouping keys for an aggregate com-

putation is limited, aggregation can be very fast. Under such

conditions, Ye et al. was able to aggregate over one billion

records per second on a commodity machine [19]. However,

when the grouping cardinality increased beyond the CPU L1

cache capacity, performance dropped by an order of magni-

tude, even for distributions with heavy hitters that are likely

to remain cache-resident. The latency of accesses to memory

for non-heavy hitters dominated the performance.

In this work, instead of computing the aggregates for the

whole table, we will only compute the aggregates of a few

heavy hitter elements. By ignoring the non-heavy hitters,

the entire aggregation is done in-cache, and the through-

put is an order of magnitude higher. Further, by using

branch-free SIMD implementations of various aggregation

data structures, we are able to get additional speed im-

provements, signiﬁcantly beyond the performance of Ye et

al. [19] even for cache-resident aggregates. We utilize the

same SIMD registers to hold multiple items (e.g.: counts &

sums) and minimize the instruction footprint.

To identify the heavy hitters, we use a sampling step prior

to aggregating the full data. In a billion-element data set,

the cost of sampling even a million elements in advance is

small relative to the cost of scanning the base data. The

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38732315

粉丝: 7

现代SIMD处理器上的高吞吐量重击者聚合

A Fast and High Throughput SQL Query System for Big Data

A DUAL-CLOCK FIFO FOR THE RELIABLE TRANSFER OF HIGH-THROUGHPUT

A Framework for Generating High Throughput CNN Implementations on FPGAs.pdf

在IEEE 802.11aj标准中，毫米波频段是如何实现Very High Throughput的，同时，该标准在MAC层和PHY层都引入了哪些关键技术？

schedule-server

IEEE 802.11aj标准如何利用毫米波频段实现Very High Throughput，其MAC和PHY层的关键技术是什么？

ElasticsearchClient，ElasticsearchAsyncClient，ElasticsearchClient，RestClient，ElasticsearchTransport区别

hdmi2.0 IP核

vivadocpu乘法

Throughput-Based ABR算法的算法思路

最新资源