High Throughput Heavy Hitter Aggregation
for Modern SIMD Processors
Orestis Polychroniou
Columbia University
orestis@cs.columbia.edu
Kenneth A. Ross
∗
Columbia University
kar@cs.columbia.edu
ABSTRACT
Heavy hitters are data items that occur at high frequency in
a data set. They are among the most important items for an
organization to summarize and understand during analytical
processing. In data sets with sufficient skew, the number of
heavy hitters can be relatively small. We take advantage of
this small footprint to compute aggregate functions for the
heavy hitters in fast cache memory in a single pass.
We design cache-resident, shared-nothing structures that
hold only the most frequent elements. Our algorithm works
in three phases. It first samples and picks heavy hitter can-
didates. It then builds a hash table and computes the exact
aggregates of these elements. Finally, a validation step iden-
tifies the true heavy hitters from among the candidates.
We identify trade-offs between the hash table configura-
tion and performance. Configurations consist of the probing
algorithm and the table capacity that determines how many
candidates can be aggregated. The probing algorithm can
be perfect hashing, cuckoo hashing and bucketized hashing
to explore trade-offs between size and speed.
We optimize performance by the use of SIMD instructions,
utilized in novel ways beyond single vectorized operations,
to minimize cache accesses and the instruction footprint.
1. INTRODUCTION
Databases allow users to process vast amounts of data.
Nevertheless, due to the limitations of human perception,
the conclusions we draw from this volume of information
are often summarized in a few words or charts. One way to
narrow down the volume of information presented is to focus
on the most important items among those being analyzed.
One measure of importance is the total contribution an
item makes to the whole. Items that contribute the most are
called heavy hitters. Heavy hitters can be defined in absolute
terms (e.g., items occuring more than 1% of the time) or in
relative terms (e.g., the top 100 items). In the scope of this
∗
This work was supported by NSF grants IIS-0915956 and
IIS-1049898.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
DaMoN’13, June 24 2013, New York, NY, USA
Copyright 2013 ACM 978-1-4503-2196-9/13/06... $15.00.
paper we use the top-K definition, but our approach can
easily be modified to account for other definitions. In many
real-world datasets, skew in the data means that aggregate
data about a small number of heavy hitters convey a lot of
information. Our goal is to identify the heavy hitters and
calculate exact aggregates (count, sum, etc.) for those items.
Now that systems with very large main memories are
available, the performance bottleneck has shifted from I/O
to CPU and memory [11]. Modern commodity processors
are multi-core systems. Parallelism and the ability to scale
to many execution units have become primary performance
considerations. Many database algorithms have been re-
designed in the context of in-memory multicore platforms.
With such issues in mind, we focus on parallel computation
of heavy hitters from a memory-resident dataset.
Recent work on in-memory aggregation has shown that
sharing a common aggregation data structure among many
cores is a bad idea when there are heavy hitters [4]. Con-
tention for popular data items causes significant delays, se-
rializing execution and preventing the full utilization of the
parallel hardware. A solution to this problem is to keep
a private running aggregate for each heavy hitter on each
core, to avoid coordination overheads. The final totals can
be combined at the end of the pass.
When the number of grouping keys for an aggregate com-
putation is limited, aggregation can be very fast. Under such
conditions, Ye et al. was able to aggregate over one billion
records per second on a commodity machine [19]. However,
when the grouping cardinality increased beyond the CPU L1
cache capacity, performance dropped by an order of magni-
tude, even for distributions with heavy hitters that are likely
to remain cache-resident. The latency of accesses to memory
for non-heavy hitters dominated the performance.
In this work, instead of computing the aggregates for the
whole table, we will only compute the aggregates of a few
heavy hitter elements. By ignoring the non-heavy hitters,
the entire aggregation is done in-cache, and the through-
put is an order of magnitude higher. Further, by using
branch-free SIMD implementations of various aggregation
data structures, we are able to get additional speed im-
provements, significantly beyond the performance of Ye et
al. [19] even for cache-resident aggregates. We utilize the
same SIMD registers to hold multiple items (e.g.: counts &
sums) and minimize the instruction footprint.
To identify the heavy hitters, we use a sampling step prior
to aggregating the full data. In a billion-element data set,
the cost of sampling even a million elements in advance is
small relative to the cost of scanning the base data. The