Fast Sorted-Set Intersection using SIMD Instructions
Benjamin Schlegel
TU Dresden
Dresden, Germany
benjamin.schlegel@tu-
dresden.de
Thomas Willhalm
Intel GmbH
Munich, Germany
thomas.willhalm@intel.com
Wolfgang Lehner
TU Dresden
Dresden, Germany
wolfgang.lehner@tu-
dresden.de
ABSTRACT
In this paper, we focus on sorted-set intersection which is
an important part in many algorithms, e.g., RID-list inter-
section, inverted indexes, and others. In contrast to tradi-
tional scalar sorted-set intersection algorithms that try to
reduce the number of comparisons, we propose a parallel
algorithm that relies on speculative execution of compar-
isons. In general, our algorithm requires more comparisons
but less instructions than scalar algorithms that translates
into a better overall speed. We achieve this by utilizing ef-
ficient single-instruction-multiple-data (SIMD) instructions
that are available in many processors. We provide different
sorted-set intersection algorithms for different integer data
types. We propose versions that use uncompressed integer
values as input and output as well as a version that uses a
tailor-made data layout for even faster intersections. In our
experiments, we achieve speedups up to 5.3x compared to
popular fast scalar algorithms.
1. INTRODUCTION
Sorted-set intersection is a fundamental operation in
query processing in the area of databases and information
retrieval. It is part of many algorithms and often accounts
for a large fraction of the overall runtime. Examples are in-
verted indexes in information retrieval [25], lists intersection
in frequent-itemset mining [1, 28], and merging of RID-lists
in database query processing [19]. In many of these applica-
tion areas low latencies are a key concern so that reducing
the execution time of set intersection is very important.
There has been done a lot of research with the goal of im-
proving sorted-set intersection. Many approaches focus on
speed up sequential intersection [4, 5, 9, 11, 13, 20] by using
efficient data structures or improved processing techniques.
Other approaches focus on utilizing modern hardware like
graphic processors (GPUs) [1, 2, 12, 26, 27] and multi-core
CPUs [23, 24] to utilize the parallelism offered by these pro-
cessors. However, the main focus of these approaches is only
on thread-level parallelism. Data-level parallelism is so far
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee. This article was presented at:
The Second International Workshop on Accelerating Data Management
Systems using Modern Processor and Storage Architectures (ADMS’11).
Copyright 2011.
not considered although it is available via SIMD instruction
sets in almost all modern CPUs. For this reason, our goal
in this paper is to speed up the intersection of sorted sets
using SIMD capabilities.
Utilization of SIMD capabilities in algorithms is ideally
achieved through automatic vectorizing by compilers or by
inserting SIMD instructions manually. However, many com-
pilers detect vectorization opportunities only for simple loop
constructs with few or without any data dependencies. In
all other cases, hand-tuned assembly or intrinsics must be
used. Unfortunately, all sorted-set intersection algorithms
have complex data dependencies so that automatic vector-
ization cannot be applied.
In this paper, we introduce parallel sorted-set intersection
algorithms that rely on speculative execution. Our main in-
tention is to speculatively execute more than the necessary
comparisons as done by the scalar algorithms. To do this
efficiently, we use the string and text processing new instruc-
tions (STTNI) that are part of the Intel
R
Streaming SIMD
Extensions 4.2 (Intel
R
SSE 4.2).
1
These instructions allow
a fast full comparison of either eight 16-bit values (= 64
comparisons) or sixteen 8-bit values (= 256 comparisons)
with only one instruction. Many of these comparisons are
useless and would not been exectuted by scalar algorithms.
However, since these instructions itself require only 8 cycles
to complete [14] we achieve significant speedups.
In summary, our contributions are as follows:
• We propose fast parallel sorted-set intersection algo-
rithms for 8-bit, 16-bit and 32-bit integer values based
on STTNI of Intel SSE 4.2. The algorithms use un-
compressed integer values as input and output. We
explain in detail the necessary SIMD instructions and
steps of the parallel intersection.
• We present a hierarchical data layout that is tailor-
made for a fast parallel intersection of integer values
with a precision higher than 16 bits.
• We compare our parallel algorithms with two highly
efficient scalar versions on synthetic datasets.
The scope of our paper is as follows. We focus solely
on sorted-set intersection of integer values without dupli-
cates. Usually, this is not a limitation since both of our ex-
ample scenarios—information retrieval and frequent-itemset
1
Intel and Core are trademarks of Intel Corporation in the
U.S. and/or other countries. Other names and brands may
be claimed as the property of others.