SALSA: Scalable and Low Synchronization NUMA-aware
Algorithm for Producer-Consumer Pools
Elad Gidron
CS Department
Technion, Haifa, Israel
eladgi@cs.technion.ac.il
Idit Keidar
EE Department
Technion, Haifa, Israel
idish@ee.technion.ac.il
Dmitri Perelman
∗
EE Department
Technion, Haifa, Israel
dima39@tx.technion.ac.il
Yonathan Perez
EE Department
Technion, Haifa, Israel
yonathan0210@gmail.com
ABSTRACT
We present a highly-scalable non-blocking producer-consumer
task pool, designed with a special emphasis on lightweight
synchronization and data locality. The core building block
of our pool is SALSA, Scalable And Low Synchronization Al-
gorithm for a single-consumer container with task stealing
support. Each consumer operates on its own SALSA con-
tainer, stealing tasks from other containers if necessary. We
implement an elegant self-tuning policy for task insertion,
which does not push tasks to overloaded SALSA containers,
thus decreasing the likelihood of stealing.
SALSA manages large chunks of tasks, which improves lo-
cality and facilitates stealing. SALSA uses a novel approach
for coordination among consumers, without strong atomic
operations or memory barriers in the fast path. It invokes
only two CAS operations during a chunk steal.
Our evaluation demonstrates that a pool built using SALSA
containers scales linearly with the number of threads and
significantly outperforms other FIFO and non-FIFO alter-
natives.
Categories and Subject Descriptors
D.1.3 [Software]: Concurrent Programming
General Terms
Algorithms, Performance
Keywords
Multi-core, concurrent data structures
∗
This work was partially supported by Hasso Plattner In-
stitute.
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that copies
bear this notice and the full citation on the first page. To copy otherwise, to
republish, to post on servers or to redistribute to lists, requires prior specific
permission and/or a fee.
SPAA’12, June 25–27, 2012, Pittsburgh, Pennsylvania, USA.
Copyright 2012 ACM 978-1-4503-1213-4/12/06 ...$10.00.
1. INTRODUCTION
Emerging computer architectures pose many new chal-
lenges for software development. First, as the number of
computing elements constantly increases, the importance of
scalability of parallel programs becomes paramount. Sec-
ond, accessing memory has become the principal bottleneck,
while multi-CPU systems are based on NUMA architectures,
where memory access from different chips is asymmetric.
Therefore, it is instrumental to design software with local
data access, cache-friendliness, and reduced contention on
shared memory locations, especially across chips. Further-
more, as systems get larger, their behavior becomes less pre-
dictable, underscoring the importance of robust programs
that can overcome unexpected thread stalls.
Our overarching goal is to devise a methodology for devel-
oping parallel algorithms addressing these challenges. In this
paper, we focus on one of the fundamental building blocks of
highly parallel software, namely a producer-consumer task
pool. Specifically, we present a scalable and highly-efficient
non-blocking pool, with lightweight synchronization-free op-
erations in the common case. Its data allocation scheme is
cache-friendly and highly suitable for NUMA environments.
Moreover, our pool is robust in the face of imbalanced loads
and unexpected thread stalls.
Our system is composed of two independent logical en-
tities: 1) SALSA, Scalable and Low Synchronization Algo-
rithm, a single-consumer pool that exports a stealing op-
eration, and 2) a work stealing framework implementing a
management policy that operates multiple SALSA pools.
In order to improve locality and facilitate stealing, SALSA
keeps tasks in chunks, organized in per-producer chunk lists.
Only the producer mapped to a given list can insert tasks
to chunks in this list, which eliminates the need for synchro-
nization among producers.
Though each consumer has its own task pool, inter-consumer
synchronization is required in order to allow stealing. The
challenge is to do so without resorting to costly atomic op-
erations (such as CAS or memory fences) upon each task
retrieval. We address this challenge via a novel chunk-based
stealing algorithm that allows consume operations to be
synchronization-free in the common case, when no stealing
occurs, which we call the fast path. Moreover, SALSA re-
duces the stealing rate by moving entire chunks of tasks in
one steal operation, which requires only two CAS operations.