C
o
n
s
i
s
t
e
n
t
*
C
o
m
p
l
e
t
e
*
W
e
l
l
D
o
c
u
m
e
n
t
e
d
*
E
a
s
y
t
o
R
e
u
s
e
*
*
E
v
a
l
u
a
t
e
d
*
P
o
P
*
A
r
t
i
f
a
c
t
*
A
E
C
More Than You Ever Wanted to Know about Synchronization
Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms
Vincent Gramoli
NICTA and University of Sydney, Australia
vincent.gramoli@sydney.edu.au
Abstract
In this paper, we present the most extensive comparison of syn-
chronization techniques. We evaluate 5 different synchronization
techniques through a series of 31 data structure algorithms from the
recent literature on 3 multicore platforms from Intel, Sun Microsys-
tems and AMD. To this end, we developed in C/C++ and Java a
new micro-benchmark suite, called Synchrobench, hence helping
the community evaluate new data structures and synchronization
techniques. The main conclusion of this evaluation is threefold: (i) al-
though compare-and-swap helps achieving the best performance on
multicores, doing so correctly is hard; (ii) optimistic locking offers
varying performance results while transactional memory offers more
consistent results; and (iii) copy-on-write and read-copy-update suf-
fer more from contention than any other technique but could be
combined with others to derive efficient algorithms.
Categories and Subject Descriptors
D.1. Programming Tech-
niques [Concurrent Programming]: Parallel programming
Keywords Benchmark; data structure; reusability; lock-freedom
1. Introduction
The increasing core count raises new challenges in the development
of efficient algorithms that allow concurrent threads to access
shared resources. Not only have developers to choose among a
large set of thread synchronization techniques, including locks, read-
modify-write, copy-on-write, transactions and read-copy-update, but
they must select dedicated data structure algorithms that leverage
each synchronization under a certain workload. These possibilities
have led to an increase in the number of proposed concurrent
data structures, each being shown efficient in “some” settings.
Unfortunately, it is almost impossible to predict their performance
given the hardware and OS artifacts. A unique framework is thus
necessary to evaluate their performance on a common ground before
recommending developers to choose a specific synchronization
technique.
On the one hand, synchronization techniques are usually tested
with standard macro-benchmarks [
8
] whose workloads alternate
realistically between various complex patterns. These macro-
benchmarks are however of little help when it comes to nailing
down the bottleneck responsible of performance drops. On the other
hand, profiling tools that measure cache traffic [
18
] and monitor
memory reclamation can be extremely useful in tuning the im-
plementation of an algorithm to a dedicated hardware platform,
however, they are of little help in optimizing the algorithm itself.
This is the reason why micro-benchmarks have been so popular
to evaluate new algorithms. They are invaluable tools that comple-
ment macro evaluations and profiling tool boxes in order to evaluate
novel concurrent algorithms. In particular, they are instrumental
in confirming how an algorithm can improve the performance of
data structures even though the same algorithm negligibly boosts a
particular application on a specific hardware or OS. Unfortunately,
these micro-benchmarks are often developed specifically to illus-
trate the performance of one algorithm and are usually tuned for
this purpose. More importantly, they are poorly documented as it
is unclear whether updates comprise operations that return unsuc-
cessfully without modifying, or whether the reported performance
of a concurrent data structure are higher than the performance of its
non-synchronized counterpart running sequentially.
Our contribution is the most extensive comparison of synchro-
nization techniques. We focus on the performance of copy-on-write,
mutual exclusion (e.g., spinlocks), read-copy-update, read-modify-
write (e.g., compare-and-swap) and transactional memory to syn-
chronize concurrent data structures written in Java and C/C++, and
evaluated on AMD Opteron, Intel Xeon and UltraSPARC T2 mul-
ticore platforms. We also propose Synchrobench, an open source
micro-benchmark suite written in Java and C/C++ for multi-core
machines to help researchers evaluate new algorithms and synchro-
nization techniques. Synchrobench is not intended to measure over-
all system performance or mimic a given application but is aimed
at helping programmers understand the cause of performance prob-
lems of their structures. Its Java version executes on top of the JVM
making it possible to test algorithms written in languages producing
JVM-compatible bytecode, like Scala. Its C/C++ version allows for
more control on the memory management.
Our evaluation includes 31 algorithms taken from the literature
and summarized in Table 1. It provides a range of data structures
from simple ones (e.g., linked lists) and fast ones (e.g., queues and
hash tables) to sorted ones (e.g., trees, skip lists). These structures
implement classic abstractions (e.g., collection, dictionary and set)
but Synchrobench also features special operations to measure the
reusability of the data structure in a concurrent library.
This systematic evaluation of synchronization techniques leads
to interesting conclusions, including three main ones:
1. Compare-and-swap is a double-edge sword.
Data structures
are typically faster when synchronized exclusively with compare-
and-swap than any other technique, regardless of the multicore
machines we tested. However, the lock-free use of compare-and-
swap makes the design of these data structures, and especially
the ones with non-trivial mutations, extremely difficult. In
particular, we observed that there are only few existing full-
fledged binary search trees using single-word compare-and-swap
and we identified a bug in one of them.
2. Transactions offer more consistent performance than locks.
We observed that optimistic locking techniques that consist of
traversing the structure and locking before revalidating help
reducing the number of locks used but also present great vari-
ations of performance depending on the considered structure
and the amount of contention. Transactional memory provides
1 2015/4/4