Performance Evaluation of Concurrent Lock-free Data Structures on GPUs
Prabhakar Misra and Mainak Chaudhuri
Department of Computer Science and Engineering
Indian Institute of Technology
Kanpur, INDIA
{prabhu,mainakc}@cse.iitk.ac.in
Abstract—Graphics processing units (GPUs) have emerged
as a strong candidate for high-performance computing. While
regular data-parallel computations with little or no synchro-
nization are easy to map on the GPU architectures, it is a
challenge to scale up computations on dynamically chang-
ing pointer-linked data structures. The traditional lock-based
implementations are known to offer poor scalability due to
high lock contention in the presence of thousands of active
threads, which is common in GPU architectures. In this paper,
we present a performance evaluation of concurrent lock-free
implementations of four popular data structures on GPUs. We
implement a set using lock-free linked list, hash table, skip
list, and priority queue. On the first three data structures,
we evaluate the performance of different mixes of addition,
deletion, and search operations. The priority queue is designed
to support retrieval and deletion of the minimum element and
addition operations to the set. We evaluate the performance of
these lock-free data structures on a Tesla C2070 Fermi GPU
and compare it with the performance of multi-threaded lock-
free implementations for CPU running on a 24-core Intel Xeon
server. The linked list, hash table, skip list, and priority queue
implementations achieve speedup of up to 7.4, 11.3, 30.7, and
30.8, respectively on the GPU compared to the Xeon server.
Keywords-linked list; hash table; skip list; priority queue;
concurrent; lock-free; GPU; CUDA;
I. INTRODUCTION
Graphics processing units (GPUs) have become one of the pre-
ferred vehicles for high-performance general purpose computing.
This computing paradigm is commonly known as general purpose
computing on GPU (GPGPU) or GPU computing. Regular data-
parallel computations with little or no synchronization have been
efficiently mapped on the GPUs. However, a large number of
general purpose ordinary programs have irregular accesses to
pointer-linked data structures that change dynamically through
addition and deletion of items. Achieving scalable performance on
such data structures requires highly concurrent implementations.
In small to medium-scale parallel machines with tens of active
thread contexts, it may be acceptable to have some amount of lock-
based synchronization. However, this would introduce prohibitive
performance overhead in GPUs where the number of active threads
can easily extend to thousands. Possibility of high lock contention
at this scale rules out lock-based implementations.
In this paper, we present an evaluation of lock-free concurrent
implementation of a few important data structures on GPUs. To
the best of our knowledge, this is the first detailed evaluation of a
number of lock-free data structures on GPUs. We present four im-
plementations of a set with the help of linked list, hash table, skip
list, and priority queue. The first three data structures support con-
current lock-free addition, deletion, and search operations on the
set, while the concurrent priority queue offers lock-free retrieval
and deletion of the minimum element and addition operations. Our
choice of data structures is governed by their importance in general
purpose computing. Linked lists form the building block for many
important data structures, such as, graphs. Hash tables are often
used to reduce average case search time. We present a lock-free
design of a closed-address hash table, which builds upon our lock-
free linked list design. Skip lists offer expected logarithmic search
time and our lock-free priority queue builds upon a lock-free
implementation of the skip list. All our implementations use the
CUDA (Compute Unified Device Architecture) C++ programming
model and rely on the CUDA atomic primitives such as atomic
compare-and-swap (CAS), atomic increment, etc..
We measure the performance of these data structures by execut-
ing a mix of the concurrent operations supported by each of the
data structures. Our evaluation is carried out on a Tesla C2070
Fermi GPU as well as a 24-core Intel Xeon server. The GPU
implementations of the lock-free linked list, hash table, skip list,
and priority queue achieve speedup of up to 7.4, 11.3, 30.7, and
30.8, respectively compared to the lock-free multi-threaded CPU
execution.
The concurrent implementations of the four data structures
chosen by us have been studied in great detail in the context of
CPUs and we review some of these contributions in Section I-A.
Section II summarizes the CUDA programming environment.
Section III presents the lock-free implementations of the four data
structures on GPU. We discuss the evaluation methodology and
the performance results in Sections IV and V.
A. Related Work
In this paper, we have implemented four lock-free data struc-
tures on CUDA-enabled GPUs. While a significant amount of
research has been done on lock-free data structures in the context
of traditional CPUs, there is very little known about the perfor-
mance of these data structures on the GPUs. Herlihy and Shavit
discuss concurrent implementations of several data structures on
shared memory multiprocessors using JAVA [10]. We summarize
relevant portions of this literature on CPU-based implementations
and discuss a few studies relevant to GPU implementations.
Lock-free linked list implementation using atomic CAS oper-
ations is proposed by Valois [29]. This implementation supports
linearizable operations [12] i.e., each operation appears to take
place atomically at some point (the linearization point) during its
execution. Valois also proposes a reference count-based solution
to the ABA problem related to memory management of data
structures operated on by atomic CAS. Subsequently, Harris [9]