GPU 并发锁自由数据结构的性能评估研究

64 浏览量更新于2024-08-25 收藏 254KB PDF 举报

"Performance Evaluation of Concurrent Lock-free Data Structures on GPUs" 在这篇论文中，作者Prabhakar Misra和Mainak Chaudhuri对并发锁自由数据结构在图形处理单元（GPU）上的性能进行了评估。他们选择了四种流行的数据结构：锁自由链表、哈希表、跳跃表和优先队列，并对其在GPU上的实现进行了性能评估。锁自由数据结构的设计目的是为了解决传统锁基于实现的可扩展性问题。传统锁基于实现由于存在高锁争用问题，在大量活动线程的存在下，无法提供良好的可扩展性。锁自由数据结构的出现解决了这个问题，能够在大量线程并发访问时提供高性能和可扩展性。在这篇论文中，作者首先介绍了GPU架构和数据并行计算的概念，然后对四种数据结构的实现和性能评估进行了详细的介绍。锁自由链表的实现使用了自适应的分配算法，以提高插入和删除操作的性能。哈希表的实现使用了锁自由的哈希函数，以提高搜索和插入操作的性能。跳跃表的实现使用了自适应的跳跃表结构，以提高搜索和插入操作的性能。优先队列的实现使用了锁自由的堆结构，以提高插入和删除操作的性能。在性能评估中，作者使用了不同的操作混合来评估数据结构的性能，包括添加、删除和搜索操作。实验结果表明，锁自由数据结构在GPU上的实现能够提供高性能和可扩展性，能够满足高性能计算的需求。这篇论文对并发锁自由数据结构在GPU上的实现和性能评估进行了深入的研究，为高性能计算领域的研究和应用提供了重要的参考价值。知识点： * 锁自由数据结构的设计和实现 * 并发锁自由数据结构在GPU上的实现和性能评估 * 锁自由链表、哈希表、跳跃表和优先队列的实现和性能评估 * 数据并行计算和GPU架构的概念 * 高性能计算领域的研究和应用这篇论文对并发锁自由数据结构在GPU上的实现和性能评估进行了深入的研究，为高性能计算领域的研究和应用提供了重要的参考价值。

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs

Prabhakar Misra and Mainak Chaudhuri

Department of Computer Science and Engineering

Indian Institute of Technology

Kanpur, INDIA

{prabhu,mainakc}@cse.iitk.ac.in

Abstract—Graphics processing units (GPUs) have emerged

as a strong candidate for high-performance computing. While

regular data-parallel computations with little or no synchro-

nization are easy to map on the GPU architectures, it is a

challenge to scale up computations on dynamically chang-

ing pointer-linked data structures. The traditional lock-based

implementations are known to offer poor scalability due to

high lock contention in the presence of thousands of active

threads, which is common in GPU architectures. In this paper,

we present a performance evaluation of concurrent lock-free

implementations of four popular data structures on GPUs. We

implement a set using lock-free linked list, hash table, skip

list, and priority queue. On the ﬁrst three data structures,

we evaluate the performance of different mixes of addition,

deletion, and search operations. The priority queue is designed

to support retrieval and deletion of the minimum element and

addition operations to the set. We evaluate the performance of

these lock-free data structures on a Tesla C2070 Fermi GPU

and compare it with the performance of multi-threaded lock-

free implementations for CPU running on a 24-core Intel Xeon

server. The linked list, hash table, skip list, and priority queue

implementations achieve speedup of up to 7.4, 11.3, 30.7, and

30.8, respectively on the GPU compared to the Xeon server.

Keywords-linked list; hash table; skip list; priority queue;

concurrent; lock-free; GPU; CUDA;

I. INTRODUCTION

Graphics processing units (GPUs) have become one of the pre-

ferred vehicles for high-performance general purpose computing.

This computing paradigm is commonly known as general purpose

computing on GPU (GPGPU) or GPU computing. Regular data-

parallel computations with little or no synchronization have been

efﬁciently mapped on the GPUs. However, a large number of

general purpose ordinary programs have irregular accesses to

pointer-linked data structures that change dynamically through

addition and deletion of items. Achieving scalable performance on

such data structures requires highly concurrent implementations.

In small to medium-scale parallel machines with tens of active

thread contexts, it may be acceptable to have some amount of lock-

based synchronization. However, this would introduce prohibitive

performance overhead in GPUs where the number of active threads

can easily extend to thousands. Possibility of high lock contention

at this scale rules out lock-based implementations.

In this paper, we present an evaluation of lock-free concurrent

implementation of a few important data structures on GPUs. To

the best of our knowledge, this is the ﬁrst detailed evaluation of a

number of lock-free data structures on GPUs. We present four im-

plementations of a set with the help of linked list, hash table, skip

list, and priority queue. The ﬁrst three data structures support con-

current lock-free addition, deletion, and search operations on the

set, while the concurrent priority queue offers lock-free retrieval

and deletion of the minimum element and addition operations. Our

choice of data structures is governed by their importance in general

purpose computing. Linked lists form the building block for many

important data structures, such as, graphs. Hash tables are often

used to reduce average case search time. We present a lock-free

design of a closed-address hash table, which builds upon our lock-

free linked list design. Skip lists offer expected logarithmic search

time and our lock-free priority queue builds upon a lock-free

implementation of the skip list. All our implementations use the

CUDA (Compute Uniﬁed Device Architecture) C++ programming

model and rely on the CUDA atomic primitives such as atomic

compare-and-swap (CAS), atomic increment, etc..

We measure the performance of these data structures by execut-

ing a mix of the concurrent operations supported by each of the

data structures. Our evaluation is carried out on a Tesla C2070

Fermi GPU as well as a 24-core Intel Xeon server. The GPU

implementations of the lock-free linked list, hash table, skip list,

and priority queue achieve speedup of up to 7.4, 11.3, 30.7, and

30.8, respectively compared to the lock-free multi-threaded CPU

execution.

The concurrent implementations of the four data structures

chosen by us have been studied in great detail in the context of

CPUs and we review some of these contributions in Section I-A.

Section II summarizes the CUDA programming environment.

Section III presents the lock-free implementations of the four data

structures on GPU. We discuss the evaluation methodology and

the performance results in Sections IV and V.

A. Related Work

In this paper, we have implemented four lock-free data struc-

tures on CUDA-enabled GPUs. While a signiﬁcant amount of

research has been done on lock-free data structures in the context

of traditional CPUs, there is very little known about the perfor-

mance of these data structures on the GPUs. Herlihy and Shavit

discuss concurrent implementations of several data structures on

shared memory multiprocessors using JAVA [10]. We summarize

relevant portions of this literature on CPU-based implementations

and discuss a few studies relevant to GPU implementations.

Lock-free linked list implementation using atomic CAS oper-

ations is proposed by Valois [29]. This implementation supports

linearizable operations [12] i.e., each operation appears to take

place atomically at some point (the linearization point) during its

execution. Valois also proposes a reference count-based solution

to the ABA problem related to memory management of data

structures operated on by atomic CAS. Subsequently, Harris [9]

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38710198

粉丝: 6
资源: 912

GPU 并发锁自由数据结构的性能评估研究

Performance Evaluation of Concurrent Lock-free Data Structures on GPUs - Slides (2012)-计算机科学

iBFS - Concurrent Breadth-First Search on GPUs - 2016 (ibfs_tcm18-284417)-计算机科学

A Scalable Lock-free Stack Algorithm (2004)-计算机科学

Improving Real-Time Performance with CUDA Persistent Threads (CuPer) on the Jetson TX2 - Concurrent Real-Time White Paper (2016)-计算机科学

event-driven-data-structures-processing

Android代码-weak-lock-free

More Than You Ever Wanted to Know about Synchronization - Synchrobench, Measuring the Impact of the Synchronization on Concurrent Algorithms - 2015 (gramoli-synchrobench)-计算机科学

javaconcurrent源码-The-Art-of-Concurrency-Programming:TheArtofJavaConcurr

数据结构Advanced-Data-Structures

自扩充的Lock-Free并发环形队列算法

最新资源