合作内核：面向阻塞算法的GPU并行编程新策略（FSE 2017）

31 浏览量更新于2024-08-25 收藏 749KB PDF 举报

Cooperative Kernels: GPU Multitasking for Blocking Algorithms是一篇发表于2017年FSE（International Conference on Formal Methods in Computer-Aided Design）的研究论文，由来自Imperial College London的Tyler Sorensen、Hugues Evrard和Alastair F. Donaldson共同撰写。随着对GPU性能提升的需求增加，研究人员注意到如何优化非规则数据并行算法在GPU上的执行成为了一个重要课题。这些算法通常采用阻塞（Blocking）策略，这意味着它们需要一个公平的调度机制。传统的GPU编程模型，如OpenCL，虽然支持大规模并行计算，但并没有强制实施公平调度。实际上，当前的GPU调度器在实践中往往不够公平，这可能会限制算法的效率，特别是当与图形渲染等其他任务共享资源时。为了克服这个问题，作者提出了“合作内核”（Cooperative Kernels）的概念。合作内核是对传统GPU编程模型的一种扩展，特别设计用于编写阻塞算法。其核心思想是将工作群组进行公平调度，确保每个任务都能获得适当的时间片。通过引入一组语言扩展，工作群组和调度器能够协同工作，实现真正的多任务处理。这样，即使在同一GPU上运行多个阻塞算法，也能保证性能的平衡，不会因为争夺资源而影响整体效率。该论文详细描述了合作内核的原型实现，包括如何在现有API中嵌入这些扩展，以及如何调整调度策略以适应阻塞算法的特性。此外，论文还探讨了这种方法在实际应用中的效果，比如在并行矩阵乘法、图遍历等典型计算密集型任务中的性能提升。通过这种方式，研究者旨在推动GPU编程模型的发展，使之更好地支持非规则数据并行算法的执行，提高整体计算性能。

Cooperative Kernels: GPU Multitasking for Blocking Algorithms ESEC/FSE’17, September 4–8, 2017, Paderborn, Germany

1 kernel work_stealing(global Task

queues) {

2 int queue_id = get_group_id();

3 while (more_work(queues)) {

4 Task

t = pop_or_steal(queues, queue_id);

5 if (t)

6 process_task(t, queues, queue_id);

7 }

8 }

Figure 2: An excerpt of a work stealing algorithm in OpenCL

2 BACKGROUND AND EXAMPLES

We outline the OpenCL programming model on which we base

cooperative kernels (Sec. 2.1), and illustrate OpenCL and the sched-

uling requirements of irregular algorithms using two examples: a

work stealing queue and frontier-based graph traversal (Sec. 2.2).

2.1 OpenCL Background

An OpenCL program is divided into host and device components.

A host application runs on the CPU and launches one or more

kernels that run on accelerator devices—GPUs in the context of this

paper. A kernel is written in OpenCL C, based on C99. All threads

executing a kernel start at the same entry function with identical

arguments. A thread can call

get_global_id

to obtain a unique id,

to access distinct data or follow dierent control ow paths.

The threads of a kernel are divided into workgroups. Functions

get_local_id

and

get_group_id

return a thread’s local id within its

workgroup and the workgroup id. The number of threads per work-

group and number of workgroups are obtained via

get_local_size

and

get_num_groups

. Execution of the threads in a workgroup can

be synchronised via a workgroup barrier. A global barrier (synchro-

nising all threads of a kernel) is not provided as a primitive.

Memory Spaces and Memory Model.

A kernel has access to four

memory spaces. Shared virtual memory (SVM) is accessible to all

device threads and the host concurrently. Global memory is shared

among all device threads. Each workgroup has a portion of local

memory for fast intra-workgroup communication. Every thread has

a portion of very fast private memory for function-local variables.

Fine-grained communication within a workgroup, as well as

inter-workgroup communication and communication with the host

while the kernel is running, is enabled by a set of atomic data types

and operations. In particular, ne-grained host/device communica-

tion is via atomic operations on SVM.

Execution Model.

OpenCL [

, p. 31] and CUDA [

] specically

make no guarantees about fair scheduling between workgroups

executing the same kernel. HSA provides limited, one-way guaran-

tees, stating [

, p. 46]: “Work-group A can wait for values written

by work-group B without deadlock provided ... (if) A comes after B

in work-group attened ID order”. This is not sucient to support

blocking algorithms that use mutexes and inter-workgroup barriers,

both of which require symmetric communication between threads.

2.2 Motivating Examples

Work Stealing.

Work stealing enables dynamic balancing of tasks

across processing units. It is useful when the number of tasks to be

1 kernel graph_app(global graph

2 global nodes

n0, global nodes

n1) {

3 int level = 0;

4 global nodes

in_nodes = n0;

5 global nodes

out_nodes = n1;

6 int tid = get_global_id();

7 int stride = get_global_size();

8 while(in_nodes.size > 0) {

9 for (int i = tid; i < in_nodes.size; i += stride)

10 process_node(g, in_nodes[i], out_nodes, level);

11 swap(&in_nodes, &out_nodes);

12 global_barrier();

13 reset(out_nodes);

14 level++;

15 global_barrier();

16 }

17 }

Figure 3: An OpenCL graph traversal algorithm

processed is dynamic, due to one task creating an arbitrary number

of new tasks. Work stealing has been explored in the context of

GPUs [

]. Each workgroup has a queue from which it obtains

tasks to process, and to which it stores new tasks. If its queue is

empty, a workgroup tries to steal a task from another queue.

Figure 2 illustrates a work stealing kernel. Each thread receives

a pointer to the task queues, in global memory, initialised by the

host to contain initial tasks. A thread uses its workgroup id (line 2)

as a queue id to access the relevant task queue. The

pop_or_steal

function (line 4) pops a task from the workgroup’s queue or tries

to steal a task from other queues. Although not depicted here,

concurrent accesses to queues inside

more_work

and

pop_or_steal

are guarded by a mutex per queue, implemented using atomic

compare and swap operations on global memory.

If a task is obtained, then the workgroup processes it (line 6),

which may lead to new tasks being created and pushed to the

workgroup’s queue. The kernel presents two opportunities for spin-

waiting: spinning to obtain a mutex, and spinning in the main kernel

loop to obtain a task. Without fair scheduling, threads waiting for

the mutex might spin indenitely, causing the application to hang.

Graph Traversal.

Figure 3 illustrates a frontier-based graph tra-

versal algorithm; such algorithms have been shown to execute

eciently on GPUs [

]. The kernel is given three arguments in

global memory: a graph structure, and two arrays of graph nodes.

Initially,

contains the starting nodes to process. Private variable

level

records the current frontier level, and

in_nodes

and

out_nodes

point to distinct arrays recording the nodes to be processed during

the current and next frontier, respectively.

The application iterates as long as the current frontier contains

nodes to process (line 8). At each frontier, the nodes to be pro-

cessed are evenly distributed between threads through stride based

processing. In this case, the stride is the total number of threads, ob-

tained via

get_global_size

. A thread calls

process_node

to process

a node given the current level, with nodes to be processed during

the next frontier being pushed to out_nodes. After processing the

frontier, the threads swap their node array pointers (line 11).

At this point, the GPU threads must wait for all other threads

to nish processing the frontier. To achieve this, we use a global

433

剩余10页未读，继续阅读

weixin_38535808

粉丝: 4
资源: 903

合作内核：面向阻塞算法的GPU并行编程新策略（FSE 2017）

Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments

Cooperative Multi-Agent Learning:The State of the Art

NVIDIA - Cooperative Groups - Slides - GTC 2017 (s7622-Kyrylo-perelygin-robust-and-scalable-cuda)-计算机科学

bluffers-guide-to-cooperative-multitasking:虚张声势的多任务协作指南 (WIP)

Cooperative Store-Carry-Forward Scheme for Intermittently Connected Vehicular Networks

Efficient CPU-GPU cooperative computing for solving the subset-sum problem

Macro for cooperative multi-tasking-开源

Cooperative time-varying formation guidance for leader-following missiles to intercept a maneuvering target with switching topologies

Distributed formation control for a cooperative multi-agent system

Cooperative Three-View Imaging Optical Coherence Tomography for Intraoperative Vascular Evaluation

最新资源