Cooperative Kernels: GPU Multitasking for Blocking Algorithms ESEC/FSE’17, September 4–8, 2017, Paderborn, Germany
1 kernel work_stealing(global Task
*
queues) {
2 int queue_id = get_group_id();
3 while (more_work(queues)) {
4 Task
*
t = pop_or_steal(queues, queue_id);
5 if (t)
6 process_task(t, queues, queue_id);
7 }
8 }
Figure 2: An excerpt of a work stealing algorithm in OpenCL
2 BACKGROUND AND EXAMPLES
We outline the OpenCL programming model on which we base
cooperative kernels (Sec. 2.1), and illustrate OpenCL and the sched-
uling requirements of irregular algorithms using two examples: a
work stealing queue and frontier-based graph traversal (Sec. 2.2).
2.1 OpenCL Background
An OpenCL program is divided into host and device components.
A host application runs on the CPU and launches one or more
kernels that run on accelerator devices—GPUs in the context of this
paper. A kernel is written in OpenCL C, based on C99. All threads
executing a kernel start at the same entry function with identical
arguments. A thread can call
get_global_id
to obtain a unique id,
to access distinct data or follow dierent control ow paths.
The threads of a kernel are divided into workgroups. Functions
get_local_id
and
get_group_id
return a thread’s local id within its
workgroup and the workgroup id. The number of threads per work-
group and number of workgroups are obtained via
get_local_size
and
get_num_groups
. Execution of the threads in a workgroup can
be synchronised via a workgroup barrier. A global barrier (synchro-
nising all threads of a kernel) is not provided as a primitive.
Memory Spaces and Memory Model.
A kernel has access to four
memory spaces. Shared virtual memory (SVM) is accessible to all
device threads and the host concurrently. Global memory is shared
among all device threads. Each workgroup has a portion of local
memory for fast intra-workgroup communication. Every thread has
a portion of very fast private memory for function-local variables.
Fine-grained communication within a workgroup, as well as
inter-workgroup communication and communication with the host
while the kernel is running, is enabled by a set of atomic data types
and operations. In particular, ne-grained host/device communica-
tion is via atomic operations on SVM.
Execution Model.
OpenCL [
12
, p. 31] and CUDA [
18
] specically
make no guarantees about fair scheduling between workgroups
executing the same kernel. HSA provides limited, one-way guaran-
tees, stating [
10
, p. 46]: “Work-group A can wait for values written
by work-group B without deadlock provided ... (if) A comes after B
in work-group attened ID order”. This is not sucient to support
blocking algorithms that use mutexes and inter-workgroup barriers,
both of which require symmetric communication between threads.
2.2 Motivating Examples
Work Stealing.
Work stealing enables dynamic balancing of tasks
across processing units. It is useful when the number of tasks to be
1 kernel graph_app(global graph
*
g,
2 global nodes
*
n0, global nodes
*
n1) {
3 int level = 0;
4 global nodes
*
in_nodes = n0;
5 global nodes
*
out_nodes = n1;
6 int tid = get_global_id();
7 int stride = get_global_size();
8 while(in_nodes.size > 0) {
9 for (int i = tid; i < in_nodes.size; i += stride)
10 process_node(g, in_nodes[i], out_nodes, level);
11 swap(&in_nodes, &out_nodes);
12 global_barrier();
13 reset(out_nodes);
14 level++;
15 global_barrier();
16 }
17 }
Figure 3: An OpenCL graph traversal algorithm
processed is dynamic, due to one task creating an arbitrary number
of new tasks. Work stealing has been explored in the context of
GPUs [
2
,
30
]. Each workgroup has a queue from which it obtains
tasks to process, and to which it stores new tasks. If its queue is
empty, a workgroup tries to steal a task from another queue.
Figure 2 illustrates a work stealing kernel. Each thread receives
a pointer to the task queues, in global memory, initialised by the
host to contain initial tasks. A thread uses its workgroup id (line 2)
as a queue id to access the relevant task queue. The
pop_or_steal
function (line 4) pops a task from the workgroup’s queue or tries
to steal a task from other queues. Although not depicted here,
concurrent accesses to queues inside
more_work
and
pop_or_steal
are guarded by a mutex per queue, implemented using atomic
compare and swap operations on global memory.
If a task is obtained, then the workgroup processes it (line 6),
which may lead to new tasks being created and pushed to the
workgroup’s queue. The kernel presents two opportunities for spin-
waiting: spinning to obtain a mutex, and spinning in the main kernel
loop to obtain a task. Without fair scheduling, threads waiting for
the mutex might spin indenitely, causing the application to hang.
Graph Traversal.
Figure 3 illustrates a frontier-based graph tra-
versal algorithm; such algorithms have been shown to execute
eciently on GPUs [
1
,
19
]. The kernel is given three arguments in
global memory: a graph structure, and two arrays of graph nodes.
Initially,
n0
contains the starting nodes to process. Private variable
level
records the current frontier level, and
in_nodes
and
out_nodes
point to distinct arrays recording the nodes to be processed during
the current and next frontier, respectively.
The application iterates as long as the current frontier contains
nodes to process (line 8). At each frontier, the nodes to be pro-
cessed are evenly distributed between threads through stride based
processing. In this case, the stride is the total number of threads, ob-
tained via
get_global_size
. A thread calls
process_node
to process
a node given the current level, with nodes to be processed during
the next frontier being pushed to out_nodes. After processing the
frontier, the threads swap their node array pointers (line 11).
At this point, the GPU threads must wait for all other threads
to nish processing the frontier. To achieve this, we use a global
433