in which an additional lower power, lower performance
quad-core ARM A53 is provided on chip, but is not directly
accessible to software and is only activated in low-power
modes. The ARM CPUs and the GPU share 4 GB of
1600-MHz DRAM memory partitioned into 32 banks.
The TX1 features an integrated GPU. Such a GPU tightly
shares DRAM memory with CPU cores, typically draws
between 5 and 15 watts, and requires minimal cooling and
additional space. The alternative to an integrated GPU is a
discrete GPU. Discrete GPUs are packaged on adapter cards
that plug into a host computer bus, have their own local
DRAM memory that is completely independent from that
used by CPU cores, typically draw between 150 and 250
watts, need active cooling, and occupy substantial space.
B. CUDA Programming Fundamentals
The following is a high-level description of CUDA, the
API for GPU programming provided by NVIDIA.
A GPU is fundamentally a co-processor that performs
operations requested by CPU programs. CUDA programs
use a set of C or C++ library routines to request GPU
operations that are implemented by a combination of
hardware and device-driver software. The typical structure
of a CUDA program is as follows:
(i)
allocate GPU-local
(device) memory for data;
(ii)
use the GPU to copy data
from host memory to GPU device memory;
(iii)
launch
a program, called a kernel, to run on the GPU cores to
compute some function on the data;
(iv)
use the GPU to
copy output data from device memory back to host memory;
(v)
free the device memory. When invoking a CUDA kernel,
the programmer specifies the number of GPU threads to
use during the kernel’s execution and how the threads
are organized into groups called thread blocks. Having
multiple threads executing the kernel enables the significant
parallelism afforded by GPUs to be exploited. Kernel
launches are always asynchronous, requiring the invoking
CPU process to explicitly wait for them to complete.
On integrated GPUs, CUDA provides a zero-copy option
where programs can simply pass a pointer to shared memory
where data used by a kernel is located—that is, explicit
copying from CPU-local memory to GPU-local memory is
avoided. CUDA also supports a different memory-access
mechanism, called unified memory, on both discrete and
integrated GPUs. Unified memory is similar to zero-copy
memory, as a single memory pointer can be used in both
CPU and GPU code. The difference between unified and
zero-copy memory appears during kernel execution, where,
in the case of unified memory, the GPU driver transparently
transfers data on-demand between CPU-local memory and
GPU-local memory.
CUDA operations pertaining to a given GPU are ordered
by associating them with a stream. By default, there is
a single stream for all programs that share a GPU, but
multiple streams can be optionally created. Operations in
a given stream are executed in FIFO order, but the order
of execution across different streams is determined by the
GPU scheduling in the device driver. Tasks from different
streams may even execute concurrently or out of request
order.
Programmers can think of a GPU as being abstractly
composed of one or more copy engines
(
CEs
)
that imple-
ment transfers of data between host memory and device
memory, and an execution engine
(
EE
)
(consisting of many
parallel processors) that executes GPU kernels. The TX1
has a single CE. EEs and CEs operate concurrently. When
there are multiple streams, kernels and copy operations from
different streams can also operate concurrently depending
on the GPU hardware. To the best of our knowledge,
complete details of kernel attributes and policies used by
NVIDIA to schedule kernels and copy operations is not
available.
C. Related Work
The black-box nature of GPU programming has limited
both the scheduling and analysis techniques available for
real-time GPU usage. As a result, much prior work treats
a single GPU as an atomic entity—a real-time task locks
an entire GPU, or individual EEs or CEs, for the duration
of any GPU computation. Such an approach is taken in
TimeGraph [
14
], RGEM [
13
], GPUSync [
9
], and several
other frameworks [
28
,
29
,
30
,
33
]. The viewpoint taken in
all of this work is that GPU co-scheduling must be avoided
because concurrently executing kernels might adversely
interfere with each other. However, we are aware of no work
directed at real-time systems in which such interference is
actually demonstrated or its effects quantified.
In a precursor to this paper, our group conducted an
investigation of the high-level effects of uncontrolled co-
scheduling on the execution times of a variety of image-
processing benchmarks [
26
]. We conducted this work
using both the NVIDIA TX1 and TK1 (a similar, but
weaker, single-board computer). This work found that
unmanaged co-scheduling can lead to improved average-
case performance. However, we did not examine in depth
how this benefit is achieved or what the limitations of it
are.
Work has also been directed at splitting GPU tasks into
smaller sub-tasks to approximate preemptive execution or
improve utilization [
3
,
13
,
19
,
35
]. A framework called
Kernelet [
34
] falls into this category, but is of particular
interest to us due to the fact that GPU co-scheduling
is considered in order to improve utilization. Kernelet,
however, requires heavy instrumentation and does not
consider co-scheduling unmodified workloads. Additionally,
the developers of Kernelet do not provide an in-depth
investigation into the GPU’s actual behavior or interference
effects during co-scheduling, which, in fairness, was not
one of their main objectives. Others have published further