Heterogeneous Computing
CUDA C++ Best Practices Guide DG-05603-001_v11.2|3
RAM
The host system and the device each have their own distinct attached physical memories
1
. As the host and device memories are separated, items in the host memory must
occasionally be communicated between device memory and host memory as described in
What Runs on a CUDA-Enabled Device?.
These are the primary hardware differences between CPU hosts and GPU devices with respect
to parallel programming. Other differences are discussed as they arise elsewhere in this
document. Applications composed with these differences in mind can treat the host and device
together as a cohesive heterogeneous system wherein each processing unit is leveraged to do
the kind of work it does best: sequential work on the host and parallel work on the device.
2.2. What Runs on a CUDA-Enabled
Device?
The following issues should be considered when determining what parts of an application to
run on the device:
‣
The device is ideally suited for computations that can be run on numerous data elements
simultaneously in parallel. This typically involves arithmetic on large data sets (such as
matrices) where the same operation can be performed across thousands, if not millions,
of elements at the same time. This is a requirement for good performance on CUDA:
the software must use a large number (generally thousands or tens of thousands) of
concurrent threads. The support for running numerous threads in parallel derives from
CUDA's use of a lightweight threading model described above.
‣
To use CUDA, data values must be transferred from the host to the device. These transfers
are costly in terms of performance and should be minimized. (See Data Transfer Between
Host and Device.) This cost has several ramifications:
‣
The complexity of operations should justify the cost of moving data to and from the
device. Code that transfers data for brief use by a small number of threads will see
little or no performance benefit. The ideal scenario is one in which many threads
perform a substantial amount of work.
For example, transferring two matrices to the device to perform a matrix addition
and then transferring the results back to the host will not realize much performance
benefit. The issue here is the number of operations performed per data element
transferred. For the preceding procedure, assuming matrices of size NxN, there are
N
2
operations (additions) and 3N
2
elements transferred, so the ratio of operations
to elements transferred is 1:3 or O(1). Performance benefits can be more readily
achieved when this ratio is higher. For example, a matrix multiplication of the same
matrices requires N
3
operations (multiply-add), so the ratio of operations to elements
transferred is O(N), in which case the larger the matrix the greater the performance
benefit. The types of operations are an additional factor, as additions have different
complexity profiles than, for example, trigonometric functions. It is important to
1
On Systems on a Chip with integrated GPUs, such as NVIDIA
®
Tegra
®
, host and device memory are physically the same, but
there is still a logical distinction between host and device memory. See the Application Note on CUDA for Tegra for details.