Heterogeneous Computing
www.nvidia.com
CUDA C Best Practices Guide DG-05603-001_v5.0|2
designed to minimize latency for one or two threads at a time each, whereas GPUs
are designed to handle a large number of concurrent, lightweight threads in order to
maximize throughput.
RAM
The host system and the device each have their own distinct attached physical
memories. As the host and device memories are separated by the PCI Express (PCIe)
bus, items in the host memory must occasionally be communicated across the bus
to the device memory or vice versa as described in What Runs on a CUDA-Enabled
Device?
These are the primary hardware differences between CPU hosts and GPU devices with
respect to parallel programming. Other differences are discussed as they arise elsewhere
in this document. Applications composed with these differences in mind can treat the
host and device together as a cohesive heterogeneous system wherein each processing
unit is leveraged to do the kind of work it does best: sequential work on the host and
parallel work on the device.
1.2What Runs on a CUDA-Enabled Device?
The following issues should be considered when determining what parts of an
application to run on the device:
‣
The device is ideally suited for computations that can be run on numerous data
elements simultaneously in parallel. This typically involves arithmetic on large
data sets (such as matrices) where the same operation can be performed across
thousands, if not millions, of elements at the same time. This is a requirement for
good performance on CUDA: the software must use a large number (generally
thousands or tens of thousands) of concurrent threads. The support for running
numerous threads in parallel derives from CUDA's use of a lightweight threading
model described above.
‣
For best performance, there should be some coherence in memory access by adjacent
threads running on the device. Certain memory access patterns enable the hardware
to coalesce groups of reads or writes of multiple data items into one operation. Data
that cannot be laid out so as to enable coalescing, or that doesn't have enough locality
to use the L1 or texture caches effectively, will tend to see lesser speedups when used
in computations on CUDA.
‣
To use CUDA, data values must be transferred from the host to the device along
the PCI Express (PCIe) bus. These transfers are costly in terms of performance and
should be minimized. (See Data Transfer Between Host and Device.) This cost has
several ramifications:
‣
The complexity of operations should justify the cost of moving data to and from
the device. Code that transfers data for brief use by a small number of threads
will see little or no performance benefit. The ideal scenario is one in which many
threads perform a substantial amount of work.
For example, transferring two matrices to the device to perform a matrix
addition and then transferring the results back to the host will not realize much
performance benefit. The issue here is the number of operations performed per