NVIDIA CUDA C编程最佳实践指南：加速并行计算

5星 · 超过95%的资源需积分: 9 12 浏览量更新于2024-07-23 收藏 1.82MB PDF 举报

"NVIDIA CUDA C Best Practices Guide" 是NVIDIA官方针对CUDA编程的详尽指南，版本为3.1，发布日期为2010年5月19日。这份文档旨在帮助开发者充分利用CUDA平台进行并行计算，提升性能和效率。以下是从章节概述中提炼的关键知识点： 1. **平行计算与CUDA**： - **异构计算**: 本章介绍了CUDA环境中主机（CPU）与设备（GPU）之间的差异，强调了在何种情况下利用GPU的优势，如大规模并行处理和高速计算。 - **硬件支持**：讨论了CUDA设备的功能特性，如计算能力（CUDA Compute Capability），以及驱动程序和CUDA运行时版本的重要性。建议开发者根据目标硬件选择合适的API。 2. **编程环境理解**： - **CUDA Compute Capability**：讲解了不同世代的GPU在指令集、浮点运算能力和内存带宽等方面的差异，这对于编写兼容性代码至关重要。 - **硬件数据**：指出了额外的硬件信息，如显存大小、线程块和块数量限制，这些都会影响程序性能。 - **CUDA API选择**：区分了CUDA运行时API（如cuBLAS、cuFFT等）和驱动API（如cudaMalloc、cudaMemcpy等），并给出了何时使用哪种API的指导。 3. **性能指标**： - **时间测量**：详细介绍了如何使用CPU和GPU计时器来评估代码执行速度，这对于优化性能至关重要。 - **带宽**：理论带宽计算是性能分析的一部分，帮助开发者理解数据传输对性能的影响，以及如何最大化GPU内存的使用效率。通过阅读这份最佳实践指南，开发者可以学习如何有效地设计CUDA程序，优化性能瓶颈，确保跨不同硬件平台的兼容性，并充分利用NVIDIA GPU的并行处理能力。对于任何从事GPU加速计算或深度学习工作的程序员来说，这是一份不可或缺的参考资料。

Chapter 1.

Parallel Computing with CUDA

3 CUDA C Best Practices Guide Version 3.1

elements transferred, so the ratio of operations to elements transferred is 1:3

or O(1). Performance benefits can be more readily achieved when this ratio

is higher. For example, a matrix multiplication of the same matrices requires

operations (multiply-add), so the ratio of operations to elements

transferred is O(N), in which case the larger the matrix the greater the

performance benefit. The types of operations are an additional factor, as

additions have different complexity profiles than, for example, trigonometric

functions. It is important to include the overhead of transferring data to and

from the device in determining whether operations should be performed on

the host or on the device.

 Data should be kept on the device as long as possible. Because transfers

should be minimized, programs that run multiple kernels on the same data

should favor leaving the data on the device between kernel calls, rather than

transferring intermediate results to the host and then sending them back to

the device for subsequent calculations. So, in the previous example, had the

two matrices to be added already been on the device as a result of some

previous calculation, or if the results of the addition would be used in some

subsequent calculation, the matrix addition should be performed locally on

the device. This approach should be used even if one of the steps in a

sequence of calculations could be performed faster on the host. Even a

relatively slow kernel may be advantageous if it avoids one or more PCIe

transfers. Section 3.1 provides further details, including the measurements of

bandwidth between the host and the device versus within the device proper.

1.1.3 Maximum Performance Benefit

High Priority: To get the maximum benefit from CUDA, focus first on finding ways to

parallelize sequential code.

The amount of performance benefit an application will realize by running on

CUDA depends entirely on the extent to which it can be parallelized. As mentioned

previously, code that cannot be sufficiently parallelized should run on the host,

unless doing so would result in excessive transfers between the host and the device.

Amdahl’s law specifies the maximum speed-up that can be expected by parallelizing

portions of a serial program. Essentially, it states that the maximum speed-up (S) of

a program is

𝑆 =

(1 − 𝑃) +

𝑃

𝑁

where P is the fraction of the total serial execution time taken by the portion of code

that can be parallelized and N is the number of processors over which the parallel

portion of the code runs.

The larger N is (that is, the greater the number of processors), the smaller the P/N

fraction. It can be simpler to view N as a very large number, which essentially

transforms the equation into

= 1 / 1−

. Now, if ¾ of a program is parallelized,

the maximum speed-up over serial code is 1 / (1 – ¾) = 4.

Chapter 1.

Parallel Computing with CUDA

5 CUDA C Best Practices Guide Version 3.1

The major and minor revision numbers of the compute capability are shown on the

third and fourth lines of Figure 1.1. Device 0 of this system has compute capability

1.1.

More details about the compute capabilities of various GPUs are in Appendix A of

the CUDA Programming Guide. In particular, developers should note the number of

multiprocessors on the device, the number of registers and the amount of memory

available, and any special capabilities of the device.

1.2.2 Additional Hardware Data

Certain hardware features are not described by the compute capability. For example,

the ability to overlap kernel execution with asynchronous data transfers between the

host and the device is available on most but not all GPUs with compute capability

1.1. In such cases, call cudaGetDeviceProperties() to determine whether the

device is capable of a certain feature. For example, the deviceOverlap field of the

device property structure indicates whether overlapping kernel execution and data

transfers is possible (displayed in the ―Concurrent copy and execution‖ line of

Figure 1.1); likewise, the canMapHostMemory field indicates whether zero-copy data

transfers can be performed.

1.2.3 C Runtime for CUDA and Driver API Version

The CUDA driver API and the C runtime for CUDA are two of the programming

interfaces to CUDA. Their version number enables developers to check the features

associated with these APIs and decide whether an application requires a newer

(later) version than the one currently installed. This is important because the CUDA

driver API is backward compatible but not forward compatible, meaning that applications,

plug-ins, and libraries (including the C runtime for CUDA) compiled against a

particular version of the driver API will continue to work on subsequent (later)

driver releases. However, applications, plug-ins, and libraries (including the C

runtime for CUDA) compiled against a particular version of the driver API may not

work on earlier versions of the driver, as illustrated in Figure 1.2.

剩余74页未读，继续阅读

maowenge

粉丝: 20
资源: 8

NVIDIA CUDA C编程最佳实践指南：加速并行计算

CUDA_C_Best_Practices_Guide

CUDA C Best Practices Guide

CUDA C Best Practices Guide 4.1

NVIDIA CUDA C Programming Best Practices Guide Version 2.3

CUDA_C_Best_Practices_Guide_cuda_GPU_

CUDA_C_Best_Practices_Guide.pdf

OpenCL_Best_Practices_Guide.pdf

CUDA C++ Programming Guide

NVIDIA CUDA C最佳实践指南

NVIDIA CUDA架构优化实战：OpenCL最佳实践指南

最新资源