CUDA编程最佳实践指南

需积分: 9 170 浏览量更新于2024-07-31 收藏 1.83MB PDF 举报

"CUDA最佳实践指南，版本3.2，由NVIDIA发布于2010年8月20日，旨在介绍CUDA并行计算技术及其性能优化策略。文档主要面向CUDA开发者，提供了一系列推荐和最佳实践，涵盖了CUDA环境理解、API使用、性能指标和度量方法等内容。" CUDA是NVIDIA开发的一种并行计算平台和编程模型，允许程序员利用图形处理器（GPU）进行通用计算。在“cuda start”这个主题下，我们可以深入探讨CUDA的核心概念和最佳实践。 **1. 异构计算与CUDA** 1.1 异构计算是指结合使用CPU（中央处理器）和GPU（图形处理器）来执行任务。CUDA使得开发者能够利用GPU的强大计算能力，特别是在处理大量数据并行性问题时，GPU通常比CPU更高效。 1.1.1 CPU和GPU之间存在显著差异：CPU擅长执行复杂的控制流和少量数据的运算，而GPU则设计为执行大量并行的简单操作，如像素渲染或数学计算。 1.1.2 CUDA启用设备主要执行两种类型的任务：计算（通过CUDA核函数）和图形处理（通过传统的GPU路径）。 1.1.3 要实现最大性能收益，关键在于有效利用GPU的并行性，确保足够多的工作负载分布到大量的流处理器上。 **2. CUDA编程环境** 1.2 了解CUDA编程环境包括理解CUDA的计算能力，这是指GPU支持的CUDA特性级别，如浮点精度、纹理单元等。 1.2.1 CUDA计算能力定义了GPU能支持的CUDA功能和性能，如CUDA核心数量、内存带宽等。 1.2.2 额外硬件数据包括GPU内存类型、容量、带宽以及对PCI-E接口的支持等，这些因素都会影响程序的性能。 1.2.3 CUDA运行时库和驱动API版本需与GPU硬件兼容，选择目标版本时要考虑兼容性和性能。 **3. CUDA API** 1.3 CUDA API提供了与GPU交互的工具，包括运行时库和驱动API。 1.3.1 CUDA运行时库适用于大多数应用程序，提供了一种高级、方便的编程模型。 1.3.2 CUDA驱动API提供更低级别的控制，但需要更多的编程工作，适合需要高性能和精细控制的场景。 1.3.3 选择API使用时，应根据项目需求平衡易用性与性能。 1.3.4 比较不同API的代码可以帮助开发者理解其工作原理和性能差异。 **4. 性能指标** 2.1 性能度量是优化CUDA程序的关键。CPU和GPU计时器可以用来测量代码段的执行时间。 2.1.1 CPU计时器用于跟踪CPU上的操作，而CUDA GPU计时器则针对GPU执行的活动。 2.2 宽带度量是评估GPU性能的重要方面，包括理论带宽和有效带宽。 2.2.1 理论带宽基于GPU的内存规格计算，反映了在理想情况下数据传输的最大速率。 2.2.2 有效带宽则考虑了实际应用中的数据传输效率，可能因内存访问模式、数据对齐等因素而降低。以上只是CUDA最佳实践指南的一部分内容，完整版将详细讨论更多关于内存管理、错误处理、线程组织和优化策略等话题，帮助开发者最大化利用CUDA的优势，编写高效并行程序。

Chapter 1.

Parallel Computing with CUDA

CUDA C Best Practices Guide Version 3.2 4

For most purposes, the key point is that the greater P is, the greater the speed-up.

An additional caveat is implicit in this equation, which is that if P is a small number

(so not substantially parallel), increasing N does little to improve performance. To

get the largest lift, best practices suggest spending most effort on increasing P; that

is, by maximizing the amount of code that can be parallelized.

1.2 Understanding the Programming

Environment

With each generation of NVIDIA processors, new features are added to the GPU

that CUDA can leverage. Consequently, it’s important to understand the

characteristics of the architecture.

Programmers should be aware of two version numbers. The first is the compute

capability, and the second is the version number of the runtime and driver APIs.

1.2.1 CUDA Compute Capability

The compute capability describes the features of the hardware and reflects the set of

instructions supported by the device as well as other specifications, such as the

maximum number of threads per block and the number of registers per

multiprocessor. Higher compute capability versions are supersets of lower (that is,

earlier) versions, and so they are backward compatible.

The compute capability of the GPU in the device can be queried programmatically

as illustrated in the CUDA SDK in the deviceQuery sample. The output for that

program is shown in Figure 1.1. This information is obtained by calling

cudaGetDeviceProperties() and accessing the information in the structure it

returns.

Figure 1.1 Sample CUDA configuration data reported by deviceQuery

Chapter 1.

Parallel Computing with CUDA

5 CUDA C Best Practices Guide Version 3.2

The major and minor revision numbers of the compute capability are shown on the

third and fourth lines of Figure 1.1. Device 0 of this system has compute capability

1.1.

More details about the compute capabilities of various GPUs are in Appendix A of

the CUDA C Programming Guide. In particular, developers should note the number of

multiprocessors on the device, the number of registers and the amount of memory

available, and any special capabilities of the device.

1.2.2 Additional Hardware Data

Certain hardware features are not described by the compute capability. For example,

the ability to overlap kernel execution with asynchronous data transfers between the

host and the device is available on most but not all GPUs with compute capability

1.1. In such cases, call cudaGetDeviceProperties() to determine whether the

device is capable of a certain feature. For example, the deviceOverlap field of the

device property structure indicates whether overlapping kernel execution and data

transfers is possible (displayed in the ―Concurrent copy and execution‖ line of

Figure 1.1); likewise, the canMapHostMemory field indicates whether zero-copy data

transfers can be performed.

1.2.3 C Runtime for CUDA and Driver API Version

The CUDA driver API and the C runtime for CUDA are two of the programming

interfaces to CUDA. Their version number enables developers to check the features

associated with these APIs and decide whether an application requires a newer

(later) version than the one currently installed. This is important because the CUDA

driver API is backward compatible but not forward compatible, meaning that applications,

plug-ins, and libraries (including the C runtime for CUDA) compiled against a

particular version of the driver API will continue to work on subsequent (later)

driver releases. However, applications, plug-ins, and libraries (including the C

runtime for CUDA) compiled against a particular version of the driver API may not

work on earlier versions of the driver, as illustrated in Figure 1.2.

1.0

Driver

Apps,

Libs &

Plug-ins

1.1

Driver

Apps,

Libs &

Plug-ins

2.0

Driver

Apps,

Libs &

Plug-ins

Compatible

Incompatible

...

Figure 1.2 Compatibility of CUDA versions

Chapter 1.

Parallel Computing with CUDA

CUDA C Best Practices Guide Version 3.2 6

1.2.4 Which Version to Target

When in doubt about the compute capability of the hardware that will be present at

runtime, it is best to assume a compute capability of 1.0 as defined in the CUDA C

Programming Guide, Section G.1.

To target specific versions of NVIDIA hardware and CUDA software, use the

–arch, -code, and –gencode options of nvcc. Code that contains double-precision

arithmetic, for example, must be compiled with ―-arch=sm_13‖ (or higher compute

capability), otherwise double-precision arithmetic will get demoted to single-

precision arithmetic (see Section 7.2.1). This and other compiler switches are

discussed further in Appendix B.

1.3 CUDA APIs

The host runtime component of the CUDA software environment can be used only

by host functions. It provides functions to handle the following:

 Device management

 Context management

 Memory management

 Code module management

 Execution control

 Texture reference management

 Interoperability with OpenGL and Direct3D

It comprises two APIs:

 A low-level API called the CUDA driver API

 A higher-level API called the C runtime for CUDA that is implemented on top

of the CUDA driver API

These APIs are mutually exclusive: An application should use one or the other.

The C runtime for CUDA, which is the more commonly used API, eases device

code management by providing implicit initialization, context management, and

module management. The C host code generated by nvcc is based on the C runtime

for CUDA, so applications that link to this code must use the C runtime for CUDA.

In contrast, the CUDA driver API requires more code and is somewhat harder to

program and debug, but it offers a better level of control. In particular, it is more

difficult to configure and launch kernels using the CUDA driver API, since the

execution configuration and kernel parameters must be specified with explicit

function calls instead of the execution configuration syntax (<<<…>>>). Note that

the APIs relate only to host code; the kernels that are executed on the device are the

same, regardless of which API is used.

The two APIs can be easily distinguished, because the CUDA driver API is

delivered through the nvcuda dynamic library and all its entry points are prefixed

Chapter 1.

Parallel Computing with CUDA

7 CUDA C Best Practices Guide Version 3.2

with cu; while the C runtime for CUDA is delivered through the cudart dynamic

library and all its entry points are prefixed with cuda.

1.3.1 C Runtime for CUDA

The C runtime for CUDA handles kernel loading and setting up kernel parameters

and launch configuration before the kernel is launched. The implicit code

initialization, CUDA context management, CUDA module management (cubin to

function mapping), kernel configuration, and parameter passing are all performed by

the C runtime for CUDA.

It comprises two principal parts:

 The low-level functions (cuda_runtime_api.h) have a C-style interface.

 The high-level functions (cuda_runtime.h) have a C++-style interface built on

top of the low-level functions.

The functions that make up this API are explained in the CUDA Reference Manual.

1.3.2 CUDA Driver API

The driver API is a lower-level API than the runtime API. When compared with the

runtime API, the driver API has these advantages:

 No dependency on the runtime library

 More control over devices (for example, only the driver API enables one CPU

thread to control multiple GPUs; see Chapter 8)

 No C extensions in the host code, so the host code can be compiled with

compilers other than nvcc and the host compiler it calls by default

Its primary disadvantages, as mentioned in Section 1.3, are as follows:

 Verbose code

 Greater difficulty in debugging

A key point is that for every runtime API function, there is an equivalent driver API

function. The driver API, however, includes other functions missing in the runtime

API, such as those for migrating a context from one host thread to another.

For more information on the driver API, refer to Section 3.3 of the CUDA C

Programming Guide.

1.3.3 When to Use Which API

The previous section lists some of the salient differences between the two APIs.

Additional considerations include the following:

Driver API–only features:

 Context management

 Support for 16-bit floating-point textures

剩余72页未读，继续阅读

bugai2

粉丝: 17
资源: 3

CUDA编程最佳实践指南

CUDA easy start up

CUDA_Quick_Start_Guide

CUDA_easy_start_up

CUDA_Quick_Start_Guide.pdf

CUDA_Quick_Start_Guide.pdf+加标签

CUDA Reference

matlab的冲击信号图代码-Start-Zero-Ubuntu16.04-install-caffe-cuda9.1-cndnn9.1-Op

RuntimeError("Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must us start method")

RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method

最新资源