CUDA C++最佳实践指南：评估、并行化与优化

需积分: 9 37 浏览量更新于2024-07-09 收藏 2.35MB PDF 举报

"CUDA_C_Best_Practices_Guide.pdf"是一份CUDA编程的最佳实践指南，由NVIDIA发布，适用于CUDA 11.2版本。该文档旨在帮助开发者优化和充分利用GPU（图形处理单元）进行并行计算，特别是在C++环境中。以下是该文档的核心内容概览： 1. 评估应用 (Chapter 1) - 开篇部分强调了评估应用程序的重要性，包括确定哪些部分适合在GPU上执行，以及理解CPU和GPU间的差异。开发者应分析其代码性能瓶颈，找出潜在的并行化机会。 2. 异构计算 (Chapter 2) - 这一章节深入讨论了主机（CPU）与设备（GPU）之间的区别，如内存访问模式、数据传输效率等。同时，介绍CUDA设备上可以运行的计算任务类型。 3. 应用性能分析 (Chapter 3) - 应用程序性能剖析是提升效率的关键步骤。讲解如何创建性能报告、识别热点区域（性能消耗高的部分），并运用强（理想情况下的加速）和弱（所有处理器都能扩展的问题）规模法则来理解代码的可扩展性。 4. 并行化策略 (Chapter 4) - 详细阐述如何将代码转化为利用CUDA并行库或通过编译器选项来暴露并行性，以便让GPU执行多线程任务。 5. 入门指导 (Chapter 5) - 提供了开始CUDA编程的实用建议，涉及使用并行库、选择支持CUDA的编译器，以及如何编写能有效利用GPU的代码。 6. 确保正确性 (Chapter 6) - 介绍了验证结果的方法，如使用参考代码对比和单元测试。同时，还强调了调试技巧以及保持数值准确性的必要性。这份最佳实践指南不仅提供了理论指导，还包括了许多实用的编码技巧和性能调优策略，旨在帮助开发者更有效地设计、实现和部署高性能的GPU应用程序。通过阅读和遵循这些指南，开发者可以避免常见的错误，提高代码效率，并最大化CUDA设备的性能潜力。

Heterogeneous Computing

CUDA C++ Best Practices Guide DG-05603-001_v11.2|3

RAM

The host system and the device each have their own distinct attached physical memories

. As the host and device memories are separated, items in the host memory must

occasionally be communicated between device memory and host memory as described in

What Runs on a CUDA-Enabled Device?.

These are the primary hardware differences between CPU hosts and GPU devices with respect

to parallel programming. Other differences are discussed as they arise elsewhere in this

document. Applications composed with these differences in mind can treat the host and device

together as a cohesive heterogeneous system wherein each processing unit is leveraged to do

the kind of work it does best: sequential work on the host and parallel work on the device.

2.2. What Runs on a CUDA-Enabled

Device?

The following issues should be considered when determining what parts of an application to

run on the device:

‣

The device is ideally suited for computations that can be run on numerous data elements

simultaneously in parallel. This typically involves arithmetic on large data sets (such as

matrices) where the same operation can be performed across thousands, if not millions,

of elements at the same time. This is a requirement for good performance on CUDA:

the software must use a large number (generally thousands or tens of thousands) of

concurrent threads. The support for running numerous threads in parallel derives from

CUDA's use of a lightweight threading model described above.

‣

To use CUDA, data values must be transferred from the host to the device. These transfers

are costly in terms of performance and should be minimized. (See Data Transfer Between

Host and Device.) This cost has several ramifications:

‣

The complexity of operations should justify the cost of moving data to and from the

device. Code that transfers data for brief use by a small number of threads will see

little or no performance benefit. The ideal scenario is one in which many threads

perform a substantial amount of work.

For example, transferring two matrices to the device to perform a matrix addition

and then transferring the results back to the host will not realize much performance

benefit. The issue here is the number of operations performed per data element

transferred. For the preceding procedure, assuming matrices of size NxN, there are

operations (additions) and 3N

elements transferred, so the ratio of operations

to elements transferred is 1:3 or O(1). Performance benefits can be more readily

achieved when this ratio is higher. For example, a matrix multiplication of the same

matrices requires N

operations (multiply-add), so the ratio of operations to elements

transferred is O(N), in which case the larger the matrix the greater the performance

benefit. The types of operations are an additional factor, as additions have different

complexity profiles than, for example, trigonometric functions. It is important to

On Systems on a Chip with integrated GPUs, such as NVIDIA

Tegra

, host and device memory are physically the same, but

there is still a logical distinction between host and device memory. See the Application Note on CUDA for Tegra for details.

Heterogeneous Computing

CUDA C++ Best Practices Guide DG-05603-001_v11.2|4

include the overhead of transferring data to and from the device in determining

whether operations should be performed on the host or on the device.

‣

Data should be kept on the device as long as possible. Because transfers should be

minimized, programs that run multiple kernels on the same data should favor leaving

the data on the device between kernel calls, rather than transferring intermediate

results to the host and then sending them back to the device for subsequent

calculations. So, in the previous example, had the two matrices to be added already

been on the device as a result of some previous calculation, or if the results of the

addition would be used in some subsequent calculation, the matrix addition should be

performed locally on the device. This approach should be used even if one of the steps

in a sequence of calculations could be performed faster on the host. Even a relatively

slow kernel may be advantageous if it avoids one or more transfers between host

and device memory. Data Transfer Between Host and Device provides further details,

including the measurements of bandwidth between the host and the device versus

within the device proper.

‣

For best performance, there should be some coherence in memory access by adjacent

threads running on the device. Certain memory access patterns enable the hardware

to coalesce groups of reads or writes of multiple data items into one operation. Data

that cannot be laid out so as to enable coalescing, or that doesn't have enough locality

to use the L1 or texture caches effectively, will tend to see lesser speedups when

used in computations on GPUs. A noteworthy exception to this are completely random

memory access patterns. In general, they should be avoided, because compared to peak

capabilities any architecture processes these memory access patterns at a low efficiency.

However, compared to cache based architectures, like CPUs, latency hiding architectures,

like GPUs, tend to cope better with completely random memory access patterns.

CUDA C++ Best Practices Guide DG-05603-001_v11.2|5

Chapter3. Application Profiling

3.1. Profile

Many codes accomplish a significant portion of the work with a relatively small amount of

code. Using a profiler, the developer can identify such hotspots and start to compile a list of

candidates for parallelization.

3.1.1. Creating the Profile

There are many possible approaches to profiling the code, but in all cases the objective is

the same: to identify the function or functions in which the application is spending most of its

execution time.

Note: High Priority: To maximize developer productivity, profile the application to determine

hotspots and bottlenecks.

The most important consideration with any profiling activity is to ensure that the workload is

realistic - i.e., that information gained from the test and decisions based upon that information

are relevant to real data. Using unrealistic workloads can lead to sub-optimal results and

wasted effort both by causing developers to optimize for unrealistic problem sizes and by

causing developers to concentrate on the wrong functions.

There are a number of tools that can be used to generate the profile. The following example is

based on gprof, which is an open-source profiler for Linux platforms from the GNU Binutils

collection.

$ gcc -O2 -g -pg myprog.c

$ gprof ./a.out > profile.txt

Each sample counts as 0.01 seconds.

% cumulative self self total

time seconds seconds calls ms/call ms/call name

33.34 0.02 0.02 7208 0.00 0.00 genTimeStep

16.67 0.03 0.01 240 0.04 0.12 calcStats

16.67 0.04 0.01 8 1.25 1.25 calcSummaryData

16.67 0.05 0.01 7 1.43 1.43 write

16.67 0.06 0.01 mcount

0.00 0.06 0.00 236 0.00 0.00 tzset

0.00 0.06 0.00 192 0.00 0.00 tolower

0.00 0.06 0.00 47 0.00 0.00 strlen

0.00 0.06 0.00 45 0.00 0.00 strchr

0.00 0.06 0.00 1 0.00 50.00 main

0.00 0.06 0.00 1 0.00 0.00 memcpy

Application Profiling

CUDA C++ Best Practices Guide DG-05603-001_v11.2|6

0.00 0.06 0.00 1 0.00 10.11 print

0.00 0.06 0.00 1 0.00 0.00 profil

0.00 0.06 0.00 1 0.00 50.00 report

3.1.2. Identifying Hotspots

In the example above, we can clearly see that the function genTimeStep() takes one-third

of the total running time of the application. This should be our first candidate function for

parallelization. Understanding Scaling discusses the potential benefit we might expect from

such parallelization.

It is worth noting that several of the other functions in the above example also

take up a significant portion of the overall running time, such as calcStats() and

calcSummaryData(). Parallelizing these functions as well should increase our speedup

potential. However, since APOD is a cyclical process, we might opt to parallelize these

functions in a subsequent APOD pass, thereby limiting the scope of our work in any given pass

to a smaller set of incremental changes.

3.1.3. Understanding Scaling

The amount of performance benefit an application will realize by running on CUDA depends

entirely on the extent to which it can be parallelized. Code that cannot be sufficiently

parallelized should run on the host, unless doing so would result in excessive transfers

between the host and the device.

Note: High Priority: To get the maximum benefit from CUDA, focus first on finding ways to

parallelize sequential code.

By understanding how applications can scale it is possible to set expectations and plan an

incremental parallelization strategy. Strong Scaling and Amdahl's Law describes strong

scaling, which allows us to set an upper bound for the speedup with a fixed problem size.

Weak Scaling and Gustafson's Law describes weak scaling, where the speedup is attained by

growing the problem size. In many applications, a combination of strong and weak scaling is

desirable.

3.1.3.1. Strong Scaling and Amdahl's Law

Strong scaling is a measure of how, for a fixed overall problem size, the time to solution

decreases as more processors are added to a system. An application that exhibits linear

strong scaling has a speedup equal to the number of processors used.

Strong scaling is usually equated with Amdahl's Law, which specifies the maximum speedup

that can be expected by parallelizing portions of a serial program. Essentially, it states that the

maximum speedup S of a program is:

Here P is the fraction of the total serial execution time taken by the portion of code that can

be parallelized and N is the number of processors over which the parallel portion of the code

runs.

Application Profiling

CUDA C++ Best Practices Guide DG-05603-001_v11.2|7

The larger N is(that is, the greater the number of processors), the smaller the P/N fraction. It

can be simpler to view N as a very large number, which essentially transforms the equation

into . Now, if 3/4 of the running time of a sequential program is parallelized, the

maximum speedup over serial code is 1 / (1 - 3/4) = 4.

In reality, most applications do not exhibit perfectly linear strong scaling, even if they do

exhibit some degree of strong scaling. For most purposes, the key point is that the larger

the parallelizable portion P is, the greater the potential speedup. Conversely, if P is a small

number (meaning that the application is not substantially parallelizable), increasing the

number of processors N does little to improve performance. Therefore, to get the largest

speedup for a fixed problem size, it is worthwhile to spend effort on increasing P, maximizing

the amount of code that can be parallelized.

3.1.3.2. Weak Scaling and Gustafson's Law

Weak scaling is a measure of how the time to solution changes as more processors are

added to a system with a fixed problem size per processor; i.e., where the overall problem size

increases as the number of processors is increased.

Weak scaling is often equated with Gustafson's Law, which states that in practice, the problem

size scales with the number of processors. Because of this, the maximum speedup S of a

program is:

Here P is the fraction of the total serial execution time taken by the portion of code that can

be parallelized and N is the number of processors over which the parallel portion of the code

runs.

Another way of looking at Gustafson's Law is that it is not the problem size that remains

constant as we scale up the system but rather the execution time. Note that Gustafson's Law

assumes that the ratio of serial to parallel execution remains constant, reflecting additional

cost in setting up and handling the larger problem.

3.1.3.3. Applying Strong and Weak Scaling

Understanding which type of scaling is most applicable to an application is an important part

of estimating speedup. For some applications the problem size will remain constant and

hence only strong scaling is applicable. An example would be modeling how two molecules

interact with each other, where the molecule sizes are fixed.

For other applications, the problem size will grow to fill the available processors. Examples

include modeling fluids or structures as meshes or grids and some Monte Carlo simulations,

where increasing the problem size provides increased accuracy.

Having understood the application profile, the developer should understand how the problem

size would change if the computational performance changes and then apply either Amdahl's

or Gustafson's Law to determine an upper bound for the speedup.

剩余92页未读，继续阅读

乖抱熊

粉丝: 4
资源: 9

CUDA C++最佳实践指南：评估、并行化与优化

CUDA_C_Best_Practices_Guide_cuda_GPU_

CUDA_C_Best_Practices_Guide

OpenCL_Best_Practices_Guide.pdf

CUDA_C_Programming_Guide

CUDA C Best Practices Guide

CUDA C Best Practices Guide 4.1

cuda_by_example

NVIDIA CUDA C Programming Best Practices Guide Version 2.3

CUDA C++ Programming Guide

cuda8.0各种包.rar

最新资源