CUDA编程指南：NVIDIA GPGPU官方教程

3星 · 超过75%的资源需积分: 20 42 浏览量更新于2023-07-05 收藏 2.95MB PDF 举报

"NVIDIA CUDA C Programming Guide 是一本详细介绍如何在NVIDIA公司的GPU上进行并行编程的官方指南，适用于GPGPU（通用计算GPU）的CUDA平台，是学习CUDA C语言的重要教材。此版本为3.1.1，日期为2010年7月21日。" CUDA是NVIDIA推出的一种编程模型，它允许开发者利用GPU的强大计算能力执行通用计算任务，而不仅仅是图形处理。CUDA C是用于CUDA编程的语言，基于标准的C/C++，但增加了对并行计算特性的支持。从版本3.1.1的更新中可以看出以下几个关键变化： 1. 移除了关于在64位主机代码中加载32位设备代码的部分，因为此功能在下一个工具包版本中将不再被支持。这意味着开发者需要确保他们的代码与目标GPU的位宽兼容。 2. 在3.2.6.3节中，提到了所有计算能力大于1.0的设备现在都支持映射的页锁定主机内存。这是一项优化，允许更快的主机与设备之间的数据交换，因为它消除了内存拷贝的开销。 3. 在3.2.7.1节中，指出对于64KB或更小的内存块，主机与设备之间的内存复制是异步的。这表示开发者可以并行执行多个内存操作，提高程序效率。 4. 修正了2.0计算能力设备的最大3D纹理引用大小（2048而不是4096），见G.1节。这对于使用大纹理的图形和计算应用来说是重要的规格更新。 5. 在C.2.1节中，关于`__fdividef(x, y)`函数的行为进行了澄清，解释了其在不同计算能力和编译标志下的行为。这个浮点除法函数的行为可能因硬件和编译选项而异。本书的目录结构通常包括： 1. 第1章介绍了从图形处理到通用并行计算的转变，阐述了CUDA作为通用并行计算架构的角色，以及其可扩展的编程模型。这一章为初学者提供了基础概念和背景知识。 1.1节深入探讨了GPU如何从专用于图形处理转变为能够处理广泛计算任务的平台。 1.2节详细介绍了CUDA架构，包括GPU的多线程结构和内存层次。 1.3节则讨论了CUDA编程模型的可扩展性，包括并行执行单元、线程和线程块的组织。通过这些章节，读者可以逐步掌握CUDA编程的基础，包括如何定义和管理线程、内存空间、同步以及如何有效地利用GPU的并行性。随着对CUDA编程的理解加深，开发者能够编写出高效利用GPU资源的并行应用程序，解决各种计算密集型问题，如物理模拟、科学计算、图像处理和机器学习等。

Chapter 1. Introduction

6 CUDA C Programming Guide Version 3.1.1

1.4 Document’s Structure

This document is organized into the following chapters:

 Chapter 1 is a general introduction to CUDA.

 Chapter 2 outlines the CUDA programming model.

 Chapter 3 describes the programming interface.

 Chapter 4 describes the hardware implementation.

 Chapter 5 gives some guidance on how to achieve maximum performance.

 Appendix A lists all CUDA-enabled devices.

 Appendix B is a detailed description of all extensions to the C language.

 Appendix C lists the mathematical functions supported in CUDA.

 Appendix D lists the C++ constructs supported in device code.

 Appendix E lists the specific keywords and directives supported by nvcc.

 Appendix F gives more details on texture fetching.

 Appendix G gives the technical specifications of various devices, as well as

more architectural details.

Chapter 2. Programming Model

8 CUDA C Programming Guide Version 3.1.1

2.2 Thread Hierarchy

For convenience, threadIdx is a 3-component vector, so that threads can be

identified using a one-dimensional, two-dimensional, or three-dimensional thread

index, forming a one-dimensional, two-dimensional, or three-dimensional thread

block. This provides a natural way to invoke computation across the elements in a

domain such as a vector, matrix, or volume.

The index of a thread and its thread ID relate to each other in a straightforward

way: For a one-dimensional block, they are the same; for a two-dimensional block

of size (D

, D

), the thread ID of a thread of index (x, y) is (x + y D

); for a three-

dimensional block of size (D

, D

), the thread ID of a thread of index (x, y, z) is

(x + y D

+ z D

As an example, the following code adds two matrices A and B of size NxN and

stores the result into matrix C:

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N],

float C[N][N])

{

int i = threadIdx.x;

int j = threadIdx.y;

C[i][j] = A[i][j] + B[i][j];

}

int main()

{

...

// Kernel invocation with one block of N * N * 1 threads

int numBlocks = 1;

dim3 threadsPerBlock(N, N);

MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

}

There is a limit to the number of threads per block, since all threads of a block are

expected to reside on the same processor core and must share the limited memory

resources of that core. On current GPUs, a thread block may contain up to 1024

threads.

However, a kernel can be executed by multiple equally-shaped thread blocks, so that

the total number of threads is equal to the number of threads per block times the

number of blocks.

Blocks are organized into a one-dimensional or two-dimensional grid of thread

blocks as illustrated by Figure 2-1. The number of thread blocks in a grid is usually

dictated by the size of the data being processed or the number of processors in the

system, which it can greatly exceed.

Chapter 2. Programming Model

10 CUDA C Programming Guide Version 3.1.1

}

int main()

{

...

// Kernel invocation

dim3 threadsPerBlock(16, 16);

dim3 numBlocks(N / threadsPerBlock.x, N / threadsPerBlock.y);

MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

}

A thread block size of 16x16 (256 threads), although arbitrary in this case, is a

common choice. The grid is created with enough blocks to have one thread per

matrix element as before. For simplicity, this example assumes that the number of

threads per grid in each dimension is evenly divisible by the number of threads per

block in that dimension, although that need not be the case.

Thread blocks are required to execute independently: It must be possible to execute

them in any order, in parallel or in series. This independence requirement allows

thread blocks to be scheduled in any order across any number of cores as illustrated

by Figure 1-4, enabling programmers to write code that scales with the number of

cores.

Threads within a block can cooperate by sharing data through some shared memory

and by synchronizing their execution to coordinate memory accesses. More

precisely, one can specify synchronization points in the kernel by calling the

__syncthreads() intrinsic function; __syncthreads() acts as a barrier at

which all threads in the block must wait before any is allowed to proceed.

Section 3.2.2 gives an example of using shared memory.

For efficient cooperation, the shared memory is expected to be a low-latency

memory near each processor core (much like an L1 cache) and __syncthreads()

is expected to be lightweight.

2.3 Memory Hierarchy

CUDA threads may access data from multiple memory spaces during their

execution as illustrated by Figure 2-2. Each thread has private local memory. Each

thread block has shared memory visible to all threads of the block and with the

same lifetime as the block. All threads have access to the same global memory.

There are also two additional read-only memory spaces accessible by all threads: the

constant and texture memory spaces. The global, constant, and texture memory

spaces are optimized for different memory usages (see Sections 5.3.2.1, 5.3.2.4, and

5.3.2.5). Texture memory also offers different addressing modes, as well as data

filtering, for some specific data formats (see Section 3.2.4).

The global, constant, and texture memory spaces are persistent across kernel

launches by the same application.

剩余172页未读，继续阅读

x845311724

粉丝: 1
资源: 5

CUDA编程指南：NVIDIA GPGPU官方教程

cuda 权威指南习题答案及coda

CUDA C Programming Guide v9.0

CUDA_C_Programming_Guide

cv::dnn::DNN_BACKEND_CUDA

cv::dnn::DNN_TARGET_CUDA

emd_cuda如何安装

helper_cuda.h __CUDA_RUNTIME_H__

def __init__(self, is_cuda=False):

CMAKE_CUDA_ARCHITECTURES

最新资源

def init(self, is_cuda=False):