CUDA-C 编程指南：简化与提升

需积分: 9 72 浏览量更新于2024-07-31 收藏 3.21MB PDF 举报

"CUDA-C 编程导论，适合学习CUDA的初学者参考，文档更新至Version 3.2，涵盖了CUDA编程的关键概念和技术更新。" CUDA-C是一种用于利用NVIDIA GPU进行并行计算的编程模型，是GPGPU（通用计算在GPU）的核心技术。CUDA-C编程指南是为那些想要利用CUDA平台提升应用程序性能的C程序员准备的。这篇文档旨在引导读者理解和掌握CUDA编程的基本原理和实践技巧。从Version 3.2的更新内容来看，以下几个方面有所增强和扩展： 1. **简化cuParamSetv()的使用**：现在CUdeviceptr类型的内核参数与void*具有相同的大小和对齐方式，因此不再需要通过中间的void*变量设置参数，这使得代码更加简洁和高效。 2. **16位浮点纹理支持**：添加了关于16位浮点纹理的3.2.4.1.4节，这意味着开发者现在可以更灵活地处理半精度浮点数据，这对于节省内存和提高计算效率很有帮助。 3. **纹理和表面内存的读写一致性**：3.2.4.4节增加了这部分内容，解释了如何确保在纹理和表面内存操作中的数据一致性，这是优化GPU计算时的重要考虑因素。 4. **表面内存访问的更多细节**：3.2.4.2节提供了更多关于表面内存访问的信息，允许开发者更有效地管理GPU内存，并直接从GPU执行内存操作。 5. **流同步功能**：3.2.6.5.2节提到了新的cudaStreamSynchronize()函数，它允许开发者精确控制不同计算任务之间的同步，从而更好地管理多任务执行。 6. **NVIDIA SLI AFR模式下的设备处理**：在3.2.7.2、3.3.10.2和4.3节中，新增了处理使用NVIDIA SLI（Scalable Link Interface）交替帧渲染模式的设备的API调用，这对于多GPU系统的开发者尤其重要。 7. **调用堆栈的相关章节**：3.2.9和3.3.12节新增了关于调用堆栈的内容，这有助于理解GPU上的函数调用行为和错误调试。 8. **内存分配函数的类型调整**：在3.3.4节的两个代码示例中，针对cuMemAllocPitch()函数签名的变化，将“pitch”变量的类型从unsigned int更改为size_t，确保了内存分配的正确性。这些更新反映了CUDA-C编程语言的持续改进和优化，使得开发者能够更好地利用GPU的计算能力，编写出高性能的并行应用。对于那些希望进入CUDA-C领域的程序员来说，这份文档提供了一个全面且最新的起点。

Chapter 1. Introduction

4 CUDA C Programming Guide Version 3.2

solve many complex computational problems in a more efficient way than on a

CPU.

CUDA comes with a software environment that allows developers to use C as a

high-level programming language. As illustrated by Figure 1-3, other languages or

application programming interfaces are supported, such as CUDA FORTRAN,

OpenCL, and DirectCompute.

Figure 1-3. CUDA is Designed to Support Various Languages

or Application Programming Interfaces

1.3 A Scalable Programming Model

The advent of multicore CPUs and manycore GPUs means that mainstream

processor chips are now parallel systems. Furthermore, their parallelism continues

to scale with Moore‟s law. The challenge is to develop application software that

transparently scales its parallelism to leverage the increasing number of processor

cores, much as 3D graphics applications transparently scale their parallelism to

manycore GPUs with widely varying numbers of cores.

The CUDA parallel programming model is designed to overcome this challenge

while maintaining a low learning curve for programmers familiar with standard

programming languages such as C.

At its core are three key abstractions – a hierarchy of thread groups, shared

memories, and barrier synchronization – that are simply exposed to the programmer

as a minimal set of language extensions.

These abstractions provide fine-grained data parallelism and thread parallelism,

nested within coarse-grained data parallelism and task parallelism. They guide the

programmer to partition the problem into coarse sub-problems that can be solved

independently in parallel by blocks of threads, and each sub-problem into finer

pieces that can be solved cooperatively in parallel by all threads within the block.

This decomposition preserves language expressivity by allowing threads to

Chapter 2. Programming Model

8 CUDA C Programming Guide Version 3.2

2.2 Thread Hierarchy

For convenience, threadIdx is a 3-component vector, so that threads can be

identified using a one-dimensional, two-dimensional, or three-dimensional thread

index, forming a one-dimensional, two-dimensional, or three-dimensional thread

block. This provides a natural way to invoke computation across the elements in a

domain such as a vector, matrix, or volume.

The index of a thread and its thread ID relate to each other in a straightforward

way: For a one-dimensional block, they are the same; for a two-dimensional block

of size (D

, D

), the thread ID of a thread of index (x, y) is (x + y D

); for a three-

dimensional block of size (D

, D

), the thread ID of a thread of index (x, y, z) is

(x + y D

+ z D

As an example, the following code adds two matrices A and B of size NxN and

stores the result into matrix C:

// Kernel definition

__global__ void MatAdd(float A[N][N], float B[N][N],

float C[N][N])

{

int i = threadIdx.x;

int j = threadIdx.y;

C[i][j] = A[i][j] + B[i][j];

}

int main()

{

...

// Kernel invocation with one block of N * N * 1 threads

int numBlocks = 1;

dim3 threadsPerBlock(N, N);

MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);

}

There is a limit to the number of threads per block, since all threads of a block are

expected to reside on the same processor core and must share the limited memory

resources of that core. On current GPUs, a thread block may contain up to 1024

threads.

However, a kernel can be executed by multiple equally-shaped thread blocks, so that

the total number of threads is equal to the number of threads per block times the

number of blocks.

Blocks are organized into a one-dimensional or two-dimensional grid of thread

blocks as illustrated by Figure 2-1. The number of thread blocks in a grid is usually

dictated by the size of the data being processed or the number of processors in the

system, which it can greatly exceed.

剩余182页未读，继续阅读

Anderson888888

粉丝: 0
资源: 7

CUDA-C 编程指南：简化与提升

cuda-10-0.zip1

cuda-10-0.zip0

cuda-repo-ubuntu1404-8-0-local-ga2_8.0.61-1_amd64

CUDA-C++-编程指南.pdf

cuda-logistic-regression:cuda-c中逻辑回归的有趣实现

CUDA-Programming.rar_CUDA中文手册_cuda_cuda编程

CUDA-Fortran-Book_nvidia_CUDA-Fortran_

cmake3.26-cuda-vs2019 cmake3.26-cuda-vs2019

最新资源