CUDA C++编程指南：11.2版更新与特性解析

需积分: 5 8 浏览量更新于2024-07-09 收藏 2.86MB PDF 举报

"CUDA_C_Programming_Guide.pdf" CUDA C++ Programming Guide 提供了全面的指导，用于在NVIDIA GPU上进行高效的并行计算。该文档涵盖了从基础概念到高级特性的广泛内容，旨在帮助开发者充分利用CUDA平台和编程模型。在最新版本11.2中，有几个关键更新和增强： 1. 更新了异步数据拷贝，引入了`cuda::memcpy_async`和`cooperative_group::memcpy_async`函数，这使得在GPU内的数据传输更加高效和灵活，支持并发执行，提高程序性能。 2. 异步屏障功能通过`cuda::barrier`进行了升级，提供了更好的线程组同步，这对于确保多线程在执行特定操作时保持一致性和正确性至关重要。 3. 添加了编译优化提示函数，开发者可以使用这些函数指示编译器进行特定优化，以适应特定的硬件或性能需求。 CUDA编程模型的核心概念包括： 1. **Kernels**：这是CUDA编程的核心，是运行在GPU上的函数，可以并行执行于大量线程中。开发者可以通过定义kernel来指定GPU应执行的计算任务。 2. **Thread Hierarchy**：CUDA线程组织成多级结构，包括线程块、线程网格等，这种层次结构允许在不同粒度上进行同步和通信。 3. **Memory Hierarchy**：CUDA设备具有层次化的内存系统，包括寄存器、全局内存、共享内存和纹理内存等，根据数据访问模式和访问速度的不同，开发者可以选择合适的内存类型进行优化。 4. **Heterogeneous Programming**：CUDA支持混合编程，允许CPU和GPU协同工作，使得开发者可以在需要高性能计算时利用GPU，在其他任务上使用CPU。 5. **Compute Capability**：不同的NVIDIA GPU有不同的计算能力，这个特性定义了GPU支持的功能级别，如浮点精度、最大线程块大小等，开发者需要根据目标设备的计算能力编写代码。编程接口部分详细介绍了如何使用NVCC编译器进行CUDA程序开发： 1. **Compilation with NVCC**：NVCC是CUDA的编译器，它负责将CUDA源代码转换为可执行的二进制文件。 - **Compilation Workflow**：包括离线编译（提前编译为特定GPU架构的二进制）和即时编译（在运行时针对目标GPU编译）两种模式。 - **Binary Compatibility**：讨论了CUDA程序与不同GPU架构之间的兼容性问题。 - **PTX Compatibility**：PTX是CUDA的中间表示，确保代码能在多种GPU架构上运行。 - **Application Compatibility**：强调了CUDA应用程序的兼容性考虑，包括旧版本库的使用和向后兼容性。 - **C++ Compatibility**：CUDA支持C++特性，但也有其特定的限制和注意事项。 - **64-Bit Compatibility**：讨论了CUDA在64位环境下的编译和运行。 CUDA C++ Programming Guide是CUDA程序员的重要参考，它提供了深入的理论知识和实践指导，帮助开发者有效地利用GPU的并行计算能力，实现高性能应用。

CUDA C++ Programming Guide PG-02829-001_v11.2|xvi

L.1.4.GPU Memory Oversubscription..................................................................................... 340

L.1.5.Multi-GPU....................................................................................................................... 340

L.1.6.System Allocator............................................................................................................ 341

L.1.7.Hardware Coherency......................................................................................................341

L.1.8.Access Counters.............................................................................................................342

L.2.Programming Model.............................................................................................................343

L.2.1.Managed Memory Opt In............................................................................................... 343

L.2.1.1.Explicit Allocation Using cudaMallocManaged().................................................... 343

L.2.1.2.Global-Scope Managed Variables Using __managed__........................................ 344

L.2.2.Coherency and Concurrency......................................................................................... 344

L.2.2.1.GPU Exclusive Access To Managed Memory......................................................... 345

L.2.2.2.Explicit Synchronization and Logical GPU Activity.................................................346

L.2.2.3.Managing Data Visibility and Concurrent CPU + GPU Access with Streams........ 347

L.2.2.4.Stream Association Examples................................................................................ 348

L.2.2.5.Stream Attach With Multithreaded Host Programs...............................................348

L.2.2.6.Advanced Topic: Modular Programs and Data Access Constraints......................349

L.2.2.7.Memcpy()/Memset() Behavior With Managed Memory.......................................... 350

L.2.3.Language Integration..................................................................................................... 351

L.2.3.1.Host Program Errors with __managed__ Variables..............................................351

L.2.4.Querying Unified Memory Support................................................................................352

L.2.4.1.Device Properties.....................................................................................................352

L.2.4.2.Pointer Attributes.................................................................................................... 352

L.2.5.Advanced Topics............................................................................................................. 352

L.2.5.1.Managed Memory with Multi-GPU Programs on pre-6.x Architectures...............352

L.2.5.2.Using fork() with Managed Memory....................................................................... 353

L.3.Performance Tuning............................................................................................................. 353

L.3.1.Data Prefetching.............................................................................................................354

L.3.2.Data Usage Hints........................................................................................................... 355

L.3.3.Querying Usage Attributes.............................................................................................356

CUDA C++ Programming Guide PG-02829-001_v11.2|xvii

List of Figures

Figure1. The GPU Devotes More Transistors to Data Processing ..................................................2

Figure2. GPU Computing Applications .............................................................................................3

Figure3. Automatic Scalability .......................................................................................................... 5

Figure4. Grid of Thread Blocks ........................................................................................................ 9

Figure5. Memory Hierarchy ............................................................................................................ 11

Figure6. Heterogeneous Programming ..........................................................................................13

Figure7. Matrix Multiplication without Shared Memory ................................................................29

Figure8. Matrix Multiplication with Shared Memory ..................................................................... 32

Figure9. Child Graph Example ........................................................................................................42

Figure10. Creating a Graph Using Graph APIs Example .............................................................. 43

Figure11. The Driver API Is Backward but Not Forward Compatible .........................................101

Figure12. Parent-Child Launch Nesting ...................................................................................... 228

Figure13. Nearest-Point Sampling Filtering Mode ..................................................................... 304

Figure14. Linear Filtering Mode ................................................................................................... 305

Figure15. One-Dimensional Table Lookup Using Linear Filtering ............................................. 306

Figure16. Examples of Global Memory Accesses ........................................................................315

Figure17. Strided Shared Memory Accesses ...............................................................................318

Figure18. Irregular Shared Memory Accesses ............................................................................ 319

Figure19. Library Context Management .......................................................................................330

CUDA C++ Programming Guide PG-02829-001_v11.2|xviii

List of Tables

Table1. Linear Memory Address Space ......................................................................................... 20

Table2. Cubemap Fetch ...................................................................................................................62

Table3. Throughput of Native Arithmetic Instructions ................................................................ 117

Table4. Alignment Requirements ................................................................................................. 131

Table5. New Device-only Launch Implementation Functions .....................................................236

Table6. Supported API Functions ................................................................................................. 237

Table7. Single-Precision Mathematical Standard Library Functions with Maximum ULP

Error.............................................................................................................................................. 252

Table8. Double-Precision Mathematical Standard Library Functions with Maximum ULP

Error.............................................................................................................................................. 255

Table9. Functions Affected by -use_fast_math ........................................................................... 259

Table10. Single-Precision Floating-Point Intrinsic Functions .................................................... 259

Table11. Double-Precision Floating-Point Intrinsic Functions ...................................................261

Table12. C++11 Language Features ............................................................................................. 262

Table13. C++14 Language Features ............................................................................................. 265

Table14. Feature Support per Compute Capability ..................................................................... 307

Table15. Technical Specifications per Compute Capability .........................................................308

Table16. Objects Available in the CUDA Driver API .................................................................. 327

Table17. CUDA Environment Variables ......................................................................................334

剩余375页未读，继续阅读

乖抱熊

粉丝: 4
资源: 9

CUDA C++编程指南：11.2版更新与特性解析

CUDA_C_Programming_Guide的下载与翻译指南

CUDA C编程指南：CUDA6.0版本详细文档

Nvidia CUDA C++编程指南：异步SIMT模型与图形内存节点

2024.1.8新版CUDA 官方文档CUDA_C_Programming_Guide.pdf

CUDA_C_Programming_Guide.rar_Windows编程_Others_

NVIDIA_CUDA_Programming_Guide_2.2.1.pdf

CUDA_C_Programming_Guide v10.0.pdf

CUDA_C_Programming_Guide、CUDA并行程序设计 GPU编程指南

CUDA11.0-C-Programming-Guide.pdf

GPU.Programming.Guide.rar_GPU编程指南

最新资源