CUDA C++编程指南（v11.1）：CUDA 8.x与8.6支持详解

需积分: 23 98 浏览量更新于2024-07-15 收藏 2.88MB PDF 举报

《CUDA C++编程指南》是NVIDIA开发者设计指南的一份重要文档，版本为PG-02829-001_v11.1，发布日期为2020年10月。这份文档主要针对CUDA C++编程进行了详尽的阐述，旨在帮助开发者充分利用NVIDIA GPU的强大计算能力进行并行计算。该文档的重要更新包括对Compute Capability 8.x的支持，这是针对特定GPU架构的优化，提供了针对这些设备的专门指令集和功能。在版本11.1中，对于Compute Capability 8.6的介绍有所增强，涉及了算术指令的更新和特性及技术规格的概述，确保了程序的高效运行和兼容性。章节1"Introduction"（入门）介绍了使用GPU的优势，如处理大量数据、并行计算性能提升等，以及CUDA作为一个通用目的的并行计算平台和编程模型。它强调了CUDA编程模型的可扩展性，使得开发者能够轻松地将计算任务分解到GPU的多核心上。第2章"Programming Model"（编程模型）深入讲解了CUDA编程的核心概念，如kernel（线程块）、线程层级结构、内存层次划分，以及如何实现异构编程，即利用CPU和GPU的协同工作。此外，对Compute Capability的讨论揭示了针对不同GPU硬件特性的编程策略。第3章"Programming Interface"（编程接口）详细说明了如何使用NVCC编译器进行CUDA C++代码的编译。这部分内容涵盖了离线编译与实时（Just-in-Time, JIT）编译的区别，以及编译后的二进制兼容性、PTX（Parallel Thread Execution）兼容性，以及对C++语言特性的支持，特别是64位兼容性，确保代码能在多种架构上无缝运行。《CUDA C++编程指南》是一份不可或缺的参考资料，为开发者提供了CUDA C++编程的全面指南，帮助他们编写出高效、兼容且可扩展的GPU加速代码。随着GPU技术的不断发展，理解并掌握这份文档对于现代GPU应用开发至关重要。

CUDA C++ Programming Guide PG-02829-001_v11.1|xvi

I.7.1.Architecture..................................................................................................................... 326

I.7.2.Global Memory.................................................................................................................327

I.7.3.Shared Memory............................................................................................................... 327

AppendixJ.Driver API......................................................................................................328

J.1.Context................................................................................................................................... 330

J.2.Module....................................................................................................................................331

J.3.Kernel Execution................................................................................................................... 332

J.4.Interoperability between Runtime and Driver APIs.............................................................334

AppendixK.CUDA Environment Variables...................................................................... 335

AppendixL.Unified Memory Programming.................................................................... 338

L.1.Unified Memory Introduction................................................................................................338

L.1.1.System Requirements....................................................................................................339

L.1.2.Simplifying GPU Programming..................................................................................... 339

L.1.3.Data Migration and Coherency......................................................................................340

L.1.4.GPU Memory Oversubscription..................................................................................... 341

L.1.5.Multi-GPU....................................................................................................................... 341

L.1.6.System Allocator............................................................................................................ 342

L.1.7.Hardware Coherency......................................................................................................342

L.1.8.Access Counters.............................................................................................................343

L.2.Programming Model.............................................................................................................344

L.2.1.Managed Memory Opt In............................................................................................... 344

L.2.1.1.Explicit Allocation Using cudaMallocManaged().................................................... 344

L.2.1.2.Global-Scope Managed Variables Using __managed__........................................ 345

L.2.2.Coherency and Concurrency......................................................................................... 345

L.2.2.1.GPU Exclusive Access To Managed Memory......................................................... 346

L.2.2.2.Explicit Synchronization and Logical GPU Activity.................................................347

L.2.2.3.Managing Data Visibility and Concurrent CPU + GPU Access with Streams........ 348

L.2.2.4.Stream Association Examples................................................................................ 349

L.2.2.5.Stream Attach With Multithreaded Host Programs...............................................349

L.2.2.6.Advanced Topic: Modular Programs and Data Access Constraints......................350

L.2.2.7.Memcpy()/Memset() Behavior With Managed Memory.......................................... 351

L.2.3.Language Integration..................................................................................................... 352

L.2.3.1.Host Program Errors with __managed__ Variables..............................................352

L.2.4.Querying Unified Memory Support................................................................................353

L.2.4.1.Device Properties.....................................................................................................353

L.2.4.2.Pointer Attributes.................................................................................................... 353

L.2.5.Advanced Topics............................................................................................................. 353

L.2.5.1.Managed Memory with Multi-GPU Programs on pre-6.x Architectures...............353

CUDA C++ Programming Guide PG-02829-001_v11.1|xviii

List of Figures

Figure1. The GPU Devotes More Transistors to Data Processing ..................................................2

Figure2. GPU Computing Applications .............................................................................................3

Figure3. Automatic Scalability .......................................................................................................... 5

Figure4. Grid of Thread Blocks ........................................................................................................ 9

Figure5. Memory Hierarchy ............................................................................................................ 11

Figure6. Heterogeneous Programming ..........................................................................................13

Figure7. Matrix Multiplication without Shared Memory ................................................................29

Figure8. Matrix Multiplication with Shared Memory ..................................................................... 32

Figure9. Child Graph Example ........................................................................................................42

Figure10. Creating a Graph Using Graph APIs Example .............................................................. 43

Figure11. The Driver API Is Backward but Not Forward Compatible .........................................101

Figure12. Parent-Child Launch Nesting ...................................................................................... 229

Figure13. Nearest-Point Sampling Filtering Mode ..................................................................... 305

Figure14. Linear Filtering Mode ................................................................................................... 306

Figure15. One-Dimensional Table Lookup Using Linear Filtering ............................................. 307

Figure16. Examples of Global Memory Accesses ........................................................................316

Figure17. Strided Shared Memory Accesses ...............................................................................319

Figure18. Irregular Shared Memory Accesses ............................................................................ 320

Figure19. Library Context Management .......................................................................................331

CUDA C++ Programming Guide PG-02829-001_v11.1|xix

List of Tables

Table1. Linear Memory Address Space ......................................................................................... 20

Table2. Cubemap Fetch ...................................................................................................................62

Table3. Throughput of Native Arithmetic Instructions ................................................................ 117

Table4. Alignment Requirements ................................................................................................. 130

Table5. New Device-only Launch Implementation Functions .....................................................237

Table6. Supported API Functions ................................................................................................. 238

Table7. Single-Precision Mathematical Standard Library Functions with Maximum ULP

Error.............................................................................................................................................. 253

Table8. Double-Precision Mathematical Standard Library Functions with Maximum ULP

Error.............................................................................................................................................. 256

Table9. Functions Affected by -use_fast_math ........................................................................... 260

Table10. Single-Precision Floating-Point Intrinsic Functions .................................................... 260

Table11. Double-Precision Floating-Point Intrinsic Functions ...................................................262

Table12. C++11 Language Features ............................................................................................. 263

Table13. C++14 Language Features ............................................................................................. 266

Table14. Feature Support per Compute Capability ..................................................................... 308

Table15. Technical Specifications per Compute Capability .........................................................309

Table16. Objects Available in the CUDA Driver API .................................................................. 328

Table17. CUDA Environment Variables ......................................................................................335

剩余378页未读，继续阅读

TracelessLe

粉丝: 6w+
资源: 471

CUDA C++编程指南（v11.1）：CUDA 8.x与8.6支持详解

CUDA C Programming Guide v9.0

CUDA C Programming Guide v8.0

CUDA_C_Programming_Guide_CN

2024.1.8新版CUDA 官方文档CUDA_C_Programming_Guide.pdf

CUDA_C_Programming_Guide.rar_Windows编程_Others_

NVIDIA_CUDA_Programming_Guide_2.1.pdf

NVIDIA_CUDA_Programming_Guide_2.2.1.pdf

CUDA_C_Programming_Guide v10.0.pdf

CUDA_C_Programming_Guide(12.6).pdf

CUDA_C_Programming_Guide、CUDA并行程序设计 GPU编程指南

最新资源