Nvidia CUDA C++编程指南：异步SIMT模型与图形内存节点

需积分: 17 70 浏览量更新于2024-07-07 收藏 3.28MB PDF 举报

"CUDA_C_Programming_Guide.pdf - Nvidia CUDA 编程指南，涵盖了从CUDA编程模型到编程接口的详细信息，适用于CUDA C++。更新包括新增了图内存节点和正式化的异步SIMT编程模型。" CUDA是Nvidia推出的一种并行计算平台和编程模型，它允许开发者利用GPU（图形处理器）进行通用计算。CUDA C++编程指南主要分为以下几个部分： 1. **介绍**：这部分解释了使用GPU的优势，CUDA作为一个通用的并行计算平台和编程模型的特点，以及其可扩展的编程模型。文档结构也在此部分中进行了概述。 2. **编程模型**：介绍了CUDA的核心概念，如： - **内核（Kernels）**：这是在GPU上执行的并行计算任务的基本单元。 - **线程层次结构**：包括线程块、线程网格和多维索引，用于组织和协调内核的执行。 - **内存层次结构**：包括全局内存、共享内存、常量内存和纹理内存等，它们各有特点，适应不同类型的访问需求。 - **异构编程**：CUDA支持在CPU和GPU之间混合编程，以利用各自的性能优势。 - **异步SIMT编程模型**：在11.5版本中正式化，允许开发者实现非阻塞操作，提高程序的并发性和效率。异步操作可以提高系统资源的利用率，减少等待时间。 3. **编程接口**：这一部分详细讲解了CUDA的编译和运行时环境： - **NVCC编译器**：CUDA的编译工作流包括离线编译和即时编译，以及关于二进制兼容性、PTX兼容性和应用兼容性的讨论。 - **CUDA运行时**：涵盖了初始化GPU、设备选择、内存管理、内核启动、同步和错误处理等基本操作。 CUDA编程涉及对GPU硬件特性的深入理解和充分利用，以实现高效并行计算。通过理解上述概念，开发者能够创建高效的CUDA C++程序，利用GPU的强大计算能力解决各种科学计算、图像处理、机器学习等领域的问题。

CUDA C++ Programming Guide PG-02829-001_v11.5|xvi

I.4.20.8.__global__ functions and function templates....................................................... 334

I.4.20.9.__managed__ and __shared__ variables.............................................................. 335

I.4.20.10.Defaulted functions...............................................................................................335

I.4.21.C++14 Features..............................................................................................................336

I.4.21.1.Functions with deduced return type......................................................................336

I.4.21.2.Variable templates..................................................................................................337

I.4.22.C++17 Features..............................................................................................................337

I.4.22.1.Inline Variable......................................................................................................... 338

I.4.22.2.Structured Binding..................................................................................................338

I.5.Polymorphic Function Wrappers...........................................................................................338

I.6.Extended Lambdas.................................................................................................................340

I.6.1.Extended Lambda Type Traits........................................................................................ 342

I.6.2.Extended Lambda Restrictions.......................................................................................342

I.6.3.Notes on __host__ __device__ lambdas........................................................................351

I.6.4.*this Capture By Value....................................................................................................352

I.6.5.Additional Notes.............................................................................................................. 354

I.7.Code Samples........................................................................................................................ 354

I.7.1.Data Aggregation Class.................................................................................................. 354

I.7.2.Derived Class...................................................................................................................355

I.7.3.Class Template................................................................................................................355

I.7.4.Function Template...........................................................................................................356

I.7.5.Functor Class.................................................................................................................. 356

AppendixJ.Texture Fetching...........................................................................................358

J.1.Nearest-Point Sampling....................................................................................................... 358

J.2.Linear Filtering......................................................................................................................359

J.3.Table Lookup......................................................................................................................... 360

AppendixK.Compute Capabilities................................................................................... 362

K.1.Features and Technical Specifications................................................................................362

K.2.Floating-Point Standard....................................................................................................... 366

K.3.Compute Capability 3.x.........................................................................................................367

K.3.1.Architecture....................................................................................................................367

K.3.2.Global Memory............................................................................................................... 368

K.3.3.Shared Memory..............................................................................................................370

K.4.Compute Capability 5.x.........................................................................................................371

K.4.1.Architecture....................................................................................................................371

K.4.2.Global Memory............................................................................................................... 372

K.4.3.Shared Memory..............................................................................................................372

K.5.Compute Capability 6.x.........................................................................................................376

CUDA C++ Programming Guide PG-02829-001_v11.5|xvii

K.5.1.Architecture....................................................................................................................376

K.5.2.Global Memory............................................................................................................... 376

K.5.3.Shared Memory..............................................................................................................376

K.6.Compute Capability 7.x.........................................................................................................377

K.6.1.Architecture....................................................................................................................377

K.6.2.Independent Thread Scheduling................................................................................... 377

K.6.3.Global Memory............................................................................................................... 380

K.6.4.Shared Memory..............................................................................................................380

K.7.Compute Capability 8.x.........................................................................................................381

K.7.1.Architecture....................................................................................................................381

K.7.2.Global Memory............................................................................................................... 382

K.7.3.Shared Memory..............................................................................................................382

AppendixL.Driver API..................................................................................................... 384

L.1.Context...................................................................................................................................386

L.2.Module................................................................................................................................... 387

L.3.Kernel Execution...................................................................................................................388

L.4.Interoperability between Runtime and Driver APIs.............................................................390

L.5.Driver Entry Point Access.................................................................................................... 390

L.5.1.Introduction.....................................................................................................................390

L.5.2.Driver Function Typedefs...............................................................................................391

L.5.3.Driver Function Retrieval...............................................................................................392

L.5.3.1.Using the driver API................................................................................................ 392

L.5.3.2.Using the runtime API.............................................................................................393

L.5.3.3.Retrieve per-thread default stream versions.........................................................393

L.5.3.4.Access new CUDA features.................................................................................... 394

AppendixM.CUDA Environment Variables......................................................................395

AppendixN.Unified Memory Programming....................................................................398

N.1.Unified Memory Introduction...............................................................................................398

N.1.1.System Requirements................................................................................................... 399

N.1.2.Simplifying GPU Programming.....................................................................................399

N.1.3.Data Migration and Coherency..................................................................................... 400

N.1.4.GPU Memory Oversubscription.....................................................................................401

N.1.5.Multi-GPU.......................................................................................................................401

N.1.6.System Allocator............................................................................................................402

N.1.7.Hardware Coherency.....................................................................................................402

N.1.8.Access Counters............................................................................................................ 403

N.2.Programming Model............................................................................................................ 404

N.2.1.Managed Memory Opt In...............................................................................................404

CUDA C++ Programming Guide PG-02829-001_v11.5|xviii

N.2.1.1.Explicit Allocation Using cudaMallocManaged()....................................................404

N.2.1.2.Global-Scope Managed Variables Using __managed__........................................405

N.2.2.Coherency and Concurrency.........................................................................................406

N.2.2.1.GPU Exclusive Access To Managed Memory.........................................................406

N.2.2.2.Explicit Synchronization and Logical GPU Activity................................................ 407

N.2.2.3.Managing Data Visibility and Concurrent CPU + GPU Access with Streams........408

N.2.2.4.Stream Association Examples................................................................................409

N.2.2.5.Stream Attach With Multithreaded Host Programs..............................................410

N.2.2.6.Advanced Topic: Modular Programs and Data Access Constraints..................... 411

N.2.2.7.Memcpy()/Memset() Behavior With Managed Memory......................................... 411

N.2.3.Language Integration.................................................................................................... 412

N.2.3.1.Host Program Errors with __managed__ Variables............................................. 413

N.2.4.Querying Unified Memory Support............................................................................... 413

N.2.4.1.Device Properties....................................................................................................413

N.2.4.2.Pointer Attributes....................................................................................................414

N.2.5.Advanced Topics............................................................................................................ 414

N.2.5.1.Managed Memory with Multi-GPU Programs on pre-6.x Architectures.............. 414

N.2.5.2.Using fork() with Managed Memory.......................................................................414

N.3.Performance Tuning............................................................................................................ 415

N.3.1.Data Prefetching............................................................................................................415

N.3.2.Data Usage Hints...........................................................................................................416

N.3.3.Querying Usage Attributes............................................................................................ 418

CUDA C++ Programming Guide PG-02829-001_v11.5|xix

List of Figures

Figure1. The GPU Devotes More Transistors to Data Processing ..................................................2

Figure2. GPU Computing Applications .............................................................................................3

Figure3. Automatic Scalability .......................................................................................................... 5

Figure4. Grid of Thread Blocks ........................................................................................................ 9

Figure5. Memory Hierarchy ............................................................................................................ 11

Figure6. Heterogeneous Programming ..........................................................................................13

Figure7. Matrix Multiplication without Shared Memory ................................................................31

Figure8. Matrix Multiplication with Shared Memory ..................................................................... 34

Figure9. Child Graph Example ........................................................................................................44

Figure10. Creating a Graph Using Graph APIs Example .............................................................. 45

Figure11. The Driver API Is Backward but Not Forward Compatible .........................................108

Figure12. Parent-Child Launch Nesting ...................................................................................... 248

Figure13. Kernel nodes ................................................................................................................. 287

Figure14. Adding new alloc node 2 .............................................................................................. 293

Figure15. Adding new alloc node 4 .............................................................................................. 293

Figure16. Sequentially launched graphs ......................................................................................295

Figure17. Nearest-Point Sampling Filtering Mode ..................................................................... 359

Figure18. Linear Filtering Mode ................................................................................................... 360

Figure19. One-Dimensional Table Lookup Using Linear Filtering ............................................. 361

Figure20. Examples of Global Memory Accesses ........................................................................370

Figure21. Strided Shared Memory Accesses ...............................................................................374

Figure22. Irregular Shared Memory Accesses ............................................................................ 375

Figure23. Library Context Management .......................................................................................387

CUDA C++ Programming Guide PG-02829-001_v11.5|xx

List of Tables

Table1. Linear Memory Address Space ......................................................................................... 21

Table2. Cubemap Fetch ...................................................................................................................65

Table3. Throughput of Native Arithmetic Instructions ................................................................ 125

Table4. Alignment Requirements ................................................................................................. 138

Table5. New Device-only Launch Implementation Functions .....................................................257

Table6. Supported API Functions ................................................................................................. 257

Table7. Single-Precision Mathematical Standard Library Functions with Maximum ULP

Error.............................................................................................................................................. 300

Table8. Double-Precision Mathematical Standard Library Functions with Maximum ULP

Error.............................................................................................................................................. 303

Table9. Functions Affected by -use_fast_math ........................................................................... 307

Table10. Single-Precision Floating-Point Intrinsic Functions .................................................... 307

Table11. Double-Precision Floating-Point Intrinsic Functions ...................................................309

Table12. C++11 Language Features ............................................................................................. 310

Table13. C++14 Language Features ............................................................................................. 313

Table14. Feature Support per Compute Capability ..................................................................... 362

Table15. Technical Specifications per Compute Capability .........................................................363

Table16. Objects Available in the CUDA Driver API .................................................................. 384

Table17. Typedefs header files for CUDA driver APIs ................................................................. 391

Table18. CUDA Environment Variables ......................................................................................395

剩余438页未读，继续阅读

打雷要下雨，雷欧

粉丝: 31

Nvidia CUDA C++编程指南：异步SIMT模型与图形内存节点

CUDA C Programming Guide v9.0

CUDA C Programming Guide v8.0

2024.1.8新版CUDA 官方文档CUDA_C_Programming_Guide.pdf

CUDA_C_Programming_Guide.rar_Windows编程_Others_

NVIDIA_CUDA_Programming_Guide_2.1.pdf

NVIDIA_CUDA_Programming_Guide_2.2.1.pdf

CUDA_C_Programming_Guide v10.0.pdf

CUDA_C_Programming_Guide(12.6).pdf

CUDA_C_Programming_Guide、CUDA并行程序设计 GPU编程指南

CUDA_C_Programming_Guide的下载与翻译指南

最新资源