NVIDIA_CUDA_ProgrammingGuide3.0

NVIDIA

CUDA

5星 · 超过95%的资源需积分: 16 90 浏览量更新于2023-03-03 评论收藏 2.63MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

Version 3.0

2/20/2010

NVIDIA CUDA™

Programming Guide

CUDA Programming Guide Version 3.0 iii

Table of Contents

Chapter 1. Introduction ..................................................................................... 1

1.1 From Graphics Processing to General-Purpose Parallel Computing ................... 1

1.2 CUDA™: a General-Purpose Parallel Computing Architecture ........................... 3

1.3 A Scalable Programming Model ..................................................................... 4

1.4 Document’s Structure ................................................................................... 5

Chapter 2. Programming Model ......................................................................... 7

2.1 Kernels ........................................................................................................ 7

2.2 Thread Hierarchy .......................................................................................... 8

2.3 Memory Hierarchy ...................................................................................... 10

2.4 Heterogeneous Programming ...................................................................... 11

2.5 Compute Capability .................................................................................... 14

Chapter 3. Programming Interface ................................................................. 15

3.1 Compilation with NVCC ............................................................................... 15

3.1.1 Compilation Workflow .......................................................................... 16

3.1.2 Binary Compatibility ............................................................................. 16

3.1.3 PTX Compatibility ................................................................................ 16

3.1.4 Application Compatibility ...................................................................... 17

3.1.5 C/C++ Compatibility ............................................................................ 18

3.2 CUDA C ..................................................................................................... 18

3.2.1 Device Memory .................................................................................... 18

3.2.2 Shared Memory ................................................................................... 21

3.2.3 Multiple Devices ................................................................................... 27

3.2.4 Texture Memory .................................................................................. 28

3.2.4.1 Texture Reference Declaration ....................................................... 29

3.2.4.2 Runtime Texture Reference Attributes ............................................ 29

3.2.4.3 Texture Binding ............................................................................ 30

3.2.5 Page-Locked Host Memory ................................................................... 33

3.2.5.1 Portable Memory ........................................................................... 33

iv CUDA Programming Guide Version 3.0

3.2.5.2 Write-Combining Memory .............................................................. 33

3.2.5.3 Mapped Memory ........................................................................... 33

3.2.6 Asynchronous Concurrent Execution ..................................................... 34

3.2.6.1 Concurrent Execution between Host and Device .............................. 34

3.2.6.2 Overlap of Data Transfer and Kernel Execution ............................... 35

3.2.6.3 Concurrent Kernel Execution .......................................................... 35

3.2.6.4 Concurrent Data Transfers ............................................................. 35

3.2.6.5 Stream ......................................................................................... 35

3.2.6.6 Event ........................................................................................... 37

3.2.6.7 Synchronous Calls ......................................................................... 37

3.2.7 Graphics Interoperability ...................................................................... 37

3.2.7.1 OpenGL Interoperability ................................................................ 38

3.2.7.2 Direct3D Interoperability ............................................................... 40

3.2.8 Error Handling ..................................................................................... 46

3.2.9 Debugging using the Device Emulation Mode ........................................ 47

3.3 Driver API .................................................................................................. 49

3.3.1 Context ............................................................................................... 51

3.3.2 Module ................................................................................................ 52

3.3.3 Kernel Execution .................................................................................. 52

3.3.4 Device Memory .................................................................................... 54

3.3.5 Shared Memory ................................................................................... 57

3.3.6 Multiple Devices ................................................................................... 58

3.3.7 Texture Memory .................................................................................. 58

3.3.8 Page-Locked Host Memory ................................................................... 60

3.3.9 Asynchronous Concurrent Execution ..................................................... 61

3.3.9.1 Stream ......................................................................................... 61

3.3.9.2 Event Management ....................................................................... 62

3.3.9.3 Synchronous Calls ......................................................................... 63

3.3.10 Graphics Interoperability ...................................................................... 63

3.3.10.1 OpenGL Interoperability ................................................................ 63

3.3.10.2 Direct3D Interoperability ............................................................... 65

3.3.11 Error Handling ..................................................................................... 72

3.4 Interoperability between Runtime and Driver APIs ........................................ 72

CUDA Programming Guide Version 3.0 v

3.5 Versioning and Compatibility ....................................................................... 73

3.6 Compute Modes ......................................................................................... 74

3.7 Mode Switches ........................................................................................... 74

Chapter 4. Hardware Implementation ............................................................ 77

4.1 SIMT Architecture ....................................................................................... 77

4.2 Hardware Multithreading ............................................................................. 78

4.3 Multiple Devices ......................................................................................... 79

Chapter 5. Performance Guidelines ................................................................. 81

5.1 Overall Performance Optimization Strategies ................................................ 81

5.2 Maximize Utilization .................................................................................... 81

5.2.1 Application Level .................................................................................. 81

5.2.2 Device Level ........................................................................................ 82

5.2.3 Multiprocessor Level ............................................................................ 82

5.3 Maximize Memory Throughput .................................................................... 84

5.3.1 Data Transfer between Host and Device ............................................... 85

5.3.2 Device Memory Accesses ...................................................................... 85

5.3.2.1 Global Memory .............................................................................. 86

5.3.2.2 Local Memory ............................................................................... 87

5.3.2.3 Shared Memory ............................................................................ 88

5.3.2.4 Constant Memory .......................................................................... 88

5.3.2.5 Texture Memory ........................................................................... 89

5.4 Maximize Instruction Throughput ................................................................ 89

5.4.1 Arithmetic Instructions ......................................................................... 90

5.4.2 Control Flow Instructions ..................................................................... 92

5.4.3 Synchronization Instruction .................................................................. 93

Appendix A. CUDA-Enabled GPUs .................................................................... 95

Appendix B. C Language Extensions ................................................................ 97

B.1 Function Type Qualifiers ............................................................................. 97

B.1.1 __device__ .......................................................................................... 97

B.1.2 __global__ .......................................................................................... 97

B.1.3 __host__ ............................................................................................. 97

B.1.4 Restrictions ......................................................................................... 98

B.2 Variable Type Qualifiers .............................................................................. 98

剩余164页未读，继续阅读

withanorchid123

2012-12-03

很好，帮助很大，适合初学者

ajiao05240625

粉丝: 7
资源: 33

会员权益专享

NVIDIA_CUDA_ProgrammingGuide3.0

评论6

会员权益专享

最新资源

NVIDIA_CUDA_ProgrammingGuide3.0

评论6

NVIDIA_CUDA_Programming_Guide_2.1.pdf

NVIDIA_CUDA_Programming_Guide_2.2.1.pdf

CUDA_2.0编程指南_NVIDIA_CUDA_Programming_Guide_2.0Final

CMAKE_CUDA_ARCHITECTURES如何设置

emd_cuda如何安装

cv::dnn::DNN_BACKEND_CUDA

cv::dnn::DNN_TARGET_CUDA

def __init__(self, is_cuda=False):

helper_cuda.h __CUDA_RUNTIME_H__

ModuleNotFoundError: No module named 'tree_filter_cuda

如何导入correlation_cuda

cupy_cuda90

我应该如何设置PYTORCH_CUDA_ALLOC_CONF

我该如何下载iou3d_cuda

ModuleNotFoundError: No module named 'score_computation_cuda'

AttributeError: 'Tensor' object has no attribute '__array_interface__'. Did you mean: '__cuda_array_interface__'?

`TORCH_USE_CUDA_DSA`

docker run --gpus all --privileged -it --rm --net host --ipc host -v /data:/data -e NVIDIA_DRIVER_CAPABILITIES=compute,utility -e NVIDIA_VISIBLE_DEVICES=0 nvidia/cuda:11.0-base nvidia-smi

No CMAKE_CUDA_COMPILER could be found. Tell CMake where to find the compiler by setting either the environment variable "CUDACXX" or the CMake cache entry CMAKE_CUDA_COMPILER to the full path to the compiler, or to the compiler name if it is in the PATH.

会员权益专享

最新资源

def init(self, is_cuda=False):

AttributeError: 'Tensor' object has no attribute '__array_interface'. Did you mean: 'cuda_array_interface__'?