CUDA编程指南：GPU并行计算入门

需积分: 9 80 浏览量更新于2024-07-19 收藏 16.43MB PDF 举报

"CUDA Programming: A Developer’s Guide to Parallel Computing with GPUs by Shane Cook" CUDA (Compute Unified Device Architecture) 是一种由 NVIDIA 推出的并行计算平台和编程模型，主要针对图形处理器（GPU）进行高性能计算。这本书《CUDA Programming》由 Shane Cook 撰写，旨在引导开发者充分利用 GPU 的并行计算能力，实现高效的应用程序。在 CUDA 编程中，开发者需要理解和掌握以下几个关键知识点： 1. **GPU 架构**：CUDA 充分利用了 GPU 的多核心架构，这些核心被称为 CUDA 核心。理解 GPU 的硬件组成，包括流处理器、共享内存、全局内存、纹理内存和常量内存，是编写高效 CUDA 程序的基础。 2. **CUDA C/C++**：CUDA 提供了一种扩展的 C/C++ 语言，用于编写运行在 GPU 上的代码，称为 CUDA C/C++。学习如何声明和使用设备函数、主机函数、设备变量以及如何在 GPU 和 CPU 之间传输数据是必要的。 3. **线程和块组织**：CUDA 中的计算任务是通过线程和线程块来组织的。线程块是一组执行相同代码但可能在不同数据上操作的线程，而多个线程块可以组成一个线程网格。理解如何有效地组织线程和线程块以最大化并行度至关重要。 4. **内存层次结构**：CUDA 提供了多种类型的内存，如全局内存、共享内存、寄存器和常量内存。合理利用内存层次可以显著提高性能，因为不同类型的内存有不同的访问速度和容量限制。 5. **同步与通信**：在 GPU 中，线程间的同步和数据交换是通过特定的函数和指令实现的，如 __syncthreads() 和 __threadfence()。了解何时和如何使用这些机制对于避免数据竞争和确保正确性至关重要。 6. **流和事件**：CUDA 流允许异步执行计算和数据传输，而事件则可以用来测量和优化程序的性能。了解如何使用流和事件来优化程序的并行执行和减少延迟是提高效率的关键。 7. **错误处理**：CUDA 程序可能会遇到各种错误，如资源不足或无效的操作。学会正确地检查和处理这些错误，可以避免程序崩溃并提供更好的用户体验。 8. **应用实例**：CUDA 广泛应用于科学计算、图像处理、机器学习等领域。通过实际案例学习，如矩阵乘法、物理模拟或深度学习算法的实现，可以加深对 CUDA 编程的理解。 9. **工具和调试**：CUDA 提供了如 Nsight 和 CUDA Profiler 等工具，帮助开发者分析性能、定位问题。熟练使用这些工具能帮助优化代码和解决问题。 10. **性能调优**：最后，理解如何利用 CUDA 工具进行性能分析和调优，包括选择合适的 block 大小、利用流优化内存访问和计算，以及识别并消除瓶颈，都是成为高效 CUDA 开发者所必需的技能。《CUDA Programming》这本书是为希望利用 GPU 进行高性能计算的开发者准备的，它涵盖了从基础概念到高级技巧的全面知识，旨在帮助读者熟练掌握 CUDA 编程，充分利用 GPU 的强大计算能力。

A Short History of Supercomputing

INTRODUCTION

So why in a book about CUDA are we looking at supercomputers? Supercomputers are typically at the

leading edge of the technology curve. What we see here is what will be commonplace on the desktop in

5 to 10 years. In 2010, the annual International Supercomputer Conference in Hamburg, Germany,

announced that a NVIDIA GPU-based machine had been listed as the second most powerful computer

in the world, according to the top 500 list (http://www.top500.org). Theoretically, it had more peak

performance than the mighty IBM Roadrunner, or the then-leader, the Cray Jaguar, peaking at near to 3

petaﬂops of performance. In 2011, NVIDIA CUDA-powered GPUs went on to claim the title of the

fastest supercomputer in the world. It was suddenly clear to everyone that GPUs had arrived in a very

big way on the high-performance computing landscape, as well as the humble desktop PC.

Supercomputing is the driver of many of the technologies we see in modern-day processors.

Thanks to the need for ever-faster processors to process ever-larger datasets, the industry produces

ever-faster computers. It is through som e of these evolutions that GPU CUDA technology has come

about today.

Both supercomputers and desktop computing are moving toward a heterogen eous computing

routedthat is, they are trying to achi eve performance with a mix of CPU (Central Processor Unit) and

GPU (Graphics Processor Unit) technology. Two of the largest worldwide projects using GPUs are

BOINC and Folding@Home, both of which are distributed computing projects. They allow ordinary

people to make a real contri bution to speciﬁc scientiﬁc projects. Contributions from CPU/GPU hosts

on projects supporting GPU accelerators hugely outweigh contributions from CPU-only hosts. As of

November 2011, there were some 5.5 million hosts contributing a total of around 5.3 petaﬂops, around

half that of the world’s fastest supercompute r, in 2011, the Fujitsu “K computer” in Japan .

The replacement for Jaguar, currently the fastest U.S. supercomputer, code-named Titan, is

planned for 2013. It will use almost 300,000 CPU cores and up to 18,000 GPU boards to achieve

between 10 and 20 petaﬂops per second of performance. With support like this from around the world,

GPU programming is set to jump into the mainstream, both in the HPC industry and also on the

desktop.

You can now put together or purch ase a desktop supercomputer with several teraﬂops of perfor-

mance. At the beginning of 2000, some 12 years ago, this would have given you ﬁrst place in the top

500 list, beating IBM ASCI Red with its 9632 Pentium processors. This just shows how much a little

over a decade of computing progress has achieved and opens up the question about where we will be

a decade from now. You can be fairly certain GPUs will be at the forefront of this trend for some time

CHAPTER

CUDA Programming. http://dx.doi.org/10.1016/B978-0-12-415933-4.00001-6

to come. Thus, learning how to program GPUs effectively is a key skill any good developer needs

to acquire.

VON NEUMANN ARCHITECTURE

Almost all processors work on the basis of the process developed by Von Neumann, considered one of

the fathers of computing. In this approach, the processor fetches instructions from memory, decodes,

and then executes that instruction.

A modern processor typically runs at anything up to 4 GHz in speed. Modern DDR-3 memory, when

paired with say a standard Intel I7 device, can run at anything up to 2 GHz. However, the I7 has at least four

processors or cores in one device, or double that if you count its hyperthreading ability as a real processor.

A DDR-3 triple-channel memory setup on a I7 Nehalem system would produce the theoretical

bandwidth ﬁgures shown in Table 1.1. Depending on the motherboard, and exact memory patter n, the

actual bandwidth could be considerably less.

You run into the ﬁrst problem with memory bandwidth when you consi der the processor clock

speed. If you take a processor running at 4 GHz, you need to potentially fetch, every cycle, an

instruction (an operator) plus some data (an operand).

Each instruction is typically 32 bits, so if you execute nothing but a set of linear instructions, with no

data, on every core, you get 4.8 GB/s O 4 ¼ 1.2 GB instructions per second. This assumes the processor

can dispatch one instruction per clock on average*. However, you typically also need to fetch and write

back data, which if we say is on a 1:1 ratio with instructions, means we effectively halve our throughput.

The ratio of clock speed to memory is an important limiter for both CPU and GPU throughput and

something we’ll look at later. We ﬁnd when you look into it, most applications, with a few exceptions on

both CPU and GPU, are often memory bound and not processor cycle or processor clock/load bound.

CPU vendors try to solve this problem by using cache memory and burst memory access. This

exploits the principle of locality. It you look at a typical C program, you might see the following type of

operation in a function:

void some_function

{

int array[100];

int i ¼ 0;

Table 1.1 Bandwidth on I7 Nehalem Processor

QPI Clock Theoretical Bandwidth Per Core

4.8 GT/s

(standard part)

19.2 GB/s 4.8 GB/s

6.4 GT/s

(extreme edition)

25.6 GB/s 6.4 GB/s

Note: QPI ¼ Quick Path Interconnect.

The actual achieved dispatch rate can be higher or lower than one, which we use here for simplicity.

2 CHAPTER 1 A Short History of Supercomputing

If the data is not in the ﬁrst level (L1) cache, then a fetch from the second or third level (L2 or L3)

cache is required, or from the main memory if no cache line has this data already. The ﬁrst level cache

typically runs at or near the processor clock speed, so for the execution of our loop, potentially we do

get near the full processor speed, assuming we write cache as well as read cache. However, there is

a cost for this: The size of the L1 cache is typically only 16 K or 32 K in size. The L2 cache is

somewhat slower, but much larger, typically around 256 K. The L3 cache is much larger, usually

several megabytes in size, but again much slower than the L2 cache.

With real-life examples, the loop iterations are much, much larger, maybe many megabytes in size.

Even if the program can remain in cache memory, the dataset usually cannot, so the processor, despite

all this cache trickery, is quite often limited by the memory throughput or bandwidth.

When the processor fetches an instruction or data item from the cache instead of the main memory,

it’s called a cache hit. The increm ental beneﬁt of using progressively larger caches drops off quite

rapidly. This in turn means the ever-larger caches we see on modern processors are a less and less

useful means to improve perf ormance, unless they manage to encompass the entire dataset of the

problem.

The Intel I7-920 processor has some 8 MB of internal L3 cache. This cache memory is not free, and

if we look at the die for the Intel I7 processor, we see around 30% of the size of the chip is dedicated to

the L3 cache memory (Figure 1.2).

As cache sizes grow, so does the physical size of the silicon used to make the processors. The

larger the chip, the more expensive it is to manufacture and the higher the likelihood that it will

contain an error and be discarded during the manufacturing process. Sometimes these faul ty devices

are sold cheaply as either triple- or dual-core devices, with the faulty cores disabled. However,

the effect of larger, progress ively more inefﬁcient caches ultimately results in higher costs to the

end user.

Core 1

Shared L3 Cache

Core 2 Core 4Core 3

FIGURE 1.2

Layout of I7 Nehalem processor on processor die.

4 CHAPTER 1 A Short History of Supercomputing

剩余590页未读，继续阅读

ffiirree

粉丝: 198

CUDA编程指南：GPU并行计算入门

最新资源