ScalableParallelProgrammingwithCUDA

4星 · 超过85%的资源需积分: 9 199 浏览量更新于2023-03-03 评论收藏 761KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源评论

资源推荐

40 March/April 2008 ACM QUEUE rants: feedback@acmqueue.com

Scalable Parallel

PROGRAMMING

JOHN NICKOLLS, IAN BUCK, AND

MICHAEL GARLAND, NVIDIA,

KEVIN SKADRON, UNIVERSITY OF VIRGINIA

42 March/April 2008 ACM QUEUE rants: feedback@acmqueue.com

model is also applicable to other shared-memory parallel

processing architectures, including multicore CPUs.

CUDA provides three key abstractions—a hierarchy

of thread groups, shared memories, and barrier syn-

chronization—that provide a clear parallel structure to

conventional C code for one thread of the hierarchy.

Multiple levels of threads, memory, and synchronization

provide ﬁne-grained data parallelism and thread paral-

lelism, nested within coarse-grained data parallelism and

task parallelism. The abstractions guide the programmer

to partition the problem into coarse sub-problems that

can be solved independently in parallel, and then into

Scalable Parallel PROGRAMMING with

CUDA

riven by the insatiable market demand for realtime,

high-deﬁnition 3D graphics, the programmable GPU

(graphics processing unit) has evolved into a highly

parallel, multithreaded, manycore processor. It is designed to

efﬁciently support the graphics shader programming model,

in which a program for one thread draws one vertex or

shades one pixel fragment. The GPU excels at ﬁne-grained,

data-parallel workloads consisting of thousands of indepen-

dent threads executing vertex, geometry, and pixel-shader

program threads concurrently.

The tremendous raw performance of modern GPUs has

led researchers to explore mapping more general non-graph-

ics computations onto them. These GPGPU (general-purpose

computation on GPUs) systems have produced some impres-

sive results, but the limitations and difﬁculties of doing this

via graphics APIs are legend. This desire to use the GPU as a

more general parallel computing device motivated NVIDIA

to develop a new uniﬁed graphics and computing GPU

architecture and the CUDA programming model.

GPU COMPUTING ARCHITECTURE

Introduced by NVIDIA in November 2006, the Tesla uniﬁed

graphics and computing architecture

1,2

signiﬁcantly extends

the GPU beyond graphics—its massively multithreaded

processor array becomes a highly efﬁcient uniﬁed platform

for both graphics and general-purpose parallel computing

applications. By scaling the number of processors and mem-

ory partitions, the Tesla architecture spans a wide market

range—from the high-performance enthusiast GeForce 8800

GPU and professional Quadro and Tesla computing products

to a variety of inexpensive, mainstream GeForce GPUs. Its

computing features enable straightforward programming of

the GPU cores in C with CUDA. Wide availability in laptops,

desktops, workstations, and servers, coupled with C pro-

grammability and CUDA software, make the Tesla architec-

ture the ﬁrst ubiquitous supercomputing platform.

The Tesla architecture is built around a scalable array of

multithreaded SMs (streaming multiprocessors). Current

GPU implementations range from 768 to 12,288 concur-

rently executing threads. Transparent scaling across this wide

range of available parallelism is a key design goal of both the

GPU architecture and the CUDA programming model. Figure

A shows a GPU with 14 SMs—a total of 112 SP (streaming

processor) cores—interconnected with four external DRAM

partitions. When a CUDA program on the host CPU invokes

a kernel grid, the CWD (compute work distribution) unit

enumerates the blocks of the grid and begins distributing

them to SMs with available execution capacity. The threads

of a thread block execute concurrently on one SM. As thread

blocks terminate, the CWD unit launches new blocks on the

vacated multiprocessors.

An SM consists of eight scalar SP cores, two SFUs (special

function units) for transcendentals, an MT IU (multithreaded

instruction unit), and on-chip shared memory. The SM cre-

ates, manages, and executes up to 768 concurrent threads

in hardware with zero scheduling overhead. It can execute

as many as eight CUDA thread blocks concurrently, limited

by thread and memory resources. The SM implements the

CUDA __syncthreads() barrier synchronization intrinsic with

a single instruction. Fast barrier synchronization together

with lightweight thread creation and zero-overhead thread

scheduling efﬁciently support very ﬁne-grained parallelism,

allowing a new thread to be created to compute each vertex,

pixel, and data point.

To manage hundreds of threads running several different

programs, the Tesla SM employs a new architecture we call

UNIFIED GRAPHICS AND COMPUTING GPUS

GPUs

FOCUS

剩余13页未读，继续阅读

mashiying2013

2013-06-08

有点坑人的书，可能是自己的水平太次了

superhackerzhang

粉丝: 21
资源: 5

会员权益专享

Scalable Parallel Programming with CUDA

评论1

会员权益专享

最新资源

Scalable Parallel Programming with CUDA

评论1

Scalable Parallel Programming with CUDA中文版

并行编程cuda

Hybrid CUDA, OpenMP, and MPI parallel programming on multicore GPU

wxWidgets 书籍

highly-scalable searchable symmetric encryption with support for boolean que

xapp1317-scalable-matrix-inverse-hls

nodejs+express+mongoose

plc for next

Spring Security 5.x

please introdoce golang

sequential convex programming

python asyncio

aqistudy nodejs

spring boot

Spring Security 5.0

Intel Xeon Scalable Processor 有多少颗ddr

spring netty

Deep Learning Toolbox

，user-scalable=no表示禁止用户手动缩放页面 不起作用

Oracle+PLSQL

会员权益专享

最新资源

，user-scalable=no表示禁止用户手动缩放页面不起作用