没有合适的资源?快使用搜索试试~ 我知道了~
首页Scalable Parallel Programming with CUDA
JOHN NICKOLLS, IAN BUCK, AND MICHAEL GARLAND, NVIDIA, KEVIN SKADRON, UNIVERSITY OF VIRGINIA March/April 2008 ACM QUEUE NVIDIA构架师的论文,想学cuda可以耐心看看。
资源详情
资源评论
资源推荐
40 March/April 2008 ACM QUEUE rants: feedback@acmqueue.com
Scalable Parallel
PROGRAMMING
JOHN NICKOLLS, IAN BUCK, AND
MICHAEL GARLAND, NVIDIA,
KEVIN SKADRON, UNIVERSITY OF VIRGINIA
ACM QUEUE March/April 2008 41
more queue: www.acmqueue.com
Is CUDA the parallel programming model that
application developers have been waiting for?
Scalable Parallel
PROGRAMMING
with
CUDA
The advent of multicore CPUs and manycore GPUs
means that mainstream processor chips are now parallel
systems. Furthermore, their parallelism continues to scale
with Moore’s law. The challenge is to develop mainstream
application software that transparently scales its parallel-
ism to leverage the increasing number of processor cores,
much as 3D graphics applications transparently scale
their parallelism to manycore GPUs with widely varying
numbers of cores.
According to conventional wisdom, parallel program-
ming is difficult. Early experience with the CUDA
1,2
scalable parallel programming model and C language,
however, shows that many sophisticated programs can be
readily expressed with a few easily understood abstrac-
tions. Since NVIDIA released CUDA in 2007, developers
have rapidly developed scalable parallel programs for
a wide range of applications, including computational
chemistry, sparse matrix solvers, sorting, searching, and
physics models. These applications scale transparently to
hundreds of processor cores and thousands of concurrent
threads. NVIDIA GPUs with the new Tesla unified graph-
ics and computing architecture (described in the GPU
sidebar) run CUDA C programs and are widely available
in laptops, PCs, workstations, and servers. The CUDA
GPUs
FOCUS
42 March/April 2008 ACM QUEUE rants: feedback@acmqueue.com
model is also applicable to other shared-memory parallel
processing architectures, including multicore CPUs.
3
CUDA provides three key abstractions—a hierarchy
of thread groups, shared memories, and barrier syn-
chronization—that provide a clear parallel structure to
conventional C code for one thread of the hierarchy.
Multiple levels of threads, memory, and synchronization
provide fine-grained data parallelism and thread paral-
lelism, nested within coarse-grained data parallelism and
task parallelism. The abstractions guide the programmer
to partition the problem into coarse sub-problems that
can be solved independently in parallel, and then into
Scalable Parallel PROGRAMMING with
CUDA
D
riven by the insatiable market demand for realtime,
high-definition 3D graphics, the programmable GPU
(graphics processing unit) has evolved into a highly
parallel, multithreaded, manycore processor. It is designed to
efficiently support the graphics shader programming model,
in which a program for one thread draws one vertex or
shades one pixel fragment. The GPU excels at fine-grained,
data-parallel workloads consisting of thousands of indepen-
dent threads executing vertex, geometry, and pixel-shader
program threads concurrently.
The tremendous raw performance of modern GPUs has
led researchers to explore mapping more general non-graph-
ics computations onto them. These GPGPU (general-purpose
computation on GPUs) systems have produced some impres-
sive results, but the limitations and difficulties of doing this
via graphics APIs are legend. This desire to use the GPU as a
more general parallel computing device motivated NVIDIA
to develop a new unified graphics and computing GPU
architecture and the CUDA programming model.
GPU COMPUTING ARCHITECTURE
Introduced by NVIDIA in November 2006, the Tesla unified
graphics and computing architecture
1,2
significantly extends
the GPU beyond graphics—its massively multithreaded
processor array becomes a highly efficient unified platform
for both graphics and general-purpose parallel computing
applications. By scaling the number of processors and mem-
ory partitions, the Tesla architecture spans a wide market
range—from the high-performance enthusiast GeForce 8800
GPU and professional Quadro and Tesla computing products
to a variety of inexpensive, mainstream GeForce GPUs. Its
computing features enable straightforward programming of
the GPU cores in C with CUDA. Wide availability in laptops,
desktops, workstations, and servers, coupled with C pro-
grammability and CUDA software, make the Tesla architec-
ture the first ubiquitous supercomputing platform.
The Tesla architecture is built around a scalable array of
multithreaded SMs (streaming multiprocessors). Current
GPU implementations range from 768 to 12,288 concur-
rently executing threads. Transparent scaling across this wide
range of available parallelism is a key design goal of both the
GPU architecture and the CUDA programming model. Figure
A shows a GPU with 14 SMs—a total of 112 SP (streaming
processor) cores—interconnected with four external DRAM
partitions. When a CUDA program on the host CPU invokes
a kernel grid, the CWD (compute work distribution) unit
enumerates the blocks of the grid and begins distributing
them to SMs with available execution capacity. The threads
of a thread block execute concurrently on one SM. As thread
blocks terminate, the CWD unit launches new blocks on the
vacated multiprocessors.
An SM consists of eight scalar SP cores, two SFUs (special
function units) for transcendentals, an MT IU (multithreaded
instruction unit), and on-chip shared memory. The SM cre-
ates, manages, and executes up to 768 concurrent threads
in hardware with zero scheduling overhead. It can execute
as many as eight CUDA thread blocks concurrently, limited
by thread and memory resources. The SM implements the
CUDA __syncthreads() barrier synchronization intrinsic with
a single instruction. Fast barrier synchronization together
with lightweight thread creation and zero-overhead thread
scheduling efficiently support very fine-grained parallelism,
allowing a new thread to be created to compute each vertex,
pixel, and data point.
To manage hundreds of threads running several different
programs, the Tesla SM employs a new architecture we call
UNIFIED GRAPHICS AND COMPUTING GPUS
GPUs
FOCUS
剩余13页未读,继续阅读
superhackerzhang
- 粉丝: 21
- 资源: 5
上传资源 快速赚钱
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
会员权益专享
最新资源
- stc12c5a60s2 例程
- Android通过全局变量传递数据
- c++校园超市商品信息管理系统课程设计说明书(含源代码) (2).pdf
- 建筑供配电系统相关课件.pptx
- 企业管理规章制度及管理模式.doc
- vb打开摄像头.doc
- 云计算-可信计算中认证协议改进方案.pdf
- [详细完整版]单片机编程4.ppt
- c语言常用算法.pdf
- c++经典程序代码大全.pdf
- 单片机数字时钟资料.doc
- 11项目管理前沿1.0.pptx
- 基于ssm的“魅力”繁峙宣传网站的设计与实现论文.doc
- 智慧交通综合解决方案.pptx
- 建筑防潮设计-PowerPointPresentati.pptx
- SPC统计过程控制程序.pptx
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功
评论1