programmable 32-bit floating-point pixel-
fragment processors and vertex processors,
programmed with Cg programs, DX9, and
OpenGL. These processors were highly multi-
threaded, creating a thread and executing a
thread program for each vertex and pixel
fragment. The GeForce 6800 scalable pro-
cessor core architecture facilitated multiple
GPU implementations with different num-
bers of processor cores.
Developing the Cg language
6
for pro-
gramming GPUs provided a scalable parallel
programming model for the programmable
floating-point vertex and pixel-fragment pro-
cessors of GeForce FX, GeForce 6800, and
subsequent GPUs. A Cg program resembles
a C program for a single thread that draws
a single vertex or single pixel. The multi-
threaded GPU created independent threads
that executed a shader p rogram to draw
every vertex and pixel fragment.
In addition to rendering real-time graph-
ics, programmers also used Cg to compute
physical simulations and other general-
purpose GPU (GPGPU) computations.
Early GPGPU computing programs achieved
high performance, but were difficult to write
because programmers had to express non-
graphics computations with a graphics API
such as OpenGL.
Unified computing and graphics GPUs
The GeForce 8800 introduced in 2006
featured the first unified graphics and com-
puting GPU architecture
7,8
programmable
in C with the CUDA parallel computing
model, in addition to using DX10 and
OpenGL. Its unified streaming processor
cores executed vertex, geometry, and pixel
shader threads for DX10 graphics programs,
and also executed computing threads for
CUDA C programs. Hardware multithread-
ing enabled the GeForce 8800 to efficiently
execute up to 12,288 threads concurrently
in 128 processor cores. NVIDIA deployed
thescalablearchitectureinafamilyof
GeForce GPUs with different numbers of
processor cores for each market segment.
The GeForce 8800 was the first GPU to
use scalar thread processors rather than vector
processors, matching standard scalar languages
like C, and eliminating the need to manage
vector registers and program vector
operations. It added instructions to support
C and other general-purpose languages,
including integer arithmetic, IEEE 754
floating-point arithmetic, and load/store
memory access instructions with byte address-
ing. It provided hardware and instructions to
support paral lel computa tion, communica-
tion, and synchronization—including thread
arrays, shared memory, and fast barrier
synchronization.
GPU computing systems
At first, users built personal supercom-
puters by adding multiple GPU cards to
PCs and workstations, and assembled clusters
of GPU computing nodes. In 2007, respond-
ing to demand for GPU computing systems,
NVIDIA introduced the Tesla C870, D870,
and S870 GPU card, deskside, and rack-
mount GPU computing systems containing
one, two, and four T8 GPUs. Th e T8
GPU was based on the GeForce 8800
GPU, configured for parallel computing.
The second-generation Tesla C1060 and
S1070 GPU computing systems introduced
in 2008 used the T10 GPU, based o n the
GPU in GeForce GTX 280. The T10 fea-
tured 240 processor cores, 1-teraflop-per-
second peak single-precision floating-point
rate, IEEE 754-2008 doubl e-pr ecision 64 -
bit floating-point arithmetic, and 4-Gbyte
DRAM memory. Today there are Tesla
S1070 systems with thousands of GPUs
widely deployed in high-performance com-
puting systems in production and research.
NVIDIA introduced the third-generation
Fermi GPU computing architecture in
2009.
9
Based on user experience with prior
generations, it addressed several key areas to
make GPU computing more broadly appli-
cable. Fermi implemented IEEE 754-2008
and significantly increased double-precision
performance. It added error-correcting code
(ECC) memory protection for large-scale
GPU computing, 64-bit unified addressing,
cached memory hierarchy, and instructions
for C, Cþþ,Fortran,OpenCL,and
DirectCompute.
GPU computing ecosystem
The GPU computing ecosystem is expand-
ing rapidly, enabled by the deployment of
more than 180 million CUDA-capable
[3B2-14] mmi2010020005.3d 23/3/010 15:43 Page 58
....................................................................
58 IEEE MICRO
.......... ................ .................. ................ ................ ................ .................. ................ .................................................................
HOT CHIPS