1
The classical version of Moore’s Law predicts that the capacity of Integrated Circuits (ICs)
doubles every 18 months. Microprocessor manufacturers followed this law by reducing the op-
erating voltages and using smaller and faster transistors. Frequency scaling got to the point that
circuits emitted too much heat to be reasonably dissipated – the so called power wall. This lead
the main microprocessor manufacturer, Intel, to publicly announce in 2004 that it would dedicate
all it future design efforts to multi-core environments. Nowadays, Intel offers a 8-core version
of the high-end Xeon processor (V8), while Opteron from AMD is provided in a 12-core version,
both at 45nm manufacturing process.
Just doubling the number of cores in a die doesn’t guarantee a speedup of two over the initial
microprocessor for a given application. Indeed, Amdahl’s law [34] suggests that the maximum
expected overall improvement of a system using N processors is highly influenced by the amount
of sequential execution of the program, but also by the degree of parallelism of the parallel sec-
tions. Most of the existing software, developed during the single-core era is essentially sequential
and therefore does’t benefit from any improvement on a multicore system. One idea, dating back
from the 1960s, is to write compilers that would automatically parallelize these sequential pro-
grams. The success of these approaches seems to be inversely proportional to the number of
targeted cores. One reason for this insuccess is that the sequential solution these tools start with
already looses some of the “parallel semantics” of the problem to be solved. Consequently, mak-
ing efficient use of multiple cores requires recovering some of this lost parallelism. This requires
recoding parts of the application using the thread programming model or using one of the well
known APIs supporting process intercommunication: MPI, PVM or OpenMP. Another reason for
the poor performance of these parallelized programs is that in a multicore system inter-process
communication, usually resolved by shared-memory techniques, is very costly. In any case, the
success of this approach will depend on the data-level parallelism of the initial application.
One success story is computer graphics. Graphics processing is an application domain having
massively parallel computational kernels: entire animation scenes and also parts of each frame
can be processed in parallel. Traditionally, Graphical Procesing Units (GPUs) consisted of numer-
ous but rather simple Processing Elements (PEs) capable of processing the numerous graphics-
related tasks in a flow-like manner. In 2001, with the introduction of first programmable GPU
(the NV20 series) programmers could execute custom visual-effects programs using the Shader
Language 1.1. In 2007 nVIDIA formalized the GPU’s computing capabilities under the name of
Compute Unified Device Architecture (CUDA): the parallel computing architecture present in
nVIDIA GPUs. General-purpose computations can be expressed using C for CUDA, a C sub-
set with nVIDIA extensions. As the PEs of modern GPUs support some of the basic floating-
point operators, it is tempting to use them to perform massively parallel scientific computations.
0
第1章
0
介绍