Foreword
In the last few years computing has entered the heterogeneous computing era, which
aims to bring together in a single device the best of both central processing units
(CPUs) and graphics processing units (GPUs). Designers are creating an increasingly
wide range of heterogeneous machines, and hardware vendors are making them
broadly available. This change in hardware offers great platforms for exciting new
applications. But, because the designs are different, classical programming models
do not work very well, and it is important to learn about new models such as those in
OpenCL.
When the design of OpenCL started, the designers noticed that for a class of
algorithms that were latency focused (spreadsheets), developers wrote code in C or
C++ and ran it on a CPU, but for a second class of algorithms that where throughput
focused (e.g. matrix multiply), developers often wrote in CUDA and used a GPU: two
related approaches, but each worked on only one kind of processor—C++ did not run
on a GPU, CUDA did not run on a CPU. Developers had to specialize in one and
ignore the other. But the real power of a heterogeneous device is that it can efciently
run applications that mix both classes of algorithms. The question was how do you
program such machines?
One solution is to add new features to the existing platforms; both C++ and CUDA
are actively evolving to meet the challenge of new hardware. Another solution was to
create a new set of programming abstractions specically targeted at heterogeneous
computing. Apple came up with an initial proposal for such a new paradigm. This
proposal was rened by technical teams from many companies, and became OpenCL.
When the design started, I was privileged to be part of one of those teams. We had
a lot of goals for the kernel language: (1) let developers write kernels in a single
source language; (2) allow those kernels to be functionally portable over CPUs,
GPUs, eld-programmable gate arrays, and other sorts of devices; (3) be low level
so that developers could tease out all the performance of each device; (4) keep the
model abstract enough, so that the same code would work correctly on machines
being built by lots of companies. And, of course, as with any computer project, we
wanted to do this fast. To speed up implementations, we chose to base the language
on C99. In less than 6 months we produced the specication for OpenCL 1.0, and
within 1 year the rst implementations appeared. And then, time passed and OpenCL
met real developers ...
So what happened? First, C developers pointed out all the great C++ features
(a real memory model, atomics, etc.) that made them more productive, and CUDA
developers pointed out all the new features that NVIDIA added to CUDA (e.g.
nested parallelism) that make programs both simpler and faster. Second, as hardware
architects explored heterogeneous computing, they gured out how to remove the
early restrictions requiring CPUs and GPUs to have separate memories. One great
hardware change was the development of integrated devices, which provide both a
xix