weaker support for these operations, although they can be
mimicked at lower performance via memory.
Tightly defined execution order and memory
model: Modern CPUs have relatively strict rules on the
order with which instructions are completed and the rules
for when memory stores become visible to memory loads.
GPUs have more relaxed rules, which provides greater free-
dom for hardware scheduling but makes it more difficult to
provide ordering guarantees at the language level.
3. PARALLELISM MODEL:
SPMD ON SIMD
Any language for parallel programming requires a concep-
tual model for expressing parallelism in the language and for
mapping this language-level parallelism to the underlying
hardware. For the following discussion of ispc’s approach,
we rely on Flynn’s taxonomy of programming models into
SIMD, MIMD, etc. [8], with Darema’s enhancement to in-
clude SPMD (Single Program Multiple Data) [7].
3.1 Why SPMD?
Recall that our goal is to design a language and compiler
for today’s SIMD CPU hardware. One option would be to
use a purely sequential language, such as unmodified C, and
rely on the compiler to find parallelism and map it to the
SIMD hardware. This approach is commonly referred to
as auto-vectorization [37]. Although auto-vectorization can
work well for regular code that lacks conditional operations,
a number of issues limit the applicability of the technique in
practice. All optimizations performed by an auto-vectorizer
must honor the original sequential semantics of the program;
the auto-vectorizer thus must have visibility into the entire
loop body, which precludes vectorizing loops that call out
to externally-defined functions, for example. Complex con-
trol flow and deeply nested function calls also often inhibit
auto-vectorization in practice, in part due to heuristics that
auto-vectorizers must apply to decide when to try to vec-
torize. As a result, auto-vectorization fails to provide good
performance transparency—it is difficult to know whether
a particular fragment of code will be successfully vectorized
by a given compiler and how it will perform.
To achieve ispc’s goals of efficiency and performance
transparency it is clear that the language must have par-
allel semantics. This leads to the question: how should
parallelism be expressed? The most obvious option is to
explicitly express SIMD operations as explicit vector com-
putations. This approach works acceptably in many cases
when the SIMD width is four or less, since explicit operations
on 3-vectors and 4-vectors are common in many algorithms.
For SIMD widths greater than four, this option is still ef-
fective for algorithms without data-dependent control flow,
and can be implemented in C++ using operator overload-
ing layered over intrinsics. However, this option becomes
less viable once complex control flow is required.
Given complex control flow, what the programmer ideally
wants is a programming model that is as close as possible to
MIMD, but that can be efficiently compiled to the available
SIMD hardware. SPMD provides just such a model: with
SPMD, there are multiple instances of a single program exe-
cuting concurrently and operating on different data. SPMD
programs largely look like scalar programs (unlike explicit
SIMD), which leads to a productivity advantage for pro-
grammers working with SPMD programs. Furthermore, the
SPMD approach aids with performance transparency: vec-
torization of a SPMD program is guaranteed by the under-
lying model, so a programmer can write SPMD code with
a clear mental model of how it will be compiled. Over the
past ten years the SPMD model has become widely used
on GPUs, first for programmable shading [28] and then for
more general-purpose computation via CUDA and OpenCL.
ispc implements SPMD execution on the SIMD vector
units of CPUs; we refer to this model as “SPMD-on-SIMD”.
Each instance of the program corresponds to a different
SIMD lane; conditionals and control flow that are different
between the program instances are allowed. As long as each
program instance operates only on its own data, it produces
the same results that would be obtained if it was running
on a dedicated MIMD processor. Figure 1 illustrates how
SPMD execution is implemented on CPU SIMD hardware.
3.2 Basic Execution Model
Upon entry to a ispc function called from C/C++ code,
the execution model switches from the application’s serial
model to ispc’s SPMD model. Conceptually, a number of
program instances start running concurrently. The group
of running program instances is a called a gang (harkening
to “gang scheduling”, since ispc provides certain guarantees
about when program instances running in a gang run con-
currently with other program instances in the gang, detailed
below.)
3
The gang of program instances starts executing in
the same hardware thread and context as the application
code that called the ispc function; no thread creation or
implicit context switching is done by ispc.
The number of program instances in a gang is relatively
small; in practice, it’s no more than twice the SIMD width of
the hardware that it is executing on.
4
Thus, there are four
or eight program instances in a gang on a CPU using the
4-wide SSE instruction set, and eight or sixteen on a CPU
using 8-wide AVX. The gang size is set at compile time.
SPMD parallelization across the SIMD lanes of a single
core is complementary to multi-core parallelism. For ex-
ample, if an application has already been parallelized across
cores, then threads in the application can independently call
functions written in ispc to use the SIMD unit on the core
where they are running. Alternatively, ispc has capabilities
for launching asynchronous tasks for multi-core parallelism;
they will be introduced in Section 5.4.
3.3 Mapping SPMD To Hardware: Control
One of the challenges in SPMD execution is handling di-
vergent control flow. Consider a while loop with a termi-
nation test n > 0; when different program instances have
different values for n, they will need to execute the loop
body different numbers of times.
ispc’s SPMD-on-SIMD model provides the illusion of sep-
arate control flow for each SIMD lane, but the burden of
3
Program instances thus correspond to threads in CUDA
and work items in OpenCL. A gang roughly corresponds to
a CUDA warp.
4
Running gangs wider than the SIMD width can give perfor-
mance benefits from amortizing shared computation (such as
scalar control flow overhead) over more program instances,
better cache reuse across the program instances, and from
more instruction-level parallelism being available. The costs
are greater register pressure and potentially more control
flow divergence across the program instances.