and easy. It can be written separately from the algorithm, by an
architecture expert if necessary. We can most flexibly schedule op-
erations which are data parallel, with statically analyzable access
patterns, but also support the reductions and bounded irregular ac-
cess patterns that occur in image processing.
In addition to this model of scheduling (Sec. 3), we present:
• A prototype embedded language, called Halide, for functional
algorithm and schedule specification (Sec. 4).
• A compiler which translates functional algorithms and op-
timized schedules into efficient machine code for x86 and
ARM, including SSE and NEON SIMD instructions, and
CUDA GPUs, including synchronization and placement of
data throughout the specialized memory hierarchy (Sec. 5).
• A range of applications implemented in our language, com-
posed of common image processing operations such as con-
volutions, histograms, image pyramids, and complex sten-
cils. Using different schedules, we compile them into opti-
mized programs for x86 and ARM CPUs, and a CUDA GPU
(Sec. 6). For these applications, the Halide code is compact,
and performance is state of the art (Fig. 2).
2 Prior Work
Loop transformation Most compiler optimizations for numeri-
cal programs are based on loop analysis and transformation, includ-
ing auto-vectorization, loop interchange, fusion, and tiling. The
polyhedral model is a powerful tool for transforming imperative
programs [Feautrier 1991], but traditional loop optimizations do not
consider recomputation of values: each point in each loop is com-
puted only once. In image processing, recomputing some values—
rather than storing, synchronizing around, and reloading them—can
be a large performance win (Sec. 6.2), and is central to the choices
we consider during optimization.
Data-parallel languages Many data-parallel languages have
been proposed. Particularly relevant in graphics, CUDA and
OpenCL expose an imperative, single program-multiple data pro-
gramming model which can target both GPUs and multicore CPUs
with SIMD units [Buck 2007; OpenCL 2011]. ispc provides a simi-
lar abstraction for SIMD processing on x86 CPUs [Pharr and Mark
2012]. Like C, they allow the specification of very high perfor-
mance implementations for many algorithms. But because parallel
work distribution, synchronization, kernel fusion, and memory are
all explicitly managed by the programmer, complex algorithms are
often not composable in these languages, and the optimizations re-
quired are often specific to an architecture, so code must be rewrit-
ten for different platforms.
Array Building Blocks provides an embedded language for data-
parallel array processing in C++ [Newburn et al. 2011]. As in our
representation, whole pipelines of operations are built up and opti-
mized globally by a compiler. It delivers impressive performance
on Intel CPUs, but requires a sufficiently smart compiler to do so.
Streaming languages encode data and task parallelism in graphs
of kernels. Compilers automatically schedule these graphs using
tiling, fusion, and fission [Kapasi et al. 2002]. Sliding window
optimizations can automatically optimize pipelines with overlap-
ping data access in 1D streams [Gordon et al. 2002]. Our model of
scheduling addresses the problem of overlapping 2D stencils, where
recomputation vs. storage becomes a critical but complex choice.
We assume a less heroic compiler, and focus on enabling human
programmers to quickly and easily specify complex schedules.
Programmer-controlled scheduling A separate line of com-
piler research attempts to put control back in the hands of the pro-
grammer. The SPIRAL system [P
¨
uschel et al. 2005] uses a domain-
specific language to specify linear signal processing operations in-
dependent of their schedule. Separate mapping functions describe
how these operations should be turned into efficient code for a par-
ticular architecture. It enables high performance across a range of
architectures by making deep use of mathematical identities on lin-
ear filters. Computational photography algorithms often do not fit
within a strict linear filtering model. Our work can be seen as an
attempt to generalize this approach to a broader class of programs.
Sequoia defines a model where a user-defined “mapping” describes
how to execute tasks on a tree-like memory hierarchy [Fatahalian
et al. 2006]. This parallels our model of scheduling, but focuses
on hierarchical problems like blocked matrix multiply, rather than
pipelines of images. Sequoia’s mappings, which are highly explicit,
are also more verbose than our schedules, which are designed to
infer details not specified by the programmer.
Image processing languages Shantzis described a framework
and runtime model for image processing systems based on graphs
of operations which process tiles of data [Shantzis 1994]. This is
the inspiration for many scalable and extensible image processing
systems, including our own.
Apple’s CoreImage and Adobe’s PixelBender include kernel lan-
guages for specifying individual point-wise operations on images
[CoreImage; PixelBender]. Neon embeds a similar kernel language
in C# [Guenter and Nehab 2010]. All three compile kernels into
optimized code for multiple architectures, including CPU SIMD
instructions and GPUs, but none optimize across kernels connected
by complex communication like stencils, and none support reduc-
tions or nested parallelism within kernels.
Elsewhere in graphics, the real-time graphics pipeline has been a
hugely successful abstraction precisely because the schedule is sep-
arated from the specification of the shaders. This allows GPUs and
drivers to efficiently execute a wide range of programs with lit-
tle programmer control over parallelism and memory management.
This separation of concerns is extremely effective, but it is spe-
cific to the design of a single pipeline. That pipeline also exhibits
different characteristics than image processing pipelines, where re-
ductions and stencil communication are common, and kernel fusion
is essential for efficiency. Embedded DSLs have also been used to
specify the shaders themselves, directly inside the host C++ pro-
gram that configures the pipeline [McCool et al. 2002].
MATLAB is extremely successful as a language for image process-
ing. Its high level syntax enables terse expression of many algo-
rithms, and its widely-used library of built-in functionality shows
that the ability to compose modular library functions is invaluable
for programmer productivity. However, simply bundling fast imple-
mentations of individual kernels is not sufficient for fast execution
on modern machines, where optimization across stages in a pipeline
is essential for efficient use of parallelism and memory (Fig. 2).
Spreadsheets for Images extended the spreadsheet metaphor as
a functional programming model for imaging operations [Levoy
1994]. Pan introduced a functional model for image processing
much like our own, in which images are functions from coordinates
to values [Elliott 2001]. Modest differences exist (Pan’s images are
functions over a continuous coordinate domain, while in ours the
domain is discrete), but Pan is a close sibling of our intrinsic al-
gorithm representation. However, it has no corollary to our model
of scheduling and ultimate compilation. It exists as an interpreted
embedding within Haskell, and as source to source compiler to C
containing basic scalar and loop optimizations [Elliott et al. 2003].