解耦算法与调度：优化图像处理管道

需积分: 5 96 浏览量更新于2024-08-04 收藏 11.64MB PDF 举报

"本文主要探讨了在图像处理代码开发中如何解耦算法与调度，以实现高性能、可读性、可移植性和模块化的平衡。作者认为当前的问题在于混淆了算法定义中的计算部分与关于存储和计算顺序的决策，即调度。调度包括了如tiling（切片）、fusion（融合）、recomputation（重新计算）与storage（存储）、vectorization（向量化）以及parallelism（并行化）等选择。他们提出了一种新的表示方法，将前向图像处理管道的算法与调度分开，使得在不牺牲代码清晰度的情况下实现高性能成为可能。这种方法简化了算法的指定，将图像和中间缓冲区视为无限整数域上的函数，无需显式存储或边界条件。编程者可以独立地指定函数组合，从而更易于优化图像处理管道。" 在这篇文章中，作者指出当前图像处理代码的编写面临着一个困境：为了追求高性能，往往需要牺牲代码的可读性、可移植性和模块化。这种困境的根源在于算法和调度的紧密耦合。算法是解决问题的核心逻辑，而调度则涉及如何有效地执行这些逻辑，包括数据的存储方式、计算顺序、并行化策略等。例如，tiling用于将大任务分解为小块以便并行处理，fusion是将多个操作合并以减少数据传输，而vectorization则是利用硬件特性一次处理多个数据元素。作者提出了解耦算法与调度的解决方案，创建了一种新的表示形式，将图像处理管道分为两个独立的部分：算法描述和执行策略。这样，程序员可以专注于算法的设计，而将调度问题留给编译器或专门的优化工具。通过这种方式，代码的可读性和可维护性得到了提升，同时依然能够生成高效的机器代码。该方法的关键在于将图像和中间结果视为无限整数域上的函数，避免了显式的存储管理和边界条件检查。这样，优化的过程可以更加灵活，例如，编译器可以根据硬件特性自动进行tiling和vectorization，或者根据需求动态调整recomputation和storage的平衡。此外，由于图像处理管道是函数的组合，程序员可以独立地设计每个函数，然后通过函数组合来构建复杂的处理流程。这种方式允许在不改变算法的前提下，轻松地调整和优化执行策略，增加了代码的灵活性和可扩展性。这篇文章提供了一个新的思考角度，强调了在图像处理领域中，算法与调度的分离对于实现高性能和代码质量的重要性。通过这样的解耦，可以期待未来图像处理软件在性能和易用性上达到更好的平衡。

and easy. It can be written separately from the algorithm, by an

architecture expert if necessary. We can most ﬂexibly schedule op-

erations which are data parallel, with statically analyzable access

patterns, but also support the reductions and bounded irregular ac-

cess patterns that occur in image processing.

In addition to this model of scheduling (Sec. 3), we present:

• A prototype embedded language, called Halide, for functional

algorithm and schedule speciﬁcation (Sec. 4).

• A compiler which translates functional algorithms and op-

timized schedules into efﬁcient machine code for x86 and

ARM, including SSE and NEON SIMD instructions, and

CUDA GPUs, including synchronization and placement of

data throughout the specialized memory hierarchy (Sec. 5).

• A range of applications implemented in our language, com-

posed of common image processing operations such as con-

volutions, histograms, image pyramids, and complex sten-

cils. Using different schedules, we compile them into opti-

mized programs for x86 and ARM CPUs, and a CUDA GPU

(Sec. 6). For these applications, the Halide code is compact,

and performance is state of the art (Fig. 2).

2 Prior Work

Loop transformation Most compiler optimizations for numeri-

cal programs are based on loop analysis and transformation, includ-

ing auto-vectorization, loop interchange, fusion, and tiling. The

polyhedral model is a powerful tool for transforming imperative

programs [Feautrier 1991], but traditional loop optimizations do not

consider recomputation of values: each point in each loop is com-

puted only once. In image processing, recomputing some values—

rather than storing, synchronizing around, and reloading them—can

be a large performance win (Sec. 6.2), and is central to the choices

we consider during optimization.

Data-parallel languages Many data-parallel languages have

been proposed. Particularly relevant in graphics, CUDA and

OpenCL expose an imperative, single program-multiple data pro-

gramming model which can target both GPUs and multicore CPUs

with SIMD units [Buck 2007; OpenCL 2011]. ispc provides a simi-

lar abstraction for SIMD processing on x86 CPUs [Pharr and Mark

2012]. Like C, they allow the speciﬁcation of very high perfor-

mance implementations for many algorithms. But because parallel

work distribution, synchronization, kernel fusion, and memory are

all explicitly managed by the programmer, complex algorithms are

often not composable in these languages, and the optimizations re-

quired are often speciﬁc to an architecture, so code must be rewrit-

ten for different platforms.

Array Building Blocks provides an embedded language for data-

parallel array processing in C++ [Newburn et al. 2011]. As in our

representation, whole pipelines of operations are built up and opti-

mized globally by a compiler. It delivers impressive performance

on Intel CPUs, but requires a sufﬁciently smart compiler to do so.

Streaming languages encode data and task parallelism in graphs

of kernels. Compilers automatically schedule these graphs using

tiling, fusion, and ﬁssion [Kapasi et al. 2002]. Sliding window

optimizations can automatically optimize pipelines with overlap-

ping data access in 1D streams [Gordon et al. 2002]. Our model of

scheduling addresses the problem of overlapping 2D stencils, where

recomputation vs. storage becomes a critical but complex choice.

We assume a less heroic compiler, and focus on enabling human

programmers to quickly and easily specify complex schedules.

Programmer-controlled scheduling A separate line of com-

piler research attempts to put control back in the hands of the pro-

grammer. The SPIRAL system [P

uschel et al. 2005] uses a domain-

speciﬁc language to specify linear signal processing operations in-

dependent of their schedule. Separate mapping functions describe

how these operations should be turned into efﬁcient code for a par-

ticular architecture. It enables high performance across a range of

architectures by making deep use of mathematical identities on lin-

ear ﬁlters. Computational photography algorithms often do not ﬁt

within a strict linear ﬁltering model. Our work can be seen as an

attempt to generalize this approach to a broader class of programs.

Sequoia deﬁnes a model where a user-deﬁned “mapping” describes

how to execute tasks on a tree-like memory hierarchy [Fatahalian

et al. 2006]. This parallels our model of scheduling, but focuses

on hierarchical problems like blocked matrix multiply, rather than

pipelines of images. Sequoia’s mappings, which are highly explicit,

are also more verbose than our schedules, which are designed to

infer details not speciﬁed by the programmer.

Image processing languages Shantzis described a framework

and runtime model for image processing systems based on graphs

of operations which process tiles of data [Shantzis 1994]. This is

the inspiration for many scalable and extensible image processing

systems, including our own.

Apple’s CoreImage and Adobe’s PixelBender include kernel lan-

guages for specifying individual point-wise operations on images

[CoreImage; PixelBender]. Neon embeds a similar kernel language

in C# [Guenter and Nehab 2010]. All three compile kernels into

optimized code for multiple architectures, including CPU SIMD

instructions and GPUs, but none optimize across kernels connected

by complex communication like stencils, and none support reduc-

tions or nested parallelism within kernels.

Elsewhere in graphics, the real-time graphics pipeline has been a

hugely successful abstraction precisely because the schedule is sep-

arated from the speciﬁcation of the shaders. This allows GPUs and

drivers to efﬁciently execute a wide range of programs with lit-

tle programmer control over parallelism and memory management.

This separation of concerns is extremely effective, but it is spe-

ciﬁc to the design of a single pipeline. That pipeline also exhibits

different characteristics than image processing pipelines, where re-

ductions and stencil communication are common, and kernel fusion

is essential for efﬁciency. Embedded DSLs have also been used to

specify the shaders themselves, directly inside the host C++ pro-

gram that conﬁgures the pipeline [McCool et al. 2002].

MATLAB is extremely successful as a language for image process-

ing. Its high level syntax enables terse expression of many algo-

rithms, and its widely-used library of built-in functionality shows

that the ability to compose modular library functions is invaluable

for programmer productivity. However, simply bundling fast imple-

mentations of individual kernels is not sufﬁcient for fast execution

on modern machines, where optimization across stages in a pipeline

is essential for efﬁcient use of parallelism and memory (Fig. 2).

Spreadsheets for Images extended the spreadsheet metaphor as

a functional programming model for imaging operations [Levoy

1994]. Pan introduced a functional model for image processing

much like our own, in which images are functions from coordinates

to values [Elliott 2001]. Modest differences exist (Pan’s images are

functions over a continuous coordinate domain, while in ours the

domain is discrete), but Pan is a close sibling of our intrinsic al-

gorithm representation. However, it has no corollary to our model

of scheduling and ultimate compilation. It exists as an interpreted

embedding within Haskell, and as source to source compiler to C

containing basic scalar and loop optimizations [Elliott et al. 2003].

剩余11页未读，继续阅读

神奇的小强

粉丝: 85
资源: 38

解耦算法与调度：优化图像处理管道

Decoupling Object Detection from Human-Object Interaction Recogn

Image Retrieval via Decoupling Diffusion into Online and Offline Processing

DECOUPLING METHOD

F429_decoupling_pcb_

Bypassing and Decoupling-综合文档

Selection and Placement of Decoupling Capacitors.pdf

Context Decoupling Augmentation for Weakly Supervised .pdf

decoupling_habr:我的habrahabr文章演示

ESR and ESL of Ceramic Capacitor Applied to Decoupling

Identification of PMSM Electrical Parameters Based on Decoupling Control

最新资源