使用ispc实现高性能CPU并行编程

122 浏览量更新于2024-08-25 收藏 365KB PDF 举报

"ispc - A SPMD Compiler for High-Performance CPU Programming (ispc_inpar_2012)-计算机科学" 这篇文档介绍的是ISPC（Intel SPMD Program Compiler），一个针对高性能CPU编程的编译器，由Intel公司的Matt Pharr和William R. Mark开发。SPMD（Single Instruction Multiple Data）并行编程模型是该编译器的核心概念，它允许在同一处理器核心上执行相同指令的不同数据实例，从而实现向量处理单元（SIMD）的高效利用。 SIMD平行性在现代CPU中扮演着越来越关键的角色，因为它在提供性能的同时具有较高的能效比，并且相对于其他形式的并行性，其在芯片面积上的成本较低。然而，当前的CPU编程语言和编译器并未充分利用硬件的这种能力。传统的CPU并行编程模型主要关注多核并行，而忽视了CPU SIMD向量单元提供的大量计算潜力。像OpenCL这样的GPU导向语言虽然支持SIMD，但在实现CPU上的最大效率时缺乏必要的功能，并且由于GPU的特性，使用起来在CPU上可能不够方便。ISPC编译器旨在解决这个问题，通过有效地利用多个处理器核心和SIMD向量单元，能在CPU上实现非常高的性能。 ISPC受到GPU编程语言的启发，这些语言已经证明对于许多应用程序，SIMD并行是提高性能的有效方法。ISPC的设计目标是提供一种更易于使用的、面向CPU的编程模型，同时充分利用了SIMD的优势，为程序员提供了更好的工具来挖掘CPU的并行计算能力。通过使用ISPC，开发者可以编写出能够自动并行化的代码，这些代码能够被优化以适应现代多核CPU的架构，包括其内置的SIMD单元。这样，ISPC不仅提高了代码的运行速度，还降低了编写高效并行代码的复杂性，使开发者能够专注于解决问题本身，而不是底层的并行化细节。 ISPC是一个创新的编译器解决方案，它填补了现有CPU编程工具在SIMD并行处理上的空白，提高了CPU程序的性能，并简化了开发流程。ISPC的出现使得开发者能够更好地利用现代处理器的硬件特性，尤其是SIMD向量处理，为高性能计算提供了新的可能性。

weaker support for these operations, although they can be

mimicked at lower performance via memory.

Tightly deﬁned execution order and memory

model: Modern CPUs have relatively strict rules on the

order with which instructions are completed and the rules

for when memory stores become visible to memory loads.

GPUs have more relaxed rules, which provides greater free-

dom for hardware scheduling but makes it more diﬃcult to

provide ordering guarantees at the language level.

3. PARALLELISM MODEL:

SPMD ON SIMD

Any language for parallel programming requires a concep-

tual model for expressing parallelism in the language and for

mapping this language-level parallelism to the underlying

hardware. For the following discussion of ispc’s approach,

we rely on Flynn’s taxonomy of programming models into

SIMD, MIMD, etc. [8], with Darema’s enhancement to in-

clude SPMD (Single Program Multiple Data) [7].

3.1 Why SPMD?

Recall that our goal is to design a language and compiler

for today’s SIMD CPU hardware. One option would be to

use a purely sequential language, such as unmodiﬁed C, and

rely on the compiler to ﬁnd parallelism and map it to the

SIMD hardware. This approach is commonly referred to

as auto-vectorization [37]. Although auto-vectorization can

work well for regular code that lacks conditional operations,

a number of issues limit the applicability of the technique in

practice. All optimizations performed by an auto-vectorizer

must honor the original sequential semantics of the program;

the auto-vectorizer thus must have visibility into the entire

loop body, which precludes vectorizing loops that call out

to externally-deﬁned functions, for example. Complex con-

trol ﬂow and deeply nested function calls also often inhibit

auto-vectorization in practice, in part due to heuristics that

auto-vectorizers must apply to decide when to try to vec-

torize. As a result, auto-vectorization fails to provide good

performance transparency—it is diﬃcult to know whether

a particular fragment of code will be successfully vectorized

by a given compiler and how it will perform.

To achieve ispc’s goals of eﬃciency and performance

transparency it is clear that the language must have par-

allel semantics. This leads to the question: how should

parallelism be expressed? The most obvious option is to

explicitly express SIMD operations as explicit vector com-

putations. This approach works acceptably in many cases

when the SIMD width is four or less, since explicit operations

on 3-vectors and 4-vectors are common in many algorithms.

For SIMD widths greater than four, this option is still ef-

fective for algorithms without data-dependent control ﬂow,

and can be implemented in C++ using operator overload-

ing layered over intrinsics. However, this option becomes

less viable once complex control ﬂow is required.

Given complex control ﬂow, what the programmer ideally

wants is a programming model that is as close as possible to

MIMD, but that can be eﬃciently compiled to the available

SIMD hardware. SPMD provides just such a model: with

SPMD, there are multiple instances of a single program exe-

cuting concurrently and operating on diﬀerent data. SPMD

programs largely look like scalar programs (unlike explicit

SIMD), which leads to a productivity advantage for pro-

grammers working with SPMD programs. Furthermore, the

SPMD approach aids with performance transparency: vec-

torization of a SPMD program is guaranteed by the under-

lying model, so a programmer can write SPMD code with

a clear mental model of how it will be compiled. Over the

past ten years the SPMD model has become widely used

on GPUs, ﬁrst for programmable shading [28] and then for

more general-purpose computation via CUDA and OpenCL.

ispc implements SPMD execution on the SIMD vector

units of CPUs; we refer to this model as “SPMD-on-SIMD”.

Each instance of the program corresponds to a diﬀerent

SIMD lane; conditionals and control ﬂow that are diﬀerent

between the program instances are allowed. As long as each

program instance operates only on its own data, it produces

the same results that would be obtained if it was running

on a dedicated MIMD processor. Figure 1 illustrates how

SPMD execution is implemented on CPU SIMD hardware.

3.2 Basic Execution Model

Upon entry to a ispc function called from C/C++ code,

the execution model switches from the application’s serial

model to ispc’s SPMD model. Conceptually, a number of

program instances start running concurrently. The group

of running program instances is a called a gang (harkening

to “gang scheduling”, since ispc provides certain guarantees

about when program instances running in a gang run con-

currently with other program instances in the gang, detailed

below.)

The gang of program instances starts executing in

the same hardware thread and context as the application

code that called the ispc function; no thread creation or

implicit context switching is done by ispc.

The number of program instances in a gang is relatively

small; in practice, it’s no more than twice the SIMD width of

the hardware that it is executing on.

Thus, there are four

or eight program instances in a gang on a CPU using the

4-wide SSE instruction set, and eight or sixteen on a CPU

using 8-wide AVX. The gang size is set at compile time.

SPMD parallelization across the SIMD lanes of a single

core is complementary to multi-core parallelism. For ex-

ample, if an application has already been parallelized across

cores, then threads in the application can independently call

functions written in ispc to use the SIMD unit on the core

where they are running. Alternatively, ispc has capabilities

for launching asynchronous tasks for multi-core parallelism;

they will be introduced in Section 5.4.

3.3 Mapping SPMD To Hardware: Control

One of the challenges in SPMD execution is handling di-

vergent control ﬂow. Consider a while loop with a termi-

nation test n > 0; when diﬀerent program instances have

diﬀerent values for n, they will need to execute the loop

body diﬀerent numbers of times.

ispc’s SPMD-on-SIMD model provides the illusion of sep-

arate control ﬂow for each SIMD lane, but the burden of

Program instances thus correspond to threads in CUDA

and work items in OpenCL. A gang roughly corresponds to

a CUDA warp.

Running gangs wider than the SIMD width can give perfor-

mance beneﬁts from amortizing shared computation (such as

scalar control ﬂow overhead) over more program instances,

better cache reuse across the program instances, and from

more instruction-level parallelism being available. The costs

are greater register pressure and potentially more control

ﬂow divergence across the program instances.

剩余12页未读，继续阅读

weixin_38748055

粉丝: 4
资源: 960

使用ispc实现高性能CPU并行编程

ispc：英特尔SPMD程序编译器

ispc-Cargo构建脚本的构建时依赖性，可帮助编译和链接到ISPC代码并为生成的库生成Rust绑定。 此板条箱是ispc_compile和ispc_rt板条箱的元板条箱，它们提供了构建ISPC代码，生成绑定以及将这些绑定导入Rust的实际功能。-Rust开发

ISPC-PHPCompiler：轻松编译和管理多个PHP版本！

Cientifico-Datos-ISPC

ISPC纹理压缩器___下载.zip

eigen_vec3.ispc:Joachim Kopp 在 ispc 中的混合 3x3 实对称特征向量和值求解器

ISPCTextureCompressor:ISPC纹理压缩器

ISPCTextureCompressor: 高效的ISPC纹理压缩技术

ISPC纹理压缩器下载：高效纹理压缩解决方案

LE4ISPC: 自动更新ISPConfig 3的加密证书解决方案

最新资源

ispc-Cargo构建脚本的构建时依赖性，可帮助编译和链接到ISPC代码并为生成的库生成Rust绑定。此板条箱是ispc_compile和ispc_rt板条箱的元板条箱，它们提供了构建ISPC代码，生成绑定以及将这些绑定导入Rust的实际功能。-Rust开发