利用分支与数据聚集优化容错GPU应用：控制与内存分歧减缩策略

28 浏览量更新于2024-08-25 收藏 1.11MB PDF 举报

"Branch and Data Herding是一项针对具有错误容忍性的GPU应用程序的优化技术，旨在减少控制和内存分歧，从而提升GPU性能。该研究发表于2012年的IEEE TRANSACTIONS ON MULTIMEDIA上，作者是John Sartori和Rakesh Kumar。论文的核心概念在于，许多GPU应用在面对错误时表现出一定的容错能力，这为通过控制路径统一（branch herding）和数据块共享（data herding）来优化执行提供了可能。控制分歧指的是在同一执行束（warp）内的线程在执行流程上的不同分支决策，这可能导致计算资源的浪费和性能瓶颈。通过branch herding，所有的线程被强制遵循相同的控制路径，避免了控制指令的随机性，从而减少了控制分歧带来的影响。这种策略有助于保持处理器资源的利用率，提高并行执行效率。另一方面，数据分歧是指同一warp中的线程访问不同的内存地址，这会引发额外的内存读取和一致性问题。数据herding通过确保所有线程加载数据来自同一内存块，消除了数据分歧，提高了内存访问的效率和一致性。然而，为了支持这样的优化，论文提出了静态分析和编译器框架，它们旨在检测并预防在引入控制和数据错误时可能出现的问题。此外，文中还介绍了一个性能监控框架，目标是在维持可接受的输出质量的同时，最大化应用的执行速度。硬件层面的优化也包括，这些优化旨在通过利用错误容忍性，更有效地利用branch and data herding的优势，进一步提升性能。论文的软件实现展示了如何将这些理论转化为实际操作，通过综合运用控制和数据的集中策略，研究人员不仅解决了性能问题，还展示了如何在处理潜在错误的同时，实现了高效的并行计算。这项工作对于提高GPU应用程序的性能和错误处理能力具有重要的实用价值，对于现代高性能计算和多媒体处理等领域具有重要意义。"

. 3

-40

-30

-20

-10

0 10 20 30 40 50 60 70 80 90 100

% forced uniform branches

% runtime reduction .

Performance increase (no software overhead)

Performance increase (software overhead)

Fig. 3. The performance of Mandelbrot can be increased by forcing

uniformity for more branches. However, if software overhead is added to

ensure branch uniformity, increasing the number of affected branches increases

overhead and can even result in degraded performance.

0 10 20 30 40 50 60 70 80 90 100

% forced uniform branches

% mismatched output bytes .

Mandelbrot Output Quality Degradation

Julia Output Quality Degradation

Fig. 4. While eliminating control divergence can increase performance,

blindly forcing branch uniformity can result in degraded output quality.

signiﬁcant. Figure 3 shows the potential performance increase

(runtime reduction) if control divergence can be eliminated for

a fraction of the static branches in Mandelbrot (from 0% to

100% of branches). The branches are chosen uniformly ran-

domly when the fraction is less than 100%. Control divergence

is preempted by changing the source code to vote within a

warp on the condition of a branch and forcing all threads in

the warp to take the same (majority) direction at the branch

(details in Section III). Experiments were run natively on a

NVIDIA GeForce GTX 480 GPU (details in Section VI).

While only 10% of dynamic instructions in Mandelbrot are

branches, and less than 1% of branches diverge, performance

can potentially be increased by 31% by eliminating control

divergence. As the no software overhead performance series in

Figure 3 demonstrates, performance increases for Mandelbrot

as control divergence is eliminated for more branches. Figure 4

shows that the quality of the Mandelbrot output set degrades

by less than 2%, even when divergence has been eliminated

for all static branches. This shows that for certain error-

tolerant applications, it may be possible to get signiﬁcant

performance beneﬁts from eliminating control divergence for

minimal output quality degradation. A quick look at the

Julia output set, however, also suggests that an indiscriminate

selection of branches for herding may result in signiﬁcant

output quality degradation for several applications. Therefore,

any implementation of branch herding needs to carefully select

the branches to target. Figure 5 shows visual representations

of the Mandelbrot and Julia output sets as the percentage

of forced uniform branches increases from 20% to 100% in

increments of 40%.

The software overhead performance series of Figure 3

demonstrates another important consideration for any tech-

nique that eliminates control divergence. Since the fraction of

Fig. 5. Progression of Mandelbrot (top) and Julia (bottom) images from 20%

to 100% forced branch uniformity in 40% intervals.

divergent branches in a program may be small (in this case,

less than 1%), an indiscriminate application of a technique

to all branches may result in signiﬁcant overhead that dimin-

ishes or even eliminates performance gains that result from

reduced divergence. This result reinforces the conclusion that

care should be exercised in selecting the branches to target

for elimination of control divergence. Also, a low-overhead

mechanism for eliminating control divergence may enable

signiﬁcantly more beneﬁts. The result also conﬁrms that na¨ıve

implementations of techniques to eliminate control divergence

may actually decrease performance in some scenarios.

B. Memory Divergence

Like control divergence, memory divergence occurs when

threads in the same warp exhibit different behavior. In the

GPU, a load operation for a warp is implemented as a

collection of scalar loads, where each thread potentially loads

from a different address. When a load is issued, the SM

sets up destination registers and corresponding scoreboard

entries for each thread in the warp. The load then exits the

pipeline, potentially before any of the individual thread loads

have ﬁnished. When all the memory requests corresponding

to the warp load have ﬁnished, the destination vector register

is marked as ready. Instructions that depend on the load must

stall if any lanes of the destination vector register are not ready.

Memory divergence occurs when the memory requests for

some threads ﬁnish before those of other threads in the same

warp [9]. Individual threads that delay in ﬁnishing their loads

prevent the SM from issuing any dependent instructions from

that warp, even though other threads are ready to execute.

Memory divergence may occur for two reasons. (1) The time

to complete each memory request depends on several factors,

including which DRAM bank the target memory resides in,

contention in the interconnect network, and availability of

resources (such as MSHRs) in the memory controller. (2)

Since the target data for a collection of memory requests

made by a warp may reside in different levels of the memory

hierarchy, the individual memory operations may complete in

different lengths of time.

Most GPU architectures do not implement out-of-order

execution due to its relative complexity and hardware cost.

Rather, GPUs cover long latency stalls by multithreading

instructions from a pool of ready warps. Providing each SM

with plenty of ready warps ensures that long latency stalls will

not be exposed. Memory divergence delays the time when

a warp may execute the next dependent instruction, cutting

into the pool of ready warps and potentially exposing stalls

剩余11页未读，继续阅读

weixin_38610513

粉丝: 9
资源: 903

利用分支与数据聚集优化容错GPU应用：控制与内存分歧减缩策略

herding-cats

藏经阁-Herding Cats.pdf

Herding-Analysis

herding:基于承诺的单分辨率LRU缓存

一个具有双重反馈作用的异类经纪人herding模型 (2010年)

相互作用herding模型的非线性行为和动力学特性 (2006年)

具有长程记忆和市场判断力的异质经纪人 Herding模型 (2011年)

数据科学与数字社会-研究论文

用 Fama-French 增强七因素模型重新审视中国 A 股的羊群行为-研究论文

面向对象_python_

最新资源