GPU并行编程中的内 warp 分支优化策略

GPU

需积分: 11 143 浏览量更新于2024-07-18 收藏 1.87MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本篇论文标题为《GPU上基于互 warp 分支感知的执行优化》（Inter-warp Divergence Aware Execution on GPUs），由Chulian Zhang撰写，作为他在东北大学电气与计算机工程专业硕士学业的一部分，于2016年4月提交。论文着重探讨了在现代GPU（图形处理器）编程中面临的分支问题，以及如何通过编程和架构层面的优化来提高性能。首先，论文定义了问题（1.1节），指出CUDA编程模型中的warp（线程块）分支执行可能会导致性能下降，因为不同warp内的线程可能执行不同的代码路径，产生分支 divergence，这会降低并行度和流水线效率。作者针对这一挑战，提出了一种优化策略，旨在解决这个问题。在编程层面（1.2.1节），作者探索了如何利用高级编程技巧，如分支预测和控制流管理，减少分支引起的不一致性，并提供更高效的指令调度。这包括利用条件分支指令的延迟策略，以及对分支依赖的合理安排，以最小化潜在的性能损失。在架构层面（1.2.2节），论文分析了GPU的硬件特性，如多级内存系统、occupancy（指令并行度）、以及与CPU的交互方式。理解这些因素有助于设计针对分支 divergence 的适应性执行策略，例如调整工作负载分配，以保持更多的warp在连续执行，从而提高整体吞吐量。接着，论文回顾了相关工作（2.1和2.2节），特别是背景减法（background subtraction）在GPU上的应用，这是许多计算密集型任务的基础，以及GPU上的线程调度算法，这些都与分支处理密切相关。作者还提到了GPGPU-Sim（3.3节），一个用于模拟GPU行为的工具，它被用来验证和优化算法在实际硬件上的性能。高性能背景减法（4章）部分，可能是论文的核心部分，讨论了如何结合前面提到的优化技术，通过有效的数据管理和计算策略，提升背景减法的并行性能，同时减少分支带来的影响。这可能包括使用混合高斯模型（Mixture of Gaussian）等算法，这些算法对于并行化处理具有天然优势。这篇论文不仅深入剖析了GPU编程中的分支问题，还提供了实用的优化策略和方法，对于理解和改进GPU应用程序的性能有着重要的参考价值。通过理论分析和实验验证，作者为开发者提供了一套在面临分支 divergence 时提升GPU程序效率的全面指南。

资源详情

资源推荐

CHAPTER 1. INTRODUCTION

Core Core

D $

OoO, Prefetch,

Branch Predictor

I $

(a) CPU Architecture

Core Core Core Core

Sched

(b) GPU Architecture

Figure 1.2: CPU & GPU Architecture Comparison

space occupied by the big cache and branch predictor can be used to place more in-order cores on

chip. Hence, more threads can run in parallel. Even though latency for each thread becomes much

bigger, the overall throughput is much better compared to CPU. Moreover, without the big cache,

branch predictor and out-of-order execution (all of them are very power-hungry), this design also

consumes much less power, resulting in much better power efﬁciency.

However to beneﬁt from this design, applications need to have a huge amount of inherent

parallelism. One great example of throughput-oriented design is GPU, which is traditionally cus-

tomized for running graphics applications. Graphics applications inherently have a lot of parallelisms

as there is a large number of pixels in each image/frame. Also, operations on each pixel are usually

similar and independent from each other. Besides graphics applications, many other applications also

have inherent parallelism, such as particle simulation, weather simulation, etc. These applications

are ideal candidates for GPUs acceleration. However, programming GPUs was not easy before

since all objects need to be described as triangles and actions need to be converted to graphics

operations on those triangles. This process is a very counter-intuitive and time-consuming. Later,

with the introduction of general-purpose GPUs programming model such as CUDA [

] and OpenCL

[

], we see a signiﬁcant boost in GPUs adoption. Effectively, CUDA / OpenCL is an extension to

high-level language such as C/C++, enabling programmers to use GPUs by writing their algorithms

CHAPTER 1. INTRODUCTION

in C/C++. Even though it’s not pure C/C++ and requires extra efforts, it’s way easier than writing

in “graphics language”. Since then, many general applications have seen signiﬁcant speedup by

GPUs acceleration as shown in [

]. Nevertheless, as more diverse applications are ported to GPUs,

there are some challenges to get the most from GPUs. Following is a more detailed analysis of those

challenges.

1.1 Problem Deﬁnition

Success of application acceleration is continuously attracting more and more general pur-

pose applications to use GPU, particularly computer vision and machine learning. These applications

have a signiﬁcant amount of parallelism and repeat the same operations over many data. For example,

in computer vision applications, the same operations will be executed on each consecutive frame in

the video. These applications are very computation intensive as it needs to extract information, build

the model, and update the model on the ﬂy from the raw data. Meanwhile, accessing input data and

updating model also incur a lot of memory accesses. Furthermore, input frames are getting higher

and higher resolution, which further increases both the computation demand and memory demand.

Due to the algorithm complexity, these applications need to perform many different tasks.

Therefore, they tend to have larger code sizes, hence larger binary sizes. As the algorithms involve

more intelligent decision making based on the condition of input data, they exhibit more control

ﬂow diversity. These control ﬂows also make the memory access more irregular as code in different

branches will access different regions of memory. Branch divergence hurts performance because

GPUs execute programs in a manner like SIMD (Single Instruction Multiple Data) which means that

every thread will do the same thing but on different data. When divergence happens, to preserve

correctness, both paths will be executed, but threads only commit results on their own paths. That

means many threads are doing useless work when divergence happen.

Besides branch divergence, memory divergence also hinders performance. As GPUs work

in SIMD mode, many requests will be generated when a memory access instruction is being executed.

If these requests access locations that are next to each other, they coalesce into a fewer number of

requests (ideally only one request is needed) to the ﬁrst level cache. Otherwise, a huge amount

of requests will be sent to the memory subsystem (also called memory divergence). With more

requests, more conﬂicts occur in the memory subsystem, especially in the data cache. With so many

threads running concurrently on GPUs and rapidly switching, cache thrashing is more likely to

CHAPTER 1. INTRODUCTION

happen. Cache thrashing occurs a lot in computer vision and machine learning applications since

their memory access patterns are often irregular.

Overall, we see that these applications run very inefﬁciently on GPUs since there are many

branch divergences, memory divergences and conﬂicts in shared resources during GPUs execution.

Therefore, to fully utilize GPUs programmers need to be aware of the underlying architecture while

designing and implementing their algorithms. At the same time, GPUs designers also need to

optimize the architecture to support better execution of more general purpose applications.

1.2 Contribution

In order to streamline the application development and improve GPUs performance. This

thesis makes contributions in two aspects: programming level and architecture level. From the

programming level, the thesis demonstrates some general and algorithm-speciﬁc optimizations for a

computer vision application that can improve performance signiﬁcantly and these optimizations are

generally applicable to other applications. From the architecture level, this thesis ﬁrst identiﬁes the

importance of thread-level similarity and deﬁnes a new metric Warp Progression Similarity (WPS)

to capture the similarity. Then different contributors to WPS are studied. Finally, a WPS-aware

scheduler is proposed that can achieve better resource usage and higher performance.

In more detail, this thesis makes the following contributions:

1.2.1 Programming Level

Background subtraction is an essential ﬁrst stage in many vision applications differentiating

foreground pixels from the background scene, with Mixture of Gaussians (MoG) being a widely

used implementation choice. Due to this algorithm complexity, MoG is highly computation intensive.

MoG’s high computation demand renders a real-time single threaded realization on CPU infeasible.

With its pixel level parallelism, deploying MoG on top of parallel architectures such as a Graphics

Processing Unit (GPU) is promising. However, MoG poses many challenges such as a signiﬁcant

amount of control ﬂow (potentially reducing GPUs execution efﬁciency) as well as a large memory

bandwidth demand.

In this thesis, a GPU implementation of Mixture of Gaussians (MoG) is proposed that

surpasses real-time processing for full HD (1080x1920, 60 Hz fps). This thesis describes step-wise

optimizations starting from general GPUs optimizations (such as memory coalescing, computation &

CHAPTER 1. INTRODUCTION

communication overlapping), via algorithm-speciﬁc optimizations including control ﬂow reduction

and register usage optimization, to windowed optimization utilizing shared memory. For each

optimization, this thesis evaluates the performance potential and identiﬁes architectural bottlenecks.

Our CUDA-based implementation improves performance over sequential implementation by 57x,

97x and 101x through general, algorithm-speciﬁc, and windowed optimizations respectively, without

impact to the output quality.

1.2.2 Architecture Level

When an application runs on a GPU, many warps will be launched. All the warps will

execute the same code and start from the beginning. After some time, there will be a divergence in

execution progress (which we call inter-warp divergence) among all the warps. In other words, some

warps are far ahead while some others are lagging behind. The inter-warp divergence happens mainly

because uneven memory access latency (whether data is in the cache or not) and branch divergence.

This divergence has signiﬁcant performance impact on GPUs as it can bring more conﬂicts to shared

resources such as D$ and I$. Previous work [

] also observes conﬂicts in the memory

subsystem, in particular, the data cache. [

] try to change thread scheduling policy as such to

preserve data locality. However, they overlook execution locality between warps and its performance

impact on shared resources especially instruction cache, which is a huge victim as it’s directly related

to inter-warp divergence.

To fully understand the performance impact, this thesis quantitatively studies the beneﬁts

of inter-warp divergence aware execution on GPUs. To that end, the thesis ﬁrst proposes a novel

approach to quantify the inter-warp divergence by measuring the temporal similarity in execution

progress of concurrent warps, which we call Warp Progression Similarity (WPS). Using this metric,

we analyzed all the factors that can inﬂuence WPS including algorithm-intrinsics, bounded cache,

scheduling policy and number of schedulers. The result shows that among all the factors, scheduling

policy is the biggest one. Then, based on the WPS metric, this thesis proposes a WPS-aware

Scheduler (WPSaS) to optimize GPUs throughput. WPSaS will slow down ahead warps and speedup

lagging warps by taking into account the age of each instruction cache block. The goal is to manage

inter-warp divergence to better hide memory access latency and minimize resource conﬂicts and

temporal under-utilization in compute units allowing GPUs to achieve their peak throughput. Our

results demonstrate that WPSaS improves throughput by 10% with a pronounced reduction in

resource conﬂicts and temporal under-utilization.

剩余75页未读，继续阅读

ysastro

粉丝: 0
资源: 1

GPU并行编程中的内 warp 分支优化策略

Fast volume rendering using a shear-warp

《基于GPU-Warp的有限元矩阵生成与装配策略》

基于GPU-Warp的有限元矩阵生成与装配方法

"基于GPU-Warp的着色法有限元矩阵生成与装配

这个content-warp在代码里表达什么意思

warp divergence

vue-seamless-scroll seamless-warp 中的div transform 属性变为负值

css flex-warp

css flex-warp 如何紧贴着换行

使用vue3在<script setup>里面实现六个按钮，用到flex-warp:warp并且在点击其中一个按钮的时候该按钮的背景颜色变为blue，其他按钮颜色不变

给el-dropdown添加滚动条

flex-warp

yocto_distro

el-table自动滚动数据

antsApplyTransforms -i inter.nii.gz -r b0.nii.gz -t S20GenericAffine.mat S21Warp.nii.gz -o nearest.nii.gz -n NearestNeighbor

使用vue-seamless-scroll，数据不能循环轮播

ubuntu安装WARP

flex-wrap : warp不生效的原因

最新资源