优化动态并行性的KLAP：聚合与提升技术

113 浏览量更新于2024-08-25 收藏 628KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

KLAP: Kernel Launch Aggregation and Promotion (2016) 是一项针对GPU动态并行性优化的关键技术，由Izzat El Hajj等人提出，他们分别来自伊利诺伊大学厄巴纳-香槟分校、科尔多瓦大学和惠普实验室。论文关注的问题是现代GPU架构在处理动态并行性时存在的效率问题，主要挑战包括高核函数启动开销（kernel launch overhead）、同时运行的有限数量的内核以及设备支持的动态调用深度有限。动态并行性简化了许多应用程序的编程，特别是那些在执行前未知是否能并行化的任务。然而，这些优势并未得到充分的硬件支持。传统的GPU设计并不高效地处理动态并行场景，导致性能瓶颈。KLAP通过以下策略改进了这种情况： 1. **核函数发射聚合**（Kernel Launch Aggregation）：KLAP将同一warps（线程块中的32个并行执行单元）、blocks或同一内核中并发的多个小核函数合并成一个单一的“聚合”核函数。这样可以减少总的核函数启动次数，从而降低系统级的开销。 2. **内核提升**（Kernel Promotion）：通过分析程序的执行特性，KLAP能够识别出那些潜在具有高并行性的代码片段，并在运行时动态地提升它们，允许更多的内核同时执行。这种方法旨在利用GPU的计算潜力，提高并行度。 3. **编译器技术**：KLAP依赖于编译器优化，它能够在编译阶段进行分析和重构，以便更好地适应动态并行环境。这可能涉及代码的重新组织、数据流管理和内存管理优化，以减少不必要的资源竞争和提高资源利用率。 4. **性能评估与决策**：KLAP需要一个有效的性能模型来评估不同的调度策略，以决定何时、如何以及在何处应用这些优化技术。这可能涉及到动态调整和迭代，以找到最优的执行路径。通过这些技术，KLAP旨在解决现代GPU架构上的动态并行性挑战，提高应用程序在实际使用中的性能表现。这对于那些需要在执行过程中动态调整工作负载和并行结构的应用特别有用，比如科学计算、机器学习和图形渲染等领域。这项研究对于GPU开发者和优化工具的进一步发展具有重要意义，它展示了如何通过软件层面的创新来克服硬件限制，从而实现更高效的并行计算。

资源详情

资源推荐

01 kernel<<<gD,bD>>>(args)

(a) Original Kernel Call

05 allocate arrays for args, gD, and bD

06 store args in arg arrays

07 store gD in gD array, and bD in bD array

08 new gD = sum of gD array across warp/block

09 new bD = max of bD array across warp/block

10 if(threadIdx == launcher thread in warp/block) {

11 kernel_agg<<<new gD,new bD>>>

12 (arg arrays, gD array, bD array)

13 }

02 __global__ void kernel(params) {

03 kernel body

04 }

(b) Original Kernel

(d) Transformed Kernel (called from a kernel)

14 __global__ void kernel_agg(param arrays, gD array, bD array) {

15 calculate index of parent thread

16 load params from param arrays

17 load actual gridDim/blockDim from gD/bD arrays

18 calculate actual blockIdx

19 if(threadIdx < actual blockDim) {

20 kernel body (with kernel launches transformed and with

21 using actual gridDim/blockDim/blockIdx)

22 }

23 }

Fig. 2. Code Generation for Aggregation at Warp and Block Granularity

(a) Original Block

param=x

gD=1

bD=4

param=y

gD=2

bD=3

(b) Block-Granularity Aggregation Logic Example

param

arr[]=

{

x,-,y,-

}

arr[]=

{

1,0,2,0

}

arr[]=

{

4,0,3,0

}

gD=sum(gD

arr)=3

bD=max(bD

arr)=4

scan=

{

1,1,3,3

}

scan[p-1]

≤

scan[p]

param=param

arr[p]

’

=gD

arr[p]

’

=bD

arr[p]

’

=bI-gDscan[p-1]

gridDim

blockDim

blockIdx

parent

threadIdx

bI=0

p=0

param=x

’

bI=1

p=2

param=y

’

bI=2

p=2

param=y

’

Fig. 3. Aggregation Example

uniform across parent threads. Finally, one of the threads in

the warp (or block) launches a single aggregated kernel on

behalf of the others (line 10). For block granularity, a barrier

synchronization is needed before the launch to ensure that

all the threads in the block have completed their preparation

of the arguments and conﬁgurations. In the aggregated kernel

launch, the new conﬁgurations are used (line 11), arguments

are replaced with argument arrays, and arrays containing the

conﬁgurations for each original child are added (line 12).

In addition to transforming kernel launches in all original

kernels, an aggregated version of each original kernel must

also be created. Figure 2(d) shows how the kernel in Fig-

ure 2(b) is transformed into an aggregated version. First, all

parameters are converted into parameter (param) arrays and

conﬁguration arrays are appended to the parameter list (line

14). Next, before the kernel body, logic is added for the

block to identify which thread in the parent warp (or block)

was its original parent (line 15). After identifying its original

parent, the block is then able to load its actual conﬁgurations

and parameters (lines 16-18). Threads that were not in the

original child kernel are then masked out (line 19). Finally,

in the kernel body, all kernel launches are transformed into

aggregated kernel launches, and all uses of blockDim and

blockIdx are replaced with the actual values (lines 20-21).

For the block to identify its original parent, it needs to

execute a scan (preﬁx sum) on the gD (gridDim) array then

search for its position (given by the aggregated blockIdx

value) between the scanned values (using p-ary search [11]). In

practice, since all child blocks need to scan the same gD array,

the scan is instead performed once by the parent before the

array is passed to the aggregated child kernel. Conveniently,

the scan can be performed along with the preparation of the

conﬁguration and parameter arrays in the parent, making it

incur little additional overhead. Since the child kernel needs

both the scan value and the original gD value, it can recover

the original gD value by subtracting adjacent scan elements.

The scan is performed using CUB [12].

The transformed code requires that all threads are active to

perform the scan and max operations. To handle control di-

vergence, a preprocessing pass performs control-ﬂow-to-data-

ﬂow conversion to convert divergent launches to non-divergent

predicated launches so that all threads reach the launch point.

Predication is achieved by multiplying the predicate with the

grid dimension such that launches by inactive threads become

launches of zero blocks.

B. Kernel Granularity

Figure 1(d) illustrates the transformation that takes place

when kernel launch aggregation is applied at kernel granu-

larity. At this granularity, all the original child kernels are

aggregated into a single kernel. Because there is no global

synchronization on the GPU, a single thread cannot be chosen

to launch the kernel on behalf of the others once the others are

ready. Instead, the child kernels are postponed and launched

from the host after the parent kernel terminates. In order

to postpone the kernel launches, this transformation requires

that parent kernels do not explicitly synchronize with their

child kernels, so kernels with explicit synchronization are not

supported at this granularity.

剩余11页未读，继续阅读

weixin_38728624

粉丝: 4
资源: 881

优化动态并行性的KLAP：聚合与提升技术

Calculated and Experimentally Obtained Heats of Combustion of Hydrazinium Nitrate, Monomethylhydrazinium Nitrate, and N, N-Dimethylhydrazinium Nitrate

硝酸肼、硝酸单甲基肼和 N, N-二甲基硝酸肼的计算和实验获得的燃烧热

aggregation applications

各种函数声明和定义模块

湖北工业大学在河南2021-2024各专业最低录取分数及位次表.pdf

1805.06605v2 DEFENSE-GAN.pdf

【语音去噪】FIR和IIR低通+带通+高通语音信号滤波（含时域频域分析）【含Matlab源码 4943期】.mp4

java-ssm+jsp幼儿园管理系统实现源码(项目源码-说明文档)

hadoop_3_2_0-yarn-resourcemanager-3.3.4-1.el7.x86_64.rpm

DelphiWebMVC-master.zip

东北农业大学在河南2021-2024各专业最低录取分数及位次表.pdf

python第二次作业

hadoop_3_2_0-mapreduce-historyserver-3.3.4-1.el7.x86_64.rpm

北京理工大学(珠海校区)在河北2021-2024各专业最低录取分数及位次表.pdf

基于java的学生宿舍管理系统设计与实现（源代码+数据库+部署文档）

华中农业大学在河南2021-2024各专业最低录取分数及位次表.pdf

springboot 高校体育运动会管理系统 演示录像.mp4

funny-word.exe

CPA《财务成本管理》刘正兵 专题班 债券和股票估价 资金时间价值的几个特殊问题.pdf

Go-master.zip

最新资源

springboot 高校体育运动会管理系统演示录像.mp4

CPA《财务成本管理》刘正兵专题班债券和股票估价资金时间价值的几个特殊问题.pdf