优化粗粒度可重构架构下的关键循环映射：并行算法与内存优化

78 浏览量更新于2024-08-29 收藏 1.02MB PDF 举报

本文探讨了"粗粒度可重构体系结构的多媒体算法映射循环"这一主题，针对粗粒度可重构架构（CGRAs）作为数据密集型应用加速器的广泛应用背景，着重解决了在这些系统中并行化传统序列程序和优化关键循环所面临的挑战。CGRAs的特点在于其可重新配置的硬件单元，但大规模内存访问导致的延迟成为性能瓶颈。首先，作者提出了一个创新的多级填充（multi-level tilling）方法来并行化循环。这种方法通过将循环分解为较小的、可以独立执行的块，或者"tile"，以便更好地利用CGRA中的并行资源。这种技术有助于减少内存访问次数，降低对延迟敏感部分的影响。接着，为了进一步优化，文中引入了一种遗传算法，用于调度具有内存感知对象功能的分块循环。这里的内存感知意味着算法会考虑数据的局部性，即倾向于将相关数据组织在一起，以减少不必要的数据传输，从而减少通信成本。这样做的目的是提高并行任务的效率，降低全局内存操作的开销。通过实验结果，研究者展示了他们的方法能够有效地生成更高效的并行任务，从而显著提升粗粒度可重构体系结构在多媒体算法中的性能。这种方法不仅关注算法的计算效率，还兼顾了内存访问和通信开销的优化，是当前针对CGRA架构进行智能并行化的重要策略。这篇研究论文深入挖掘了粗粒度可重构架构的潜力，并通过精细的算法映射和任务调度策略，为解决并行编程中的复杂问题提供了一种实用且高效的方法，对于推动CGRA在多媒体处理和其他高性能计算领域的发展具有重要意义。

Mapping Loops of multimedia algorithms for Coarse-

grained reconfigurable architectures

Ziyu Yang, Peng Zhao, Guanwu Wang, Sikun Li

School of Computer

National University of Defense Technology

Changsha, China

zyyang@nudt.edu.cn

Abstract—Coarse-Grained Reconfigurable Architectures

(CGRAs) are widely used as coprocessors to accelerate data-

intensive applications. However, the parallelization of sequential

programs and the optimization of critical loops are still

challenging issues, since the access delay introduced by the

massive memory accesses contained in those loops has become

the bottleneck of CGRA’s performance. In this paper we focus on

the parallel optimization of applications by considering the

critical loops mapping under the CGRA’s resource constraints.

We first propose a novel approach to parallelize loops by multi-

level tilling. Then a genetic algorithm is introduced to schedule

tiled loops with memory-aware object functions. Data locality

and communication cost are optimized during the parallel

processing as well. Experimental results show that our approach

can generate more effective parallel tasks to improve the data

locality and load-balanced execution, while obtains 9.6% better

speedup compared with the memory-unaware parallel

processing.

Keywords- loop mapping; multi-level tilling; coarse-grained

reconfigurable architecture

I. INTRODUCTION

The multi-media applications, such as image and video

processing, are not only computationally expensive, but also

date intensive. That is, a large number of memory accesses are

demanded during the execution of critical loops [1]. To

accelerate the critical loops in these applications, coarse-

grained reconfigurable architecture (CGRA) provides an

efficient way to accelerate these applications by processing

tasks parallelized. CGRA is essentially an array of processing

elements (PEs), like ALUs and multipliers, interconnected with

a mesh-like network. Because of the simplification on

hardware, experienced designers have to optimize applications

manually to maximize the performance of CGRAs. Mapping

applications (mostly loop tasks) automatically and efficiently

onto CGRAs is now one of the hottest topics among

researchers [2]. This paper presents a novel approach for

mapping sequential data-intensive algorithms on CGRAs,

which focuses on the parallel processing of critical loops.

Section 2 gives an overview of the related work. We briefly

discuss the typical “application kernel–CGRA” mapping

framework with resource constraints in Section 3.

Subsequently in Sections 4 and 5, the parallel processing

details including application memory analysis based multi-

level loop tilling transformation and a memory-aware genetic

scheduling algorithm is given respectively. In Section 6

presents gives the experimental results and analysis proving the

efficiency of our approach. Finally, Section 7 gives the

conclusion and insights into the future work.

II. R

ELATED WORK

Considering the type of coupling of the RPUs and their

granularity, the existent reconfigurable computing platforms

could be classified into the two groups as follows: (1) fine-

grained architectures in which the typical configuration data-

widths within 4 bits, like FPGAs, (2) coarse-grained

architecture with the RPUs are usually ALUs and the typical

data-widths are 16 or 32 bits. The fine-grained reconfigurable

array is much more general can be adapted for many

applications, but its performance and energy savings are lower

than the coarse-grained approaches. Many CGRAs have been

proposed with compilers or synthesis tools. Like LEAP [3], and

others. However, there is a lack of efficient parallelization and

mapping tools for the full utilization of the performance and

flexibility offered by these architectures. Most previous CGRA

mapping [4, 5], assumes a very simple model of the CGRA,

consider operation mapping on PEs, but not the memory

hierarchy or data dependences. Yoon [4] consider mapping as a

graph-embedding problem by taking a data-flow graph as input,

and then map it using a known graph algorithm on a PE

interconnection graph. Wang [5] present a loop self-pipelining

mapping approach called LKPM for mapping data-intensive

applications. In this paper, the proposed approach is differ from

others, since it focus on the critical loops mapping under the

CGRA’s resource constraints. With considering of memory

hierarchy, the multi-level loop tilling makes the data locality

more efficient, and the genetic scheduling algorithm make the

load-balanced execution more adaptive.

III. P

ROBLEM FORMULATION

A. RCP_CGRA Model

We propose a CGRA model named RCP_CGRA

(Resource, Constraint and Performance model of CGRA),

which contains the features of the object architecture called

LEAP [4] that will be transported as constraints to the coming

application mapping. Fig. 1(a) shows a typical CGRA called

608

下载后可阅读完整内容，剩余4页未读，立即下载

weixin_38569675

粉丝: 4
资源: 980

优化粗粒度可重构架构下的关键循环映射：并行算法与内存优化

核心循环到粗粒度可重构体系结构的流水化映射1

粗粒度可重构体系结构中核心循环的高效流水映射技术

粗粒度可重构体系结构的数据预取和重用策略

粗粒度可重构体系结构的雷达信号处理配置压缩方法

基于粗粒度可重构架构的并行FFT算法实现

基于粗粒度可重构架构的并行FFT算法实现 (2013年)

粗粒度可重构阵列上的布局布线算法1

基于粗粒度可重构处理器的浮点乘加算法.pdf

针对粗粒度可重构处理器的通用循环编译技术.pdf

用于多媒体应用的粗粒度可重构处理器的分层流水线优化

最新资源