Mapping Loops of multimedia algorithms for Coarse-
grained reconfigurable architectures
Ziyu Yang, Peng Zhao, Guanwu Wang, Sikun Li
School of Computer
National University of Defense Technology
Changsha, China
zyyang@nudt.edu.cn
Abstract—Coarse-Grained Reconfigurable Architectures
(CGRAs) are widely used as coprocessors to accelerate data-
intensive applications. However, the parallelization of sequential
programs and the optimization of critical loops are still
challenging issues, since the access delay introduced by the
massive memory accesses contained in those loops has become
the bottleneck of CGRA’s performance. In this paper we focus on
the parallel optimization of applications by considering the
critical loops mapping under the CGRA’s resource constraints.
We first propose a novel approach to parallelize loops by multi-
level tilling. Then a genetic algorithm is introduced to schedule
tiled loops with memory-aware object functions. Data locality
and communication cost are optimized during the parallel
processing as well. Experimental results show that our approach
can generate more effective parallel tasks to improve the data
locality and load-balanced execution, while obtains 9.6% better
speedup compared with the memory-unaware parallel
processing.
Keywords- loop mapping; multi-level tilling; coarse-grained
reconfigurable architecture
I. INTRODUCTION
The multi-media applications, such as image and video
processing, are not only computationally expensive, but also
date intensive. That is, a large number of memory accesses are
demanded during the execution of critical loops [1]. To
accelerate the critical loops in these applications, coarse-
grained reconfigurable architecture (CGRA) provides an
efficient way to accelerate these applications by processing
tasks parallelized. CGRA is essentially an array of processing
elements (PEs), like ALUs and multipliers, interconnected with
a mesh-like network. Because of the simplification on
hardware, experienced designers have to optimize applications
manually to maximize the performance of CGRAs. Mapping
applications (mostly loop tasks) automatically and efficiently
onto CGRAs is now one of the hottest topics among
researchers [2]. This paper presents a novel approach for
mapping sequential data-intensive algorithms on CGRAs,
which focuses on the parallel processing of critical loops.
Section 2 gives an overview of the related work. We briefly
discuss the typical “application kernel–CGRA” mapping
framework with resource constraints in Section 3.
Subsequently in Sections 4 and 5, the parallel processing
details including application memory analysis based multi-
level loop tilling transformation and a memory-aware genetic
scheduling algorithm is given respectively. In Section 6
presents gives the experimental results and analysis proving the
efficiency of our approach. Finally, Section 7 gives the
conclusion and insights into the future work.
II. R
ELATED WORK
Considering the type of coupling of the RPUs and their
granularity, the existent reconfigurable computing platforms
could be classified into the two groups as follows: (1) fine-
grained architectures in which the typical configuration data-
widths within 4 bits, like FPGAs, (2) coarse-grained
architecture with the RPUs are usually ALUs and the typical
data-widths are 16 or 32 bits. The fine-grained reconfigurable
array is much more general can be adapted for many
applications, but its performance and energy savings are lower
than the coarse-grained approaches. Many CGRAs have been
proposed with compilers or synthesis tools. Like LEAP [3], and
others. However, there is a lack of efficient parallelization and
mapping tools for the full utilization of the performance and
flexibility offered by these architectures. Most previous CGRA
mapping [4, 5], assumes a very simple model of the CGRA,
consider operation mapping on PEs, but not the memory
hierarchy or data dependences. Yoon [4] consider mapping as a
graph-embedding problem by taking a data-flow graph as input,
and then map it using a known graph algorithm on a PE
interconnection graph. Wang [5] present a loop self-pipelining
mapping approach called LKPM for mapping data-intensive
applications. In this paper, the proposed approach is differ from
others, since it focus on the critical loops mapping under the
CGRA’s resource constraints. With considering of memory
hierarchy, the multi-level loop tilling makes the data locality
more efficient, and the genetic scheduling algorithm make the
load-balanced execution more adaptive.
III. P
ROBLEM FORMULATION
A. RCP_CGRA Model
We propose a CGRA model named RCP_CGRA
(Resource, Constraint and Performance model of CGRA),
which contains the features of the object architecture called
LEAP [4] that will be transported as constraints to the coming
application mapping. Fig. 1(a) shows a typical CGRA called
978-1-4673-2193-8/12/$31.00 ©2012 IEEE