The main contributions of this work are: (i) transforming the
managing of the distribution of inter-cluster data transferring
into consideration of resource pressure and register pressure
across the whole scheduling cycle range; (ii) introducing two
kinds of register pressure to estimate the influence of register
allocation phase on cycle scheduling and cluster assignment;
(iii) enhancing performance and energy consumption effect-
iveness by taken into consideration both the influence of
resource pressures (availability for FUs in each cluster) and
register pressures (availability of global registers).
This paper is organized as follow: Related works are dis-
cussed in Section
2. Section 3 will discuss more about the
RFCC VLIW architecture. The register pressure aware
instruction scheduling algorithm is introduced in Section
4.
And the experimental framework and results are presented in
Section
5.Andfinally, we give conclusions in Section 6.
2. RELATED WORK
As clustering has become a common trend, there already exist
a lot of works concerning the instruction scheduling of clus-
tered architectures.
Zalamea et al.[
11] have presented an instruction scheduling
algorithm for clustered VLIW architecture, which performs
instruction scheduling, register allocation and cluster assign-
ment in a single step. The algorithm is based on an iterative
approach with limited backtracking, which allows one to undo
previous scheduling, spilling or communication decisions
without the compilation time penalty of a wide search of the
solution space. Codina et al.[
12] have introduced a modulo
scheduling framework for clustered instruction level parallel-
ism processors that integrates the cluster assignment, instruc-
tion scheduling and register allocation steps in a single phase.
The proposed framework includes a mechanism to insert spill
code on-the-fly and heuristics to evaluate the quality of partial
schedules considering simultaneously inter-cluster transfer-
ring, memory pressure and register pressure. Later, they have
exploited a concept of virtual cluster to assist the instruction
scheduling for clustered architecture [
13].
Aleta et al.[
14] have presented a graph-partitioning-based
instruction scheduling for clustered architecture. Later, they
[
15] have presented another graph-based approach, called
AGAMOS, to modulo schedule loops on clustered architec-
tures. Xu et al.[
16] have presented their study on the design
of inter-cluster connection network in clustered digital signal
processor (DSP) processors. The approach starts with deter-
mining the minimum number of buses required in polynomial
time for any given schedules, and then further determines an
underlying inter-cluster connection scheme with the number
of buses determined in the previous step.
Arafath et al.[
17] have implemented an integrated instruc-
tion partitioning and scheduling technique for clustered
VLIW architectures, using the amount of clock cycles
followed by each instruction and the number of successors of
an instruction to prioritize the instructions. Zhang et al.[
18]
presented a phase coupled priority-based heuristic scheduling
algorithm, which converts the instruction scheduling problem
into the problem of scheduling a set of instructions with a
common deadline.
Huang et al.[
9] have introduced a Worst-Case-Execution-
Time-aware Re-scheduling Register Allocation (WRRA)
approach, to achieve Worst-Case-Execution-Time (WCET)
minimization for real-time embedded systems with clustered
VLIW architecture.
Jiang et al.[
19] have proposed a multithreading technique
based on a scheduling scheme of stream programs on clus-
tered VLIW stream architecture, which aims at optimal arith-
metic unit utilization without increasing energy consumption.
Its principle is to exploit more kernel-level parallelism for fur-
ther optimal compilation by constructing homogeneous mul-
tiple threads on stream programs.
However, these research effort is focused on BCC VLIW
architecture. The efforts toward the optimization for RFCC
VLIW architecture are scarce.
Zhou et al.[
20] have presented a two-dimension force-
directed (TDFD) scheduling algorithm for RFCC VLIW
architecture, which simply considered the balancing of influ-
ences of data dependence relations and available resources on
instruction scheduling. TDFD has not actually taken into
account the influence of limitation on access ports to the glo-
bal register file on the instruction scheduling.
Tang et al.[
21] have presented a force-balanced-two-phase
(FBTP) instruction scheduling algorithm for RFCC VLIW
architecture, which based on careful arrangement of inter-
cluster data transferring to balance the distribution of the
access to global register file among the whole execution time.
However, FBTP focused on arrangement of the di stribution
of inter-cluster data transferring by balancing the influence of
data dependence relations and resource availability, and does
not directly take into account the influence of register pres-
sure on instruction scheduling and cluster assignment.
3. PROBLEM ANALYSIS
3.1. Introduction of register file connected clustered
VLIW architecture
In this work, we focused on RFCC VLIW architecture.
Figure
2 presents an example of a generic RFCC VLIW archi-
tecture. There are four clusters, and each cluster is composed
of three FUs and a tightly connected local register file. Each
FU has its own sets of access ports to the local register file, so
FUs can directly address data stored in the local register file.
Each cluster has a sets of access ports (either read or write)
to the global register file, which means the FUs in one cluster
must share those access ports. When an inter-cluster data
3ON IMPROVING PERFORMANCE AND ENERGY EFFICIENCY
SECTION A: COMPUTER SCIENCE THEORY,METHODS AND TOOLS
THE COMPUTER JOURNAL, 2017