优化嵌入式系统中RFCCVLIW架构的性能与能效提升策略

研究论文

200 浏览量更新于2024-07-14 收藏 1.49MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

本文是一篇由HUHE、XUYANG和YANJUN ZHANG共同署名的研究论文，发表于《计算机 Journal》, 2017 年，探讨了在嵌入式系统中优化寄存器文件连接的集群式VLIW（Very Long Instruction Word）架构的性能提升和能效改进。集群式VLIW架构通常包含独立的寄存器分配、调度和簇分配阶段，然而这种分离处理可能会对其他阶段产生负面影响。传统上，这种架构的注册分配是在指令调度和簇分配之前单独进行的。然而，论文作者认为这种方法可能不是最优化的，特别是对于寄存器文件连接的集群VLIW (RFCCVLIW) 架构。在RFCCVLIW架构中，每个簇对全局寄存器文件的访问是关键性能指标。问题的核心在于，通过集成或协同进行这些过程，即同时考虑寄存器分配、指令调度和簇内指令组的形成，可以减少不必要的数据通路竞争，从而提高执行效率。论文主要关注以下几点： 1. **集成策略**：作者探索了一种新的方法，即将寄存器分配与指令调度和簇分配相结合，这样可以在早期设计阶段就考虑到寄存器冲突和复用，避免了后期可能导致的性能损失。 2. **性能优化**：论文详细分析了如何通过改进的算法或硬件优化来提升指令的并行度和流水线效率，减少指令等待时间，从而提高整体系统的执行速度。 3. **能效提升**：通过优化数据流管理和减少不必要的存储器访问，论文旨在降低能耗，这对于能源受限的嵌入式系统至关重要。 4. **实验验证**：文中提供了基于实际硬件平台的实验结果，证明了提出的改进方法在性能和能效方面的确带来了显著的提升，这为实际应用中的RFCCVLIW架构提供了实用指导。 5. **应用场景**：研究成果特别适用于那些对低功耗和高速度有高要求的嵌入式系统，如物联网设备、移动设备或者实时控制应用。这篇论文通过对寄存器文件连接的集群式VLIW架构进行深入研究，提出了创新的集成策略，旨在解决传统方法中可能出现的问题，从而实现嵌入式系统的性能和能效双提升，为VLIW技术在实际应用中的进一步优化提供了有价值的参考。

资源详情

资源推荐

The main contributions of this work are: (i) transforming the

managing of the distribution of inter-cluster data transferring

into consideration of resource pressure and register pressure

across the whole scheduling cycle range; (ii) introducing two

kinds of register pressure to estimate the inﬂuence of register

allocation phase on cycle scheduling and cluster assignment;

(iii) enhancing performance and energy consumption effect-

iveness by taken into consideration both the inﬂuence of

resource pressures (availability for FUs in each cluster) and

This paper is organized as follow: Related works are dis-

cussed in Section

2. Section 3 will discuss more about the

RFCC VLIW architecture. The register pressure aware

instruction scheduling algorithm is introduced in Section

And the experimental framework and results are presented in

Section

5.Andﬁnally, we give conclusions in Section 6.

2. RELATED WORK

As clustering has become a common trend, there already exist

a lot of works concerning the instruction scheduling of clus-

tered architectures.

Zalamea et al.[

11] have presented an instruction scheduling

algorithm for clustered VLIW architecture, which performs

instruction scheduling, register allocation and cluster assign-

ment in a single step. The algorithm is based on an iterative

approach with limited backtracking, which allows one to undo

previous scheduling, spilling or communication decisions

without the compilation time penalty of a wide search of the

solution space. Codina et al.[

12] have introduced a modulo

scheduling framework for clustered instruction level parallel-

ism processors that integrates the cluster assignment, instruc-

tion scheduling and register allocation steps in a single phase.

The proposed framework includes a mechanism to insert spill

code on-the-ﬂy and heuristics to evaluate the quality of partial

schedules considering simultaneously inter-cluster transfer-

ring, memory pressure and register pressure. Later, they have

exploited a concept of virtual cluster to assist the instruction

scheduling for clustered architecture [

13].

Aleta et al.[

14] have presented a graph-partitioning-based

instruction scheduling for clustered architecture. Later, they

[

15] have presented another graph-based approach, called

AGAMOS, to modulo schedule loops on clustered architec-

tures. Xu et al.[

16] have presented their study on the design

of inter-cluster connection network in clustered digital signal

processor (DSP) processors. The approach starts with deter-

mining the minimum number of buses required in polynomial

time for any given schedules, and then further determines an

underlying inter-cluster connection scheme with the number

of buses determined in the previous step.

Arafath et al.[

17] have implemented an integrated instruc-

tion partitioning and scheduling technique for clustered

VLIW architectures, using the amount of clock cycles

followed by each instruction and the number of successors of

an instruction to prioritize the instructions. Zhang et al.[

18]

presented a phase coupled priority-based heuristic scheduling

algorithm, which converts the instruction scheduling problem

into the problem of scheduling a set of instructions with a

common deadline.

Huang et al.[

9] have introduced a Worst-Case-Execution-

Time-aware Re-scheduling Register Allocation (WRRA)

approach, to achieve Worst-Case-Execution-Time (WCET)

minimization for real-time embedded systems with clustered

VLIW architecture.

Jiang et al.[

19] have proposed a multithreading technique

based on a scheduling scheme of stream programs on clus-

tered VLIW stream architecture, which aims at optimal arith-

metic unit utilization without increasing energy consumption.

Its principle is to exploit more kernel-level parallelism for fur-

ther optimal compilation by constructing homogeneous mul-

tiple threads on stream programs.

However, these research effort is focused on BCC VLIW

architecture. The efforts toward the optimization for RFCC

VLIW architecture are scarce.

Zhou et al.[

20] have presented a two-dimension force-

directed (TDFD) scheduling algorithm for RFCC VLIW

architecture, which simply considered the balancing of inﬂu-

ences of data dependence relations and available resources on

instruction scheduling. TDFD has not actually taken into

account the inﬂuence of limitation on access ports to the glo-

bal register ﬁle on the instruction scheduling.

Tang et al.[

21] have presented a force-balanced-two-phase

(FBTP) instruction scheduling algorithm for RFCC VLIW

architecture, which based on careful arrangement of inter-

cluster data transferring to balance the distribution of the

access to global register ﬁle among the whole execution time.

However, FBTP focused on arrangement of the di stribution

of inter-cluster data transferring by balancing the inﬂuence of

data dependence relations and resource availability, and does

not directly take into account the inﬂuence of register pres-

sure on instruction scheduling and cluster assignment.

3. PROBLEM ANALYSIS

3.1. Introduction of register ﬁle connected clustered

VLIW architecture

In this work, we focused on RFCC VLIW architecture.

Figure

2 presents an example of a generic RFCC VLIW archi-

tecture. There are four clusters, and each cluster is composed

of three FUs and a tightly connected local register ﬁle. Each

FU has its own sets of access ports to the local register ﬁle, so

FUs can directly address data stored in the local register ﬁle.

Each cluster has a sets of access ports (either read or write)

to the global register ﬁle, which means the FUs in one cluster

must share those access ports. When an inter-cluster data

3ON IMPROVING PERFORMANCE AND ENERGY EFFICIENCY

SECTION A: COMPUTER SCIENCE THEORY,METHODS AND TOOLS

THE COMPUTER JOURNAL, 2017

剩余14页未读，继续阅读

weixin_38743968

粉丝: 404
资源: 2万+

优化嵌入式系统中RFCCVLIW架构的性能与能效提升策略

在多CPU嵌入式系统中的实现及性能优化

Embedded Computing A VLIW Approach

嵌入式系统晶片架构.ppt

面向能耗有效高性能嵌入式微处理器的VLIW调度.pdf

嵌入式系统/ARM技术中的高性能浮点DSP芯片及其最小系统的设计

群集VLIW处理器的终生漏洞感知寄存器分配

嵌入式系统中的处理器技术.pdf

嵌入式计算：VLIW架构、编译器与工具的探索

VLIW架构下的高性能变长指令调度优化

嵌入式系统中的存储系统与微控制器架构

提升密码协处理器性能：VLIW架构与指令级并行编译策略

嵌入式系统微处理器架构解析

嵌入式系统芯片架构：微处理器与架构解析

提升VLIW DSP性能：一种支持同时多线程的架构

簇状异构VLIW处理器动态调度提升实时应用性能

【单片机控制字：深入剖析其架构、功能和应用】：揭秘嵌入式系统中的核心组件

numexpr-2.8.3-cp38-cp38-win_amd64.whl

最新资源