动态代码生成用于能量优化

185 浏览量更新于2023-12-03 收藏 12.63MB PDF 举报

学术平台

博士论文

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

HAL Id: tel-01285964https://theses.hal.science/tel-01285964Fernando Akira Endo0提交日期：2016年3月10日0HAL是一个多学科开放获取的文档存储和传播平台，用于存储和传播法国或国外的研究机构、公共或私人实验室发布的研究级科学文献。0HAL是一个多学科开放获取的文档存储和传播平台，用于存储和传播法国或国外的研究机构、公共或私人实验室发布的研究级科学文献。0动态代码生成用于能量优化0引用此版本：0Fernando Akira Endo. 动态代码生成用于能量优化. 硬件架构[cs.AR]. 阿尔卑斯大学格勒诺布尔分校, 2015.法语. �NNT : 2015GREAM044�. �tel- 01285964�0博士论文0获得学位0格勒诺布尔大学博士学位，专业：计算机科学0部长令：2006年8月7日0Fernando Akira E NDO提交0由Henri-Pierre C HARLES指导的博士论文0在格勒诺布尔CEA工作室和数学、科学和信息技术学院软件工作室内准备0动态代码生成用于能量优化0于2015年9月18日公开支持，评审委员会成员如下：0M. Frédéric P ÉTROT Grenoble Institute of Technology教授，主席 M.Florent DE D INECHIN INSA de Lyon教授，评审 M. Paul K ELLY ImperialCollege London教授，评审 M me Karine H EYDEMANN Université Pierreet Marie Curie副教授，考官 M. Henri-Pierre C HARLES CEALIST研究主任，导师 M. Damien C OUROUSSÉ CEALIST工程师研究员，Grenoble，协导师0我将这篇论文献给我的家人，他们一直支持着我。iv0致谢0我想感谢实验室的所有人，在我攻读博士学位的三年里给予了我很多帮助。感谢CEA资助我的这篇论文，以及评审委员会的成员，特别是论文的评审人Florent de Dinechin博士和PaulKelly博士，他们给出了很多建议和修正意见。我还要特别感谢我的朋友Fayçal Benaziz和ThibaultCattelani，他们帮助我校对和修正了论文摘要的法语版本，还有Alexandre Aminot、IvanLlopard、Laurentiu Trifan、Thierno Barry、Tiana Rakotovao和VictorLomüller，他们校对和修正了我的论文，这些论文也被整合到了论文中。我还要感谢UNICAMP、BRAFITEC项目和INSA deLyon，没有它们，我就没有机会在法国交流并获得法国学位，这为我申请博士学位提供了便利。最后，我要感谢PhiInnovations，因为我在这家公司学到的很多知识，包括我收到的BeagleBoard-xM，对我技术和科学工作的发展非常有帮助。viIThesis12 State of the art92.1Sources of energy consumption in ICs . . . . . . . . . . . . . . . . . . . . . . . . . . .92.1.1Static or leakage power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.1.2Dynamic power . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.2Energy reduction techniques integrated into compilers . . . . . . . . . . . . . . . . . . .102.2.1Energy reduction in software . . . . . . . . . . . . . . . . . . . . . . . . . . . .102.2.2Compiler techniques . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132.3The ARM architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.4Embedded processor simulation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.4.1Abstraction levels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .152.4.2Micro-architectural performance simulation . . . . . . . . . . . . . . . . . . . .162.4.3Micro-architectural energy simulation . . . . . . . . . . . . . . . . . . . . . . .182.5Run-time code optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .232.5.1Run-time code specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . .232.5.2Dynamic binary optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . .242.5.3Run-time recompilation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252.5.4Online auto-tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .252.6Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .263 Micro-architectural simulation of ARM processors293.1gem5. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .303.1.1The arm_detailed conﬁguration . . . . . . . . . . . . . . . . . . . . . . . .313.1.2Modeling improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .333.1.3In-order model based on the O3 CPU model . . . . . . . . . . . . . . . . . . . .353.2McPAT. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .373.2.1Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .373.2.2Better modeling core heterogeneity. . . . . . . . . . . . . . . . . . . . . . . .403.3Parameters and statistics conversion from gem5 to McPAT. . . . . . . . . . . . . . . .400目录01 Introduction 3 1.1 Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.1.1Run-time code generation and auto-tuning for embedded systems . . . . . . . . 7 1.1.2Micro-architectural simulation of ARM cores . . . . . . . . . . . . . . . . . . . 7 1.2 Thesis organization . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8viiiCONTENTS3.4Performance validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423.4.1Reference models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423.4.2Simulation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .423.4.3Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .463.4.4Accuracy evaluation of the Cortex-A models. . . . . . . . . . . . . . . . . . .473.4.5In-order model behavior and improvement for a Cortex-A8 . . . . . . . . . . . .493.5Area and relative energy/performance validation . . . . . . . . . . . . . . . . . . . . . .513.5.1Reference models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .513.5.2Simulation models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .523.5.3Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .543.5.4Area validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .543.5.5Relative energy/performance validation . . . . . . . . . . . . . . . . . . . . . .563.6Example of architectural/micro-architectural exploration. . . . . . . . . . . . . . . . .573.7Scope and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .583.7.1Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .583.7.2Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .603.8Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .614 Run-time code generation634.1deGoal: a tool to embed dynamic code generators into applications . . . . . . . . . . . .634.1.1Utilization workﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .644.1.2Example of kernel implementation: C with and without SIMD intrinsics anddeGoal versions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .644.1.3The Begin and End commands . . . . . . . . . . . . . . . . . . . . . . . . . .664.1.4Register allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .674.1.5Code generation decisions: deGoal mixed to C code. . . . . . . . . . . . . . .684.1.6Branches and loops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .694.2Thesis contribution: New features and porting to ARM processors . . . . . . . . . . . .694.2.1Overview of contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .704.2.2SISD and SIMD code generation . . . . . . . . . . . . . . . . . . . . . . . . . .714.2.3Conﬁgurable instruction scheduler . . . . . . . . . . . . . . . . . . . . . . . . .724.2.4Static and dynamic conﬁguration . . . . . . . . . . . . . . . . . . . . . . . . . .734.2.5Further improvements and discussion . . . . . . . . . . . . . . . . . . . . . . .744.3Performance analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .754.3.1Evaluation boards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .754.3.2Benchmarks and deGoal kernels . . . . . . . . . . . . . . . . . . . . . . . . . .754.3.3Raw performance evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . .754.3.4Transparent vectorization: SISD vs SIMD code generation . . . . . . . . . . . .804.3.5Dynamic code specialization . . . . . . . . . . . . . . . . . . . . . . . . . . . .804.3.6Run-time auto-tuning possibilities with deGoal . . . . . . . . . . . . . . . . . .824.4Scope and limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .844.4.1Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .844.4.2Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .844.5Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .85CONTENTSix5 Online auto-tuning for embedded systems875.1Motivational example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .895.2Methodology. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .915.2.1Auto-tuning with deGoal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .915.2.2Regeneration decision and space exploration. . . . . . . . . . . . . . . . . . .945.2.3Kernel evaluation and replacement . . . . . . . . . . . . . . . . . . . . . . . . .955.3Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .965.3.1Hardware platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .965.3.2Simulation platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .965.3.3Benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .965.3.4Evaluation methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .995.4Experimental results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .995.4.1Real platforms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .995.4.2Simulated cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1025.4.3Analysis with varying workload . . . . . . . . . . . . . . . . . . . . . . . . . . 1055.4.4Analysis of correlation between auto-tuning parameters and pipeline designs. . 1055.5Scope, limitations and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.5.1Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.5.2Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085.5.3Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1095.6Conclusion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1106 Conclusion and prospects1116.1Achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1116.1.1Embedded core simulation with gem5 and McPAT. . . . . . . . . . . . . . . . 1116.1.2Run-time code generation and auto-tuning for embedded systems. . . . . . . . 1126.1.3Summary of achievements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1136.1.4Amount of work. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1146.2Prospects. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114IIAppendix117A gem5 to McPAT conversion tables119List of ﬁgures126List of tables128Bibliography131Personal bibliography143Glossary145Résumé étendu149xCONTENTSPart IThesisChapter 1IntroductionSince the past decade, the energy consumption in high-performance processors is limiting the per-formance growth expected from transistor scaling. During three decades, reducing the size of transistorsalso reduced their energy consumption, resulting in constant power densities and exponential growthof performance per watt. Transistor scaling under these conditions is known as Dennard scaling [53].Today, even if transistors are still becoming smaller in new generations of integrated circuits (ICs), theirenergy consumption is almost not scaling down anymore. With an almost constant energy consumptionper transistor, the increasing number of transistors results in an exponential growth of the total power dis-sipations of a chip. Figure 1.1 illustrates this phenomenon. Under such conditions, processor designersmust limit the power budget of chips to avoid excessive dissipations. Because of this power restrictionand its consequences in the performance of computing systems, it is known as the power wall. As aconsequence, a decade ago, the “free” performance growth obtained by increasing CPU clock reached alimit and processor designers had to switch from single-cores to (homogeneous) multi-cores.Today, homogeneous multi-cores in server-class processors are facing another issue: the dark sili-con [102]. Because of high power densities and thermal problems, only a fraction of transistors in ICs2501801309065453222(1997) (1999) (2001) (2004) (2006) (2008) (2010) (2012)0.00.10.20.30.40.50.60.70.80.91.00.0E+01.0E+92.0E+93.0E+94.0E+95.0E+9Normalized VddNormalized energy of transistorsNumber of transistorsNormalized chip power (unconstrained)Technology node in nm (year)Normalized Vdd, transistor energyand chip powerNumber of transistorsFigure 1.1: Trends of transistor technologies and impact on the total power dissipation of chips. Datafrom Dreslinski et al. [56] and HiPEAC Vision 2015 [59].4CHAPTER 1. INTRODUCTION ﬁ ﬁ ﬂFigure 1.2: Dark silicon: the fraction of transistors that can be powered simultaneously reduces as tran-sistor technologies advance. From The HiPEAC Vision for Advanced Computing in Horizon 2020 [58].can be powered simultaneously, and this fraction is becoming smaller in each new generation. Figure 1.2illustrates this problem. It is expected that dark silicon will dominate in CPU-like ICs between 2016and 2021 [65]. Given that relying only on transistor scaling will not allow to sustain the performanceimprovement of computing systems, new architectural and micro-architectural designs are needed if wewant to avoid a performance stagnation in the next decades, before new transistor technologies becomemainstream [82]. Studies suggest that heterogeneous multi- or manycores coupled to accelerators areone of the solutions to keep the performance growth expected from transistor scaling [32]. Traditionally,processor designers invested the maximum number of the transistors to 90 % of the workload and theremaining to special cases. This approach is called 90/10 optimization. Now on, with dark silicon, hard-ware specialization is need. Instead of the 90/10 approach, investing 10 % of transistors to accelerate10 % of cases, and another 10 % of transistors to another 10 % of cases, and so on, is more interesting,because the improved energy-efﬁciency of specialized hardware allows to increase the total throughputof the system. This approach is called 10×10 [46]. By 2020, processors will likely have hundreds tothousands of heterogeneous cores, possibly specialized to different tasks [73].0高性能嵌入式核心已经具有相当程度的异构性，例如ARM架构定义了小型顺序核心，如Cortex-A5和A7，中型核心如A8，A9，A12和A17，大型乱序核心如A15，更不用说一些64位对应核心如A53，A57和A72了。这些基本设计可以使用不同的晶体管技术合成，并具有不同的大小和类型的资源（流水线缓冲区，缓存级别和大小等），还存在完全定制的核心实现。由于高性能嵌入式处理器也受到功耗墙的影响，而且暗硅可能很快出现。5 0单核同构多核异构多核单核流水线0Dennard缩放0功耗墙0暗硅0最大可达到的由软件获得的0异构时间线众核 +加速器0性能差距增加0图1.3：计算系统中软件获得的性能与最大可达到性能之间随时间增加的差距。0完全改变嵌入式架构，我们可以预期异构多/多核的设计变得越来越复杂。最近，ARM发布了一种异构多核系统（big.LITTLE设计），应用程序可以在两个ISA兼容的核心之间切换，具有不同的功耗和性能权衡[71]。虽然它在低性能需求下提高了能源效率，但SoC内部的核异构解决方案也带来了新的性能优化挑战。例如，如果将应用程序编译和优化到目标核心A，当应用程序被调度到核心B上运行时，由于流水线实现的差异，性能可能不如将应用程序优化到核心B时好。0在异构多/多核时代，软件从硬件中提取的性能与最大可达到性能之间的差距将增加，在长期内，运行时方法可能是提高能源效率的唯一途径[32]。图1.3说明了这种性能差距的增加。在这样复杂的系统中，编程和性能可移植性的挑战变得更加复杂，因为处理器中的不同核心具有不同的ISA和加速器。0静态自动调整已被用于提高性能可移植性，并提取接近最优硬件性能，可与手动调整的代码相媲美。这种方法已成功应用于线性代数和信号处理领域，以应对现代处理器的架构复杂性，通过探索算法实现的空间[73]。然而，每当目标核心或运行条件发生变化时，代码理想上应该再次自动调整到新环境，因为静态自动调整的代码在不同微架构之间迁移时通常具有较差的性能可移植性[3]。当执行环境在编译时不固定时，在线自动调整是一种

下载后可阅读完整内容，剩余1页未读，立即下载