高性能浮点计算可重构电路的研究-2011年里昂高师博士论文

67 浏览量更新于2023-12-03 收藏 14.49MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

0HAL Id: tel-006541210https://theses.hal.science/tel-00654121v20提交日期：2012年4月26日0HAL是一个多学科开放获取档案，用于存储和传播科学研究文档，无论其是否发表。这些文档可以来自法国或国外的教育和研究机构，也可以来自公共或私人研究中心。0HAL多学科开放获取档案，旨在存储和传播法国或国外的教育和研究机构、公共或私人研究中心的研究级科学文献，无论是否发表。0可重构电路上的高性能浮点计算0Bogdan Mihai Pasca0引用此版本：0Bogdan Mihai Pasca. 可重构电路上的高性能浮点计算. 其他 [cs.OH]. 法国里昂高等师范学校 - ENS LYON,2011. 英语. �NNT : 2011ENSL0656�. �tel-00654121v2�0序号：6560图书馆分配的编号：2011ENSL06560里昂高等师范学校并行计算实验室0博士论文0于2011年9月21日公开并进行答辩，作者：Bogdan P ASCA0为了获得学位0里昂高等师范学校博士0专业：计算机科学0作为里昂数学和基础计算机学院的博士学位0可重构电路上的高性能浮点计算0导师：Florent DE D INECHIN0经过意见：Paolo I ENNE Olivier S ENTIEYS0审查委员会成员：0Octavian C RET成员 Florent DE DINECHIN成员 Paul F EAUTRIER成员 Paolo IENNE成员/评审员 Martin L ANGHAMMER成员Olivier S ENTIEYS成员/评审员0致谢0首先，我要感谢我的家人对我在整个论文期间的无价和无条件的支持。他们让我成为今天的我，我将永远感激。接下来，我要亲切地感谢我的女朋友Mioara对我的爱和支持，这帮助我顺利地克服了所有遇到的挑战。感谢你在我们无尽的截止日期下在办公室度过的深夜时光中的支持和鼓励。我还要感谢我们ENS的罗马尼亚社区，我们一起度过的所有美好时光。关于各种主题的无休止辩论既有培养作用，也帮助我显著提高了我的论证能力。长时间而具有挑战性的骑行、健身、攀岩、跑步...让我总是能够更进一步，我也试图将这种精神应用到研究中。我特别要感谢我的导师Florent对我的潜力的信任，并在整个论文期间高效地利用我的技能。他的指导非常出色，总是提出有趣的课题供我研究，同时也给予我足够的自由来解决我自己的研究课题。我还要感谢我的论文评审人，他们的建设性评论帮助我进一步改进了这篇论文，以及答辩期间评委们的中肯而具有挑战性的问题。非常感谢CompSys的成员Alexandru和Christophe，他们围绕着漫长的咖啡/茶歇进行的有趣讨论。这些讨论不仅扩宽了我的研究兴趣，还在论文的最后一年产生了一篇研究文章。最后但同样重要的是，我要感谢Arenaire团队成员，感谢他们的善良，并且总是为我敞开大门。我还要感谢我们的团队助理Severine和Damien，在简化有时令人不知所措的文书工作方面给予我的所有帮助。非常感谢我的朋友和其他没有在这里提到的每个人对我帮助和支持。Contents1Introduction12Field Programmable Gate Arrays52.1Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.1.1Logic elements. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .62.1.2DSP blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .92.1.3Block memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122.2FPGA design ﬂow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .132.3Application markets. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153Floating-point arithmetic173.1Generalities. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173.1.1Representation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173.1.2Rounding. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .203.1.3Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .213.2Floating-point arithmetic on FPGAs. . . . . . . . . . . . . . . . . . . . . . . . . . . .224Custom arithmetic data-path design254.1Arithmetic operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .264.1.1FPGA-speciﬁc arithmetic operator design . . . . . . . . . . . . . . . . . . . . .264.1.2From libraries to generators . . . . . . . . . . . . . . . . . . . . . . . . . . . . .274.2Design choices for FloPoCo. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .284.3A motivating example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .284.4.1Operators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .314.4.3Synchronization mechanisms . . . . . . . . . . . . . . . . . . . . . . . . . . . .334.4.5Sub-cycle accurate data-path design . . . . . . . . . . . . . . . . . . . . . . . .354.4.7The Target class hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .364.4.9Test-bench generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .384.5Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .4104.4 FloPoCo框架 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3104.4.2 自动流水线管理 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3204.4.4 管理子组件 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3404.4.6 基于频率的自动流水线 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3604.4.8 最终结果 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3704.4.10 框架扩展 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405Binary addition in FloPoCo435.2Design-space exploration by resource estimation . . . . . . . . . . . . . . . . . . . . .455.3.1Classical RCA pipelining. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .475.3.2Resource estimation techniques . . . . . . . . . . . . . . . . . . . . . . . . . . .475.3.3Alternative RCA pipelining . . . . . . . . . . . . . . . . . . . . . . . . . . . . .485.3.4Area-complexity of the pipelined designs . . . . . . . . . . . . . . . . . . . . .495.4Short-latency addition architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . .495.4.1Classic carry-select adder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .505.4.2Acceleration of inter-block carries. . . . . . . . . . . . . . . . . . . . . . . . .505.4.3The Add-Add-Multiplex (AAM) carry-select architecture . . . . . . . . . . . .525.4.4The Compare-Add-Increment (CAI) carry-increment architecture . . . . . . .535.4.5The Compare-Compare-Add (CCA) carry-select architecture . . . . . . . . . .545.4.6Block-splitting strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .545.4.7Area complexity of the designs . . . . . . . . . . . . . . . . . . . . . . . . . . .575.5Global inference of shift-registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .585.6Reality check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .595.6.1Estimation formulas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .595.6.2Synthesis results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .595.7Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .606Large multipliers with fewer DSP blocks636.1Large multipliers using DSP blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . .636.2Visual representation of multipliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . .646.3Karatsuba-Ofman algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .656.3.2Implementation issues on Virtex-4 . . . . . . . . . . . . . . . . . . . . . . . . .666.3.44-part splitting. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .686.3.6Issues with the most recent devices. . . . . . . . . . . . . . . . . . . . . . . .716.4.1Design decisions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .736.4.3Reality check. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .746.5.1Squarers on Virtex-4 and Stratix-II . . . . . . . . . . . . . . . . . . . . . . . . .766.5.3Non-standard tilings on Virtex-5/6 . . . . . . . . . . . . . . . . . . . . . . . . .776.6.1Faithfully accurate multipliers. . . . . . . . . . . . . . . . . . . . . . . . . . .786.6.2FPGA ﬁtting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .796.6.3Architecture generation algorithm . . . . . . . . . . . . . . . . . . . . . . . . .796.7Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .81vi0vi 目录05.1 相关工作 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4305.3 在FPGA上的流水线加法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4506.3.1 二部分分裂 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6506.3.3 三部分分裂 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6606.3.5 N部分分裂 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6906.4 非标准平铺 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7206.4.2 算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7406.5 平方器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7506.5.2 Stratix-III和Stratix-IV上的平方器 . . . . . . . . . . . . . . . . . . . . . . . . 7606.6 截断乘法器 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77Contentsvii7Polynomial-based architectures for function evaluation837.1Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .847.2Function evaluation by polynomial approximation . . . . . . . . . . . . . . . . . . . .857.2.1Range reduction. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .867.2.2Polynomial approximation. . . . . . . . . . . . . . . . . . . . . . . . . . . . .867.2.3Polynomial evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .887.2.4Accuracy and error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .897.2.5Parameter space exploration for the FPGA target. . . . . . . . . . . . . . . .907.3Reality check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .927.3.1Optimization effect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .927.3.2Examples and comparisons . . . . . . . . . . . . . . . . . . . . . . . . . . . . .927.4Conclusion, open issues and future work. . . . . . . . . . . . . . . . . . . . . . . . .948Multiplicative square root algorithms978.1.1Notations and terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . .988.2Square root by polynomial approximation . . . . . . . . . . . . . . . . . . . . . . . . . 1008.4Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049.1Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1079.2.1Algorithm overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1099.2.3Computation of eY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1119.3.2Overall error analysis. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1139.3.4Polynomial approximation for large precisions . . . . . . . . . . . . . . . . . . 1149.4Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169.4.1Synthesis results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1169.4.2Comparison with other works. . . . . . . . . . . . . . . . . . . . . . . . . . . 1169.4.3Comparison with microprocessors . . . . . . . . . . . . . . . . . . . . . . . . . 1179.5Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11710 Floating-point accumulation and sum-of-products11910.1 A fast and accurate accumulator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12010.1.1 Overall architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12010.1.2 Parameterisation of the accumulator . . . . . . . . . . . . . . . . . . . . . . . . 12110.1.3 Fast accumulator design using partial carry-save . . . . . . . . . . . . . . . . . 12210.1.4 Post-normalisation unit, or not . . . . . . . . . . . . . . . . . . . . . . . . . . . 12310.1.5 Synthesis results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12310.2 Application-speciﬁc accumulator design . . . . . . . . . . . . . . . . . . . . . . . . . . 12410.2.1 A performance vs. accuracy tradeoff . . . . . . . . . . . . . . . . . . . . . . . . 12410.2.2 A case study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12610.2.3 Accuracy measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 126vii08.1 浮点平方根算法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9708.1.2 正确舍入的代价 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9908.3 结果、比较和一些手工调整 . . . . . . . . . . . . . . . . . . . . . . . 10309 浮点指数 10709.2 算法和架构 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10809.2.2 范围缩减 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10909.3 实现问题 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112 9.3.1 常数乘法 . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . 11209.3.3 单精度的案例研究 . . . . . . . . . . . . . . . . . . . . . . . . . 11409.3.5 参数选择 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115viiiContents10.3 Accurate Sum-of-Products . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12710.4 Comparison with related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12810.5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12911 High-level synthesis of perfect loop nests13111.1 Computational data-path generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13211.2 Efﬁcient hardware generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13211.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13311.2.2 Working examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13511.2.3 Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13911.2.4 One dimensional Jacobi stencil computation . . . . . . . . . . . . . . . . . . . 14111.2.5 Lessons . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14211.2.6 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14311.3 Computing kernel accuracy and performance . . . . . . . . . . . . . . . . . . . . . . . 14611.4 Reality check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14911.5 Conclusion and future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15112.1 The Table Maker’s Dilemma . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15312.2 Proposed algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15412.2.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15612.2.3 An example: the exponential function . . . . . . . . . . . . . . . . . . . . . . . 15712.3 Our design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15712.3.1 Functional model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15812.3.2 Bandwidth requirement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16212.3.3 Performance estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16212.3.4 Reality Check . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16312.3.5 FloPoCo impact . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16412.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16513 Conclusions and Perspectives167viii011.3.1 矩阵乘法 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146 11.3.2 一维Jacobi模板计算 . . . . . . . . . . .. . . . . . . . . . . . . . . . 148 11.3.3 教训 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149012 使用FloPoCo解决表格制造者的困境 153012.2.1 列出的差异方法 . . . . . . . . . . . . . . . . . . . . . . . . . 154List of Figures2.1Ve

下载后可阅读完整内容，剩余1页未读，立即下载