高性能浮点计算在可重构电路中的研究 - Bogdan Mihai Pasca博士论文

78 浏览量更新于2024-06-18 收藏 14.49MB PDF 举报

“高性能浮点计算可重构电路的研究-2011年里昂高师博士论文”是由Bogdan Mihai Pasca撰写的一篇博士论文，该论文聚焦于在可重构电路上实现高性能的浮点计算。这篇论文于2011年在法国里昂高等师范学校（ENSLYON）完成，并于同年9月21日公开答辩，作者因此获得了计算机科学博士学位。指导老师是Florent Dedinechin，评审委员会包括多位知名专家。这篇研究工作处于计算机科学与可重构电路的交叉领域，可重构电路是一种能够根据需求动态改变其硬件结构的电路，它在高性能计算中有着广泛的应用。浮点计算是处理复杂科学和工程问题的关键，尤其是在图形处理、物理模拟和机器学习等领域。通过在可重构电路中优化浮点计算，可以实现更高的计算效率和能效比。论文探讨了如何利用可重构技术来提升浮点运算的性能，可能涉及了硬件设计、算法优化、并行计算以及功耗管理等方面。作者可能分析了不同的浮点运算单元（FPU）架构，比较了固定配置与动态配置的优缺点，以及如何通过重新配置电路来适应不同的计算任务，以达到性能最大化。此外，论文还可能涉及了与可重构电路相关的软硬件协同设计方法，以及如何通过编译器技术来自动调度计算任务，充分利用可重构硬件的优势。评审委员会成员的背景涵盖了多个相关领域，表明这篇论文的研究内容广泛且深入，涉及了理论与实践的结合。作者在论文中表示，他的工作受到了家人、朋友以及学术社区的大力支持，特别是他的导师Florent Dedinechin的信任，这对他个人成长和研究进展起到了关键作用。通过参与各种活动和讨论，作者的论证能力和问题解决技巧得到了提升，这也反映了学术研究中团队合作和个人发展的相互促进。这篇博士论文是关于如何在可重构电路中实现高性能浮点计算的深入研究，对于理解并优化现代计算系统，特别是那些需要大量浮点运算的领域，如高性能计算和嵌入式系统，具有重要的理论和实践意义。

缩写

FPGA现场可编程门阵列

GPU图形处理单元

HPC高性能计算

IC集成电路

ASIC专用集成电路

CUDA统一计算设备架构

HDL硬件描述语言

PAL可编程逻辑阵列

CPLD复杂可编程逻辑器件

LE逻辑元件

LUT查找表

CLB可配置逻辑块

LAB逻辑阵列块

MLAB存储逻辑阵列块(LAB)

ALM自适应逻辑模块

ALUT自适应查找表

RAM随机存取存储器

SRAM静态随机存取存储器(RAM)

EDIF电子数字交换格式

XNFXilinxNetlist格式

FSM有限状态机

FP浮点数

SP单精度

DP双精度

QP四倍精度

FPU浮点数单元

CR正确舍入

FR精确舍入

HLS高级综合

DSP数字信号处理

FFT快速傅里叶变换

1. Florent de Dinechin, Jean-Michel Muller, Bogdan Pasca, and Alexandru Plesco. An FPGA ar-

chitecture for solving the Table Maker’s Dilemma. In International Conference on Application-

speciﬁc Systems, Architectures and Processors, 2011. Best Paper Award

2. Hong Diep Nguyen, Bogdan Pasca, and Thomas B. Preußer. FPGA-speciﬁc arithmetic op-

timizations of short-latency adders. In International Conference on Field Programmable

Logic and Applications. IEEE, 2011.

3. Florent de Dinechin and Bogdan Pasca. Designing custom arithmetic data paths with

FloPoCo. IEEE Design and Test, 2011.

4. Christophe Alias, Bogdan Pasca, and Alexandru Plesco. Automatic generation of FPGA-

speciﬁc pipelined accelerators. In The 7th International Symposium on Applied Reconﬁg-

urable Computing, 2011.

5. Florent de Dinechin and Bogdan Pasca. Floating-point exponential functions for DSP-

enabled FPGAs. In IEEE International Conference on Field-Programmable Technology.

IEEE, 2010.

6. Sebastian Banescu, Florent de Dinechin, Bogdan Pasca, and Radu Tudoran. Multipliers

for ﬂoating-point double precision and beyond on FPGAs. In International Workshop on

Higly-Efﬁcient Accelerators and Reconﬁgurable Technologies (HEART). ACM, 2010.

7. Florent de Dinechin, Mioara Joldes, and Bogdan Pasca. Automatic generation of polynomial-

based hardware architectures for function evaluation. In International Conference on Application-

speciﬁc Systems, Architectures and Processors, 2010.

8. Florent de Dinechin, Mioara Joldes, Bogdan Pasca, and Guillaume Revy. Multiplicative

square root algorithms for FPGAs. In International Conference on Field Programmable

Logic and Applications. 2010.

9. Florent de Dinechin, Hong Diep Nguyen, and Bogdan Pasca. Pipelined FPGA adders. In

International Conference on Field Programmable Logic and Applications. 2010.

10. Florent de Dinechin, Mioara Joldes, Bogdan Pasca, and Guillaume Revy. Racines carrées

multiplicatives sur FPGA. In SYMPosium en Architectures nouvelles de machines (SYMPA),

2009.

出版物

11.FlorentdeDinechin和BogdanPasca.较少DSP块的大型乘法器.

2009年国际可编程逻辑和应用会议.

12.FlorentDeDinechin,CristianKlein和BogdanPasca.生成高性能定制浮点数流水线.

2009年国际可编程逻辑和应用会议.

13.FlorentdeDinechin,BogdanPasca,OctavianCre¸t和RaduTudoran.

FPGA特定的浮点累加和乘积求和方法.2008年IEEE国际可编程技术会议.

The classical version of Moore’s Law predicts that the capacity of Integrated Circuits (ICs)

doubles every 18 months. Microprocessor manufacturers followed this law by reducing the op-

erating voltages and using smaller and faster transistors. Frequency scaling got to the point that

circuits emitted too much heat to be reasonably dissipated – the so called power wall. This lead

the main microprocessor manufacturer, Intel, to publicly announce in 2004 that it would dedicate

all it future design efforts to multi-core environments. Nowadays, Intel offers a 8-core version

of the high-end Xeon processor (V8), while Opteron from AMD is provided in a 12-core version,

both at 45nm manufacturing process.

Just doubling the number of cores in a die doesn’t guarantee a speedup of two over the initial

microprocessor for a given application. Indeed, Amdahl’s law [34] suggests that the maximum

expected overall improvement of a system using N processors is highly inﬂuenced by the amount

of sequential execution of the program, but also by the degree of parallelism of the parallel sec-

tions. Most of the existing software, developed during the single-core era is essentially sequential

and therefore does’t beneﬁt from any improvement on a multicore system. One idea, dating back

from the 1960s, is to write compilers that would automatically parallelize these sequential pro-

grams. The success of these approaches seems to be inversely proportional to the number of

targeted cores. One reason for this insuccess is that the sequential solution these tools start with

already looses some of the “parallel semantics” of the problem to be solved. Consequently, mak-

ing efﬁcient use of multiple cores requires recovering some of this lost parallelism. This requires

recoding parts of the application using the thread programming model or using one of the well

known APIs supporting process intercommunication: MPI, PVM or OpenMP. Another reason for

the poor performance of these parallelized programs is that in a multicore system inter-process

communication, usually resolved by shared-memory techniques, is very costly. In any case, the

success of this approach will depend on the data-level parallelism of the initial application.

One success story is computer graphics. Graphics processing is an application domain having

massively parallel computational kernels: entire animation scenes and also parts of each frame

can be processed in parallel. Traditionally, Graphical Procesing Units (GPUs) consisted of numer-

ous but rather simple Processing Elements (PEs) capable of processing the numerous graphics-

related tasks in a ﬂow-like manner. In 2001, with the introduction of ﬁrst programmable GPU

(the NV20 series) programmers could execute custom visual-effects programs using the Shader

Language 1.1. In 2007 nVIDIA formalized the GPU’s computing capabilities under the name of

Compute Uniﬁed Device Architecture (CUDA): the parallel computing architecture present in

nVIDIA GPUs. General-purpose computations can be expressed using C for CUDA, a C sub-

set with nVIDIA extensions. As the PEs of modern GPUs support some of the basic ﬂoating-

point operators, it is tempting to use them to perform massively parallel scientiﬁc computations.

第1章

介绍

剩余198页未读，继续阅读

cpongm

粉丝: 5
资源: 2万+

高性能浮点计算在可重构电路中的研究 - Bogdan Mihai Pasca博士论文

可重构计算相关技术研究1

一种可重构计算系统设计与实现

论文研究-可重构计算技术及其发展趋势.pdf

计算机浮点运算性能是什么意思

计算机浮点运算功能的发展历史,现状及发展趋势

arm-linux-gnueabihf适用于没有硬浮点支持的设备吗？

查阅计算机浮点运算功能的发展历史、现状及发展趋势

nvidia 浮点运算测试

verilog中浮点运算ip核怎么用于浮点运算

stm32 m3 浮点运算

最新资源