FPGA实现的高性能双精度矩阵乘法

需积分: 9 74 浏览量更新于2024-07-26 收藏 327KB PDF 举报

"FPGA在高性能双精度矩阵乘法中的应用" 本文主要探讨了基于FPGA（Field-Programmable Gate Array，现场可编程门阵列）的高性能双精度浮点矩阵乘法的设计，该设计针对高阶FPGA进行了优化。矩阵乘法是许多重要分块BLAS（Basic Linear Algebra Subprograms，基础线性代数子程序）算法的核心部分，非常适合加速处理。文章提出了两种设计（I和II），它们都基于rank-1更新方案，能够处理任意大小的矩阵，并且除了初始的延迟期外，能够保持峰值性能。设计的关键在于展示了在FPGA实现中本地内存和带宽之间的权衡。通过对设计参数的分析，提供了选择最优设计策略的方法。这两种设计在Virtex-5 SX240T FPGA上实现了平滑的扩展，从1到40个处理元素（Processing Elements，PEs）时，设计频率的退化小于1%，显示出良好的可扩展性。第一种设计（Design I）可能侧重于利用FPGA的并行计算能力，通过并行处理多个矩阵元素来提高运算速度，同时可能通过优化数据流和存储访问模式来减少延迟。第二种设计（Design II）可能进一步优化了资源利用率，例如，通过更高效的数据重用策略或改进的流水线结构来提升性能。文章深入讨论了在FPGA上实现双精度矩阵乘法的挑战，包括精度保证、数据对齐问题、存储带宽限制以及计算单元的并行化。作者可能还分析了不同矩阵尺寸对性能的影响，以及如何通过调整PE的数量和配置来适应不同的计算任务。此外，为了实现高性能，设计可能采用了分布式存储结构，如BRAM（Block RAM）和分布式RAM，以减少数据传输的延迟。在性能评估部分，可能会比较设计I和II在功耗、延迟、吞吐量等方面的性能，并给出实际运行实例来验证设计的有效性。这篇论文对于理解如何利用FPGA进行高性能计算，特别是在科学计算和工程应用中的双精度矩阵乘法，提供了宝贵的见解。它不仅展示了FPGA在加速关键计算任务上的潜力，也为未来在FPGA上开发更复杂、更高效的算法提供了参考框架。

Int J Parallel Prog (2010) 38:322–338 325

2.2 FPGAs

For accelerator designs to be more than just an academic exercise the following are

important considerations. The design time should be low and allow for extensive test-

ing. The design should be modular in nature and scale with available resources. For

integration within an existing system form-factor limitations should be considered as

should power and memory. Most HPC systems are based on Inﬁniband like intercon-

nects between nodes, with nodes having PCI-e for communication with peripherals.

The PCI-e connects via the southbridge to the host memory, sharing bandwidth with

the host processor. DMA is used in order to transfer data efﬁciently, and thus acceler-

ators s hould be compatible with DMA and burst transfer. The overall system should

also be oblivious to the presence of the accelerator, requiring minimal modiﬁcations

to be done to accomodate it. These aspects make FPGA based designs very attractive.

Their form factors allow multiple FPGA to ﬁt on existing boards, that can communi-

cate via PCI-e. The cost of being reconﬁgurable doesn’t allow FPGAs to run at clocks

as high as modern general purpose processors. However, it does let designs exploit

the very low power consumption and their high parallelisability.

FPGAs have a reconﬁgurable fabric consisting of ﬂip-ﬂops and look-up tables

(LUTs) grouped into Conﬁgurable Logic Blocks (CLB). The difference between

FPGAs results from different arrangements within the CLB and the interconnects

between them. The ﬁxed function logic blocks, such as multipliers, and embedded

block RAM are ‘systematically’ interspersed between these. Care should be taken that

designs should not be complex from the view of routing between elements. FP-

GAs have limited resources that facilitate long routing, and can adversely affect the

maximum achievable clock frequency if not utilized well. In this work, care has

been paid to minimise communication between PEs, keeping the routing complex-

ity low.

The targeted device is from the Xilinx Virtex-5 family [11], based on a 65 nm

process, it provides four 6-input LUTs, four ﬂip-ﬂops, multiplexers and carry chains,

within a slice, with two slices making a CLB. For our context, we brieﬂy introduce

the key FPGA primitives used in this design: the Block RAM (BRAM), the FIFO and

the DSP48 slice based Multipliers and Adders. These hard primitives embedded in the

Virtex-5 fabric are individually able to clock at speeds greater than 500 MHz while

operating at relatively low power.

Block RAM

The BRAMs are 36 bit wide 1 K deep true dual-port SRAMs, true dual-port mean-

ing being able to independently read/write from both ports. They can be used in

a variety of width-depth conﬁgurations and cascaded if requried. Two adjacent

BRAMs can be treated as 64 bit wide memories with no additional user logic.

They can also be conﬁgured as FIFOs with relevant ﬂags available for use.

DSP48 Slices

DSP48E blocks consists of cascadeable, 25 × 18 bit multipliers and 48-bit

adder/subtractr/accumulator. They also allow for functions like shifting, com-

parisons and others to be implemented. Their ability to be cascaded allows for

ﬂoating point implementations.

123

剩余16页未读，继续阅读

bailiuqiao

粉丝: 0
资源: 1

FPGA实现的高性能双精度矩阵乘法

2维ising伊辛模型模拟

三维伊辛模型的蒙特卡罗模拟

ising模型（伊辛模型）.exe

费米子为广义伊辛模型

三临界伊辛模型中的缺陷

二维伊辛模型的Matlab代码

通过无关的摄动塑造晶格：伊辛模型

二维伊辛模型（方格）简单模拟

伊辛模型matlab代码-persistent-vi:离散无向模型的变分贝叶斯

简单三次伊辛模型磁化的广义置信传播

最新资源