高性能FP分频器：基于Goldschmidt算法与共享乘法器

156 浏览量更新于2024-08-28 收藏 448KB PDF 举报

"本文介绍了一种基于Goldschmidt算法的高性能浮点（FP）分频器设计，旨在解决除法运算复杂、计算时间长的问题。该设计利用Bipartite reciprocal look-up tables节省面积获取迭代初始值，并通过并行乘法器减少延迟。此外，通过状态机控制实现管道执行，提升吞吐量。该方案在数字信号处理（DSP）芯片中得以实现，通过共享已有的乘法器来优化资源利用。" 在高性能计算和现代应用中，浮点除法（FP division）的需求日益增长，但由于其复杂的计算过程和较长的延迟时间，成为计算性能的一个瓶颈。针对这一问题，文章提出了一种创新的解决方案，即基于Goldschmidt算法的FP分频器设计。Goldschmidt算法是一种快速除法的迭代方法，能够有效地将除法转化为乘法操作，从而减少计算复杂性。设计的关键在于采用Bipartite reciprocal look-up tables来获取迭代的初始值。这种方法能够节省硬件资源，因为它利用预计算的表来存储部分结果，减少了计算过程中所需的逻辑门数量，从而降低了面积成本。为了进一步降低延迟，设计中引入了并行乘法器。在每次迭代过程中，这些并行乘法器可以同时进行多个乘法运算，显著提高了计算速度，减少了总的执行时间。这种并行化策略对于提升整个系统性能至关重要，尤其是在需要高吞吐量的应用场景中。为了支持管道执行并最大化处理效率，设计还包括了一个状态机控制器。状态机负责协调各个阶段的操作，确保数据流的连续性和无阻塞，从而提升了系统的吞吐量。这样的设计使得FP分频器能够在不影响其他计算任务的情况下连续处理多个除法请求，提高了整体系统的并发处理能力。最后，该设计在数字信号处理（DSP）芯片上实现，通过共享现有的乘法器资源，避免了重复建设，节约了硬件成本。这样的实现方式既满足了高性能计算的需求，又充分考虑了资源的有效利用，是嵌入式系统和高性能计算平台的理想选择。这篇论文提出的基于Goldschmidt算法的带共享乘数的高性能FP分频器设计，通过优化算法、并行化计算以及高效的控制机制，成功地解决了浮点除法的计算复杂性和延迟问题，为需要频繁进行浮点除法运算的领域提供了高效能的硬件解决方案。

Chinese Journal of Electronics

Vol.26, No.2, Mar. 2017

High-Performance FP Divider with Sharing

Multipliers Based on Goldsch midt Algorithm

∗

HE Tingting, CHEN Jiyang, LEI Yuanwu, PENG Yuanxi and ZHU Baozhou

(College of Computer, National University of Defense and Technology, Changsha 410000, China)

Abstract — Focused on the issue that division is com-

plex and needs a long latency to compute, a method to

design the unit of high-performance Floating-point (FP)

divider based on Goldschmidt algorithm was proposed.

Bipartite reciprocal tables were adopted to obtain initial

value of iteration with area-saving, and parallel multipliers

were employed in the iteration unit to reduce latency. FP

divider to support pipeline execution with the control of

state machine is presented to increase the throughput. The

design was implemented in Digital signal process (DSP)

chip by sharing the existed multipliers.

Key words — FP divider, Goldschmidt algorithm, Bi-

partite reciprocal look-up tables, DSP.

I. Introduction

The computation complexity of modern applications

is continually increasing, and FP division is used more fre-

quently. Hardware circuit of FP division is implemented

in most of general processors, such as Intel Core i7, IBM

Power6 and AMD Phenom II. However, comparing with

FP addition, subtraction and multiplication, division is

more complex and needs longer latency. It is signiﬁcant

to design and implement high-performance FP divider.

Division functional units could be implemented using

a variety of arithmetic algorithms

[1−4]

. The algorithms

are divided into categories in diﬀerent basic arithmetic

operations including subtraction and multiplication. The

beneﬁt of subtractive algorithms is that they are typi-

cally low in complexity, and the implementation requires

small area. On the other hand, subtractive algorithms

have relatively longer latency since they converge linearly.

The algorithms using multiplication as the basic iterative

arithmetic operation have primary advantage which is the

quadratic convergence to the precise quotient, that means

the number of accurate quotient digits in the estimate

double at each iteration.

The two main methods used for multiplicative divi-

sion are Newton-Raphson algorithm

[1]

and Goldschmidt

algorithm

[4]

. Comparing with each iteration of Newton-

Raphson algorithm involving two interrelated multipli-

cations, the most important superiority of Goldschmidt

algorithm is that the multiplications per iteration are in-

dependent so that they can be pipelined and computed in

parallel.

However, there are many challenges posed to Gold-

schmidt algorithm in the tradeoﬀ between performance

and consumption. The higher precision reciprocal approx-

imate value is estimated, the fewer number of iteration

need to achieve lower latency. But the area-cost of recip-

rocal look-up table will enlarge and more latency is needed

to ﬁnd reciprocal look-up table with initial precision in-

creasing. Pipeline structure can achieve high throughout,

while much extra hardware is consumed. To get a balance

between performance and cost, we presented an appropri-

ate structure and implemented it in DSP chip.

The advantages of our design are as following:

1) Area-saving bipartite Read only memory(ROM) re-

ciprocal look-up tables are used to provide initial approx-

imate value for the following iterations.

2) Parallel operation technique is used in iterative unit

of divider in order to reduce the latency of iteration.

3) Timing-eﬃcient iterative controller make multipli-

ers execute as pipeline way to achieve larger throughout.

4) The divider is implemented on DSP chip with shar-

ing existing multipliers by data bridge, so, hardware cost

was reduced signiﬁcantly.

The remainder of this paper is organized as follows.

Section II reviews related works. Section III surveys Gold-

schmidt algorithm. The design of our FP divider and im-

plementation of FP divider on DSP chip is described in

Section IV. Experimental results are shown in Section V.

∗

Manuscript Received May 28, 2015; Accepted Jan. 5, 2016. This work is supported by the Aerospace Science Foundation of China

(No.2013ZC88003), and the Natural Science Foundation of China (No.61402499).

 2017 Chinese Institute of Electronics. DOI:10.1049/cje.2016.10.004

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38689113

粉丝: 1
资源: 974

高性能FP分频器：基于Goldschmidt算法与共享乘法器

倒数的迭代算法

verilog 两种方法实现 除法器

高级浮点除法器的fpga实现

改进的Goldschmidt双精度浮点除法器

基于FPGA的除法器算法研究.pdf

基于FPGA的新边缘指导插值算法硬件实现.pdf

Python库 | goldschmidt-0.0.6-py3.5.egg

Algebraic.Functions.and.Projective.Curves,.David.Goldschmidt（代数函数和投影曲线）.

基于FPGA的32位除法器设计

基于FPGA的移位减法除法器优化设计与实现.pdf

最新资源

verilog 两种方法实现除法器