Chinese Journal of Electronics
Vol.26, No.2, Mar. 2017
High-Performance FP Divider with Sharing
Multipliers Based on Goldsch midt Algorithm
∗
HE Tingting, CHEN Jiyang, LEI Yuanwu, PENG Yuanxi and ZHU Baozhou
(College of Computer, National University of Defense and Technology, Changsha 410000, China)
Abstract — Focused on the issue that division is com-
plex and needs a long latency to compute, a method to
design the unit of high-performance Floating-point (FP)
divider based on Goldschmidt algorithm was proposed.
Bipartite reciprocal tables were adopted to obtain initial
value of iteration with area-saving, and parallel multipliers
were employed in the iteration unit to reduce latency. FP
divider to support pipeline execution with the control of
state machine is presented to increase the throughput. The
design was implemented in Digital signal process (DSP)
chip by sharing the existed multipliers.
Key words — FP divider, Goldschmidt algorithm, Bi-
partite reciprocal look-up tables, DSP.
I. Introduction
The computation complexity of modern applications
is continually increasing, and FP division is used more fre-
quently. Hardware circuit of FP division is implemented
in most of general processors, such as Intel Core i7, IBM
Power6 and AMD Phenom II. However, comparing with
FP addition, subtraction and multiplication, division is
more complex and needs longer latency. It is significant
to design and implement high-performance FP divider.
Division functional units could be implemented using
a variety of arithmetic algorithms
[1−4]
. The algorithms
are divided into categories in different basic arithmetic
operations including subtraction and multiplication. The
benefit of subtractive algorithms is that they are typi-
cally low in complexity, and the implementation requires
small area. On the other hand, subtractive algorithms
have relatively longer latency since they converge linearly.
The algorithms using multiplication as the basic iterative
arithmetic operation have primary advantage which is the
quadratic convergence to the precise quotient, that means
the number of accurate quotient digits in the estimate
double at each iteration.
The two main methods used for multiplicative divi-
sion are Newton-Raphson algorithm
[1]
and Goldschmidt
algorithm
[4]
. Comparing with each iteration of Newton-
Raphson algorithm involving two interrelated multipli-
cations, the most important superiority of Goldschmidt
algorithm is that the multiplications per iteration are in-
dependent so that they can be pipelined and computed in
parallel.
However, there are many challenges posed to Gold-
schmidt algorithm in the tradeoff between performance
and consumption. The higher precision reciprocal approx-
imate value is estimated, the fewer number of iteration
need to achieve lower latency. But the area-cost of recip-
rocal look-up table will enlarge and more latency is needed
to find reciprocal look-up table with initial precision in-
creasing. Pipeline structure can achieve high throughout,
while much extra hardware is consumed. To get a balance
between performance and cost, we presented an appropri-
ate structure and implemented it in DSP chip.
The advantages of our design are as following:
1) Area-saving bipartite Read only memory(ROM) re-
ciprocal look-up tables are used to provide initial approx-
imate value for the following iterations.
2) Parallel operation technique is used in iterative unit
of divider in order to reduce the latency of iteration.
3) Timing-efficient iterative controller make multipli-
ers execute as pipeline way to achieve larger throughout.
4) The divider is implemented on DSP chip with shar-
ing existing multipliers by data bridge, so, hardware cost
was reduced significantly.
The remainder of this paper is organized as follows.
Section II reviews related works. Section III surveys Gold-
schmidt algorithm. The design of our FP divider and im-
plementation of FP divider on DSP chip is described in
Section IV. Experimental results are shown in Section V.
∗
Manuscript Received May 28, 2015; Accepted Jan. 5, 2016. This work is supported by the Aerospace Science Foundation of China
(No.2013ZC88003), and the Natural Science Foundation of China (No.61402499).
c
2017 Chinese Institute of Electronics. DOI:10.1049/cje.2016.10.004