高效准确的浮点数整数转换算法：速度与精度的提升

86 浏览量更新于2024-08-25 收藏 354KB PDF 举报

"Printing Floating-Point Numbers Quickly and Accurately with Integers - 2010 (dtoa-pldi2010) 是一篇由 Florian Loitsch 发表在计算机科学领域的论文。该研究主要关注如何高效且精确地将浮点数转换为十进制表示，这对于处理数值计算和打印精度至关重要。Loitsch 提出了三种算法，这些算法相较于常规使用高精度整数的方法，速度提升可达4倍，从而提高了程序性能。论文的核心在于开发了一种基于固定大小整数运算的转换技术，这对于实现快速转换是关键。Loitsch强调了所需的整数类型至少要比浮点数的阶码多两个比特，例如对于IEEE 754双精度浮点数（拥有53位阶码），55位的整数类型就足够了。这种方法的优点在于减少了对额外数据结构的需求，提高了代码执行效率。第一种算法是最基础的，不利用额外的比特，专注于最小化所需比特数量，进行二进制到十进制的转换。这提供了最直接且直观的解决方案。然而，后续的两种算法则考虑了如何利用额外的比特来优化输出质量，这意味着它们可能在精度上有所提高，但可能会牺牲一部分速度。这篇论文不仅提供了实用的转换算法，还展示了如何根据需求权衡速度和精度，这对于处理大规模浮点数计算和输出时非常有用，尤其是在需要快速响应或处理大量数据的场景中。通过阅读这篇论文，读者可以学习到如何在保证准确性的同时，有效利用计算机硬件资源，提升程序的性能表现。"

In this conﬁguration the extreme values are min := 1×10

and max := 999×10

. The smallest normalized number equals

100×10

. Non-normalized representations like 3×10

are not

valid. The signiﬁcand must either have three digits or the expo-

nent must be zero.

Let v := 1234 be a real number that should be stored inside

the ﬂoating-point number type. Since it contains four digits the

number will not ﬁt exactly into the representation and it must

be rounded. When rounded to the nearest representation then

v := [v]

:= 123×10

is the only possible representation. The

rounding error is equal to 4 = 0.4 ulp.

Contrary to v the real number w := 1245 lies exactly between to

possible representations. Indeed, 124×10

and 125×10

are both

at distance 5. The chosen representation depends on the rounding

mechanism. If rounded up then the signiﬁcand 125 is chosen. If

rounded to even then 124 is chosen. For w’ = 1235 both rounding

mechanisms would have chosen 124 as signiﬁcand.

The neighbors of w are w

−

:= 123×10

and w

:= 125×10

Its respective boundaries are therefore m

−

:= 123.5×10

and

:= 124.5×10

. In this case the neighbors were both at the

same distance. This is not true for r := 100×10

, with neighbors

−

:= 999×10

and r

:= 101×10

. Clearly r

−

is closer to r

than is r

For the sake of completeness we now show the boundaries

for the extreme values and the smallest normalized number. The

number min has its lower (resp. upper) boundary at 0.5×10

(resp. 1.5×10

). For max, the boundaries are 998.5×10

and

999.5×10

The boundaries for the smallest normalized number are special:

even though its signiﬁcand is equal to 100 the distance to its

lower neighbor (99×10

) is equal to 1 ulp and not just 0.5 ulp.

Therefore its boundaries are 99.5×10

and 100.5×10

2.4 IEEE 754 Double-Precision

An IEEE 754 double-precision ﬂoating-point number, or simply

“double”, is deﬁned as a base 2 data type consisting of 64 bits.

The ﬁrst bit is the number sign, followed by 11 bits reserved for the

exponent e

IEEE

, and 52 bits for the signiﬁcand f

IEEE

. For the purpose

of this paper the sign-bit is irrelevant and we will assume to work

with positive numbers.

With the exception of some special cases (which will be dis-

cussed shortly) all numbers are normalized which in base 2 implies

a starting 1 bit. For space-efﬁciency this initial bit is not included

in the encoded signiﬁcant. IEEE 754 numbers have hence effec-

tively a 53 bit signiﬁcand where the ﬁrst 1 bit is hidden (with value

hidden = 2

). The encoded exponent e

IEEE

is an unsigned positive

integer which is biased by bias = 1075. Decoding an e

IEEE

con-

sist of subtracting 1075. Combining this information, the value v of

any normalized double can be computed as f

:= hidden + f

IEEE

:= e

IEEE

− bias and hence v = f

×2

Note. This choice of decoding is not unique. Often the signiﬁcand

is decoded as fraction with a decimal separator after the hidden bit.

IEEE 754 reserves some conﬁgurations for special values: when

IEEE

= 0x7FF (its maximum) and f

IEEE

= 0 then the double is in-

ﬁnity (or minus inﬁnity, if the bit-sign is set). When e

IEEE

= 0x7FF

and f

IEEE

6= 0 then the double represents “NaN” (Not a Number).

The exponent e

IEEE

= 0 is reserved for denormals and zero.

Denormals do not have a hidden bit. Their value can be computed

as follows: f

IEEE

×2

1−bias

Throughout this paper we will assume that positive and nega-

tive inﬁnity, positive and negative zero, as well as NaN have al-

ready been handled. Developers should be careful when testing

for negative zero, though. Following the IEEE 754 speciﬁcation

−0.0 = +0.0 and −0.0 6< +0.0. One should thus use the sign-bit

to efﬁciently determine a number’s sign. In the remainder of this

paper a “ﬂoating-point number” will designate only a non-special

number or a strictly positive denormal. It does not include zero,

NaN or inﬁnities.

Note. Any value representable by doubles (except for NaNs) has a

unique representation.

Note. For any non-special strictly positive IEEE double v with

IEEE

6= 0 the upper and lower boundaries m

and m

−

are at dis-

tance 2

−1

. When f

IEEE

= 0 then m

is still at distance 2

−1

but

the lower boundary only satisﬁes v − m

−

≤ 2

−2

3. Handmade Floating-Point

1: typedef struct diy fp {

2: uint64 t f;

3: int e;

4: } diy fp;

Figure 1: The diy fp type.

Grisu and its variants only require ﬁxed-size integers, but these

integers are used to emulate ﬂoating-point numbers. In general

reimplementing a ﬂoating-point number type is a non-trivial task,

but in our context only few operations with severe limitations

are needed. In this section we will present our implementation,

diy fp, of such a ﬂoating-point number type. As can be seen

in Figure 1 it consists of a limited precision integer (of higher

precision than the input ﬂoating-point number), and one integer

exponent. For the sake of simplicity we will use the 64 bit long

uint64 t in the accompanying code samples. The text itself is,

however, size-agnostic and uses q for the signiﬁcand’s precision.

Deﬁnition 3.1 (diy fp). A diy fp x is composed of an unsigned

q-bit integer f

(the signiﬁcand) and a signed integer e

(the ex-

ponent) of unlimited range. The value of x can be computed as

x = f

×2

The “unlimited” range of diy fp’s exponent simpliﬁes proofs.

In practice the exponent type must only have a slightly greater

range than the input exponent. Input numbers are systematically

normalized, and a denormal will therefore require more bits than

the original data-type. We furthermore need some extra space to

avoid overﬂows. For IEEE doubles which reserves 11 bits for the

exponent, a 32-bit signed integer is by far big enough.

3.1 Operations

Grisu extracts the signiﬁcand of its diy fps in an early stage and

diy fps are only used for two operations: subtraction and multi-

plication. The implementation of the diy fp type is furthermore

simpliﬁed by restricting the input and by relaxing the output. For

instance, both operations are not required to return normalized re-

sults (even if the operands were normalized). Figure 2 shows the C

implementation of the two operations.

The operands of the subtraction must have the same exponent

and the result of subtracting both signiﬁcands must ﬁt into the

signiﬁcand-type. Under these conditions the operation clearly does

not introduce any imprecision. The result might not be normalized.

The multiplication returns a diy fp

r containing the rounded

result of multiplying the two given diy fps x and y. The result

might not be normalized. In order to distinguish this imprecise from

the precise multiplication we will use the “rounded” symbol for this

operation:

r := x⊗y.

The inequality is only needed for e

IEEE

= 1 where the predecessor is a

denormal.

剩余10页未读，继续阅读

weixin_38599430

粉丝: 0

高效准确的浮点数整数转换算法：速度与精度的提升

Fixed-point-and-floating-point.rar_dsp_floating_floating point D

AMD64 Architecture Programmer's Manual - Volume 5 - 64-Bit Media and x87 Floating-point Instructions (26569, r3.12, Mar-2012)-计算机科学

floating-point-alternative-.rar_floating

Floating-Point Megafunctions

ARM C and C++ Libraries and Floating-Point Support

Floating-Point Fused Multiply-Add Architectures

32-bit-Floating-Point-ALU:单精度浮点加法系统

Floating-point arithmetic in C

MATLAB's str2double Function: Converting Strings to Double Precision Floating-Point Numbers for More...

floating-point

最新资源