![](https://csdnimg.cn/release/download_crawler_static/16840147/bg3.jpg)
In this configuration the extreme values are min := 1×10
0
and max := 999×10
10
. The smallest normalized number equals
100×10
0
. Non-normalized representations like 3×10
4
are not
valid. The significand must either have three digits or the expo-
nent must be zero.
Let v := 1234 be a real number that should be stored inside
the floating-point number type. Since it contains four digits the
number will not fit exactly into the representation and it must
be rounded. When rounded to the nearest representation then
˜
v := [v]
?
3
:= 123×10
1
is the only possible representation. The
rounding error is equal to 4 = 0.4 ulp.
Contrary to v the real number w := 1245 lies exactly between to
possible representations. Indeed, 124×10
1
and 125×10
1
are both
at distance 5. The chosen representation depends on the rounding
mechanism. If rounded up then the significand 125 is chosen. If
rounded to even then 124 is chosen. For w’ = 1235 both rounding
mechanisms would have chosen 124 as significand.
The neighbors of w are w
−
:= 123×10
1
and w
+
:= 125×10
1
.
Its respective boundaries are therefore m
−
:= 123.5×10
1
and
m
+
:= 124.5×10
1
. In this case the neighbors were both at the
same distance. This is not true for r := 100×10
3
, with neighbors
r
−
:= 999×10
2
and r
+
:= 101×10
3
. Clearly r
−
is closer to r
than is r
+
.
For the sake of completeness we now show the boundaries
for the extreme values and the smallest normalized number. The
number min has its lower (resp. upper) boundary at 0.5×10
1
(resp. 1.5×10
1
). For max, the boundaries are 998.5×10
10
and
999.5×10
10
.
The boundaries for the smallest normalized number are special:
even though its significand is equal to 100 the distance to its
lower neighbor (99×10
0
) is equal to 1 ulp and not just 0.5 ulp.
Therefore its boundaries are 99.5×10
0
and 100.5×10
0
.
2.4 IEEE 754 Double-Precision
An IEEE 754 double-precision floating-point number, or simply
“double”, is defined as a base 2 data type consisting of 64 bits.
The first bit is the number sign, followed by 11 bits reserved for the
exponent e
IEEE
, and 52 bits for the significand f
IEEE
. For the purpose
of this paper the sign-bit is irrelevant and we will assume to work
with positive numbers.
With the exception of some special cases (which will be dis-
cussed shortly) all numbers are normalized which in base 2 implies
a starting 1 bit. For space-efficiency this initial bit is not included
in the encoded significant. IEEE 754 numbers have hence effec-
tively a 53 bit significand where the first 1 bit is hidden (with value
hidden = 2
52
). The encoded exponent e
IEEE
is an unsigned positive
integer which is biased by bias = 1075. Decoding an e
IEEE
con-
sist of subtracting 1075. Combining this information, the value v of
any normalized double can be computed as f
v
:= hidden + f
IEEE
,
e
v
:= e
IEEE
− bias and hence v = f
v
×2
e
v
.
Note. This choice of decoding is not unique. Often the significand
is decoded as fraction with a decimal separator after the hidden bit.
IEEE 754 reserves some configurations for special values: when
e
IEEE
= 0x7FF (its maximum) and f
IEEE
= 0 then the double is in-
finity (or minus infinity, if the bit-sign is set). When e
IEEE
= 0x7FF
and f
IEEE
6= 0 then the double represents “NaN” (Not a Number).
The exponent e
IEEE
= 0 is reserved for denormals and zero.
Denormals do not have a hidden bit. Their value can be computed
as follows: f
IEEE
×2
1−bias
.
Throughout this paper we will assume that positive and nega-
tive infinity, positive and negative zero, as well as NaN have al-
ready been handled. Developers should be careful when testing
for negative zero, though. Following the IEEE 754 specification
−0.0 = +0.0 and −0.0 6< +0.0. One should thus use the sign-bit
to efficiently determine a number’s sign. In the remainder of this
paper a “floating-point number” will designate only a non-special
number or a strictly positive denormal. It does not include zero,
NaN or infinities.
Note. Any value representable by doubles (except for NaNs) has a
unique representation.
Note. For any non-special strictly positive IEEE double v with
f
IEEE
6= 0 the upper and lower boundaries m
+
and m
−
are at dis-
tance 2
v
e
−1
. When f
IEEE
= 0 then m
+
is still at distance 2
v
e
−1
but
the lower boundary only satisfies v − m
−
≤ 2
v
e
−2
.
2
3. Handmade Floating-Point
1: typedef struct diy fp {
2: uint64 t f;
3: int e;
4: } diy fp;
Figure 1: The diy fp type.
Grisu and its variants only require fixed-size integers, but these
integers are used to emulate floating-point numbers. In general
reimplementing a floating-point number type is a non-trivial task,
but in our context only few operations with severe limitations
are needed. In this section we will present our implementation,
diy fp, of such a floating-point number type. As can be seen
in Figure 1 it consists of a limited precision integer (of higher
precision than the input floating-point number), and one integer
exponent. For the sake of simplicity we will use the 64 bit long
uint64 t in the accompanying code samples. The text itself is,
however, size-agnostic and uses q for the significand’s precision.
Definition 3.1 (diy fp). A diy fp x is composed of an unsigned
q-bit integer f
x
(the significand) and a signed integer e
x
(the ex-
ponent) of unlimited range. The value of x can be computed as
x = f
x
×2
e
x
.
The “unlimited” range of diy fp’s exponent simplifies proofs.
In practice the exponent type must only have a slightly greater
range than the input exponent. Input numbers are systematically
normalized, and a denormal will therefore require more bits than
the original data-type. We furthermore need some extra space to
avoid overflows. For IEEE doubles which reserves 11 bits for the
exponent, a 32-bit signed integer is by far big enough.
3.1 Operations
Grisu extracts the significand of its diy fps in an early stage and
diy fps are only used for two operations: subtraction and multi-
plication. The implementation of the diy fp type is furthermore
simplified by restricting the input and by relaxing the output. For
instance, both operations are not required to return normalized re-
sults (even if the operands were normalized). Figure 2 shows the C
implementation of the two operations.
The operands of the subtraction must have the same exponent
and the result of subtracting both significands must fit into the
significand-type. Under these conditions the operation clearly does
not introduce any imprecision. The result might not be normalized.
The multiplication returns a diy fp
˜
r containing the rounded
result of multiplying the two given diy fps x and y. The result
might not be normalized. In order to distinguish this imprecise from
the precise multiplication we will use the “rounded” symbol for this
operation:
˜
r := x⊗y.
2
The inequality is only needed for e
IEEE
= 1 where the predecessor is a
denormal.