计算机科学家必知：David Goldberg论浮点运算设计

需积分: 1 141 浏览量更新于2024-07-22 收藏 3.65MB PDF 举报

在计算机科学领域，浮动点运算（Floating-Point Arithmetic）是一个不可或缺且常常被误解的重要主题。由David Goldberg撰写的文章"What Every Computer Scientist Should Know About Floating-Point Arithmetic"旨在提供一个深入理解这一复杂领域的全面指南。文章针对的是那些广泛应用于计算机系统中的浮点数处理，包括编程语言的数据类型支持、从个人电脑到超级计算机的硬件加速器、以及编译器对浮点算法的优化处理。首先，文章着重介绍了浮点数的表示方式及其背后的数学原理。理解浮点数如何在有限的二进制位数中近似表示无限的实数范围是至关重要的。这涉及了基数（base）、阶码（exponent）和尾数（mantissa）的概念，以及它们如何组合以实现精度和表示范围之间的权衡。接着，作者详细阐述了IEEE（Institute of Electrical and Electronics Engineers）的浮点数标准，这是当前业界广泛采用的标准，如IEEE 754。这个标准定义了各种数据格式，包括单精度（32位）和双精度（64位），以及它们的舍入规则，确保不同系统间能进行有效的交互。理解这些规范有助于开发者避免由于浮点计算不一致性导致的问题，如隐式舍入误差。文章还讨论了浮点运算中的常见陷阱，如溢出（overflow）、下溢（underflow）、NaN（Not a Number）和无穷大（infinity）的处理，这些都是设计者必须考虑的因素，因为它们可能对程序的正确性和性能产生重大影响。作者提供了实例，指导系统构建者如何优化硬件和软件，以提高浮点运算的效率和准确性，比如使用硬件加速器来减轻CPU的负担，或者编写能够检测并处理异常的代码。最后，文章强调了浮点运算在系统设计中的实际应用，无论是嵌入式系统、服务器还是云计算环境，都需要考虑到浮点运算的性能瓶颈和错误处理策略。对于任何从事计算机系统设计或软件开发的人来说，理解和掌握浮动点运算的特性与标准是必不可少的基础知识。这篇文章不仅涵盖了浮点运算的基本概念，还深入探讨了其在现代计算机系统中的实际作用和挑战。对于计算机科学家和工程师来说，它是一份宝贵的参考资料，帮助他们提升对浮点运算的理解，从而设计出更高效、更稳定的计算机系统。

12 “

David Goldberg

The condition that c s .005 is met in

virtually every actual floating-point sys-

tem. For example, when 13= 2,

p >8

ensures that e < .005, and when 6 = 10,

p z 3 is enough.

In statements like Theorem 3 that dis-

cuss the relative error of an expression,

it is understood that the expression is

computed using floating-point arith-

metic. In particular, the relative error is

actually of the expression

The troublesome expression (1 + i/n)’

can be rewritten as exp[ n ln(l + i / n)],

where now the problem is to compute

In(l + x) for small x. One approach is to

use the approximation ln(l + x) = x, in

which case the payment becomes

$37617.26, which is off by $3.21 and even

less accurate than the obvious formula.

But there is a way to compute ln(l + x)

accurately, as Theorem 4 shows

[Hewlett-Packard 1982], This formula

yields $37614.07, accurate to within 2

(sQRT(a @(b @c))@ (C @(a @b))

cents!

Theorem 4 assumes that LN( x) ap-

F3(c @(a @b))@ (a @(b @c)))

proximate ln( x) to within 1/2 ulp. The

problem it solves is that when x is small,

@4.

(8)

LN(l @ x) is not close to ln(l + x) be-

cause 1 @ x has lost the information in

the low order bits of x. That is, the com-

Because of the cumbersome nature of (8), puted value of ln(l + x) is not close to its

in the statement of theorems we will actual value when x < 1.

usually say

the computed value of E

rather than writing out E with circle

Theorem 4

notation.

Error bounds are usually too pes-

simistic. In the numerical example given

above, the computed value of (7) is 2.35,

compared with a true value of 2.34216

for a relative error of O.7c, which is much

less than 11 e. The main reason for com-

puting error bounds is not to get precise

bounds but rather to verify that the

formula does not contain numerical

problems.

A final example of an expression that

can be rewritten to use benign cancella-

tion is (1 + x)’, where x < 1. This ex-

pression arises in financial calculations.

Consider depositing $100 every day into

a bank account that earns an annual

interest rate of 6~o, compounded daily. If

n = 365 and i = ,06, the amount of

money accumulated at the end of one

year is 100[(1 + i/n)” – 11/(i/n) dol-

lars. If this is computed using ~ = 2 and

P = 24, the result is $37615.45 compared

to the exact answer of $37614.05, a

discrepancy of $1.40. The reason for

the problem is easy to see. The expres-

sion 1 +

i/n involves adding 1 to

.0001643836, so the low order bits of

i/n

are lost. This rounding error is amplified

when 1 +

i / n is raised to the nth power.

If ln(l – x) is computed using the for-

mula

ln(l + x)

x forl~x=l

—

xln(l + x)

(1 +X)-1

forl G3x#l

the relative error is at most 5 c when O <

x < 3/4, provided subtraction is per-

formed with a guard digit, e <0.1, and

in is computed to within 1/2 ulp.

This formula will work for any value of

x but is only interesting for x + 1, which

is where catastrophic cancellation occurs

in the naive formula ln(l + x) Although

the formula may seem mysterious, there

is a simple explanation for why it works.

Write ln(l + x) as x[ln(l + x)/xl =

XV(x). The left-hand factor can be com-

puted exactly, but the right-hand factor

P(x) = ln(l + x)/x will suffer a large

rounding error when adding 1 to x. How-

ever, v is almost constant, since ln(l +

x) = x. So changing x slightly will not

introduce much error. In other words, if

x, computing XK( 2) will be a good

ACM Computmg Surveys, Vol 23, No 1, March 1991

Floating-Point Arithmetic 8 13

approximation to xp( x) = ln(l + x). Is

there a value for 5 for which 2 and

5 + 1 can be computed accurately? There

is; namely,

2 = (1 @ x) e 1, because

then 1 + 2 is exactly equal to 1 @ x.

The results of this section can be sum-

marized by saying that a guard digit

guarantees accuracy when nearby pre-

cisely known quantities are subtracted

(benign cancellation). Sometimes a for-

mula that gives inaccurate results can be

rewritten to have much higher numeri -

cal accuracy by using benign cancella-

tion; however, the procedure only works

if subtraction is performed using a guard

digit. The price of a guard digit is not

high because is merely requires making

the adder 1 bit wider. For a 54 bit double

precision adder, the additional cost is less

than 2%. For this price, you gain the

ability to run many algorithms such as

formula (6) for computing the area of a

triangle and the expression in Theorem 4

for computing ln(l + ~). Although most

modern computers have a guard digit,

there are a few (such as Crays) that

do not.

1.5 Exactly Rounded Operations

When floating-point operations are done

with a guard digit, they are not as accu-

rate as if they were computed exactly

then rounded to the nearest floating-point

number. Operations performed in this

manner will be called

exactly rounded.

The example immediately preceding

Theorem 2 shows that a single guard

digit will not always give exactly rounded

results. Section 1.4 gave several exam-

ples of algorithms that require a guard

digit in order to work properly. This sec-

tion gives examples of algorithms that

require exact rounding.

So far, the definition of rounding has

not been given. Rounding is straightfor-

ward, with the exception of how to round

halfway cases; for example, should 12.5

mnnd to 12 OP12? Ofie whool of thought

divides the 10 digits in half, letting

{0, 1,2,3,4} round down and {5,6,’7,8,9}

round up; thus 12.5 would round to 13.

This is how rounding works on Digital

Equipment Corporation’s VAXG comput -

ers. Another school of thought says that

since numbers ending in 5 are halfway

between two possible roundings, they

should round down half the time and

round up the other half. One way of ob -

taining this 50’%0behavior is to require

that the rounded result have its least

significant digit be even. Thus 12.5

rounds to 12 rather than 13 because 2 is

even. Which of these methods is best,

round up or round to even? Reiser and

Knuth [1975] offer the following reason

for preferring round to even.

Theorem 5

Let x and y be floating-point numbers,

and define X. = x,

xl=(xOey)O

y,...,=(x(ley)@y)If@If@ and

e are exactly rounded using round to

even, then either x. = x for all n or x. = xl

foralln >1.

❑

To clarify this result, consider ~ = 10,

p = 3 and let x = 1.00, y = –.555.

When rounding up, the sequence be-

comes

X. 9 Y = 1.56, Xl = 1.56 9 .555

= 1.01, xl e y ~ LO1 Q .555 = 1.57,

and each successive value of x. in-

creases by .01. Under round to even, x.

is always 1.00. This example suggests

that when using the round up rule, com-

putations can gradually drift upward,

whereas when using round to even the

theorem says this cannot happen.

Throughout the rest of this paper, round

to even will be used.

One application of exact rounding oc-

curs in multiple precision arithmetic.

There are two basic approaches to higher

precision. One approach represents float -

ing-point numbers using a very large sig-

nificant, which is stored in an array of

words, and codes the routines for manip-

ulating these numbers in assembly lan-

guage. The second approach represents

higher precision floating-point numbers

as an array of ordinary floating-point

‘VAX is a trademark of Digital Equipment

Corporation.

ACM Computmg Surveys, Vol 23, No. 1, March 1991

剩余43页未读，继续阅读

bianzhiyu

粉丝: 1
资源: 4

计算机科学家必知：David Goldberg论浮点运算设计

What Every Computer Scientist Should Know About Floating-Point Arithmetic

What Every Engineer Should Know About Excel(2nd) epub

Think C++ How to think like a computer scientist.pdf

Think Java How to think like a computer scientist.pdf

Think Python How to think like a computer scientist.pdf

Python for Software Design How to Think Like a Computer Scientist.pdf

think-like-a-computer-scientist-python-replit-book

Addison.Wesley.The.Design.of.Design.Essays.from.a.Computer.Scientist.Mar.2010.rar

floating_point_math.rar_floating

Althoff -- The Self-Taught Computer Scientist -- 2021.pdf

最新资源