Download 1.8 Binary floating point numbers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Large numbers wikipedia , lookup

Location arithmetic wikipedia , lookup

Rounding wikipedia , lookup

Addition wikipedia , lookup

Arithmetic wikipedia , lookup

Approximations of π wikipedia , lookup

Positional notation wikipedia , lookup

Elementary mathematics wikipedia , lookup

Transcript
1.8 Binary floating point numbers
Computers often use base 2 for their representation of floating point numbers. A number x is expressed in
binary (base 2) floating point form if it is written as a signed number with magnitude between 1 and 2
multiplied by an integral power of 2. More precisely, we write x =  (1.b2bj)2  2q where each bj is either
zero or one (a number that is either 0 or 1 is often called a bit). The quantity (1.b2bj)2 represents
b2 
bj
1+
+
+ j-1 + .
2
2
Example 1.
1.0112  25
1.1012  2
represents 4410,
-2
- 1.1012  2
= 3.2510,
-2
= - 3.2510.
(1. 1001 1001 ….)2  2
-4
= .110
(the pattern 1001 repeats forever).
Suppose x =  1.b2bj  2q. If x > 0 then chop(x, p, 2) = 1.b2bp  2q = 2q-p+1  2p-q-1 x  denotes x
chopped of to p bits. If x < 0 then chop(x, p, 2) = - chop(-x, p, 2). If x > 0 then
round(x, p, 2)
= chop( x + 2q-p, p, 2)
1.b b
1.b b
2
chop(x, p, 2)
=
if bp+1 = 0

0  2q
2
p-11
2
s-110
2q
q+1
if bp+1 = 1 and bp = 0
if bp+1 = 1, bs = 0 and bs+1 =  = bp = 1
if bp+1 = 1 and b2 =  = bp = 1
denotes x rounded to p bits. If x < 0 then round(x, p, 2) = - round( -x, p, 2). To say that a computation is done
with p bits of precision means that a number x is represented by round(x, p, 2) and the result of each arithmetic
operation is rounded to p bits.
Note that the number 0.110 which has only a finite number of digits in its base 10 representation has an infinite
number of bits in its binary representation. This illustrates a general point. In many computer computations
where numbers are represented by round(x, p, 2) for some fixed p, the only numbers that are represented exactly
are rational numbers whose denominator is a power of 2. There is an error in the representation of all other
numbers including most numbers that are represented exactly in decimal. The absolute error in
xa = round(x, p, 2) considered as an approximation to x may be as large as 2q-p. We have the following analogue
to Proposition 1 in section 1.4; the proof is similar.
Proposition 1. If xa =  1.b2bp  2q is the approximation to x obtained by rounding x to p significant bits of
precision, then the relative error is no more than 2 -p.
One consequence is that the machine  on a computer where the calculations are done with binary floating point
numbers with p bits of precision is  = 2-p.
Two common choices for the number of bits of precision in computers are
1.8 - 1
25 bits (or some value close to 25 bits) which is often called single precision,
53 bits (or some value close to 53 bits) which is called double precision,
If single precision denotes 25 bits of precision then the relative error in the stored number is
2-25 = 1/33554432 = 2.98…  10-8. Thus single precision representation of numbers approximates them better
than eight decimal digits of precision on the average. If double precision denotes 53 bits of precision then the
relative error in the stored number is 2-53 = 1.11…  10-16. Thus double precision is better than sixteen decimal
digits of precision on the average. Terminology differs somewhat from author to author. For example,
Epperson [1, Problem 4, p. 27] uses that convention that single precision is seven decimal digits of precision
and double precision is 14 digits.
1.8 - 2