Download 1.8 Binary floating point numbers

1.8 Binary floating point numbers Computers often use base 2 for their representation of floating point numbers. A number x is expressed in binary (base 2) floating point form if it is written as a signed number with magnitude between 1 and 2 multiplied by an integral power of 2. More precisely, we write x =  (1.b2bj)2  2q where each bj is either zero or one (a number that is either 0 or 1 is often called a bit). The quantity (1.b2bj)2 represents b2  bj 1+ + + j-1 + . 2 2 Example 1. 1.0112  25 1.1012  2 represents 4410, -2 - 1.1012  2 = 3.2510, -2 = - 3.2510. (1. 1001 1001 ….)2  2 -4 = .110 (the pattern 1001 repeats forever). Suppose x =  1.b2bj  2q. If x > 0 then chop(x, p, 2) = 1.b2bp  2q = 2q-p+1  2p-q-1 x  denotes x chopped of to p bits. If x < 0 then chop(x, p, 2) = - chop(-x, p, 2). If x > 0 then round(x, p, 2) = chop( x + 2q-p, p, 2) 1.b b 1.b b 2 chop(x, p, 2) = if bp+1 = 0  0  2q 2 p-11 2 s-110 2q q+1 if bp+1 = 1 and bp = 0 if bp+1 = 1, bs = 0 and bs+1 =  = bp = 1 if bp+1 = 1 and b2 =  = bp = 1 denotes x rounded to p bits. If x < 0 then round(x, p, 2) = - round( -x, p, 2). To say that a computation is done with p bits of precision means that a number x is represented by round(x, p, 2) and the result of each arithmetic operation is rounded to p bits. Note that the number 0.110 which has only a finite number of digits in its base 10 representation has an infinite number of bits in its binary representation. This illustrates a general point. In many computer computations where numbers are represented by round(x, p, 2) for some fixed p, the only numbers that are represented exactly are rational numbers whose denominator is a power of 2. There is an error in the representation of all other numbers including most numbers that are represented exactly in decimal. The absolute error in xa = round(x, p, 2) considered as an approximation to x may be as large as 2q-p. We have the following analogue to Proposition 1 in section 1.4; the proof is similar. Proposition 1. If xa =  1.b2bp  2q is the approximation to x obtained by rounding x to p significant bits of precision, then the relative error is no more than 2 -p. One consequence is that the machine  on a computer where the calculations are done with binary floating point numbers with p bits of precision is  = 2-p. Two common choices for the number of bits of precision in computers are 1.8 - 1 25 bits (or some value close to 25 bits) which is often called single precision, 53 bits (or some value close to 53 bits) which is called double precision, If single precision denotes 25 bits of precision then the relative error in the stored number is 2-25 = 1/33554432 = 2.98…  10-8. Thus single precision representation of numbers approximates them better than eight decimal digits of precision on the average. If double precision denotes 53 bits of precision then the relative error in the stored number is 2-53 = 1.11…  10-16. Thus double precision is better than sixteen decimal digits of precision on the average. Terminology differs somewhat from author to author. For example, Epperson [1, Problem 4, p. 27] uses that convention that single precision is seven decimal digits of precision and double precision is 14 digits. 1.8 - 2

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 1.8 Binary floating point numbers