Download Computer Arithmetic

Mathematics and Statistics Dr. Corcoran – STAT 6550 Computer Arithmetic Because of the limitations of finite binary storage, computers do not store exact representations of most numbers, nor do they perform exact arithmetic. • 32 bits (a single 1/0 unit) are used for an integer. • 64 bits are used for a double-precision number. Mathematics and Statistics Dr. Corcoran – STAT 6550 I t Integer Storage St For many computers, 32 bits of a stored integer u can be thought off as the h binary bi coefficients ffi i xi in i the h representation i 32 u   xi 2i 1  231 , i 1 where each xi is 1 or 0. Note that in this representation, if x32 is 1 and every other xi = 0, then u = 0. • What’s What s the largest possible positive integer that can be stored using this representation? • What’s the largest (in magnitude) negative integer? Mathematics and Statistics Dr. Corcoran – STAT 6550 Integer Arithmetic If the result of integer arithmetic puts us outside of the range of storable the results can be unpredictable. storable, unpredictable For example, example with a 32-bit representation, adding 1 to the largest number will result in overflow. In R (notice how R needs to be coerced into integer arithmetic): > u=as.integer(0) > b=as b=as.integer(1) integer(1) > two=as.integer(2) > for (i in 1:31) {u=u+b; b=b*two} Warning message: NAs produced by integer overflow in: b * two > u [1] 2147483647 > u+as.integer(1) [1] NA Warning message: NAs produced by integer overflow in: u + as.integer(1) Mathematics and Statistics Dr. Corcoran – STAT 6550 Fl ti Point Floating P i t Storage St Typically represented in the form (1) (i 1 xi 2  i )2 k . x0 t In this formulation: • k is an integer called the exponent. • xi = 0 or 1 for i = 1,…,t. • x0 is the sign bit, with x0 = 0 for positive numbers and x0 = 1 f negative for i numbers. b • The fractional part (the summation) is called the mantissa. Mathematics and Statistics Dr. Corcoran – STAT 6550 Floating Point Storage – Additional Conventions • Note that by shifting the digits in the mantissa and making a corresponding change in the exponent, the representation is not unique. By convention, the exponent is chosen so that the first digit of the mantissa is 1, except if this would put the exponent out of range. • A 32-bit single precision floating point number is usually represented as a sign bit, a 23-bit mantissa, and an 8-bit exponent. • The exponent is typically shifted so that it takes the values –126 to 128. The remaining g ppossible value can be reserved for special p values ((such as underflow, overflow [in R stored as Inf], or something like NaN [not a number]). • Standard double precision uses a sign bit, an 11-bit exponent, and 52 bits for the mantissa. Mathematics and Statistics Dr. Corcoran – STAT 6550 Floating Point Storage – Example Suppose pp that yyou have 8 bits of storage g to represent p a floatingg ppoint number. We use a sign bit, 5 bits for the mantissa, and 2 bits for the exponent, using the same conventions as described earlier for the 32 bit representation (note that the lead bit in the mantissa need not 32-bit be stored). What would our representation be for 1/3? What is the difference between 1/3 and its representation? p Mathematics and Statistics Dr. Corcoran – STAT 6550 Machine Constants These numbers may include, for example: • The largest possible positive and negative numbers (in magnitude) g ) that can be stored without pproducing g overflow. • A smallest possible positive number. • The smallest number that when added to 1 will produce a result different than 1. 1 Mathematics and Statistics Dr. Corcoran – STAT 6550 Finding Machine Constants Such constants are not typically readily available, so we often need to use algorithms g to obtain them. For example, p , see www.nr.com. We can use the fact that R is a compiled C program to find some of these constants constants. For example, example the smallest possible number that can be added to 1 and produce a result different than 1 is typically referred to as machine epsilon, denoted by εm. Keeping in mind the representation i discussed di d earlier, li we have h 1  (1 / 2  i  2 0 / 2  i )21. t What should the next largest representable number be? Mathematics and Statistics Dr. Corcoran – STAT 6550 M hi Epsilon Machine E il To find εm, recall that for double precision we have t = 52, although we don’t need to store the leading bit (by convention) so that effectively t = 53. Thus, 1 + 1/252 should be different than 1. In R: > options(digits=20) > 1+1/2^53 [1] 1 > 1+1/2^52 [1] 1.0000000000000002 > 1/2^52 [1] 2.2204460492503131e 2.2204460492503131e-16 16 Mathematics and Statistics Dr. Corcoran – STAT 6550 M hi Epsilon Machine E il (continued) ( ti d) Note that while 1 + 1/252 mayy be the next largest g representable p number, 1/252 may not be εm. That is, if addition is performed at higher accuracy, and the result is rounded to the nearest representable number, number then the next representable number larger than 1/253, when added to 1, should also round to this value. The next number larger than 1/253 should be (1 + 1/252)/253 = 1/253 + 1/2105. However, in R (although not in S) this doesn’t seem to be εm (as we’ve defined it). Mathematics and Statistics Dr. Corcoran – STAT 6550 R l ti Error Relative E If x is the true value of a number, and f(x) is its floating point representation, i the h εm is i an upper bound b d on the h relative l i error off any stored number (except overflow or underflow). Recall f ( x)  (1) (i 1 xi 2  i )2 k , x0 t for xi = 0, 0 1. 1 Then the relative error is given by | x  f ( x) | 2k  t 1 , | x| 2 | x| since otherwise x would round to another floatingg ppoint number. Mathematics and Statistics Dr. Corcoran – STAT 6550 R l ti Error Relative E and d εm The value εm thus plays an important role in error analysis. It is sometimes i called ll d machine h precision, in i addition ddi i to machine hi epsilon. il In double precision on a Sun, εm ≈ 1.11 x 10-16, so that double pprecision numbers are stored accurate to about 16 decimal digits. g R has an object .Machine containing many machine constants: > .Machine$double.eps [1] 2.2204460492503131e-16 > 1/2^52 [1] 2.2204460492503131e-16 Mathematics and Statistics Dr. Corcoran – STAT 6550 Wh t d What does thi this have h to t do d with ith practical ti l computing? Definitions: Condition C diti – a measure that th t broadly b dl represents t th the ease with which a problem can be solved. Stability – a measure of the numerical accuracy of a solution relative to the input input. Mathematics and Statistics Dr. Corcoran – STAT 6550 Condition Consider the simple definition: output = f(input) The condition of a problem measures the relative change in the output due to a relative change in the input: | f (input   )  output | | |  condition ; output input Or in terms of derivatives the condition number C of a problem is approximated by: C | xf ' ( x) / f ( x) | . Mathematics and Statistics Dr. Corcoran – STAT 6550 Example Consider solving the polynomial z 2  x1 z  x2  0, where x1, x2 > 0. What is the condition number of the problem? Compare the stability of the quadratic formula to the approach that uses the reciprocal solutions in negative powers of z. Mathematics and Statistics Dr. Corcoran – STAT 6550 Computing Sums We have data x1,…,xn, and wish to compute S  i 1 xi . n Given a floating point representation f(xi), we first add f(x1) to f(x2), and that result is stored as a floating point number. Then f(x3) is added to the sum and the result again is converted to a floating point number, number and so on. on Note that the relative error can potentially compound, and so we mustt be b careful f l in i approaching hi summations ti where h the th compounding error is catastrophic. Mathematics and Statistics Dr. Corcoran – STAT 6550 C Compounding di Error E • Turns out that the error bound for straightforward g summation (adding one element at a time) increases as the square of the number of terms in the sum. • When adding negative numbers (i.e., handling subtraction), if two numbers with opposite signs are similar in magnitude, then the leading digits of the mantissa will cancel, leading to potentially large relative error. error • If all numbers have the same sign, the error bound for straight summation can be greatly reduced by summing from smallest to largest. largest • Pairwise summation can reduce the relative error to magnitude nlog2(n). Can you explain why? How might that improve accuracy over straight i h summation i if n = 1000, 1000 for f example? l ? Mathematics and Statistics Dr. Corcoran – STAT 6550 Example – Taylor Series Summation Consider the Taylor series approximation given by exp( x)  i 0 x i / i!, which works well if |x| is not too large. Straight summation works well for positive x, but not so well if x < 0: > fexp=function(x){ + i=0 + expx=1 + u=1 + while (abs(u)>1.e-8*abs(expx)){ + i=i+1 + u=u*x/i + expx=expx+u + } + return(expx) + } > options(digits=10) > c(exp(1),fexp(1)) [1] 2 2.718281828 718281828 2 2.718281826 718281826 > c(exp(100),fexp(100)) [1] 2.688117142e+43 2.688117108e+43 > c(exp(-1),fexp(-1)) [1] 0.3678794412 0.3678794413 > c(exp(-10),fexp(-10)) 0 0 [1] 4.539992976e-05 4.539992956e-05 > c(exp(-20),fexp(-20)) [1] 2.061153622e-09 5.621884467e-09 > c(exp(-30),fexp(-30)) ( p( ), p( )) [1] 9.357622969e-14 -3.066812359e-05 > # FOR THE SAKE OF ILLUSTRATION: > (-20)^10/prod(1:10) [1] 2821869.489 > ( (-20)^9/prod(1:9) 20)^9/prod(1:9) [1] -1410934.744 > (-20)^20/prod(1:20) [1] 43099804.12 Mathematics and Statistics Dr. Corcoran – STAT 6550 Example – Taylor Series Summation (continued) Note using the straight summation for values of -20 or -30 results in large terms that alternate in sign – some of which are much larger th the than th final fi l solution. l ti Better B tt approachh is i to t note t that th t exp(-x) = 1/exp(x): > fexp.better function(x){ xa=abs(x) i=0 expx=1 u=1 while (abs(u)>1.e-8*abs(expx)){ i=i+1 u=u*x/i / expx=expx+u } if (x>=0) return(expx) else return(1/expx) } > c(exp(-1),fexp.better(-1)) [1] 0.3678794412 0.3678794415 > c(exp(-10),fexp.better(-10)) [1] 4.539992976e-05 4.539992986e-05 > c(exp(-20),fexp.better(-20)) 20 20 [1] 2.061153622e-09 2.061153632e-09 > c(exp(-30),fexp.better(-30)) [1] 9.357622969e-14 9.357623008e-14 > Mathematics and Statistics Dr. Corcoran – STAT 6550 Example – Computing the Sample Variance The familiar formula: 1  1 2 s   i xi  n 1  n 2  x   2 i i  > x=c(0.999999998,0.999999999,1.0,1.000000001,1.000000002) > # SO-CALLED "ONE-PASS" APPROACH: > (sum(x^2)-sum(x)^2/n)/(n-as.integer(1)) [1] 0 > # "TWO-PASS" APPROACH: > sum((x-sum(x)/n)^2)/(n-as.integer(1)) [1] 2.5000000251237998e-18 >

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Computer Arithmetic