Download Floating-Point Representation and Approximation Errors

EE103 (Shinnerl) IEEE floating point numbers Floating-Point Representation and Approximation • floating point numbers with base 10 Errors • floating point numbers with base 2 Cancellation • IEEE floating point standard Instability • machine precision Simple one-variable examples • rounding error Swamping 1 2 Floating point numbers with base 10 Notation Interpretation x = ±(d110−1 + d210−2 + · · · + dn10−n) · 10e x = ± (0.d1d2 . . . dn)10 · 10e Example (with n = 7) • 0.d1d2 . . . dn is the mantissa di ∈ {0, 1, . . . , 9}, d1 6= 0 if x 6= 0 12.625 = + (.1262500)10 · 102 = + (1 · 10−1 + 2 · 10−2 + 6 · 10−3 + 2 · 10−4 + 5 · 10−5 • n is the mantissa length (or precision) + 0 · 10−6 + 0 · 10−7) · 102 • e is the exponent (emin ≤ e ≤ emax) used in pocket calculators • The restriction that d1 6= 0 when x 6= 0 ensures that each floating-point number has a unique representation. 3 4 Properties Format of Normalized Floating-Point Numbers in Base 2 x = ±(1 + 0.b1b2 . . . bn)2 · 2e — a finite set of numbers — unequally spaced: distance between floating point numbers varies • f ≡ 0.b1b2 . . . bn is the mantissa (bi ∈ {0, 1}). Restricting the mantissa to [0, 1) ensures uniqueness of the representation. the smallest number greater than 1 is 1 + 10−n+1 • n is the mantissa length (or precision) the smallest number greater than 10 is 10 + 10−n+2, . . . • e is the exponent (emin ≤ e ≤ emax) • special representations for 0, NaN, and Inf — largest positive number: +(.999 · · · 9)10 · 10emax = (1 − 10−n)10emax • in practice, includes tiny ‘subnormal numbers’ of the form xsub = ±(0.b1b2 . . . bn)2 · 2emin . — smallest (normalized) positive number: • used in almost all computers xmin = +(.100 · · · 0)10 · 10emin = 10emin−1 5 6 A finite set of unequally spaced numbers Interpretation: x = ±(1 + b12−1 + b22−2 + · · · + bn2−n) · 2e — For n = 3, emin = −1, emax = 2 q h 1 1h 2 2h 3 3h Example (with n = 8): — Largest positive number: 12.625 = + (1 + 0.10010100)2 · 23 = + (1 + 1 · 2−1 + 0 · 2−2 + 0 · 2−3 + 1 · 2−4 + 0 · 2−5 + 1 · 2−6 + 0 · 2−7 + 0 · 2−8) · 23 xmax = +(1 + 0.111 · · · 1)2 · 2emax = (2 − 2−n)2emax — Smallest positive normalized number: = 8 + 4 + 1/2 + 1/8 xmin = +(1 + 0.000 · · · 0)2 · 2emin = 2emin — Zero is stored as 0 = ±0.00 . . . 0 · 2emin . 7 8 IEEE standard for binary arithmetic specifies two binary floating point number formats Machine precision Definition: the machine precision of a binary floating point number system with mantissa length n is defined as IEEE standard single precision: n = 24, emin = −125, ²M = 2−(n+1) emax = 127 requires 32 bits: 1 sign bit, 23 bits for mantissa, 8 bits for exponent emin = −1022, ²M = 2−53 ≈ 1.1102 · 10−16 (machine epsilon eps = 2²M = 2.22 · 10−16) is pre-defined in Matlab.) IEEE standard double precision: n = 52, Example: IEEE standard double precision (n = 52): emax = 1023 requires 64 bits: 1 sign bit, 52 bits for mantissa, 11 bits for exponent. Used in almost all modern computers Interpretation: 1 + 2²M ≡ 1 + eps is the smallest floating point number greater than 1: (.10 · · · 01)2 · 21 = 1 + 2−n = 1 + 2²M 9 10 Rounding a floating-point number system is a finite set of numbers; all other numbers must be rounded notation: fl(x) is the floating-point representation of x Example Numbers x ∈ (1, 1 + 2²M ) are rounded to 1 or 1 + 2²M . fl(x) = 1 if 1 < x ≤ 1 + ²M fl(x) = 1 + 2²M if 1 + ²M < x < 1 + 2²M rounding rules used in practice: numbers are rounded to the nearest floating-point number in case of a ties: round to the number with least significant bit 0 (‘round to nearest even’) 11 This gives another interpretation of ²M : numbers between 1 and 1 + ²M are indistinguishable from 1. 12 Rounding error and machine precision Example Fact: |fl(x) − x| ≤ ²M |x| — Machine precision gives a bound on the relative error due to rounding. — The number of correct (decimal) digits in fl(x) is roughly − log10 ²M i.e., about 15 or 16 in IEEE double precision. Significant loss of accuracy can result from representing simple decimal expressions in binary. Exercise: µ ¶ 1 = (0.0001100110011...)2 10 10 Historical significance: Gulf War, 1991. — A fundamental limit on accuracy of numerical computations. 13 14 Exercises Explain the following Matlab results (Matlab uses IEEE double precision) Run the following code in Matlab and explain the result x = 2; for i=1:54 x = sqrt(x); end; for i=1:54 x = x^2; end >> (1 + 1e-16) - 1 ans = 0 >> (1 + 2e-16) - 1 ans = 2.2204e-16 >> (1 - 1e-16) - 1 ans = -1.1102e-16 >> 1 + (1e-16 - 1) ans = 1.1102e-16 15 16 Measuring error Explain the following results (log(1 + x)/x ≈ 1 for small x) b of a real number x. Given: an approximation x >> log(1+3e-16)/3e-16 ans = 0.7401 b − x| absolute error: |x relative error: >> log(1+3e-16)/((1+3e-16)-1) ans = 1.0000 b − x| |x (if x 6= 0) |x| number of correct significant digits is equal to r if 0.5 · 10−r < b − x| |x ≤ 5 · 10−r |x| 17 18 Example Cancellation x = π = 3.141592 . . .. b = 3.1419, x b = 3.1421, x b = 3.1430 x are all correct to 4 significant digits, according to these definitions. In case x is close to zero, a combination of relative and absolute error can be used: b − x| |x , (|x| + c) where c > 0 is a small, fixed cutoff specifying the radius of the neighborhood of zero in which absolute error should be used. â = a(1 + ∆a), b̂ = b(1 + ∆b) • a, b: exact data; â, b̂: approximations; ∆a, ∆b: unknown relative errors • relative error in x̂ = â − b̂ = (a − b) + (a∆a − b∆b) is |a∆a − b∆b| |x̂ − x| = |x| |a − b| if a ' b, small ∆a and ∆b can lead to very large relative errors in x. Please see the supplemental handout online. 19 20 Roots of a quadratic equation ax2 + bx + c = 0 Cancellation occurs when: (a 6= 0) Algorithm 1: use the formulas • we subtract two numbers that are almost equal q q −b − b2 − 4ac b2 − 4ac x1 = x2 = 2a 2a 2 these are unstable if b À |4ac|. −b + • one or both are subject to error Instability is often (but not always) caused by cancellation. Reformulating calculations to avoid cancellation can dramatically improve their accuracy. • If b ≤ 0, cancellation occurs in x2 (−b ≈ • If b ≥ 0, cancellation occurs in x1 (b ≈ q ; b2 − 4ac). q b2 − 4ac). • In both cases, b may be exact, but the square root introduces a small error. 21 The roots of the quadratic can be calculated another way. 22 Algorithm 2 Notice that x2 = = −b − q   q • if b ≤ 0, calculate b2 − 4ac  −b + b2 − 4ac  q ·  2a −b + b2 − 4ac b2 − (b2 − 4ac) µ ¶ q 2a −b + b2 − 4ac = µ a −b + x1 = 2ac q b2 − 4ac ¶ c = , ax1 −b + q x2 = c ax1 −b − q x1 = c ax2 b2 − 4ac , 2a • if b > 0, calculate x2 = b2 − 4ac , 2a ... no cancellation! and similarly, x1 = c/(ax2). 23 24 Exercise Exercise Evaluate Function chop(x,n) rounds x to n decimal digits (for example chop(pi,4) returns 3.14200000000000) 4 digits. 3000 X k−2 = 1.6446, rounding all intermediate results to k=1 >> sum = 0; >> for k=1:3000 sum = chop(sum+1/k^2, 4); end >> sum sum = 1.6240 Cancellation occurs in (1 − cos x)/ sin x for x ≈ 0. >> x = 1e-2; >> (1-chop(cos(x,4)))/chop(sin(x,4)) ans = 0 The exact value is about 0.005. Give a stable alternative method. This result has only two correct digits, but there is no cancellation (there are no subtractions). Explain, and propose a better method. 25 26 Swamping Exercise The number e = 2.7182818 . . . can be defined as e = lim(1 + 1/n)n n→ This suggests an algorithm for calculating e: choose n large and evaluate ê = (1 + 1/n)n Let ²M = 2−(n+1) denote machine precision (on a machine with n binary digits in the mantissa). Suppose we are given positive floating point values a and b of very different sizes. Definition If 0 < fl(b) < ²M · fl(a), then results: fl(a + b) = fl(a), n 104 108 1012 1016 ê 2.718145926 2.718281798 2.718523496 1.000000000 # correct digits 4 7 4 0 and we say that b is swamped by a in the sum. Example: >> 1.5 + 1.0e-17 ans = 1.50000000000000 Explain. 27 28 Swamping Example Evaluate intermediate results to 4 digits. 3000 X k−2 = 1.6446, rounding all k=1 Solution In a long sum over gradually decreasing terms, swamping can be avoided by adding the smaller terms together first. Simply reversing the order of the sum restores the accuracy. >> sum = 0; >> for k=1:3000 sum = chop(sum+1/k^2, 4); end >> sum sum = 1.6240 This result has only two correct digits, but there is no cancellation (there are no subtractions). Explain, and propose a better method. >> sum = 0; >> for k=3000:-1:1 sum = chop(sum+1/k^2, 4); end >> sum sum = 1.6450 29 Exercise 30 Conclusions Show that in finite precision, the harmonic series ∞ X 1 1 1 1 = 1 + + + + ··· k 2 3 4 k=1 appears to converge if the terms are added in the given (descending) order. Determine the least value N0 for which  fl  N X 1 k=1 k   =  fl  N0 X 1 k=1 k   for all • Floating point arithmetic is neither commutative nor associative. For ◦ ∈ { +, −, ∗, / }, fl(a ◦ b) may not equal fl((a ◦ b) ◦ c) fl(b ◦ a) may not equal fl(a ◦ (b ◦ c)) • Whenever possible, reformulate calculations to avoid unnecessary cancellation error (e.g., in the quadratic formula). N ≥ N0 . Write a Matlab function that accurately approximates the partial P 1 sum N k=1 k for all values of N , including N > N0 . 31 • In a long sum of positive elements, use a loop ordering that avoids swamping by adding the smaller terms together first. 32 Summary The conditioning of a mathematical problem • sensitivity of the solution with respect to perturbations in the data • ill-conditioned problems are ‘almost unsolvable’ in practice (i.e., in the presence of data uncertainty): even if we solve the problem exactly, the solution may be meaningless • ill-conditioned problems are close to ill-posed problems: there exist small perturbations which make the problem unsolvable in exact arithmetic. Precision of a computer • a machine property (usually IEEE double precision, i.e., about 15 significant decimal digits) • a bound on the rounding error introduced when representing numbers in finite precision • a property of a problem, independent of the solution method 33 Stability of an algorithm • a property of a numerical algorithm • the computed solution is the exact solution of a slightly different problem Accuracy of a numerical result • determined by: machine precision, accuracy of the data, conditioning of the data, and the stability of the algorithm • usually much smaller than 16 significant digits 35 34

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Floating-Point Representation and Approximation Errors