Download Floating-Point Representation and Approximation Errors

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mathematics of radio engineering wikipedia , lookup

Large numbers wikipedia , lookup

Proofs of Fermat's little theorem wikipedia , lookup

Location arithmetic wikipedia , lookup

Series (mathematics) wikipedia , lookup

Elementary arithmetic wikipedia , lookup

Approximations of π wikipedia , lookup

Positional notation wikipedia , lookup

Arithmetic wikipedia , lookup

Elementary mathematics wikipedia , lookup

Addition wikipedia , lookup

Transcript
EE103 (Shinnerl)
IEEE floating point numbers
Floating-Point Representation and Approximation
• floating point numbers with base 10
Errors
• floating point numbers with base 2
Cancellation
• IEEE floating point standard
Instability
• machine precision
Simple one-variable examples
• rounding error
Swamping
1
2
Floating point numbers with base 10
Notation
Interpretation x = ±(d110−1 + d210−2 + · · · + dn10−n) · 10e
x = ± (0.d1d2 . . . dn)10 · 10e
Example (with n = 7)
• 0.d1d2 . . . dn is the mantissa
di ∈ {0, 1, . . . , 9},
d1 6= 0 if x 6= 0
12.625 = + (.1262500)10 · 102
= + (1 · 10−1 + 2 · 10−2 + 6 · 10−3 + 2 · 10−4 + 5 · 10−5
• n is the mantissa length (or precision)
+ 0 · 10−6 + 0 · 10−7) · 102
• e is the exponent (emin ≤ e ≤ emax)
used in pocket calculators
• The restriction that d1 6= 0 when x 6= 0 ensures that each
floating-point number has a unique representation.
3
4
Properties
Format of Normalized Floating-Point Numbers in Base 2
x = ±(1 + 0.b1b2 . . . bn)2 · 2e
— a finite set of numbers
— unequally spaced: distance between floating point numbers
varies
• f ≡ 0.b1b2 . . . bn is the mantissa (bi ∈ {0, 1}). Restricting the
mantissa to [0, 1) ensures uniqueness of the representation.
the smallest number greater than 1 is 1 + 10−n+1
• n is the mantissa length (or precision)
the smallest number greater than 10 is 10 + 10−n+2, . . .
• e is the exponent (emin ≤ e ≤ emax)
• special representations for 0, NaN, and Inf
— largest positive number:
+(.999 · · · 9)10 · 10emax = (1 − 10−n)10emax
• in practice, includes tiny ‘subnormal numbers’ of the form
xsub = ±(0.b1b2 . . . bn)2 · 2emin .
— smallest (normalized) positive number:
• used in almost all computers
xmin = +(.100 · · · 0)10 · 10emin = 10emin−1
5
6
A finite set of unequally spaced numbers
Interpretation: x = ±(1 + b12−1 + b22−2 + · · · + bn2−n) · 2e
— For n = 3, emin = −1, emax = 2
q
h
1
1h
2
2h
3
3h
Example (with n = 8):
— Largest positive number:
12.625 = + (1 + 0.10010100)2 · 23
= + (1 + 1 · 2−1 + 0 · 2−2 + 0 · 2−3 + 1 · 2−4 + 0 · 2−5
+ 1 · 2−6 + 0 · 2−7 + 0 · 2−8) · 23
xmax = +(1 + 0.111 · · · 1)2 · 2emax = (2 − 2−n)2emax
— Smallest positive normalized number:
= 8 + 4 + 1/2 + 1/8
xmin = +(1 + 0.000 · · · 0)2 · 2emin = 2emin
— Zero is stored as 0 = ±0.00 . . . 0 · 2emin .
7
8
IEEE standard for binary arithmetic specifies two binary floating point number formats
Machine precision
Definition: the machine precision of a binary floating point number system with mantissa length n is defined as
IEEE standard single precision:
n = 24,
emin = −125,
²M = 2−(n+1)
emax = 127
requires 32 bits: 1 sign bit, 23 bits for mantissa, 8 bits for
exponent
emin = −1022,
²M = 2−53 ≈ 1.1102 · 10−16
(machine epsilon eps = 2²M = 2.22 · 10−16) is pre-defined in
Matlab.)
IEEE standard double precision:
n = 52,
Example: IEEE standard double precision (n = 52):
emax = 1023
requires 64 bits: 1 sign bit, 52 bits for mantissa, 11 bits for
exponent. Used in almost all modern computers
Interpretation: 1 + 2²M ≡ 1 + eps is the smallest floating point
number greater than 1:
(.10 · · · 01)2 · 21 = 1 + 2−n = 1 + 2²M
9
10
Rounding
a floating-point number system is a finite set of numbers; all
other numbers must be rounded
notation: fl(x) is the floating-point representation of x
Example
Numbers x ∈ (1, 1 + 2²M ) are rounded to 1 or 1 + 2²M .
fl(x) = 1 if 1 < x ≤ 1 + ²M
fl(x) = 1 + 2²M if 1 + ²M < x < 1 + 2²M
rounding rules used in practice:
numbers are rounded to the nearest floating-point number
in case of a ties: round to the number with least significant
bit 0 (‘round to nearest even’)
11
This gives another interpretation of ²M :
numbers between 1 and 1 + ²M are indistinguishable from 1.
12
Rounding error and machine precision
Example
Fact:
|fl(x) − x|
≤ ²M
|x|
— Machine precision gives a bound on the relative error due to
rounding.
— The number of correct (decimal) digits in fl(x) is roughly
− log10 ²M
i.e., about 15 or 16 in IEEE double precision.
Significant loss of accuracy can result from representing simple
decimal expressions in binary.
Exercise:
µ
¶
1
= (0.0001100110011...)2
10 10
Historical significance: Gulf War, 1991.
— A fundamental limit on accuracy of numerical computations.
13
14
Exercises
Explain the following Matlab results (Matlab uses IEEE double
precision)
Run the following code in Matlab and explain the result
x = 2;
for i=1:54
x = sqrt(x);
end;
for i=1:54
x = x^2;
end
>> (1 + 1e-16) - 1
ans = 0
>> (1 + 2e-16) - 1
ans = 2.2204e-16
>> (1 - 1e-16) - 1
ans = -1.1102e-16
>> 1 + (1e-16 - 1)
ans = 1.1102e-16
15
16
Measuring error
Explain the following results (log(1 + x)/x ≈ 1 for small x)
b of a real number x.
Given: an approximation x
>> log(1+3e-16)/3e-16
ans =
0.7401
b − x|
absolute error: |x
relative error:
>> log(1+3e-16)/((1+3e-16)-1)
ans =
1.0000
b − x|
|x
(if x 6= 0)
|x|
number of correct significant digits is equal to r if
0.5 · 10−r <
b − x|
|x
≤ 5 · 10−r
|x|
17
18
Example
Cancellation
x = π = 3.141592 . . ..
b = 3.1419,
x
b = 3.1421,
x
b = 3.1430
x
are all correct to 4 significant digits, according to these definitions.
In case x is close to zero, a combination of relative and absolute
error can be used:
b − x|
|x
,
(|x| + c)
where c > 0 is a small, fixed cutoff specifying the radius of the
neighborhood of zero in which absolute error should be used.
â = a(1 + ∆a),
b̂ = b(1 + ∆b)
• a, b: exact data; â, b̂: approximations; ∆a, ∆b: unknown
relative errors
• relative error in x̂ = â − b̂ = (a − b) + (a∆a − b∆b) is
|a∆a − b∆b|
|x̂ − x|
=
|x|
|a − b|
if a ' b, small ∆a and ∆b can lead to very large relative
errors in x.
Please see the supplemental handout online.
19
20
Roots of a quadratic equation
ax2 + bx + c = 0
Cancellation occurs when:
(a 6= 0)
Algorithm 1: use the formulas
• we subtract two numbers that are almost equal
q
q
−b − b2 − 4ac
b2 − 4ac
x1 =
x2 =
2a
2a
2
these are unstable if b À |4ac|.
−b +
• one or both are subject to error
Instability is often (but not always) caused by cancellation.
Reformulating calculations to avoid cancellation can dramatically
improve their accuracy.
• If b ≤ 0, cancellation occurs in x2 (−b ≈
• If b ≥ 0, cancellation occurs in x1 (b ≈
q
;
b2 − 4ac).
q
b2 − 4ac).
• In both cases, b may be exact, but the square root introduces
a small error.
21
The roots of the quadratic can be calculated another way.
22
Algorithm 2
Notice that
x2 =
=
−b −
q


q
• if b ≤ 0, calculate
b2 − 4ac  −b + b2 − 4ac 
q
·

2a
−b + b2 − 4ac
b2 − (b2 − 4ac)
µ
¶
q
2a −b + b2 − 4ac
=
µ
a −b +
x1 =
2ac
q
b2 − 4ac
¶
c
=
,
ax1
−b +
q
x2 =
c
ax1
−b −
q
x1 =
c
ax2
b2 − 4ac
,
2a
• if b > 0, calculate
x2 =
b2 − 4ac
,
2a
... no cancellation!
and similarly, x1 = c/(ax2).
23
24
Exercise
Exercise
Evaluate
Function chop(x,n) rounds x to n decimal digits
(for example chop(pi,4) returns 3.14200000000000)
4 digits.
3000
X
k−2 = 1.6446, rounding all intermediate results to
k=1
>> sum = 0;
>> for k=1:3000
sum = chop(sum+1/k^2, 4);
end
>> sum
sum = 1.6240
Cancellation occurs in (1 − cos x)/ sin x for x ≈ 0.
>> x = 1e-2;
>> (1-chop(cos(x,4)))/chop(sin(x,4))
ans = 0
The exact value is about 0.005. Give a stable alternative method.
This result has only two correct digits, but there is no cancellation (there are no subtractions). Explain, and propose a better
method.
25
26
Swamping
Exercise The number e = 2.7182818 . . . can be defined as
e = lim(1 + 1/n)n
n→
This suggests an algorithm for calculating e: choose n large and
evaluate
ê = (1 + 1/n)n
Let ²M = 2−(n+1) denote machine precision (on a machine with
n binary digits in the mantissa). Suppose we are given positive
floating point values a and b of very different sizes.
Definition If 0 < fl(b) < ²M · fl(a), then
results:
fl(a + b) = fl(a),
n
104
108
1012
1016
ê
2.718145926
2.718281798
2.718523496
1.000000000
# correct digits
4
7
4
0
and we say that b is swamped by a in the sum.
Example:
>> 1.5 + 1.0e-17
ans = 1.50000000000000
Explain.
27
28
Swamping Example Evaluate
intermediate results to 4 digits.
3000
X
k−2 = 1.6446, rounding all
k=1
Solution In a long sum over gradually decreasing terms, swamping can be avoided by adding the smaller terms together first.
Simply reversing the order of the sum restores the accuracy.
>> sum = 0;
>> for k=1:3000
sum = chop(sum+1/k^2, 4);
end
>> sum
sum = 1.6240
This result has only two correct digits, but there is no cancellation (there are no subtractions). Explain, and propose a better
method.
>> sum = 0;
>> for k=3000:-1:1
sum = chop(sum+1/k^2, 4);
end
>> sum
sum = 1.6450
29
Exercise
30
Conclusions
Show that in finite precision, the harmonic series
∞
X
1
1
1
1
= 1 + + + + ···
k
2
3
4
k=1
appears to converge if the terms are added in the given (descending) order. Determine the least value N0 for which

fl 
N
X
1
k=1
k


=

fl 
N0
X
1
k=1
k


for all
• Floating point arithmetic is neither commutative nor associative. For ◦ ∈ { +, −, ∗, / },
fl(a ◦ b)
may not equal
fl((a ◦ b) ◦ c)
fl(b ◦ a)
may not equal
fl(a ◦ (b ◦ c))
• Whenever possible, reformulate calculations to avoid unnecessary cancellation error (e.g., in the quadratic formula).
N ≥ N0 .
Write a Matlab function that accurately approximates the partial
P
1
sum N
k=1 k for all values of N , including N > N0 .
31
• In a long sum of positive elements, use a loop ordering that
avoids swamping by adding the smaller terms together first.
32
Summary
The conditioning of a mathematical problem
• sensitivity of the solution with respect to perturbations in the
data
• ill-conditioned problems are ‘almost unsolvable’ in practice
(i.e., in the presence of data uncertainty): even if we solve
the problem exactly, the solution may be meaningless
• ill-conditioned problems are close to ill-posed problems: there
exist small perturbations which make the problem unsolvable
in exact arithmetic.
Precision of a computer
• a machine property (usually IEEE double precision, i.e., about
15 significant decimal digits)
• a bound on the rounding error introduced when representing
numbers in finite precision
• a property of a problem, independent of the solution method
33
Stability of an algorithm
• a property of a numerical algorithm
• the computed solution is the exact solution of a slightly different problem
Accuracy of a numerical result
• determined by: machine precision, accuracy of the data, conditioning of the data, and the stability of the algorithm
• usually much smaller than 16 significant digits
35
34