Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Floating Point Computation
Jyun-Ming Chen
Fall 2015
1
Contents
• Computer Representation of (floating-point)
Numbers
• Sources of Computational Error
• Efficiency Issues
Fall 2015
2
Computer Representation of
Floating Point Numbers
Decimal-binary conversion
Floating point VS. fixed point
Standard: IEEE 754 (1985)
Fall 2015
3
Decimal-Binary Conversion
• Ex: 29(base 10)
2910 a5 25 a4 2 4 a3 23 a2 2 2 a1 21 a0 20
a0 29 mod 2 1
1410 a5 2 4 a4 23 a3 2 2 a2 21 a1 20
a1 14 mod 2 0
710 a5 23 a4 2 2 a3 21 a2 20
a2 7 mod 2 1
310 a5 2 2 a4 21 a3 20
2)29
2)14
2) 7
2) 3
2) 1
2) 0
1
0
1
1
1
a3 3 mod 2 1
2910=111012
110 a5 21 a4 20
a4 1 mod 2 1
a5 a6 0
Fall 2015
4
Fraction Binary Conversion
• Ex: 0.625 (base 10)
2
2
0.62510 a1 2 1 a2 2 2 a3 2 3 a4 2 4 a5 2 5
1.25010 a1 20 a2 2 1 a3 2 2 a4 2 3 a5 2 4
a1=1
0.50010
a2 20 a3 2 1 a4 2 2 a5 2 3
a2=1
1.00010
a3=1
a3 20 a4 2 1 a5 2 2
a4= a5=…=0
Fall 2015
5
• Computing:
• How about 0.110?
0.625
2
1.250
2
0.500
2
1.000
0.110 = 0.000112
0.62510 = 0.1012
Fall 2015
6
Exercise
• Convert 13.62510 to binary representation.
Fall 2015
7
Floating Point Representation
f b
e
• Fraction, f
– Usually normalized so that 1.0 f b
• Base, b
– 2 for personal computers
– 16 for mainframe
–…
• Exponent, e
Fall 2015
8
IEEE 754-1985
• Purpose: make floating system portable
• Defines: the number representation, how
calculation performed, exceptions, …
• Single-precision (32-bit)
• Double-precision (64-bit)
Fall 2015
9
Number Representation
• S: sign of mantissa
• Range (roughly)
– Single: 10-38 to 1038
– Double: 10-307 to 10307
• Precision (roughly)
1 11
p 22
– Single: 7-8 significant
decimal digits
– Double: 15 significant
decimal digits
Fall 2015
21024
log p 1024 log 2 308.25
p 10308
10
Significant Digits
(reference)
Fall 2015
11
Significant Digits
• In binary sense, 24 bits are
significant (with implicit
one – next page)
2-23
1
• When you write your
program, make sure the
results you printed carry
the meaningful significant
digits.
• In decimal sense, roughly
7-8 decimal significant
digits
Fall 2015
12
Implicit One
• Normalized mantissa always 1.0
– Only store the fractional part to increase one
extra bit of precision
• Ex: 3.5
3.5 2 1 0.5 11.12 1.11 2
1
Fall 2015
13
Exponent Bias
• Ex: in single precision, exponent has 8 bits
– 0000 0000 (0) to 1111 1111 (255)
• Add an offset to represent +/ – numbers
– Effective exponent = biased exponent – bias
– Bias value: 32-bit (127); 64-bit (1023)
– Ex: 32-bit
• 1000 0000 (128): effective exp.=128-127=1
Fall 2015
14
Ex: Convert
– 3.5 to 32-bit FP Number
3.5 0
s 1
3.5 2 1 0.5 11.12
1.11 21 1.11 2128127 e 128 100000002
m 1100...0002
11000000 01100000 00000000 00000000
HW: Convert the same number to 64-bit FP number
Fall 2015
15
Design Philosophy of IEEE 754
• [s|e|m]
• S first: whether the number is +/- can be tested
easily
• E before M: simplify sorting
• Represent negative by bias (not 2’s complement)
for ease of sorting
– [biased rep] –1, 0, 1 = 126, 127, 128
– [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01
• More complicated math for sorting, increment/decrement
Fall 2015
16
Exceptions
• Overflow:
– ±INF: when number exceeds the range of representation
• Underflow
– When the number are too close to zero, they are treated
as zeroes
• Dwarf
– The smallest representable number in the FP system
• Machine Epsilon (ME)
– A number with computation significance (more later)
Fall 2015
17
Extremities
More
later
• E : (1…1)
– M (0…0): infinity
– M not all zeros; NaN (Not a Number)
• E : (0…0)
– M (0…0): clean zero
– M not all zero: dirty zero (see next page)
Fall 2015
18
Not-a-Number
• Numerical exceptions
– Sqrt of a negative number
– Invalid domain of trigonometric functions
–…
• Often cause program to stop running
Fall 2015
19
Extremities (32-bit)
• Max:
01 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1.
(1.111…1)2254-127=(10-0.000…1) 21272128
• Min (w/o stepping into dirty-zero)
0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1.
(1.000…0)21-127=2-126Fall 2015
20
a.k.a.: also known as
Dirty-Zero (a.k.a. denormals)
• No “Implicit One”
• IEEE 754 did not specify compatibility for
denormals
• If you are not sure how to handle them, stay
away from them. Scale your problem
properly
– “Many problems can be solved by pretending
as if they do not exist”
Fall 2015
21
Dirty-Zero (cont)
denormals
R
0
dwarf
2-126
00000000 10000000 00000000 00000000
2-126
-127
00000000 01000000 00000000 00000000 2
00000000 00100000 00000000 00000000 2-128
00000000 00010000 00000000 00000000 2-129
(Dwarf: the smallest
representable)
Fall 2015
22
Drawf (32-bit)
Value: 2-149
Fall 2015
23
Machine Epsilon (ME)
• Definition
– smallest non-zero number that makes a
difference when added to 1.0 on your working
platform
• This is not the same as the dwarf
Fall 2015
24
Computing ME (32-bit)
1+eps
Getting closer to 1.0
ME:
(00111111 10000000 00000000 00000001)
–1.0
Fall 2015
= 2-23 1.12 10-7
25
Effect of ME
Fall 2015
26
Significance of ME
• Never terminate the iteration on that 2 FP
numbers are equal.
• Instead, test whether |x-y| < ME
Fall 2015
27
Machine Epsilon (Wikipedia)
Machine epsilon gives an upper bound on the relative error due
to rounding in floating point arithmetic.
Fall 2015
28
Numerical Scaling
• Number density: there
are as many IEEE 754
numbers between [1.0,
2.0] as there are in
[256, 512]
• Revisit:
• Implication:
– “roundoff” error
– ME: a measure of real
number density near
1.0
Fall 2015
– Scale your problem so
that intermediate
results lie between 1.0
and 2.0 (where
numbers are dense; and
where roundoff error is
smallest)
R
29
Scaling (cont)
• Performing computation on denser portions
of real line minimizes the roundoff error
– but don’t over do it; switch to double precision
will easily increase the precision
– The densest part is near subnormal, if density is
defined as numbers per unit length
Fall 2015
30
四則運算
加法(減法)
•將兩數轉為同一exponent(較大者為準)
•mantissa相加(相減);處理進位
乘法
•mantissa相乘;exponent相加;處理進位
Fall 2015
31
Example
510 1012 1.01 22
1.2510 1.012 1.01 20
510 1.2510
510 1.2510
1.01 2 2 1.01 20
1.01 2 2 1.01 2 0
1.01 2 2 1.01 2 2 2 2
1.01 1.01 2 2 0
1.01 2 2 0.0101 2 2
1.1001 2 2
6.2510
1.1001 2 2
6.2510
Fall 2015
32
Subtraction of Nearly Equal
Numbers
• Base 10: 1.24446 – 1.24445
1.
Significant loss of accuracy
(most bits are unreliable)
Fall 2015
1110111
– 0100011
1010100…
34
[Theorem of Loss Precision]
• x, y be normalized floating point machine
numbers, and x>y>0
y
p
• If 2 1 2q
x
then at most p, at least q significant binary
bits are lost in the subtraction of x-y.
• Interpretation:
– “When two numbers are very close, their
subtraction introduces a lot of numerical error.”
Fall 2015
35
Implications
• When you program:
• You should write
these instead:
f ( x) x 1 1
f ( x) ( x 1 1)
g ( x) ln( x) 1
x
g ( x) ln( x) ln( e) ln( )
e
2
2
x 2 11
x 2 11
x2
x 2 11
Every FP operation introduces error, but the
subtraction of nearly equal numbers is the worst
and should be avoided whenever possible
Fall 2015
36
Source of Numerical Error
Fall 2015
37
Sources of Computational Error
• Converting a
mathematical problem to
numerical problem, one
introduces errors due to
limited computation
resources:
• Misc.
– round off error (limited
precision of representation)
– truncation error (limited
time for computation)
Fall 2015
– Error in original data
– Blunder: to make a
mistake through
stupidity, ignorance, or
carelessness;
programming/data
input error
– Propagated error
38
Common Measures of Error
• Definitions
– total error = round off + truncation
– Absolute error = | numerical – exact |
– Relative error = Abs. error / | exact |
• If exact is zero, rel. error is not defined
Fall 2015
39
Ex: Round off error
Representation consists
of finite number of
digits
The approximation of
real-number on the
number line is discrete!
R
Fall 2015
40
Watch out for printf !!
• By default, “%f” prints out 6 digits behind
decimal point.
Fall 2015
41
Ex: Numerical Differentiation
• Evaluating first derivative of f(x)
f ( x h) f ( x ) f ' ( x ) h f " ( x )
h2
2
f ( x h) f ( x )
f ' ( x)
f " ( x) h2
h
f ( x h ) f ( x )
f ' ( x)
, for small h
h
Truncation
error
Fall 2015
42
Numerical Differentiation (cont)
• Select a problem with known answer
– So that we can evaluate the error!
f ( x) x 3 f ' ( x) 3 x 2
f ' (10) 300
Fall 2015
43
Numerical Differentiation (cont)
• Error analysis
– h (truncation) error
• What happened at h =
0.00001?!
Fall 2015
44
Ex: Polynomial Deflation
• F(x) is a polynomial with 20 real roots
f ( x) ( x 1)( x 2) ( x 20)
• Use any method to numerically solve a root,
then deflate the polynomial to 19th degree
• Solve another root, and deflate again, and
again, …
• The accuracy of the roots obtained is getting
worse each time due to error propagation
Fall 2015
45
Efficiency Issues
• Horner Scheme
• program examples
Fall 2015
46
Horner Scheme
• For polynomial evaluation
• Compare efficiency
Fall 2015
47
Accuracy vs. Efficiency
Fall 2015
48
Good Coding Practice
Fall 2015
49
Storing Multidimensional Array in
Linear Memory
C and others
Fortran,
MATLAB
Fall 2015
50
On Accessing Arrays …
Which one
is more
efficient?
Fall 2015
51
Issues of PI
• 3.14 is often not accurate enough
– 4.0*atan(1.0) is a good substitute
Fall 2015
52
Compare:
Fall 2015
53
Exercise
• Explain why
100, 000
0.1 10,000
i 0
• Explain why converge when implemented
numerically
1
1 1 1
1
2 3 4
n 1 n
Fall 2015
54
Exercise
• Why Me( ) does not work as advertised?
• Construct the 64-bit version of everything
– Bit-Examiner
– Dme( );
• 32-bit: int and float. Can every int be
represented by float (if converted)?
Fall 2015
55
Supplemental
Fall 2015
56
Examine Bits of FP Numbers
• Explain how this
program works
Fall 2015
57
The “Examiner”
• Use the previous program to
– Observe how ME work
– Test subnormal behaviors on your
computer/compiler
– Convince yourself why the subtraction of two
nearly equal numbers produce lots of error
– NaN: Not-a-Number !?
Fall 2015
58
Understanding Your Platform
1
2
4
4
8
4
8
4
Memory word:
4 bytes on 32-bit machines
Fall 2015
59
Padding
How about
Fall 2015
60
Data Alignment (data structure
padding)
• Padding is only inserted when a structure member
is followed by a member with a larger alignment
requirement or at the end of the structure.
• Alignment requirement:
Fall 2015
61
Ex: Padding
sizeof (struct MixedData) = 12 bytes
// for Data2 to align on a 2-byte boundary
// no padding required; already on 4-byte boundary
// final padding to align a 4-byte boundary
Fall 2015
62
Data Alignment (cont)
• By changing the ordering of
members in a structure, it is
possible to change the amount
of padding required to maintain
alignment.
• Direct the compiler to ignore
data alignment (align it on a 1byte boundary)
Fall 2015
Push current alignment
to stack
63
#include <stdio.h>
struct pad1 {
char data1;
short data2;
int data3;
char data4;
};
struct pad2 {
int data3;
short data2;
char data1;
char data4;
};
#pragma pack(push)
#pragma pack(1)
struct pad3 {
char data1;
short data2;
int data3;
char data4;
};
#pragma pack(pop)
main()
{
printf ("pad1 size: %d\n", sizeof (struct pad1));
printf ("pad2 size: %d\n", sizeof (struct pad2));
printf ("pad3 size: %d\n", sizeof (struct pad3));
}
Fall 2015
12
8
8
64
Floating VS. Fixed Point
• Decimal, 6 digits (positive number)
– fixed point: with 5 digits after decimal point
• 0.00001, … , 9.99999
– Floating point: 2 digits as exponent (10-base); 4 digits
for mantissa (accuracy)
• 0.001x1000, … , 9.999x1099
• Comparison:
– Fixed point: fixed accuracy; simple math for
computation (used in systems w/o FPU)
– Floating point: trade accuracy for larger range of
representation
Fall 2015
65
Supplement: Error Classification
(Hildebrand)
• Gross error: caused by
human or mechanical
mistakes
• Roundoff error: the
consequence of using a
number specified by n
correct digits to
approximate a number
which requires more than
n digits (generally
infinitely many digits) for
its exact specification.
• Truncation error: any
error which is neither a
gross error nor a roundoff
error.
• Frequently, a truncation
error corresponds to the
fact that, whereas an exact
result would be afforded
(in the limit) by an infinite
sequence of steps, the
process is truncated after a
certain finite number of
steps.
Fall 2015
66