Download Floating point arithmetic

Floating Point Computation Jyun-Ming Chen Fall 2015 1 Contents • Computer Representation of (floating-point) Numbers • Sources of Computational Error • Efficiency Issues Fall 2015 2 Computer Representation of Floating Point Numbers Decimal-binary conversion Floating point VS. fixed point Standard: IEEE 754 (1985) Fall 2015 3 Decimal-Binary Conversion • Ex: 29(base 10) 2910  a5  25  a4  2 4  a3  23  a2  2 2  a1  21  a0  20 a0  29 mod 2  1 1410  a5  2 4  a4  23  a3  2 2  a2  21  a1  20 a1  14 mod 2  0 710  a5  23  a4  2 2  a3  21  a2  20 a2  7 mod 2  1 310  a5  2 2  a4  21  a3  20 2)29 2)14 2) 7 2) 3 2) 1 2) 0 1 0 1 1 1 a3  3 mod 2  1 2910=111012 110  a5  21  a4  20 a4  1 mod 2  1 a5  a6    0 Fall 2015 4 Fraction Binary Conversion • Ex: 0.625 (base 10) 2 2 0.62510  a1  2 1  a2  2 2  a3  2 3  a4  2 4  a5  2 5   1.25010  a1  20  a2  2 1  a3  2  2  a4  2 3  a5  2  4   a1=1 0.50010  a2  20  a3  2 1  a4  2  2  a5  2 3   a2=1 1.00010  a3=1 a3  20  a4  2 1  a5  2  2   a4= a5=…=0 Fall 2015 5 • Computing: • How about 0.110? 0.625 2  1.250 2  0.500 2  1.000 0.110 = 0.000112 0.62510 = 0.1012 Fall 2015 6 Exercise • Convert 13.62510 to binary representation. Fall 2015 7 Floating Point Representation  f b e • Fraction, f – Usually normalized so that 1.0  f  b • Base, b – 2 for personal computers – 16 for mainframe –… • Exponent, e Fall 2015 8 IEEE 754-1985 • Purpose: make floating system portable • Defines: the number representation, how calculation performed, exceptions, … • Single-precision (32-bit) • Double-precision (64-bit) Fall 2015 9 Number Representation • S: sign of mantissa • Range (roughly) – Single: 10-38 to 1038 – Double: 10-307 to 10307 • Precision (roughly) 1 11 p  22 – Single: 7-8 significant decimal digits – Double: 15 significant decimal digits Fall 2015  21024 log p  1024 log 2  308.25 p  10308 10 Significant Digits (reference) Fall 2015 11 Significant Digits • In binary sense, 24 bits are significant (with implicit one – next page) 2-23 1 • When you write your program, make sure the results you printed carry the meaningful significant digits. • In decimal sense, roughly 7-8 decimal significant digits Fall 2015 12 Implicit One • Normalized mantissa always  1.0 – Only store the fractional part to increase one extra bit of precision • Ex: 3.5 3.5  2  1  0.5  11.12  1.11 2 1 Fall 2015 13 Exponent Bias • Ex: in single precision, exponent has 8 bits – 0000 0000 (0) to 1111 1111 (255) • Add an offset to represent +/ – numbers – Effective exponent = biased exponent – bias – Bias value: 32-bit (127); 64-bit (1023) – Ex: 32-bit • 1000 0000 (128): effective exp.=128-127=1 Fall 2015 14 Ex: Convert – 3.5 to 32-bit FP Number  3.5  0  s 1 3.5  2  1  0.5  11.12  1.11 21  1.11 2128127  e  128  100000002  m  1100...0002 11000000 01100000 00000000 00000000 HW: Convert the same number to 64-bit FP number Fall 2015 15 Design Philosophy of IEEE 754 • [s|e|m] • S first: whether the number is +/- can be tested easily • E before M: simplify sorting • Represent negative by bias (not 2’s complement) for ease of sorting – [biased rep] –1, 0, 1 = 126, 127, 128 – [2’s compl.] –1, 0, 1 = 0xFF, 0x00, 0x01 • More complicated math for sorting, increment/decrement Fall 2015 16 Exceptions • Overflow: – ±INF: when number exceeds the range of representation • Underflow – When the number are too close to zero, they are treated as zeroes • Dwarf – The smallest representable number in the FP system • Machine Epsilon (ME) – A number with computation significance (more later) Fall 2015 17 Extremities More later • E : (1…1) – M (0…0): infinity – M not all zeros; NaN (Not a Number) • E : (0…0) – M (0…0): clean zero – M not all zero: dirty zero (see next page) Fall 2015 18 Not-a-Number • Numerical exceptions – Sqrt of a negative number – Invalid domain of trigonometric functions –… • Often cause program to stop running Fall 2015 19 Extremities (32-bit) • Max: 01 1 1 1 1 1 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1. (1.111…1)2254-127=(10-0.000…1) 21272128 • Min (w/o stepping into dirty-zero) 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1. (1.000…0)21-127=2-126Fall 2015 20 a.k.a.: also known as Dirty-Zero (a.k.a. denormals) • No “Implicit One” • IEEE 754 did not specify compatibility for denormals • If you are not sure how to handle them, stay away from them. Scale your problem properly – “Many problems can be solved by pretending as if they do not exist” Fall 2015 21 Dirty-Zero (cont) denormals R 0 dwarf 2-126 00000000 10000000 00000000 00000000 2-126 -127 00000000 01000000 00000000 00000000 2 00000000 00100000 00000000 00000000 2-128 00000000 00010000 00000000 00000000 2-129 (Dwarf: the smallest representable) Fall 2015 22 Drawf (32-bit) Value: 2-149 Fall 2015 23 Machine Epsilon (ME) • Definition – smallest non-zero number that makes a difference when added to 1.0 on your working platform • This is not the same as the dwarf Fall 2015 24 Computing ME (32-bit) 1+eps Getting closer to 1.0 ME: (00111111 10000000 00000000 00000001) –1.0 Fall 2015 = 2-23  1.12  10-7 25 Effect of ME Fall 2015 26 Significance of ME • Never terminate the iteration on that 2 FP numbers are equal. • Instead, test whether |x-y| < ME Fall 2015 27 Machine Epsilon (Wikipedia) Machine epsilon gives an upper bound on the relative error due to rounding in floating point arithmetic. Fall 2015 28 Numerical Scaling • Number density: there are as many IEEE 754 numbers between [1.0, 2.0] as there are in [256, 512] • Revisit: • Implication: – “roundoff” error – ME: a measure of real number density near 1.0 Fall 2015 – Scale your problem so that intermediate results lie between 1.0 and 2.0 (where numbers are dense; and where roundoff error is smallest) R 29 Scaling (cont) • Performing computation on denser portions of real line minimizes the roundoff error – but don’t over do it; switch to double precision will easily increase the precision – The densest part is near subnormal, if density is defined as numbers per unit length Fall 2015 30 四則運算加法(減法) •將兩數轉為同一exponent(較大者為準) •mantissa相加(相減)；處理進位乘法 •mantissa相乘；exponent相加；處理進位 Fall 2015 31 Example 510  1012  1.01 22 1.2510  1.012  1.01 20 510  1.2510 510  1.2510  1.01 2 2  1.01 20  1.01 2 2  1.01 2 0  1.01 2 2  1.01 2  2  2 2  1.01 1.01 2 2 0  1.01 2 2  0.0101 2 2  1.1001 2 2  6.2510  1.1001 2 2  6.2510 Fall 2015 32 Subtraction of Nearly Equal Numbers • Base 10: 1.24446 – 1.24445 1. Significant loss of accuracy (most bits are unreliable) Fall 2015 1110111 – 0100011 1010100… 34 [Theorem of Loss Precision] • x, y be normalized floating point machine numbers, and x>y>0 y p • If 2  1   2q x then at most p, at least q significant binary bits are lost in the subtraction of x-y. • Interpretation: – “When two numbers are very close, their subtraction introduces a lot of numerical error.” Fall 2015 35 Implications • When you program: • You should write these instead: f ( x)  x  1  1 f ( x)  ( x  1  1) g ( x)  ln( x)  1 x g ( x)  ln( x)  ln( e)  ln( ) e 2 2 x 2 11 x 2 11  x2 x 2 11 Every FP operation introduces error, but the subtraction of nearly equal numbers is the worst and should be avoided whenever possible Fall 2015 36 Source of Numerical Error Fall 2015 37 Sources of Computational Error • Converting a mathematical problem to numerical problem, one introduces errors due to limited computation resources: • Misc. – round off error (limited precision of representation) – truncation error (limited time for computation) Fall 2015 – Error in original data – Blunder: to make a mistake through stupidity, ignorance, or carelessness; programming/data input error – Propagated error 38 Common Measures of Error • Definitions – total error = round off + truncation – Absolute error = | numerical – exact | – Relative error = Abs. error / | exact | • If exact is zero, rel. error is not defined Fall 2015 39 Ex: Round off error Representation consists of finite number of digits The approximation of real-number on the number line is discrete! R Fall 2015 40 Watch out for printf !! • By default, “%f” prints out 6 digits behind decimal point. Fall 2015 41 Ex: Numerical Differentiation • Evaluating first derivative of f(x) f ( x  h)  f ( x )  f ' ( x ) h  f " ( x ) h2 2  f ( x  h)  f ( x ) f ' ( x)   f " ( x) h2   h f ( x  h ) f ( x ) f ' ( x)  , for small h h Truncation error Fall 2015 42 Numerical Differentiation (cont) • Select a problem with known answer – So that we can evaluate the error! f ( x)  x 3  f ' ( x)  3 x 2  f ' (10)  300 Fall 2015 43 Numerical Differentiation (cont) • Error analysis – h  (truncation) error  • What happened at h = 0.00001?! Fall 2015 44 Ex: Polynomial Deflation • F(x) is a polynomial with 20 real roots f ( x)  ( x  1)( x  2)  ( x  20) • Use any method to numerically solve a root, then deflate the polynomial to 19th degree • Solve another root, and deflate again, and again, … • The accuracy of the roots obtained is getting worse each time due to error propagation Fall 2015 45 Efficiency Issues • Horner Scheme • program examples Fall 2015 46 Horner Scheme • For polynomial evaluation • Compare efficiency Fall 2015 47 Accuracy vs. Efficiency Fall 2015 48 Good Coding Practice Fall 2015 49 Storing Multidimensional Array in Linear Memory C and others Fortran, MATLAB Fall 2015 50 On Accessing Arrays … Which one is more efficient? Fall 2015 51 Issues of PI • 3.14 is often not accurate enough – 4.0*atan(1.0) is a good substitute Fall 2015 52 Compare: Fall 2015 53 Exercise • Explain why 100, 000  0.1  10,000 i 0 • Explain why converge when implemented numerically  1 1 1 1  1     2 3 4 n 1 n Fall 2015 54 Exercise • Why Me( ) does not work as advertised? • Construct the 64-bit version of everything – Bit-Examiner – Dme( ); • 32-bit: int and float. Can every int be represented by float (if converted)? Fall 2015 55 Supplemental Fall 2015 56 Examine Bits of FP Numbers • Explain how this program works Fall 2015 57 The “Examiner” • Use the previous program to – Observe how ME work – Test subnormal behaviors on your computer/compiler – Convince yourself why the subtraction of two nearly equal numbers produce lots of error – NaN: Not-a-Number !? Fall 2015 58 Understanding Your Platform 1 2 4 4 8 4 8 4 Memory word: 4 bytes on 32-bit machines Fall 2015 59 Padding How about Fall 2015 60 Data Alignment (data structure padding) • Padding is only inserted when a structure member is followed by a member with a larger alignment requirement or at the end of the structure. • Alignment requirement: Fall 2015 61 Ex: Padding sizeof (struct MixedData) = 12 bytes // for Data2 to align on a 2-byte boundary // no padding required; already on 4-byte boundary // final padding to align a 4-byte boundary Fall 2015 62 Data Alignment (cont) • By changing the ordering of members in a structure, it is possible to change the amount of padding required to maintain alignment. • Direct the compiler to ignore data alignment (align it on a 1byte boundary) Fall 2015 Push current alignment to stack 63 #include <stdio.h> struct pad1 { char data1; short data2; int data3; char data4; }; struct pad2 { int data3; short data2; char data1; char data4; }; #pragma pack(push) #pragma pack(1) struct pad3 { char data1; short data2; int data3; char data4; }; #pragma pack(pop) main() { printf ("pad1 size: %d\n", sizeof (struct pad1)); printf ("pad2 size: %d\n", sizeof (struct pad2)); printf ("pad3 size: %d\n", sizeof (struct pad3)); } Fall 2015 12 8 8 64 Floating VS. Fixed Point • Decimal, 6 digits (positive number) – fixed point: with 5 digits after decimal point • 0.00001, … , 9.99999 – Floating point: 2 digits as exponent (10-base); 4 digits for mantissa (accuracy) • 0.001x1000, … , 9.999x1099 • Comparison: – Fixed point: fixed accuracy; simple math for computation (used in systems w/o FPU) – Floating point: trade accuracy for larger range of representation Fall 2015 65 Supplement: Error Classification (Hildebrand) • Gross error: caused by human or mechanical mistakes • Roundoff error: the consequence of using a number specified by n correct digits to approximate a number which requires more than n digits (generally infinitely many digits) for its exact specification. • Truncation error: any error which is neither a gross error nor a roundoff error. • Frequently, a truncation error corresponds to the fact that, whereas an exact result would be afforded (in the limit) by an infinite sequence of steps, the process is truncated after a certain finite number of steps. Fall 2015 66

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Floating point arithmetic