Download For the IEEE double precision floating point

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
MATH/CMPSC 455
Introduction to
Numerical Analysis I
Floating Point
Representation of Real
Numbers
FLOATING POINT REPRESENTATION OF
REAL NUMBERS
 This
is about how computers represent and
operate real numbers.
 Helps us to understand rounding errors
We consider IEEE 754 Floating Point
Standard
Representing binary numbers in computer:
1. format
2. machine representation
FLOATING POINT FORMAT

Formats for decimal system
Standard Notation
Scientific Notation
Normalized Scientific Notation
FLOATING POINT FORMAT

Format for floating point number (binary representation)
Normalized IEEE floating point standard:
o sign (+ or -)
o mantissa , which contains the significant bits. (N b’s)
o exponent (p, M-bit binary number)
s
e1
e2
…
eM b1 b2
…
bN
Precision
sign
Exponent
(M)
Mantissa
(N)
single
1
8
23
double
1
11
52
Long double
1
15
64
Definition (machine epsilon,
): It is the
distance between 1 and the smallest floating point
number greater than 1. Gives a bound on the
relative error due to rounding.
For the IEEE double precision floating point standard:
ROUNDING
How do we fit a given binary number in a finite
number of bits?
IEEE Rounding to Nearest Rule:
For double precision, if the 53rd bit to the right of the binary point
is 0, then round down (truncate after the 52nd bit). If the 53rd bit
is 1, then round up (add 1 to 52 bit), unless all known bits to the
right of the 1 are 0’s, in which case 1 is added to bit 52 if and only
if bit 52 is 1.
ROUNDING
Notation: Denote the IEEE double precision floating
point number associated to x, using the Rounding to
the Nearest Rule, by fl(x).
Definition (absolute error & relative error): Let
a computed version of the exact quantity .
be
ROUNDING
Example:
Example:
Relative rounding error:
MACHINE REPRESENTATION
s
e1
e2
…
eM b1 b2
…
bN
• Sign: 1 bit, 0 for positive, 1 for negative;
• Mantissa: 52 bits,
b1 b2 … bN
11
• Exponent: 11 bits so 0 < e < 2 -1 = 2047 and
p = e - 1023
• 1~2046  -1022 ~ 1023
• 2 values reserved for infinity / NaN and 0
• 2047  infinity if the mantissa is allzeros, NaN
otherwise;
• 0  small numbers including 0
ADDITION AND ROUNDING OF FLOATING
POINT NUMBERS
Step 1: line up the two numbers
Double Precision
Step 2: add them
Higher Precision
Step 3: store the result as a floating point number
Double Precision
Example :
Example :