Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Floating Point Numbers
Floating Point Numbers
•Registers for real numbers usually contain 32 or 64 bits,
allowing 232 or 264 numbers to be represented.
•Which reals to represent? There are an infinite number
between 2 adjacent integers. (or two reals!!)
•Which bit patterns for reals selected?
•Answer: use scientific notation
•Consider: A x 10B, where A is one digit
A
0
1 .. 9
1 .. 9
1 .. 9
B
any
0
1
2
A x 10B
0
1 .. 9
10 .. 90
100 .. 900
1 .. 9
1 .. 9
-1
-2
0.1 .. 0.9
0.01 .. 0.09
•Scientific notation in binary
•Standard: IEEE 754 Floating-Point
Floating Point
Representation:
S
E
F
•S is one bit representing the sign of the number
•E is an 8 bit biased integer representing the exponent
•F is an unsigned integer
The true value represented is:
(-1)S x f x 2e
• e = E – bias
• f = F/2n + 1
• for single precision numbers n=23, bias=127
Floating Point
•S, E, F all represent fields within a representation. Each
is just a bunch of bits.
•S is the sign bit
•(-1)S  (-1)0 = +1 and (-1)1 = -1
•Just a sign bit for signed magnitude
•E is the exponent field
•The E field is a biased-127 representation.
•True exponent is (E – bias)
•The base (radix) is always 2 (implied).
•Some early machines used radix 4 or 16 (IBM)
•F (or M) is the fractional or mantissa field.
•It is in a strange form.
•There are 23 bits for F.
•A normalized FP number always has a leading 1.
•No need to store the one, just assume it.
•This MSB is called the HIDDEN BIT.
Floating Point: 64.2 into IEEE SP
1. Get a binary representation for 64.2
• Binary of left of radix point 64 is:
• Binary of right of radix
.2
.4
.8
.6
x
x
x
x
2
2
2
2
=
=
=
=
0.4
0.8
1.6
1.2
0
0
1
1
• Binary for .2:
• 64.2 is:
2. Normalize binary form
• Produces:
3. Turn true exponent into bias-127
•
4. Put it together:
• 23-bit F is:
• S E F is:
• In hex:
Floating Point
•Since floating point numbers are always stored in
normal form, how do we represent 0?
•0x0000 0000 and 0x8000 0000 represent 0.
•What numbers cannot be represented because of this?
Floating Point
•Other special values:
•+ 5 / 0 = +
•+
0 11111111 00000… (0x7f80 0000)
•-7/0 = •1 11111111 00000… (0xff80 0000)
•0/0 or +
+=NaN (Not a number)
•NaN ? 11111111 ?????…
(S is either 0 or 1, E=0xff, and F is anything but
all zeroes)
•Also de-normalized numbers (beyond scope)
Floating Point
What is 0x4228 0000 if in a SP FP notation?
Floating Point
What is 47.625 as a SP FP?
Floating Point
•What do floating-point numbers represent?
•Rational numbers with non-repeating expansions
in the given base within the specified exponent range.
•They do not represent repeating rational or irrational
numbers, or any number too small (0 < |x|<Bemin) –
underflow, or too large (|x| > Bemax) – overflow
•IEEE Double Precision is similar to SP
•52-bit M
•53 bits of precision with hidden bit
•11-bit E, excess 1023, representing –1022:1023
•One sign bit
•Always use DP unless memory/file size is important
•SP ~ 10-38 … 1038
•DP ~ 10-308 … 10308
•Be very careful of these ranges in numeric computation
Floating Point: Chapter 6
•Floating Point operations include
•?
•?
•?
•?
•?
•They are complicated because…
Floating Point Addition
9.997 x 102
+ 4.631 x 10-1
1. Align decimal points
2. Add
9.997
+ 0.004631
10.001631
x 102
x 102
x 102
3. Normalize the result
• Often already normalized
• Otherwise move one digit
1.0001631 x 103
4. Round result
1.000
x103
Floating Point Addition
.25 = 0 01111101 00000000000000000000000
100 = 0 10000101 10010000000000000000000
Or with hidden bit
.25 = 0 01111101 1 00000000000000000000000
100 = 0 10000101 1 10010000000000000000000
1. Align radix points
• Shifting F left by 1 bit, decreasing e by 1
• Shifting F right by 1 bit, increasing e by 1
• Shift mantissa right so least significant bits fall off
• Which should we shift?
Floating Point Addition
•Shift the .25 to increase its exponent
01111101 – 1111111 (127) =
10000101 – 1111111 (127) =
•Shift smaller by:
•Bias cancels with subtraction, so
10000101
- 01111101
00001000
•Carefully shifting,
0
0
0
0
0
0
0
0
0
01111101 1
01111110 0
01111111 0
10000000 0
10000001 0
10000010 0
10000011 0
10000100 0
10000101 0
00000000000000000000000 (original value)
10000000000000000000000 (shifted by 1)
01000000000000000000000 (shifted by 2)
00100000000000000000000 (shifted by 3)
00010000000000000000000 (shifted by 4)
00001000000000000000000 (shifted by 5)
00000100000000000000000 (shifted by 6)
00000010000000000000000 (shifted by 7)
00000001000000000000000 (shifted by 8)
Floating Point Addition
2. Add with hidden bit
+
0 10000101 1 10010000000000000000000 (100)
0 10000101 0 00000001000000000000000 (.25)
0 10000101 1 10010001000000000000000
3. Normalize the result
• Get a ‘1’ back in hidden bit
• Already normalized
4. Round result
• Renormalize if needed
•
0 10000101 10010001000000000000000
Post-normalization example:
+
s
0
0
0
0
exp
011
011
011
100
h
1
1
11
1
frtn
1100
1011
0111
1011
1 discarded
Floating Point Subtraction
•Mantissa’s are sign-magnitude
•Watch out when the numbers are close
1.23456
x 102
- 1.23455
x 102
•A many-digit normalization is possible
•This is why FP addition is in many ways more
difficult than FP multiplication
Floating Point Subtraction
1. Align radix points
2. Perform sign-magnitude operand swap
• Compare magnitudes (with hidden bit)
• Change sign bit if order of operands is changed.
3. Subtract
4. Normalize
5. Round
• Renormalize if needed
s
0
- 0
switch
0
- 0
1
1
exp
h
frtn
011
1
1011 smaller
011
1
1101 bigger
and make difference negative
011
1
1101 bigger
011
1
1011 smaller
011
0
0010
000
1
0000
switch sign
Floating Point Multiplication
Multiplication Example
3.0 x 101
x 5.0 x 102
•Multiply mantissas
3.0
x 5.0
15.00
•Add exponents
1+2=3
15.00 x 103
•Normalize if needed
1.50 x 104
Floating Point Multiplication
Multiplication in binary (4-bit F)
x
0 10000100 0100
1 00111100 1100
•Multiply mantissas
1.0100
x 1.1100
00000
00000
10100
10100
10100
1000110000
10.00110000
Floating Point Multiplication
•Add exponents, subtract extra bias.
Biased:
10000100
+ 00111100
•Check by converting to true exponents and adding
•Renormalize, correcting exponent
1 01000001
10.00110000
Becomes
1 01000010
1.000110000
•Drop the hidden bit
1 01000010
000110000
Floating Point Division
Division
•True division
•Unsigned, full-precision division on mantissas
•This is much more costly (e.g. 4x) than mult.
•Subtract exponents
•Faster division
•Newton’s method to find reciprocal
•Multiply dividend by reciprocal of divisor
•May not yield exact result without some work
•Similar speed as multiplication
Related documents