Download Floating Point - GMU Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Approximations of π wikipedia , lookup

Law of large numbers wikipedia , lookup

Infinity wikipedia , lookup

Large numbers wikipedia , lookup

Positional notation wikipedia , lookup

Addition wikipedia , lookup

Location arithmetic wikipedia , lookup

Arithmetic wikipedia , lookup

Elementary mathematics wikipedia , lookup

Transcript
CS 367
Floating Point
Topics (Ch 2.4)




IEEE Floating Point Standard
Rounding
Floating Point Operations
Mathematical properties
IEEE Floating Point
IEEE Standard 754

Established in 1985 as uniform standard for floating point
arithmetic
 Before that, many idiosyncratic formats

Supported by all major CPUs
Driven by Numerical Concerns


Nice standards for rounding, overflow, underflow
Hard to make go fast
 Numerical analysts predominated over hardware types in
defining standard
–2–
CS 367
IEEE Standard 754: Form and
Encoding
Numerical Form

–1s x M x 2E
 Sign bit s determines whether number is negative or positive
 Significand M normally a fractional value in range [1.0,2.0).
 Exponent E weights value by power of two
Encoding
s



–3–
exp
frac
MSB is sign bit
frac field encodes M
exp field encodes E
CS 367
Carnegie Mello
IEEE Standard 754: Precision options
Single precision: 32 bits
s exp
1
frac
8-bits
23-bits
Double precision: 64 bits
s exp
1
frac
11-bits
52-bits
Extended precision: 80 bits (Intel only)
s exp
1
–4–
frac
15-bits
63 or 64-bits
CS 367
Floating Point Types
s
exp
frac
Types
1.
2.
3.
–5–
Normalized values – when exp  000…0
and exp  111…1
Denormalized Values – when exp = 000…0
Special Values – when exp = 111…1
CS 367
Background: Fractional Binary
Numbers
2i
2i–1
4
2
1
•••
bi bi–1
•••
b2 b1 b0 . b–1 b–2 b–3
1/2
1/4
1/8


b–j
•••
Representation
•••
2–j
Bits to right of “binary point” represent fractional powers of 2
i
Represents rational number:
k
 bk 2
k  j
–6–
CS 367
Frac. Binary Number Examples
Value
5 3/4
2 7/8
63/64
Representation
101.112
10.1112
0.1111112
Observations

Multiply/Divide by 2 by shifting left/right
5 3/4 * 2 = 11 1/2

1011.12
Numbers of form 0.111111…2 just below 1.0
 1/2 + 1/4 + 1/8 + … + 1/2i + …  1.0
 Use notation 1.0 – 
–7–
CS 367
Representable Numbers
Limitation


Can only exactly represent numbers of the form x/2k
Other numbers have repeating bit representations
Value
1/3
1/5
1/10
–8–
Representation
0.0101010101[01]…2
0.001100110011[0011]…2
0.0001100110011[0011]…2
CS 367
Normalized floating point
–1s x M x 2E
Condition

exp  000…0 and exp  111…1
s
exp
frac
Exponent coded as biased value
E = Exp – Bias
 Exp : unsigned value denoted by exp bits
 Bias : Bias value
» Single precision: 127 (Exp: 1…254, E: -126…127)
» Double precision: 1023 (Exp: 1…2046, E: -1022…1023)
» in general: Bias = 2e-1 - 1, where e is number of exponent bits
Significand coded with implied leading 1
M = 1.xxx…x2
 xxx…x: bits of frac
 Minimum when 000…0 (M = 1.0)
 Maximum when 111…1 (M = 2.0 – )
–9–
CS 367
Normalized Encoding Example
Value
Float F = 15213.0;
 1521310 = 111011011011012
= 1.11011011011012 X 213
Significand
M
=
frac =
1.11011011011012
110110110110100000000002
Exponent
E
=
Bias =
Exp =
13
127
140 =
100011002
Floating Point Representation (Class 02):
Hex:
Binary:
140:
15213:
– 10 –
4
6
6
D
B
4
0
0
0100 0110 0110 1101 1011 0100 0000 0000
100 0110 0
1110 1101 1011 01
CS 367
Denormalized Values
Condition
exp = 000…0

–1s x M x 2E
s
exp
frac
Value


Exponent value E = –Bias + 1
Significand value M = 0.xxx…x2
 xxx…x: bits of frac
Cases

exp = 000…0, frac = 000…0
 Represents value 0
 Note that have distinct values +0 and –0

exp = 000…0, frac  000…0
 Numbers very close to 0.0
 Lose precision as get smaller
– 11 –
 “Gradual underflow”
CS 367
Special Values
Condition

exp = 111…1
Cases

exp = 111…1, frac = 000…0
 Represents value (infinity)
 Operation that overflows
 Both positive and negative
 E.g., 1.0/0.0 = 1.0/0.0 = +, 1.0/0.0 = 

exp = 111…1, frac  000…0
 Not-a-Number (NaN)
 Represents case when no numeric value can be determined
 E.g., sqrt(–1), 
– 12 –
CS 367
Summary of Floating Point
Real Number Encodings

NaN
– 13 –
-Normalized
+Denorm
-Denorm
0
+0
+Normalized
+
NaN
CS 367
Distribution of Values
6-bit IEEE-like format



e = 3 exponent bits
f = 2 fraction bits
Bias is 3
Notice how the distribution gets denser toward zero.
-15
– 14 –
-10
-5
Denormalized
0
5
Normalized Infinity
10
15
CS 367
Distribution of Values
(close-up view)
6-bit IEEE-like format



-1
– 15 –
e = 3 exponent bits
f = 2 fraction bits
Bias is 3
-0.5
Denormalized
0
Normalized
0.5
Infinity
1
CS 367
Interesting Numbers
Description
exp
Zero
00…00 00…00
0.0
Smallest Pos. Denorm.
00…00 00…01
2– {23,52} X 2– {126,1022}



00…00 11…11
(1.0 – ) X 2– {126,1022}
Single  1.18 X 10–38
Double  2.2 X 10–308
Smallest Pos. Normalized 00…01 00…00

Numeric Value
Single  1.4 X 10–45
Double  4.9 X 10–324
Largest Denormalized

frac
1.0 X 2– {126,1022}
Just larger than largest denormalized
One
01…11 00…00
1.0
Largest Normalized
11…10 11…11
(2.0 – ) X 2{127,1023}


– 16 –
Single  3.4 X 1038
Double  1.8 X 10308
CS 367
Special Properties of Encoding
FP Zero Same as Integer Zero

All bits = 0
Can (Almost) Use Unsigned Integer Comparison



Must first compare sign bits
Must consider -0 = 0
NaNs problematic
 Will be greater than any other values
 What should comparison yield?

Otherwise OK
 Denorm vs. normalized
 Normalized vs. infinity
– 17 –
CS 367
Tiny Floating Point Example
8-bit Floating Point Representation



the sign bit is in the most significant bit.
the next four bits are the exponent, with a bias of 7.
the last three bits are the frac
 Same General Form as IEEE Format


normalized, denormalized
representation of 0, NaN, infinity
7 6
s
– 18 –
0
3 2
exp
frac
CS 367
Values Related to the Exponent
Bias = 7
– 19 –
Exp
exp
E
2E
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
0000
0001
0010
0011
0100
0101
0110
0111
1000
1001
1010
1011
1100
1101
1110
1111
-6
-6
-5
-4
-3
-2
-1
0
+1
+2
+3
+4
+5
+6
+7
n/a
1/64
1/64
1/32
1/16
1/8
1/4
1/2
1
2
4
8
16
32
64
128
-Bias+1
(denorms)
exp-Bias
(inf, NaN)
CS 367
–1s x M x 2E
Representation to Value
s exp
0
0
Denormalized 0
…
numbers
0
0
0
0
…
0
0
Normalized 0
numbers
0
0
…
0
0
0
– 20 –
frac
s
exp
frac
E
Value
0000 000
0000 001
0000 010
-6
-6
-6
0
1/8*1/64 = 1/512
2/8*1/64 = 2/512
closest to zero
0000
0000
0001
0001
110
111
000
001
-6
-6
-6
-6
6/8*1/64
7/8*1/64
8/8*1/64
9/8*1/64
=
=
=
=
6/512
7/512
8/512
9/512
largest denorm
smallest norm
0110
0110
0111
0111
0111
110
111
000
001
010
-1
-1
0
0
0
14/8*1/2
15/8*1/2
8/8*1
9/8*1
10/8*1
=
=
=
=
=
14/16
15/16
1
9/8
10/8
7
7
n/a
14/8*128 = 224
15/8*128 = 240
inf
1110 110
1110 111
1111 000
closest to 1 below
closest to 1 above
largest norm
CS 367
Ranges:
Largest
Closest to 0
(positive)
Normalized
0 1110 111
= 15/8 * 128
= 240
Denormalized
0 0000 111
= 7/8 * 1/64
= 0.01367188
– 21 –
Range
Total
Number of
Values
0 0001 000
= 1 * 1/64
= 0.015625
-240 to 240
112
0 0000 001
= 1/8 * 1/64
= 0.001953125
-0.01367188
to
0.01367188
8
CS 367
Examples: binary to float
1. 0 0101 110
exp = 5 so E = 5 – 7 = -2
frac = 6/8 so M = 14/8
7 6
s
0
3 2
exp
frac
= 14/8 * ¼ = 7/16
2. 1 1010 011
exp = 10 so E = 10 – 7 = 3
frac = 3/8 so M = 11/8
= -11/8 * 8 = -11.0
3. 0 0000 101
exp = 0 so denormalized.
E = -6 (-bias + 1)
frac = 5/8 so M = 5/8
– 22 –
= 5/8 * 1/64 = 5/512
CS 367
Float to binary
Step 1: Get the Normal form
• Binary:
1.
2.
3.
Represent as binary: X.Y
Normalize: Move decimal point until you have leading 1
Normal form: 1.Z * 2k, where k is based on step 2
• Decimal:
1.
2.
Multiply or divide the number by 2 until you get a result X
between 1 (inclusive) and 2.
Normal form: X * 2k, where k is based on step 1
Step 2: Determine exponent and fractional parts based
on the normal form.
– 23 –
CS 367
Binary Normalize
Requirement


Set binary point so that numbers of form 1.xxxxx
Adjust all to have leading one
 Decrement exponent as shift left
Value
13
17
63
0.25
0.3125
0.09375
– 24 –
Binary
001101.0
010001.0
111111.0
0.010000
0.010100
0.000110
Fraction
1.1010000
1.0001000
1.1111100
1.0000000
1.0100000
1.1000000
Normal Form
1.1012 * 23
1.00012 * 24
1.111112 * 25
1.02 * 2-2
1.012 * 2-2
1.12 * 2-4
CS 367
Decimal Normalize
Requirement

Multiply/Divide until in correct range
 Decrement exponent as shift left
Value
13
17
0.25
0.3125
0.09375
– 25 –
Multiply/Divide
13,13/2,13/4,13/8
17,17/2,17/4,17/8,
17/16
1/4,1/2,1
5/16,10/16,20/16
3/32,6/32,12/32,
24/32,48/32
Normal Form
1.101 * 23
1.00012 * 24
1.02 * 2-2
1.012 * 2-2
1.12 * 2-4
CS 367
Normal Form to Encoding
Exponent


Remember that E = exp – bias
If computed exp value will not fit in the correct range, either
have overflow (number too large), denormalized (very small
but representable number) or underflow( number too small)
Fraction


– 26 –
If non-zero portion of M fits into the fractional part, no
rounding needed and you are done
If not, need to round
CS 367
Examples: float to binary (no
rounding)
1. 52.0 = 1.625 * 25
7 6
s
0
3 2
frac
exp
E = 5 so exp must be 12
1.625 = 13/8 so frac = 5/8
=
0 1100 101
Could also do this way: 52 is 110100.0 in 6-bits.
Moving decimal left 5 times gives us 1.10100
Fraction bits
2. 7/64 = 14/8 * 1/24
E = -4 so exp must be 3
frac = 6/8
= 0 0011 110
7/64 is 0.0000111
Moving decimal place to the right 4 times gives us 1.11
– 27 –
CS 367
Floating Point Operations
Conceptual View


First compute exact result
Make it fit into desired precision
 Possibly overflow if exponent too large
 Possibly round to fit into frac
Rounding Modes (illustrate with $ rounding)
$1.40
$1.60
$1.50
$2.50
–$1.50

Zero
$1
$1
$1
$2
–$1

Round down (-)
Round up (+)
Nearest Even (default)
$1
$2
$1
$1
$2
$2
$1
$2
$2
$2
$3
$2
–$2
–$1
–$2


Note:
1. Round down: rounded result is close to but no greater than true result.
2. Round up: rounded result is close to but no less than true result.
– 28 –
CS 367
Closer Look at Round-To-Even
Default Rounding Mode


Hard to get any other kind without dropping into assembly
All others are statistically biased
 Sum of set of positive numbers will consistently be over- or under-
estimated
Applying to Other Decimal Places / Bit Positions

When exactly halfway between two possible values
 Round so that least significant digit is even

E.g., round to nearest hundredth
1.2349999
1.2350001
1.2350000
1.2450000
– 29 –
1.23
1.24
1.24
1.24
(Less than half way)
(Greater than half way)
(Half way—round up)
(Half way—round down)
CS 367
Rounding Binary Numbers
Binary Fractional Numbers


“Even” when least significant bit is 0
Half way when bits to right of rounding position = 100…2
Examples
Round to nearest 1/4 (2 bits right of binary point)
Value
Binary
Rounded Action
Rounded Value
2 3/32
10.000112 10.002
(<1/2—down)
2
2 3/16
10.001102 10.012
(>1/2—up)
2 1/4
2 7/8
10.111002 11.002
(1/2—up)
3
2 5/8
10.101002 10.102
(1/2—down)
2 1/2

– 30 –
CS 367
Rounding
1.BBGRXXX
Guard bit: LSB of result
Sticky bit: OR of remaining bits
Round bit: 1st bit removed
Round up conditions
Round = 1, Sticky = 1 ➙ > 0.5
 Guard = 1, Round = 1, Sticky = 0 ➙ Round to even
Value
Fraction
GRS
Incr?
Rounded

– 31 –
128
15
1.0000000
1.1010000
000
100
N
N
1.000
1.101
17
1.0001000
010
N
1.000
19
138
63
1.0011000
1.0001010
1.1111100
110
011
111
Y
Y
Y
1.010
1.001
10.000
CS 367
FP Multiplication
Operands
(–1)s1 X M1 X 2E1
*
(–1)s2 X M2 X 2E2
Exact Result
(–1)s X M X 2E



Sign s: s1 ^ s2
Significand M: M1 * M2
Exponent E:
E1 + E2
Fixing

If M ≥ 2, shift M right, increment E

If E out of range, overflow
Round M to fit frac precision

Implementation

– 32 –
Biggest chore is multiplying significands
CS 367
FP Multiplication Example
(1.8 * 23 ) * (1.6 * 25)
Exact Result: (1.8 * 1.6) * 23+5
= 2.88 * 28
Need to divide to put into normal form:
= 1.44 * 29
– 33 –
CS 367
Math. Properties of FP Mult
Compare to Commutative Ring

Closed under multiplication?
YES
 But may generate infinity or NaN


Multiplication Commutative?
Multiplication is Associative?
YES
NO
 Possibility of overflow, inexactness of rounding


1 is multiplicative identity?
YES
Multiplication distributes over addition? NO
 Possibility of overflow, inexactness of rounding
Monotonicity

a ≥ b & c ≥ 0  a *c ≥ b *c?
ALMOST
 Except for infinities & NaNs
– 34 –
CS 367
FP Addition
Operands
(–1)s1 X M1 X 2E1 +
(–1)s2 X M2 X 2E2

Assume E1 > E2
E1–E2
(–1)s1 M1
(–1)s2 M2
+
Exact Result
(–1)s M
(–1)s X M X 2E

Sign s, significand M:
 Result of signed align & add

Exponent E:
E1
Fixing

If M ≥ 2, shift M right, increment E

if M < 1, shift M left k positions, decrement E by k
Overflow if E out of range
Round M to fit frac precision

– 35 –

CS 367
FP Addition Example
(1.8 * 23 ) + (1.6 * 25)
Align numbers:
(0.45 * 25 ) + (1.6 * 25)
Now add:
= 2.05 * 25
Need to divide to put into normal form:
= 1.025 * 26
– 36 –
CS 367
Mathematical Properties of FP Add
Compare to those of Abelian Group

Closed under addition?
YES
 But may generate infinity or NaN


Commutative?
Associative?
YES
NO
 Overflow and inexactness of rounding


0 is additive identity?
YES
Every element has additive inverse ALMOST
 Except for infinities & NaNs
Monotonicity

a ≥ b  a+c ≥ b+c?
ALMOST
 Except for infinities & NaNs
– 37 –
CS 367
Floating Point in C
C Guarantees Two Levels
float
double
single precision
double precision
Conversions


Casting between int, float, and double changes numeric
values
Double or float to int
 Truncates fractional part
 Like rounding toward zero
 Not defined when out of range
» Generally saturates to TMin or TMax

int to double
 Exact conversion, as long as int has ≤ 53 bit word size

int to float
 Will round according to rounding mode
– 38 –
CS 367
Floating Point Puzzles

For each of the following C expressions, either:
 Argue that it is true for all argument values
 Explain why not true
• x == (int)(float) x
int x = …;
• x == (int)(double) x
float f = …;
• f == (float)(double) f
double d = …;
• d == (float) d
• f == -(-f);
Assume neither
d nor f is NaN
• 2/3 == 2/3.0
• d < 0.0

((d*2) < 0.0)
• d > f

-f > -d
• d * d >= 0.0
• (d+f)-d == f
– 39 –
CS 367
Floating Point Puzzles

For each of the following C expressions, either:
 Argue that it is true for all argument values
 Explain why not true
• x == (int)(float) x
F
int x = …;
• x == (int)(double) x
T
float f = …;
• f == (float)(double) f
T
double d = …;
• d == (float) d
F
• f == -(-f);
T
• 2/3 == 2/3.0
F
Assume neither
d nor f is NaN
– 40 –
• d < 0.0
unless OF

((d*2) < 0.0) T
• d > f

-f > -d
T
• d * d >= 0.0
T unless OF
• (d+f)-d == f
F
CS 367
Summary
IEEE Floating Point Has Clear Mathematical Properties


Represents numbers of form M X 2E
Can reason about operations independent of implementation
 As if computed with perfect precision and then rounded

Not the same as real arithmetic
 Violates associativity/distributivity
 Makes life difficult for compilers & serious numerical
applications programmers
– 41 –
CS 367