Download Floating Point - GMU Computer Science

CS 367 Floating Point Topics (Ch 2.4)     IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties IEEE Floating Point IEEE Standard 754  Established in 1985 as uniform standard for floating point arithmetic  Before that, many idiosyncratic formats  Supported by all major CPUs Driven by Numerical Concerns   Nice standards for rounding, overflow, underflow Hard to make go fast  Numerical analysts predominated over hardware types in defining standard –2– CS 367 IEEE Standard 754: Form and Encoding Numerical Form  –1s x M x 2E  Sign bit s determines whether number is negative or positive  Significand M normally a fractional value in range [1.0,2.0).  Exponent E weights value by power of two Encoding s    –3– exp frac MSB is sign bit frac field encodes M exp field encodes E CS 367 Carnegie Mello IEEE Standard 754: Precision options Single precision: 32 bits s exp 1 frac 8-bits 23-bits Double precision: 64 bits s exp 1 frac 11-bits 52-bits Extended precision: 80 bits (Intel only) s exp 1 –4– frac 15-bits 63 or 64-bits CS 367 Floating Point Types s exp frac Types 1. 2. 3. –5– Normalized values – when exp  000…0 and exp  111…1 Denormalized Values – when exp = 000…0 Special Values – when exp = 111…1 CS 367 Background: Fractional Binary Numbers 2i 2i–1 4 2 1 ••• bi bi–1 ••• b2 b1 b0 . b–1 b–2 b–3 1/2 1/4 1/8   b–j ••• Representation ••• 2–j Bits to right of “binary point” represent fractional powers of 2 i Represents rational number: k  bk 2 k  j –6– CS 367 Frac. Binary Number Examples Value 5 3/4 2 7/8 63/64 Representation 101.112 10.1112 0.1111112 Observations  Multiply/Divide by 2 by shifting left/right 5 3/4 * 2 = 11 1/2  1011.12 Numbers of form 0.111111…2 just below 1.0  1/2 + 1/4 + 1/8 + … + 1/2i + …  1.0  Use notation 1.0 –  –7– CS 367 Representable Numbers Limitation   Can only exactly represent numbers of the form x/2k Other numbers have repeating bit representations Value 1/3 1/5 1/10 –8– Representation 0.0101010101[01]…2 0.001100110011[0011]…2 0.0001100110011[0011]…2 CS 367 Normalized floating point –1s x M x 2E Condition  exp  000…0 and exp  111…1 s exp frac Exponent coded as biased value E = Exp – Bias  Exp : unsigned value denoted by exp bits  Bias : Bias value » Single precision: 127 (Exp: 1…254, E: -126…127) » Double precision: 1023 (Exp: 1…2046, E: -1022…1023) » in general: Bias = 2e-1 - 1, where e is number of exponent bits Significand coded with implied leading 1 M = 1.xxx…x2  xxx…x: bits of frac  Minimum when 000…0 (M = 1.0)  Maximum when 111…1 (M = 2.0 – ) –9– CS 367 Normalized Encoding Example Value Float F = 15213.0;  1521310 = 111011011011012 = 1.11011011011012 X 213 Significand M = frac = 1.11011011011012 110110110110100000000002 Exponent E = Bias = Exp = 13 127 140 = 100011002 Floating Point Representation (Class 02): Hex: Binary: 140: 15213: – 10 – 4 6 6 D B 4 0 0 0100 0110 0110 1101 1011 0100 0000 0000 100 0110 0 1110 1101 1011 01 CS 367 Denormalized Values Condition exp = 000…0  –1s x M x 2E s exp frac Value   Exponent value E = –Bias + 1 Significand value M = 0.xxx…x2  xxx…x: bits of frac Cases  exp = 000…0, frac = 000…0  Represents value 0  Note that have distinct values +0 and –0  exp = 000…0, frac  000…0  Numbers very close to 0.0  Lose precision as get smaller – 11 –  “Gradual underflow” CS 367 Special Values Condition  exp = 111…1 Cases  exp = 111…1, frac = 000…0  Represents value (infinity)  Operation that overflows  Both positive and negative  E.g., 1.0/0.0 = 1.0/0.0 = +, 1.0/0.0 =   exp = 111…1, frac  000…0  Not-a-Number (NaN)  Represents case when no numeric value can be determined  E.g., sqrt(–1),  – 12 – CS 367 Summary of Floating Point Real Number Encodings  NaN – 13 – -Normalized +Denorm -Denorm 0 +0 +Normalized + NaN CS 367 Distribution of Values 6-bit IEEE-like format    e = 3 exponent bits f = 2 fraction bits Bias is 3 Notice how the distribution gets denser toward zero. -15 – 14 – -10 -5 Denormalized 0 5 Normalized Infinity 10 15 CS 367 Distribution of Values (close-up view) 6-bit IEEE-like format    -1 – 15 – e = 3 exponent bits f = 2 fraction bits Bias is 3 -0.5 Denormalized 0 Normalized 0.5 Infinity 1 CS 367 Interesting Numbers Description exp Zero 00…00 00…00 0.0 Smallest Pos. Denorm. 00…00 00…01 2– {23,52} X 2– {126,1022}    00…00 11…11 (1.0 – ) X 2– {126,1022} Single  1.18 X 10–38 Double  2.2 X 10–308 Smallest Pos. Normalized 00…01 00…00  Numeric Value Single  1.4 X 10–45 Double  4.9 X 10–324 Largest Denormalized  frac 1.0 X 2– {126,1022} Just larger than largest denormalized One 01…11 00…00 1.0 Largest Normalized 11…10 11…11 (2.0 – ) X 2{127,1023}   – 16 – Single  3.4 X 1038 Double  1.8 X 10308 CS 367 Special Properties of Encoding FP Zero Same as Integer Zero  All bits = 0 Can (Almost) Use Unsigned Integer Comparison    Must first compare sign bits Must consider -0 = 0 NaNs problematic  Will be greater than any other values  What should comparison yield?  Otherwise OK  Denorm vs. normalized  Normalized vs. infinity – 17 – CS 367 Tiny Floating Point Example 8-bit Floating Point Representation    the sign bit is in the most significant bit. the next four bits are the exponent, with a bias of 7. the last three bits are the frac  Same General Form as IEEE Format   normalized, denormalized representation of 0, NaN, infinity 7 6 s – 18 – 0 3 2 exp frac CS 367 Values Related to the Exponent Bias = 7 – 19 – Exp exp E 2E 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 0000 0001 0010 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 -6 -6 -5 -4 -3 -2 -1 0 +1 +2 +3 +4 +5 +6 +7 n/a 1/64 1/64 1/32 1/16 1/8 1/4 1/2 1 2 4 8 16 32 64 128 -Bias+1 (denorms) exp-Bias (inf, NaN) CS 367 –1s x M x 2E Representation to Value s exp 0 0 Denormalized 0 … numbers 0 0 0 0 … 0 0 Normalized 0 numbers 0 0 … 0 0 0 – 20 – frac s exp frac E Value 0000 000 0000 001 0000 010 -6 -6 -6 0 1/8*1/64 = 1/512 2/8*1/64 = 2/512 closest to zero 0000 0000 0001 0001 110 111 000 001 -6 -6 -6 -6 6/8*1/64 7/8*1/64 8/8*1/64 9/8*1/64 = = = = 6/512 7/512 8/512 9/512 largest denorm smallest norm 0110 0110 0111 0111 0111 110 111 000 001 010 -1 -1 0 0 0 14/8*1/2 15/8*1/2 8/8*1 9/8*1 10/8*1 = = = = = 14/16 15/16 1 9/8 10/8 7 7 n/a 14/8*128 = 224 15/8*128 = 240 inf 1110 110 1110 111 1111 000 closest to 1 below closest to 1 above largest norm CS 367 Ranges: Largest Closest to 0 (positive) Normalized 0 1110 111 = 15/8 * 128 = 240 Denormalized 0 0000 111 = 7/8 * 1/64 = 0.01367188 – 21 – Range Total Number of Values 0 0001 000 = 1 * 1/64 = 0.015625 -240 to 240 112 0 0000 001 = 1/8 * 1/64 = 0.001953125 -0.01367188 to 0.01367188 8 CS 367 Examples: binary to float 1. 0 0101 110 exp = 5 so E = 5 – 7 = -2 frac = 6/8 so M = 14/8 7 6 s 0 3 2 exp frac = 14/8 * ¼ = 7/16 2. 1 1010 011 exp = 10 so E = 10 – 7 = 3 frac = 3/8 so M = 11/8 = -11/8 * 8 = -11.0 3. 0 0000 101 exp = 0 so denormalized. E = -6 (-bias + 1) frac = 5/8 so M = 5/8 – 22 – = 5/8 * 1/64 = 5/512 CS 367 Float to binary Step 1: Get the Normal form • Binary: 1. 2. 3. Represent as binary: X.Y Normalize: Move decimal point until you have leading 1 Normal form: 1.Z * 2k, where k is based on step 2 • Decimal: 1. 2. Multiply or divide the number by 2 until you get a result X between 1 (inclusive) and 2. Normal form: X * 2k, where k is based on step 1 Step 2: Determine exponent and fractional parts based on the normal form. – 23 – CS 367 Binary Normalize Requirement   Set binary point so that numbers of form 1.xxxxx Adjust all to have leading one  Decrement exponent as shift left Value 13 17 63 0.25 0.3125 0.09375 – 24 – Binary 001101.0 010001.0 111111.0 0.010000 0.010100 0.000110 Fraction 1.1010000 1.0001000 1.1111100 1.0000000 1.0100000 1.1000000 Normal Form 1.1012 * 23 1.00012 * 24 1.111112 * 25 1.02 * 2-2 1.012 * 2-2 1.12 * 2-4 CS 367 Decimal Normalize Requirement  Multiply/Divide until in correct range  Decrement exponent as shift left Value 13 17 0.25 0.3125 0.09375 – 25 – Multiply/Divide 13,13/2,13/4,13/8 17,17/2,17/4,17/8, 17/16 1/4,1/2,1 5/16,10/16,20/16 3/32,6/32,12/32, 24/32,48/32 Normal Form 1.101 * 23 1.00012 * 24 1.02 * 2-2 1.012 * 2-2 1.12 * 2-4 CS 367 Normal Form to Encoding Exponent   Remember that E = exp – bias If computed exp value will not fit in the correct range, either have overflow (number too large), denormalized (very small but representable number) or underflow( number too small) Fraction   – 26 – If non-zero portion of M fits into the fractional part, no rounding needed and you are done If not, need to round CS 367 Examples: float to binary (no rounding) 1. 52.0 = 1.625 * 25 7 6 s 0 3 2 frac exp E = 5 so exp must be 12 1.625 = 13/8 so frac = 5/8 = 0 1100 101 Could also do this way: 52 is 110100.0 in 6-bits. Moving decimal left 5 times gives us 1.10100 Fraction bits 2. 7/64 = 14/8 * 1/24 E = -4 so exp must be 3 frac = 6/8 = 0 0011 110 7/64 is 0.0000111 Moving decimal place to the right 4 times gives us 1.11 – 27 – CS 367 Floating Point Operations Conceptual View   First compute exact result Make it fit into desired precision  Possibly overflow if exponent too large  Possibly round to fit into frac Rounding Modes (illustrate with $ rounding) $1.40 $1.60 $1.50 $2.50 –$1.50  Zero $1 $1 $1 $2 –$1  Round down (-) Round up (+) Nearest Even (default) $1 $2 $1 $1 $2 $2 $1 $2 $2 $2 $3 $2 –$2 –$1 –$2   Note: 1. Round down: rounded result is close to but no greater than true result. 2. Round up: rounded result is close to but no less than true result. – 28 – CS 367 Closer Look at Round-To-Even Default Rounding Mode   Hard to get any other kind without dropping into assembly All others are statistically biased  Sum of set of positive numbers will consistently be over- or under- estimated Applying to Other Decimal Places / Bit Positions  When exactly halfway between two possible values  Round so that least significant digit is even  E.g., round to nearest hundredth 1.2349999 1.2350001 1.2350000 1.2450000 – 29 – 1.23 1.24 1.24 1.24 (Less than half way) (Greater than half way) (Half way—round up) (Half way—round down) CS 367 Rounding Binary Numbers Binary Fractional Numbers   “Even” when least significant bit is 0 Half way when bits to right of rounding position = 100…2 Examples Round to nearest 1/4 (2 bits right of binary point) Value Binary Rounded Action Rounded Value 2 3/32 10.000112 10.002 (<1/2—down) 2 2 3/16 10.001102 10.012 (>1/2—up) 2 1/4 2 7/8 10.111002 11.002 (1/2—up) 3 2 5/8 10.101002 10.102 (1/2—down) 2 1/2  – 30 – CS 367 Rounding 1.BBGRXXX Guard bit: LSB of result Sticky bit: OR of remaining bits Round bit: 1st bit removed Round up conditions Round = 1, Sticky = 1 ➙ > 0.5  Guard = 1, Round = 1, Sticky = 0 ➙ Round to even Value Fraction GRS Incr? Rounded  – 31 – 128 15 1.0000000 1.1010000 000 100 N N 1.000 1.101 17 1.0001000 010 N 1.000 19 138 63 1.0011000 1.0001010 1.1111100 110 011 111 Y Y Y 1.010 1.001 10.000 CS 367 FP Multiplication Operands (–1)s1 X M1 X 2E1 * (–1)s2 X M2 X 2E2 Exact Result (–1)s X M X 2E    Sign s: s1 ^ s2 Significand M: M1 * M2 Exponent E: E1 + E2 Fixing  If M ≥ 2, shift M right, increment E  If E out of range, overflow Round M to fit frac precision  Implementation  – 32 – Biggest chore is multiplying significands CS 367 FP Multiplication Example (1.8 * 23 ) * (1.6 * 25) Exact Result: (1.8 * 1.6) * 23+5 = 2.88 * 28 Need to divide to put into normal form: = 1.44 * 29 – 33 – CS 367 Math. Properties of FP Mult Compare to Commutative Ring  Closed under multiplication? YES  But may generate infinity or NaN   Multiplication Commutative? Multiplication is Associative? YES NO  Possibility of overflow, inexactness of rounding   1 is multiplicative identity? YES Multiplication distributes over addition? NO  Possibility of overflow, inexactness of rounding Monotonicity  a ≥ b & c ≥ 0  a *c ≥ b *c? ALMOST  Except for infinities & NaNs – 34 – CS 367 FP Addition Operands (–1)s1 X M1 X 2E1 + (–1)s2 X M2 X 2E2  Assume E1 > E2 E1–E2 (–1)s1 M1 (–1)s2 M2 + Exact Result (–1)s M (–1)s X M X 2E  Sign s, significand M:  Result of signed align & add  Exponent E: E1 Fixing  If M ≥ 2, shift M right, increment E  if M < 1, shift M left k positions, decrement E by k Overflow if E out of range Round M to fit frac precision  – 35 –  CS 367 FP Addition Example (1.8 * 23 ) + (1.6 * 25) Align numbers: (0.45 * 25 ) + (1.6 * 25) Now add: = 2.05 * 25 Need to divide to put into normal form: = 1.025 * 26 – 36 – CS 367 Mathematical Properties of FP Add Compare to those of Abelian Group  Closed under addition? YES  But may generate infinity or NaN   Commutative? Associative? YES NO  Overflow and inexactness of rounding   0 is additive identity? YES Every element has additive inverse ALMOST  Except for infinities & NaNs Monotonicity  a ≥ b  a+c ≥ b+c? ALMOST  Except for infinities & NaNs – 37 – CS 367 Floating Point in C C Guarantees Two Levels float double single precision double precision Conversions   Casting between int, float, and double changes numeric values Double or float to int  Truncates fractional part  Like rounding toward zero  Not defined when out of range » Generally saturates to TMin or TMax  int to double  Exact conversion, as long as int has ≤ 53 bit word size  int to float  Will round according to rounding mode – 38 – CS 367 Floating Point Puzzles  For each of the following C expressions, either:  Argue that it is true for all argument values  Explain why not true • x == (int)(float) x int x = …; • x == (int)(double) x float f = …; • f == (float)(double) f double d = …; • d == (float) d • f == -(-f); Assume neither d nor f is NaN • 2/3 == 2/3.0 • d < 0.0  ((d*2) < 0.0) • d > f  -f > -d • d * d >= 0.0 • (d+f)-d == f – 39 – CS 367 Floating Point Puzzles  For each of the following C expressions, either:  Argue that it is true for all argument values  Explain why not true • x == (int)(float) x F int x = …; • x == (int)(double) x T float f = …; • f == (float)(double) f T double d = …; • d == (float) d F • f == -(-f); T • 2/3 == 2/3.0 F Assume neither d nor f is NaN – 40 – • d < 0.0 unless OF  ((d*2) < 0.0) T • d > f  -f > -d T • d * d >= 0.0 T unless OF • (d+f)-d == f F CS 367 Summary IEEE Floating Point Has Clear Mathematical Properties   Represents numbers of form M X 2E Can reason about operations independent of implementation  As if computed with perfect precision and then rounded  Not the same as real arithmetic  Violates associativity/distributivity  Makes life difficult for compilers & serious numerical applications programmers – 41 – CS 367

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Floating Point - GMU Computer Science