Download Assembly Language Programming

Floating Point CPSC 252 Computer Organization Ellen Walker, Hiram College Representing Non-Integers – Often represented in decimal format – Some require infinite digits to represent exactly – With a fixed number of digits (or bits), many numbers are approximated – Precision is a measure of the degree of approximation Scientific Notation (Decimal) • Format: m.mmmm x 10^eeeee – Normalized = exactly 1 digit before decimal point • Mantissa (m) represents the significant digits – Precision limited by number of digits in mantissa • Exponent (e) represents the magnitude – Magnitude limited by number of digits in exponent – Exponent < 0 for numbers between 0 and 1 Scientific Notation (Binary) • Format: 1.mmmm x 2^eeeee – Normalized = 1 before the binary point • Mantissa (m) represents the significant bits – Precision limited by number of bits in mantissa • Exponent (e) represents the magnitude – Magnitude limited by number of bits in exponent – Exponent < 0 for numbers between 0 and 1 Binary Examples • 1/16 1.0 x 2^-4 (mantissa 1.0, exponent -4) • 32.5 1.000001 x 2^5 (mantissa 1.000001, exponent 5) Quick Decimal-to-Binary Conversion (Exact) 1. Multiply the number by a power of 2 big enough to get an integer 2. Convert this integer to binary 3. Place the binary point the appropriate number of bits (based on the power of 2 from step 1) from the right of the number Conversion Example • Convert 32.5 to binary 1. Multiply 32.5 by 2 (result is 65) 2. Convert 65 to binary (result is 1000001) 3. Place the decimal point (in this case 1 bit from the right) (result is 100000.1) • Convert to binary scientific notation (result is 1.000001 x 2^5) Floating Point Representation • • • • Mantissa - m bits (unsigned) Exponent - e bits (signed) Sign (separate) - 1 bit Total = 1+m+e bits – Tradeoff between precision and magnitude – Total bits fit into 1 or 2 full words Implicit First Bit • Remember the mantissa must always begin with “1.” • Therefore, we can save a bit by not actually representing the 1 explicitly. • Example: – Mantissa bits 0001 – Mantissa: 1.0001 Offset Exponent • Exponent can be positive or negative, but it’s cleaner (for sorting) use an unsigned representation • Therefore, represent exponents as unsigned, but add a bias of –((2^(bits-1))-1) • Examples: 8 bit exponent – 00000001 = 1(+ -127) = -126 – 10000000 = 128 (+ -127) = 1 IEEE 754 Floating Point Representation (Single) • Sign (1 bit), Exponent (8 bits), Magnitude (23 bits) – What is the largest value that can be represented? – What is the smallest positive value that can be represented? – How many “significant bits” can be represented? • Values can be sorted using integer comparison – Sign first – Exponent next (sorted as unsigned) – Magnitude last (also unsigned) Double Precision • Floating point number takes 2 words (64 bits) • Sign is 1 bit • Exponent is 11 bits (vs. 8) • Magnitude is 52 bits (vs. 23) – Last 32 bits of magnitude is in the second word Floating Point Errors • Overflow – A positive exponent becomes too large for the exponent field • Underflow – A negative exponent becomes too large for the exponent field • Rounding (not actually an error) – The result of an operation has too many significant bits for the fraction field Special Values • Infinity – Result of dividing a non-zero value by 0 – Can be positive or negative – Infinity +/- anything = Infinity • Not A Number (NaN) – Result of an invalid mathematical operation, e.g. 0/0 or Infinity-Infinity Representing Special Values in IEEE 754 • Exponent ≠0, Exponent ≠ FF – Ordinary floating point number • Exponent = 00, Fraction = 0 – Number is 0 • Exponent = 00, Fraction ≠ 0 – Number is denormalized (leading 0. Instead of 1.) • Exponent = FF, Fraction = 0 – Infinity (+ or -, depending on sign) • Exponent = FF, Fraction ≠ 0 – Not a Number (NaN) Double Precision in MIPS • Each even register can be considered a register pair for double precision – High order bit in even register – Low order bit in odd register Floating Point Arithmetic in MIPS • Add.s, add.d, sub.s, sub.d [rd] [rs] [rt] – Single and double precision addition / subtraction – rd = rs +/- rt • 32 floating point registers $f0 - $f31 – Use in pairs for double precision – Registers for add.d (etc) must be even numbers Why Separate Floating Point Registers? • Twice as many registers using the same number of instruction bits • Integer & floating point operations usually on distinct data • Increased parallelism possible • Customized hardware possible Load/ Store Floading Point Number • • • • Lwc1 32 bit word to FP register Swc1 FP register to 32 bit word Ldc1 2 words to FP register pair Sdc1 register pair to 2 words • (Note last character is the number 1) Floating Point Addition • Align the binary points (make exponents equal) • Add the revised mantissas • Normalize the sum Changing Exponents for Alignment and Normalization • To keep the number the same: – Left shift mantissa by 1 bit and decrement exponent – Right shift mantissa by one bit and increment exponent • Align by right-shifting smaller number • Normalize by – Round result to correct number of significant bits – Shift result to put 1 before binary point Addition Example Add 1.101 x 2^4 + 1.101 x 2^5 (26+52) • Align binary points 1.101 x 2^4 = 0.1101 x 2^5 • Add mantissas 0.1101 x 2^5 1.1010 x 2^5 10.0111 x 2^5 Addition Example (cont.) • Normalize: 10.0111 x 2^5 = 1.00111 x 2^6 (78) • Round to 3-bit mantissa: 1.00111 x 2^6 ~= 1.010 x 2^6 (80) Rounding • At least 1 bit beyond the last bit is needed • Rounding up could require renormalization – Example: 1.1111 -> 10.000 • For multiplication, 2 extra bits are needed in case the product’s first bit is 0 and it must be left shifted (guard, round) • For complete generality, add “sticky bit” that is set whenever additional bits to the right would be >0 Round to Nearest Even • Most common rounding mode • If the actual value is halfway between two values round to an even result • Examples: – 1.0011 -> 1.010 – 1.0101 -> 1.010 • If the sticky bit is set, round up because the value isn’t really halfway between! Floating point addition Sign Exponent • Fraction Sign Exponent Fraction 1. Compare the exponents of the two numbers. Shift the smaller number to the right until its exponent would match the larger exponent Small ALU Exponent difference 0 Start 2. Add the significands 1 0 1 0 3. Normalize the sum, either shifting right and incrementing the exponent or shifting left and decrementing the exponent Shift right Control 1 Overflow or underflow? Big ALU Yes No 0 0 1 1 4. Round the significand to the appropriate Increment or decrement number of bits Shift left or right No Rounding hardware Still normalized? Yes Sign Exponent Fraction Done Exception Floating Point Multiplication 1. Calculate new exponent by adding exponents together 2. Multiply the significands (using shift & add) 3. Normalize the product 4. Round 5. Set the sign Adding Exponents • Remember that exponents are biased – Adding exponents adds 2 copies of bias! (exp1 + 127) + (exp2 + 127) = (exp1+exp2 + 254) • Therefore, subtract the bias from the sum and the result is a correctly biased value Multiplication Example • Convert 2.25 x 1.5 to binary floating point (3 bits exponent, 3 bits mantissa) • 2.25 = 10.01 * 2^0 = 1.001 * 2^1 • Exp = 100 (because bias is 3) • 2.25 = 0 100 001 • 1.5 = 1.100 * 2^0 • Exp = 011, Mantissa: 100 • 1.5 = 0 100 100 1. Add Exponents 0 100 001 x 0 011 100 • Add Exponents (and subtract bias) 100 + 011 – 011 = 100 2. Multiply Significands 0 100 001 x 0 011 100 • Remember to restore the leading 1 • Remember that the number of binary places doubles 1.001 1.100 -----------------------.100100 1.001000 ---------------1.101100 x 2^1 Finish Up • • • • • • Product is 1.1011 * 2^1 Already normalized But, too many bits, so we need to round Nearest even number (up) is 1.110 Result: 0 100 110 Value is 1.75 * 2 = 3.5 Types of Errors • Overflow • Exponent too large or small for the number of bits allotted • Underflow • Negative exponent is too small to fit in the # bits • Rounding error • Mantissa has too many bits Overflow and Underflow • Addition – Overflow is possible when adding two positive or two negative numbers • Multiplication – Overflow is possible when multiplying two large absolute value numbers – Underflow is possible when multiplying two numbers very close to 0 Limitations of Finite Floating Point Representations • Gap between 0 and the smallest nonzero number • Gaps between values when the last bit of the mantissa changes • Fixed number of values between 0 and 1 • Significant effects of rounding in mathematical operations Implications for Programmers • Mathematical rules are not always followed – (a / b) * b does not always equal a – (a + b) + c does not always equal a + (b + c) • Use inequality comparisons instead of directly comparing floating point numbers (with ==) – if ((x > –epsilon) && (x < epsilon)) instead of if(x==0) – Epsilon can be set based on problem or knowledge of representation (e.g. single vs. double precision) The Pentium Floating Point Bug • To speed up division, a table was used • It was assumed that 5 elements of the table would never be accessed (and the hardware was optimized to make them 0) • These table elements occasionally caused errors in bits 12 to 52 of floating point significands • (see Section 3.8 for more) A Marketing Error • July 1994 - Intel discovers the bug, decides not to halt production or recall chips • September 1994 - A professor discovers the bug, posts to Internet (after attempting to inform Intel) • November 1994 - Press articles, Intel says will affect “maybe several dozen people” • December 1994 - IBM disputes claim and halts shipment of Pentium based PCs. • Late December 1994 - Intel apologizes The “Big Picture” • Bits in memory have no inherent meaning. A given sequence can contain – – – – An instruction An integer A string of characters A floating point number • All number representations are finite • Finite arithmetic requires compromises

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Assembly Language Programming