Download IEEE 754 Single-Precision Numbers

IEEE 754 Single-Precision Numbers Rev. 1.1 (060307) Introduction To represent both very large and very small values, C++ has adopted the floating point representation specified in the IEEE754 standard. The float data type represents a single-precision number, whereas the double data type represents a double-precision number. The two types do not differ in concept, only in the number of bits used to represent them in memory. This note will describe the single-precision number representation (called a float in C++ terminology) in detail, and briefly describe the double-precision number representation. The scientific notation The number is stored in scientific notation using 2 as the base number. This means that all numbers are stored in the form k =± 1.M ⋅ 2 . Since 2 is chosen as the base number, only three characteristic values need to be stored for any number: The sign S, the exponent E, and the mantissa M. E The single-point representation consists of 32 bits (4 bytes) divided into these three fields: The sign bit S (bit 31) The sign bit S is set to 0 if the number is positive, 1 if the number is negative. The exponent E (bits 30 – 23) The exponent E is the power 2 must be raised to in the scientific notation of the number. The exponent is biased with a value of 127, which means that the value stored in E is 127 for exponent = 0, 128 for exponent = 1, etc. This is done to be able to represent very small exponents without the need of representing the exponent itself as a negative number. The exponent value range is -126 – 127 (E = 0x01 – 0xFE) The mantissa M (bits 22 – 0) The mantissa M is the part of the value (in scientific notation) that comes after the binary point. Since scientific notation with base number 2 is used, a value will always be of the form 1.xxxx, so the “1.” -part is left out (called “the hidden bit”). The mantissa is constructed by writing the after-the-point value ap as a sum 22 of negative powers of 2. i.e. ap = ∑M i =0 i ⋅ 2 −( 23−i ) , where Mi is the value of the i’th bit in the mantissa (0..22) The use of S, E, and M will become clearer by some methods and a couple of examples: Converting a real value k to its single-precision representation b: To convert a value k into its single-precision representation b, follow these steps: (1) Deduct the sign bit from k (2) If |k| > 2, continuously divide k with two so that the value is finally between 1 and two. The exponent is now the number of times k has been divided. Add 127 to the exponent to find E = b[30..23] (3) If |k| < 1, continuously multiply k with two so that the value is finally between 1 and two. The exponent is now minus the number of times k has been multiplied. Add 127 to the exponent to find E (4) Write the part of the divided/multiplied |k| after the binary point as a sum of negative powers of 2 to find M. (e.g. 0.625 = 0.5 +0.125 Æ M = 101000…) Example: Convert k = -4.625 to its single-precision representation b: Following the method above, (1) S = 1, since k is negative. Hence b[31] = 1 (2) |k| is divided by 2 two times, yielding |k| = 4.625 = |1.15625| * 22. Hence the exponent is 2 + 127 = 129, and therefore b[30..23] = 10000001 (3) Not applicable in this example (4) Since the divided value of |k| Is 1.15625, the part after the binary point, 0.15625, must be written as a sum of negative powers of 2. 0.15625 = 0.125 + 0,03125 = 2-3 + 2-5. Hence, the mantissa (b[22..0]) will be 00101000…0 The complete representation of k = -4.625 is therefore b=11000000100101000000000000000000 Converting a represented value b to a real number k. To convert a single-precision value b to the real number k it represents, follow these steps: (1) Deduct the sign of k from the sign bit S = b[31]. (2) Deduct E from b[30..23]. (3) Deduct the mantissa of K from b[22..0]. For each 1-bit in M, add 2pos-23 to the mantissa (pos is the bit position M) (4) Use S, E, and M to construct k using the formula k =S1.M ⋅ 2 E Example: Find the number represented by 1 10001010 10110100010000000000000 Following the method above, (1) S = b[31] = 1 (2) E = b[30..23] = 10001010 = 138 (3) M = b[22...0] = 10110100010000000000000 = 2-1 + 2-3 + 2-4 + 2-6 + 2-10 = 0.7041015625 (4) k =± 1.M ⋅ 2 E = -1.7041015625 * 2138-127 = -3490 Some special values Some special combinations of S, E, and M are used to represent special numbers such as 0 and ± ∞ : S E M Meaning 0 or 1 00000000 0000000…0 Zero 1 11111111 0000000…0 −∞ 0 11111111 0000000…0 +∞ 0 or 1 11111111 <> 0000000…0 Not-a-Number (NaN) Double-precision numbers Double-precision numbers are very similar to single-precision numbers. The only difference is the size of the exponent (11 bits), the offset of the exponent (1027), the range of values of the exponent (-1026 to 1027), and the number of bits to store the mantissa (62 bits)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download IEEE 754 Single-Precision Numbers