Download IEEE 754 Single-Precision Numbers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of mathematical notation wikipedia , lookup

Law of large numbers wikipedia , lookup

Big O notation wikipedia , lookup

Addition wikipedia , lookup

Location arithmetic wikipedia , lookup

Large numbers wikipedia , lookup

Positional notation wikipedia , lookup

Arithmetic wikipedia , lookup

Elementary mathematics wikipedia , lookup

Transcript
IEEE 754 Single-Precision Numbers
Rev. 1.1 (060307)
Introduction
To represent both very large and very small values, C++ has adopted the floating point representation
specified in the IEEE754 standard. The float data type represents a single-precision number, whereas the
double data type represents a double-precision number. The two types do not differ in concept, only in the
number of bits used to represent them in memory.
This note will describe the single-precision number representation (called a float in C++ terminology) in
detail, and briefly describe the double-precision number representation.
The scientific notation
The number is stored in scientific notation using 2 as the base number. This means that all numbers are
stored in the form k =± 1.M ⋅ 2 . Since 2 is chosen as the base number, only three characteristic values
need to be stored for any number: The sign S, the exponent E, and the mantissa M.
E
The single-point representation consists of 32 bits (4 bytes) divided into these three fields:
The sign bit S (bit 31)
The sign bit S is set to 0 if the number is positive, 1 if the number is negative.
The exponent E (bits 30 – 23)
The exponent E is the power 2 must be raised to in the scientific notation of the number. The exponent is
biased with a value of 127, which means that the value stored in E is 127 for exponent = 0, 128 for exponent
= 1, etc. This is done to be able to represent very small exponents without the need of representing the
exponent itself as a negative number. The exponent value range is -126 – 127 (E = 0x01 – 0xFE)
The mantissa M (bits 22 – 0)
The mantissa M is the part of the value (in scientific notation) that comes after the binary point. Since
scientific notation with base number 2 is used, a value will always be of the form 1.xxxx, so the “1.” -part is
left out (called “the hidden bit”). The mantissa is constructed by writing the after-the-point value ap as a sum
22
of negative powers of 2. i.e. ap =
∑M
i =0
i
⋅ 2 −( 23−i ) , where Mi is the value of the i’th bit in the mantissa (0..22)
The use of S, E, and M will become clearer by some methods and a couple of examples:
Converting a real value k to its single-precision representation b:
To convert a value k into its single-precision representation b, follow these steps:
(1) Deduct the sign bit from k
(2) If |k| > 2, continuously divide k with two so that the value is finally between 1 and two. The exponent
is now the number of times k has been divided. Add 127 to the exponent to find E = b[30..23]
(3) If |k| < 1, continuously multiply k with two so that the value is finally between 1 and two. The
exponent is now minus the number of times k has been multiplied. Add 127 to the exponent to find E
(4) Write the part of the divided/multiplied |k| after the binary point as a sum of negative powers of 2 to
find M. (e.g. 0.625 = 0.5 +0.125 Æ M = 101000…)
Example: Convert k = -4.625 to its single-precision representation b:
Following the method above,
(1) S = 1, since k is negative. Hence b[31] = 1
(2) |k| is divided by 2 two times, yielding |k| = 4.625 = |1.15625| * 22. Hence the exponent is 2 + 127 =
129, and therefore b[30..23] = 10000001
(3) Not applicable in this example
(4) Since the divided value of |k| Is 1.15625, the part after the binary point, 0.15625, must be written as
a sum of negative powers of 2. 0.15625 = 0.125 + 0,03125 = 2-3 + 2-5. Hence, the mantissa (b[22..0])
will be 00101000…0
The complete representation of k = -4.625 is therefore b=11000000100101000000000000000000
Converting a represented value b to a real number k.
To convert a single-precision value b to the real number k it represents, follow these steps:
(1) Deduct the sign of k from the sign bit S = b[31].
(2) Deduct E from b[30..23].
(3) Deduct the mantissa of K from b[22..0]. For each 1-bit in M, add 2pos-23 to the mantissa (pos is the bit
position M)
(4) Use S, E, and M to construct k using the formula
k =S1.M ⋅ 2 E
Example: Find the number represented by 1 10001010 10110100010000000000000
Following the method above,
(1) S = b[31] = 1
(2) E = b[30..23] = 10001010 = 138
(3) M = b[22...0] = 10110100010000000000000 = 2-1 + 2-3 + 2-4 + 2-6 + 2-10 = 0.7041015625
(4)
k =± 1.M ⋅ 2 E = -1.7041015625 * 2138-127 = -3490
Some special values
Some special combinations of S, E, and M are used to represent special numbers such as 0 and ± ∞ :
S
E
M
Meaning
0 or 1
00000000
0000000…0
Zero
1
11111111
0000000…0
−∞
0
11111111
0000000…0
+∞
0 or 1
11111111
<> 0000000…0
Not-a-Number (NaN)
Double-precision numbers
Double-precision numbers are very similar to single-precision numbers. The only difference is the size of the
exponent (11 bits), the offset of the exponent (1027), the range of values of the exponent (-1026 to 1027),
and the number of bits to store the mantissa (62 bits)