Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computer representation of numbers in different bases. Decimal: 37145 = 5+40+100+7000+30000 = 5 × 100 + 4 × 101 + 1 × 102 + 7 × 103 + 3 × 104 General formula: A number in decimal system would be a string like ๐๐ ๐๐โ1 ๐๐โ2 โฆ ๐0 = ๐0 × 100 + ๐1 × 101 โฆ + ๐๐ × 10๐ For a fractional part in decimal 0. ๐1 ๐2 ๐3 โฆ = ๐1 × 10โ1 + ๐2 × 10โ2 + ๐3 × 10โ3 + โฏ Expanding it to any base ๐ฝ number, a typical number may be seen as โ๐ (๐๐ ๐๐โ1 โฆ ๐0 . ๐1 ๐2 ๐3 โฆ )๐ฝ = โ๐๐=0 ๐๐ ๐ฝ ๐ + โโ ๐=1 ๐๐ ๐ฝ Examples. (723.3041)10 = 7 × 102 + 2 × 101 + 3 × 100 + 3 × 10โ1 + 4 × 10โ3 + 1 × 10โ4 To convert a number from one base to another, we may do the following. Example from decimal to octal: Divide the part before the decimal recursively by 8, keep the remainder aside, and continue working with the quotient in the similar manner until quotient becomes less than 8. e.g. (723)10 = 8|723 โ 8|90 + 3 โ 8|11 2 โ 1 3 Collecting them, we get (723)10 = (1323)8 Check (1323)8 = 1 × 83 + 3 × 82 + 2 × 81 + 3 = 723 We can expand (1323)8 could be expressed in binary. Take each digit from the right and expand that digit in binary. Therefore, (1323)8 = (1 011 010 011)2 Similarly, to get the decimal part in octal, keep multiplying it by 8 collecting the subsequent digit at the left of the decimal point for octal expansion as we continue. 0.3041 × 8 = 2 + 0.4328 0.4328 × 8 = 3 + 0.4624 0.4624 × 8 = 3 + 0.6992 0.6992 × 8 = 5 + 0.5936 โฆโฆโฆ Thus, (0.3041)10 = (.2335. . )8 = (.10 011 011 101)2 Thus (723.3041)10 = (1323.2335)8 = (1 011010011.10011011101)2 Floating point representations. Computer arithmetic treats number other than integers as floating point numbers, and they all comprise a sign bit, integer part, and a fractional part as shown in normalized scientific notation forms: 54.23157 = 0.5423157 × 102 0.000327138 = 0.327138 × 10โ3 โ20137.21765 = โ0.2013721765 × 105 In general, in normalized form, a floating point ๐ฅ is represented as ๐ฅ = ±๐ × 10๐ where ๐ = normalized mantissa and ๐ = exponent Note that 0.1 โค ๐ < 1 The standard IEEE single-precision floating point form is 32-bit number, in which first bit is the sign bit, the mantissa is 23 bits, and the exponent is 8 bits. For double precision arithmetic, we use 64-bit representation, in which first bit is the sign bit, the mantissa is 53 bits, and the exponent is 10 bits. IEEE Floating point format: ๐ฅ = (โ1)๐ × (1 + ๐๐๐๐๐ก๐๐๐) × 2(๐๐ฅ๐๐๐๐๐๐กโ๐๐๐๐ ) ๐ = 0 for positive, and 1, for negative Exponent = Excess representation: actual exponent + bias โ Ensures exponent is unsigned โ For single precision: bias = 127 โ For double precision: bias = 1203 The final look for a 32-bit floating point number: Sign bit | the 8 exponent bits | 23 bits for mantissa Single precision range โ 00000000 and 11111111 are reserved. โ smallest value: Exponent 00000001. This implies actual exponent to 1-127 = -126. โ The smallest value: ±1 × 2โ126 โ ±1.2 × 10โ38 โ The largest value: Exponent = 11111110 implies Actual exponent = 254-127=127 Fraction part: 0.1111 โฆ11 โ significant โ 2.0 The number is ±2.0 × 2127 โ ±3.4 × 1038 Example. (From http://sandbox.mc.edu/~bennet/cs110/flt/dtof.html) Convert -1313.3125 to IEEE 32-bit floating point format. The integral part is 131310 = (2441)8 =(10100100001)2. The fractional: 0.3125 × 2 = 0.625 0 Generate 0.625 × 2 = 1.25 1 Generate 0.25 × 2 = 0.5 0 Generate Generate 0.5 × 2 = 1.0 1 0 1 0 1 and and and and continue. continue with the rest. continue. nothing remains. So 1313.312510 = (10100100001.0101)2. Normalize: (10100100001.0101)2 = (1.01001000010101)2 × 210. Mantissa is 01001000010101000000000, exponent is 10 + 127 = 137 = 100010012, sign bit is 1. So -1313.3125 is 11000100101001000010101000000000 = (๐ถ4๐ด42๐ด00)16 Machine precision โ For a 32-bit number, the smallest number that we can identify is 2โ23 โ 1.2 × 2โ7 . Therefore, in usual single precision computations, approximately 6 significant digits of accuracy is available. โ For a double precision, each number is stored using two consecutive words in memory, 52 bits are available for mantissa. Since 2โ52 โ 2.2 × 2โ16, we have approximately 15 bits available for double precision accuracy. โ For an integer in 32 bits, it lies between โ(231 โ 1) and (231 โ 1) = 2147483647. In integer arithmetic, accuracy is about 9 digits only. Computational rounding off error โ Let ๐ฅ = actual number, and fl(x) floating point machine number. Then, fl(x) =x(1+๐ฟ), where |๐ฟ| โค 2โ24 โ The relative error in representing x by its floating point machine number fl(x) is Relative Error, e โฅ โ Some relative errors: |๐ฅโ๐๐(๐ฅ) |๐ฅ| fl(x โ y) = (x โ y) (1+๐ฟ) โ In computers, all non-integer numbers (floating point numbers) are expressed with finite precision. ๐ and 1 3 cannot be expressed by a fixed number of significant digits. And here is our source of loss of significant digit. Compare 0.3+0.3+0.3 with 0.9. Assign their difference to error. What is the value of error? > 0.3+0.3+0.3==0.9 [1] FALSE > error=(0.3+0.3+0.3-0.9) > error [1] -1.110223e-16 > Note also that floating point computation results depend on the order of computations. It may not be always associative i.e (a+b)+c may not be always equal to a+(b+c). Example. > (0.00000000000000001+1)-1 [1] 0 > 0.00000000000000001+(1-1) [1] 1e-17 > Some floating points like 0.1 has to be approximated. And this leads to a major source of error. 0.1=1.1001 1001 1001 1001 1001 1001 × 2โ4 The fraction part (the mantissa) 1001 1001 1001 1001 1001 1001 is also called significant, and -4 is the exponent. Storage for floating points is predicated by machine architectures. For a particular machine, sign part takes 1 bit, significant is stored using 24 bits, and the exponent by 7 bits so that the entire floating point is stored as 4 bytes for single precision as shown by an example earlier. All these examples point to loss of significant digits due to machine limitations (rounding/chopping errors) of floating point numbers. In particular, be careful of subtracting two numbers which are almost equal as careless subtractions may lead to loss of significant digits. In general, error in numerical computation (our current issue) now arises from: โ Modeling errors โ Mistakes โ Physical measurement errors โ Machine representation of floating points โ Mathematical approximation of errors Consequences of inadequate handling of errors: โ Loss of significant digits โ Noise in function evaluation โ Over and under stretching of errors