Download More on errors

Computer representation of numbers in different bases. Decimal: 37145 = 5+40+100+7000+30000 = 5 × 100 + 4 × 101 + 1 × 102 + 7 × 103 + 3 × 104 General formula: A number in decimal system would be a string like 𝑎𝑛 𝑎𝑛−1 𝑎𝑛−2 … 𝑎0 = 𝑎0 × 100 + 𝑎1 × 101 … + 𝑎𝑛 × 10𝑛 For a fractional part in decimal 0. 𝑏1 𝑏2 𝑏3 … = 𝑏1 × 10−1 + 𝑏2 × 10−2 + 𝑏3 × 10−3 + ⋯ Expanding it to any base 𝛽 number, a typical number may be seen as −𝑘 (𝑎𝑛 𝑎𝑛−1 … 𝑎0 . 𝑏1 𝑏2 𝑏3 … )𝛽 = ∑𝑛𝑘=0 𝑎𝑘 𝛽 𝑘 + ∑∞ 𝑘=1 𝑏𝑘 𝛽 Examples. (723.3041)10 = 7 × 102 + 2 × 101 + 3 × 100 + 3 × 10−1 + 4 × 10−3 + 1 × 10−4 To convert a number from one base to another, we may do the following. Example from decimal to octal: Divide the part before the decimal recursively by 8, keep the remainder aside, and continue working with the quotient in the similar manner until quotient becomes less than 8. e.g. (723)10 = 8|723 → 8|90 + 3 → 8|11 2 → 1 3 Collecting them, we get (723)10 = (1323)8 Check (1323)8 = 1 × 83 + 3 × 82 + 2 × 81 + 3 = 723 We can expand (1323)8 could be expressed in binary. Take each digit from the right and expand that digit in binary. Therefore, (1323)8 = (1 011 010 011)2 Similarly, to get the decimal part in octal, keep multiplying it by 8 collecting the subsequent digit at the left of the decimal point for octal expansion as we continue. 0.3041 × 8 = 2 + 0.4328 0.4328 × 8 = 3 + 0.4624 0.4624 × 8 = 3 + 0.6992 0.6992 × 8 = 5 + 0.5936 ……… Thus, (0.3041)10 = (.2335. . )8 = (.10 011 011 101)2 Thus (723.3041)10 = (1323.2335)8 = (1 011010011.10011011101)2 Floating point representations. Computer arithmetic treats number other than integers as floating point numbers, and they all comprise a sign bit, integer part, and a fractional part as shown in normalized scientific notation forms: 54.23157 = 0.5423157 × 102 0.000327138 = 0.327138 × 10−3 −20137.21765 = −0.2013721765 × 105 In general, in normalized form, a floating point 𝑥 is represented as 𝑥 = ±𝑟 × 10𝑛 where 𝑟 = normalized mantissa and 𝑛 = exponent Note that 0.1 ≤ 𝑟 < 1 The standard IEEE single-precision floating point form is 32-bit number, in which first bit is the sign bit, the mantissa is 23 bits, and the exponent is 8 bits. For double precision arithmetic, we use 64-bit representation, in which first bit is the sign bit, the mantissa is 53 bits, and the exponent is 10 bits. IEEE Floating point format: 𝑥 = (−1)𝑠 × (1 + 𝑓𝑟𝑎𝑐𝑡𝑖𝑜𝑛) × 2(𝑒𝑥𝑝𝑜𝑛𝑒𝑛𝑡−𝑏𝑖𝑎𝑠) 𝑠 = 0 for positive, and 1, for negative Exponent = Excess representation: actual exponent + bias ■ Ensures exponent is unsigned ■ For single precision: bias = 127 ■ For double precision: bias = 1203 The final look for a 32-bit floating point number: Sign bit | the 8 exponent bits | 23 bits for mantissa Single precision range ■ 00000000 and 11111111 are reserved. ■ smallest value: Exponent 00000001. This implies actual exponent to 1-127 = -126. ■ The smallest value: ±1 × 2−126 ≈ ±1.2 × 10−38 ■ The largest value: Exponent = 11111110 implies Actual exponent = 254-127=127 Fraction part: 0.1111 …11 ≈ significant ≈ 2.0 The number is ±2.0 × 2127 ≈ ±3.4 × 1038 Example. (From http://sandbox.mc.edu/~bennet/cs110/flt/dtof.html) Convert -1313.3125 to IEEE 32-bit floating point format. The integral part is 131310 = (2441)8 =(10100100001)2. The fractional: 0.3125 × 2 = 0.625 0 Generate 0.625 × 2 = 1.25 1 Generate 0.25 × 2 = 0.5 0 Generate Generate 0.5 × 2 = 1.0 1 0 1 0 1 and and and and continue. continue with the rest. continue. nothing remains. So 1313.312510 = (10100100001.0101)2. Normalize: (10100100001.0101)2 = (1.01001000010101)2 × 210. Mantissa is 01001000010101000000000, exponent is 10 + 127 = 137 = 100010012, sign bit is 1. So -1313.3125 is 11000100101001000010101000000000 = (𝐶4𝐴42𝐴00)16 Machine precision ■ For a 32-bit number, the smallest number that we can identify is 2−23 ≈ 1.2 × 2−7 . Therefore, in usual single precision computations, approximately 6 significant digits of accuracy is available. ■ For a double precision, each number is stored using two consecutive words in memory, 52 bits are available for mantissa. Since 2−52 ≈ 2.2 × 2−16, we have approximately 15 bits available for double precision accuracy. ■ For an integer in 32 bits, it lies between −(231 − 1) and (231 − 1) = 2147483647. In integer arithmetic, accuracy is about 9 digits only. Computational rounding off error ■ Let 𝑥 = actual number, and fl(x) floating point machine number. Then, fl(x) =x(1+𝛿), where |𝛿| ≤ 2−24 ■ The relative error in representing x by its floating point machine number fl(x) is Relative Error, e ≥ ■ Some relative errors: |𝑥−𝑓𝑙(𝑥) |𝑥| fl(x ⊙ y) = (x ⊙ y) (1+𝛿) ■ In computers, all non-integer numbers (floating point numbers) are expressed with finite precision. 𝜋 and 1 3 cannot be expressed by a fixed number of significant digits. And here is our source of loss of significant digit. Compare 0.3+0.3+0.3 with 0.9. Assign their difference to error. What is the value of error? > 0.3+0.3+0.3==0.9 [1] FALSE > error=(0.3+0.3+0.3-0.9) > error [1] -1.110223e-16 > Note also that floating point computation results depend on the order of computations. It may not be always associative i.e (a+b)+c may not be always equal to a+(b+c). Example. > (0.00000000000000001+1)-1 [1] 0 > 0.00000000000000001+(1-1) [1] 1e-17 > Some floating points like 0.1 has to be approximated. And this leads to a major source of error. 0.1=1.1001 1001 1001 1001 1001 1001 × 2−4 The fraction part (the mantissa) 1001 1001 1001 1001 1001 1001 is also called significant, and -4 is the exponent. Storage for floating points is predicated by machine architectures. For a particular machine, sign part takes 1 bit, significant is stored using 24 bits, and the exponent by 7 bits so that the entire floating point is stored as 4 bytes for single precision as shown by an example earlier. All these examples point to loss of significant digits due to machine limitations (rounding/chopping errors) of floating point numbers. In particular, be careful of subtracting two numbers which are almost equal as careless subtractions may lead to loss of significant digits. In general, error in numerical computation (our current issue) now arises from: ■ Modeling errors ■ Mistakes ■ Physical measurement errors ■ Machine representation of floating points ■ Mathematical approximation of errors Consequences of inadequate handling of errors: ■ Loss of significant digits ■ Noise in function evaluation ■ Over and under stretching of errors

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download More on errors