Download Computer Arithmetic

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Error detection and correction wikipedia , lookup

Transcript
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Computer Arithmetic
Because of the limitations of finite binary storage, computers do
not store exact representations of most numbers, nor do they
perform exact arithmetic.
• 32 bits (a single 1/0 unit) are used for an integer.
• 64 bits are used for a double-precision number.
Mathematics and Statistics
Dr. Corcoran – STAT 6550
I t
Integer
Storage
St
For many computers, 32 bits of a stored integer u can be thought
off as the
h binary
bi
coefficients
ffi i
xi in
i the
h representation
i
32
u   xi 2i 1  231 ,
i 1
where each xi is 1 or 0. Note that in this representation, if x32 is 1
and every other xi = 0, then u = 0.
• What’s
What s the largest possible positive integer that can be
stored using this representation?
• What’s the largest (in magnitude) negative integer?
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Integer Arithmetic
If the result of integer arithmetic puts us outside of the range of
storable the results can be unpredictable.
storable,
unpredictable For example,
example with a
32-bit representation, adding 1 to the largest number will result in
overflow. In R (notice how R needs to be coerced into integer
arithmetic):
> u=as.integer(0)
> b=as
b=as.integer(1)
integer(1)
> two=as.integer(2)
> for (i in 1:31) {u=u+b; b=b*two}
Warning message:
NAs produced by integer overflow in: b * two
> u
[1] 2147483647
> u+as.integer(1)
[1] NA
Warning message:
NAs produced by integer overflow in: u + as.integer(1)
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Fl ti Point
Floating
P i t Storage
St
Typically represented in the form
(1) (i 1 xi 2  i )2 k .
x0
t
In this formulation:
• k is an integer called the exponent.
• xi = 0 or 1 for i = 1,…,t.
• x0 is the sign bit, with x0 = 0 for positive numbers and x0 = 1
f negative
for
i numbers.
b
• The fractional part (the summation) is called the mantissa.
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Floating Point Storage – Additional Conventions
•
Note that by shifting the digits in the mantissa and making a corresponding
change in the exponent, the representation is not unique. By convention, the
exponent is chosen so that the first digit of the mantissa is 1, except if this
would put the exponent out of range.
•
A 32-bit single precision floating point number is usually represented as a
sign bit, a 23-bit mantissa, and an 8-bit exponent.
•
The exponent is typically shifted so that it takes the values –126 to 128. The
remaining
g ppossible value can be reserved for special
p
values ((such as
underflow, overflow [in R stored as Inf], or something like NaN [not a
number]).
•
Standard double precision uses a sign bit, an 11-bit exponent, and 52 bits for
the mantissa.
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Floating Point Storage – Example
Suppose
pp
that yyou have 8 bits of storage
g to represent
p
a floatingg ppoint
number. We use a sign bit, 5 bits for the mantissa, and 2 bits for the
exponent, using the same conventions as described earlier for the
32 bit representation (note that the lead bit in the mantissa need not
32-bit
be stored).
What would our representation be for 1/3?
What is the difference between 1/3 and its representation?
p
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Machine Constants
These numbers may include, for example:
• The largest possible positive and negative numbers (in
magnitude)
g
) that can be stored without pproducing
g overflow.
• A smallest possible positive number.
• The smallest number that when added to 1 will produce a result
different than 1.
1
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Finding Machine Constants
Such constants are not typically readily available, so we often need
to use algorithms
g
to obtain them. For example,
p , see www.nr.com.
We can use the fact that R is a compiled C program to find some of
these constants
constants. For example,
example the smallest possible number that can
be added to 1 and produce a result different than 1 is typically
referred to as machine epsilon, denoted by εm. Keeping in mind the
representation
i discussed
di
d earlier,
li we have
h
1  (1 / 2  i  2 0 / 2  i )21.
t
What should the next largest representable number be?
Mathematics and Statistics
Dr. Corcoran – STAT 6550
M hi Epsilon
Machine
E il
To find εm, recall that for double precision we have t = 52, although
we don’t need to store the leading bit (by convention) so that
effectively t = 53. Thus, 1 + 1/252 should be different than 1. In R:
> options(digits=20)
> 1+1/2^53
[1] 1
> 1+1/2^52
[1] 1.0000000000000002
> 1/2^52
[1] 2.2204460492503131e
2.2204460492503131e-16
16
Mathematics and Statistics
Dr. Corcoran – STAT 6550
M hi Epsilon
Machine
E il (continued)
(
ti
d)
Note that while 1 + 1/252 mayy be the next largest
g representable
p
number, 1/252 may not be εm. That is, if addition is performed at
higher accuracy, and the result is rounded to the nearest
representable number,
number then the next representable number larger
than 1/253, when added to 1, should also round to this value. The
next number larger than 1/253 should be (1 + 1/252)/253 =
1/253 + 1/2105. However, in R (although not in S) this doesn’t
seem to be εm (as we’ve defined it).
Mathematics and Statistics
Dr. Corcoran – STAT 6550
R l ti Error
Relative
E
If x is the true value of a number, and f(x) is its floating point
representation,
i the
h εm is
i an upper bound
b
d on the
h relative
l i error off any
stored number (except overflow or underflow). Recall
f ( x)  (1) (i 1 xi 2  i )2 k ,
x0
t
for xi = 0,
0 1.
1 Then the relative error is given by
| x  f ( x) |
2k
 t 1
,
| x|
2 | x|
since otherwise x would round to another floatingg ppoint number.
Mathematics and Statistics
Dr. Corcoran – STAT 6550
R l ti Error
Relative
E
and
d εm
The value εm thus plays an important role in error analysis. It is
sometimes
i
called
ll d machine
h precision, in
i addition
ddi i to machine
hi epsilon.
il
In double precision on a Sun, εm ≈ 1.11 x 10-16, so that double
pprecision numbers are stored accurate to about 16 decimal digits.
g
R has an object .Machine containing many machine constants:
> .Machine$double.eps
[1] 2.2204460492503131e-16
> 1/2^52
[1] 2.2204460492503131e-16
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Wh t d
What
does thi
this have
h
to
t do
d with
ith practical
ti l
computing?
Definitions:
Condition
C
diti – a measure that
th t broadly
b dl represents
t th
the ease
with which a problem can be solved.
Stability – a measure of the numerical accuracy of a solution
relative to the input
input.
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Condition
Consider the simple definition:
output = f(input)
The condition of a problem measures the relative change in the
output due to a relative change in the input:
| f (input   )  output |
| |
 condition
;
output
input
Or in terms of derivatives the condition number C of a problem is
approximated by:
C | xf ' ( x) / f ( x) | .
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Example
Consider solving the polynomial
z 2  x1 z  x2  0,
where x1, x2 > 0.
What is the condition number of the problem?
Compare the stability of the quadratic formula to the approach that
uses the reciprocal solutions in negative powers of z.
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Computing Sums
We have data x1,…,xn, and wish to compute
S  i 1 xi .
n
Given a floating point representation f(xi), we first add f(x1) to f(x2),
and that result is stored as a floating point number. Then f(x3) is
added to the sum and the result again is converted to a floating
point number,
number and so on.
on
Note that the relative error can potentially compound, and so we
mustt be
b careful
f l in
i approaching
hi summations
ti
where
h the
th
compounding error is catastrophic.
Mathematics and Statistics
Dr. Corcoran – STAT 6550
C
Compounding
di
Error
E
• Turns out that the error bound for straightforward
g
summation
(adding one element at a time) increases as the square of the
number of terms in the sum.
• When adding negative numbers (i.e., handling subtraction), if
two numbers with opposite signs are similar in magnitude, then
the leading digits of the mantissa will cancel, leading to
potentially large relative error.
error
• If all numbers have the same sign, the error bound for straight
summation can be greatly reduced by summing from smallest to
largest.
largest
• Pairwise summation can reduce the relative error to magnitude
nlog2(n). Can you explain why? How might that improve
accuracy over straight
i h summation
i if n = 1000,
1000 for
f example?
l ?
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Example – Taylor Series Summation
Consider the Taylor series approximation given by
exp( x)  i 0 x i / i!,
which works well if |x| is not too large. Straight summation works
well for positive x, but not so well if x < 0:
> fexp=function(x){
+ i=0
+ expx=1
+ u=1
+ while (abs(u)>1.e-8*abs(expx)){
+
i=i+1
+
u=u*x/i
+
expx=expx+u
+ }
+ return(expx)
+ }
> options(digits=10)
> c(exp(1),fexp(1))
[1] 2
2.718281828
718281828 2
2.718281826
718281826
> c(exp(100),fexp(100))
[1] 2.688117142e+43 2.688117108e+43
> c(exp(-1),fexp(-1))
[1] 0.3678794412 0.3678794413
> c(exp(-10),fexp(-10))
0
0
[1] 4.539992976e-05 4.539992956e-05
> c(exp(-20),fexp(-20))
[1] 2.061153622e-09 5.621884467e-09
> c(exp(-30),fexp(-30))
( p(
),
p(
))
[1] 9.357622969e-14 -3.066812359e-05
> # FOR THE SAKE OF ILLUSTRATION:
> (-20)^10/prod(1:10)
[1] 2821869.489
> (
(-20)^9/prod(1:9)
20)^9/prod(1:9)
[1] -1410934.744
> (-20)^20/prod(1:20)
[1] 43099804.12
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Example – Taylor Series Summation (continued)
Note using the straight summation for values of -20 or -30 results in
large terms that alternate in sign – some of which are much larger
th the
than
th final
fi l solution.
l ti
Better
B tt approachh is
i to
t note
t that
th t
exp(-x) = 1/exp(x):
> fexp.better
function(x){
xa=abs(x)
i=0
expx=1
u=1
while (abs(u)>1.e-8*abs(expx)){
i=i+1
u=u*x/i
/
expx=expx+u
}
if (x>=0) return(expx)
else return(1/expx)
}
> c(exp(-1),fexp.better(-1))
[1] 0.3678794412 0.3678794415
> c(exp(-10),fexp.better(-10))
[1] 4.539992976e-05 4.539992986e-05
> c(exp(-20),fexp.better(-20))
20
20
[1] 2.061153622e-09 2.061153632e-09
> c(exp(-30),fexp.better(-30))
[1] 9.357622969e-14 9.357623008e-14
>
Mathematics and Statistics
Dr. Corcoran – STAT 6550
Example – Computing the Sample Variance
The familiar formula:
1 
1
2
s 
 i xi 
n 1 
n
2
 x  
2
i
i

> x=c(0.999999998,0.999999999,1.0,1.000000001,1.000000002)
> # SO-CALLED "ONE-PASS" APPROACH:
> (sum(x^2)-sum(x)^2/n)/(n-as.integer(1))
[1] 0
> # "TWO-PASS" APPROACH:
> sum((x-sum(x)/n)^2)/(n-as.integer(1))
[1] 2.5000000251237998e-18
>