Download Module II - The University of Texas at Dallas

Document related concepts
no text concepts found
Transcript
Module II
Lecture 4
Special Probability Distributions
Certain probability distributions occur with such regularity in real-life situations
that they have been given their own names and it is worth studying their properties.
In this section, we look at three probability distributions that arise in almost every
aspect of business.
The Binomial Distribution
Consider the following situations:
a) You audit a transaction, it is either in compliance with procedures of it is not;
b) You hire a person, that person is either a female or a male;
c) You visit a customer, it either leads to a sale or it doesn’t;
d) You lower the price of a product, sales either increase or they don’t;
e) You have a model for the stock market, you predict that it will go up at least
30 points, it either goes up 30 or more points or it doesn’t;
f) You have an intermittent problem on your company’s network, on any given
day the problem either appears or doesn’t appear.
All of these situations are ones where the Binomial Distribution may be
applicable.
There is a canonical definition for the binomial distribution. This is a set of
assumptions, which, if they hold, indicate that the binomial distribution may be applied to
a particular situation.
Let us suppose that in a given situation only one of two possible things can occur.
For example, if we flip a fair coin then we can only get the outcomes heads or tails.
Flipping the coin is called an experiment in statistical jargon, and heads or tails are called
the possible outcomes of the experiment. Each repetition of the experiment is called a
trial.
The binomial distribution applies in the following situation:
a) The outcome of any trial can only take on two possible values, say success and
failure;
b) There is a constant probability p of success on each trial;
c) The experiment is repeated n times (i.e. n trials are conducted);
d) The trials are statistically independent (i.e. the outcome of past trials does not
affect subsequent trials;
then if x equals the number of successes in the n trials, we have:
P( x ) 
for x = 0, 1, 2, …… n.
n!
p x (1  p) n  x
x !(n  x )!
For example, if we flipped a fair coin ten times, and let x equal the number of
heads, the above formula would give the following probabilities:
x
P(x)
0
1
2
3
4
5
6
7
8
9
10
0.00098
0.00977
0.04395
0.11719
0.20508
0.24609
0.20508
0.11719
0.04395
0.00977
0.00098
1
Graphically, the probability distribution looks like:
Probability
Binomial Distribution, n= 10, p=.5
0.30000
0.25000
0.20000
0.15000
0.10000
0.05000
0.00000
0
1
2
3
4
5
x
6
7
8
9 10
If we used a biased coin so that the probability of getting a head is only .3, then
the probability distribution would look like:
Probability
Binomial Distribution, n= 10, p=.3
0.30000
0.25000
0.20000
0.15000
0.10000
0.05000
0.00000
0
1
2
3
4
5
x
6
7
8
9 10
If the coin were extremely biased so that the probability of heads was only .05,
then the distribution would look like:
Probability
Binomial Distribution, n= 10, p=.05
0.70000
0.60000
0.50000
0.40000
0.30000
0.20000
0.10000
0.00000
0
1
2
3
4
5
x
6
7
8
9 10
EXCEL allows one to compute the binomial probability distribution directly. The
form of the function is:
=binomdist(x, n, p, condition),
where x is the value of interest, n is the number of trials, p is the probability of success
and condition is either “false” or “true”.
If you specify the following command,
=binomdist(3, 10, .50, false),
then EXCEL will compute the probability that x = 3.
If you use the command,
=binomdist(3, 10, .50, true),
then Excel will compute the probability that x<=3. In other words EXCEL will
accumulate the probabilities for x = 0, x = 1, x = 2, and x = 3 and report the total.
The following table shows the use of both conditions in the case where n = 10,
and p = .5:
Condition
false
true
x
P(x)
P(<=x)
0
1
2
3
4
5
6
7
8
9
10
0.00098
0.00977
0.04395
0.11719
0.20508
0.24609
0.20508
0.11719
0.04395
0.00977
0.00098
0.00098
0.01074
0.05469
0.17188
0.37695
0.62305
0.82813
0.94531
0.98926
0.99902
1.00000
1
One can show that,
E(x) = np,
and,
SD( x)  np(1  p)
If instead of x, the number of successes, we are interested in
p  x / n
that is the proportion of successes in n trials, then one can show that
E ( p )  p
and,
SD( p ) 
p(1  p)
n
In our case,
E(x) = 10 * .5 = 5,
and,
SD( x)  (.5) *(.5) * 10  1581139
.
Let us apply the binomial distribution to a more practical problem then flipping
coins.
Suppose that you are going to hire 10 persons from a pool of qualified candidates
which is 30 % women. You find that only 1 woman, and 9 men were hired. Is this
evidence that the firm is discriminating against women in hiring?
The first question is to determine if the binomial distribution is applicable.
Clearly each hire can only be a man or woman so that condition is fulfilled.
The hires are probably independent or close to independent so that condition is
probably fulfilled.
The major problem is whether or not the probability of success, p, is constant
from trial to trial. If there were only a total of 20 applicants, 6 women and 14 men, then
if you hired one of the women on the first hire, that would leave 5 women and 14 men
which would mean the probability of hiring a woman for the second hire could only be
5 / 19 = .2632,
which is very large change from the initial probability of .30. On the other hand if the
hiring pool consisted of 100 applicants, 30 women and 70 men, then if you hired one of
the women on the first hire, that would leave 29 women and 70 men, which would mean
the probability of hiring a woman for the second hire would be
29 / 99 = .2929
which is a very small change.
The actual probability distribution to use is called the hypergeometric distribution.
However it is well know that if the probability p in the binomial distribution does not
change much from trial to trial, then the results from the hypergeometric distribution and
the binomial distribution are almost identical.
Assuming that the probability of hiring a woman does not change much over the
10 hires, then we can reasonably assume that the probability is approximately constant
over the 10 hires and the assumptions of the binomial distribution are approximately
fulfilled.
The next problem we face is that any value between 0 and 10 can possibly occur.
Indeed the probability distribution for this situation, i.e. binomial with n = 10, and p = .3,
is given in the table below:
Female
Hires
Prob
Cumulative
Prob
0
1
2
3
4
5
6
7
8
9
10
0.028248
0.121061
0.233474
0.266828
0.200121
0.102919
0.036757
0.009002
0.001447
0.000138
0.000006
0.028248
0.149308
0.382783
0.649611
0.849732
0.952651
0.989408
0.998410
0.999856
0.999994
1.000000
Notice that although any value is possible the values are not all equally probable.
For example it would not at all seem odd if we hired 3 women or 2 women or 4 women
since these values all have reasonably high probabilities. On the other hand, it would
seem odd if we hired 10 women since the probability of this outcome is approximately
6 chances in 1, 000, 000. Almost as rare as winning the Texas Lottery!!
Statistical logic works like this: a) define what you think is a rare event (most
users of statistics define rare as 1 chance in 20 [.05] or 1 chance in 100 [.01]); b) if the
probability of the observed result or anything more extreme is less than what you define
as rare, then the assumptions of the binomial distribution are suspect.
In our case we observed 1 female hire. More extreme is to hire 0 women.
Therefore we want the probability of observing 1 or fewer women. This can be obtained
directly from the above table or by using the =binomdist(1, 10, .3, true) command.
The result is a probability of .149308. This is roughly a chance of 1 in 7 which
most people would not think is rare. Accordingly this data would not be suggestive of
disproportionate hiring of women. (Of course if it happened more than once, it might be
indicative. Suppose a month later the same thing happens. The probability of hiring 1
woman in 10 hires from a pool that is 30% women twice would be:
(.149308) * (.149308) = .0223
which for many people would make one suspect of a fair hiring environment.
Graphically, the probability distribution we would expect for x = the number of
women hired when you are hiring for 10 positions (n) from a pool of qualified applicants
which is 30% female ( p = .30) is given below with the observed and more extreme value
highlighted:
Number of Women Hires
0.300000
Probability
0.250000
0.200000
0.150000
0.100000
0.050000
0.000000
0
1
2
3
4
5
x
6
7
8
9
10
Now let us leave the percentage of women in the qualified pool the same as
previously (i.e. p =.3) but now hire for 50 positions (n=50). And assume again we only
hire 10% women (x=5). Then the probability distribution would look like:
Number of Women Hires
Probability
0.15
0.1
0.05
x=5
x
As can be seen, hiring only 5 women in 50 hires from a pool of 30 % women, is a
relatively rare event. Using the binomdist function I can compute:
P(x<= 5) = binomidist(5, 50, .3, true) = .00072.
This amounts to a chance of approximately 7 in 10,000 which is highly
improbable. Accordingly, we would be suspect in this situation that women are being
hired proportionate to their representation in the applicant pool
.
48
44
40
36
32
28
24
20
16
12
8
4
0
0
It is very easy to simulate the binomial distribution. Suppose we wish to simulate
values of x from a binomial distribution with n = 10 and p = .4.
For each trial, we could use the statement:
=if(rand()<=.4, 1, 0)
This would generate a value of 1 approximately 40% of the time. If we repeated the
above statement 10 times and added up the results, this would be equivalent to taking a
sample of n = 10 and observing x where x followed the binomial distribution with p = .4.
The above procedure, however, is cumbersome. Notice that what we really did
was to divide the range of values between 0 and 1 into two regions. The first went from 0
to .4. If the random number fell in this range, we said that the outcome should be 1. If
the random number fell in the range .4 to 1 we said the outcome should be 0.
If we had a random variable with three possible outcomes, then we could divide
the interval between 0 and 1 into three regions with the size of each region proportional
to its probability.
In our case, with n = 10, we have eleven possible values which can occur, namely
the values 0, 1, 2, . . . , 10. We wish to divide these proportionally to their probability
given as:
Binomial Distribution
n = 10, p = .4
x
P(x)
0
1
2
3
4
5
6
7
8
9
10
0.006047
0.040311
0.120932
0.214991
0.250823
0.200658
0.111477
0.042467
0.010617
0.001573
0.000105
1
This can easily be accomplished by computing the cumulative binomial
probability distribution which is shown below:
Cumulative Binomial Distribution
n = 10, p = .4
x
P(<=x)
0
1
2
3
4
5
6
7
8
9
10
0.006047
0.046357
0.16729
0.382281
0.633103
0.833761
0.945238
0.987705
0.998322
0.999895
1
We now set up the following rule:
If Random number is
between
0
0.006047
0.046357
0.16729
0.382281
0.633103
0.833761
0.945238
0.987705
0.998322
0.999895
and
and
and
and
and
and
and
and
and
and
and
0.006047
0.046357
0.16729
0.382281
0.633103
0.833761
0.945238
0.987705
0.998322
0.999895
1
Then x =
0
1
2
3
4
5
6
7
8
9
10
For example if we generated the random number .188754, this value falls between
.16729 and .382281 so we would say that x = 3 for those 10 trials. If we generated a
second random number, say .99461, then this value falls between .987705 and .998322 so
that we would say that x = 8 for those 10 trials.
The entire process can be automated in further in EXCEL using the function
“LOOKUP”. This function has three arguments. The first is the value we wish to look
up, in this case this is the random number. The second argument is the table in which
you want to look up the probability, in our case this is the cumulative probability
distribution column (but we must add the value “0” in the row proceeding the first
probability). Finally, the third argument is the table containing the results, in our case the
values of x.
The entire process is shown in the following screen shot:
The first argument of the function “LOOKUP” is the random number in column F
row 175 (shown in blue). The second argument is the table of the cumulative binomial
distribution (with zero added) shown in green. The third argument is the actual value of
the lookup process, the value of x shown in lavender.
Notice that the ranges of the second and third arguments have the symbol “$”
prefixing both the column and row entry. This is necessary so that if the entries are
copied (as in using a table of random numbers), the relevant look up table entries remain
constant since in EXCEL relative addressing is always used.
The following steps a), b), and c) show the entire process simulating 25 times the
number of success when 10 trials are run.
Simulate Binomial with n = 10 and pi = .4
a) Generate Distribution
binomdist(x,10,.4,true)
x
0
0 0.006047
1 0.046357
2 0.16729
3 0.382281
4 0.633103
5 0.833761
6 0.945238
7 0.987705
8 0.998322
9 0.999895
10
1
b) Generate Random Numbers (25)
0.188754
0.105325
0.555951
0.480493
0.131033
0.99461
0.608283
0.207534
0.129214
0.085797
0.280296
0.161509
0.404118
0.220414
0.821619
0.997045
0.394184
0.052352
0.892615
0.335211
0.562853
0.490515
0.363331
0.583843
0.316501
c) Use LOOKUP(x,ProbCOL,ResultCOL) after anchoring results
3
2
4
4
2
8
4
3
2
2
3
2
4
3
5
8
4
2
6
3
4
4
3
4
3
The Poisson Distribution
The Poisson Distribution is another distribution which arises in a great number of
business situations. It usually is applicable in situations where random phenomenon
occur at a certain rate over a period of time. For example, it describes the number of
people in line at a checkout counter as well as the number of telephone calls received at a
switching point. It, like the Binomial Distribution, has a canonical definition.
Assume you have an “exposure” variable such as time (it does not have to be
time, but it has to be continuous). Assume that this time period can be divided into small
enough increments, say of width dt, so that in any one of these intervals something
happens or doesn’t happen. For example, consider phone calls during an hour period.
Obviously we can divide the hour into small intervals, maybe of width 1 second, so that
we can either receive one call or no call.
Assume the probability of an event occurring in an interval of width dt is  dt.
Assume further that the occurrence or non-occurrence of an event in one interval is
independent of the occurrence or non-occurrence of the same event in another interval.
Then if one defines,
 = t
where t is the length of the interval, then the probability of x occurrences in the interval
of length t is given by,
e  x
P( x ) 
x!
for x = 0, 1, 2, ………………
One can show that for the Poisson Distribution with parameter ,
E ( x)  
and,
SD( x)  
One can usually easily recognize situations where the Poisson Distribution is
applicable since they usually involve a rate and an exposure.
For example, suppose an office receives, on average, 15 calls per hour. In a two
hour period the office received 45 calls. Is this a rare event? Here the rate is 15 calls per
hour and the exposure is a two hour period. This implies that  = 15 * 2 = 30.
As another example, suppose a manufacturing plant has, on average, 24 accidents
per year. In a one month period it has 5 accidents. Is this a rare event? Here the rate is
24 accidents per year and the exposure is one month ( 1 /12 of a year). This implies that
 = 24 * (1 / 12) = 2.
The exposure rate does not have to be time. For example consider the situation
where, on average, a driver has 1 accident per 50,000 miles driven. Suppose the person
drives a vehicle for 100,000 miles and has no accidents is this a rare event? Here the rate
is 1 accident per 50,000 miles driven and the exposure is 100,000 miles so that
 = 100,000 * ( 1 / 50,000) = 2.
The probability distribution for this last case is given below:
x
P(x)
P(<=x)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
0.135335
0.270671
0.270671
0.180447
0.090224
0.036089
0.012030
0.003437
0.000859
0.000191
0.000038
0.000007
0.000001
0.000000
0.135335
0.406006
0.676676
0.857123
0.947347
0.983436
0.995466
0.998903
0.999763
0.999954
0.999992
0.999999
1.000000
1.000000
Note that the probabilities continue on past 12, it is just that they are so small that
they appear in the table as zero.
Graphically, the distribution is as shown below:
Poisson Distribution mu = 2
0.300000
0.250000
0.150000
0.100000
0.050000
x
12
10
8
6
4
2
0.000000
0
Prob
0.200000
Now consider the situation where we are inspecting parts for defects. Assume we
have a defective rate of 1/1000. If we inspect 1000 parts and observe 3 defects, should
we worry? In this case  = 1000 * ( 1 / 1000) = 1.
As in the binomial case, EXCEL can be used to compute the probability of this or
any more extreme event. The function to use has the form,
=Poisson(x, mu, condition),
where, just as in the case of the binomial distribution, a condition of ‘false’ gives us the
probability of x , and a condition of ‘true’ gives us the probability of being less than or
equal to x.
In this case we want the probability of 3 or anything more extreme, that is the
probability of three or more. We can find ,
P( x  2)  POISSON (2,1, true) .919699
so that,
P( x  3)  1  P( x  2)  1.919699 .080311
which is not rare by the usual standards of .05 or .01.
A picture of the probabilities to be added is shown below:
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Now assume we scale up the situation and inspect 5,000 parts. Then the
parameter would change to  = 5,000 * (1 / 1000) = 5. If we scale up the defective rate to
mirror that in the first problem, this would correspond to observing 5 * 3 = 15 defectives.
Graphically the Poisson Distribution with  = 5 would look like:
24
20
22
16
18
12
14
8
10
6
4
x=15
2
0
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
By using the Poisson function in EXCEL, one obtains:
P( x  15)  1  P( x  14)  1  POISSON (14,5, true)  1.999774 .000226
which is clearly a relatively rare event which might encourage us to improve quality
control.
Actually, I have been misleading you a bit. For in fact if we inspect 5,000 parts
each of which has a probability of .001 (1 / 1000) of being defective, I am really
describing a Binomial situation.
However if n is large and p is small, so that n * p is moderate, then the Poisson
distribution can be used to approximate the binomial distribution by taking
 = n * p.
The graph below shows how good this approximation is in this case with n =
5,000 and p = .001 and  = 5.
Poisson Approximation to Binomial
0.18
0.16
0.14
0.12
Prob
0.1
0.08
poisson
0.06
binomial
0.04
24
22
20
18
16
14
12
10
8
6
4
2
0
0
0.02
x
The Poisson distribution gave the probability of 15 or more defectives as .000226,
while the exact value from the Binomial distribution is .000224, a small error in the sixth
decimal place!
Simulating a Poisson distribution is essentially the same as the procedure for the
Binomial Distribution. Below is a screen shot of the procedure:
A step by step illustration of simulating 25 realizations of the Poisson Distribution
with parameter 5 is shown below:
Simulate a Poisson with mu = 5
a) Generate Distribution
poisson(x,5,true)
x
0
0 0.006738
1 0.040428
2 0.124652
3 0.265026
4 0.440493
5 0.615961
6 0.762183
7 0.866628
8 0.931906
9 0.968172
10 0.986305
11 0.994547
12 0.997981
13 0.999302
b) Generate Random Numbers (25)
0.188754
0.105325
0.555951
0.480493
0.131033
0.99461
0.608283
0.207534
0.129214
0.085797
0.280296
0.161509
0.404118
0.220414
0.821619
0.997045
0.394184
0.052352
0.892615
0.335211
0.562853
0.490515
0.363331
0.583843
0.316501
c) Use LOOKUP(x,ProbCOL,ResultCOL) after anchoring results
3
2
5
5
3
12
5
3
3
2
4
3
4
3
7
12
4
2
8
4
5
5
4
5
4
The Normal Distribution
The normal distribution (the so called "curve") is perhaps the best known
probability distribution since it arises so many situations.
In business applications is commonly found to describe the distribution of the rate
of return on investments. However much business data is right skewed. If x is a typical
business statistics, for example the assets of banks or the gross sales of companies, one
usually finds that many small firms have modest values with a few very large values.
This gives rise to a right skewed distribution. However, if one looks at the log (assets of
banks) or the log(gross sales), one finds that the logarithmic value is approximated
closely by the normal distribution. Such variables are said to have the Log-Normal
distribution.
Unlike the Binomial and Poisson distributions, the normal distribution is defined,
theoretically, for continuous variables, that is variables with no gaps between potential
values. Of course, in the real world, one never measures things to very many decimal
places so that we can think of the normal distribution applying to many everyday
variables. For example, if a man says he is six feet tall, he probably does not mean that
he is exactly six feet tall. What is usually meant is that to the nearest inch, the man is six
feet tall. That is his actual height is probably between five foot eleven and one half
inches, and six feet and one half inch. Formally this is called "discretizing" the normal
distribution.
Since the normal is a continuous curve, it does not have a probability distribution.
Instead it has what is called a probability density function. A probability density
function is a positive function f(x), which has the property that:
b
P(a  x  b)   f ( x)dx
a
and

 f ( x)dx  1

The form of the function f(x) for a normal distribution is:
f ( x) 
e
 x

 



2
2 
The normal distribution depends on two parameters  and . We have used these
symbols before to describe the mean and standard deviation of a population. For the
normal distribution:
E (x)  
and,
SD(x)  
Below is shown a picture of two normal distributions which have the same
standard deviations but different means:
5.7
4.8
3.9
3
2.1
1.2
0.3
-0.6
-1.5
-2.4
-3.3
-4.2
-5.1
0.45
0.4
0.35
0.3
0.25
0.2
0.15
0.1
0.05
0
-6
f(x)
Normal Distributions
x
Increasing the mean moves the normal curve to the right, while decreasing the
mean moves the curve to the left.
Below are shown normal distribution with the same mean but with different
standard deviations:
Normal Distributions
1
f(x)
0.8
0.6
0.4
0.2
5.
1
6.
2
4
-7
-5
.9
-4
.8
-3
.7
-2
.6
-1
.5
-0
.4
0.
7
1.
8
2.
9
0
x
Increasing the standard deviation makes the curve more spread out and lower (this
has to occur since the total area under the curve is always 1). Decreasing the standard
deviation makes the curve less spread out and higher.
It is very easy to work with the normal distribution using EXCEL. Like the case
of the Binomial and Poisson distributions, EXCEL provides a function for computing
values for any normal distribution. The form of the function is :
=normdist(x, mean, sd, condition)
As in the case of the Binomial and Poisson distributions, setting condition =
"true" gives the probability of being less than or equal to a particular value. If the
condition is set to "false" then one gets the value of f(x) which unlike the Binomial and
Poisson distributions is not a probability.
EXCEL also has the function:
=normsdist(z)
which gives the probability of being less than or equal to the value z for the special
normal distribution with a mean of 0 and a standard deviation of 1 (called the Standard
Normal Distribution). The Standard Normal Distribution, in the past, was quite
important since it was used to compute probabilities for any normal distribution by just
using a table of the Standard Normal Distribution and the transformations:
z = (x - ) / ,
and the inverse relationship,
x=+z
With the advent of worksheet programs, one no longer needs these tables since
the computer can generate the probabilities of interest directly.
Let us consider an application of the normal distribution. The height of adult
males in the United States is approximately normally distributed with an average height
of 67 inches and a standard deviation of 2.1 inches. What is the probability that a
randomly chosen adult male will be six feet tall or taller?
Graphically, we are interested in finding the probability in the white area in the
graph below:
Height Distribution
0.15
0.1
Height
73.5
72.5
71.5
70.5
69.5
68.5
67.5
66.5
65.5
64.5
63.5
62.5
61.5
0.05
0
60.5
f(x)
0.2
EXCEL can be used to find the area of the shaded area in the above graph by
using the command:
P( x  72)  normdist (72,67,2.1, true)  .991366
The probability of being six feet tall or greater (the area of the white area) is then given
by:
P( x  72)  1  .991366  .008634
or slight less than 1% of the population.
Now suppose we wished to find the proportion of adult U.S. males who were
between 5 foot three (63 inches) and 5 foot ten (70 inches) tall. This is the shaded area in
the graph below:
Height
73.1
72.4
71.7
71
70.3
69.6
68.9
68.2
67.5
66.8
66.1
65.4
64.7
64
63.3
62.6
61.9
61.2
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
60.5
f(x)
Normal Distribution
To find this probability we need to realize that:
P(63  x  70)  P( x  70)  P( x  63)
or in EXCEL terminology:
P(63  x  70)  normdist (70,67,2.1, true)  normdist (63,76,2.1, true)
This gives the answer as:
P(63  x  70)  .923436  .028405  .895031
Approximately 89.5% of adult U. S. males are between 5 foot three and 5 foot 10 in
height.
Suppose I was interested in the inverse problem, that is what are the heights of
95% of adult U.S. males?
Actually there are many ways to do this. I could for example find the minimum
height that 95% of the population is greater than. Or I could include the lower 95 %. Or
I could try to get the "middle" 95%.
The middle 95 % would have 2.5% of the observations larger and 2.5% smaller.
This is shown in the picture below:
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
95 %
73.1
71.7
71
70.3
69.6
68.9
68.2
67.5
66.1
65.4
64.7
64
63.3
62.6
61.9
61.2
66.8
Height
72.4
2.5%
2.5%
60.5
f(x)
Normal Distribution
EXCEL has a function which solves the following equation for x0 given any value
of p for the normal distribution:
P( x  x0 )  p
The function is:
 NORMINV ( p,  ,  )
Therefore in our case, we find that:
62.884 = norminv(.025, 67, 2.1)
and,
71.116 = norminv(.975, 67, 2.1).
Therefore 95% of the heights fall between the values of 62.884 inches and 71.116
inches.
Recall that the mound rule indicated that approximately 95% of the values fell
within the interval +/- two standard deviations. The values 62.884 and 71.116
correspond to +/- 1.96 standard deviations which is the more precise figure.
EXCEL also has the function:
=normsinv(p)
which is the inverse function for the the standard normal distribution (mean of zero and
standard deviation of 1). Directly we could have obtained:
-1.96 = normsinv(.025)
and
1.96 = normsinv(.975).
In a non-computerized statistics course, the above values would have been
obtained from a table in the back of the book and then transformed to get:
67 – 1.96 * 2.1 = 62.884
and
67 + 1.96 * 2.1 = 71.116
In EXCEL we can do this directly using the norminv function.
For large values of n, we can use the normal distribution to approximate the
binomial distribution by taking:
  np
  np(1  p)
This approximation will be valid if np > 5 and n(1 – p) > 5.
For example in the case discussed previously with n = 50 and p = .3, we would
have
 = 50 * .3 = 15
  50(.3)(.7)  3.24037
and in this case np = 50(.3) = 15 and n(1-p) = 50 (.7) = 35 so that the approximation
should be good.
The two curves are plotted below:
Normal Approximation to Binomial
0.14
0.12
Prob
0.1
0.08
binomial
normal
0.06
0.04
0.02
48
45
42
39
36
33
30
27
24
21
18
15
9
12
6
3
0
0
x
Notice however that the normal curve is continuous while the binomial
distribution is discrete with nothing between the values of say 17 and 18.
This distinction between discrete and continuous is important. For the binomial
distribution
Pbin ( x  10)  Pbin ( x  10)  Pbin ( x  10)
so that a distinction must be made between "less than or equal" and "less than". For the
normal distribution however,
Pnor ( x  10)  Pnor ( x  10)
since there is no probability that x will exactly equal 10 (by exactly we mean to an
infinite number of decimal places).
We can get around this problem of continuous versus discrete by the use of what
is called the "continuity correction". Simply it says that when using the normal
distribution to approximate a discrete distribution (such as the binomial or poisson),
assume that any discrete value, say 10, actually goes half way between the previous
discrete value and the subsequent discrete value. In other words when using the normal
distribution we assume that 10 actually goes from 9.5 to 10.5; 29 would go from 28.5 to
29.5, etc. If we let k be one of the discrete values, then the following five relationships
illustrate the use of the continuity correction in all possible cases:
Pbin ( x  k )  Pnor (k  .5  x  k  .5)
Pbin ( x  k )  Pnor ( x  k  .5)
Pbin ( x  k )  Pnor ( x  k  .5)
Pbin ( x  k )  Pnor ( x  k  .5)  1  Pnor ( x  k  .5)
Pbin ( x  k )  Pnor ( x  k  .5)  1  Pnor ( x  k  .5)
Fortunately, in EXCEL we can use the binomdist function for most cases and
need to use the normal approximation to the binomial only infrequently.
The poisson distribution can also be approximated by the normal distribution if
the poisson parameter  > 5. By taking  as given and taking the standard deviation as
the square root of , we can approximate the poisson distribution with the normal
distribution.
The following graph shows the Poisson distribution with mean of 5 and the
normal distribution with a mean of 5 and standard deviation equal to 2.2361 (the square
root of 5):
0.2
0.18
0.16
0.14
0.12
0.1
0.08
0.06
0.04
0.02
0
24
22
20
18
16
14
12
10
8
6
4
2
Poisson
Normal
0
Prob
Normal Approximation to Poisson
x
Again if using the normal distribution to compute probabilities for the Poisson
distribution one must correct for continuity in the same way as was done for the binomial
distribution.
It is very easy to simulate data which follows the normal distribution using the
norminv function in EXCEL.
The first step is to generate a set of random numbers using the rand() function as
we have done before. Some sample data is shown below:
a) Use RAND() to simulate 50 random numbers
0.407998
0.678836
0.1953
0.013617
0.927638
0.327113 0.785353 0.167275 0.016527 0.524512 0.445768
0.059317 0.764079 0.228678 0.055253 0.850148 0.517072
0.250051 0.962008 0.09676 0.495551 0.021171 0.132306
0.945737 0.29648 0.938787 0.034023 0.353435 0.298913
0.541888 0.09178 0.58483 0.825083 0.348549 0.309452
0.098211
0.37086
0.589669
0.840628
0.341382
0.739599
0.422763
0.349044
0.014406
0.926097
0.815619
0.165053
0.176948
0.689208
0.406182
Now suppose we were looking at an investment with a mean return of 8% with a
risk (sd) of 2%. Then we could generate 50 normally distributed return by applying the
function
=norminv(random number, .08, .02)
to the 50 random numbers previously generated to get:
b) Use NORMINV(x,.08,.02) to generate the normal random variables
0.075346
0.089289
0.062829
0.035837
0.109168
0.071042 0.095808
0.0607 0.037372 0.08123 0.077273
0.048789 0.09439 0.065136 0.048082 0.100741 0.080856
0.066513 0.11549 0.053995 0.079777 0.039397 0.057689
0.112097 0.069309 0.110893 0.043506 0.072479 0.069449
0.082104 0.053403 0.084285 0.098698 0.072215 0.070052
0.054164
0.073408
0.084534
0.099941
0.071826
0.092842
0.076103
0.072242
0.036279
0.108946
0.097976
0.060522
0.061459
0.089872
0.075252
Finally, one should check that the simulation is approximately on target.
For this data, the sample mean of the simulated values is .0748 compared to the
theoretical value of .08. The standard deviation of the simulated values if .0210
compared to the theoretical value of .02. Finally, I have done a histogram of the
simulated values which is shown below:
Frequency Distribution
14
12
8
6
4
2
0
0.
03
0.
04
0.
05
0.
06
0.
07
0.
08
0.
09
0.
1
0.
11
0.
12
M
or
e
Frequency
10
Simulated Returns
As can be seen it is approximately normally distributed.
What if we wanted to simulate correlated investments as we studied earlier in this
module? Specifically suppose we wanted to simulate 25 years of returns on two
investments. The first having a mean return of .08 and a standard deviation of .02 and the
second having a mean return of .12 with a standard deviation of .05. And suppose that
the investments are correlated with a correlation coefficient of -.4.
The difficult part is generating the values so they are correlated. Fortunately there
is a theoretical result that says if z1 and z2 are two independent random variables each
with mean 0 and standard deviation 1, and we define two new variables x1 and x2 with the
equations:
x1 
1 r
1 r
z1 
z2
2
2
x2 
1 r
1 r
z1 
z2
2
2
then x1 and x2 will both still have mean 0 and standard deviation 1, but now the x's will
be correlated with correlation coefficient r.
Let us illustrate this procedure in steps. First generate two columns of 25 random
numbers using the =rand() EXCEL function. The data would look like:
a) Start with random numbers in two sets of 25
0.758384
0.301347
0.741409
0.591236
0.742038
0.764761
0.928779
0.754592
0.05917
0.324527
0.389161
0.159801
0.621819
0.091531
0.140335
0.444617
0.799008
0.266267
0.215947
0.031036
0.014389
0.526516
0.129849
0.459673
0.667225
0.311735
0.87599
0.306248
0.70228
0.518124
0.70556
0.086294
0.914354
0.699389
0.91737
0.781662
0.369674
0.359246
0.78965
0.801505
0.556215
0.512305
0.686824
0.298083
0.07719
0.337121
0.597915
0.06653
0.644828
0.024672
Nest use the EXCEL function =normsinv(random number) to generate two
columns of uncorrelated normal random variables with mean zero and standard deviation
1. The data would look something like this:
b) Generate two independent set of Normal with Mean of 0 and sd of 1
by using Normsinv (x)
0.701114 -0.49094
-0.52053 1.155174
Check correlation using CORREL
0.647697 -0.50651
0.230726 0.53097
-0.02995
0.649641 0.045444
0.721701 0.54046
1.46676 -1.36394
0.68901 1.368066
-1.56178 0.522646
-0.45508 1.387593
-0.28151 0.77782
-0.99527 -0.33272
0.31026 -0.36047
-1.33138 0.805205
-1.07882 0.847008
-0.13927 0.141381
0.838081 0.030848
-0.62414 0.486868
-0.78595 -0.52992
-1.86578 -1.42423
-2.18652 -0.42033
0.066516 0.247956
-1.1271 -1.50215
-0.10126 0.371394
0.432262 -1.96561
z1
z2
Using the =correl(z1, z2) we get a correlation of -.02995 compared to the
theoretical value of 0.
Next we implement the formula given above to induce the appropriate correlation,
in this case r = -.4. The formula would look like:
c) Generate two columns x1 and x2 by using the formula
x1=sqrt((1+R)/2)*z1 + SQRT((1-R)/2)*z2
x2=sqrt((1+R)/2)*z1 - SQRT((1-R)/2)*z2
The actual formula contained in the cell for x1 in shown on the following screen
shot:
The actual formula for the cell for x2 is shown in the screen shot below:
After transforming all the pairs, one would get the following result:
d) Example make R = -.4
-0.02673
0.681382
-0.06902
0.570615
0.393844
0.847474
-0.33777
1.521993
-0.41814
0.911688
0.496584
-0.8235
-0.13166
-0.05555
0.117766
0.042005
0.484845
0.065486
-0.87385
-2.21353
-1.54928
0.243887
-1.87413
0.255269
-1.40779
0.794764
-1.25159
0.778538
-0.31787
0.317802
-0.05689
1.944528
-0.76722
-1.2927
-1.4102
-0.80496
-0.26676
0.471531
-1.40291
-1.29955
-0.19457
0.433227
-0.7492
0.01288
0.169663
-0.84593
-0.17102
0.639447
-0.36619
1.881306
Check
-0.36222
Finally we need to adjust the generated values to have the appropriate means and
standard deviations. The first investment is supposed to have a mean of .08 and a
standard deviation of .02, therefore we create the new variable
.08 + .02 * x1
The second investment is supposed to have a mean of .12 and a standard deviation of .05,
so we create the new variable
.12 + .05 * x2
The final results are shown below:
e) Now multiply x1 by .02 and add .08 and multiply x2 by .05 and add .12
0.079465
0.093628
0.07862
0.091412
0.087877
0.096949
0.073245
0.11044
0.071637
0.098234
0.089932
0.06353
0.077367
0.078889
0.082355
0.08084
0.089697
0.08131
0.062523
0.035729
0.049014
0.084878
0.042517
0.085105
0.051844
0.159738
0.05742
0.158927
0.104107
0.13589
0.117156
0.217226
0.081639
0.055365
0.04949
0.079752
0.106662
0.143577
0.049854
0.055023
0.110271
0.141661
0.08254
0.120644
0.128483
0.077704
0.111449
0.151972
0.10169
0.214065
0.077
0.018
0.112
0.046
Check results
Average
Sd
corr=
-0.362
As can be seen the simulation results agree reasonably with the theoretical values.
I could now simulate what would happen for a 25 year period into the future if I
invested $10,000 in each investment. I just need to add one to the simulated returns and
cumulate the investment history as shown below:
Investment
1
Investment
2
Initial Investment
$10,000
$10,000
25 Year Growth
$10,795
$11,805
$12,733
$13,897
$15,119
$16,584
$17,799
$19,765
$21,181
$23,262
$25,353
$26,964
$29,050
$31,342
$33,923
$36,666
$39,954
$43,203
$45,904
$47,544
$49,875
$54,108
$56,409
$61,209
$64,383
1.079465 $11,597 1.159738
1.093628 $12,263 1.05742
1.07862 $14,212 1.158927
1.091412 $15,692 1.104107
1.087877 $17,824 1.13589
1.096949 $19,912 1.117156
1.073245 $24,238 1.217226
1.11044 $26,217 1.081639
1.071637 $27,668 1.055365
1.098234 $29,038 1.04949
1.089932 $31,353 1.079752
1.06353 $34,698 1.106662
1.077367 $39,679 1.143577
1.078889 $41,657 1.049854
1.082355 $43,950 1.055023
1.08084 $48,796 1.110271
1.089697 $55,708 1.141661
1.08131 $60,307 1.08254
1.062523 $67,582 1.120644
1.035729 $76,265 1.128483
1.049014 $82,191 1.077704
1.084878 $91,352 1.111449
1.042517 $105,235 1.151972
1.085105 $115,936 1.10169
1.051844 $140,754 1.214065
By simply pressing the F9 key, I would recompute all of the values in the above
simulation to get results such as:
e) Now multiply x1 by .02 and add .08 and multiply x2 by .05 and add .12
0.090549
0.101857
0.07085
0.060797
0.039147
0.093078
0.067094
0.101997
0.071355
0.097502
0.06629
0.099349
0.084416
0.094967
0.049
0.059694
0.108832
0.109332
0.108323
0.066578
0.075929
0.056609
0.091946
0.056871
0.055641
0.102065
0.088143
0.098587
0.125362
0.1414
0.122713
0.107743
0.073149
0.166681
0.138635
0.166275
0.070526
0.159715
0.110902
0.237066
0.190704
0.09977
0.119577
0.110818
0.131873
0.117236
0.041032
0.121354
0.169959
0.237808
0.079
0.021
0.130
0.047
Check results
Average
Sd
corr=
-0.513
And another realization of a 25 year investment as:
Investment
1
Investment
2
Initial Investment
$10,000
$10,000
25 Year Growth
$10,905
$12,016
$12,868
$13,650
$14,184
$15,505
$16,545
$18,232
$19,533
$21,438
$22,859
$25,130
$27,251
$29,839
$31,301
$33,170
$36,780
$40,801
$45,221
$48,231
$51,894
$54,831
$59,873
$63,278
$66,799
1.090549
1.101857
1.07085
1.060797
1.039147
1.093078
1.067094
1.101997
1.071355
1.097502
1.06629
1.099349
1.084416
1.094967
1.049
1.059694
1.108832
1.109332
1.108323
1.066578
1.075929
1.056609
1.091946
1.056871
1.055641
$11,021
$11,992
$13,174
$14,826
$16,922
$18,999
$21,046
$22,585
$26,350
$30,003
$34,992
$37,459
$43,442
$48,260
$59,701
$71,086
$78,178
$87,527
$97,226
$110,048
$122,949
$127,994
$143,527
$167,920
$207,853
1.102065
1.088143
1.098587
1.125362
1.1414
1.122713
1.107743
1.073149
1.166681
1.138635
1.166275
1.070526
1.159715
1.110902
1.237066
1.190704
1.09977
1.119577
1.110818
1.131873
1.117236
1.041032
1.121354
1.169959
1.237808
Related documents