Download The Normal Distribution

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Page 1 of 8
Normal Distributions
What makes a normal distribution normal? This is just another way of saying we
should see something that looks like a bell-shaped continuous curve.
Each of the curves to the right has that characteristic.
If we are talking about heights of men in a certain group, say
the Bambutu pygmy tribe in Africa, we are not surprised that
the average height of an adult male is about 51 inches. Most
adults have a height near that size while a few are
somewhat taller or shorter. By contrast, in the Tutsi tribe,
also called the Watusi, heights average near a spectacular
7 feet tall!
Neither of these populations is normal in height to our way of
thinking, but each tribe’s distribution of heights is normal.
C
Mathematically a curve is normal when it demonstrates a symmetry with scores more
concentrated in the middle than in the tails.
C
Normal curves are defined by two parameters: the mean (µ) and the standard deviation (σ).
C
The more normal a curve is the closer the median value1 is to the mean (or vice versa). They
also get very close to the absolute maximum of the normal curve.
C
Many kinds of behavioral data are approximated well by a normal distribution. Many statistical
tests assume a normal distribution. Most of those tests work well even if the distribution is only
approximately normal as long as the distribution does not deviate greatly from normality.
We should be able to find a mean (µ) and a standard deviation (σ) so that a normally distributed
population can be modeled by or fitted to
1
f ( x) 
2
2
e
  ( x   )2 


2
 2

.
That function has all the characteristics mentioned before and has become the standard equation for
a normal curve. After that, everything we have ever said about a continuous probability
distribution holds true.

C
The area under a normal curve is one. Hence,


f ( x) 

C

2

2
e
 1.
No probability of an event can exceed one and all probabilities must be positive values.
b
Hence, 0 
b
P ( a  x  b)   f ( x )  
a
C
1
  ( x   )2 


2
 2

a
1
2
2
e
  ( x   )2 


2
 2

1
where a  b are both
reasonable outcomes in the experiment.
Since f ( x ) is continuous, the probability of any specific result, say a height of exactly 6 feet,
doesn’t really come from the function directly. The value x  6 describes one of infinitely many
1
This is the value that splits the data into to halves with 50% above and 50% below.
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
Page 2 of 8
Normal Distributions
points.
So P ( x  6)  0 technically and is infinitesimally small realistically. We should more correctly
ask what is the probability of a height between say 5.999 feet and 6.001 feet. Then we can use
integration to find that probability.
 x2
There’s the problem. Notice that the function above is a translation of
. You should recall
y e
from your previous calculus experience that
ye
 x2
has no neat antiderivative. So the
integration process must be done “numerically” or with some kind of table of values.
For the latter, numerous tables have been created to do this kind of stuff. However, most of them
assume that the data has been “standardized” in some way.
Standardizing Data (z-scores)
Standardization takes raw data with any mean (µ) and standard deviation (σ) and fits it to a curve with
mean of 0 and standard deviation of 1. Then those neat tables work well.
This is really a simple process. We just calculate the z-score, z 
x

.
Suppose the Bambutu tribe does have an average adult male height of 51
inches and the standard deviation for the male population is 4 inches.
1. Someone with a height of 51 inches (an average joe) would have a zscore of zero.
2. Someone with a height of 55 inches is one standard deviation above
average
   1  .
3. Someone with a height of 47 inches is one standard deviation below
average
   1  .
4. Note that z 
55  51
47  51
 1 and z 
 1 .
4
4
Then if we wanted to know what portion of the population is within one
standard deviation of the mean, we should do the calculation stream below.
Using the given mean and standard deviation,
 x

 1  P  z  1 2
P(     x     )  P  x    1   P 
 

Notice that it doesn’t matter what the mean and standard deviation are. All we
need to know is the z-score for the range of values of interest to us. Then we
can apply the table to the right 3 and what we know about the symmetry of the
normal curve to extract the probability we need.
Normal Distribution
z -Table
Z
0.00
0.5040
0.1 0.5398
0.5438
0.2 0.5793
0.5832
0.3 0.6179
0.6217
0.4 0.6554
0.6591
0.5 0.6915
0.6950
0.6 0.7257
0.7291
0.7 0.7580
0.7611
0.8 0.7881
0.7910
0.9 0.8159
0.8186
1.0 0.8413
0.8438
1.1 0.8643
0.8665
1.2 0.8849
0.8869
1.3 0.9032
0.9049
1.4 0.9192
0.9207
1.5 0.9332
0.9345
1.6 0.9452
0.9463
1.7 0.9554
0.9564
1.8 0.9641
0.9649
1.9 0.9713
0.9719
2.0 0.9772
0.9778
2
Since σ > 0, we can do the division and slip it into the absolute value without a problem.
3
A more complete table is provided at the end of this lesson. You can find tables tailored to many needs just by
googling for normal distribution tables.
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
0.01
0.0 0.5000
Normal Distributions
Page 3 of 8
Note that from the first red entry ( z  0 ) to the second red entry ( z  1 ) we
have z-scores in one-hundredths. Since z  0 is the line of symmetry for our
normal bell-shaped curve, half of the area under the curve is already
accounted for. To get the area for 0  z  1 we need to subtract the two red
values: 0.8413  0.5000  0.3413 .
Normal Distribution
z -Table
Z
Since we need the range from 1  z  1 , symmetry allows us to double the
result to get about 0.68 . 68%.4


So what is P z  2 ?
1. This is the probability that a data value is within 2 standard deviations of
the mean.
2. To get this we need the blue value.
3. Then we calculate the area under the curve from 0  z  2 as
0.9772  0.5000  0.4772 .
4. Once again we double the result to get another rule of thumb:
P  z  2   95.44% .
You should use the complete table to estimate the area within three standard
deviations of the mean yourself.
0.00
0.01
0.0 0.5000
0.5040
0.1 0.5398
0.5438
0.2 0.5793
0.5832
0.3 0.6179
0.6217
0.4 0.6554
0.6591
0.5 0.6915
0.6950
0.6 0.7257
0.7291
0.7 0.7580
0.7611
0.8 0.7881
0.7910
0.9 0.8159
0.8186
1.0 0.8413
0.8438
1.1 0.8643
0.8665
1.2 0.8849
0.8869
1.3 0.9032
0.9049
1.4 0.9192
0.9207
1.5 0.9332
0.9345
1.6 0.9452
0.9463
1.7 0.9554
0.9564
1.8 0.9641
0.9649
1.9 0.9713
0.9719
2.0 0.9772
0.9778
Now what if we wanted P  .5  z  1.5  ?
1. The table does it with the green entries.
2. We need the area reading for
z  0.5 and z  1.5 .
3. Hence, P  .5  z  0   0.6915  .5000  0.1915 by symmetry
P(0  z  1.5)  0.9332  .5000  0.4332
P(.5  z  1.5)  0.1915  0.4332  0.6247
Interpretations
.5  z  1.5
x
.5 
 1.5
Now let’s see what all that means.
The problem began with a statement that “the Bambutu tribe does
have an average adult male height of 51 inches and the standard
deviation for the male population is 4 inches.”
So what is the height range for some male who falls into the
.5  z  1.5 band?
It is just a matter of unwrapping the z-score as I did to the right.
Anyone falling into that range of z-scores is in the height range
from 49 inches to 57 inches. There is about a 62.5% chance that
some male falls into that height range in the normal population.
4

x  51
 1.5
4
.5(4)  x  51  1.5(4)
.5(4)  51  x  1.5(4)  51
This is a rule of thumb: For a normal curve about 68% of all data points fall into the band
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
.5 
49  x  57
1  z  1 .
Page 4 of 8
Normal Distributions
Let’s get back to the Tutsi tribe. Let’s suppose the average height is 70 inches and the standard
deviation is 3 inches.
What is the z-score for someone with a height of 57 inches? z 
57  70
 4.3 !
3
He is more than 4 standard deviations below the average height. He would rate as a midget in his
hometown and a giant among the Bambutu! What is the probability that someone from the Tutsi
would be 57 inches or shorter?
In my table, the value at z  4.3 is not given. We need a better table or a calculator or computer.
Your TI fits the bill nicely.
Look for the “VARS” key and then look at the calculator body. You should see “DISTR.” Do a 2NDVARS to get to it. In there you’ll find a number of distributions. Find option 2, “normalcdf.”5
•
The format for the option is normalcdf(lower, upper, μ, σ).
upper
•
It is the calculation P (lower  x  upper ) 

lower
1
2 2
e
  ( x   )2 


2
 2

.
Select, then complete, the command to see normalcdf(-E99,57,70,3).6
Try this command: normalcdf(0,57,70,3). Notice that since this population must have some
height, we really didn’t need to go out to the infinitely long tail to the left.
Go back now and calculate the probability that a Bambutu tribesman is in the height range 49 to 57
inches with a mean of 51 and standard deviation of 4 inches.
The command normalcdf(49,57,51,4) produces a probability of 0.6246552391...
While you are at it, try this command: normalcdf(-.5,1.5). You should get 0.6246552391...
Without the mean and standard deviation, normalcdf assumes that they are 0 and 1, respectively.
If you needed a number of values, try entering the command normalcdf(0,x,51,4) as a function
in the “Y=” area. Then you can use trace or TABLE to get your values. Notice that the curve is
asymptotic to y = 1. Don’t forget that the normal distribution is continuous with a domain of all real
numbers even though its application may have a truncated domain.
A Few More Examples
Example:
Find the probability that anyone is within 5 standard deviations of the average height for
the entire world.
Note that neither μ nor σ are given. However we don’t need them because we have been given a
z-score problem. We use normalcdf(-5,5)to get 0.9999994258 . . .
5
Note: The only good use for normalpdf is to draw the normal curve.
6
Here’s the rub about the TI: It doesn’t have a symbolic infinity. However, we can fool it by using
-E99 or E99 (by using the EE key) since it is so far below (or above) zero that it does the job.
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
Normal Distributions
Example:
Page 5 of 8
Find the probability that someone is more than 5 standard deviations above the average
height for the entire world.
Note that neither μ nor σ are given. We still don’t need them because it is a z-score problem.
We use normalcdf(5,E99)to get
Example:
2.871049995 107 .
Find the probability that someone is at most 5 standard deviations above the average
height for the entire world.
We use normalcdf(-E99,5) or 1 - normalcdf(5,E99) to get 0.9999997129 . . .
Example:
One hundred students took a standardized test. The average was 77.5 and the
standard deviation was 6.5. How many students were within one standard deviation of
the mean?
Try all of these:
normalcdf(77.5-6.5,77.5+6.5,77.5,6.5)
normalcdf(-1,1)
normalcdf(0,1)*2
Then recall that this is a standardized test, intended to create a normal distribution. We did that
calculation at the beginning of the lesson.
Hence, we expected to see at least 68 students and might see a 69th. We don’t expect to see 68.2
since that constitutes a gory mess in the testing center!
Normal versus Binomial
Suppose you wanted to find the probability that between 0 and 10 flights might arrive late out of a
flight schedule of 5,000 flights. That takes us back the binomial probability model. We need a
probability for any particular flight to be late. Let’s assume p = 0.01 (1%) for a flight to be late under
usual conditions.
Using my TI, I calculated using binomcdf(5000,0.01,0,10)= 5.48478  1012 . That’s pretty
small.
Back in the old days calculating these probabilities was extremely tedious. So for large numbers of
trials we relied on the visible fact that the binomial probability model begins to look like a stair stepish
version of the normal model. Look at the graph (obtained through excel®)of this situation. If that
doesn’t look “normal,” it is difficult to imagine what would be. The red curve is a normal distribution
with appropriate mean and standard deviation.
0.06
0.05
0.04
0.03
0.02
0.01
0
1 3
5
7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 51 53 55 57 59 61 63 65 67 69 71 73 75 77 79
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
Normal Distributions
Page 6 of 8
Approximate binomcdf(5000,0.01,0,10)using the facts that for the binomial probability model
  np and   npq  np (1  p ) and the normalcdf(lower, upper,  ,  ) feature of
your TI.7
For practice, convert the values to z-scores for your range (0 to 10):
z
x  np
0  50
x  np
10  50

 7.10669 and z 

 5.68535
np (1  p )
49.5
np (1  p )
49.5
Then, the correct command is normalcdf  7.10669, 5.68535  .
Using it, I got a result of 6.54518  109 . While both results are incredibly small, the discrepancy is
incredibly large (in relative terms one is 1,000 times greater than the other).
So the questions is when do these seemingly similar (in shape) distributions begin to approximate
each other? Our rules-of-thumb for knowing when the normal approximation to the binomial is valid
are as follows:
1. n must be at least 30, AND
2. np must be at least 10, AND
3. nq  n(1  p) must be at least 10.
We met these criteria, but we did choose a bad place for approximating. The results would be more
valid for a range further away from the tails of the graph because the values we chose to include in
the binomial distribution have a minuscule impact on the total. Let’s try a little more middle-of-the-road
calculation.
Example:
Suppose you wanted to approximate the probability that between 50 and 150 flights
might arrive late out of a flight schedule of 5,000 flights. Let’s assume p = 0.01 (1%) for
a flight to be late under usual conditions.
Try these commands:


normalcdf 50,150,50, 49.5) = 0.49999999995
binomcdf(5000,0.01,0,150) binomcdf(5000,0.01,0,49) = 0.519092
.8
The two results are not far part. So the normal curve does approximate the binomial curve. There is
one other feature we can use. We can apply a “continuity correction” where we adjust the range
interval to handle how the binomial histogram works. Generally, we would adjust one-half unit at each
end of the range. There is a little “Kentucky windage” here.
Try this calculation which stretched out at both ends:


normalcdf 49.5,150.5,50, 49.5)  0.528328 .
Looking at the graph, notice that the histogram is inside the normal curve to the right. We try this as a
7
These facts were developed in your text book.
8
Notice that since this is discrete, we discarded the {0,1,2,...,49} set of probabilities.
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
Page 7 of 8
Normal Distributions


“better” approximation: normalcdf 49.5,149.5,50, 49.5)  0.528328 also.
The result is a little better. Since we were approximating in any case, we should be happy enough
with either of them. These days we aren’t likely to do the approximation anyway. With calculators and
computers we can usually get the correct result directly.
In this course you have studied only two regularly used distributions, the normal and the binomial.
Almost every other one was some arbitrarily created example just to get you to work with the
probability modeling process. However, the process of working with any distribution is the same.
Being solidly founded in these two will take you a long way in a statistics course.
If you have access to a spread-sheeting program, you should find that every useful command the TI
can create is available there, but you can get a neat print-out to work with. You should learn about
this if you plan to make use of statistical or probabilistic processes in your business life.
Normal Distribution Table ( 0  z  3.9 )
Z
0.0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1.0
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2.0
2.1
2.2
2.3
2.4
2.5
2.6
2.7
2.8
2.9
3.0
0.00
0.5000
0.5398
0.5793
0.6179
0.6554
0.6915
0.7257
0.7580
0.7881
0.8159
0.8413
0.8643
0.8849
0.9032
0.9192
0.9332
0.9452
0.9554
0.9641
0.9713
0.9772
0.9821
0.9861
0.9893
0.9918
0.9938
0.9953
0.9965
0.9974
0.9981
0.9987
0.01
0.5040
0.5438
0.5832
0.6217
0.6591
0.6950
0.7291
0.7611
0.7910
0.8186
0.8438
0.8665
0.8869
0.9049
0.9207
0.9345
0.9463
0.9564
0.9649
0.9719
0.9778
0.9826
0.9864
0.9896
0.9920
0.9940
0.9955
0.9966
0.9975
0.9982
0.9987
0.02
0.5080
0.5478
0.5871
0.6255
0.6628
0.6985
0.7324
0.7642
0.7939
0.8212
0.8461
0.8686
0.8888
0.9066
0.9222
0.9357
0.9474
0.9573
0.9656
0.9726
0.9783
0.9830
0.9868
0.9898
0.9922
0.9941
0.9956
0.9967
0.9976
0.9982
0.9987
0.03
0.5120
0.5517
0.5910
0.6293
0.6664
0.7019
0.7357
0.7673
0.7967
0.8238
0.8485
0.8708
0.8907
0.9082
0.9236
0.9370
0.9484
0.9582
0.9664
0.9732
0.9788
0.9834
0.9871
0.9901
0.9925
0.9943
0.9957
0.9968
0.9977
0.9983
0.9988
0.04
0.5160
0.5557
0.5948
0.6331
0.6700
0.7054
0.7389
0.7704
0.7995
0.8264
0.8508
0.8729
0.8925
0.9099
0.9251
0.9382
0.9495
0.9591
0.9671
0.9738
0.9793
0.9838
0.9875
0.9904
0.9927
0.9945
0.9959
0.9969
0.9977
0.9984
0.9988
0.05
0.5199
0.5596
0.5987
0.6368
0.6736
0.7088
0.7422
0.7734
0.8023
0.8289
0.8531
0.8749
0.8944
0.9115
0.9265
0.9394
0.9505
0.9599
0.9678
0.9744
0.9798
0.9842
0.9878
0.9906
0.9929
0.9946
0.9960
0.9970
0.9978
0.9984
0.9989
0.06
0.5239
0.5636
0.6026
0.6406
0.6772
0.7123
0.7454
0.7764
0.8051
0.8315
0.8554
0.8770
0.8962
0.9131
0.9279
0.9406
0.9515
0.9608
0.9686
0.9750
0.9803
0.9846
0.9881
0.9909
0.9931
0.9948
0.9961
0.9971
0.9979
0.9985
0.9989
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
0.07
0.5279
0.5675
0.6064
0.6443
0.6808
0.7157
0.7486
0.7794
0.8078
0.8340
0.8577
0.8790
0.8980
0.9147
0.9292
0.9418
0.9525
0.9616
0.9693
0.9756
0.9808
0.9850
0.9884
0.9911
0.9932
0.9949
0.9962
0.9972
0.9979
0.9985
0.9989
0.08
0.5319
0.5714
0.6103
0.6480
0.6844
0.7190
0.7517
0.7823
0.8106
0.8365
0.8599
0.8810
0.8997
0.9162
0.9306
0.9429
0.9535
0.9625
0.9699
0.9761
0.9812
0.9854
0.9887
0.9913
0.9934
0.9951
0.9963
0.9973
0.9980
0.9986
0.9990
0.09
0.5359
0.5753
0.6141
0.6517
0.6879
0.7224
0.7549
0.7852
0.8133
0.8389
0.8621
0.8830
0.9015
0.9177
0.9319
0.9441
0.9545
0.9633
0.9706
0.9767
0.9817
0.9857
0.9890
0.9916
0.9936
0.9952
0.9964
0.9974
0.9981
0.9986
0.9990
Normal Distributions
Copyright 2010 - ASU School of Mathematical and Statistical Sciences (Terry Turner)
Page 8 of 8