Download S03B Normal Distributions - Dixie State University :: Business

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

History of statistics wikipedia , lookup

Law of large numbers wikipedia , lookup

Probability amplitude wikipedia , lookup

Transcript
William Christensen, Ph.D.
The Standard Normal Distribution
0





The Density Curve (or probability density function) shown above gives us a good
picture of a Standard Normal Distribution.
Although all normal distributions have a bell-shape, a “standard” normal
distribution has a mean = 0 and a standard deviation = 1
Also, the total area under the curve is exactly 1
Remember: we previously learned that the sum of all probabilities in any
probability distribution also must add up to exactly 1
Hopefully you get the picture and realize there is an important relationship
between the density curve shown above (with a total area under the curve of 1)
and probability distributions where the sum of all individual probabilities is also 1
The Standard Normal Distribution





In the Standard Normal Distribution we integrate the
idea of “area-under-the-curve” with what we learned
about probability
This concept is at the heart of statistics and it is
critical that you understand, so let’s review a little
and take it step-by-step
First, know and remember that a normal distribution
has a classic bell-shape and is perfectly
symmetrical, with the mean right in the middle
Remember we when talked about the concept of
“unusual” and defined it as any value or observation
more than 2 standard deviations from the mean?
The following slide shows us the general
relationship between “area under the curve” and
probability. Let’s break it down
The Empirical Rule
For Normal Distributions
99.7% of data are within 3 standard deviations of the mean
95% within
2 standard deviations
68% within
1 standard deviation
34%
34%
2.4%
2.4%
0.1%
0.1%
13.5%
x - 3s
x - 2s
13.5%
x-s
x
x+s
x + 2s
x + 3s
Area-under-the-curve and Probability

OK, the previous slide shows us some good stuff





Notice how you can add up all the sections and they add to 1
Let’s consider some other examples from the previous slide: It shows
us that the area-under-the-curve or PROBABILITY of an x value being
between the mean and 1 standard deviation above the mean is 0.34
or 34%
It also shows us that the area-under-the-curve or PROBABILITY of an
x value being between within plus or minus 1 standard deviation of the
mean is 0.68 or 68% (34% + 34% = 68%)
It shows us that the area-under-the-curve or PROBABILITY of an x
value being more than 2 standard deviations above or below the mean
is how much? How about 2.4% + 0.1% on one side plus 2.4% and
0.1% on the other side for a total of 0.05 or 5%. Etc., etc., etc.
You should become very familiar with this way of thinking and of
finding the P(x) by looking at the area-under-the-curve
Standard Normal Distributions
and area-under-the-curve




A Standard Normal Distribution allows us to use z-scores to do the same thing we did in the previous slide
Remember what a z-score is? It represents the number of std. deviations an x value is from the mean
Well, if µ = 0 and σ =1 (a Standard Normal Distribution) then x values and z-scores are the same number
and we get an density curve that looks like this
Again, we can read this graph just like the previous slide



We see the area-under-the-curve or PROBABILITY of an x value between the mean and z = 1 is 0.34 or 34%
We also see the area-under-the curve or PROBABILITY of an x value of more than z=2 above the mean is 0.025 or
2.5% (2.4% + 0.1%). Etc., etc., etc.
This is great, but what if we want to know the PROBABILITY of an x value falling between the mean and
z=0.5 (or 0.5 std. deviations above the mean). We need some way to find all these in-between areas or
Probabilities
34% 34%
2.4%
0.1%
2.4%
0.1%
13.5%
-3
-2
-1
13.5%
0
1
2
3
In a Standard Normal Distribution, x values are the same as z-scores because an x of 1 is 1 standard
deviation from the mean (0). A negative z-score means you are below the mean and a
positive z-score means you are above the mean.
Using z-scores to find Probability
or area-under-the-curve

The Excel function NORMSDIST
allows us to find the area-underthe curve from the extreme leftside of a normal distribution out to
any z-score
0
z
=NORMSDIST(z), where z is the z-score
for any x value, provides the area-underthe-curve or PROBABILITY from the
extreme left out to the z-score
Note: Since the total area-under-the-curve = 1, and the distribution is symmetrical (mirror image
on each side), then it should make sense that the total area-under-the-curve below the mean =
0.5, and the total area-under-the-curve (or probability) above the mean also = 0.5
Using z-scores to find Probability
or area-under-the-curve

Example: try using Excel
to find the area-underthe-curve from the
extreme left to a z-score
of 1
0
z=1
Excel gave us this area or probability.
This means the probability of an x
value falling from anywhere below the
mean up to z=1 (or one standard
deviation) above the mean is about
0.841 of 84.1%. Notice that since the
entire left-half of the distribution has
an area = 0.5, then we can calculate
the area from the mean (0) to the
z=+1 as 0.841345 – 0.5000 =
0.341345.
Using z-scores to find Probability
or area-under-the-curve

Example: With the same
information as the
previous slide, can you
calculate the area-underthe-curve from a z-score
of 1 to the extreme right?
0
z=1
Since we know the total area-underthe-curve equals 1, and the area from
the extreme left over to z=1 is
0.841345, then the remaining area
must be the difference between 1 and
0.841345, or in other words 1 –
0.841345 = 0.158655. This means
there is a probability of about 0.159 or
15.9% of finding an x value with a z >
+1 (more than 1 std. deviation above
the mean). Cool eh!
Using z-scores to find Probability
or area-under-the-curve

Example: This time, let’s use
Excel to find the area-underthe-curve from the extreme
left to a z-score of -0.5
(remember a negative z-score
means we are below the mean)
0
z = -0.5
IMPORTANT: Although z-scores
can be negative, area or
probability is NEVER negative.
Here Excel gave us an area or
probability of +0.308538. This
means the probability of finding an
x value with a z-score less than
-0.5 (below the value which is 0.5
std. deviations below the mena) is
about 0.3085 or 30.85%.
Using z-scores to find Probability
or area-under-the-curve

Example: With the same
information, let’s find the
area-under-the-curve from
the extreme right down to the
z-score of -0.5
0
z = -0.5
Since the total area-under-the-curve
must be equal to 1, the area to the
right of our z=-0.50 must be the
difference between 1 and 0.308538,
or 1 – 0.308538 = 0.691462. This
means the probability of finding an x
value with a z-score greater than -0.5
(greater than a value which is ½ std.
deviation below the mean) is about
0.6915 or 69.15%.
Normal Distributions
There are many distributions that are normally distributed, but
that are not “standard”. That is, they are bell-shaped but do not
have a mean of 0 or a standard deviation of 1.
Here we see the normally distributed heights of men and women.
Women:
µ = 63.6 inches
σ = 2.5
Men:
µ = 69.0
inches
inches
σ = 2.8
inches
63.6
69.0
Height (inches)
Normal Distributions
•
•
•
We can convert any normal distribution to a standard normal distribution (and thus
be able to find probabilities (area-under-the-curve)
We do this by changing x values to z-scores
For example: we see the mean height of men is 69.0 inches and the std.
deviation of men’s heights is 2.8 inches. What is the z-score for the mean (x =
69.0)? How many std. deviations is 69.0 from the mean? Intuition alone should
tell you that 69.0 is 0 std. deviations from itself (69.0)
•
•
We can also use the formula we learned z = (x - µ) / σ = (69 – 69) / 2.8 = 0
We can also use the Excel function
=STANDARDIZE(x,mean,std.dev.) to calculate
the z-score for any x value where you know the
mean and std. deviation of the distribution. In this
case =STANDARDIZE(69.0,69.0,2.8) = 0
Women:
µ = 63.6 inches
σ = 2.5 inches
Men:
µ = 69.0 inches
σ = 2.8 inches
63.6
69.0
Height (inches)
Normal Distributions
•
•
Example: Try this. Can you calculate the z-score for a woman’s height of 6 feet
(that’s 12 x 6 = 72 inches)?
•
•
•
Using the formula: z = (x - µ) / σ = (72 – 63.6) / 2.5 = 3.36
Using Excel =STANDARDIZE(72,63.6,2.5) = 3.36
This means that 72 inches is 3.36 standard deviations above the mean height
So now the big question. What is the probability of a woman being 6 feet (72
inches) tall or taller. Or, another way of asking this is: what percent of all women
are 6 feet tall or taller?
•
•
We can solve this using Excel in two different ways:
First, we could us the =NORMSDIST(z) function we just
learned and simply plug in the z=3.36 we just calculated
=NORMSDIST(3.36) = 0.99961, but remember this is the
probability or area from the extreme left up to z=3.36. We
actually want to know the area of probability above that
(72 inches or greater). We know the total area must equal
1, so the difference between 1 and 0.99961 must be that
remaining area. 1 – 0.99961 = 0.00039
This means the probability of a woman being 6 feet or
taller is 0.00039 or 0.0391% (about 4 in 10,000). Another
way of saying this is that about 0.039% of all women are 6
feet or taller
Women:
µ = 63.6 inches
σ = 2.5 inches
63.6 Height72
(inches)
Normal Distributions
•
•
Example: Can you calculate the z-score for a woman’s height of 6 feet (that’s 12 x
6 = 72 inches)?
What is the probability of a woman being 6 feet (72 inches) tall or taller. Or, another
way of asking this is: what percent of all women are 6 feet tall or taller?
•
•
We can solve this using Excel in two different ways:
Here is the second (and even easier) way I promised to show you.
Excel has another function that combines the calculation of a zscore with finding the area or probability from the extreme left to
that z-score. =NORMDIST
Women:
µ = 63.6 inches
σ = 2.5 inches
63.6 Height72
(inches)
Normal Distributions
•
Example: What is the probability of a woman being 6 feet (72 inches) tall or
taller. Or, another way of asking this is: what percent of all women are 6 feet tall
or taller?
•
Using Excel’s NORMDIST function we can answer this very easily
The “cumulative” option, if set to “true” or “1” provides
the cumulative probability or the total area from the
extreme left to the x value we input. This is normally
how we would use this function. However, just so you
know, if this option is set to “false” or “0” it will return
the exact probability of x occurring. In this case, if set
to “false” or “1” it would tell us the probability of a
woman being exactly 72 inches tall.
Women:
µ = 63.6 inches
σ = 2.5 inches
0.99961 is the area or
probability that
includes everything
below the mean plus
everything above the
mean up to z=3.36 or
3.36 std.deviations
above the mean
63.6
72
Height (inches)
•
•
Finding an x value when given an
area or probability
Often we are interested in finding the value for x
(in this case Men’s height) given a certain
probability.
For example, using the information shown below,
we might want to know what height separates the
10% of tallest men.
Men:
µ = 69.0 inches
σ = 2.8 inches
10% tallest men
90% of men
Height (inches) 69.0
?
Finding a z-score or x value when
given an area or probability
•
•
•
Remember that probability is measured by the area-underthe-curve
To find the z-score associated with a given probability or
area-under-the-curve we can use the Excel function
NORMSINV
To find an x value (within a distribution in which we know
the mean and standard deviation) we can use the Excel
function NORMINV
Men:
µ = 69.0 inches
σ = 2.8 inches
10% tallest men
90% of men
Height (inches) 69.0
?
NORMSINV
•
•
=NORMSINV(probability) gives us
the z-score (number of standard
deviations from the mean) for
whatever cumulative probability we
enter into the function
The probability we input always
represents the cumulative area
starting from the extreme left and
going out until the area-under-thecurve equals the probability we
input
Probability represents cumulative area from extreme left
0
z=1
NORMSINV
• For example:
•
To find the z-score
(number of standard deviations
from the mean) for a cumulative
probability of 0.85 or 85%, use
=NORMSINV(0.85)
In others words, the z-score
associated with a cumulative
probability of 0.85 or 85% is
1.036433
85% of area
0
z=1
NORMINV
•
•
=NORMINV(probability,mean,stand
ard_dev) gives us the x value
associated with a probability we
input (given we also have the
mean and standard deviation of
the distribution which x is part of)
Again, the probability we input
always represents the cumulative
area starting from the extreme left
and going out until the area-underthe-curve equals the probability we
input
Men:
µ = 69.0 inches
σ = 2.8 inches
10% tallest men
90% of men
Height (inches) 69.0
?
NORMINV
•
•
•
•
For example: Remember we wanted to find
the men’s height that sets apart the tallest
10% of men
For Men’s Heights we were given a mean of
69.0 inches and standard deviation of 2.8
inches
We now have all the ingredients and can
input them into the Excel function
=NORMINV(0.90,69.0,2.8)
In others words, the height that separates the
10% of tallest men is 72.588 inches. We
could also say that only 10% of all
men are over 72.588 inches tall
90% of men
Height (inches) 69.0
Men:
µ = 69.0 inches
σ = 2.8 inches
10% tallest men
72.588”
Exercises

Given that the population of women have a mean weight of 143 lbs., with a
standard deviation of 29 lbs., use what you just learned to answer the
following questions
1.
2.
3.
4.
What percent of women would you expect to weight less than 130 lbs?
If you randomly select one woman, what is the probability that she will weigh
more than 150 lbs?
What percent of women weigh more than 110 lbs?
What percent of women weigh between 130 and 160 lbs?
For answers, either email me a single Excel file containing all your work, or come to
one of the scheduled open lab sessions and bring your work
Women’s Weight in lbs.
x = 143
s = 29
143
Caution!!!
 1. Don’t confuse z scores and probabilities

z scores represent the number of standard deviations an x value is from
the mean. z scores are negative if their corresponding x value falls
below the mean, and z scores are positive if their corresponding x value
falls above the mean.



For example, for a distribution with a mean of 100, the z score for any value
less than 100 would be negative, and the z score for any x value greater than
100 would be positive
Or, for a distribution with a mean of -50 (negative fifty), the z-score for -60
would be negative, and the z-score for -40 (above the mean) would be
positive.
Probabilities are represented by the area-under-the-curve and must
always always always be between 0 and 1


If you ever think you have a probability that is less than 0 or greater than 1
then you have made a serious error
The probabilities that Excel gives you are cumulative, meaning that they
represent the area from the extreme left (beginning) of the distribution out to
some z score or x value. The area or probability you are interested in might
not be the exact probability that Excel gives you, but you can always find the
area you are interested in by remembering that the total area-under-the-curve
is 1 and adding or subtracting areas appropriately. Some of the exercises I
gave you will let you some practice doing this.
The
Central Limit
Theorem
Central Limit Theorem
Everything we just learned about probability and normal distributions only applies, of
course, when our population is normally distributed. And, in fact, many things in
nature and science and behavior are normally distributed. However, there are also
a number of things of interest that may not be normally distributed.
The “Central Limit Theorem” provides us with a neat little trick that can turn ANY
distribution into a normal distribution and therefore allow us to use what we
learned about area and probabilities.
Here is what we need to do, according to The Central Limit Theorem, to change a nonnormal distribution into a normal distribution
1. Rather than looking at each individual x, we need to take the mean/average of
randomly selected groups of x’s (usually groups of 3, 4, or 5 individual x’s) and we
have to have a reasonable number (30 or more, but the more the better) of these
groups.
2. It is the mean of these groups that forms our new “normally distributed” population


The mean of our new “normal distribution” (composed of group means) is simply the mean
of the group means – often called the mean of means
The standard deviation of our new distribution is calculated as the standard deviation of
the individual x’s divided by the square root of n (where n is the number of individual x’s in
each group)
Central Limit Theorem
the mean of the sample means
µx = µ
the standard deviation of sample means
(often called standard error of the mean)
σx =
σ
n
Where n is the number in each group
Central Limit Theorem
Here is an example of how we can use the Central Limit Theorem
• A study was done in which 50 social security numbers were randomly
selected.
• The last 4 digits of these numbers were put into the following
histogram (4 x 50 = 200 numbers total).
• Does this look like a normal distribution (bell-shaped)?
• I hope you can see that it DOES NOT. In fact, it looks a lot like what
is called a uniform distribution (all the numbers have about the same
number of occurrences)
So here we go, we have a non-normal distribution that we would like to
turn into a normal distribution by using the Central Limit Theorem
Frequency
20
10
0
0
1
2
3
4
5
6
7
8
9
Central Limit Theorem
As already mentioned, to use the
Central Limit Theorem we first put the
data into groups.
• In this case, the data is naturally in
groups (groups of 4 digits)
• We next take the mean or average of
each group of 4 digits as shown in the
table (the mean of these means
becomes the mean of our new
transformed distribution)
• The standard deviation of our new
transformed distribution is the std.
deviation of the original data divided by
the square root of the number in each
group (for this example n = 4)
last 4 digits of SS#
1
8
6
4
5
3
3
6
9
8
8
8
5
1
2
5
9
3
3
5
4
2
6
2
7
7
1
6
9
1
5
4
5
3
3
9
6
2
2
5
0
2
7
8
5
7
3
4
4
4
5
1
3
6
7
3
7
3
3
8
3
7
6
1
9
5
7
8
6
4
0
7
mean
4.75
4.25
8.25
3.25
5.00
3.50
5.25
4.75
5.00
5.25
4.25
4.50
4.75
3.75
5.25
3.75
4.50
6.00
Central Limit Theorem
Frequency
A look at the histogram of our new transformed distribution (the
distribution of means of the last 4 digits of 50 SS#’s) looks very
much like a normal distribution should (i.e., bell-shaped)
• Here we had 50 groups, and the more sample groups we
have, the closer the distribution of means will be to a pure
normal distribution
• There you go – that’s how the Central Limit Theorem
transforms a non-normal distribution of individual samples into
a normal distribution of sample means
15
10
5
0
0
1
2
3
4
5
6
7
8
9
Determining Normality
(How to know if a distribution is normally
distributed)
Although the Central Limit Theorem is usually used to transform nonnormally distributed data into a normal distribution (by taking the
distribution of the means of randomly selected groups), it can also be
used on data that is already normally distributed in order to examine
probabilities associated with groups of x’s rather than individual x’s.
For example, if we wanted to know the probability that the average
weight of a group of 5 randomly selected women is greater than 160
lbs., we could use the Central Limit Theorem to do that. See if you can
come up with the correct answer to that question (hint: the key is to
remember that the standard deviation of our new distribution is σ / sqrt(n) where n is the
group size, n=5 in this case).
The next slide discusses how to test whether or not data is normally
distributed
Procedure for Determining Whether Data
Have a Normal Distribution
There are various sophisticated methods for determining
normality and if you progress in statistics you will learn
these methods. However, for this elementary course, I only
expect you to be able to test for “normality” (whether or not
a sample comes from a normally distributed population) by
looking at the following:
1. Histogram: Construct a histogram. Reject normality if the
histogram departs dramatically from a bell shape.
2. Outliers: Identify outliers. Reject normality if there is more
than one outlier present. An outlier is an extremely small or
extremely large value that appears inconsistent with the rest
of the data
William Christensen, Ph.D.