Download The Normal Distribution - Department of Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Looking Ahead
The Normal Distribution
I
Chapter 4 introduces the normal distribution as a probability
distribution.
I
Chapter 5 culminates in the central limit theorem, the primary
theoretical justification for most of the methods of statistical
inference in the remainder of the textbook.
Bret Larget
Departments of Botany and of Statistics
University of Wisconsin—Madison
Statistics 371
23rd September 2005
The Normal Distribution
The Normal Density
Normal curves have the following bell-shaped, symmetric density.
The Normal Distribution is the most important distribution of
continuous random variables.
I
The normal density curve is the famous symmetric,
bell-shaped curve.
I
The central limit theorem is the reason that the normal curve
is so important. Essentially, many statistics that we calculate
from large random samples will have approximate normal
distributions (or distributions derived from normal
distributions), even if the distributions of the underlying
variables are not normally distributed.
I
The Central Limit Theorem is the basis of most of the
methods of statistical inference we will study in the last half
of the course.
1 y −µ 2
1
f (y ) = √ e− 2 ( σ )
σ 2π
Parameters: The parameters of a normal curve are the mean µ
and the standard deviation σ.
Here is an example of a normal curve with µ = 100 and σ = 20.
Normal Distribution
mu = 100 , sigma = 20
Probability Density
I
50
−3
100
−2
−1
0
150
1
2
3
Standardization
The Standard Normal Table
I
All normal curves have the same essential shape.
I
Every normal curve can be drawn in exactly the same manner,
just by changing labels on the axis.
I
The standard normal curve is the normal curve with mean
µ = 0 and standard deviation σ = 1.
I
The standardization formula is
Z=
Y −µ
σ
I
Every problem that asks for an area under a normal curve is
solved by first finding an equivalent problem for the standard
normal curve.
I
The justification for this comes from calculus. An area under
a normal curve is a definite integral. The integral is simplified
by using the standardization formula.
The 68–95–99.7 Rule
I
The standard normal table lists the area to the left of z under
the standard normal curve for each value from −3.49 to 3.49
by 0.01 increments.
I
The normal table is on the inside front cover of your textbook.
I
Numbers in the margins represent z.
I
Numbers in the middle of the table are areas to the left of z.
R can do this for general values of z, and R can do the
standardization for you.
Example Calculations
For every normal curve, 68% of the area is within one SD of the
mean, 95% of the area is within two SDs of the mean, and 99.7%
of the area is within three SDs of the mean.
Normal Distribution
mu = 0 , sigma = 1
Normal Distribution
mu = 0 , sigma = 1
P( X < −1 ) = 0.1587
P( X > 1 ) = 0.1587
−4
−2
−3
−2
0
−1
0
2
1
2
4
3
P( X < −2 ) = 0.0228
P( X > 2 ) = 0.0228
−4
−2
−3
−2
0
−1
Suppose that egg shell thickness is normally distributed with a
mean of 0.381 mm and a standard deviation of 0.031 mm. Here
are a large number of example calculations.
P( −3 < X < 3 ) = 0.9973
0
2
1
2
4
3
Probability Density
P( −2 < X < 2 ) = 0.9545
Probability Density
Probability Density
P( −1 < X < 1 ) = 0.6827
Normal Distribution
mu = 0 , sigma = 1
P( X < −3 ) = 0.0013
P( X > 3 ) = 0.0013
−4
−2
−3
−2
0
−1
0
2
1
2
4
3
Example Area Calculation
Example Area Calculation
Area to the left. Find the proportion of eggs with shell thickness
less than 0.34 mm.
Area between two values. Find the proportion of eggs with shell
thickness between 0.34 and 0.36 mm.
Normal Distribution
mu = 0.381 , sigma = 0.031
P( X < 0.34 ) = 0.093
P( X > 0.34 ) = 0.907
0.25
0.30
−3
0.35
−2
−1
0.40
0
0.45
1
2
P( 0.34 < X < 0.36 ) = 0.1561
Probability Density
Probability Density
Normal Distribution
mu = 0.381 , sigma = 0.031
0.50
P( X < 0.34 ) = 0.093
P( X > 0.36 ) = 0.7509
0.25
3
0.30
−3
Example Area Calculation
0.35
−2
0.35
−1
3
0.40
0
0.45
1
2
0.50
3
P( 0.32 < X < 0.4 ) = 0.7055
Probability Density
Probability Density
P( X > 0.36 ) = 0.7509
−2
2
Normal Distribution
mu = 0.381 , sigma = 0.031
P( X < 0.36 ) = 0.2491
−3
1
0.50
Area outside two values. Find the proportion of eggs with shell
thickness smaller than 0.32 mm or greater than 0.40 mm.
Normal Distribution
mu = 0.381 , sigma = 0.031
0.30
0
0.45
Example Area Calculation
Area to the right. Find the proportion of eggs with shell
thickness more than 0.36 mm.
0.25
−1
0.40
P( X < 0.32 ) = 0.0245
P( X > 0.4 ) = 0.27
0.25
0.30
−3
0.35
−2
−1
0.40
0
0.45
1
2
0.50
3
Example Area Calculation
Quantiles
Central area. Find the proportion of eggs with shell thickness
within 0.05 mm of the mean.
Normal Distribution
mu = 0.381 , sigma = 0.031
Quantile calculations ask you to use the normal table backwards.
You know the area but need to find the point or points on the
horizontal axis.
Probability Density
P( 0.331 < X < 0.431 ) = 0.8932
P( X < 0.331 ) = 0.0534
P( X > 0.431 ) = 0.0534
0.25
0.30
−3
0.35
−2
−1
0.40
0
0.45
1
2
0.50
3
Example Area Calculation
Example Quantile Calculations
Two-tail area. Find the proportion of eggs with shell thickness
more than 0.07 mm from the mean.
Percentile. What is the 90th percentile of the egg shell thickness
distribution?
Normal Distribution
mu = 0.381 , sigma = 0.031
Probability Density
P( 0.311 < X < 0.451 ) = 0.9761
P( X < 0.311 ) = 0.012
P( X > 0.451 ) = 0.012
Probability Density
Normal Distribution
mu = 0.381 , sigma = 0.031
P( X < 0.4207 ) = 0.9
z = 1.28
0.25
0.30
−3
0.35
−2
−1
0.40
0
0.45
1
2
0.50
3
0.25
0.30
−3
0.35
−2
−1
0.40
0
0.45
1
2
0.50
3
Example Quantile Calculations
Using R
Upper cut-off point. What value cuts off the top 15% of egg
shell thicknesses?
Normal Distribution
mu = 0.381 , sigma = 0.031
You can use R functions pnorm and qnorm for the previous
calculations.
Area to the left. Find the proportion of eggs with shell thickness
less than 0.34 mm.
> pnorm(0.34, 0.381, 0.031)
Probability Density
[1] 0.09298744
Area to the right. Find the proportion of eggs with shell
thickness more than 0.36 mm.
P( X < 0.4131 ) = 0.85
> 1 - pnorm(0.36, 0.381, 0.031)
[1] 0.75093
z = 1.04
0.25
0.30
−3
0.35
−2
−1
0.40
0
0.45
1
2
0.50
Area between two values. Find the proportion of eggs with shell
thickness between 0.34 and 0.36 mm.
> pnorm(0.36, 0.381, 0.031) - pnorm(0.34, 0.381, 0.031)
3
[1] 0.1560825
Example Quantile Calculations
More R Calculations
Central cut-off points. The middle 75% egg shells have
thicknesses between which two values?
Normal Distribution
mu = 0.381 , sigma = 0.031
Area outside two values. Find the proportion of eggs with shell
thickness smaller than 0.32 mm or greater than 0.40 mm.
> pnorm(0.32, 0.381, 0.031) + 1 - pnorm(0.4, 0.381, 0.031)
[1] 0.2945190
Probability Density
Central area. Find the proportion of eggs with shell thickness
within 0.05 mm of the mean.
> 1 - 2 * pnorm(0.381 - 0.05, 0.381, 0.031)
P( X < 0.4167 ) = 0.875
[1] 0.8932345
Two-tail area. Find the proportion of eggs with shell thickness
more than 0.07 mm from the mean.
z = 1.15
0.25
0.30
0.35
0.40
0.45
0.50
> 2 * pnorm(0.381 - 0.07, 0.381, 0.031)
[1] 0.02394164
−3
−2
−1
0
1
2
3
More R Calculations
What is a Normal Probability Plot?
Percentile. What is the 90th percentile of the egg shell thickness
distribution?
I
A normal probability plot is a plot of the sorted sample data
versus somethng close to the expected z-score for the
corresponding rank of a random normal sample of the same
size.
I
For example, the expected z-score of the minumum from a
random sample of size 10 from a normal population is about
−1.54.
> qnorm(0.9, 0.381, 0.031)
[1] 0.4207281
Upper cut-off point. What value cuts off the top 15% of egg
shell thicknesses?
> qnorm(0.85, 0.381, 0.031)
[1] 0.4131294
Central cut-off points. The middle 75% egg shells have
thicknesses between which two values?
I
If the plotted points are close to a straight line, there is
evidence that the distribution is close to normal.
> qnorm(c(0.25/2, 1 - 0.25/2), 0.381, 0.031)
I
If the plotted points are far from a straight line, there is
evidence of non-normality.
[1] 0.3453392 0.4166608
−2
−1
0
1
2
−2
−1
0
1
Theoretical Quantiles
Theoretical Quantiles
Histogram of x
Histogram of x
Histogram of x
−1
0
x
1
2
2
Frequency
6
−2
−1
0
x
1
2
0 2 4 6 8
8
12
10
−2
1
2
2
0
−2
−2
1
−1
Sample Quantiles
1
0
−1
Sample Quantiles
2
1
0
−1
0
4
There is no easy answer to the question, but a normal
probability plot is much more informative than the result of a
test.
−1
Theoretical Quantiles
Frequency
I
−2
2
It is generally better to make a plot that sheds light on the
question, “Is the data so far from normality as to bias a
method that assumes normality?”
Normal Q−Q Plot
0
I
Sample Quantiles
If given data, there are tests for normality, but there are
reasons not to do these.
−2
I
Normal Q−Q Plot
15
Sometimes we can rely on the central limit theorem.
Normal Q−Q Plot
10
I
All three plots are normal with n = 50.
5
A standard question begins, “Assuming that variable Y has a
normal distribution,. . . ”. But how do we know the distribution
is approximately normal?
Frequency
I
Normal Example
0
Normal Probability Plots
−2
−1
0
x
1
2
Normal Example
The Continuity Correction
All three plots are normal with n = 500.
0
1
2
3
−3
−2
−1
0
1
2
3
−3
−2
−1
0
1
2
Theoretical Quantiles
Theoretical Quantiles
Histogram of x
Histogram of x
Histogram of x
0
1
2
These calculations are more accurate when the boundaries of
normal calculations match midpoints between possible values
of the discrete distribution.
I
Approximations to the binomial
p distribution use a normal
curve with µ = np and σ = np(1 − p).
3
100
I
50
Frequency
0
−1
I
0
Frequency
20 40 60 80
60
40
20
−2
The normal distribution can also be used to compute
approximate probabilities of discrete distributions.
150
Theoretical Quantiles
0
−3
I
−3
−3
−1
−1
Sample Quantiles
3
2
1
−1
Sample Quantiles
2
1
−1 0
Sample Quantiles
−3
−2
80
−3
Frequency
Normal Q−Q Plot
1 2 3 4
Normal Q−Q Plot
3
Normal Q−Q Plot
3
−3
−2
−1
x
0
1
2
3
−4
−2
x
0
2
4
x
Non-normal Example
When approximating a binomial
p probability, the approximation
is usually pretty good when np(1 − p) > 3.
Example
The first plot is not normal, the second two are with n = 50.
Normal Q−Q Plot
Normal Q−Q Plot
1
−2
−1
0
1
0.5
2
−2
−1
0
1
Theoretical Quantiles
Histogram of x
Histogram of x
Histogram of x
6
x
8
6
4
Frequency
0
2
0 2 4 6 8
−2
−1
0
x
1
2
−2
−1
0
x
Algebraically,
Pr {3 ≤ Y ≤ 7} = Pr {2 < Y < 8}
8 10
Theoretical Quantiles
4
Here is an example with the binomial distribution with n = 15
and p = 0.4 where we want to compute Pr {3 ≤ Y ≤ 7}.
2
Theoretical Quantiles
12
2
I
−1.5
2
Frequency
0
−0.5
Sample Quantiles
1.5
0.5
−1.5
0
0 2 4 6 8
Frequency
−0.5
Sample Quantiles
8
6
4
2
Sample Quantiles
0
−1
12
−2
I
1.5
Normal Q−Q Plot
1
2
I
Which end points should we use?
I
We get a better answer if we use 2.5 and 7.5.
Example
Example (cont.)
I
The exact calculation is:
Pr {3 ≤ Y ≤ 7} =
7
X
y =3
15!
.
(0.4)y (0.6)15−y = 0.7598
y !(15 − y )!
If Y has a binomial probability with n = 500 and p = 0.2, what is
the probability of 110 or more successes?
Exact:
> 1 - pbinom(109, 500, 0.2)
[1] 0.1443539
I
I
The approximate calculation uses µ = 6 and σ = 1.9.
¾
½
7−6
3−6
≤Z ≤
Pr {3 ≤ Y ≤ 7} ≈ Pr
1.9
1.9
.
= 0.7524
This approximation is more accurate than using 3 and 7 or 2
and 8 as end points.
Continuity Correction
Here is a picture that shows why the continuity correction helps.
0.15
0.10
0.05
0.00
0.00
0.05
0.10
0.15
0.20
Approximate = 0.7529
0.20
Exact = 0.7598
0
2
4
6
8
10
12
14
0
2
4
6
8
10
12
14
Normal approximation:
> mu = 500 * 0.2
> sigma = sqrt(500 * 0.2 * 0.8)
> 1 - pnorm(109.5, mu, sigma)
[1] 0.1440878
With R, do the exact calculation. With a calculator and normal
table, use the normal approximation.