Download Central Limit Theorem

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
Central Limit Theorem
A) Let X1, X2, ..., Xn be a random sample of size n
from a population with mean µ and variance σ2.
–
–
–
Then X is a RV with E(X ) = µ, and V(X ) = σ2/n.
Random Sample (SRS): X1, X2, ..., Xn are
• Independent
• Identically distributed
Alternate terminology: IID.
B) Let X1, X2, ..., Xn be a random sample of size n
from a normal population with mean µ and variance σ2.
–
–
–
Then X is a normal RV with E(X ) = µ, and V(X ) = σ2/n.
–
That is X ~ NORM(µ, σ/√n).
C) The Central Limit Theorem:
Let X1, X2, ..., Xn be a random sample of size n
from a any population with mean µ and variance σ2.
–
–
–
Then X is a RV with E(X ) = µ, and V(X ) = σ2/n.
–
Furthermore, X has approximately the distribution NORM(µ, σ/√n).
The approximation gets better as n gets larger.
Roughly speaking, the "convergence" is faster if the population
distribution is nearly symmetrical.
Bruce E. Trumbo. Spring 2006. CSU East Bay.
2
Example:
(a) Suppose a certain population of male collegiate swimmers has distribution NORM(70, 7),
weights in kg.
We select 1 swimmer at random from this population.
What is the probability that he weighs between 65 and 75 kg?
Answer. Use Standard Normal Distribution:
P{65 ≤ X ≤ 75} = P{(65 – 70)/7 ≤ (X – µ)/σ ≤ (75 – 70)/7}
Z ~ NORM(0,1),
Standard Normal
= P{–5/7 ≤ Z ≤ 5/7} = P{–0.714 ≤ Z ≤ 0.714}
= 1 – 2P{Z ≤ 0.714} = 1 – 2(0.2376) = 0.5248.
[From R; answer from tables will be slightly less accurate.]
So a little more than half of the swimmers have weights in
the "Middle Part" between 67 and 75 kg
In R: pnorm(75, 70, 7) - pnorm(65, 70, 7) returns 0.5249.
Problem 1: How close do you get using normal tables in the text. Why not exactly the same?
x <- seq(45, 95, by=.1)
mr <- seq(65, 75, by=.01)
plot(x, dnorm(x, 70, 7), type="l", lwd=2)
lines(mr, dnorm(mr, 70, 7), type="h", col="blue")
abline(vert=c(63, 77), col="green")
Bruce E. Trumbo. Spring 2006. CSU East Bay.
3
Note: Green lines at µ ± σ. Area between them under this normal curve is 0.6826.
Problem 2: Verify probability in note using tables in text.
Bruce E. Trumbo. Spring 2006. CSU East Bay.
4
(b) Suppose a certain population of collegiate swimmers has distribution NORM(70, 7).
We select 9 swimmers at random from this population.
What is the probability that the sample mean weight is between 65 and 75 kg?
Answer. Use Statement B:
–
–
P{65 ≤ X ≤ 75} = P{(65 – 70)/(7/3) ≤ (X – µ)/( σ/√n) ≤ (75 – 70)/(7/3)}
= P{–15/7 ≤ Z ≤ 15/7} = P{–2.143 ≤ Z ≤ 2.143} = 0.9679
[From R; answer from tables will be slightly less accurate.]
So almost all random samples of size n = 9 from this population will have
–
sample mean X between 65 and 75 kg.
Problem 3:
(i)
Verify this answer (as close as you can get) using normal tables in the text.
(ii) What is the probability that the mean of a sample of 5 swimmers lies between 65 and 75 kg?
(iii) What is the probability that the mean of a sample of 9 swimmers exceeds 72 kg?
x <- seq(45, 95, by=.1)
mr <- seq(65, 75, by=.01)
plot(x, dnorm(x, 70, 7), ylim=c(0,.18), type="l", lwd=1, lty="dotted")
lines(mr, dnorm(mr, 70, 7/3), type="h", col="blue")
lines(x, dnorm(x, 70, 7/3), lwd=2, col="darkred")
Bruce E. Trumbo. Spring 2006. CSU East Bay.
5
Bruce E. Trumbo. Spring 2006. CSU East Bay.
6
(c) Suppose a certain population of collegiate swimmers has distribution NORM(70, 7).
We select 9 swimmers at random from this population.
What is the probability that every one of them weighs between 65 and 75 kg?
Answer. Use Binomial Distribution:
Recall from (a): P{65 ≤ X ≤ 75} = 0.5248. Call the event A = {65 ≤ X ≤ 75} a "Success."
We seek the probability of 9 Successes in n = 9 binomial trials.
Answer is (0.5248)9 = 0.0030.
Comment on how sample means behave:
There are two ways for the sample mean of 9 to lie in the Middle Part (between 65 and 75):
• All 9 observations in the Middle Part: We have just seen that this is very unlikely.
• "Averaging Effect": Some heavy swimmers and some light swimmers in the group. But
heavier ones "balance" lighter ones, so that the sample mean is in the Middle Part.
Bruce E. Trumbo. Spring 2006. CSU East Bay.
7
(d) Suppose a certain population of collegiate swimmers has mean µ = 70 kg and σ = 7 kg.
However, we cannot be sure the population is normal in shape.
We select 9 swimmers at random from this population.
What is the probability that the sample mean weight is between 65 and 75 kg?
Answer. Use Statement C: The Central Limit Theorem.
The numerical solution is the same as in (b):
–
P{65 ≤ X ≤ 75} = 0.9679,
except this result is now an approximation.
The accuracy of the approximation depends on whether n is large enough for the
CLT-effect to be useful.
Problem 4: Same as (d), but with 12 swimmers, and the mean between 63 and 77.
In many cases of practical interest, the CLT gives good approximations—
even for rather small sample sizes such as n = 9.
We investigate this claim now for several population distributions.
In each case, µ = 4 and σ = 2 and n = 9. (Shapes of dist'ns matter, not values of µ and σ.)
First, we show the population distributions.
Then, the simulated distributions of sample means for n = 9.
Bruce E. Trumbo. Spring 2006. CSU East Bay.
8
Bruce E. Trumbo. Spring 2006. CSU East Bay.
9
Bruce E. Trumbo. Spring 2006. CSU East Bay.
10
This DRAFT code is provided for reference. It was used to produce the two pages of graphs just above. Undoubtedly, the clarity and elegance of the code can be improved.
set.seed(1212)
m = 10000; n = 9
xx = seq(-1,9,by=.01)
par(mfrow=c(2,3))
dd = dnorm(xx, 4, 2)
plot(xx,dd,type="l", lwd=2, col="blue",
ylim=c(0,.36), ylab="Density", xlab="x", main="Normal")
uu = c(-1, 4-2*sqrt(3), 4-2*sqrt(3), 4+2*sqrt(3),
4+2*sqrt(3), 9)
hh = 1/(4*sqrt(3))
dd = c(0 ,
0,
hh,
hh,
0, 0)
plot(uu,dd,type="l", lwd=2, col="blue",
ylim=c(0,.36), ylab="Density", xlab="x", main="Uniform")
dd = .5*dnorm(xx, 4-sqrt(3), 1)+.5*dnorm(xx,4+sqrt(3),1)
plot(xx,dd,type="l", lwd=2, col="blue",
ylim=c(0,.36), ylab="Density", xlab="x", main="Bimodal")
dd = dgamma(xx, 4, 1)
plot(xx,dd,type="l", lwd=2, col="blue",
ylim=c(0,.36), ylab="Density", xlab="x", main="Gamma")
dd = .5*dexp(abs(xx-4), 1/sqrt(2))
plot(xx,dd,type="l", lwd=2, col="blue",
ylim=c(0,.36), ylab="Density", xlab="x", main="Laplace")
dd = dt(xx-4, 1)
plot(xx,dd,type="l", lwd=2, col="blue",
ylim=c(0,.36), ylab="Density", xlab="x", main="Cauchy")
par(mfrow=c(1,1))
edg = 1.96*2/sqrt(n)
zz <- seq(1,7,by=.01)
dd <- dnorm(zz, 4, 2/sqrt(n))
cutp = seq(.5, 8, by=.5)
par(mfrow=c(2,3))
x = rnorm(m*n, 4, 2)
DTA = matrix(x, nrow=m)
x.bar = rowMeans(DTA)
summary(x.bar); sd(x.bar)
mean(x.bar < 4-edg); mean(x.bar > 4+edg)
hist(x.bar, breaks=cutp, prob=T, xlab="Sample Mean",
ylim=c(0,.7), main=paste("Means of",n,"from Normal"))
lines(zz,dd,col="blue")
x = runif(m*n, 4-2*sqrt(3), 4+2*sqrt(3))
DTA = matrix(x, nrow=m)
x.bar = rowMeans(DTA)
summary(x.bar); sd(x.bar)
mean(x.bar < 4-edg); mean(x.bar > 4+edg)
hist(x.bar, breaks=cutp, prob=T, xlab="Sample Mean",
ylim=c(0,.7), main=paste("Means of",n,"from Uniform"))
lines(zz,dd,col="blue")
x = 4+rnorm(m*n, sqrt(3), 1)*sample(c(-1,1),m*n,repl=T)
DTA = matrix(x, nrow=m)
x.bar = rowMeans(DTA)
summary(x.bar); sd(x.bar)
mean(x.bar < 4-edg); mean(x.bar > 4+edg)
hist(x.bar, breaks=cutp, prob=T, xlab="Sample Mean",
ylim=c(0,.7), main= paste("Means of",n,"from Bimodal"))
lines(zz,dd,col="blue")
x = rgamma(m*n, 4, 1)
DTA = matrix(x, nrow=m)
x.bar = rowMeans(DTA)
summary(x.bar); sd(x.bar)
mean(x.bar < 4-edg); mean(x.bar > 4+edg)
hist(x.bar, breaks=cutp, prob=T, xlab="Sample Mean",
ylim=c(0,.7), main= paste("Means of",n,"from Gamma"))
lines(zz,dd,col="blue")
x = 4+rexp(m*n,1/sqrt(2))*sample(c(-1,1),m*n,repl=T)
DTA = matrix(x, nrow=m)
x.bar = rowMeans(DTA)
summary(x.bar); sd(x.bar)
mean(x.bar < 4-edg); mean(x.bar > 4+edg)
hist(x.bar, breaks=cutp, prob=T, xlab="Sample Mean",
ylim=c(0,.7), main= paste("Means of",n,"from Laplace"))
lines(zz,dd,col="blue")
x = 4+rt(m*n, 1)
DTA = matrix(x, nrow=m)
x.bar = rowMeans(DTA)
summary(x.bar); sd(x.bar)
mean(x.bar < 4-edg); mean(x.bar > 4+edg)
hist(x.bar, prob=T, xlab="Sample Mean",
main=paste("Means of",n,"from Cauchy"))
par(mfrow=c(1,1))
Bruce E. Trumbo. Spring 2006. CSU East Bay.
11
Statistical Application:
For data X1, X2, ..., Xn, a "95% Confidence Interval" for estimating µ is as follows:
–
X ± 1.96 σ/√n.
–
This is based on the CLT and the idea that Z = (X – µ)/(σ/√n) is approximately NORM(0, 1),
so that P{–1.96 < Z < 1.96} = 95%.
Problem 5: Verify this using normal tables in the text.
–
–
A few lines of algebra with inequalities gives P{X – 1.96 σ/√n < µ < X + 1.96 σ/√n} = 95%,
–
–
so the random interval X ± 1.96 σ/√n (centered at X and extending 1.96 σ/√n in either direction)
covers the true, but unknown, value of µ "95% of the time."
In practice, the value σ is also unknown, and it is estimated by the sample standard deviation S.
Then the confidence interval becomes
–
X ± 2 S/√n.
Now there are two assumptions beyond the fundamental one that the data are randomly chosen:
First, that n is large enough for the CLT to give a good approximation.
Second, that n is large enough for S to be a good approximation of s.
In practice, n = 30 is often large enough—provided the data do not have "outliers," straggling
values far away from the rest of the data.
Problem 6: Suppose an appropriate sample of size 100 has sample mean 11.34 and standard
deviation 2.17. Find the 95% confidence interval for the population mean.
Bruce E. Trumbo. Spring 2006. CSU East Bay.
Related documents