Download Lab 10

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Statistical inference wikipedia , lookup

Student's t-test wikipedia , lookup

Gibbs sampling wikipedia , lookup

Transcript
Lab 10—The Central Limit Theorem
Introduction:
The Central Limit Theorem (CLT) is one of the foundations of basic inferential statistics.
Up until now we have basically investigated descriptive statistics, where we produced
graphs and statistics and parameters and other information about data we either find or
generate. Inferential statistics, as the name implies, infers facts about a large population
based only upon facts generated from a (usually) much smaller sample. For example, if we
generate an “appropriate” sample of 100 voters in the U.S., we can do a survey,
determining the proportion of them who will vote for a specific national candidate, then,
from the sample proportion, determine, plus or minus a few percent from the sample
result, the true proportion of voters in the population of 200 million potential U.S. voters
preferring that candidate. From the microcosm (sample) we can fairly accurately infer the
results of the complete cosmos (population)! At first impression, this seems counter
intuitive, to infer an accurate result in the big group by only doing a test on a much smaller
group. However, the CLT gives us the ability to do this.
The CLT contains the following results, which we can check with R.
• If I start with a normal population and take multiple random samples of size n from
it, finding the sample mean (y-bar), the resulting sampling distribution of y-bars
will also be normally distributed, and centered at the same mean as the original
population the samples were taken from.
• As n increases, the standard deviation of the sampling distribution will decrease
• the value of the standard deviation of the sampling distribution (sd(y-bar)) will
follow the formula: sd(y-bar) = s/√(n), where s is the standard deviation of the
population distribution.
Showing sampling distributions of normal population
distributions stay centered at the population mean:
I want to look at the distribution of random samples of size 2, taken from the N(0, 3)
distribution, and see if predicted results of the CLT show up. See below for code and
results from my simulation.
# taking samples size n=2
samp1 <- c()
n <- 2
for (i in 1:200) {
samp1[i] <- mean(rnorm(n, mean=10, sd=3))
}
hist(samp1, prob=TRUE)
lines(density(samp1))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
-1-
quantile(samp1, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", 3/sqrt(n), "\n")
I find the mean of 200 samples of size n=2 from the N(10, 3) population, and store these ybars in vector samp1. I then make a histogram of samp1, with relative frequency on the y
axis (so I can compare it with its density curve) for shape. Finally, I print out the values of
sampling distribution samp1 which lie at ± 3sd, ± 2sd, and ± 1sd from the middle value,
using the quantile() command. My last line computes the theoretical standard
deviation, according to the CLT, which says sd(y-bar) = 3/√(2) . Output is shown below.
next
Below is code to generate the same information from 200 samples of size n=5, n=10, and
n=25, respectively. Resulting sampling distributions are stored in samp2, samp3, and
samp4, respectively.
# taking samples size n=5
samp2 <- c()
n <- 5
for (i in 1:200) {
samp2[i] <- mean(rnorm(n, mean=10, sd=3))
}
hist(samp2, prob=TRUE)
lines(density(samp1))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
quantile(samp2, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", 3/sqrt(n), "\n")
Output is below for n=5.
And
-2-
Below are code and results for n=10 case, with results in samp3.
# taking samples size n=10
samp3 <- c()
n <- 10
for (i in 1:200) {
samp3[i] <- mean(rnorm(n, mean=10, sd=3))
}
hist(samp3, prob=TRUE)
lines(density(samp1))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
quantile(samp3, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", 3/sqrt(n), "\n")
output is below for samp3.
And
Code and results for n=25 are shown below.
# taking samples size n=25
samp4 <- c()
n <- 25
for (i in 1:200) {
-3-
samp4[i] <- mean(rnorm(n, mean=10, sd=3))
-3-
}
hist(samp4, prob=TRUE)
lines(density(samp1))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
quantile(samp4, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", 3/sqrt(n), "\n")
Output is below.
And
After reviewing all cases, we notice the following results (which follow CLT predictions).
• Each sampling distribution (samp1, samp2, samp3, samp4) all seem to have
about the same center (10), within usual sampling variation
• Each sampling distribution seem to have a mound or bell shape (again within
sampling variation)
• Each sampling distribution shows that as n increases, the sampling distribution
standard deviation decreases
• Each sampling distribution seem to follow the formula for the theoretical value of
sampling standard deviation fairly closely (within sampling variation), for each value
of n.
If our population distribution follows N(10, 3), then the following shows the theoretical
density plots of the original population, sampling distributions of n=2, n=10, and n=25.
# CLT - normal population
# density curves
n <- 1
curve(dnorm(x, mean=10, sd=3/sqrt(n)), xlim=c(0, 20), ylim=c(0, .65),
ylab="density")
n <- 2
curve(dnorm(x, mean=10, sd=3/sqrt(n)), col="blue", add=TRUE)
n <- 10
curve(dnorm(x, mean=10, sd=3/sqrt(n)), col="red", add=TRUE)
-4-
n <- 25
curve(dnorm(x, mean=10, sd=3/sqrt(n)), col="green", add=TRUE)
The green shows n=25, red shows n=10, blue shows n=2, and black shows the original
distribution N(10,3). The same CLT results show on these theoretical valued density plots.
So, the CLT seems to predict correctly for sampling from fairly symmetric population
distributions.
Homework [1]: Check out the CLT predictions on the population which is N(6, 2), for
sampling distributions for n=2, n=6, n=15, and n=40. In your lab report state your
impressions.
CLT predictions from non-symmetric distributions:
Let us take samples from the U(3, 12) distribution. See below for samples of n=2, n=10,
n=25, n=100 (in samp1, samp2, samp3, and samp4, respectively). Notice that the
theoretical sd for a uniform distribution is (b-a)/√(12).
# CLT from non-normal distributions
# taking samples size n=2 from U(3,12)
samp1 <- c() ; sd.unif1 <- (12-3)/sqrt(12)
n <- 2
for (i in 1:200) {
samp1[i] <- mean(runif(n, min=3, max=12))
}
hist(samp1, prob=TRUE)
lines(density(samp1))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
-5-
quantile(samp1, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", sd.unif1/sqrt(n), "\n")
output is
and
Next is n=10
# taking samples size n=10 from U(3,12)
samp2 <- c() ; sd.unif1 <- (12-3)/sqrt(12)
n <- 10
for (i in 1:200) {
samp2[i] <- mean(runif(n, min=3, max=12))
}
hist(samp2, prob=TRUE)
lines(density(samp2))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
quantile(samp2, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", sd.unif1/sqrt(n), "\n")
output is
and
-6-
Next is n=25.
# taking samples size n=25 from U(3,12)
samp3 <- c() ; sd.unif1 <- (12-3)/sqrt(12)
n <- 25
for (i in 1:200) {
samp3[i] <- mean(runif(n, min=3, max=12))
}
hist(samp3, prob=TRUE)
lines(density(samp3))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
quantile(samp3, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", sd.unif1/sqrt(n), "\n")
output is below
and
Next is samples of n=100.
-7-
# taking samples size n=100 from U(3,12)
samp4 <- c() ; sd.unif1 <- (12-3)/sqrt(12)
n <- 100
for (i in 1:200) {
samp4[i] <- mean(runif(n, min=3, max=12))
}
hist(samp4, prob=TRUE)
lines(density(samp4))
print("values of 1, 2, 3 sd from middle and median value are shown
below")
quantile(samp4, c(.0015,.025,.16, .5, .84, .975, .9985))
cat("theoretical sd=", sd.unif1/sqrt(n), "\n")
output is
and
As we increase n, we see that the mean stays stable at about 7.5, and the shape as well as
the numbers on the sample standard deviation start getting more lined up with the bellshape and theoretical value of (9/√(12))/√(n) predicted by the CLT, as we get farther from
the middle. We seem to be relatively normal on the sampling distribution within 2 to 3 sd's
as our n is the biggest, in both shape and standard deviation values.
In summary, for sampling distributions from non-symmetric, non-normal population
distributions, the larger the n is, the more the shape of the sampling distribution mimics a
bell-shape, and the standard deviation starts matching the theoretical (b-a)/√(12). This
again agrees with the predictions of the CLT.
Homework [2]: Take the exp(.35) and see if you can match the predictions of the CLT,
as far as shape, center, and standard deviation. Use sampling distributions where
n=2, n=20, n=50, and n=100. Note that for the exp(l), the theoretical mean = sd =1/l
.
-8-
CLT predictions on binomial distributions:
If we have a population of 5000, where each member has either the “success”
characteristic or does not, and each member has a .2 probability of having the “success”
characteristic, we will have a binary population of 5000 individuals from the B(1,.2), where
an individual is picked and has a 20% probability of being a “1”, which means they have the
“success” characteristic . I made such a population shown below.
# population made from pop with p=.2
pop1 <- rbinom(5000, 1, .2)
pop1
hist(pop1)
output is below.
And
So, approximately 1000 members of the population will have the “success” characteristic.
If we take random samples from this population, the CLT says:
• If the sample size is big enough (but not so big as to be more than 10% of the number
in the population), concurrent with the parameter p being “centralized” enough (not
being too close to either end, like p = .001 or .995), then the sampling distribution
will be bell-shaped (enough).
• The bell-shaped sampling distribution will be centered at the population mean,
which for our B(12, .2) is 12*.2 = 2.4 theoretically.
• The standard deviation of our bell-shaped sampling distribution is √(p(1-p)/n)
Let us experiment with this and see if we can agree with the CLT. I will take samples of
size n=5, n=10, n=20, n=30, n=40, and n=50. Code and results are below.
-9-
#samples of size 5
vec1 <- c(5) ; vec2 <- c(200) ; n <- 5
for (i in 1:200) {
vec1 <- sample(pop1,size=5, replace=TRUE)
vec2[i] <- sum(vec1)/n
}
hist(vec2, main="n=5")
print("values of 1, 2, 3 sd from middle and median value are shown below")
quantile(vec2, c(.0015,.025,.16, .5, .84, .975, .9985))
sdtheory <- sqrt(.2*.8/n)
cat("theoretical sd is", sdtheory, "\n")
results are below.
n=5
n=10
n=20
n=30
n=40
n=50
-10-
n=80
n=100
next
next
next
-11-
As before, it seems that after a certain n, the sampling distribution starts looking more
normally distributed, centered at the population value of 0.2, and having standard
deviation coming closer to the theoretical √(p(1-p)/n) value.
Homework [3]: Generate a population of 5000 from a B(1, .78), where the probability
of “success” is .78, then obtain sampling distributions from this population of various
sizes, settling on a size of n which is just “good enough” to satisfy a decent bell-shape
and conformity with the theoretical standard deviation. Explain your results in the
lab report.
-12-