Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lab 10—The Central Limit Theorem Introduction: The Central Limit Theorem (CLT) is one of the foundations of basic inferential statistics. Up until now we have basically investigated descriptive statistics, where we produced graphs and statistics and parameters and other information about data we either find or generate. Inferential statistics, as the name implies, infers facts about a large population based only upon facts generated from a (usually) much smaller sample. For example, if we generate an “appropriate” sample of 100 voters in the U.S., we can do a survey, determining the proportion of them who will vote for a specific national candidate, then, from the sample proportion, determine, plus or minus a few percent from the sample result, the true proportion of voters in the population of 200 million potential U.S. voters preferring that candidate. From the microcosm (sample) we can fairly accurately infer the results of the complete cosmos (population)! At first impression, this seems counter intuitive, to infer an accurate result in the big group by only doing a test on a much smaller group. However, the CLT gives us the ability to do this. The CLT contains the following results, which we can check with R. • If I start with a normal population and take multiple random samples of size n from it, finding the sample mean (y-bar), the resulting sampling distribution of y-bars will also be normally distributed, and centered at the same mean as the original population the samples were taken from. • As n increases, the standard deviation of the sampling distribution will decrease • the value of the standard deviation of the sampling distribution (sd(y-bar)) will follow the formula: sd(y-bar) = s/√(n), where s is the standard deviation of the population distribution. Showing sampling distributions of normal population distributions stay centered at the population mean: I want to look at the distribution of random samples of size 2, taken from the N(0, 3) distribution, and see if predicted results of the CLT show up. See below for code and results from my simulation. # taking samples size n=2 samp1 <- c() n <- 2 for (i in 1:200) { samp1[i] <- mean(rnorm(n, mean=10, sd=3)) } hist(samp1, prob=TRUE) lines(density(samp1)) print("values of 1, 2, 3 sd from middle and median value are shown below") -1- quantile(samp1, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", 3/sqrt(n), "\n") I find the mean of 200 samples of size n=2 from the N(10, 3) population, and store these ybars in vector samp1. I then make a histogram of samp1, with relative frequency on the y axis (so I can compare it with its density curve) for shape. Finally, I print out the values of sampling distribution samp1 which lie at ± 3sd, ± 2sd, and ± 1sd from the middle value, using the quantile() command. My last line computes the theoretical standard deviation, according to the CLT, which says sd(y-bar) = 3/√(2) . Output is shown below. next Below is code to generate the same information from 200 samples of size n=5, n=10, and n=25, respectively. Resulting sampling distributions are stored in samp2, samp3, and samp4, respectively. # taking samples size n=5 samp2 <- c() n <- 5 for (i in 1:200) { samp2[i] <- mean(rnorm(n, mean=10, sd=3)) } hist(samp2, prob=TRUE) lines(density(samp1)) print("values of 1, 2, 3 sd from middle and median value are shown below") quantile(samp2, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", 3/sqrt(n), "\n") Output is below for n=5. And -2- Below are code and results for n=10 case, with results in samp3. # taking samples size n=10 samp3 <- c() n <- 10 for (i in 1:200) { samp3[i] <- mean(rnorm(n, mean=10, sd=3)) } hist(samp3, prob=TRUE) lines(density(samp1)) print("values of 1, 2, 3 sd from middle and median value are shown below") quantile(samp3, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", 3/sqrt(n), "\n") output is below for samp3. And Code and results for n=25 are shown below. # taking samples size n=25 samp4 <- c() n <- 25 for (i in 1:200) { -3- samp4[i] <- mean(rnorm(n, mean=10, sd=3)) -3- } hist(samp4, prob=TRUE) lines(density(samp1)) print("values of 1, 2, 3 sd from middle and median value are shown below") quantile(samp4, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", 3/sqrt(n), "\n") Output is below. And After reviewing all cases, we notice the following results (which follow CLT predictions). • Each sampling distribution (samp1, samp2, samp3, samp4) all seem to have about the same center (10), within usual sampling variation • Each sampling distribution seem to have a mound or bell shape (again within sampling variation) • Each sampling distribution shows that as n increases, the sampling distribution standard deviation decreases • Each sampling distribution seem to follow the formula for the theoretical value of sampling standard deviation fairly closely (within sampling variation), for each value of n. If our population distribution follows N(10, 3), then the following shows the theoretical density plots of the original population, sampling distributions of n=2, n=10, and n=25. # CLT - normal population # density curves n <- 1 curve(dnorm(x, mean=10, sd=3/sqrt(n)), xlim=c(0, 20), ylim=c(0, .65), ylab="density") n <- 2 curve(dnorm(x, mean=10, sd=3/sqrt(n)), col="blue", add=TRUE) n <- 10 curve(dnorm(x, mean=10, sd=3/sqrt(n)), col="red", add=TRUE) -4- n <- 25 curve(dnorm(x, mean=10, sd=3/sqrt(n)), col="green", add=TRUE) The green shows n=25, red shows n=10, blue shows n=2, and black shows the original distribution N(10,3). The same CLT results show on these theoretical valued density plots. So, the CLT seems to predict correctly for sampling from fairly symmetric population distributions. Homework [1]: Check out the CLT predictions on the population which is N(6, 2), for sampling distributions for n=2, n=6, n=15, and n=40. In your lab report state your impressions. CLT predictions from non-symmetric distributions: Let us take samples from the U(3, 12) distribution. See below for samples of n=2, n=10, n=25, n=100 (in samp1, samp2, samp3, and samp4, respectively). Notice that the theoretical sd for a uniform distribution is (b-a)/√(12). # CLT from non-normal distributions # taking samples size n=2 from U(3,12) samp1 <- c() ; sd.unif1 <- (12-3)/sqrt(12) n <- 2 for (i in 1:200) { samp1[i] <- mean(runif(n, min=3, max=12)) } hist(samp1, prob=TRUE) lines(density(samp1)) print("values of 1, 2, 3 sd from middle and median value are shown below") -5- quantile(samp1, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", sd.unif1/sqrt(n), "\n") output is and Next is n=10 # taking samples size n=10 from U(3,12) samp2 <- c() ; sd.unif1 <- (12-3)/sqrt(12) n <- 10 for (i in 1:200) { samp2[i] <- mean(runif(n, min=3, max=12)) } hist(samp2, prob=TRUE) lines(density(samp2)) print("values of 1, 2, 3 sd from middle and median value are shown below") quantile(samp2, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", sd.unif1/sqrt(n), "\n") output is and -6- Next is n=25. # taking samples size n=25 from U(3,12) samp3 <- c() ; sd.unif1 <- (12-3)/sqrt(12) n <- 25 for (i in 1:200) { samp3[i] <- mean(runif(n, min=3, max=12)) } hist(samp3, prob=TRUE) lines(density(samp3)) print("values of 1, 2, 3 sd from middle and median value are shown below") quantile(samp3, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", sd.unif1/sqrt(n), "\n") output is below and Next is samples of n=100. -7- # taking samples size n=100 from U(3,12) samp4 <- c() ; sd.unif1 <- (12-3)/sqrt(12) n <- 100 for (i in 1:200) { samp4[i] <- mean(runif(n, min=3, max=12)) } hist(samp4, prob=TRUE) lines(density(samp4)) print("values of 1, 2, 3 sd from middle and median value are shown below") quantile(samp4, c(.0015,.025,.16, .5, .84, .975, .9985)) cat("theoretical sd=", sd.unif1/sqrt(n), "\n") output is and As we increase n, we see that the mean stays stable at about 7.5, and the shape as well as the numbers on the sample standard deviation start getting more lined up with the bellshape and theoretical value of (9/√(12))/√(n) predicted by the CLT, as we get farther from the middle. We seem to be relatively normal on the sampling distribution within 2 to 3 sd's as our n is the biggest, in both shape and standard deviation values. In summary, for sampling distributions from non-symmetric, non-normal population distributions, the larger the n is, the more the shape of the sampling distribution mimics a bell-shape, and the standard deviation starts matching the theoretical (b-a)/√(12). This again agrees with the predictions of the CLT. Homework [2]: Take the exp(.35) and see if you can match the predictions of the CLT, as far as shape, center, and standard deviation. Use sampling distributions where n=2, n=20, n=50, and n=100. Note that for the exp(l), the theoretical mean = sd =1/l . -8- CLT predictions on binomial distributions: If we have a population of 5000, where each member has either the “success” characteristic or does not, and each member has a .2 probability of having the “success” characteristic, we will have a binary population of 5000 individuals from the B(1,.2), where an individual is picked and has a 20% probability of being a “1”, which means they have the “success” characteristic . I made such a population shown below. # population made from pop with p=.2 pop1 <- rbinom(5000, 1, .2) pop1 hist(pop1) output is below. And So, approximately 1000 members of the population will have the “success” characteristic. If we take random samples from this population, the CLT says: • If the sample size is big enough (but not so big as to be more than 10% of the number in the population), concurrent with the parameter p being “centralized” enough (not being too close to either end, like p = .001 or .995), then the sampling distribution will be bell-shaped (enough). • The bell-shaped sampling distribution will be centered at the population mean, which for our B(12, .2) is 12*.2 = 2.4 theoretically. • The standard deviation of our bell-shaped sampling distribution is √(p(1-p)/n) Let us experiment with this and see if we can agree with the CLT. I will take samples of size n=5, n=10, n=20, n=30, n=40, and n=50. Code and results are below. -9- #samples of size 5 vec1 <- c(5) ; vec2 <- c(200) ; n <- 5 for (i in 1:200) { vec1 <- sample(pop1,size=5, replace=TRUE) vec2[i] <- sum(vec1)/n } hist(vec2, main="n=5") print("values of 1, 2, 3 sd from middle and median value are shown below") quantile(vec2, c(.0015,.025,.16, .5, .84, .975, .9985)) sdtheory <- sqrt(.2*.8/n) cat("theoretical sd is", sdtheory, "\n") results are below. n=5 n=10 n=20 n=30 n=40 n=50 -10- n=80 n=100 next next next -11- As before, it seems that after a certain n, the sampling distribution starts looking more normally distributed, centered at the population value of 0.2, and having standard deviation coming closer to the theoretical √(p(1-p)/n) value. Homework [3]: Generate a population of 5000 from a B(1, .78), where the probability of “success” is .78, then obtain sampling distributions from this population of various sizes, settling on a size of n which is just “good enough” to satisfy a decent bell-shape and conformity with the theoretical standard deviation. Explain your results in the lab report. -12-