Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
History of statistics wikipedia , lookup
Confidence interval wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Taylor's law wikipedia , lookup
Opinion poll wikipedia , lookup
Sampling (statistics) wikipedia , lookup
Resampling (statistics) wikipedia , lookup
German tank problem wikipedia , lookup
Stats 120A Review of CIs, hypothesis tests and more Sample/Population • Last time we collected height/armspan data. Is this a sample or a population? Gallup Poll, 1/9/07 "As you may know, the Bush administration is considering a temporary but significant increase in the number of U.S. troops in Iraq to help stabilize the situation there. Would you favor or oppose this?" Results • Results based on 1004 randomly selected adults (> 18 years) interviewed Jan 5-7, 2007. • 61% are opposed. • "For results based on this sample, one can say with 95% confidence that the maximum error attributable to sampling and other random effects is ±3 percentage points. " Pop Quiz • Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling Sampling paradigm • In the U.S., the proportion of adults who are opposed to a surge is p, (or p*100%). • We take a random sample of n = 1004. • The proportion of our sample ("p hat") is an estimate of the proportion in the population. A simulation: • Choose a value to serve as p (say p = .6) • Our "data" consist of 1004 numbers: 0's represent those in favor, 1's are those opposed. • x = 589 out of 1004 say "opposed", so p-hat = 589/1004 = .5866 • mean(x) = .5866 • sd(x) = .4926 xbar=.5866, s = .493 How do we know sample proportion is a good estimate of population proportion? • Law of Large Numbers: sample averages (and proportions) converge on population values •implying that for finite values, the sample proportion might be close if the sample size is large Coin flips: sample proportion "settles down" to 0.5 So if we stop earlier, say n = 10 p-hat = .60 Which raises the question: • If we stop early, how far away will our sample proportion be from the true value? • Or, in a survey setting, if we take a finite sample of n=1004, how far off from the population proportion are we likely to be? A simulation might help: • • • • Assume p = .60 (population proportion) Take sample of n = 1004 and find p-hat. Save this value Repeat above 3 steps 10000 times. The R code (for the record) • phat <- c() for (i in 1:10000){ x <- sample(c(0,1),1004,replace=T,prob=c(.4, .6)) temp <- sum(x)/1004 phat <- c(phat,temp)} • hist(phat) each dot represents one survey of 1004 people 10,000 sample proportions, n = 1004 Observe that... • sample proportions are centered on the true population value: p = .60 • variability is not great: smallest is .54, biggest is .66 • distribution is bellshaped We've just witnessed the Central Limit Theorem If samples are independent and random and sufficiently large • means (and proportions) follow a nearly Normal distribution • the mean of the Normal is the mean of the population • the SD of the Normal (aka the standard error) is the population SD divided by sqrt(n) CLT applied to sample proportions • • • • phat is distributed with an approx Normal mean is p SE is sqrt(p*(1-p)/n) For our simulation, p = .60 so our p-hats will be centered on .6 with a SD of sqrt(.6*.4/1004) = 0.0155 We saw • Normal • mean(phat) = 0.600 (expected .6) • sd(phat) = 0.01554 (expected 0.0155) In practice, we don't know p but we can get a good approximation to the standard error using sqrt(phat * (1-phat)/n) rather than sqrt(p*(1-p)/n) So if we take a random sample of n = 1004 and we see p-hat = .61, we know that: • The true value of p can't be far away. SE = sqrt(.61*.39/1004) = 0.0154 •So 68% of the time we do this, p will be within 0.0154 of phat •And 95% of the time it will be with 2*.0154 = 0.03 Which leads us to conclude that the true proportion of the population that opposes a surge is somewhere in the interval .61 - .03 = 0.58 to .61+.03 = 0.64 Confidence intervals • This is an example of a 95% confidence interval. • Because 95% of all samples will produce a p-hat that is within 2 standard errors of the true value, we are 95% confident that ours is a "good" interval. Formula A 95% CI for a proportion is estimate +/- 2 * (Standard Error) p-hat +/- 2*sqrt(phat*(1-phat)/n) 0.61 +/- 2*sqrt(.61*.39/1004) (.58, .64) note: our replacing phat for p in SE means we get an approximate value What does 95% mean? • If we repeat this infinitely many times: – take a sample of n = 1004 from population – calculate sample proportion – find an interval using +/- 2 * SE • then 95% of these CIs will contain the truth and 5% will not. • We see only one: (.58, .64). It is either good or bad, but we are confident it is good. Where did the 95% come from? • It came from the normal curve. • The CLT told us that p-hat followed a (approx) normal distribution. • For Normal's, 68% of probability is within 1 standard deviation of mean, 95% within 2, 99.7% within 3. • A normal table gives other probabilities Change confidence level by changing the width of margin of error -0.015 +.015 1 SE 68% 2 SEs 95% 3 SEs 99.7% 90% 1.6 SE phat =0.61 The CLT applies to • any linear combination of the observations • assuming observations are randomly sampled, and independent • it does NOT matter what the distribution of the population looks like • if n is small, the distribution will be only approximately normal, and this might be a very poor approximation the CLT does NOT apply to • non-linear combinations, such as the sample median or the standard deviation • non-random samples • samples that are dependent simulation • http://onlinestatbook.com/stat_sim/sampling _dist/index.html Summary • Confidence Level is a statement about the sampling process, not the sample • Margin of error is determined to achieve the desired confidence level • We can calculate the confidence level only if we know the sampling distribution: the probability distribution of the sample Pop Quiz • Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling Pop Quiz • Is the value 61% a statistic or a parameter? • The margin of error is given as 3%. What does the margin of error measure? a) the variability in the sample b) the variability in the population c) the variability in repeated sampling For next time: • In WWII, German army produced tanks with sequential serial numbers. The allies captured a few tanks, and wanted to infer the total number of tanks produced. • Suppose you had captured 10 tanks. Come up with three estimators for the total number of tanks. • Data: 911 5146 6083 944 11944 9365 6087 6647 7076 12275