Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Opinion poll wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Stats 120A
Review of CIs, hypothesis tests and
more
Sample/Population
• Last time we collected height/armspan data.
Is this a sample or a population?
Gallup Poll, 1/9/07
"As you may know, the Bush
administration is considering a
temporary but significant increase
in the number of U.S. troops in
Iraq to help stabilize the situation
there. Would you favor or oppose
this?"
Results
• Results based on 1004 randomly selected
adults (> 18 years) interviewed Jan 5-7,
2007.
• 61% are opposed.
• "For results based on this sample, one
can say with 95% confidence that the
maximum error attributable to sampling
and other random effects is ±3
percentage points. "
Pop Quiz
• Is the value 61% a statistic or a parameter?
• The margin of error is given as 3%. What
does the margin of error measure?
a) the variability in the sample
b) the variability in the population
c) the variability in repeated sampling
Sampling paradigm
• In the U.S., the proportion of adults who are
opposed to a surge is p, (or p*100%).
• We take a random sample of n = 1004.
• The proportion of our sample ("p hat") is an
estimate of the proportion in the population.
A simulation:
• Choose a value to serve as p (say p = .6)
• Our "data" consist of 1004 numbers: 0's
represent those in favor, 1's are those
opposed.
• x = 589 out of 1004 say "opposed", so p-hat
= 589/1004 = .5866
• mean(x) = .5866
• sd(x) = .4926
xbar=.5866, s = .493
How do we know sample
proportion is a good estimate of
population proportion?
• Law of Large Numbers:
sample averages (and proportions) converge
on population values
•implying that for finite values, the sample
proportion might be close if the sample
size is large
Coin flips: sample proportion
"settles down" to 0.5
So if we stop earlier, say n = 10
p-hat = .60
Which raises the question:
• If we stop early, how far away will our
sample proportion be from the true value?
• Or, in a survey setting, if we take a finite
sample of n=1004, how far off from the
population proportion are we likely to be?
A simulation might help:
•
•
•
•
Assume p = .60 (population proportion)
Take sample of n = 1004 and find p-hat.
Save this value
Repeat above 3 steps 10000 times.
The R code (for the record)
• phat <- c()
for (i in 1:10000){
x <- sample(c(0,1),1004,replace=T,prob=c(.4, .6))
temp <- sum(x)/1004
phat <- c(phat,temp)}
• hist(phat)
each dot represents one survey of
1004 people
10,000 sample proportions, n =
1004
Observe that...
• sample proportions are
centered on the true
population value: p =
.60
• variability is not great:
smallest is .54, biggest
is .66
• distribution is bellshaped
We've just witnessed the Central
Limit Theorem
If samples are independent and random and
sufficiently large
• means (and proportions) follow a nearly
Normal distribution
• the mean of the Normal is the mean of the
population
• the SD of the Normal (aka the standard
error) is the population SD divided by
sqrt(n)
CLT applied to sample
proportions
•
•
•
•
phat is distributed with an approx Normal
mean is p
SE is sqrt(p*(1-p)/n)
For our simulation, p = .60 so our p-hats
will be centered on .6 with a SD of
sqrt(.6*.4/1004) = 0.0155
We saw
• Normal
• mean(phat) = 0.600
(expected .6)
• sd(phat) = 0.01554
(expected 0.0155)
In practice, we don't know p
but we can get a good approximation to the
standard error using
sqrt(phat * (1-phat)/n)
rather than
sqrt(p*(1-p)/n)
So if we take a random sample of
n = 1004
and we see p-hat = .61, we know that:
• The true value of p can't be far away.
SE = sqrt(.61*.39/1004) = 0.0154
•So 68% of the time we do this, p will be
within 0.0154 of phat
•And 95% of the time it will be with 2*.0154
= 0.03
Which leads us to conclude
that the true proportion of the population that
opposes a surge is somewhere in the
interval
.61 - .03 = 0.58
to .61+.03 = 0.64
Confidence intervals
• This is an example of a 95% confidence
interval.
• Because 95% of all samples will produce a
p-hat that is within 2 standard errors of the
true value, we are 95% confident that ours
is a "good" interval.
Formula
A 95% CI for a proportion is
estimate +/- 2 * (Standard Error)
p-hat +/- 2*sqrt(phat*(1-phat)/n)
0.61 +/- 2*sqrt(.61*.39/1004)
(.58, .64)
note: our replacing phat for p in SE means we
get an approximate value
What does 95% mean?
• If we repeat this infinitely many times:
– take a sample of n = 1004 from population
– calculate sample proportion
– find an interval using +/- 2 * SE
• then 95% of these CIs will contain the truth
and 5% will not.
• We see only one: (.58, .64). It is either good
or bad, but we are confident it is good.
Where did the 95% come from?
• It came from the normal curve.
• The CLT told us that p-hat followed a
(approx) normal distribution.
• For Normal's, 68% of probability is within 1
standard deviation of mean, 95% within 2,
99.7% within 3.
• A normal table gives other probabilities
Change confidence level by
changing the width of margin of
error
-0.015
+.015
1 SE
68%
2 SEs
95%
3 SEs
99.7%
90%
1.6 SE
phat
=0.61
The CLT applies to
• any linear combination of the observations
• assuming observations are randomly
sampled, and independent
• it does NOT matter what the distribution of
the population looks like
• if n is small, the distribution will be only
approximately normal, and this might be a
very poor approximation
the CLT does NOT apply to
• non-linear combinations, such as the sample
median or the standard deviation
• non-random samples
• samples that are dependent
simulation
• http://onlinestatbook.com/stat_sim/sampling
_dist/index.html
Summary
• Confidence Level is a statement about the
sampling process, not the sample
• Margin of error is determined to achieve the
desired confidence level
• We can calculate the confidence level only
if we know the sampling distribution: the
probability distribution of the sample
Pop Quiz
• Is the value 61% a statistic or a parameter?
• The margin of error is given as 3%. What
does the margin of error measure?
a) the variability in the sample
b) the variability in the population
c) the variability in repeated sampling
Pop Quiz
• Is the value 61% a statistic or a parameter?
• The margin of error is given as 3%. What
does the margin of error measure?
a) the variability in the sample
b) the variability in the population
c) the variability in repeated sampling
For next time:
• In WWII, German army produced tanks
with sequential serial numbers. The allies
captured a few tanks, and wanted to infer
the total number of tanks produced.
• Suppose you had captured 10 tanks. Come
up with three estimators for the total
number of tanks.
• Data: 911 5146 6083 944 11944 9365
6087 6647 7076 12275