Download Stat 475/920 Notes 3

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Stat 475/920 Notes 3
Reading: Lohr, Chapter 2.5-2.8
I. Sample Size Estimation
An important part of planning a survey is choosing the sample
size. Observations cost money. If the sample is too large, time
and talent are wasted. Conversely, if the number of observations
included in the sample is too small, we have bought inadequate
information for the time and effort expended and again have
been wasteful.
The following are the general steps needed to estimate the
sample size:
1. Specify the tolerable error: Ask “What is expected of the
sample, and how much precision do I need?” What are the
consequences of the sample results and how much error is
tolerable?
2. Find an equation relating the sample size n and the tolerable
error.
3. Estimate any unknown quantities and solve for n.
Step 1: Specify the tolerable error.
1
For the sample mean, the desired precision is often expressed in
absolute terms as
P(| y  yU | e)  0.95 ,
i.e., we want to have a 95% chance that the difference between
the sample mean and the population mean is at most e. For
surveys in which a proportion is measured, e is often chosen to
be 0.03. e is sometimes called the margin of error. The margin
of error is half the width of a 95% confidence interval.
Sometimes, we would like to achieve a desired relative
precision. In this case the precision may be expressed as
 y  yU

P 
 e   0.95
y
U


Step 2: Find an equation relating the sample size n and the
tolerable error.
Suppose we want P(| y  yU | e)  0.95 . A large sample
approximate 95% confidence interval for yU is
y  1.96
S
n
1
N ,
n

S
n
S
n 
i.e., P  y  1.96 n 1  N  yU  y  1.96 n 1  N   0.95


or equivalently,

S
n 
P  | y  yU | 1.96
1    0.95
N
n

2
Thus, our equation for the sample size is
S
n
1.96
1  e
N
n
1.962 S 2
n
1.962 S 2
for which
 e2
N
For obtaining a specified relative precision, the equation is
S
n
1.96
1   e | yU |
N
n
Step 3: Estimate unknown quantities
To use the above equations to estimate the needed sample size,
we need an estimate of the population standard deviation S .
For proportions, S 
p(1  p)
N
N  1 , which is approximately
p(1  p) for large populations. The maximum value of
p(1  p) is 0.5.
3
For proportions, we can conservatively choose S to be 0.5.
Example 1: Suppose we want to estimate the proportion of
recipes in the Better Homes and Garden New Cook Book that do
not involve animal products. We plan to take a simple random
sample of the N  1251 recipes, and we want a margin of error
of at most 0.03. Then using the conservative estimate of S of
0.5, we want to solve
0.5
n
1.96
1
 0.03
1251
n
Solving for n gives n  576 .
Example 2: Consider a poll of a large population for which we
want the margin of error to be at most 0.03. For a large
population, the finite population correction will be about 1 and
we can solve
0.5
1.96
 0.03 ,
n
4
n  1067 .
For quantities other than proportions, S must be estimated or
guessed at. Some methods for estimating S include:
1. Use sample quantities obtained from a pilot sample. A pilot
sample is a small sample taken to provide information and
guidance for the design of the main survey, can be used to
estimate quantities needed for setting the sample size.
2. Use previous studies or data available in the literature.
3. If nothing else is available, guess the variance. Sometimes a
hypothesized distribution of the data will give us information
about the variance. For example, if you believe the population
will be normally distributed, you may not know what the
variance is, but you may have an idea of the range of the data.
You could then estimate S by range/4 or range/6 becaues
approximately 95% of the values from a normal population are
within 2 standard deviations of the mean and 99.7% of the
values are within 3 standard deviation of the mean.
Example of a pilot sample: Consider the Vietnam Veteran Agent
Orange study from Sample 2. Suppose we want to obtain a
margin of error of 0.25. We take a pilot sample of size 20 and
estimate the standard deviation S .
# Pilot sample for Agent Orange data
agent.orange.data=read.table("agent_orange_data.txt",header=TRUE);
dioxin=agent.orange.data$dioxin;
N=646;
5
pilotsamplesize=20;
pilotsample=sample(N,pilotsamplesize,replace=FALSE);
s.pilotsample=sd(dioxin[pilotsample]);
> s.pilotsample
[1] 1.832456
We now find the sample size that would make the margin of
error 0.25 if the population standard deviation was 1.83.
1.962 S 2
1.962 *1.832
n

 156.11
2
2
1.962 S 2
1.96
*1.83
 e2
 0.252
N
646
We set the sample size to be 157. We include the 20
observations already chosen and then choose 157-20=137 units
from the remaining population.
# Sample the remaining needed observations from the population that
# the units already sampled in the pilot sample
full.sample.size=ceiling(nhat);
remaining.sample.size=full.sample.size-pilotsamplesize;
remainingsample=sample(seq(1,N,1)[pilotsample],remaining.sample.size,replace=FALSE);
fullsample=c(dioxin[pilotsample],dioxin[remainingsample]);
mean.fullsample=mean(fullsample);
s.fullsample=sd(fullsample);
se.ybar=sqrt((s.fullsample^2/full.sample.size)*(1-full.sample.size/N));
lowerci=mean.fullsample-1.96*se.ybar;
upperci=mean.fullsample+1.96*se.ybar;
> s.fullsample
[1] 1.981706
> full.sample.size
6
[1] 157
> nhat
[1] 156.4195
> mean(fullsample)
[1] 4.216561
> se.ybar
[1] 0.1376029
> lowerci
[1] 3.946859
> upperci
[1] 4.486262
Recall that the population mean and standard deviation are:
> mean(dioxin)
[1] 4.260062
> sd(dioxin)
[1] 2.642617
Using the pilot sample, we found a confidence interval of width
4.49-3.95=0.54, which corresponds to a margin of error of 0.27.
This is about what we were hoping for, a margin of error of
0.25.
Note, we got a little lucky with our sample as because the
standard deviation in the pilot sample, 1.83, was less than the
population standard deviation, 2.64, we took a smaller sample
size, 157 units, than we would have done if we knew the
population standard deviation. If we knew the population
standard deviation, we would have taken a sample size of
1.962 S 2
1.962 * 2.642
n

 257.58
2 2
2
2
1.96 S
1.96 * 2.64
 e2
 0.252
N
646
7
II. Derivation of Mean and Variance of Sample Mean (Chapter
2.7)
In Notes 2, we claimed the results that for a simple random
sample, the sampling distribution of the sample mean under
simple random sampling has the following properties
Mean: E ( y )  yU .
The sample mean is an unbiased estimator of the population
mean.
S2 
n
Var
(
y
)

1



Variance:
n  N .
We will now derive these results. We do not make any
assumptions about the population in deriving the results; we use
only the fact that we have taken a simple random sample. This
is called the design based theory or randomization theory.
8
9
10
III. Model Based Approach and Superpopulations
In the design based theory, we do not make any assumptions
about the population. In more complicated settings, it is
sometimes useful to consider a model for the population and
also to consider the finite population as being randomly drawn
from an infinite superpopulation, e.g.,
y1 , , yN iid N (  ,  2 )
We then take a probability sample from the finite population
y1 , , yN .
In the model based approach, we might be interested in either
the superpopulation mean  or the finite population mean
1 N
 yi
N i 1 .
11