* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Stat 475/920 Notes 3
Survey
Document related concepts
Transcript
Stat 475/920 Notes 3 Reading: Lohr, Chapter 2.5-2.8 I. Sample Size Estimation An important part of planning a survey is choosing the sample size. Observations cost money. If the sample is too large, time and talent are wasted. Conversely, if the number of observations included in the sample is too small, we have bought inadequate information for the time and effort expended and again have been wasteful. The following are the general steps needed to estimate the sample size: 1. Specify the tolerable error: Ask “What is expected of the sample, and how much precision do I need?” What are the consequences of the sample results and how much error is tolerable? 2. Find an equation relating the sample size n and the tolerable error. 3. Estimate any unknown quantities and solve for n. Step 1: Specify the tolerable error. 1 For the sample mean, the desired precision is often expressed in absolute terms as P(| y yU | e) 0.95 , i.e., we want to have a 95% chance that the difference between the sample mean and the population mean is at most e. For surveys in which a proportion is measured, e is often chosen to be 0.03. e is sometimes called the margin of error. The margin of error is half the width of a 95% confidence interval. Sometimes, we would like to achieve a desired relative precision. In this case the precision may be expressed as y yU P e 0.95 y U Step 2: Find an equation relating the sample size n and the tolerable error. Suppose we want P(| y yU | e) 0.95 . A large sample approximate 95% confidence interval for yU is y 1.96 S n 1 N , n S n S n i.e., P y 1.96 n 1 N yU y 1.96 n 1 N 0.95 or equivalently, S n P | y yU | 1.96 1 0.95 N n 2 Thus, our equation for the sample size is S n 1.96 1 e N n 1.962 S 2 n 1.962 S 2 for which e2 N For obtaining a specified relative precision, the equation is S n 1.96 1 e | yU | N n Step 3: Estimate unknown quantities To use the above equations to estimate the needed sample size, we need an estimate of the population standard deviation S . For proportions, S p(1 p) N N 1 , which is approximately p(1 p) for large populations. The maximum value of p(1 p) is 0.5. 3 For proportions, we can conservatively choose S to be 0.5. Example 1: Suppose we want to estimate the proportion of recipes in the Better Homes and Garden New Cook Book that do not involve animal products. We plan to take a simple random sample of the N 1251 recipes, and we want a margin of error of at most 0.03. Then using the conservative estimate of S of 0.5, we want to solve 0.5 n 1.96 1 0.03 1251 n Solving for n gives n 576 . Example 2: Consider a poll of a large population for which we want the margin of error to be at most 0.03. For a large population, the finite population correction will be about 1 and we can solve 0.5 1.96 0.03 , n 4 n 1067 . For quantities other than proportions, S must be estimated or guessed at. Some methods for estimating S include: 1. Use sample quantities obtained from a pilot sample. A pilot sample is a small sample taken to provide information and guidance for the design of the main survey, can be used to estimate quantities needed for setting the sample size. 2. Use previous studies or data available in the literature. 3. If nothing else is available, guess the variance. Sometimes a hypothesized distribution of the data will give us information about the variance. For example, if you believe the population will be normally distributed, you may not know what the variance is, but you may have an idea of the range of the data. You could then estimate S by range/4 or range/6 becaues approximately 95% of the values from a normal population are within 2 standard deviations of the mean and 99.7% of the values are within 3 standard deviation of the mean. Example of a pilot sample: Consider the Vietnam Veteran Agent Orange study from Sample 2. Suppose we want to obtain a margin of error of 0.25. We take a pilot sample of size 20 and estimate the standard deviation S . # Pilot sample for Agent Orange data agent.orange.data=read.table("agent_orange_data.txt",header=TRUE); dioxin=agent.orange.data$dioxin; N=646; 5 pilotsamplesize=20; pilotsample=sample(N,pilotsamplesize,replace=FALSE); s.pilotsample=sd(dioxin[pilotsample]); > s.pilotsample [1] 1.832456 We now find the sample size that would make the margin of error 0.25 if the population standard deviation was 1.83. 1.962 S 2 1.962 *1.832 n 156.11 2 2 1.962 S 2 1.96 *1.83 e2 0.252 N 646 We set the sample size to be 157. We include the 20 observations already chosen and then choose 157-20=137 units from the remaining population. # Sample the remaining needed observations from the population that # the units already sampled in the pilot sample full.sample.size=ceiling(nhat); remaining.sample.size=full.sample.size-pilotsamplesize; remainingsample=sample(seq(1,N,1)[pilotsample],remaining.sample.size,replace=FALSE); fullsample=c(dioxin[pilotsample],dioxin[remainingsample]); mean.fullsample=mean(fullsample); s.fullsample=sd(fullsample); se.ybar=sqrt((s.fullsample^2/full.sample.size)*(1-full.sample.size/N)); lowerci=mean.fullsample-1.96*se.ybar; upperci=mean.fullsample+1.96*se.ybar; > s.fullsample [1] 1.981706 > full.sample.size 6 [1] 157 > nhat [1] 156.4195 > mean(fullsample) [1] 4.216561 > se.ybar [1] 0.1376029 > lowerci [1] 3.946859 > upperci [1] 4.486262 Recall that the population mean and standard deviation are: > mean(dioxin) [1] 4.260062 > sd(dioxin) [1] 2.642617 Using the pilot sample, we found a confidence interval of width 4.49-3.95=0.54, which corresponds to a margin of error of 0.27. This is about what we were hoping for, a margin of error of 0.25. Note, we got a little lucky with our sample as because the standard deviation in the pilot sample, 1.83, was less than the population standard deviation, 2.64, we took a smaller sample size, 157 units, than we would have done if we knew the population standard deviation. If we knew the population standard deviation, we would have taken a sample size of 1.962 S 2 1.962 * 2.642 n 257.58 2 2 2 2 1.96 S 1.96 * 2.64 e2 0.252 N 646 7 II. Derivation of Mean and Variance of Sample Mean (Chapter 2.7) In Notes 2, we claimed the results that for a simple random sample, the sampling distribution of the sample mean under simple random sampling has the following properties Mean: E ( y ) yU . The sample mean is an unbiased estimator of the population mean. S2 n Var ( y ) 1 Variance: n N . We will now derive these results. We do not make any assumptions about the population in deriving the results; we use only the fact that we have taken a simple random sample. This is called the design based theory or randomization theory. 8 9 10 III. Model Based Approach and Superpopulations In the design based theory, we do not make any assumptions about the population. In more complicated settings, it is sometimes useful to consider a model for the population and also to consider the finite population as being randomly drawn from an infinite superpopulation, e.g., y1 , , yN iid N ( , 2 ) We then take a probability sample from the finite population y1 , , yN . In the model based approach, we might be interested in either the superpopulation mean or the finite population mean 1 N yi N i 1 . 11