Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 522: Sampling and Survey Techniques Topic 6 Topic Overview This topic will cover • Sampling with unequal probabilities • Sampling one primary sampling unit • One-stage sampling with replacement Unequal probabilities • Recall πi is the probability that unit i is selected as part of the sample. • Most designs we have studied so far have the πi equal. • Now we consider general designs where the πi can vary with i. • There are situations where this can give much better results. Example 6.1 • Survey of nursing home residents in Philadelphia to determine preferences on lifesustaining treatments • 294 nursing homes with a total of 37,652 beds (number of residents not known at the planning stage) • Use cluster sampling • Suppose we choose an SRS of the 294 nursing homes and then an SRS of 10 residents of each selected home. • A nursing home with 20 beds has the same probability of being sampled as a nursing home with 1000 beds. • 10 residents from the 20 bed home represent fewer people than 10 residents from 1000 bed home. 1 Self-weighting • This procedure gives a sample that is not self-weighted. • Alternatives that are self-weighted. – A one-stage cluster sample – Sample a fixed percentage of the residents of each selected nursing home. The two-stage cluster design • The two-stage cluster design (SRS of homes, then equal proportion SRS of residents in each selected home) – Gives a mathematically valid estimator SRS at first stage Three shortcomings: • We would expect ti to be proportional to the number of beds in nursing home i, so estimators will have large variance (Mi ). • Equal percentage sampling in each selected home may be difficult to administer. • Cost is not known in advance (dont know if you will get large or small homes in sample). The study • They drew a sample of 57 nursing homes with probabilities proportional to the number of beds. • Then they took an SRS of 30 beds (and their occupants) from a list of all beds within each selected nursing home. Properties • Each bed is equally likely to be in the sample (note beds vs occupants). • The cost is known before selecting the sample. • The same number of interviews is taken at each nursing home. • The estimators will have smaller variance 2 Key ideas • When sampling with unequal probabilities, we deliberately vary the selection probabilities. • We compensate by using weights in the estimation. • The key is that we know the selection probabilities Notation • The probability that psu i is in the sample is πi . • The probability that psu i is selected on the first draw is ψi . • We will consider an artificial situation where n = 1, so πi = ψi . Sampling one psu • Sample size is n = 1. • Suppose we are interested in estimating the population total. • ti is the total for psu i. • To illustrate the ideas, we will assume that we know the whole population. The Example • N = 4 supermarkets • Size (in square meters) varies. • Select n = 1 with probabilities proportional to size. • Record total sales • Using the data from one store we want to estimate total sales for the four stores in the population. The population Store A B C D Total Size 100 200 300 1000 1600 ψi ti 1/16 11 2/16 20 3/16 24 10/16 245 1 300 3 Weights • The weights wi are the inverses of the selection probabilities ψi . P • The weighted estimator of the population total is t̂ψ = wi ti . • There are four possible samples. • We calculate t̂ψ for each. The samples Sample ψi A 1/16 B 2/16 C 3/16 D 10/16 t̂ψ ti 11 176 20 160 24 128 245 392 wi 16 8 16/3 16/10 Sampling distribution of the estimate t̂ψ Sample ψi 1 1/16 2 2/16 3 3/16 4 10/16 t̂ψ 176 160 128 392 Mean of the sampling distribution of t̂ψ E t̂ψ = 2 3 10 1 176 + 160 + 128 + 392 = 300 = t 16 16 16 16 • So t̂ψ is unbiased. • This will always be true. E t̂ψ = X ψi wi ti = X ti Variance of the sampling distribution t̂ψ Var(t̂ψ ) = 1 2 3 10 (176 − 300)2 + (160 − 300)2 + (128 − 300)2 + (392 − 300)2 = 14248 16 16 16 16 Compare with the variance for an SRS: 1 1 1 1 Var(t̂SRS ) = (176 − 300)2 + (160 − 300)2 + (128 − 300)2 + (392 − 300)2 = 154488 4 4 4 4 4 Interpretation • Store D is the largest and we expect it to account for a large portion of the total sales. • Therefore, we give it a higher probability of being in the sample (10/16) than it would have with an SRS (1/4). • If it is selected, we multiply its sales by (16/10) to estimate total sales. One-stage sampling with replacement • Suppose n > 1 and we sample with replacement. • This implies πi = 1 − (1 − ψi )n . • Probability that item i is selected on the first draw is the same as the probability that item i is selected on any other draw. • Sampling with replacement gives us n independent estimates of the population total, one for each unit in sample. • We average these n estimates. • Estimated variance is variance of the estimates divided by n Example 6.2 • N = 15 classes of elementary stat • Mi students in class i (i = 1 to 15) • Values of Mi range from 20 to 100. • We want a sample of 5 classes. • Each student in the selected classes will fill out a questionnaire. • (It is possible for the same class to be selected more than once.) Randomization • There are a total of 647 students in these classes. • Select 5 random numbers between 1 and 647. • Think about ordering the students by class. • Each random number corresponds to a student and the corresponding class will be in the sample. 5 This method • This method is called the cumulative-size method. • It is based on M1 , M1 + M2 , M1 + M2 + M3 , . . . • An alternative is to use the cumulative sums of the ψi and select random numbers between 0 and 1. • For this example, ψi = Mi /647 Alternative • Systematic sampling is often used as an alternative in this setting. – The basic idea is the same. – Not technically sampling with replacement – Works well as systematic sampling works well. – See page 186 for details. • Lahiris method – Involves two stages of randomization – Rejection sampling: corresponds to classroom problem in Problem Set 2. – Can be inefficient. – See page 187 for details Estimation Theory • Let Qi be the number of times unit i occurs in the sample. P • Then t̂ψ = n1 Qi ti /ψi . • The estimated variance of t̂i is X 1 ti Qi ( − t̂ψ )2 n(n − 1) ψi • The estimate and its estimated variance are both unbiased. Choosing the selection probabilities • We want small variance for our estimator. – Often, ti is related to the size of the psu. – We can take ψi proportional to Mi or some other measure of the size of psu i. 6 PPS • This procedure is called sampling with probability proportional to size (pps). • The formulas for the estimate and variance can be simplified for this special case. ψi = Mi K ti = K ȳi ψi • See page 190 for details • See Example 6.5 on pages 190-192 Two-stage sampling with replacement • Basic ideas are very similar to one-stage sampling. • ψi is the probability that psu i is selected on the first (or any) draw. • We take a sample of mi ssus from each selected psu. Sampling ssu’s • Usually we use an SRS. • Alternatives include – systematic sampling – any other probability sampling method • Note if a psu is selected more than once, a separate independent second stage sample is required. Estimates and SE’s • Weights are used to make the estimators unbiased. • Formulas are similar to those for one-stage. • See (6.8) and (6.9) on page 192 7 Outline of the procedure 1. Determine the ψi . 2. Select the n psus (with replacement). 3. Select the ssus. 4. Estimate the t for each selected psu, t̂ψ = weight × t̂ 5. The average of these is t̂ψ . √ 6. SE is the standard error of these (sd/ n). Unequal probability sampling without replacement • ψi is the probability of selection on the first draw. • The probability of selection on later draws depends on which units were selected on earlier draws. Estimation • πi is called the inclusion probability. ( P pop πi = n) P • πi,j is the probability that both psu i and psu j are in the sample. ( j6=i πi,j = (n−1)πi ) • Weights (inverse of selection probability) – we use πi /n in place of ψi (with replacement) • The recommended procedure is to use the Horvitz-Thompson (HT) estimator and the P associated SE. (t̂HT = sam t̂i /πi ) • See page 196-197 for details. • This estimator can be generalized to other designs that do not use replacement. Randomization Theory Framework is • Probability sampling without replacement for the psus for the first stage • Sampling at the second stage is independent of sampling at the first stage 8 Horvitz-Thompson • Randomization theory can be used to prove the Horvitz-Thompson Theorem. – Expected value of the estimator is t. – Formula for the variance of the estimator The estimator • t̂HT = P t̂i /πi – where the sum is over the psu’s selected in the first stage. • Idea behind proofs is to condition on which psus are in the sample. • Study pages 205-210 Model • One-way random effects anova model Yi,j = Ai + i,j where – the Ai are random variables with mean µ and variance σA2 – the i,j are random variables with mean 0 and variance σ 2 . – the Ai and the i,j are uncorrelated The pps estimator • πi = nMi /K – the inclusion probability T̂P = X K T̂i nMi • We rewrite this as a weighted estimator. Mi X Yi,j mi X = wi,j Yi,j t̂i = t̂P where wi,j = K nMi • Take expected values to show that the estimator is unbiased. 9 Variance • The variance can be computed. • See page 211 • The variance depends on which psu’s are selected through the Mi . • The variance is smallest when psu’s with the largest Mi are chosen. Recall • Estimate of population total is the weighted average of the t̂i for the selected psus. • The weights wi are the inverses of the probabilities of selection. Elephants • A circus needed to ship its 50 elephants. • They needed to estimate the total weight of the animals. • It is not easy to weigh 50 elephants and they were in a hurry. • They had data from three years ago. Sample • The owner wanted to base the estimate on a sample. • Dumbo had a weight equal to the average three years ago. • The owner wanted to weigh Dumbo and multiply by 50. • The statistician said: NO • You have to use probability sampling and the Horvitz-Thompson estimator. • They compromised: – The probability of selecting Dumbo was set as 99/100. – The probability of selecting each of the other elephants was 1/4900. 10 Who was selected • Dumbo, of course. • The owner was happy and said now we can estimate the weight of the 50 elephants as 50 times Dumbos weight, 50y. • The statistician said NO • The estimate of the total weight of the 50 elephants should be Dumbos weight divided by his probability of selection. • This is y/(99/100) or 100y/99. • The theory behind this estimator is rigorous What if • The owner asked – What if the randomization had selected Jumbo the largest elephant in the herd? • The statistician replied 4900y, where y is Jumbos weight. Conclusion • The statistician lost his circus job and became a teacher of statistics. • bad model; highly variable estimator • Due to Basu (1971). 11