Download Unequal Probability Sampling

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability wikipedia , lookup

Transcript
Statistics 522: Sampling and Survey Techniques
Topic 6
Topic Overview
This topic will cover
• Sampling with unequal probabilities
• Sampling one primary sampling unit
• One-stage sampling with replacement
Unequal probabilities
• Recall πi is the probability that unit i is selected as part of the sample.
• Most designs we have studied so far have the πi equal.
• Now we consider general designs where the πi can vary with i.
• There are situations where this can give much better results.
Example 6.1
• Survey of nursing home residents in Philadelphia to determine preferences on lifesustaining treatments
• 294 nursing homes with a total of 37,652 beds (number of residents not known at the
planning stage)
• Use cluster sampling
• Suppose we choose an SRS of the 294 nursing homes and then an SRS of 10 residents
of each selected home.
• A nursing home with 20 beds has the same probability of being sampled as a nursing
home with 1000 beds.
• 10 residents from the 20 bed home represent fewer people than 10 residents from 1000
bed home.
1
Self-weighting
• This procedure gives a sample that is not self-weighted.
• Alternatives that are self-weighted.
– A one-stage cluster sample
– Sample a fixed percentage of the residents of each selected nursing home.
The two-stage cluster design
• The two-stage cluster design (SRS of homes, then equal proportion SRS of residents
in each selected home)
– Gives a mathematically valid estimator
SRS at first stage
Three shortcomings:
• We would expect ti to be proportional to the number of beds in nursing home i, so
estimators will have large variance (Mi ).
• Equal percentage sampling in each selected home may be difficult to administer.
• Cost is not known in advance (dont know if you will get large or small homes in sample).
The study
• They drew a sample of 57 nursing homes with probabilities proportional to the number
of beds.
• Then they took an SRS of 30 beds (and their occupants) from a list of all beds within
each selected nursing home.
Properties
• Each bed is equally likely to be in the sample (note beds vs occupants).
• The cost is known before selecting the sample.
• The same number of interviews is taken at each nursing home.
• The estimators will have smaller variance
2
Key ideas
• When sampling with unequal probabilities, we deliberately vary the selection probabilities.
• We compensate by using weights in the estimation.
• The key is that we know the selection probabilities
Notation
• The probability that psu i is in the sample is πi .
• The probability that psu i is selected on the first draw is ψi .
• We will consider an artificial situation where n = 1, so πi = ψi .
Sampling one psu
• Sample size is n = 1.
• Suppose we are interested in estimating the population total.
• ti is the total for psu i.
• To illustrate the ideas, we will assume that we know the whole population.
The Example
• N = 4 supermarkets
• Size (in square meters) varies.
• Select n = 1 with probabilities proportional to size.
• Record total sales
• Using the data from one store we want to estimate total sales for the four stores in the
population.
The population
Store
A
B
C
D
Total
Size
100
200
300
1000
1600
ψi
ti
1/16 11
2/16 20
3/16 24
10/16 245
1 300
3
Weights
• The weights wi are the inverses of the selection probabilities ψi .
P
• The weighted estimator of the population total is t̂ψ = wi ti .
• There are four possible samples.
• We calculate t̂ψ for each.
The samples
Sample
ψi
A
1/16
B
2/16
C
3/16
D
10/16
t̂ψ
ti
11 176
20 160
24 128
245 392
wi
16
8
16/3
16/10
Sampling distribution of the estimate t̂ψ
Sample
ψi
1
1/16
2
2/16
3
3/16
4
10/16
t̂ψ
176
160
128
392
Mean of the sampling distribution of t̂ψ
E t̂ψ =
2
3
10
1
176 + 160 + 128 + 392 = 300 = t
16
16
16
16
• So t̂ψ is unbiased.
• This will always be true.
E t̂ψ =
X
ψi wi ti =
X
ti
Variance of the sampling distribution t̂ψ
Var(t̂ψ ) =
1
2
3
10
(176 − 300)2 + (160 − 300)2 + (128 − 300)2 + (392 − 300)2 = 14248
16
16
16
16
Compare with the variance for an SRS:
1
1
1
1
Var(t̂SRS ) = (176 − 300)2 + (160 − 300)2 + (128 − 300)2 + (392 − 300)2 = 154488
4
4
4
4
4
Interpretation
• Store D is the largest and we expect it to account for a large portion of the total sales.
• Therefore, we give it a higher probability of being in the sample (10/16) than it would
have with an SRS (1/4).
• If it is selected, we multiply its sales by (16/10) to estimate total sales.
One-stage sampling with replacement
• Suppose n > 1 and we sample with replacement.
• This implies πi = 1 − (1 − ψi )n .
• Probability that item i is selected on the first draw is the same as the probability that
item i is selected on any other draw.
• Sampling with replacement gives us n independent estimates of the population total,
one for each unit in sample.
• We average these n estimates.
• Estimated variance is variance of the estimates divided by n
Example 6.2
• N = 15 classes of elementary stat
• Mi students in class i (i = 1 to 15)
• Values of Mi range from 20 to 100.
• We want a sample of 5 classes.
• Each student in the selected classes will fill out a questionnaire.
• (It is possible for the same class to be selected more than once.)
Randomization
• There are a total of 647 students in these classes.
• Select 5 random numbers between 1 and 647.
• Think about ordering the students by class.
• Each random number corresponds to a student and the corresponding class will be in
the sample.
5
This method
• This method is called the cumulative-size method.
• It is based on M1 , M1 + M2 , M1 + M2 + M3 , . . .
• An alternative is to use the cumulative sums of the ψi and select random numbers
between 0 and 1.
• For this example, ψi = Mi /647
Alternative
• Systematic sampling is often used as an alternative in this setting.
– The basic idea is the same.
– Not technically sampling with replacement
– Works well as systematic sampling works well.
– See page 186 for details.
• Lahiris method
– Involves two stages of randomization
– Rejection sampling: corresponds to classroom problem in Problem Set 2.
– Can be inefficient.
– See page 187 for details
Estimation Theory
• Let Qi be the number of times unit i occurs in the sample.
P
• Then t̂ψ = n1
Qi ti /ψi .
• The estimated variance of t̂i is
X
1
ti
Qi ( − t̂ψ )2
n(n − 1)
ψi
• The estimate and its estimated variance are both unbiased.
Choosing the selection probabilities
• We want small variance for our estimator.
– Often, ti is related to the size of the psu.
– We can take ψi proportional to Mi or some other measure of the size of psu i.
6
PPS
• This procedure is called sampling with probability proportional to size (pps).
• The formulas for the estimate and variance can be simplified for this special case.
ψi =
Mi
K
ti
= K ȳi
ψi
• See page 190 for details
• See Example 6.5 on pages 190-192
Two-stage sampling with replacement
• Basic ideas are very similar to one-stage sampling.
• ψi is the probability that psu i is selected on the first (or any) draw.
• We take a sample of mi ssus from each selected psu.
Sampling ssu’s
• Usually we use an SRS.
• Alternatives include
– systematic sampling
– any other probability sampling method
• Note if a psu is selected more than once, a separate independent second stage sample
is required.
Estimates and SE’s
• Weights are used to make the estimators unbiased.
• Formulas are similar to those for one-stage.
• See (6.8) and (6.9) on page 192
7
Outline of the procedure
1. Determine the ψi .
2. Select the n psus (with replacement).
3. Select the ssus.
4. Estimate the t for each selected psu,
t̂ψ = weight × t̂
5. The average of these is t̂ψ .
√
6. SE is the standard error of these (sd/ n).
Unequal probability sampling without replacement
• ψi is the probability of selection on the first draw.
• The probability of selection on later draws depends on which units were selected on
earlier draws.
Estimation
• πi is called the inclusion probability. (
P
pop
πi = n)
P
• πi,j is the probability that both psu i and psu j are in the sample. ( j6=i πi,j = (n−1)πi )
• Weights (inverse of selection probability)
– we use πi /n in place of ψi (with replacement)
• The recommended procedure
is to use the Horvitz-Thompson (HT) estimator and the
P
associated SE. (t̂HT = sam t̂i /πi )
• See page 196-197 for details.
• This estimator can be generalized to other designs that do not use replacement.
Randomization Theory
Framework is
• Probability sampling without replacement for the psus for the first stage
• Sampling at the second stage is independent of sampling at the first stage
8
Horvitz-Thompson
• Randomization theory can be used to prove the Horvitz-Thompson Theorem.
– Expected value of the estimator is t.
– Formula for the variance of the estimator
The estimator
• t̂HT =
P
t̂i /πi
– where the sum is over the psu’s selected in the first stage.
• Idea behind proofs is to condition on which psus are in the sample.
• Study pages 205-210
Model
• One-way random effects anova model
Yi,j = Ai + i,j
where
– the Ai are random variables with mean µ and variance σA2
– the i,j are random variables with mean 0 and variance σ 2 .
– the Ai and the i,j are uncorrelated
The pps estimator
• πi = nMi /K – the inclusion probability
T̂P =
X K
T̂i
nMi
• We rewrite this as a weighted estimator.
Mi X
Yi,j
mi
X
=
wi,j Yi,j
t̂i =
t̂P
where wi,j =
K
nMi
• Take expected values to show that the estimator is unbiased.
9
Variance
• The variance can be computed.
• See page 211
• The variance depends on which psu’s are selected through the Mi .
• The variance is smallest when psu’s with the largest Mi are chosen.
Recall
• Estimate of population total is the weighted average of the t̂i for the selected psus.
• The weights wi are the inverses of the probabilities of selection.
Elephants
• A circus needed to ship its 50 elephants.
• They needed to estimate the total weight of the animals.
• It is not easy to weigh 50 elephants and they were in a hurry.
• They had data from three years ago.
Sample
• The owner wanted to base the estimate on a sample.
• Dumbo had a weight equal to the average three years ago.
• The owner wanted to weigh Dumbo and multiply by 50.
• The statistician said:
NO
• You have to use probability sampling and the Horvitz-Thompson estimator.
• They compromised:
– The probability of selecting Dumbo was set as 99/100.
– The probability of selecting each of the other elephants was 1/4900.
10
Who was selected
• Dumbo, of course.
• The owner was happy and said now we can estimate the weight of the 50 elephants as
50 times Dumbos weight, 50y.
• The statistician said
NO
• The estimate of the total weight of the 50 elephants should be Dumbos weight divided
by his probability of selection.
• This is y/(99/100) or 100y/99.
• The theory behind this estimator is rigorous
What if
• The owner asked
– What if the randomization had selected Jumbo the largest elephant in the herd?
• The statistician replied 4900y, where y is Jumbos weight.
Conclusion
• The statistician lost his circus job and became a teacher of statistics.
• bad model; highly variable estimator
• Due to Basu (1971).
11