Download No Slide Title

Document related concepts
Transcript
Ch 4: Stratified Random
Sampling (STS)

DEFN: A stratified random sample is
obtained by separating the population
units into non-overlapping groups,
called strata, and then selecting a
random sample from each stratum
1
Procedure

Divide sampling frame into mutually exclusive
and exhaustive strata


Select a random sample from each stratum



Stratum
#1
Assign each SU to one and only one stratum
Select random sample from stratum 1
Select random sample from stratum 2
…
Stratum H
h=1
h=2
...
...
h=H
2
Ag example

Divide 3078 counties into 4 strata
corresponding to regions of the countries





Northeast (h = 1)
North central (h = 2)
South (h = 3)
West (h = 4)
Select a SRS from each stratum



In this example, stratum sample size is proportional to
stratum population size
300 is 9.75% of 3078
Each stratum sample size is 9.75% of stratum population
3
Ag example – 2
Stratum
(h)
Stratum
size (Nh)
Sample
size (nh)
1 (NE)
220
21
2 (NC)
1054
103
3 (S)
1382
135
4 (W)
422
41
Total
3078
300
4
Procedure – 2

Need to have a stratum value for each SU in
the frame

Minimum set of variables in sampling frame: SU
id, stratum assignment
Stratum (h)
SU (j)
1
1
1
2
1
3
2
1
2
2
…
…
5
Ag example – 3
Stratum (h)
SU (j)
1
1
1
2
1
3
…
…
1
220
2
1
2
2
…
…
4
421
4
422
6
Procedure – 3

Each stratum sample is selected
independently of others



New set of random numbers for each stratum
Basis for deriving properties of estimators
Design within a stratum



For Ch 4, we will assume a SRS is selected within
each stratum
Can use any probability design within a stratum
Sample designs do not need to be the same
across strata
7
Uses for STS

To improve representativeness of
sample


In SRS, can get ANY combination of n
elements in the sample
In SYS, we severely restricted the set to k
possible samples


Can get “bad” samples
Less likely to get unbalanced samples if frame
is sorted using a variable correlated with Y
8
Uses for STS – 2

To improve representativeness of
sample - 2

In STS, we also exclude samples


Explicitly choose strata to restrict possible
samples
Improve chance of getting representative
samples if use strata to encourage spread
across variation in population
9
Uses for STS – 3

To improve precision of estimates for
population parameters

Achieved by creating strata so that




variation WITHIN stratum is small
variation AMONG strata is large
Uses same principal as “blocking” in
experimental design
Improve precision of estimate for
population parameter by obtaining precise
estimates within each stratum
10
Uses for STS – 4

To study specific subpopulations


Define strata to be subpopulations of interest
Examples






Male v. female
Racial/ethnic minorities
Geographic regions
Population density (rural v. urban)
College classification
Can establish sample size within each stratum to
achieve desired precision level for estimates of
subpopulations
11
Uses for STS – 5

To assist in implementing operational aspects
of survey


May wish to apply different sampling and data
collection procedures for different groups
Agricultural surveys (sample designs)



Large farms in one stratum are selected using a list
frame
Smaller farms belong to a second strata, and are
selected using an area sample
Survey of employers (data collection methods)


Large firms: use mail survey because information is too
voluminous to get over the phone
Small firms: telephone survey
12
Estimation strategy


Objective: estimate population total
Obtain estimates for each stratum

Estimate stratum population total


Estimate variance of estimator in each stratum


Use SRS estimator for stratum total
Use SRS estimator for variance of estimated stratum
total
Pool estimates across strata


Sum stratum total estimates and variance
estimates across strata
Variance formula justified by independence of
samples across strata
13
Ag example – 4
Stratum
(h)
Stratum
size (Nh)
Sample
size (nh)
Sample mean
(y h)
Estimated stratum
total ( tˆh )
1 (NE)
220
21
97,630
21,478,558
2 (NC)
1054
103
300,504
316,731,379
3 (S)
1382
135
211,315
292,037,391
4 (W)
422
41
662,295
279,488,706
Total
3078
300
Acres devoted
to farms / co
Total farms acres for
stratum
14
Ag example – 5

Estimated total farm acres in US
H
H
h 1
h 1
tˆstr   tˆh   N h y h
 220(97,630)  1054(300,504)  1382(211,315)  422(662,295)
 909,736,034 farm acres in US
15
Ag example – 6
Stratum
(h)
Stratum
size (Nh)
Sample
size (nh)
Sample variance
( s h2 )
1 (NE)
220
21
7,647,472, 708
2 (NC)
1054
103
29,618,183,543
3 (S)
1382
135
53,587,487,856
4 (W)
422
41
396,185,950,266
Total
3078
300
16
Ag example – 7

Estimated variance for estimated total farm
acres in US
2


n
s
Vˆ(tˆstr )  Vˆ(tˆh )   N n 1  h  h
N h  nh
h 1
h 1

21  7,647,472, 708

2
2
2
 220 2 1 

1054
(...)

1382
(...)

422
(...)

220 
21

H
H
2
 2.5419 x 1015
SE (tˆstr )  Vˆ(tˆstr )  50,417,248 acres
17
Ag example – 8

Compare with SRS estimates
Ny  916,927,100 acres
Vˆ(tˆ)  Ny  3.38368 x 1015
SE (tˆstr )  Vˆ(tˆstr )  58,169,381 acres
18
Estimation strategy - 2


Objective: estimate population mean
Divide estimated total by population size
y str

tˆstr

N
OR equivalently,

Obtain estimates for each stratum


Estimate stratum mean with stratum sample mean
Pool estimates across strata

Use weighted average of stratum sample means with
weights proportional to stratum sizes Nh
19
Ag example – 9

Estimated mean farm acres / county
tˆstr
909,736,034
y str 

N
3078
or
H
N
y str   h y h
h 1 N
220
1054
1382
422

97,630 
300,504 
211,315 
662,295
3078
3078
3078
3078
 909,736,034 farm acres / county
20
Ag example – 10

Estimate variance of estimated mean farm
acres / county
Vˆ(y str ) 
1 ˆˆ
V (t str )
2
N
or
2
N
Vˆ(y str )   h2 Vˆ(y h )
h 1 N
H
21
Notation
h=1
h=2
...
...
h=H
Stratum
H
Stratum 1

Index set for stratum h = 1, 2, …, H



Uh = {1, 2, …, Nh }
Nh = number of OUs in stratum h in the population
Partition sample of size n across strata


nh = number of sample units from stratum h (fixed)
Sh = index set for sample belonging to stratum h
22
Notation – 2

Population sizes



Nh = number of OUs in stratum h in the
population
N = N1 + N2 + … + NH
Partition sample of size n across strata



nh = number of sample units from stratum h
n = n1 + n2 + … + nH
The stratum sample sizes are fixed


In domain estimation, they are random
For now, we will assume that the sampling
unit (SU) is an observation unit (OU)
23
Notation – 3

Response variable
Yhj = characteristic of interest for OU j in stratum h

Population and stratum totals
th 
t 
Nh
y hj

j
 population total in stratum h
1
H
th

h
 population total
1
24
Notation – 4

Population and stratum means
j
Nh
y hU 
yU
1
y hj
 population mean in stratum h
Nh
t


N
h  j
H
Nh
1
N
1
y hj
 overall population mean
25
Notation – 5

Population stratum variance
S h2 
Nh

j
1
y
hj
 y hU
Nh 1

2
 population variance in stratum h
26
Notation – 6

SRS estimators for stratum parameters
y hj

j S
yh 
tˆh 
s h2 

h
nh
Nh
nh
y hj

j S

 y
j S h
 Nhy h
h
 yh 
2
hj
nh  1
27
STS estimators

For population total
H
H
h 1
h 1
tˆstr   tˆh   N h y h
H
H
h 1
h 1
Vˆ(tˆstr )  Vˆ(tˆh )   N n2

nh
1 
 Nh
 s h2

 nh
28
STS estimators – 2

For population mean
y str
tˆstr


N
Nh
yh

h 1 N
H
2
H
N
1
Vˆ(y str )  2 Vˆ(tˆstr )   h2 Vˆ(y h )
N
h 1
N
29
STS estimators – 3

For population proportion
30
Properties

STS estimators are unbiased

y str is unbiased estimator of y U
tˆstr is unbiased estimator of t
pˆstr is unbiased estimator of p

Each estimate of stratum population mean
or total is unbiased (from SRS)
 H Nh

E 
yh 
h 1 N

Nh
E y h  

h 1 N
H
Nh
y hU  y U

h 1 N
Nh
31
Properties – 2

Inclusion probability for SU j in stratum
h

Definition in words:

Formula hj =
32
Properties – 3

In general, for any stratification scheme,
STS will provide a more precise estimate of
the population parameters (mean, total,
proportion) than SRS

For example
V (y str )  V (y )

Confidence intervals


Same form (using z/2)
Different CLT
33
Sampling weights

Note that
H Nh
N
tˆstr   tˆh   N h y h    h y hj    w hj y hj
h 1
h 1
h 1 j 1 n h
h 1 j 1
H

H
Nh
Sampling weight for SU j in stratum h
w hj

H
Nh

nh
A sampling weight is a measure of the
number of units in populations represented
by SU j in stratum h
34
Example

w hj 
Nh
nh
Stratum (h)
Nh
nh
h=1
6
3
6
2
3
h=2
2
2
2
1
2
h=3
4
1
4
4
1
h=4
5
3
5
 1.67
3
17
9
Note: weights for each OU within a stratum
are the same
35
Example – 2

Dataset from study
Stratum (h)
Nh
nh
whj
yhj
1
6
3
2
53
1
6
3
2
107
1
6
3
2
83
2
2
2
1
34
2
2
2
1
22
3
4
1
4
90
4
5
3
1.67
12
4
5
3
1.67
34
4
5
3
1.67
15
36
Sampling weights – 2

For STS estimators presented in Ch 4,
sampling weight is the inverse inclusion
probability
w hj
 hj
Nh
1


n h  hj
nh

Nh
37
Defining strata

Depends on purpose of stratification





If possible, use factors related to variation in
characteristic of interest, Y




Improved representativeness
Improved precision
Subpopulations estimates
Implementing operational aspects
Geography, political boundaries, population density
Gender, ethnicity/race, ISU classification
Size or type of business
Remember

Stratum variable must be available for all OUs
38
Allocation strategies



Want to sample n units from the population
An allocation rule defines how n will be spread across
the H strata and thus defines values for nh
Overview for estimating population parameters
Special cases
of optimal
allocation
Stratum
costs
same
Stratum
variances
same
No
No
Optimal
Yes
No
Neyman
Yes
Yes
Proportional
Allocation rule
39
Allocation strategies – 2

Focus is on estimating parameter for
entire population


We’ll look at subpopulations later
Factors affecting allocation rule



Number of OUs in stratum
Data collection costs within strata
Within-stratum variance
40
Proportional allocation

Stratum sample size allocated in proportion to
population size within stratum
nh
n

Nh
N

Allocation rule
Nh
nh 
n
N
41
Ag example – 11
Stratum
h
Stratum Total
Nh
Stratum Sample Size
nh = n (Nh / N )
1 (NE)
220
21  .0975 (220) = 21.4
2 (NC)
1054
103  .0975 (1054) = 102.7
3 (S)
1382
135  .0975 (1382) = 134.7
4 (W)
422
41  .0975 (422) = 41.1
Total
N = 3078
300 = n
42
Proportional allocation – 2

Proportional allocation rule implies

Sampling fraction for stratum h is constant across
n
strata
n

h
Nh

Inclusion probability is constant for all SUs in
n
n
population
 

hj

N
h
Nh
N
Sampling weight for each unit is constant
w hj 
1
 hj

N
n
43
Proportional allocation – 3


STS with proportional allocation leads to a
self-weighting sample
What is a self-weighting sample?




If whj has the same value for every OU in the
sample, a sample is said to be self-weighting
Since each weight is the same, each sample unit
represents the same number of units in the
population
For self-weighting samples, estimator for
population mean to sample mean y
Estimator for variance does NOT necessarily
reduce to SRS estimator for variance of y
44
Proportional allocation – 4
y

Check to see that a STS with proportional
allocation generates a self-weighting sample



Is the sample weight whj is same for each OU?
Is estimator for population mean y str equal to the
sample mean y ?
What happens to the variance of y str ?
45
Ag example – 12
Stratum
h
Nh
Stratum Sample
Size nh
Sample Weight
whj
1 (NE)
220
21
220/21 = 10.5
2 (NC)
1054
103
1054/103 = 10.2
3 (S)
1382
135
1382/135 = 10.2
4 (W)
422
41
422/41 = 10.3
N = 3078
n = 300
Total

Stratum Total
Even though we have used proportional allocation,
rounding in setting sample sizes can lead to unequal
(but approximately equal) weights
46
Neyman allocation


Suppose within-stratum variances S h2 vary
across strata
Stratum sample size allocated in proportion to



Population size within stratum Nh
Population standard deviation within stratum Sh
Allocation rule
nh 
NhSh
H
NlSl

l
n
1
47
Caribou survey example
NhSh
Stratum
H
h
Nh
NlSl

l
NhSh
Sh
n
1
whj
A
400
3,000
1,200,000
96.26  96
400/96 = 4.17
B
30
2,000
60,000
4.81  5
30/10 = 3.00
C
61
9,000
549,000
44.04  44
61/37 = 1.65
D
18
2,000
36,000
2.89  3
18/6 = 3.00
E
70
12,000
840,000
67.38  67
70/39 = 1.79
F
120
1,000
120,000
9.63  10
120/21 = 5.71
Total
N = 699
H
NlSl

l
 2,805,000
n = 225
1
48
Optimal allocation



Suppose data collection costs ch vary across strata
Let C = total budget
c0 = fixed costs (office rental, field manager)
ch = cost per SU in stratum h (interviewer time,
travel cost)
Express budget constraints as
H
C  c 0   c h nh
h 1
and determine nh
49
Optimal allocation – 2


Assume general case: stratum population sizes,
stratum variances, and stratum data collection costs
vary across strata
Sample size is allocated to strata in proportion to




Stratum population size Nh
Stratum standard deviation Sh
Inverse square root of stratum data collection costs
Allocation rule
nh 
NhSh / c h
H
NlSl

l
1
1
ch
n
/ cl
50
Optimal allocation – 3

Obtain this formula by finding nh such that V (y str )
is minimized given cost constraints


The optimal stratum allocation will generate the smallest
variance of y str for a given stratification and cost constraint
Sample size for stratum h (nh ) is larger in strata
where one or more of the following conditions exist



Stratum size Nh is large
2
Stratum variance S h is large
Stratum per-unit data collection costs ch are small
51
Welfare example

Objective


Estimate fraction of welfare participant households
in NE Iowa that have access to a reliable vehicle
for work
Sample design


Frame = welfare participant list
Stratum 1: Phone


Stratum 2: No phone


N1 = 4500 households, p1 = 0.85, c1 = $100
N2 = 500 households, p2 = 0.50, c2 = $300
Sample size n = 500
52
Welfare example – 2

Optimal allocation with phone strata
Stratum
h
N hSh / c h
S h2
Nh
 ph (1-ph)
ch
N hS h / c h
H
NlSl

l
1
/ cl
nh
whj
1: phone
2: no phone
Total
N = 5000
H
NlSl

l
1
/ cl 
n = 500
53
Optimal allocation – 4


Proportional and Neyman allocation are
special cases of optimal allocation
Neyman allocation

Data collection costs per sample unit ch are
approximately constant across strata


Telephone survey of US residents with regional strata
ch term cancels out of optimal allocation formula
nh 
NhSh
H
NlSl

l
n
1
54
Optimal allocation – 5

Proportional allocation


Data collection costs per sample unit ch are
approximately constant across strata
Within stratum variances S h2 are approximately
constant across strata

Y = number of persons per household is relatively
constant across regions

ch and Sh terms drop out of allocation formula
Nh
nh 
n
N
55
Subpopulation allocation

Suppose main interest is in estimating
stratum parameters


Define strata to be subpopulations


Subpopulation (stratum) mean, total, proportion
Estimate stratum population parameters:
y hU or t hU or p hU
Allocation rules derived from independent
SRS within each stratum (subpopulation)


Equal allocation for equal stratum costs, variances
Stratum variances change across strata
56
Subpopulation allocation – 2

Equal allocation

Assume





Desired precision levels for each subpopulation (stratum)
are constant across strata
Stratum costs, stratum variances equal across strata
Stratum FPCs near 1
Allocation rule is to divide n equally across the
H strata (subpopulations)
n
nh 
H
If Nh vary much, equal allocation will lead to less
precise estimates of parameters for full population
57
Welfare example – 3

Suppose we wanted to estimate
proportion of welfare households that
have access to a car for households in
each of three subpopulations in NE
Iowa



Metropolitan county
Counties adjacent to metropolitan county
Counties not adjacent to metro county
58
Welfare example – 4

Equal allocation with population density
strata
Stratum h
1: Metro
Nh
h
whj
3,800
2: Adjacent
to metro
700
3: Not
adjacent to
metro
500
Total
nh
N = 5000
n = 500
59
Subpopulation allocation – 3

More complex settings: If Sh vary across strata, can
use SRS formulas for determining stratum sample
sizes, e.g., for stratum mean
2
2
z  / 2S h
nh 
z 2 / 2 S h2
2
eh 
Nh

Result is
n 
H
nh

h
1

May get sample sizes (nh) that are too large or small
relative to budget


Relax margin of error eh and/or confidence level 100(1-)%
Recalibrate stratum sample sizes to get desired sample size
60
Welfare example – 5

95% CI, e = 0.10 for all pop density strata
Stratum h
Nh
ph
3,800
0.70
0.21
2: Adjacent
to metro
700
0.80
0.16
3: Not
adjacent to
metro
500
0.90
0.09
1: Metro
Total
N = 5000
Initial nh
S h2
Recalibrate nh
n = 500
61
Compromise allocations
Proportional
Allocation
Equal
Allocation
nh = nNh /N
nh
nh
Nh
Nh
nh  n
nh
Nh
Square Root
Allocation
nh = n /H
Nh
H

l
1
Nl
62
Square root allocation
nh  n
Nh
H

l
1
Nl



nh
Nh
Square Root
Allocation

More SUs to small strata
than proportional allocation
Fewer SUs to large strata
than equal
Variance for subpopulation
estimates is smaller than
proportional
Variance for whole
population estimates is
smaller than equal allocation
63
Compromise allocations – 2

nh

max nh
min nh
A
B
May want to set

Nh

Rule

nh

max nh
min nh
A
B
Nh
Minimum number of SUs
in a stratum
Cap on max number of
SUs in a stratum

nh = min for Nh < A
nh = max for Nh > B
Apply rule in between A
and B


Square root
Proportional
64
Welfare example – 6

Comparing equal, proportional and
square root allocation
Stratum h
1: Metro
Nh
Equal
allocation
3,800
167
2: Adjacent
to metro
700
167
3: Not
adjacent to
metro
500
166
N = 5000
n = 500
Total
Proportional
allocation
n = 500
Square root of
Nh
Sum =
Square root
allocation
n = 500
65
Other allocations

Certainty stratum is used to guarantee
inclusion in sample


Census (sample all) the units in a stratum
For certainty stratum h



Allocation: nh = Nh
Inclusion probability: hj = 1
Ad hoc allocations


The sample allocation does not have to follow any
of the rules mentioned so far
However, you should determine the stratum
allocation in relation to analysis objectives and
operational constraints
66
Welfare example – 7

Ad hoc allocation
Stratum h
1: Metro
Nh
Equal
allocation
Square root
allocation
Proportional
allocation
Actual
allocation
3,800
167
279
380
200
2: Adjacent
to metro
700
167
120
70
150
3: Not
adjacent to
metro
500
166
101
50
150
N = 5000
n = 500
n = 500
n = 500
Total
n = 500
67
Determining sample size n

Determine allocation using rule expressed in terms of
relative sample size nh /n
NhSh / c h
nh
n

H
NlSl

l
1

/ cl
Rewrite variance of tˆstr as a function of relative
sample sizes (ignoring stratum FPCs)
H
n 2 2 
n 2 2
N
S

where


N h Sh


h
h
n h 1 n h
n
h 1 n h
1
V (tˆstr ) 

H
Sample size calculation based on margin of error e
for population total
z 2 / 2
n 
e2
68
Determining sample size n – 2

Rewrite variance of y str as a function of relative
sample sizes (ignoring stratum FPCs)
H
n 2 2

n 2 2
V (y str ) 
N
S

where


N h Sh

h
2 
2
h
n N h 1 n h
nN
h 1 n h
1 1

H
Samples size calculation based on margin of error e
for population mean
z 2 / 2
n  2 2
e N
69
Welfare example – 8

Relative sample size for equal allocation
nh
1

n
H

Value of 
n 2 2
 
N h Sh 
h 1 n h
H
H
HN

h
1
2
S
h
h
2
 3[38002 (.21)  700 2 (.16)  500 2 (.09)]  9,399,900

For 95% CI with e = 0.1
z 2 / 2
4(9,399,900)
n  2 2 
 150
.01(25,000,000)
e N
70
STS Summary

Choose stratification scheme



Set a design for each stratum




Scheme depends on objectives, operational constraints
Must know stratum identifier for each SU in the frame
Design for each stratum – SRS, SYS, …
Determine n and nh
Select sample independently within each stratum
Pool stratum estimates to get estimates of population
parameters
71