Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Ch 4: Stratified Random Sampling (STS) DEFN: A stratified random sample is obtained by separating the population units into non-overlapping groups, called strata, and then selecting a random sample from each stratum 1 Procedure Divide sampling frame into mutually exclusive and exhaustive strata Select a random sample from each stratum Stratum #1 Assign each SU to one and only one stratum Select random sample from stratum 1 Select random sample from stratum 2 … Stratum H h=1 h=2 ... ... h=H 2 Ag example Divide 3078 counties into 4 strata corresponding to regions of the countries Northeast (h = 1) North central (h = 2) South (h = 3) West (h = 4) Select a SRS from each stratum In this example, stratum sample size is proportional to stratum population size 300 is 9.75% of 3078 Each stratum sample size is 9.75% of stratum population 3 Ag example – 2 Stratum (h) Stratum size (Nh) Sample size (nh) 1 (NE) 220 21 2 (NC) 1054 103 3 (S) 1382 135 4 (W) 422 41 Total 3078 300 4 Procedure – 2 Need to have a stratum value for each SU in the frame Minimum set of variables in sampling frame: SU id, stratum assignment Stratum (h) SU (j) 1 1 1 2 1 3 2 1 2 2 … … 5 Ag example – 3 Stratum (h) SU (j) 1 1 1 2 1 3 … … 1 220 2 1 2 2 … … 4 421 4 422 6 Procedure – 3 Each stratum sample is selected independently of others New set of random numbers for each stratum Basis for deriving properties of estimators Design within a stratum For Ch 4, we will assume a SRS is selected within each stratum Can use any probability design within a stratum Sample designs do not need to be the same across strata 7 Uses for STS To improve representativeness of sample In SRS, can get ANY combination of n elements in the sample In SYS, we severely restricted the set to k possible samples Can get “bad” samples Less likely to get unbalanced samples if frame is sorted using a variable correlated with Y 8 Uses for STS – 2 To improve representativeness of sample - 2 In STS, we also exclude samples Explicitly choose strata to restrict possible samples Improve chance of getting representative samples if use strata to encourage spread across variation in population 9 Uses for STS – 3 To improve precision of estimates for population parameters Achieved by creating strata so that variation WITHIN stratum is small variation AMONG strata is large Uses same principal as “blocking” in experimental design Improve precision of estimate for population parameter by obtaining precise estimates within each stratum 10 Uses for STS – 4 To study specific subpopulations Define strata to be subpopulations of interest Examples Male v. female Racial/ethnic minorities Geographic regions Population density (rural v. urban) College classification Can establish sample size within each stratum to achieve desired precision level for estimates of subpopulations 11 Uses for STS – 5 To assist in implementing operational aspects of survey May wish to apply different sampling and data collection procedures for different groups Agricultural surveys (sample designs) Large farms in one stratum are selected using a list frame Smaller farms belong to a second strata, and are selected using an area sample Survey of employers (data collection methods) Large firms: use mail survey because information is too voluminous to get over the phone Small firms: telephone survey 12 Estimation strategy Objective: estimate population total Obtain estimates for each stratum Estimate stratum population total Estimate variance of estimator in each stratum Use SRS estimator for stratum total Use SRS estimator for variance of estimated stratum total Pool estimates across strata Sum stratum total estimates and variance estimates across strata Variance formula justified by independence of samples across strata 13 Ag example – 4 Stratum (h) Stratum size (Nh) Sample size (nh) Sample mean (y h) Estimated stratum total ( tˆh ) 1 (NE) 220 21 97,630 21,478,558 2 (NC) 1054 103 300,504 316,731,379 3 (S) 1382 135 211,315 292,037,391 4 (W) 422 41 662,295 279,488,706 Total 3078 300 Acres devoted to farms / co Total farms acres for stratum 14 Ag example – 5 Estimated total farm acres in US H H h 1 h 1 tˆstr tˆh N h y h 220(97,630) 1054(300,504) 1382(211,315) 422(662,295) 909,736,034 farm acres in US 15 Ag example – 6 Stratum (h) Stratum size (Nh) Sample size (nh) Sample variance ( s h2 ) 1 (NE) 220 21 7,647,472, 708 2 (NC) 1054 103 29,618,183,543 3 (S) 1382 135 53,587,487,856 4 (W) 422 41 396,185,950,266 Total 3078 300 16 Ag example – 7 Estimated variance for estimated total farm acres in US 2 n s Vˆ(tˆstr ) Vˆ(tˆh ) N n 1 h h N h nh h 1 h 1 21 7,647,472, 708 2 2 2 220 2 1 1054 (...) 1382 (...) 422 (...) 220 21 H H 2 2.5419 x 1015 SE (tˆstr ) Vˆ(tˆstr ) 50,417,248 acres 17 Ag example – 8 Compare with SRS estimates Ny 916,927,100 acres Vˆ(tˆ) Ny 3.38368 x 1015 SE (tˆstr ) Vˆ(tˆstr ) 58,169,381 acres 18 Estimation strategy - 2 Objective: estimate population mean Divide estimated total by population size y str tˆstr N OR equivalently, Obtain estimates for each stratum Estimate stratum mean with stratum sample mean Pool estimates across strata Use weighted average of stratum sample means with weights proportional to stratum sizes Nh 19 Ag example – 9 Estimated mean farm acres / county tˆstr 909,736,034 y str N 3078 or H N y str h y h h 1 N 220 1054 1382 422 97,630 300,504 211,315 662,295 3078 3078 3078 3078 909,736,034 farm acres / county 20 Ag example – 10 Estimate variance of estimated mean farm acres / county Vˆ(y str ) 1 ˆˆ V (t str ) 2 N or 2 N Vˆ(y str ) h2 Vˆ(y h ) h 1 N H 21 Notation h=1 h=2 ... ... h=H Stratum H Stratum 1 Index set for stratum h = 1, 2, …, H Uh = {1, 2, …, Nh } Nh = number of OUs in stratum h in the population Partition sample of size n across strata nh = number of sample units from stratum h (fixed) Sh = index set for sample belonging to stratum h 22 Notation – 2 Population sizes Nh = number of OUs in stratum h in the population N = N1 + N2 + … + NH Partition sample of size n across strata nh = number of sample units from stratum h n = n1 + n2 + … + nH The stratum sample sizes are fixed In domain estimation, they are random For now, we will assume that the sampling unit (SU) is an observation unit (OU) 23 Notation – 3 Response variable Yhj = characteristic of interest for OU j in stratum h Population and stratum totals th t Nh y hj j population total in stratum h 1 H th h population total 1 24 Notation – 4 Population and stratum means j Nh y hU yU 1 y hj population mean in stratum h Nh t N h j H Nh 1 N 1 y hj overall population mean 25 Notation – 5 Population stratum variance S h2 Nh j 1 y hj y hU Nh 1 2 population variance in stratum h 26 Notation – 6 SRS estimators for stratum parameters y hj j S yh tˆh s h2 h nh Nh nh y hj j S y j S h Nhy h h yh 2 hj nh 1 27 STS estimators For population total H H h 1 h 1 tˆstr tˆh N h y h H H h 1 h 1 Vˆ(tˆstr ) Vˆ(tˆh ) N n2 nh 1 Nh s h2 nh 28 STS estimators – 2 For population mean y str tˆstr N Nh yh h 1 N H 2 H N 1 Vˆ(y str ) 2 Vˆ(tˆstr ) h2 Vˆ(y h ) N h 1 N 29 STS estimators – 3 For population proportion 30 Properties STS estimators are unbiased y str is unbiased estimator of y U tˆstr is unbiased estimator of t pˆstr is unbiased estimator of p Each estimate of stratum population mean or total is unbiased (from SRS) H Nh E yh h 1 N Nh E y h h 1 N H Nh y hU y U h 1 N Nh 31 Properties – 2 Inclusion probability for SU j in stratum h Definition in words: Formula hj = 32 Properties – 3 In general, for any stratification scheme, STS will provide a more precise estimate of the population parameters (mean, total, proportion) than SRS For example V (y str ) V (y ) Confidence intervals Same form (using z/2) Different CLT 33 Sampling weights Note that H Nh N tˆstr tˆh N h y h h y hj w hj y hj h 1 h 1 h 1 j 1 n h h 1 j 1 H H Nh Sampling weight for SU j in stratum h w hj H Nh nh A sampling weight is a measure of the number of units in populations represented by SU j in stratum h 34 Example w hj Nh nh Stratum (h) Nh nh h=1 6 3 6 2 3 h=2 2 2 2 1 2 h=3 4 1 4 4 1 h=4 5 3 5 1.67 3 17 9 Note: weights for each OU within a stratum are the same 35 Example – 2 Dataset from study Stratum (h) Nh nh whj yhj 1 6 3 2 53 1 6 3 2 107 1 6 3 2 83 2 2 2 1 34 2 2 2 1 22 3 4 1 4 90 4 5 3 1.67 12 4 5 3 1.67 34 4 5 3 1.67 15 36 Sampling weights – 2 For STS estimators presented in Ch 4, sampling weight is the inverse inclusion probability w hj hj Nh 1 n h hj nh Nh 37 Defining strata Depends on purpose of stratification If possible, use factors related to variation in characteristic of interest, Y Improved representativeness Improved precision Subpopulations estimates Implementing operational aspects Geography, political boundaries, population density Gender, ethnicity/race, ISU classification Size or type of business Remember Stratum variable must be available for all OUs 38 Allocation strategies Want to sample n units from the population An allocation rule defines how n will be spread across the H strata and thus defines values for nh Overview for estimating population parameters Special cases of optimal allocation Stratum costs same Stratum variances same No No Optimal Yes No Neyman Yes Yes Proportional Allocation rule 39 Allocation strategies – 2 Focus is on estimating parameter for entire population We’ll look at subpopulations later Factors affecting allocation rule Number of OUs in stratum Data collection costs within strata Within-stratum variance 40 Proportional allocation Stratum sample size allocated in proportion to population size within stratum nh n Nh N Allocation rule Nh nh n N 41 Ag example – 11 Stratum h Stratum Total Nh Stratum Sample Size nh = n (Nh / N ) 1 (NE) 220 21 .0975 (220) = 21.4 2 (NC) 1054 103 .0975 (1054) = 102.7 3 (S) 1382 135 .0975 (1382) = 134.7 4 (W) 422 41 .0975 (422) = 41.1 Total N = 3078 300 = n 42 Proportional allocation – 2 Proportional allocation rule implies Sampling fraction for stratum h is constant across n strata n h Nh Inclusion probability is constant for all SUs in n n population hj N h Nh N Sampling weight for each unit is constant w hj 1 hj N n 43 Proportional allocation – 3 STS with proportional allocation leads to a self-weighting sample What is a self-weighting sample? If whj has the same value for every OU in the sample, a sample is said to be self-weighting Since each weight is the same, each sample unit represents the same number of units in the population For self-weighting samples, estimator for population mean to sample mean y Estimator for variance does NOT necessarily reduce to SRS estimator for variance of y 44 Proportional allocation – 4 y Check to see that a STS with proportional allocation generates a self-weighting sample Is the sample weight whj is same for each OU? Is estimator for population mean y str equal to the sample mean y ? What happens to the variance of y str ? 45 Ag example – 12 Stratum h Nh Stratum Sample Size nh Sample Weight whj 1 (NE) 220 21 220/21 = 10.5 2 (NC) 1054 103 1054/103 = 10.2 3 (S) 1382 135 1382/135 = 10.2 4 (W) 422 41 422/41 = 10.3 N = 3078 n = 300 Total Stratum Total Even though we have used proportional allocation, rounding in setting sample sizes can lead to unequal (but approximately equal) weights 46 Neyman allocation Suppose within-stratum variances S h2 vary across strata Stratum sample size allocated in proportion to Population size within stratum Nh Population standard deviation within stratum Sh Allocation rule nh NhSh H NlSl l n 1 47 Caribou survey example NhSh Stratum H h Nh NlSl l NhSh Sh n 1 whj A 400 3,000 1,200,000 96.26 96 400/96 = 4.17 B 30 2,000 60,000 4.81 5 30/10 = 3.00 C 61 9,000 549,000 44.04 44 61/37 = 1.65 D 18 2,000 36,000 2.89 3 18/6 = 3.00 E 70 12,000 840,000 67.38 67 70/39 = 1.79 F 120 1,000 120,000 9.63 10 120/21 = 5.71 Total N = 699 H NlSl l 2,805,000 n = 225 1 48 Optimal allocation Suppose data collection costs ch vary across strata Let C = total budget c0 = fixed costs (office rental, field manager) ch = cost per SU in stratum h (interviewer time, travel cost) Express budget constraints as H C c 0 c h nh h 1 and determine nh 49 Optimal allocation – 2 Assume general case: stratum population sizes, stratum variances, and stratum data collection costs vary across strata Sample size is allocated to strata in proportion to Stratum population size Nh Stratum standard deviation Sh Inverse square root of stratum data collection costs Allocation rule nh NhSh / c h H NlSl l 1 1 ch n / cl 50 Optimal allocation – 3 Obtain this formula by finding nh such that V (y str ) is minimized given cost constraints The optimal stratum allocation will generate the smallest variance of y str for a given stratification and cost constraint Sample size for stratum h (nh ) is larger in strata where one or more of the following conditions exist Stratum size Nh is large 2 Stratum variance S h is large Stratum per-unit data collection costs ch are small 51 Welfare example Objective Estimate fraction of welfare participant households in NE Iowa that have access to a reliable vehicle for work Sample design Frame = welfare participant list Stratum 1: Phone Stratum 2: No phone N1 = 4500 households, p1 = 0.85, c1 = $100 N2 = 500 households, p2 = 0.50, c2 = $300 Sample size n = 500 52 Welfare example – 2 Optimal allocation with phone strata Stratum h N hSh / c h S h2 Nh ph (1-ph) ch N hS h / c h H NlSl l 1 / cl nh whj 1: phone 2: no phone Total N = 5000 H NlSl l 1 / cl n = 500 53 Optimal allocation – 4 Proportional and Neyman allocation are special cases of optimal allocation Neyman allocation Data collection costs per sample unit ch are approximately constant across strata Telephone survey of US residents with regional strata ch term cancels out of optimal allocation formula nh NhSh H NlSl l n 1 54 Optimal allocation – 5 Proportional allocation Data collection costs per sample unit ch are approximately constant across strata Within stratum variances S h2 are approximately constant across strata Y = number of persons per household is relatively constant across regions ch and Sh terms drop out of allocation formula Nh nh n N 55 Subpopulation allocation Suppose main interest is in estimating stratum parameters Define strata to be subpopulations Subpopulation (stratum) mean, total, proportion Estimate stratum population parameters: y hU or t hU or p hU Allocation rules derived from independent SRS within each stratum (subpopulation) Equal allocation for equal stratum costs, variances Stratum variances change across strata 56 Subpopulation allocation – 2 Equal allocation Assume Desired precision levels for each subpopulation (stratum) are constant across strata Stratum costs, stratum variances equal across strata Stratum FPCs near 1 Allocation rule is to divide n equally across the H strata (subpopulations) n nh H If Nh vary much, equal allocation will lead to less precise estimates of parameters for full population 57 Welfare example – 3 Suppose we wanted to estimate proportion of welfare households that have access to a car for households in each of three subpopulations in NE Iowa Metropolitan county Counties adjacent to metropolitan county Counties not adjacent to metro county 58 Welfare example – 4 Equal allocation with population density strata Stratum h 1: Metro Nh h whj 3,800 2: Adjacent to metro 700 3: Not adjacent to metro 500 Total nh N = 5000 n = 500 59 Subpopulation allocation – 3 More complex settings: If Sh vary across strata, can use SRS formulas for determining stratum sample sizes, e.g., for stratum mean 2 2 z / 2S h nh z 2 / 2 S h2 2 eh Nh Result is n H nh h 1 May get sample sizes (nh) that are too large or small relative to budget Relax margin of error eh and/or confidence level 100(1-)% Recalibrate stratum sample sizes to get desired sample size 60 Welfare example – 5 95% CI, e = 0.10 for all pop density strata Stratum h Nh ph 3,800 0.70 0.21 2: Adjacent to metro 700 0.80 0.16 3: Not adjacent to metro 500 0.90 0.09 1: Metro Total N = 5000 Initial nh S h2 Recalibrate nh n = 500 61 Compromise allocations Proportional Allocation Equal Allocation nh = nNh /N nh nh Nh Nh nh n nh Nh Square Root Allocation nh = n /H Nh H l 1 Nl 62 Square root allocation nh n Nh H l 1 Nl nh Nh Square Root Allocation More SUs to small strata than proportional allocation Fewer SUs to large strata than equal Variance for subpopulation estimates is smaller than proportional Variance for whole population estimates is smaller than equal allocation 63 Compromise allocations – 2 nh max nh min nh A B May want to set Nh Rule nh max nh min nh A B Nh Minimum number of SUs in a stratum Cap on max number of SUs in a stratum nh = min for Nh < A nh = max for Nh > B Apply rule in between A and B Square root Proportional 64 Welfare example – 6 Comparing equal, proportional and square root allocation Stratum h 1: Metro Nh Equal allocation 3,800 167 2: Adjacent to metro 700 167 3: Not adjacent to metro 500 166 N = 5000 n = 500 Total Proportional allocation n = 500 Square root of Nh Sum = Square root allocation n = 500 65 Other allocations Certainty stratum is used to guarantee inclusion in sample Census (sample all) the units in a stratum For certainty stratum h Allocation: nh = Nh Inclusion probability: hj = 1 Ad hoc allocations The sample allocation does not have to follow any of the rules mentioned so far However, you should determine the stratum allocation in relation to analysis objectives and operational constraints 66 Welfare example – 7 Ad hoc allocation Stratum h 1: Metro Nh Equal allocation Square root allocation Proportional allocation Actual allocation 3,800 167 279 380 200 2: Adjacent to metro 700 167 120 70 150 3: Not adjacent to metro 500 166 101 50 150 N = 5000 n = 500 n = 500 n = 500 Total n = 500 67 Determining sample size n Determine allocation using rule expressed in terms of relative sample size nh /n NhSh / c h nh n H NlSl l 1 / cl Rewrite variance of tˆstr as a function of relative sample sizes (ignoring stratum FPCs) H n 2 2 n 2 2 N S where N h Sh h h n h 1 n h n h 1 n h 1 V (tˆstr ) H Sample size calculation based on margin of error e for population total z 2 / 2 n e2 68 Determining sample size n – 2 Rewrite variance of y str as a function of relative sample sizes (ignoring stratum FPCs) H n 2 2 n 2 2 V (y str ) N S where N h Sh h 2 2 h n N h 1 n h nN h 1 n h 1 1 H Samples size calculation based on margin of error e for population mean z 2 / 2 n 2 2 e N 69 Welfare example – 8 Relative sample size for equal allocation nh 1 n H Value of n 2 2 N h Sh h 1 n h H H HN h 1 2 S h h 2 3[38002 (.21) 700 2 (.16) 500 2 (.09)] 9,399,900 For 95% CI with e = 0.1 z 2 / 2 4(9,399,900) n 2 2 150 .01(25,000,000) e N 70 STS Summary Choose stratification scheme Set a design for each stratum Scheme depends on objectives, operational constraints Must know stratum identifier for each SU in the frame Design for each stratum – SRS, SYS, … Determine n and nh Select sample independently within each stratum Pool stratum estimates to get estimates of population parameters 71