Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 522: Sampling and Survey Techniques Topic 3 Topic Overview This topic will cover • Ratio and Regression Estimation • Estimation in Domains Laplace • Wanted to determine population of France in 1802 • Number of births is easy to obtain from public records. • Size of population is difficult to determine. • Use births to predict population • Sampled 30 “communes” – Total population: 2,037,615 – Births in last 3 years: 215,599 or 71,866.33 per year • persons per birth: 2037615/71866.33 = 28.35 • Multiply births by 28.35 Ratio estimation basics • yi is the characteristic of interest (response variable). • xi is the auxiliary variable or subsidiary variable. P • ty = pop yi P • tx = pop xi • B= ty tx = ȳU /x̄U 1 Procedure • We assume that tx is known. • Therefore, x̄U = tx N is known. • Use an SRS and measure yi and xi in the sample. • Calculate ȳ and x̄ for the sample. P P • B̂ = x̄ȳ = sample yi / sample xi • t̂y,ratio = B̂tx or • ȳˆratio = B̂ x̄U Why Sometimes we are interested in the ratio • acres per farm differ • yield per acre • per capita income Notes • Both ȳ and x̄ are random variables. – They will differ from sample to sample. – We exclude the possibility that x̄ is zero. • Denominator often looks like a sample size. – The usual estimator for an SRS can be viewed this way: xi = 1 for all items. X N = tx = xi pop n= X xi sample P yi B̂ = P = ȳ xi t̂ = B̂tx = B̂N = N ȳ 2 Why (2) • Sometimes we want to estimate a population total but N is not known, so we can’t use N ȳ. – Estimate the number of fish in a catch – Weigh and count a sample; weigh the total – Multiply the average fish per pound times the total weight of the catch • Increase the precision – Laplace could have estimated the population of France by computing the average number of persons per commune and multiplying by the number of communes. – Ratio estimate has smaller M SE (because of positive correlation between births and population size). • We can adjust estimates to reflect demographic totals. – example in text on page 62 concerning gender – This is called poststratification. – We will discuss this in Section 4.7 and Chapters 7 and 8 • Can be used to adjust for nonresponse – Example in text on page 63 – Discussed in Chapter 8 Example 3.2 • The U.S. Census of Agriculture • We have a SRS size n = 300 from the population of N = 3078 counties. • Suppose we have the population totals for 1987 but have only the sample information for 1992. • We want to estimate – average acres per farm – total acres 3 Get the sample data ( SLL063.sas ) *import the file agsrs.dat, check it, then create a permanent SAS data set; proc print data=asrs; run; libname xxx ’C:\Purdue\Stat522\SASdata’; data xxx.agsrs; set asrs; run; Plot the data symbol1 v=circle i=sm70; proc gplot data=asrs; plot acres92*acres87/frame; run; proc reg data=asrs; model acres92=acres87/noint; run; proc univariate data=asrs; var acres92 acres87; run; Regression through the origin 4 Smoothed plot Ratio estimate proc univariate data=asrs; var acres92 acres87; Sums Variable: ACRES92 Sum Observations 89369114 Variable: ACRES87 Sum Observations 90586117 A calculation data acalc; tot92=89369114; tot87=90586117; ratio=tot92/tot87; output; Output Obs tot92 1 89369114 tot87 90586117 ratio 0.98657 Ratio is B̂: in the sample of n = 300, 1992 acres are 98.7% of 1987 acres. Total acres for 1987 proc univariate data=apop; var acres87; 5 Output Variable: ACRES87 Mean Sum Observations 313016.378 963464412 NOTE: These values differ slightly from the values in the text. The estimates • Total acres for 1987: 963464412 • B̂ = 0.98657 = 1992acres 1987acres • Estimate of total 1992 acres is B̂ × total87acres = 0.98657(963464412) = 950000000 • Estimate of mean acres per county for 1992 is B̂ × mean87acrespercounty = 0.98657(313016.378) = 309000 Comments • Ratio estimators are biased. • The random variable is B̂ = x̄pop /x̄sam . • The relative bias: (|bias(B̂)|/s(B̂)) ≤ |CV (x̄)| Proof • Use E(t̂x ) = tx ; E(t̂y ) = ty ; t̂y,ratio = t̂y t t̂x x = B̂tx . E(t̂y,ratio − ty ) = E(t̂y,ratio − t̂y + t̂y − ty ) t̂y = E tx − t̂y t̂x t̂y = E (tx − t̂x ) t̂x = −E[B̂(t̂x − tx )] h i = − E(B̂ t̂x ) − E(B̂)E(t̂x ) = −Cov(B̂, t̂x ) 6 • Therefore, E(t̂y,ratio − ty ) |E(B̂ − B)| = tx Cov(B̂, t̂ ) Cov(B̂, x̄ ) x sam = = tx x̄pop q Corr(B̂, x̄) = Var(B̂)Var(x̄) x̄pop ≤ SE(B̂)SE(x̄)/|x̄pop | = SE(B̂)|CV (x̄)| More Comments • Similar argument shows that Bias(B̂) ≈ f pc 1 (S 2 B − Cov(x, y)) nx̄2U x • The bias will be small when – n is large. – the sampling fraction n/N is large. – x̄U is large. – the standard deviation of x – Sx – is small. – the correlation R between y and x is close to one. M SE of B̂ var( ˆ B̂) ≈ M SE(B̂) ≈ ¯ Var(d) , x̄2U where di = yi − Bxi . Idea behind proof ȳ −B x̄ (ȳ − B x̄) = x̄ • Note that we have the random variable x̄ in the denominator of this expression. B̂ − B = • Approximate it by x̄U . B̂ − B ≈ 7 (ȳ − B x̄) x̄U Standard error for B̂ • To estimate the standard deviation of B̂, substitute sample estimates for unknown quantities: var( ˆ B̂) = f pc where s2e = P s2e , nx̄2U e2i /(n − 1) and ei = yi − B̂xi . • SE is the square root of the variance. Other standard errors • t̂ratio = tx B̂ – Estimated variance of t̂ratio is ˆ B̂) = f pc t2x var( N2 2 s n e • ȳˆratio = x̄U B̂ – Estimated variance of ȳˆratio is x̄2U var( ˆ B̂) = f pc 2 s n e • Take square roots to obtain the SE’s Confidence intervals For 95%, general form is estimate ± 1.96SE(estimate) Example 3.3 • The US Census of Agriculture • We have a SRS size n = 300 from the population of N = 3078 counties. • Estimate total acres for 1992 using a ratio estimate. • B is the ratio of 1992 acres to 1987 acres for the population; B̂ for the sample of n = 300. • We use the known value of total acres in 1987 for the population N = 3078. 8 The estimates • Total acres for 1987: 963464412 • B̂ = 0.98657 (1992acres/1987acres) • Estimate of total 1992 acres is B̂ × total87acres = 0.98657(963464412) = 950000000 • We need to calculate the standard error and a 95% confidence interval . Estimate B (SLL068.sas) libname xxx ’xxx\SASdata’; data asrs; set xxx.agsrs; proc means data=asrs; var acres87 acres92; output out=a2 sum=sum87 sum92; run; data a2; set a2; Bhat=sum92/sum87; proc print data=a2; run; Output Obs 1 sum87 sum92 Bhat 90586117 89369114 0.98657 Define e and compute SE data asrs2; set asrs; if _n_ eq 1 then set a2; e=acres92-Bhat*acres87; proc means data=asrs2; var e; output out=a4 stderr=se_e n=nsrs; Find population total for 1987 proc means data=xxx.agpop; var acres87; output out=a5 sum=sum87pop n=Npop; Put it together data a6; merge a2 a4 a5; fpc=(1-nsrs/Npop); var_tot=(Npop*Npop)*fpc*se_e*se_e; 9 se_tot=sqrt(var_tot); moe=1.96*se_tot; tot_est=bhat*sum87pop; lcl95=tot_est-moe; ucl95=tot_est+moe; Print proc print data=a6; var tot_est se_tot moe lcl95 ucl95; Output Obs 1 tot_est 950520496 se_tot 5344567 moe lcl95 10475351 940045144 ucl95 960995848 Evaluation • The 95% CI is 941 to 961 million acres • The SE for the ratio estimator is 5.3 million acres • If we use the SRS estimate (N × mean acres for the sample), the standard error is 58.2 million acres • The ratio estimate works very well for this problem M SE approximation Text (page 71) suggests that the approximation may severely underestimate the true M SE (i.e., miss the bias) unless • N is at least 30 • CV (x̄) ≤ 0.1 • CV (ȳ) ≤ 0.1 When is the ratio estimate better? • When the deviations of yi from ȳ are larger than the deviations of yi from B̂xi . • We want to compare the M SE’s of the usual and the ratio estimators. • The MSE of the ratio estimator is smaller (M SE(ȳˆ) ≤ M SE(ȳ)) whenever R≥ CV (x) 2CV (y) 10 Model If the relationship between y and x is a straight line through the origin with variance proportional to x, B is the weighted least squares estimate of the slope with weights proportional to 1/x. X 1 B̂ = arg min (yi − Bxi )2 B x i sam Ratio estimators for proportions • B̂ = ȳ x̄ is the quantity of interest. • Use the same approach. • See Section 3.1.3, including Example 3.5 on pages 72-73. Regression estimation • Statistics 512 • Regression fit is ȳˆ = B̂0 + B̂1 x̄pop = ȳ + B̂1 (x̄pop − x̄sam ) • Ratio estimator: B̂0 = 0. • The regression estimate is the predicted value when we substitute x̄U for x. • We need to do more work to calculate the standard error, MOE and CI. M SE(ȳˆreg ) ≈ f pc RSSpop n(N − 1) Example 3.6 • Estimate the number of dead trees in an area. • Divide area into 100 square plots. • Photo counts are easy (x), available for all N = 100 plots. • For a sample of n = 25 plots, measure actual numbers of dead trees (y). Estimation • Use a regression to describe the relationship between the actual count of dead trees (y) and the photo number of dead trees (x) in the n = 25 sample. • Find the average number of dead trees in the photos (x̄U = 11.3) and use this to get the predicted average number of dead trees in the population. • Multiply by N to get the total. 11 Enter the data ( SLL075.sas ) data a1; input photo field @@; datalines; 10 15 12 14 7 9 13 14 13 8 6 5 17 18 16 15 15 13 10 15 14 11 12 15 10 12 5 8 12 13 10 9 10 11 9 12 6 9 11 12 7 13 9 11 11 10 10 9 10 8 11.3 . ; Run the regression proc reg data=a1; model field=photo/clm; run; Output Var DF Int 1 photo 1 Par St Est Error t Pr>|t| 5.059 1.763 2.87 0.0087 0.613 0.160 3.83 0.0009 Predicted average Obs 26 Dep Var field . Predicted Value 11.9893 Standard error • Different approximations are available • We will use (3.14) on page 75 var( ˆ ȳˆreg ) = f pc s2e n • See Example 3.6 on pages 75-76 Difference estimation • Special case where slope is assumed to be one • Many cases where difference between y and x are zero • Differences are equally likely to be positive or negative 12 • Sometimes useful in auditing • See text Section 3.2.2 on page 77 Estimation in domains • Suppose we want to estimate a mean and/or a total for a subset of the population. • We call the subpopulation a domain or a subdomain. • In Example 3.7 on page 79, we use the SRS of counties to estimate mean and total 1992 acres for the western states. One view • After we take the sample of size n, select the observations that are within the domain of interest and proceed as if this were an SRS of size nd from the domain. • The f pc would be f pcd = 1 − nd . Nd • Note, this requires that we know Nd . • We also need Nd to convert ȳd to t̂d . A technical difficulty • The sample size in the domain nd is a random variable; it varies from sample to sample. • With this view we are ignoring the variability introduced into our estimate from this fact. • We are conditioning on nd . • This is what we do in regression when we treat the values of the explanatory variables as fixed and not random. Section 3.3 view • For the ȳd use the mean of the observations in the sample that are in the domain. • Treat the denominator nd as a random variable. • In this view, ȳd is a ratio estimator (a B̂) and the methods of this chapter apply. 13 Some details • Let ui = yi if i is in the domain, and 0 otherwise. • Similarly, xi = 1 if i is in the domain, and 0 otherwise. • ȳd = B̂ = P u P sam i sam xi P = dom yi nd • t̂d = Nd ȳd if Nd is known • Use the formula on the next to the last line of page 78 for SE(ȳd ). • This is a rewrite of the formula for a ratio estimate P 2 f pc sam (yi − B̂xi ) var( ˆ B̂) = nx̄2pop n−1 using the notation of this section. • Use SE(t̂d ) = Nd SE(ȳd ) for the total. Nd unknown • Use the same estimator for the domain mean: the mean of this subset of the sample ȳd . • Use the last line on page 78 for the SE of ȳd , an approximation that assumes – – nd ≈ NNd . n (nd −1) ≈ nnd . (n−1) Standard error for ȳd SE(ȳd ) = p sy,d f pc √ nd 14 • Proof f pc var(ȳ ˆ d) = nx̄2pop = = ≈ ≈ = P − B̂xi )2 n−1 sam (ui P 2 f pc sam (yi − B̂xi ) I(i ∈ dom) nx̄2pop n−1 2 P 2 f pc N dom (yi − B̂xi ) n Nd n−1 2 (nd − 1)s2y,d f pc n n nd n−1 2 (nd )s2y,d f pc n n nd n 2 sy,d f pc nd √ • The term syd / nd is the usual (sampling from an infinite population) standard error for a Ȳ . • So, for the estimation of the domain mean, we treat the nd observations as an SRS and approximate f pcd with the f pc for the entire SRS. Estimation of the total • (If Nd is known, we multiply Ȳd and the SE by Nd ; MOE and CI follow.) • When Nd is unknown, we have a bit of a mess. • The domain proportion in the population (Nd /N ) and sample (nd /n) should be approximately the same. • So, Nd is approximately N nnd A consequence t̂y,d = N ū p SE(t̂y,d ) = N f pc × SE(ū), • where SE(ū) is the infinite population standard error estimate. • This is a strange SE/variance. • The variance is used to summarize variability in a situation where many/most observations are 0. 15 Example 3.7 ( SLL078.sas ) • The US Census of Agriculture • We have a SRS size n = 300 from the population of N = 3078 counties. • B is the ratio of 1992 acres in western states to the number of counties in western states, the mean acres per county. • The total of interest is the total number of 1992 acres in western states. Get data and define West libname xxx ’C:\...\SASdata’; data asrs; set xxx.agsrs; if state eq ’AK’ . . . or state eq ’WY’ then region = ’West’; Define u and x if region = ’West’ then do; u=acres92; x=1; end; else do; u=0; x=0; end; run; Estimates and SE for mean proc means data=asrs; var acres92; where region eq ’West’; output out=a2 mean=ybard stderr=sey; run; Estimates and SE for total proc means data=asrs; var u; output out=a3 mean=ubar stderr=seu; run; 16 Merge and calculate data a4; merge a2 a3; npop=3078; nsrs=300; fpc=1-nsrs/npop; seybard=sqrt(fpc)*sey; thatd=npop*ubar; seubar=sqrt(fpc)*seu; sethatd=npop*(seubar); Print proc print data=a4; var ybard seybard thatd sethatd; Output Obs 1 ybard 598680.58974 seybard 78520.29 thatd 239556051.18 sethatd 46090456.81 Models • Models: βxi + i , Var(i ) = σ 2 xi Yi = Yi = β0 + β1 xi + i , Var(i ) = σ 2 – Get parameter estimates and make predictions and take summaries of predictions. – SE’s are SE’s of expected values. • Estimates of parameters are generally the same for randomization theory and modelbased inference. • Estimators are “model unbiased”. – Variability is in ’s, not in sampling design. • Standard errors may be slightly different, calculated from – design or – model • Requires stronger assumptions – If model is true, evaluate only large x’s or x’s far from mean. – Diagnostics are key. 17 Nonconstant sampling • In SRS, each data point is sampled with equal probability (πi = n ). N • Sometimes, we want to sample with unequal probabilities. – to reduce bias (randomization) – to minimize variance (model) Var(yi ) = σ 2 xi ⇒ πi ∝ √ xi Comparison • SRS, ratio, regression • See page 88 for summary • Ratio and regression estimators are useful when there is an informative x • Ratio estimators are useful when we have cluster sampling 18