Download Topic 3 Laplace Ratio estimation basics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Transcript
Statistics 522: Sampling and Survey Techniques
Topic 3
Topic Overview
This topic will cover
• Ratio and Regression Estimation
• Estimation in Domains
Laplace
• Wanted to determine population of France in 1802
• Number of births is easy to obtain from public records.
• Size of population is difficult to determine.
• Use births to predict population
• Sampled 30 “communes”
– Total population: 2,037,615
– Births in last 3 years: 215,599
or 71,866.33 per year
• persons per birth:
2037615/71866.33 = 28.35
• Multiply births by 28.35
Ratio estimation basics
• yi is the characteristic of interest (response variable).
• xi is the auxiliary variable or subsidiary variable.
P
• ty = pop yi
P
• tx = pop xi
• B=
ty
tx
= ȳU /x̄U
1
Procedure
• We assume that tx is known.
• Therefore, x̄U =
tx
N
is known.
• Use an SRS and measure yi and xi in the sample.
• Calculate ȳ and x̄ for the sample.
P
P
• B̂ = x̄ȳ = sample yi / sample xi
• t̂y,ratio = B̂tx or
• ȳˆratio = B̂ x̄U
Why
Sometimes we are interested in the ratio
• acres per farm differ
• yield per acre
• per capita income
Notes
• Both ȳ and x̄ are random variables.
– They will differ from sample to sample.
– We exclude the possibility that x̄ is zero.
• Denominator often looks like a sample size.
– The usual estimator for an SRS can be viewed this way: xi = 1 for all items.
X
N = tx =
xi
pop
n=
X
xi
sample
P
yi
B̂ = P = ȳ
xi
t̂ = B̂tx = B̂N = N ȳ
2
Why (2)
• Sometimes we want to estimate a population total but N is not known, so we can’t
use N ȳ.
– Estimate the number of fish in a catch
– Weigh and count a sample; weigh the total
– Multiply the average fish per pound times the total weight of the catch
• Increase the precision
– Laplace could have estimated the population of France by computing the average
number of persons per commune and multiplying by the number of communes.
– Ratio estimate has smaller M SE (because of positive correlation between births
and population size).
• We can adjust estimates to reflect demographic totals.
– example in text on page 62 concerning gender
– This is called poststratification.
– We will discuss this in Section 4.7 and Chapters 7 and 8
• Can be used to adjust for nonresponse
– Example in text on page 63
– Discussed in Chapter 8
Example 3.2
• The U.S. Census of Agriculture
• We have a SRS size n = 300 from the population of N = 3078 counties.
• Suppose we have the population totals for 1987 but have only the sample information
for 1992.
• We want to estimate
– average acres per farm
– total acres
3
Get the sample data ( SLL063.sas )
*import the file agsrs.dat, check it,
then create a permanent SAS data set;
proc print data=asrs;
run;
libname xxx ’C:\Purdue\Stat522\SASdata’;
data xxx.agsrs; set asrs;
run;
Plot the data
symbol1 v=circle i=sm70;
proc gplot data=asrs;
plot acres92*acres87/frame;
run;
proc reg data=asrs;
model acres92=acres87/noint;
run;
proc univariate data=asrs;
var acres92 acres87;
run;
Regression through the origin
4
Smoothed plot
Ratio estimate
proc univariate data=asrs;
var acres92 acres87;
Sums
Variable: ACRES92
Sum Observations
89369114
Variable: ACRES87
Sum Observations
90586117
A calculation
data acalc;
tot92=89369114;
tot87=90586117;
ratio=tot92/tot87;
output;
Output
Obs
tot92
1
89369114
tot87
90586117
ratio
0.98657
Ratio is B̂: in the sample of n = 300, 1992 acres are 98.7% of 1987 acres.
Total acres for 1987
proc univariate data=apop;
var acres87;
5
Output
Variable: ACRES87
Mean
Sum Observations
313016.378
963464412
NOTE: These values differ slightly from the values in the text.
The estimates
• Total acres for 1987: 963464412
• B̂ = 0.98657 =
1992acres
1987acres
• Estimate of total 1992 acres is
B̂ × total87acres = 0.98657(963464412) = 950000000
• Estimate of mean acres per county for 1992 is
B̂ × mean87acrespercounty = 0.98657(313016.378) = 309000
Comments
• Ratio estimators are biased.
• The random variable is B̂ = x̄pop /x̄sam .
• The relative bias: (|bias(B̂)|/s(B̂)) ≤ |CV (x̄)|
Proof
• Use E(t̂x ) = tx ; E(t̂y ) = ty ; t̂y,ratio =
t̂y
t
t̂x x
= B̂tx .
E(t̂y,ratio − ty ) = E(t̂y,ratio − t̂y + t̂y − ty )
t̂y
= E
tx − t̂y
t̂x
t̂y
= E
(tx − t̂x )
t̂x
= −E[B̂(t̂x − tx )]
h
i
= − E(B̂ t̂x ) − E(B̂)E(t̂x )
= −Cov(B̂, t̂x )
6
• Therefore,
E(t̂y,ratio − ty ) |E(B̂ − B)| = tx
Cov(B̂, t̂ ) Cov(B̂, x̄ ) x sam = =
tx
x̄pop
q
Corr(B̂, x̄) = Var(B̂)Var(x̄)
x̄pop
≤ SE(B̂)SE(x̄)/|x̄pop |
= SE(B̂)|CV (x̄)|
More Comments
• Similar argument shows that
Bias(B̂) ≈ f pc
1
(S 2 B − Cov(x, y))
nx̄2U x
• The bias will be small when
– n is large.
– the sampling fraction n/N is large.
– x̄U is large.
– the standard deviation of x – Sx – is small.
– the correlation R between y and x is close to one.
M SE of B̂
var(
ˆ B̂) ≈ M SE(B̂) ≈
¯
Var(d)
,
x̄2U
where di = yi − Bxi .
Idea behind proof
ȳ
−B
x̄
(ȳ − B x̄)
=
x̄
• Note that we have the random variable x̄ in the denominator of this expression.
B̂ − B =
• Approximate it by x̄U .
B̂ − B ≈
7
(ȳ − B x̄)
x̄U
Standard error for B̂
• To estimate the standard deviation of B̂, substitute sample estimates for unknown
quantities:
var(
ˆ B̂) = f pc
where s2e =
P
s2e
,
nx̄2U
e2i /(n − 1) and ei = yi − B̂xi .
• SE is the square root of the variance.
Other standard errors
• t̂ratio = tx B̂
– Estimated variance of t̂ratio is
ˆ B̂) = f pc
t2x var(
N2 2
s
n e
• ȳˆratio = x̄U B̂
– Estimated variance of ȳˆratio is
x̄2U var(
ˆ B̂) =
f pc 2
s
n e
• Take square roots to obtain the SE’s
Confidence intervals
For 95%, general form is
estimate ± 1.96SE(estimate)
Example 3.3
• The US Census of Agriculture
• We have a SRS size n = 300 from the population of N = 3078 counties.
• Estimate total acres for 1992 using a ratio estimate.
• B is the ratio of 1992 acres to 1987 acres for the population; B̂ for the sample of
n = 300.
• We use the known value of total acres in 1987 for the population N = 3078.
8
The estimates
• Total acres for 1987: 963464412
• B̂ = 0.98657 (1992acres/1987acres)
• Estimate of total 1992 acres is
B̂ × total87acres = 0.98657(963464412) = 950000000
• We need to calculate the standard error and a 95% confidence interval .
Estimate B (SLL068.sas)
libname xxx ’xxx\SASdata’;
data asrs; set xxx.agsrs;
proc means data=asrs;
var acres87 acres92;
output out=a2 sum=sum87 sum92;
run;
data a2; set a2; Bhat=sum92/sum87;
proc print data=a2;
run;
Output
Obs
1
sum87
sum92
Bhat
90586117 89369114 0.98657
Define e and compute SE
data asrs2; set asrs;
if _n_ eq 1 then set a2;
e=acres92-Bhat*acres87;
proc means data=asrs2;
var e;
output out=a4 stderr=se_e n=nsrs;
Find population total for 1987
proc means data=xxx.agpop;
var acres87;
output out=a5 sum=sum87pop n=Npop;
Put it together
data a6; merge a2 a4 a5;
fpc=(1-nsrs/Npop);
var_tot=(Npop*Npop)*fpc*se_e*se_e;
9
se_tot=sqrt(var_tot);
moe=1.96*se_tot;
tot_est=bhat*sum87pop;
lcl95=tot_est-moe;
ucl95=tot_est+moe;
Print
proc print data=a6;
var tot_est se_tot moe lcl95 ucl95;
Output
Obs
1
tot_est
950520496
se_tot
5344567
moe
lcl95
10475351 940045144
ucl95
960995848
Evaluation
• The 95% CI is 941 to 961 million acres
• The SE for the ratio estimator is 5.3 million acres
• If we use the SRS estimate (N × mean acres for the sample), the standard error is 58.2
million acres
• The ratio estimate works very well for this problem
M SE approximation
Text (page 71) suggests that the approximation may severely underestimate the true M SE
(i.e., miss the bias) unless
• N is at least 30
• CV (x̄) ≤ 0.1
• CV (ȳ) ≤ 0.1
When is the ratio estimate better?
• When the deviations of yi from ȳ are larger than the deviations of yi from B̂xi .
• We want to compare the M SE’s of the usual and the ratio estimators.
• The MSE of the ratio estimator is smaller (M SE(ȳˆ) ≤ M SE(ȳ)) whenever
R≥
CV (x)
2CV (y)
10
Model
If the relationship between y and x is a straight line through the origin with variance proportional to x, B is the weighted least squares estimate of the slope with weights proportional
to 1/x.
X 1
B̂ = arg min
(yi − Bxi )2
B
x
i
sam
Ratio estimators for proportions
• B̂ =
ȳ
x̄
is the quantity of interest.
• Use the same approach.
• See Section 3.1.3, including Example 3.5 on pages 72-73.
Regression estimation
• Statistics 512
• Regression fit is ȳˆ = B̂0 + B̂1 x̄pop = ȳ + B̂1 (x̄pop − x̄sam )
• Ratio estimator: B̂0 = 0.
• The regression estimate is the predicted value when we substitute x̄U for x.
• We need to do more work to calculate the standard error, MOE and CI.
M SE(ȳˆreg ) ≈
f pc
RSSpop
n(N − 1)
Example 3.6
• Estimate the number of dead trees in an area.
• Divide area into 100 square plots.
• Photo counts are easy (x), available for all N = 100 plots.
• For a sample of n = 25 plots, measure actual numbers of dead trees (y).
Estimation
• Use a regression to describe the relationship between the actual count of dead trees
(y) and the photo number of dead trees (x) in the n = 25 sample.
• Find the average number of dead trees in the photos (x̄U = 11.3) and use this to get
the predicted average number of dead trees in the population.
• Multiply by N to get the total.
11
Enter the data ( SLL075.sas )
data a1;
input photo field @@;
datalines;
10 15 12 14 7 9 13 14 13 8
6 5 17 18 16 15 15 13 10 15
14 11 12 15 10 12 5 8 12 13
10 9 10 11 9 12 6 9 11 12
7 13 9 11 11 10 10 9 10 8
11.3 .
;
Run the regression
proc reg data=a1;
model field=photo/clm;
run;
Output
Var DF
Int
1
photo 1
Par
St
Est Error
t Pr>|t|
5.059 1.763 2.87 0.0087
0.613 0.160 3.83 0.0009
Predicted average
Obs
26
Dep Var
field
.
Predicted
Value
11.9893
Standard error
• Different approximations are available
• We will use (3.14) on page 75
var(
ˆ ȳˆreg ) = f pc
s2e
n
• See Example 3.6 on pages 75-76
Difference estimation
• Special case where slope is assumed to be one
• Many cases where difference between y and x are zero
• Differences are equally likely to be positive or negative
12
• Sometimes useful in auditing
• See text Section 3.2.2 on page 77
Estimation in domains
• Suppose we want to estimate a mean and/or a total for a subset of the population.
• We call the subpopulation a domain or a subdomain.
• In Example 3.7 on page 79, we use the SRS of counties to estimate mean and total
1992 acres for the western states.
One view
• After we take the sample of size n, select the observations that are within the domain
of interest and proceed as if this were an SRS of size nd from the domain.
• The f pc would be f pcd = 1 −
nd
.
Nd
• Note, this requires that we know Nd .
• We also need Nd to convert ȳd to t̂d .
A technical difficulty
• The sample size in the domain nd is a random variable; it varies from sample to sample.
• With this view we are ignoring the variability introduced into our estimate from this
fact.
• We are conditioning on nd .
• This is what we do in regression when we treat the values of the explanatory variables
as fixed and not random.
Section 3.3 view
• For the ȳd use the mean of the observations in the sample that are in the domain.
• Treat the denominator nd as a random variable.
• In this view, ȳd is a ratio estimator (a B̂) and the methods of this chapter apply.
13
Some details
• Let ui = yi if i is in the domain, and 0 otherwise.
• Similarly, xi = 1 if i is in the domain, and 0 otherwise.
• ȳd = B̂ =
P
u
P sam i
sam xi
P
=
dom
yi
nd
• t̂d = Nd ȳd if Nd is known
• Use the formula on the next to the last line of page 78 for SE(ȳd ).
• This is a rewrite of the formula for a ratio estimate
P
2
f pc
sam (yi − B̂xi )
var(
ˆ B̂) =
nx̄2pop
n−1
using the notation of this section.
• Use SE(t̂d ) = Nd SE(ȳd ) for the total.
Nd unknown
• Use the same estimator for the domain mean: the mean of this subset of the sample
ȳd .
• Use the last line on page 78 for the SE of ȳd , an approximation that assumes
–
–
nd
≈ NNd .
n
(nd −1)
≈ nnd .
(n−1)
Standard error for ȳd
SE(ȳd ) =
p
sy,d
f pc √
nd
14
• Proof
f pc
var(ȳ
ˆ d) =
nx̄2pop
=
=
≈
≈
=
P
− B̂xi )2
n−1
sam (ui
P
2
f pc
sam (yi − B̂xi ) I(i ∈ dom)
nx̄2pop
n−1
2 P
2
f pc N
dom (yi − B̂xi )
n
Nd
n−1
2
(nd − 1)s2y,d
f pc n
n
nd
n−1
2
(nd )s2y,d
f pc n
n
nd
n
2
sy,d
f pc
nd
√
• The term syd / nd is the usual (sampling from an infinite population) standard error
for a Ȳ .
• So, for the estimation of the domain mean, we treat the nd observations as an SRS and
approximate f pcd with the f pc for the entire SRS.
Estimation of the total
• (If Nd is known, we multiply Ȳd and the SE by Nd ; MOE and CI follow.)
• When Nd is unknown, we have a bit of a mess.
• The domain proportion in the population (Nd /N ) and sample (nd /n) should be approximately the same.
• So, Nd is approximately N nnd
A consequence
t̂y,d = N ū
p
SE(t̂y,d ) = N f pc × SE(ū),
• where SE(ū) is the infinite population standard error estimate.
• This is a strange SE/variance.
• The variance is used to summarize variability in a situation where many/most observations are 0.
15
Example 3.7 ( SLL078.sas )
• The US Census of Agriculture
• We have a SRS size n = 300 from the population of N = 3078 counties.
• B is the ratio of 1992 acres in western states to the number of counties in western
states, the mean acres per county.
• The total of interest is the total number of 1992 acres in western states.
Get data and define West
libname xxx ’C:\...\SASdata’;
data asrs; set xxx.agsrs;
if
state eq ’AK’
. . .
or state eq ’WY’
then region = ’West’;
Define u and x
if region = ’West’
then do;
u=acres92; x=1;
end;
else do;
u=0; x=0; end;
run;
Estimates and SE for mean
proc means data=asrs;
var acres92;
where region eq ’West’;
output out=a2
mean=ybard
stderr=sey;
run;
Estimates and SE for total
proc means data=asrs;
var u;
output out=a3
mean=ubar
stderr=seu;
run;
16
Merge and calculate
data a4; merge a2 a3;
npop=3078; nsrs=300;
fpc=1-nsrs/npop;
seybard=sqrt(fpc)*sey;
thatd=npop*ubar;
seubar=sqrt(fpc)*seu;
sethatd=npop*(seubar);
Print
proc print data=a4;
var ybard seybard thatd sethatd;
Output
Obs
1
ybard
598680.58974
seybard
78520.29
thatd
239556051.18
sethatd
46090456.81
Models
• Models:
βxi + i ,
Var(i ) = σ 2 xi
Yi =
Yi = β0 + β1 xi + i , Var(i ) = σ 2
– Get parameter estimates and make predictions and take summaries of predictions.
– SE’s are SE’s of expected values.
• Estimates of parameters are generally the same for randomization theory and modelbased inference.
• Estimators are “model unbiased”.
– Variability is in ’s, not in sampling design.
• Standard errors may be slightly different, calculated from
– design or
– model
• Requires stronger assumptions
– If model is true, evaluate only large x’s or x’s far from mean.
– Diagnostics are key.
17
Nonconstant sampling
• In SRS, each data point is sampled with equal probability (πi =
n
).
N
• Sometimes, we want to sample with unequal probabilities.
– to reduce bias (randomization)
– to minimize variance (model)
Var(yi ) = σ 2 xi ⇒ πi ∝
√
xi
Comparison
• SRS, ratio, regression
• See page 88 for summary
• Ratio and regression estimators are useful when there is an informative x
• Ratio estimators are useful when we have cluster sampling
18