Download Notes 6 - Wharton Statistics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Choice modelling wikipedia , lookup

German tank problem wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Stat 475 Notes 6
Reading: Lohr, Chapter 3.3, 4.1-4.2
Corrections for Note 4 Addendum
A first order Taylor expansion of B̂ is
g (a0 , b0 )
g (a0 , b0 )
Bˆ  g (a0 , b0 )  (a  a0 )
 (b  b0 )
a
b
y
y
1
 U  ( x  xU ) U  ( y  yU )
xU
xU
xU
We approximate Var ( Bˆ ) by the variance of the right hand side
of the above:
2
y
2y
1
U
Var ( Bˆ )  2 Var ( x )  2 Var ( y )  2U Cov( x , y )
xU
xU
xU
I. Ratio and regression estimation review
Ratio and regression estimation are useful for estimating the
mean or population total of a variable y when there is an
auxiliary variable x for which we know the population total and
that it is highly correlated with y.
The intuition is that when x and y are highly correlated, by
comparing the sample mean of x to the population mean of x, we
1
can predict whether the sample mean of y is likely to be an
overestimate and underestimate of the population mean of y and
we can adjust for this.
Steps in using ratio and regression estimation:
1. Plot the data. Fit a simple linear regression model. Make a
residual plot.
2. Based on the residual plot, decide whether the simple linear
regression model is a reasonable model for the data. If the
simple linear regression model is not a reasonable model for the
data, then both the ratio and regression estimator could have
large bias. An estimator based on a more appropriate regression
model can be considered (see end of notes).
3. If the simple linear regression model is a reasonable model
for the data, check whether the regression line approximately
goes through the origin, e.g., by testing whether the intercept
equals 0. If the regression line approximately goes through the
origin, either the ratio or regression estimator can be used. If the
regression line does not approximately go through the origin,
then the regression estimator should be used.
II. Example of ratio and regression estimation:
An advertising firm is concerned about the effect of a new
regional promotional campaign on the total dollar sales for a
particular product. A simple random sample of n  20 stores is
drawn from then N  452 regional stores in which the product
2
is sold. Quarterly sales data are obtained for the current 3month period and the 3-month period prior to the campaign.
The total sales among all 425 stores for the 3-month period prior
to the campaign is known to be 216,256. Use the data to
estimate the total sales for the current period and a 95%
confidence interval for the total sales for the current period.
Plot the data:
precampaign.sales=c(208,400,440,259,351,880,273,487,183,863,599,510,828,473,
924,110,829,257,388,244);
present.sales=c(239,428,472,276,363,942,294,514,195,897,626,538,888,510,998,1
71,889,265,419,257);
plot(precampaign.sales,present.sales);
Precampaign sales and present sales are highly correlated:
3
cor(precampaign.sales,present.sales)
[1] 0.9985922
Fit a simple linear regression model:
regmodel=lm(present.sales~precampaign.sales);
summary(regmodel);
Call:
lm(formula = present.sales ~ precampaign.sales)
Residuals:
Min 1Q Median 3Q Max
-19.037 -7.918 -2.345 8.252 45.423
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)
10.10523 7.08325 1.427 0.171
precampaign.sales 1.04975 0.01314 79.871 <2e-16 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 14.93 on 18 degrees of freedom
Multiple R-squared: 0.9972, Adjusted R-squared: 0.997
F-statistic: 6379 on 1 and 18 DF, p-value: < 2.2e-16
Residual plot:
plot(precampaign.sales,resid(regmodel),xlab="Precampaign
Sales",ylab="Residual",main="Residual plot");
abline(0,0); # Draws a horizontal line at 0
4
The simple linear regression model appears to be a reasonable
model for the data.
The p-value for testing whether the intercept of the regression
line is 0 is 0.171, so there is no evidence that the intercept is not
zero. It is reasonable to use either the ratio or the regression
estimator.
Comparison of ratio, regression and standard estimators
# Ratio estimator
# Population mean of precampaign sales
population.mean.precampaign.sales=216256/452;
5
# Sample estimate of ratio
Bhat=mean(present.sales)/mean(precampaign.sales)
# Ratio estimator of population total
ytotal.ratio=452*Bhat*population.mean.precampaign.sales;
# s_e^2 needed for estimating standard error of ratio estimator of population total
se.ratio.sq=sum((present.sales-Bhat*precampaign.sales)^2)/(length(present.sales)1);
# Standard error of ratio estimator of total
standard.error.ytotal.ratio=452*sqrt((1-20/452)*se.ratio.sq/20);
# Approximate 95% confidence interval
lci.ratio=ytotal.ratio-1.96*standard.error.ytotal.ratio;
uci.ratio=ytotal.ratio+1.96*standard.error.ytotal.ratio;
# Regression estimator
# Fit a simple linear regression model of present sales on precampaign sales
regmodel=lm(present.sales~precampaign.sales);
# Extract least squares estimates of intercept and slope from regression fit
B0hat=coef(regmodel)[1];
B1hat=coef(regmodel)[2];
# Regression estimator of population total
ytotal.reg=452*(B0hat+B1hat*population.mean.precampaign.sales);
# Variance of residuals from regression
sereg.sq=sum(resid(regmodel)^2/(length(present.sales)-2));
# Standard error of regression estimator of total
standard.error.ytotal.reg=452*sqrt((1-(20/452))*sereg.sq/20); # Standard error of
regression estimator
# Approximate 95% confidence interval
lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg;
uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg;
# Standard total estimator that ignores precampaign sales
ybar=mean(present.sales);
ytotal.standard=452*ybar;
standard.error.ytotal.standard=452*sqrt((1-20/452)*var(present.sales)/20);
# Approximate 95% confidence interval
lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg;
uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg;
6
Method
Estimated Total
SE
95% CI
Standard
230,091
27,073
(177,027,
283,154)
Ratio
231,612
1,537
(228,600,
234,624)
Regression
231,582
1,475
(228,690,
234,474)
The ratio and regression estimators are about equally as efficient
and are much more efficient than the standard estimator.
III. Domain Estimation
Often we want separate estimates of means for subpopulations;
the subpopulations are called domains. For example, we may
want to take a SRS of visitors who fly to Philadelphia on
September 18th and to estimate the proportion of visitors who
intend to stay longer than 1 week. For this survey, there are two
domains of study: visitors from in-state and visitors from out-ofstate. We do not know which persons in the population to which
domain until they are sampled, though. Thus, the number of
persons in the sample who fall into each domain is a random
variable, with value unknown at the time the survey is designed.
Suppose there are D domains. Let U d be the set of units in the
population that are in domain d and let S d be the set of units in
7
the sample that are in domain d for d  1, 2, , D . Let N d be
the number of population units in U d and nd be the number of
sample units in S d . Suppose we want to estimate the population
mean in domain d:
y
yU d   i
iU d N d
A natural estimator of yU d is
yd  
iSd
yi
nd
(1.1)
(1.1) looks at first just like the sample mean we have
studied for estimating the whole population mean. The quantity
nd is a random variable, however: If a different simple random
sample is taken, we will very likely to have a different value for
nd . Different samples would have different numbers of out-ofstate visitors. Technically, (1.1) is a ratio estimate. To see this,
let
 yi if i  U d
ui  
0 if i  U d
1 if i  U d
xi  
0 if i  U d
N
Then xU  N d / N ,
yU d 
u
i 1
N
i
x
i 1
and
i
8
yd
u

u
 Bˆ  
x x
iS
iS
i
i
Because we are estimating a ratio, we use the formula from
Notes 4 for calculating the standard error:
n 1

SE ( yd )  1   2
 N  nxU
n 1

 1   2
 N  nxU
ˆ )
 (u  Bx
i
iS
2
i
n 1
2
ˆ
(
y

Bx
)
 i i
iS
n 1
,
2
2
n  1  N  (nd  1) s yd

 1   

n 1
 N  n  Nd 
( yi  yd ) 2

iS d
2
s

where yd
is the sample variance in domain d.
nd  1
If the expected sample size in domain d is large enough, then we
expect that nd / n  N d / N and (nd  1) /(n  1)  ( Nd  1) /( N 1)
and have the approximation
2
n s

SE ( yd )  1   yd
 N  nd
9
n

1

When the finite population correction factor  N  is about 1,
the above is just the standard error we would get if we assumed
that we took a sample of fixed size nd on the domain D. Thus,
in a sufficiently large sample, the technicality that we are using a
ratio estimator makes little difference in practice for estimating a
domain mean.
The situation is a little more complicated when estimating a
domain total. If N d is known, estimation is simple: use N d yd .
If N d is unknown thought, we need to estimate it by Nnd / n .
Then
ui
nd 
iS
tyd  N
 Nu
n nd
The standard error is
n  su2

SE ( tyd )  NSE (u )  N 1   .
 N n
Example 3.8 from book.
IV. Simulation Studies
(a) Population in which the simple linear regression model
approximately holds, the intercept of the regression line is
approximately zero and the correlation is high (0.91):
10
Population regression:
Call:
lm(formula = popy ~ popx)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.47783 0.23496 -2.034 0.0425 *
popx
1.14619 0.02287 50.124 <2e-16 ***
For samples of size 50 from population of 500:
Bias
Root Mean Squared
Error
Sample Mean
0.00
0.330
Ratio
0.00
0.135
11
Regression
0.00
0.135
Root Mean Squared Error (RMSE) =
MSE
(b) Population in which there is a linear relationship between
E(y|x) and log(x) and the correlation is moderately high (0.66)
Call:
lm(formula = popy ~ popx)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.404816 0.056785 24.74 <2e-16 ***
popx
0.109780 0.005569 19.71 <2e-16 ***
For samples of size 50 from population of 500:
12
Bias
Root Mean Squared
Error
Sample Mean
0.00
0.046
Ratio
0.00
0.052
Regression
0.00
0.035
(c) Population
in which there is a linear relationship between
E(y|x) and exp(x) and the correlation is moderate (0.43)
Call:
lm(formula = popy ~ popx)
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -0.28542 0.09928 -2.875 0.00421 **
13
popx
0.38301 0.03521 10.877 < 2e-16 ***
For samples of size 20 from population of 500:
Bias
Root Mean Squared
Error
Sample Mean
0.00
0.969
Ratio
-0.04
0.899
Regression
-0.11
0.890
Code for simulation
# Simulation studies to compare ratio, regression and sample mean estimators
# Simple linear regression model
beta=1.1;
popx=rnorm(500,mean=10,sd=2);
popy=beta*popx+rnorm(500,mean=0,sd=1);
# Linear relationship between E(y|x) and log(x)
beta=1.1
popx=rnorm(500,mean=10,sd=2);
popy=beta*log(popx)+rnorm(500,mean=0,sd=.25);
# Linear relationship between E(y|x) and exp(x)
beta=.01;
popy=beta*exp(popx)+rnorm(500,mean=0,sd=1);
popxmean=mean(popx);
popymean=mean(popy);
samplesize=50;
nosims=50000;
samplemean=rep(0,nosims);
ratioest=rep(0,nosims);
regest=rep(0,nosims);
for(i in 1:nosims){
tempsample=sample(1:500,samplesize,replace=FALSE);
14
y.sample=popy[tempsample];
x.sample=popx[tempsample];
samplemean[i]=mean(y.sample);
Bhat=mean(y.sample)/mean(x.sample);
ratioest[i]=Bhat*popxmean;
tempreg=lm(y.sample~x.sample);
B0hat=coef(tempreg)[1];
B1hat=coef(tempreg)[2];
regest[i]=B0hat+B1hat*popxmean;
}
bias.samplemean=mean(samplemean)-popymean;
bias.ratioest=mean(ratioest)-popymean;
bias.regest=mean(regest)-popymean;
bias.samplemean;
bias.ratioest;
bias.regest;
rmse.samplemean=sqrt(mean((samplemean-popymean)^2));
rmse.ratioest=sqrt(mean((ratioest-popymean)^2));
rmse.regest=sqrt(mean((regest-popymean)^2));
rmse.samplemean;
rmse.ratioest;
rmse.regest;
plot(popx,popy,xlab="x",ylab="y",main="Population");
# Population regression
popreg=lm(popy~popx);
summary(popreg);
V. Generalized Regression Estimator
Suppose the following model holds in the population:
yi  f ( xi )  ei , E (ei | xi )  0 , i.e., E ( yi | xi )  f ( xi )
1
y

Then U N
N
 f (x ) .
i 1
i
15
The generalized regression estimator is
1 N ˆ
yˆ gen.reg   f ( xi )
N i 1
where fˆ is an estimate of f based on the sample.
For simulation (c), consider the generalized regression estimator
based on E ( yi | xi )  0  1 exp( xi ) , the model from which the
population was generated.
Bias
Root Mean Squared
Error
Sample Mean
0.00
0.969
Ratio
-0.04
0.899
Regression
-0.11
0.890
Generalized
Regression
0.00
0.175
The code for estimating the generalized regression estimator in
each sample is:
exp.x.sample=exp(x.sample);
tempreg.exp=lm(y.sample~exp.x.sample);
B0hat.exp=coef(tempreg.exp)[1];
B1hat.exp=coef(tempreg.exp)[2];
regest.exp[i]=B0hat.exp+B1hat.exp*mean(exp(popx));
16