Download Notes 6 - Wharton Statistics

Stat 475 Notes 6 Reading: Lohr, Chapter 3.3, 4.1-4.2 Corrections for Note 4 Addendum A first order Taylor expansion of B̂ is g (a0 , b0 ) g (a0 , b0 ) Bˆ  g (a0 , b0 )  (a  a0 )  (b  b0 ) a b y y 1  U  ( x  xU ) U  ( y  yU ) xU xU xU We approximate Var ( Bˆ ) by the variance of the right hand side of the above: 2 y 2y 1 U Var ( Bˆ )  2 Var ( x )  2 Var ( y )  2U Cov( x , y ) xU xU xU I. Ratio and regression estimation review Ratio and regression estimation are useful for estimating the mean or population total of a variable y when there is an auxiliary variable x for which we know the population total and that it is highly correlated with y. The intuition is that when x and y are highly correlated, by comparing the sample mean of x to the population mean of x, we 1 can predict whether the sample mean of y is likely to be an overestimate and underestimate of the population mean of y and we can adjust for this. Steps in using ratio and regression estimation: 1. Plot the data. Fit a simple linear regression model. Make a residual plot. 2. Based on the residual plot, decide whether the simple linear regression model is a reasonable model for the data. If the simple linear regression model is not a reasonable model for the data, then both the ratio and regression estimator could have large bias. An estimator based on a more appropriate regression model can be considered (see end of notes). 3. If the simple linear regression model is a reasonable model for the data, check whether the regression line approximately goes through the origin, e.g., by testing whether the intercept equals 0. If the regression line approximately goes through the origin, either the ratio or regression estimator can be used. If the regression line does not approximately go through the origin, then the regression estimator should be used. II. Example of ratio and regression estimation: An advertising firm is concerned about the effect of a new regional promotional campaign on the total dollar sales for a particular product. A simple random sample of n  20 stores is drawn from then N  452 regional stores in which the product 2 is sold. Quarterly sales data are obtained for the current 3month period and the 3-month period prior to the campaign. The total sales among all 425 stores for the 3-month period prior to the campaign is known to be 216,256. Use the data to estimate the total sales for the current period and a 95% confidence interval for the total sales for the current period. Plot the data: precampaign.sales=c(208,400,440,259,351,880,273,487,183,863,599,510,828,473, 924,110,829,257,388,244); present.sales=c(239,428,472,276,363,942,294,514,195,897,626,538,888,510,998,1 71,889,265,419,257); plot(precampaign.sales,present.sales); Precampaign sales and present sales are highly correlated: 3 cor(precampaign.sales,present.sales) [1] 0.9985922 Fit a simple linear regression model: regmodel=lm(present.sales~precampaign.sales); summary(regmodel); Call: lm(formula = present.sales ~ precampaign.sales) Residuals: Min 1Q Median 3Q Max -19.037 -7.918 -2.345 8.252 45.423 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 10.10523 7.08325 1.427 0.171 precampaign.sales 1.04975 0.01314 79.871 <2e-16 *** --Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 14.93 on 18 degrees of freedom Multiple R-squared: 0.9972, Adjusted R-squared: 0.997 F-statistic: 6379 on 1 and 18 DF, p-value: < 2.2e-16 Residual plot: plot(precampaign.sales,resid(regmodel),xlab="Precampaign Sales",ylab="Residual",main="Residual plot"); abline(0,0); # Draws a horizontal line at 0 4 The simple linear regression model appears to be a reasonable model for the data. The p-value for testing whether the intercept of the regression line is 0 is 0.171, so there is no evidence that the intercept is not zero. It is reasonable to use either the ratio or the regression estimator. Comparison of ratio, regression and standard estimators # Ratio estimator # Population mean of precampaign sales population.mean.precampaign.sales=216256/452; 5 # Sample estimate of ratio Bhat=mean(present.sales)/mean(precampaign.sales) # Ratio estimator of population total ytotal.ratio=452*Bhat*population.mean.precampaign.sales; # s_e^2 needed for estimating standard error of ratio estimator of population total se.ratio.sq=sum((present.sales-Bhat*precampaign.sales)^2)/(length(present.sales)1); # Standard error of ratio estimator of total standard.error.ytotal.ratio=452*sqrt((1-20/452)*se.ratio.sq/20); # Approximate 95% confidence interval lci.ratio=ytotal.ratio-1.96*standard.error.ytotal.ratio; uci.ratio=ytotal.ratio+1.96*standard.error.ytotal.ratio; # Regression estimator # Fit a simple linear regression model of present sales on precampaign sales regmodel=lm(present.sales~precampaign.sales); # Extract least squares estimates of intercept and slope from regression fit B0hat=coef(regmodel)[1]; B1hat=coef(regmodel)[2]; # Regression estimator of population total ytotal.reg=452*(B0hat+B1hat*population.mean.precampaign.sales); # Variance of residuals from regression sereg.sq=sum(resid(regmodel)^2/(length(present.sales)-2)); # Standard error of regression estimator of total standard.error.ytotal.reg=452*sqrt((1-(20/452))*sereg.sq/20); # Standard error of regression estimator # Approximate 95% confidence interval lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg; uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg; # Standard total estimator that ignores precampaign sales ybar=mean(present.sales); ytotal.standard=452*ybar; standard.error.ytotal.standard=452*sqrt((1-20/452)*var(present.sales)/20); # Approximate 95% confidence interval lci.reg=ytotal.reg-1.96*standard.error.ytotal.reg; uci.reg=ytotal.reg+1.96*standard.error.ytotal.reg; 6 Method Estimated Total SE 95% CI Standard 230,091 27,073 (177,027, 283,154) Ratio 231,612 1,537 (228,600, 234,624) Regression 231,582 1,475 (228,690, 234,474) The ratio and regression estimators are about equally as efficient and are much more efficient than the standard estimator. III. Domain Estimation Often we want separate estimates of means for subpopulations; the subpopulations are called domains. For example, we may want to take a SRS of visitors who fly to Philadelphia on September 18th and to estimate the proportion of visitors who intend to stay longer than 1 week. For this survey, there are two domains of study: visitors from in-state and visitors from out-ofstate. We do not know which persons in the population to which domain until they are sampled, though. Thus, the number of persons in the sample who fall into each domain is a random variable, with value unknown at the time the survey is designed. Suppose there are D domains. Let U d be the set of units in the population that are in domain d and let S d be the set of units in 7 the sample that are in domain d for d  1, 2, , D . Let N d be the number of population units in U d and nd be the number of sample units in S d . Suppose we want to estimate the population mean in domain d: y yU d   i iU d N d A natural estimator of yU d is yd   iSd yi nd (1.1) (1.1) looks at first just like the sample mean we have studied for estimating the whole population mean. The quantity nd is a random variable, however: If a different simple random sample is taken, we will very likely to have a different value for nd . Different samples would have different numbers of out-ofstate visitors. Technically, (1.1) is a ratio estimate. To see this, let  yi if i  U d ui   0 if i  U d 1 if i  U d xi   0 if i  U d N Then xU  N d / N , yU d  u i 1 N i x i 1 and i 8 yd u  u  Bˆ   x x iS iS i i Because we are estimating a ratio, we use the formula from Notes 4 for calculating the standard error: n 1  SE ( yd )  1   2  N  nxU n 1   1   2  N  nxU ˆ )  (u  Bx i iS 2 i n 1 2 ˆ ( y  Bx )  i i iS n 1 , 2 2 n  1  N  (nd  1) s yd   1     n 1  N  n  Nd  ( yi  yd ) 2  iS d 2 s  where yd is the sample variance in domain d. nd  1 If the expected sample size in domain d is large enough, then we expect that nd / n  N d / N and (nd  1) /(n  1)  ( Nd  1) /( N 1) and have the approximation 2 n s  SE ( yd )  1   yd  N  nd 9 n  1  When the finite population correction factor  N  is about 1, the above is just the standard error we would get if we assumed that we took a sample of fixed size nd on the domain D. Thus, in a sufficiently large sample, the technicality that we are using a ratio estimator makes little difference in practice for estimating a domain mean. The situation is a little more complicated when estimating a domain total. If N d is known, estimation is simple: use N d yd . If N d is unknown thought, we need to estimate it by Nnd / n . Then ui nd  iS tyd  N  Nu n nd The standard error is n  su2  SE ( tyd )  NSE (u )  N 1   .  N n Example 3.8 from book. IV. Simulation Studies (a) Population in which the simple linear regression model approximately holds, the intercept of the regression line is approximately zero and the correlation is high (0.91): 10 Population regression: Call: lm(formula = popy ~ popx) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.47783 0.23496 -2.034 0.0425 * popx 1.14619 0.02287 50.124 <2e-16 *** For samples of size 50 from population of 500: Bias Root Mean Squared Error Sample Mean 0.00 0.330 Ratio 0.00 0.135 11 Regression 0.00 0.135 Root Mean Squared Error (RMSE) = MSE (b) Population in which there is a linear relationship between E(y|x) and log(x) and the correlation is moderately high (0.66) Call: lm(formula = popy ~ popx) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.404816 0.056785 24.74 <2e-16 *** popx 0.109780 0.005569 19.71 <2e-16 *** For samples of size 50 from population of 500: 12 Bias Root Mean Squared Error Sample Mean 0.00 0.046 Ratio 0.00 0.052 Regression 0.00 0.035 (c) Population in which there is a linear relationship between E(y|x) and exp(x) and the correlation is moderate (0.43) Call: lm(formula = popy ~ popx) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -0.28542 0.09928 -2.875 0.00421 ** 13 popx 0.38301 0.03521 10.877 < 2e-16 *** For samples of size 20 from population of 500: Bias Root Mean Squared Error Sample Mean 0.00 0.969 Ratio -0.04 0.899 Regression -0.11 0.890 Code for simulation # Simulation studies to compare ratio, regression and sample mean estimators # Simple linear regression model beta=1.1; popx=rnorm(500,mean=10,sd=2); popy=beta*popx+rnorm(500,mean=0,sd=1); # Linear relationship between E(y|x) and log(x) beta=1.1 popx=rnorm(500,mean=10,sd=2); popy=beta*log(popx)+rnorm(500,mean=0,sd=.25); # Linear relationship between E(y|x) and exp(x) beta=.01; popy=beta*exp(popx)+rnorm(500,mean=0,sd=1); popxmean=mean(popx); popymean=mean(popy); samplesize=50; nosims=50000; samplemean=rep(0,nosims); ratioest=rep(0,nosims); regest=rep(0,nosims); for(i in 1:nosims){ tempsample=sample(1:500,samplesize,replace=FALSE); 14 y.sample=popy[tempsample]; x.sample=popx[tempsample]; samplemean[i]=mean(y.sample); Bhat=mean(y.sample)/mean(x.sample); ratioest[i]=Bhat*popxmean; tempreg=lm(y.sample~x.sample); B0hat=coef(tempreg)[1]; B1hat=coef(tempreg)[2]; regest[i]=B0hat+B1hat*popxmean; } bias.samplemean=mean(samplemean)-popymean; bias.ratioest=mean(ratioest)-popymean; bias.regest=mean(regest)-popymean; bias.samplemean; bias.ratioest; bias.regest; rmse.samplemean=sqrt(mean((samplemean-popymean)^2)); rmse.ratioest=sqrt(mean((ratioest-popymean)^2)); rmse.regest=sqrt(mean((regest-popymean)^2)); rmse.samplemean; rmse.ratioest; rmse.regest; plot(popx,popy,xlab="x",ylab="y",main="Population"); # Population regression popreg=lm(popy~popx); summary(popreg); V. Generalized Regression Estimator Suppose the following model holds in the population: yi  f ( xi )  ei , E (ei | xi )  0 , i.e., E ( yi | xi )  f ( xi ) 1 y  Then U N N  f (x ) . i 1 i 15 The generalized regression estimator is 1 N ˆ yˆ gen.reg   f ( xi ) N i 1 where fˆ is an estimate of f based on the sample. For simulation (c), consider the generalized regression estimator based on E ( yi | xi )  0  1 exp( xi ) , the model from which the population was generated. Bias Root Mean Squared Error Sample Mean 0.00 0.969 Ratio -0.04 0.899 Regression -0.11 0.890 Generalized Regression 0.00 0.175 The code for estimating the generalized regression estimator in each sample is: exp.x.sample=exp(x.sample); tempreg.exp=lm(y.sample~exp.x.sample); B0hat.exp=coef(tempreg.exp)[1]; B1hat.exp=coef(tempreg.exp)[2]; regest.exp[i]=B0hat.exp+B1hat.exp*mean(exp(popx)); 16

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Notes 6 - Wharton Statistics