Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data analysis wikipedia , lookup
Generalized linear model wikipedia , lookup
Corecursion wikipedia , lookup
Pattern recognition wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Regression analysis wikipedia , lookup
Cluster analysis wikipedia , lookup
Inverse problem wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Soc709 Lab 11 Hetereoskedasticity and weighting data Heteroskedasticity Review of lecture 11: Hetereoskedasticity is a problem because the variance of the error term in not the same for each case. As a result, the standard formula for the variance of the coefficients in no longer valid, and estimates of the standard errors will be biased. Note that the point estimates of the coefficients are still unbiased. It is just a problem of the standard errors. In lecture we discussed how, if the variance-covariance matrix was known, weighting by P (the inverse of the standard error of the error term of each case) would solve the problem. In practice, if we don’t know the variance of the error term, we can use the Huber-White sandwich estimator, which gives us unbiased estimates of the standard errors. If we have more information about the structure of the variance of the error term, such as its relationship to other variables in our analysis, then we can improve upon the HuberWhite approach. In lab we will learn how to detect and correct for heteroskedasticity. Overview of syntax: Here is an example of the syntax. clear use lab11a reg y x * substitute whatever regression you are running here estimates store ols estat hettest, iid * run this command after your regression to test for heteroskedasticity predict double u, residual * I generated the residuals (“double” provides more precision) gen u2=u^2 * now I am generating the squared residuals (a consistent estimate of the variance of the error term) reg u2 x * this tests whether the squared residuals are related to x gen x2=x^2 reg u2 x2 * If I found that x-squared was a good predictor of the squared errors, I could use this as a weight * Note that if x was a better predictor than x-squared, I would have used that instead. gen wgt=1/((x)^2) * if you had multiple variables in your model of the weights you could do something like this: * reg u2 x1 x2 x3 * predict u2 * reg u2 x1 x2 x3 * gen wgt2=1/(x2) * if x2 was a predictor of u2 * reg y x1 x2 x3 [aw=wgt2] * estat hettest, iid scatter u x reg y x, vce(robust) * this estimates the Huber-White sandwich estimator estimates store robust reg y x [aw=wgt] * I used the weights I generated earlier from x-squared estimates store fgls estat hettest, iid estimates table ols robust fgls, b(%9.4f) se(%5.3f) /// title(comparison of estimates) Note that you can’t go wrong just using reg y x, vce(robust) --unless your data is clustered or you have autocorrelated errors for other reasons-but your standard errors could be larger than if you had other information about the structure of the variance-covariance matrix of the error term. Weighting data We will discuss two cases, stratified random sampling and cluster sampling. In stratified random sampling, different demographic groups may be sampled at different rates. The pweights in Stata represent 1/p where p is the probability of sampling I.e., if we sampled 10% of men and 20% of women, then our weight variable for men would be 10 and 5 for women. In a regression command, xi: reg lnwage i.sex [pw=wgt] where wgt is the inverse probability of sampling. For data where the sampling was done in clusters (explain) we need to take this into account (i.e., following the logic of lecture 11). Here I am going to follow the example of cluster analysis from the UCLA Stata website. For more details on the use of Stata for survey data analysis, see the UCLA webpage Example: We will use the example from the UCLA webpage on a cluster sampling of schools, ucla survey example.log Data on schools in California (from the UCLA webpage): California requires that all students in public schools be tested each year. The State Department of Education then puts together the annual Academic Performance Index (API) which rates how a school is doing overall, in terms of the test scores. The file, apipop.dta, contains api ratings and demographic information on 6,194 schools in 757 school districts. To be included in the file schools must have at least 100 students. One-Stage Cluster Sampling Another approach to sampling from the population is cluster sampling. In this example we will use school districts as the cluster or primary sampling units. We will take a random sample of 15 school districts and look at all of the schools in each one. In this example, the sampling frame contains the 757 school districts. You have to create the correct pweights, the inverse probability that a particular school was sampled. In this case, p=15/757 pw=757/15 svyset syntax: svyset psu [weight], fpc(fpc-variable) psu=the variable giving the id of the sampling unit, i.e., the cluster weight=the pweight, the inverse probability that a case was sampled fpc-variable=the number of psu’s. clear use http://www.ats.ucla.edu/stat/stata/library/apiclus1, clear tabulate stype tabulate dnum svyset dnum [pw=pw], fpc(fpc) svy: mean api00 svy: total enroll svy: regress api00 meals ell avg_ed estimates store svy * compare to this: mean api00 total enroll regress api00 meals ell avg_ed estimates store ols regress api00 meals ell avg_ed [pw=pw] estimates store ols_pw regress api00 meals ell avg_ed [pw=pw], vce(cluster dnum) estimates store ols_pw_cluster regress api00 meals ell avg_ed [pw=pw], vce(robust) estimates store sandwich estimates table svy ols ols_pw ols_pw_cluster sandwich, b(%9.4f) se(%5.3f) /// title(comparison of estimates) comparison of estimates -------------------------------------------------------------------------Variable | svy ols ols_pw ols_pw_~r sandwich -------------+-----------------------------------------------------------meals | -2.9487 -2.9487 -2.9487 -2.9487 -2.9487 | 0.327 0.260 0.229 0.333 0.229 ell | -0.2227 -0.2227 -0.2227 -0.2227 -0.2227 | 0.394 0.365 0.345 0.402 0.345 avg_ed | 16.4283 16.4283 16.4283 16.4283 16.4283 | 15.322 9.874 9.180 15.627 9.180 _cons | 755.4386 755.4386 755.4386 755.4386 755.4386 | 55.612 35.841 33.346 56.719 33.346 -------------------------------------------------------------------------legend: b/se Lab assignment 1) Use the data set lab11_het.dta . Test and correct for heteroskedasticity. What variable/s are related to the heteroskedasticity? What is the impact on the standard errors? ps9.dta is a data set of hypothetical math test data in a survey of schools. The students were sampled by schools, so the data is “clustered” at the school level. In addition, students were sampled by gender. 100% of gender A students were sampled, and 40% of gender B students were sampled. What effect does the sampling and clustering have on the mean of math scores and regression estimates of the effect of SES and gender on math scores? . des Contains data from ps9.dta obs: 1,400 vars: 5 4 Apr 2007 20:20 size: 33,600 (99.7% of memory free) ------------------------------------------------------------------------------storage display value variable name type format label variable label ------------------------------------------------------------------------------schid float %9.0g school id gender float %9.0g gender a=0, b=1 ses float %9.0g parental soc-econ status math float %9.0g math test score p float %9.0g sampling probability ------------------------------------------------------------------------------Sorted by: 1. Effect of sampling and clustering on means In Stata, a. create a weight variable, wgt, that is the inverse of the sampling probability b. use the command svyset to set the data for survey analysis the general syntax is svyset psuid [pw=wgt_var], fpc(varname) * note: ignore the fpc for this homework. where psuid is the cluster variable and wgt_var is the variable for the sampling weights c. use the command “mean x” to find the mean of the math scores. Find the mean under three conditions 1. ignoring weights and clusters, 2. with pweights only, 3. with pweights and clusters. After running the third condition, inspect the deff’s Syntax example: mean x mean x [pw=wgt_var] svy: mean x estat eff, deff deft d. Explain what is going on with the mean and the standard error of the mean under these three conditions. Why does using the pweights give use the right mean but the wrong standard error? 2. Understanding design effects and intra-class correlation. a. figure out how many clusters (schools) there are in the data. b. figure out how many students were sampled per school (tab schid) c. based on the formula we derived in class, deff 1 ( M 1) what is the intra-class correlation, , for mathematics scores in this example? (note the formula in this case may not be precisely correct because of sampling, but it is close enough to use here) Interpret the meaning of the intra-class correlation. 3. Run a regression of math scores on ses and gender, using the three conditions that you used in part c of #1. Do the different conditions affect the point estimates of the coefficients? Do they affect the estimates of the standard errors? Explain. Interpret the deff’s for the standard errors. example syntax: reg y x z reg y x z [pw=wgt_var] svy: reg y x z (alternatively, in Stata 10 you can estimate “reg y x z [pw=wgt_var], vce(cluster psuid)”. Note that this syntax would allow use of the xi command) estat eff, deff deft (note: don’t use the xi command here. Because gender is dichotomous, you can add it directly without the xi command)