Download Soc709 Lab 11

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data analysis wikipedia , lookup

Generalized linear model wikipedia , lookup

Corecursion wikipedia , lookup

Pattern recognition wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Regression analysis wikipedia , lookup

Cluster analysis wikipedia , lookup

Inverse problem wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Error detection and correction wikipedia , lookup

Nyquist–Shannon sampling theorem wikipedia , lookup

Soc709 Lab 11
Hetereoskedasticity and weighting data
Review of lecture 11:
Hetereoskedasticity is a problem because the variance of the error term in not the same
for each case.
As a result, the standard formula for the variance of the coefficients in no longer valid,
and estimates of the standard errors will be biased.
Note that the point estimates of the coefficients are still unbiased. It is just a problem of
the standard errors.
In lecture we discussed how, if the variance-covariance matrix was known, weighting by
P (the inverse of the standard error of the error term of each case) would solve the
In practice, if we don’t know the variance of the error term, we can use the Huber-White
sandwich estimator, which gives us unbiased estimates of the standard errors. If we have
more information about the structure of the variance of the error term, such as its
relationship to other variables in our analysis, then we can improve upon the HuberWhite approach.
In lab we will learn how to detect and correct for heteroskedasticity.
Overview of syntax:
Here is an example of the syntax.
use lab11a
reg y x
* substitute whatever regression you are running here
estimates store ols
estat hettest, iid
* run this command after your regression to test for heteroskedasticity
predict double u, residual
* I generated the residuals (“double” provides more precision)
gen u2=u^2
* now I am generating the squared residuals (a consistent estimate of
the variance of the error term)
reg u2 x
* this tests whether the squared residuals are related to x
gen x2=x^2
reg u2 x2
* If I found that x-squared was a good predictor of the squared errors,
I could use this as a weight
* Note that if x was a better predictor than x-squared, I would have
used that instead.
gen wgt=1/((x)^2)
* if you had multiple variables in your model of the weights you could
do something like this:
* reg u2 x1 x2 x3
* predict u2
* reg u2 x1 x2 x3
* gen wgt2=1/(x2)
* if x2 was a predictor of u2
* reg y x1 x2 x3 [aw=wgt2]
* estat hettest, iid
scatter u x
reg y x, vce(robust)
* this estimates the Huber-White sandwich estimator
estimates store robust
reg y x [aw=wgt]
* I used the weights I generated earlier from x-squared
estimates store fgls
estat hettest, iid
estimates table ols robust fgls, b(%9.4f) se(%5.3f) ///
title(comparison of estimates)
Note that you can’t go wrong just using
reg y x, vce(robust) --unless your data is clustered or you have autocorrelated errors for
other reasons-but your standard errors could be larger than if you had other information about the
structure of the variance-covariance matrix of the error term.
Weighting data
We will discuss two cases, stratified random sampling and cluster sampling.
In stratified random sampling, different demographic groups may be sampled at different
rates. The pweights in Stata represent 1/p where p is the probability of sampling
I.e., if we sampled 10% of men and 20% of women, then our weight variable for men
would be 10 and 5 for women.
In a regression command,
xi: reg lnwage [pw=wgt]
where wgt is the inverse probability of sampling.
For data where the sampling was done in clusters (explain) we need to take this into
account (i.e., following the logic of lecture 11).
Here I am going to follow the example of cluster analysis from the UCLA Stata website.
For more details on the use of Stata for survey data analysis, see the UCLA webpage
Example: We will use the example from the UCLA webpage on a cluster sampling of
schools, ucla survey example.log
Data on schools in California (from the UCLA webpage):
California requires that all students in public schools be tested each year. The State
Department of Education then puts together the annual Academic Performance Index
(API) which rates how a school is doing overall, in terms of the test scores. The file,
apipop.dta, contains api ratings and demographic information on 6,194 schools in 757
school districts. To be included in the file schools must have at least 100 students.
One-Stage Cluster Sampling
Another approach to sampling from the population is cluster sampling. In this example
we will use school districts as the cluster or primary sampling units. We will take a
random sample of 15 school districts and look at all of the schools in each one.
In this example, the sampling frame contains the 757 school districts.
You have to create the correct pweights, the inverse probability that a particular school
was sampled. In this case,
svyset syntax:
svyset psu [weight], fpc(fpc-variable)
psu=the variable giving the id of the sampling unit, i.e., the cluster
weight=the pweight, the inverse probability that a case was sampled
fpc-variable=the number of psu’s.
use, clear
tabulate stype
tabulate dnum
svyset dnum [pw=pw], fpc(fpc)
svy: mean api00
svy: total enroll
svy: regress api00 meals ell avg_ed
estimates store svy
* compare to this:
mean api00
total enroll
regress api00 meals ell avg_ed
estimates store ols
regress api00 meals ell avg_ed [pw=pw]
estimates store ols_pw
regress api00 meals ell avg_ed [pw=pw], vce(cluster dnum)
estimates store ols_pw_cluster
regress api00 meals ell avg_ed [pw=pw], vce(robust)
estimates store sandwich
estimates table svy ols ols_pw ols_pw_cluster sandwich, b(%9.4f) se(%5.3f) ///
title(comparison of estimates)
comparison of estimates
-------------------------------------------------------------------------Variable |
-------------+-----------------------------------------------------------meals |
ell |
avg_ed |
_cons | 755.4386
-------------------------------------------------------------------------legend: b/se
Lab assignment
1) Use the data set lab11_het.dta . Test and correct for heteroskedasticity. What
variable/s are related to the heteroskedasticity? What is the impact on the standard
ps9.dta is a data set of hypothetical math test data in a survey of schools. The students
were sampled by schools, so the data is “clustered” at the school level. In addition,
students were sampled by gender. 100% of gender A students were sampled, and 40% of
gender B students were sampled. What effect does the sampling and clustering have on
the mean of math scores and regression estimates of the effect of SES and gender on
math scores?
. des
Contains data from ps9.dta
4 Apr 2007 20:20
33,600 (99.7% of memory free)
------------------------------------------------------------------------------storage display
variable name
variable label
float %9.0g
school id
float %9.0g
gender a=0, b=1
float %9.0g
parental soc-econ status
float %9.0g
math test score
float %9.0g
sampling probability
------------------------------------------------------------------------------Sorted by:
1. Effect of sampling and clustering on means
In Stata,
a. create a weight variable, wgt, that is the inverse of the sampling probability
b. use the command svyset to set the data for survey analysis
the general syntax is
svyset psuid [pw=wgt_var], fpc(varname)
* note: ignore the fpc for this homework.
where psuid is the cluster variable and wgt_var is the variable for the sampling weights
c. use the command “mean x” to find the mean of the math scores. Find the mean under
three conditions 1. ignoring weights and clusters, 2. with pweights only, 3. with pweights
and clusters. After running the third condition, inspect the deff’s
Syntax example:
mean x
mean x [pw=wgt_var]
svy: mean x
estat eff, deff deft
d. Explain what is going on with the mean and the standard error of the mean under these
three conditions. Why does using the pweights give use the right mean but the wrong
standard error?
2. Understanding design effects and intra-class correlation.
a. figure out how many clusters (schools) there are in the data.
b. figure out how many students were sampled per school
(tab schid)
c. based on the formula we derived in class,
deff  1   ( M  1)
what is the intra-class correlation,  , for mathematics scores in this example? (note the
formula in this case may not be precisely correct because of sampling, but it is close
enough to use here) Interpret the meaning of the intra-class correlation.
3. Run a regression of math scores on ses and gender, using the three conditions that you
used in part c of #1. Do the different conditions affect the point estimates of the
coefficients? Do they affect the estimates of the standard errors? Explain. Interpret the
deff’s for the standard errors.
example syntax:
reg y x z
reg y x z [pw=wgt_var]
svy: reg y x z
(alternatively, in Stata 10 you can estimate “reg y x z [pw=wgt_var], vce(cluster psuid)”.
Note that this syntax would allow use of the xi command)
estat eff, deff deft
(note: don’t use the xi command here. Because gender is dichotomous, you can add it
directly without the xi command)