Download Advanced Labor Economics II

Document related concepts

Regression toward the mean wikipedia , lookup

Least squares wikipedia , lookup

Regression analysis wikipedia , lookup

Coefficient of determination wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Instrumental
Variables
Isaac Mbiti
University of Virginia
Basics
• Goal: we want to to estimate the returns to education (for
example)
• We collect data on wages/earnings, years of schooling, and other
individual level data.
• Outline of lecture
• Basic OLS (ordinary least squares regression) – We look at basic
relationship btw wages and schooling
• Problems with OLS- what problems do we face using OLS
• How to address these problems with IV (instrumental varibles)
methods
Basics
• Consider estimating the returns to education.
Yi = α + ρS i + vi
•
•
•
•
•
Y = wages/earnings
S = years of schooling,
ρ = returns to schooling – the coefficient of interest
ν is the error term (often denoted with e)
For clarity purposes we will focus on simple bivariate regression
o Extends to multivariate regression case
60
40
20
Earnings
80
Basics- OLS regression
6
8
10
12
yrs_schooling
earnings
Fitted values
14
16
Basics- OLS regression
80
What can I learn
from looking at this
scatter plot?
20
40
What command do
you use in stata to
do a scatter plot?
60
Earnings
6
8
10
12
yrs_schooling
earnings
Fitted values
14
16
“Error” is the diff
btw prediction and
the actual data
point
Basics
Yi = α + ρS i + vi
• OLS (ordinary least squares) finds the “best fit line”
o Minimizes the sum of squared errors
• Which letters denote the slope and intercept?
ρ is the _____? And α is the _______
• In Stata we use the command:
regress {dep_var} {independent}, [options]
Other examples:
Regress wages yrs_schooling,
regress birth_weight mothers_smoking, robust
regress test_score textbooks, cluster(classroomid)
Basics- OLS regression
. regress earnin yrs_schooling
Source
SS
df
MS
Model
Residual
7459.44217
2864.05109
1
98
7459.44217
29.2250111
Total
10323.4933
99
104.27771
earnings
Coef.
yrs_schooling
_cons
4.331217
2.729021
Std. Err.
.2711029
3.002237
Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE
t
15.98
0.91
P>|t|
0.000
0.366
How do we read this regression output table?
=
=
=
=
=
=
100
255.24
0.0000
0.7226
0.7197
5.406
[95% Conf. Interval]
3.793222
-3.228821
4.869212
8.686862
Basics
• Consider estimating the returns to education.
Yi = α + ρS i + vi
• OLS finds the “best fit line”
o Minimizes the sum of squared errors
• In Stata we use the command:
regress {dep_var} {independent}, [options]
Other examples:
Regress wages yrs_schooling,
regress birth_weight mothers_smoking, robust
regress test_score textbooks, cluster(classroomid)
Basics- Assumptions of OLS
• Linearity
• Each person’s earnings (Yi) are a linear function of their education
(Si), plus an individual-specific error term, νi
• νi may be called the error term, the residual, or the deviation. It is
a random variable, meaning that the value that any individual gets
is a random draw from a distribution. I think of νi as representing
the “luck” factor.
• For a given level of schooling (S), not everyone makes the same Y
because some are lucky and make more and others make less.
Basics- Assumptions of OLS
• Assumption 2: Error term has zero mean, E(νi)=0
• This means that the positive errors and the negative errors cancel
out, so that on average, the error is zero.
Basics- Assumptions of OLS
• Assumption 3: Homoskedasticity• In math: var(νi |S) is constant
• Intuitively this means that the variance of the error term does not
depend on S (our independent variable)
Basics- Assumptions of OLS
• Assumption 3: Homoskedasticity• An example of a violation of this assumptions is if our data is
clustered
• Dataset 1: 100 standard 5 students each from different schools
• Dataset 2: 100 standard 5 students, 10 students each from 10
schools
• Suppose you do the following in stata with both datasets
• Regress test_scores textbooks
•
which analysis would be problematic?
Basics- Assumptions of OLS
•
•
•
•
•
•
Assumption 3: HomoscedasticityViolations of this assumption do not affect the slope coefficients
BUT: it affects the standard error of the coefficients
This means our T-statistics and confidence intervals will be wrong
SO we could give bad policy advice!
(eg we say we should implement a program when we really
shouldn’t)
• So we fix this by clustering or using robust standard errors
• In stata:
regress birth_weight mothers_smoking, robust
regress test_score textbooks, cluster(classroomid)
Basics- OLS regression
. regress earnin yrs_schooling
Source
SS
df
MS
Model
Residual
7459.44217
2864.05109
1
98
7459.44217
29.2250111
Total
10323.4933
99
104.27771
earnings
Coef.
yrs_schooling
_cons
4.331217
2.729021
Std. Err.
.2711029
3.002237
Number of obs
F( 1,
98)
Prob > F
R-squared
Adj R-squared
Root MSE
t
15.98
0.91
P>|t|
0.000
0.366
=
=
=
=
=
=
100
255.24
0.0000
0.7226
0.7197
5.406
[95% Conf. Interval]
3.793222
-3.228821
If we violate homoscedasticity – Standard errors, T
statistics, P value and Confidence Intervals will be wrong!
4.869212
8.686862
Basics- Assumptions of OLS
• Assumption 4: Cov(S,νi ) =0
• Recall our regression:
Yi = α + ρS i + vi
•
•
•
•
•
What does this assumption mean?
No relationship between schooling and the error term
Recall we are thinking of the error term as “luck”.
So the assumption states that “lucky people” have similar
education as “unlucky people”
Basics- Assumptions of OLS
• Assumption 4: Cov(S,νi ) =0
• Use a simulated data set where I violate this assumption and I
know the true relationship between variables.
Yi = 2 + 3.7 S i + vi
• Ie. I am going to create a data set where the above relationship is
true (what is the intercept? What is the slope?)
• I also create the data such that Cov(S,νi ) ≠0
Basics- Assumptions of OLS
20
• Assumption 4: Cov(S,νi ) =0
• Use a simulated data set where I violate this assumption.
• What does this look like?
•
Very
-20
-10
0
10
“Luck”
5
10
15
yrs_schooling
v
Fitted values
20
important
note:
I can only
create this
graph because
I am using
simulated
data. You
cannot create
such a graph
with regular
data
So what?
• Lets run the ols regression on the simulated data.
Yi = α + ρS i + vi
. regress earnings yrs_school, robust
Linear regression
Number of obs
F( 1,
998)
Prob > F
R-squared
Root MSE
earnings
Coef.
yrs_schooling
_cons
4.291932
-4.316475
Robust
Std. Err.
.081199
.9084094
t
52.86
-4.75
P>|t|
0.000
0.000
Did my regression give me the right answer?
=
1000
= 2793.86
= 0.0000
= 0.7447
= 5.7537
[95% Conf. Interval]
4.132591
-6.099087
4.451272
-2.533864
OLS Assumptions –When
All is OK
S
V
Y
Omitted Variables
• Often S and V are actually correlated because of omitted
variables
• An omitted variable  Cov(S,νi ) ≠0
• From our simulation what problem will this cause in OLS?
• Recall previous lectures:
• We want to know the impact of a program
• But its really hard bc people who enroll in programs are more
motivated (for example).
• This is an example of an omitted variable problem
• ALL methods we discussed are trying to address this problem
• RCT, Diff in Diff , RD and now we look at one more… IV
(instrumental variables)
IV basics
• Often S and V are actually correlated because of omitted
variables
• An omitted variable  Cov(S,νi ) ≠0
• Classic example of omitted variable problem:
o unobserved ability (which is correlated with schooling)
Yi =α + ρ Si + A 'i + vi
• In this case Cov(S, V) >0 ie higher ability people get more
schooling
IV basics
• Could we solve the problem by controlling for A?
• Yes BUT ONLY IF
•
•
•
•
o WE CAN MEASURE A properly? (unlikely in practice)
o A is the only Omitted variable? (hard to argue and very
unlikely)
o SO adding lots of variables to the regression is not
sufficient.
For exposition purposes let us suppose A is the only omitted
variable
If we can’t measure A then we have a problem
Simply estimating this regression would lead to overestimates of ρ
Why would we be overestimating?
Omitted Variables
S
Y
V
We can’t measure A so A is part of V
Since V and S are correlated and BOTH affect Y
Is S driving Y or is it V (including A)?
Could Y be driven solely by V (which includes A)?
Omitted Variable BiasJust trust me on this
• By knowing (or hypothesizing) about direction of the
relationships between S and V; Y and V; we can actually
figure out if the omitted variable problem will lead us to
overestimate or underestimate the true relationship between Y
and S if we use OLS
• if S and V are positively correlated & Y and V are positively
correlated, OLS will overestimate the true relationship
• if S and V are negatively correlated & Y and V are negatively
correlated, OLS will overestimate the true relationship
• if S and V are positively correlated & Y and V are negatively
correlated, OLS will underestimate the true relationship
• if S and V are negatively correlated & Y and V are postively
correlated, OLS will underestimate the true relationship
Let’s Start Simple..
Yi =α + ρ Si + A 'i + vi
• We can solve the problem using an instrumental variable (z) which
is correlated with S but not A or v
• The assumption that z is uncorrelated with A or is called the
exclusion restriction.
IV intuition: The exclusion
restriction
Z
S
V
Y
IV intuition: The Exclusion
Restriction
???
Z
S
V
Y
How Do We Estimate ρ?
ρ
Cov(Yi , Z i ) Cov(Yi , Z i ) / V ( Z i )
=
Cov( Si , Z i ) Cov( Si , Z i ) / V ( Z i )
• Note: with a simple regression Eg reg y on x, the OLS
estimate is Cov(Y,X)/V(X)
• So the denominator is the OLS regression between schooling
and our instrument Z
o We call this the FIRST STAGE
o Does Z predict schooling?
• First stage coefficient can’t be zero! That is the instrument has
to have some predictive power
How Do We Estimate ρ?
ρ
Cov(Yi , Z i ) Cov(Yi , Z i ) / V ( Z i )
=
Cov( Si , Z i ) Cov( Si , Z i ) / V ( Z i )
• Note: with a simple regression Eg reg y on x, the OLS
estimate is Cov(Y,X)/V(X)
• The numerator is the OLS regression between Earnings and
our instrument (Z)
• We call this the reduced form relationship
o (SIMILAR TO INTENT TO TREAT)
• So IV works by taking the coefficients from the reduced form
relationship and dividing by coefficients from the first stage
• It’s the ratio of the effect of Z on earnings divided by the
effect of Z on schooling
Where do we get instruments
from?
• Economic theory
• Natural/policy experiments
o Eg Duflo (2001) examines impact of rapid school
construction on education and wages.
• Take away – sometimes the variation in diff in
diff can support an IV estimation strategy
Where do we get instruments
from?
• Natural experiments:
o Angrist and Kruger (1991) use quarter of birth +
compulsory schooling laws in the US as an instrument for
years of schooling
o How/why does this work?
• School year starts Sept 1. You have to be age 6 by that
date. Now compare someone born Aug 30 to someone
born Sept 3- very similar in age but older kids meets
criteria, younger one has to wait till next year.
• School leaving laws say you have to be in school till age
16. suppose both leave at 16 who will have more
schooling?
• Is this a valid instrument?
Exercise
GPAi = β 0 + β1 PCi + ui
•
Want to examine effect of personal computer
(PC) on gpa in college. PC is a dummy
variable
1. Why might PC be correlated with u?
2. Explain why PC is likely to be related to parental income.
Does this mean parental income is a good IV for PC? Why
or why not?
3. Suppose the university randomly gave grants for PC
purchasing to some students. How could you use this to
construct an instrumental variable estimate of the
equation of interest (above)?
Quarter of Birth: An Example of an IV for
Schooling (Angrist and Krueger, QJE 1991)
• Why might this work?
o States only allows kids to start school after they turn 6.
o For exposition pretend school starts Sept 1.
o So, kids turning 6 from Sept 2-Dec31 can’t join until the following year. But Kids
turning 6 in Sept 1 and before are able to enroll in the school.
o But compulsory schooling laws require kids to stay in school until their 16th birthday
o So…kids with Sept2 -December birthdays end up spending less time in school than kids
born in before Sept1 → quarter of birth predictive of years of schooling
o Meanwhile, hard to imagine that quarter of birth affects earnings for any reason besides
completed schooling (or does it?)
Let’s Add Covariates to the
Model
• Important because maybe people born in the South are more likely to
give birth later in the year than people in the North (I made this up)
and region of birth is correlated with earnings. No problem: just
controls for region/state of birth
Si = X i'π 10 + π 11Z i + ξ1i
Yi = X i'π 20 + π 21Z i + ξ 2i
• The coefficient on Z in first equation is the first stage and the
coefficient on Z in second equation is reduced form
π 21
ρ=
π 11
Two-Stage Least Squares
(2SLS)
First Stage: Estimate predicted schooling using the Z (and the
other covariates)
Second Stage: Plug that predicted schooling into equation of
interest (structural equation). The estimated coefficient on the
predicted schooling will be the estimated ρ
Two-Stage Least Squares
(2SLS)
First Stage: Estimate predicted schooling using the Z (and the
other covariates)
S i = π 0 + π 1Z i + ei
Si
Second Stage: Plug that predicted schooling into equation of
interest (structural equation). The estimated coefficient on the
predicted schooling will be the estimated ρ
Yi = α + ρ S i + ui
“S-hat”
How To Think About This
2SLS retains only variation in S that is generated from variation
in X. This variation is not correlated with ability and so we can
consistently estimate the effect of schooling on earnings
If I use S in the regression of interest- many things drive S
including A and other unobservables
If I use S_hat (predicted from instrument)- ALL the variation
in S_hat is driven by the instrument so we “break link” between
schooling and unobservables
The Wald Estimator
• (This is the British guy who told RAF to reinforce fuselage)
• Simplest IV estimator: A single dummy instrument, with one
endogenous variable and no covariates
• Not so useful in practice but helps to think about intuition
E[Yi | Z i =
1] − E[Yi | Z i =
0]
ρ=
E[ Si | Z i =
1] − E[ Si | Z i =
0]
• Only reason for any relationship between Z and Y is that Z
affects S so if numerator nonzero, must be because of S.
Denominator is just for rescaling so we can answer the
question of how much S affects Y.
Example with Quarter of Birth
1st Quarter
4th Quarter
Difference
Compute the wald estimator comparing Q1 to
Q4
Example with Quarter of Birth
1st Quarter
4th Quarter
Difference
•
1.
2.
3.
4.
Exercises
Suppose you want to test whether girls who attend school a
girls high school do better in math than girls in coed (mixed)
schools. You have a sample from high school girls and score
is a standardized math test. Girlhs = attend all girls school
What other factors would you control for in the equation?
(be realistic about things that are in data)
write a regression equation for #1
Suppose parental support and motivation are unobserved
factors in the error term in #2. Are they likely correlated
with girlhs? explain
Discuss the assumptions needed for the number of girls
high schools within a 20 km radius of a girls home to be a
valid IV for girlhs
Back to the Basics
• A good instrument must
1. Be correlated with the endogenous right hand side variable
• We also want the First Stage to be informative and
strongly statistically significant. (A good F test)
2. Uncorrelated with the error term
• Good news is that we can test condition (1)…- this is the
importance of the first stage
• Bad news is that we cannot test (2)
Very Silly and Embarrassing Mistakes
• Don’t do 2SLS by hand (getting predicted values of
endogenous variable, plugging in to equation of interest, doing
OLS) because standard errors WILL BE WRONG
• Always put all of the controls in the first and second stage
equations!
o First stage residual (S minus Shat) uncorrelated by construction with all covariates in
first stage. But..these first stage residuals which are included in error in second stage,
may be correlated with any X’s that were not in the first stage→INCONSISTENT
estimates!
o What’s good enough for the second stage is good enough for the first stage!!
Forbidden Regressions
• Imagine endogenous variable is a dummy. It is FORBIDDEN
to get predicted value for this dummy using a probit model and
to plug this into second stage equation
o WHY? Only OLS is guaranteed to produce first stage residuals which
are uncorrelated with fitted values.
• If want to examine effect of schooling on earnings but believe
nonlinear relationship, include S and S2. But..
o Treat S and S2 as two endogenous variables and so need two
instruments (the square of the original instrument is fine)
Another Example of an IV
• Draft lottery numbers• During the vietnam war there was a lottery for who would be
required to join the army (“the draft”).
• Basically all men were given a random number based on
birthday. If the number was high- not drafted. If low- you
were drafted (ie you were required to do military service)
• Angrist uses this to examine the relationship between military
service and earnings
•
What does IV tell us?
Notice that the IV estimator is the ratio of the change in Y due to change
in Z to the change in X due to change in Z
E[Yi | Z i =
1] − E[Yi | Z i =
0]
ρ=
E[ Si | Z i =
1] − E[ Si | Z i =
0]
•
•
You can see that easily from Wald estimator
Lets go back to date of birth and compulsory schooling example
What does IV tell us?
• Suppose you have two types of people in the population:
o "ambitious" and "non-ambitious" people.
o Distributed evenly in population and across AK birth groups.
o Ambitious people get more years of education.
• Who is going to respond to the "treatment" in this setting?
• “Aug 31 and before" + ambitious will get educated anyway
and “Aug 31 and before" + non-ambitious forced to get
additional year of schooling Sept 2 and after +ambitious will get educated (so doesn’t respond to treatment) and sept
3 and after + non-ambitious will drop out asap. so really the
estimates are driven by “Aug31 and before" + non-ambitious
people
•
--> IV produces a "LOCAL AVERAGE TREATMENT EFFECT"
not necessarily the same as the treatment effect on the whole
population.
Assumptions Need to Make to
Interpret IV Estimate as LATE
1. Independence: Instrument as good as randomly assigned (eg:
random draft)
2. Exclusion: Instrument only affects outcome via the
endogenous right hand side variable (no other mechanism)
3. First stage: Instrument has an effect on endogenous variable
4. Monotonicity: Instrument has either no effect or same effect
for everyone.
• If we don’t have monotonicty some the instrument pushes some people into
treatment while pushing other people out of treatment
We’ll get to what a local average treatment effect (LATE) is in a
second, but first, a bit more on the assumptions...
Useful Notation
Yi(d,z)
Potential outcome of person i were this person to
have treatment d and IV z.
Causal effect of veteran status
(serving in the army:
Causal effect of draft eligibility:
D1i
D0i
Yi (1, zi ) − Yi (0, zi )
Yi ( Di ,1) − Yi ( Di , 0)
Whether join military given z=1 (draft eligible,
low number)
Whether join military given z=0 (draft ineligible,
high number)
What is Observed?
Di =D0i + ( D1i − D0i ) zi =π 0 + π 1i zi + ξ
π 0 ≡ E[ D0i ],
π 1i ≡ D1i − D0i
For any individual, only see one potential treatment: D1i or D0i (but not
both)..ie, you don’t see whether person i would have joined military if had
gotten different draft number (z)
E[π 1i ]
Average causal effect of zi on Di.
LATE Theorem
Suppose independence, exclusion, first stage, and monotonicity,
then an instrument can be used to estimate the average causal
effect on the affected group.
E[Yi | zi =
1] − E[Yi | zi =
0]
=
E[Y1i − Y0i | D1i > D0i ]
E[ Di | zi =
1] − E[ Di | zi =
0]
= E[ ρi | π 1i > 0]
Examples:
o IV estimate of effect of military service on earnings gives us causal effect of military on
earnings for men who only served because were drafted (wouldn’t have served otherwise)
o IV estimate of schooling on earnings (when IV is birth month) gives us causal effect of
schooling on earnings for people who stayed in school a few extra months before
dropping out because started school younger
LATE Theorem
• Compliers: Get treated if z=1 and don’t get treated if z=0: D1i=1
and D0i=0
• Always-takers: Always get treated:D1i =D0i=1,
• Never-takers: : Never get treated:D1i =D0i=0,
LATE is effect of treatment on compliers.
Analogy: We want to know effect of medicine on health in a
randomized trial. Some people always take medicine and some
never take medicine. IV will only tell us effect of medicine on
compliers.
Treatment Effect on the Treated
• Average causal effect on compliers ≠ treatment effect on the
treated
o The treated consist of compliers (with z=1) + always takers but the always takers may
have a different effect than compliers.
• Examples: people who take medicine no matter what may be those that benefit most
from medicine. People who complete 12 years of schooling regardless of whether
they’re forced to be in school may benefit most from school.
o Effect of treatment on the treated is weighted average of effects on compliers and always
takers (people who would go to military regardless)
E[Y1i − Y0i | Di =
1]
Effect on treated (people who serve)
=
E[Y1i − Y0i | D0i =
1]P[ D0i =
1| Di =
1]
+
=E[Y1i − Y0i | D1i > D0i ]P[ D1i > D0i , zi =1| Di =1]
Effect on always takers
Effect on compliers
Average Treatment Effect
• Unconditional average treatment effect is weighted average of
effect on compliers, always-takers, and never-takers
IV in Randomized Trials
• Imagine randomized trial where no one in control group has
access to intervention but participation voluntary among those
assigned to treatment
• Can’t simply compare those who got the treatment to those
who didn’t because self-selection (among those offered
treatment) into who gets treated. Usually positive selection
(those who take the medicine probably healthier people).
• However, IV solves the compliance problem and estimates the
effect of treatment (taking the drug) on the treated (those who
actually take the drug)
Impact of Training Program
Earnings as dependent variable (men only)
Comparisons by
Training Status (OLS)
Comparisons by
Instrumental Variables
Assignment Status (ITT)
3970
1117
1825
Treatment: JTPA training program
Only 60% of those assigned to training actually received the training, 2% of those
assigned to control group, received training
ITT= Intention to treat, measures causal effect of being offered treatment. Because
some of the people offered treatment didn’t receive treatment, does not measure causal
effect of the treatment
IV: ITT divided by difference in compliance rates (first stage) measure effect of
treatment on the people who actually get treated. In general, this is LATE but b/c there
are practically no always-takers, LATE=treatment effect on treated
IV in Fuzzy RD
• In many applications of RD, we have imperfect compliance
across the discontinuity.
We can use whether you
were above or below the
threshold as an
instrument for the
program take-up (in
this case completing
secondary school)
Compliers
• Different (valid) IVs for same causal relationship can estimate
different things
• Effect of schooling on earnings
o Quarter of birth IVs and compulsory schooling IVs affect same people (potential high
school dropouts) and so should have similar estimates
o Proximity of college would impact different group of people.
o If same results for both, might conclude homogeneous effects of schooling …suggestive
of external validity
• Effect of family size on children’s education
o IV for family size using sex ratio
o IV using twins
o These should generate different compliant populations,. Since get similar results (no
effect of family size), might conclude that there really is no effect for anybody (at least
in Israel)
Characterizing Compliers
• Of course we can’t see in the data who are the compliers vs.
always-takers vs. never-takers (we don’t what they would’ve
done if different z)
• But we can examine the characteristics of compliers
o Example: Relative likelihood that complier is college grad = first stage of college grads /
first stage of all others
o In studies of effect of family size on kids education,
• Twins compliers are more likely to be older (younger women probably would have
had an additional child even without having had twins)
• Twins compliers more educated while sex ratio compliers less educated