Download Slides 2

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ordinary least squares wikipedia , lookup

Transcript
Munich Lecture Series 2
Non-linear panel data models: Binary response
and ordered choice models and bias-corrected
fixed effects models
Stefanie Schurer
[email protected]
RMIT University
School of Economics, Finance, and Marketing
January 29, 2014
1 / 48
Overview
1
2
A brief review of binary response models and the maximum
likelihood principle;
Binary response models with unobserved heterogeneity:
1
2
Random effects approaches;
Chamberlain’s (1981) conditional logit fixed effects model.
3
Extension to ordered choice data (Ferrer-i-Carbonell and
Frijters, 2004; Jones and Schurer, 2011)
4
Bias-corrected fixed effects models: very hard! (For a good
review: Arellano and Hahn, 2007) + application: Carro and
Traferri (2013)
2 / 48
References for Lecture 2
1 Greene, W.H. (2011). Econometric Analysis. Pearson Education Limited. pp.
756-771;
2 Hsiao, C. (2003). Analysis of Panel Data. Econometric Society Monographs.
Cambridge University Press: New York. pp. 188-202;
3 Verbeek, M. (2000) A Guide to Modern Econometrics. Wiley. pp.
177-182;151-160; 336-340;
4 Jones, A.M., Schurer, S. (2011). How does heterogeneity shape the
socioeconomic gradient in health satisfaction. Journal of Applied Econometrics
26(4); 549714.
5 Carro, J., Traferri, A. (2012). State dependence and heterogeneity in health
using a bias-corrected fixed effects estimator. Journal of Applied Econometrics.
In print.
6 Arellano, M., Hahn, J. (2007). Understanding bias in nonlinear panel models:
some recent developments. In: Advances in Economics and Econometrics,
Theory and Applications, Ninth World Congress, Vol. 3. Blundell, R., Newey,
W. , Persson, T. (eds.), CUP, UK; 381-409.
3 / 48
1. A brief review of binary response models
and the maximum likelihood principle
4 / 48
Review of binary response models
Many data applications in health economics involve a focus on an
outcome variable Y that can take on a discrete range of values
which represent different state outcomes. For example:
• HEALTH PRODUCTION: Being in good or excellent health:
Yi = 1 if individual i in time period t reports a health status
of e.g. 4 (if the Likert scale is increasing in good health and
bound between 1 and 5), Yi = 0 otherwise;
• HEALTH BEHAVIOR: Smoking or excercising: Yi = 1 if
individual i is a smoker/is exercising, Yi = 0 otherwise;
• HEALTH CARE DEMAND: Visiting a family physician or
specialist: Yi = 1 if individual i consulted a doctor, Yi = 0
otherwise.
A natural extension of binary response models are ordered choice
models, i.e. when Yi = 0, 1, 2, . . . , J, and where one can say that
Yi = 0 < Yi = 1 (natural ordering). More difficult to model.
5 / 48
Assume we observe random sample of N observations (Yi , Xi ) from
the population, where Yi = 0 or Yi = 1 is a binary response
variable, and Xi is a vector of covariates. We are interested in
trying to model/understand the relationship between Yi and Xi ,
and, in particular, we believe that P(Yi = 1) is some function of Xi
that we wish to parameterise.
The usual way economists approach this is in terms of a latent
variable Yi∗ that is a linear function of the observed covariates Xi
plus an unobserved error term. This latent (continuous) variable is
not observed, but we observe a binary variable that takes the value
1 if Yi∗ > 0 and 0 otherwise.
6 / 48
Model set up
More formally this is:
Yi∗ = Xi′ β + εi ,
(1)
Yi = 1(Yi∗ > 0),
(2)
and
where the threshold is implicitly built into the intercept term in
this model, and the notation 1(.) is an indicator function that
takes the value 1 if the statement in parentheses is true or false.
Combining Eqs. 1 and 2:
Yi = 1(Xi′ β + εi > 0),
(3)
7 / 48
Model set up
The form of Equation 1 looks similar to the linear modelling
approach. In particular, we are still assuming a linear relationship
between the latent variable Yi∗ and the regressors of the model Xi .
The only difference is that we do not observe whether or not Yi∗ is
positive. This means that we can only meaningfully consider
discussing the probability that Yi∗ is positive conditional on the
vector of covariates, i.e. P(Yi = 1|Xi ). From Eq. 3 we have:
P(Yi |Xi ) = P(εi > −Xi′ β).
(4)
The marginal probability associated with an observation is:
P(Yi |Xi ) =
Z
Ui
f (εi )dεi ,
(5)
Li
where (Li , Ui ) = (−∞, −Xi′ β) if Yi = 0 and (−Xi′ β, +∞) if
Yi = 1.
8 / 48
Maximum Likelihood Principle
To evaluate this probability, and hence make the model
operational, requires distributional assumptions regarding the error
term εi . Formally, suppose θ is a vector of parameters that
specifies the model; θ will include β and any other parameters that
characterise the distribution of εi .
Y 1−Yi
fi (Yi |Xi , θ) = P(Yi = 1|Xi , θ) i 1 − P(Yi = 1)|Xi , θ
, (6)
and θ by maximising the log-likelihood function:
L(θ|Y , X ) =
N
Y
fi (Yi |Xi , θ)
i =1
log L(θ|Y , X ) =
N
X
log fi (Yi |Xi , θ)
i =1
=
N
X
{Yi log P(Yi = 1|Xi , θ) + (1 − Yi ) log (1 − P(Yi = 1)|Xi , θ)}.
i =1
9 / 48
The maximum likelihood principle
Let P(Yi ) be some cumulative distribution function Fi and θ = β,
then the first-order derivative is:
N
Yi − F (Xi′ β)
∂LogL X
=
F ′ (Xi′ β)Xi = 0.
∂β
F (Xi′ β)[1 − F (Xi′ β)]
(7)
i =1
The second-order derivative is:
N
′ ′ 2
∂ 2 LogL n X Yi
1 − Yi
= −
+
F (Xi β)
′
′
′
2
2
∂β∂β
F (Xi β) [1 − F (Xi β)]
i =1
+
N
X
i =1
o
Yi − F (Xi′ β)
′′
′
F (Xi β) Xi Xi′ .
F (Xi′ β)[1 − F (Xi′ β)]
10 / 48
The maximum likelihood principle
If the likelihood function is concave, then the Newton-Raphson
method can be used to find the MLE of β. One issue is how to
choose the initial values β 0 :
β̂ j = β̂ j−1 −
∂ 2 logL −1
∂logL j−1
∂β∂β ′ β=β̂
∂β β=β̂ j−1
(8)
and where β̂ j−1 denotes the j − 1th iterative solution.
11 / 48
The logit model
This model assumes that the probability function in Eq. 6 has the
following form:
P(Yi = 1|Xi ) =
exp(Xi′ β)
.
1 + exp(Xi β)
(9)
Alternatively, this expression can be derived by assuming either:
• The errors are logistically distributed: εit ∼ Λ(0, π 2 /3)
(bell-shaped distribution)
• The log-odds are linear:
log
h P(Y = 1|X ) i
i
i
= Xi′ β
1 − P(Yi = 1|Xi )
(10)
12 / 48
The logit model: pros and cons
• The first order condition associated with maximising the
log-likelihood function of Eq. 9 has a closed-form solution.
• The logit specification has some useful properties that enable
unobserved heterogeneity to be controlled for using panel
data.
13 / 48
The probit model
Assumes that the errors are normally distributed , i.e.
εi ∼ N(0, σ 2 ) giving:
P(Yi = 1|Xi ) = Φ(Xi′ β/σ),
(11)
where Φ is the cumulative normal distribution function of the
normal distribution. As parameterised, β and σ are not separately
identified due to the absence of scale in the outcome variable. For
this reason we normalise σ = 1.
One of the crucial problems associated with the probit model is
that the first-order condition from maximising the log-likelihood
using Eq. ?? does not have a closed-form solution. Hence,
estimating the probit model is computationally more demanding.
14 / 48
2. Binary response models with unobserved
heterogeneity
15 / 48
Binary choice models with unobserved
heterogeneity
We begin with a simple panel data extension of a cross-sectional
model as outlined in Eq. 3 to allow for fixed unobserved
heterogeneity
Yit = 1(Xit′ β + εit > 0),
(12)
for i = 1, . . . , N and t = 1, . . . , T . We assume that εit is iid.
16 / 48
Binary choice models with unobserved
heterogeneity
There are two generic approaches
1
Random effects estimation: possible under very strong
assumptions about the unobserved heterogeneity unless
imposing some relationship between the unobserved
heterogeneity and regressors of the model (remember the
Mundlak/Chamberlain solution);
2
Fixed effects estimation suffers from a phenomenon called the
incidental parameter problem - only a restricted model can be
estimated.
17 / 48
2.a. Random effects approaches
18 / 48
Random effects models
This model specifies the error term as:
εit = αi + uit ,
(13)
where αi and uit are random variables with:
• E [uit |X ] = 0, COV [uit , ujs |X ] = Var [uit |X ], if i = j t = s; 0
otherwise;
2 , if i = j; 0
• E [αi |X ] = 0, COV [αi , αj ] = Var [αi |X ] = σα
otherwise.
• COV [uit , αj ] = 0 for all i , t, j.
Here X captures all exogenous variables.
19 / 48
Random effects models
From these assumptions it follows that:
• E [εit |X ] = 0
2 = 1 + σ2
• Var [εit |X ] = σu2 + σα
α
• Corr [εit , εis |X ] = ρ =
2
σα
2
1+σα
20 / 48
Random effects models
The contribution of each individual i to the likelihood is the joint
probability for all T observations (Read Greene 2012, p. 759):
Li = P(Yi 1 , . . . , YiT |X ) =
Z
Ui ,T
...
Li ,T
Z
Ui ,1
f (εi 1 , . . . , εiT )dεi 1 , . . . dεiT .
Li ,1
(14)
We can obtain the joint density of the uit s by integrating out αi
out of the joint density, i.e.
f (εi 1 , . . . , εiT , αi ) = f (εi 1 , . . . , εiT |αi )f (αi )
(15)
or
f (εi 1 , . . . , εiT ) =
Z
+∞
f (εi 1 , . . . , εiT |αi )f (αi )dαi
(16)
−∞
21 / 48
Random effects models
Using Eq. 16 and changing the order of the integration,
conditioning on αi , the εit s are independent, we get:
Li = P[Yi 1 , . . . , YiT |X ] =
Z
T
+∞ h Y
Z
(
T
+∞ h Y
i
Prob(Yit = yit |Xit′ β+uit ) f (αi )dαi .
−∞
t=1
Uit
Lit
i
f (εit |αi )dεit ) f (αi )dαi .
(17)
More generally:
Li = P[Yi 1 , . . . , YiT |X ] =
Z
−∞
t=1
(18)
22 / 48
Random effects models
The inner probability can be probit or logit (or any other you can
think of). We can do the outer integration with Butler and
Moffitt’s methods assuming that uit ∼ N. Their method uses the
Gauss-Hermite quadrature to approximate integrals. (Please read
p. 622 in Greene (2012) to check more detail this method). This
approach is often critcised for the assumption of equal correlation
across time-periods, but it can be efficiently estimated even with
large T s.
23 / 48
Random effects models
Alternatively, one can use simulated maximum likelihood, which is
based on an expectation and is more flexible:
Li = Eαi [
T
Y
Prob(Yit = yit |Xit′ β + αi )]
(19)
t=1
This expectation can be approximated by simulation. We won’t
get into the details, but a sample of person-specific draws from the
population αi can be generated with a random number generator.
24 / 48
Random effects models
• One advantage of random effects models is that one can
construct average partial effects in the presence of unobserved
heterogeneity. See Wooldridge (2009) for an extensive
discussion about this.
• The assumption of independence between the unobserved
heterogeneity and the regressors of the model is difficult to
justify - one can use the Mundlak approach to impose some
structure on the relationship between αi and Xit .
25 / 48
2.b. Chamberlain’s (1980) conditional fixed
effects logit model
26 / 48
Fixed effects models
Assume the following model:
Yit∗ = αi dit + Xit′ β + uit
(20)
where Yit = 1 if Yit∗ > 0, and 0 otherwise. Here dit is a dummy
variable that takes the value one for individual i and 0 otherwise.
Xit does not contain any constant. Hence, there are K regressors
and n individual constant terms. The log-likelihood function is:
lnL =
N X
T
X
lnP(Yit |αi + Xit′ β),
(21)
i =1 t=1
where P(.) is the probability of the observed outcome (e.g.
Φ(qit (αi + Xit′ β) for the probit model or Λ(qit (αi + Xit′ β) for the
logit model, qit = 2Yit − 1).
27 / 48
Fixed effects models
• In the linear regression model, we could use deviations from
the mean to get rid of the individual-specific heterogeneity.
This is no longer possible in the linear case (except for some
special cases). In this case, you would need to estimate the
sometimes large number of constant terms.
• The problem with estimating the large number of constant
terms is that the estimator relies on Ti increasing for the
constant term to be consistent. Usually, Ti are small, and
thus the estimates for αi are not consistent (they don’t
converge at all if Ti is fixed).
28 / 48
Fixed effects models
• Since the estimator of β is a function of α, the MLE estimator
of β is not consistent either. This is the famous incidental
parameter problem (Read Lancaster (2000), which will be on
the Blackboard).
• There is also a small sample bias in β: Hsiao (2000) found
that the bias in Ti = 2 can be 100% (Check Hsiao 2003, p.
194-195 for the case of T=2, for one regressor, with values
Xi 1 = 0 and Xi 1 = 1); Heckman and MaCurdy (1980)
estimate that the bias is in the order of 10% for N=100 and
T=8.
29 / 48
The conditional fixed effects estimator
Chamberlain’s (1980) conditional fixed effects estimator relies on
the notion of a sufficient statistic Ȳi for αi . This sufficient statistic
states:
f (Yit |Xit , Ȳi , αi ) = f (Yit |Xit , Ȳi )
(22)
Such a sufficient statistic has been shown to be available for some
distributions (e.g. logit), but not for the probit. The fixed effects
binary logit model is:
Prob(Yit = 1|Xit ) =
exp(αi + Xit′ β)
.
1 + exp(αi + Xit′ β)
(23)
30 / 48
The conditional fixed effects estimator
The unconditional likelihood for the NT observations is:
L=
N Y
T
Y
FitYit (1 − Fit )1−Yit .
(24)
i =1 t=1
Chamberlain (1980) used a result by Anderson (1970) to show that
that the conditional log-likelihood is independent of the incidental
parameter αi
c
L =
N
Y
i =1
Prob(Yi 1 = yi 1 , Yi 2 = yi 2 , . . . , YiT = yiT |
T
X
yit ). (25)
t=1
31 / 48
The conditional fixed effects estimator
The joint likelihood for each set of Ti observations conditioned on
the number of ones in the set is:
Prob(Yi 1 = yi 1 , Yi 2 = yi 2 , . . . , YiT = yiT |
T
X
yit , Xit ) =
t=1
PP
t
P
′
exp( T
t=1 yit )Xit β
PT
′
dit =Si exp( t=1 dit Xit β)
The function
in the denominator is summed over the set of all
Ti
different sequences of Ti zeros and ones that have the
Si
P i
same sum as Si = T
t=1 Yit .
32 / 48
The conditional fixed effects estimator
Only observations for whom the dependent variable changes at
least once between 0Pand 1 can be used.1 Consider the case of
T = 2. In this case Ss=1 Yis = 0, 1, 2.
P
• What if
s Yis = 0 or 2? In this case Yi 1 and Yi 2 are both
determined (they are either both 0 or 1) - both observations
would be uninformative aboutP
β as they would drop out of the
i
likelihood. In the case where T
t=1 Yit = 2 or = 0, the
conditional probability is 1, b/c then αi = +/ − ∞ (see Hsiao
(2003), p. 194). Note that α̂i = − β2 if Yi 1 + Yi 2 = 1.
P
• What if
s Yis = 1? Then either Yi 1 = 1 or Yi 2 = 0, e.g.:
See next page for derivations.
1
(In more technical terms this means that 0 < T −1
PTi
t=1
Yit < T )
33 / 48
The conditional fixed effects estimator
Let Yi 1 = 1 and Yi 2 = 0, then P(Yi 1 = 1, Yi 2 = 0|αi , Xi ) =
exp(αi +Xi′1 β)
1+exp(αi +Xi′1 β) (1
exp(αi +Xi′1 β)
1+exp(αi +Xi′1 β) (1
−
=
exp(αi +Xi′2 β)
1+exp(αi +Xi′2 β) )
−
+
exp(αi +Xi′2 β)
1+exp(αi +Xi′2 β) )
exp(αi +Xi′2 β)
1+exp(αi +Xi′2 β) (1
−
exp(αi +Xi′1 β)
1+exp(αi +Xi′1 β) )
exp(αi + Xi′1 β)
.
exp(αi + Xi′1 β) + exp(αi + Xi′2 β)
34 / 48
The conditional fixed effects estimator
For simplicity, let’s use only Λ as a short-cut to express the logistic
function:
P(Yi 1 = 0, Yi 2 = 1|αi , Xi ) =
Λ(αi + Xi′2 β)(1 − Λ(αi + Xi′1 β))
Λ(αi + Xi′1 β)(1 − Λ(αi + Xi′2 β)) + Λ(αi + Xi′2 β)(1 − Λ(αi + Xi′1 β)
= Λ((Xi 2 − Xi 1 )′ β).
Thus, Eq. 26 is in the form of a binary logit function in which the
two outcomes are (0,1) and (1, 0), with explanatory variables
(Xi 2 − Xi 1 ). The conditional log-likelihood is then
35 / 48
The conditional fixed effects estimator
The conditional log-likelihood function is:
X
logL =
{ωi log Λ[(Xi 2 −Xi 1 )′ β]+(1−ωi )log (1−Λ[(Xi 2 −Xi 1 )′ β])},
i ∈B̃
(26)
where B̃1 = {i |Yi 1 + Yi 2 = 1} and ω = 1 if Yi 1 , Yi 2 = (0, 1) and
ω = 0 if Yi 1 , Yi 2 = (1, 0).
What does this in practice mean? It means that we can use only
individuals who change status at least once within the time periods
observed. Only time-variant variables can be included in the set of
regressors, as the data is defacto first-differenced.
36 / 48
The conditional fixed effects estimator
You can get the asymptotic covariance matrix for the conditional
MLE for β as N tends to infinity. Chamberlain (1980) has shown
that the inverse of the information matrix is equivalent to the
asymptotic covariance matrix. Let di = 1 if Yi 1 + Yi 2 = 1 and 0
otherwise.
X
∂ 2 logL
=
−
di F ((Xi 2 − Xi 1 )′ β) ×
∂β∂β ′
× [1 − F ((Xi 2 − Xi 1 )′ β)](Xi 2 − Xi 1 )(Xi 2 − Xi 1 )′
2
logL
The information matrix would be I = E ( ∂∂β∂β
′ ).
37 / 48
3. Extensions to Ordered Choice Data
38 / 48
Extensions
What can we do if we want to model ordered choice data, such as
health satisfaction or general health status?
• Ferrer-i-Carbonell and Frijters (2004): Suggest to find an
individual-specific, efficient threshold according to which the
ordered-choice variable is dichotomised; then apply
Chamberlain CFE (In practice: use individual-specific mean of
dependent variable as threshold)
• Jones and Schurer (2011): Model two individual-specific
effects: One in the health outcome equation and one in each
threshold (In practice: Dichotomise the ordered choice
variable for every possible cut-off k; estimate k equations
using the Chamberlain Conditional Fixed Effects Approach heterogeneous parameter estimates are interpreted as
non-linearities in the effect of e.g. income on health).
39 / 48
Jones and Schurer (2011)
We want to model the socioeconomic gradient of health
satisfaction HSit∗ and allow for non-linearities in the effect of X on
health.
HSit∗ = αi + µ(β ′ Xit ) + uit
(27)
We observe reported health as:
HSit = j if τi ,j−1 < HSit∗ ≤ τi ,j
(28)
τi ,j = τij−1 + τ̃i ,j
(29)
where
Which means that each threshold depends on an individual-specific
effect τ̃i ,j (e.g. could be personality traits).
40 / 48
Jones and Schurer (2011)
This means that the true health status (when reporting a particular
value for health) is bound between:
τi ,j−1 < αi + µ(β ′ Xit ) + uit ≤ τi ,j
(30)
Which can be rearranged accordingly:
−(αi − τi ,j−1 ) − µ(β ′ Xit ) < uit ≤ −(αi − τi ,j ) − µ(β ′ Xit )
(31)
where αi and τi ,j−k cannot be separately identified (Let
(αi − τi ,j−k = αi ,j−k ). We then get:
Pitj = P(HSitj = j|αij , Xit ) = F (αij −µ(β ′ Xit ))−F (αij−1 −µ(β ′ Xit ))
(32)
and F(.) is the logistic function.
41 / 48
In practice
1
Dichotomise HSit K − 1 times where K is the number of
categories of the health satisfaction variable;
HSitB1 = 1 if HSit > 1 0 otherwise
(33)
HSitB2
HSitB3
HSitB3
= 1 if HSit > 2 0 otherwise
(34)
= 1 if HSit > 3 0 otherwise
(35)
= 1 if HSit > 4 0 otherwise
(36)
2
Estimate K − 1 times a Chamberlain Conditional Fixed Effects
Model, once for each binary variable;
3
Interpret heterogeneous coefficient estimates as non-linearities
in the effect of e.g. income on true health.
42 / 48
4. Bias-corrected fixed effects models: VERY
HARD!
43 / 48
Bias-corrected fixed effects estimators
A new literature is evolving that tries to solve the incidental
parameter problem by calculating and correcting for the bias. Most
applications exist for static and dynamic binary/ordered choice
estimators:
• Analytical or numerical bias correction in a fixed effects
estimator: Fernandenz-Val (2009) - dynamic binary choice
model;
• Correct bias in a moment equation, i.e. the expected score
function: Carro (2007) - dynamic binary choice model,
modified MLE;
• Correct objective function, i.e. the concentrated log
likelihood: Arellano and Hahn (2006), Bester and Hansen
(2009) - dynamic ordered probit .
All three approaches may yield different results and finite sample
properties need to be assessed, but they all reduce the asymptotic
bias from order T −1 to order T −2 for a general class of models
(when T dimension is not too small).
44 / 48
Correct the fixed effect estimator
Let yit ∼ N(α0i , σ02 ). To obtain an estimate of σ 2 = θ:
1
(yit − αi )2
logf (yit ; σ 2 , αi ) = C − log σ 2 −
2
2σ 2
(37)
T
1 X
yit ≡ ȳi
T
(38)
1
NT
(39)
Then:
α̂i
=
θ̂ =
t=1
N X
T
X
(yit − ȳi )2
i =1 i =1
45 / 48
Correct the fixed effect estimator
It can be shown, as N → ∞ with fixed T, that:
θ̂ = θ0 −
1
θ0 + op (1)
T
(40)
We are concerned about this part T1 θ0 . Approach 1 would correct
this bias directly by correcting for the correct degrees of freedom
(equate denominator with N(T-1)), which turns the bias from 1/T
into 1/T 2 . In general, one needs to find the formula for the bias
(in the limit), and then obtain an estimate for the bias (sample)
analog. This approach does not depend on the log likelihood
function. It requires transformations/derivations and possibly of
expectations, and usually does not produce the exact bias
correction.
46 / 48
Correct moment conditions (use expected
score)
In this case, one uses the expected fixed effects score function,
evaluated at θ0 , i.e.
E[
T
1
1
1 X ∂
logf (yit |θ0 , α̂i (θ0 ))] = bi (θ0 ) + o( )
T
∂θ
T
T
(41)
t=1
This approach also requires the calculation of expectations. The
expression of the expected score is then used to construct a
moment condition, which will then be adjusted (Note: the score is
used to obtain the MLE when equated to 0). This approach can
produce an exact bias correction (not only 1/T 2 correction)
47 / 48
Correct the concentrated log likelihood
Alternatively, one can take the expectation of the log likelihood
such as:
T
T
1 X
1 X
1
1
logf (yit |θ0 , α̂i (θ0 ))−
logf (yit |θ, ᾱi (θ))] = βi (θ0 )+o( )
T
T
T
T
t=1
t=1
(42)
This approach is easier to compute and usually does not require
the calculation of expectations (e.g. Bester and Hansen (2009)’s
version of a dynamic ordered probit model, but approach does not
perform well in removing the bias for T < 13.)
E[
48 / 48