Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Munich Lecture Series 2 Non-linear panel data models: Binary response and ordered choice models and bias-corrected fixed effects models Stefanie Schurer [email protected] RMIT University School of Economics, Finance, and Marketing January 29, 2014 1 / 48 Overview 1 2 A brief review of binary response models and the maximum likelihood principle; Binary response models with unobserved heterogeneity: 1 2 Random effects approaches; Chamberlain’s (1981) conditional logit fixed effects model. 3 Extension to ordered choice data (Ferrer-i-Carbonell and Frijters, 2004; Jones and Schurer, 2011) 4 Bias-corrected fixed effects models: very hard! (For a good review: Arellano and Hahn, 2007) + application: Carro and Traferri (2013) 2 / 48 References for Lecture 2 1 Greene, W.H. (2011). Econometric Analysis. Pearson Education Limited. pp. 756-771; 2 Hsiao, C. (2003). Analysis of Panel Data. Econometric Society Monographs. Cambridge University Press: New York. pp. 188-202; 3 Verbeek, M. (2000) A Guide to Modern Econometrics. Wiley. pp. 177-182;151-160; 336-340; 4 Jones, A.M., Schurer, S. (2011). How does heterogeneity shape the socioeconomic gradient in health satisfaction. Journal of Applied Econometrics 26(4); 549714. 5 Carro, J., Traferri, A. (2012). State dependence and heterogeneity in health using a bias-corrected fixed effects estimator. Journal of Applied Econometrics. In print. 6 Arellano, M., Hahn, J. (2007). Understanding bias in nonlinear panel models: some recent developments. In: Advances in Economics and Econometrics, Theory and Applications, Ninth World Congress, Vol. 3. Blundell, R., Newey, W. , Persson, T. (eds.), CUP, UK; 381-409. 3 / 48 1. A brief review of binary response models and the maximum likelihood principle 4 / 48 Review of binary response models Many data applications in health economics involve a focus on an outcome variable Y that can take on a discrete range of values which represent different state outcomes. For example: • HEALTH PRODUCTION: Being in good or excellent health: Yi = 1 if individual i in time period t reports a health status of e.g. 4 (if the Likert scale is increasing in good health and bound between 1 and 5), Yi = 0 otherwise; • HEALTH BEHAVIOR: Smoking or excercising: Yi = 1 if individual i is a smoker/is exercising, Yi = 0 otherwise; • HEALTH CARE DEMAND: Visiting a family physician or specialist: Yi = 1 if individual i consulted a doctor, Yi = 0 otherwise. A natural extension of binary response models are ordered choice models, i.e. when Yi = 0, 1, 2, . . . , J, and where one can say that Yi = 0 < Yi = 1 (natural ordering). More difficult to model. 5 / 48 Assume we observe random sample of N observations (Yi , Xi ) from the population, where Yi = 0 or Yi = 1 is a binary response variable, and Xi is a vector of covariates. We are interested in trying to model/understand the relationship between Yi and Xi , and, in particular, we believe that P(Yi = 1) is some function of Xi that we wish to parameterise. The usual way economists approach this is in terms of a latent variable Yi∗ that is a linear function of the observed covariates Xi plus an unobserved error term. This latent (continuous) variable is not observed, but we observe a binary variable that takes the value 1 if Yi∗ > 0 and 0 otherwise. 6 / 48 Model set up More formally this is: Yi∗ = Xi′ β + εi , (1) Yi = 1(Yi∗ > 0), (2) and where the threshold is implicitly built into the intercept term in this model, and the notation 1(.) is an indicator function that takes the value 1 if the statement in parentheses is true or false. Combining Eqs. 1 and 2: Yi = 1(Xi′ β + εi > 0), (3) 7 / 48 Model set up The form of Equation 1 looks similar to the linear modelling approach. In particular, we are still assuming a linear relationship between the latent variable Yi∗ and the regressors of the model Xi . The only difference is that we do not observe whether or not Yi∗ is positive. This means that we can only meaningfully consider discussing the probability that Yi∗ is positive conditional on the vector of covariates, i.e. P(Yi = 1|Xi ). From Eq. 3 we have: P(Yi |Xi ) = P(εi > −Xi′ β). (4) The marginal probability associated with an observation is: P(Yi |Xi ) = Z Ui f (εi )dεi , (5) Li where (Li , Ui ) = (−∞, −Xi′ β) if Yi = 0 and (−Xi′ β, +∞) if Yi = 1. 8 / 48 Maximum Likelihood Principle To evaluate this probability, and hence make the model operational, requires distributional assumptions regarding the error term εi . Formally, suppose θ is a vector of parameters that specifies the model; θ will include β and any other parameters that characterise the distribution of εi . Y 1−Yi fi (Yi |Xi , θ) = P(Yi = 1|Xi , θ) i 1 − P(Yi = 1)|Xi , θ , (6) and θ by maximising the log-likelihood function: L(θ|Y , X ) = N Y fi (Yi |Xi , θ) i =1 log L(θ|Y , X ) = N X log fi (Yi |Xi , θ) i =1 = N X {Yi log P(Yi = 1|Xi , θ) + (1 − Yi ) log (1 − P(Yi = 1)|Xi , θ)}. i =1 9 / 48 The maximum likelihood principle Let P(Yi ) be some cumulative distribution function Fi and θ = β, then the first-order derivative is: N Yi − F (Xi′ β) ∂LogL X = F ′ (Xi′ β)Xi = 0. ∂β F (Xi′ β)[1 − F (Xi′ β)] (7) i =1 The second-order derivative is: N ′ ′ 2 ∂ 2 LogL n X Yi 1 − Yi = − + F (Xi β) ′ ′ ′ 2 2 ∂β∂β F (Xi β) [1 − F (Xi β)] i =1 + N X i =1 o Yi − F (Xi′ β) ′′ ′ F (Xi β) Xi Xi′ . F (Xi′ β)[1 − F (Xi′ β)] 10 / 48 The maximum likelihood principle If the likelihood function is concave, then the Newton-Raphson method can be used to find the MLE of β. One issue is how to choose the initial values β 0 : β̂ j = β̂ j−1 − ∂ 2 logL −1 ∂logL j−1 ∂β∂β ′ β=β̂ ∂β β=β̂ j−1 (8) and where β̂ j−1 denotes the j − 1th iterative solution. 11 / 48 The logit model This model assumes that the probability function in Eq. 6 has the following form: P(Yi = 1|Xi ) = exp(Xi′ β) . 1 + exp(Xi β) (9) Alternatively, this expression can be derived by assuming either: • The errors are logistically distributed: εit ∼ Λ(0, π 2 /3) (bell-shaped distribution) • The log-odds are linear: log h P(Y = 1|X ) i i i = Xi′ β 1 − P(Yi = 1|Xi ) (10) 12 / 48 The logit model: pros and cons • The first order condition associated with maximising the log-likelihood function of Eq. 9 has a closed-form solution. • The logit specification has some useful properties that enable unobserved heterogeneity to be controlled for using panel data. 13 / 48 The probit model Assumes that the errors are normally distributed , i.e. εi ∼ N(0, σ 2 ) giving: P(Yi = 1|Xi ) = Φ(Xi′ β/σ), (11) where Φ is the cumulative normal distribution function of the normal distribution. As parameterised, β and σ are not separately identified due to the absence of scale in the outcome variable. For this reason we normalise σ = 1. One of the crucial problems associated with the probit model is that the first-order condition from maximising the log-likelihood using Eq. ?? does not have a closed-form solution. Hence, estimating the probit model is computationally more demanding. 14 / 48 2. Binary response models with unobserved heterogeneity 15 / 48 Binary choice models with unobserved heterogeneity We begin with a simple panel data extension of a cross-sectional model as outlined in Eq. 3 to allow for fixed unobserved heterogeneity Yit = 1(Xit′ β + εit > 0), (12) for i = 1, . . . , N and t = 1, . . . , T . We assume that εit is iid. 16 / 48 Binary choice models with unobserved heterogeneity There are two generic approaches 1 Random effects estimation: possible under very strong assumptions about the unobserved heterogeneity unless imposing some relationship between the unobserved heterogeneity and regressors of the model (remember the Mundlak/Chamberlain solution); 2 Fixed effects estimation suffers from a phenomenon called the incidental parameter problem - only a restricted model can be estimated. 17 / 48 2.a. Random effects approaches 18 / 48 Random effects models This model specifies the error term as: εit = αi + uit , (13) where αi and uit are random variables with: • E [uit |X ] = 0, COV [uit , ujs |X ] = Var [uit |X ], if i = j t = s; 0 otherwise; 2 , if i = j; 0 • E [αi |X ] = 0, COV [αi , αj ] = Var [αi |X ] = σα otherwise. • COV [uit , αj ] = 0 for all i , t, j. Here X captures all exogenous variables. 19 / 48 Random effects models From these assumptions it follows that: • E [εit |X ] = 0 2 = 1 + σ2 • Var [εit |X ] = σu2 + σα α • Corr [εit , εis |X ] = ρ = 2 σα 2 1+σα 20 / 48 Random effects models The contribution of each individual i to the likelihood is the joint probability for all T observations (Read Greene 2012, p. 759): Li = P(Yi 1 , . . . , YiT |X ) = Z Ui ,T ... Li ,T Z Ui ,1 f (εi 1 , . . . , εiT )dεi 1 , . . . dεiT . Li ,1 (14) We can obtain the joint density of the uit s by integrating out αi out of the joint density, i.e. f (εi 1 , . . . , εiT , αi ) = f (εi 1 , . . . , εiT |αi )f (αi ) (15) or f (εi 1 , . . . , εiT ) = Z +∞ f (εi 1 , . . . , εiT |αi )f (αi )dαi (16) −∞ 21 / 48 Random effects models Using Eq. 16 and changing the order of the integration, conditioning on αi , the εit s are independent, we get: Li = P[Yi 1 , . . . , YiT |X ] = Z T +∞ h Y Z ( T +∞ h Y i Prob(Yit = yit |Xit′ β+uit ) f (αi )dαi . −∞ t=1 Uit Lit i f (εit |αi )dεit ) f (αi )dαi . (17) More generally: Li = P[Yi 1 , . . . , YiT |X ] = Z −∞ t=1 (18) 22 / 48 Random effects models The inner probability can be probit or logit (or any other you can think of). We can do the outer integration with Butler and Moffitt’s methods assuming that uit ∼ N. Their method uses the Gauss-Hermite quadrature to approximate integrals. (Please read p. 622 in Greene (2012) to check more detail this method). This approach is often critcised for the assumption of equal correlation across time-periods, but it can be efficiently estimated even with large T s. 23 / 48 Random effects models Alternatively, one can use simulated maximum likelihood, which is based on an expectation and is more flexible: Li = Eαi [ T Y Prob(Yit = yit |Xit′ β + αi )] (19) t=1 This expectation can be approximated by simulation. We won’t get into the details, but a sample of person-specific draws from the population αi can be generated with a random number generator. 24 / 48 Random effects models • One advantage of random effects models is that one can construct average partial effects in the presence of unobserved heterogeneity. See Wooldridge (2009) for an extensive discussion about this. • The assumption of independence between the unobserved heterogeneity and the regressors of the model is difficult to justify - one can use the Mundlak approach to impose some structure on the relationship between αi and Xit . 25 / 48 2.b. Chamberlain’s (1980) conditional fixed effects logit model 26 / 48 Fixed effects models Assume the following model: Yit∗ = αi dit + Xit′ β + uit (20) where Yit = 1 if Yit∗ > 0, and 0 otherwise. Here dit is a dummy variable that takes the value one for individual i and 0 otherwise. Xit does not contain any constant. Hence, there are K regressors and n individual constant terms. The log-likelihood function is: lnL = N X T X lnP(Yit |αi + Xit′ β), (21) i =1 t=1 where P(.) is the probability of the observed outcome (e.g. Φ(qit (αi + Xit′ β) for the probit model or Λ(qit (αi + Xit′ β) for the logit model, qit = 2Yit − 1). 27 / 48 Fixed effects models • In the linear regression model, we could use deviations from the mean to get rid of the individual-specific heterogeneity. This is no longer possible in the linear case (except for some special cases). In this case, you would need to estimate the sometimes large number of constant terms. • The problem with estimating the large number of constant terms is that the estimator relies on Ti increasing for the constant term to be consistent. Usually, Ti are small, and thus the estimates for αi are not consistent (they don’t converge at all if Ti is fixed). 28 / 48 Fixed effects models • Since the estimator of β is a function of α, the MLE estimator of β is not consistent either. This is the famous incidental parameter problem (Read Lancaster (2000), which will be on the Blackboard). • There is also a small sample bias in β: Hsiao (2000) found that the bias in Ti = 2 can be 100% (Check Hsiao 2003, p. 194-195 for the case of T=2, for one regressor, with values Xi 1 = 0 and Xi 1 = 1); Heckman and MaCurdy (1980) estimate that the bias is in the order of 10% for N=100 and T=8. 29 / 48 The conditional fixed effects estimator Chamberlain’s (1980) conditional fixed effects estimator relies on the notion of a sufficient statistic Ȳi for αi . This sufficient statistic states: f (Yit |Xit , Ȳi , αi ) = f (Yit |Xit , Ȳi ) (22) Such a sufficient statistic has been shown to be available for some distributions (e.g. logit), but not for the probit. The fixed effects binary logit model is: Prob(Yit = 1|Xit ) = exp(αi + Xit′ β) . 1 + exp(αi + Xit′ β) (23) 30 / 48 The conditional fixed effects estimator The unconditional likelihood for the NT observations is: L= N Y T Y FitYit (1 − Fit )1−Yit . (24) i =1 t=1 Chamberlain (1980) used a result by Anderson (1970) to show that that the conditional log-likelihood is independent of the incidental parameter αi c L = N Y i =1 Prob(Yi 1 = yi 1 , Yi 2 = yi 2 , . . . , YiT = yiT | T X yit ). (25) t=1 31 / 48 The conditional fixed effects estimator The joint likelihood for each set of Ti observations conditioned on the number of ones in the set is: Prob(Yi 1 = yi 1 , Yi 2 = yi 2 , . . . , YiT = yiT | T X yit , Xit ) = t=1 PP t P ′ exp( T t=1 yit )Xit β PT ′ dit =Si exp( t=1 dit Xit β) The function in the denominator is summed over the set of all Ti different sequences of Ti zeros and ones that have the Si P i same sum as Si = T t=1 Yit . 32 / 48 The conditional fixed effects estimator Only observations for whom the dependent variable changes at least once between 0Pand 1 can be used.1 Consider the case of T = 2. In this case Ss=1 Yis = 0, 1, 2. P • What if s Yis = 0 or 2? In this case Yi 1 and Yi 2 are both determined (they are either both 0 or 1) - both observations would be uninformative aboutP β as they would drop out of the i likelihood. In the case where T t=1 Yit = 2 or = 0, the conditional probability is 1, b/c then αi = +/ − ∞ (see Hsiao (2003), p. 194). Note that α̂i = − β2 if Yi 1 + Yi 2 = 1. P • What if s Yis = 1? Then either Yi 1 = 1 or Yi 2 = 0, e.g.: See next page for derivations. 1 (In more technical terms this means that 0 < T −1 PTi t=1 Yit < T ) 33 / 48 The conditional fixed effects estimator Let Yi 1 = 1 and Yi 2 = 0, then P(Yi 1 = 1, Yi 2 = 0|αi , Xi ) = exp(αi +Xi′1 β) 1+exp(αi +Xi′1 β) (1 exp(αi +Xi′1 β) 1+exp(αi +Xi′1 β) (1 − = exp(αi +Xi′2 β) 1+exp(αi +Xi′2 β) ) − + exp(αi +Xi′2 β) 1+exp(αi +Xi′2 β) ) exp(αi +Xi′2 β) 1+exp(αi +Xi′2 β) (1 − exp(αi +Xi′1 β) 1+exp(αi +Xi′1 β) ) exp(αi + Xi′1 β) . exp(αi + Xi′1 β) + exp(αi + Xi′2 β) 34 / 48 The conditional fixed effects estimator For simplicity, let’s use only Λ as a short-cut to express the logistic function: P(Yi 1 = 0, Yi 2 = 1|αi , Xi ) = Λ(αi + Xi′2 β)(1 − Λ(αi + Xi′1 β)) Λ(αi + Xi′1 β)(1 − Λ(αi + Xi′2 β)) + Λ(αi + Xi′2 β)(1 − Λ(αi + Xi′1 β) = Λ((Xi 2 − Xi 1 )′ β). Thus, Eq. 26 is in the form of a binary logit function in which the two outcomes are (0,1) and (1, 0), with explanatory variables (Xi 2 − Xi 1 ). The conditional log-likelihood is then 35 / 48 The conditional fixed effects estimator The conditional log-likelihood function is: X logL = {ωi log Λ[(Xi 2 −Xi 1 )′ β]+(1−ωi )log (1−Λ[(Xi 2 −Xi 1 )′ β])}, i ∈B̃ (26) where B̃1 = {i |Yi 1 + Yi 2 = 1} and ω = 1 if Yi 1 , Yi 2 = (0, 1) and ω = 0 if Yi 1 , Yi 2 = (1, 0). What does this in practice mean? It means that we can use only individuals who change status at least once within the time periods observed. Only time-variant variables can be included in the set of regressors, as the data is defacto first-differenced. 36 / 48 The conditional fixed effects estimator You can get the asymptotic covariance matrix for the conditional MLE for β as N tends to infinity. Chamberlain (1980) has shown that the inverse of the information matrix is equivalent to the asymptotic covariance matrix. Let di = 1 if Yi 1 + Yi 2 = 1 and 0 otherwise. X ∂ 2 logL = − di F ((Xi 2 − Xi 1 )′ β) × ∂β∂β ′ × [1 − F ((Xi 2 − Xi 1 )′ β)](Xi 2 − Xi 1 )(Xi 2 − Xi 1 )′ 2 logL The information matrix would be I = E ( ∂∂β∂β ′ ). 37 / 48 3. Extensions to Ordered Choice Data 38 / 48 Extensions What can we do if we want to model ordered choice data, such as health satisfaction or general health status? • Ferrer-i-Carbonell and Frijters (2004): Suggest to find an individual-specific, efficient threshold according to which the ordered-choice variable is dichotomised; then apply Chamberlain CFE (In practice: use individual-specific mean of dependent variable as threshold) • Jones and Schurer (2011): Model two individual-specific effects: One in the health outcome equation and one in each threshold (In practice: Dichotomise the ordered choice variable for every possible cut-off k; estimate k equations using the Chamberlain Conditional Fixed Effects Approach heterogeneous parameter estimates are interpreted as non-linearities in the effect of e.g. income on health). 39 / 48 Jones and Schurer (2011) We want to model the socioeconomic gradient of health satisfaction HSit∗ and allow for non-linearities in the effect of X on health. HSit∗ = αi + µ(β ′ Xit ) + uit (27) We observe reported health as: HSit = j if τi ,j−1 < HSit∗ ≤ τi ,j (28) τi ,j = τij−1 + τ̃i ,j (29) where Which means that each threshold depends on an individual-specific effect τ̃i ,j (e.g. could be personality traits). 40 / 48 Jones and Schurer (2011) This means that the true health status (when reporting a particular value for health) is bound between: τi ,j−1 < αi + µ(β ′ Xit ) + uit ≤ τi ,j (30) Which can be rearranged accordingly: −(αi − τi ,j−1 ) − µ(β ′ Xit ) < uit ≤ −(αi − τi ,j ) − µ(β ′ Xit ) (31) where αi and τi ,j−k cannot be separately identified (Let (αi − τi ,j−k = αi ,j−k ). We then get: Pitj = P(HSitj = j|αij , Xit ) = F (αij −µ(β ′ Xit ))−F (αij−1 −µ(β ′ Xit )) (32) and F(.) is the logistic function. 41 / 48 In practice 1 Dichotomise HSit K − 1 times where K is the number of categories of the health satisfaction variable; HSitB1 = 1 if HSit > 1 0 otherwise (33) HSitB2 HSitB3 HSitB3 = 1 if HSit > 2 0 otherwise (34) = 1 if HSit > 3 0 otherwise (35) = 1 if HSit > 4 0 otherwise (36) 2 Estimate K − 1 times a Chamberlain Conditional Fixed Effects Model, once for each binary variable; 3 Interpret heterogeneous coefficient estimates as non-linearities in the effect of e.g. income on true health. 42 / 48 4. Bias-corrected fixed effects models: VERY HARD! 43 / 48 Bias-corrected fixed effects estimators A new literature is evolving that tries to solve the incidental parameter problem by calculating and correcting for the bias. Most applications exist for static and dynamic binary/ordered choice estimators: • Analytical or numerical bias correction in a fixed effects estimator: Fernandenz-Val (2009) - dynamic binary choice model; • Correct bias in a moment equation, i.e. the expected score function: Carro (2007) - dynamic binary choice model, modified MLE; • Correct objective function, i.e. the concentrated log likelihood: Arellano and Hahn (2006), Bester and Hansen (2009) - dynamic ordered probit . All three approaches may yield different results and finite sample properties need to be assessed, but they all reduce the asymptotic bias from order T −1 to order T −2 for a general class of models (when T dimension is not too small). 44 / 48 Correct the fixed effect estimator Let yit ∼ N(α0i , σ02 ). To obtain an estimate of σ 2 = θ: 1 (yit − αi )2 logf (yit ; σ 2 , αi ) = C − log σ 2 − 2 2σ 2 (37) T 1 X yit ≡ ȳi T (38) 1 NT (39) Then: α̂i = θ̂ = t=1 N X T X (yit − ȳi )2 i =1 i =1 45 / 48 Correct the fixed effect estimator It can be shown, as N → ∞ with fixed T, that: θ̂ = θ0 − 1 θ0 + op (1) T (40) We are concerned about this part T1 θ0 . Approach 1 would correct this bias directly by correcting for the correct degrees of freedom (equate denominator with N(T-1)), which turns the bias from 1/T into 1/T 2 . In general, one needs to find the formula for the bias (in the limit), and then obtain an estimate for the bias (sample) analog. This approach does not depend on the log likelihood function. It requires transformations/derivations and possibly of expectations, and usually does not produce the exact bias correction. 46 / 48 Correct moment conditions (use expected score) In this case, one uses the expected fixed effects score function, evaluated at θ0 , i.e. E[ T 1 1 1 X ∂ logf (yit |θ0 , α̂i (θ0 ))] = bi (θ0 ) + o( ) T ∂θ T T (41) t=1 This approach also requires the calculation of expectations. The expression of the expected score is then used to construct a moment condition, which will then be adjusted (Note: the score is used to obtain the MLE when equated to 0). This approach can produce an exact bias correction (not only 1/T 2 correction) 47 / 48 Correct the concentrated log likelihood Alternatively, one can take the expectation of the log likelihood such as: T T 1 X 1 X 1 1 logf (yit |θ0 , α̂i (θ0 ))− logf (yit |θ, ᾱi (θ))] = βi (θ0 )+o( ) T T T T t=1 t=1 (42) This approach is easier to compute and usually does not require the calculation of expectations (e.g. Bester and Hansen (2009)’s version of a dynamic ordered probit model, but approach does not perform well in removing the bias for T < 13.) E[ 48 / 48