Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
9. Binary Dependent Variables • 9.1 Homogeneous models – Logit, probit models – Inference – Tax preparers • 9.2 Random effects models • 9.3 Fixed effects models • 9.4 Marginal models and GEE • Appendix 9A - Likelihood calculations 9.1 Homogeneous models • The response of interest, yit, now may be only a 0 or a 1, a binary dependent variable. – Typically indicates whether the ith subject possesses an attribute at time t. • Suppose that the probability that the response equals 1 is denoted by Prob(yit = 1) = pit. – Then, we may interpret the mean response to be the probability that the response equals 1 , that is, E yit = 0 Prob(yit = 0) + 1 Prob(yit = 1) = pit . – Further, straightforward calculations show that the variance is related to the mean through the expression Var yit = pit (1 - pit ) . Inadequacy of linear models • Homogeneous means that we will not incorporate subjectspecific terms that account for heterogeneity. • Linear models of the form yit = xit + it are inadequate because: – The expected response is a probability and thus must vary between 0 and 1 although the linear combination, xit , may vary between negative and positive infinity. – Linear models assume homoscedasticity (constant variance) yet the variance of the response depends on the mean which varies over observations. – The response must be either a 0 or 1 although the distribution of the error term is typically regarded as continuous. Using nonlinear functions of explanatory variables • In lieu of linear, or additive, functions, we express the probability of the response being 1 as a nonlinear function of explanatory variables pit = (xit ). • Two special cases are: 1 ez π( z ) z z 1 e e 1 – the logit case – (z ) as a cumulative standard normal distribution function, the probit case. • These two functions are similar. I focus on the logit case because it permits closed-form expressions unlike the cumulative normal distribution function. Threshold interpretation • Suppose that there exists an underlying linear model, yit* = xit + it*. – The response is interpreted to be the “propensity” to possess a characteristic. – We do not observe the propensity but we do observe when the propensity crosses a threshold, say 0. 0 yit* 0 – We observe yit * 1 yit 0 • Using the logit distribution function, Prob (it* a) = 1/ (1 + exp(-a) ) • Note that Prob(-it* xit ) = Prob(it* xit ). Thus, 1 * * Prob( yit 1) Prob( yit 0) Prob( it xit β) (xit β) 1 exp( xit β) Random utility interpretation • In economics applications, we think of an individual choosing among c categories. – Preferences among categories are indexed by an unobserved utility function. – We model utility as a function of an underlying value plus random noise, that is, Uitj = uit(Vitj + itj), j = 0,1. – If Uit1 > Uit0 , then denote this choice as yit = 1. – Assuming that uit is a strictly increasing function, we have Prob( y it 1) Prob(U it 0 U it1 ) Prob u it (Vit 0 it 0 ) u it (Vit1 it1 ) Prob it 0 it1 Vit1 Vit 0 • Parameterize the problem by taking Vit0 = 0 and Vit1 = xit β. • We may take the difference in the errors, it0 - it1 , to be normal or logistic, corresponding to the probit and logit cases. Logistic regression • This is another phrase used to describe the logit case. • Using p = (z), the inverse of can be calculated as z = -1(p) = ln ( p/(1-p) ) . – Define logit (p) = ln ( p/(1-p) ) to be the logit function. – Here, p/(1-p) is known as the odds ratio. It has a convenient economic interpretation in terms of fair games. • That is, suppose that p = 0.25. Then, the odds ratio is 0.333. • The odds against winning are 0.333 to 1, or 1 to 3. If we bet $1, then in a fair game we should win $3. • The logistic regression models the linear combination of explanatory variables as the logarithm of the odds ratio, xit = ln ( pit/(1-pit) ) . Parameter interpretation • To interpret =( 1, 2, …, K), we begin by assuming that jth explanatory variable, xitj, is either 0 or 1. • Then, with the notation, we may interpret j xit1 1 xitK β xit1 0 xitK β Prob( yit 1 | xitj 1) Prob( yit 1 | xitj 0) ln ln 1 Prob( yit 1 | xitj 1) 1 Prob( yit 1 | xitj 0) • Thus, e j 0) / 1 Prob( y 0) Prob( yit 1 | xitj 1) / 1 Prob( yit 1 | xitj 1) Prob( yit 1 | xitj it 1 | xitj • To illustrate, if j = 0.693, then exp(j) = 2. – The odds (for y = 1) are twice as great for xj = 1 as for xj = 0. More parameter interpretation • Similarly, assuming that jth explanatory variable is continuous, we have Prob( yit 1 | xitj ) d d j xit β ln dxitj dxitj 1 Prob( yit 1 | xitj ) d Prob( yit 1 | xitj ) / 1 Prob( yit 1 | xitj ) dxitj Prob( yit 1 | xitj ) / 1 Prob( yit 1 | xitj ) • Thus, we may interpret j as the proportional change in the odds ratio, known as an elasticity in economics. Parameter estimation • • • The customary estimation method is maximum likelihood. The log likelihood of a single observation is ln( 1 π(xit β)) yit ln π(xit β) (1 yit ) ln( 1 π(xit β)) ln π(xit β) The log likelihood of the data set is y it • if yit 0 if yit 1 ln π(xit β) (1 yit ) ln(1 π(xit β)) it Taking partial derivatives with respect to b yields the score equations it π(xit β) xit yit π(xit β) 0 π(xit β)1 π(xit β) – The solution of these equations, say bMLE, yields the maximum likelihood estimate. • The score equations can also be expressed as a generalized estimating equation: 1 y E y E y Var y 0 it it it it it β • where E yit x it π(xit β) E yit π( x it β) Var yit π( xit β)1 π( xit β) β For the logit function • The normal equations are: x y it it (xit β) 0 it – The solution depends on the responses yit only through the vector of statistics it xit yit . • The solution of these equations, say bMLE, yields the maximum likelihood estimate bMLE . • This method can be extended to provide standard errors for the estimates. 9.2 Random effects models • We accommodate heterogeneity by incorporating subjectspecific variables of the form: pit = (i + xit ). – We assume that the intercepts are realizations of random variables from a common distribution. • We estimate the parameters of the {i} distribution and the K slope parameters . • By using the random effects specification, we dramatically reduced the number of parameters to be estimated compared to the Section 9.3 fixed effects set-up. – This is similar to the linear model case. • This model is computationally difficult to evaluate. Commonly used distributions • We assume that subject-specific effects are independent and come from a common distribution. – It is customary to assume that the subject-specific effects are normally distributed. • We assume, conditional on subject-specific effects, that the responses are independent. Thus, there is no serial correlation. • There are two commonly used specifications of the conditional distributions in the random effects panel data model. – 1. A logistic model for the conditional distribution of a response. 1 That is, Prob( y 1 | ) π( x β) it i i it 1 exp ( i xit β) – 2. A normal model for the conditional distribution of a response. That is, Prob( yit 1 | i ) ( i xit β) – where is the standard normal distribution function. Likelihood • Let Prob(yit = 1| i) =(i + xit ) denote the conditional probability for both the logistic and normal models. • Conditional on i, the likelihood for the it th observation is: 1 π( i xitβ) if yit 0 y (1 y ) π( i xitβ) (1 π( i xitβ)) if yit 1 π( i xitβ) it it • Conditional on i, the likelihood for the ith subject is: Ti y 1 y π i xit β it 1 π i xit β it t 1 • Thus, the (unconditional) likelihood for the ith subject is: li Ti πa xit β yit 1 πa xitβ1 y it φ(a)da t 1 – Here, is the standard normal density function. • Hence, the total log-likelihood is i ln li . – Note: lots of evaluations of a numerical integral…. Comparing logit to probit specification • There are no important advantages or disadvantages when choosing the conditional probability to be: – logit function (logit model) – standard normal (probit model) • The likelihood involves roughly the same amount of work to evaluate and maximize, although the logit function is slightly easier to evaluate than the standard normal distribution function. • The probit model is slightly easier to interpret because unconditional probabilities can be expressed in terms of the standard normal distribution function. • That is, xit β Prob( yit 1) E Φ( i x it β) Φ 2 1 9.3 Fixed effects models • As with homogeneous models, we express the probability of the response being 1 as a nonlinear function of linear combinations of explanatory variables. • To accommodate heterogeneity, we incorporate subjectspecific variables of the form: pit = (i + xit ). – Here, the subject-specific effects account only for the intercepts and do not include other variables. – We assume that {i} are fixed effects in this section. • In this chapter, we assume that responses are serially uncorrelated. • Important point: Panel data with dummy variables provide inconsistent parameter estimates…. Maximum likelihood estimation • Unlike random effect models, maximum likelihood estimators are inconsistent in fixed effects models. – The log likelihood of the data set is y it ln ( i xit β) (1 yit ) ln(1 ( i xit β)) it – This log likelihood can still be maximized to yield maximum likelihood estimators. – However, as the subject size n tends to infinity, the number of parameters also tends to infinity. • Intuitively, our ability to estimate is corrupted by our inability to estimate consistently the subject-specific effects { i } . – In the linear case, we had that the maximum likelihood estimates are equivalent to the least squares estimates. • The least squares estimates of were consistent. • The least squares procedure “swept out” intercept estimators when producing estimates of . Maximum likelihood estimation is inconsistent • Example 9.2 (Chamberlain, 1978, Hsiao 1986). – Let Ti = 2, K=1 and xi1 = 0 and xi2=1. – Take derivatives of the likelihood function to get the score functions – these are in display (9.8). – From (9.8), the score functions are – and L e i e i yi1 yi 2 0 i i i 1 e 1 e L ei yi 2 0 i β 1 e i – Appendix 9A.1 • Maximize this to get bmle • Show that the probability limit of bmle is 2 , and hence is an inconsistent estimator of . Conditional maximum likelihood estimation • This estimation technique provides consistent estimates of the beta coefficients. – It is due to Chamberlain (1980) in the context of fixed effects panel data models. • Let’s consider the logit specification of , so that pit π( i xit β) 1 1 exp ( i xit β) • Big idea: With this specification, it turns out that t yit is a sufficient statistic for i. – Thus, if we condition on t yit , then the distribution of the responses will not depend on i. Example of the sufficiency • To illustrate how to separate the intercept from the slope effects, consider the case Ti = 2. – Suppose that the sum, t yit = yi1+yi2, equals either 0 or 2. • • • • If sum equals 0, then Prob (yi1 = 0, yi2 = 0 |yi1 + yi2 = sum) = 1. If sum equals 2, then Prob (yi1 = 1, yi2 = 1 |yi1 + yi2 = sum) = 1. Both conditional probabilities do not depend on i . Both conditional events are certain and will contribute nothing to a conditional likelihood. – If sum equals 1, Prob yi1 yi 2 1 Prob yi1 0Prob yi 2 1 Prob yi1 1Prob yi 2 0 exp i xi1β exp i xi 2β 1 exp i xi1β1 exp i xi 2β Example of the sufficiency • Thus, Prob yi1 0Prob yi 2 1 Prob yi1 0, yi 2 1 | yi1 yi 2 1 Prob yi1 yi 2 1 exp i xi 2β exp i xi1β exp i xi 2β expxi 2β expxi1β expxi 2β • This does not depend on i . – Note that if an explanatory variable xij is time-constant (xij2 xij1 ), then the corresponding parameter j disappears from the conditional likelihood. Conditional likelihood estimation • Let Si be the random variable representing t yit and let sumi be the realization of t yit . • The conditional likelihood of the data set is n i 1 piy1i1 piy2i 2 piTyiT Prob( S i sumi ) – Note that the ratio equals one when sumi equal 0 or Ti. – The distribution of Si is messy and is difficult to compute for moderate size data sets with T more than 10. • This provides a fix for the problem of “infinitely many nuisance parameters.” – Computationally difficult, hard to extend to more complex models, hard to explain to consumers 9.4 Marginal models and GEE • Marginal models, also know as “population-averaged” models, only require specification of the first two moments – Means, variances and covariances – Not a true probability model – Ideal for moment estimation (GEE, GMM) • Begin in the context of the random effects binary dependent variable model – The mean is E yit = m it m it (β, ) πa xit β d F (a) – The variance is Var yit = mit (1- mit ). – The covariance is Cov (yir, yis) πa xir β πa xis β d F (a) m ir m is GEE – generalized estimating equations • This is a method of moments procedure – Essentially the same as generalized method of moments – One matches theoretical moments to sample moments, with appropriate weighting. • Idea – find the values of the parameters that satisfy n 0 K G m (b EE , EE )Vi (b EE , EE ) (y i μ i (b EE , EE )) 1 i 1 – We have already specified the variance matrix. – We also use a K x Ti matrix of derivatives μiT μ i (β, ) μi1 G m (β, ) i β β – For binary variables, we have mit xit πa xitβ d F (a) β β Marginal Model • Choose the mean function to be – Motivated by probit specification m it Φx it β x β it Prob( yit 1) E Φ( i xit β) Φ 2 1 • For the variance function, consider Var yit = mit (1- mit). • Let Corr(yir, yis) denote the correlation between yir and yis. – This is known as a working correlation. • Use the exchangeable correlation structure specified as 1 for r s Corr ( y ir , y is ) for r s • Here, the motivation is that the latent variable i is common to all observations within a subject, thus inducing a common correlation. • The parameters τ = (, ) constitute the variance components. Robust Standard Errors • Model-based standard errors are taken from the square root of the diagonal elements of n 1 G m (b EE , EE )Vi (b EE , EE ) G m (b EE , EE ) i 1 1 • As an alternative, robust or empirical standards errors are from G m Vi1G m i 1 n 1 G m Vi1 y i μ i y i μ i Vi1G m G m Vi1G m i 1 i 1 n n 1 • These are robust to misspecified heterscedasticity as well as time series correlation.