Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Economics 140A Qualitative Dependent Variables With each of the classic assumptions covered, we turn our attention to extensions of the classic linear regression model. Our …rst extension is to data in which the dependent variable is limited. Dependent variables that are limited in some way are common in economics, but not all require special treatment. We have seen examples with wages, income or consumption as the dependent variables all of these must be positive. As these strictly positive variables take numerous values, we found the log transform to be su¢ cient. Yet not all restrictions on the dependent variable can be handled so easily. If we model individual choice, the optimal behavior of individuals often results in a sizable fraction of the population at a corner solution. For example, a sizable fraction of working age adults do not work outside the home, so the distribution of hours worked has a sizable pile up at zero. If we …t a linear conditional mean, we will likely predict negative hours worked for some individuals. The log transform used for wages will not work, as the log of zero is unde…ned. Another issues arises with sample selection. It may well be the case that E (Y jX) is linear, but nonrandom sampling requires more detailed inference. Finally, a host of other data issues may arise: linear conditional mean functions that switch over regimes, data recorded as counts or analysis of durations between events. As we will see, even if only a …nite number of values are possible, a linear model for E (Y jX) may still be appropriate. While all these issues may arise, we focus on perhaps the most common restriction in which the dependent variable is qualitative in nature and so takes discrete values. For this reason, such models are also termed discrete dependent variable models or (less frequently) dummy dependent variable models. As we recall from our discussion of qualitative regressors, qualitative variables capture the presence or absence of some non-numeric quantity. For example, in studying home ownership the dependent variable is often Yt = 1 if household t owns their home : 0 otherwise Many qualitative variables take more than two values. For example, in studies of employment dynamics the dependent variable can take three values 8 < 1 if individual t is employed 0 if individual t is unemployed but seeking employment Yt = : : 1 if individual t is not in the labor force (i.e. not seeking employment) We focus attention on qualitative dependent variables that take only two values and for ease set these values to 0 and 1. In binary response models, interest is primarily in p (X) P (Y = 1jX) = P (Y = 1jX1 ; : : : ; XK ) ; for various values of X. For a continuous regressor Xj , the partial e¤ect of Xj on the response probability is simply @P (Y@X=1jX) . When multiplied by Xj (for small j Xj ), the partial e¤ect yields the approximate change in P (Y = 1jX) when Xj increases by Xj holding all other regressors constant. For a discrete regressor XK , the partial e¤ect is P (Y = 1jx1 ; : : : ; xK 1 ; XK = 1) P (Y = 1jx1 ; : : : ; xK 1 ; XK = 0) : Perhaps the most natural extension of the classic linear regression model is to leave the structure of the population model unchanged, so that Yt = 0 + 1 Xt + Ut : How does the presence of a qualitative dependent variable a¤ect our analysis? In quite substantial ways. Consider the familiar relation E (Yt jX) = 0 + 1 Xt : Because Yt takes only the values 0 and 1, E (Yt jX) = P (Yt = 1jX) hence p (X) = 0 + 1 Xt and so the conditional mean is a probability and the model is termed the linear probability model. The coe¢ cient 1 is interpreted as the e¤ect of a one unit change in Xt on the probability that Yt = 1. Similarly, if Xt is a binary regressor, then 1 captures the e¤ect of moving from Xt = 0 to Xt = 1. As there is no reason to believe that as Xt varies the conditional mean will remain between 0 and 1, equality between the conditional mean and P (Yt = 1) is a substantial drawback of the linear probability model. As one can deduce, it is hard to …t points that are clustered at 0 and 1 (on the y axis) with a straight line, so the R-square measure is not reliable. (Draw a graph with the points clustered at 0 and 1 on the y-axis and a straight line attempting to …t them.) 2 Several other features of the linear probability model are easily obtained. Because Yt is a Bernoulli random variable, Ut is a binomial random variable Ut = ( 1 0 ( + with probability 1 X ) 1 t with probability 1 Xt ) 0+ ( 0 + 1 Xt ) : 0 + 1 Xt From the de…nition of Ut it is clear that the error has mean 0 and variance EUt2 = ( = ( + 0+ 0 2 1 Xt ) [1 ( 0 + 1 Xt )] + [1 ( 0 + 1 Xt )] : 1 Xt ) [1 ( 0 + 2 1 Xt )] ( 0 + 1 Xt ) Thus the error term is heteroskedastic and binomial, violating two of the classic assumptions. The OLSE is unbiased and consistent for the linear probability model, although robust standard errors are needed to account for the heteroskedasticity. As an aside, the test statistic for H0 : 1 = = K = 0 can be accurately constructed from the OLSE, as under H0 the error is homoskedastic, EUt2 = 0 (1 0 ). To improve the e¢ ciency of the OLSE, construct the weighted least squares estimator. Let YtP denote the predicted value of the dependent variable constructed from the OLSE. If 0 < YtP < 1 for all t, then form the estimate of the error standard deviation q St = [YtP (1 YtP )]: The weighted least squares estimator is obtained from the model Yt = St 0 1 + St 1 Xt Ut + : St St Again, the reported standard errors are valid, as follows from our earlier treatment = (0; 1), then WLS is infeasible without an ad of weighted least squares. If YtP 2 hoc adjustment and should not be done. The linear probability model is a convenient approximation and generally gives good estimates of the partial e¤ects of the response probability near the center of the regressor distribution. If one wishes to know the partial e¤ect, averaged over the values of X, then the linear probability model may work well even if it gives poor estimates of the partial e¤ects for extreme values of X. Example (Married Women’s Labor Force Participation) In a survey of 753 women, 428 report working more than zero hours. Also, 606 have no young children while 118 have exactly one young child. The variables in play are 3 inlf nonw…nc ed ex k6 k+ binary, value 1 indicates non-zero working hours non-wife income, in thousands of dollars education experience number of children less than 6 years old number of children between 6 and 18, inclusive . The estimated regression is inlf P ols se robust se = :586 (:154) [:151] :0034nonwf inc + :038 ed + :039 ex (:0014) [:0015] (:007) [:007] (:006) [:006] :0006 ex2 (:00018) [:00019] :016 age (:002) [:002] :262 k6 + :013 k + (:034) [:032] R2 = :264 Except for k+, all regressor coe¢ cients have sensible signs and are statistically signi…cant. The regressor k + is neither statistically signi…cant nor practically important. Also, the OLS and robust standard errors are almost identical! Interpretation: An increase in non-wife income of $10,000 reduces participation in the labor force by only .034 (3.4 percent). As the sample mean of non-wife income is only $20,129 with a standard deviation of $11,635, a $10,000 increase is quite substantial. Having one more small child seems a …rst-order e¤ect, reducing the probability of being in the labor force by 26.2 percent. Finally, of the 753 …tted values, 33 lie outside the unit interval (hence we do not construct WLS estimators). The case for linear probability models grows stronger if most regressors are discrete and take only a few values, so that there are no “extreme” values. To understand how to construct a discrete regressor from a continuous regressor, return to the preceding example. Partition the variable k6 into three indicator variables: I0 = 1 (0 young children), I1 = 1 (1 young child), and I2 = 1 (2 or more young children). We replace k6 with (I1 ; I2 ) to allow the impact of the …rst young child to di¤er and obtain the estimated coe¢ cients :263 for I1 and :274 for I2 . It appears that the key impact is having one young child, additional young children do not change labor force participation much. The use of discrete regressors is familiar to us from our discussion of regressor speci…cation. For the model with union-gender interactions, the predicted values correspond to cell averages. When a model has the amount of indicator variables and interactions to …t cell averages, 4 (:013) [:013] the model is saturated. A new fact about saturated models emerges here: If a model is saturated, then the predicted probabilities must lie between 0 and 1 (because they mimic cell averages). For estimates of the partial e¤ects for extreme values of the regressors, we must develop a new framework. To do so, observe that the linear probability model falls within a broader class of models, termed single index models. The term single index arises because the various regressors a¤ect Yt through the scalar Xt0 , which is the single index. The class of single index models is Yt = F (Xt0 ) + Ut ; where the linear probability model is given by F (Xt0 ) = Xt0 . To overcome the problem that the predictions for Yt can lie outside the unit interval, we constrain F so that 0 < F (z) < 1 for all z: Given this constraint, a natural choice for F is a cumulative distribution function (although it is not necessary to use a CDF). Index models in which F is a CDF are derived from a latent model. The latent model concerns a variable that underlies the decision and cannot be observed. If Yt measures whether or not household t owns a home, then the latent variable Yt captures the desire of household t to own a home. If the desire is high enough, then household t owns a home. Or, put another way, Yt captures the di¤erence in utility between the two options, namely owning and renting, in which case if Yt is positive, then the utility from owning exceeds that from renting and household t purchases a home. The (latent) population model that explains the latent variable is Yt = 0 + 1 Xt + Vt ; where fVt gnt=1 is a sequence of i.i.d. random variables that are symmetric about their mean of 0, with variance 2 and distribution F . (Note, Vt does not have to be symmetric about zero.) The latent variable and the observed variable are linked as Yt = 1 if Yt > 0 : 0 otherwise From the measurement rule we can see that if we multiply Yt by any positive constant, then Yt is unchanged. As a result, we can only estimate 0 and 1 up to a positive multiple (that is, relative to scale). To identify the coe¢ cients, we 5 set = 1. We also see that if the threshold is c 6= 0, then we return to a zero threshold simply by subtracting c from 0 . To identify the intercept, we set c = 0. To construct an estimator of the coe¢ cients, note p (X) = P (Yt = 1jX) = P (Yt > 0jX) = P ( 0 + 1 Xt + Vt > 0) = P ( 0 + 1 Xt > Vt ) = P ( ( 0 + 1 Xt ) Vt ) = F ( 0 + 1 Xt ) ; where the third displayed line follows because of symmetry about 0. The main criticism of the linear probability model has been addressed; because the distribution function is contained in [0; 1] so too is the probability that Yt equals 1. The presence of the latent model can give one the impression that we are interested in the e¤ect of X on Y , which is given by 1 . Yet Y rarely has sensible units of measurement (desire to own a home, or di¤erences in utility), so the magnitude of 1 is not generally important. Rather, our goal is to explain the e¤ect of X on the response probability p (X). To understand the link, note that if X is a continuous regressor @p (X) =f( @X 0 + 1 X) where f (z) = 1 @F (z) : @z If F is a strictly increasing function (as is true for Gaussian and logistic CDF’s), then f (z) > 0 for all z and the sign of 1 determines the direction of the e¤ect on the response probability. Observe that the magnitude of the e¤ect depends on the value of the regressor, through f ( 0 + 1 X). If the underlying density is unimodal and symmetric about 0, the maximum value of f (z) occurs at z = 0. For the leading cases Probit f (z) = p Logit f (z) = 1 2 2 1 z2 2 2 e ez (1 + ez )2 1 f (0) = p 2 :399 f (0) = :25 For the multiple regressor model, relative e¤ects do not vary with X, as @p(X) @Xi @p(X) @Xj = i j 6 : If X is a discrete regressor, then the impact of an increase from c to c + 1 in X on the response probability is F( 0 + 1 (c + 1)) F( 0 + 1 c) : If X is an indicator regressor, then c = 0. Example (E¤ect of Job Training) p (X) = probability of employment Xk = indicator of participation in job training The direction of the job training e¤ect is the sign of k , while the magnitude of the e¤ect di¤ers depending on age, education, and experience (the other included regressors). Finally, consider the model Xt = 0 + 1 X1t 2 2 X1t + + 3 ln X2t : The partial e¤ect of X1 on the response probability is f (Xt ) ( +2 1 2 X1t ) so the direction of the partial e¤ect potentially changes at X1 = partial e¤ect of lnX2 on the response probability is f (Xt ) 1 2 2 . The 3: 2 Because d ln X2 = dX , a 1 percent change in X2 is a .01 change in ln X2 . Therefore X2 the partial e¤ect of a 1 percent change in X2 on the response probability is f (Xt ) 3 100 : Given the distributional assumptions on Vt , the maximum likelihood estimator arises naturally. The distribution of Y1 is used to form the likelihood as L( 0; 1 jY1 = y1 ; X1 = x1 ) = [F ( 0 + y1 1 x1 )] [1 F( 0 + 1 y1 1 x1 )] The likelihood for the sample is L [ 0 ; 1 j (Y1 ; X1 ) = (y1 ; x1 ) ; : : : ; (Yn ; Xn ) = (yn ; xn )] n Y = [F ( 0 + 1 xt )]yt [1 F ( 0 + 1 xt )]1 yt : t=1 7 : The log-likelihood is ln L ( 0; 1j )= n X (yt ln F ( 0 + 1 xt ) + (1 yt ) ln [1 F( 0 + 1 xt )]) : t=1 (We are able to construct ln L because of the strict inequality 0 < F (z) < 1.) The …rst-order condition for the estimator of the coe¢ cients is the partial derivative of the log-likelihood with respect to each of the coe¢ cients and is termed the score. The ML estimators B0 and B1 are the values that set the scores equal to zero: @ ln L ( 0 ; @ i 1j ) = i =Bi n X t=1 yt F (B0 + B1 xt ) @F ( 0 + F (B0 + B1 xt ) [1 F (B0 + B1 xt )] @ i 1 xt ) = 0; i =Bi for i = 0; 1. The MLE is consistent and asymptotically Gaussian. To determine the covariance matrix for the estimators, we construct the expected value of the Hessian conditional on X, for which we note @F ( @0 + 1 xt ) = xit f ( 0 + 1 xt ), i E @ 2 ln L ( 0 ; @ @ 0 1j ) jX equals E " n X F (1 t=1 n X f 2 xt x0t = ; F (1 F ) t=1 F ) xf ( xf )0 + u(xf (1) ) [F (1 uxf [xf (1 F )]2 F ) + ( xf )F ] jX # because E (ujX) = 0 which is a positive semi-de…nite matrix. The estimator of the asymptotic variance of B is " n # 1 X f 2 xt x0 t V = : F (1 F ) t=1 If the inverse exists, the matrix is positive de…nite. If the inverse does not exist, the problem is likely multicollinear regressors. It does not make sense to compute robust standard errors. The reason - in the latent model we specify all conditional 8 moments of Y jX. Therefore, if we believe the variance is misspeci…ed, then the conditional mean must be misspeci…ed as well. If we follow the classic regression model and assume that Vt is Gaussian, then F ( ) is the distribution of a standard Gaussian random variable. The resulting ML estimators are termed probit estimators (because is termed the probit function in statistics) and are obtained through nonlinear optimization (as the score function is not a linear function of the coe¢ cient estimators). Because the Gaussian distribution function cannot be expressed in closed form (that is, an integral must be used), many researchers assume that Vt has a logistic distribution. While the logistic density function is similar to the Gaussian and di¤ers only in the tails, the logistic distribution function can be expressed in closed form as F( 0 + 1 xt ) = exp ( 0 + 1 xt ) : 1 + exp ( 0 + 1 xt ) The closed form expression for the logistic distribution delivers a simpli…ed likelihood as well L( 0; Thus ln L ( n Y y t exp ( 0 + 1 xt ) 1 1 + exp ( 0 + 1 xt ) 1 + exp ( 0 + t=1 Pn Pn exp ( 0 t=1 yt + 1 t=1 xt yt ) Qn : = t=1 [1 + exp ( 0 + 1 xt )] 1j ) = 0; 1j )= 0 n X yt + 1 n X xt yt ln [1 + exp ( 0 1 xt ) + 1 xt )] : t=1 t=1 t=1 n X 1 yt The ML estimators, which are termed logit estimators, are the values B0 and B1 that satisfy @ ln L ( 0 ; @ i 1j ) = i =Bi n X yt xi;t t=1 n X t=1 1 exp (B0 + B1 xt ) xi;t = 0; 1 + exp (B0 + B1 xt ) for i = 0; 1 with x0;t = 1 and x1;t = xt . (As for the probit estimators, a nonlinear solution technique must be used.) An immediate consequence of the score for 0 is that n n X X yt = P^ (yt = 1) ; t=1 t=1 9 that is the observed frequency of yt = 1 equals the predicted frequency (here exp(B0 +B1 xt ) P^ (yt = 1) ). (Note, the same feature holds for the linear prob1+exp(B0 +B1 xt ) ability model, because the OLS coe¢ cient estimators satisfy the relation that the sum of observed values of the dependent variable, which yields the observed frequencies, equals the sum of predicted values of the dependent variable, which yields the predicted frequencies.) One additional advantage of the logistic assumption is that ln F ( 0 + 1 xt ) = 1 F ( 0 + 1 xt ) 0 + 1 xt : The slope coe¢ cient is interpreted as the e¤ect of a one unit change in the regressor on the logarithm of the odds ratio, where the odds ratio yields the probability of a success (Y = 1) divided by the probability of a failure (Y = 0). For the special case in which we have a number of observations for each value of the regressor, a simpler estimator can be constructed. Suppose that the regressor takes K distinct values and that there are nk observations on each value. For each of the distinct regressor Pnk values calculate the observed frequency of success, that 1 is construct p^k = nk t=1 yt . The estimator of the coe¢ cients is then obtained as the OLS coe¢ cient estimator for the regression model ln p^k = 1 p^k 0 + 1 xk + uk for k = 1; : : : ; K. The method is sensible if nk is large for each k. If nk is not constant across k, then the error is heteroskedastic and weighted least squares should be used. To perform hypothesis tests, any of the three test statistics can be used. As the tests are asymptotically equivalent (and the …nite sample comparisons are speci…c to the model) simply choose the statistic that is easiest to compute. We begin with test of exclusion restrictions, of which the leading example is the need to include additional regressors Z (perhaps indicators for region or industry). Set Yt = F (Xt + Zt ) + Ut : (If Zt consists only of functions of Xt , we have a pure functional form test.) The Wald test is computed directly in Stata. To construct an LR test, …rst estimate the full model via probit and obtain the estimated value of the log-likelihood, L^U . Next construct the probit estimate for the restricted model, in which the 10 conditional mean is assumed to be F (Xt ), and obtain the estimated value of the log-likelihood, L^R . The LR test statistic is 2 L^U L^R ) 2 Q Q = dim ( ) : If Q is large, then probit can be di¢ cult to construct. For large Q, the LM statistic is preferred as only the restricted model is estimated. First, construct the probit estimate for the restricted model, B, and form F^ = F (XB) f^ = f (XB) U^ = Y F^ . We then regress the residuals on both the included and excluded regressors, where we do WLS for e¢ ciency U^t = wt f^t Xt + 1 wt f^t Zt + Vt 2 wt h F^t wt = F^t 1 i 12 : Because the residuals sum to zero, there is no need for an intercept. The explained sum of squares from the regression is identical to the LM statistic. Alternatively, nR2 can be used as it is an asymptotically equivalent statistic, although numerically distinct. Both are distributed as 2 (Q) random variables. Although less common, there are more general restrictions of interest to test. To test for heteroskedastic errors, the latent model becomes Ut jX Yt = Xt + Ut N 0; e2Zt : We analyze the leading case, in which Zt consists of all varying regressors (all but the intercept). With heteroskedastic errors, our calculations become P (Yt = 1jX) = P (Ut > = e Zt Xt jX) = P e Xt : Zt Ut > e Zt Xt jX As noted before, if the error to the latent model is heteroskedastic, the speci…cation of the conditional mean is altered (hence we do not construct robust standard errors for the original speci…cation). In particular, the conditional mean is no longer a single index model, as the regressors a¤ect the response probabilty in two ways. To indicate the absence of a single index model, the response probability is often written as p (X) = m (X ; X; ) ; 11 where the last two arguments emphasize the fact that regressors a¤ect the response probability through more than the single index X . The natural null hypothesis is H0 : = 0, under which the latent model is a standard probit model. As the restricted model is clearly the easiest to estimate, we again use the LM statistic. Again, construct the probit estimate for the restricted model, B, and form F^ = F (XB) f^ = f (XB) U^ = Y F^ . We then regress the residuals on both the included regressors and the score for (recall, f^t multiplied by the excluded regressors forms the score for in the test of exclusion restrictions), where we do WLS for e¢ ciency U^t = wt f^t Xt + 1 wt 2 5 m (Xt ; Xt ; ) wt = wt = F^t 1 + Vt = For the heteroskedaticity example (in which 5 m (Xt ; Xt ; ) wt h Zt 0 0 F^t i 12 : = 0) Xt e Zt = e Xt ( Zt ) = (Xt ) Xt (Zt ) : = 0 0 The explained sum of squares from the regression is identical to the LM statistic. Alternatively, nR2 can be used as it is an asymptotically equivalent statistic, although numerically distinct. Again, both statistics are 2 (Q) random variables. 12