Download Discrete

Models with Discrete Dependent Variables These notes are mainly based on: Damodar N. Gujarati. Basic Econometrics. Fourth Edition. 1. Concepts In models with discrete dependent variables, the dependent variables can be 1. Binary or Dichotomous. Assume there are two political parties, Democratic or Republican. The dependent variable here is the choice between the two political parties. We can let Y=1 if the vote is for a Democratic candidate and Y=0 if the vote is for a Republican candidate; 2. Trichotomous. In the above example, if there are three parties, Democratic, Republican and Independent, then the dependent variable is trichotomous. 3. Polychotomous or Multiple-category. When doing research we can have ordinal dependent variable. This is an ordered categorical variable, such as schooling (less than 8 years, 8-11 years, 12-14 years and 15 or more years). If there is no inherent ordering in the categorical variables, we call it nominal, such as ethnicity (Black, White, Hispanic, Asian, and other). Models with discrete dependent variable are a common kind of qualitative response models. In a model where the dependent variable is quantitative, our objective is to estimate its expected, or mean, value given the values of independent variables. In models where the dependent variable is qualitative, our objective is to find the probability of something happening, such as voting for a Democratic candidate, or owning a house, or belonging to a union. etc. Hence, qualitative response regression models are often known as probability models. The theoretical framework of probability models is: Prob (event j occurs) = Prob (Y=j) = Function ( relevant effects: parameters) In the situations of models with discrete dependent variable, we need to consider the following problems: 1. How to estimate models with discrete dependent variable? Can the OLS procedure be applied here? 2. Are there special inference problems? That is, is the hypothesis testing procedure different here? 3. How can we measure the goodness of fit for such models? 2. Binary Response Regression Model According to the theoretical framework of probability models, for the binary response regression model Y  X   , we expect lim X  Pr ob (Y=1) =1 and lim X  Pr ob (Y=1) = 0 There are mainly three approaches to developing a probability model for a binary response variable: A. The linear probability model (LPM) B. The logit model C. The probit model This classification is based on different distributions the binary response regression models are based on. 1 2.1 The linear probability model (LPM) Consider the following regression model: (1) Yi  1   2 X i  i Here X = family income and Y=1 if the family owns a house and 0 if it does not own a house. Model (1) looks like a typical linear regression model. But because the dependent variable is binary, or dichotomous, it is called a linear probability model (LPM). This is because the conditional expectation of Yi given X i , E (Yi / X i ), can be interpreted as the conditional probability that the event will occur given X i , that is, Pr (Yi  1 / X i ) . Thus, in this example, E (Yi / X i ) gives the probability of a family owning a house (and) whose income is the given amount X i . Assuming E (  i ) = 0, we can obtain from (1): E (Yi )  1   2 X i (2) Let Pi = the probability that Yi =1 and (1- Pi ) = the probability that Yi = 0, then Probability Yi 1 Pi 0 1- Pi 1 Total Thus, LPM is based on Bernoulli probability distribution. So, E ( Yi ) = Pi . Also, E (Yi / X i )  1   2 X i Pi That is, the conditional expectation of model (1) can be interpreted as the conditional probability of Yi . Since Pi lies between 0 and 1, we have 0  E (Yi / X i )  1 as a restriction. We face several problems for LPM: (1) Non-normality of the Disturbance  i : Since  i  Yi  1   2 X i , the distribution of  i is: i Probability When Yi = 1 1  1   2 X i Pi When Yi = 0  1   2 X i 1- Pi So  i follows the Bernoulli distribution and cannot be assumed to be normally distributed. However this problem may not be so critical because if the objective is point estimation, the normality assumption of disturbance is not necessary and the OLS still remain unbiased. Also, according to the central limit theorem, the OLS estimators tend to be normally distributed as the sample size increases. 2 (2) Heteroscedastic Variance of the Disturbance For the Bernoulli distribution, the variance is p(1-p), thus Var (  i )  p i (1 - p i ). Since Pi  E (Yi / X i )  1   2 X i , the variance of  i ultimately depends on the values of X and hence is not homoscedastic. However, this problem can be solved using weighted least squares with as the weights. In practice, because the true E (Yi / X i ) is unknown, p i (1 - p i ) serving p i (1 - p i ) is unknown. We can run the ^ regression Yi  1   2 X i  i despite the heteroscedasticity problem and use Yi as the estimate of the true ^ E (Yi / X i ) and then use ^ Yi (1 - Yi ) as the weights. (3) Nonfulfillment of 0  E (Yi / X i )  1 As a probability, E (Yi / X i ) must lie between 0 and 1. However, there is no guarantee that the predictions from the model (the estimator of E (Yi / X i ) ) will fulfill this restriction. This is the real problem with LPM. (4) Low R 2 Values In the binary dependent variable model, for a given X, the Y values will be either 0 or 1. Therefore, all the Y values will either lie along the X-axis or along the line corresponding to 1. Therefore, generally no LPM is expected to fit such a scatter so well. As a result, the conventional R 2 is quite low for such models. Thus, it is not advisable to use R 2 as a summary statistic in models with qualitative dependent variables. ^ Y 1 X 0 (5) The Linearity is Unrealistic LPM assumes that Pi  E (Y  1 / X ) increases linearly with X, that is, the marginal effect of X remains the same throughout. In our example, a certain amount of change in X will have the same effect on the probability of owning a house, no matter the income level is $100 dollars or $100,000. However, in reality at very low income level or very high income level, a small change in income X will have very little effect on the probability of owning a house. Because of the above problem (3) and problem (5), LPM is much less popular than logit model or probit model 2.2 The Logit Model The logit model is based on the cumulative distribution function of logistic distribution: 1 eZ i Pi  E (Y  1/ X i )   (In our example, Z i  1   2 X i ) (3) 1  e Z i 1  eZ i 3 P 1 Z 0 As we can see from the figure, as Z i ranges from   to   , Pi ranges between 0 and 1 and that Pi is nonlinearly related to Z i or X i . Hence, problem (3) and (5) can be solved in the logit model. However, on the other hand, because Pi is not only nonlinear in X but also in  ’s, we cannot use conventional OLS procedure to estimate the parameters. So we do the following transformations: eZi 1  Zi 1 e 1  e Zi Pi 1  e Zi   e Z i  e 1   2 X i Zi 1  Pi 1  e 1  Pi  1  Here, and (4) Pi is the odds ratio in favor of owning a house---the ratio of the probability that a family will own 1  Pi a house to the probability that it will not own a house. For instance, if Pi =0.8, then the odds are 4 to 1 in favor of the family owning a house. Pi )  Z i  1   2 X i (5) 1  Pi Now Li , the log of the odds ratio, is not only linear in X but also linear in the parameters. L is called the logit, and model (5) is named logit model. In the logit model,  2 measures the change in L for a unit change in X, that is, it tells how the log odds in favor of owning a house change as income changes by a unit. Also,  1 is the value of log-odds in favor of owning a house if income is zero. This interpretation may not have any physical meaning. Let Li  ln( Once we get the estimates of  1 and  2 from logit model, we can estimate the probability of owning a house by equation (3). The methods of estimating the parameters in logit model differ according to what kind of data we have: (1) Grouped or replicated data; (2) Data at the individual, or micro, level. 4 2.2.1 Grouped or Replicated Data Consider the data given in the following table: Table 1 X i (Income, in thousand dollars) 6 8 10 13 15 20 25 30 35 40 N i (Number of families at income X i ) 40 50 60 80 100 70 65 50 40 25 ni (Number of families owning a house among N i ) 8 12 18 28 45 36 39 33 30 20 Corresponding to income level X i , there are N i families, and among the grouped N i families there are ni families that own a house. ni , the relative frequency, can be used as an estimate of the true p i corresponding to each X i . Ni (When the sample size becomes infinitely large, the probability of an event is the limit of the relative frequency). Therefore, ^ Thus, pi  ^ ^ L i  ln( pi ^ ^ 1  pi ^ )  1   2 X i (6) Thus, when the number of observations N i at each income level is reasonably large, we can estimate the parameters in logit model. Can we apply OLS to equation (6) to estimate the parameters? The answer is not quite because the disturbance term is not stochastic. If N i is fairly large and if each observation in a given income class X i is distributed independently as a binomial variable, then  i ~ N [0, 1 ] N i Pi (1  P ) i That is,  i follows the normal distribution with zero mean and variance equal to The formal proof process is as follows: 5 1 . N i Pi (1  P)i Let I be given and assume that each of the integers N1 ,...N i is large. Let F ( z i )  Pi  eZi . Since N i is 1  e Zi ^ large, p i is close to E ( yi )  F ( z i )  Pi by the law of large numbers. Also, by the central limit ^ theorem, the random variable variance equal to : N i ( p i - p i ) is approximately normally distributed with mean zero and var( yi )  F ( zi )[1  F ( zi )]  pi (1  pi ). Thus we can write: ^ pi  F ( zi )  vi where the errors are independent and normally distributed with mean zero and variance: p (1  pi ) . var( vi )  i Ni We still do not have a model which is linear in the parameter z, however. To move in this direction, we use Slutsky’s theorem on convergence in probability. Since N i is large: ^ F 1 ( pi )  F 1 ( pi )  F 1 [ F ( z i )]  z i and the random variable: ^ ^ N i [ F 1 ( pi )  F 1 ( pi )]  N i [ F 1 ( pi )  zi ] is approximately normally distributed with mean zero and variance : pi (1  pi )[ dF 1 ( pi ) 2 pi (1  pi ) p (1  pi ) ]   i 1 2 dpi f [ F ( pi )]  f ( zi )]2  This brings us to the approximating model: ^ F 1 ( pi )  z i  u i ^ ^ That is L i  ln( pi ^ ^ 1  pi ^ )  1   2 X i  ui Where the u i s are independent and normally distributed with mean zero and variance: var( ui )  pi (1  pi ) F ( z i )[1  F ( z i )] .  2 N i [ f ( z i )] N i [ f ( z i )] 2   6 Because for the logit model f = F (1-F), we have: var( u i )  1 1  N i F ( z i )[1  F ( z i )] N i Pi (1  Pi ) Therefore, we need to use the weighted least squares with ^ 1 serving as weights to estimate N i Pi (1  Pi ) ^ equation (6). In the above example, if wi  N i Pi (1  Pi ) , then we can apply the weighted least squares to run the following regression: wi Li  1 wi   2 wi X i  wi  i  Li  1 wi   2 X i  vi * * (7) Where Li  wi Li , X i  wi X i and vi  wi  i * * The results are as follows: Li  1.59474 wi  0.07862 X i t = (-14.43619) (14.56675) * * We have several ways to interpret the results: (A). Interpretation of the Estimated Logit Model * The coefficient of X i tells us that as the weighted income increases for a unit ($1000), the weighted log of the odds in favor of owning a house goes up by about 0.08 unites. But this is an indirect interpretation. (B). Odds Interpretation From equation (4) we have: Pi 1  eZ i   e Z i  e 1   2 X i Zi 1  Pi 1  e Since we have the estimates for  1 and  2 , we can calculate the amount of change in odds in favor of owning a house caused by a unit of change in income X i . (C). Computing Probabilities From equation (3) we have: eZ i e 1   2 X i Pi   1  e Z i 1  e 1   2 X i Using estimates of  1 and  2 , we can calculate the estimated probability of owning a house given a certain income level. For example, when X=20 (income is $20,000), 7 Z i  1   2 X i  1.59474  0.07862  20  0.022 ^ Pi  eZi e 0.022   0.49 1  e Zi 1  e 0.022 That is, given the income of $20,000, the probability of a family owning a house is about 49 percent. (D). Computing the Rate of Change of Probability eZ P e Z (1  e Z )  e Z  e Z eZ 1 dP P Z P      P  (1  P)      2 P(1  P) Z Z 2 Z Z Z dX Z X 1 e (1  e ) 1 e 1 e To calculate the change in probability of owning a house for a unit increase in income from level $20,000: ^ ^ ^  2 P(1  P)  0.07862  0.49  0.51  0.01965 2.2.2 Ungrouped or Individual Data An example of this kind of data is as follows: Table 2 Family 1 2 3 4 5 6 7 8 9 Y(1 if owns house, 0 otherwise) 0 1 1 0 0 1 1 0 0 X (income, in thousand dollars) 8 16 18 11 12 19 20 13 9 Here, Pi  1 if a family owns a house and Pi  0 if it does not own a house. If we try to calculate the logit Li : P 1 When a family owns a house, then Li  ln( i )  ln( ) 1  Pi 0 P 0 When a family does not own a house, then Li  ln( i )  ln( ) 1  Pi 1 Obviously, these expressions are meaningless. Therefore, if we have data at micro, or individual, level, we cannot estimate logit model by the standard OLS method and may have to resort to the maximum likelihood (ML) method to estimate the parameters. Since Pi  1 1 e  1   2 X i 8 We do not actually observe Pi , but only observe the outcome Y=1 if an individual owns a house, and Y=0, if the individual does not own a house. Since each Yi is a Bernoulli random variable, we can write Pr(Yi  1)  Pi and Pr(Yi  0)  1  Pi Suppose we have a random sample of N observations and let f i (Yi ) be the probability that Yi  1 or 0, then the likelihood function is N N 1 1 f (Y1,Y 2,...Yn )   f i (Yi )   Pi i (1  Pi )1Yi Y Since it is a little awkward to manipulate, we take its natural logarithm and the log-likelihood function is: N N ln f (Y1,Y 2,...Yn )  [Yi ln Pi  (1  Yi ) ln( 1  Pi )]  [Yi ln( 1 Since 1  Pi  1 1  e 1   2 X i 1 and N Pi )]   ln( 1  Pi )] 1  Pi 1 P ln( i )  1   2 X i 1  Pi We have: N N 1 1 ln f (Y1,Y 2,...Yn )   Yi (  1   2 X i )   ln( 1  e 1   2 X i ) Next, we differentiate the above equation with respect to  1 and  2 , set the resulting equations to zero and solve the resulting equations. Then we apply the second-order condition of maximization to verify the values of the parameters we have obtained do in fact maximize the likelihood function. However, the above equation is nonlinear in the parameters and it cannot be easily solved analytically. Consequently, the maximum likelihood estimates must be obtained by iterative, numerical techniques. All of the commonly used approaches to the numerical solutions of likelihood equations derived from Newton’s method. The method of Newton is also called the Newton-Raphson method. To learn more about NewtonRaphson method, you can read the following books: 1. Econometrics of Qualitative Dependent Variables by Christian Gourierourx, or 2. Advanced Econometrics Methods by Thomas B. Fomby, R. Carter Hill and Stanley R. Johnson. Several points are noteworthy about using the method of maximum likelihood in the binary dependent variable model: A. Asymptotic estimated standard errors The method of maximum likelihood is generally a large-sample method, so the estimated standard errors are asymptotic; B. Measure of statistical significance If the sample size is reasonably large, then the t distribution converges to normal distribution. Thus we use the standard normal Z statistic instead of the t statistic to evaluate the statistical significance. 9 C. Measure of goodness of fit The conventional R 2 is not a meaningful measure of goodness of fit in binary regressand. In contrast, there are various measures of goodness of fit for binary regressand models and they are called pseudo R 2 . Often the goodness-of-fit measures are implicitly or explicitly based on comparisons with a model that contains only a constant as the explanatory variable. Let ln L1 be the maximum log-likelihood value of the model of interest and let ln L0 be the maximum value of the log-likelihood function when all parameters, except the intercept, are set to zero. Because the log-likelihood is the sum of log probabilities, lnL 0  ln L1  0 . The larger the difference between the two log-likelihood values, the more the extended model adds to the very restrictive model. To compute lnL 0 , it is not necessary to estimate a logit model with an intercept term only. If there is only a constant term in the model, the MLE for P is: ^ P  N1 / N where N1   Yi . That is, the estimated probability is equal to the proportion of ones in the sample. i Hence, N N 1 1 ln L0   Yi ln( N 1 / N )   (1  Yi ) ln( 1  N 1 / N )  N 1 ln( N 1 / N )  ( N  N 1 ) ln( 1  N 1 / N ) The value of ln L1 can be computed from the computer package. 2 McFadden R is one of the measures of goodness of fit in binary regressand models: ln L1 2 McFadden R =1. lnL 0 2 Conceptually, lnL 1 is equivalent to RSS and ln L0 is equivalent to TSS the linear regression. McFadden R ranges between 0 and 1. 2 Another simple measure is the count R and it is defined as: Count R 2 = number of correct predictions / total number of observations. 2 Since the regressand in the logit model takes a value of 1 or 0, to calculate count R , we need to classify the predicted probability as 1 or 0. If the predicted probability is greater than 0.5, we classify that as 1. If the predicted probability is less than 0.5, we classify that as 0. However, it should be noted that the most important thing in the binary regressand models is the expected signs of the regression coefficients and their statistical and/or practical significance. D. Likelihood ratio statistic (LR) To test the null hypothesis that all the slope coefficients are simultaneously equal to zero, the equivalent of the F test in the linear regression model is the likelihood ratio statistic. 10 LR = -2( lnL 0 - ln L1 ) Given the null hypothesis, the LR statistic follows the  2 distribution with degree of freedom equal to the number of explanatory variables (excluding the intercept). The null hypothesis that all the slope coefficients are simultaneously equal to zero is rejected for large likelihood ratio. Problems: 1. For the data in Table 1, try to interpret the logit model regression results in the four ways shown in the above note at income level X=35: A. What is the amount of change in odds in favor of owning a house if the income increases for a unit ($1000) at income level X=35? B. Please calculate the estimated probability of owning a house at income level X=35. C. Please calculate the change in probability of owning a house for a unit increase in income from the level of $35,000 2. Suppose we use a sample with 32 observations to estimate a binary regressand logit model with 3 dependent variables. In this sample, the proportion of ones in the dependent variables is 11. From the computer package we know the unrestricted log-likelihoods for this model is –12.890. Please calculate the likelihood ratio statistic and test the hypothesis at the 0.05 level that the coefficients on the 3 independent variables are all zero. Answers to problems: Problem 1. Pi A.  e Zi  e 1   2 X i 1  Pi e 0.07862  1.0817 1.0817-1=0.0817=8.17% So the changes in odds in favor of owning a house is 8.17% if the income increases for a unit ($1000) at income level X=35. B. e 1   2 X i  e 1.590.07835  e1.14  3.13 ^ P e 1   2 X i 3.13   0.76 1   2 X i 1  3.13 1 e Thus, the estimated probability of owning a house at income level X=35 is 0.76. 11 ^ ^ ^ C.  2 P(1  P)  0.07862  0.76  0.24  0.0143 So the change in probability of owning a house for a unit increase in income from the level of $35,000 is 0.0143. Problem 2. The restricted log-likelihood is : ln L0  N1 ln( N1 / N )  ( N  N1 ) ln( 1  N1 / N )  11ln( 11 / 32)  (32  11) ln( 1  11 / 32)  20.5917 Thus, LR = -2( lnL 0 - ln L1 ) =  2  [20.5917  ( 12.890)] = 15.404 The critical value from the chi-squared distribution with 3 degrees of freedom at 0.05 level is 7.81. Hence, we reject the hypothesis at the 0.05 level that the coefficients on the 3 independent variables are all zero. 12

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Discrete