Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic 12. Binary Logistic Regression Binary logistic regression is a type of regression analysis that is used to estimate the relationship between a dichotomous dependent variable and dichotomous-, interval-, and ratio-level independent variables. Many different variables of interest are dichotomous – e.g., whether or not someone voted in the last election, whether or not someone is a smoker, whether or not one has a child, whether or not one is unemployed, etc. These types of variables are often referred to as discrete or qualitative. Many discrete or qualitative variables can be thought of as events. Dichotomous or dummy variables are usually coded 1, indicating “success” or “yes,” and 0, indicating “failure” or “no.” The mean of a dichotomous variable coded 1 and 0 is equal to the proportion of cases coded as 1, which can also be interpreted as a probability. 1111110000 mean = 6 / 10 = .6 = the probability that any 1 case out of 10 has a score of 1 For quite a while, researchers used OLS regression to analyze dichotomous outcomes. This was based on the idea that predicted values (ŷ) – based on the regression results – generally range from 0 to 1 and are equivalent to predicted probabilities, predicted proportions, and predicted percents of “success” given values on the independent variables. In other words, if we regressed a dummy variable, voted or not, on education and got the estimate b = .25, then we could say that a one-unit increase in education increases the probability of voting by .25. Equivalently, a one-unit increase in education increases the proportion voting by .25. Finally, a one-unit increase in education increases the percent voting by 25 percent. Due to a number of conceptual and statistical problems, however, people no longer use OLS regression to analyze dichotomous dependent variables. There are a number of alternative approaches to modeling dichotomous outcomes including logistic regression, probit analysis, and discriminant function analysis. Logistic regression is by far the most common, so that will be our main focus. Additionally, we will focus on binary logistic regression as opposed to multinomial logistic regression – used for nominal variables with more than 2 categories. OLS Regression with a Dichotomous Dependent Variable What is wrong with using OLS regression with dichotomous dependent variables? There are a number of problems. 1. One of the regression assumptions that we discussed is that the dependent variable is quantitative (at least at the interval level), continuous (can take on any numerical value), and unbounded. A person’s score on the dependent variable is assumed to be a function of their score on each independent variable. Therefore, the dependent variable must be free to take on any value that is predicted by the combination of independent variables. If the dependent variable does not meet these requirements (e.g., it is dichotomous), then predicted scores on the dependent variable may lie outside possible limits. When you use OLS regression with a dichotomous dependent variable, predicted probabilities (based on the estimated OLS regression equation) are not bounded by the values of 0 and 1. Page 1 of 12 Why is this a problem? In the real world, probabilities (and percents for that matter) can never be less than 0 and can never be greater than 1. For example, the chance of rain can’t be less than 0% and can’t be more than 100% - 0 % means that it’s impossible and 100% means it’s guaranteed. You can’t get any more or less likely than that! With dummy dependent variables and OLS regression, it is not uncommon for predicted probabilities to be less than 0 and greater than 1. The likelihood of this increases as the difference between the number of successes and failures increases. In other words, if the split is 90% have a score of 1 and 10% have a score of 0, then you will probably experience impossible predicted probabilities. 2. Another OLS multiple regression assumption is that the relationship between Y and X is linear and additive in the population. Our estimates cannot be very good if we assume that the true relationship is linear and additive and we specify a linear and additive relationship when, in fact, in the population, the relationship is non-linear and/or non-additive In many cases, it is not unreasonable to assume that the relationship between two variables is non-linear. The example that Pampel uses in the book is that of income and home ownership. A $10,000 increase in income probably increases the probability of owning a home more for someone with an initial income of $40,000 than someone with an initial income of $0. Also an additional $10,000 probably does not have much of an influence on the likelihood that a very rich person owns a home – e.g., earning $1,001,000 versus earning $1,000,000. So a more appropriate functional form (rather than a line) might be an s-shaped curve: 1 0.9 0.8 Probability 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 X This type of a curve suggests that one-unit changes in the independent variable have different effects on the dependent variable at different levels of the independent variable. It takes a much larger increase in X to have the same effect on Y at extreme ends of the curve. One way to think about this is to consider the fact that the slope of this curve changes at different values of Y. In essence, there are a number of different perpendicular lines that one can draw. Page 2 of 12 3. Another regression assumption is that the error term is normally distributed. Remember that the error term summarizes all of the causes of the dependent variable not included in the model as well as errors in the functional form of the equation, measurement error, and the randomness in human behavior. The assumption of normality allows you to do hypothesis testing. If the error term is not normally distributed, then we cannot use z (t) to find the probability under the curve. The error term is not normally distributed when you use OLS regression with a dichotomous dependent variable because, for any value of X, there are only two possible values that the residuals can take. A residual is defined as the observed value on the dependent variable minus the predicted value given X. Here’s an example: Consider 10 people with a value of 2 on the independent variable and an estimated regression equation of: ŷi = .03 + .48 * xi The residual is equal to yi – ŷi, where ŷi = a + (b * xi). So after substituting, the residual is equal to yi – (a + (b * xi)). For the value X = 2, there are only 2 possible values for the residual because there are only two possible observed values for Y (1 and 0). 1 – (.03 + .48 * 2) = .01 0 – (.03 + .48 * 2) = -.99 Thus, for any value of x, there are only two possible residuals so the distribution is not normal. 4. Another assumption is that of homoskedasticity – that the variance of the error term is constant across all values of the independent variables. Homoskedasticity means that the predicted values of the dependent variable are as good (or as bad) at all levels of the independent variable. This is violated because the residuals vary with the value of x. The linear OLS regression consistently underestimates the slope at moderate levels of x and consistently overestimates the slope at extreme levels of x. Heteroskedasticity leads to biased estimates of the standard errors, which we use in our t tests. Poor estimates increase the chance of drawing incorrect conclusions in hypothesis testing. The Logit Transformation So what can we do? As I mentioned earlier, many topics of interest are dichotomous. The “miracle” of logistic regression is that, through the logit transformation, it linearizes the non-linear relationship between X and the probability of Y. It does this through the use of odds and logarithms. So, the logit is a nonlinear function that represents the s-shaped curve. Let’s look more closely at how this works. Probabilities express the likelihood of an event as a proportion of both occurrences and non-occurrences. In other words, probabilities are defined as the number of occurrences divided by the number of occurrences plus the number of non-occurrences. So, if you have a sample of 4,000 people and 3000 are married, the probability of being married is .75 (there is a 75% change of being married): 3000 / (3000 + 1000) = .75. Probabilities cannot be less than 0 and cannot be greater than 1. In other words, they are bounded by 0 and 1. Page 3 of 12 Odds, by contrast, are defined as the likelihood of occurrence divided by the likelihood of non-occurrence. Thus, the odds of being married for our example is: 3000 / 1000 = 3. What difference does dividing only by the number of non-occurrences make? It removes the upper limit of 1. But wait…that’s not all…odds are also non-linear. Consider the examples in the Pampel text (p. 11). P 1-P Odds .1 .9 .111 .2 .8 .250 .3 .7 .429 .4 .6 .667 .6 .4 1.500 .5 .5 1 .7 .3 2.333 .8 .2 4 .9 .1 9 The same change (a .1 increase in P) leads to increasingly large increases in the odds. Notice that the odds is still bounded at the lower end. It is impossible for the odds to fall below 0. So, transforming the probabilities into odds has removed the upper limit. We are half way there. The next step is to take the natural logarithm of the odds. Here is the example from Pampel: P 1-P Odds Ln odds .1 .9 .111 -2.198 .2 .8 .250 -1.386 .3 .7 .429 -.846 .4 .6 .667 -.405 .5 .5 1 0 .6 .4 1.500 .405 .7 .3 2.333 .847 .8 .2 4 1.386 .9 .1 9 2.197 Taking the natural log of the odds eliminates the floor of zero. Taking the natural log of a number above 0 and below 1 yields a negative number. Odds cannot be less than zero, but all odds less than 1 yield natural logs that are negative…the floor is gone. Taking the natural log of the number 1 yields 0. Finally, taking the natural logarithm of a number greater than 1 yields a positive number. However, notice that the distribution of logged odds is symmetrical around 0. The same size increase or decrease in the probabilities has the same absolute value in logged odds. It is important to point out that the difference between the logged odds is not constant. For example, 2.1971.386=.811 while 1.386-.847=.539. What does this mean? The logit transformation is stretching the distribution at extreme ends so the same one-unit change in X (the independent variable) leads to increasingly smaller gains in Y. In essence, it yields the s-shaped curve. The linear relationship between X and the log odds is given by the following formula – cumulative logistic distribution function (you will see this in some statistical output): P Logged odds: Ln i 1 Pi b0 b1 x1 b2 x2 Notice that there is no error term in the model. The error term is not necessary. “The random component of the model is inherent in the modeling itself – the logit equation, for example, provides the expression for the probability that an event will occur. For each observation the occurrence or non-occurrence of that event comes about through a chance mechanism determined by this probability, rather than by a draw from a bowl of error terms” (Kennedy 1998, p. 234). Page 4 of 12 The right hand side of the equation looks like a normal linear regression equation, but the left hand side is the log odds rather than a probability. This equation can be re-written as follows: Odds: Pi / 1 Pi e bo * e b1x1i * e b2 x2i Probabilities: Pi E (Yi 1 X 1i , X 2i ) 1 1 e ( b0 b1 x1i b2 x2 i ) Estimation Fantastic…logistic regression allows us to estimate the relationship between dichotomous dependent variables and dichotomous, interval, and ratio independent variables. But…how does it work? Where do the estimates for come from? Logistic regression uses maximum likelihood estimation to generate estimates of . Specifically, ML uses the observed data and probability theory to find the most likely or the most probable population value given the sample data (observations). In logistic regression, the formula that is used to determine the population value most likely to yield the sample data is given by the likelihood function: LF Pi i * (1 Pi )1 yi y The likelihood function is an expression for the likelihood of observing the pattern of occurrences (y=1) and non-occurrences (y=0) of an event in a given sample. In other words, it tells us the probability of getting our sample data from a population with probabilities equal to Pi. The yis above can be either 1 or 0, depending on the score of person i on the dependent variable. The Pis refer to predicted probabilities on the dependent variable (there is one for every person in the sample) given the person’s scores on the independent variable(s). You plug predicted probabilities into this equation just like we plugged possible population proportions into the binomial probability distribution. The predicted probabilities are from a logistic regression model. yi 1 1 0 0 LF Pi 0.1 0.2 0.8 0.9 P i yi * (1 Pi )1 yi 0.100000 0.200000 0.200000 0.100000 0.000400 yi 1 1 0 0 Pi 0.8 0.9 0.1 0.2 P i yi * (1 Pi )1 yi 0.800000 0.900000 0.900000 0.800000 0.518400 How does the software package do a logistic regression to get the predicted probabilities? It does this by using OLS regression to get a first set of estimates. These estimates yield predicted probabilities for each case, which are plugged into the likelihood function and generate a score for those estimates. The score is equal to the likelihood of obtaining the observed sample data (the combination of 1s and 0s) from a population with that particular set of slope estimates. Next, based on sophisticated computer algorithms, it chooses a new estimate for , which is used to generate a new set of predicted probabilities for each case. These are plugged into the likelihood function and generate a new score (a new probability) for this new estimate. Page 5 of 12 This process is repeated over and over until the likelihood function is maximized – until increases in the number become extremely small with successive attempts. Believe me, you don’t want to do this by hand. In the book, Pampel talks about the log likelihood function. The log likelihood function is another version of the likelihood function – it is simply the log of the likelihood function. Why does the computer program use this instead of the likelihood function? Because it’s easier to do addition than multiplication. By taking the log of the likelihood function, the different components of the equation can be added together instead of multiplied together. One rule of logarithms states that log (a*b) = log a + log b. Summary So to summarize, when we are interested in examining the relationship between a dichotomous dependent variable and dichotomous, interval, and ratio independent variables, we can use logistic regression. Logistic regression uses maximum likelihood to generate the estimates of the slopes. Logistic regression yields unbiased and efficient estimates of and OLS regression does not. Logistic regression is now available in most statistical packages. It is fairly simple to run in SPSS. Getting Results from SPSS and Interpretation For our example, we will examine people’s attitudes toward women in politics. Our dependent variable is dichotomous: whether or not the respondent agrees that men are better suited emotionally for politics (1 = yes). Independent variables include: sex (male), race (white), age, marital status (married), the number of children (childs), and education (educ). We will run two logistic regressions. In the first, we will control for all variable except for education. In the second, we will control for all variables. Here is the process for using the menu system: analyze, regression, binary logistic. That’s all there is to it. Once you have entered the dependent and independent variables, click on paste and run the command from the syntax window. Here is the command syntax for SPSS for the first model: LOGISTIC REGRESSION VAR=menpol2 /METHOD=ENTER male white age married childs /CRITERIA PIN(.05) POUT(.10) ITERATE(20) CUT(.5) . Here is the output: Logistic Regression Case Processing Summary Unweighted Cases Selected Cases Unselected Cas es Total a Included in Analysis Mis sing Cases Total N 1436 94 1530 0 1530 Percent 93.9 6.1 100.0 .0 100.0 a. If weight is in effect, s ee class ification table for the total number of cases. Page 6 of 12 De pendent Varia ble Encoding Original Value Int ernal Value .00 no 0 1.00 y es 1 SPSS uses a listwise deletion of missing cases procedure for logitistic regression – cases with missing data on any variable in the model are dropped from the analysis. In this first model, we lose 94 cases (or 6.1% of our sample). Block 0: Beginning Block Classification Tablea,b Predicted Observed Step 0 no 729 707 no yes yes 0 0 Overall Percentage Percentage Correct 100.0 .0 50.8 a. Constant is included in the model. b. The cut value is .500 Notice that the n = 1436 (729+707). 50.8% is the percent that SPSS would guess correctly if it guessed “no” for every case (729/1436=.508). Without any additional information (e.g., independent variables) this would be the best guess because no is the modal category. Va riables in the Equation St ep 0 Constant B -.031 S. E. .053 W ald .337 df 1 Sig. .562 1 1 1 1 1 5 Sig. .764 .766 .000 .017 .000 .000 Ex p(B) .970 Va riables not in the Equation St ep 0 Variables MALE W HITE AGE MARRIED CHILDS Overall Statist ics Sc ore .090 .089 59.009 5.646 14.435 67.258 df Nothing terribly important here. This first model or baseline model is just used to calculate the –2 log likelihood value for the fully unconditional or null model (the model with no variables). LF Pi i * (1 Pi )1 yi I believe that the “scores” are chi-square statistics that test the null hypothesis that the variable is not associated with the dependent variable. y Page 7 of 12 Block 1: Method = Enter Omnibus Tests of Model Coefficients Step 1 Step Block Model Chi-square 68.454 68.454 68.454 df 5 5 5 Sig. .000 .000 .000 Model Summary Step 1 -2 Log likelihood 1921.927 Cox & Snell R Square .047 Nagelkerke R Square .062 SPSS does not actually give you the baseline –2 log likelihood value. You have to calculate it based on the chisquare and –2 log likelihood values listed above. 1921.927 is the –2 log likelihood for the model with independent variables. 68.454 is the improvement in this model over the baseline –2 log likelihood value. So the baseline value is 1990.381 (1921.927+68.454=1990.381). “0” = perfect fit, closer to 0 is better fit. These numbers do not have any useful meaning in and of themselves. The larger the improvement in the model, the better able are the independent variables to predict the outcome. More correctly, the smaller the –2 log likelihood, the more probable the population estimates given the sample data (smaller because it is multiplied by negative 2). The difference between these two –2 log likelihood values can be thought of as a test of the significance of the whole model. If the chi-square value is significant, then it means that at least one of the independent variables has a significant effect on the dependent variable. It is, therefore, roughly equivalent to the F test in OLS multiple regression. Another way to evaluate the fit of the model is to examine the pseudo r-squared values reported in the output. As Pampel mentions in the Sage text, the various pseudo r-squareds are not equivalent to the r-squared test in OLS multiple regression. In OLS multiple regression, the r-square indicates the amount of variation in the dependent variable that is explained by the independent variables. In other words, the degree to which we are able to accurately predict the observed values on the dependent variable over guessing the mean. The total variation in the dependent variable is defined as the sum of squared deviations between the mean of the dependent variable and each case’s observed score. The explained variation is defined as the sum of squared deviations between the predicted score on the dependent variable and the mean. Thus, the r-squared is defined as the explained variation divided by the total variation. This type of summary measure of the goodness of fit is commonly referred to as a proportional reduction of error measure. The pseudo r-squareds that are available in logistic regression use the same logic. The simplest pseudo rsquared is simply the improvement in the –2 log likelihood divided by the –2 log likelihood of the baseline model. This indicates the proportional improvement in the model, much like the r-squared in linear regression. A number of different alternative pseudo r-squareds are available. Two included in the SPSS output are the Cox and Snell pseudo r-squared and the Nagelkerke pseudo r-squared. Page 8 of 12 Others discussed in the Pampel book include: Aldrich and Nelson, Hagle and Mitchell, and McKelvey and Zavoina. All of these pseudo r-squareds indicate the goodness of fit of the model – that is, the proportional reduction in the chi square value from the baseline to the fitted model. What is the difference between all of the different pseudo r-squareds? The –2 log likelihood value is partially determined by sample size and the number of variables. Also, some pseudo r-squareds are not bounded by 0 and 1. Different versions of the pseudo r-squared correct for one or both of these problems. I would suggest presenting several in your output. In terms of those available from SPSS, the Nagelkerke version is based on the Cox and Snell pseudo r squared, but has a value of 1 for the upper limit. Classification Tablea Predicted Observed Step 1 no 441 296 no yes yes 288 411 Overall Percentage Percentage Correct 60.5 58.1 59.3 a. The cut value is .500 Any case with a predicted probability greater than .5 (the cut value) is assigned as yes and any case with a predicted probability less than .5 is assigned as no. Here you have the percent correctly classified. Earlier, we learned that if SPSS guessed the modal category (no) for each case, it would have been correct 50.8% of the time. So, by taking into account the independent variables, it is now able to be correct 59.3% of the time – an improvement of about 9%. Obviously, the bigger the difference, the better the model. Variables in the Equation Step a 1 MALE WHITE AGE MARRIED CHILDS Constant B .065 -.169 .024 .259 .034 -1.220 S.E. .110 .167 .003 .117 .031 .221 Wald .353 1.019 48.850 4.911 1.247 30.597 df 1 1 1 1 1 1 Sig. .552 .313 .000 .027 .264 .000 Exp(B) 1.067 .845 1.024 1.296 1.035 .295 a. Variable(s) entered on s tep 1: MALE, WHITE, AGE, MARRIED, CHILDS. Here are the coefficients. The B column presents the logged odds. These do not have an intuitive interpretation. For example, the logged odds of agreeing that men are better emotionally suited for politics is .065 higher for men than women. Each additional child increases the logged odds of agreeing that men are better emotionally suited for politics by .034. The SE column provides the standard error for the estimated slopes. Remember, this is an estimate of how much variability there is across all possible samples in the estimated slopes. Dividing B by the SE yields a z score, which you can use to test if the slope is significantly different that zero. If you square the z value – multiply it times itself, you get the Wald statistic that is presented in column four. The Wald statistic has a chi square distribution. So you can also use the Wald statistic and the degrees of freedom to test if the slope is different than zero. The actual probability that the slope is equal to zero is given in the next column. Page 9 of 12 Finally, the Exp(B) column presents the odds of agreeing. This is a simple transformation of the B. To calculate the exponent of B, raise e (the base of the natural logarithm, about equal to 2.718) to a power equal to B. For example, e.065=1.067. The odds has a more intuitive interpretation. The odds will be greater than 1 if the B is positive. The odds will be less than 1 if the B is negative. If the odds is equal to 1, there is no relationship. Here is the interpretation: men are 6.7% more likely to agree that men are better emotionally suited for politics than women. Each additional child increases the odds of agreeing by 3.5%. Odds are different than probabilities! Let’s run a second model controlling for education. Here is the output. Case Processing Summary Unweighted Cases Selected Cases a Included in Analysis Mis sing Cases Total Unselected Cas es Total N 1428 102 1530 0 1530 Percent 93.3 6.7 100.0 .0 100.0 a. If weight is in effect, s ee class ification table for the total number of cases. Notice that we have fewer cases. We lost a few cases when we entered the education variable (n=1,428). Block 0: Beginning Block De pendent Varia ble Encodi ng Original Value Int ernal Value .00 no 0 1.00 y es 1 Classification Tablea,b Predicted Observed Step 0 no 724 704 no yes yes 0 0 Overall Percentage Percentage Correct 100.0 .0 50.7 a. Constant is included in the model. b. The cut value is .500 Va riables in the Equation St ep 0 Constant B -.028 S. E. .053 W ald .280 df 1 Page 10 of 12 Sig. .597 Ex p(B) .972 Variables not in the Equation Step 0 Variables Score .065 .039 60.445 5.108 14.686 54.437 94.604 MALE WHITE AGE MARRIED CHILDS EDUC Overall Statistics df 1 1 1 1 1 1 6 Sig. .798 .844 .000 .024 .000 .000 .000 Block 1: Method = Enter Omnibus Tests of Model Coefficients Step 1 Step Block Model Chi-square 97.391 97.391 97.391 df 6 6 6 Sig. .000 .000 .000 Model Summary Step 1 -2 Log likelihood 1881.958 Cox & Snell R Square .066 Nagelkerke R Square .088 Classification Tablea Predicted Observed Step 1 no yes no 479 311 yes 245 393 Overall Percentage Percentage Correct 66.2 55.8 61.1 a. The cut value is .500 What I want to point out is that you are able to compare the fit of two models with a different set of independent variables. The logic is the same as before when we compared the null model to the first fitted model. Now we can compare the first fitted model (without education) to the second fitted model (with education). The improvement in the chi square value is: 1921.927 - 1881.958 = 39.969. With 1 degree of freedom, the new chi square value of 39.969 is significant. So we can conclude that adding the education variable significantly improves the fit of the model. Page 11 of 12 Variables in the Equation Step a 1 MALE WHITE AGE MARRIED CHILDS EDUC Constant B .085 -.051 .020 .298 .008 -.099 .054 S.E. .111 .172 .004 .119 .032 .019 .332 Wald .586 .088 30.337 6.222 .061 27.306 .026 df 1 1 1 1 1 1 1 Sig. .444 .766 .000 .013 .805 .000 .872 Exp(B) 1.089 .950 1.020 1.347 1.008 .906 1.055 a. Variable(s) entered on s tep 1: MALE, WHITE, AGE, MARRIED, CHILDS, EDUC. Bells and Whistles You have probably noticed that I have not said much about standardized or semi-standardized coefficients. Pampel discusses both in reference to logistic regression. I’ll show my bias…in general, I am not a big fan of standardized coefficients. I think it is much more meaningful to describe the relationships between variables in their natural metric. That being said, only semi-standardized coefficients are available in logistic regression. Semi-standardized coefficients are obtained by standardizing the independent variables (converting them to z sores) and using the new standardized variables in the logistic regression. Alternatively, you can multiple the logged odds of each independent variable (from your logistic regression) by each variable’s standard deviation. It is impossible to have a truly standardized coefficient in logistic regression because both the independent and dependent variables must be standardized to have a standardized coefficient that can be compared across different regression models and you cannot convert a dichotomous variable to a standardized variable. Standardizing a dummy variable only changes the values from 0 and 1 to two other values. If you get semi-standardized coefficients, they allow you to compare the strength of different independent variables when they are in the same model. Dichotomous independent variables can be standardized to compare their relative strength to that of interval and ratio level variables even though a one standard deviation unit increase in a dummy variable doesn’t make any sense. Pampel says that semi-standardized coefficients are available in SPSS…if they are, I can’t find them anywhere. Page 12 of 12