Survey

Survey

Document related concepts

Transcript

An Introduction to Logistic Regression GV917 For categorical Dependent Variables What do we do when the dependent variable in a regression is a dummy variable? Suppose we have the dummy variable turnout: 1 – if a survey respondent turns out to vote 0 – if they don’t vote One thing we could do is simply run an ordinary least squares regression Turnout and Interest in the Election Turnout 1.00 .00 1.00 1.00 1.00 .00 1.00 1.00 1.00 .00 .00 1.00 1.00 .00 1.00 1.00 .00 1.00 .00 1.00 .00 1.00 1.00 1.00 1.00 .00 1.00 1.00 1.00 1.00 Interest 4.00 1.00 4.00 3.00 3.00 1.00 3.00 2.00 3.00 1.00 2.00 2.00 2.00 1.00 2.00 4.00 2.00 2.00 1.00 1.00 2.00 4.00 1.00 2.00 1.00 2.00 3.00 3.00 4.00 3.00 (N=30) Turnout 1 yes 0 No Interest in the Election 1 not at all interested 2 not very interested 3 fairly interested 4 very interested OLS Regression of Interest on Turnout – The Linear Probability Model Model Summary Model 1 R .540a Adjusted R Square .266 R Square .291 Std. Error of the Estimate .39930 a. Predictors: (Constant), interest ANOVA b Model 1 Regres sion Residual Total Sum of Squares 1.836 4.464 6.300 df 1 28 29 Mean Square 1.836 .159 F 11.513 Sig. .002a a. Predictors: (Constant), interest b. Dependent Variable: turnout Coefficientsa Model 1 (Constant) interes t Unstandardized Coeffic ients B Std. Error .152 .177 .238 .070 a. Dependent Variable: turnout Standardiz ed Coeffic ients Beta .540 t .856 3.393 Sig. .399 .002 The Residuals of the OLS Turnout Regression Casewise Diagnostics Case Num ber 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 Std. Residual -.264 -.977 -.264 .333 .333 -.977 .333 .930 .333 -.977 -1.574 .930 .930 -.977 .930 -.264 -1.574 .930 -.977 1.527 -1.574 -.264 1.527 .930 1.527 -1.574 .333 .333 -.264 .333 a. Dependent Vari able: turnout turnout 1.00 .00 1.00 1.00 1.00 .00 1.00 1.00 1.00 .00 .00 1.00 1.00 .00 1.00 1.00 .00 1.00 .00 1.00 .00 1.00 1.00 1.00 1.00 .00 1.00 1.00 1.00 1.00 a Predicted Value 1.1053 .3901 1.1053 .8669 .8669 .3901 .8669 .6285 .8669 .3901 .6285 .6285 .6285 .3901 .6285 1.1053 .6285 .6285 .3901 .3901 .6285 1.1053 .3901 .6285 .3901 .6285 .8669 .8669 1.1053 .8669 Residual -.10526 -.39009 -.10526 .13313 .13313 -.39009 .13313 .37152 .13313 -.39009 -.62848 .37152 .37152 -.39009 .37152 -.10526 -.62848 .37152 -.39009 .60991 -.62848 -.10526 .60991 .37152 .60991 -.62848 .13313 .13313 -.10526 .13313 What’s Wrong? Predicted probabilities which exceed 1.0, which makes no sense. The test statistics t and F are not valid because the sampling distribution of residuals does not meet the required assumptions (heteroscedasticity) We can correct for the heteroscedasticity, but a better option is to use a logistic regression model Some Preliminaries needed for Logistic Regression Odds Ratios These are defined as the probability of an event occurring divided by the probability of it not occurring. Thus if p is the probability of an event: p Odds = ------1- p For example: In the 2005 British Election Study Face-to-Face survey 48.2 per cent of the sample were men, and 51.8 percent women, thus the odds of being a man were: 0.482 0.518 -------- = 0.93 and the odds of being a women were -------- = 1.07 0.518 0.482 Note that if the odds ratio was 1.00 it would mean that women were equally likely to appear in the survey as men. Log Odds The natural logarithm of a number is the power we must raise e (2.718) to give the number in question. So the natural logarithm of 100 is 4.605 because 100 = e 4.605 This can be written 100 = exp(4.605) Similarly the anti-log of 4.605 is 100 because e 4.605 = 100 In the 2005 BES study 70.5 per cent of men and 72.9 per cent of women voted. The odds of men voting were 0.705/0.295 = 2.39, and the log odds were ln(2.39) = 0.8712 The odds of women voting were 0.729/0.271 = 2.69, and the log odds were ln(2.69) = 0.9896 Note that ln(1.0) = 0, so that when the odds ratio is 1.0 the log odds ratio is zero Why Use Logarithms? They have 3 advantages: Odds vary from 0 to ∞, whereas log odds vary from -∞ to + ∞ and are centered on 0. Odds less than 1 have negative values in log odds, and odds greater than one have positive values in log odds. This accords better with the natural number system which runs from -∞ to + ∞. If we take any two numbers and multiply them together that is the equivalent of adding their logs. Thus logs make it possible to convert multiplicative models to additive models, a useful property in the case of logistic regression which is a non-linear multiplicative model when not expressed in logs A useful statistic for evaluating the fit of models is -2*loglikelihood (also known as the deviance). The model has to be expressed in logarithms for this to work Logistic Regression ^ p(y) ln -------- = a + bXi ^ 1 - p(y) ^ Where p(y) is the predicted probability of being a voter ^ 1 – p(y) is the predicted probability of not being a voter If we express this in terms of anti-logs or odds ratios, then ^ p(y) --------^ 1 - p(y) = exp( a + bXi) and ^ exp(a + bXi) p(y) = ----------------1 + exp(a + bXi) The Logistic Function The logistic function can never be greater than one, so there are no impossible probabilities It corrects for the problems with the test statistics Estimating a Logistic Regression In OLS regression the least squares solution can be defined analytically – there are equations called the Normal Equations which we use to find the values of a and b. In logistic regression there are no such equations. The solutions are derived iteratively – by a process of trial and error. Doing this involves identifying a likelihood function. A likelihood is a measure of how typical a sample is of a given population. For example we can calculate how typical the ages of the students in this class are in comparison with students in the university as a whole. Applied to our regression problem we are working out how likely individuals are to be voters given their level of interest in the election and given values for the a and b coefficients. We ‘try out’ different values of a and b and the maximum likelihood estimation identifies the values which are most likely to reproduce the distribution of voters and non-voters we see in the sample, given their levels of interest in the election. Maximum Likelihood Define the probability of getting a head in tossing a fair coin as p(H) = 0.5, so that p(1-H) = 0.5 (getting a tail). So the probability of two heads followed by a tail is: P[(H)(H)(1-H)] = (0.5)(0.5)(0.5) = 0.125 We can get this sequence in 3 different ways (the tail can be first, second or third), so that the probability of getting 2 heads and a tail without worrying about the sequence is 0.125(3) = 0.375 But suppose we did not know the value of p(H). We could ‘try out’ different values and see how well they fitted an experiment consisting of repeated tosses of a coin three times. For example if we thought p(H) = 0.4, then two heads and a tail would give (0.4)(0.4)(0.6)(3)= 0.288. If we thought it was 0.3 we would get: (0.3)(0.3)(0.7)(3) = 0.189 Maximum Likelihood in General More generally we can write a likelihood function for this exercise: LF = π [pi2 * (1- pi)] where pi is the probability of getting a head and π is the number of ways this sequence can occur. The maximum value of this function occurs when pi=0.5, making this the maximum likelihood estimate of the sequence two heads and a tail. Explaining Variance In OLS regression we defined the following expression: _ _ Σ(Yi – Y)2 = Σ(Ŷ – Y)2 + Σ(Yi - Ŷ)2 Or Total Variation = Explained Variation + Residual Variation In logistic regression measures of the Deviance replace the sum of squares as the building blocks of measures of fit and statistical tests. Deviance Deviance measures are built from maximum likelihoods calculated using different models. For example, suppose we fit a model with no slope coefficient (b), but an intercept coefficient (a). We can call this model zero because it has no predictors. We then fit a second model, called model one, which has both a slope and an intercept. we can form the ratio of the maximum likelihoods of these models: maximum likelihood of model zero Likelihood ratio = --------------------------------------------maximum likelihood of model one Expressed in logs this becomes: Log Likelihood ratio = ln(maximum likelihood of model zero – maximum likelihood of model one) Note that the (Likelihood ratio)2 is the same as 2(log likelihood ratio) The Deviance is defined as -2(log likelihood ratio) What does this mean? The maximum likelihood of model zero is analogous to the total variation in OLS and the maximum likelihood of model one is analogous to the explained variation. If the maximum likelihoods of models zero and one were the same, then the likelihood ratio would be 1 and the log likelihood ratio 0. This would mean that model one was no better than model zero in accounting for turnout, so the deviance captures how much we improve things by taking into account interest in the election. The bigger the deviance the more the improvement Logistic Regression of Turnout Omnibus Tests of Model Coefficients Step 1 Chi-square 10.757 10.757 10.757 Step Block Model df Sig. .001 .001 .001 1 1 1 Mo de l Sum ma ry St ep 1 -2 Log lik eliho od 25 .894 a Co x & Sne ll R Squa re .30 1 Na gelk erke R Squa re .42 7 a. Es tima tion term inat ed a t iteration num ber 6 b ecau se pa rame ter e stim ate s ch ange d by les s tha n .0 01. Classification Table a Predicted turnout Step 1 Observed turnout .00 .00 1.00 1.00 Percentage Correct 55.6 85.7 76.7 5 3 4 18 df Sig. .012 .047 Overall Percentage a. The cut value is .500 Variables in the Equation Step a 1 interest Constant B 1.742 -2.582 S.E. .697 1.302 a. Variable(s) entered on step 1: interest. Wald 6.251 3.934 1 1 Exp(B) 5.708 .076 The Meaning of the Omnibus Test Omnibus Tests of Model Coefficients Step 1 Step Block Model Chi-square 10.757 10.757 10.757 df 1 1 1 Sig. .001 .001 .001 SPSS starts by fitting what it calls Block 0, which is the model containing the constant term and no predictor variables. It then proceeds to Block 1 which fits the model and gives us another estimate of the likelihood function. These two can then be compared and the table shows a chi-square statistical test of the improvement in the model achieved by adding interest in the election to the equation. This chi-square statistic is significant at the 0.001 level. In a multiple logistic regression this table tells us how much all of the predictor variables improve things compared with model zero. We have significantly improved on the baseline model by adding the variable interest to the equation The Model Summary Table Model Sum ma ry St ep 1 -2 Log Cox & Snell lik elihood R Square 25.894 a .301 Nagelk erke R Square .427 a. Es timation terminat ed at iteration number 6 because parameter estimates changed by les s than .001. The -2 loglikelihood statistic for our two variable model appears in the table, but it is only really meaningful for comparing different models. The Cox and Snell and the Nagelkerke R Squares are different ways of approximating the percentage of variance explained (R square) in multiple regression. The Cox and Snell statistic is problematic because it has a maximum value of 0.75. The Nagelkerke R square corrects this and has a maximum value of 1.0, so it is often the preferred measure. The Classification Table Classification Table a Predicted turnout Step 1 Observed turnout Overall Percentage .00 .00 1.00 1.00 5 3 4 18 Percentage Correct 55.6 85.7 76.7 a. The cut value is .500 The classification table tells us the extent to which the model correctly predicts the actual turnout, so it is another goodness of fit measure. The main diagonal from top left to bottom right contains the cases predicted correctly (23), whereas the off-diagonal from bottom right to top left are the cases predicted incorrectly (7). So overall 76.7 per cent of the cases are predicted correctly. Interpreting the Coefficients Variables in the Equation Step a 1 interest Constant B 1.742 -2.582 S.E. .697 1.302 Wald 6.251 3.934 df 1 1 Sig. .012 .047 Exp(B) 5.708 .076 a. Variable(s) entered on step 1: interest. The column on the left gives the coefficients in the logistic regression model. It means that a unit change in the level of interest in the election increases the log odds of voting by 1.742. The standard error appears in the next column (0.697) and the Wald statistic in the third column. The latter is the t statistic squared (6.251) and as we can see it is significant at the 0.012 level. Finally, Exp (B) is the anti-log of the (B) column so that e1.742 = 5.708. This is the effect on the odds of voting of an increase in the level of interest in the election by one unit. Since odds ratios are a bit more easy to understand than log odds ratios the effects are often reported using these coefficients. Making Sense of the Coefficients ^ p(y) ln -------^ 1 - p(y) So that ^ p(y) = -2.582 + 1.742Xi = exp(-2.582 + 1.742Xi) ------------------------------1 + exp(-2.582 + 1.742Xi) Translating into Probabilities Suppose a person scores 4 on the interest in the election variable (they are very interested). Then according to the model the probability that they will vote is: ^ P(y) = exp(-2.582 + 1.742(4)) ------------------------------1 + exp(-2.582 + 1.742(4)Xi) ^ If they are not at all interested and score (1) then: ^ Consequently a change from being not at all interested to being very interested increases the probability of voting by 0.99-0.30= 0.69 P(y) P(y) = exp(4.386)/(1 + exp(4.386)) = 0.99 = exp(-0.84)/(1 + exp(-0.84)) = 0.30 Probabilities Level of Interest 1 2 3 4 Probability of Voting 0.30 0.71 0.93 0.99 Conclusions Logistic regression allows us to model relationships when the dependent variable is a dummy variable. It can be extended to multinomial logistic regression in which there are several categories – and this produces several sets of coefficients The results are more reliable than if we had just used ordinary least squares regression