Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
An Introduction to Regression with Binary Dependent Variables Brian Goff Department of Economics Western Kentucky University Introduction and Description Examples of binary regression Features of linear probability models Why use logistic regression? Interpreting coefficients Evaluating the performance of the model Binary Dependent Variables In many regression settings, the Y variable is (0,1) A Few Examples: Consumer chooses brand (1) or not (0); A quality defect occurs (1) or not (0); A person is hired (1) or not (0); Evacuate home during hurricane (1) or not (0); Other Examples? Scatterplot of with Y=(0,1): Y = Hired-Not Hired; X= Experience 1 0 The Linear Probability Model (LPM) If we estimate the slope using OLS regression: Hired = α + *Income + e ; The result is called a “Linear Probability Model” The predicted values are probabilities that Y equals 1; The equation is linear – the slope is constant Picture of LPM 1 LPM Regression Line (slope coefficient) Points on regression line represent predicted probabilities For Y for each value of X 0 An Example: Loan Approvals Data: Dependent Variable: Loaned 1 if Loan Approved, 0 if not Approved by Bank Z Independent Variables ROA = net income as % of total assets of applicant; Debt = debt as % of total assets of applicant; Officer = 1 if loan handled by loan officer A and 0 if handled by officer B; Scatterplot (Loaned – NITA) LPM Results Coefficientsa Model 1 (Cons tant) nita tdta officer Uns tandardized Coefficients B Std. Error 1.087 .192 .022 .013 -.063 .029 -.279 .138 Standardized Coefficients Beta .237 -.291 -.291 t 5.659 1.655 -2.156 -2.020 Sig. .000 .105 .036 .049 a. Dependent Variable: loaned Coefficient on NITA implies 1% increase in ROA increases Probability of loan by 2.2% (0.022) LPM Weaknesses The predicted probabilities can be greater than 1 or less than 0 The error terms vary based on size of X-variable (“heteroskedastic”) – Probabilities, by definition, have max =1; min = 0; This is not a big issue if they are very close to 0 and 1 There may be models that have lower variance – more “efficient” The errors are not normally distributed because Y takes on only two values Creates problems for More of an issue for statistical theorists Predicted Probabilities in LPM Loans Model In loan case, all of the predicted probabilities fall within (0,1) range Descriptive Statistics N predicted_loans Valid N (lis twis e) 51 51 Minimum .01273 Maximum .97245 Mean .6666667 Std. Deviation .19034701 (Binary) Logistic Regression or “Logit” Selects regression coefficient to force predicted values for Y to be between (0,1) Produces S-shaped regression predictions rather than straight line Selects these coefficient through “Maximum Likelihood” estimation technique Picture of Logistic Regression 1 Logistic Regression (non-linear slope coefficient) Points on regression line represent predicted probabilities For Y for each value of X 0 LPM & Logit Regressions LPM & Logit Regressions in some cases provide similar answers If few “outlying” X-values on upper or lower ends then LPM model often produces predicted values within (0,1) band In such cases, the non-linear sections of the Logit regression are not needed In such cases, simplicity of LPM may be reason for use See following slide for an illustration Example where LPM & Logit Results Similar LP Model 1 0 LPM & Logit: Loan Case In Loan example the results are similar: R-square = 98% for regression of LPM-predicted probabilities & Logit-predicted probabilities Descriptive statistics for both probabilities appear below: The main difference is the LPM is max/min closer to 0 and 1 Descriptive Statistics N pred_lpm pred_logit Valid N (listwise) 51 51 51 Minimum .01273 .06948 Maximum .97245 .91364 Mean .6666667 .6666667 Std. Deviation .19034701 .19209809 SPSS Logistic Regression Output for Loan Approval: Variables in the Equation Step a 1 nita tdta officer Cons tant B .108 -.325 -1.455 2.968 S.E. .070 .180 .767 1.187 Wald 2.393 3.241 3.599 6.248 df 1 1 1 1 Sig. .122 .072 .058 .012 Exp(B) 1.114 .723 .233 19.443 a. Variable(s ) entered on step 1: nita, tdta, officer. Note: The, instead of t-statistics, “Wald” statistics are used to test whether the Coefficients differ from zero; the associated p-values (Sig) have the same Interpretation as in any other regression output Interpreting Logistic Regression (Logit) Coefficients The slope coefficient from a logistic regression () = the rate of change in the "log odds" of the event under study as X changes one unit What in the world does that mean? We want to know the change in the probability of the event as X changes In Logistic Regression, this value changes as X-changes (Sshape instead of linear) Loan Example: Effect of NITA on Probability of Loan NITA coefficient (B) = 0.11 P (1-P) B*(P)*(1-P) Low Probability 0.1 .9 0.009 Medium Probability 0.5 0.5 0.0275 High Probability .9 .1 0.009 Meaning? At moderate probabilities (around 0.5) of getting a loan (corresponds to average NITA of about 5), the likelihood of getting a loan increases by 2.75% for each 1% increase in NITA This estimate is very close to the LPM estimate of 2.2% At the lower and upper extremes (NITA values -/+ teens), the probability changes by only about 0.9% for a 1 unit increase in NITA Alternative Methods of Evaluating Logit Regressions Statistics for comparing alternative logit models: Model Chi-Square Percent Correct Predictions Pseudo-R2 Chi-Square Test for Fit Omnibus Tests of Model Coefficients Step 1 Step Block Model Chi -s quare 8.498 8.498 8.498 df 3 3 3 Sig. .037 .037 .037 The Chi-Square statistic and associated p-value (Sig.) tests whether the model coefficients as a group equal zero Larger Chi-squares and smaller p-values indicate greater confidence in rejected the null hypothesis of no Percent Correct Predictions The "Percent Correct Predictions" statistic assumes that if the estimated p is greater than or equal to .5 then the event is expected to occur and not occur otherwise. By assigning these probabilities 0s and 1s and comparing these to the actual 0s and 1s, the % correct Yes, % correct No, and overall % correct scores are calculated. Note: subgroups for the % correctly predicted is also important, especially if most of the data are 0s or 1s Percent Correct Results 35% of loan rejected cases (0) were correctly predicted Classification Tablea Predicted loaned Step 1 Obs erved loaned .00 .00 1.00 Overall Percentage 1.00 6 2 11 32 Percentage Correct 35.3 94.1 74.5 a. The cut value is .500 94% of loan accepted cases (1) were correctly predicted 75% of all cases (0,1) were correctly predicted Note: The model is much better at predicting loan acceptance than loan rejection – this may serve as a basis for thinking about additional variables to improve the model 2 R Problems 1 0 Notice that whether using LPM or logit, the predicted values on the regression lines are not near The actual observations (which are all either 0 or 1). This makes the typical R-square statistic of no value in assessing how well the model “fits” the data Pseudo-R2 Values Model Summary Step 1 -2 Log li kelihood 56.427a Cox & Snell R Square .153 Nagelkerke R Square .213 a. Estim ation termi nated at iteration num ber 4 because param eter estim ates changed by less than .001. There are psuedo-R2 statistics that make adjustment for the (0,1) nature of the actual data: two are listed above Their computation is somewhat complicated but yield measures that vary between 0 and (somewhat close to) 1 much like the R2 in a LP model. Appendix: Calculating Effect of Xvariable on Probability of Y Effect on probability of from 1 unit change in X = ()*(Probability)*(1-Probability) Probability changes as the value of X changes To calculate (1-P) for a given X values: (1-P) = 1/exp[α + 1*X1 + 2*X2 …] With multiple X-variables it is common to focus on one at a time and use average values for all but one