Download LogisticRegressionHandout

Stat 701 Handout on Binary Logistic Regression The Study Of Interest (Example on page 575 of text): The data provided below is from a study to assess the ability to complete a task within a specified time pertaining to a complex programming problem, and to relate this ability to the experience level of the programmer. Twenty-five programmers were used in this study. They were all given the same task. The data set from the study is given below. X = Months of Programming Experience; Y = Success in Task (1 = Successful, 0 = Failure). Note that X, the predictor variable is a quantitative variable; while Y, the response variable is a dichotomous, qualitative variable. The scatterplot of the data is given below. Y = Task Success 1.0 0.5 0.0 0 10 20 30 X = Months Of Experience The problem is to obtain a model for relating the response variable (Y) to the predictor variable (X). The model utilized is called the logistic regression model described as follows: Let  ( x)  P{Y  1 | X  x} be the conditional probability of observing a successful outcome in performing the task when the level of programming experience of the subject is x. In the logistic regression model it is assumed that   ( x)  log     0  1 x. 1   ( x)  This is equivalent to assuming that  ( x)  exp 0  1 x . 1  exp 0  1 x Here are two graphs of this logistic function corresponding to two sets of values of (0, 1). Note that one of the graphs will be a very bad model for the data above, while the other graph might be a good model for the success probability of the programming data above. 1 Two Graphs of the Logistic Probability Function 1.0 Beta0 = 4, Beta1 = -2 Probability of Success 0.9 0.8 0.7 0.6 0.5 0.4 0.3 Beta0 = -3.05, Beta1 = .16 0.2 0.1 0.0 0 10 20 30 x Interpretation of the Coefficients (discussed in more detail in class): 0 = intercept term for the linear model of the log-odds. First, the ODDS of the probability (x) is given by ODDS [ ( x)]   ( x) . 1   ( x) The coefficient 1 could be interpreted in several ways. 1. 2. It could be viewed as the change in the value of the log-odds when the value of the predictor variable is changed by one unit. exp(1) could also be interpreted as the ODDS RATIO (OR), which is the ratio of the odds when the predictor value is (x+1) and the odds when the predictor value is x. Symbolically, exp( 1 )  ODDS [ ( x  1)]  ODDS RATIO (OR ). ODDS [ ( x)] Thus, 1 could also be interpreted as the LOGARITHM of the ODDS RATIO, that is, 1 = ln(OR). Estimation and Testing when Dealing with Logistic Model 1. 2. Maximum Likelihood Estimation Procedure. Testing hypothesis is via likelihood ratio tests. Will not go into any detail about these methods of inference, but simply illustrate them using the results from the logistic regression analysis in Minitab. It should be noted that there are no closed form expressions to the regression coefficient estimates. They are obtained iteratively, and the object of this iterative procedure is to obtain the regression coefficients that will maximize the likelihood function. As such, the estimation procedure is a very computer-intensive procedure. We now illustrate the results of the Minitab Analysis. 2 Binary Logistic Regression (Minitab Output) Step 0 1 2 3 4 5 Log-Likelihood -17.148 -12.866 -12.714 -12.712 -12.712 -12.712 Link Function: Logit Response Information Variable TaskSucc Value 1 0 Total Count 11 14 25 (Event) Logistic Regression Table Predictor Constant MonOfExp Coef -3.060 0.16149 StDev 1.259 0.06498 Odds Ratio Z P -2.43 0.015 2.49 0.013 95% CI Lower Upper 1.18 1.03 1.33 Log-Likelihood = -12.712 Test that all slopes are zero: G = 8.872, DF = 1, P-Value = 0.003 Goodness-of-Fit Tests Method Pearson Deviance Hosmer-Lemeshow Chi-Square 19.623 19.879 5.946 DF 17 17 8 P 0.294 0.280 0.653 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Value 1 Obs Exp 0 Obs Exp Total Group 6 1 2 3 4 5 7 8 9 0 0.2 0 0.3 1 0.3 1 1.0 1 1.2 2 1.8 3 2.7 1 1.7 3 3.0 2 3 2 4 10 1 1.0 2 1.2 1 1.4 1 1.6 3 2.6 11 2 1.8 1 1.0 0 0.8 1 0.6 1 0.4 0 0.4 14 3 2 2 2 2 3 25 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Concordant Discordant Ties Total Number 127 25 2 154 Percent 82.5% 16.2% 1.3% 100.0% Summary Measures Somers' D Goodman-Kruskal Gamma Kendall's Tau-a 3 0.66 0.67 0.34 Total A Goodness-Of-Fit Criterion Model Deviance: compares the log-likelihood of the fitted logistic model with the perfectly fitting model (called the saturated model). The smaller the value of this deviance, the better is the fit. The DEVIANCE statistic is given by: n DEV ( X )  2 [Yi ln( p ( X i ))  (1  Yi ) ln( 1  p( X i ))]. i 1 The p(Xi) is the estimate of the success probability for the predictor value of X i. Under the hypothesis that the logistic model is correct, the statistic DEV(X) follows a chi-square distribution with degrees-offreedom of n - 1 (in general, n - p, where p-1 is the number of predictor variables). Chi-Square Statistic: The data is grouped into classes according to their fitted logit values. Let there be c groups. For each group, determine the number of observed successes (denoted by Oj1's) and the number of observed failures (denoted by Oj0's). Also, for each group, obtain the expected successes and failures (denoted by Ej1's and Ej0's). If the logistic regression model is appropriate, then the observed and expected frequencies for each of the cells/groupings will tend to be close to each other. This closeness, or lack thereof, is measured by the chi-square statistic given by: c 1    2 (O jk  E jk ) 2 E jk j 1 k  0 . If the model is appropriate then this chi-square statistic follows a chi-square distribution with degress-offreedom of c-2, so to test the model, this is compared to the 100(1-)th percentile of the chi-square distribution with c-2 degrees-of-freedom. Some Diagnostic Plots These diagnostic plots are obtained by computing the above statistics when a given observation is deleted. Delta Chi-Square versus Probability 7 Delta Chi-Square 6 5 4 3 2 1 0 0.20 0.45 0.70 0.95 Probability 4 Delta Deviance versus Probability Delta Deviance 4 3 2 1 0 0.20 0.45 0.70 0.95 Probability Implementation Using SAS THE PROGRAM /* Logistic Regression Illustration */ data prgtask; input MonExp TskSucc Est; cards; 14 0 0.310262 29 0 0.835263 6 0 0.109996 25 1 0.726602 18 1 0.461837 4 0 0.082130 18 0 0.461837 12 0 0.245666 22 1 0.620812 6 0 0.109996 30 1 0.856299 11 0 0.216980 30 1 0.856299 5 0 0.095154 20 1 0.542404 13 0 0.276802 9 0 0.167100 32 1 0.891664 24 0 0.693379 13 1 0.276802 19 0 0.502134 4 0 0.082130 28 1 0.811825 22 1 0.620812 8 1 0.145815 ; proc print; proc logistic DESCENDING; /* The keyword DESCENDING is to indicate that 1=Success */ model TskSucc = MonExp / waldcl corrb covb itprint lackfit plcl plrl rsquare; run; 5 The OUTPUT Obs Mon Exp Tsk Succ Est 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 14 29 6 25 18 4 18 12 22 6 30 11 30 5 20 13 9 32 24 13 19 4 28 22 8 0 0 0 1 1 0 0 0 1 0 1 0 1 0 1 0 0 1 0 1 0 0 1 1 1 0.31026 0.83526 0.11000 0.72660 0.46184 0.08213 0.46184 0.24567 0.62081 0.11000 0.85630 0.21698 0.85630 0.09515 0.54240 0.27680 0.16710 0.89166 0.69338 0.27680 0.50213 0.08213 0.81183 0.62081 0.14582 The LOGISTIC Procedure Model Information Data Set Response Variable Number of Response Levels Number of Observations Link Function Optimization Technique WORK.PRGTASK TskSucc 2 25 Logit Fisher's scoring Response Profile Ordered Value TskSucc Total Frequency 1 2 1 0 11 14 Maximum Likelihood Iteration History Iter Ridge -2 Log L Intercept MonExp 0 1 2 3 4 0 0 0 0 0 34.296490 25.732187 25.428428 25.424575 25.424574 -0.241162 -2.401052 -2.982504 -3.058497 -3.059696 0 0.127956 0.157626 0.161427 0.161486 Last Change in -2 Log L 6 9.1283891E-7 Last Evaluation of Gradient Intercept MonExp -1.577658E-7 5.635832E-7 Convergence criterion (GCONV=1E-8) satisfied. The LOGISTIC Procedure Model Fit Statistics Criterion Intercept Only Intercept and Covariates 36.296 37.515 34.296 29.425 31.862 25.425 AIC SC -2 Log L R-Square 0.2987 Max-rescaled R-Square 0.4003 Testing Global Null Hypothesis: BETA=0 Test Chi-Square DF Pr > ChiSq 8.8719 7.9742 6.1760 1 1 1 0.0029 0.0047 0.0129 Likelihood Ratio Score Wald Analysis of Maximum Likelihood Estimates Parameter DF Estimate Standard Error Chi-Square Pr > ChiSq Intercept MonExp 1 1 -3.0597 0.1615 1.2594 0.0650 5.9029 6.1760 0.0151 0.0129 Odds Ratio Estimates Effect Point Estimate MonExp 1.175 95% Wald Confidence Limits 1.035 1.335 Association of Predicted Probabilities and Observed Responses Percent Concordant Percent Discordant Percent Tied Pairs 82.5 16.2 1.3 154 Somers' D Gamma Tau-a c 0.662 0.671 0.340 0.831 The LOGISTIC Procedure Profile Likelihood Confidence Interval for Parameters Parameter Estimate Intercept MonExp -3.0597 0.1615 95% Confidence Limits -6.0369 0.0500 7 -0.9159 0.3140 Wald Confidence Interval for Parameters Parameter Estimate Intercept MonExp -3.0597 0.1615 95% Confidence Limits -5.5280 0.0341 -0.5914 0.2888 Profile Likelihood Confidence Interval for Adjusted Odds Ratios Effect Unit Estimate MonExp 1.0000 1.175 95% Confidence Limits 1.051 1.369 Estimated Covariance Matrix Variable Intercept MonExp Intercept MonExp 1.585967 -0.0754 -0.0754 0.004222 Estimated Correlation Matrix Variable Intercept MonExp 1.0000 -0.9214 -0.9214 1.0000 Intercept MonExp The LOGISTIC Procedure Partition for the Hosmer and Lemeshow Test Group Total 1 2 3 4 5 6 7 8 3 3 3 3 3 3 3 4 TskSucc = 1 Observed Expected 0 1 0 1 1 3 2 3 TskSucc = 0 Observed Expected 0.26 0.37 0.63 0.86 1.43 1.78 2.23 3.44 3 2 3 2 2 0 1 1 Hosmer and Lemeshow Goodness-of-Fit Test Chi-Square DF Pr > ChiSq 5.1453 6 0.5253 8 2.74 2.63 2.37 2.14 1.57 1.22 0.77 0.56 Another Example Multiple Logistic Regression Study Considered (Example on page 582 but using the whole data set): To investigate an epidemic outbreak of a disease that is spread by mosquitoes, individuals were randomly sampled within two sectors in a city to determine if the person has recently contracted the disease under study. Response variables was coded 1 = Yes, 0 = No. The predictor variables considered are: 1. 2. 3. Age, a quantitative variable; SocioEconomic status, a qualitative variable taking the values Upper, Middle, Lower, and which were then coded by using two dummy variables with the following coding: (0, 0) = Upper, (1, 0) = Middle, and (0, 1) = Lower. CitySector, which is a qualitative variable taking values Sector 1 (coded 1) and Sector 2 (coded 2). To give you an idea of the data set, the plot below is a scatterplot of Disease Status versus Age. DiseaseStatus 1.0 0.5 0.0 0 10 20 30 40 50 60 70 80 90 Age Using Minitab, we fit a multiple logistic regression model. The results of this analysis is summarized next. Binary Logistic Regression Link Function: Logit Response Information Variable DiseaseS Value 1 0 Total Count 107 89 196 (Event) Logistic Regression Table Predictor Constant Age SocEcoSt SocEcoSt CitySect Coef 0.1963 0.03596 -0.9768 0.7751 -0.0213 StDev 0.7011 0.01001 0.2012 0.3584 0.3927 Z 0.28 3.59 -4.85 2.16 -0.05 P 0.780 0.000 0.000 0.031 0.957 Odds Ratio 1.04 0.38 2.17 0.98 95% CI Lower Upper 1.02 0.25 1.08 0.45 Log-Likelihood = -107.826 Test that all slopes are zero: G = 54.406, DF = 4, P-Value = 0.000 9 1.06 0.56 4.38 2.11 Goodness-of-Fit Tests Method Pearson Deviance Hosmer-Lemeshow Chi-Square 165.767 185.154 9.343 DF 165 165 8 P 0.469 0.135 0.314 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Value 1 Obs Exp 0 Obs Exp Total Group 6 1 2 3 4 5 6 2.9 3 4.5 3 6.0 9 8.2 9 10.5 13 16.1 17 15.5 16 13.0 11 11.8 11 9.5 19 20 19 20 20 7 8 9 13 11.5 14 15.0 15 14.4 18 16.5 17 17.6 107 6 7.5 7 6.0 4 4.6 2 3.5 2 1.4 89 19 21 19 20 10 19 Total 196 Measures of Association: (Between the Response Variable and Predicted Probabilities) Pairs Concordant Discordant Ties Total Number 7560 1930 33 9523 Percent 79.4% 20.3% 0.3% 100.0% Summary Measures Somers' D Goodman-Kruskal Gamma Kendall's Tau-a 0.59 0.59 0.29 CONCLUSIONS?? Question: Suppose now that we want to see the effect of SocioEconomic Status on Disease Outbreak, given that the predictors of AGE and CITY SECTOR are already in the model. To answer this question, we need to fit the reduced model which only contains AGE and CITY SECTOR as predictors in order to be able to compute the DEVIANCE statistic for SOCIOECONOMIC STATUS after accounting for AGE and CITY SECTOR. This statistic will be denoted by DEV(SocEconStat | Age, City Sector) = DEV(Age, City Sector) - DEV(Age, SocEconStat, CitySect). This is called the partial deviance and is analogous to the extra-sum of squares idea in multiple linear regression. The results of fitting the reduced model is given below: Binary Logistic Regression Link Function: Logit Response Information Variable DiseaseS Value 1 Count 107 (Event) 10 0 Total 89 196 Logistic Regression Table Predictor Constant Age CitySect Coef -0.6875 0.034064 0.1739 StDev 0.2599 0.009345 0.3449 Odds Ratio Z P -2.65 0.008 3.65 0.000 0.50 0.614 95% CI Lower Upper 1.03 1.19 1.02 0.61 1.05 2.34 Log-Likelihood = -126.065 Test that all slopes are zero: G = 17.928, DF = 2, P-Value = 0.000 Goodness-of-Fit Tests Method Pearson Deviance Hosmer-Lemeshow Chi-Square 93.182 119.708 13.116 DF 91 91 8 P 0.417 0.023 0.108 Table of Observed and Expected Frequencies: (See Hosmer-Lemeshow Test for the Pearson Chi-Square Statistic) Value 1 Obs Exp 0 Obs Exp Total 1 5 Group 6 2 3 4 11 7.5 8 7.4 9 8.9 5 8.9 10 9.6 10 13.5 11 11.6 12 12.1 14 10.1 9 9.4 21 19 21 19 19 7 8 9 12 10.9 11 11.2 9 12.9 13 13.7 19 16.0 107 8 9.1 8 7.8 11 7.1 6 5.3 0 3.0 89 20 19 20 19 10 19 Total 196 SAS IMPLEMENTATIO N /* Multiple Logistic Regression */ data DisOut; input ObsNum Age SocEcD1 SocEcD2 CitySect DisSta; label SocEcD1 = "Indicator for Middle SocioEcon Status" SocEcD2 = "Indicator for Lower SocioEcon Status" CitySect = "City Sector (0 = Sector 1)" DisSta = "Disease Status (1=Diseased)"; Cards; (Data Set to be Inserted here) run; proc print; run; proc logistic; model DisSta = Age SocEcD1 SocEcD2 CitySect / itprint plcl plrl rsquare lackfit; run; 11 THE OUTPUT Data Set: WORK.DISOUT Response Variable: DISSTA Response Levels: 2 Number of Observations: 196 Link Function: Logit Disease Status (1=Diseased) Response Profile Ordered Value DISSTA Count 1 2 0 1 89 107 Maximum Likelihood Iterative Phase Iter Step 0 1 2 3 4 5 INITIAL IRLS IRLS IRLS IRLS IRLS -2 Log L INTERCPT 270.058302 217.769394 215.679039 215.652532 215.652526 215.652526 AGE -0.184192 -0.316255 -0.213630 -0.196544 -0.196262 -0.196262 0 -0.025919 -0.034772 -0.035939 -0.035956 -0.035956 SOCECD1 SOCECD2 0 0 0 0.811906 -0.580761 0.958907 -0.751783 0.976515 -0.774711 0.976768 -0.775062 0.976768 -0.775062 CITYSECT 0.016157 0.021381 0.021312 0.021305 0.021305 Last Change in -2 Log L: 1.136868E-13 Last Evaluation of Gradient INTERCPT -8.223913E-7 AGE -0.000054821 SOCECD1 -4.526762E-7 SOCECD2 CITYSECT -1.795719E-6 -5.701351E-7 The LOGISTIC Procedure Model Fitting Information and Testing Global Null Hypothesis BETA=0 Criterion AIC SC -2 LOG L Score Intercept Only Intercept and Covariates 272.058 275.336 270.058 . 225.653 242.043 215.653 . RSquare = 0.2424 Chi-Square for Covariates . . 54.406 with 4 DF (p=0.0001) 48.404 with 4 DF (p=0.0001) Max-rescaled RSquare = 0.3241 12 Analysis of Maximum Likelihood Estimates Variable DF INTERCPT AGE SOCECD1 SOCECD2 CITYSECT 1 1 1 1 1 Parameter Standard Wald Pr > Standardized Estimate Error Chi-Square Chi-Square Estimate -0.1963 -0.0360 0.9768 -0.7751 0.0213 0.7011 0.0100 0.2012 0.3584 0.3927 0.0784 12.9026 23.5669 4.6760 0.0029 0.7795 0.0003 0.0001 0.0306 0.9567 Odds Ratio . . -0.374763 0.467169 -0.210140 0.005348 0.965 2.656 0.461 1.022 Association of Predicted Probabilities and Observed Responses Concordant = 79.4% Discordant = 20.4% Tied = 0.2% (9523 pairs) Somers' D = 0.590 Gamma = 0.591 Tau-a = 0.294 c = 0.795 Parameter Estimates and 95% Confidence Intervals Profile Likelihood Confidence Limits Variable Parameter Estimate Lower Upper INTERCPT AGE SOCECD1 SOCECD2 CITYSECT -0.1963 -0.0360 0.9768 -0.7751 0.0213 -1.5875 -0.0565 0.5926 -1.4879 -0.7506 1.1740 -0.0170 1.3843 -0.0772 0.7961 Conditional Odds Ratios and 95% Confidence Intervals Profile Likelihood Confidence Limits Variable Unit Odds Ratio Lower Upper AGE SOCECD1 SOCECD2 CITYSECT 1.0000 1.0000 1.0000 1.0000 0.965 2.656 0.461 1.022 0.945 1.809 0.226 0.472 0.983 3.992 0.926 2.217 Hosmer and Lemeshow Goodness-of-Fit Test Group Total 1 2 3 4 5 6 7 8 9 10 21 20 20 20 21 20 21 20 20 13 DISSTA = 0 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Observed Expected 2 2 5 7 6 12 14 15 17 9 1.71 3.71 4.95 5.90 8.62 9.78 13.05 14.11 16.03 11.14 DISSTA = 1 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Observed Expected 19 18 15 13 15 8 7 5 3 4 19.29 16.29 15.05 14.10 12.38 10.22 7.95 5.89 3.97 1.86 Goodness-of-fit Statistic = 7.1833 with 8 DF (p=0.5170) _________________________________________________________________________________________ 13 SELECTING BEST VARIABLES You may also use SAS to select the appropriate variables to include in your model. You do this by using the INCLUDE = p and SELECTION = STEPWISE option in the MODEL statement. The value of p tells SAS to include in the model the first p variables listed. Thus, for the above data set, we could use the command proc logistic; model DisSta =SocEcD1 SocEcD2 CitySect Age / include = 2 selection=stepwise; run; The relevant part of the output is given below: Stepwise Selection Procedure The following variables will be included in each model: INTERCPT Step SOCECD1 SOCECD2 0. The INCLUDE variables were entered. Model Fitting Information and Testing Global Null Hypothesis BETA=0 Criterion AIC SC -2 LOG L Score Intercept Only Intercept and Covariates 272.058 275.336 270.058 . 236.851 246.685 230.851 . Chi-Square for Covariates . . 39.207 with 2 DF (p=0.0001) 37.067 with 2 DF (p=0.0001) Residual Chi-Square = 14.7090 with 2 DF (p=0.0006) Step 1. Variable AGE entered: Model Fitting Information and Testing Global Null Hypothesis BETA=0 Criterion AIC SC -2 LOG L Score Intercept Only Intercept and Covariates 272.058 275.336 270.058 . 223.655 236.768 215.655 . Chi-Square for Covariates . . 54.403 with 3 DF (p=0.0001) 48.402 with 3 DF (p=0.0001) Residual Chi-Square = 0.0029 with 1 DF (p=0.9567) NOTE: No (additional) variables met the 0.05 significance level for entry into the model. Summary of Stepwise Procedure Step 1 Variable Entered Removed AGE Number In Score Chi-Square Wald Chi-Square Pr > Chi-Square 3 14.7000 . 0.0001 14 Analysis of Maximum Likelihood Estimates Variable DF INTERCPT SOCECD1 SOCECD2 AGE 1 1 1 1 Parameter Standard Wald Pr > Standardized Estimate Error Chi-Square Chi-Square Estimate -0.2009 0.6960 0.9772 0.2010 -0.7700 0.3463 -0.0358 0.00978 0.0833 23.6272 4.9459 13.4347 0.7729 0.0001 0.0262 0.0002 . . 0.467385 -0.208780 -0.373561 Odds Ratio 2.657 0.463 0.965 Association of Predicted Probabilities and Observed Responses Concordant = 79.4% Discordant = 20.3% Tied = 0.3% (9523 pairs) Somers' D = 0.591 Gamma = 0.592 Tau-a = 0.294 c = 0.795 Conclusions: By using this procedure, it determined that the variable City Sector is not an important predictor. Note: If you did not include the option INCLUDE = 2, then it will also see if the SocioEconomic variables are also important. Below is the program and output: Relevant Program Portion: proc logistic; model DisSta = SocEcD1 SocEcD2 CitySect Age / selection=stepwise; run; 15

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download LogisticRegressionHandout