* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Logistic Regression
Survey
Document related concepts
Transcript
Logistic Regression Logistic Regression When ? Just like multiple regression, but when the dependent variable is dichotomous. E.g. improved or not improved; successful or not successful. Why ? Logistic regression can be used for classification purpose (it includes c2). Why not performed a discriminant analysis ? Give probability of an effect (outcome) and evaluate the risk (odds). Discriminant analysis can produced a probability of success that lie outside [0,1] Discriminant analysis require normality Why not performed a multiple regression ? Multiple regression can produced a probability of success that lie outside [0,1] Multiple regression require homoscedasticity Multiple regression require normality Logistic Regression Exemple: Suppose we want to predict whether someone as a coronary disease (DV) using age in years (IV). It is customary to code a binary DV either 0 or 1. Logistic Regression The logistic curve u b0 b1 x1 yˆ 1 1 eu bp x p Linear part Nonlinear part Logistic Regression The logistic curve u b0 b1 x1 yˆ 1 1 eu bp x p Logistic Regression Exemple: Suppose we want to predict whether someone as a coronary disease (DV) using age in years (IV). It is customary to code a binary DV either 0 or 1. Logistic Regression The logistic curve u b0 b1 x1 bp x p 1 eu yˆ 1 e u 1 eu where ŷ is the probability of a 1, e is the base of the natural logarithm (about 2.718) and b are the parameters of the model. The value of a yields ŷ when X is zero, and b adjusts how quickly the probability changes with changing X a single unit (we can have standardized and unstandardized b in logistic regression, just as in ordinary linear regression). Because the relation between X and ŷ is nonlinear, b does not have a straightforward interpretation in this model as it does in ordinary linear regression. Logistic Regression (Where did it came from) Suppose we only know a person's age and we want to predict whether that person has a coronary disease or not. We can talk about the probability of having the disease, or we can talk about the odds of having the disease. Let's say that the probability of not having the disease for a given age is .95. Then the odds of not having the disease is yˆ 0.95 odds 19 1 yˆ 1 0.95 Now the odds of having the disease would be .05/.95 or 1/19 or 0.0526. This asymmetry is unappealing, because the odds of having the disease should be the opposite of the odds of not having the desease. Logistic Regression (Where did it came from) We can take care of this asymmetry though the natural logarithm, ln. The natural log of 19 is 2.9444 (ln(0.95/0.05)=2.9444). The natural log of 1/19 is - 2.9444 (ln(0.05/0.95)=-2.9444), so the log odds of having a coronary disease is exactly opposite to the log odds of not having a disease. Log (odds) Log ( Solving for yˆ ) b0 b1 x1 ˆ 1 y bp x p u yˆ )u ˆ 1 y yˆ eu 1 yˆ ŷ Log ( eu 1 yˆ 1 eu 1 e u In term of probability In term of odds Logistic Regression Finding the regression weights. In multiple regression, we wanted to minimized the residual sum of square. This yield to the formula b ( XT X) 1 XT y With the logistic curve, there is no mathematical solution that will produce least squares estimates of the parameters. We will used instead the maximum (log) likelihood. A likelihood is a conditional probability: P( ŷ |X), the probability of ŷ given X). The idea is to choose the regression weights that will give the maximum (log) likelihood between the data and the logistic curve. n L(b) yi yˆì (1 yi )(1 yˆì ) , b (b0 , b1 , i 1 , bp ) Maximum likelihood n LL(b) yi ln( yˆì ) (1 yi ) ln(1 yˆì ) i 1 Maximum log likelihood Logistic Regression Finding the regression weights. The maximum of this expression can then be found numerically using an optimization algorithm Logistic Regression Finding the regression weights. The maximum of this expression can then be found numerically using an optimization algorithm Logistic Regression Finding the regression weights. The maximum of this expression can then be found numerically using an optimization algorithm Logistic Regression Hypothesis testing The idea is to compare the full model with only the constant using chisquare. c 2 2ln( LL(b) LL(0)) -53.6765-(-68.3315) = 29.3099 c 2 (1) 29.3099, p <0.01 There is only 1 predictor This indicates that age can reliably distinguished between people having a coronary disease from those who do not. Logistic Regression Hypothesis testing We can use the same idea to build a regression model. c 2 2 ln( LL(bigger model) LL(smaller model)) Also, the Wald statistic can be used (Z test). Sometimes the values are squared then a chi-square is used zi bi SEii SE Fisher information matrix I (b) 1 I(b) XT WX, where, W diag yˆi 1 yˆi Logistic Regression Hypothesis testing Also, the Wald statistic can be used Constant SE I (b) 1 I (b) XT WX z const bconst -5.30945 -4.68348 SEconst 1.13365 z disease b disease 0.110921 4.61022 SEdisease 0.0240598 IV (coronary disease) Logistic Regression Explained variability There are three popular measures that approximate the variance interpretation found in linear regression (R2). 1- McFadden's p 2 1 2 CS 2- R 1 e LL(b) 53.6765 1 0.214468 LL(0) 68.3315 2 LL ( b ) LL ( 0 ) n 1 e 2 53.6765 ( 68.3315) n 0.254052 2 RCS 0.254052 3 R 2 = =0.340993, RMax 0.745035 2 N 2 Max where, R 1 e 2 LL ( 0 ) n 1 e 2 ( 68.3315) n 0.745035 Logistic Regression Odds Ratio (OR) The odds ratio is the increase (or decrease) in odds of being in one outcome category when the value of the predictor increases by on unit. If the odds are the same across groups, then OR=1. If the odds are greater than 1, then there is an increase probability of being classify into the category. If the odds are smaller than1, then there is a decrease probability of being classify into the given category. ORi ebi ORDisease ebdisease e0.110921 1.11731 Thus, at each of my birthdays I increase my odds of having a coronary disease by 1.12. In other words, each year I increase the risk of developing a coronary disease by 12 percents. Logistic Regression Odds Ratio (OR) For a 5 year age difference, say, the increase is exp(b)5 [= 1.117315] = 1.74, or a 74% increase. Classification table Cut off = 0.5 Constant only Total correct percentage = 57 All predictors Total correct percentage = 74 Logistic Regression Prediction If I have (x’=)50 years old, what is my probability of having a coronary disease ? yˆ yˆ 1 1 e (-5.30945 0.110921x1 ) 1 1 e (-5.30945 0.110921*50) 1 1 e 0.236604 0.558877 Logistic Regression Confidence intervals CI=0.95 SE (u (x)) x T I(b) 1 x x TVar (b)x 1 SE (u ) 50 0.026677 1 1.28517 1 50 0.026677 0.000578876 50 1 SE (u ) 0.0646601 0.254284 50 u ( x) Z / 2 SE ( yˆ ( x)) 0.236604 1.96*0.254284 -0.261783, 0.73499 1 1 eu ( x ) 1 1 e 0.434925, 0.675899 -0.261783, 0.73499 Logistic Regression Confidence bands CI=0.95 SE (u (x)) xT I(b) 1 x xTVar (b)x u (x) Z / 2 SE ( yˆ (x)) 1 1 eu ( x ) Logistic Regression Recoding a continuous variable into a dichotomous variable Cutoff at 55 Contingency table Logistic Regression Recoding a continuous variable into a dichotomous variable Cutoff at 55 Regression weights yˆ Wald test 1 1 e (-0.840783 2.09355x1 ) Logistic Regression Recoding a continuous variable into a dichotomous variable Cutoff at 55 Explained variability 1- McFadden's p 2 0.136861 2 2- RCS 0.170588 3 RN2 = 0.228967 Logistic Regression Recoding a continuous variable into a dichotomous variable Cutoff at 55 Classification table Total correct percentage = 57 Total correct percentage = 72 Logistic Regression Recoding a continuous variable into a dichotomous variable Cutoff at 55 Odds ratio ORDisease ebdisease e2.09355 8.11364 If I am 55 years old and up, I have 8 times more chances to have a coronary disease. ORDisease 21 22 8.11364 6 51 Logistic Regression Recoding a continuous variable into a dichotomous variable Cutoff at 55 Confidence intervals ebi Z a / 2 SEbi 2.87956, 22.8615 The CI (0.95) is asymmetric. It suggests that coronary disease is 2.9 to 22.9 more likely to occur if I am 55 yrs and up.