Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Forecasting wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Data assimilation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Least squares wikipedia , lookup
Coefficient of determination wikipedia , lookup
Multivariate Data Analysis Dr. Kateřina Schindlerová Logistic Regression Wintersemester 2014-2015 The contents of the course is coordinated with Univ.- Prof.Dr. Von Eye. K. Schindlerová Multivariate Data Analysis Logistic Regression 1 / 54 Linear Regressions Simple linear regression y = α + β1 x; Regression coefficient b1 : - measures the relationship between y and x; - Least squares method; Multiple linear regression y = α + β1 x1 + β2 x2 · · · + βm xm - Relation between a continuous variable and a set of βi continuous variables; - Partial regression coefficients bi - Measures association between xi and y adjusted for all other xi Multivariate linear regression: y is a vector K. Schindlerová Multivariate Data Analysis Logistic Regression 2 / 54 Linear Regression (Multiple) linear regression y = α + β1 x1 + β2 x2 + · · · + βm xm xi , i = 1, . . . , m are: - independent variables, predictor variables, explanatory variables, covariables. y is: - dependent variable, predicted variable, response variable, outcome variable. The outcome variable of LR is continuous, e.g. blood pressure (BP). Example: BP ( as y) versus age, weight, height (as x), etc. K. Schindlerová Multivariate Data Analysis Logistic Regression 3 / 54 Multivariate Analysis Model choice Model Linear regression Poisson regression Logistic regression Discriminant analysis Outcome continous count data and contingency tables binary or categorical, a prob. function a category or group to which a subject belongs Model choice depends on the study, objectives, and the variables; K. Schindlerová Multivariate Data Analysis Logistic Regression 4 / 54 Logistic Regression (LoR) Logistic regression analyzes the relationship between multiple independent variables and a categorical dependent variable; models and estimates the probability of occurrence of an event by fitting data to a logistic curve. Binary logistic regression: used when the dependent variable is dichotomous and the independent variables are either continuous or categorical. Multinomial logistic regression: When the dependent variable is not dichotomous and is comprised of more than two categories. K. Schindlerová Multivariate Data Analysis Logistic Regression 5 / 54 History Malthus’ population growth theory: 1798: Earth population will increase in a geometric way (i.e. exponential growth) Intimidating with a famine catastrophe - Malthusianism K. Schindlerová Multivariate Data Analysis Logistic Regression 6 / 54 History Belgian statistician P.F. Verhulst 1837-47 came as the first one with analytic form of logistic function and used to model population growth. He used the logistic curve for fitting population in Belgium, France and Russia; 1920 Pearl and Reed applied logistic curve to population modelling in USA; (Source: J.S. Cramer, Origins of Logistic Regession, University of Amsterdam, 2002) (Wikipedia) K. Schindlerová Multivariate Data Analysis Logistic Regression 7 / 54 Odds Odds of an event is the ratio of the probability that an event will occur to the probability that it will not occur. probability of an event occurring . . . p probability of the event not occurring . . . (1 − p). It is a Bernoulli trial, as it has exactly two outcomes. odds(Event) = K. Schindlerová p 1−p Multivariate Data Analysis Logistic Regression 8 / 54 Odds Logistic regression calculates the probability of an event occurring over the probability of an event not occurring ⇒ the impact of independent variables can be explained by odds. Odds do not have to fall to < 0, 1 > but the logistic regression transform the odds into < 0, 1 > using the natural logarithm. K. Schindlerová Multivariate Data Analysis Logistic Regression 9 / 54 Logistic Model With logistic regression, the natural log odds are modelled as a linear function of the explanatory variable: logit(y ) = ln(odds) = ln( p ) = α + βx. 1−p p is the probability of interested outcome x is the explanatory variable. The parameters of the logistic regression are α and β. logit(y ) is the link function - see generalized linear models. K. Schindlerová Multivariate Data Analysis Logistic Regression 10 / 54 Simple Logistic Model Equation for the prediction of the probability of the occurrence of interested outcome: Simple logistic model p = P(Y = interested outcome|X = x, a specific value) = K. Schindlerová e α+βx 1 + e α+βx Multivariate Data Analysis Logistic Regression 11 / 54 Complex Logistic Model An extension to multiple predictors: Complex logistic model logit(y ) = ln(odds) = ln( p ) = α + β1 x1 + · · · + βm xm . 1−p p = P(Y = interested outcome |X1 = x1 , . . . , Xm = xm ) = K. Schindlerová e α+β1 x1 +···+βm xm 1 + e (α+β1 x1 +···+βm xm ) Multivariate Data Analysis Logistic Regression 12 / 54 Odds Ratio The odds ratio (OR) A comparative measure of two odds relative to different events. For events A, B, the odds of A occurring relative to B occurring odds(A) pA /(1 − pA ) OR(A, B) = = . odds(B) pB /(1 − pB ) Measures an association between the exposure and an outcome. OR represents the odds that an outcome (a disease) will occur given a particular exposure (a treatment, a healthy life-style) compared to the odds of the outcome occurring in the absence of that exposure. K. Schindlerová Multivariate Data Analysis Logistic Regression 13 / 54 An Example of Odds Ratio One investigates the relationships bettwen the occurence of heart attack and smoking. 10000 patients, about each we know whether he/she smokes or not and whether he/she had a heart attack or not. heart attack yes heart attack no smoker 130 1870 non-smoker 70 7930 Out of 2000 smokers hat 130 a heart attack. Odds ratio 130 × 7930 OR = ≈ 7.88 70 × 1870 The odds, or a chance to get an heart attack is about 8 times higher for smokers than for non-smokers. (Wikipedia) K. Schindlerová Multivariate Data Analysis Logistic Regression 14 / 54 Odds Ratio The exponential function of the regression coefficient e b1 is the OR associated with a one unit increase in the independent variable. OR can determine whether a particular exposure is a risk factor for a particular outcome; OR can compare the magnitude of various risk factors for that outcome. K. Schindlerová Multivariate Data Analysis Logistic Regression 15 / 54 Odds Ratio OR = 1 . . . exposure does not affect odds of outcome. OR > 1 . . . exposure associated with higher odds of outcome. OR < 1 . . . exposure associated with lower odds of outcome. Example: The variable for smokers is coded as 0 (=non-smoker) and 1 (=smoker) and OR for this variable is 2.4 Then the odds for a positive outcome for smokers are 2.4 times higher than in non-smokers. K. Schindlerová Multivariate Data Analysis Logistic Regression 16 / 54 Logistic Regression and Curve Logistic regression generalizes OR beyond two binary variables (Peng and So, 2002). Logistic regression fits a regression curve y = f (x) for y a binary variable. Logistic curve - a sigmoid curve y= K. Schindlerová ex 1 = x 1+e 1 + e −x Multivariate Data Analysis Logistic Regression 17 / 54 Logistic Curve (Wikipedia) K. Schindlerová Multivariate Data Analysis Logistic Regression 18 / 54 Logistic Curve for Regression t = α + βx linear regression y= et 1 e α+βx = = t α+βx 1+e 1+e 1 + e −(α+βx) Logistics regression determines the coefficients α and β. It changes the range of the proportion from 0, 1 to (−∞, ∞) K. Schindlerová Multivariate Data Analysis Logistic Regression 19 / 54 Logit Function Logit function p = P(y |x) logit(y ) = ln(odds) = = ln( p P(y |x) ) = ln( ) = α + βx 1−p 1 − P(y |x) p . . . probability of the interested outcome, x predictor The logit function transforms the exponential curve into a straight line. The logit function is the inverse of the sigmoidal logistic function; K. Schindlerová Multivariate Data Analysis Logistic Regression 20 / 54 Logit - Advantages Simple transformation of P(y |x) Linear relationship with x Can be continuous (logit between −∞ to ∞) It is a binomial distribution (P is between 0 and 1) Directly related to the odds of disease K. Schindlerová Multivariate Data Analysis Logistic Regression 21 / 54 Interpretation of β Disease - y yes no Exposure - x yes no P(x|y = 1) P(x|y = 0) 1 − P(x|y = 1) 1 − P(x|y = 0) d - disease; e - exposure ē - no exposure p 1−p = e α+βx Odds(d|e) Odds(d|ē) e α+β β eα = e OR(d, e) = Odds(d|e) = e α+β OR = α Odds(d|ē) = e (for ē is x = 0) ln(OR) = β β = increase in log-odds for a one unit increase in x K. Schindlerová Multivariate Data Analysis Logistic Regression 22 / 54 Example Age (< 55 and 55+ years) and risk of developing coronary heart disease (CD) CD present (1) absent (0) 55+ 21 6 < 55 22 51 Odds of disease among exposed Odds(d|e) = 21/6 Odds of disease among unexposed Odds(d|ē) = 22/51 Odds ratio OR = 8.1 (adopted from Salmi et al., University of Tunghai, Taiwan) K. Schindlerová Multivariate Data Analysis Logistic Regression 23 / 54 Assumptions of Logistic Regression Not required: linearity of the relationship between dependent and independent variables, normality and homoscedasticity of the errors; LoR can handle both the continuous data and discrete data as independent variables. Dependent variable is discrete, mostly dichotomous. Since LoR estimates the probability of the event occurring (P(Y = 1)), it is necessary to code the dependent variable accordingly, i.e. the desired outcome should be coded to be 1. K. Schindlerová Multivariate Data Analysis Logistic Regression 24 / 54 Assumptions of Logistic Regression The model should have no multicollinearity. (i.e. indep. variables are not linear functions of each other.) Though LoR does not require a linear relationship between the dependent and independent variables, it requires that the independent variables are linearly related to the log odd of an event. LoR requires large sample sizes for data fitting (because of maximum likelihood method) K. Schindlerová Multivariate Data Analysis Logistic Regression 25 / 54 Sample Size for the Multiple Logistic Regression (mLoR) Minimum number of observations for mLoR: N= 10k p where p is the smallest of the proportions of negative or positive cases in the population and k the number of independent variables. Peduzzi, Concato, Kemper, Holford and Feinstein (1996) Sample sizes greater than 400 recommended (Hosmer and Lemeshow, 2000). K. Schindlerová Multivariate Data Analysis Logistic Regression 26 / 54 Fitting Logistic Regression Model to Data Linear regression: Least squares (LS) Logistic regression: - the underlying distribution is binomial and LS do not suffice for estimating α and β. Maximum likelihood: an iterative procedure to compute values of α and β which maximize the probability that the observed values of the dependent variable in the data set may be predicted from the observed values of the independent variables. It is easier to work with log-likelihood. K. Schindlerová Multivariate Data Analysis Logistic Regression 27 / 54 Likelihood Function Assume that in a population of sample size n each individual has probability p that an event occurs. Yi = 1 . . . an event occurs for the i−th subject, otherwise yi = 0. Data: y1 , . . . , yn and x1 , . . . , xn . The joint probability of the data (the likelihood) L= n Y p(y |x)yi (1 − p(y |x))1−yi = i=1 Pn = p(y |x) K. Schindlerová i=1 yi (1 − p(y |x))n− Multivariate Data Analysis Pn i=1 yi Logistic Regression 28 / 54 Log Likelihood Logarithmized: ` = log(L) = n X yi log[p(y |x)] + (n − i=1 where p(y |x) = n X yi ) log[1 − p(y |x)] i=1 e α+βx . 1+e α+βx Computing first derivatives of ` - solving for α and β. Initialization: an arbitrary value for the coefficients (usually 0). log-likelihood is computed and variation of coefficients values observed. Iteration is then performed until ` is maximum (equivalent to maximizing L). The results are the maximum likelihood estimates of α and β and estimates of P(y ) for a given value of x. K. Schindlerová Multivariate Data Analysis Logistic Regression 29 / 54 Multiple Logistic Regression More than one independent variable: Dichotomous (binary), ordinal, nominal, continuous . . . ln( p ) = α + β1 x1 + · · · + βm xm . 1−p βi : Increase in log-odds for a one unit increase in xi with other xj j 6= i constant Measures association between xi and log-odds adjusted for all other xj . K. Schindlerová Multivariate Data Analysis Logistic Regression 30 / 54 Multiple Logistic Regression with Interaction Terms Effect modification Can be modelled by including interaction terms, for example: p ln( ) = α + β1 x1 + β2 x2 + β3 x1 x2 . 1−p K. Schindlerová Multivariate Data Analysis Logistic Regression 31 / 54 Coefficients If the model fits well, the next question is how important each of the independent variables is. The contribution of individual predictors: The logistic regression coefficient for the i-th independent variable shows the change in the predicted log odds of having an outcome for one unit change in the i-th independent variable, all other variables being equal. That is, if the i-th independent variable is changed 1 unit while all of the other predictors are held constant, log odds of outcome is expected to change βi units. K. Schindlerová Multivariate Data Analysis Logistic Regression 32 / 54 Statistical Testing of LoR for Individual Regression Coefficients Question: Does the model including a given independent variable provide more information about the dependent variable than the model without this variable? Tests: Likelihood ratio statistic (LRS) Wald test Odds ratios with 95% CI Hosmer-Lemeshow test Score (Lagrange test) K. Schindlerová Multivariate Data Analysis Logistic Regression 33 / 54 Likelihood Ratio Statistic (LRS) Compares two nested models Log (odds) = α + β1 x1 + β2 x2 · · · + βm xm (model 1) Log (odds) = α + β1 x1 + β2 x2 · · · + βq xq (model 2) where q < m. The overall fit of the model with m coefficients can be examined by LRS which tests the null hypothesis H0 : β1 = β2 = · · · = βm = 0. Likelihood of the null model is the likelihood of obtaining the observation if the independent variables had no effect on the outcome. Likelihood of the given model is the likelihood of obtaining the observations with all independent variables in the model. K. Schindlerová Multivariate Data Analysis Logistic Regression 34 / 54 Likelihood Ratio Statistic (LRS) The difference of the two models yields a goodness of fit index G , χ2 statistic with q degrees of freedom (Bewick, Cheek, Ball, 2005). G measures how well all of the independent variables affect the outcome: G = χ2 = = (−2 log likelihoodofnullmodel) − (−2 log likelihoodofgivenmodel) = −2 log likelihoodofnullmodel . likelihoodofgivenmodel If the p-value for the overall model fit statistic < 0.05 then H0 is to be rejected since at least one of the independent variables contributes ot the prediction outcome. K. Schindlerová Multivariate Data Analysis Logistic Regression 35 / 54 Wald Test Assesses the contribution of individual predictors or the significance of individual coefficients in a given model (Bewick et al., 2005). It is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient. The Wald statistic is asymptotically distributed as a χ2 distribution. βj2 Wj = SEβ2j Each Wald statistic is compared with a χ2 with 1 degree of freedom. Wald statistics are easy to calculate but their reliability is questionable. K. Schindlerová Multivariate Data Analysis Logistic Regression 36 / 54 Odds Ratios with 95%CI Odds ratio with 95% confidence interval (CI) is used to assess the contribution of individual predictors (Katz, 1999). Unlike the p value, the 95% CI does not report a measure’ s statistical significance. The 95% CI is used to estimate the precision of the OR. A large CI indicates a low level of precision of the OR, whereas a small CI indicates a higher precision of the OR. K. Schindlerová Multivariate Data Analysis Logistic Regression 37 / 54 Odds Ratios with 95%CI An approximate confidence interval for the population log odds ratio is 95% CI for the ln(OR) = ln(OR) ± 1.96SE ln(OR) where ln(OR) is the sample log odds ratio SE ln(OR) is the standard error of the log odds ratio (Morris and Gardner, 1988). By exponenting, we get the 95% CI for the odds ratio: 95% CI for OR = e ln(OR)±1.96×SE ln(OR) 95% CI for OR = e β±1.96×SEβ K. Schindlerová Multivariate Data Analysis Logistic Regression 38 / 54 Hosmer - Lemeshow Test H-L test is a statistical test for goodness of fit for logistic regression models. It is used frequently in risk prediction models. The test assesses whether or not the observed event rates match expected event rates in subgroups of the model population. The H-L test specifically identifies subgroups as the deciles of fitted risk values. The test statistic asymptotically follows a χ2 distribution. It is not recommend to use the test for n < 400 (Hosmer and Lemeshow, 2000). K. Schindlerová Multivariate Data Analysis Logistic Regression 39 / 54 Rao’s Score Test Score test (the Lagrange multiplier test): a simple null hypothesis that a parameter of interest θ is equal to some particular value θ0 . It is the most powerful test when the true value of θ is close to θ0 . The main advantage of the Score-test is that it does not require an estimate of the information under the alternative hypothesis or unconstrained maximum likelihood. K. Schindlerová Multivariate Data Analysis Logistic Regression 40 / 54 Example I. Adopted from: Kerr: Handbook of Public Health Methods, McGraw-Hill, 1998 p . . . probability for cardiac arrest exc . . . 1 = lack of exercise, 0 = exercise smk . . . 1 = smokers, 0 = non − smokers ln( p ) = α + β1 exc + β2 smk = 1−p = 0.7102 + 1.0047exc + 0.7005smk (SEβ1 0.2614) (SEβ2 0.2664) OR for lack of excercise = e 1.0047 = 2.73 (adjusted for smoking) 95%CI = e β1 ±1.96×SEβ1 = e (1.0047±1.96×0.2614) = 1.64 to 4.56. K. Schindlerová Multivariate Data Analysis Logistic Regression 41 / 54 Example II. Results of fitting logistic regression model ln( p ) = α + β × Age = −0.841 + 2.094 × Age 1−p Age Constant Coefficients 2.094 -0.841 SE 0.529 0.255 Coeff /SE 3.96 -3.3 log(Odds) = 2.094 OR = e 2.094 = 8.1 Wald test for effect of age = 3.962 with 1 degree of freedom, p < 0.05 95% confidence interval = e <2.094−1.96×0.529,2.094+1.96×0.529> =< 2.9, 22.9 >. K. Schindlerová Multivariate Data Analysis Logistic Regression 42 / 54 Example III. Interactive effect between smoking and exercise? log( p ) = α + β1 exc + β2 smk + β3 smk.exc 1−p Product term β3 = −0.4604(SE 0.5332) Wald test = 0.75 (1df) −2 log(L) = 342.092 with interaction term = 342.836 without interaction term LR statistic = 0.74 (1df), p = 0.39 Conclusion: No evidence of any interaction K. Schindlerová Multivariate Data Analysis Logistic Regression 43 / 54 Validation of Logistic Regression Question: Can the results of LoR analysis on the sample be extended to the population the sample has been chosen from? Model validation: Estimate its coefficients in one data set, then use this model to predict the outcome variable from the second data set, then check the residuals, and so on. A model which is validated using the data on which the model was developed, is likely to be over-estimated. Thus, the validity of model should be assessed by carrying out tests of goodness of fit and discrimination on a different data set (Giancristofaro and Salmaso, 2003). K. Schindlerová Multivariate Data Analysis Logistic Regression 44 / 54 Validation of Logistic Regression Model computed on a sub sample of observations and validated with the remaining sample internal validation. data-splitting, repeated data-splitting, jackknife technique, bootstrapping; Validity is tested with a new independent data set external validation. K. Schindlerová Multivariate Data Analysis Logistic Regression 45 / 54 Evaluation of the Results of Logistic Regression To be evaluated: an overall evaluation of the logistic model; statistical tests of individual predictors; goodness-of-fit statistics; assessment of the predicted probabilities. K. Schindlerová Multivariate Data Analysis Logistic Regression 46 / 54 Evaluation of the Results of Logistic Regression: An Example I. Source: Park, H.: An Introduction to Logistic Regression: From Basic Concepts to Interpretation with Particular Attention to Nursing Domain, J. Korean Acad. Nurs. Vol 43, No. 2, 2013. Statistical tests of individual predictors Example Output from logistic regression: Wald’s df p e β (OR) 95% CI for OR χ2 Lower Upper chol. 1.48 0.45 10.98 1 < 10−3 4.04 1.83 10.58 −3 const. -12.78 1.98 44.82 1 < 10 pred. = predictor, chol.= cholesterol, const. = constant pred. β K. Schindlerová SE (β) Multivariate Data Analysis Logistic Regression 47 / 54 Evaluation of the Results of Logistic Regression: An Example II. Cholesterol was a significant predictor for event (p < 0.05). The slope coefficient 1.48 represents the change in the log odds for a one unit increase in cholesterol. The test of the intercept (p < 0.05) was significant suggesting that the intercept should be included in the model. Odd ratio 4.04 indicates that the odds for an event increase 4.04 times when the value of the cholesterol is increased by 1 unit. K. Schindlerová Multivariate Data Analysis Logistic Regression 48 / 54 Evaluation of the Results of Logistic Regression: An Example III. Example Output from Logistic Regression: Overall Model Evaluation and Goodness-of-Fit Statistics categories χ2 df lik. ratio test 12.02 2 score test 11.52 2 Wald test 11.06 2 G-of fit test Hosmer and Lam. 7.76 8 ov. mod. ev. = overall modle evaluation, lik. 0= G-of fit test = goodness of fit test test ov. mod. ev. K. Schindlerová Multivariate Data Analysis p 0.002 0.003 0.004 0.457 likelihood, Logistic Regression 49 / 54 Evaluation of the Results of Logistic Regression: An Example III. Model evalutation tests: likelihood ratio, score, and Wald tests. All three tests for the given data set conclude that given logistic model with independent variables was more effective than the null model. Inferential goodness-of-fit test: Hosmer-Lemeshow test statistics 7.76 was insignificant (p > 0.05) suggesting that the model was fit to the data well. K. Schindlerová Multivariate Data Analysis Logistic Regression 50 / 54 Evaluation of the Results of Logistic Regression: An Example IV. Example Output from Logistic Regression: A Classification Table observed yes no Overall % correct K. Schindlerová predicted yes no 3 57 6 124 Multivariate Data Analysis % correct 5.00 95.48 66.84 Logistic Regression 51 / 54 Evaluation of the Results of Logistic Regression: An Example IV. The classification table shows the degree to which predicted probabilities agree with actual outcomes. The overall correct prediction, 66.84 % shows an improvement over the chance level which is 50 % The table measures also: Sensitivity = the proportion of correctly classified events specificity = the proportion of correctly classified nonevents false positives = the proportion of observations misclassified as events over all of those classified as events false negatives = the proportion of observations misclassified as nonevents over all of those classified as nonevents. K. Schindlerová Multivariate Data Analysis Logistic Regression 52 / 54 Basic Literature Basic Literature von Eye, A., Mun, E.-Y. Log-linear modeling - Concepts, interpretation and applications. New York: Wiley, 2013. Sabine Fromm: Binäre logistische Regressionsanalyse: Eine Einführung f ür Sozialwissenschaftler mit SPSS für Windows, available in pdf in Internet; http : //user .demogr .mpg .de/doblhammer /logreg .pdf (in German) http : //www .empirical − methods.hslu.ch/ h − logistische − regression.htm K. Schindlerová Multivariate Data Analysis Logistic Regression 53 / 54 Logistic Regression - Videos Some Video Presentations Recommended videos: https : //www .youtube.com/watch?v =P o − xZJflPM (in English) (logistic regression in SPSS - in English) https : //www .google.com/search?sourceid = navclient &ie = UTF − 8&rlz = 1T 4WQIAe nAT 538AT 538&q = spss + logistische + regression K. Schindlerová Multivariate Data Analysis Logistic Regression 54 / 54