* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Logistic Regression & Survival Analysis
Data assimilation wikipedia , lookup
Lasso (statistics) wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Regression toward the mean wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Coefficient of determination wikipedia , lookup
Logistic Regression & Survival Analysis Analysis of binary outcome & time to event data Larry Holmes, Jr Joabyer Hossain Stats Research, Lecture 7 November 13, 2008 Presentation Objectives At the end of this presentation, participants should be able to : Rationale for logistic regression, conduct and interpretation of result Survival analysis – Measure Time and Events – Understand Truncation and Censoring – Understand Survival and Hazard Functions – Define Competing Risks – Understand Models and Hypothesis Testing Log rank Kaplan- Meier survival curve & estimates Cox Proportional Hazards Model (semi-parametric model) What is Logistic Regression? – Logistic regression is often used because the relationship between the DV (a discrete variable) and a predictor is non-linear Blood glucose level and diabetes mellitus Hypertension and LDL level Logistic Regression In logistic regression: Outcome variable is binary Purpose of the analysis is to assess the effects of multiple explanatory variables, which can be numeric and/or categorical, on the outcome variable. Requirements for Logistic Regression The Following need to be specified: 1) An outcome variable with two possible categorical outcomes (1=success; 0=failure). 2) Estimating the probability P of the outcome variable. 3) Linking the outcome variable to the explanatory variables. 4) Estimating the coefficients of the regression equation, as well as their confidence intervals. 5) Testing the goodness of fit of the regression model. Measuring the Probability of Outcome The probability of the outcome is measured by the odds of occurrence of an event. If P is the probability of an event, then (1-P) is the probability of it not occurring. Odds of success = P / 1-P P 1 P The logistic function The logistic function u e Yi u 1 e Where Y-hat is the estimated probability that the ith case is in a category and u is the regular linear regression equation: u A B1 X1 B2 X 2 BK X K Logistic function For a response variable y with p(y=1)= P and p(y=0) = 1- P 1.0 Probability of disease 0.8 0.6 e x P( y x ) 1 e x Logistic regression will allow for the estimation of an equation that fits a curve the age/probability of CHD relationship 0.4 A regression method to deal with the case when the dependent variable y is binary (dichotomous) 0.2 0.0 x The logistic function Change in probability is not constant (linear) with constant changes in X This means that the probability of a success (Y = 1) given the predictor variable (X) is a non-linear function, specifically a logistic function The logistic function It is not obvious how the regression coefficients for X are related to changes in the dependent variable (Y) when the model is written this way Change in Y(in probability units)|X depends on value of X. Look at S-shaped function The Logistic Regression The joint effects of all explanatory variables put together on the odds is Odds = P/1-P = e α + β1X1 + β2X2 + …+βpXp Taking the logarithms of both sides Log{P/1-P} = log α+β1X1+β2X2+…+βpXp Logit P = α+β1X1+β2X2+..+βpXp The coefficients β1, β2, βp are such that the sums of the squared distance between the observed and predicted values (i.e. regression line) are smallest. The Logistic Regression Logit p = α + β1X1 +β2X2 + .. + βpXp α represents the overall disease risk β1 represents the fraction by which the disease risk is altered by a unit change in X1 β2 is the fraction by which the disease risk is altered by a unit change in X2 ……. and so on. What changes is the log odds. The odds themselves are changed by eβ If β = 1.6 the odds are e1.6 = 4.95 Logistic Regression-Demo MS-Excel: No default functions SPSS: Analyze > Regression > Binary Logistic > Select Dependent variable: > Select independent variable (covariate) Logistic Regression SPSS output Dependent Variable Encoding Original Value 0 Internal Value 0 1 1 Categorical Variables Codings Parameter coding Frequency Shades (1) 1 30 1.000 2 30 .000 Classification Table(a,b) Predicted pc Step 0 Observed pc 0 Percentage Correct 1 0 0 30 .0 1 0 30 100.0 Overall Percentage 50.0 a Constant is included in the model. b The cut value is .500 Variables in the Equation B Step 0 Constant .000 S.E. .258 Wald .000 df Sig. 1.000 1 Variables not in the Equation Step 0 Variables Overall Statistics Shades(1) Score 17.067 17.067 df 1 Sig. .000 1 .000 Exp(B) 1.000 Logistic Regression SPSS output Omnibus Tests of Model Coefficients Chi-square Step 1 df Sig. Step 17.985 1 .000 Block 17.985 1 .000 Model 17.985 1 .000 Model Summary Step 1 -2 Log likelihood 65.193(a) Cox & Snell R Square .259 Nagelkerke R Square .345 a Estimation terminated at iteration number 4 because parameter estimates changed by less than .001. Classification Table(a) Predicted pc Step 1 Observed pc 0 Percentage Correct 1 0 23 7 76.7 1 7 23 76.7 Overall Percentage 76.7 a The cut value is .500 Variables in the Equation B Step 1(a) Shades(1) Constant -2.379 1.190 a Variable(s) entered on step 1: Shades. S.E. Wald df Sig. Exp(B) .610 15.189 1 .000 .093 .432 7.594 1 .006 3.286 Regression vs. Survival Analysis Technique Predictor Variables Categorical or Linear continuous Regression Outcome Variable Normally distributed Censoring permitted? No Categorical or Binary (except in Logistic polytomous log. continuous Regression regression) No Time and categorical or continuous Yes Survival Analyses Binary Regression vs. Survival Analysis Technique Mathematical model Yields Linear Regression Y=B1X + Bo (linear) Linear changes Logistic Regression Ln(P/1-P)=B1X+Bo (sigmoidal prob.) Odds ratios Survival Analyses h(t) = ho(t)exp(B1X+Bo) Hazard rates What is survival analysis? Model time to failure or time to event – Unlike linear regression, survival analysis has a dichotomous (binary) outcome – Unlike logistic regression, survival analysis analyzes the time to an event Why is that important? Able to account for censoring Can compare survival between 2+ groups Assess relationship between covariates and survival time Survival Analysis Survival analysis deals with making inference about EVENT RATES Rate at t = Rate among those at risk at t Deals with Median survival (50%) . Not Mean survival (need everyone to have an event) …..Why? Survival vs. time-to-event Outcome variable = event time Examples of events: – Death, infection, MI,prostate cancer death, hospitalization – Recurrence of cancer after treatment Types of censoring Subject does not experience event of interest Incomplete follow-up – Lost to follow-up – Withdraws from study – Dies (if not being studied) Left or right censored Survival Function S(t) = P[ T ≥ t ] = 1 – P[ T < t ] Plot: Y axis = % alive, X axis = time Proportion of population still without the event by time t Survival Curve 0.0 Proportion Alive 0.2 0.4 0.6 0.8 1.0 Survival Curve 0 1 2 3 4 5 6 Months since surgery 7 8 9 Hazard Function Also termed incidence rate, instantaneous risk, force of mortality λ(t) Event rate at t among those at risk for an event Key function Estimated in a straightforward way – Censored – Truncated Time to Cardiovascular Adverse Event in VIGOR Trial Hazard Function Event = death, scale = months since Tx “λ(t) = 1% at t = 12 months” “At 1 year, patients are dying at a rate of 1% per month” “At 1 year the chance of dying in the following month is 1%” Relationship between survivor function and hazard function Survivor function, S(t) defines the probability of surviving longer than time t – this is what the Kaplan-Meier curves show. – Hazard function is the derivative of the survivor function over time h(t)=dS(t)/dt instantaneous risk of event at time t (conditional failure rate) Survivor and hazard functions can be converted into each other Use of survival analysis: clinical trial Accrual into the study over 2 years Data analysis at year 3 Reasons for exiting a study – Died – Alive at study end – Withdrawal for non-study related reasons (LTFU) – Died from other causes Kaplan-Meier One way to estimate survival Nice, simple, can compute by hand Can add stratification factors Cannot evaluate covariates like Cox model No sensible interpretation for competing risks Kaplan-Meier estimate Multiply together a series of conditional probabilities Time ti # at risk # events Ŝ 0 20 0 1.00 5 20 2 [1-(2/20)]*1.00=0.90 6 18 0 [1-(0/18)]*0.90=0.90 10 15 1 [1-(1/15)]*0.90=0.84 13 14 2 (1-(2/14)]*0.84=0.72 Proportion Surviving (95% Confidence) 0.6 0.7 0.8 0.9 1.0 Kaplan-Meier Curve 0 5 10 Survival Time 15 20 Kaplan Meier Curve Limit of Kaplan-Meier curves What happens when you have several covariates that you believe contribute to survival? Example – Smoking, hyperlipidemia, diabetes, hypertension, contribute to time to myocardial infarct Can use stratified K-M curves – for 2 or maybe 3 covariates Need another approach – multivariate Cox proportional hazards model is most common -- for many covariates – (think multivariate regression or logistic regression rather than a Student’s t-test or the odds ratio from a 2 x 2 table) Multivariable method: Cox proportional hazards Needed to assess effect of multiple covariates on survival Cox-proportional hazards is the most commonly used multivariable survival method Cox proportional hazard model Works with hazard model Conveniently separates baseline hazard function from covariates – Baseline hazard function over time h(t) = ho(t)exp(B1X+Bo) – Covariates are time independent – B1 is used to calculate the hazard ratio, which is similar to the relative risk Semi-parametric Cox Proportional Hazards Model Add covariates to the model Change in a prognostic factor → proportional change in the hazard (on the log scale) Can test the effect of the prognostic factor as in linear regression - H0: β=0 Limitations of Cox PH model Does not accommodate variables that change over time – Most variables (e.g. gender, ethnicity, or congenital condition) are constant If necessary, one can program time-dependent variables When might you want this? Baseline hazard function, ho(t), is never specified – You can estimate ho(t) accurately if you need to estimate S(t). Summary Survival analyses quantifies time to a single, dichotomous event Handles censored data well Survival and hazard can be mathematically converted to each other Kaplan-Meier survival curves can be compared statistically and graphically Cox proportional hazards models help distinguish individual contributions of covariates on survival, provided certain assumptions are met. SPSS output of Survival functions Survival Table 1 2 3 4 5 Time 6.000 14.000 21.000 44.000 62.000 Status 1 1 0 1 1 Cumulative Proportion Surviving at the Time Estimate Std. Error .800 .179 .600 .219 . . .300 .239 .000 .000 N of Cumulative Events 1 2 2 3 4 N of Remaining Cases 4 3 2 1 0 Means and Medians for Survival Time a Estimate 35.800 Mean 95% Confidence Interval Std. Error Lower Bound Upper Bound 11.810 12.652 58.948 Estimate 44.000 a. Estimation is limited to the largest survival time if it is censored. Median 95% Confidence Interval Std. Error Lower Bound Upper Bound 23.875 .000 90.794 SPSS output of KM plot SPSS output of cumulative hazard SPSS output of Cox Regression Omnibus Tests of Model Coefficientsa,b -2 Log Likelihood 6.732 Overall (score) Chi-square df .468 1 Sig. .494 Change From Previous Step Chi-square df Sig. .646 1 .422 Change From Previous Block Chi-square df Sig. .646 1 .422 a. Beginning Block Number 0, initial Log Likelihood function: -2 Log likelihood: 7.378 b. Beginning Block Number 1. Method = Enter Variables in the Equation psa B -1.393 SE 2.305 Wald .365 df 1 Sig. .546 Exp(B) .248