Download Logistic Regression & Survival Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Regression toward the mean wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Coefficient of determination wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Transcript
Logistic Regression & Survival Analysis
Analysis of binary outcome & time to event data
Larry Holmes, Jr
Joabyer Hossain
Stats Research, Lecture 7
November 13, 2008
Presentation Objectives



At the end of this presentation, participants should be able to :
Rationale for logistic regression, conduct and interpretation of result
Survival analysis
– Measure Time and Events
– Understand Truncation and Censoring
– Understand Survival and Hazard Functions
– Define Competing Risks
– Understand Models and Hypothesis Testing
 Log rank
 Kaplan- Meier survival curve & estimates
 Cox Proportional Hazards Model (semi-parametric model)
What is Logistic Regression?
– Logistic regression is often used
because the relationship between
the DV (a discrete variable) and a
predictor is non-linear
 Blood glucose level and diabetes
mellitus
 Hypertension and LDL level
Logistic Regression
In logistic regression:
 Outcome variable is binary
 Purpose of the analysis is to assess the
effects of multiple explanatory variables,
which can be numeric and/or categorical, on
the outcome variable.
Requirements for Logistic Regression
The Following need to be specified:
1) An outcome variable with two possible categorical
outcomes (1=success; 0=failure).
2) Estimating the probability P of the outcome variable.
3) Linking the outcome variable to the explanatory
variables.
4) Estimating the coefficients of the regression equation, as
well as their confidence intervals.
5) Testing the goodness of fit of the regression model.
Measuring the Probability of Outcome
The probability of the outcome is measured
by the odds of occurrence of an event.
If P is the probability of an event, then (1-P) is
the probability of it not occurring.
Odds of success = P / 1-P
P
1 P
The logistic function
The logistic function
u
e
Yi 
u
1 e

Where Y-hat is the estimated probability
that the ith case is in a category and u is the
regular linear regression equation:
u  A  B1 X1  B2 X 2 
 BK X K
Logistic function
For a response variable y with p(y=1)= P and p(y=0) = 1- P
1.0
Probability
of disease
0.8
0.6
e  x
P( y x ) 
1  e  x
Logistic regression will allow for the
estimation of an equation that fits a
curve the age/probability of CHD
relationship
0.4
A regression method to deal
with the case when the
dependent variable y is binary
(dichotomous)
0.2
0.0
x
The logistic function


Change in probability is not constant
(linear) with constant changes in X
This means that the probability of a
success (Y = 1) given the predictor
variable (X) is a non-linear function,
specifically a logistic function
The logistic function


It is not obvious how the regression
coefficients for X are related to changes in
the dependent variable (Y) when the
model is written this way
Change in Y(in probability units)|X
depends on value of X. Look at S-shaped
function
The Logistic Regression
The joint effects of all explanatory variables put together on
the odds is
Odds = P/1-P = e α + β1X1 + β2X2 + …+βpXp
Taking the logarithms of both sides
Log{P/1-P} = log α+β1X1+β2X2+…+βpXp
Logit P = α+β1X1+β2X2+..+βpXp
The coefficients β1, β2, βp are such that the sums of the
squared distance between the observed and predicted
values (i.e. regression line) are smallest.
The Logistic Regression
Logit p = α + β1X1 +β2X2 + .. + βpXp
α represents the overall disease risk
β1 represents the fraction by which the disease risk is
altered by a unit change in X1
β2 is the fraction by which the disease risk is altered
by a unit change in X2
……. and so on.
What changes is the log odds. The odds themselves
are changed by eβ
If β = 1.6 the odds are e1.6 = 4.95
Logistic Regression-Demo

MS-Excel: No default functions

SPSS: Analyze > Regression > Binary Logistic > Select
Dependent variable: > Select independent variable
(covariate)
Logistic Regression SPSS output
Dependent Variable Encoding
Original Value
0
Internal Value
0
1
1
Categorical Variables Codings
Parameter
coding
Frequency
Shades
(1)
1
30
1.000
2
30
.000
Classification Table(a,b)
Predicted
pc
Step 0
Observed
pc
0
Percentage
Correct
1
0
0
30
.0
1
0
30
100.0
Overall Percentage
50.0
a Constant is included in the model.
b The cut value is .500
Variables in the Equation
B
Step 0
Constant
.000
S.E.
.258
Wald
.000
df
Sig.
1.000
1
Variables not in the Equation
Step 0
Variables
Overall Statistics
Shades(1)
Score
17.067
17.067
df
1
Sig.
.000
1
.000
Exp(B)
1.000
Logistic Regression SPSS output
Omnibus Tests of Model Coefficients
Chi-square
Step 1
df
Sig.
Step
17.985
1
.000
Block
17.985
1
.000
Model
17.985
1
.000
Model Summary
Step
1
-2 Log
likelihood
65.193(a)
Cox & Snell
R Square
.259
Nagelkerke R
Square
.345
a Estimation terminated at iteration number 4 because parameter estimates changed by less than .001.
Classification Table(a)
Predicted
pc
Step 1
Observed
pc
0
Percentage
Correct
1
0
23
7
76.7
1
7
23
76.7
Overall Percentage
76.7
a The cut value is .500
Variables in the Equation
B
Step
1(a)
Shades(1)
Constant
-2.379
1.190
a Variable(s) entered on step 1: Shades.
S.E.
Wald
df
Sig.
Exp(B)
.610
15.189
1
.000
.093
.432
7.594
1
.006
3.286
Regression vs. Survival Analysis
Technique
Predictor
Variables
Categorical or
Linear
continuous
Regression
Outcome
Variable
Normally
distributed
Censoring
permitted?
No
Categorical or Binary (except in
Logistic
polytomous log.
continuous
Regression
regression)
No
Time and
categorical or
continuous
Yes
Survival
Analyses
Binary
Regression vs. Survival Analysis
Technique
Mathematical
model
Yields
Linear
Regression
Y=B1X + Bo
(linear)
Linear changes
Logistic
Regression
Ln(P/1-P)=B1X+Bo
(sigmoidal prob.)
Odds ratios
Survival
Analyses
h(t) =
ho(t)exp(B1X+Bo)
Hazard rates
What is survival analysis?

Model time to failure or time to event
– Unlike linear regression, survival analysis has a dichotomous
(binary) outcome
– Unlike logistic regression, survival analysis analyzes the time
to an event

Why is that important?
Able to account for censoring
 Can compare survival between 2+ groups
 Assess relationship between covariates and survival
time

Survival Analysis

Survival analysis deals with making inference about
EVENT RATES
 Rate at t = Rate among those at risk at t
 Deals with Median survival (50%) .
 Not Mean survival (need everyone to have an event)
…..Why?

Survival vs. time-to-event
 Outcome variable = event time
 Examples of events:
– Death, infection, MI,prostate cancer death, hospitalization
– Recurrence of cancer after treatment
Types of censoring
Subject does not
experience event of
interest
 Incomplete follow-up

– Lost to follow-up
– Withdraws from study
– Dies (if not being studied)

Left or right censored
Survival Function
S(t) = P[ T ≥ t ] = 1 – P[ T < t ]
 Plot: Y axis = % alive, X axis = time
 Proportion of population still without the
event by time t

Survival Curve
0.0
Proportion Alive
0.2 0.4 0.6 0.8
1.0
Survival Curve
0
1
2
3
4
5
6
Months since surgery
7
8
9
Hazard Function





Also termed incidence rate, instantaneous risk,
force of mortality
λ(t)
Event rate at t among those at risk for an event
Key function
Estimated in a straightforward way
– Censored
– Truncated
Time to Cardiovascular Adverse Event in VIGOR Trial
Hazard Function

Event = death, scale = months since Tx
 “λ(t) = 1% at t = 12 months”
 “At 1 year, patients are dying at a rate of 1%
per month”
 “At 1 year the chance of dying in the
following month is 1%”
Relationship between survivor function and hazard
function
 Survivor function, S(t) defines the probability of
surviving longer than time t
– this is what the Kaplan-Meier curves show.
– Hazard function is the derivative of the survivor
function over time h(t)=dS(t)/dt


instantaneous risk of event at time t (conditional failure
rate)
Survivor and hazard functions can be converted
into each other
Use of survival analysis: clinical trial

Accrual into the study over 2 years
 Data analysis at year 3
 Reasons for exiting a study
– Died
– Alive at study end
– Withdrawal for non-study related reasons
(LTFU)
– Died from other causes
Kaplan-Meier

One way to estimate survival
 Nice, simple, can compute by hand
 Can add stratification factors
 Cannot evaluate covariates like Cox model
 No sensible interpretation for competing
risks
Kaplan-Meier estimate

Multiply together a series of conditional probabilities
Time ti
# at risk
# events
Ŝ
0
20
0
1.00
5
20
2
[1-(2/20)]*1.00=0.90
6
18
0
[1-(0/18)]*0.90=0.90
10
15
1
[1-(1/15)]*0.90=0.84
13
14
2
(1-(2/14)]*0.84=0.72
Proportion Surviving (95% Confidence)
0.6
0.7
0.8
0.9
1.0
Kaplan-Meier Curve
0
5
10
Survival Time
15
20
Kaplan Meier Curve
Limit of Kaplan-Meier curves


What happens when you have several covariates that you
believe contribute to survival?
Example
– Smoking, hyperlipidemia, diabetes, hypertension, contribute to time
to myocardial infarct


Can use stratified K-M curves – for 2 or maybe 3 covariates
Need another approach – multivariate Cox proportional
hazards model is most common -- for many covariates
– (think multivariate regression or logistic regression rather than a
Student’s t-test or the odds ratio from a 2 x 2 table)
Multivariable method: Cox proportional
hazards
Needed to assess effect of multiple covariates
on survival
 Cox-proportional hazards is the most
commonly used multivariable survival
method

Cox proportional hazard model

Works with hazard model

Conveniently separates baseline hazard function from
covariates
– Baseline hazard function over time
h(t) = ho(t)exp(B1X+Bo)
– Covariates are time independent
– B1 is used to calculate the hazard ratio, which is similar to the relative
risk


Semi-parametric
Cox Proportional Hazards Model

Add covariates to the model
 Change in a prognostic factor →
proportional change in the hazard (on the
log scale)
 Can test the effect of the prognostic factor
as in linear regression - H0: β=0
Limitations of Cox PH model

Does not accommodate variables that change
over time
– Most variables (e.g. gender, ethnicity, or congenital
condition) are constant



If necessary, one can program time-dependent variables
When might you want this?
Baseline hazard function, ho(t), is never specified
– You can estimate ho(t) accurately if you need to
estimate S(t).
Summary





Survival analyses quantifies time to a single,
dichotomous event
Handles censored data well
Survival and hazard can be mathematically converted to
each other
Kaplan-Meier survival curves can be compared
statistically and graphically
Cox proportional hazards models help distinguish
individual contributions of covariates on survival,
provided certain assumptions are met.
SPSS output of Survival functions
Survival Table
1
2
3
4
5
Time
6.000
14.000
21.000
44.000
62.000
Status
1
1
0
1
1
Cumulative Proportion
Surviving at the Time
Estimate
Std. Error
.800
.179
.600
.219
.
.
.300
.239
.000
.000
N of
Cumulative
Events
1
2
2
3
4
N of
Remaining
Cases
4
3
2
1
0
Means and Medians for Survival Time
a
Estimate
35.800
Mean
95% Confidence Interval
Std. Error Lower Bound Upper Bound
11.810
12.652
58.948
Estimate
44.000
a. Estimation is limited to the largest survival time if it is censored.
Median
95% Confidence Interval
Std. Error Lower Bound Upper Bound
23.875
.000
90.794
SPSS output of KM plot
SPSS output of cumulative hazard
SPSS output of Cox Regression
Omnibus Tests of Model Coefficientsa,b
-2 Log
Likelihood
6.732
Overall (score)
Chi-square
df
.468
1
Sig.
.494
Change From Previous Step
Chi-square
df
Sig.
.646
1
.422
Change From Previous Block
Chi-square
df
Sig.
.646
1
.422
a. Beginning Block Number 0, initial Log Likelihood function: -2 Log likelihood: 7.378
b. Beginning Block Number 1. Method = Enter
Variables in the Equation
psa
B
-1.393
SE
2.305
Wald
.365
df
1
Sig.
.546
Exp(B)
.248