Download 6.multivariateanalysis - 6th Summer Course on RMHS 2015

Document related concepts

Simplex algorithm wikipedia , lookup

Vector generalized linear model wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Predictive analytics wikipedia , lookup

Generalized linear model wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
Practicum
1. The First Session
1.
2.
3.
4.
Introduction of program
Descriptive statistics
Univariate analysis
Multivariate analysis
2. The Second Session (Homework)
Presentations on Friday Morning
Multivariate Analysis
Önder Ergönül, MD, MPH
Koç University, School of Medicine
Summer Course on Research Methodology and Ethics in Medical Sciences
June 16-20, 2014, Istanbul
Background
%
1978-79
1989
2004-05
Descriptive only
27
12
13
Statistics tables
(contingency tables)
Epidemiologic measures
(relatif risk, odds)
27
36
53
10
22
35
Survival analysis
11
32
61
5
14
51
3
3
39
Multivariate analysis
(regression)
Power analysis
Horton NJ, Switzer SS. NEJM 2005; 353.
Multivariate analyses
Regression
Dependent variable
(outcome)
Linear
Continuous
Logistic
Dichotomous
Cox
Dichotomous
Poisson
Dichotomous
Confounder
Variable A
Outcome
Variable B
Confounder
Acinetobacter
infection
fatality
Severity index
APACHE score
Control of the confounders
1. Randomization
2. Stratification
3. Adjustment by multivariate analysis
Odds Ratio
p = probability (or proportion)
The lower bound is 0, and the upper bound is 1.
Probability of success: Pr(y = 1) = p
Probability of failure: Pr(y = 0) = 1 – p
What is the p of success or failure?
Failure
Success
Total
1-p
p
(1 - p) + p = 1
.25 = 1 - p
.75 = p
1 = (1 - p) + p
Odds = p/(1-p) = .75/ (1 - .75) = .75/.25 = 3
Odds Ratio= pA/(1-pA)
pB/(1-pB)
Relative Risk
Heparin
Plasebo
Riskheparin=
DVT
DVT
8
18
92
82
8/100 = 0.08
Riskplasebo= 18/100 = 0.18
Relative risk =
Risk plasebo
=
Risk heparin
0.18
0.08
= 2.25
Odds Ratio
Heparin
Plasebo
DVT
DVT
8
18
92
82
Oddsheparin= 8/92 = 0.087
Oddsplasebo= 18/82 = 0.22
Odds ratio =
Odds plasebo
=
Odds heparin
0.22
0.087
= 2.53
Comparing risks and odds
Risk
Odds
0.05 or 5%
0.053
0.1 or 10%
0.11
0.2 or 20%
0.25
0.3 or 30%
0.43
0.4 or 40%
0.67
0.5 or 50%
1
0.6 or 60%
1.5
0.7 or 70%
2.3
0.8 or 80%
4
0.9 or 90%
9
0.95 or 95%
19
Confounder
OC
MI
Smoking
Oral contraceptives (OC) and
myocardial infarction (MI)
Case-control study, unstratified data
OC
MI
Controls
Yes
No
693
307
320
680
Total
1000
1000
OR
4.8
Ref.
Oral contraceptives (OC) and
myocardial infarction (MI)
Case-control study, unstratified data
Smoking
MI
Controls
Yes
No
700
300
500
500
Total
1000
1000
OR
2.3
Ref.
Smokers
OC
MI
Controls
Yes
No
517
183
160
340
Total
700
500
OR
6.0
Ref.
Nonsmokers
OC
MI
Controls
OR
Yes
176
160
3.0
No
124
340
Ref.
Total
300
500
Odds ratio for OC adjusted for smoking = 4 .5
Regression
From Correlation to Regression
Correlation
Lineer Reg
Logistic Reg
TheSimplest case:
Y depends linearly on X
E(Y) = ß0+ ß1x
ß0, ß1: PARAMETERS
Regression
X: Predictor or independent variable
Y: outcome or dependent variable
Y
Y= a+bx
X
Scatterplot and Linear Regression
Lineer Regression
E(Y)
Weight
=
=
ß0
-44,16
+
+
ß1 x
0,55 * height
Parameters:
ß0: INTERCEPT
ß1: SLOPE, regression coefficient
SBP (mm Hg)
220
SBP  81.54  1.222  Age
200
180
160
140
120
100
80
20
30
40
50
60
Age (years)
70
80
90
Simple linear regression
• Relation between 2 continuous variables (SBP and
age)
y
Slope
y  α  β1x1
x
• Regression coefficient b1
– Measures association between y and x
– Amount by which y changes on average when x changes by
one unit
– Least squares method
Logistic regression
Age and signs of coronary heart disease (CD)
Age
CD
Age
CD
Age
CD
22
23
24
27
28
30
30
32
33
35
38
0
0
0
0
0
0
0
0
0
1
0
40
41
46
47
48
49
49
50
51
51
52
0
1
0
0
0
1
0
1
0
1
0
54
55
58
60
60
62
65
67
71
77
81
0
1
1
1
0
1
1
1
1
1
1
Dot-plot
Signsofcrnaydisea
Y
e
s
N
o
0
2
0
4
0
6
0
A
G
E
(
y
e
a
r
s
)
8
0
1
0
0
Logistic function
Probability
of disease
1.0
0.8
e  x
P( y x ) 
1  e  x
0.6
0.4
0.2
0.0
x
Why is it called “Logistic” regression?
• It uses the logit transformation.
• The logistics transformation can be
interpreted as the logarithm of the odds
of success vs. failure.
 p 

logit ()  log 
1 p 
 p 

ln 
1  p 
Transformation
e  x
P( y x) 
  x
1 e
P(y x)
1  P(y x)
{
 P( y x ) 
ln 
    x
1  P( y x ) 
logit of P(y|x)
α=
log odds of disease
in unexposed
β=
log odds ratio associated
with being exposed
β
e = odds ratio
Fitting equation to the data
• Linear regression: Least squares
• Logistic regression: Maximum likelihood
• Likelihood function
– Estimates parameters a and b
– Practically easier to work with log-likelihood
n
L()  ln l ()   yi ln  ( xi )  (1  yi ) ln 1   ( xi )
i 1
Maximum likelihood
Iterative computing
– Choice of an arbitrary value for the coefficients
(usually 0)
– Computing of log-likelihood
– Variation of coefficients’ values
– Reiteration until maximisation (plateau)
Results
– Maximum Likelihood Estimates (MLE) for  and 
– Estimates of P(y) for a given value of x
Why Do We do Logistic Regression so much?
1. Predict the likelihood of discrete outcomes
1. Group membership
2. Binary outcome (disease/no disease)
2. Quite Flexible Statistical Assumptions
1. No need for assumptions about the distributions of the
predictor variables.
2. Predictors do not have to be normally distributed
3. Does not have to be linearly related.
4. Does not have to have equal variance within each group.
3. Very good in giving Odds Ratio
Construction of Model
Construction of Model
1. Perform univariate statistics (outliers,
distribution, gaps)
2. Perform univariate analysis
3. Transform nominal independent variables to
dichotomous (dummied) variable
Occupation Nominal
Dummy
Farmer
Dummy
Nurse
Farmer
1
1
0
Housewife 2
0
0
Physician
3
0
0
Nurse
4
0
1
Policeman
5
0
0
Construction of Model
Multicollinearity
4. Run a correlation matrix. If any pair of
independent variables are correlated at
> 0.90 (multicollinearity), decide which
one to keep and which one to exclude.
More practical way is to consider
“biologic relation”
(smoking, carrying matches, and cancer)
Diagnosing confounder
Variable A
Variable B
Outcome
Confounder
Fatality
Acinetobacter
infection
APACHE
score
How to choose variables?
1.
2.
3.
4.
Kitchen sink
Inclusion of significant variables
Forward selection
Backward selection
An example
Outcome
: Deep Vein Thrombosis
Independent variables : Heparin, gender,
Coronary Heart Disease,
aspirin use
Y= a+b1x1+ b2x2 + b3x3 + b4x4
DVT=a+b1(heparin)+ b2(female) + b3(CAD)+ b4(aspirin)
0.5
1.5
3
0.6
Logistic regression
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
Log likelihood = -6.2444702
=
=
=
=
10
0.97
0.9141
0.0722
---------------------------------------------------------------------------DVT | Odds Ratio
Std. Err.
z
P>|z|
[95% Conf Interval]
-------------+-------------------------------------------------------------Heparin |
.50
.023
-2.81
0.003
.15
.72
Kadin |
1.48
1.08
0.01
0.504
.095
23.17
KAH |
3.06
.03
2.36
0.009
1.34
12.37
aspirin |
0.58
0.08
0.31
0.622
.46
1.03
----------------------------------------------------------------------------
OR
p-değeri
95% CI
Heparin use
0.5
0.003
0.15-0.72
Female
1.48
0.504
0.095-23.17
CHD
3.06
0.009
1.34-12.37
Aspirin use
0.58
0.622
0.46-1.03
Assessment of the Model
Methods for measuring how well a model accounts for the outcome
Model accounts
for outcome
better than
chance
Multiple lineer
regression
Multiple logistic
regression
F test
Likelihood ratio test Likelihood ratio test
(LR)
(LR)
Quantitative/quali R2
tative assessment
of how well model
accounts for
outcome
R2 (rarely used)
Comparison of
estimated to
observed value
Hosmer-Lemeshov
Prediction of
outcome
NA
Sensitivity,
Specificity,
Accuracy,
c index
Proportional hazard
analysis
Comparsion of
estimated to
observed value
LR and p value of the test
For both logistic regression and proportional
hazard analysis:
If chi square of the LR is large,
the p value will be small, and the null
hypothesis can be rejected.
R2
R2 is a quantitative measure of how well
the independent variables account for the
outcome
When R2 is multiplied by 100, it can be
thought of as the percentage of the
variance in the dependent variable
explained by the independent variable
Goodness of Fit
r² is between 0 and 1
r² = 1 perfect fit
(x does not add any information to y)
r² > 0.8 pretty good fit
r² ≈ 0.3 what we see usually !
r² ≈ 0 poor fit
Interactions
How can interactions be taken into consideration?
Let X and Y be two explanatory variables, then
include X·Y in the model.
Is X a categorical variable with dummy variables
(X1,...Xm-1), then X·Y = (X1 ·Y,…,Xm-1 ·Y).
By interactions the multiplicative character of the OR
is changed.
SURVIVAL ANALYSIS
TIME-TO EVENT
TIME-TO DEATH
Exposure and outcome
exposure
NO
exposure
Person-Time
Jan
Feb
March April
May
June
A
Total Time at
risk
3 months
B
6 months
C
2 months
Total person time
3+6+2=11
The Reason for Survival Analysis
• Censored cases
– During the follow up:
• expected outcome did not happen for the case
• lossed from follow up or dropped
• All the patients do not need to start together.
Survival Analysis
• life table
• Kaplan-Meier curve
• Cox regresyon
Kaplan-Meier Curve
• Duration = Time to event
(time until an event occurs)
Number of the patients
• “Event” = Fatality, survived, relaps...
50
49
 Log-rank test compares 2 curves
statistically
44
42
40
39
2
4
6
8
months
The number of patients
Kaplan-Meier Curve
50
49
treatment
44
42
40
39
placebo
2
4
6
8
ay
Cox Regression
Logistic
regression
Cox regression
Log likelihood = -6.2444702
Number of obs
LR chi2(4)
Prob > chi2
Pseudo R2
=
=
=
=
10
0.97
0.9141
0.0722
---------------------------------------------------------------------------DVT |Hazard
Odds Ratio
z
P>|z|
[95% Conf Interval]
ratio Std. Err.
-------------+-------------------------------------------------------------Heparin |
.50
.023
-2.81
0.003
.15
.72
Kadin |
1.48
1.08
0.01
0.504
.095
23.17
KAH |
3.06
.03
2.36
0.009
1.34
12.37
aspirin |
0.58
0.08
0.31
0.622
.46
1.03
----------------------------------------------------------------------------