Download Residual Logistic Regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Discrete choice wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Time series wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Logistic Regression and the new:
Residual Logistic Regression
F. Berenice Baez-Revueltas
Wei Zhu
1
Outline
1.
2.
3.
4.
5.
6.
7.
8.
Logistic Regression
Confounding Variables
Controlling for Confounding Variables
Residual Linear Regression
Residual Logistic Regression
Examples
Discussion
Future Work
2
1. Logistic Regression Model
In 1938, Ronald Fisher
and Frank Yates
suggested the logit
link for regression
with a binary response
variable.
ln(Odds of Y  1| x)
 P(Y  1| x) 
 P(Y  1| x) 
 ln 

ln



P
(
Y

0
|
x
)
1

P
(
Y

1|
x
)




   x 
 ln 
  0  1 x
1    x  
A popular model for categorical response variable
 Logistic regression model is the most popular model for
binary data.
 Logistic regression model is generally used to study the
relationship between a binary response variable and a group
of predictors (can be either continuous or categorical).
Y = 1 (true, success, YES, etc.) or
Y = 0 ( false, failure, NO, etc.)
 Logistic regression model can be extended to model a
categorical response variable with more than two categories.
The resulting model is sometimes referred to as the
multinomial logistic regression model (in contrast to the
‘binomial’ logistic regression for a binary response variable.)
More on the rationale of the logistic regression model

Consider a binary response variable Y=0 or 1and a single predictor variable x. We want to
model E(Y|x) =P(Y=1|x) as a function of x. The logistic regression model expresses the
logistic transform of P(Y=1|x) as a linear function of the predictor.
 P(Y  1| x) 
ln 
  0  1 x
 1  P(Y  1| x) 
This model can be rewritten as
exp(  0  1 x)
P(Y  1 | x) 
1  exp(  0  1 x)
 E(Y|x)= P(Y=1| x) *1 + P(Y=0|x) * 0 = P(Y=1|x) is bounded between 0 and 1 for
all values of x. The following linear model may violate this condition sometimes:
P(Y=1|x) =  0  1 x
More on the properties of the logistic regression model

In the simple logistic regression, the regression coefficient  1 has the interpretation
that it is the log of the odds ratio of a success event (Y=1) for a unit change in x.
 P(Y  1 | x  1) 
 P(Y  1 | x) 
  ln 
  [  0  1 ( x  1)]  [  0  1 x]  1
ln 
 P(Y  0 | x  1) 
 P(Y  0 | x) 

For multiple predictor variables, the logistic regression model is
 P(Y  1 | x1 , x2 ,..., xk ) 
   0  1 x1  ...   k xk
ln 
 P(Y  0 | x1 , x2 ,..., xk ) 
Logistic Regression, SAS Procedure





http://www.ats.ucla.edu/stat/sas/output/SAS_logit_output.htm
Proc Logistic
This page shows an example of logistic regression with footnotes explaining
the output. The data were collected on 200 high school students, with
measurements on various tests, including science, math, reading and social
studies. The response variable is high writing test score (honcomp), where a
writing score greater than or equal to 60 is considered high, and less than 60
considered low; from which we explore its relationship with gender (female),
reading test score (read), and science test score (science). The dataset used in
this page can be downloaded from
http://www.ats.ucla.edu/stat/sas/webbooks/reg/default.htm.
data logit;
set "c:\temp\hsb2";
honcomp = (write >= 60);
run;
proc logistic data= logit descending;
model honcomp = female read science;
7
run;
Logistic Regression, SAS Output
8
2. Confounding Variables
 Correlated with both the dependent and
independent variables
 Represent major threat to the validity of inferences
on cause and effect
 Add to multicollinearity
 Can lead to over or underestimation of an effect, it
can even change the direction of the conclusion
 They add error in the interpretation of what may
be an accurate measurement
9
For a variable to be a confounder it needs to
have
Relationship with the exposure
Relationship with the outcome even in the
absence of the exposure (not an intermediary)
Not on the causal pathway
Uneven distribution in comparison groups
Exposure
Outcome
Third variable
10
Birth order
Down Syndrome
Maternal Age
Confounding
Maternal age is correlated with birth order and a risk factor for Down
Syndrome, even if Birth order is low
Alcohol
Lung Cancer
Smoking
No
Confounding
Smoking is correlated with alcohol consumption and is a risk factor for
Lung Cancer even for persons who don’t drink alcohol
11
3. Controlling for Confounding
Variables
 In study designs



Restriction
Random allocation of subjects to study
groups to attempt to even out unknown
confounders
Matching
subjects
using
potential
confounders
12
 In data analysis





Stratified analysis using Mantel Haenszel
method to adjust for confounders
Case-control studies
Cohort studies
Restriction (is still possible but it means to
throw data away)
Model fitting using regression techniques
13
Pros and Cons of Controlling Methods
 Matching methods call for subjects with
exactly the same characteristics
 Risk of over or under matching
 Cohort studies can lead to too much loss of
information when excluding subjects
 Some strata might become too thin and thus
insignificant creating also loss of information
 Regression methods, if well handled,
can control for confounding factors
14
4. Residual Linear Regression
 Consider a dependant variable Y and a set
of n independent covariates, from which the
first k (k<n) of them are potential
confounding factors
 Initial model treating only the confounding
Y as
0 follows
1X1  2 X2  ... k Xk  
variables
 Residuals
from
ˆ calculated
ˆ X  ˆ X  ...
ˆ Xthis model, let
ˆ are
Y




0
1 1
2 2
k
k

15
The residuals are e j  Yj  Yj with the following properties:
Zero mean
Homoscedasticity
Normally distributed
 Corr ei , e j   0 , i  j
This residual
will be considered the new dependant

variable. That is, the new model to be fitted is
Y  Y   0   k 1 X k 1   k  2 X k  2  ...  t X t  
which is equivalent to:
Y   0  Y   k 1 X k 1   k  2 X k  2  ...  t X t  
16
The Usual Logistic Regression Approach
to ‘Control for’ Confounders
 Consider a binary outcome Y and n covariates
where the first k (k<n) of them being potential
confounding factors
 The usual way to ‘control for’ these confounding
variables is to simply put all the n variables in the
same model as:
  
log
    X   2 X 2  ...  k X k  ... n X n
1   0 1 1

17
5. Residual Logistic Regression
 Each subject has a binary outcome Y
 Consider n covariates, where the first k (k<n) are
potential confounding factors
 Initial model with  as the probability of success
where only confounding effect is analyzed
  
log

 0  1 X1   2 X 2  ...  k X k
1  
18
Method 1
 The confounding variables effect is retained and
plugged in to the second level regression model
along with the variables of interest following the
residual linear regression approach.
 That is, let T  1X1  2 X2  ... k Xk
 The new model to be fitted is
  
log

   0  T   k 1 X k 1   k  2 X k  2  ...   n X n
1  
19
Method 2
 Pearson residuals are calculated from the initial
model using the Pearson Yresidual
(Hosmer
and

Lemeshow, 1989) Z 
 1   

where
is the estimated probability of success
based on the confounding
variables alone:
   X
k


e
0
1 e
i
i 1
0 
i 1
k
i
i X i
 0,1
The second level regression will use this residual
20

as the new dependant variable.
Therefore the new dependant variable is Z, and
because it is not dichotomous anymore we can
apply a multiple linear regression model to
analyze the effect of the rest of the covariates.
The new model to be fitted is a linear regression
model
Z  0  k1Xk1  k2 Xk2  ... n Xn  

21
6. Example 1

Data: Low Birth Weight
 Dow. Indicator of birth weight less than 2.5 Kg
 Age: Mother’s age in years
 Lwt: Mother’s weight in pounds
 Smk: Smoking status during pregnancy
 Ht: History of hypertension
Correlation
matrix with
alpha=0.05
Age
Lwt
Smk
Ht
Age
Lwt
Smk
Ht
1.0000
0.1738
-0.0444
-0.0158
1.0000
-0.0408
0.2369
1.0000
0.0134
1.0000
22
 Potential confounding factor: Age
 Model for  (probability of low birth weight)
 Logistic regression
  
log
 0  1age   2 lwt   3 smk  4 ht

 1  
 Residual logistic regression

initial model
 Method 1
T  1age

  
log
    age
1   0 1
  
log
   T   2 lwt  3smk   4 ht
1   0
 Method 2


 2lwt  3smk  4 ht  
Z  0 
23
Results
Variables
Logistic Regression
RLR Method1
Odds ratio
P-value
SE
Odds ratio
P-value
SE
lwt
0.988
0.060
0.0064
0.989
0.078
0.0065
smk
3.480
0.001
0.3576
3.455
0.001
0.3687
ht
3.395
0.053
0.6322
3.317
0.059
0.6342
RLR Method 2
Variables
P-value
Conf. factors
SE
lwt
0.077
0.0024
Smk
0.000
0.1534
ht
0.042
0.3094
Variables
Age
P-value
Log reg
Ini model
0.055
0.027
24
Example 2

Data: Alzheimer patients
Decline: Whether the subjects cognitive capabilities deteriorates or not
Age: Subjects age
Gender: Subjects gender
MMS: Mini Mental Score
PDS: Psychometric deterioration scale
HDT: Depression scale
Age
Correlation
matrix with
alpha=0.05
Gender
MMS
PDS
HDT
Age
Gender
MMS
PDS
HDT
1.0000
0.0413
-0.2120
0.3327
0.9679
1.0000
-0.1074
0.2020
-0.1839
1.0000
0.3784
-0.1839
1.0000
0.0110
1.0000
25
 Potential confounding factors: Age, Gender
 Model for  (probability of declining)
 Logistic regression
  
log
 0  1age   2 gender   3 mms  4 pds   5 hdt

 1  
 Residual logistic regression

initial model
  
log
   1age   2 gender
1   0
 Method 1
T  1age  2gender

  
log
   T   3mms  4 pds   5 hdt
1   0
 Method 2


Z  0  3mms
 4 pds  5hdt  
26
Results
Variables
Logistic Regression
RLR Method1
Odds ratio
P-value
SE
Odds ratio
P-value
SE
mms
0.717
0.023
0.1451
0.720
0.023
0.1443
pds
1.691
0.001
0.1629
1.674
0.001
0.1565
hdt
1.018
0.643
0.0380
1.018
0.644
0.0377
RLR Method 2
Conf. factors
Variables
P-value
SE
mms
<0.001
0.0915
pds
<0.001
0.0935
hdt
0.061
0.0273
Variables
P-value
Log reg
Ini model
Age
0.004
0.000
Gender
0.935
0.551
27
7. Discussion
 The usual logistic regression is not designed to
control for confounding factors and there is a risk
for multicollinearity.
 Method 1 is designed to control for confounding
factors; however, from the given examples we can
see Method 1 yields similar results to the usual
logistic regression approach
 Method 2 appears to be more accurate with some
SE significantly reduced and thus the p-values for
some regressors are significantly smaller.
However it will not yield the odds ratios as
Method 1 can.
28
8. Future Work
We will further examine the assumptions
behind Method 2 to understand why it
sometimes yields more significant results.
We will also study residual longitudinal data
analysis, including the survival analysis,
where one or more time dependant
variable(s) will be taken into account.
29
Selected References
 Menard, S. Applied Logistic Regression Analysis. Series:
Quantitative Applications in the Social Sciences. Sage
University Series
 Lemeshow, S; Teres, D.; Avrunin, J.S. and Pastides, H.
Predicting the Outcome of Intensive Care Unit Patients.
Journal of the American Statistical Association 83, 348-356
 Hosmer, D.W.; Jovanovic, B. and Lemeshow, S. Best Subsets
Logistic Regression. Biometrics 45, 1265-1270. 1989.
 Pergibon, D. Logistic Regression Diagnostics. The Annals of
Statistics 19(4), 705-724. 1981.
30
Questions?
31