Download Introduction to logistic regression

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Granular computing wikipedia , lookup

Perceptual control theory wikipedia , lookup

General circulation model wikipedia , lookup

Data analysis wikipedia , lookup

Vector generalized linear model wikipedia , lookup

Simplex algorithm wikipedia , lookup

Data assimilation wikipedia , lookup

Generalized linear model wikipedia , lookup

Predictive analytics wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
Introduction to
Logistic Regression
Rachid Salmi, Jean-Claude Desenclos, Thomas Grein,
Alain Moren, Viviane Bremer
Objectives
• When do we need to use logistic
regression
• Principles of logistic regression
• Uses of logistic regression
• What to keep in mind
Chlamorea
• Sexually transmitted infection
–Virus recently identified
–Leads to general rash, blush, pimples
and feeling of shame
–Increasing prevalence with age
–Risk factors unknown so far
Case control study
• Population of Berlin
• 150 cases, 150 controls
• Hypothesis: Consistent use of condoms
protects against chlamorea
• Questionnaire with questions on
demographic characteristics, sexual
behaviour
• OR, t-test
Results bivariate analysis
Cases
n=150
Controls
n=150
Odds ratio
Used condoms
at last sex
40
90
0.17
Did not use
condoms
110
60
Ref
Results bivariate analysis
Cases
n=150
Controls
n=150
Odds ratio
Single
125
50
4.7
Currently in a
relationship
25
100
Ref
Results bivariate analysis
Cases
n=150
Controls
n=150
nr partners
during last year
4
2
p=0.001
Mean age in
years
39
26
p=0.001
Confounding?
T-test
Stratification
Agegroup
Single status
Chlamorea and condom use
a
c
b
d
OR raw
a1
c1
b1
OR1
d1
a2
b2
c2
d2
Number of partners
OR2
a1 b1
c1 d1
OR1
a2 b2
c2 d2
OR2
ai bi
ci di
a1 b1
c1 d1
OR1
a2 b2
c2 d2
OR2
ai bi
ci di
a3
b3
c3
d3
ai
bi
ci
di
OR3
ORi
a1 b1
c1 d1
OR1
a2 b2
c2 d2
OR2
ai bi
ci di
OR4
ORi
ORi
a1 b1
c1 ad1 b OR1
1
1
c
d
a2 b12 a11 b1OR1
d1 b OR1
c2 ad22 cb12 a
1 2 1
OR
c
d1 OR1
c
d
ai b2 i a22 b12 OR
2
ci adi i cb2 i ad22 b2 OR2
OR
ci adi i cb2 i id2 OR2
a1 b1 ORi
di iOR
bi OR
c1 acdi 1 ab
1
i
1
1
cdi diOR
c
OR
1
1
a
b
a2 b2 1 1 1 i
d1 b OR1
c2 ad22 cb12 a
1 2 1
OR
c
d1 OR1
ai cb2 i ad22 b12 OR
2
c
d
a
b
ci adi i b2 i 22 2 OR2
OR
ci adi i cb2 i id2 OR2
OR
ci adi i bi i
ORi
a1 b1 ci di OR
i
c1 ad1 b OR1
1
1
a2 cb12 ad11 b1OR1
d1 b OR1
c2 ad22 cb12 a
1 2 1
OR
c
d1 OR1
ai cb2 i ad22 b12 OR
2
ci adi i cb2 i ad22 b2 OR2
OR
ci adi i cb2 i id2 OR2
OR
ci adi i bi i
ORi
ci di
ORi
Let’s go one step back
Simple linear regression
Table 1
Age and systolic blood pressure (SBP) among 33 adult women
Age
SBP
Age
SBP
Age
SBP
22
23
24
27
28
29
30
32
33
35
40
131
128
116
106
114
123
117
122
99
121
147
41
41
46
47
48
49
49
50
51
51
51
139
171
137
111
115
133
128
183
130
133
144
52
54
56
57
58
59
63
67
71
77
81
128
105
145
141
153
157
155
176
172
178
217
SBP (mm Hg)
220
SBP  81.54  1.222  Age
200
180
160
140
120
100
80
20
30
40
50
60
Age (years)
adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974
70
80
90
Simple linear regression
• Relation between 2 continuous variables
(SBP and age)
y
Slope
α
y  α  β1x1
x
• Regression coefficient b1
–Measures association between y and x
–Amount by which y changes on average
when x changes by one unit
–Least squares method
What if we have more than one
independent variable?
Multiple risk factors
• Objective:
To attribute to each risk factors the
respective effect (RR) it has on the
occurrence of disease.
Types of multivariable analysis
• Multiple models
–Linear regression
–Logistic regression
–Cox model
–Poisson regression
–Loglinear model
–Discriminant analysis…
• Choice of the tool according objectives,
study design and variables
Multiple linear regression
• Relation between a continuous variable and
a set of i variables
y  α  β1x1  β2 x 2  ...  βi xi
• Partial regression coefficients bi
–Amount by which y changes when xi changes
by one unit and all the other xi remain constant
–Measures association between xi and y
adjusted for all other xi
• Example
–Number of partners in relation to age & income
Multiple linear regression
y

Predicted
Response variable
Outcome variable
Dependent
α  β1x1  β2 x 2  ...  βi xi
Predictor variables
Explanatory variables
Covariables
Independent variables
y (number of partners) = α + β1 age + β2 income + β3 gender
What if our outcome variable is
dichotomous?
Logistic regression (1)
Table 2
Age and chlamorea
Age
Chlamorea
Age
Chlamorea
Age
Chlamorea
22
23
24
27
28
30
30
32
33
35
38
0
0
0
0
0
0
0
0
0
1
0
40
41
46
47
48
49
49
50
51
51
52
0
1
0
0
0
1
0
1
0
1
0
54
55
58
60
60
62
65
67
71
77
81
0
1
1
1
0
1
1
1
1
1
1
How can we analyse these data?
• Compare mean age of diseased
and non-diseased
–Non-diseased: 26 years
–Diseased: 39 years (p=0.0001)
• Linear regression?
Dot-plot: Data from Table 2
of Chlamorea
Presence
Signs of coronary
disease
Yes
No
0
20
40
AGE (years)
60
80
100
Logistic regression (2)
Table 3
Prevalence (%) of chlamorea according to age group
Diseased
Age group
# in group
#
%
20 - 29
5
0
0
30 - 39
6
1
17
40 - 49
7
2
29
50 - 59
7
4
57
60 - 69
5
4
80
70 - 79
2
2
100
80 - 89
1
1
100
Dot-plot: Data from Table 3
Diseased %
100
80
60
40
20
0
0
2
4
Age group
6
8
Logistic function (1)
Probability of
disease
1.0
0.8
e  bx
P( y x ) 
1  e  bx
0.6
0.4
0.2
0.0
x
Logistic function
• Logistic regression models the logit
of the outcome
=natural logarithm of the odds of the outcome
Probability of the outcome (p)
ln
Probability of not having the outcome (1-p)
 P 
ln 
  α  β1x1  β2 x 2  ... βixi
 1- P 
Logistic function
 P 
ln 
  α  β1x1  β2 x 2  ... βixi
 1- P 
  = log odds of disease
in unexposed
 b = log odds ratio associated
with being exposed
b
 e = odds ratio
Multiple logistic regression
• More than one independent variable
–Dichotomous, ordinal, nominal, continuous …
 P 
ln 
  α  β1x1  β2 x 2  ... βixi
 1- P 
• Interpretation of bi
–Increase in log-odds for a one unit increase in xi
with all the other xis constant
–Measures association between xi and log-odds
adjusted for all other xi
Uses of multivariable analysis
• Etiologic models
–Identify risk factors adjusted for
confounders
–Adjust for differences in baseline
characteristics
• Predictive models
–Determine diagnosis
–Determine prognosis
Fitting equation to the data
• Linear regression:
–Least squares
• Logistic regression:
–Maximum likelihood
Elaborating eβ
• eβ = OR
 What if the independent variable
is continuous?
 what’s the effect of a change in x
by more than one unit?
The Q fever example
• Distance to farm as independent
continuous variable counted in meters
–β in logistic regression was -0.00050013
and statistically significant
• OR for each 1 meter distance is 0.9995
–Too small to use
• What’s the OR for every 1000 meters?
–e1000*β = e-1000*0.00050013 = 0.6064
Continuous variables
• Increase in OR for a one unit change in
exposure variable
• Logistic model is multiplicative 
OR increases exponentially with x
–If OR = 2 for a one unit change in exposure
and x increases from 2 to 5:
OR = 2 x 2 x 2 = 23 = 8
• Verify if OR increases exponentially with x
–When in doubt, treat as qualitative variable
Coding of variables (2)
• Nominal variables or ordinal with unequal
classes:
–Preferred hair colour of partners:
» No hair=0, grey=1, brown=2, blond=3
–Model assumes that OR for blond partners
= OR for grey-haired partners3
–Use indicator variables (dummy variables)
Indicator variables: Hair colour
Dummy variables
Hair colour of
partners
blond
brown
grey
grey
brown
blond
no hair
0
0
1
0
0
1
0
0
1
0
0
0
• Neutralises artificial hierarchy between
classes in variable “hair colour of partners"
• No assumptions made
• 3 variables in model using same reference
• OR for each type of hair adjusted for the
others in reference to “no hair”
Classes
• Relationship between number of partners during last
year and chlamorea
– Code number of partners: 0-1 = 1, 2-3 = 2, 4-5 = 3
Code
nr partners
Cases
Controls
OR
1
20
40
1.0
2
22
30
1.5
3
12
11
2.2
1.52 2.2
• Compatible with assumption of multiplicative model
– If not compatible, use indicator variables
Risk factors for Chlamorea
Sex
Hair colour
Agegroup
Single
Visiting bars
Number of partners
No condom use
Chlamorea
Unconditional Logistic Regression
Term
Odds
Ratio
95%
C.I.
Coef.
S. E.
ZStatistic
PValue
# partners
1,2664
0,2634
10,7082
0,2362
0,9452
0,5486
0,5833
Single (Yes/No)
1,0345
0,3277
3,2660
0,0339
0,5866
0,0578
0,9539
Hair colour (1/0)
1,6126
0,2675
9,7220
0,4778
0,9166
0,5213
0,6022
Hair colour (2/0)
0,7291
0,0991
5,3668
-0,3159
1,0185
-0,3102
0,7564
Hair colour (3/0)
1,1137
0,1573
7,8870
0,1076
0,9988
0,1078
0,9142
Visiting bars
1,5942
0,4953
5,1317
0,4664
0,5965
0,7819
0,4343
Used no Condoms
9,0918
3,0219
27,3533
2,2074
0,5620
3,9278
0,0001
Sex (f/m)
1,3024
0,2278
7,4468
0,2642
0,8896
0,2970
0,7665
*
*
*
-3,0080
2,0559
-1,4631
0,1434
CONSTANT
Last but not least
Why do we need multivariable analysis?
• Our real world is multivariable
• Multivariable analysis is a tool to
determine the relative contribution
of all factors
Sequence of analysis
• Descriptive analysis
–Know your dataset
• Bivariate analysis
–Identify associations
• Stratified analysis
–Confounding and effect modifiers
• Multivariable analysis
–Control for confounding
What can go wrong
• Small sample size and too few cases
• Wrong coding
• Skewed distribution of independent
variables
–Empty “subgroups”
• Collinearity
–Independent variables express the same
Do not forget
• Rubbish in - rubbish out
• Check for confounders first
• Number of subjects >> variables in the
model
• Keep the model simple
–Statisticians can help with the model but
you need to understand the interpretation
• You will need several attempts to find
the “best” model
• If in doubt…
Really call a statistician !!!!
References
• Norman GR, Steiner DL. Biostatistics.
The Bare Essentials. BC Decker,
London, 2000
• Hosmer DW, Lemeshow S. Applied
logistic regression. Wiley & Sons, New
York, 1989
• Schwartz MH. Multivariable analysis.
Cambridge University Press, 2006