Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Interaction (statistics) wikipedia , lookup

Instrumental variables estimation wikipedia , lookup

Time series wikipedia , lookup

Choice modelling wikipedia , lookup

Regression toward the mean wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Regression analysis wikipedia , lookup

Transcript
Chapter 16
logistic Regression
Analysis
1
Content
Logistic regression
 Conditional logistic regression
 Application

2
Purpose: Work out the equations for logistic regression which are
used to estimate the dependent variable (outcome factor) from
the independent variables (risk factors). Logistic regression is a
kind of nonlinear regression.
Data: 1.The dependent variable is a binary categorical variable
that has two values such as "yes" and "no“. 2.All of the
independent variables, at least, most of which should be
categories. Of course, some of them can be numerical variable.
The categorical variable should be quantified.
3
Implication: Logistic regression can be used to
study the quantitative relations between the
happening of some diseases or phenomena and
many risk factors.
2

There are some demerits to use test (or u test ):
1. can only study one risk factor.
2. can only educe the qualitative conclusion.
4
Category:
1.Between-subjects (non-conditional) logistic regression
equation
2. Paired (conditional) logistic regression equation
5
§
1
logistic regression
(non-conditional logistic
regression )
6
I Basic Conception

1 happen
The dependent variable Y 
0 not happen
The independent variable X1 , X 2 ,
, Xm
The probability of positive outcome under the
function of m independent variables can be
marked like this:
P  P(Y  1 | X 1 , X 2 ,, X m )
0  P 1
7
Regression model
1
P
1  exp[(  0  1 X 1   2 X 2 
If:
Z  0  1 X1  2 X 2 
1
P
Z
1 e
 P 
ln 
 = 0  1 X 1   2 X 2 
 1 P 
Scale:
  m X m )]
 m X m
While:  0 is the constant
term ,  1 ,  2 , ,  m is the
coefficient of regression。
 m X m 
log itP
Probability: P:0~1,logitP:-
∞~∞。
8
P
1
0.5
0.5
Z : , 0, 
P : 0, 0.5, 1
0
-4
-3
-2
-1
Z
0
1
2
3
4
Figure 16-1 the figure of logistic function
9
The meaning of model parameter
 P 
ln 
 = 0  1 X 1   2 X 2 
 1 P 
  m X m  log itP
By constant  0 we mean the natural logarithm of likelihood
ratio between happening and non-happening when exposure
dose is zero.
By regression coefficient  ( j  1, 2,
j
, m) we mean the
change of logitP when the independent variable X changes by
j
one unit.
10
Odds ratio (OR)
The statistical indicator--odds ratio which is used to measure the
function of risk factor in the epidemiology ,the formula of
computation is:
OR j 
P1 /(1  P1 )
P0 /(1  P0 )
In the formula , P1 is the incidence of a disease when X
j
is
is the incidence of a disease when X is c0 . OR is
called odds ratio when many variables had been adjusted, it show the
function of the risk factors without the influence of the other
11
independent variables.
c1 ,and
P0
j
j
The relationship with logistic P :
Comparing the conditions of disease when one risk factor has
two different exposure levels ( X j  c1 , X j  c0 ), the natural
logarithm of Odds Ratio is:
 P1 /(1  P1 ) 
ln OR j  ln 
  logitP1  logitP0
 P0 /(1  P0 ) 
m
m
t j
t j
 (  0   j c1   t X t )  (  0   j c0    t X t )
  j (c1  c0 )
12
that is OR j  exp[  j (c1  c0 )]

1
i f Xj 
0
exposure
, c1  c0  1,
nonexposure
 0, OR j  1 no function


OR j  exp  j ,  j >0, OR j  1 risk factor

 0, OR j  1 protect factor
P1 /(1  P1 )
when P  1, OR 
 RR
P0 /(1  P0 )
We often think that  is an ineffective parameter,
0
because there is no relationship between OR and  .
j
0
13
II the parametric estimation of logistic
regression model
1. parametric estimation
Theory:the estimation of likelihood
n
L   Pi Yi (1  Pi )1Yi
i 1
ln L 
n
 [Y
i 1
i
ln Pi  (1  Yi ) ln(1  Pi )]
b0 , b1 , b2 ,  , bm
14
2.Estimation of OR It can show the OR of
two different levels (c1,c0) of one factor.
ORˆ j  exp[ b j (c1  c0 )]
If the independent variable
X j only has two
levels—the exposure and the non- exposure, the
estimate formula of 1   confidence interval of OR j
is:
exp( b j  u / 2 S b )
j
15
e.g.: 16-1 Table 16-1 is a case-control data which is
used to study the relations among smoking、drinking
and esophagus cancer, please try running logistic
regression analysis.
Definite every variable’s code
X1
Y
1

0
1

0
smoking
no smoking
1
X2  
0
drinking
no drinking
case
control
16
Table16-1 the case-control data of the relation
between smoking and esophagus cancer
stratification
smoking
drinking
case
positive
negative
g
X1
X2
ng
dg
ng dg
1
0
0
199
63
136
2
0
1
170
63
107
3
1
0
101
44
57
4
1
1
416
265
151
17
经 logistic 回归计算后得
Results:
b0 =-0.9099, Sb0 =0.1358; b1 =0.8856,
S b =0.1500; b2 =0.5261, S b =0.1572
1
2
The OR of smoking and nonsmoking :
吸烟与不吸烟的优势比: ORˆ1  exp b1  exp 0.8856=2.42
ORˆ1  exp b1  exp 0.8856=2.42
OR1 的 95可信区间:
95 confidence interval of OR1
exp[b  u
S 1 ]  exp(0.8856  1.96  0.1500)  (1.81, 3.25)
0.05 / 2 b
exp[b1  u0.05 / 2 Sb ] 1exp(0.8856
 1.96  0.1500)  (1.81, 3.25)
1
饮酒与不饮酒的优势比:
The OR of drinking and no drinking
ˆ 0.5261
exp 0.5261  1.69
ORˆ  exp b  OR
exp
1.69
2  exp b
2 
2
2
95 confidence
interval
of OR2
OR
2 的 95可信区间:
exp(b2  1.96exp(
Sb2 b)2  1.96
exp(0.5261
 1.96 1.96
0.1572)
 (1.24,
2.30)
Sb )  exp(0.5261
 0.1572)
 (1.24,
2.30)
2
18
III the hypothesis test of logistic
regression model
1. Likelihood test
2. Wald test comparing the estimations of parameters with
zero, the control is its standard error , statistics are:
u
bj
Sb j
or
 bj
 
 Sb
 j
2
2

 ,  1


2
 0.8856 
H 0 : 1  0, H1 : 1  0,   0.05, 12  
  34.86
 0.1500 
2
 0.5261 
H 0 :  2  0, H1 :  2  0,   0.05,  22  
  11.20
0.1572


Both of  2 are more than 3.84, that is to say that esophagus
cancer、smoking and drinking have relations with each other.
19
The conclusion is same as above.
IV variable selection
methods:forward selection、backward elimination
and
stepwise regression .
Test statistics:it is not F statistic,but one of likelihood、
Wald test and score test statistics.
e.g.: 16-2 In order to discuss the risk factors that relate to
coronary heart disease, to take case-control study on 26
coronary heart disease patients and 28 controllers, table 16-2
and table 16-3 show the definition of all factors and the data.
Please try using logistic stepwise regression to select the risk
factors.
(in  0.10,  out  0.15)
20
Table 16-2 eight probable risk factors of coronary
heart disease and valuation
factors
Age
hypertension
Family hypertension
Smoking
High blood lipid
Animal fat intake
Weight index(BMI)
Type A character
Coronary heart disease
variables
X1
X2
X3
X4
X5
X6
X7
X8
Y
Definition of valuation
<45=1, 4554=2, 5564=3, 65=4
Not=0, have=1
Not=0, have=1
nonsmoking=0, smoking=1
Not=0, have=1
low=0, high=1
<24=1, 24<26=2, 26=3
no=0, yes=1
control=0,case=1
21
Table 16-3 the case-control data of heart disease’s risk factors
Order
1
2
3
4
5
6
7
8
9
10
X1
3
2
2
2
3
3
2
3
2
1
X2
1
0
1
0
0
0
0
0
0
0
X3
0
1
0
0
0
1
1
1
0
0
X4
1
1
1
1
1
1
0
1
0
1
X5
0
0
0
0
0
0
0
1
0
0
X6
0
0
0
0
1
0
0
0
0
0
X7
1
1
1
1
1
2
1
1
1
1
X8
1
0
0
0
1
1
0
0
1
0
Y
0
0
0
0
0
0
0
0
0
0
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
51
52
53
54
2
2
2
3
0
1
1
1
1
1
0
1
1
1
1
0
0
0
0
1
1
0
0
0
2
2
1
3
1
1
1
1
1
1
1
1
22
Learn how to see the results!
Table 16-4
e.g.16-2 the independent variables which are
entering equation and estimations of related parameters
Model
Coeffici
ent of
regressio
n (b)
Standar
d
error(
S )
Wal
d

P
2
b
Standard
coefficient
of
regression(
b’)
ˆ
OR
consta
nt
-4.705
1.543
9.3
0
0.00
23
--
--
X1
0.924
0.477
3.7
6
0.05
25
0.401
2.52
X5
1.496
0.744
4.0
4
0.04
43
0.406
4.46
X6
3.136
1.249
6.3
0
0.01
21
0.703
23.0
0
X8
1.947
0.847
5.2
9
0.02
15
0.523
7.01
23
Finally there are four risk factors entering the logistic regression
model, they are rising age ( X 1 ) 、history of high blood lipid ( X 5 ) 、
animal fat intake ( X 6 ) and type A character ( X 8 ) 。

Standard coefficient of regression b j  b j  S j /  /
'

3 can be
used to compare the importance of every factor,S j is standard error
of X j , =3.1416。
24
Content
Logistic regression
 Conditional logistic regression
 Application

25
§2 conditional logistic regression
I Principle
In the paired data, one case and several
controls in each group is the most commonly
method, that is 1: M paired study( usually
M  3) 。
26
Table 16-5 the data format of 1: M conditional logistic regression
Matched
group
i
1
n
Number in Dependent
group
variable
t
Y
0
1
1
0
2
0
Risk factors
X1
X101
X 111
X 121
X2
X 102
X 112
X 122
…
…
…
…
Xm
X 10m
X 11m
X 12m
M
0
X 1M1
X 1M2
…
X 1Mm
0
1
2
1
0
0
Xn01
X n11
X n21
X n02
X n12
X n22
…
…
…
X n0m
X n1m
X n2m
M
0
X nM1
X nM2
…
X nMm
* t = 0 is the case and the others are the control.
27
The model of conditional logistic
1
Pi 
i  1, 2, , n
1  exp[( 0i  1 X 1   2 X 2  ...   m X m )]
Pi
means the disease probability of the layer i under the
function of a group of risk factors
 0 i means the effect of every layer,  1 ,  2 , ,  m
are the parameter to estimate.
The difference with the model of non-conditional
logistic regression is constant, the  0i can be different from
each other, but they assume that the ability of causing
diseases is the same among different paired groups.
28
II applied example
e.g.16-3 Some study about risk factors of larynx cancer in a northern
city,it used1:2 paired case-control method. Now 6 probable risk factors and
25 paired data have been selected, the valuation is in the following table
16-6, and the data is in table 16-7. (in  0.10,out  0.15)
Table 16-6 the risks of larynx cancer and explanation of valuation
Factors
pharyngitis
smoking(cig/day)
hoarseness
Fresh vegetables intake
Fruits intake
Family cancer history
larynx cancer
variables
X1
X2
X3
X4
X5
X6
Y
Explanation of valuation
no=1, occasion=2, often=3
0=1, 14=2, 59=3, 1020=4, 20=5
no=1, occasion =2, often=3
little=1, occasion=2, every day=3
rare =1, little=2, often=3
no=0, yes=1
case=1, control=0
29
P344:
Table 16-7 the data table of 1:2 paired
case-control study about larynx cancer
30
Using stepwise
Six risk factors
variable selection
four
factors enter equation,Table16-9 shows the results。
Table16-8 e.g.16-3 The Estimation of independent variables and
related parameters which have entered the equation
Entering
variables
Coefficient of
regressionb
X2
X3
X4
1.4869
1.9166
X6
Standard
Wald  2
errorSb
ˆ
OR
P
-3.7641
0.5506
0.9444
1.8251
7.29
4.12
4.25
4.42
6.80
0.02
0.0069
0.0424
0.0392
3.6321
1.8657
3.79
37.79
0.0516
The four entered risk factors are smoking ( X 2 ) 、hoarseness ( X 3 ) 、whether
often have fresh vegetable or not ( X 4 ) and family cancers ( X 6 ) ,in all of these,
having fresh vegetable is a protecting factor (b4
 0) 。
31
Content
Logistic regression
 Conditional logistic regression
 Application

32
§ 3 the application of logistic regression and
the notice
I the application of logistic regression
1.The analysis of epidemiologic risk factors
One feature of logistic regression is that the
meaning of parameter is clear, so logistic regression is
suitable for epidemiologic study.
33
2.Analysis of clinical experiment
The goal of clinical experiment is to assess the effect of
some drugs or cure methods, if there are some confounding
factors, and they are not balance among teams, the final
results will be wrong. So it is necessary to adjust these factors
during the process of analysis. when dependent variable is
binary, we can use logistic regression to analyze and get the
adjusted results.
34
3.Analyze dose–response of drugs or poisons
In the studies about dose–response of some drugs
or poisons, if the date is the logarithm of dose ,the
Probability
distribution
close
to
normal. The
distribution of normal function is very similar to
logistic regression, then we can express their relation
through the following model.
P
1
1  exp[ (  0   ln X )]
(While P is the positive rate;
X is dose.)
35
4.Forecast and discrimination
logistic regression is a model of probability ,
so we can use it to predict the probability of
something. For example in clinical we can
discriminate the probability of some diseases under
some index. please refer to the chapter 18 about
discrimination.
36
II the notice of application of
logistic regression
1.The value form of variable (the same as chapter15)
2.Sample size n  20 p (the number of independent
variable)
3.The evaluation of model

t o t he i ndependent var i abl e of t he model
t o t he t est of goodness of f i t of r egr essi on equat i on
4.Multi-category logistic regression
37
summary:
Purpose: Work out the equations for logistic regression which
are used to estimate the dependent variable (outcome factor)
from the independent variable (risk factor). Logistic
regression belong to probability type and nonlinear
regression.
Data: 1.The dependent variable is a binary categorical variable
that has two values such as "yes" and "no“. 2.All of the
independent variables, at least, most of which should be
categories. Of course, some of them can be numerical
variable. The categories variable should be measure by
number.
38
Implication: Logistic regression can be used to study
the quantitative relations between the happening of
some disease or phenomena and many risk factors
Category:
 1.Between-subjects (non-conditional) logistic
regression equation
 2. Paired (conditional) logistic regression equation
39
Thinking:
In order to analysis the influent factors of the rescue of AMI
patients, a hospital collected five years’ data of AMI patients (there
are many related factors ,this case only lists three ones for the
limited space), which has 200 cases in total, the data has been
shown in the following table, P=0 means successful rescue,P=1
means death;X1=1 means shock before rescue, X1=0 means no
shock before rescue; X2=1 means heart failure before rescue,
X2=0 means no heart failure before rescue; X3=1 means that it
has been more than 12 hours from the beginning of AMI symptom
to rescue, X3=0 means the time has not passed 12 hours.
which analysis method is the best one? why?
which result can we got?
40
The data of the rescue risk factor of
the AMI patients
P=0(successfully rescued)
P=1(death)
X1
X2
X3
N
X1
X2
X3
N
0
0
0
35
0
0
0
4
0
0
1
34
0
0
1
10
0
1
0
17
0
1
0
4
0
1
1
19
0
1
1
15
1
0
0
17
1
0
0
6
1
0
1
6
1
0
1
9
1
1
0
6
1
1
0
6
1
1
1
6
1
1
1
6
41