Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data assimilation wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Interaction (statistics) wikipedia , lookup
Instrumental variables estimation wikipedia , lookup
Time series wikipedia , lookup
Choice modelling wikipedia , lookup
Regression toward the mean wikipedia , lookup
Linear regression wikipedia , lookup
Chapter 16 logistic Regression Analysis 1 Content Logistic regression Conditional logistic regression Application 2 Purpose: Work out the equations for logistic regression which are used to estimate the dependent variable (outcome factor) from the independent variables (risk factors). Logistic regression is a kind of nonlinear regression. Data: 1.The dependent variable is a binary categorical variable that has two values such as "yes" and "no“. 2.All of the independent variables, at least, most of which should be categories. Of course, some of them can be numerical variable. The categorical variable should be quantified. 3 Implication: Logistic regression can be used to study the quantitative relations between the happening of some diseases or phenomena and many risk factors. 2 There are some demerits to use test (or u test ): 1. can only study one risk factor. 2. can only educe the qualitative conclusion. 4 Category: 1.Between-subjects (non-conditional) logistic regression equation 2. Paired (conditional) logistic regression equation 5 § 1 logistic regression (non-conditional logistic regression ) 6 I Basic Conception 1 happen The dependent variable Y 0 not happen The independent variable X1 , X 2 , , Xm The probability of positive outcome under the function of m independent variables can be marked like this: P P(Y 1 | X 1 , X 2 ,, X m ) 0 P 1 7 Regression model 1 P 1 exp[( 0 1 X 1 2 X 2 If: Z 0 1 X1 2 X 2 1 P Z 1 e P ln = 0 1 X 1 2 X 2 1 P Scale: m X m )] m X m While: 0 is the constant term , 1 , 2 , , m is the coefficient of regression。 m X m log itP Probability: P:0~1,logitP:- ∞~∞。 8 P 1 0.5 0.5 Z : , 0, P : 0, 0.5, 1 0 -4 -3 -2 -1 Z 0 1 2 3 4 Figure 16-1 the figure of logistic function 9 The meaning of model parameter P ln = 0 1 X 1 2 X 2 1 P m X m log itP By constant 0 we mean the natural logarithm of likelihood ratio between happening and non-happening when exposure dose is zero. By regression coefficient ( j 1, 2, j , m) we mean the change of logitP when the independent variable X changes by j one unit. 10 Odds ratio (OR) The statistical indicator--odds ratio which is used to measure the function of risk factor in the epidemiology ,the formula of computation is: OR j P1 /(1 P1 ) P0 /(1 P0 ) In the formula , P1 is the incidence of a disease when X j is is the incidence of a disease when X is c0 . OR is called odds ratio when many variables had been adjusted, it show the function of the risk factors without the influence of the other 11 independent variables. c1 ,and P0 j j The relationship with logistic P : Comparing the conditions of disease when one risk factor has two different exposure levels ( X j c1 , X j c0 ), the natural logarithm of Odds Ratio is: P1 /(1 P1 ) ln OR j ln logitP1 logitP0 P0 /(1 P0 ) m m t j t j ( 0 j c1 t X t ) ( 0 j c0 t X t ) j (c1 c0 ) 12 that is OR j exp[ j (c1 c0 )] 1 i f Xj 0 exposure , c1 c0 1, nonexposure 0, OR j 1 no function OR j exp j , j >0, OR j 1 risk factor 0, OR j 1 protect factor P1 /(1 P1 ) when P 1, OR RR P0 /(1 P0 ) We often think that is an ineffective parameter, 0 because there is no relationship between OR and . j 0 13 II the parametric estimation of logistic regression model 1. parametric estimation Theory:the estimation of likelihood n L Pi Yi (1 Pi )1Yi i 1 ln L n [Y i 1 i ln Pi (1 Yi ) ln(1 Pi )] b0 , b1 , b2 , , bm 14 2.Estimation of OR It can show the OR of two different levels (c1,c0) of one factor. ORˆ j exp[ b j (c1 c0 )] If the independent variable X j only has two levels—the exposure and the non- exposure, the estimate formula of 1 confidence interval of OR j is: exp( b j u / 2 S b ) j 15 e.g.: 16-1 Table 16-1 is a case-control data which is used to study the relations among smoking、drinking and esophagus cancer, please try running logistic regression analysis. Definite every variable’s code X1 Y 1 0 1 0 smoking no smoking 1 X2 0 drinking no drinking case control 16 Table16-1 the case-control data of the relation between smoking and esophagus cancer stratification smoking drinking case positive negative g X1 X2 ng dg ng dg 1 0 0 199 63 136 2 0 1 170 63 107 3 1 0 101 44 57 4 1 1 416 265 151 17 经 logistic 回归计算后得 Results: b0 =-0.9099, Sb0 =0.1358; b1 =0.8856, S b =0.1500; b2 =0.5261, S b =0.1572 1 2 The OR of smoking and nonsmoking : 吸烟与不吸烟的优势比: ORˆ1 exp b1 exp 0.8856=2.42 ORˆ1 exp b1 exp 0.8856=2.42 OR1 的 95可信区间: 95 confidence interval of OR1 exp[b u S 1 ] exp(0.8856 1.96 0.1500) (1.81, 3.25) 0.05 / 2 b exp[b1 u0.05 / 2 Sb ] 1exp(0.8856 1.96 0.1500) (1.81, 3.25) 1 饮酒与不饮酒的优势比: The OR of drinking and no drinking ˆ 0.5261 exp 0.5261 1.69 ORˆ exp b OR exp 1.69 2 exp b 2 2 2 95 confidence interval of OR2 OR 2 的 95可信区间: exp(b2 1.96exp( Sb2 b)2 1.96 exp(0.5261 1.96 1.96 0.1572) (1.24, 2.30) Sb ) exp(0.5261 0.1572) (1.24, 2.30) 2 18 III the hypothesis test of logistic regression model 1. Likelihood test 2. Wald test comparing the estimations of parameters with zero, the control is its standard error , statistics are: u bj Sb j or bj Sb j 2 2 , 1 2 0.8856 H 0 : 1 0, H1 : 1 0, 0.05, 12 34.86 0.1500 2 0.5261 H 0 : 2 0, H1 : 2 0, 0.05, 22 11.20 0.1572 Both of 2 are more than 3.84, that is to say that esophagus cancer、smoking and drinking have relations with each other. 19 The conclusion is same as above. IV variable selection methods:forward selection、backward elimination and stepwise regression . Test statistics:it is not F statistic,but one of likelihood、 Wald test and score test statistics. e.g.: 16-2 In order to discuss the risk factors that relate to coronary heart disease, to take case-control study on 26 coronary heart disease patients and 28 controllers, table 16-2 and table 16-3 show the definition of all factors and the data. Please try using logistic stepwise regression to select the risk factors. (in 0.10, out 0.15) 20 Table 16-2 eight probable risk factors of coronary heart disease and valuation factors Age hypertension Family hypertension Smoking High blood lipid Animal fat intake Weight index(BMI) Type A character Coronary heart disease variables X1 X2 X3 X4 X5 X6 X7 X8 Y Definition of valuation <45=1, 4554=2, 5564=3, 65=4 Not=0, have=1 Not=0, have=1 nonsmoking=0, smoking=1 Not=0, have=1 low=0, high=1 <24=1, 24<26=2, 26=3 no=0, yes=1 control=0,case=1 21 Table 16-3 the case-control data of heart disease’s risk factors Order 1 2 3 4 5 6 7 8 9 10 X1 3 2 2 2 3 3 2 3 2 1 X2 1 0 1 0 0 0 0 0 0 0 X3 0 1 0 0 0 1 1 1 0 0 X4 1 1 1 1 1 1 0 1 0 1 X5 0 0 0 0 0 0 0 1 0 0 X6 0 0 0 0 1 0 0 0 0 0 X7 1 1 1 1 1 2 1 1 1 1 X8 1 0 0 0 1 1 0 0 1 0 Y 0 0 0 0 0 0 0 0 0 0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 52 53 54 2 2 2 3 0 1 1 1 1 1 0 1 1 1 1 0 0 0 0 1 1 0 0 0 2 2 1 3 1 1 1 1 1 1 1 1 22 Learn how to see the results! Table 16-4 e.g.16-2 the independent variables which are entering equation and estimations of related parameters Model Coeffici ent of regressio n (b) Standar d error( S ) Wal d P 2 b Standard coefficient of regression( b’) ˆ OR consta nt -4.705 1.543 9.3 0 0.00 23 -- -- X1 0.924 0.477 3.7 6 0.05 25 0.401 2.52 X5 1.496 0.744 4.0 4 0.04 43 0.406 4.46 X6 3.136 1.249 6.3 0 0.01 21 0.703 23.0 0 X8 1.947 0.847 5.2 9 0.02 15 0.523 7.01 23 Finally there are four risk factors entering the logistic regression model, they are rising age ( X 1 ) 、history of high blood lipid ( X 5 ) 、 animal fat intake ( X 6 ) and type A character ( X 8 ) 。 Standard coefficient of regression b j b j S j / / ' 3 can be used to compare the importance of every factor,S j is standard error of X j , =3.1416。 24 Content Logistic regression Conditional logistic regression Application 25 §2 conditional logistic regression I Principle In the paired data, one case and several controls in each group is the most commonly method, that is 1: M paired study( usually M 3) 。 26 Table 16-5 the data format of 1: M conditional logistic regression Matched group i 1 n Number in Dependent group variable t Y 0 1 1 0 2 0 Risk factors X1 X101 X 111 X 121 X2 X 102 X 112 X 122 … … … … Xm X 10m X 11m X 12m M 0 X 1M1 X 1M2 … X 1Mm 0 1 2 1 0 0 Xn01 X n11 X n21 X n02 X n12 X n22 … … … X n0m X n1m X n2m M 0 X nM1 X nM2 … X nMm * t = 0 is the case and the others are the control. 27 The model of conditional logistic 1 Pi i 1, 2, , n 1 exp[( 0i 1 X 1 2 X 2 ... m X m )] Pi means the disease probability of the layer i under the function of a group of risk factors 0 i means the effect of every layer, 1 , 2 , , m are the parameter to estimate. The difference with the model of non-conditional logistic regression is constant, the 0i can be different from each other, but they assume that the ability of causing diseases is the same among different paired groups. 28 II applied example e.g.16-3 Some study about risk factors of larynx cancer in a northern city,it used1:2 paired case-control method. Now 6 probable risk factors and 25 paired data have been selected, the valuation is in the following table 16-6, and the data is in table 16-7. (in 0.10,out 0.15) Table 16-6 the risks of larynx cancer and explanation of valuation Factors pharyngitis smoking(cig/day) hoarseness Fresh vegetables intake Fruits intake Family cancer history larynx cancer variables X1 X2 X3 X4 X5 X6 Y Explanation of valuation no=1, occasion=2, often=3 0=1, 14=2, 59=3, 1020=4, 20=5 no=1, occasion =2, often=3 little=1, occasion=2, every day=3 rare =1, little=2, often=3 no=0, yes=1 case=1, control=0 29 P344: Table 16-7 the data table of 1:2 paired case-control study about larynx cancer 30 Using stepwise Six risk factors variable selection four factors enter equation,Table16-9 shows the results。 Table16-8 e.g.16-3 The Estimation of independent variables and related parameters which have entered the equation Entering variables Coefficient of regressionb X2 X3 X4 1.4869 1.9166 X6 Standard Wald 2 errorSb ˆ OR P -3.7641 0.5506 0.9444 1.8251 7.29 4.12 4.25 4.42 6.80 0.02 0.0069 0.0424 0.0392 3.6321 1.8657 3.79 37.79 0.0516 The four entered risk factors are smoking ( X 2 ) 、hoarseness ( X 3 ) 、whether often have fresh vegetable or not ( X 4 ) and family cancers ( X 6 ) ,in all of these, having fresh vegetable is a protecting factor (b4 0) 。 31 Content Logistic regression Conditional logistic regression Application 32 § 3 the application of logistic regression and the notice I the application of logistic regression 1.The analysis of epidemiologic risk factors One feature of logistic regression is that the meaning of parameter is clear, so logistic regression is suitable for epidemiologic study. 33 2.Analysis of clinical experiment The goal of clinical experiment is to assess the effect of some drugs or cure methods, if there are some confounding factors, and they are not balance among teams, the final results will be wrong. So it is necessary to adjust these factors during the process of analysis. when dependent variable is binary, we can use logistic regression to analyze and get the adjusted results. 34 3.Analyze dose–response of drugs or poisons In the studies about dose–response of some drugs or poisons, if the date is the logarithm of dose ,the Probability distribution close to normal. The distribution of normal function is very similar to logistic regression, then we can express their relation through the following model. P 1 1 exp[ ( 0 ln X )] (While P is the positive rate; X is dose.) 35 4.Forecast and discrimination logistic regression is a model of probability , so we can use it to predict the probability of something. For example in clinical we can discriminate the probability of some diseases under some index. please refer to the chapter 18 about discrimination. 36 II the notice of application of logistic regression 1.The value form of variable (the same as chapter15) 2.Sample size n 20 p (the number of independent variable) 3.The evaluation of model t o t he i ndependent var i abl e of t he model t o t he t est of goodness of f i t of r egr essi on equat i on 4.Multi-category logistic regression 37 summary: Purpose: Work out the equations for logistic regression which are used to estimate the dependent variable (outcome factor) from the independent variable (risk factor). Logistic regression belong to probability type and nonlinear regression. Data: 1.The dependent variable is a binary categorical variable that has two values such as "yes" and "no“. 2.All of the independent variables, at least, most of which should be categories. Of course, some of them can be numerical variable. The categories variable should be measure by number. 38 Implication: Logistic regression can be used to study the quantitative relations between the happening of some disease or phenomena and many risk factors Category: 1.Between-subjects (non-conditional) logistic regression equation 2. Paired (conditional) logistic regression equation 39 Thinking: In order to analysis the influent factors of the rescue of AMI patients, a hospital collected five years’ data of AMI patients (there are many related factors ,this case only lists three ones for the limited space), which has 200 cases in total, the data has been shown in the following table, P=0 means successful rescue,P=1 means death;X1=1 means shock before rescue, X1=0 means no shock before rescue; X2=1 means heart failure before rescue, X2=0 means no heart failure before rescue; X3=1 means that it has been more than 12 hours from the beginning of AMI symptom to rescue, X3=0 means the time has not passed 12 hours. which analysis method is the best one? why? which result can we got? 40 The data of the rescue risk factor of the AMI patients P=0(successfully rescued) P=1(death) X1 X2 X3 N X1 X2 X3 N 0 0 0 35 0 0 0 4 0 0 1 34 0 0 1 10 0 1 0 17 0 1 0 4 0 1 1 19 0 1 1 15 1 0 0 17 1 0 0 6 1 0 1 6 1 0 1 9 1 1 0 6 1 1 0 6 1 1 1 6 1 1 1 6 41