* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Introduction to logistic regression
Granular computing wikipedia , lookup
Perceptual control theory wikipedia , lookup
General circulation model wikipedia , lookup
Data analysis wikipedia , lookup
Vector generalized linear model wikipedia , lookup
Simplex algorithm wikipedia , lookup
Data assimilation wikipedia , lookup
Generalized linear model wikipedia , lookup
Introduction to Logistic Regression Rachid Salmi, Jean-Claude Desenclos, Thomas Grein, Alain Moren, Viviane Bremer Objectives • When do we need to use logistic regression • Principles of logistic regression • Uses of logistic regression • What to keep in mind Chlamorea • Sexually transmitted infection –Virus recently identified –Leads to general rash, blush, pimples and feeling of shame –Increasing prevalence with age –Risk factors unknown so far Case control study • Population of Berlin • 150 cases, 150 controls • Hypothesis: Consistent use of condoms protects against chlamorea • Questionnaire with questions on demographic characteristics, sexual behaviour • OR, t-test Results bivariate analysis Cases n=150 Controls n=150 Odds ratio Used condoms at last sex 40 90 0.17 Did not use condoms 110 60 Ref Results bivariate analysis Cases n=150 Controls n=150 Odds ratio Single 125 50 4.7 Currently in a relationship 25 100 Ref Results bivariate analysis Cases n=150 Controls n=150 nr partners during last year 4 2 p=0.001 Mean age in years 39 26 p=0.001 Confounding? T-test Stratification Agegroup Single status Chlamorea and condom use a c b d OR raw a1 c1 b1 OR1 d1 a2 b2 c2 d2 Number of partners OR2 a1 b1 c1 d1 OR1 a2 b2 c2 d2 OR2 ai bi ci di a1 b1 c1 d1 OR1 a2 b2 c2 d2 OR2 ai bi ci di a3 b3 c3 d3 ai bi ci di OR3 ORi a1 b1 c1 d1 OR1 a2 b2 c2 d2 OR2 ai bi ci di OR4 ORi ORi a1 b1 c1 ad1 b OR1 1 1 c d a2 b12 a11 b1OR1 d1 b OR1 c2 ad22 cb12 a 1 2 1 OR c d1 OR1 c d ai b2 i a22 b12 OR 2 ci adi i cb2 i ad22 b2 OR2 OR ci adi i cb2 i id2 OR2 a1 b1 ORi di iOR bi OR c1 acdi 1 ab 1 i 1 1 cdi diOR c OR 1 1 a b a2 b2 1 1 1 i d1 b OR1 c2 ad22 cb12 a 1 2 1 OR c d1 OR1 ai cb2 i ad22 b12 OR 2 c d a b ci adi i b2 i 22 2 OR2 OR ci adi i cb2 i id2 OR2 OR ci adi i bi i ORi a1 b1 ci di OR i c1 ad1 b OR1 1 1 a2 cb12 ad11 b1OR1 d1 b OR1 c2 ad22 cb12 a 1 2 1 OR c d1 OR1 ai cb2 i ad22 b12 OR 2 ci adi i cb2 i ad22 b2 OR2 OR ci adi i cb2 i id2 OR2 OR ci adi i bi i ORi ci di ORi Let’s go one step back Simple linear regression Table 1 Age and systolic blood pressure (SBP) among 33 adult women Age SBP Age SBP Age SBP 22 23 24 27 28 29 30 32 33 35 40 131 128 116 106 114 123 117 122 99 121 147 41 41 46 47 48 49 49 50 51 51 51 139 171 137 111 115 133 128 183 130 133 144 52 54 56 57 58 59 63 67 71 77 81 128 105 145 141 153 157 155 176 172 178 217 SBP (mm Hg) 220 SBP 81.54 1.222 Age 200 180 160 140 120 100 80 20 30 40 50 60 Age (years) adapted from Colton T. Statistics in Medicine. Boston: Little Brown, 1974 70 80 90 Simple linear regression • Relation between 2 continuous variables (SBP and age) y Slope α y α β1x1 x • Regression coefficient b1 –Measures association between y and x –Amount by which y changes on average when x changes by one unit –Least squares method What if we have more than one independent variable? Multiple risk factors • Objective: To attribute to each risk factors the respective effect (RR) it has on the occurrence of disease. Types of multivariable analysis • Multiple models –Linear regression –Logistic regression –Cox model –Poisson regression –Loglinear model –Discriminant analysis… • Choice of the tool according objectives, study design and variables Multiple linear regression • Relation between a continuous variable and a set of i variables y α β1x1 β2 x 2 ... βi xi • Partial regression coefficients bi –Amount by which y changes when xi changes by one unit and all the other xi remain constant –Measures association between xi and y adjusted for all other xi • Example –Number of partners in relation to age & income Multiple linear regression y Predicted Response variable Outcome variable Dependent α β1x1 β2 x 2 ... βi xi Predictor variables Explanatory variables Covariables Independent variables y (number of partners) = α + β1 age + β2 income + β3 gender What if our outcome variable is dichotomous? Logistic regression (1) Table 2 Age and chlamorea Age Chlamorea Age Chlamorea Age Chlamorea 22 23 24 27 28 30 30 32 33 35 38 0 0 0 0 0 0 0 0 0 1 0 40 41 46 47 48 49 49 50 51 51 52 0 1 0 0 0 1 0 1 0 1 0 54 55 58 60 60 62 65 67 71 77 81 0 1 1 1 0 1 1 1 1 1 1 How can we analyse these data? • Compare mean age of diseased and non-diseased –Non-diseased: 26 years –Diseased: 39 years (p=0.0001) • Linear regression? Dot-plot: Data from Table 2 of Chlamorea Presence Signs of coronary disease Yes No 0 20 40 AGE (years) 60 80 100 Logistic regression (2) Table 3 Prevalence (%) of chlamorea according to age group Diseased Age group # in group # % 20 - 29 5 0 0 30 - 39 6 1 17 40 - 49 7 2 29 50 - 59 7 4 57 60 - 69 5 4 80 70 - 79 2 2 100 80 - 89 1 1 100 Dot-plot: Data from Table 3 Diseased % 100 80 60 40 20 0 0 2 4 Age group 6 8 Logistic function (1) Probability of disease 1.0 0.8 e bx P( y x ) 1 e bx 0.6 0.4 0.2 0.0 x Logistic function • Logistic regression models the logit of the outcome =natural logarithm of the odds of the outcome Probability of the outcome (p) ln Probability of not having the outcome (1-p) P ln α β1x1 β2 x 2 ... βixi 1- P Logistic function P ln α β1x1 β2 x 2 ... βixi 1- P = log odds of disease in unexposed b = log odds ratio associated with being exposed b e = odds ratio Multiple logistic regression • More than one independent variable –Dichotomous, ordinal, nominal, continuous … P ln α β1x1 β2 x 2 ... βixi 1- P • Interpretation of bi –Increase in log-odds for a one unit increase in xi with all the other xis constant –Measures association between xi and log-odds adjusted for all other xi Uses of multivariable analysis • Etiologic models –Identify risk factors adjusted for confounders –Adjust for differences in baseline characteristics • Predictive models –Determine diagnosis –Determine prognosis Fitting equation to the data • Linear regression: –Least squares • Logistic regression: –Maximum likelihood Elaborating eβ • eβ = OR What if the independent variable is continuous? what’s the effect of a change in x by more than one unit? The Q fever example • Distance to farm as independent continuous variable counted in meters –β in logistic regression was -0.00050013 and statistically significant • OR for each 1 meter distance is 0.9995 –Too small to use • What’s the OR for every 1000 meters? –e1000*β = e-1000*0.00050013 = 0.6064 Continuous variables • Increase in OR for a one unit change in exposure variable • Logistic model is multiplicative OR increases exponentially with x –If OR = 2 for a one unit change in exposure and x increases from 2 to 5: OR = 2 x 2 x 2 = 23 = 8 • Verify if OR increases exponentially with x –When in doubt, treat as qualitative variable Coding of variables (2) • Nominal variables or ordinal with unequal classes: –Preferred hair colour of partners: » No hair=0, grey=1, brown=2, blond=3 –Model assumes that OR for blond partners = OR for grey-haired partners3 –Use indicator variables (dummy variables) Indicator variables: Hair colour Dummy variables Hair colour of partners blond brown grey grey brown blond no hair 0 0 1 0 0 1 0 0 1 0 0 0 • Neutralises artificial hierarchy between classes in variable “hair colour of partners" • No assumptions made • 3 variables in model using same reference • OR for each type of hair adjusted for the others in reference to “no hair” Classes • Relationship between number of partners during last year and chlamorea – Code number of partners: 0-1 = 1, 2-3 = 2, 4-5 = 3 Code nr partners Cases Controls OR 1 20 40 1.0 2 22 30 1.5 3 12 11 2.2 1.52 2.2 • Compatible with assumption of multiplicative model – If not compatible, use indicator variables Risk factors for Chlamorea Sex Hair colour Agegroup Single Visiting bars Number of partners No condom use Chlamorea Unconditional Logistic Regression Term Odds Ratio 95% C.I. Coef. S. E. ZStatistic PValue # partners 1,2664 0,2634 10,7082 0,2362 0,9452 0,5486 0,5833 Single (Yes/No) 1,0345 0,3277 3,2660 0,0339 0,5866 0,0578 0,9539 Hair colour (1/0) 1,6126 0,2675 9,7220 0,4778 0,9166 0,5213 0,6022 Hair colour (2/0) 0,7291 0,0991 5,3668 -0,3159 1,0185 -0,3102 0,7564 Hair colour (3/0) 1,1137 0,1573 7,8870 0,1076 0,9988 0,1078 0,9142 Visiting bars 1,5942 0,4953 5,1317 0,4664 0,5965 0,7819 0,4343 Used no Condoms 9,0918 3,0219 27,3533 2,2074 0,5620 3,9278 0,0001 Sex (f/m) 1,3024 0,2278 7,4468 0,2642 0,8896 0,2970 0,7665 * * * -3,0080 2,0559 -1,4631 0,1434 CONSTANT Last but not least Why do we need multivariable analysis? • Our real world is multivariable • Multivariable analysis is a tool to determine the relative contribution of all factors Sequence of analysis • Descriptive analysis –Know your dataset • Bivariate analysis –Identify associations • Stratified analysis –Confounding and effect modifiers • Multivariable analysis –Control for confounding What can go wrong • Small sample size and too few cases • Wrong coding • Skewed distribution of independent variables –Empty “subgroups” • Collinearity –Independent variables express the same Do not forget • Rubbish in - rubbish out • Check for confounders first • Number of subjects >> variables in the model • Keep the model simple –Statisticians can help with the model but you need to understand the interpretation • You will need several attempts to find the “best” model • If in doubt… Really call a statistician !!!! References • Norman GR, Steiner DL. Biostatistics. The Bare Essentials. BC Decker, London, 2000 • Hosmer DW, Lemeshow S. Applied logistic regression. Wiley & Sons, New York, 1989 • Schwartz MH. Multivariable analysis. Cambridge University Press, 2006