Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Practicum 1. The First Session 1. 2. 3. 4. Introduction of program Descriptive statistics Univariate analysis Multivariate analysis 2. The Second Session (Homework) Presentations on Friday Morning Multivariate Analysis Önder Ergönül, MD, MPH Koç University, School of Medicine Summer Course on Research Methodology and Ethics in Medical Sciences June 16-20, 2014, Istanbul Background % 1978-79 1989 2004-05 Descriptive only 27 12 13 Statistics tables (contingency tables) Epidemiologic measures (relatif risk, odds) 27 36 53 10 22 35 Survival analysis 11 32 61 5 14 51 3 3 39 Multivariate analysis (regression) Power analysis Horton NJ, Switzer SS. NEJM 2005; 353. Multivariate analyses Regression Dependent variable (outcome) Linear Continuous Logistic Dichotomous Cox Dichotomous Poisson Dichotomous Confounder Variable A Outcome Variable B Confounder Acinetobacter infection fatality Severity index APACHE score Control of the confounders 1. Randomization 2. Stratification 3. Adjustment by multivariate analysis Odds Ratio p = probability (or proportion) The lower bound is 0, and the upper bound is 1. Probability of success: Pr(y = 1) = p Probability of failure: Pr(y = 0) = 1 – p What is the p of success or failure? Failure Success Total 1-p p (1 - p) + p = 1 .25 = 1 - p .75 = p 1 = (1 - p) + p Odds = p/(1-p) = .75/ (1 - .75) = .75/.25 = 3 Odds Ratio= pA/(1-pA) pB/(1-pB) Relative Risk Heparin Plasebo Riskheparin= DVT DVT 8 18 92 82 8/100 = 0.08 Riskplasebo= 18/100 = 0.18 Relative risk = Risk plasebo = Risk heparin 0.18 0.08 = 2.25 Odds Ratio Heparin Plasebo DVT DVT 8 18 92 82 Oddsheparin= 8/92 = 0.087 Oddsplasebo= 18/82 = 0.22 Odds ratio = Odds plasebo = Odds heparin 0.22 0.087 = 2.53 Comparing risks and odds Risk Odds 0.05 or 5% 0.053 0.1 or 10% 0.11 0.2 or 20% 0.25 0.3 or 30% 0.43 0.4 or 40% 0.67 0.5 or 50% 1 0.6 or 60% 1.5 0.7 or 70% 2.3 0.8 or 80% 4 0.9 or 90% 9 0.95 or 95% 19 Confounder OC MI Smoking Oral contraceptives (OC) and myocardial infarction (MI) Case-control study, unstratified data OC MI Controls Yes No 693 307 320 680 Total 1000 1000 OR 4.8 Ref. Oral contraceptives (OC) and myocardial infarction (MI) Case-control study, unstratified data Smoking MI Controls Yes No 700 300 500 500 Total 1000 1000 OR 2.3 Ref. Smokers OC MI Controls Yes No 517 183 160 340 Total 700 500 OR 6.0 Ref. Nonsmokers OC MI Controls OR Yes 176 160 3.0 No 124 340 Ref. Total 300 500 Odds ratio for OC adjusted for smoking = 4 .5 Regression From Correlation to Regression Correlation Lineer Reg Logistic Reg TheSimplest case: Y depends linearly on X E(Y) = ß0+ ß1x ß0, ß1: PARAMETERS Regression X: Predictor or independent variable Y: outcome or dependent variable Y Y= a+bx X Scatterplot and Linear Regression Lineer Regression E(Y) Weight = = ß0 -44,16 + + ß1 x 0,55 * height Parameters: ß0: INTERCEPT ß1: SLOPE, regression coefficient SBP (mm Hg) 220 SBP 81.54 1.222 Age 200 180 160 140 120 100 80 20 30 40 50 60 Age (years) 70 80 90 Simple linear regression • Relation between 2 continuous variables (SBP and age) y Slope y α β1x1 x • Regression coefficient b1 – Measures association between y and x – Amount by which y changes on average when x changes by one unit – Least squares method Logistic regression Age and signs of coronary heart disease (CD) Age CD Age CD Age CD 22 23 24 27 28 30 30 32 33 35 38 0 0 0 0 0 0 0 0 0 1 0 40 41 46 47 48 49 49 50 51 51 52 0 1 0 0 0 1 0 1 0 1 0 54 55 58 60 60 62 65 67 71 77 81 0 1 1 1 0 1 1 1 1 1 1 Dot-plot Signsofcrnaydisea Y e s N o 0 2 0 4 0 6 0 A G E ( y e a r s ) 8 0 1 0 0 Logistic function Probability of disease 1.0 0.8 e x P( y x ) 1 e x 0.6 0.4 0.2 0.0 x Why is it called “Logistic” regression? • It uses the logit transformation. • The logistics transformation can be interpreted as the logarithm of the odds of success vs. failure. p logit () log 1 p p ln 1 p Transformation e x P( y x) x 1 e P(y x) 1 P(y x) { P( y x ) ln x 1 P( y x ) logit of P(y|x) α= log odds of disease in unexposed β= log odds ratio associated with being exposed β e = odds ratio Fitting equation to the data • Linear regression: Least squares • Logistic regression: Maximum likelihood • Likelihood function – Estimates parameters a and b – Practically easier to work with log-likelihood n L() ln l () yi ln ( xi ) (1 yi ) ln 1 ( xi ) i 1 Maximum likelihood Iterative computing – Choice of an arbitrary value for the coefficients (usually 0) – Computing of log-likelihood – Variation of coefficients’ values – Reiteration until maximisation (plateau) Results – Maximum Likelihood Estimates (MLE) for and – Estimates of P(y) for a given value of x Why Do We do Logistic Regression so much? 1. Predict the likelihood of discrete outcomes 1. Group membership 2. Binary outcome (disease/no disease) 2. Quite Flexible Statistical Assumptions 1. No need for assumptions about the distributions of the predictor variables. 2. Predictors do not have to be normally distributed 3. Does not have to be linearly related. 4. Does not have to have equal variance within each group. 3. Very good in giving Odds Ratio Construction of Model Construction of Model 1. Perform univariate statistics (outliers, distribution, gaps) 2. Perform univariate analysis 3. Transform nominal independent variables to dichotomous (dummied) variable Occupation Nominal Dummy Farmer Dummy Nurse Farmer 1 1 0 Housewife 2 0 0 Physician 3 0 0 Nurse 4 0 1 Policeman 5 0 0 Construction of Model Multicollinearity 4. Run a correlation matrix. If any pair of independent variables are correlated at > 0.90 (multicollinearity), decide which one to keep and which one to exclude. More practical way is to consider “biologic relation” (smoking, carrying matches, and cancer) Diagnosing confounder Variable A Variable B Outcome Confounder Fatality Acinetobacter infection APACHE score How to choose variables? 1. 2. 3. 4. Kitchen sink Inclusion of significant variables Forward selection Backward selection An example Outcome : Deep Vein Thrombosis Independent variables : Heparin, gender, Coronary Heart Disease, aspirin use Y= a+b1x1+ b2x2 + b3x3 + b4x4 DVT=a+b1(heparin)+ b2(female) + b3(CAD)+ b4(aspirin) 0.5 1.5 3 0.6 Logistic regression Number of obs LR chi2(4) Prob > chi2 Pseudo R2 Log likelihood = -6.2444702 = = = = 10 0.97 0.9141 0.0722 ---------------------------------------------------------------------------DVT | Odds Ratio Std. Err. z P>|z| [95% Conf Interval] -------------+-------------------------------------------------------------Heparin | .50 .023 -2.81 0.003 .15 .72 Kadin | 1.48 1.08 0.01 0.504 .095 23.17 KAH | 3.06 .03 2.36 0.009 1.34 12.37 aspirin | 0.58 0.08 0.31 0.622 .46 1.03 ---------------------------------------------------------------------------- OR p-değeri 95% CI Heparin use 0.5 0.003 0.15-0.72 Female 1.48 0.504 0.095-23.17 CHD 3.06 0.009 1.34-12.37 Aspirin use 0.58 0.622 0.46-1.03 Assessment of the Model Methods for measuring how well a model accounts for the outcome Model accounts for outcome better than chance Multiple lineer regression Multiple logistic regression F test Likelihood ratio test Likelihood ratio test (LR) (LR) Quantitative/quali R2 tative assessment of how well model accounts for outcome R2 (rarely used) Comparison of estimated to observed value Hosmer-Lemeshov Prediction of outcome NA Sensitivity, Specificity, Accuracy, c index Proportional hazard analysis Comparsion of estimated to observed value LR and p value of the test For both logistic regression and proportional hazard analysis: If chi square of the LR is large, the p value will be small, and the null hypothesis can be rejected. R2 R2 is a quantitative measure of how well the independent variables account for the outcome When R2 is multiplied by 100, it can be thought of as the percentage of the variance in the dependent variable explained by the independent variable Goodness of Fit r² is between 0 and 1 r² = 1 perfect fit (x does not add any information to y) r² > 0.8 pretty good fit r² ≈ 0.3 what we see usually ! r² ≈ 0 poor fit Interactions How can interactions be taken into consideration? Let X and Y be two explanatory variables, then include X·Y in the model. Is X a categorical variable with dummy variables (X1,...Xm-1), then X·Y = (X1 ·Y,…,Xm-1 ·Y). By interactions the multiplicative character of the OR is changed. SURVIVAL ANALYSIS TIME-TO EVENT TIME-TO DEATH Exposure and outcome exposure NO exposure Person-Time Jan Feb March April May June A Total Time at risk 3 months B 6 months C 2 months Total person time 3+6+2=11 The Reason for Survival Analysis • Censored cases – During the follow up: • expected outcome did not happen for the case • lossed from follow up or dropped • All the patients do not need to start together. Survival Analysis • life table • Kaplan-Meier curve • Cox regresyon Kaplan-Meier Curve • Duration = Time to event (time until an event occurs) Number of the patients • “Event” = Fatality, survived, relaps... 50 49 Log-rank test compares 2 curves statistically 44 42 40 39 2 4 6 8 months The number of patients Kaplan-Meier Curve 50 49 treatment 44 42 40 39 placebo 2 4 6 8 ay Cox Regression Logistic regression Cox regression Log likelihood = -6.2444702 Number of obs LR chi2(4) Prob > chi2 Pseudo R2 = = = = 10 0.97 0.9141 0.0722 ---------------------------------------------------------------------------DVT |Hazard Odds Ratio z P>|z| [95% Conf Interval] ratio Std. Err. -------------+-------------------------------------------------------------Heparin | .50 .023 -2.81 0.003 .15 .72 Kadin | 1.48 1.08 0.01 0.504 .095 23.17 KAH | 3.06 .03 2.36 0.009 1.34 12.37 aspirin | 0.58 0.08 0.31 0.622 .46 1.03 ----------------------------------------------------------------------------