Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STA 107: Logistic Regression and Categorical Data Analysis Lecturer: Dr. Daisy Dai Department of Medical Research 1 Contents • • • • Binary Logit Analysis Simple Logistic Regression Multiple Logistic Regression Stepwise or Backward Model Selections • Collinearity 2 Categorical Data Analysis Binomial Test Chi-square Test Fisher’s Exact Test McNemar’s Test Cochran-Mantel-Haenszel Test 3 Binomial test Make inference about a proportion of binary outcomes by comparing the confidence interval of a proportion to target. null hypothesis alt . hypotehsis test statistic decision : : : rule : H 0 : p p0 H A : p p0 reject 1 2n p0 (1 p0 ) n H 0 if | Z | Z / 2 ˆ p0 | | p 4 Case Study: Genital Wart • A company markets a therapeutic product for genital warts with a known cure rate of 40% in the general population. In a study of 25 patients with genital warts treated with this product, patients were also given high doses of vitamin C. As shown in Table on the next page, 14 patients were cured. Is this consistent with the cure rate in the general population? 5 Treatment to Genital Wart ID Effectiveness ID Effectiveness 1 YES 15 YES 2 NO 16 NO 3 YES 17 NO 4 NO 18 YES 5 YES 19 YES 6 YES 20 NO 7 NO 21 YES 8 YES 22 YES 9 NO 23 NO 10 NO 24 YES 11 YES 25 YES 12 NO 13 YES 14 NO 6 Results • 64% (16/25) of patient were cured by the treatment. • The 95% confidence interval extends from 44% to 80% • If the probability of "success" in each trial or subject is 0.300, then the chance of observing 16 or more successes in 25 trials is 0.045 (p-value). • The cure rate of genital wart by the experimental therapy was significantly higher than 30%. 7 Fisher’s Exact Test A conservative non-parametric test about a relationship between two categorical variables. The groups in comparison should be independent. Responders Non-responders Total Group 1 N11 N12 N11+N12 Group 2 N21 N22 N21+N22 Combined N11+N21 N12+ N22 N 8 Case Study: CHF Incidence A new adenosine-releasing agent (ARA), thought to reduce side effects in patients undergoing coronary artery bypass surgery (CABG), was studied in a pilot trial. CHF No CHF Total ARA 2 (6%) 33 35 Placebo 5 (25%) 20 25 Combined 7 53 60 Fisher’s exact test: p=0.0455 9 Chi-square test Test a relationship between two categorical variables. Groups should be independent. The chi-square test assumes that the expected value for cell: is Hfive nulleach hypothesis : p por higher. 0 alt . hypotehsis test statistic : decision rule : : 1 2 H A : p1 p2 NUM 2 DEN reject H 0 if 2 12 2 10 Case Study: ADR Frequency with Antibiotic Treatment A study was conducted to monitor the incidence of GI adverse drug reactions of a new antibiotic used in lower respiratory tract infections. Responders Non-responders Total Test (new antibiotic) 22 (33%) 44 66 Control (erythromycin) 28 (54%) 24 53 Combined 50 (42%) 68 118 Chi-square test: p=0.0252; Fisher’s exact test: p=0.0385 11 McNemar’s test Compare response rates in binary data between two related populations. It’s analogous to Chi-square test or Fisher’s exact test for independent populations. After Before Responders null hypothesis : H 0 : p1 p2 alt . hypotehsis : H A : p1 p2 test statistic : decision rule : (B C)2 BC reject H 0 if 2 12 ( ) 2 Nonresponders Responders A B Non-responder C D 12 Case Study: Bilirubin A study was conduct to evaluate the toxicity side effect of an experimental therapy. Patients (n=86) were treated with the experimental drug for 3 months. Clinical lab measured bilirubin levels of each patient at baseline and 3 months after therapy. 13 14 Results of McNemar’s Test After Normal Abnormally high 60 14 Before Normal • • • • • Abnormally high 6 6 At baseline, 14% (12/86) of patients had abnormally high bilirubin level. At 3 months post treatment, 23% (20/86) of patients had abnormally high bilirubin level. P-value = 0.1175 Odds ratio = 2.3; 95% CI: 0.8 - 7.4 There is no enough evidence to prove the increasing risk of high bilirubin due to treatment. 15 Cochran-MantelHaenszel (CMH) Test • The Cochran-Mantel-Haenszel test is a method to compare the probability of an event among independent groups in stratified samples. • The stratification factor can be study center, gender, race, age groups, obesity status or disease severity. These underlying sub-population can be confounding factors that affect the associations between risk factors and the outcome variables. null hypothesis : H :p p 0 alt . hypotehsis test statistic : : 1 2 H A : p1 p2 2 chm k NUM j 1 2 k DEN j 1 decision rule : j reject H 0 if 2 CHM 16 j ( ) 2 1 Case Study: Diabetic Ulcers • A multi-center study with 4 centers is testing an experimental treatment, Dermotel, used to accelerate the healing of dermal foot ulcers in diabetic patients. Sodium hyaluronate was used in a control group. Patients who showed a decrease in ulcer size after 20 weeks treatment of at least 90% surface area measurements were considered ‘responders’. The numbers of responders in each group are shown in Table 19.2 for each study center. Is there an overall difference in response rates between the Dermotel and control groups? 17 Response Frequencies by Study Center Study Center Treatment Group Response Non-Response 1 Dermotel 26 (87%) 4 Control 18 (62%) 11 Dermotel 8 (73%) 3 Control 7 (58%) 5 Dermotel 7 (58%) 5 Control 4 (40%) 6 Dermotel 11 (65%) 6 Control 9 (64%) 5 2 3 4 18 • The interest in this study is to compare the response rate between two treatment. Because the study was conducted in four centers, it is concerned that some potential influences of study center on the response rate. By including the study center, the researcher can examine associations between the treatment and the response rate while adjusting (controlling) for the effect of study center. • Cochran-Mantel-Haenszel Test assumes a common odds ratio and test the null hypothesis that the explanatory variable X (treatment) and the outcome variable Y (response rate) are conditionally independent, given the control variable Z (study center). In other words, CMH tests whether the response is conditionally independent of the explanatory variable when adjusting for the control variable. • One can also measures average conditional association between the explanatory (treatment) and the response variable by calculating the common odds ratio conditioned on the control variable (study center). 19 Results Response Rates Study Center Active (n) Control (n) Chi-Square p-Value 1 86.7% (30) 62.1% (29) 4.706 0.030* 2 72.7% (11) 58.3% (12) 0.524 0.469 3 58.3% (12) 40.0% (10) 0.733 0.392 4 64.7% (17) 64.3% (14) 0.001 0.981 Overall 74.3% (70) 58.5% (65) 40.39 0.044* *P-value from CMH test 20 Chi-Square Test, Ignoring Strata Group Response Non-Response Total Active Control 52 (74.3%) 38 (58.5%) 18 27 70 65 Total 90 (100.0%) 45 135 Chi-square value = 3.798, p = 0.051 p. 313 Counter-Intuitive Combined Results Stratum Group Responders NonResponders Total Response Rate 1 A 10 38 48 21% B 4 21 25 16% A 20 10 30 67% B 27 17 44 61% A 30 48 78 38% B 31 38 69 45% 2 Combined 22 Logistic Regression 23 Logistic Regression • Logistic Regression are methods to identify the associations between a categorical outcome variable and explanatory variables. • In most cases, the outcome variable is dichotomous. The explanatory variables can be categorical or continuous. The probability of the outcome variable can be predicted by the values of explanatory variables. Dichotomous outcome variable explanatory variables Log(P/(1-P))=a + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4+… 24 Odds Ratio • Let Y be the dichotomous variable where y=1 indicates an event and y=0 indicates no events • Odd=probability of an event/probability of no event =P(Y=1)/P(Y=0)=P(Y=1)/(1-P(Y=0)) • Odds Ratio=Odds in the Test Group/Odd in the Control Group • Logistic Model: Log(Odds Ratio of an event) explanatory variables • Log (odds ratio)=a + b1 * x1 + b2 * x2 + b3 * x3 + b4 * x4+… 25 Case Study: CHF Incidence • A new adenosine-releasing agent (ARA), thought to • reduce side effects in • patients undergoing coronary artery bypass • surgery (CABG), was studied in a pilot trial. Odd of CHF incidence in the ARA group=(2/35)/(33/35)=2/33=6%. Odd of CHF incidence in the Placebo group=(5/25)/(20/25)=20%. Odds Ratio=Odd in the ARA group/odd in the Placebo group=(2/33)/(5/20)=0.24 The risk (odd) of CHF incidence in the ARA group is only 24% the risk (odd) in the Placebo group. CHF No CHF Total ARA 2 (5.7%) 33 35 Placebo 5 (25%) 20 25 Combined 7 53 60 Fisher’s exact test: p=0.0455 26 Properties of Odds Ratio • Odds ratio is non-negative. • If odds ratio<1, then the risk is smaller than control. • If odds ratio>1, then the risk is larger than control. • Odds ratio of no event=1/odds ratio of an event. • One can calculate the confidence interval of an odds ratio. The confidence interval of a significance odds ratio does not contain 1. 27 Case Study: CHF Incidence • odds Ratio for ARA versus A new adenosine-releasing Control=(2/33)/(5/20)=0.24<1. So agent (ARA), thought to the risk of CHF incidence in the reduce side effects in ARA group is relatively smaller. patients undergoing • One can also calculate odds ratio coronary artery bypass for Control versus ARA as surgery (CABG), was 1/0.24=4.1>1, which indicates the risk (odd) of CHF in Placebo group is studied in a pilot trial. 4.1 fold of risk in ARA group. CHF No CHF Total ARA 2 (5.7%) 33 35 Placebo 5 (25%) 20 25 Combined 7 53 60 Fisher’s exact test: p=0.0455 28 Logistic Probability Curve • Log(p/(1-p))=a+bx • p/1-p=exp(a+bx) • p=1/(1+exp(-a-bx)) Probability X 29 Logistic Regression vs. Linear Regression Common: In regression we are looking for a dependence of one variable, the dependent variable, on other, the independent variable(s). • In linear regression the dependent variable is continuous The relationship is summarized by a regression equation consisting of a slope and an intercept. In increases with unit increase in the independent variable, and the intercept represents the value of the dependent variable when the independent variable takes the value zero. • • in logistic regression the dependent variable is binary. In logistic regression the slope represents the change in log odds for a unit increase in the independent variable and the regression we are interested in the simultaneous relationship between one dependent variable and a number of independent variables. Menopause 18.00 No Yes 16.00 14.00 Hb • 30 12.00 10.00 R Sq Linear = 0.774 20.00 30.00 40.00 50.00 Age 60.00 70.00 Case Study: Relapse Rate in AML One hundred and two patients with acute myelogenous leukemia (AML) in remission were enrolled in a study of a new antisense oligonucleotide (asODN). The patients were randomly assigned to receive a 10-day infusion of asODN or no treatment (Control), and the effects were followed for 90 days. The time of remission from diagnosis or prior relapse (X, in months) at study enrollment was considered an important covariate in predicating response. The response data are shown in next page with Y=1 indicating relapse, death, or major intervention, such as bone marrow transplant before Day 90. Is there any evidence that administration of asODN is associated with a decreased relapse rate? 31 AML Data asODN Group Pa tie nt Numbe r 1 2 4 6 7 10 11 14 15 17 20 21 22 25 26 28 29 X 3 3 3 6 15 6 6 6 15 15 12 18 6 15 6 15 12 Y 0 1 1 1 0 1 1 1 0 0 0 0 1 0 1 0 1 Pa tie nt Numbe r 32 33 36 39 42 44 46 49 50 52 54 56 58 60 62 63 66 X 9 6 6 6 6 3 18 9 12 6 9 9 3 9 12 12 3 Y 0 1 0 0 0 1 0 0 1 0 1 1 0 1 0 0 0 Pa tie nt Numbe r 67 69 71 73 74 77 79 81 83 85 88 90 92 94 95 98 99 102 X 12 12 12 9 6 12 6 15 9 3 9 9 9 9 9 12 3 6 Y 0 0 0 1 1 0 0 1 0 1 0 0 0 0 1 1 1 1 Pa tie nt Numbe r 72 75 76 78 80 82 84 86 87 89 91 93 98 97 100 101 X 9 15 15 12 9 12 15 18 12 15 15 15 18 18 18 18 Y 1 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 Control Group Pa tie nt Numbe r 3 5 8 9 12 13 16 18 19 23 24 27 30 31 34 35 37 X 9 3 12 3 3 15 9 12 3 9 15 9 6 9 6 12 9 Y 1 0 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 Pa tie nt Numbe r 38 40 41 43 45 47 48 51 53 55 57 59 61 64 65 68 70 X 15 15 9 9 12 3 6 6 12 12 12 3 12 3 12 6 6 p. 323 Y 0 1 0 0 1 1 1 1 0 0 1 1 1 1 1 1 1 33 34 Effect Selection Methods Statistical model selection will facilitate selection and screening of explanatory variables from a sets of candidate variables. The commonly used model selection method include: • Backward selection: starting with all candidate variables and testing them one by one for statistical significance, deleting any that are not significant. • Forward selection: starting with no variables in the model, trying out the variables one by one and including them if they are 'statistically significant'. • Stepwise selection: A combination of both methods. Select a most significant variable from the candidate pool and remove this variable if it’s not significant in the joint model. And repeat this process step by step for all remaining variables. 35 Flow Chart of Forward Selection 36 Multicollinearity • • • Multicollinearity occurs when two or more explanatory variables in a multiple regression model are highly correlated. In other words, there is redundant explanatory variables in the multiple regression models. Multicollinearity can cause problematic estimate in the individual effects. A high degree of multicollinearity can also cause computer software packages to be unable to perform the matrix inversion that is required for computing the regression coefficients, or it may make the results of that inversion inaccurate. Note that in statements of the assumptions underlying regression analyses such as ordinary least squares, the phrase "no multicollinearity" is sometimes used to mean the absence of perfect multicollinearity, which is an exact (non-stochastic) linear relation among the regressors. 37 Detection of Multicollinearity • Large changes in the estimated regression coefficients when a predictor variable is added or deleted • Tests of the individual effects of affected variables are not significant, but a global test of overall model is significant (using an F-test). • Use variance inflation factor (VIF) to detect multicollinearity: Regress a explanatory variable on all the other explanatory variables. A high coefficient of determination, r2, indicates the regressed explanatory variable was highly corrected with other explanatory variables. A tolerance=1-r2. VIF=1/tolerance. A tolerance of less than 0.20 or 0.10 or a VIF of 5 or 10 and above indicates a multicollinearity problem. 38 Caveats • Sometimes logistic regression is carried out when a dependent variable is dichotomized. It is important that the cut point is not derived by direct examination of the data for example to find a ‘gap in the data which maximizes the discrimination between the selected groups as this can lead to biased results. It is bests if there are a priori grounds for choosing a particular cut point. 39 References • Common Statistical Methods for Clinical Research 2nd Edition by Glenn Walker • Logistic Regression Using The SAS System by Paul Allison • Medical Statistics by Campbell et al. 40