Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1430 Cook • UP WITH ODDS RATIOS! SPECIAL CONTRIBUTIONS Advanced Statistics: Up with Odds Ratios! A Case for Odds Ratios When Outcomes Are Common Thomas D. Cook, PhD Abstract Treatment comparisons from clinical studies involving dichotomous outcomes are often summarized using risk ratios. Risk ratios are typically used because the underlying statistical model is often consistent with the underlying biological mechanism of the treatment and they are easily interpretable. The use of odds ratios to summarize treatment effects has been discouraged, especially in studies in which outcomes are common, largely be- cause odds ratios differ from risk ratios and are frequently interpreted incorrectly as risk ratios. In this article, the author contends that risk ratios can be easily misinterpreted and that, in many cases, odds ratios should be preferred, especially in studies in which outcomes are common. Key words: odds ratios; risk ratios; statistics; differences; outcomes. ACADEMIC EMERGENCY MEDICINE 2002; 9:1430–1434. In a 1999 article, Schwartz and colleagues1 comment on the reporting of a study of the effect of gender and race on physicians’ recommendations for cardiac catheterization.2 They contend that, as reported, the differences between African Americans and whites, and between men and women, in rates of referral for cardiac catheterization study were overstated (Schwartz et al. indicate other reasons why this finding may be misleading, but they are not relevant to this discussion. As the authors suggest, formal comparisons of rates in this setting may not even be meaningful).1 Schwartz et al. blame this overstatement, in part, on the use of odds ratios (ORs) to summarize the results and argue against such use. Additionally, a number of other articles have appeared in recent years discouraging the use of ORs for reporting the results of medical studies,3–9 especially when outcomes are common. Very little has been published since then to refute this recommendation.10 As is argued below, in the study cited above and when properly interpreted, ORs may be the most meaningful summary measures of the differences observed. As an illustration, consider the results of a recent study of diaspirin cross-linked hemoglobin (DCLHb) in patients suffering from severe traumatic hemorrhagic shock.11 (For this study, the effect of treatment—overall, within subgroups, and covariate-adjusted—was reported using ORs.) Overall mortality in patients receiving DCLHb (see the first two columns of Table 1) was 46% (24/52) and mortality in patients receiving normal saline (control) was 17% (8/46). The risk ratio (RR) is the ratio of these mortality rates, RR = 2.65 = 46%/17%. Conversely, the odds of mortality is the ratio of the mortality rate to the survival rate, or equivalently, the ratio of the number of deaths to the number of survivors. In the DCLHb study, the odds are 0.857 = 24/28 and 0.211 = 8/38 in the DCLHb and control groups, respectively. The odds ratio is the ratio of odds in the two groups, OR = 4.07 = 0.857/0.211. These measures are numerically quite different and, hence, must be interpreted differently. As discussed below, each requires careful consideration and each can be easily misinterpreted. Criticisms of ORs fall principally into two categories: 1) ORs are not as intuitive as RRs and, therefore, are difficult to understand and easily misinterpreted and misapplied, and 2) ORs often differ significantly from RRs. Arguments of the first category are important, but they suffer from a major flaw. Risk ratios may seem intuitive and easily applied; however, they are easily misapplied and the conclusions drawn from their use may be inappropriate. An intuitive, easily understood summary measure is worthwhile only to the extent that it results in valid conclusions. Arguments in the second category appear to be based implicitly on two assumptions. The first is that the most appropriate summary of differences between groups is the RR and that this measure should be reported whenever possible. Second, since ORs, especially when the underlying risk is high, are more extreme than RRs (larger than RR From the Department of Biostatistics and Medical Informatics, University of Wisconsin, Madison, WI (TDC). Received October 8, 2001; revision received May 7, 2002; accepted July 17, 2002. Series editor: Roger J. Lewis, MD, PhD, Department of Emergency Medicine, Harbor–UCLA Medical Center, Torrance, CA. Address for correspondence and reprints: Thomas D. Cook, PhD, 209 WARF Building, 610 Walnut Street, Madison, WI 53705. Fax: 608-263-0415; e-mail: [email protected]. 1431 ACAD EMERG MED • December 2002, Vol. 9, No. 12 • www.aemj.org TABLE 1. Mortality in the Diaspirin Cross-linked Hemoglobin (DCLHb) Study11 Overall and by Baseline Predicted Probability of Death Using the TRISS Method TRISS-predicted Probability of Survival Overall Dead (Mortality) Alive Total 80%–100% 20%–80% 0%–20% DCLHb Control DCLHb Control DCLHb Control DCLHb Control 24 (46.2%) 28 8 (17.4%) 38 5 (21.7%) 18 1 (4.5%) 21 5 (38.5%) 8 1 (8.3%) 11 12 (92.3%) 1 6 (60.0%) 4 52 46 23 22 13 12 13 10 Note that five patients had insufficient baseline data upon which to compute a TRISS score. when RR > 1 and smaller than RR when RR < 1), and they overstate the differences between treatment groups. Again, a case can be made that precisely the opposite is true: when they differ, RRs actually understate treatment differences. The purpose of this article is to argue that in many cases the OR is a more appropriate summary measure that can be applied to a broader population of patients than the RR. In such cases ORs should be preferred, especially when ORs and RRs differ, i.e., when outcomes are common. Notwithstanding the errors in the interpretation of the results reported by Schulman et. al.,2 there is no evidence that in practice, errors resulting from the misinterpretation of ORs are more frequent than errors resulting from the misinterpretation of RRs. We suggest, and illustrate below, that practitioners who are likely to misinterpret or misuse ORs are also likely to misinterpret or misuse RRs. RISK RATIOS VERSUS ODDS RATIOS Given an outcome of interest (considered a failure), the risk of failure is the probability that a patient will experience failure. For a given population, the risk is usually estimated by the proportion of the population observed to fail. It is important to keep in mind, however, that there is likely to be variation in risk within the population. The observed population risk is actually an average of the risks for the individuals in the population, and therefore, the average risk may not necessarily apply to individuals within the population. Again, this can be illustrated by considering data from the DCLHb study shown in Table 1. Three subpopulations are defined by the probability of survival using the TRISS method.12 We consider a low-risk group (45 patients), a middle-risk group (25 patients), and a high-risk group (23 patients). Note that five patients had insufficient baseline data to compute the TRISS score. Now, given two groups of patients, for example, treated and control, the (unadjusted) RR is the ratio of the risks in the two groups. As above, because the average risk does not necessarily represent the risk for any particular individual, the RR calculated using the average risk may not represent the RR for any particular individual. Assuming that the two groups are balanced with respect to underlying patient risk (no confounding), the aggregate unadjusted RR will apply to individuals only if there is a common RR over the population (homogeneous RR assumption). Understanding this fact is critical to the correct application of RRs in practice. Considering Table 2, the overall observed RR of 2.65 likely does not represent the RR for any of the subgroups, especially the high-risk group (it is outside the 95% confidence interval for the RR for this group). It is also well below the observed RR in the other two groups (although well within the corresponding confidence intervals). These differences suggest that the homogeneity of RR assumption does not hold (This example is primarily for purposes of illustration and no attempt at statistical inference is intended. Because of the relatively small numbers of patients in this study, observed differences among groups, here and in what follows, may not reach statistical significance. This fact should have no bearing on the principles being illustrated). TABLE 2. Risk Ratios (RRs) and Odds Ratios (ORs) in the Diaspirin Cross-linked Hemoglobin (DCLHb) Study11 Overall and by Baseline Predicted Probability of Death Using the TRISS Method TRISS-predicted Probability of Survival RR 95% CI OR 95% CI Overall 2.65 (1.32, 5.32) 4.07 (1.59, 10.4) 80%–100% 20%–80% 0%–20% 4.78 4.62 1.54 (0.61, 37.7) (0.63, 34.1) (0.91, 2.61) 5.83 6.88 8.00 (0.62, 54.7) (0.67, 70.8) (0.73, 88.2) TRISS-adjusted 2.07 (1.22, 4.50) 7.15 (2.18, 23.5) For the TRISS-adjusted RR, the confidence interval (CI) was computed using a bootstrap method. 1432 Cook • UP WITH ODDS RATIOS! Conversely, the odds of failure is the ratio of the failure probability to the success probability. In a population, the odds can usually be estimated by the number (or proportion) observed to fail divided by the number (or proportion) observed not to fail. While the risk must be between 0 and 100%, the odds can be any number greater than or equal to zero. Given two groups of patients, the (unadjusted) OR is the ratio of the odds of failure in the two groups. The OR is always more extreme (further from 1) than the RR, but when the risks are small (less than, say, 10%), the RR and the OR will roughly agree. As the underlying risks increase, however, the difference between the RR and the OR can become quite large. As with risk, the true odds of failure may vary significantly among members of the population and, again, the unadjusted OR may not represent the OR for any particular individual. The principal benefit of ORs is that the assumption of homogeneity of ORs is a more tenable assumption than the assumption of homogeneity of RRs and thus it is more likely that an estimate of the OR can be reliably applied to all individuals within a population. We now illustrate how the use of RRs can be misleading. Note that the overall RR of 2.65 can be expressed by the following statement: Mortality in patients receiving DCLHb was 2.65 times higher than in patients receiving normal saline. 1) We first note two things regarding statement 1, above. First, it is a summary of the aggregate results observed in this study, neither inferring causation nor explicitly quantifying the effect of DCLHb. Second, as discussed above, it is a statement about the population under study in aggregate, and does not directly apply to individual patients. Given statement 1, it may be natural for a reader to infer a statement such as the following: DCLHb increases the risk of death 2.65 times relative to normal saline. 2) Statement 2 differs from statement 1 in two immediate ways. First, it directly addresses the effect of DCLHb, and second, it makes sense only when applied to individual patients. It also differs from statement 1 in that it is false, at least to the extent that it does not hold for a significant number of patients. In particular, it cannot hold for those patients whose underlying (saline) risk is above 38% (estimated to be about 16% of the study population based on TRISS-predicted survival probabilities). For these patients it is nonsensical to suggest that they would have risk of more than 100% (38% ⫻ 2.65 = 101%) if given DCLHb. It is also likely to be false for patients with very low risk. In fact, for low-risk patients (about half of the DCLHb study population has TRISS-predicted baseline risk below 6%), the OR is a good approximation to the RR. In addition, to the extent that the aggregate (population) OR reflects the common individual OR (assuming that a common OR exists), the RR for lowrisk patients is probably more accurately estimated by the crude OR of 4.07. In reality, and as suggested by Table 2, given the heterogeneity of the population, the crude OR most likely underestimates the true OR, but the crude OR is sufficient for this discussion. Indeed, the observed RR for the low-risk group in Table 2 is 4.78. Thus, statement 2 is likely to hold for only a relatively small number of patients. A reader who is likely to misinterpret ORs is also likely to believe, incorrectly, that statement 2 follows directly from statement 1. Clearly, without additional consideration of the underlying assumptions (in particular the assumption of homogeneous RR), RRs can be easily misinterpreted, obscuring the actual effect of the treatment. In cases where there is large variation in risk, RRs can be uninterpretable. In contrast, the ORs computed for the three risk categories are not too dissimilar and the assumption of homogeneity of ORs is quite reasonable. Under this assumption, the TRISS-adjusted OR of 7.15 shown in Table 2 may be used as the estimate of a common OR for all patients in the study. Furthermore, this may represent the most reliable estimate of the RR for the lowest-risk patients. (Note that the estimate in the low-risk group of 4.78 is based on only one death in the saline group and has a much wider confidence interval.) Given that the homogeneous RR assumption cannot possibly hold, the TRISS-adjusted RR shown in Table 2 is probably meaningless, especially since predicted risk for a number of patients under this model is greater than 100%. This leads to a second issue—that RRs may not adequately summarize the effect of treatment when outcomes are common. To illustrate, consider subgroups of patients defined by baseline risk in the DCLHb study. From Table 2, in the high-risk group (predicted survival probability < 20%) the observed saline (control) mortality rate is 60.0% and the observed DCLHb mortality rate is 92.3%. (Recall that in this study, treatment was observed to have an adverse effect on mortality.) Thus, the observed RR in this subgroup is 1.54 and the observed OR is 8.0. Given the high control group rate, the largest possible observed RR (assuming 60.0% mortality in the 1433 ACAD EMERG MED • December 2002, Vol. 9, No. 12 • www.aemj.org control group and 100% mortality in the DCLHb group) is 1.67 (=100%/60.0%). On the other hand, the observed RR in the low-risk group is 4.78 and the OR = 5.83. This brings us to a potential contradiction arising from the use of RRs. Even though it is quite plausible that the biological effect of DCLHb is as great or greater in the higher-risk patients, it would be literally impossible to conclude this using the RR no matter what the data show unless the RRs for the lower-risk groups were below 1.67. In fact, to take the extreme case, had the observed mortality in patients in the high-risk group receiving DCLHb been 100%, and ignoring random error, it would be reasonable to argue that the effect of DCLHb is actually much greater in this subgroup (it would be 100% fatal!), despite the lower RR. Conversely, this extreme effect would be reflected by a very large (infinite) OR. When outcomes are common, the RR can be seriously misleading regarding the clinical significance of observed treatment differences. These arguments apply equally to situations where the treatment has a beneficial effect (OR < 1 or RR < 1). Again, if events are common, and the RR is less than 1, say, 0.6, then this implies that the highest possible risk for a treated patient is 60% (=100% ⫻ 0.6), which would result if a patient had a baseline risk of 100%. It is highly unlikely that any treatment could have such a dramatic effect on patients who are otherwise certain to either die or experience failure. Again, if outcomes are sufficiently common, it is almost certain that there will be a substantial and unidentified subset of patients for whom the RR does not apply. On the other hand, given any estimated OR, we would still conclude that a patient who is at 100% risk without treatment will also be at 100% risk when treated, and others who are at very high risk will remain at very high risk. This behavior is likely to be more consistent with the true effect of treatment. In part, as pointed out be Senn,10 the difficulty with RRs results from the somewhat arbitrary decision to use either the rate of success or the rate of failure as the summary outcome measure. Since the OR accounts for both failure and success rates symmetrically, it does not suffer from this difficulty. To illustrate, consider the gender and race study discussed earlier.2 The results were reported by Schulman et al. as ORs, which were interpreted incorrectly by some as RRs. For example, an OR of .57 was interpreted as ‘‘Blacks and women with chest pain are 40 percent less likely than whites or men to be referred for cardiac catheterization.’’13 The actual referral rates cited by Schulman et al. are 90.6% for whites and men and 84.7% for African Americans and women. Schwartz et al. argue correctly that this represents only a 7% reduction in the rate of referral based on a ‘‘risk ratio’’ of 0.93. What is peculiar about this conclusion is that, while in most settings ‘‘risk’’ refers to the probability of failure, the authors use ‘‘risk’’ to refer to the probability of referral, which in this study is implicitly viewed as success. Based on its more traditional meaning, it would seem more appropriate that ‘‘risk’’ refer to the probability of non-referral. In fact, the RR for non-referral is 1.63 (15.3/9.4). That is, African Americans and women are 63% more likely to not be referred than are whites and men. This RR is not too dissimilar to the OR for non-referral of 1.74 (=1/.57) and suggests a far more pronounced difference than does the 7% reduction in rate of referral. The apparent discrepancy results from the arbitrary decision to use either the rate of success or the rate of failure as the outcome measure. The perception of the stated difference can be heavily influenced by this choice. Since the OR combines the two outcomes (success and failure) symmetrically, it is not subject to such arbitrariness, and therefore is usually a more robust measure of treatment differences. CONCLUSIONS The assertion that odds ratios can mislead is true only when odds ratios are misinterpreted. The best solution to the problem is that the proper use of odds ratios should be encouraged, especially when outcomes are common or when the range of underlying risks in the populations is large, rather than to discourage their use in the reporting of clinical studies. Furthermore, readers need to be educated in their proper interpretation. One can be misled by odds ratios only when they are applied as if they were risk ratios. The assumption of homogeneity of odds ratios is far more tenable in most situations than the implicit assumption of homogeneity of risk ratios. The apparent ease of application of risk ratios is negated by the fact that they are not as well understood as many believe, and naive applications may be incorrect. This paper has benefited from the helpful comments of Michael Kosorok, the senior statistical editor, and the reviewers. References 1. 2. Schwartz LM, Woloshin S, Welch HG. Misunderstandings about the effects of race and sex on physicians’ referrals for cardiac catheterization. N Engl J Med. 1999; 341:279– 83. Schulman KA, Berlin JA, Harless W, et al. The effect of race and sex on physicians’ recommendations for cardiac catheterization. N Engl J Med. 1999; 340:618–26. 1434 3. 4. 5. 6. 7. Davies HTO, Crombie IK, Tavakoli M. When can odds ratios mislead? BMJ. 1998; 316:989–91. Bracken MB, Sinclair JC. When can odds ratios mislead? Avoidable systematic error in estimating treatment effects must not be tolerated [letter; comment]. BMJ. 1998; 317: 1156–7. Deeks JJ. When can odds ratios mislead? Odds ratios should be used only in case–control studies and logistic regression analyses [letter; comment]. BMJ. 1998; 317: 1156–7. Altman DG, Deeks JJ, Sackett DL. Odds ratios should be avoided when events are common [letter]. BMJ. 1998; 317:1318. Taeger D, Sun Y, Straif K. On the use, misuse and interpretation of odds ratios. eBMJ, 1998. Website: http:// bmj.com/cgi/eletters/316/7136/989. Cook • UP WITH ODDS RATIOS! 8. 9. 10. 11. 12. 13. Sackett DL, Deeks JJ, Altman DG. Down with odds ratios. Evid Based Med. 1996; 1:164–6. Zhang J, Yu KF. What’s the relative risk? A method of correcting the odds ratio in cohort studies of common outcomes. JAMA. 1998; 280:1690–1. Senn S. Rare distinction and common fallacy [letter]. eBMJ, 1999. Website: http://bmj.com/cgi/eletters/317/ 7168/1318. Sloan EP, Koenigsberg M, Gens D, et al. Diaspirin crosslinked hemoglobin (DCLHb) in the treatment of severe traumatic hemorrhagic shock, a randomized controlled efficacy trial. JAMA. 1999; 282:1857–63. Boyd CR, Tolson MA, Copes WS. Evaluating trauma care: the TRISS method. J Trauma. 1987; 27:370–8. Rubin R. Heart care reflects race and sex, not symptoms. USA Today. Feb 25, 1999:1A.