Download Advanced Statistics: Up with Odds Ratios! A Case for Odds Ratios

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forensic epidemiology wikipedia , lookup

Race and health wikipedia , lookup

Prenatal testing wikipedia , lookup

Epidemiology wikipedia , lookup

Odds ratio wikipedia , lookup

Transcript
1430
Cook • UP WITH ODDS RATIOS!
SPECIAL CONTRIBUTIONS
Advanced Statistics: Up with Odds Ratios! A Case for
Odds Ratios When Outcomes Are Common
Thomas D. Cook, PhD
Abstract
Treatment comparisons from clinical studies involving
dichotomous outcomes are often summarized using risk
ratios. Risk ratios are typically used because the underlying statistical model is often consistent with the underlying biological mechanism of the treatment and they
are easily interpretable. The use of odds ratios to summarize treatment effects has been discouraged, especially
in studies in which outcomes are common, largely be-
cause odds ratios differ from risk ratios and are frequently interpreted incorrectly as risk ratios. In this article, the author contends that risk ratios can be easily
misinterpreted and that, in many cases, odds ratios
should be preferred, especially in studies in which outcomes are common. Key words: odds ratios; risk ratios;
statistics; differences; outcomes. ACADEMIC EMERGENCY MEDICINE 2002; 9:1430–1434.
In a 1999 article, Schwartz and colleagues1 comment on the reporting of a study of the effect of
gender and race on physicians’ recommendations
for cardiac catheterization.2 They contend that, as
reported, the differences between African Americans and whites, and between men and women, in
rates of referral for cardiac catheterization study
were overstated (Schwartz et al. indicate other reasons why this finding may be misleading, but they
are not relevant to this discussion. As the authors
suggest, formal comparisons of rates in this setting
may not even be meaningful).1 Schwartz et al.
blame this overstatement, in part, on the use of
odds ratios (ORs) to summarize the results and argue against such use. Additionally, a number of
other articles have appeared in recent years discouraging the use of ORs for reporting the results
of medical studies,3–9 especially when outcomes are
common. Very little has been published since then
to refute this recommendation.10 As is argued below, in the study cited above and when properly
interpreted, ORs may be the most meaningful summary measures of the differences observed.
As an illustration, consider the results of a recent
study of diaspirin cross-linked hemoglobin
(DCLHb) in patients suffering from severe traumatic hemorrhagic shock.11 (For this study, the effect of treatment—overall, within subgroups, and
covariate-adjusted—was reported using ORs.)
Overall mortality in patients receiving DCLHb (see
the first two columns of Table 1) was 46% (24/52)
and mortality in patients receiving normal saline
(control) was 17% (8/46). The risk ratio (RR) is the
ratio of these mortality rates, RR = 2.65 = 46%/17%.
Conversely, the odds of mortality is the ratio of the
mortality rate to the survival rate, or equivalently,
the ratio of the number of deaths to the number of
survivors. In the DCLHb study, the odds are 0.857
= 24/28 and 0.211 = 8/38 in the DCLHb and control
groups, respectively. The odds ratio is the ratio of
odds in the two groups, OR = 4.07 = 0.857/0.211.
These measures are numerically quite different and,
hence, must be interpreted differently. As discussed
below, each requires careful consideration and each
can be easily misinterpreted.
Criticisms of ORs fall principally into two categories: 1) ORs are not as intuitive as RRs and, therefore, are difficult to understand and easily misinterpreted and misapplied, and 2) ORs often differ
significantly from RRs. Arguments of the first category are important, but they suffer from a major
flaw. Risk ratios may seem intuitive and easily applied; however, they are easily misapplied and the
conclusions drawn from their use may be inappropriate. An intuitive, easily understood summary
measure is worthwhile only to the extent that it results in valid conclusions.
Arguments in the second category appear to be
based implicitly on two assumptions. The first is
that the most appropriate summary of differences
between groups is the RR and that this measure
should be reported whenever possible. Second,
since ORs, especially when the underlying risk is
high, are more extreme than RRs (larger than RR
From the Department of Biostatistics and Medical Informatics,
University of Wisconsin, Madison, WI (TDC).
Received October 8, 2001; revision received May 7, 2002; accepted July 17, 2002.
Series editor: Roger J. Lewis, MD, PhD, Department of Emergency Medicine, Harbor–UCLA Medical Center, Torrance, CA.
Address for correspondence and reprints: Thomas D. Cook,
PhD, 209 WARF Building, 610 Walnut Street, Madison, WI
53705. Fax: 608-263-0415; e-mail: [email protected].
1431
ACAD EMERG MED • December 2002, Vol. 9, No. 12 • www.aemj.org
TABLE 1. Mortality in the Diaspirin Cross-linked Hemoglobin (DCLHb) Study11 Overall and by Baseline
Predicted Probability of Death Using the TRISS Method
TRISS-predicted Probability of Survival
Overall
Dead
(Mortality)
Alive
Total
80%–100%
20%–80%
0%–20%
DCLHb
Control
DCLHb
Control
DCLHb
Control
DCLHb
Control
24
(46.2%)
28
8
(17.4%)
38
5
(21.7%)
18
1
(4.5%)
21
5
(38.5%)
8
1
(8.3%)
11
12
(92.3%)
1
6
(60.0%)
4
52
46
23
22
13
12
13
10
Note that five patients had insufficient baseline data upon which to compute a TRISS score.
when RR > 1 and smaller than RR when RR < 1),
and they overstate the differences between treatment groups. Again, a case can be made that precisely the opposite is true: when they differ, RRs
actually understate treatment differences.
The purpose of this article is to argue that in
many cases the OR is a more appropriate summary
measure that can be applied to a broader population of patients than the RR. In such cases ORs
should be preferred, especially when ORs and RRs
differ, i.e., when outcomes are common. Notwithstanding the errors in the interpretation of the results reported by Schulman et. al.,2 there is no evidence that in practice, errors resulting from the
misinterpretation of ORs are more frequent than errors resulting from the misinterpretation of RRs. We
suggest, and illustrate below, that practitioners who
are likely to misinterpret or misuse ORs are also
likely to misinterpret or misuse RRs.
RISK RATIOS VERSUS ODDS RATIOS
Given an outcome of interest (considered a failure),
the risk of failure is the probability that a patient
will experience failure. For a given population, the
risk is usually estimated by the proportion of the
population observed to fail. It is important to keep
in mind, however, that there is likely to be variation
in risk within the population. The observed population risk is actually an average of the risks for the
individuals in the population, and therefore, the average risk may not necessarily apply to individuals
within the population. Again, this can be illustrated
by considering data from the DCLHb study shown
in Table 1. Three subpopulations are defined by the
probability of survival using the TRISS method.12
We consider a low-risk group (45 patients), a middle-risk group (25 patients), and a high-risk group
(23 patients). Note that five patients had insufficient
baseline data to compute the TRISS score.
Now, given two groups of patients, for example,
treated and control, the (unadjusted) RR is the ratio
of the risks in the two groups. As above, because
the average risk does not necessarily represent the
risk for any particular individual, the RR calculated
using the average risk may not represent the RR for
any particular individual. Assuming that the two
groups are balanced with respect to underlying patient risk (no confounding), the aggregate unadjusted RR will apply to individuals only if there is
a common RR over the population (homogeneous
RR assumption). Understanding this fact is critical
to the correct application of RRs in practice. Considering Table 2, the overall observed RR of 2.65
likely does not represent the RR for any of the subgroups, especially the high-risk group (it is outside
the 95% confidence interval for the RR for this
group). It is also well below the observed RR in the
other two groups (although well within the corresponding confidence intervals). These differences
suggest that the homogeneity of RR assumption
does not hold (This example is primarily for purposes of illustration and no attempt at statistical
inference is intended. Because of the relatively
small numbers of patients in this study, observed
differences among groups, here and in what follows, may not reach statistical significance. This
fact should have no bearing on the principles being
illustrated).
TABLE 2. Risk Ratios (RRs) and Odds Ratios (ORs)
in the Diaspirin Cross-linked Hemoglobin
(DCLHb) Study11 Overall and by Baseline
Predicted Probability of Death Using the
TRISS Method
TRISS-predicted
Probability of
Survival
RR
95% CI
OR
95% CI
Overall
2.65
(1.32, 5.32)
4.07
(1.59, 10.4)
80%–100%
20%–80%
0%–20%
4.78
4.62
1.54
(0.61, 37.7)
(0.63, 34.1)
(0.91, 2.61)
5.83
6.88
8.00
(0.62, 54.7)
(0.67, 70.8)
(0.73, 88.2)
TRISS-adjusted
2.07
(1.22, 4.50)
7.15
(2.18, 23.5)
For the TRISS-adjusted RR, the confidence interval (CI) was
computed using a bootstrap method.
1432
Cook • UP WITH ODDS RATIOS!
Conversely, the odds of failure is the ratio of the
failure probability to the success probability. In a
population, the odds can usually be estimated by
the number (or proportion) observed to fail divided
by the number (or proportion) observed not to fail.
While the risk must be between 0 and 100%, the
odds can be any number greater than or equal to
zero. Given two groups of patients, the (unadjusted) OR is the ratio of the odds of failure in the
two groups. The OR is always more extreme (further from 1) than the RR, but when the risks are
small (less than, say, 10%), the RR and the OR will
roughly agree. As the underlying risks increase,
however, the difference between the RR and the OR
can become quite large. As with risk, the true odds
of failure may vary significantly among members
of the population and, again, the unadjusted OR
may not represent the OR for any particular individual. The principal benefit of ORs is that the assumption of homogeneity of ORs is a more tenable
assumption than the assumption of homogeneity of
RRs and thus it is more likely that an estimate of
the OR can be reliably applied to all individuals
within a population.
We now illustrate how the use of RRs can be misleading. Note that the overall RR of 2.65 can be
expressed by the following statement:
Mortality in patients receiving DCLHb was
2.65 times higher than in patients receiving
normal saline.
1)
We first note two things regarding statement 1,
above. First, it is a summary of the aggregate results observed in this study, neither inferring causation nor explicitly quantifying the effect of
DCLHb. Second, as discussed above, it is a statement about the population under study in aggregate, and does not directly apply to individual patients.
Given statement 1, it may be natural for a reader
to infer a statement such as the following:
DCLHb increases the risk of death 2.65 times
relative to normal saline.
2)
Statement 2 differs from statement 1 in two immediate ways. First, it directly addresses the effect
of DCLHb, and second, it makes sense only when
applied to individual patients. It also differs from
statement 1 in that it is false, at least to the extent
that it does not hold for a significant number of
patients. In particular, it cannot hold for those patients whose underlying (saline) risk is above 38%
(estimated to be about 16% of the study population
based on TRISS-predicted survival probabilities).
For these patients it is nonsensical to suggest that
they would have risk of more than 100% (38% ⫻
2.65 = 101%) if given DCLHb. It is also likely to be
false for patients with very low risk. In fact, for
low-risk patients (about half of the DCLHb study
population has TRISS-predicted baseline risk below
6%), the OR is a good approximation to the RR. In
addition, to the extent that the aggregate (population) OR reflects the common individual OR (assuming that a common OR exists), the RR for lowrisk patients is probably more accurately estimated
by the crude OR of 4.07. In reality, and as suggested
by Table 2, given the heterogeneity of the population, the crude OR most likely underestimates the
true OR, but the crude OR is sufficient for this discussion. Indeed, the observed RR for the low-risk
group in Table 2 is 4.78. Thus, statement 2 is likely
to hold for only a relatively small number of patients. A reader who is likely to misinterpret ORs
is also likely to believe, incorrectly, that statement
2 follows directly from statement 1. Clearly, without additional consideration of the underlying assumptions (in particular the assumption of homogeneous RR), RRs can be easily misinterpreted,
obscuring the actual effect of the treatment. In cases
where there is large variation in risk, RRs can be
uninterpretable.
In contrast, the ORs computed for the three risk
categories are not too dissimilar and the assumption of homogeneity of ORs is quite reasonable. Under this assumption, the TRISS-adjusted OR of 7.15
shown in Table 2 may be used as the estimate of a
common OR for all patients in the study. Furthermore, this may represent the most reliable estimate
of the RR for the lowest-risk patients. (Note that the
estimate in the low-risk group of 4.78 is based on
only one death in the saline group and has a much
wider confidence interval.) Given that the homogeneous RR assumption cannot possibly hold, the
TRISS-adjusted RR shown in Table 2 is probably
meaningless, especially since predicted risk for a
number of patients under this model is greater than
100%.
This leads to a second issue—that RRs may not
adequately summarize the effect of treatment when
outcomes are common. To illustrate, consider subgroups of patients defined by baseline risk in the
DCLHb study. From Table 2, in the high-risk group
(predicted survival probability < 20%) the observed
saline (control) mortality rate is 60.0% and the observed DCLHb mortality rate is 92.3%. (Recall that
in this study, treatment was observed to have an
adverse effect on mortality.) Thus, the observed RR
in this subgroup is 1.54 and the observed OR is 8.0.
Given the high control group rate, the largest possible observed RR (assuming 60.0% mortality in the
1433
ACAD EMERG MED • December 2002, Vol. 9, No. 12 • www.aemj.org
control group and 100% mortality in the DCLHb
group) is 1.67 (=100%/60.0%). On the other hand,
the observed RR in the low-risk group is 4.78 and
the OR = 5.83.
This brings us to a potential contradiction arising
from the use of RRs. Even though it is quite plausible that the biological effect of DCLHb is as great
or greater in the higher-risk patients, it would be
literally impossible to conclude this using the RR
no matter what the data show unless the RRs for
the lower-risk groups were below 1.67. In fact, to
take the extreme case, had the observed mortality
in patients in the high-risk group receiving DCLHb
been 100%, and ignoring random error, it would be
reasonable to argue that the effect of DCLHb is actually much greater in this subgroup (it would be
100% fatal!), despite the lower RR. Conversely, this
extreme effect would be reflected by a very large
(infinite) OR. When outcomes are common, the RR
can be seriously misleading regarding the clinical
significance of observed treatment differences.
These arguments apply equally to situations
where the treatment has a beneficial effect (OR < 1
or RR < 1). Again, if events are common, and the
RR is less than 1, say, 0.6, then this implies that the
highest possible risk for a treated patient is 60%
(=100% ⫻ 0.6), which would result if a patient had
a baseline risk of 100%. It is highly unlikely that
any treatment could have such a dramatic effect on
patients who are otherwise certain to either die or
experience failure. Again, if outcomes are sufficiently common, it is almost certain that there will
be a substantial and unidentified subset of patients
for whom the RR does not apply. On the other
hand, given any estimated OR, we would still conclude that a patient who is at 100% risk without
treatment will also be at 100% risk when treated,
and others who are at very high risk will remain at
very high risk. This behavior is likely to be more
consistent with the true effect of treatment.
In part, as pointed out be Senn,10 the difficulty
with RRs results from the somewhat arbitrary decision to use either the rate of success or the rate of
failure as the summary outcome measure. Since the
OR accounts for both failure and success rates symmetrically, it does not suffer from this difficulty. To
illustrate, consider the gender and race study discussed earlier.2 The results were reported by Schulman et al. as ORs, which were interpreted incorrectly by some as RRs. For example, an OR of .57
was interpreted as ‘‘Blacks and women with chest
pain are 40 percent less likely than whites or men
to be referred for cardiac catheterization.’’13 The actual referral rates cited by Schulman et al. are 90.6%
for whites and men and 84.7% for African Americans and women. Schwartz et al. argue correctly
that this represents only a 7% reduction in the rate
of referral based on a ‘‘risk ratio’’ of 0.93. What is
peculiar about this conclusion is that, while in most
settings ‘‘risk’’ refers to the probability of failure,
the authors use ‘‘risk’’ to refer to the probability of
referral, which in this study is implicitly viewed as
success. Based on its more traditional meaning, it
would seem more appropriate that ‘‘risk’’ refer to
the probability of non-referral. In fact, the RR for
non-referral is 1.63 (15.3/9.4). That is, African Americans and women are 63% more likely to not be
referred than are whites and men. This RR is not
too dissimilar to the OR for non-referral of 1.74
(=1/.57) and suggests a far more pronounced difference than does the 7% reduction in rate of referral. The apparent discrepancy results from the arbitrary decision to use either the rate of success or
the rate of failure as the outcome measure. The perception of the stated difference can be heavily influenced by this choice. Since the OR combines the
two outcomes (success and failure) symmetrically,
it is not subject to such arbitrariness, and therefore
is usually a more robust measure of treatment differences.
CONCLUSIONS
The assertion that odds ratios can mislead is true
only when odds ratios are misinterpreted. The best
solution to the problem is that the proper use of
odds ratios should be encouraged, especially when
outcomes are common or when the range of underlying risks in the populations is large, rather
than to discourage their use in the reporting of clinical studies. Furthermore, readers need to be educated in their proper interpretation. One can be
misled by odds ratios only when they are applied
as if they were risk ratios.
The assumption of homogeneity of odds ratios is
far more tenable in most situations than the implicit
assumption of homogeneity of risk ratios. The apparent ease of application of risk ratios is negated
by the fact that they are not as well understood as
many believe, and naive applications may be incorrect.
This paper has benefited from the helpful comments of Michael
Kosorok, the senior statistical editor, and the reviewers.
References
1.
2.
Schwartz LM, Woloshin S, Welch HG. Misunderstandings
about the effects of race and sex on physicians’ referrals
for cardiac catheterization. N Engl J Med. 1999; 341:279–
83.
Schulman KA, Berlin JA, Harless W, et al. The effect of
race and sex on physicians’ recommendations for cardiac
catheterization. N Engl J Med. 1999; 340:618–26.
1434
3.
4.
5.
6.
7.
Davies HTO, Crombie IK, Tavakoli M. When can odds
ratios mislead? BMJ. 1998; 316:989–91.
Bracken MB, Sinclair JC. When can odds ratios mislead?
Avoidable systematic error in estimating treatment effects
must not be tolerated [letter; comment]. BMJ. 1998; 317:
1156–7.
Deeks JJ. When can odds ratios mislead? Odds ratios
should be used only in case–control studies and logistic
regression analyses [letter; comment]. BMJ. 1998; 317:
1156–7.
Altman DG, Deeks JJ, Sackett DL. Odds ratios should be
avoided when events are common [letter]. BMJ. 1998;
317:1318.
Taeger D, Sun Y, Straif K. On the use, misuse and interpretation of odds ratios. eBMJ, 1998. Website: http://
bmj.com/cgi/eletters/316/7136/989.
Cook • UP WITH ODDS RATIOS!
8.
9.
10.
11.
12.
13.
Sackett DL, Deeks JJ, Altman DG. Down with odds ratios. Evid Based Med. 1996; 1:164–6.
Zhang J, Yu KF. What’s the relative risk? A method of
correcting the odds ratio in cohort studies of common
outcomes. JAMA. 1998; 280:1690–1.
Senn S. Rare distinction and common fallacy [letter].
eBMJ, 1999. Website: http://bmj.com/cgi/eletters/317/
7168/1318.
Sloan EP, Koenigsberg M, Gens D, et al. Diaspirin crosslinked hemoglobin (DCLHb) in the treatment of severe
traumatic hemorrhagic shock, a randomized controlled
efficacy trial. JAMA. 1999; 282:1857–63.
Boyd CR, Tolson MA, Copes WS. Evaluating trauma
care: the TRISS method. J Trauma. 1987; 27:370–8.
Rubin R. Heart care reflects race and sex, not symptoms.
USA Today. Feb 25, 1999:1A.