Download what is a roC curve? - The Netherlands Journal of Critical Care

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Netherlands Journal of Critical Care
Copyright © 2011, Nederlandse Vereniging voor Intensive Care. All Rights Reserved. Received April 2011; accepted June 2011
Review
Introduction to statistics in the ICU; what is a ROC curve?
OM Dekkers
Department of Clinical Epidemiology. Leiden University Medical Center, Leiden, The Netherlands
Abstract - Receiver Operating Characteristic (ROC) curves are used in medical research as a graphical display for the relationship
between sensitivity and specificity for a continuous test variable. In a ROC curve the sensitivity is plotted against 1-specificity for every
cut-off point of the diagnostic variable. Although often shown as an overall measure of test performance, the ROC curve has no direct
clinical interpretation. To be applicable for clinical practice, a cut-off value should be chosen, balancing sensitivity, specificity, the severity of the disease under study, availability of treatments, as well as consequences from false positivity and false negativity in terms of
costs, psychological impact and effect on health outcomes.
Keywords - Receiver Operating Characteristic, sensitivity, specificity
Introduction
Diagnosis and prognosis are the cornerstone of medical decision
making. Unfortunately, clinical, biochemical and radiological tests
used for diagnosis and prediction are not infallible. This means
that people without a disease can be classified as ‘diseased’
according to a test, whereas patients might be wrongly classified
as ‘healthy’. Studies on diagnostic and prognostic tools serve
two goals. The first goal is to find optimal tests that are practical
and reliable in medical practice. The second goal is to calculate
the uncertainty associated with a given test.
In a recent article, the role of oxygen saturation and blood
lactate levels in predicting outcome after paediatric cardiac
surgery was studied [1]. High lactate levels as well as low oxygen
saturation are markers of inadequate tissue oxygenation. The
authors concluded that venous oxygen saturation <68% and
peak lactate > 3mmol/l during the surgical procedure were
associated with higher morbidity and mortality. To enable a direct
comparison between the performances of the two test variables,
the authors plotted Receiver Operating Characteristic (ROC)
curves and calculated the area under the curve (AUC). The study
showed that for morbidity, both variables predicted similarly (AUC
of 0.73), whereas peak lactate had a higher accuracy in predicting
mortality (AUC 0.87, vs. 0.73 for venous oxygen saturation).
The main aim of the present article is to give an overview of ROC
curves. Since ROC curves are based on sensitivity and specificity,
firstly characteristics of diagnostic tests will be described.
Evaluation of diagnostic tests
Diagnostic tests are used to answer questions such as whether
a disease is present or not. Similarly, prognostic tests are used
to predict the possible presence or absence of disease in the
Correspondence
OM Dekkers
E-mail: [email protected]
276
NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011
future. These tests should help to distinguish between patients
truly having the disease and patients without the disease. The
performance of diagnostic (and prognostic) binary tests can be
expressed as sensitivity, specificity, negative predictive value
(NPV) and positive predictive value (PPV) (see table 1).
The proportion of patients with the disease that is correctly
classified as diseased by the test is called the sensitivity of a test.
The specificity of a test is the proportion of patients without the
disease who have a negative test. The NPV is the proportion of
patients with a negative test who do not have the disease; the
PPV is the proportion of patients with a positive test who have
the disease. A global measure of the performance of a diagnostic
test is the diagnostic accuracy. This represents the proportion of
persons correctly classified by the test (A+D/N from table 1).
In a population tested for a disease, for example, adrenal
insufficiency, a certain proportion will have the disease (A+B/N
from table 1). This proportion is called the prevalence of the
disease. If we test in a low risk population the prevalence of
the disease will be low, whereas in a high risk population the
prevalence will be much higher. For example, the prevalence
of adrenal insufficiency in patients with only tiredness as a
symptom will be much lower than in patients with sepsis on
the ICU. Sensitivity and specificity are measures starting with
the distinction between diseased and non-diseased patients.
They are characteristics of the test as they do not depend on
the prevalence of the disease. In other words, whether a test is
used in a population with high disease prevalence or low disease
prevalence will not influence the sensitivity and specificity. This
is not true, however, for NPV and PPV, which do clearly depend
on the disease prevalence. For example, if a test is used in a
population with a low prevalence (low number of A+B from
table 1), the PPV will be low due to the relatively high number
of false positive patients (C from table 1). This is an important
consideration when interpreting a positive test result in clinical
practice: a positive test in a patient from a population with a low
disease prevalence will be accompanied by a high probability
Netherlands Journal of Critical Care
Introduction to statistics in the ICU; what is a ROC curve?
of a false positive result (low PPV), whereas a positive result in
patients sampled from a high risk population (high prevalence of
the disease) will give a much higher PPV. For example, in a setting
of only few patients having adrenal insufficiency (tired subjects
in an outpatient setting), a positive should not be interpreted as
compelling evidence for the disease: there will be many falsepositives among the tested, whereas the number of true positives
will be low, leading to a low PPV.
Sensitivity and specificity are measures mainly used in
research settings. In that situation the disease status of the
study population is known and sensitivity and specificity can
be determined. In clinical practice the disease status of tested
patients is not known, otherwise the performance of a diagnostic
test does not make sense. This means that the measures used in
daily practice are NPV and PPV.
Determining a cut-off for diagnostic tests
Test characteristics are determined by comparing a test to a
reference standard (often called gold standard). Such a reference
test is presumed to have a perfect sensitivity and specificity
and is used in such a way that the classification of the study
population into diseased /non-diseased represents the truth.
However, for most reference tests this assumption is too strong
to hold. A reference can be either another test (for example,
pulmonary angiography is considered the reference standard
for the diagnosis of pulmonary embolism), or the course of the
disease (new techniques to detect early stages of recurrent
malignancies can be compared to the course of the disease). A
test that is already part of the diagnostic criteria for a disease is
difficult to test.
If a diagnostic or prognostic test is based on a continuous
variable, there are several options to relate the variable to
disease status. For example, if lactate levels are used as a test
for predicting mortality, the risk might be calculated for specific
lactate level ranges. If the aim is to create a binary diagnostic test
from a continuous variable, a cut-off value should be chosen to
determine the sensitivity, specificity, NPV and PPV of the test.
In the example above, morbidity and mortality were predicted
based on two continuous measures: venous oxygen saturation
and plasma lactate levels. The authors reported an NPV of 94%
for a nadir venous oxygen saturation < 70% as well as for a peak
lactate > 3 mmol/l.
The choice of a cut-off is not straightforward and to a certain
extent arbitrary. Firstly, problems can arise from the lack of an
adequate reference standard. This is, for example, the case when
determining an adequate cut-off for the diagnosis relative adrenal
insufficiency in the intensive care unit [2]. Since there is no
standard that distinguishes beyond doubt between patients with
and patients without relative adrenal insufficiency, establishing
the optimal cortisol cut-off values for the diagnosis is a difficult
enterprise. A second problem in the choice of a cut-off value
is that sensitivity and specificity can not both be optimal for a
specific cut-off, at least for the majority of tests. The reason is
that values of a certain variable overlap between diseased and
non-diseased persons, meaning that the test can not discriminate
with certainty between diseased and non-diseased for a given
test value [3].
High sensitivity often implies low specificity and vice versa.
In table 2 sensitivity and specificity have been calculated for
various cut-off points for a diagnostic continuous variable (with
an outcome between 1 and 40). Calculations are presented for
only the highest and the lowest five cut-off points. As can be seen
from the table, low cut-off values will identify the vast majority of
diseased patients (high sensitivity), at the cost of a low specificity.
In contrast, using a high cut-off value (for example >= 36) results
in a perfect specificity, but almost all diseased patients will be
missed. The choice of a cut-off on the diagnostic accuracy is
not that outspoken as the effect on sensitivity and specificity.
The reason is that diagnostic accuracy is a weighted average of
sensitivity and specificity. The values that the diagnostic accuracy
can obtain are therefore always in between and less extreme than
the values of sensitivity and specificity.
An imaginary example that is not derived from clinical practice
can demonstrate the relationship between sensitivity and
specificity: if e.g., we use the presence of two eyes as a positive
test indicating the presence of relative adrenal insufficiency, all
patients with the disease would have a positive test (sensitivity
100%). However, also all patients without the disease would
have a positive test (specificity 0%). In the example outlined
above, increasing the threshold cut-off for oxygen saturation
Table 1. Characteristics of diagnostic tests
Disease
Test
Present
Absent
Total
Positive
A
True positive
C
False positive
A+C
PPV =
A/(A+C)
Negative
B
False negative
D
True negative
B+D
NPV =
D/(B+D)
Total
A+B
C+D
N
Sensitivity =
A/(A+B)
Specificity =
D/(C+D)
NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011
277
Netherlands Journal of Critical Care
OM Dekkers
(for example, from <70% to <80%) would increase the number
of diseased patients adequately detected (i.e. increase of
sensitivity), but this would also increase the number of patients
with a false positive result (decrease in specificity).
The choice of a cut-off value depends largely on the disease
to be diagnosed and the consequences of false positive and
false negative test results. If the main goal is to identify the
largest proportion of diseased patients, the test should be highly
sensitive, at the expense of a not optimal specificity. This might
be the case for many conditions in the ICU setting where early
recognition and treatment for serious diseases could be life
saving. But on the other hand, if the applied treatment has many
serious side effects, one might be more concerned about a low
specificity than for treatments that are rather safe. It is however,
important to keep in mind that false positive test results lead to
unnecessary treatment, add to unnecessary costs and can have
a large psychological impact if the disease for which the test was
performed is serious, whereas false negative test results may
withhold patients from being adequately treated.
ROC curves
If the test used is based on a continuous variable (for example,
oxygen saturation or cortisol), for every cut-off value of the test
variable the sensitivity and specificity can be calculated. The
ROC curve is used as a graphical display for the relationship
between sensitivity and specificity for a continuous test variable.
In a ROC curve the sensitivity is plotted against 1-specificity for
every cut-off point of the diagnostic variable. In figure 1 the ROC
curve for the data from table 2 is shown. One point represents
the ratio of the probability that a patient with the disease has a
positive test (sensitivity), to the probability of a positive test in an
undiseased person (1-specificity). Such a ratio (formally called
the likelihood ratio) represents how many times a positive test is
more likely to occur in a diseased than in an undiseased person.
Table 2. Sensitivity and specificity for various cut-offs
of a continuous diagnostic variable
Cut-off
Sensitivity Specificity
Diagnostic accuracy
( >= 1 )
100%
54%
0%
( >= 2 )
96%
9%
56%
( >= 3 )
93%
20%
59%
( >= 4 )
85%
30%
60%
( >= 5 )
80%
39%
61%
( >= 35 )
9%
96%
49%
( >= 36 )
9%
98%
50%
( >= 37 )
7%
100%
50%
( >= 38 )
5%
100%
49%
( >= 40 )
3%
100%
48%
278
NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011
The point from a ROC curve most approaching the left corner is
the point optimizing sensitivity and specificity.
In figure 1, one arrow is drawn that represents the cut-off of >=
5. The accompanying sensitivity is 80% and 1-specificity is 61%,
meaning that for a diseased person, the likelihood of a positive
test is 1.3 higher than that for a non-diseased. A test that perfectly
discriminates between diseased and non-diseased at every value
of the test variable would coincide with the left and top side of
the plot. A test variable that does not in any way discriminate
between diseased and non-diseased would result in a straight
diagonal line from the bottom left corner to the upper right corner.
Such a line means that for every cut-off the diseased and the
non-diseased have the same probability of having a positive test.
Such a test would be nothing better (but much more expensive)
than flipping a coin.
The area under the curve (AUC) is a global assessment of
the test performance. When comparing two tests, a higher AUC
indicates a better test performance. Importantly, this does not
imply that the test with the higher AUC performs better for every
cut-off value. The AUC is equal to the probability that a tested
person with the disease has a higher value of the test variable
than a person without the disease [4]. A ROC of 0.5 represents
a useless test. Although often shown as an overall measure of
test performance, the ROC curve and the accompanying AUC
have no direct clinical interpretation. They only give a visual
and quantitative overview of the test characteristics at different
cut-offs. However, to be applicable for clinical practice, a cutoff value should be chosen, balancing sensitivity, specificity, the
severity of the disease under study, availability of treatments,
costs of the test as well as consequences from false positivity
and false negativity in terms of costs, psychological impact and
effect on health outcomes.
Figure 1. ROC curve plotting sensitivity against 1-specificity for
a continuous diagnostic variable
Netherlands Journal of Critical Care
Introduction to statistics in the ICU; what is a ROC curve?
In conclusion, Receiver Operating Characteristic (ROC) curves
are used in medical research as a graphical display for the
relationship between sensitivity and specificity for a continuous
test variable. A ROC curve gives a global view on the test
performance, but can not be applied directly to clinical practice.
For that purpose a cut-off value of the test variable should be
chosen, a choice that depends on the disease to be diagnosed
and the consequences of false positive and false negative test
results.
References
1. Ranucci M, Isgro G, Carlucci C, De La Torre T, Enginoli S, Frigiola A: Central venous
oxygen saturation and blood lactate levels during cardiopulmonary bypass are associated
with outcome after pediatric cardiac surgery. Crit Care 14:R149, 2010
2. Fleseriu M, Loriaux DL: “Relative” adrenal insufficiency in critical illness. Endocr Pract
15:632-640, 2009
3. Ware JH: The limitations of risk factors as prognostic tools. N Engl J Med 355:26152617, 2006
4. Altman DG, Bland JM: Diagnostic tests 3: receiver operating characteristic plots.
BMJ 309:188, 1994
NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011
279