Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Netherlands Journal of Critical Care Copyright © 2011, Nederlandse Vereniging voor Intensive Care. All Rights Reserved. Received April 2011; accepted June 2011 Review Introduction to statistics in the ICU; what is a ROC curve? OM Dekkers Department of Clinical Epidemiology. Leiden University Medical Center, Leiden, The Netherlands Abstract - Receiver Operating Characteristic (ROC) curves are used in medical research as a graphical display for the relationship between sensitivity and specificity for a continuous test variable. In a ROC curve the sensitivity is plotted against 1-specificity for every cut-off point of the diagnostic variable. Although often shown as an overall measure of test performance, the ROC curve has no direct clinical interpretation. To be applicable for clinical practice, a cut-off value should be chosen, balancing sensitivity, specificity, the severity of the disease under study, availability of treatments, as well as consequences from false positivity and false negativity in terms of costs, psychological impact and effect on health outcomes. Keywords - Receiver Operating Characteristic, sensitivity, specificity Introduction Diagnosis and prognosis are the cornerstone of medical decision making. Unfortunately, clinical, biochemical and radiological tests used for diagnosis and prediction are not infallible. This means that people without a disease can be classified as ‘diseased’ according to a test, whereas patients might be wrongly classified as ‘healthy’. Studies on diagnostic and prognostic tools serve two goals. The first goal is to find optimal tests that are practical and reliable in medical practice. The second goal is to calculate the uncertainty associated with a given test. In a recent article, the role of oxygen saturation and blood lactate levels in predicting outcome after paediatric cardiac surgery was studied [1]. High lactate levels as well as low oxygen saturation are markers of inadequate tissue oxygenation. The authors concluded that venous oxygen saturation <68% and peak lactate > 3mmol/l during the surgical procedure were associated with higher morbidity and mortality. To enable a direct comparison between the performances of the two test variables, the authors plotted Receiver Operating Characteristic (ROC) curves and calculated the area under the curve (AUC). The study showed that for morbidity, both variables predicted similarly (AUC of 0.73), whereas peak lactate had a higher accuracy in predicting mortality (AUC 0.87, vs. 0.73 for venous oxygen saturation). The main aim of the present article is to give an overview of ROC curves. Since ROC curves are based on sensitivity and specificity, firstly characteristics of diagnostic tests will be described. Evaluation of diagnostic tests Diagnostic tests are used to answer questions such as whether a disease is present or not. Similarly, prognostic tests are used to predict the possible presence or absence of disease in the Correspondence OM Dekkers E-mail: [email protected] 276 NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011 future. These tests should help to distinguish between patients truly having the disease and patients without the disease. The performance of diagnostic (and prognostic) binary tests can be expressed as sensitivity, specificity, negative predictive value (NPV) and positive predictive value (PPV) (see table 1). The proportion of patients with the disease that is correctly classified as diseased by the test is called the sensitivity of a test. The specificity of a test is the proportion of patients without the disease who have a negative test. The NPV is the proportion of patients with a negative test who do not have the disease; the PPV is the proportion of patients with a positive test who have the disease. A global measure of the performance of a diagnostic test is the diagnostic accuracy. This represents the proportion of persons correctly classified by the test (A+D/N from table 1). In a population tested for a disease, for example, adrenal insufficiency, a certain proportion will have the disease (A+B/N from table 1). This proportion is called the prevalence of the disease. If we test in a low risk population the prevalence of the disease will be low, whereas in a high risk population the prevalence will be much higher. For example, the prevalence of adrenal insufficiency in patients with only tiredness as a symptom will be much lower than in patients with sepsis on the ICU. Sensitivity and specificity are measures starting with the distinction between diseased and non-diseased patients. They are characteristics of the test as they do not depend on the prevalence of the disease. In other words, whether a test is used in a population with high disease prevalence or low disease prevalence will not influence the sensitivity and specificity. This is not true, however, for NPV and PPV, which do clearly depend on the disease prevalence. For example, if a test is used in a population with a low prevalence (low number of A+B from table 1), the PPV will be low due to the relatively high number of false positive patients (C from table 1). This is an important consideration when interpreting a positive test result in clinical practice: a positive test in a patient from a population with a low disease prevalence will be accompanied by a high probability Netherlands Journal of Critical Care Introduction to statistics in the ICU; what is a ROC curve? of a false positive result (low PPV), whereas a positive result in patients sampled from a high risk population (high prevalence of the disease) will give a much higher PPV. For example, in a setting of only few patients having adrenal insufficiency (tired subjects in an outpatient setting), a positive should not be interpreted as compelling evidence for the disease: there will be many falsepositives among the tested, whereas the number of true positives will be low, leading to a low PPV. Sensitivity and specificity are measures mainly used in research settings. In that situation the disease status of the study population is known and sensitivity and specificity can be determined. In clinical practice the disease status of tested patients is not known, otherwise the performance of a diagnostic test does not make sense. This means that the measures used in daily practice are NPV and PPV. Determining a cut-off for diagnostic tests Test characteristics are determined by comparing a test to a reference standard (often called gold standard). Such a reference test is presumed to have a perfect sensitivity and specificity and is used in such a way that the classification of the study population into diseased /non-diseased represents the truth. However, for most reference tests this assumption is too strong to hold. A reference can be either another test (for example, pulmonary angiography is considered the reference standard for the diagnosis of pulmonary embolism), or the course of the disease (new techniques to detect early stages of recurrent malignancies can be compared to the course of the disease). A test that is already part of the diagnostic criteria for a disease is difficult to test. If a diagnostic or prognostic test is based on a continuous variable, there are several options to relate the variable to disease status. For example, if lactate levels are used as a test for predicting mortality, the risk might be calculated for specific lactate level ranges. If the aim is to create a binary diagnostic test from a continuous variable, a cut-off value should be chosen to determine the sensitivity, specificity, NPV and PPV of the test. In the example above, morbidity and mortality were predicted based on two continuous measures: venous oxygen saturation and plasma lactate levels. The authors reported an NPV of 94% for a nadir venous oxygen saturation < 70% as well as for a peak lactate > 3 mmol/l. The choice of a cut-off is not straightforward and to a certain extent arbitrary. Firstly, problems can arise from the lack of an adequate reference standard. This is, for example, the case when determining an adequate cut-off for the diagnosis relative adrenal insufficiency in the intensive care unit [2]. Since there is no standard that distinguishes beyond doubt between patients with and patients without relative adrenal insufficiency, establishing the optimal cortisol cut-off values for the diagnosis is a difficult enterprise. A second problem in the choice of a cut-off value is that sensitivity and specificity can not both be optimal for a specific cut-off, at least for the majority of tests. The reason is that values of a certain variable overlap between diseased and non-diseased persons, meaning that the test can not discriminate with certainty between diseased and non-diseased for a given test value [3]. High sensitivity often implies low specificity and vice versa. In table 2 sensitivity and specificity have been calculated for various cut-off points for a diagnostic continuous variable (with an outcome between 1 and 40). Calculations are presented for only the highest and the lowest five cut-off points. As can be seen from the table, low cut-off values will identify the vast majority of diseased patients (high sensitivity), at the cost of a low specificity. In contrast, using a high cut-off value (for example >= 36) results in a perfect specificity, but almost all diseased patients will be missed. The choice of a cut-off on the diagnostic accuracy is not that outspoken as the effect on sensitivity and specificity. The reason is that diagnostic accuracy is a weighted average of sensitivity and specificity. The values that the diagnostic accuracy can obtain are therefore always in between and less extreme than the values of sensitivity and specificity. An imaginary example that is not derived from clinical practice can demonstrate the relationship between sensitivity and specificity: if e.g., we use the presence of two eyes as a positive test indicating the presence of relative adrenal insufficiency, all patients with the disease would have a positive test (sensitivity 100%). However, also all patients without the disease would have a positive test (specificity 0%). In the example outlined above, increasing the threshold cut-off for oxygen saturation Table 1. Characteristics of diagnostic tests Disease Test Present Absent Total Positive A True positive C False positive A+C PPV = A/(A+C) Negative B False negative D True negative B+D NPV = D/(B+D) Total A+B C+D N Sensitivity = A/(A+B) Specificity = D/(C+D) NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011 277 Netherlands Journal of Critical Care OM Dekkers (for example, from <70% to <80%) would increase the number of diseased patients adequately detected (i.e. increase of sensitivity), but this would also increase the number of patients with a false positive result (decrease in specificity). The choice of a cut-off value depends largely on the disease to be diagnosed and the consequences of false positive and false negative test results. If the main goal is to identify the largest proportion of diseased patients, the test should be highly sensitive, at the expense of a not optimal specificity. This might be the case for many conditions in the ICU setting where early recognition and treatment for serious diseases could be life saving. But on the other hand, if the applied treatment has many serious side effects, one might be more concerned about a low specificity than for treatments that are rather safe. It is however, important to keep in mind that false positive test results lead to unnecessary treatment, add to unnecessary costs and can have a large psychological impact if the disease for which the test was performed is serious, whereas false negative test results may withhold patients from being adequately treated. ROC curves If the test used is based on a continuous variable (for example, oxygen saturation or cortisol), for every cut-off value of the test variable the sensitivity and specificity can be calculated. The ROC curve is used as a graphical display for the relationship between sensitivity and specificity for a continuous test variable. In a ROC curve the sensitivity is plotted against 1-specificity for every cut-off point of the diagnostic variable. In figure 1 the ROC curve for the data from table 2 is shown. One point represents the ratio of the probability that a patient with the disease has a positive test (sensitivity), to the probability of a positive test in an undiseased person (1-specificity). Such a ratio (formally called the likelihood ratio) represents how many times a positive test is more likely to occur in a diseased than in an undiseased person. Table 2. Sensitivity and specificity for various cut-offs of a continuous diagnostic variable Cut-off Sensitivity Specificity Diagnostic accuracy ( >= 1 ) 100% 54% 0% ( >= 2 ) 96% 9% 56% ( >= 3 ) 93% 20% 59% ( >= 4 ) 85% 30% 60% ( >= 5 ) 80% 39% 61% ( >= 35 ) 9% 96% 49% ( >= 36 ) 9% 98% 50% ( >= 37 ) 7% 100% 50% ( >= 38 ) 5% 100% 49% ( >= 40 ) 3% 100% 48% 278 NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011 The point from a ROC curve most approaching the left corner is the point optimizing sensitivity and specificity. In figure 1, one arrow is drawn that represents the cut-off of >= 5. The accompanying sensitivity is 80% and 1-specificity is 61%, meaning that for a diseased person, the likelihood of a positive test is 1.3 higher than that for a non-diseased. A test that perfectly discriminates between diseased and non-diseased at every value of the test variable would coincide with the left and top side of the plot. A test variable that does not in any way discriminate between diseased and non-diseased would result in a straight diagonal line from the bottom left corner to the upper right corner. Such a line means that for every cut-off the diseased and the non-diseased have the same probability of having a positive test. Such a test would be nothing better (but much more expensive) than flipping a coin. The area under the curve (AUC) is a global assessment of the test performance. When comparing two tests, a higher AUC indicates a better test performance. Importantly, this does not imply that the test with the higher AUC performs better for every cut-off value. The AUC is equal to the probability that a tested person with the disease has a higher value of the test variable than a person without the disease [4]. A ROC of 0.5 represents a useless test. Although often shown as an overall measure of test performance, the ROC curve and the accompanying AUC have no direct clinical interpretation. They only give a visual and quantitative overview of the test characteristics at different cut-offs. However, to be applicable for clinical practice, a cutoff value should be chosen, balancing sensitivity, specificity, the severity of the disease under study, availability of treatments, costs of the test as well as consequences from false positivity and false negativity in terms of costs, psychological impact and effect on health outcomes. Figure 1. ROC curve plotting sensitivity against 1-specificity for a continuous diagnostic variable Netherlands Journal of Critical Care Introduction to statistics in the ICU; what is a ROC curve? In conclusion, Receiver Operating Characteristic (ROC) curves are used in medical research as a graphical display for the relationship between sensitivity and specificity for a continuous test variable. A ROC curve gives a global view on the test performance, but can not be applied directly to clinical practice. For that purpose a cut-off value of the test variable should be chosen, a choice that depends on the disease to be diagnosed and the consequences of false positive and false negative test results. References 1. Ranucci M, Isgro G, Carlucci C, De La Torre T, Enginoli S, Frigiola A: Central venous oxygen saturation and blood lactate levels during cardiopulmonary bypass are associated with outcome after pediatric cardiac surgery. Crit Care 14:R149, 2010 2. Fleseriu M, Loriaux DL: “Relative” adrenal insufficiency in critical illness. Endocr Pract 15:632-640, 2009 3. Ware JH: The limitations of risk factors as prognostic tools. N Engl J Med 355:26152617, 2006 4. Altman DG, Bland JM: Diagnostic tests 3: receiver operating characteristic plots. BMJ 309:188, 1994 NETH J CRIT CARE - VOLUME 15 - NO 6 - DECEMBER 2011 279