Download Inter-rater Reliability Diagnostic

Advanced Statistical Analysis in Epidemiology: Inter-rater Reliability Diagnostic Cutpoints Test Comparison Discrepant Analysis Polychotomous Logistic Regression and Generalized Estimating Equations Jeffrey J. Kopicko, MSPH Tulane University School of Public Health and Tropical Medicine Diagnostic Statistics Typically Assess a 2 x 2 contingency table taking the form: True + True - Total Test + a b a+b Test - c d c+d Total a+c b+d a+b+c+d Inter-rater Reliability Suppose that two different tests exist for the diagnosis of a specific disease. We are interested in determining if the new test is as reliable in diagnosing the disease as the old test (“gold standard”). Inter-rater Reliability continued In 1960, Cohen proposed a statistic that would provide a measure of reliability between the ratings of two different radiologists in the interpretation of x-rays. He called it the Kappa coefficient. Inter-rater Reliability continued Cohen’s Kappa can be used to assess the reliability between two raters or diagnostic tests. Based on the previous contingency table, it has the following form and interpretation: Inter-rater Reliability continued Cohen’s Kappa: 2(ad  bc)  (( a  c)(c  d )  (b  d )( a  b)) where K > 0.75 excellent reproducibility 0.4<=K<=0.75 good reproducibility 0<=K<0.4 marginal reproducibility *Rosner, 1986 Inter-rater Reliability continued Cohen’s Kappa is appropriately used when the prevalence of the disease is low and the marginal totals of the contingency table are distributed evenly. When these are not the case, Cohen’s Kappa will be erroneously low. Inter-rater Reliability continued Byrt, et al. proposed a solution to these possible biases in 1994. They called their solution the “Prevalence-Adjusted BiasAdjusted Kappa” or PABAK. It has the same interpretation as Cohen’s Kappa and the following form: Inter-rater Reliability continued 1. Take the mean of b and c. (b  c) m 2 2. Take the mean of a and d. (a  d ) n 2 3. Compute PABAK using these means and the original Cohen’s Kappa formula and this table. Yes No Yes n m No m n Inter-rater Reliability continued • PABAK is preferable in all instances, regardless of the prevalence or the potential bias between raters. • More meaningful statistics regarding the diagnostic value of a test can be computed, however. Diagnostic Measures • • • • • Prevalence Sensitivity Specificity Predictive Value Positive Predictive Value Negative Diagnostic Measures continued Prevalence Definition: Prevalence quantifies the proportion of individuals in a population who have the disease at a specific instant and provides and estimate of the probability (risk) that an individual will be ill at a point in time. Formula: (a  c) prevalence  (a  b  c  d ) Diagnostic Measures continued Sensitivity Definition: Sensitivity is defined as the probability of testing positive if the disease is truly present. Formula: a sensitivity  (a  c) Diagnostic Measures continued Specificity Definition: Specificity is defined as the probability of testing negative if the disease is truly absent. Formula: d specificit y  (b  d ) Diagnostic Measures continued Predictive Value Positive Definition: Predictive Value Positive (PV+ ) is defined as the probability that a person actually has the disease given that he or she tests positive. Formula: a PV  ( a  b)  Diagnostic Measures continued Predictive Value Negative Definition: Predictive Value Negative (PV- ) is defined as the probability that a person actually disease-free given that he or she tests negative. Formula: d PV  (c  d )  Example: Cervical Cancer Screening The standard of care for cervical cancer/dysplasia detection is the Pap smear. We want to assess a new serum DNA detection test for the Humanpapilloma Virus. Pap + DNA + DNA Total Pap 50 5 55 Total 35 410 445 Prevalence = 55/500 = 0.110 Sensitivity = 50/55 = 0.909 Specificity = 410/445 = 0.921 PV+ = 50/85 = 0.588 PV- = 410/415 = 0.988 85 415 500 Reciever Operating Characteristic (ROC) Curves Sensitivities and Specificities are used to: 1. Determine the diagnostic value of a test. 2. Determine the appropriate cutpoint for continuous data. 3. Compare the diagnostic values of two or more tests. ROC Curves continued 1. For every gap in continuous data, the mean value is taken as the cutoff. This is where there is a change in the contingency table distribution. 2. At each new cutpoint, the sensitivity and specificity is calculated. 3. The sensitivity is graphed versus 1specificity. ROC Curves continued 4. Since the sensitivity and specificity are proportions, the total area of the graph is 1.0 units. 5. The area under the curve is the statistic of interest. 6. The area under a curve produced by chance alone is 0.50 units. ROC Curves continued 7. If the area under the diagnostic test curve is significantly above 0.50, then the test is a good predictor of disease. 8. If the area under the diagnostic test curve is significantly below 0.50, then the test is an inverse predictor of disease. 9. If the area under the diagnostic test curve is not significantly different from 0.50, then the test is a poor predictor of disease. ROC Curves continued 10. An individual curve can be compared to 0.50 using the N(0, 1) distribution. 11. Two or more diagnostic tests can be compared also using the N(0, 1) distribution. 12. A diagnostic cutpoint can be determined for tests with continuous outcomes in order to maximize the sensitivity and specificity of the test. Nonparametric Reciever Operating Curve (ROC) Plot of Serum RCP Level 1 0.9 1.0 True Positive Fraction (Sensitivity) 0.8 0.9 1.1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 True Negative Fraction (1-Specificity) 0.8 1 ROC Curves continued Determining Diagnostic Cutpoints optimum cutpoint  SUP sensitivit y * specificit y ROC Curves continued Determining Diagnostic Cutpoints Cut Point 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Sensitivity Specificity Sens*Spec 1 0 0 1 0.04 0.04 1 0.2 0.2 1 0.4 0.4 0.980769 0.68 0.666923 0.923077 0.82 0.756923 0.923077 0.88 0.812308 0.865385 0.96 0.83077 0.846154 0.96 0.812308 0.807692 0.98 0.791538 0.788462 1 0.788462 0.788462 1 0.788462 0.730769 1 0.730769 0.730769 1 0.730769 0.730769 1 0.730769 Nonparametric Reciever Operating Curve (ROC) Plot of Serum RCP Level 1 0.9 1.0 True Positive Fraction (Sensitivity) 0.8 0.9 1.1 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 True Negative Fraction (1-Specificity) 0.8 1 ROC Curves continued Diagnostic Value of a Test a1  ao z ~ N (0,1) sea1  a0 where a1 = area under the diagnostic test curve, ao = 0.50, se a1 is the standard error of the area, and se ao = 0.00. ROC Curves continued Diagnostic Value of a Test For the RCP example, the area under the curve is 0.987, with a pvalue of <0.001. The optimal cutpoint for this test is 1.1 ng/ml. ROC Curves continued Comparing the areas under 2 or more curves In order to compare the areas under two or more ROC curves, use the same formula, substituting the values for the second curve for those previously defined for chance alone. Nonparametric Receiver Operating Characteristic Curves for Different Tests of CMV Retinitis 1 0.9 0.8 0.7 Sensitivity 0.6 0.5 0.4 0.3 Antigenemia 0.2 Digene Amplicor 0.1 Chance 0 0 0.1 0.2 0.3 0.4 0.5 1-Specificity 0.6 0.7 0.8 0.9 1 ROC Curves continued Comparing the areas under 2 or more curves For the CMV retinitis example, the Digene test had the largest area (although not significantly greater than antigenemia). The cutpoint was determined to be 1,400 cells/cc. The sensitivity was 0.85 and the specificity was 0.84. Bonferronni adjustments must be made for >2 comparisons. ROC Curves continued Another Application? Remember when Cohen’s Kappa was unstable at extreme prevalence and/or when there was bias among the raters? What about using ROC curves to assess inter-rater reliability? ROC Curves continued Another limitation to K is that it provides only a measure of agreement, regardless of whether the raters correctly classify the state of the items. K can be high, indicating excellent reliability, even though both raters incorrectly assess the items. ROC Curves continued The two areas under the curves may be compared as a measure of overall interrater reliability. This comparison is made by applying the following formula: droc = (1- |Area1- Area2|) By subtracting the difference in areas by one, droc is on a similar scale as K, ranging from 0 to 1. ROC Curves continued If both raters correctly classify the objects at the same rate, their sensitivities and specificities will be equal, resulting in a droc of 1. If one rater correctly classifies all the objects, and the second rater misclassifies all the objects, droc will equal 0. Statistics for Figure 1(N=20): Rater One: Rater Two: % Correct = 80 % % Correct = 55 % sensitivity = 0.80 sensitivity = 0.60 specificity = 0.80 specificity = 0.533 Area under ROC = 0.80 Area under ROC = 0.567 droc = 0.7667 Monte Carlo Simulation Several different levels of disease prevalence, sample size and rater error rates were assessed using Monte Carlo methods. Total sample sizes of 20, 50 and 100 were generated each for disease prevalence of 5, 15, 25, 50, 75, and 90 percent. Two raters were used in this study. Rater One was fixed at a 5 percent probability of misclassifying the true state of the disease, while Rater Two was allowed varying levels of percent misclassification. For each condition of disease prevalence, rater error, and sample size, 1000 valid samples were generated and analyzed using SAS proc IML. Figure Two: Actual Percent Agreement (N=50) 1 - ROC Curve Difference ROC Curves continued 1 0.95 0.9 0.85 Another limitation is that K provides only 0.8 measure of agreement, regardless of 0.75 0.7 whether the raters correctly classify the 0.65 state of the items. K can be high, a 0.6 indicating excellent reliability, even though 0.55 both raters incorrectly assess the items. 0.5 0.05 0.15 0.25 0.5 0.75 Rater Two Error Probability Prevalence: 0.05 0.15 0.25 0.5 0.75 0.9 Figure Four: Cohen's Kappa Coefficient (N=50) ROC Curves continued 1 1 - ROC Curve Difference 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 If both raters correctly classify the objects at the same rate, their sensitivities and specificities will be equal, resulting in a droc of 0. If one rater correctly classifies all the objects, and the second rater misclassifies 0.05 objects, 0.15 droc will 0.25 equal 0.5 0.75 all the 1. Rater Two Error Probability Prevalence: 0.05 0.15 0.25 0.5 0.75 0.9 Figure Five: PABAK Coefficient (N=50) ROC Curves continued 1 1 - ROC Curve Difference 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 If both raters correctly classify the objects at the same rate, their sensitivities and specificities will be equal, resulting in a droc of 0. If one rater correctly classifies all the objects, and the second rater misclassifies 0.05 objects, 0.15 droc will 0.25 equal 0.5 0.75 all the 1. Rater Two Error Probability Prevalence: 0.05 0.15 0.25 0.5 0.75 0.9 Figure Three: Difference in ROC Curves (N=50) ROC Curves continued 1 1 - ROC Curve Difference 0.95 0.9 0.85 If both raters correctly classify the objects 0.8 at the same rate, their sensitivities and 0.75 specificities will be equal, resulting in a 0.7 0.65 droc of 0. 0.6 0.55 If one rater correctly classifies all the 0.5 objects, and the second rater misclassifies 0.05 objects, 0.15 droc will 0.25 equal 0.5 0.75 all the 1. Rater Two Error Probability Prevalence: 0.05 0.15 0.25 0.5 0.75 0.9 Based on the above results, it appears that the difference in two ROC curves may be a more stable estimate of inter-rater agreement than K. Based on the metric used to assess K, a similar metric can be formed for the difference in two ROC curves. We propose the following: 1.0 > droc > 0.95 excellent reliability 0.8 < droc < 0.95 good reliability 0 < droc < 0.8 marginal reliability ROC Curves continued From the example data provided with Figure 1, it can be seen that droc behaves similarly to K. The droc from these data is 0.7667, while K is 0.30. Both result in a decision of marginal interrater reliability. However, from the ROC plot and the percent correct for each rater, it is seen that Rater One is much more correct in his observations than Rater Two, with percent agreements of 80 % and 55 %, respectively. ROC Curves continued Without the individual calculation of the sensitivities and specificities, information about the correctness of the raters would have remained obscure. Additionally, with the large differential rater error, K may have been underestimated. The difference in ROC curves allows many advantages over K, but only when the true state of the objects being rated is known. Finally, with very little adaptation, these methods may be extended to more than two raters and to continuous outcome data. So, we now know how to assess whether a test is a good predictor of disease, how to compare two or more tests, and how to determine cutpoints. But, What if there is no established “gold-standard?” Discrepant Analysis Discrepant Analysis (DA) is a commonly used (and commonly misused) technique of estimating the sensitivity and specificity of diagnostic tests that are imperfect “goldstandards.” This technique often results in “upwardly biased” estimates of the diagnostic statistics. Discrepant Analysis continued Example: Chlamydia trachomatis is a common STI that has been diagnosed using cervical swab culture for years. Often, though, patients only present for screening when they are symptomatic. Symptomatic screening may be closely associated with organism load. Therefore, culture diagnosis may miss carriers and patients with low organism loads. Discrepant Analysis continued Example continued: GenProbe testing has also been used to capture some cases that are not captured by culture. New polymerase chain reaction (PCR) and ligase chain reaction (LCR) DNA assays may be better diagnostic tests. But, there is obviously no good “gold-standard.” Discrepant Analysis continued Example continued: 1. Culture vs. PCR 2. Culture + GenProbe vs. PCR 3. Culture vs. LCR 4. Culture + GenProbe vs. LCR …and many other combinations. Discrepant Analysis continued Example continued: Goal is to maximize the sensitivity and specificity of the new tests, since we think that the new tests are probably more accurate. Major limitation is that this is often seen as a “fishing expedition” with the great possibility of Type I error, and inflation of diagnostic statistics. Polychotomous Logistic Regression Simple logistic regression is useful when the outcome of interest is binomial (ie: yes/no, male/female, etc.) Linear regression is useful when the outcome of interest is continuous (ie: age, blood pressure, etc.) But what if the outcome is categorical with more than one level? Generalized Estimating Equations GEE is used when there are repeated measures on continuous, ordinal, or categorical outcomes, and there are different numbers of measurements on each subject. It is useful in that it accounts for missing data at different time points. The interpretation of the GEE model is the same as all other regressions.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Inter-rater Reliability Diagnostic