Download Sensitivity, Specificity, and Useful Measures of Diagnostic Utility

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Actuary wikipedia , lookup

Financial economics wikipedia , lookup

Receiver operating characteristic wikipedia , lookup

Risk management wikipedia , lookup

Risk wikipedia , lookup

Enterprise risk management wikipedia , lookup

Risk aversion (psychology) wikipedia , lookup

Expected utility hypothesis wikipedia , lookup

Transcript
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Sensitivity, Specificity, and Useful Measures of
Diagnostic Utility
Frank E Harrell Jr
Department of Biostatistics
Vanderbilt University School of Medicine
Statistics and Methodology Core Training Seminar
Vanderbilt Kennedy Center
3 May 2012
Sensitivity, Specificity, and Bayes’ Rule
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
sensitivity = Prob[T + |D + ]
specificity = Prob[T − |D − ]
Prob[D + |T + ] =
sens×prev
sens×prev +(1−spec)×(1−prev )
Problems with Traditional Indexes of Diagnostic
Utility
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Diagnosis forced to be binary
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Test force to be binary
Sensitivity and specificity are in backwards time order
Confuse decision making for groups vs. individuals
Diagnostic
Risk Modeling
Inadequate utilization of pre-test information
Assessing
Diagnostic
Yield
Dichotomization of continuous variables in general
Summary
References
The Dichotomizers
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
The speed limit is 60.
I am going faster than the speed limit.
Will I be caught?
A response from a dichotomizer:
Are you going faster than 70?
A response from a better dichotomizer:
If you are among other cars, are you going faster than 73?
If you are exposed are your going faster than 67?
Better:
How fast are you going and are you exposed?
Motorist, continued
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Analogy to most medical diagnosis research in which +/diagnosis is a false dichotomy of an underlying disease severity:
The speed limit is moderately high.
I am going fairly fast.
Will I be caught?
The “sensitive” motorist (who is also a lover of P-values):
Of all the motorists receiving speeding tickets, what
proportion of them were going semi–fast?
BI-RADS Scores
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Diagnosis
0
Incomplete
1
Negative
2
Benign
Problems with
Traditional
Indexes
3
Probably Benign
Decision
Making and
Forward Risk
4
Suspicious Abnormality
Diagnostic
Risk Modeling
5
Highly Suspicious of Malignancy
6
Known Biopsy
Malignancy
Assessing
Diagnostic
Yield
Proven
Number of Criteria
Your mammogram or ultrasound didn’t give
the radiologist enough information to make a
clear diagnosis; follow-up imaging is necessary
There is nothing to comment on; routine
screening recommended
A definite benign finding; routine screening
recommended
Findings that have a high probability of being benign (> 98%); six-month short interval
follow-up
Not characteristic of breast cancer, but reasonable probability of being malignant (3 to
94%); biopsy should be considered
Lesion that has a high probability of being malignant (≥ 95%); take appropriate action
Lesions known to be malignant that are being imaged prior to definitive treatment; assure
that treatment is completed
Summary
References
Breast Imaging Reporting and Data System, American College of
Radiologists
http://breastcancer.about.com/od/diagnosis/a/birads.htm
American College of Radiology. BI-RADS US (PDF document) Copyright 2004.
How to Reduce False Positives and Negatives?
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Do away with “positive” and “negative”
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Provide risk estimates
Defer decision to decision maker
Risks have self-contained error rates
Risk of 0.2 → Prob[error]=.2 if don’t treat
Risk of 0.8 → Prob[error]=.2 if treat
Against Diagnosis
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
The act of diagnosis requires that patients be placed
in a binary category of either having or not having a
certain disease. Accordingly, the diseases of particular
concern for industrialized countries—such as type 2
diabetes, obesity, or depression—require that a
somewhat arbitrary cut-point be chosen on a
continuous scale of measurement (for example, a
fasting glucose level > 6.9 mmol/L [> 125 mg/dL]
for type 2 diabetes). These cut-points do not
adequately reflect disease biology, may inappropriately
treat patients on either side of the cut-point as 2
homogeneous risk groups, fail to incorporate other
risk factors, and are invariable to patient preference.
Vickers et al. [2008]
Problems with Sensitivity and Specificity
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Backwards time order
Irrelevant to both physician and patient
Improper discontinuous scoring rules
Are not test characteristics
Are characteristics of the test and patients
Not constant; vary with patient characteristics
Sensitivity ↑ with any covariate related to disease severity
if diagnosis is dichotomized
Require adjustment for workup bias
Diagnostic risk models do not; only suffer from
under-representation
Summary
References
Good for proof of concept of a diagnostic method in a
case–control study; not useful for utility
Hlatky et al. [1984]; Moons et al. [1997];
Moons and Harrell [2003]; Gneiting and Raftery [2007]
Sensitivity of Exercise ECG for Diagnosing CAD
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Age (years)
< 40
40–49
50–59
≥ 60
Sex
male
female
# Diseased CAs
1
2
3
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Hlatky et al. [1984]
Sensitivity
0.56
0.65
0.74
0.84
0.72
0.57
0.48
0.68
0.85
Types of Bias
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Asymmetric error in an estimator of the appropriate
quantity
Zero error in estimating the wrong quantity
Damage Caused by Improper Discontinuous
Scoring Rules
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Example: Predicting Prob[disease]
N = 400, 0.57 of subjects have disease
Classify as diseased if prob. > 0.5
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Model
age
sex
age+sex
constant
C
Index
.592
.589
.639
.500
χ2
10.5
12.4
22.8
0.0
Adjusted Odds Ratios:
age (IQR 58y:42y)
1.6 (0.95CL 1.2-2.0)
sex (f:m)
0.5 (0.95CL 0.3-0.7)
Test of sex effect adjusted for age (22.8 − 10.5):
P = 0.0005
Proportion
Correct
.622
.588
.600
.573
Sensitivity & specificity
are also improper rules
Problems with ROC Curves and Cutoffs
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
. . . statistics such as the AUC are not especially relevant to someone who
must make a decision about a particular xc . . . . ROC curves lack or obscure
several quantities that are necessary for evaluating the operational
effectiveness of diagnostic tests. . . . ROC curves were first used to check
how radio receivers (like radar receivers) operated over a range of
frequencies. . . . This is not how most ROC curves are used now, particularly
in medicine. The receiver of a diagnostic measurement . . . wants to make a
decision based on some xc , and is not especially interested in how well he
would have done had he used some different cutoff.
Briggs and Zaretzki [2008]; In the discussion to this paper, David Hand states “whe
integrating to yield the overall AUC measure, it is necessary to decide what weight t
give each value in the integration. The AUC implicitly does this using a weighting derive
empirically from the data. This is nonsensical. The relative importance of misclassifyin
a case as a non-case, compared to the reverse, cannot come from the data itself. It mus
come externally, from considerations of the severity one attaches to the different kind
of misclassifications.”
Optimum Decision Making for an Individual
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Minimize expected loss/cost/disutility
Uses
utility function (e.g., inverse of cost of missing a diagnosis,
cost of over-treatment if disease is absent)
probability of disease
d = decision, o = outcome
Utility for outcome o = U(o)
R
Expected utility of decision d = U(d) = p(o|d)U(o)do
dopt = d maximizing U(d)
Summary
References
http://en.wikipedia.org/wiki/Optimal_decision
Diagnostic Risk Modeling Assuming (Atypical)
Binary Disease Status
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
Y
X
T
α
β
γ
1:diseased, 0:normal
vector of subject characteristics (e.g., demographics, risk factors, symptoms)
vector of test (biomarker, . . . ) outputs
intercept
vector of coefficients of X
vector of coefficients of T
1
pre(X ) = Prob[Y = 1|X ] = 1+exp[−(α
∗ +β ∗ X )]
1
post(X , T ) = Prob[Y = 1|X , T ] = 1+exp[−(α+βX
+γT )]
References
Note: Proportional odds model extends to ordinal disease severity Y .
Significant Coronary Artery Disease
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Pryor et al. [1983]
Bacterial vs. Viral Meningitis
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Spanos et al. [1989]
Model for Ordinal Diagnostic Classes
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Brazer et al. [1991]
Assessing Diagnostic Yield
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Absolute Yield
Pencina et al. [2008]: Absolute incremental information in a
new set of markers
Consider change in predicted risk when add new variables
Average increase in risk of disease when disease present
+
Average decrease in risk of disease when disease absent
Formal Test of Added Information
Likelihood ratio χ2 test of partial association of new markers,
adjusted for old markers
Assessing Relative Diagnostic Yield
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Variation in relative log odds of disease = T γ̂, holding X
constant
Summarize with Gini’s mean difference or inter-quartile
range, then anti-log
E.g.: the typical modification of pre-test odds of disease is
by a factor of 3.4
Assessing
Diagnostic
Yield
Summary
References
Gini’s mean difference = mean absolute difference between any pair of values
Relationship between Odds Ratio and Absolute
Change in Risk
Assessing
Diagnostic
Yield
Summary
References
0.5
0.4
0.3
4
0.2
3
2
1.75
1.5
0.1
Diagnostic
Risk Modeling
5
1.25
0.0
Decision
Making and
Forward Risk
10
Increase in Risk with T+
Problems with
Traditional
Indexes
0.6
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
0.0
0.2
0.4
0.6
0.8
Risk of Disease for Subject with T
Numbers above curves are odds ratios
−
1.0
Assessing Absolute Diagnostic Yield: Cohort Study
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Patient i = 1, 2, 3, . . . , n
In-sample sufficient statistics: pre(X1 ), . . . , pre(Xn ),
post(X1 , T1 ), . . . , post(Xn , Tn )
Problems with
Traditional
Indexes
Summarize
with quantile
regression to
estimate 10th
and 90th
percentiles of
post as a
function of pre
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Hlatky et al. [2009]
Assessing Absolute Yield, continued
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Out-of-sample assessment: compute pre(X ) and post(X , T ) for
any X and T of interest
Summary measures
quantile regression
function of pre
(Koenker and Bassett [1978])
curves as a
overall mean |post – pre|
quantiles of post – pre
du50 : distribution of post when pre = 0.5
diagnostic utility at maximum pre-test uncertainty
Choose X so that pre = 0.5
Examine distribution of post at this pre
Summarize with quantiles, Gini’s mean difference on prob.
scale
Special case where test is binary (atypical): compute post
for T + and for T −
Assessing Diagnostic Yield: Case-Control & Other
Oversampling Designs
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Intercept α is meaningless
Choose X and solve for α so that pre = 0.5
Proceed as above to estimate du50
Example: Diagnosis of Coronary Artery Disease
(CAD): Test = Total Cholesterol
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
100 150 200 250 300 350 400
3 Vessel or Left Main CAD
Significant CAD
70
4
Decision
Making and
Forward Risk
log odds
Problems with
Traditional
Indexes
2
70
0
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
40
−2
40
Summary
References
100 150 200 250 300 350 400
Cholesterol, mg %
Relative effect of total cholesterol for age 40 and 70;
Data from Duke Cardiovascular Disease Databank, n = 2258
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
0.8
0.6
0.4
0.2
Decision
Making and
Forward Risk
0.0
Problems with
Traditional
Indexes
Post−Test Probability (age + sex + cholesterol)
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
1.0
Utility of Cholesterol for Diagnosing Significant
CAD
0.2
0.4
0.6
0.8
Pre−Test Probability (age + sex)
Curves are 0.1 and 0.9 quantiles from quantile regression using restricted cubic splines
Summary
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Diagnostic utility needs to be estimated using measures of
relevance to individual decision makers
Improper scoring rules lead to suboptimal decisions
Traditional risk modeling is a powerful tool in this setting
Decision
Making and
Forward Risk
Cohort studies are ideal but useful measures can be
obtained even with oversampling
Diagnostic
Risk Modeling
Avoid categorization of any continuous or ordinal variables
Assessing
Diagnostic
Yield
This work used only free software
Summary
References
LATEX
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
References
S. R. Brazer, F. S. Pancotto, T. T. Long III, F. E. Harrell, K. L. Lee, M. P. Tyor, and D. B. Pryor. Using
ordinal logistic regression to estimate the likelihood of colorectal neoplasia. J Clin Epi, 44:1263–1270,
1991.
W. M. Briggs and R. Zaretzki. The skill plot: A graphical technique for evaluating continuous diagnostic
tests (with discussion). Biometrics, 64:250–261, 2008.
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
T. Gneiting and A. E. Raftery. Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc,
102:359–378, 2007.
M. A. Hlatky, D. B. Pryor, F. E. Harrell, R. M. Califf, D. B. Mark, and R. A. Rosati. Factors affecting the
sensitivity and specificity of exercise electrocardiography. Multivariable analysis. Am J Med, 77:64–71,
1984.
M. A. Hlatky, P. Greenland, D. K. Arnett, C. M. Ballantyne, M. H. Criqui, M. S. Elkind, A. S. Go, F. E.
Harrell, Y. Hong, B. V. Howard, V. J. Howard, P. Y. Hsue, C. M. Kramer, J. P. McConnell, S. L.
Normand, C. J. O’Donnell, S. C. Smith, and P. W. Wilson. Criteria for evaluation of novel markers of
cardiovascular risk: a scientific statement from the American Heart Association. Circulation, 119(17):
2408–2416, 2009. American Heart Association Expert Panel on Subclinical Atherosclerotic Diseases and
Emerging Risk Factors and the Stroke Council; PMID 19364974.
Assessing
Diagnostic
Yield
R. Koenker and G. Bassett. Regression quantiles. Econometrica, 46:33–50, 1978.
Summary
K. G. M. Moons and F. E. Harrell. Sensitivity and specificity should be de-emphasized in diagnostic accuracy
studies. Academic Radiology, 10:670–672, 2003. Editorial.
References
K. G. M. Moons, G.-A. van Es, J. W. Deckers, J. D. F. Habbema, and D. E. Grobbee. Limitations of
sensitivity, specificity, likelihood ratio, and Bayes’ theorem in assessing diagnostic probabilities: A
clinical example. Epidemiology, 8(1):12–17, 1997.
M. J. Pencina, R. B. D’Agostino Sr, R. B. D’Agostino Jr, and R. S. Vasan. Evaluating the added predictive
ability of a new marker: From area under the ROC curve to reclassification and beyond. Stat Med, 27:
157–172, 2008.
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
D. B. Pryor, F. E. Harrell, K. L. Lee, R. M. Califf, and R. A. Rosati. Estimating the likelihood of significant
coronary artery disease. Am J Med, 75:771–780, 1983.
A. Spanos, F. E. Harrell, and D. T. Durack. Differential diagnosis of acute meningitis: An analysis of the
predictive value of initial observations. JAMA, 262:2700–2707, 1989.
A. J. Vickers, E. Basch, and M. W. Kattan. Against diagnosis. Ann Int Med, 149:200–203, 2008.
Sensitivity,
Specificity,
and Useful
Measures of
Diagnostic
Utility
Problems with
Traditional
Indexes
Decision
Making and
Forward Risk
Diagnostic
Risk Modeling
Assessing
Diagnostic
Yield
Summary
References
Reducing Bias and Increasing Diagnostic Utility Through Diagnostic
Risk Models
Frank E Harrell Jr
Department of Biostatistics
Vanderbilt University
Medical diagnostic research, as usually practiced, is prone to bias and even more importantly to yielding
information that is not useful to patients or physicians and sometimes overstates the value of diagnostics.
Important sources of these problems are conditioning on the wrong statistical information, reversing the flow
of time, and categorization of inherently continuous test outputs and disease severity. It will be shown that
sensitivity and specificity are not properties of tests in the usual sense of the word, and that they were never
natural choices for describing test performance. This implies that ROC curves are unhelpful. So is categorical
thinking.
The many advantages of diagnostic risk modeling will be discussed, and this talk will show how pre– and
post-test diagnostic models give rise to clinically useful displays of pre-test vs. post-test probabilities that
themselves quantify diagnostic utility in a way that is useful to patients, physicians, and diagnostic device
makers. And unlike sensitivity and specificity, post-test probabilities are immune to certain biases, including
workup bias.