Download Introduction to Biostatistics - Annals of Emergency Medicine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
SPECIAL CONTRIBUTION
biostatistics
Introduction to Biostatistics: Part 3,
Sensitivity, Specificity, Predictive Value,
and Hypothesis Testing
Diagnostic tests guide physicians in assessment of clinical disease states,
just as statistical tests guide scientists in the testing of scientific hypotheses. Sensitivity and specificity are properties of diagnostic tests and are
not predictive,of, disease in individual patients. Positive and negative predictive values are predictive of disease in patients and are dependent on
both the diagnostic test used and the prevalence of disease in the population studied. These concepts are best illustrated by study of a two by two
table of possible outcomes of testing, which shows that diagnostic tests
m a y lead to correct or erroneous clinical conclusions. In a similar manner,
hypothesis testing m a y or m a y not yield correct conclusions. A two by two
table of possible outcomes shows that two types of errors in hypothesis
testing are possible. One can falsely conclude that a significant difference
exists between groups (type I error). The probability of a type I error is ~.
One can falsely conclude that no difference exists between groups (type II
error). The probability of a type II error is ~. The consequence and probability of these errors depend on the nature of the research study. Statistical
power indicates the ability of a research study to detect a significant difference between populations, when a significant difference truly exists.
Power equals 1 - ~. Because hypothesis testing yields "yes" or "no" answers, confidence intervals can be calculated to complement the results of
hypothesis testing. Finally, just as some abnormal laboratory values can
be ignored clinically, some statistical differences m a y not be relevant clinically. [Gaddis GM, Gaddis ML: Introduction to biostatistics: Part 3, sensitivity, specificity, predictive value, and hypothesis testing. Ann Emerg
Med May 1990;19:591-597.]
Gary M Gaddis, MD, PhD*
Monica L Gaddis, PhDt
Kansas City, Missouri
From the Departments of Emergency
Health Services* and Surgery,t University
of Missouri -- Kansas City School of
Medicine, Truman Medical Center, Kansas
City.
Received for publication September 1,
1989. Accepted for publication
January 30, 1990.
Address for reprints: Gary M Gaddis, MD,
PhD, Department of Emergency Health
Services, University of Missouri -- Kansas
City School of Medicine, Truman Medical
Center, 2301 Holmes, Kansas City,
Missouri 64108.
INTRODUCTION
Diagnostic tests guide the physician in assessment of clinical disease
entities. In a similar manner, statistical inference theory guides the scientist in the testing of scientific hypotheses. Before discussing inferential
techniques (parts 4 and 5 of this series), it is necessary to understand the
basis of hypothesis testing, to gain an appreciation of the type of questions
inferential statistics help answer. Clinical diagnostic testing and hypothesis testing have many parallels, but most clinicians are more familiar
with diagnostic than hypothesis testing. Therefore, this article will focus
on the components of diagnostic testing theory, including sensitivity, specificity, and predictive value. This will be followed by analogies to facilitate
understanding of hypothesis testing.
EVALUATION OF DIAGNOSTIC TESTS
Sensitivity and Specificity
Physicians make medical diagnoses with the aid of the patient history,
physical examination, and diagnostic testing. Numerous new diagnostic
tests are presented each year in the medical literature, and each must be
evaluated before it is introduced into the clinical setting. Most new diagnostic tests are evaluated in relation to another older, previously accepted,
often more invasive, and historically reliable test (the "gold standard"
test). Common examples of gold standards include the use of ECG changes
plus cardiac enzyme levels to diagnose acute myocardial infarction, or pulmonary angiography to diagnose pulmonary embolism. For the purposes of
19:5 May 1990
Annals of Emergency Medicine
591/145
BIOSTATISTICS
Gaddis &Gaddis
our discussion, it will be assumed
that results obtained by the gold
standard test are always correct.
Hypothetically, imagine that a new
magnetic resonance imaging (MRI)
venogram has been proposed as a
noninvasive means of evaluating patients suspected by clinical criteria of
having a deep venous thrombosis.
The MRI venogram, the proposed
new diagnostic test, will be evaluated a g a i n s t the t r a d i t i o n a l and
widely used gold standard, the intravenous contrast venogram. Table 1
shows that there are four possible
outcomes of diagnostic testing. Patients can be diagnosed as having
deep venous thrombosis or not having deep venous thrombosis by both
the gold standard test and by the new
MRI diagnostic test, if patients undergo both tests.
In Table 1, 250 patients clinically
s u s p e c t e d of h a v i n g deep v e n o u s
thrombosis undergo both tests. Of
the 250 patients clinically suspected
to have deep venous thrombosis, 150
actually do have deep venous thrombosis, with 130 shown to have deep
venous thrombosis by both the gold
standard test and by the new MRI
test. This group of 130 is termed the
true positive (TP) group by the new
d i a g n o s t i c test because t h e y are
shown to have disease by the new
test and are also proven to have disease by the gold standard test. However, 20 of the 150 patients who are
proven by the gold standard test to
have deep venous thrombosis had a
negative MRI diagnostic test. These
20 are termed the false negative (FN)
group because they were classified
incorrectly as disease free by the new
MRI test.
Similarly, 100 of the patients were
judged disease free by the contrast
venogram, but of these, only 87 had a
negative MRI test. This group of 87
constitutes the true negative (TN)
group. The remaining 13 were incorrectly classified by the new MRI test
as having a deep venous thrombosis,
when in fact they did not have the
disease. This constitutes the false
positive (FP) group.
The two by two outcome table in
Table 1 can now be used to help us
evaluate how well the new MRI test
does in detecting deep venous thrombosis. We want to know the answers
to two questions: Is the test sensitive
enough to detect the presence of a
deep venous thrombosis in a diseased
146/592
TABLE 1. Gold standard versus diagnostic test
Gold Standard Test
(Contrast Venogram)
Disease
Evident
No Disease
Evident
Diagnostic
Test
(MRI
Venogram)
Disease
Evident
No Disease
Evident
Total
TP (130)
FP (13)
143
FN (20)
TN (87)
107
150
100
250
TABLE 2. Gold standard versus diagnostic test
Gold Standard Test
(Contrast Venogram)
Diagnostic
Test
(MRI
Venogram)
Disease
Evident
No Disease
Evident
Disease
Evident
No Disease
Evident
Total
TP (35)
FP (21)
56
FN (5)
TN (139)
144
160
200
40
,
TABLE 3. Possible outcomes of hypothesis testing
Reality
Decision From
Statistical Test
Reject Ho,
Accept H~
Ho False,
H1 True
Ho True,
H1 False
Correct,
No Error
Incorrect,
Type l Error
(A)
Accept Ho,
Reject H 1
Incorrect,
Type II Error
(C)
patient? Is the test specific enough to
indicate the absence of deep venous
thrombosis disease only in patients
who in fact are not afflicted by it?
Sensitivity, which can be thought
of as "positivity (of the test) in disease," is derived by working down
the first column of Table 1:
Sensitivity (%) =
100 x TP/(TP + FN)
In this example, sensitivity equals
100 x 130/(130 + 20), or 86.7%.
Annals of Emergency Medicine
(B)
Correct,
No Error
(U)
Specificity, which can be thought
of as " n e g a t i v i t y (of the test) in
health," is also derived by working
vertically, in the second column of
Table 1:
Specificity (%) =
100 x TN/(TN + FP)
Here, specificity equals 100 x 87/(87
+ 13), or 87.0%.
The ideal diagnostic test would be
100% sensitive and 100% specific,
and thus would have no FP or FN
19:5 May 1990
1,0
••
m
F I G U R E I. Operating characteristic
curve. ~ is dependent on ~, n, and i .
In this example, ct is fixed at .05. All
else held constant, increasing i or
increasing n decreases ~.
c~ 0.05
n2>nl
=
/
~
fl
0,5
-
•7
0.0
low
~
high
1
TABLE 4. Prior probability and chance of error
Prior Probability
Chance of Error
Type l
Type II
outcomes. Because v i r t u a l l y all diagn o s t i c tests have s o m e FP and F N
o u t c o m e s , t h e y do n o t h a v e 100%
sensitivity and specificity.
Unfortunately, m a n y clinicians believe that s e n s i t i v i t y and specificity
can be used to predict w h e t h e r an individual patient is diseased or disease
free. This is an error. Sensitivity and
specificity are m e r e l y properties of a
test. Sensitivity and specificity
should not be used to m a k e predictive s t a t e m e n t s a b o u t an i n d i v i d u a l
patient.
Predictive Value
P r e d i c t i v e v a l u e s can be u s e d to
help predict the l i k e l i h o o d of disease
in an i n d i v i d u a l . A p o s i t i v e predictive value (PPV) is useful to indicate
the proportion of individuals who actually have the disease w h e n the dia g n o s t i c test i n d i c a t e s the presence
19:5 May 1990
Low
High
High
Low
Low
High
of that disease. A negative predictive
v a l u e (NPV) is useful to d e t e r m i n e
the proportion of individuals who are
t r u l y free of t h e d i s e a s e t e s t e d for
when the diagnostic test indicates
the absence of that disease.
P r e d i c t i v e v a l u e s are d e r i v e d by
w o r k i n g h o r i z o n t a l l y on the two by
two o u t c o m e table in Table 1:
PPV (%) = 100 x TP/(TP + FP)
NPV (%) = 100 x T N / ( T N + FN)
From the e x a m p l e in Table 1, PPV
= 100 x 130/(130 + 13), or 90.9%,
and NPV = 100 x 87/(87 + 20), or
81.3%.
PPV and NPV are affected by the
prevalence of disease in the population. Prevalence is defined as the proportion of the p o p u l a t i o n afflicted by
the disease in question. In the example in Table 1, the prevalence of deep
venous t h r o m b o s i s w h e n it was clinAnnals of Emergency Medicine
i c a l l y s u s p e c t e d w a s 60% b e c a u s e
the total n u m b e r of patients studied
was 250, and the n u m b e r of patients
w h o a c t u a l l y h a d a c o n t r a s t venogram (the gold standard test) indicative of deep venous t h r o m b o s i s was
150.
Next, the effects of decreased prevalence of deep venous thrombosis on
the predictive value of the MRI venogram test will be examined. Imagine
a sample of 200 patients, only 20% of
w h o m h a v e a deep v e n o u s t h r o m bosis (prevalence, 20% ). This group is
d e p i c t e d (Table 2). Because 20% of
the patients have a deep venous
thrombosis, the s u m of TP + FN in
c o l u m n 1 m u s t be 0.2 x 200, or 40. Of
these, a b o u t 35 w i l l c o n s t i t u t e t h e
TP group b e c a u s e t h e s e n s i t i v i t y of
the test has already been shown to be
86.7% (0.867 x 40 = 34.7). The rem a i n i n g five can be expected to be in
the F N group because sensitivity is a
p r o p e r t y of the test i n d e p e n d e n t of
disease prevalence. Because the prevalence of deep venous t h r o m b o s i s is
only 20%, the r e m a i n i n g 0.8 x 200,
or 160, will not have a deep venous
thrombosis, so the s u m of T N + FP
results in c o l u m n 2 will be 160. Of
this set of 160, 87%, or a b o u t 139,
will be in the T N group, and the rem a i n i n g 21 will be in the FP group
because specificity is also a property
of t h e test, i n d e p e n d e n t of d i s e a s e
prevalence.
T h e c h a n g e of p r e v a l e n c e m a r k e d l y i n f l u e n c e s t h e PPV and N P V
v a l u e s o b t a i n e d (Table 2). W i t h a
20% prevalence, the PPV falls to 100
x 35/(35 + 21), or 62.5%, w h i l e the
NPV increases to 100 x 139/(139 + 5),
or 96.5%. N o t e that as disease prevalence falls, the PPV of any test will
fall and the NPV of any test will increase.
F r o m this, it is easy to see w h y
m a n y n e w diagnostic tests that seem
from initial reports to be useful m a y
not represent a diagnostic improvem e n t w h e n in c o m m o n use. M a n y
diagnostic tests are validated in sett i n g s on p o p u l a t i o n s w i t h a h i g h
p r e v a l e n c e of the disease for w h i c h
testing is done. However, w h e n the
n e w test is used in different clinical
settings w i t h a l o w e r p r e v a l e n c e of
593/147
BIOSTATISTICS
Gaddis & Gaddis
FIGURE 2. Clinical testing.
that disease, the test does not perform up to reported expectations. A
clinical example of the interrelationship between prevalence of disease
and predictive value is the use of amylase levels to screen for pancreatitis.
An elevated amylase level is more
likely indicative of panereatitis in
persons previously afflicted with
pancreatitis than it is predictive of
pancreatitis among all patients with
a b d o m i n a l pain or o t h e r possible
causes of an elevated serum amylase
level.
In summary, sensitivity and specificity are properties that indicate the
degree of reliability of a diagnostic
test. Sensitivity and specificity do
not indicate predictive value. Predictive values can be applied to an individual patient's test result and are affected by the prevalence of the disease in the population to which the
test is applied. The PPV will fall and
the NPV will rise as the prevalence
of disease decreases.
HYPOTHESIS TESTING
Formulation of the Hypothesis
Statistical inference involves the
testing of hypotheses. A hypothesis
is a numerical s t a t e m e n t about an
u n k n o w n p a r a m e t e r 3 Just as a two
by two table can be constructed for
the four possible outcomes of a clinical diagnostic test, a two by two table can be constructed for the four
possible outcomes of hypothesis testing.
Before constructing this table, it is
necessary to understand what a hypothesis states. The first step in hypothesis testing is a s t a t e m e n t of a
h y p o t h e s i s in positive terms. This
defines the " r e s e a r c h " or "alternative" hypothesis, H1.2 For example,
one could h y p o t h e s i z e t h a t experienced e m e r g e n c y p h y s i c i a n s (those
w i t h m o r e t h a n five years of fulltime postgraduate emergency departm e n t experience) can examine, diagnose, and treat m o r e p a t i e n t s per
hour than inexperienced emergency
p h y s i c i a n s (less t h a n five years of
full-time ED experience).
The next step is to state the "null"
or " s t a t i s t i c a l " h y p o t h e s i s , Ho,
w h i c h follows logically from H1.1,2
The hypothesis tested statistically is
H o. In this example, H o would state
"Experienced emergency physicians
and inexperienced emergency physi148/594
Sensitivity
The ability of a test to reliably detect the presence of disease
(positivity in disease).
Sensitivity (%) = 100 x TP/(TP + FN)
Specificity
The ability of a test to reliably detect the absence of disease
(negativity in health).
Specificity (%) = 100 x TN/(TN + FP)
Prevalence
The proportion of the population with disease.
Prevalence (%) = 100 x (TP + FN)/(n)
Positive
Predictive
Value
The proportion of individuals with disease when the presence
of disease is indicated by the diagnostic test.
PPV = 100 x TP/(TP + FP)
Negative
The proportion of individuals free of disease when the abPredictive
sence of disease is indicated by the diagnostic test.
Value
NPV = 100 x TN/(TN + FN)
TN, true negative; FN, false negative; TP, true positive; FP, false positive.
2
cians do n o t differ significantly in
the n u m b e r of patients they can examine, diagnose, and treat per hour."
We "reject" or "fail to reject" ("accept") H o based on our inferential
statistical testing. 1-3 Ho hypothesizes
a difference of zero between population samples tested, while H 1 hypothesizes a nonzero difference bet w e e n p o p u l a t i o n s a m p l e s tested.
There exist an infinite n u m b e r of
possible nonzero differences between
populations. Therefore, the reason
that H o rather than H 1 is tested is
that mathematically, H o theorizes a
single m a g n i t u d e of difference between populations studied, and it is
possible to statistically assess this
single hypothesis. In contrast, H 1 is
a c t u a l l y an infinite n u m b e r of hypotheses because there exist an infinite n u m b e r of possible magnitudes
of difference between populations. 4 It
would be impossible to calculate the
required statistics for each of the infinite n u m b e r of possible magnitudes
of d i f f e r e n c e b e t w e e n p o p u l a t i o n
samples H 1 hypothesizes.
If H 0 is "accepted" as tenable, then
H 1 m u s t be " r e j e c t e d , " and v i c e
versa, because the two h y p o t h e s e s
are mutually exclusive. When H o is
tested, the probability that numerical
differences between population samples are not due strictly to chance is
assessed. 2 H 0 does r e c o g n i z e t h a t
nonzero differences between groups
are possible, even if two samples of
the same population are tested, simply due to r a n d o m s c a t t e r of t h e
data. 2 If H o is "accepted" as tenable,
this signifies the likelihood that no
significant difference exists between
Annals of Emergency Medicine
the populations studied and that any
numerical differences between
groups are due to chance alone. If H o
is rejected, this signifies that a significant difference does exist between
the populations studied and that the
n u m e r i c a l differences b e t w e e n the
groups are not due to chance alone.
Errors in Hypothesis Testing
Hypothesis testing m a y lead to erroneous inferential statistical conclusions, just as diagnostic testing m a y
lead to erroneous diagnostic conclusions. Just as a two by two table of
possible outcomes of diagnostic tests
can be constructed, so can a two by
two table of possible outcomes of inf e r e n t i a l s t a t i s t i c a l t e s t s be c o n structed (Table 3). Two types of incorrect conclusions are possible. Box
B of Table 3 indicates cases in which
the statistical test falsely indicates
that a significant difference exists between groups, when in fact no true
difference exists. It is analogous to a
false-positive diagnostic test result.
In other words, box B shows cases
where H o is rejected, w h e n it is in
fact true. This rejection of H o when
H o is true is arbitrarily called a type I
e r r o r . 1-3
Box C of Table 3 indicates cases in
which the statistical test falsely indicates the lack of a significant difference between groups, w h e n in fact a
true difference exists (H 1 is true).
This is analogous to a false-negative
diagnostic test result. In other words,
box C shows cases in which H e is accepted when it is in fact false. The
acceptance of H o when H o is false is
arbitrarily called a type II error. 1-3
19:5 May 1990
FIGURE 3.
Research
(Alternative)
Hypothesis
(H1)
Null
(Statistical)
Hypothesis
An hypothesis that states a difference exists between two
(or more) populations studied. H 1 is a positive statement
that a difference exists between groups.
An hypothesis of no difference between two or more populations studied. H o is a negative statement, that no difference exists between groups.
(Ho)
Type I
Error
To reject the null hypothesis (Ho), when in fact H o is true.
To falsely conclude that a significant difference exists between populations.
Type II
Error
To accept the null hypothesis (Ho), when in fact H o is false.
To falsely conclude that no significant difference exists
between populations.
The probability of making a type I error.
Statistical calculations from the experimental data indicate
that the probability of making a type I error is less than
5%.
The probability of making a type II error.
The ability of an experiment to find a significant difference
exists between populations, when in fact a significant difference truly exists. Power = 1 - 13
The degree of difference between populations tested.
Alpha (o0 " •
P < .05
Beta (13)
Power
Delta (A)
Operating
Characteristic
Curve
Prior
Probability
A function that relates the dependent variable !8 that results from independent values of ~, A, and n.
The likelihood that an hypothesized difference between
populations is in fact correct.
Box A and box D of Table 3 denote
c o r r e c t c o n c l u s i o n s , a n a l o g o u s to
true-positive and true-negative diagnostic test results. Thus, Table 3
s h o w s t h a t there exist t w o correct
and t w o incorrect conclusions possible w h e n e v e r H o is tested.
Next, the probability of m a k i n g inc o r r e c t c o n c l u s i o n s m u s t b e assessed. T h e probability of m a k i n g a
type I error is defined as alpha (~).1,2,4
is derived from the raw data, statistical c a l c u l a t i o n s , and s t a t i s t i c a l tables a p p r o p r i a t e for t h e i n f e r e n t i a l
s t a t i s t i c a l t e s t used. By c o n v e n t i o n ,
s t a t i s t i c a l s i g n i f i c a n c e is g e n e r a l l y
accepted if the probability a of m a k ing a t y p e I error is less t h a n 0.05,
w h i c h is c o m m o n l y denoted on figures and tables as P < .05.3, 4
T h o u g h conventional, selection of
an a l p h a level of .05 as t h e crucial
level of significance is arbitrary. Acc e p t i n g s i g n i f i c a n c e a t et = .05
m e a n s that it is recognized that one
t i m e o u t of 20, a type I error will be
committed, a consequence that the
i n v e s t i g a t o r is w i l l i n g to accept. If
the consequences of malting a type I
error are judged to b e sufficiently se19:5 May 1990
vere, it m a y be appropriate to select
m o r e s t r i n g e n t levels of % such as
.01, as the cutoff for statistical significance. W h e n a caption or text indicates t h a t for s o m e s t a t i s t i c a l c o m parison, P = .XY, the probability of a
type I error, based on the calculations
performed for that inferential statistical test, is 0.XY, and the reader is
left to judge w h e t h e r this level of ot is
i n d i c a t i v e of a t r u e d i f f e r e n c e between populations tested. Another
advantage of the reporting of P values
is t h a t the a r b i t r a r y d e s i g n a t i o n of
significance at .05, and the i m p r o p e r
and arbitrary designation of a trend if
.10 > P > .05, can be avoided.
T h e probability of m a k i n g a type II
error is defined as beta (13).1,~,4 ~ is
m o r e difficult to derive t h a n a, and
u n l i k e ~, a c t u a l l y is n o t one single
probability value. [3 is often ignored
by researchers, s However, it is imp o r t a n t . If s o m e t r e a t m e n t yields a
10% increase in survival or a 10% decrease in some complication, it
w o u l d l i k e l y be readily incorporated
into m e d i c a l practice. Unfortunately,
n u m e r o u s c l i n i c a l t r i a l s h a v e suffered from errors of e x p e r i m e n t a l deAnnals of Emergency Medicine
Hypothesis testing.
sign that cause 13 to be u n a c c e p t a b l y
high, such t h a t type II errors are easily made, and t r e a t m e n t s that are significantly better t h a n older m e t h o d s
are rejected because of statistical artifact resulting from poor e x p e r i m e n t a l
design .s By convention, [3 should be
less t h a n .20, and i d e a l l y less t h a n
.10, to m i n i m i z e the chance of m a k ing a type II error. 6
and f3 are i n t e r r e l a t e d . A l l else
h e l d c o n s t a n t (such as t h e p o p u l a tions studied, the n u m b e r of subjects,
and t h e m e t h o d of testing), as a is arbitrarily decreased, 13 is increased. As
is increased, 13 is decreased.i, 2
S t a t i s t i c a l p o w e r is d e f i n e d as
(1-~).1,2, 4 B e c a u s e [3 i n d i c a t e s t h e
probability of m a k i n g a type II error,
power indicates mathematically the
probability of not m a k i n g a type II error. Power is analogous to sensitivity
in hypothesis testing. Sensitivity indicates the probability that the diagnostic test can detect disease w h e n it
is present. Power indicates the probability that the statistical test can detect s i g n i f i c a n t differences b e t w e e n
populations, w h e n in fact such differences truly exist.
Power depends on several variables: 1,2,4,7
c~: As a increases, 13 decreases, and
power increases.
n ( s a m p l e size): A s n i n c r e a s e s ,
power increases.
T h e m a g n i t u d e of t h e d i f f e r e n c e
actually present b e t w e e n the populations tested, delta (A): Just as it is
easier to find a pitchfork t h a n a needle in a h a y s t a c k , so it is easier to
find a large difference t h a n it is to
find a small difference b e t w e e n populations tested.
One-tailed versus two-tailed tests:
O n e - t a i l e d tests are m o r e p o w e r f u l
than two-tailed tests, because a statistical test result m u s t n o t vary as
m u c h from the m e a n to achieve significance at any level of c~ chosen. (If
c~ is .05, for a two-tailed test, a result
m u s t fall in either the top or b o t t o m
21/2% of r e s u l t s to a c h i e v e significance, b u t for a o n e - t a i l e d test, t h e
result m u s t m e r e l y fall in either the
top or b o t t o m 5% of a distribution.)
In the original h y p o t h e s i s e x a m p l e
about h o w quickly e m e r g e n c y physicians can treat patients, the appropriate test w o u l d be one-tailed, because
H 1 specifies the direction of the difference between groups hypothe595/149
BIOSTATISTICS
Gaddis & Gaddis
sized.
Parametric versus nonparametric
statistical testing: Parametric tests
are g e n e r a l l y m o r e p o w e r f u l . (This
will be further discussed in Part 4 of
this series.)
Use of proper e x p e r i m e n t a l design
and s t a t i s t i c s : Errors in t h e s e areas
decrease power.
Because so m a n y variables can affect I3, ~ is not one single value. This
follows from t h e fact t h a t ~ is t h e
probability of erroneously concluding
t h a t H o is false, and H o specifies a
single m a g n i t u d e of d i f f e r e n c e bet w e e n populations. However, as has
been explained, ~ is t h e p r o b a b i l i t y
of erroneously concluding that H I is
false, and H 1 h y p o t h e s i z e s an infinite
n u m b e r of p o s s i b l e m a g n i t u d e s of
difference between populations
tested. ~ is expressed as a function of
A, n, and a by a function called the
operating characteristic curve of the
test s (Figure 1).
T h e m o s t c o m m o n use of ~ is in
t h e c a l c u l a t i o n of t h e a p p r o x i m a t e
n u m b e r of s u b j e c t s t h a t m u s t be
s t u d i e d to keep R and [3 a c c e p t a b l y
small. This calculation uses estim a t e s of p o p u l a t i o n standard deviations and e s t i m a t e s o f / k , acceptable
values of a and ~, and n u m b e r s from
statistical tables, to derive a value of
n of sufficient size. T h e d e t e r m i n a t i o n of a d e q u a t e s a m p l e size for an
e x p e r i m e n t is readily referenced. 8 lo
P Values Versus Confidence
Intervals
H y p o t h e s i s testing yields yes or no
answers about statistical significance, answers t h a t can be fraught
w i t h errors, and a n s w e r s t h a t m a y
represent oversimplifications. P
values i m p l y l i t t l e about the magnit u d e of d i f f e r e n c e p r e s e n t b e t w e e n
p o p u l a t i o n s . T h e r e f o r e , s o m e feel
that the use of confidence intervals
(CIs) is c o m p l e m e n t a r y or even prefe r a b l e to t h e u s e of P v a l u e s i n
reporting clinical data.11 (Confidence
intervals were discussed in part 2 of
this series. 12) It is correct to r e p o r t
b o t h CI and P v a l u e s for s c i e n t i f i c
data, and the two are often complementary. 1,11
Clinical Versus Statistical
Significance
Statistically significant numerical
differences between study groups
m a y n o t be c l i n i c a l l y significant or
relevant. A n analogy to clinical test150/596
ing is again useful. It is c o m m o n experience to ignore or place little emphasis on a single diagnostic test res u l t t h a t lies o u t s i d e t h e e x p e c t e d
range for that test w h e n large n u m bers of tests are done. A n example is
the i n t e r p r e t a t i o n of an isolated elevated a m y l a s e level in a p a t i e n t having otherwise n o r m a l routine laborat o r y d a t a after a n o r m a l s c r e e n i n g
p h y s i c a l e x a m i n a t i o n at his f a m i l y
physician's office. M a n y experienced
clinicians can i n t u i t i v e l y sense w h e n
to place l i t t l e e m p h a s i s on i s o l a t e d
laboratory test results outside the
n o r m a l range w h e n an abnormal res u l t is n o t e x p e c t e d . A l t e r n a t i v e l y
stated, w h e n there is very little prior
probability of disease, an isolated abn o r m a l laboratory value is generally
n o t cause for great concern, and the
clinician avoids a clinical error analogous to a type I error by a v o i d i n g
concluding that disease is present in
a disease-free patient.
S i m i l a r l y , if e n o u g h s t a t i s t i c a l
c o m p a r i s o n s are m a d e , e v e n t u a l l y
type I and type II statistical errors are
i n e v i t a b l e . T h e p r o b l e m c o m e s in
discerning w h i c h statistically significant differences are m e a n i n g f u l and
w h i c h are meaningless. Just as prevalence affects the predictive value of a
positive diagnostic test, so the prior
probability of a difference affects the
predictive value of a statistical test.
Prior probability is an expression of
h o w likely an hypothesis will be true
w h e n assessed b e f o r e doing s t a t i s t i cal c a l c u l a t i o n s . Prior p r o b a b i l i t y is
derived from previously available
k n o w l e d g e t h a t led to the f o r m u l a tion of the hypothesis being tested.
W h e n a hypothesis has a low prior
probability
of b e i n g t r u e , y e t
achieves statistical significance, such
as a l i n k b e t w e e n coffee c o n s u m p tion and pancreatic cancer, 13 a significant result m u s t be interpreted cautiously. Furthermore, if a type I error
is being made, repetitive study will
p r o b a b l y n o t r e p l i c a t e a significant
difference, as subsequently occurred
in t h e case of t h e a l l e g e d l i n k bet w e e n coffee c o n s u m p t i o n and pancreatic c a n c e r J 4 However, in cases of
high prior p r o b a b i l i t y , a s i g n i f i c a n t
s t a t i s t i c a l difference is u s u a l l y correct, just as in cases of high disease
prevalence, a positive clinical test result is m o r e l i k e l y to be correct.
Table 4 s u m m a r i z e s the interrelationship between prior probability
and the chance of m a k i n g a type I or
Annals of Emergency Medicine
type II error. This relationship is further explained by Bayes theorem,
w h i c h t h e r e a d e r is i n v i t e d to explore.
SUMMARY
A n u n d e r s t a n d i n g of t h e i n t e r p r e t a t i o n of d i a g n o s t i c t e s t s facilitates an u n d e r s t a n d i n g of hypothesis
testing. A diagnostic test result m a y
be a t r u e - p o s i t i v e , t r u e - n e g a t i v e ,
f a l s e - p o s i t i v e , or f a l s e - n e g a t i v e result. For diagnostic tests, s e n s i t i v i t y
and s p e c i f i c i t y are p r o p e r t i e s of t h e
d i a g n o s t i c test and do n o t i n d i c a t e
p r e d i c t i v e value. P r e v a l e n c e of disease is a d e t e r m i n a n t of the predictive value of b o t h positive and negative test results.
Similarly, hypothesis testing can
yield erroneous results. A false-positive result, w h i c h a c c e p t s t h e presence of a significant difference bet w e e n p o p u l a t i o n s w h e n in fact no
significant difference exists (type I error}, occurs w i t h a probability of a. A
false-negative result, rejecting the
p r e s e n c e of a significant difference
b e t w e e n p o p u l a t i o n s , w h e n in fact
t h e y actually do differ (type II error),
occurs w i t h a probability of ~.
P o w e r is i-p, and is analogous to
the sensitivity of a diagnostic test in
t h a t b o t h s e n s i t i v i t y and p o w e r address w h e t h e r a test can detect w h a t
it is designed to detect. As s e n s i t i v i t y
and specificity are n o t predictive, so
also power is n o t predictive. As prevalence of disease affects the predictive value of a positive test result, so
the prior p r o b a b i l i t y of a difference
being p r e s e n t affects the p r e d i c t i v e
value of a significant statistical test
result. Figures 2 and 3 summarize
these points.
REFERENCES
1. Hopkins KD, Glass GV: Basic Statistics for
the Behavioral Sciences. Englewood Cliffs, New
Jersey, Prentice-Hall, Inc, 1978.
2. Keppel G: Design and Analysis. A Researcher's Handbook. Englewood Cliffs, New
Jersey, Prentice-Hall, Inc, 1978.
3. Elenbaas RM, Elenbaas JK, Cuddy PG: Evaluating the medical literature Part II: Statistical
analysis. Ann Emerg Med 1983;12:610-620.
4. Sokal RR, Rohlf FJ. Biometry (ed 2). New
York, WH Freeman and Co, 1981.
5. Freiman JA, Chalmers TC, Smith H, et al:
The importance of beta, the type II error, and
sample size in the design and interpretation of
the randomized clinical trial. N Engl J Med
1978;299:690-694.
6. Reed JF, Slaichert W: Statistical proof in inconclusive "negative" trials. Arch Intern Med
1981;141:1307-1310.
19:5 May 1990
7. Cohen J: Differences between proportions,
in: Statistics in Medicine. Boston, Little, Brown,
& Co, 1974.
8. Arkin CG WachtelMS: Howm.any patients
are necessary to assess test pefformance?JAMA
1990;263:275-278.
9. Fleiss JL: Statistical Methods for Rates and
Proportions (ed 2). New York, John Wiley &
19:5 May 1990
Sons, 1981.
10. Young MJ, Bresnitz EA, Strom BL: Sample
size nomograms for interpreting negative clinical studies. Ann Intern Med 1983;99:248-251.
11. Gardner MJ, Altman DG: Confidence intervals rather than P values: Estimation rather
than hypothesis testing. Br Meal J 1986;292:
746-750.
Annals of Emergency Medicine
12. Gaddis GM, Gaddis ML: Introduction to
biostatistics: Part 2, descriptive statistics. Ann
Emerg Med 1990;19:309-315.
13. MacMahon B, Yen S, Trichopoulos D, et al:
Coffee and cancer of the pancreas. N Engl J Med
1981;304:630-633.
14. Gorham ED, Garland CF, Garland FL, et al:
Coffee and pancreatic cancer in a rural California county. West J Med 1988;148:48-51.
597/151