Download CORP: Minimizing the chances of false positives

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Receiver operating characteristic wikipedia, lookup

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Foundations of statistics wikipedia, lookup

Statistical hypothesis testing wikipedia, lookup

Misuse of statistics wikipedia, lookup

Transcript
J Appl Physiol 122: 91–95, 2017.
First published December 1, 2016; doi:10.1152/japplphysiol.00937.2016.
REVIEW
Cores of Reproducibility in Physiology
CORP: Minimizing the chances of false positives and false negatives
Douglas Curran-Everett
Division of Biostatistics and Bioinformatics, National Jewish Health, Denver, Colorado; and Department of Biostatistics
and Informatics, Colorado School of Public Health, University of Colorado Denver, Denver, Colorado
Submitted 21 October 2016; accepted in final form 23 November 2016
estimation; hypothesis test; power; reproducibility; significance test
[Populations] always display variation ... .
—Sir Ronald A. Fisher (1925)
get a false negative. In this review I outline strategies to accomplish each of these things. These strategies are related intimately
to fundamental concepts of statistics and the inherent uncertainty
embedded in them. I begin with a brief overview of the null
hypothesis, the foundation of any scientific experiment.
IT IS ONLY BECAUSE POPULATIONS always display variation that
statistics is essential to the process of scientific discovery. An
inescapable tenet of statistics, however, is the notion of uncertainty that has reared its head within the arena of reproducibility of research.
It just so happens that improving the reproducibility of
research is also the focus of the Journal of Applied Physiology’s recent initiative, “Cores of Reproducibility in Physiology”
(22). Each article in this initiative is designed to elucidate the
principles and nuances of using some piece of scientific equipment or some experimental technique so that other researchers
can obtain reproducible results (22). But other researchers can
use some piece of equipment or some technique with expert
skill and still fail to replicate a pivotal experimental result if
they neglect to consider fundamental concepts of statistics (6,
7, 11, 14) and their inescapable connection to the reproducibility of research.
If we want to improve the reproducibility of our research, then
we want to minimize the chance that we get a false positive
and—at the same time—we want to minimize the chance that we
When we design an experiment, we want to test a scientific
idea. By tradition, this idea is called the null hypothesis (see
Ref. 7). When we make an inference about a null hypothesis,
we can make a mistake (Fig. 1). We can reject a true null
hypothesis—we get a false positive— or we can fail to reject a
false null hypothesis—we get a false negative.1
We control the chance that we get a false positive when we
define the critical significance level ␣, the probability that we
reject the null hypothesis given that the null hypothesis is true.
When we define ␣, we declare that we are willing to reject a
true null hypothesis 100␣% of the time.
In contrast, we control the chance that we get a false
negative when we prescribe instead the chance that we get a
true positive. We do this when we define power, the probability
that we reject the null hypothesis given that the null hypothesis
Address for reprint requests and other correspondence: D. Curran-Everett,
Division of Biostatistics and Bioinformatics, M222, National Jewish Health,
1400 Jackson St., Denver, CO 80206-2761 (e-mail: [email protected]).
1
Neyman and Pearson described a false positive as an error of the first kind
and a false negative as an error of the second kind (see Ref. 7). Errors of the
first kind and second kind are known also as type I and type II errors.
http://www.jappl.org
The Null Hypothesis: a Scientific Idea
8750-7587/17 Copyright © 2017 the American Physiological Society
91
Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017
Curran-Everett D. CORP: Minimizing the chances of false positives and false
negatives. J Appl Physiol 122: 91–95, 2017. First published December 1, 2016;
doi:10.1152/japplphysiol.00937.2016.—Statistics is essential to the process of
scientific discovery. An inescapable tenet of statistics, however, is the notion
of uncertainty which has reared its head within the arena of reproducibility of
research. The Journal of Applied Physiology’s recent initiative, “Cores of Reproducibility in Physiology,” is designed to improve the reproducibility of research:
each article is designed to elucidate the principles and nuances of using some piece
of scientific equipment or some experimental technique so that other researchers
can obtain reproducible results. But other researchers can use some piece of
equipment or some technique with expert skill and still fail to replicate an
experimental result if they neglect to consider the fundamental concepts of statistics
of hypothesis testing and estimation and their inescapable connection to the
reproducibility of research. If we want to improve the reproducibility of our
research, then we want to minimize the chance that we get a false positive and—at
the same time—we want to minimize the chance that we get a false negative. In this
review I outline strategies to accomplish each of these things. These strategies are
related intimately to fundamental concepts of statistics and the inherent uncertainty
embedded in them.
92
Cores of Reproducibility in Physiology • Curran-Everett D
is false. In general, four things affect power: the critical significance level ␣, the standard deviation ␴ of the underlying population, the sample size n, and the magnitude of the difference we
want to be able to detect (Table 1; see also Ref. 8).
Fig. 2. The distribution of observed P values from 500,000 replications of a
simulation in which the null hypothesis, H0: ⌬␮ ⫽ 0, is true. We drew at
random two samples, each with n observations, from a standard normal
distribution (see Refs. 7 and 9), did a two-sample t-test (using a critical
significance level ␣ ⫽ 0.05), and then repeated this process to generate a total
of 500,000 replications. As expected (see Ref. 7), in ⬃ 5% of the replications
(gray area), P ⬍ 0.05: we reject a true null hypothesis. In other words, we
obtain a false positive (see Fig. 1). The proportion of false positives depends
solely on the critical significance level ␣. After Ref. 9, Fig. 1.
might ask if our two samples came from the same or different
populations. This means we define the null and alternative
hypotheses, H0 and H1, as
H0 : The samples come from the same population.
H1 : The samples come from the different populations.
If we want to know whether the populations have the same
mean, we can write these as
When the Null Hypothesis Is True: to Minimize False
Positives
When we test a null hypothesis, rarely, if ever, do we expect
that null hypothesis to be true (see Refs. 7 and 11). Rather, we
usually know—at the very least, suspect—that our null hypothesis is not exactly true. If we happen to be mistaken about the
validity of our null hypothesis, however, we would like to
guard against an unsubstantiated discovery: we would like to
avoid a false positive.
As I posited in Ref. 9, suppose we want to learn if some
intervention affects the biological thing we care about. If we
use two groups—a control group and a treated group—we
Table 1. Power estimates for a two-sample t-test
Sample Size, n
␣
Effect Size
10
20
30
40
50
60
0.01
0.25
0.50
0.75
1.00
0.02
0.06
0.15
0.29
0.03
0.14
0.38
0.67
0.05
0.24
0.60
0.88
0.07
0.35
0.76
0.96
0.09
0.45
0.87
0.99
0.11
0.55
0.93
1.00
0.05
0.25
0.50
0.75
1.00
0.08
0.18
0.35
0.56
0.12
0.34
0.64
0.87
0.16
0.48
0.81
0.97
0.20
0.60
0.91
0.99
0.24
0.70
0.96
1.00
0.27
0.78
0.98
1.00
0.10
0.25
0.50
0.75
1.00
0.13
0.28
0.49
0.69
0.19
0.46
0.75
0.93
0.25
0.61
0.89
0.99
0.30
0.72
0.95
1.00
0.34
0.80
0.98
1.00
0.39
0.86
0.99
1.00
Values represent power for different combinations of sample size, critical
significance level ␣, and effect size. Effect size is the difference between
population means ␮1 and ␮0 expressed as a fraction of the population standard
deviation ␴: |␮1 ⫺ ␮0|/␴ ⫽ ⌬␮/␴ (see Ref. 8). An effect size of 0.5 means that
⌬␮ ⫽ 0.5␴, and an effect size of 1 means that ⌬␮ ⫽ ␴. If all else is equal,
power increases when the sample size increases, the critical significance level
␣ increases, and the effect size increases.
H0 : ⌬␮ ⫽ 0
H1 : ⌬␮ ⫽ 0,
where ⌬␮, the difference in population means, is the difference
between the means of the treated and control populations.
In any experiment, regardless of the number of observations,
the chance that we get a false positive depends only on the
critical significance level ␣: we will obtain a false positive
100␣% of the time (Fig. 2).2 If we want to minimize the chance that we get a false positive, then we want ␣ to be more
stringent than the traditional 0.05. A smaller, more stringent ␣
2
This refers to a false positive that occurs for statistical reasons. Bias or
experimental error can result in a spurious true positive: there is an apparent
effect for which there is convincing statistical evidence but that is detectable
only because of bias or error.
Fig. 3. The populations. Population 0 is a standard normal distribution with
mean ␮0 ⫽ 0 and standard deviation ␴ ⫽ 1. Population 1 is a normal distribution with mean ␮1 ⫽ 0.5 and standard deviation ␴ ⫽ 1. Therefore, the true
difference in population means, ⌬␮, is ⌬␮ ⫽ ␮1 ⫺ ␮0 ⫽ 0.5. [Reprinted with
permission from Ref. 9.]
J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org
Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017
Fig. 1. The relationship between the state of a null hypothesis (rows) and our
inference about that null hypothesis (columns) after we analyze our data. If we
reject a true null hypothesis, we make an error of the first kind (see Ref. 7): we
get a false positive. If we fail to reject a false null hypothesis, we make an error
of the second kind (see Ref. 7): we get a false negative. If we want to improve
the reproducibility of our research, then we want to do two things simultaneously: maximize the chance that we get a true negative and maximize the
chance that we get a true positive. This means we want to minimize the chance
that we get a false positive and minimize the chance that we get a false
negative (see text).
93
Cores of Reproducibility in Physiology • Curran-Everett D
Table 2. Probability a duplicate experiment will achieve
P ⬍ 0.05
P1
Pr {P2 ⬍␣ ⫽ 0.05}
0.40
0.30
0.20
0.10
0.05
0.025
0.01
0.005
0.001
0.00031
0.00002
0.13
0.18
0.25
0.38
0.50
0.61
0.73
0.80
0.91
0.95
0.99
Values represent P1, the P value from an initial experiment, and Pr {P2 ⬍
␣ ⫽ 0.05}, the probability that the P value from a duplicate experiment will be
less than ␣ ⫽ 0.05. Ref. 9 discusses the calculation of these probabilities. See
Fig. 4.
n
Power
%Inflated
10
20
30
40
50
60
64
70
80
90
100
200
0.18
0.34
0.48
0.60
0.70
0.78
0.80
0.84
0.88
0.92
0.94
1.00
100
100
96
83
73
64
63
60
57
55
54
50
Values represent the results of 10,000 replications of a two-sample t-test in
which the n observations for each sample were drawn at random from the
populations in Fig. 3. For each replication, we obtained a P value with which
to assess our null hypothesis H0: ⌬␮ ⫽ 0 and ⌬y៮ , the difference in sample
means that estimates the true difference ⌬␮ ⫽ 0.5. Let us consider only
those simulations for which we reject our null hypothesis: that is, for which
P ⬍ 0.05. When power was 0.18 (n ⫽ 10), the estimated difference ⌬y៮
exceeded the true difference ⌬␮ in all those statistically significant simulations: 100% of the ⌬y៮ values were inflated. In contrast, when power was
1.00 (n ⫽ 200), the estimated difference ⌬y៮ exceeded the true difference
⌬␮ in only half those simulations: 50% of the ⌬y៮ values were inflated.
After Ref. 9, Fig. 5.
When the Null Hypothesis Is False: to Minimize False
Negatives
Now imagine we truly do know that our null hypothesis is
false. Suppose we use the same null and alternative hypotheses
we just did:
H0 : ⌬␮ ⫽ 0
H1 : ⌬␮ ⫽ 0.
If we define ⌬␮, the difference in population means, to be 0.5
units (Fig. 3), then we know our null hypothesis is false.
It—almost— goes without saying that we would like to
reject our null hypothesis if it is false. We can up our chances
of doing that if we design our experiment so that power, the
probability we reject our null hypothesis given it is false, is
high. Because we have defined our populations, power depends
only on the number of observations in our two groups (8).
With this background we are ready to explore the relationship of fundamental concepts of statistics— hypothesis testing
and estimation—to reproducibility.3
Hypothesis testing. Suppose we have a unique experimental
result that is statistically unusual. Had we defined ␣ ⫽ 0.05,
this means our P value is less than 0.05. If our null hypothesis
is true, our result is unusual. We reject our null hypothesis. We
have discovered something! We assume our result reflects a
general phenomenon that other researchers will reproduce if
they repeat our experiment. But will they?
If the P value from our initial experiment is 0.05, then the
probability a duplicate experiment will achieve P ⬍ 0.05—the
probability it will achieve statistical significance—is 50%
3
The theoretical foundation of the next section is developed in—and was
adapted from—Ref. 9.
Fig. 4. Probability a second experiment will result in P ⬍ 0.05 when the test statistic is z. If the critical significance level ␣ is 0.05, then the critical value z*
is 1.96 (see Ref. 7). Suppose we conduct an experiment for which we compute an observed value z1 that happens to be more extreme than z*: that is, z1 ⬎ z*.
This means P1, the P value associated with z1, is less than the critical significance level ␣. This is unusual if the null hypothesis is true: we have a statistically
significant result. If we assume the magnitude of the effect observed in this initial experiment equals the magnitude of the true effect, then the probability that
the P value from a second experiment will be less than 0.05 is depicted by the gray area. See Table 2. [Adapted with permission from Ref. 9.]
J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org
Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017
will also help us minimize the chance that we get a false
negative when we attempt to reproduce the results of a pivotal,
pioneering experiment.
Table 3. Percent inflated estimates of ⌬␮ when P ⬍ 0.05
94
Cores of Reproducibility in Physiology • Curran-Everett D
(Table 2). If the P value from our initial experiment is 0.01,
then the probability a duplicate experiment will achieve P ⬍
0.05 is about 75%. Only when the P value from our initial
experiment is 0.001 does the probability a duplicate experiment will achieve P ⬍ 0.05 exceed 90% (Fig. 4). Power,
experimental design, and the actual test statistic (see Ref. 7)
have little impact on this phenomenon (3).
Estimation. In contrast, power—with its inherent connection
to the statistical benchmark of ␣— does impact the reproducibility of point and interval estimates4 of the magnitude of
some biologic effect (18). If an experiment of lower power
happens to reject its null hypothesis—if the effect is statistically unusual—then our estimate of the magnitude of that
effect will be exaggerated (9, 15); see Table 3 and Ref. 9.
Experimental design and the actual test statistic have little
impact on this phenomenon (18).
The Impact of Multiple Comparisons
Summary
If we want to improve the reproducibility of our research,
then we want to revamp how we apply the fundamental
concepts of hypothesis testing and estimation to our science.
This is how we can do that:
4
A sample mean y៮ estimates some population mean ␮: a sample mean is
a point estimate for some population mean. A confidence interval [y៮ ⫺a,
y៮ ⫺a], where a is an allowance, is a range that we expect, with some level
of confidence, to include the true value of a population parameter such as
the mean: a confidence interval is an interval estimate for some population
mean.
Table 4. Probability that at least one of k comparisons will
reject a true null hypothesis
Number of Comparisons, k
␣
1
2
3
4
0.05
0.01
0.005
0.001
0.05
0.01
0.005
0.001
0.10
0.02
0.010
0.002
0.14
0.03
0.015
0.003
0.19
0.04
0.020
0.004
5
...
10
...
20
...
100
0.23 . . . 0.40 . . . 0.64 . . . 0.99
0.05
0.10
0.18
0.63
0.025
0.05
0.10
0.39
0.005
0.01
0.02
0.10
Values assume that each of k independent null hypotheses in a family—a
group—is true and is tested at a critical significance level ␣. For any ␣, as
the number of comparisons in a family increases, it is more likely that at
least one true null hypothesis will be rejected by mistake. In other words,
it is more likely that we will obtain at least one false positive. If ␣ ⫽ 0.05
and we make 10 comparisons, then there is a 40% chance that we will obtain
at least one false positive. If we make enough comparisons, then we will
almost inevitably obtain at least one false positive.
Research workers . . . have to accustom themselves to the
fact that in many branches of research the really critical
experiment is rare, and that it is frequently necessary to combine the results of numbers of experiments dealing with the
same issue in order to form a satisfactory picture of the true
situation. . . . In such circumstances a number of experiments
of moderate accuracy are of far greater value than a single
experiment of very high accuracy.
In addition to the strategies listed above, perhaps it is time
to also revisit that decades-old cornerstone of reproducibility.
ACKNOWLEDGMENTS
I dedicate this review to John Ludbrook (retired, Department of Surgery,
The University of Melbourne, Melbourne, Victoria, Australia) who, in his own
words, was a sometime academic surgeon, a sometime applied cardiovascular
physiologist, and a biostatistician accredited by the Statistical Society of
Australia. John and I have been corresponding since 2004 when he congratulated Dale Benos and me on the publication of our guidelines for reporting
statistics. Over the years, on those rare occasions when he was able to tear
himself away from watching cricket matches, John graciously reviewed some
of the papers in my Explorations in Statistics series. My papers are better for
it. And I have enjoyed tremendously our correspondence.
I thank Peter D. Wagner (University of California, San Diego, La Jolla,
California) for his encouragement to write this review and for his helpful
comments.
5
6
See Ref. 9 for some resources that can help with this.
See Ref. 9 for details.
J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org
Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017
When we do an experiment, we typically test more than one
null hypothesis. This makes sense: an experiment can be costly
in terms of time and resources, and we want to maximize what
we learn. When we test more than one null hypothesis—when
we have a family of comparisons—we are more likely to reject
a true null hypothesis simply by virtue of the fact that we are
testing more than one null hypothesis (Table 4). In this setting,
we can minimize the chance that we get a false positive if we
use a statistical procedure that controls for this phenomenon.
The false discovery rate procedure has advantages over the
Bonferroni procedure (1, 2, 5).
• When we design an experiment, estimate sample size so
that power approaches 0.90.5 This minimizes the chance
that we get a false negative and makes it less likely that we
exaggerate the magnitude of the true effect.
• Define the critical significance level ␣—the benchmark
for how statistically unusual our result needs to be before
we reject our null hypothesis—to be 0.005 or 0.001 (19,
20). This minimizes the chance that we get a false positive
and makes it more likely a subsequent experiment will
achieve P ⬍ 0.05.
• Power and ␣ impact sample size. If we define ␣ to be a
more stringent 0.005 and if we design our experiment so
that power approaches 0.90, might the sample size for our
experiment be so large as to be practically impossible?
Not necessarily (20).
• Think less about a simple P value and more about the
scientific importance of the confidence interval bounds for
some experimental result (6, 10, 11, 14, 15, 20). Even a
convincing P value of 0.005 can be associated with a
confidence interval whose bounds indicate an effect that is
scientifically inconsequential: for example, [0.004, 0.01]
(see Ref. 6).
• Treat with caution the potential magnitude of that experimental result (18).
• Control for multiple comparisons (5, 10). This minimizes
the chance that we get a false positive simply because we
make more than one comparison in a single experiment.
• Rely on repeated studies to accumulate evidence for some
biological phenomenon (12, 17, 23). We can do this using
simple inspection, as envisioned by Fisher (12, 13), or a
formal meta-analysis (4, 16).
The notion of using repeated studies to accumulate evidence
for some phenomenon is long established in science (12, 13,
21).6 In 1951 Yates (23) wrote
Cores of Reproducibility in Physiology • Curran-Everett D
DISCLOSURES
No conflicts of interest, financial or otherwise, are declared by the
author.
REFERENCES
11. Curran-Everett D, Taylor S, Kafadar K. Fundamental concepts in statistics: elucidation and illustration. J Appl Physiol (1985) 85: 775–786, 1998.
12. Fisher RA. Statistical Methods for Research Workers. London: Oliver and
Boyd, 1925.
13. Fisher RA. The arrangement of field experiments. J Minist Agric GB 33:
503–513, 1926. http://hdl.handle.net/2440/15191.
14. Gardner MJ, Altman DG. Confidence intervals rather than P values:
estimation rather than hypothesis testing. Br Med J (Clin Res Ed) 292:
746 –750, 1986. doi:10.1136/bmj.292.6522.746.
15. Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle
P value generates irreproducible results. Nat Methods 12: 179 –185, 2015.
doi:10.1038/nmeth.3288.
16. Hedges LV, Olkin I. Statistical Methods for Meta-Analysis. New York:
Academic, 1985.
17. Ioannidis JPA. Why most published research findings are false. PLoS
Med 2: e124, 2005. doi: 10.1371/journal.pmed.0020124.
18. Ioannidis JPA. Why most discovered true associations are inflated.
Epidemiology 19: 640 –648, 2008. doi:10.1097/EDE.0b013e31818131e7.
19. Johnson VE. Revised standards for statistical evidence. Proc Natl Acad
Sci U S A 110: 19313–19317, 2013. doi:10.1073/pnas.1313476110.
20. Sterne JAC, Davey Smith G. Sifting the evidence-what’s wrong with
significance tests? BMJ 322: 226 –231, 2001. doi:10.1136/bmj.322.
7280.226.
21. Student. The probable error of a mean. Biometrika 6: 1–25, 1908.
doi:10.1093/biomet/6.1.1.
22. Wagner PD. New opportunities for authors. J Appl Physiol (1985) 121:
365–366, 2016. doi:10.1152/japplphysiol.00555.2016.
23. Yates F. The influence of Statistical Methods for Research Workers on
the development of the science of statistics. J Am Stat Assoc 46: 19 –34,
1951.
J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org
Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017
1. Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the
false discovery rate in behavior genetics research. Behav Brain Res 125:
279 –284, 2001. doi:10.1016/S0166-4328(01)00297-2.
2. Benjamini Y, Yekutieli D. The control of the false discovery rate in
multiple testing under dependency. Ann Stat 29: 1165–1188, 2001.
3. Boos DD, Stefanski LA. P-value precision and reproducibility. Am Stat
65: 213–221, 2011. doi:10.1198/tas.2011.10129.
4. Borenstein H, Hedges LV, Higgins JPT, Rothstein H. Introduction to MetaAnalysis. Chichester, UK: Wiley, 2009. doi:10.1002/9780470743386.
5. Curran-Everett D. Multiple comparisons: philosophies and illustrations.
Am J Physiol Regul Integr Comp Physiol 279: R1–R8, 2000.
6. Curran-Everett D. Explorations in statistics: confidence intervals. Adv
Physiol Educ 33: 87–90, 2009. doi:10.1152/advan.00006.2009.
7. Curran-Everett D. Explorations in statistics: hypothesis tests and P
values. Adv Physiol Educ 33: 81–86, 2009. doi:10.1152/advan.
90218.2008.
8. Curran-Everett D. Explorations in statistics: power. Adv Physiol Educ
34: 41–43, 2010. doi:10.1152/advan.00001.2010.
9. Curran-Everett D. Explorations in statistics: statistical facets of reproducibility. Adv Physiol Educ 40: 248 –252, 2016. doi:10.1152/advan.
00042.2016.
10. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in
journals published by the American Physiological Society. J Appl Physiol
97: 457–459, 2004. doi:10.1152/japplphysiol.00513.2004.
95