Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

J Appl Physiol 122: 91–95, 2017. First published December 1, 2016; doi:10.1152/japplphysiol.00937.2016. REVIEW Cores of Reproducibility in Physiology CORP: Minimizing the chances of false positives and false negatives Douglas Curran-Everett Division of Biostatistics and Bioinformatics, National Jewish Health, Denver, Colorado; and Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Denver, Denver, Colorado Submitted 21 October 2016; accepted in final form 23 November 2016 estimation; hypothesis test; power; reproducibility; significance test [Populations] always display variation ... . —Sir Ronald A. Fisher (1925) get a false negative. In this review I outline strategies to accomplish each of these things. These strategies are related intimately to fundamental concepts of statistics and the inherent uncertainty embedded in them. I begin with a brief overview of the null hypothesis, the foundation of any scientific experiment. IT IS ONLY BECAUSE POPULATIONS always display variation that statistics is essential to the process of scientific discovery. An inescapable tenet of statistics, however, is the notion of uncertainty that has reared its head within the arena of reproducibility of research. It just so happens that improving the reproducibility of research is also the focus of the Journal of Applied Physiology’s recent initiative, “Cores of Reproducibility in Physiology” (22). Each article in this initiative is designed to elucidate the principles and nuances of using some piece of scientific equipment or some experimental technique so that other researchers can obtain reproducible results (22). But other researchers can use some piece of equipment or some technique with expert skill and still fail to replicate a pivotal experimental result if they neglect to consider fundamental concepts of statistics (6, 7, 11, 14) and their inescapable connection to the reproducibility of research. If we want to improve the reproducibility of our research, then we want to minimize the chance that we get a false positive and—at the same time—we want to minimize the chance that we When we design an experiment, we want to test a scientific idea. By tradition, this idea is called the null hypothesis (see Ref. 7). When we make an inference about a null hypothesis, we can make a mistake (Fig. 1). We can reject a true null hypothesis—we get a false positive— or we can fail to reject a false null hypothesis—we get a false negative.1 We control the chance that we get a false positive when we define the critical significance level ␣, the probability that we reject the null hypothesis given that the null hypothesis is true. When we define ␣, we declare that we are willing to reject a true null hypothesis 100␣% of the time. In contrast, we control the chance that we get a false negative when we prescribe instead the chance that we get a true positive. We do this when we define power, the probability that we reject the null hypothesis given that the null hypothesis Address for reprint requests and other correspondence: D. Curran-Everett, Division of Biostatistics and Bioinformatics, M222, National Jewish Health, 1400 Jackson St., Denver, CO 80206-2761 (e-mail: [email protected]). 1 Neyman and Pearson described a false positive as an error of the first kind and a false negative as an error of the second kind (see Ref. 7). Errors of the first kind and second kind are known also as type I and type II errors. http://www.jappl.org The Null Hypothesis: a Scientific Idea 8750-7587/17 Copyright © 2017 the American Physiological Society 91 Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017 Curran-Everett D. CORP: Minimizing the chances of false positives and false negatives. J Appl Physiol 122: 91–95, 2017. First published December 1, 2016; doi:10.1152/japplphysiol.00937.2016.—Statistics is essential to the process of scientific discovery. An inescapable tenet of statistics, however, is the notion of uncertainty which has reared its head within the arena of reproducibility of research. The Journal of Applied Physiology’s recent initiative, “Cores of Reproducibility in Physiology,” is designed to improve the reproducibility of research: each article is designed to elucidate the principles and nuances of using some piece of scientific equipment or some experimental technique so that other researchers can obtain reproducible results. But other researchers can use some piece of equipment or some technique with expert skill and still fail to replicate an experimental result if they neglect to consider the fundamental concepts of statistics of hypothesis testing and estimation and their inescapable connection to the reproducibility of research. If we want to improve the reproducibility of our research, then we want to minimize the chance that we get a false positive and—at the same time—we want to minimize the chance that we get a false negative. In this review I outline strategies to accomplish each of these things. These strategies are related intimately to fundamental concepts of statistics and the inherent uncertainty embedded in them. 92 Cores of Reproducibility in Physiology • Curran-Everett D is false. In general, four things affect power: the critical significance level ␣, the standard deviation of the underlying population, the sample size n, and the magnitude of the difference we want to be able to detect (Table 1; see also Ref. 8). Fig. 2. The distribution of observed P values from 500,000 replications of a simulation in which the null hypothesis, H0: ⌬ ⫽ 0, is true. We drew at random two samples, each with n observations, from a standard normal distribution (see Refs. 7 and 9), did a two-sample t-test (using a critical significance level ␣ ⫽ 0.05), and then repeated this process to generate a total of 500,000 replications. As expected (see Ref. 7), in ⬃ 5% of the replications (gray area), P ⬍ 0.05: we reject a true null hypothesis. In other words, we obtain a false positive (see Fig. 1). The proportion of false positives depends solely on the critical significance level ␣. After Ref. 9, Fig. 1. might ask if our two samples came from the same or different populations. This means we define the null and alternative hypotheses, H0 and H1, as H0 : The samples come from the same population. H1 : The samples come from the different populations. If we want to know whether the populations have the same mean, we can write these as When the Null Hypothesis Is True: to Minimize False Positives When we test a null hypothesis, rarely, if ever, do we expect that null hypothesis to be true (see Refs. 7 and 11). Rather, we usually know—at the very least, suspect—that our null hypothesis is not exactly true. If we happen to be mistaken about the validity of our null hypothesis, however, we would like to guard against an unsubstantiated discovery: we would like to avoid a false positive. As I posited in Ref. 9, suppose we want to learn if some intervention affects the biological thing we care about. If we use two groups—a control group and a treated group—we Table 1. Power estimates for a two-sample t-test Sample Size, n ␣ Effect Size 10 20 30 40 50 60 0.01 0.25 0.50 0.75 1.00 0.02 0.06 0.15 0.29 0.03 0.14 0.38 0.67 0.05 0.24 0.60 0.88 0.07 0.35 0.76 0.96 0.09 0.45 0.87 0.99 0.11 0.55 0.93 1.00 0.05 0.25 0.50 0.75 1.00 0.08 0.18 0.35 0.56 0.12 0.34 0.64 0.87 0.16 0.48 0.81 0.97 0.20 0.60 0.91 0.99 0.24 0.70 0.96 1.00 0.27 0.78 0.98 1.00 0.10 0.25 0.50 0.75 1.00 0.13 0.28 0.49 0.69 0.19 0.46 0.75 0.93 0.25 0.61 0.89 0.99 0.30 0.72 0.95 1.00 0.34 0.80 0.98 1.00 0.39 0.86 0.99 1.00 Values represent power for different combinations of sample size, critical significance level ␣, and effect size. Effect size is the difference between population means 1 and 0 expressed as a fraction of the population standard deviation : |1 ⫺ 0|/ ⫽ ⌬/ (see Ref. 8). An effect size of 0.5 means that ⌬ ⫽ 0.5, and an effect size of 1 means that ⌬ ⫽ . If all else is equal, power increases when the sample size increases, the critical significance level ␣ increases, and the effect size increases. H0 : ⌬ ⫽ 0 H1 : ⌬ ⫽ 0, where ⌬, the difference in population means, is the difference between the means of the treated and control populations. In any experiment, regardless of the number of observations, the chance that we get a false positive depends only on the critical significance level ␣: we will obtain a false positive 100␣% of the time (Fig. 2).2 If we want to minimize the chance that we get a false positive, then we want ␣ to be more stringent than the traditional 0.05. A smaller, more stringent ␣ 2 This refers to a false positive that occurs for statistical reasons. Bias or experimental error can result in a spurious true positive: there is an apparent effect for which there is convincing statistical evidence but that is detectable only because of bias or error. Fig. 3. The populations. Population 0 is a standard normal distribution with mean 0 ⫽ 0 and standard deviation ⫽ 1. Population 1 is a normal distribution with mean 1 ⫽ 0.5 and standard deviation ⫽ 1. Therefore, the true difference in population means, ⌬, is ⌬ ⫽ 1 ⫺ 0 ⫽ 0.5. [Reprinted with permission from Ref. 9.] J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017 Fig. 1. The relationship between the state of a null hypothesis (rows) and our inference about that null hypothesis (columns) after we analyze our data. If we reject a true null hypothesis, we make an error of the first kind (see Ref. 7): we get a false positive. If we fail to reject a false null hypothesis, we make an error of the second kind (see Ref. 7): we get a false negative. If we want to improve the reproducibility of our research, then we want to do two things simultaneously: maximize the chance that we get a true negative and maximize the chance that we get a true positive. This means we want to minimize the chance that we get a false positive and minimize the chance that we get a false negative (see text). 93 Cores of Reproducibility in Physiology • Curran-Everett D Table 2. Probability a duplicate experiment will achieve P ⬍ 0.05 P1 Pr {P2 ⬍␣ ⫽ 0.05} 0.40 0.30 0.20 0.10 0.05 0.025 0.01 0.005 0.001 0.00031 0.00002 0.13 0.18 0.25 0.38 0.50 0.61 0.73 0.80 0.91 0.95 0.99 Values represent P1, the P value from an initial experiment, and Pr {P2 ⬍ ␣ ⫽ 0.05}, the probability that the P value from a duplicate experiment will be less than ␣ ⫽ 0.05. Ref. 9 discusses the calculation of these probabilities. See Fig. 4. n Power %Inflated 10 20 30 40 50 60 64 70 80 90 100 200 0.18 0.34 0.48 0.60 0.70 0.78 0.80 0.84 0.88 0.92 0.94 1.00 100 100 96 83 73 64 63 60 57 55 54 50 Values represent the results of 10,000 replications of a two-sample t-test in which the n observations for each sample were drawn at random from the populations in Fig. 3. For each replication, we obtained a P value with which to assess our null hypothesis H0: ⌬ ⫽ 0 and ⌬y , the difference in sample means that estimates the true difference ⌬ ⫽ 0.5. Let us consider only those simulations for which we reject our null hypothesis: that is, for which P ⬍ 0.05. When power was 0.18 (n ⫽ 10), the estimated difference ⌬y exceeded the true difference ⌬ in all those statistically significant simulations: 100% of the ⌬y values were inflated. In contrast, when power was 1.00 (n ⫽ 200), the estimated difference ⌬y exceeded the true difference ⌬ in only half those simulations: 50% of the ⌬y values were inflated. After Ref. 9, Fig. 5. When the Null Hypothesis Is False: to Minimize False Negatives Now imagine we truly do know that our null hypothesis is false. Suppose we use the same null and alternative hypotheses we just did: H0 : ⌬ ⫽ 0 H1 : ⌬ ⫽ 0. If we define ⌬, the difference in population means, to be 0.5 units (Fig. 3), then we know our null hypothesis is false. It—almost— goes without saying that we would like to reject our null hypothesis if it is false. We can up our chances of doing that if we design our experiment so that power, the probability we reject our null hypothesis given it is false, is high. Because we have defined our populations, power depends only on the number of observations in our two groups (8). With this background we are ready to explore the relationship of fundamental concepts of statistics— hypothesis testing and estimation—to reproducibility.3 Hypothesis testing. Suppose we have a unique experimental result that is statistically unusual. Had we defined ␣ ⫽ 0.05, this means our P value is less than 0.05. If our null hypothesis is true, our result is unusual. We reject our null hypothesis. We have discovered something! We assume our result reflects a general phenomenon that other researchers will reproduce if they repeat our experiment. But will they? If the P value from our initial experiment is 0.05, then the probability a duplicate experiment will achieve P ⬍ 0.05—the probability it will achieve statistical significance—is 50% 3 The theoretical foundation of the next section is developed in—and was adapted from—Ref. 9. Fig. 4. Probability a second experiment will result in P ⬍ 0.05 when the test statistic is z. If the critical significance level ␣ is 0.05, then the critical value z* is 1.96 (see Ref. 7). Suppose we conduct an experiment for which we compute an observed value z1 that happens to be more extreme than z*: that is, z1 ⬎ z*. This means P1, the P value associated with z1, is less than the critical significance level ␣. This is unusual if the null hypothesis is true: we have a statistically significant result. If we assume the magnitude of the effect observed in this initial experiment equals the magnitude of the true effect, then the probability that the P value from a second experiment will be less than 0.05 is depicted by the gray area. See Table 2. [Adapted with permission from Ref. 9.] J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017 will also help us minimize the chance that we get a false negative when we attempt to reproduce the results of a pivotal, pioneering experiment. Table 3. Percent inflated estimates of ⌬ when P ⬍ 0.05 94 Cores of Reproducibility in Physiology • Curran-Everett D (Table 2). If the P value from our initial experiment is 0.01, then the probability a duplicate experiment will achieve P ⬍ 0.05 is about 75%. Only when the P value from our initial experiment is 0.001 does the probability a duplicate experiment will achieve P ⬍ 0.05 exceed 90% (Fig. 4). Power, experimental design, and the actual test statistic (see Ref. 7) have little impact on this phenomenon (3). Estimation. In contrast, power—with its inherent connection to the statistical benchmark of ␣— does impact the reproducibility of point and interval estimates4 of the magnitude of some biologic effect (18). If an experiment of lower power happens to reject its null hypothesis—if the effect is statistically unusual—then our estimate of the magnitude of that effect will be exaggerated (9, 15); see Table 3 and Ref. 9. Experimental design and the actual test statistic have little impact on this phenomenon (18). The Impact of Multiple Comparisons Summary If we want to improve the reproducibility of our research, then we want to revamp how we apply the fundamental concepts of hypothesis testing and estimation to our science. This is how we can do that: 4 A sample mean y estimates some population mean : a sample mean is a point estimate for some population mean. A confidence interval [y ⫺a, y ⫺a], where a is an allowance, is a range that we expect, with some level of confidence, to include the true value of a population parameter such as the mean: a confidence interval is an interval estimate for some population mean. Table 4. Probability that at least one of k comparisons will reject a true null hypothesis Number of Comparisons, k ␣ 1 2 3 4 0.05 0.01 0.005 0.001 0.05 0.01 0.005 0.001 0.10 0.02 0.010 0.002 0.14 0.03 0.015 0.003 0.19 0.04 0.020 0.004 5 ... 10 ... 20 ... 100 0.23 . . . 0.40 . . . 0.64 . . . 0.99 0.05 0.10 0.18 0.63 0.025 0.05 0.10 0.39 0.005 0.01 0.02 0.10 Values assume that each of k independent null hypotheses in a family—a group—is true and is tested at a critical significance level ␣. For any ␣, as the number of comparisons in a family increases, it is more likely that at least one true null hypothesis will be rejected by mistake. In other words, it is more likely that we will obtain at least one false positive. If ␣ ⫽ 0.05 and we make 10 comparisons, then there is a 40% chance that we will obtain at least one false positive. If we make enough comparisons, then we will almost inevitably obtain at least one false positive. Research workers . . . have to accustom themselves to the fact that in many branches of research the really critical experiment is rare, and that it is frequently necessary to combine the results of numbers of experiments dealing with the same issue in order to form a satisfactory picture of the true situation. . . . In such circumstances a number of experiments of moderate accuracy are of far greater value than a single experiment of very high accuracy. In addition to the strategies listed above, perhaps it is time to also revisit that decades-old cornerstone of reproducibility. ACKNOWLEDGMENTS I dedicate this review to John Ludbrook (retired, Department of Surgery, The University of Melbourne, Melbourne, Victoria, Australia) who, in his own words, was a sometime academic surgeon, a sometime applied cardiovascular physiologist, and a biostatistician accredited by the Statistical Society of Australia. John and I have been corresponding since 2004 when he congratulated Dale Benos and me on the publication of our guidelines for reporting statistics. Over the years, on those rare occasions when he was able to tear himself away from watching cricket matches, John graciously reviewed some of the papers in my Explorations in Statistics series. My papers are better for it. And I have enjoyed tremendously our correspondence. I thank Peter D. Wagner (University of California, San Diego, La Jolla, California) for his encouragement to write this review and for his helpful comments. 5 6 See Ref. 9 for some resources that can help with this. See Ref. 9 for details. J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017 When we do an experiment, we typically test more than one null hypothesis. This makes sense: an experiment can be costly in terms of time and resources, and we want to maximize what we learn. When we test more than one null hypothesis—when we have a family of comparisons—we are more likely to reject a true null hypothesis simply by virtue of the fact that we are testing more than one null hypothesis (Table 4). In this setting, we can minimize the chance that we get a false positive if we use a statistical procedure that controls for this phenomenon. The false discovery rate procedure has advantages over the Bonferroni procedure (1, 2, 5). • When we design an experiment, estimate sample size so that power approaches 0.90.5 This minimizes the chance that we get a false negative and makes it less likely that we exaggerate the magnitude of the true effect. • Define the critical significance level ␣—the benchmark for how statistically unusual our result needs to be before we reject our null hypothesis—to be 0.005 or 0.001 (19, 20). This minimizes the chance that we get a false positive and makes it more likely a subsequent experiment will achieve P ⬍ 0.05. • Power and ␣ impact sample size. If we define ␣ to be a more stringent 0.005 and if we design our experiment so that power approaches 0.90, might the sample size for our experiment be so large as to be practically impossible? Not necessarily (20). • Think less about a simple P value and more about the scientific importance of the confidence interval bounds for some experimental result (6, 10, 11, 14, 15, 20). Even a convincing P value of 0.005 can be associated with a confidence interval whose bounds indicate an effect that is scientifically inconsequential: for example, [0.004, 0.01] (see Ref. 6). • Treat with caution the potential magnitude of that experimental result (18). • Control for multiple comparisons (5, 10). This minimizes the chance that we get a false positive simply because we make more than one comparison in a single experiment. • Rely on repeated studies to accumulate evidence for some biological phenomenon (12, 17, 23). We can do this using simple inspection, as envisioned by Fisher (12, 13), or a formal meta-analysis (4, 16). The notion of using repeated studies to accumulate evidence for some phenomenon is long established in science (12, 13, 21).6 In 1951 Yates (23) wrote Cores of Reproducibility in Physiology • Curran-Everett D DISCLOSURES No conflicts of interest, financial or otherwise, are declared by the author. REFERENCES 11. Curran-Everett D, Taylor S, Kafadar K. Fundamental concepts in statistics: elucidation and illustration. J Appl Physiol (1985) 85: 775–786, 1998. 12. Fisher RA. Statistical Methods for Research Workers. London: Oliver and Boyd, 1925. 13. Fisher RA. The arrangement of field experiments. J Minist Agric GB 33: 503–513, 1926. http://hdl.handle.net/2440/15191. 14. Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. Br Med J (Clin Res Ed) 292: 746 –750, 1986. doi:10.1136/bmj.292.6522.746. 15. Halsey LG, Curran-Everett D, Vowler SL, Drummond GB. The fickle P value generates irreproducible results. Nat Methods 12: 179 –185, 2015. doi:10.1038/nmeth.3288. 16. Hedges LV, Olkin I. Statistical Methods for Meta-Analysis. New York: Academic, 1985. 17. Ioannidis JPA. Why most published research findings are false. PLoS Med 2: e124, 2005. doi: 10.1371/journal.pmed.0020124. 18. Ioannidis JPA. Why most discovered true associations are inflated. Epidemiology 19: 640 –648, 2008. doi:10.1097/EDE.0b013e31818131e7. 19. Johnson VE. Revised standards for statistical evidence. Proc Natl Acad Sci U S A 110: 19313–19317, 2013. doi:10.1073/pnas.1313476110. 20. Sterne JAC, Davey Smith G. Sifting the evidence-what’s wrong with significance tests? BMJ 322: 226 –231, 2001. doi:10.1136/bmj.322. 7280.226. 21. Student. The probable error of a mean. Biometrika 6: 1–25, 1908. doi:10.1093/biomet/6.1.1. 22. Wagner PD. New opportunities for authors. J Appl Physiol (1985) 121: 365–366, 2016. doi:10.1152/japplphysiol.00555.2016. 23. Yates F. The influence of Statistical Methods for Research Workers on the development of the science of statistics. J Am Stat Assoc 46: 19 –34, 1951. J Appl Physiol • doi:10.1152/japplphysiol.00937.2016 • www.jappl.org Downloaded from http://jap.physiology.org/ by 10.220.33.5 on January 8, 2017 1. Benjamini Y, Drai D, Elmer G, Kafkafi N, Golani I. Controlling the false discovery rate in behavior genetics research. Behav Brain Res 125: 279 –284, 2001. doi:10.1016/S0166-4328(01)00297-2. 2. Benjamini Y, Yekutieli D. The control of the false discovery rate in multiple testing under dependency. Ann Stat 29: 1165–1188, 2001. 3. Boos DD, Stefanski LA. P-value precision and reproducibility. Am Stat 65: 213–221, 2011. doi:10.1198/tas.2011.10129. 4. Borenstein H, Hedges LV, Higgins JPT, Rothstein H. Introduction to MetaAnalysis. Chichester, UK: Wiley, 2009. doi:10.1002/9780470743386. 5. Curran-Everett D. Multiple comparisons: philosophies and illustrations. Am J Physiol Regul Integr Comp Physiol 279: R1–R8, 2000. 6. Curran-Everett D. Explorations in statistics: confidence intervals. Adv Physiol Educ 33: 87–90, 2009. doi:10.1152/advan.00006.2009. 7. Curran-Everett D. Explorations in statistics: hypothesis tests and P values. Adv Physiol Educ 33: 81–86, 2009. doi:10.1152/advan. 90218.2008. 8. Curran-Everett D. Explorations in statistics: power. Adv Physiol Educ 34: 41–43, 2010. doi:10.1152/advan.00001.2010. 9. Curran-Everett D. Explorations in statistics: statistical facets of reproducibility. Adv Physiol Educ 40: 248 –252, 2016. doi:10.1152/advan. 00042.2016. 10. Curran-Everett D, Benos DJ. Guidelines for reporting statistics in journals published by the American Physiological Society. J Appl Physiol 97: 457–459, 2004. doi:10.1152/japplphysiol.00513.2004. 95