Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Eur J Epidemiol (2013) 28:939–944 DOI 10.1007/s10654-013-9861-4 DIALOGUE The researcher and the consultant: a dialogue on null hypothesis significance testing Andreas Stang • Charles Poole Received: 23 July 2013 / Accepted: 30 October 2013 / Published online: 14 November 2013 Springer Science+Business Media Dordrecht 2013 Abstract Since its introduction, null hypothesis significance testing (NHST) has caused much debate. Many publications on common misunderstandings have appeared. Despite the many cautions, NHST remains one of the most prevalent, misused and abused statistical procedures in the biomedical literature. This article is directed at practicing researchers with limited statistical background who are driven by subject matter questions and have empirical data to be analyzed. We use a dialogue as in ancient Greek literature for didactic purposes. We illustrate several, though only a few, irritations that can come up when a researcher with minimal statistical background but a good sense of what she wants her study to do, and of what she wants to do with her study, asks for consultation by a statistician. We provide insights into the meaning of several concepts including null and alternative hypothesis, one- and two-sided null hypotheses, statistical models, test statistic, rejection and acceptance regions, type I and II error, p value, and the frequentist’ concept of endless study repetitions. A. Stang (&) Medical Faculty, Institute of Clinical Epidemiology, MartinLuther-University of Halle-Wittenberg, Magdeburger Str. 8, 06097 Halle, Germany e-mail: [email protected] A. Stang Department of Epidemiology, School of Public Health, Boston University, 715 Albany Street, Talbot Building, Boston, MA 02118, USA C. Poole Department of Epidemiology, Gillings School of Global Public Health, University of North Carolina, Chapel Hill, NC 27599-7435, USA e-mail: [email protected] Keywords Significance testing p value Type I error Type II error Estimation Introduction Since its introduction early in the twentieth century, null hypothesis significance testing (NHST) has caused much debate. It is a constantly mutating hybridization of Fisher significance testing (ST) and Neyman–Pearson null hypothesis testing (NHT) [7]. Many publications on common misunderstandings have appeared. A count in 2000 of over 300 warnings of limitations of ST, NHT and NHST [1] was followed a year later by a list of 402 references (http://warnercnr.colostate.edu/*anderson/thompson1.html, accessed July 17, 2012), among which we found 89 in biomedical publications. Despite the many cautions, NHST remains one of the most prevalent, misused and abused statistical procedures in the biomedical literature. This article aims to illustrate some of the many misconceptions of ST, NHT, and NHST. In principle, it does not present new methodological issues. However, it uses a different didactic modality to clarify several misconceptions: a dialogue between a researcher and a statistician (called Consultant). Potential irritations of the researcher are flagged. This article is directed at practicing researchers with limited statistical background who are driven by subject matter questions and have empirical data to be analyzed. Researcher I want to study the association between an immunohistochemical factor A and the prognosis of patients with skin melanoma. Can you help me with the data analysis? Consultant I can help you. However, I do not really understand your study question. Can you be more specific? 123 940 Researcher There is a hypothesis that the presence of factor A is associated with a poorer prognosis than its absence. Consultant What do you mean by ‘‘poorer prognosis’’? Researcher I mean a higher 5-year overall mortality risk. Consultant We have to re-formulate your substantive hypothesis into a statistical null hypothesis that may be rejected by your data (irritation #1). An appropriate null hypothesis would be, ‘‘There is either no association or an inverse association between factor A and the 5-year mortality risk among patients with newly diagnosed skin melanoma’’. The alternative hypothesis would be, ‘‘There is a positive association.’’ However, there is another option. If factor A is associated with a decreased mortality risk, would it be of interest for you to detect this association? Researcher Yes, of course, though I believe a risk reduction is much less likely. Consultant If you believe it is possible, you need a twosided NHT. Your null hypothesis is, ‘‘There is no association.’’ Your alternative hypothesis is, ‘‘There is a risk reduction or a risk increase.’’ Researcher That is interesting. One part of this alternative hypothesis is my original hypothesis, but the other part is its opposite. The null hypothesis that I am testing, however, is neither of them (irritation #2). If we reject the null hypothesis, we will accept the alternative hypothesis? Consultant Yes. This is called ‘‘statistical inference.’’ Researcher So, how will we decide whether or not to reject the null hypothesis? Consultant Using a statistical model, we define critical values of a test statistic that define rejection regions and an acceptance region. For these definitions, we do not have to use your study data. Researcher You used several terms with which I’m unfamiliar (irritation #3). How do we do this? Consultant Let me explain them to you. A statistical model is a set of assumptions, including distributional assumptions about probabilities (or probability densities) of observations given hypotheses [18] [23]. When we combine empirical observations with a statistical model and draw inferences, we assume validity given what has been controlled in the study design and data analysis (irritation #4) [3] [10]. The statistical models we use for null hypothesis testing give probabilities so that values of special variables, called test statistics, will fall within specified ranges. Consider the well-known test statistic, Z, for instance. If the validity assumptions and the null hypothesis are true, there is a 2.5 % chance we will observe Z B -1.96 in your study and a 2.5 % chance we will observe Z C ?1.96. If we choose those two values of Z as the critical values for our test, we will reject the null hypothesis if we obtain a value Z in either 123 A. Stang, C. Poole of those rejection regions and accept the null hypothesis if the value of Z falls between the critical values and into the acceptance region. Rejecting the null hypothesis when it is true is a mistake known as a type I error. By choosing those particular values of Z to define those two rejection regions and that one acceptance region, we have set our probability of making a type I error in advance, at 5 %. This probability of rejecting the null hypothesis, if it happens to be false, is called the alpha level of the test. Researcher But how did I decide that ‘‘no association’’ is my null hypothesis and that 5 % is my highest tolerable probability of mistakenly rejecting it if it is true? Consultant In the original formulation of hypothesis testing [20] [21] substantive considerations were used to select the hypothesis to be tested in each analysis. The hypothesis for which the consequences would be worse if one falsely rejected it was to be chosen as the null hypothesis [19]. Here we are saying it would be worse to think there is a true association (positive or negative) when there is not than to accept the hypothesis of no association when there really is one. The original formulation also called for specifying the alpha level anew in each analysis, based on a judgment of how serious a mistake it would be to reject the tested hypothesis if it were true. Over the years, it became orthodox in the biomedical field and social sciences to test the hypothesis of no true association (here, the hypothesis of equal 5-year overall mortality risks) and to use a standard alpha of nearly always 5 % in nearly all tests. Having defined the rejection regions, we use the statistical model and the study data to calculate the realized value of the test statistic. If it falls within a rejection region, we reject the null hypothesis. If it falls within an acceptance region, we accept the null hypothesis. Researcher I struggle with the unfamiliar terms you used in explaining other unfamiliar terms. I struggle as well with the condition that the study and data analysis are perfectly valid. Doesn’t this imply that I first have to determine potential biases in my study before I think of interpreting this test? Consultant Yes. The more validity is compromised, the more misleading the NHT becomes. Researcher I think I have a rough idea what you are saying. I think I do get the concept of a type I error, but I am having difficulty with the probability of this error and how I can set or fix it before analyzing my data. Can you explain this? Consultant A 5 % type I error probability means that, in the long run, if we would undertake an endless number of studies identical to yours with only chance causing them to produce different results, we would expect 5 of 100 NHTs to reject the null hypothesis if that hypothesis were true and the data model were valid. The researcher and the consultant Researcher How do you know? I just did this one study. How can we make statements about an endless number of studies that will never be undertaken? And how could they be identical to my study? Isn’t every study unique (irritation #5)? Consultant Based on the statistical model and the null hypothesis, we can describe the statistical expectation of the distribution of the hypothetical endless number of test statistics. Researcher How do I know whether or not a type I error occurred? Consultant If the test does not reject the null hypothesis, we can say with 100 % certainty that a type I error was not made. If the null hypothesis is rejected, we cannot tell whether or not a type I error has been made, or the probability that a type I error has been made (irritation #6). We can only give the probability that a type I error would have been made if the null hypothesis were true and your study design, data collection protocol, and analysis model were perfectly valid. Researcher So, if my test with a = 0.05 rejects the null hypothesis, the probability that I’ve made a type I error is 5 %? And the probability is 95 % that I haven’t made a type I error and that there really is a true underlying association between factor A and 5-year overall mortality risk? If this test were a screening test for a disease, its positive predictive value—the probability of disease given a positive test result—would be 95 %? Consultant No. 100 % minus alpha, or 95 % in this test and nearly all others, is not the positive predictive value of an NHT (irritation #7). It is the specificity of the NHT: the probability of not rejecting the null hypothesis when it is true, akin to the probability of a negative screening test when the disease is absent. Researcher How would we determine the positive predictive value of a rejection of a null hypothesis? Consultant That would require us to use Bayesian statistics, which we are not using here, and to determine the prior probabilities of the null and alternative hypotheses. We would do analogously to how you determine the prevalence of the absence and presence of disease when you calculate predictive values of disease screening tests. Researcher I am curious as to why have you not yet introduced the p value that I’ve heard so much about. Don’t we need this value? Consultant A p value does not have to be used to conduct an NHT [20], but it can be [21]. The p value is central to the ST of Fisher [5]. Whereas alpha is sometimes called the ‘‘significance level’’ of an NHT, Fisher called the p value the ‘‘significance level’’ of the ST, as did Cox [4]. In Fisher’s ST, the p value is interpreted as a continuous, inverse measure of evidence against the tested hypothesis: the smaller the p value, the stronger the evidence. Fisher 941 did not introduce, and in fact vehemently opposed, the notions of critical regions, type I errors, and alternative hypotheses. Fisher did divide the p value range (0–1) into approximate categories of strength of evidence from time to time, but he objected philosophically to the NHT. He believed that the goal of science is learning, not decisionmaking. Researcher Like all things philosophical, that sounds very abstract to me. What exactly is the p value? Consultant Think of throwing a coin, an experiment. What is the expected number of heads if you throw the coin 250 times? Researcher If the coin is fair, I would expect 125 heads. Consultant Would you expect exactly 125 heads every time you toss the coin 250 times? Researcher I suppose not. Consultant Try it and see. I have. I found that I hardly ever got exactly 125 heads. Because of random fluctuations, I expect that the number of heads is usually somewhere close to 125 but hardly ever exactly 125. Values closer to 125 are more probable than values farther away. For example, I would expect to get 124 heads or 126 heads more often than 110 heads or 130 heads. Researcher How do you know how much more probable 124 or 126 heads would be than 110 or 130 heads? What has this to do with the p value? Consultant For the coin tosses, I use a statistical model called the binomial distribution. My assumptions are that the coin is balanced and tossed fairly, that it has a head on only one side, and that the result of each toss is reported and recorded accurately. These assumptions lead to a null hypothesis with an expected probability of 50 % heads and, as we have noted, an expected number of 125 heads in 250 tosses. But ‘‘expected’’ just means the average result if I repeated the 250 tosses many, many times. Suppose I actually get 110 heads. I can deduce from the binomial model that, if all the assumptions in that model are true, the probability of obtaining 110 or fewer heads is a little greater than 3 %. And, since 110 is 15 away from the expected value of 125 heads, I note that getting 125 ? 15 = 140 heads or more is just about as ‘‘extreme’’ as getting 110 or fewer when the expected value is 125. The probability of getting 140 heads or more is a little less than 3 %. So, I can add the probability, under the null hypothesis (which includes the validity assumptions), of 110 or fewer heads to the probability of 140 or more heads and obtain a two-sided p value of about 6 %. Researcher In 250 tosses, 110 or fewer heads and 140 or more heads would occur so rarely? Consultant Yes. Researcher Shouldn’t we then get concerned whether the coin was fair? 123 942 Consultant Thank you. In a Fisher ST, we would say that p = 0.06 is the evidence against the null hypothesis of a fair coin tossed fairly, given the statistical model (binomial distribution, heads and tails recorded correctly, etc.). It would be considered some evidence, but not particularly strong evidence, against the null hypothesis. In a Neyman– Pearson NHT with a pre-specified, two-sided alpha of 5 %, we would accept the null hypothesis. Researcher What do you mean by a ‘‘two-sided’’ alpha? Consultant I mean what I meant in the case of your NHT: that we have two rejection regions. Researcher But there are many more results from the coin tossing experiment that could have occurred and that are even further from the result expected under the null hypothesis than the observed result. Yet the coin experiment produced just one result (irritation #8). Consultant You are right. A p value is more of a statement about the events that did not occur than it is a concise statement of the evidence from your actual observed data. Jeffreys wrote amusingly that this aspect of the use of p values in NHT implies ‘‘that a hypothesis that may be true is rejected because it has not predicted observable results that have not occurred.’’ ([16] p 316). Nevertheless, with the binomial probability distribution and most other distributions we use in biomedical research, the lower the probability of a given result, the lower the probability of that result plus the more extreme results that did not occur. Researcher Can’t you tell me the p value before I do the study? Consultant No, you specify the alpha level in an NHT before you conduct the test, but you calculate the p value from your data and your model to conduct your test, whether that test be an NHT or an ST. Researcher You used the example of throwing a coin where we have a good subjective guess what the expectation is, at least theoretically, after throwing the coin 250 times. Consultant The subjective guess is a good one if the coin came out of my pocket and you tossed it. But suppose you were strolling through a carnival and you saw a small crowd around a man who was tossing a coin and asking onlookers to bet on heads, with him taking tails in each bet. Would you really expect half the tosses to be heads in that setting? I wouldn’t. I’d expect fewer heads, more tails. This example shows that the most reasonable hypothesis to test, and therefore the most reasonable expected value, can depend on the circumstances and on the researcher’s judgments about them. But in the coin tossing example, whether the expected number of heads were 125 in 250 (here) or substantially fewer than 125 (at the carnival), the binomial distribution would still be the one to use. 123 A. Stang, C. Poole Researcher But you never observed an endless series of 250 thows? Consultant Never. But I could simulate a series of 10 million sets of 250 throws in a second or two on my notebook computer. Fisher objected to Neyman and Pearson’s concept of endless study repetitions. Fisher preferred a hypothetical concept of sampling the study participants from an infinitely large ‘‘superpopulation’’ [6]. But I suspect that the philosophical nature of this disagreement would make it unattractive to you. Researcher That is correct. However, I frequently observe that peer-reviewed papers use both approaches in their data analysis. They use Neyman–Pearson NHTs to distinguish between significant and non-significant findings and additionally use the actual value of the p value to emphasize or de-emphasize them, as in Fisher STs. For instance, when a NHT just barely fails to reject the null hypothesis, as in your coin tossing example, researchers will sometimes point to the p value of 0.06 and call it a ‘‘trend’’ [22]. When a NHT rejects the null hypothesis by a wide margin, they will often report the p value to show how much far beneath alpha it is, sometimes even changing alpha to a lower value after the fact to make the test result appear more impressive [9] (irritation #9). Consultant You are right. This is an unfortunate mixture of Neyman and Pearson’s NHT and Fisher’s ST that has evolved over the years [7]. It has several variants and is constantly mutating, but some version or another of this hybrid is very frequently applied nowadays. The mixing can create considerable confusion, such as the common inability to distinguish alpha from the p value [9] [15] [14], worsened by calling each of them the ‘‘significance level.’’ The hybrid nature of the blend is even reflected in the name it has been given: ‘‘NHST’’: NHT from Neyman and Pearson, ST from Fisher. Researcher The reliance on NHST in a single study appears to me misleading when I read systematic reviews and meta-analyses. Those publications almost never focus on NHST results from individual studies. They show, compare and sometimes combine estimates of measures like the difference of 5-year mortality risks in my study, and they use confidence intervals to show how precise those estimates are. As far as I understand, these are the factors that contribute to the meta-analyses and not studyby-study NHTs, STs or any blend of the two. So why did you spend so much time explaining NHT, ST and NHST for use in my study, when they produce results that are not used in systematic reviews? Consultant Everything you say is true. Systematic reviewers and meta-analysts are loathe to view a literature as a series of tests, whether those test be of the NHT, ST, or NHST variety. They vastly prefer to view literatures as consisting of estimates of meaningful parameters, estimates The researcher and the consultant that vary from study to study in their precision and in their internal and external validity. Nonetheless, we statistical consultants are in the habit of starting off by advising our clients on how to do NHST, as a form of ‘‘stand-alone inference,’’ in which each study is viewed as an act of testing a null hypothesis all by itself, in isolation. Estimates of measures such as your difference in five-year risks, and confidence intervals for assessing the precision of those estimates, are more suited for a view of your study as a contribution to a scientific literature that, in the aggregate, will help guide decisions about future actions in research and beyond. Let us make another appointment for our next consultation. I’ve been keeping a list of topics we’ve deferred: type II errors, Bayesian statistics, stand-alone inference, estimation, systematic reviews and meta-analysis. Which would you like to take up next? Discussion As early as the fourth century BCE, dialogues were used for didactic purposes in ancient Greek literature: for example, the Socratic dialogues of Plato, fourth century BCE [2]. We have used this didactic modality to illustrate several, though only a few, irritations that can come up when a researcher with minimal statistical background but a good sense of what she wants her study to do, and of what she wants to do with her study, asks for consultation by a statistician. For more irritations and misconceptions in applied uses ST, NHT and NHST, please see some of our references, Goodman 2008 [8] and Greenland [11] [12], the latter especially for the crucial point that failure to reject the null hypothesis, even with high power, does not imply support for that hypothesis over plausible alternatives. The two characters in our dialogue are somewhat uncommon. First, the researcher, who obviously has little statistical background, is thoughtful and asks many basic questions. She wants to understand the statistics, not just use them. Second, among the many roles a consultant can choose including the role of a helper, leader, data-blesser, collaborator, and teacher [17], our consultant takes the challenge to be a teacher of the researcher. She starts off with the conventional advice, to conduct an NHT, but carefully explains the concepts related to that, to the ST and even the NHST in response to the client’s questions. Consultants can easily get into the habit of urging the client’s thinking along a particular line. However, in our dialogue, the consultant is not tempted to do that, or to give simple answers to complex questions [26]. In 1954, Tukey stated, ‘‘In the long run, it does not pay a statistician to fool either himself or his clients.’’ He continued, ‘‘Statisticians 943 have an obligation to clarify the foundations of their techniques for their clients’’ [27]. A researcher with minimal statistical background typically goes through a series of irritations when it comes to NHST including the re-formulation of the substantive hypothesis to a statistical null hypothesis, two-sided instead of one-sided alternative hypotheses, several technical terms like ‘‘statistical model’’, ‘‘critical value’’, ‘‘test statistic’’, ‘‘rejection region’’, and ‘‘type I and II error’’. The validity issue addressed by the consultant further irritates researchers as they usually have some intuition about potential biases that occurred in their own study. The frequentist’ concept of endless repetitions of studies irritates as it is follows counterfactual thinking. The consultant can be involved at different steps of a study: prior to data collection, after data have been collected, or after data have been analyzed [25]. These steps require different consulting priorities. In our dialogue, the researcher contacts the consultant after all of the data have been collected but have not been analyzed. The best time for a researcher to get statistical or, more generally, methodologic advice is when a study is on the drawing board, before the first datum has been collected. For readers interested in reading more about NHT, ST and the modern hybrid NHST, the publications in our reference list and a few others (e.g., [7] [24] [13]) would be a good start. Acknowledgments We would like to thank Sander Greenland PhD, Department of Epidemiology & Department of Statistics, University of California, Los Angeles, for helpful comments and suggestions. References 1. Anderson DR, Burnham KP, Thompson WL. Null hypothesis testing: problems, prevalence, and an alternative. J Wildl Manag. 2000;64:912–23. 2. Baldick C. Oxford dictionary of literary terms. Oxford: Oxford University Press; 2008. 3. Box GEP. Sampling and Bayes’ inference in scientific modelling and robustness. J R Stat Soc A. 1980;143:383–430. 4. Cox DR. The role of significance tests. Scand J Stat. 1977;4:49–70. 5. Fisher RA. Statistical methods for research workers. Edingburgh: Oliver and Boyd; 1925. 6. Fisher RA. Statistical methods and scientific inference. Edingburgh: Oliver and Boyd; 1956. 7. Gigerenzer G, Swijtink Z, Porter T, et al. The empire of chance. how probability changed science and everyday life. Cambridge: Cambridge University Press; 1989. 8. Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008;45:135–40. 9. Goodman SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J Epidemiol. 1993;137:485–96. 10. Greenland S. Multiple-bias modelling for analysis of observational data. J R Stat Soc A. 2005;168:267–306. 123 944 11. Greenland S. Null misinterpretation in statistical testing and its impact on health risk assessment. Prev Med. 2011;53:225–8. 12. Greenland S. Nonsignificance plus high power does not imply support for the null over the alternative. Ann Epidemiol. 2012;22:364–8. 13. Greenland S, Poole C. Problems in common interpretations of statistics in scientific articles, expert reports, and testimony. Jurimetrics. 2011;51:129. 14. Hubbard R. Alphabet soup: blurring the distinction between p’s and a’s in psychological research. Theory Psychol. 2004;14:295–327. 15. Hubbard R, Bayarri MJ. Confusion over measures of evidence (p’s) versus errors (a’s) in classical statistical testing (with discussion). Am Stat. 2003;57:171–82. 16. Jeffreys H. Theory of probability. Oxford: Clarendon Press; 1939. 17. Kirk RE. Statistical consulting in a University: dealing with people and other challenges. Am Stat. 1991;45:28–34. 18. Leamer EE. Specification searches. New York: Wiley; 1978. 19. Neyman J. Frequentist probability and frequentist statistics. Synthese. 1977;36:97–131. 20. Neyman J, Pearson ES. On the use and interpretation of certain test criteria for purposes of statistical inference: Part I. Biometrika. 1928;20A:175–240. 123 A. Stang, C. Poole 21. Neyman J, Pearson ES. The testing of statistical hypotheses in relation to probabilities a priori. Proc Cambridge Philos Soc. 1933;29:492–510. 22. Pocock SJ, Ware JH. Translating statistical findings into plain English. Lancet. 2009;373:1926–8. 23. Robins JM, Greenland S. The role of model selection in causal inference from nonexperimental data. Am J Epidemiol. 1986;123:392–402. 24. Rothman KJ, Greenland S, Lash TL. Precision and validity in epidemiologic studies. In: Rothman KJ, Greenland S, Lash TL, editors. Modern epidemiology. Philadelphia: Wolters Kluwer, Lippincott Williams and Wilkins; 2008. p. 148–67. 25. Section on Statistical Consulting.American Statistical Association. When you consult a statistician… what to expect. 2003. 26. Stegman CE. Statistical consulting in the university: a faculty member’s perspective. J Educ Stat. 1985;10:269–82. 27. Tukey JW. Unsolved problems of experimental statistics. J Am Stat Assoc. 1954;49:706–31.