Download The researcher and the consultant: a dialogue on null hypothesis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Probability interpretations wikipedia , lookup

Ronald Fisher wikipedia , lookup

Transcript
Eur J Epidemiol (2013) 28:939–944
DOI 10.1007/s10654-013-9861-4
DIALOGUE
The researcher and the consultant: a dialogue on null hypothesis
significance testing
Andreas Stang • Charles Poole
Received: 23 July 2013 / Accepted: 30 October 2013 / Published online: 14 November 2013
Springer Science+Business Media Dordrecht 2013
Abstract Since its introduction, null hypothesis significance testing (NHST) has caused much debate. Many
publications on common misunderstandings have
appeared. Despite the many cautions, NHST remains one
of the most prevalent, misused and abused statistical procedures in the biomedical literature. This article is directed
at practicing researchers with limited statistical background
who are driven by subject matter questions and have
empirical data to be analyzed. We use a dialogue as in
ancient Greek literature for didactic purposes. We illustrate
several, though only a few, irritations that can come up
when a researcher with minimal statistical background but
a good sense of what she wants her study to do, and of what
she wants to do with her study, asks for consultation by a
statistician. We provide insights into the meaning of several concepts including null and alternative hypothesis,
one- and two-sided null hypotheses, statistical models, test
statistic, rejection and acceptance regions, type I and II
error, p value, and the frequentist’ concept of endless study
repetitions.
A. Stang (&)
Medical Faculty, Institute of Clinical Epidemiology, MartinLuther-University of Halle-Wittenberg, Magdeburger Str. 8,
06097 Halle, Germany
e-mail: [email protected]
A. Stang
Department of Epidemiology, School of Public Health, Boston
University, 715 Albany Street, Talbot Building, Boston,
MA 02118, USA
C. Poole
Department of Epidemiology, Gillings School of Global Public
Health, University of North Carolina, Chapel Hill,
NC 27599-7435, USA
e-mail: [email protected]
Keywords Significance testing p value Type I
error Type II error Estimation
Introduction
Since its introduction early in the twentieth century, null
hypothesis significance testing (NHST) has caused much
debate. It is a constantly mutating hybridization of Fisher
significance testing (ST) and Neyman–Pearson null
hypothesis testing (NHT) [7]. Many publications on common misunderstandings have appeared. A count in 2000 of
over 300 warnings of limitations of ST, NHT and NHST
[1] was followed a year later by a list of 402 references
(http://warnercnr.colostate.edu/*anderson/thompson1.html,
accessed July 17, 2012), among which we found 89 in
biomedical publications. Despite the many cautions, NHST
remains one of the most prevalent, misused and abused
statistical procedures in the biomedical literature.
This article aims to illustrate some of the many misconceptions of ST, NHT, and NHST. In principle, it does not
present new methodological issues. However, it uses a different didactic modality to clarify several misconceptions: a
dialogue between a researcher and a statistician (called
Consultant). Potential irritations of the researcher are flagged. This article is directed at practicing researchers with
limited statistical background who are driven by subject
matter questions and have empirical data to be analyzed.
Researcher I want to study the association between an
immunohistochemical factor A and the prognosis of
patients with skin melanoma. Can you help me with the
data analysis?
Consultant I can help you. However, I do not really
understand your study question. Can you be more specific?
123
940
Researcher There is a hypothesis that the presence of
factor A is associated with a poorer prognosis than its
absence.
Consultant What do you mean by ‘‘poorer prognosis’’?
Researcher I mean a higher 5-year overall mortality
risk.
Consultant We have to re-formulate your substantive
hypothesis into a statistical null hypothesis that may be
rejected by your data (irritation #1). An appropriate null
hypothesis would be, ‘‘There is either no association or an
inverse association between factor A and the 5-year mortality risk among patients with newly diagnosed skin
melanoma’’. The alternative hypothesis would be, ‘‘There
is a positive association.’’ However, there is another option.
If factor A is associated with a decreased mortality risk,
would it be of interest for you to detect this association?
Researcher Yes, of course, though I believe a risk
reduction is much less likely.
Consultant If you believe it is possible, you need a twosided NHT. Your null hypothesis is, ‘‘There is no association.’’ Your alternative hypothesis is, ‘‘There is a risk
reduction or a risk increase.’’
Researcher That is interesting. One part of this alternative hypothesis is my original hypothesis, but the other
part is its opposite. The null hypothesis that I am testing,
however, is neither of them (irritation #2). If we reject the
null hypothesis, we will accept the alternative hypothesis?
Consultant Yes. This is called ‘‘statistical inference.’’
Researcher So, how will we decide whether or not to
reject the null hypothesis?
Consultant Using a statistical model, we define critical
values of a test statistic that define rejection regions and an
acceptance region. For these definitions, we do not have to
use your study data.
Researcher You used several terms with which I’m
unfamiliar (irritation #3). How do we do this?
Consultant Let me explain them to you. A statistical
model is a set of assumptions, including distributional
assumptions about probabilities (or probability densities) of
observations given hypotheses [18] [23]. When we combine empirical observations with a statistical model and
draw inferences, we assume validity given what has been
controlled in the study design and data analysis (irritation
#4) [3] [10].
The statistical models we use for null hypothesis testing
give probabilities so that values of special variables, called
test statistics, will fall within specified ranges. Consider the
well-known test statistic, Z, for instance. If the validity
assumptions and the null hypothesis are true, there is a
2.5 % chance we will observe Z B -1.96 in your study and
a 2.5 % chance we will observe Z C ?1.96. If we choose
those two values of Z as the critical values for our test, we
will reject the null hypothesis if we obtain a value Z in either
123
A. Stang, C. Poole
of those rejection regions and accept the null hypothesis if
the value of Z falls between the critical values and into the
acceptance region. Rejecting the null hypothesis when it is
true is a mistake known as a type I error. By choosing those
particular values of Z to define those two rejection regions
and that one acceptance region, we have set our probability
of making a type I error in advance, at 5 %. This probability
of rejecting the null hypothesis, if it happens to be false, is
called the alpha level of the test.
Researcher But how did I decide that ‘‘no association’’
is my null hypothesis and that 5 % is my highest tolerable
probability of mistakenly rejecting it if it is true?
Consultant In the original formulation of hypothesis
testing [20] [21] substantive considerations were used to
select the hypothesis to be tested in each analysis. The
hypothesis for which the consequences would be worse if
one falsely rejected it was to be chosen as the null
hypothesis [19]. Here we are saying it would be worse to
think there is a true association (positive or negative) when
there is not than to accept the hypothesis of no association
when there really is one.
The original formulation also called for specifying the
alpha level anew in each analysis, based on a judgment of
how serious a mistake it would be to reject the tested
hypothesis if it were true. Over the years, it became
orthodox in the biomedical field and social sciences to test
the hypothesis of no true association (here, the hypothesis
of equal 5-year overall mortality risks) and to use a standard alpha of nearly always 5 % in nearly all tests. Having
defined the rejection regions, we use the statistical model
and the study data to calculate the realized value of the test
statistic. If it falls within a rejection region, we reject the
null hypothesis. If it falls within an acceptance region, we
accept the null hypothesis.
Researcher I struggle with the unfamiliar terms you
used in explaining other unfamiliar terms. I struggle as well
with the condition that the study and data analysis are
perfectly valid. Doesn’t this imply that I first have to
determine potential biases in my study before I think of
interpreting this test?
Consultant Yes. The more validity is compromised, the
more misleading the NHT becomes.
Researcher I think I have a rough idea what you are
saying. I think I do get the concept of a type I error, but I
am having difficulty with the probability of this error and
how I can set or fix it before analyzing my data. Can you
explain this?
Consultant A 5 % type I error probability means that, in
the long run, if we would undertake an endless number of
studies identical to yours with only chance causing them to
produce different results, we would expect 5 of 100 NHTs
to reject the null hypothesis if that hypothesis were true and
the data model were valid.
The researcher and the consultant
Researcher How do you know? I just did this one study.
How can we make statements about an endless number of
studies that will never be undertaken? And how could they
be identical to my study? Isn’t every study unique (irritation #5)?
Consultant Based on the statistical model and the null
hypothesis, we can describe the statistical expectation of
the distribution of the hypothetical endless number of test
statistics.
Researcher How do I know whether or not a type I error
occurred?
Consultant If the test does not reject the null hypothesis,
we can say with 100 % certainty that a type I error was not
made. If the null hypothesis is rejected, we cannot tell
whether or not a type I error has been made, or the probability that a type I error has been made (irritation #6). We
can only give the probability that a type I error would have
been made if the null hypothesis were true and your study
design, data collection protocol, and analysis model were
perfectly valid.
Researcher So, if my test with a = 0.05 rejects the null
hypothesis, the probability that I’ve made a type I error is
5 %? And the probability is 95 % that I haven’t made a
type I error and that there really is a true underlying
association between factor A and 5-year overall mortality
risk? If this test were a screening test for a disease, its
positive predictive value—the probability of disease given
a positive test result—would be 95 %?
Consultant No. 100 % minus alpha, or 95 % in this test
and nearly all others, is not the positive predictive value of
an NHT (irritation #7). It is the specificity of the NHT: the
probability of not rejecting the null hypothesis when it is
true, akin to the probability of a negative screening test
when the disease is absent.
Researcher How would we determine the positive predictive value of a rejection of a null hypothesis?
Consultant That would require us to use Bayesian statistics, which we are not using here, and to determine the
prior probabilities of the null and alternative hypotheses.
We would do analogously to how you determine the
prevalence of the absence and presence of disease when
you calculate predictive values of disease screening tests.
Researcher I am curious as to why have you not yet
introduced the p value that I’ve heard so much about. Don’t
we need this value?
Consultant A p value does not have to be used to conduct an NHT [20], but it can be [21]. The p value is central
to the ST of Fisher [5]. Whereas alpha is sometimes called
the ‘‘significance level’’ of an NHT, Fisher called the
p value the ‘‘significance level’’ of the ST, as did Cox [4].
In Fisher’s ST, the p value is interpreted as a continuous,
inverse measure of evidence against the tested hypothesis:
the smaller the p value, the stronger the evidence. Fisher
941
did not introduce, and in fact vehemently opposed, the
notions of critical regions, type I errors, and alternative
hypotheses. Fisher did divide the p value range (0–1) into
approximate categories of strength of evidence from time
to time, but he objected philosophically to the NHT. He
believed that the goal of science is learning, not decisionmaking.
Researcher Like all things philosophical, that sounds
very abstract to me. What exactly is the p value?
Consultant Think of throwing a coin, an experiment.
What is the expected number of heads if you throw the coin
250 times?
Researcher If the coin is fair, I would expect 125 heads.
Consultant Would you expect exactly 125 heads every
time you toss the coin 250 times?
Researcher I suppose not.
Consultant Try it and see. I have. I found that I hardly
ever got exactly 125 heads. Because of random fluctuations, I expect that the number of heads is usually somewhere close to 125 but hardly ever exactly 125. Values
closer to 125 are more probable than values farther away.
For example, I would expect to get 124 heads or 126 heads
more often than 110 heads or 130 heads.
Researcher How do you know how much more probable
124 or 126 heads would be than 110 or 130 heads? What
has this to do with the p value?
Consultant For the coin tosses, I use a statistical model
called the binomial distribution. My assumptions are that
the coin is balanced and tossed fairly, that it has a head
on only one side, and that the result of each toss is
reported and recorded accurately. These assumptions lead
to a null hypothesis with an expected probability of 50 %
heads and, as we have noted, an expected number of 125
heads in 250 tosses. But ‘‘expected’’ just means the
average result if I repeated the 250 tosses many, many
times.
Suppose I actually get 110 heads. I can deduce from the
binomial model that, if all the assumptions in that model
are true, the probability of obtaining 110 or fewer heads is
a little greater than 3 %. And, since 110 is 15 away from
the expected value of 125 heads, I note that getting
125 ? 15 = 140 heads or more is just about as ‘‘extreme’’
as getting 110 or fewer when the expected value is 125.
The probability of getting 140 heads or more is a little less
than 3 %. So, I can add the probability, under the null
hypothesis (which includes the validity assumptions), of
110 or fewer heads to the probability of 140 or more heads
and obtain a two-sided p value of about 6 %.
Researcher In 250 tosses, 110 or fewer heads and 140 or
more heads would occur so rarely?
Consultant Yes.
Researcher Shouldn’t we then get concerned whether
the coin was fair?
123
942
Consultant Thank you. In a Fisher ST, we would say
that p = 0.06 is the evidence against the null hypothesis of
a fair coin tossed fairly, given the statistical model (binomial distribution, heads and tails recorded correctly, etc.).
It would be considered some evidence, but not particularly
strong evidence, against the null hypothesis. In a Neyman–
Pearson NHT with a pre-specified, two-sided alpha of 5 %,
we would accept the null hypothesis.
Researcher What do you mean by a ‘‘two-sided’’ alpha?
Consultant I mean what I meant in the case of your
NHT: that we have two rejection regions.
Researcher But there are many more results from the
coin tossing experiment that could have occurred and that
are even further from the result expected under the null
hypothesis than the observed result. Yet the coin experiment produced just one result (irritation #8).
Consultant You are right. A p value is more of a
statement about the events that did not occur than it is a
concise statement of the evidence from your actual
observed data. Jeffreys wrote amusingly that this aspect of
the use of p values in NHT implies ‘‘that a hypothesis that
may be true is rejected because it has not predicted
observable results that have not occurred.’’ ([16] p 316).
Nevertheless, with the binomial probability distribution
and most other distributions we use in biomedical research,
the lower the probability of a given result, the lower the
probability of that result plus the more extreme results that
did not occur.
Researcher Can’t you tell me the p value before I do the
study?
Consultant No, you specify the alpha level in an NHT
before you conduct the test, but you calculate the p value
from your data and your model to conduct your test,
whether that test be an NHT or an ST.
Researcher You used the example of throwing a coin
where we have a good subjective guess what the expectation is, at least theoretically, after throwing the coin 250
times.
Consultant The subjective guess is a good one if the
coin came out of my pocket and you tossed it. But suppose you were strolling through a carnival and you saw a
small crowd around a man who was tossing a coin and
asking onlookers to bet on heads, with him taking tails in
each bet. Would you really expect half the tosses to be
heads in that setting? I wouldn’t. I’d expect fewer heads,
more tails. This example shows that the most reasonable
hypothesis to test, and therefore the most reasonable
expected value, can depend on the circumstances and on
the researcher’s judgments about them. But in the coin
tossing example, whether the expected number of heads
were 125 in 250 (here) or substantially fewer than 125 (at
the carnival), the binomial distribution would still be the
one to use.
123
A. Stang, C. Poole
Researcher But you never observed an endless series of
250 thows?
Consultant Never. But I could simulate a series of 10
million sets of 250 throws in a second or two on my
notebook computer. Fisher objected to Neyman and Pearson’s concept of endless study repetitions. Fisher preferred
a hypothetical concept of sampling the study participants
from an infinitely large ‘‘superpopulation’’ [6]. But I suspect that the philosophical nature of this disagreement
would make it unattractive to you.
Researcher That is correct. However, I frequently
observe that peer-reviewed papers use both approaches in
their data analysis. They use Neyman–Pearson NHTs to
distinguish between significant and non-significant findings
and additionally use the actual value of the p value to
emphasize or de-emphasize them, as in Fisher STs. For
instance, when a NHT just barely fails to reject the null
hypothesis, as in your coin tossing example, researchers
will sometimes point to the p value of 0.06 and call it a
‘‘trend’’ [22]. When a NHT rejects the null hypothesis by a
wide margin, they will often report the p value to show
how much far beneath alpha it is, sometimes even changing
alpha to a lower value after the fact to make the test result
appear more impressive [9] (irritation #9).
Consultant You are right. This is an unfortunate mixture
of Neyman and Pearson’s NHT and Fisher’s ST that has
evolved over the years [7]. It has several variants and is
constantly mutating, but some version or another of this
hybrid is very frequently applied nowadays. The mixing
can create considerable confusion, such as the common
inability to distinguish alpha from the p value [9] [15] [14],
worsened by calling each of them the ‘‘significance level.’’
The hybrid nature of the blend is even reflected in the name
it has been given: ‘‘NHST’’: NHT from Neyman and
Pearson, ST from Fisher.
Researcher The reliance on NHST in a single study
appears to me misleading when I read systematic reviews
and meta-analyses. Those publications almost never focus
on NHST results from individual studies. They show,
compare and sometimes combine estimates of measures
like the difference of 5-year mortality risks in my study,
and they use confidence intervals to show how precise
those estimates are. As far as I understand, these are the
factors that contribute to the meta-analyses and not studyby-study NHTs, STs or any blend of the two. So why did
you spend so much time explaining NHT, ST and NHST
for use in my study, when they produce results that are not
used in systematic reviews?
Consultant Everything you say is true. Systematic
reviewers and meta-analysts are loathe to view a literature
as a series of tests, whether those test be of the NHT, ST, or
NHST variety. They vastly prefer to view literatures as
consisting of estimates of meaningful parameters, estimates
The researcher and the consultant
that vary from study to study in their precision and in their
internal and external validity. Nonetheless, we statistical
consultants are in the habit of starting off by advising our
clients on how to do NHST, as a form of ‘‘stand-alone
inference,’’ in which each study is viewed as an act of
testing a null hypothesis all by itself, in isolation. Estimates
of measures such as your difference in five-year risks, and
confidence intervals for assessing the precision of those
estimates, are more suited for a view of your study as a
contribution to a scientific literature that, in the aggregate,
will help guide decisions about future actions in research
and beyond. Let us make another appointment for our next
consultation. I’ve been keeping a list of topics we’ve
deferred: type II errors, Bayesian statistics, stand-alone
inference, estimation, systematic reviews and meta-analysis. Which would you like to take up next?
Discussion
As early as the fourth century BCE, dialogues were used
for didactic purposes in ancient Greek literature: for
example, the Socratic dialogues of Plato, fourth century
BCE [2]. We have used this didactic modality to illustrate
several, though only a few, irritations that can come up
when a researcher with minimal statistical background but
a good sense of what she wants her study to do, and of
what she wants to do with her study, asks for consultation
by a statistician. For more irritations and misconceptions
in applied uses ST, NHT and NHST, please see some of
our references, Goodman 2008 [8] and Greenland [11]
[12], the latter especially for the crucial point that failure
to reject the null hypothesis, even with high power, does
not imply support for that hypothesis over plausible
alternatives.
The two characters in our dialogue are somewhat
uncommon. First, the researcher, who obviously has little
statistical background, is thoughtful and asks many basic
questions. She wants to understand the statistics, not just
use them. Second, among the many roles a consultant can
choose including the role of a helper, leader, data-blesser,
collaborator, and teacher [17], our consultant takes the
challenge to be a teacher of the researcher. She starts off
with the conventional advice, to conduct an NHT, but
carefully explains the concepts related to that, to the ST
and even the NHST in response to the client’s questions.
Consultants can easily get into the habit of urging the
client’s thinking along a particular line. However, in our
dialogue, the consultant is not tempted to do that, or to give
simple answers to complex questions [26]. In 1954, Tukey
stated, ‘‘In the long run, it does not pay a statistician to fool
either himself or his clients.’’ He continued, ‘‘Statisticians
943
have an obligation to clarify the foundations of their
techniques for their clients’’ [27].
A researcher with minimal statistical background typically goes through a series of irritations when it comes to
NHST including the re-formulation of the substantive
hypothesis to a statistical null hypothesis, two-sided instead
of one-sided alternative hypotheses, several technical terms
like ‘‘statistical model’’, ‘‘critical value’’, ‘‘test statistic’’,
‘‘rejection region’’, and ‘‘type I and II error’’. The validity
issue addressed by the consultant further irritates
researchers as they usually have some intuition about
potential biases that occurred in their own study. The
frequentist’ concept of endless repetitions of studies irritates as it is follows counterfactual thinking.
The consultant can be involved at different steps of a
study: prior to data collection, after data have been collected, or after data have been analyzed [25]. These steps
require different consulting priorities. In our dialogue, the
researcher contacts the consultant after all of the data have
been collected but have not been analyzed. The best time
for a researcher to get statistical or, more generally,
methodologic advice is when a study is on the drawing
board, before the first datum has been collected. For
readers interested in reading more about NHT, ST and the
modern hybrid NHST, the publications in our reference list
and a few others (e.g., [7] [24] [13]) would be a good start.
Acknowledgments We would like to thank Sander Greenland PhD,
Department of Epidemiology & Department of Statistics, University
of California, Los Angeles, for helpful comments and suggestions.
References
1. Anderson DR, Burnham KP, Thompson WL. Null hypothesis
testing: problems, prevalence, and an alternative. J Wildl Manag.
2000;64:912–23.
2. Baldick C. Oxford dictionary of literary terms. Oxford: Oxford
University Press; 2008.
3. Box GEP. Sampling and Bayes’ inference in scientific modelling
and robustness. J R Stat Soc A. 1980;143:383–430.
4. Cox DR. The role of significance tests. Scand J Stat.
1977;4:49–70.
5. Fisher RA. Statistical methods for research workers. Edingburgh:
Oliver and Boyd; 1925.
6. Fisher RA. Statistical methods and scientific inference. Edingburgh: Oliver and Boyd; 1956.
7. Gigerenzer G, Swijtink Z, Porter T, et al. The empire of chance.
how probability changed science and everyday life. Cambridge:
Cambridge University Press; 1989.
8. Goodman S. A dirty dozen: twelve p-value misconceptions. Semin Hematol. 2008;45:135–40.
9. Goodman SN. P values, hypothesis tests, and likelihood: implications for epidemiology of a neglected historical debate. Am J
Epidemiol. 1993;137:485–96.
10. Greenland S. Multiple-bias modelling for analysis of observational data. J R Stat Soc A. 2005;168:267–306.
123
944
11. Greenland S. Null misinterpretation in statistical testing and its
impact on health risk assessment. Prev Med. 2011;53:225–8.
12. Greenland S. Nonsignificance plus high power does not imply
support for the null over the alternative. Ann Epidemiol.
2012;22:364–8.
13. Greenland S, Poole C. Problems in common interpretations of
statistics in scientific articles, expert reports, and testimony. Jurimetrics. 2011;51:129.
14. Hubbard R. Alphabet soup: blurring the distinction between p’s and
a’s in psychological research. Theory Psychol. 2004;14:295–327.
15. Hubbard R, Bayarri MJ. Confusion over measures of evidence
(p’s) versus errors (a’s) in classical statistical testing (with discussion). Am Stat. 2003;57:171–82.
16. Jeffreys H. Theory of probability. Oxford: Clarendon Press; 1939.
17. Kirk RE. Statistical consulting in a University: dealing with
people and other challenges. Am Stat. 1991;45:28–34.
18. Leamer EE. Specification searches. New York: Wiley; 1978.
19. Neyman J. Frequentist probability and frequentist statistics.
Synthese. 1977;36:97–131.
20. Neyman J, Pearson ES. On the use and interpretation of certain
test criteria for purposes of statistical inference: Part I. Biometrika. 1928;20A:175–240.
123
A. Stang, C. Poole
21. Neyman J, Pearson ES. The testing of statistical hypotheses in
relation to probabilities a priori. Proc Cambridge Philos Soc.
1933;29:492–510.
22. Pocock SJ, Ware JH. Translating statistical findings into plain
English. Lancet. 2009;373:1926–8.
23. Robins JM, Greenland S. The role of model selection in causal
inference from nonexperimental data. Am J Epidemiol.
1986;123:392–402.
24. Rothman KJ, Greenland S, Lash TL. Precision and validity in
epidemiologic studies. In: Rothman KJ, Greenland S, Lash TL,
editors. Modern epidemiology. Philadelphia: Wolters Kluwer,
Lippincott Williams and Wilkins; 2008. p. 148–67.
25. Section on Statistical Consulting.American Statistical Association. When you consult a statistician… what to expect. 2003.
26. Stegman CE. Statistical consulting in the university: a faculty
member’s perspective. J Educ Stat. 1985;10:269–82.
27. Tukey JW. Unsolved problems of experimental statistics. J Am
Stat Assoc. 1954;49:706–31.