Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
QUANTITATIVE METHODS IN PSYCHOLOGY On the Probability of Making Type I Errors P. Pollard J. T. E. Richardson Department of Human Sciences School of Psychology Lancashire Polytechnic Preston, Lancashire, United Kingdom Brunei University Uxbridge, Middlesex, United Kingdom A statistical test leads to a Type I error whenever it leads to the rejection of a null hypothesis that is in fact true. The probability of making a Type I error can be characterized in the following three ways: the conditional prior probability (the probability of making a Type I error whenever a true null hypothesis is tested), the overall prior probability (the probability of making a Type I error across all experiments), and the conditional posterior probability (the probability of having made a Type I error in situations in which the null hypothesis is rejected). In this article, we show (a) that the alpha level can be equated with the first of these and (b) that it provides an upper bound for the second but (c) that it does not provide an estimate of the third, although it is commonly assumed to do so. We trace the source of this erroneous assumption first to statistical texts used by psychologists, which are generally ambiguous about which of the three interpretations is intended at any point in their discussions of Type I errors and which typically confound the conditional price and posterior probabilities. Underlying this, however, is a more general fallacy in reasoning about probabilities, and we suggest that this may be the result of erroneous inferences about probabilistic conditional statements. Finally, we consider the possibility of estimating the (posterior) probability of a Type I error in situations in which the null hypothesis is rejected and, hence, the proportion of statistically significant results that may be Type I errors. A psychological experiment typically produces an outcome that is consistent in direction with one or more effects. At this point, the researcher has to make a decision between (a) accepting the existence of such an effect and (b) accepting the null hypothesis that no effect is present in the population from which the sample was drawn. In the latter case, any apparent effects are assumed simply to be chance fluctuations around the zero-effect value that would be observed in the whole population. For instance, if the researcher finds performance on some measure in one condition to be 60%, compared with 40% in a second condition, he or she has to make a decision between the conclusion that the first condition yields better performance than the second and the conclusion that the two scores merely reflect chance (sample) variations around a single level of performance that would be obtained in a study of the population as a whole. Conventionally, researchers make such decisions by assuming the null hypothesis to be true and, given this assumption, attempting to make inferences based on the probability of obtaining the actual pattern of results observed. Specifically, a statistical test yields the probability of a given result's (or one more extreme) being produced by chance if the null hypothesis is true. If D denotes an outcome or one more extreme and HO denotes the null hypothesis's being true, then the probability produced by such a statistical test can be expressed as P(D|Ho), that is, the conditional probability of D, given HO. If this figure is less than a threshold probability or alpha level (typically .05), then chance is concluded to be a sufficiently unlikely explanation of the outcome, and the existence of an effect is held to be supported by the data. Thus, for an alpha level of .05, one rejects the null hypothesis whenever P(D|Ho) is less than .05. In other words, if D* denotes an outcome that would lead to the rejection of the null hypothesis, then by definition, P(D*|Ho) is equal to the alpha level. If the conclusion that chance is an insufficient explanation of the outcome is incorrect, then an error will have been made: The null hypothesis will have been rejected when it is in fact true. This is conventionally described as a Type I error. When a true null hypothesis is tested, then a Type I error is made if and only if the experimental outcome leads to the rejection of the null hypothesis. We have defined such an outcome as D*, and thus, the probability of such an outcome when the null hypothesis is true can be expressed as jP(D*|H<>X which is equal to the alpha level. For instance, if a 5% criterion is adopted prior to experimentation, then the experimenter will (on average) make Type I errors on 5% of those occasions when the null hypothesis is true. We refer to this as the conditional prior probability of making a Type I error because the probability is conditional on the null hypothesis's being true.1 We are grateful to Larry Phillips for his comments on a previous version of this article and to an anonymous reviewer for bringing the article by Carver (1978) to our attention. Correspondence concerning this article should be addressed to P. Pollard, School of Psychology, Lancashire Polytechnic, Preston, Lancashire PR I 2TQ, United Kingdom. 1 For convenience, we assume two points regarding the alpha level throughout this article: First, a given alpha level will produce approximately that proportion of errors when the null hypothesis is true; second, the significance level derived from a statistical test is veridical. The Psychological Bulletin, l9S7,Vol. 102. No. 1, 159-163 Copyright 1987 by the American Psychological Association. Inc. 0033-2909/87/S00.7S 159 160 P. POLLARD AND J. T. E. RICHARDSON Of course, when the null hypothesis is false, the experimenter there is a 5% chance of having made a Type I error. A more cannot make a Type I error, only the converse error of failing to reject the null hypothesis (that is, a Type II error). It follows that colloquial way of expressing this is to say that in any particular the alpha level defines the maximum number of Type I errors chance. A corollary of this assumption is that at 5%, 1 in every to be expected across a series of experiments in which the null 20 significant results will be a Type I error. instance, there is a 5% probability that the results are due to hypothesis may sometimes be true and sometimes be false. As Probabilities of the form P(A|B) and />(B| A) are not, however, Pagano (1981) expressed this point, "The alpha level which the scientist sets at the beginning of the experiment is the level to algebraically equivalent. Nor do they necessarily have the same which he wishes to limit the probability of making a Type I value in practice. For example, the (high) probability of a population of firemen "generating" a sample person in uniform ob- error" (p. 203). For instance, the outcome of 20 tests of a true viously is not the same as the (lower) probability that a sample null hypothesis (not necessarily the same one in each case) person in uniform was "generated" by a population of firemen. would be expected to yield 19 correct acceptances and 1 Type I It follows that />(Ho|D*) is not the same as />(D"|Ho) and, hence, error (that is, 5%). If, however, during the same period, several that the posterior conditional probability of making a Type I error is not the same as the alpha level (cf. Cronbach & Snow, tests were also made on a false null hypothesis, none of these could yield a Type I error, and thus, the overall percentage of 1977, p. 52). In more colloquial terms, the alpha level cannot experiments that yielded Type I errors would be less than 5%. be equated with the probability that the research results were In general, a Type I error is made in any experiment if and due to chance or with the probability that the alternative hy- only if the experimental outcome leads to the rejection of the null hypothesis and the null hypothesis is in fact true. Thus, the probability of making such an error is equal to P(D* & HO), pothesis is false (Carver, 1978). Why, then, should there be such a common assumption that the alpha level yields the probability of having made a Type I error? which is equivalent to />(H0)-/>(D*|H0). This formulation makes it clear that the probability is less than P(D*|Ho), that is, One possible reason is that many statistical texts that psychologists and their students read and use appear to encourage the alpha level, if P(Ho) is less than one. Thus, it can be held to be equal to the alpha level only if it is assumed that only true this idea.2 They generally do so by being vague about the do- null hypotheses are ever tested. We refer to this as the overall prior probability of making a Type I error because it is based on the frequency of Type I errors as a proportion of all experiments. If one decides to reject the null hypothesis, however, what is the probability of having made a Type I error? When the null hypothesis is rejected, the probability of having made a Type I error is a probability about the null hypothesis because a Type I error has been made if and only if the null hypothesis is in fact true. It follows that the probability of having made a Type I error is equal to />(Ho|D*). We refer to this as the conditional posterior probability of making a Type I error because it is conditional on the decision to reject the null hypothesis. This yields the probability of any particular rejection's having been a Type I error, and therefore, it could be used to estimate the proportion of all significant results that are Type I errors. main of events across which the proportion of Type I errors is to be computed, thus leaving indeterminate which of the three possible interpretations of the expression "probability of making a Type I error" is intended. Most of these texts define this probability as equivalent to the alpha level, so they apparently have the conditional prior probability in mind. For example, "The probability of making a Type 1 error is very simply and directly indicated by a" (Guilford & Fruchter, 1978, p. 174); "The probability of committing a Type 1 error, which is denoted by a, is called the significance level of the test" (Hoel & Jessen, 1977, p. 223); "The probability of committing such an error is actually equivalent to the significance level we select" (Miller, 1975, p. 59); "Alpha determines the probability of making a Type I error" (Pagano, 1981, p. 202); and "The significance level is simply the probability of making a Type 1 error" (Robson, 1973, p. 35). Our informal inquiries within a wide and varied cross section Of course, the alpha level does indeed give the probability of making a Type I error when the null hypothesis is true, but these of our professional colleagues indicated a widespread assump- quotations involve an unfortunate shorthand in which the con- tion that the probability of having made a Type I error in rejecting the null hypothesis is the same as the alpha level; that is, if a the reader is likely to interpret the probability as either the over- null hypothesis is rejected at the .05 level of significance, then all prior probability (that is, ". . . across all experiments") or first of these seems reasonable, although it is a simplification if short sequences of experiments are considered. We do not wish to wholly defend the second assumption. Statistical tests yield estimates of probabilities, and these estimates may vary as a function of the specific null hypothesis being tested (that is, the type of test chosen) and the extent to which the distributional characteristics of the data deviate from those of the theoretical null-hypothesis population being tested. A certain amount of such deviation is likely to occur in many psychological applications, and hence, such estimates may be inexact. Thus, our statement that the prior probability of a Type I error is given by the alpha level is a simplification. This point does not, however, affect the arguments presented here unless the errors in estimates produced by such tests are gross (in which case, of course, any discussion of the present nature would be pointless). ditional nature of this definition is left unstated. In its absence, the conditional posterior probability (that is, ". . . whenever the null hypothesis is rejected"). In the former case, the reader is being led to equate the probability of making a Type I error with an expression that is merely an upper bound to that probability (unless the texts assume that their readers will never test any false null hypotheses). In the latter case, the reader is being encouraged to use the alpha level in validly, not as a conditional 2 Carver (1978) observed that general textbooks in psychology also encouraged it (e.g., Anastasi, 1976, p. 109; Hebb, 1966, p. 173). Our purpose in this article is to illuminate the extent of the fallacy even in those texts that psychologists and their students might consider the most authoritative and to offer an explanation of its source in terms of current ideas about human decision making. 161 TYPE I ERRORS probability of the obtained outcome but as a conditional probability of the truth of the null hypothesis itself. A common device that promotes this ambiguity is a 2 X 2 table showing the probability of a correct or incorrect decision as a function of (a) the truth or falsehood of the null hypothesis and (b) a decision to accept or reject the null hypothesis. The alpha level is shown in the cell corresponding to "null hypothesis true" and "null hypothesis rejected." The headings on such tables, however, typically do not identify the contents as being conditional (rather than absolute) probabilities. Even if the reader realizes that the alpha level is a conditional probability, the table usually gives no indication of whether it is conditional within rows or within columns and, thus, fails to distinguish between the conditional prior and the conditional posterior interpretations of "the probability of making a Type I error." These problems worsen when the authors in question discuss the frequency of Type 1 errors. For instance, Christensen (1980) reported, "If the .05 significance level is set, you run the risk of being wrong and committing Type 1 error five times in 100" (p. 311). Keppell and Saufley (1980) wrote, "We will make a Type 1 error a small percentage of the time—the exact amount being specified by our significance level" (p. 137), The puzzled reader might legitimately ask of such quotations, five times in which 100, or a small percentage of what time? The author never made clear that the answer is, five times in 100 when a true null hypothesis is tested, and once again led the reader toward alternative interpretations, such as, five times in 100 experiments (that is, the overall prior probability), or five times in 100 when the null hypothesis is rejected (that is, the conditional posterior probability). It is perhaps more serious that this uncertainty concerning the interpretation of the expression "the probability of making a Type I error" is shared by at least some of the authors of statistical texts in frequent use. In particular, they tend to confound the conditional prior probability of making a Type 1 error with the conditional posterior probability. For instance, Greene and D'Oliveira (1982), after correctly describing the significance level as a conditional prior probability, later refer to it as the "percentage probability . . . that your results are due to chance" (p. 31). The earlier definition of the probability has been changed, and the alpha level is now held to be the probability of a particular source of the data (namely, the null hypothesis). In a similar manner, Miller (1975) states, "If we reject the null hypothesis whenever the chance of it being true is less than 0.05, then obviously we shall be wrong 5 per cent of the time" (p. 59). Once again, the alpha level is used specifically as a probability of the truth of the null hypothesis. Finally, when discussing a decision to reject the null hypothesis on the basis of a test statistic whose value lies in the appropriate rejection region, Siegel (1956) states, "We may explain the actual occurrence of that value in two ways: first we may explain it by deciding that the null hypothesis is false, or second, we may explain it by deciding that a rare and unlikely event has occurred" (p. 14). After stating that the first alternative would be chosen, although the second might be true (that is, the null hypothesis might be true), he continues, "In fact, the probability that the second explanation is the correct one is given by [italics added] a, for rejecting H0 when in fact it is true is the Type 1 error" (Siegel, 1956, p. 14). The italicized portion of the last quotation again clearly shows the interpretation of the alpha level as a probability of the truth of the null hypothesis. In short, one possible reason for the common assumption among psychologists and their students that the alpha level represents the probability of having made a Type I error is that standard statistical texts promote this fallacy. On one hand, they may correctly define the alpha level as the probability of making a Type I error, given a true null hypothesis, but then subsequently make no reference to the prior conditional nature of this probability. On the other hand, they may leave the intended interpretation of the expression "the probability of making a Type I error" entirely unclear. In either case, the resulting ambiguity, between .P(D*|Ho) and ,P(Ho|D*) is compounded further until the alpha level comes to be identified with the conditional posterior probability of making a Type I error. Of course, this might be seen as a problem that relates simply to the training of psychologists rather than one of general theoretical interest to psychology itself. It is pertinent, however, to ask why the authors of statistics textbooks as well as their readers should be vulnerable to such problems in their reasoning about probabilities. Why should there be such a fundamental tendency to confuse the posterior probability that the null hypothesis is true, given that it has been rejected, with the prior probability that the null hypothesis will be rejected, given that it is true? Kahneman and Tversky (1973) demonstrated a widespread tendency for subjects to confuse (intuitive) judgments of the form /"(D|H) with judgments of the form F(H|D) and showed that psychologists themselves will be vulnerable to such statistical biases (Tversky & Kahneman, 1971). Kahneman and Tversky described such errors as examples of a "base rate" or "prior probability" fallacy encouraged by superficial judgments of similarity, or "representativeness." Bar-Hillel (1974) demonstrated explicitly that this tendency to base judgments on similarity leads subjects to make no distinction between P(D\H) and P(H|D) judgments. Analogously, the significance level seems to operate as a measure of the similarity, or representativeness, between the observed sample and the null-hypothesis population and, thus, creates the illusion that it predicts both the likelihood of rejecting the null hypothesis, given that it is true, and the likelihood of the null hypothesis being true, given that it has been rejected. Another way of characterizing errors of similarity or representativeness is to view the confusion as stemming from a faulty logical inference rather than a faulty statistical one. Consider the following argument: If Ho then not D* D* Therefore: Not H 0 . It should be clear that this is a valid (modus tollens) inference. However, consider the following variant: If HO then D* very unlikely D* Therefore: H0 very unlikely. It should be equally clear that this is now invalid (substitute "This person is American" for H0 and "This person is a mem- 162 P. POLLARD AND J. T. E. RICHARDSON her of Congress" for D*). Such (conditional) inferences do not relation with the conditional prior probability of a Type I error, work with probabilistic relations. One may make inferences on the basis of a premise of the form "If HO then not D*" but not it is positively related to the conditional posterior probability of a Type I error. Thus, to the extent that most psychologists run on the basis of a premise of the form "If HO then probably not sensitive experiments that are associated with high values of D*." Possibly, however, the latter statement leads to the assumption of a symmetrical relation and, thus, to the erroneous inference "If D* then probably not HO." There is little research evi- P(D*) and that minimize Type II errors, the proportion of Type I errors in the literature will not be unduly high. Furthermore, to the extent that most psychologists frame dence for this specific transformation, although there is abun- good alternative hypotheses (that is, ones more likely to be true dant evidence that standard universal and conditional statements tend to be "converted" in this way (see, for instance, than false), /"(Ho) will be likely to be low. A special case of this the chapters on conditional and syllogistic reasoning in Evans, 1982), and there is some evidence that any relation will tend to rejection of the null hypothesis. Because the overall probability of rejecting the null hypothesis, /"(D*), will be greater than the be seen as symmetrical (Newstead, Pollard, & Griggs, 1986; Tsal, 1977). threshold probability level, />(D*|H0), the Bayesian relation already stated entails that the conditional posterior probability is when one attempts to replicate an outcome that led to the If such a transformation is intuitively made, then it will lead of making a Type I error, />(H0|D*)> will be less than the prior from "If Ho then the probability of D* is equal to the alpha probability of the null hypothesis's being true, P(Ho). By a similar argument, it can be shown that the conditional posterior level" to "If D* then the probability of HO is equal to the alpha level." This provides a characterization of this similarity, or representativeness, error in terms of a logical fallacy rather than a statistical one. Also, according to the foregoing discussion, the hypotheticc-deductive method is not strictly workable when combined with the conventional procedures of statistical inference. Of course, the conditional prior probability of making a Type I error and the conditional posterior probability of making such an error can be directly related to one another by means of Bayes's theorem. Specifically, P(Ho|D*) = P(D*\H0)-P(H0)/ P(D*). Such a formulation shows that the conditional prior and conditional posterior probabilities will be equivalent only if /'(Ho) and P(D*) are fortuitously equal. Otherwise, the conditional posterior probability of making a Type I error will be greater or less than the alpha level, depending on whether P(Ho) is greater or less than P(D*). Nevertheless, it is impossible to quantify the value of P(Ho), either across all hypotheses or in any individual case. It would probability of making a Type I error in a replication of an outcome that was statistically significant must be less than the conditional posterior probability of making a Type I error in the original experiment, provided that the threshold probability level remains the same in both cases. In short, although it is not possible to quantify the conditional posterior probability of having made a Type I error, one may nevertheless be confident that this probability is reduced by successive replications of the effect. In other words, even an experiment that fails to achieve statistical significance but replicates the direction of a previous significant effect decreases rather than increases the probability that the original finding was a Type I error (cf. Humphreys, 1980). A corollary of this is that to ensure the same conditional posterior probability of making a Type I error as in the original experiment, a researcher should use a less conservative threshold probability level in its replication. (These points are, of course, implicitly recognized in techniques of research synthesis; e.g., Green & Hall, 1984, and Rosenthal, 1978.) in principle be possible for individual researchers to arrive at Given the assumption that psychologists tend to frame good al- an empirical estimate of P(D*) in their own work by enumerat- ternative hypotheses and, hence, that the value of P(Ho) is gener- ing all the instances of significant and nonsignificant results that they had ever obtained. It would not, however, be possible to ally low, together with the fact that AHo/D*) will be less than estimate the value of P(D*) in the general domain of psychological research because it is not known how many nonsignificant stantially lower than the alpha level. Thus, that the posterior prob- P(Ho), the posterior probability of Type I errors is likely to be sub- findings go unreported in the published literature (Greenwald, 1975). ability of having made a Type I error is in principle unquantifiable is not necessarily cause for general concern because the values of /'(D*) and /"(Ho) are to some extent under the researcher's control Because it is impossible to estimate either P(Ho) or P(D*), and there are reasons for believing that the overall number of Type the precise value of the conditional posterior probability of I errors in the literature is small. Nevertheless, in any specific case making a Type I error is in principle unquantifiable and, hence, of rejection of the null hypothesis, the probability of having made indeterminate. From this follow two conclusions of particular interest. First, there are no circumstances under which it would a Type I error is indeterminate. be legitimate to claim that the alpha level was an index of the conditional posterior probability of making a Type I error. Second, the alpha level cannot be used to estimate the proportion References of Type I errors in the psychological research literature. Never- Anastasi, A. (1976). Psychological testing (4th ed.). New York: Macmillan. theless, both PCD*) and P(Ho) are under the control of psychological researchers, as we now point out. The proportion of Type I errors in the literature should be computed as the ratio of such errors to the total number of rejections of the null hypothesis. Thus, anything that maximizes the number of correct rejections (such as experimental power) will reduce this proportion. It follows from this that whereas the (prior) probability of a Type II error shows the familiar inverse Bar-Hillel, M. (1974). Similarity and probability. Organizational Behavior and Human Performance, 11, 277-282. Carver, R. P. (1978). The case against statistical significance testing. Harvard Educational Review, 48, 378-399. Christensen, L. B. (1980). Experimental methodology (2nd ed.). Boston: AUyn & Bacon. Cronbach. L. J., & Snow, R. E. (1977). Aptitudes and instructional methods: A handbook for research on interactions. New Ifoik: Irvington. TYPE I ERRORS Evans, J. St. B. T. (1982). The psychology of deductive reasoning. London: Routledge & Kegan Paul. Green, B. E, & Hall, J. A. (1984). Quantitative methods for literature reviews. Annual Review of Psychology, 35, 37-53. Greene, !.,& D'Oliveira, M. (1982). Learning to me statistical tests in psychology: A students' guide- Milton Keynes, United Kingdom: Open University Press. Greenwald, A. G. (1975). Consequences of prejudice against the null hypothesis. Psychological Bulletin, 82,1-20. Guilford, J. P., & Fruchter, B. (1978). Fundamental statistics in psychology and education. New York: McGraw-Hill. Hebb, D. O. (1966). A textbook of psychology. Philadelphia, PA: Saunders. Hoel, P. G., & Jesscn, R. J. (1977). Basic statistics for business and economics (2nd ed,). New \brk: Wiley. Humphreys, L. G. (1980). The statistics of failure to replicate: A comment on Buriel's (1978) conclusions. Journal of Educational Psychology, 72, 71-75. Kahneman, D., & Tversky, A. (1973). On the psychology of prediction. Psychological Review, 80, 237-251. Keppel, G., & Saufley, W. R, Jr. (1980). Introduction to design and analysis: A students' handbook. San Francisco: Freeman. 163 Miller, S. (1975). Experimental design and statistics. London: Methuen. Newstead, S. E,, Pollard, P., & Griggs, R. E. (1986). Response bias in relational reasoning. Bulletin of the Psychonomic Society, 24, 95-98. Pagano, R. R. (1981). Understanding statistics in the behavioral sciences. St. Paul, MN: West. Robson, C. (1973). Experiment, design, and statistics in psychology. Harmondsworth, England: Penguin. Rosenthal, R. (1978). Combining results of independent studies. Psychological Bulletin, 81, 185-193. Siegel, S. (1956). Nonparametric statistics for the behavioral sciences. New \brk: McGraw-Hill. Tsal, Y. (1977). Symmetry and transitivity assumptions about a nonspecified logical relation. Quarterly Journal of Experimental Psychology. 29, 677-684. Tversky, A., & Kahneman, D. (1971). The belief in the law of small numbers. Psychological Bulletin, 76, 105-110. Received January 1,1986 Revision received May 1, 1986 Accepted October 27,1986 •