Download QUANTITATIVE METHODS IN PSYCHOLOGY On the Probability of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Randomness wikipedia , lookup

Dempster–Shafer theory wikipedia , lookup

Infinite monkey theorem wikipedia , lookup

Probability box wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Inductive probability wikipedia , lookup

Conditioning (probability) wikipedia , lookup

Probability interpretations wikipedia , lookup

Transcript
QUANTITATIVE METHODS IN PSYCHOLOGY
On the Probability of Making Type I Errors
P. Pollard
J. T. E. Richardson
Department of Human Sciences
School of Psychology
Lancashire Polytechnic
Preston, Lancashire, United Kingdom
Brunei University
Uxbridge, Middlesex, United Kingdom
A statistical test leads to a Type I error whenever it leads to the rejection of a null hypothesis that is
in fact true. The probability of making a Type I error can be characterized in the following three
ways: the conditional prior probability (the probability of making a Type I error whenever a true
null hypothesis is tested), the overall prior probability (the probability of making a Type I error
across all experiments), and the conditional posterior probability (the probability of having made a
Type I error in situations in which the null hypothesis is rejected). In this article, we show (a) that
the alpha level can be equated with the first of these and (b) that it provides an upper bound for the
second but (c) that it does not provide an estimate of the third, although it is commonly assumed to
do so. We trace the source of this erroneous assumption first to statistical texts used by psychologists,
which are generally ambiguous about which of the three interpretations is intended at any point in
their discussions of Type I errors and which typically confound the conditional price and posterior
probabilities. Underlying this, however, is a more general fallacy in reasoning about probabilities,
and we suggest that this may be the result of erroneous inferences about probabilistic conditional
statements. Finally, we consider the possibility of estimating the (posterior) probability of a Type I
error in situations in which the null hypothesis is rejected and, hence, the proportion of statistically
significant results that may be Type I errors.
A psychological experiment typically produces an outcome
that is consistent in direction with one or more effects. At this
point, the researcher has to make a decision between (a) accepting the existence of such an effect and (b) accepting the null
hypothesis that no effect is present in the population from
which the sample was drawn. In the latter case, any apparent
effects are assumed simply to be chance fluctuations around the
zero-effect value that would be observed in the whole population. For instance, if the researcher finds performance on some
measure in one condition to be 60%, compared with 40% in a
second condition, he or she has to make a decision between the
conclusion that the first condition yields better performance
than the second and the conclusion that the two scores merely
reflect chance (sample) variations around a single level of performance that would be obtained in a study of the population
as a whole.
Conventionally, researchers make such decisions by assuming the null hypothesis to be true and, given this assumption,
attempting to make inferences based on the probability of obtaining the actual pattern of results observed. Specifically, a statistical test yields the probability of a given result's (or one more
extreme) being produced by chance if the null hypothesis is
true. If D denotes an outcome or one more extreme and HO
denotes the null hypothesis's being true, then the probability
produced by such a statistical test can be expressed as P(D|Ho),
that is, the conditional probability of D, given HO. If this figure
is less than a threshold probability or alpha level (typically .05),
then chance is concluded to be a sufficiently unlikely explanation of the outcome, and the existence of an effect is held to be
supported by the data. Thus, for an alpha level of .05, one rejects the null hypothesis whenever P(D|Ho) is less than .05. In
other words, if D* denotes an outcome that would lead to the
rejection of the null hypothesis, then by definition, P(D*|Ho) is
equal to the alpha level.
If the conclusion that chance is an insufficient explanation of
the outcome is incorrect, then an error will have been made:
The null hypothesis will have been rejected when it is in fact
true. This is conventionally described as a Type I error. When a
true null hypothesis is tested, then a Type I error is made if and
only if the experimental outcome leads to the rejection of the
null hypothesis. We have defined such an outcome as D*, and
thus, the probability of such an outcome when the null hypothesis is true can be expressed as jP(D*|H<>X which is equal to the
alpha level. For instance, if a 5% criterion is adopted prior to
experimentation, then the experimenter will (on average) make
Type I errors on 5% of those occasions when the null hypothesis
is true. We refer to this as the conditional prior probability of
making a Type I error because the probability is conditional on
the null hypothesis's being true.1
We are grateful to Larry Phillips for his comments on a previous version of this article and to an anonymous reviewer for bringing the article
by Carver (1978) to our attention.
Correspondence concerning this article should be addressed to P. Pollard, School of Psychology, Lancashire Polytechnic, Preston, Lancashire PR I 2TQ, United Kingdom.
1
For convenience, we assume two points regarding the alpha level
throughout this article: First, a given alpha level will produce approximately that proportion of errors when the null hypothesis is true; second, the significance level derived from a statistical test is veridical. The
Psychological Bulletin, l9S7,Vol. 102. No. 1, 159-163
Copyright 1987 by the American Psychological Association. Inc. 0033-2909/87/S00.7S
159
160
P. POLLARD AND J. T. E. RICHARDSON
Of course, when the null hypothesis is false, the experimenter
there is a 5% chance of having made a Type I error. A more
cannot make a Type I error, only the converse error of failing to
reject the null hypothesis (that is, a Type II error). It follows that
colloquial way of expressing this is to say that in any particular
the alpha level defines the maximum number of Type I errors
chance. A corollary of this assumption is that at 5%, 1 in every
to be expected across a series of experiments in which the null
20 significant results will be a Type I error.
instance, there is a 5% probability that the results are due to
hypothesis may sometimes be true and sometimes be false. As
Probabilities of the form P(A|B) and />(B| A) are not, however,
Pagano (1981) expressed this point, "The alpha level which the
scientist sets at the beginning of the experiment is the level to
algebraically equivalent. Nor do they necessarily have the same
which he wishes to limit the probability of making a Type I
value in practice. For example, the (high) probability of a population of firemen "generating" a sample person in uniform ob-
error" (p. 203). For instance, the outcome of 20 tests of a true
viously is not the same as the (lower) probability that a sample
null hypothesis (not necessarily the same one in each case)
person in uniform was "generated" by a population of firemen.
would be expected to yield 19 correct acceptances and 1 Type I
It follows that />(Ho|D*) is not the same as />(D"|Ho) and, hence,
error (that is, 5%). If, however, during the same period, several
that the posterior conditional probability of making a Type I
error is not the same as the alpha level (cf. Cronbach & Snow,
tests were also made on a false null hypothesis, none of these
could yield a Type I error, and thus, the overall percentage of
1977, p. 52). In more colloquial terms, the alpha level cannot
experiments that yielded Type I errors would be less than 5%.
be equated with the probability that the research results were
In general, a Type I error is made in any experiment if and
due to chance or with the probability that the alternative hy-
only if the experimental outcome leads to the rejection of the
null hypothesis and the null hypothesis is in fact true. Thus, the
probability of making such an error is equal to P(D* & HO),
pothesis is false (Carver, 1978). Why, then, should there be such
a common assumption that the alpha level yields the probability of having made a Type I error?
which is equivalent to />(H0)-/>(D*|H0). This formulation
makes it clear that the probability is less than P(D*|Ho), that is,
One possible reason is that many statistical texts that psychologists and their students read and use appear to encourage
the alpha level, if P(Ho) is less than one. Thus, it can be held to
be equal to the alpha level only if it is assumed that only true
this idea.2 They generally do so by being vague about the do-
null hypotheses are ever tested. We refer to this as the overall
prior probability of making a Type I error because it is based on
the frequency of Type I errors as a proportion of all experiments.
If one decides to reject the null hypothesis, however, what is
the probability of having made a Type I error? When the null
hypothesis is rejected, the probability of having made a Type I
error is a probability about the null hypothesis because a Type
I error has been made if and only if the null hypothesis is in fact
true. It follows that the probability of having made a Type I
error is equal to />(Ho|D*). We refer to this as the conditional
posterior probability of making a Type I error because it is conditional on the decision to reject the null hypothesis. This yields
the probability of any particular rejection's having been a Type
I error, and therefore, it could be used to estimate the proportion of all significant results that are Type I errors.
main of events across which the proportion of Type I errors is
to be computed, thus leaving indeterminate which of the three
possible interpretations of the expression "probability of making a Type I error" is intended. Most of these texts define this
probability as equivalent to the alpha level, so they apparently
have the conditional prior probability in mind. For example,
"The probability of making a Type 1 error is very simply and
directly indicated by a" (Guilford & Fruchter, 1978, p. 174);
"The probability of committing a Type 1 error, which is denoted by a, is called the significance level of the test" (Hoel &
Jessen, 1977, p. 223); "The probability of committing such an
error is actually equivalent to the significance level we select"
(Miller, 1975, p. 59); "Alpha determines the probability of
making a Type I error" (Pagano, 1981, p. 202); and "The significance level is simply the probability of making a Type 1 error" (Robson, 1973, p. 35).
Our informal inquiries within a wide and varied cross section
Of course, the alpha level does indeed give the probability of
making a Type I error when the null hypothesis is true, but these
of our professional colleagues indicated a widespread assump-
quotations involve an unfortunate shorthand in which the con-
tion that the probability of having made a Type I error in rejecting the null hypothesis is the same as the alpha level; that is, if a
the reader is likely to interpret the probability as either the over-
null hypothesis is rejected at the .05 level of significance, then
all prior probability (that is, ". . . across all experiments") or
first of these seems reasonable, although it is a simplification if short
sequences of experiments are considered. We do not wish to wholly defend the second assumption. Statistical tests yield estimates of probabilities, and these estimates may vary as a function of the specific null
hypothesis being tested (that is, the type of test chosen) and the extent
to which the distributional characteristics of the data deviate from those
of the theoretical null-hypothesis population being tested. A certain
amount of such deviation is likely to occur in many psychological applications, and hence, such estimates may be inexact. Thus, our statement
that the prior probability of a Type I error is given by the alpha level
is a simplification. This point does not, however, affect the arguments
presented here unless the errors in estimates produced by such tests are
gross (in which case, of course, any discussion of the present nature
would be pointless).
ditional nature of this definition is left unstated. In its absence,
the conditional posterior probability (that is, ". . . whenever
the null hypothesis is rejected"). In the former case, the reader
is being led to equate the probability of making a Type I error
with an expression that is merely an upper bound to that probability (unless the texts assume that their readers will never test
any false null hypotheses). In the latter case, the reader is being
encouraged to use the alpha level in validly, not as a conditional
2
Carver (1978) observed that general textbooks in psychology also
encouraged it (e.g., Anastasi, 1976, p. 109; Hebb, 1966, p. 173). Our
purpose in this article is to illuminate the extent of the fallacy even in
those texts that psychologists and their students might consider the most
authoritative and to offer an explanation of its source in terms of current
ideas about human decision making.
161
TYPE I ERRORS
probability of the obtained outcome but as a conditional probability of the truth of the null hypothesis itself.
A common device that promotes this ambiguity is a 2 X 2
table showing the probability of a correct or incorrect decision
as a function of (a) the truth or falsehood of the null hypothesis
and (b) a decision to accept or reject the null hypothesis. The
alpha level is shown in the cell corresponding to "null hypothesis true" and "null hypothesis rejected." The headings on such
tables, however, typically do not identify the contents as being
conditional (rather than absolute) probabilities. Even if the
reader realizes that the alpha level is a conditional probability,
the table usually gives no indication of whether it is conditional
within rows or within columns and, thus, fails to distinguish
between the conditional prior and the conditional posterior interpretations of "the probability of making a Type I error."
These problems worsen when the authors in question discuss
the frequency of Type 1 errors. For instance, Christensen (1980)
reported, "If the .05 significance level is set, you run the risk of
being wrong and committing Type 1 error five times in 100" (p.
311). Keppell and Saufley (1980) wrote, "We will make a Type
1 error a small percentage of the time—the exact amount being
specified by our significance level" (p. 137), The puzzled reader
might legitimately ask of such quotations, five times in which
100, or a small percentage of what time? The author never made
clear that the answer is, five times in 100 when a true null hypothesis is tested, and once again led the reader toward alternative interpretations, such as, five times in 100 experiments (that
is, the overall prior probability), or five times in 100 when the
null hypothesis is rejected (that is, the conditional posterior
probability).
It is perhaps more serious that this uncertainty concerning
the interpretation of the expression "the probability of making
a Type I error" is shared by at least some of the authors of statistical texts in frequent use. In particular, they tend to confound
the conditional prior probability of making a Type 1 error with
the conditional posterior probability. For instance, Greene and
D'Oliveira (1982), after correctly describing the significance
level as a conditional prior probability, later refer to it as the
"percentage probability . . . that your results are due to
chance" (p. 31). The earlier definition of the probability has
been changed, and the alpha level is now held to be the probability of a particular source of the data (namely, the null hypothesis). In a similar manner, Miller (1975) states, "If we reject the
null hypothesis whenever the chance of it being true is less than
0.05, then obviously we shall be wrong 5 per cent of the time"
(p. 59). Once again, the alpha level is used specifically as a probability of the truth of the null hypothesis. Finally, when discussing a decision to reject the null hypothesis on the basis of a test
statistic whose value lies in the appropriate rejection region,
Siegel (1956) states, "We may explain the actual occurrence of
that value in two ways: first we may explain it by deciding that
the null hypothesis is false, or second, we may explain it by deciding that a rare and unlikely event has occurred" (p. 14). After
stating that the first alternative would be chosen, although the
second might be true (that is, the null hypothesis might be true),
he continues, "In fact, the probability that the second explanation is the correct one is given by [italics added] a, for rejecting
H0 when in fact it is true is the Type 1 error" (Siegel, 1956, p.
14). The italicized portion of the last quotation again clearly
shows the interpretation of the alpha level as a probability of
the truth of the null hypothesis.
In short, one possible reason for the common assumption
among psychologists and their students that the alpha level represents the probability of having made a Type I error is that
standard statistical texts promote this fallacy. On one hand, they
may correctly define the alpha level as the probability of making
a Type I error, given a true null hypothesis, but then subsequently make no reference to the prior conditional nature of
this probability. On the other hand, they may leave the intended
interpretation of the expression "the probability of making a
Type I error" entirely unclear. In either case, the resulting ambiguity, between .P(D*|Ho) and ,P(Ho|D*) is compounded further
until the alpha level comes to be identified with the conditional
posterior probability of making a Type I error.
Of course, this might be seen as a problem that relates simply
to the training of psychologists rather than one of general theoretical interest to psychology itself. It is pertinent, however, to
ask why the authors of statistics textbooks as well as their readers should be vulnerable to such problems in their reasoning
about probabilities. Why should there be such a fundamental
tendency to confuse the posterior probability that the null hypothesis is true, given that it has been rejected, with the prior
probability that the null hypothesis will be rejected, given that
it is true?
Kahneman and Tversky (1973) demonstrated a widespread
tendency for subjects to confuse (intuitive) judgments of the
form /"(D|H) with judgments of the form F(H|D) and showed
that psychologists themselves will be vulnerable to such statistical biases (Tversky & Kahneman, 1971). Kahneman and Tversky described such errors as examples of a "base rate" or "prior
probability" fallacy encouraged by superficial judgments of
similarity, or "representativeness." Bar-Hillel (1974) demonstrated explicitly that this tendency to base judgments on similarity leads subjects to make no distinction between P(D\H)
and P(H|D) judgments. Analogously, the significance level
seems to operate as a measure of the similarity, or representativeness, between the observed sample and the null-hypothesis
population and, thus, creates the illusion that it predicts both
the likelihood of rejecting the null hypothesis, given that it is
true, and the likelihood of the null hypothesis being true, given
that it has been rejected.
Another way of characterizing errors of similarity or representativeness is to view the confusion as stemming from a faulty
logical inference rather than a faulty statistical one. Consider
the following argument:
If Ho then not D*
D*
Therefore: Not H 0 .
It should be clear that this is a valid (modus tollens) inference.
However, consider the following variant:
If HO then D* very unlikely
D*
Therefore: H0 very unlikely.
It should be equally clear that this is now invalid (substitute
"This person is American" for H0 and "This person is a mem-
162
P. POLLARD AND J. T. E. RICHARDSON
her of Congress" for D*). Such (conditional) inferences do not
relation with the conditional prior probability of a Type I error,
work with probabilistic relations. One may make inferences on
the basis of a premise of the form "If HO then not D*" but not
it is positively related to the conditional posterior probability of
a Type I error. Thus, to the extent that most psychologists run
on the basis of a premise of the form "If HO then probably not
sensitive experiments that are associated with high values of
D*." Possibly, however, the latter statement leads to the assumption of a symmetrical relation and, thus, to the erroneous inference "If D* then probably not HO." There is little research evi-
P(D*) and that minimize Type II errors, the proportion of Type
I errors in the literature will not be unduly high.
Furthermore, to the extent that most psychologists frame
dence for this specific transformation, although there is abun-
good alternative hypotheses (that is, ones more likely to be true
dant evidence that standard universal and conditional
statements tend to be "converted" in this way (see, for instance,
than false), /"(Ho) will be likely to be low. A special case of this
the chapters on conditional and syllogistic reasoning in Evans,
1982), and there is some evidence that any relation will tend to
rejection of the null hypothesis. Because the overall probability
of rejecting the null hypothesis, /"(D*), will be greater than the
be seen as symmetrical (Newstead, Pollard, & Griggs, 1986;
Tsal, 1977).
threshold probability level, />(D*|H0), the Bayesian relation already stated entails that the conditional posterior probability
is when one attempts to replicate an outcome that led to the
If such a transformation is intuitively made, then it will lead
of making a Type I error, />(H0|D*)> will be less than the prior
from "If Ho then the probability of D* is equal to the alpha
probability of the null hypothesis's being true, P(Ho). By a similar argument, it can be shown that the conditional posterior
level" to "If D* then the probability of HO is equal to the alpha
level." This provides a characterization of this similarity, or representativeness, error in terms of a logical fallacy rather than a
statistical one. Also, according to the foregoing discussion, the
hypotheticc-deductive method is not strictly workable when
combined with the conventional procedures of statistical inference.
Of course, the conditional prior probability of making a Type
I error and the conditional posterior probability of making such
an error can be directly related to one another by means of
Bayes's theorem. Specifically, P(Ho|D*) = P(D*\H0)-P(H0)/
P(D*). Such a formulation shows that the conditional prior and
conditional posterior probabilities will be equivalent only if
/'(Ho) and P(D*) are fortuitously equal. Otherwise, the conditional posterior probability of making a Type I error will be
greater or less than the alpha level, depending on whether P(Ho)
is greater or less than P(D*).
Nevertheless, it is impossible to quantify the value of P(Ho),
either across all hypotheses or in any individual case. It would
probability of making a Type I error in a replication of an outcome that was statistically significant must be less than the conditional posterior probability of making a Type I error in the
original experiment, provided that the threshold probability
level remains the same in both cases. In short, although it is
not possible to quantify the conditional posterior probability of
having made a Type I error, one may nevertheless be confident
that this probability is reduced by successive replications of the
effect. In other words, even an experiment that fails to achieve
statistical significance but replicates the direction of a previous
significant effect decreases rather than increases the probability
that the original finding was a Type I error (cf. Humphreys,
1980). A corollary of this is that to ensure the same conditional
posterior probability of making a Type I error as in the original
experiment, a researcher should use a less conservative threshold probability level in its replication. (These points are, of
course, implicitly recognized in techniques of research synthesis; e.g., Green & Hall, 1984, and Rosenthal, 1978.)
in principle be possible for individual researchers to arrive at
Given the assumption that psychologists tend to frame good al-
an empirical estimate of P(D*) in their own work by enumerat-
ternative hypotheses and, hence, that the value of P(Ho) is gener-
ing all the instances of significant and nonsignificant results that
they had ever obtained. It would not, however, be possible to
ally low, together with the fact that AHo/D*) will be less than
estimate the value of P(D*) in the general domain of psychological research because it is not known how many nonsignificant
stantially lower than the alpha level. Thus, that the posterior prob-
P(Ho), the posterior probability of Type I errors is likely to be sub-
findings go unreported in the published literature (Greenwald,
1975).
ability of having made a Type I error is in principle unquantifiable
is not necessarily cause for general concern because the values of
/'(D*) and /"(Ho) are to some extent under the researcher's control
Because it is impossible to estimate either P(Ho) or P(D*),
and there are reasons for believing that the overall number of Type
the precise value of the conditional posterior probability of
I errors in the literature is small. Nevertheless, in any specific case
making a Type I error is in principle unquantifiable and, hence,
of rejection of the null hypothesis, the probability of having made
indeterminate. From this follow two conclusions of particular
interest. First, there are no circumstances under which it would
a Type I error is indeterminate.
be legitimate to claim that the alpha level was an index of the
conditional posterior probability of making a Type I error. Second, the alpha level cannot be used to estimate the proportion
References
of Type I errors in the psychological research literature. Never-
Anastasi, A. (1976). Psychological testing (4th ed.). New York: Macmillan.
theless, both PCD*) and P(Ho) are under the control of psychological researchers, as we now point out.
The proportion of Type I errors in the literature should be
computed as the ratio of such errors to the total number of rejections of the null hypothesis. Thus, anything that maximizes
the number of correct rejections (such as experimental power)
will reduce this proportion. It follows from this that whereas the
(prior) probability of a Type II error shows the familiar inverse
Bar-Hillel, M. (1974). Similarity and probability. Organizational Behavior and Human Performance, 11, 277-282.
Carver, R. P. (1978). The case against statistical significance testing.
Harvard Educational Review, 48, 378-399.
Christensen, L. B. (1980). Experimental methodology (2nd ed.). Boston: AUyn & Bacon.
Cronbach. L. J., & Snow, R. E. (1977). Aptitudes and instructional methods: A handbook for research on interactions. New Ifoik: Irvington.
TYPE I ERRORS
Evans, J. St. B. T. (1982). The psychology of deductive reasoning. London: Routledge & Kegan Paul.
Green, B. E, & Hall, J. A. (1984). Quantitative methods for literature
reviews. Annual Review of Psychology, 35, 37-53.
Greene, !.,& D'Oliveira, M. (1982). Learning to me statistical tests in
psychology: A students' guide- Milton Keynes, United Kingdom:
Open University Press.
Greenwald, A. G. (1975). Consequences of prejudice against the null
hypothesis. Psychological Bulletin, 82,1-20.
Guilford, J. P., & Fruchter, B. (1978). Fundamental statistics in psychology and education. New York: McGraw-Hill.
Hebb, D. O. (1966). A textbook of psychology. Philadelphia, PA: Saunders.
Hoel, P. G., & Jesscn, R. J. (1977). Basic statistics for business and
economics (2nd ed,). New \brk: Wiley.
Humphreys, L. G. (1980). The statistics of failure to replicate: A comment on Buriel's (1978) conclusions. Journal of Educational Psychology, 72, 71-75.
Kahneman, D., & Tversky, A. (1973). On the psychology of prediction.
Psychological Review, 80, 237-251.
Keppel, G., & Saufley, W. R, Jr. (1980). Introduction to design and
analysis: A students' handbook. San Francisco: Freeman.
163
Miller, S. (1975). Experimental design and statistics. London: Methuen.
Newstead, S. E,, Pollard, P., & Griggs, R. E. (1986). Response bias in
relational reasoning. Bulletin of the Psychonomic Society, 24,
95-98.
Pagano, R. R. (1981). Understanding statistics in the behavioral sciences. St. Paul, MN: West.
Robson, C. (1973). Experiment, design, and statistics in psychology.
Harmondsworth, England: Penguin.
Rosenthal, R. (1978). Combining results of independent studies. Psychological Bulletin, 81, 185-193.
Siegel, S. (1956). Nonparametric statistics for the behavioral sciences.
New \brk: McGraw-Hill.
Tsal, Y. (1977). Symmetry and transitivity assumptions about a nonspecified logical relation. Quarterly Journal of Experimental Psychology. 29, 677-684.
Tversky, A., & Kahneman, D. (1971). The belief in the law of small
numbers. Psychological Bulletin, 76, 105-110.
Received January 1,1986
Revision received May 1, 1986
Accepted October 27,1986 •