Download Combining statistical tests by multiplying p-values

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
$Revision 1.8$
A brief note on: : :
Combining statistical tests by multiplying p-values
James Theiler
Astrophysics and Radiation Measurements Group, NIS-2
Los Alamos National Laboratory, MS-D436
Los Alamos, NM 87545
Abstract
A general discussion on combining statistical hypothesis tests is followed by an application involving point source detection in skymaps.
The traditional literature on combining statistical tests has concentrated on a regime
in which p-values as large as 0.05 are still of interest, and in which it is dangerous to
treat tests as truly independent. But for point source detection, very small p-values are
desired, and situations can arise in which one can fairly assume independence of the
tests.
1. Introduction
Classical hypothesis testing requires that both the null hypothesis Ho and the desired \level"
be specied beforehand, before ever looking at the data. Then a statistic is computed on
the data, and this statistic is converted to a p-value. If the p-value is less than , then the
hypothesis is rejected. The level is also the false alarm rate: if the the null hypothesis is
true, and assuming that the test is well-calibrated, then it will be rejected with probability .
In practice, many researchers prefer not to specify beforehand, and then simply report
acceptance or rejection. Instead, these authors compute and report the p-value. This is
sensible since it provides no less information to the reader than the single accept/reject bit.
Indeed, it tells the reader that for all > p, the hypothesis would be rejected, and for < p,
it would be accepted. The reader can choose his own and determine whether or not to
reject the null.
Despite it's name, the p-value is not as directly interpretable as a probability as the level
is. In particular, the p-value certainly is not the probability that the null hypothesis is
true.
But since is a probability, and the p-value tells you at what you can reject the null
hypothesis, the p-value is often treated informally as a probability. One place where this is
sometimes done incorrectly is with the combination to multiple independent tests.
Suppose you have a null hypothesis Ho and two independent data sets D and D . When
you perform the hypothesis tests, you get two p-values p and p . Based on this information,
at what level can you reject the null hypothesis. The informal and incorrect answer is the
product p p for reasons that I will try to clarify.
1
1
1
1 2
1 If
this point is not clear, you'll have to ask a Bayesian!
1
2
2
But rst, I will provide a \correct" approach. You know beforehand that you are going
to do two tests, and you want the combination to have a \level" (aka size, aka false alarm
rate) . So you
choose numbers and such that = . For instance, you might take
p
= = . This choice is arbitrary, but it is crucial that you make the choice before
looking at the data. Then, you compute p-values, and if p < and p < , you reject the
null hypothesis at a level .
Again, though, you'd prefer not to report a single accept/reject bit, but a (hopefully)
more informative p-value for the pair of tests taken together. Also, you'd like to be more
exible about making both tests reject the null. If one decisively rejects the null, but the
other is entirely consistent with the null, you'd still like to reject the null. For instance, the
data may be taken on dierent days, and the null may be false one day and true the next
{ in that case the statement that \the null is always true" is false. In short, you'd like to
multiply p-values.
But if you think about it, straight multiplication of p-values can't be right. All p-values
are less than one, so if you multiply enough of them together, you'll be able to get as small
a value as you like. Even when the null is true.
1
1
2
1
2
2
1
1
2
2
2. Multiplying p-values
The dening property of a p-value is that if the null is true, then p is uniformly distributed
on the unit interval: p U (0; 1). Put another way,
P (p < x) x
(1)
for 0 x 1. Here, we write P (X ) as the probabilty of observing event X . If this inequality
is not satised, then the test will have an excess of \Type I" errors, and will not be considered
accurate. The more nearly equality is approached, then the more ecient is the test.
If we have two p-values, p and p , we can always multiply them (let p = p p ), but we
cannot interpret their product p as an honest p-value, because the product is not uniformly
distributed on the unit interval.
What we can do, though, is the following:
1. Suppose that there are k independent tests, and they will produce p-values p ; p ; : : : ; pk .
2. Assuming Ho , compute the probability Rk () that the product p = p p pk is less
than .
3. Go ahead and compute the p-values referred to in step 1, and compute their product
as well: p = p p pk .
4. Report Rk (p) as the p-value of the combined test.
The only hard part here is step 2, but fortunately this calculation only needs to be done once,
and it will apply to all situations in which test are to be combined by multiplying p-values.
And it is worth emphasizing that the nal p-value, though not equal to the product p, will
depend on the individual p-values only through their product.
1
2
1 2
1
1 2
1 2
2
2
Since the individual p-values are uniform on the unit interval, the computation in step 2
is straightforward. For k = 1, we have R () = by denition. For k = 2, R () is the area
inside the unit square that is \under" the hyperbola given by p p = . We can show that
!
Z 1
R () = + p dp = 1 + log (2)
More generally, we can write the recursion relation
!
Z
Rk () = + Rk? p dp
(3)
The derivation is left as an exercise (at least until I come up with a cleaner one myself). This
leads to
8
!9
=
<
1
1
1
(4)
R () = :1 + log + 2 log ;
1
2
1 2
1
2
1
1
2
3
but I have not actually integrated it for k 4.
As an example, suppose you perform two tests, the rst gives a p-value of 0.04 (certainly
sigincant at the traditional 5% level), but the second gives a p-value of 0.5 (just plain
insignicant). If we ignore the second test (which we are free to do as long as we decide
to ignore it before actually performing it!), we can condently reject the null hypothesis. If
we multiply p-values, we get 0.02, but it is (properly) counter-intuitive that an insignicant
result should increase the signicance of a test. When we compute the correct p-value for
the combined tests, we obtain 0:02(1 ? log 0:02) = 0:098, which is no longer signicant at the
5% level.
Was it a mistake to combine tests? Well, suppose both tests gave a p-value of 0.06. Both
tests are (just) insignicant at the 5% level, but the combined test has a p-value of 0.024,
which is signicant.
As a second example, suppose that we had four independent tests, each giving a hardly
signicant p-value of 0.1; this is equivalent to two pairwise tests, each with combined a pvalue of 0.056; or to a single combined test with p = 0:021. This is certainly signicant, but
at nowhere near the 10? level that a naive multiplication of p-values would provide. It is of
obvious importance that the tests not only be independent, but that they not be \selected"
either. If in fact there were really 8 tests, four giving p-values of 0.1, and the other four giving
p-values of 0.5, say; then combining the four insignicant tests gives a p-value of 0.7236, and
then combining this with the signicant four-way test gives a combined p-value of 0.08, which
is not signicant.
Interestingly, there is a \break-even" point for two tests which are equally signicant.
Suppose the two tests individually each have signicance p; then their combined signicance
is p0 = p ? p log p . As long as p < 0:2847, then p0 < p and the combined p-value will be less
than the individual p-values. For p larger than this value, combining the tests will increase
the p-value, which makes the combined test less signicant.
4
2
2
2
3. Standard methods
The most commonly used method of combining statistical tests is called the Bonferonni
procedure. Let p ; p ; : : :; p m be the p-values for the m tests, sorted in increasing order.
(1)
(2)
(
)
3
Then the appropriate p-value for the combined test is no larger than
pB = mp :
(5)
Expressed another way: if you want to test a hypothesis at some pre-determined level ,
then you can reject the null if at least one of the m tests has a p-value less than =m. This
is actually a special case of Ruger's inequality, which points out that for any k,
pR = mp k =k
(6)
is an upper bound on the actual p-value of the combined test. However, k must be chosen
beforehand. It bears remarking that the inequalities in Bonferonni's and Ruger's methods hold
even when the statistical tests are not independent. That is to say, the tests are conservative
and will not improperly underestimate the p-value (ie, overestimate the signicance). On
the other hand, it is also worth emphasizing that these are inequalities, and that when the
individual tests are independent, then they underestimate the signicance of the combined
test. For example, when the tests are independent, we nd
P (pB < x) = x ? m2?m 1 x + O(x )
(7)
for Bonferroni's method.
A number of authors have suggested \sharper" procedures for combining statistical tests[1{3],
but a particularly noteworthy suggestion is due to Simes[4], who suggests using
pS = min
mp k =k
(8)
k
as a p-value for the combined test. When the tests are independent, it can be shown that
pS is the exact p-value for the combined tests. When the tests are not independent, Simes[4]
argues from numerical evidence that for most situations when the tests are not independent,
the approximation will still be conservative.
The method of multiplying p-values will also be exact if the tests are independent. Also,
if both tests are very signicant, then this way of combining tests will give an extremely
signicant result. However, if the tests are not independent, this can seriously over-estimate
the signicance.
(1)
( )
2
3
( )
4. Weighted combination of tests
The method of multiplying p-values produces an overall p-value which had equal contributions
from each of the tests. However, if it is known in advance that one test is more powerful
than another, one may want to weight its contribution appropriately. One straightforward
way to do this is to write
p = pa pb
(9)
where a > b puts more heavily weights the rst test, and vice versa. One can show in this
case that
!
a b
=a
=b
p = p a ? b + p b ? a
(10)
8 b=a?
9
a=b? =
<p
= p p : 1 ? b=a + 1p? a=b ;
(11)
1 2
1
1 2
1
1
2
4
1
1
1
1
1
0.8
0.8
0.8
0.6
0.6
p2
0.6
p2
p2
0.4
0.4
0.4
0.2
0.2
0.2
0
0
0
0.2
0.4
0.6
0.8
1
0
0
0.2
0.4
0.6
p1
0.8
1
0
0.2
0.4
p1
0.6
0.8
1
p1
Figure 1: Comparison of Simes (left panel) contours of combined p-value with the product
contours (right panel), for = 0:1; 0:05; 0:02; and 0:01. The middle panel is a direct comparison
of the contours for = 0:05.
is the corrected p-value. Note that the p-value obtained by combining two p and p in this
way depends only on the ratio a=b and not on their absolute values. In particular, when a = b
the formula reverts to
p = p p (1 ? log p p ):
(12)
1
1 2
2
1 2
1
1
0.8
0.8
0.6
0.6
p2
p2
0.4
0.4
0.2
0.2
0
0
0
0.2
0.4
0.6
0.8
1
0
0.2
0.4
p1
0.6
0.8
1
p1
Figure 2: Contours of combined p-value for = 0:1; 0:05; 0:02; and 0:01. The left panel
here (which is identical to the right panel of the previous gure) is for the unweighted product
combination, and the right panel is for the weighted case.
It is straightforward to generalize this weighted combination to the case of more than two
p-values,
log p = a log p + a log p + (13)
though coverting p to a true p-value may not be nontrivial.
It bears remarking that the advantage obtained using the weighted combination is slight
unless there is a large disparity between p and p . For instance, suppose that when the
null hypothesis is false, then p = 0:05 (barely signicant) and p = 0:00005 (three orders of
magnitude more signicant). Since the second test is more powerful, we expect a stronger
combined test when b=a > 1. Fig. 3 shows the combined p-value as a function of the ratio b=a.
If we just throw out the rst (barely signicant) test, our nal p-value is p = p = 0:00005.
(Using Bonferroni's procedure, we would get pB = 0:0001 for this example.) We get a p-value
1
1
1
2
2
2
1
2
2
5
Combined p-value
0.00005
0.00004
0.00003
0.00002
1
1.5
b/a
2
Figure 3: Combined p-value for a weighted combination of tests as a function of the ratio b=a of
the weights. Solid line is the case that p1 = 0:05 and p2 = 0:00005, and the dashed line is for
p1 = 0:005 and p2 = 0:0005. The curves cross for b=a = 1, since the unweighted combination
depends only on the product p = p1 p2. Only when the individual p-values are orders of magnitude
apart is there much gain in using a weighted combination.
that is 30% smaller than this when we combine it with the rst test using the unweighted
Eq. (12). If we use b=a = 1:49, we can reduce this by another factor of 60% to obtain
p = 0:00002. On the other hand, when p and p are the same order of magnitude, then there
is virtually no gain in using a weighted instead of an unweighted combination.
1
2
5. An application
Consider the problem of testing for a \point source" in a map of counts. If there is a point
source at a candidate location, then there will be more counts per unit area in a source region
centered on this location than there will be in an annulus drawn around this location. If As
and Ab are the areas of the source region and background annulus, then the p-value associated
with seeing Ns and Nb counts in these two regions is given by[5]
p = If (Ns ; Nb + 1)
(14)
R x a?
where f = As=(As + Ab) and Ix(a; b) = t (1 ? t)b? dt is the incomplete beta function.
In general, the size of the source region should match the size of the \point spread function"
(PSF) of the telescope, and in fact one can show that for a gaussian source, the optimum is
given by a source region with radius 1:4.
If the PSF is not a \tophat" function, then this will not be the most powerful test. That
test will involve convolving the data with the PSF, and although the high count rate limit
leads to a statistic, it is dicult (and computationally expensive) to obtain correct p-values
in general.
However, it is also possible to use two annuli and to separately test two independent null
hypotheses. Consider a source region S and two concentric but disjoint annuli, an outer
annulus A and an inner annulus B , both of which surround but are disjoint with the source
region. Then the following two null hypotheses are strictly independent:
0
1
2
6
1
1. Treating S as source and B as backgound, the hypothesis is that there are no more
counts in S than expected, given the number of counts in B .
2. Treating S [ B as source and A as background, the hypothesis again is that the count
rate in the source region equal to that count rate in the background region.
Let f = As=(As + Ab) and f = (As + Ab)=(As + Ab + Aa) be the ratios of areas for the
two hypotheses. Then
1
2
p = If1 (Ns; Nb + 1)
p = If2 (Ns + Nb; Na + 1)
1
2
(15)
(16)
are the p-values associated with the two hypotheses. And these p-values can be combined
using either Eq. (11) or Eq. (12).
References
1. G. Holm, \A simple sequentially rejective multiple test procedure." Scand. J. Statist. 6,
65{70 (1979).
2. G. Hommel, \Tests of the overall hypothesis for arbitrary dependence structures." Biom.
J. 5, 423{430 (1983).
3. Y. Hochberg, \A sharper Bonferroni procedure for multiple tests of signcance." Biometrika 75, 800{802 (1988).
4. R. Simes, \An improved Bonferroni procedure for multiple tests of signcance." Biometrika 73, 751{754 (1986).
5. M. Lampton, \Two-sample discrimination of poisson means." Ap. J. 436, 784{786 (1994).
7