Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
$Revision 1.8$ A brief note on: : : Combining statistical tests by multiplying p-values James Theiler Astrophysics and Radiation Measurements Group, NIS-2 Los Alamos National Laboratory, MS-D436 Los Alamos, NM 87545 Abstract A general discussion on combining statistical hypothesis tests is followed by an application involving point source detection in skymaps. The traditional literature on combining statistical tests has concentrated on a regime in which p-values as large as 0.05 are still of interest, and in which it is dangerous to treat tests as truly independent. But for point source detection, very small p-values are desired, and situations can arise in which one can fairly assume independence of the tests. 1. Introduction Classical hypothesis testing requires that both the null hypothesis Ho and the desired \level" be specied beforehand, before ever looking at the data. Then a statistic is computed on the data, and this statistic is converted to a p-value. If the p-value is less than , then the hypothesis is rejected. The level is also the false alarm rate: if the the null hypothesis is true, and assuming that the test is well-calibrated, then it will be rejected with probability . In practice, many researchers prefer not to specify beforehand, and then simply report acceptance or rejection. Instead, these authors compute and report the p-value. This is sensible since it provides no less information to the reader than the single accept/reject bit. Indeed, it tells the reader that for all > p, the hypothesis would be rejected, and for < p, it would be accepted. The reader can choose his own and determine whether or not to reject the null. Despite it's name, the p-value is not as directly interpretable as a probability as the level is. In particular, the p-value certainly is not the probability that the null hypothesis is true. But since is a probability, and the p-value tells you at what you can reject the null hypothesis, the p-value is often treated informally as a probability. One place where this is sometimes done incorrectly is with the combination to multiple independent tests. Suppose you have a null hypothesis Ho and two independent data sets D and D . When you perform the hypothesis tests, you get two p-values p and p . Based on this information, at what level can you reject the null hypothesis. The informal and incorrect answer is the product p p for reasons that I will try to clarify. 1 1 1 1 2 1 If this point is not clear, you'll have to ask a Bayesian! 1 2 2 But rst, I will provide a \correct" approach. You know beforehand that you are going to do two tests, and you want the combination to have a \level" (aka size, aka false alarm rate) . So you choose numbers and such that = . For instance, you might take p = = . This choice is arbitrary, but it is crucial that you make the choice before looking at the data. Then, you compute p-values, and if p < and p < , you reject the null hypothesis at a level . Again, though, you'd prefer not to report a single accept/reject bit, but a (hopefully) more informative p-value for the pair of tests taken together. Also, you'd like to be more exible about making both tests reject the null. If one decisively rejects the null, but the other is entirely consistent with the null, you'd still like to reject the null. For instance, the data may be taken on dierent days, and the null may be false one day and true the next { in that case the statement that \the null is always true" is false. In short, you'd like to multiply p-values. But if you think about it, straight multiplication of p-values can't be right. All p-values are less than one, so if you multiply enough of them together, you'll be able to get as small a value as you like. Even when the null is true. 1 1 2 1 2 2 1 1 2 2 2. Multiplying p-values The dening property of a p-value is that if the null is true, then p is uniformly distributed on the unit interval: p U (0; 1). Put another way, P (p < x) x (1) for 0 x 1. Here, we write P (X ) as the probabilty of observing event X . If this inequality is not satised, then the test will have an excess of \Type I" errors, and will not be considered accurate. The more nearly equality is approached, then the more ecient is the test. If we have two p-values, p and p , we can always multiply them (let p = p p ), but we cannot interpret their product p as an honest p-value, because the product is not uniformly distributed on the unit interval. What we can do, though, is the following: 1. Suppose that there are k independent tests, and they will produce p-values p ; p ; : : : ; pk . 2. Assuming Ho , compute the probability Rk () that the product p = p p pk is less than . 3. Go ahead and compute the p-values referred to in step 1, and compute their product as well: p = p p pk . 4. Report Rk (p) as the p-value of the combined test. The only hard part here is step 2, but fortunately this calculation only needs to be done once, and it will apply to all situations in which test are to be combined by multiplying p-values. And it is worth emphasizing that the nal p-value, though not equal to the product p, will depend on the individual p-values only through their product. 1 2 1 2 1 1 2 1 2 2 2 Since the individual p-values are uniform on the unit interval, the computation in step 2 is straightforward. For k = 1, we have R () = by denition. For k = 2, R () is the area inside the unit square that is \under" the hyperbola given by p p = . We can show that ! Z 1 R () = + p dp = 1 + log (2) More generally, we can write the recursion relation ! Z Rk () = + Rk? p dp (3) The derivation is left as an exercise (at least until I come up with a cleaner one myself). This leads to 8 !9 = < 1 1 1 (4) R () = :1 + log + 2 log ; 1 2 1 2 1 2 1 1 2 3 but I have not actually integrated it for k 4. As an example, suppose you perform two tests, the rst gives a p-value of 0.04 (certainly sigincant at the traditional 5% level), but the second gives a p-value of 0.5 (just plain insignicant). If we ignore the second test (which we are free to do as long as we decide to ignore it before actually performing it!), we can condently reject the null hypothesis. If we multiply p-values, we get 0.02, but it is (properly) counter-intuitive that an insignicant result should increase the signicance of a test. When we compute the correct p-value for the combined tests, we obtain 0:02(1 ? log 0:02) = 0:098, which is no longer signicant at the 5% level. Was it a mistake to combine tests? Well, suppose both tests gave a p-value of 0.06. Both tests are (just) insignicant at the 5% level, but the combined test has a p-value of 0.024, which is signicant. As a second example, suppose that we had four independent tests, each giving a hardly signicant p-value of 0.1; this is equivalent to two pairwise tests, each with combined a pvalue of 0.056; or to a single combined test with p = 0:021. This is certainly signicant, but at nowhere near the 10? level that a naive multiplication of p-values would provide. It is of obvious importance that the tests not only be independent, but that they not be \selected" either. If in fact there were really 8 tests, four giving p-values of 0.1, and the other four giving p-values of 0.5, say; then combining the four insignicant tests gives a p-value of 0.7236, and then combining this with the signicant four-way test gives a combined p-value of 0.08, which is not signicant. Interestingly, there is a \break-even" point for two tests which are equally signicant. Suppose the two tests individually each have signicance p; then their combined signicance is p0 = p ? p log p . As long as p < 0:2847, then p0 < p and the combined p-value will be less than the individual p-values. For p larger than this value, combining the tests will increase the p-value, which makes the combined test less signicant. 4 2 2 2 3. Standard methods The most commonly used method of combining statistical tests is called the Bonferonni procedure. Let p ; p ; : : :; p m be the p-values for the m tests, sorted in increasing order. (1) (2) ( ) 3 Then the appropriate p-value for the combined test is no larger than pB = mp : (5) Expressed another way: if you want to test a hypothesis at some pre-determined level , then you can reject the null if at least one of the m tests has a p-value less than =m. This is actually a special case of Ruger's inequality, which points out that for any k, pR = mp k =k (6) is an upper bound on the actual p-value of the combined test. However, k must be chosen beforehand. It bears remarking that the inequalities in Bonferonni's and Ruger's methods hold even when the statistical tests are not independent. That is to say, the tests are conservative and will not improperly underestimate the p-value (ie, overestimate the signicance). On the other hand, it is also worth emphasizing that these are inequalities, and that when the individual tests are independent, then they underestimate the signicance of the combined test. For example, when the tests are independent, we nd P (pB < x) = x ? m2?m 1 x + O(x ) (7) for Bonferroni's method. A number of authors have suggested \sharper" procedures for combining statistical tests[1{3], but a particularly noteworthy suggestion is due to Simes[4], who suggests using pS = min mp k =k (8) k as a p-value for the combined test. When the tests are independent, it can be shown that pS is the exact p-value for the combined tests. When the tests are not independent, Simes[4] argues from numerical evidence that for most situations when the tests are not independent, the approximation will still be conservative. The method of multiplying p-values will also be exact if the tests are independent. Also, if both tests are very signicant, then this way of combining tests will give an extremely signicant result. However, if the tests are not independent, this can seriously over-estimate the signicance. (1) ( ) 2 3 ( ) 4. Weighted combination of tests The method of multiplying p-values produces an overall p-value which had equal contributions from each of the tests. However, if it is known in advance that one test is more powerful than another, one may want to weight its contribution appropriately. One straightforward way to do this is to write p = pa pb (9) where a > b puts more heavily weights the rst test, and vice versa. One can show in this case that ! a b =a =b p = p a ? b + p b ? a (10) 8 b=a? 9 a=b? = <p = p p : 1 ? b=a + 1p? a=b ; (11) 1 2 1 1 2 1 1 2 4 1 1 1 1 1 0.8 0.8 0.8 0.6 0.6 p2 0.6 p2 p2 0.4 0.4 0.4 0.2 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0 0.2 0.4 0.6 p1 0.8 1 0 0.2 0.4 p1 0.6 0.8 1 p1 Figure 1: Comparison of Simes (left panel) contours of combined p-value with the product contours (right panel), for = 0:1; 0:05; 0:02; and 0:01. The middle panel is a direct comparison of the contours for = 0:05. is the corrected p-value. Note that the p-value obtained by combining two p and p in this way depends only on the ratio a=b and not on their absolute values. In particular, when a = b the formula reverts to p = p p (1 ? log p p ): (12) 1 1 2 2 1 2 1 1 0.8 0.8 0.6 0.6 p2 p2 0.4 0.4 0.2 0.2 0 0 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 p1 0.6 0.8 1 p1 Figure 2: Contours of combined p-value for = 0:1; 0:05; 0:02; and 0:01. The left panel here (which is identical to the right panel of the previous gure) is for the unweighted product combination, and the right panel is for the weighted case. It is straightforward to generalize this weighted combination to the case of more than two p-values, log p = a log p + a log p + (13) though coverting p to a true p-value may not be nontrivial. It bears remarking that the advantage obtained using the weighted combination is slight unless there is a large disparity between p and p . For instance, suppose that when the null hypothesis is false, then p = 0:05 (barely signicant) and p = 0:00005 (three orders of magnitude more signicant). Since the second test is more powerful, we expect a stronger combined test when b=a > 1. Fig. 3 shows the combined p-value as a function of the ratio b=a. If we just throw out the rst (barely signicant) test, our nal p-value is p = p = 0:00005. (Using Bonferroni's procedure, we would get pB = 0:0001 for this example.) We get a p-value 1 1 1 2 2 2 1 2 2 5 Combined p-value 0.00005 0.00004 0.00003 0.00002 1 1.5 b/a 2 Figure 3: Combined p-value for a weighted combination of tests as a function of the ratio b=a of the weights. Solid line is the case that p1 = 0:05 and p2 = 0:00005, and the dashed line is for p1 = 0:005 and p2 = 0:0005. The curves cross for b=a = 1, since the unweighted combination depends only on the product p = p1 p2. Only when the individual p-values are orders of magnitude apart is there much gain in using a weighted combination. that is 30% smaller than this when we combine it with the rst test using the unweighted Eq. (12). If we use b=a = 1:49, we can reduce this by another factor of 60% to obtain p = 0:00002. On the other hand, when p and p are the same order of magnitude, then there is virtually no gain in using a weighted instead of an unweighted combination. 1 2 5. An application Consider the problem of testing for a \point source" in a map of counts. If there is a point source at a candidate location, then there will be more counts per unit area in a source region centered on this location than there will be in an annulus drawn around this location. If As and Ab are the areas of the source region and background annulus, then the p-value associated with seeing Ns and Nb counts in these two regions is given by[5] p = If (Ns ; Nb + 1) (14) R x a? where f = As=(As + Ab) and Ix(a; b) = t (1 ? t)b? dt is the incomplete beta function. In general, the size of the source region should match the size of the \point spread function" (PSF) of the telescope, and in fact one can show that for a gaussian source, the optimum is given by a source region with radius 1:4. If the PSF is not a \tophat" function, then this will not be the most powerful test. That test will involve convolving the data with the PSF, and although the high count rate limit leads to a statistic, it is dicult (and computationally expensive) to obtain correct p-values in general. However, it is also possible to use two annuli and to separately test two independent null hypotheses. Consider a source region S and two concentric but disjoint annuli, an outer annulus A and an inner annulus B , both of which surround but are disjoint with the source region. Then the following two null hypotheses are strictly independent: 0 1 2 6 1 1. Treating S as source and B as backgound, the hypothesis is that there are no more counts in S than expected, given the number of counts in B . 2. Treating S [ B as source and A as background, the hypothesis again is that the count rate in the source region equal to that count rate in the background region. Let f = As=(As + Ab) and f = (As + Ab)=(As + Ab + Aa) be the ratios of areas for the two hypotheses. Then 1 2 p = If1 (Ns; Nb + 1) p = If2 (Ns + Nb; Na + 1) 1 2 (15) (16) are the p-values associated with the two hypotheses. And these p-values can be combined using either Eq. (11) or Eq. (12). References 1. G. Holm, \A simple sequentially rejective multiple test procedure." Scand. J. Statist. 6, 65{70 (1979). 2. G. Hommel, \Tests of the overall hypothesis for arbitrary dependence structures." Biom. J. 5, 423{430 (1983). 3. Y. Hochberg, \A sharper Bonferroni procedure for multiple tests of signcance." Biometrika 75, 800{802 (1988). 4. R. Simes, \An improved Bonferroni procedure for multiple tests of signcance." Biometrika 73, 751{754 (1986). 5. M. Lampton, \Two-sample discrimination of poisson means." Ap. J. 436, 784{786 (1994). 7