Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
7 Multiple testing. Seek and ye shall find As we have seen, we can control the probability of making errors in statistical inference, but we can’t eliminate them. In particular the hypothesis testing framework controls the probability of rejecting a null hypothesis when it is true. Nonetheless, if we test enough hypotheses we will find p-values that are small just by chance and not because we have detected a real departure from the null. This is something that, as statisticians and scientists, we have to accept. However, the problem is very much exacerbated by multiple testing. Problems due to choosing the best For example, suppose we wanted to compare the characteristics of two groups, 100 cases and 100 controls for some disease, say. Also suppose that we diligently collected a set of 10 measurements on them, carried out 10 t-tests and got the following p-values: 0.869 0.635 0.600 0.480 0.112 0.796 0.021 0.387 0.674 0.439 Would we then be justified in claiming that measurement 7 was significantly correlated with the disease with p-value 0.021? Well, looking at the empirical distribution function, plotted in figure 14, we see that the 10 p-values we generated are consistent with being a sample from a Uniform(0,1) distribution which is what we expect when the null hypothesis is true. We have generated a false positive association by trying too many tests and not allowing for this in our analysis. Hidden tests A second example shows that multiple testing can also occur when we only perform one test. Suppose we have 5 sample from 5 different groups and want to test the hypothesis that the groups have the same mean. The appropriate way to do this, as we saw in week 7 is to use analysis of variance. Suppose instead though, that we simply compared the smallest group mean with the largest group mean using a t-test. Figure 15 shows box and whisker plots for the 5 group samples. Carrying out a t-test between samples 2 and 5 gives a p-value of 0.007, a very significant result. However, the samples were all generated randomly from a Normal(0,1) distribution and the finding is false. Although we have made only 1 test, by comparing the smallest and largest means we know that the p-value we get is the smallest possible of all the n(n − 1)/2 = 10 comparisons we could have made. There are in effect multiple implicit tests. Publication bias and data mining A similar effect is seen in publication bias. A statistically significant finding is usually more interesting and hence more likely to be submitted and accepted for publication than a non significant result. Therefore, any published result of a test of hypothesis is a selected one of many tests carried out by the same and other researchers. The scale of the multiple testing problem is very much on the increase due to the ability to accumulate and store vast amounts of information electronically. This is particularly true 66 0.8 0.6 0.4 0.2 0.0 Cumulative frequency 1.0 Figure 14: The empirical distribution of a set of 10 p-values 0.0 0.2 0.4 0.6 p−value 67 0.8 1.0 −3 −1 0 1 2 3 Figure 15: Box and whisker plots for 5 samples from Normal(0,1) 1 2 3 68 4 5 in the biological sciences following the recent genomics boom. It is now possible to assay the genotypes of an group individuals at 500,000 different genetic loci almost overnight. In a study to find associations between disease case-control status and genetic markers there are therefore 500,000 potential statistical tests to perform. Similar issues arise in many other fields where large numbers of covariates can potentially be used as predictors of an outcome. There are probably more opportunities for careless and speculative analysis of data now than ever before. Many discoveries by data mining techniques should be viewed with skepticism. These may be interesting hypotheses but should not be considered as conclusive. There are, however, ways of avoiding and alleviating the multiple testing problem. 7.1 Restrict the number of tests This first solution is used in very formal and established statistical environments where a standard procedure for analysis is specified before any data is collected and carefully adhered to after. Clinical trials are an example of this. If different hypotheses are to be tested, the protocol must allow for this in advance as it must contingencies for early stopping of a trial. This approach is limited in a more research oriented environment where ideas are tried out, rejected, amended and retried, and new data is continually being generated. Real data analysis involves a good deal of exploration before any formal methods are applied. 7.2 Use the right test As we have saw above, in comparing the means of the 5 groups, we should not have compared the most extreme values but used analysis of variance. This would have given us a p-value of 0.0529 which although unusual is far less misleading than the 0.007 derived above. This solution is not always available, but we can often do well with a randomization test provided we randomize so as to mimic the entire analysis process. In this case a simple randomization of the observations in groups 2 and 5 would be wrong. Instead we should randomize all the observations into 5 groups, pick out the groups with the smallest and largest means and obtain a t-statistic for this pairwise comparison. By repeating our complete procedure under randomization we can obtain a fair p-value. 7.3 Generation and validation sets If we split our sample into two sets we can use one for hypothesis generation and one for hypothesis testing. If analysis of the of the trial set is done completely blind of the validation set, and any p-value is assessed only on the validation set, then we are free to fit as many models and test as many hypotheses are we want in order to generate the one, or small number, that we will validate. Sometimes, also, only the generation will be done with the validation being left to other studies. This is essentially the approach applied to the problem of publication bias. While a fresh study with negative results is not often published, it is notable if it was intended to confirm or disprove a previously published positive result. Thus, the first publication of a 69 result is generally regarded with skepticisms until the result is replicated by a another study preferably carried out by an independent group. For instance, the world is still waiting for a replication of the cold fusion experiment. 7.4 Correcting for multiple tests Suppose we carry out n independent tests of hypothesis the most significant of which has a nominal p-value of p. Although we realize that p is not the true p-value we suspect that it is so small that it would be unlikely to have arisen by chance even though we selected it from a set of results. However, if we want to report the best result we need to find the corrected or true p-value, q, that allows for this selection. We can find q in terms of p and n as follows: q = = = = = = P (the best test of n gives a more extreme result than observed) P (at least one test of n gives a more extreme result than observed) 1 − P (all n tests gave less extreme values than observed) 1 − P (a single test gives a less extreme result than observed)n 1 − (1 − P (a single test gives a more extreme result than observed))n 1 − (1 − p)n Thus, a simple correction is to report the actual p-value as 1 − (1 − p)n when p is the nominal p-value. Note that 1 − (1 − p) n n(n − 1) 2 n(n − 1)(n − 2) 3 = 1 − 1 − np + p − p +... 2 6 ≈ np for small p. ! which is known as the Bonferroni correction for multiple testing. These corrections are for independent tests. If the tests are positively correlated then the correction is usually too severe: it is conservative. Better corrections may be available but these will depend very much on the circumstances. Note also that these corrections will reduce the power. For example, suppose a test based on 100 observations has 80% power to reject the null hypothesis at a size of 0.05 when the alternative is true. In the face of say 10 multiple tests the nominal size will have to be reduced to near 0.005 to maintain the same performance. To maintain the same power we would, therefore, need many more observations. 7.5 False discovery rate While the ability to generate huge data resources has increased, the same technology has also enabled more extensive studies to follow up on potential hypotheses. It is often the case that we can tolerate some false positive leads provided the pay off for the true positives is substantial enough. Drug development is an example where this is the case. 70 1.0 0.8 0.6 0.4 0.2 0.0 Proportion of significant tests Figure 16: This plot shows an example where a threshold p-value of 0.05 gives a false discovery rate of approximately 50% 0.0 0.2 0.4 0.6 0.8 1.0 Critical value The false discovery rate is a recent approach to multiple testing that addresses the problem from this point of view. Suppose we have performed a large number of hypothesis tests and can sort the p-values in increasing order. For any particular choice of the critical value, c, say we will reject R(c) hypotheses. Of these F (c) will occur when the null hypothesis is F (c) true and be false positives. The ratio R(c) is the false discovery rate. Clearly, while R(c) F (c) is known we have to estimate F (c). Moreover, we may want to choose c so that R(c) is estimated to be a reasonable value. We know that under the null hypothesis p-values are Uniformly distributed and we can use this as a basis for estimating the false discovery rate. This is illustrated in figure 16. 71 7.6 Worksheet 1. * Simulate 10 pairs of sets of 100 random observations each from the same distribution and perform 10 t-tests to compare the means of each pair. Calculate the raw p-value, the Bonferroni corrected p-value, and the 1 − (1 − p)n corrected p-value. 2. * Repeat the above exercise 1000 times and draw the empirical distribution functions of the raw p-value and the two corrected p-values. 3. Simulate 15 datasets of 50 observations from some Normal distribution. Use analysis of variance to test the hypothesis that the means are equal. Calculate the t-test statistic for the difference between the smallest and largest of the 15 means, and find the raw p-value. 4. Repeat the above 1000 times and compare the empirical distribution functions of the p-values. 5. Construct a randomization test to give an empirical p-value for the above situation. 6. * In the above example, we know that testing equality of the largest and smallest means implies that 15(15 − 1)/2 = 105 hidden comparisons have been made. A Bonferroni correction to the p-value, therefore, would be to multiply it by 105. Or we could use 1 − (1 − p)105 . However since the comparisons are not independent this is likely a conservative correction. Construct a simulation experiment to evaluate m the effective number of tests that the actual test makes. 7. Download the file multiregress from the class web page. This has 100 lines of 11 columns each. The first number on each line is the value of a dependent variable while the following 10 are independent variables. For each independent variable test the hypothesis that it is a significant predictor of the response. Do this with and without a multiple testing correction. What do you conclude? 8. * Download the file lotsofps from the class web page. This has 10000 p-values from a case control study using 100x100 DNA expression chips. Each p-value is for a test of whether the one of 10000 genes is differentially expressed in cases and controls for asthma. Compare the conclusions you would derive from this data if you used multiple test corrections or the false discovery rate. 72