Download 7 Multiple testing. Seek and ye shall find

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genetic testing wikipedia , lookup

DNA paternity testing wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Transcript
7
Multiple testing. Seek and ye shall find
As we have seen, we can control the probability of making errors in statistical inference,
but we can’t eliminate them. In particular the hypothesis testing framework controls
the probability of rejecting a null hypothesis when it is true. Nonetheless, if we test
enough hypotheses we will find p-values that are small just by chance and not because we
have detected a real departure from the null. This is something that, as statisticians and
scientists, we have to accept. However, the problem is very much exacerbated by multiple
testing.
Problems due to choosing the best
For example, suppose we wanted to compare the characteristics of two groups, 100 cases
and 100 controls for some disease, say. Also suppose that we diligently collected a set of
10 measurements on them, carried out 10 t-tests and got the following p-values:
0.869 0.635 0.600 0.480 0.112 0.796 0.021 0.387 0.674 0.439
Would we then be justified in claiming that measurement 7 was significantly correlated
with the disease with p-value 0.021? Well, looking at the empirical distribution function,
plotted in figure 14, we see that the 10 p-values we generated are consistent with being a
sample from a Uniform(0,1) distribution which is what we expect when the null hypothesis
is true. We have generated a false positive association by trying too many tests and not
allowing for this in our analysis.
Hidden tests
A second example shows that multiple testing can also occur when we only perform one
test. Suppose we have 5 sample from 5 different groups and want to test the hypothesis
that the groups have the same mean. The appropriate way to do this, as we saw in week
7 is to use analysis of variance. Suppose instead though, that we simply compared the
smallest group mean with the largest group mean using a t-test. Figure 15 shows box and
whisker plots for the 5 group samples. Carrying out a t-test between samples 2 and 5
gives a p-value of 0.007, a very significant result. However, the samples were all generated
randomly from a Normal(0,1) distribution and the finding is false. Although we have made
only 1 test, by comparing the smallest and largest means we know that the p-value we get
is the smallest possible of all the n(n − 1)/2 = 10 comparisons we could have made. There
are in effect multiple implicit tests.
Publication bias and data mining
A similar effect is seen in publication bias. A statistically significant finding is usually more
interesting and hence more likely to be submitted and accepted for publication than a non
significant result. Therefore, any published result of a test of hypothesis is a selected one
of many tests carried out by the same and other researchers.
The scale of the multiple testing problem is very much on the increase due to the ability
to accumulate and store vast amounts of information electronically. This is particularly true
66
0.8
0.6
0.4
0.2
0.0
Cumulative frequency
1.0
Figure 14: The empirical distribution of a set of 10 p-values
0.0
0.2
0.4
0.6
p−value
67
0.8
1.0
−3
−1
0
1
2
3
Figure 15: Box and whisker plots for 5 samples from Normal(0,1)
1
2
3
68
4
5
in the biological sciences following the recent genomics boom. It is now possible to assay
the genotypes of an group individuals at 500,000 different genetic loci almost overnight. In
a study to find associations between disease case-control status and genetic markers there
are therefore 500,000 potential statistical tests to perform. Similar issues arise in many
other fields where large numbers of covariates can potentially be used as predictors of an
outcome. There are probably more opportunities for careless and speculative analysis of
data now than ever before. Many discoveries by data mining techniques should be viewed
with skepticism. These may be interesting hypotheses but should not be considered as
conclusive.
There are, however, ways of avoiding and alleviating the multiple testing problem.
7.1
Restrict the number of tests
This first solution is used in very formal and established statistical environments where
a standard procedure for analysis is specified before any data is collected and carefully
adhered to after. Clinical trials are an example of this. If different hypotheses are to
be tested, the protocol must allow for this in advance as it must contingencies for early
stopping of a trial.
This approach is limited in a more research oriented environment where ideas are tried
out, rejected, amended and retried, and new data is continually being generated. Real data
analysis involves a good deal of exploration before any formal methods are applied.
7.2
Use the right test
As we have saw above, in comparing the means of the 5 groups, we should not have
compared the most extreme values but used analysis of variance. This would have given
us a p-value of 0.0529 which although unusual is far less misleading than the 0.007 derived
above. This solution is not always available, but we can often do well with a randomization
test provided we randomize so as to mimic the entire analysis process.
In this case a simple randomization of the observations in groups 2 and 5 would be
wrong. Instead we should randomize all the observations into 5 groups, pick out the groups
with the smallest and largest means and obtain a t-statistic for this pairwise comparison.
By repeating our complete procedure under randomization we can obtain a fair p-value.
7.3
Generation and validation sets
If we split our sample into two sets we can use one for hypothesis generation and one
for hypothesis testing. If analysis of the of the trial set is done completely blind of the
validation set, and any p-value is assessed only on the validation set, then we are free to
fit as many models and test as many hypotheses are we want in order to generate the one,
or small number, that we will validate. Sometimes, also, only the generation will be done
with the validation being left to other studies.
This is essentially the approach applied to the problem of publication bias. While a
fresh study with negative results is not often published, it is notable if it was intended to
confirm or disprove a previously published positive result. Thus, the first publication of a
69
result is generally regarded with skepticisms until the result is replicated by a another study
preferably carried out by an independent group. For instance, the world is still waiting for
a replication of the cold fusion experiment.
7.4
Correcting for multiple tests
Suppose we carry out n independent tests of hypothesis the most significant of which has
a nominal p-value of p. Although we realize that p is not the true p-value we suspect that
it is so small that it would be unlikely to have arisen by chance even though we selected
it from a set of results. However, if we want to report the best result we need to find the
corrected or true p-value, q, that allows for this selection. We can find q in terms of p and
n as follows:
q =
=
=
=
=
=
P (the best test of n gives a more extreme result than observed)
P (at least one test of n gives a more extreme result than observed)
1 − P (all n tests gave less extreme values than observed)
1 − P (a single test gives a less extreme result than observed)n
1 − (1 − P (a single test gives a more extreme result than observed))n
1 − (1 − p)n
Thus, a simple correction is to report the actual p-value as 1 − (1 − p)n when p is the
nominal p-value.
Note that
1 − (1 − p)
n
n(n − 1) 2 n(n − 1)(n − 2) 3
= 1 − 1 − np +
p −
p +...
2
6
≈ np for small p.
!
which is known as the Bonferroni correction for multiple testing.
These corrections are for independent tests. If the tests are positively correlated then
the correction is usually too severe: it is conservative. Better corrections may be available
but these will depend very much on the circumstances.
Note also that these corrections will reduce the power. For example, suppose a test
based on 100 observations has 80% power to reject the null hypothesis at a size of 0.05
when the alternative is true. In the face of say 10 multiple tests the nominal size will have
to be reduced to near 0.005 to maintain the same performance. To maintain the same
power we would, therefore, need many more observations.
7.5
False discovery rate
While the ability to generate huge data resources has increased, the same technology has
also enabled more extensive studies to follow up on potential hypotheses. It is often the
case that we can tolerate some false positive leads provided the pay off for the true positives
is substantial enough. Drug development is an example where this is the case.
70
1.0
0.8
0.6
0.4
0.2
0.0
Proportion of significant tests
Figure 16: This plot shows an example where a threshold p-value of 0.05 gives a false
discovery rate of approximately 50%
0.0
0.2
0.4
0.6
0.8
1.0
Critical value
The false discovery rate is a recent approach to multiple testing that addresses the problem from this point of view. Suppose we have performed a large number of hypothesis tests
and can sort the p-values in increasing order. For any particular choice of the critical value,
c, say we will reject R(c) hypotheses. Of these F (c) will occur when the null hypothesis is
F (c)
true and be false positives. The ratio R(c)
is the false discovery rate. Clearly, while R(c)
F (c)
is known we have to estimate F (c). Moreover, we may want to choose c so that R(c)
is
estimated to be a reasonable value. We know that under the null hypothesis p-values are
Uniformly distributed and we can use this as a basis for estimating the false discovery rate.
This is illustrated in figure 16.
71
7.6
Worksheet
1. * Simulate 10 pairs of sets of 100 random observations each from the same distribution
and perform 10 t-tests to compare the means of each pair. Calculate the raw p-value,
the Bonferroni corrected p-value, and the 1 − (1 − p)n corrected p-value.
2. * Repeat the above exercise 1000 times and draw the empirical distribution functions
of the raw p-value and the two corrected p-values.
3. Simulate 15 datasets of 50 observations from some Normal distribution. Use analysis
of variance to test the hypothesis that the means are equal. Calculate the t-test
statistic for the difference between the smallest and largest of the 15 means, and find
the raw p-value.
4. Repeat the above 1000 times and compare the empirical distribution functions of the
p-values.
5. Construct a randomization test to give an empirical p-value for the above situation.
6. * In the above example, we know that testing equality of the largest and smallest
means implies that 15(15 − 1)/2 = 105 hidden comparisons have been made. A
Bonferroni correction to the p-value, therefore, would be to multiply it by 105. Or
we could use 1 − (1 − p)105 . However since the comparisons are not independent this
is likely a conservative correction. Construct a simulation experiment to evaluate m
the effective number of tests that the actual test makes.
7. Download the file multiregress from the class web page. This has 100 lines of 11
columns each. The first number on each line is the value of a dependent variable
while the following 10 are independent variables. For each independent variable test
the hypothesis that it is a significant predictor of the response. Do this with and
without a multiple testing correction. What do you conclude?
8. * Download the file lotsofps from the class web page. This has 10000 p-values from
a case control study using 100x100 DNA expression chips. Each p-value is for a test
of whether the one of 10000 genes is differentially expressed in cases and controls
for asthma. Compare the conclusions you would derive from this data if you used
multiple test corrections or the false discovery rate.
72