Download 22-Multiple comparisons,permutation,randomization,bootstrap

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
Multiple comparisons, multiple testing, permutation testing, bootstrap
p.1
Multiple comparisons
“The more you look, the more you discover ... that’s actually false!”
Data dredging, data snooping, fishing expedition.
Suppose you do K hypothesis tests. Let’s evaluate the strategy:
“report the best P-value”.
Suppose all the H0’s are true! (No “discoveries” to be made.)
Suppose they are all statistically independent.
(Assume continuous data & continuous P-values.)
Then the P-values, P1,...,PK, are i.i.d. Uniform(0,1). What is the actual distribution
of the reported best P-value? Hint: it’s not Uniform(0,1).
P2
Pr(at least one P-value ≤ 0.05)
= Pr( Pmin ≤ 0.05)
= Pr( P1 ≤ 0.05 or P2 ≤ 0.05)
= 2×0.05 − 0.052
= 0.0975
0.0
5
0.0
5
P1
The classical view of multiple testing (independent case)
If you test
H0 versus
HAi for i=1,…,k,
and then you report the best of the k P-values (“nominal”), Pmin= Pbest=
minPi : i  1,..., k ,
then
a) the true Type I error for this procedure is bigger than the nominal Type I
error,
b) the true P-value is bigger than the nominal
Multiple comparisons, multiple testing, permutation testing, bootstrap
p.2
For (a), the reason is that the actual Type I error = Pr(rejection region | H0), and
overall rejection region =
rejection region i.
For (b), the reason is that the true “tail of surprise” includes tails for other
hypotheses,
not just the tail for Hibest (where ibest  argminPi : i  1,..., k ).
Disturbing examples with multiple testing adjustments
Example 1: Response of cancer pts to IL 2: effect of immunologic HLA type Rubin
et al, 1995.
Data = 2x2 table, the HLA type DQ1 = present or absent, response to IL2 = yes or
no,
Fisher “exact” test P= 0.01.
Data = three 2x2 tables, for HLA types DQ1…DQ3.
Minimum P-value = 0.01 for DQ1. Times 3  0.03
Data = 3 groups of 2x2 tables, including 3 DQ types, 5 DP types, and 7 DR types.
Minimum P-value = 0.01 for DQ1. Times 15  0.15
Data = 2 groups of groups of 2x2 tables,
including MHC1 (A, B, C) and MHC2 (DP,DQ,DR).
Total # tables = 120.
Minimum P-value = 0.01 for DQ1. Times 120  1.2. Sidak: 0.70
What is the “proper” collection of tests to control the Type I error over?
Just the DQ1 test? All DQ tests? All MHC2 tests? All HLA tests?
Multiple comparisons, multiple testing, permutation testing, bootstrap
p.3
Example 2: “Comparisons of a priori interest” Cohen Anwar, Day 1983.
Testing 6 methods of measuring echocardiograms—are they equivalent?
6x5/2 = 15 comparisons
Nominal P for method B versus method C = 0.005.
But not “of a priori interest”, so P = 15 * 0.005 = 0.075, “not significant”.
But now investigator states “the comparisons of a priori interest were:
A versus D, A versus E, A versus F, D versus E, D versus F, E versus F
Now the adjusted P values for B versus C is
P = (15-6) * 0.005 = 0.045, “significant”.
So the inference on B versus C changed depending on how many others were “of a
priori interest”.
Example 3: ECOG 5592 cooperative group clinical trial
Arms: (A) etoposide + cisplatin, (B) taxol+cisplatin+G-CSF, (C) taxol+cisplatin.
Should multiple comparisons adjustments be made? Which ones?
The mystery answer: “There are four comparisons:
B > A, C > A, B > C, C > B
so the required significance level will be 0.05 / 4 = 0.0125”.
Issues with multiple comparisons methods
 This is sometimes too cautious (“conservative”), if the tests are positively
correlated.
 Sometimes it’s difficult to decide what collection of tests to throw together into
one “bag”.
Multiple comparisons, multiple testing, permutation testing, bootstrap
p.4
 For huge numbers of tests (for example, high-thoughput biological data;
degrees of freedom is negative, n < K), the “family-wise error rate” (FWER)
may be far too conservative. One would allow a high probability of at least one
“false positive” (Type I error) in exchange for making some true discoveries.
Classic methods for multiple comparisons
Bonferroni and Sidak
The simplest multiple comparison method is the Bonferroni correction:
Padjusted = k Pmin.
Almost identical, and slightly more justified perhaps, is the Sidak correction:
Padjusted = 1- (1 - Pmin) k , which is roughly k Pmin - k (k-1)/2 Pmin 2
This is exactly correct when the k P-values are statistically independent U(0,1)
given H0.
Since the P values are generally positively correlated, it is usually conservative (too
big).
Non-independent cases
There are many many special adjustments in special cases to avoid being too
conserative: Tukey LSD, Duncan range test, Neuman-Keuls, randomization tests,
permutation tests,…
Permutation and randomization methods
Multiple comparisons, multiple testing, permutation testing, bootstrap
p.5
Randomization tests are often used as a basis of these methods, or to evaluate
their performance. These tests condition on the collection of numbers, but break a
connection in order to assure that the null hypothesis is true.
In the simplest case there is just ONE test, and one wants to check a questionable
test.
Example: checking a chi-square test.
Dark Hair (D)
Responder (R)
3
Nonresponder (N)
2
TOTAL
5
Light Hair (L)
5
90
95
TOTAL
8
92
100
Is hair color associated with response?
Chi-square P-value:
See randomizationTest-example.R
Benjamini-Hochberg method (1995) for “false discovery rate”,
http://www.jstor.org/stable/2346101 ;
implemented in Statistical Analysis of Microarrays (SAM) and BRBTOOLS
from the NCI
B-H is a classical frequentist technique.
False discovery rate could be defined as
Multiple comparisons, multiple testing, permutation testing, bootstrap
But what if R = # declared significant can equal 0? The Qe will be infinite.
And when all null hypotheses are true, then all discoveries are false and FDR=1.
Instead we define
FDR =
The BH procedure is:
Then FDR is no bigger than q*.
Example: An RCT of rt-PA versus APSACj when myocardial infarction occurs.
There are 15 different clinical endpoints. The 15 P values, ranked, and the critical
values with q*=0.05, are
Pvalue 0.0001 0.0004 0.0019 0.0095 0.0201 0.0278 0.0298 ... 1
cutoff 0.0033 0.0067 0.0100 0.0133 0.0167 0.0200 0.0233 ... 0.05
The first 4 hypotheses are “significant” with this rule.
This method does not really help with the problems listed before.
Storey-Tibshirani method for “Qvalues”: conditional false discovery rate
http://www.pnas.org/content/100/16/9440.abstract
Define pFDR = E(V/R | R > 0). (If R = 0, why would we care about FDR?)
(In this paper, V/R is written as F/S.)
p.6
Multiple comparisons, multiple testing, permutation testing, bootstrap
p.7
Write FDR as a posterior probability:
where m0 = “prior probability a null hypothesis is true”.
We estimate m0 from the observed distribution of all the P-values, assuming a
mixture of a uniform multiplied by π0 = m0 /( m0 + m1 ) and a portion near zero
multiplied by 1 − π0. Different methods can estimate π0. In the Figure, the
LOWER dotted line corresponds to p̂ 0 . See the R package qvalue by Dabney and
Storey. For each threshold t, define
Finally, for each null hypothesis i , qvaluei =
The interpretation is the minimum FDR you get if pi is called “significant”.
Multiple comparisons, multiple testing, permutation testing, bootstrap
p.8
The Bootstrap
The bootstrap is a technique for checking calculations of standard errors and
confidence intervals. It is a non-parametric non-model based version of the classical
rationale for estimating a property of a statistics S. (S must be a continuous statistic.)
F̂ “is near” FTrue .
Therefore,
distribution S | F = Fˆ is “near” distribution (S | F = FTrue )
(
)
And likewise for any feature of the distribution, for example
var S | F = Fˆ is “near”
(
)
THE RESAMPLING IDEA
Motivation
Our sample = {Y1¼Yn } . Our statistic S = S(Y1 ¼Yn ) . Our question: what is var  S  ?
If we had many other independent samples of size n, we could calculate S on each one to
get S1, SB , we’d use:
But you only have one sample, so RESAMPLE to create the samples--WITH replacement.
Again we are assuming that F̂ “is near” FTrue -- but this time F̂ is the empirical c.d.f.
ORIGINAL SAMPLE
BOOTSTRAP
SAMPLES
BOOTSTRAPPED
STATISTICS
various summaries
Then for example we estimate
.
There is MUCH MUCH more to the bootstrap, but this is the central idea.