Download False Discovery Rate

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
False Discovery Rate (FDR) =
proportion of false positive results out
of all positive results (positive result =
statistically significant result)
Ladislav Pecen
Outline
 Introduction
 Familywise Error Rate
 False Discovery Rate
 Positive False Discovery Rate
 Comparison of approaches
 Example – microarray study
 Estimate of FDR and pFDR
 Bayes approach to FDR and pFDR
Background
The aim is to find an approach to Problem of multiple testing
Problem, which occur by multiple testing
 each single test of one null hypothesis has probability of type I error
equal to α
 calculation of several tests, where each has probability of type I error
equal to α, the probability of overall type I error increases
 this leads to high value of false positive results
 in the worst situation (tests are independent) – the probabilities of type I
error are summing
 assume 100 of independent tests
 each test has probability of type I error equal to α = 0.05
 in about 5% of tests ( = 5 tests) you will make type I error, i.e. you will reject
the null hypothesis although it is true
Typical areas where you can meet multiple testing
 microarray studies
 genetics and all lab testing
Theory of hypothesis testing
We test the null hypothesis H0 against the alternative hypothesis H1
One of the following four possibilities will happen
Testing result
True
null hypothesis
alternative hypothesis
reject null hypothesis
accept null hypothesis
type I error


type II error
Usually we require restrictions on the two errors:
 probability of type I error, i.e. false positive rate, should be limited by α
P(rejection of H0 | H0 is true) = P(type I error) ≤ α
 power of the test should not be lower than 1 – β
P(rejection of H0 | H0 is not true) = 1 – P(acceptation of H0 | H0 is not true)
= 1 – P(type II error) ≥ 1 - β
Multiple hypotheses testing
Testing of multiple hypotheses
Testing result
True
# of rejected null
hypotheses
# of accepted null
hypotheses
Total
# of true null hypotheses
S
T
m0
# of true alternatives
U
W
m1=m-m0
Total
R0
R1=m-R0
m
Also here we have restrictions on the false results,
 these could be similar as in one test of hypothesis :
 control false positive rate: (# of incorrectly rejected H0)/(# of true H0) = S/m0
 this is a classical approach, also called “Familywise error rate approach”
(FWER)
 or can be approached from different point of view
 control false discovery rate (# of incorrectly rejected H0)/(# of rejected H0) =
S/R0
Connected characteristics
 sensitivity – proportion of correctly identified DE genes: U/m1
 specificity – proportion of correctly identified non-DE genes: S/m0
Example
Example from genetics, microarray studies
 10 000 genes examined
 search for differentially expressed genes
Results of testing
Determined
as DE genes
Reality




Determined as
non-DE genes
Total
Truly DE genes
400
100
500
Truly non-DE genes
475
9 025
9 500
Total
875
9 125
10 000
type I error rate (false positive rate) = 475 / 9 500 = 5%
type II error rate (false negative rate) = 100 / 500 = 20%
sensitivity of the test (power) = 400 / 500 = 80%
false discovery rate = 475 / 875 = 54%
 more than a half of discovered DE genes are faults
 false non-discovery rate = 100 / 9125 = 1%
Common approach
Control of familywise error rate (FWER)
 probability of making one or more false discoveries (type I error)
 estimate: number of false discoveries out of all tests done
 assume, we provide k independent tests, each at significance level α = 0.05
 i.e. in each test we have 5% probability of making false positive decision
 using Bonferroni inequality, we can estimate the overall probability of type I error,
i.e. the probability of making at least one false positive decision as k * α = k * 0.05
 already for 10 tests we have the upper bound for probability of at least one false positive
result 50%
 to keep the overall significance level controlled (e.g. equal to 5%), one have to
decrease the significance level in each particular test
 to be sure that overall significance level is α = 0.05, each test has to be provided at
significance level α = 0.05 / k
 for 10 test, each has to have its significance level 0.005; for thousand of tests, ...
Such approach leads to highly conservative results
 for thousands of test it is highly difficult to prove truly positive result
 the more tests, the lower power
False Discovery Rate
Definition
False Discovery Rate (FDR) = proportion of false positive results out of
all positive (= statistically significant) results
Advantages
 if null hypothesis is rejected, we know the probability, that it is correctly rejected
 FDR = 5% means that out of 100 positive tests circa 5 are false positive and remaining 95 are
truly positive
 usually it is more powerful than the traditional FWER approach
 convenient especially when testing large amount of null hypotheses
Disadvantages
 do not need to keep the probability of at least one wrongly rejected null hypothesis
lower than α
 one has to care about situation, when number of rejected H0 is zero
Factors determining False Discovery Rate (FDR)




proportion of truly DE genes: m1/m
distribution of true differences
variability
sample size
False Discovery Rate
The FDR as defined above works nicely, when at least one of the null
hypotheses is rejected, i.e. when R0 > 0
In case, when P(R0 = 0) > 0, three possibilities are available
FDR1 = E(S / R0 | R0 > 0) * P(R0 > 0)
FDR2 = E(S / R0 | R0 > 0)
FDR3 = E(S) / E(R0)
The second and third alternatives are equal to 1, if m0 = m
 FDR2 and FDR3 cannot be controlled (limited by α) whenever m0 = m, hence
Benjamini and Hochberg decided to work with FDR1
 in further, it is called False Discovery Rate (FDR)
 when controlling FDR1 by α, it means that we control FDR1 = E(S/R0 |R0> 0)
only with α / P(R0 > 0), hence Storey decided to work with FDR2
 the fact that FDR2 = 1 when m0 = m is not a problem, since this result is
obvious
 in further, it is called positive False Discovery Rate (pFDR)
Benjamini-Hochberg - FDR
Procedure controlling false discovery rate
 consider testing of m null hypotheses H1, H2, H3, ..., Hm
 order the respective p-values such that p(1) ≤ p(2) ≤ p(3) ≤ ... ≤ p(m) and
denote the null hypothesis corresponding to p(i) as H(i)
 choose k* such that p(k*) is the largest p-value less than α * k / m,
i.e.
k* = argmax{k: p(k) ≤ α * k / m; 1 ≤ k ≤ m}
 then reject all hypotheses H0(i) for which p(i) ≤ p(k*)
Properties
 for independent test statistics and any configuration of false null
hypotheses, the procedure controls FDR at value α
E(S/k0) = E(falsely rejected / number of rejected) ≤ α *m0/m
 Benjamini, Y. and Hochberg, Y.: Controlling the false discovery rate: a
practical and powerful approach to multiple testing. (1995) J. R. Stat.
Soc. Ser. B 57289–300
Positive False Discovery Rate
Definition
 false discovery rate given that at least one test has positive result
 i.e. proportion of false positive results between all the
positive results given at least one positive result occur
pFDR = E(V / R0 | R0 > 0)
Additional characteristics can be defined
 q-value: a natural pFDR counterpart to common p-value
 p-value is defined as test statistic greater than equal to observed value
given null hypothesis P (T ≥ t | H0)
 q-value is Bayesian analogue to p-value
 q-value is posterior probability of null hypothesis given test statistic
is greater than or equal to the observed value P(H0 | T ≥ t)
 for more hypotheses: q-value(t) = inf {pFDR(Γα); t ϵ Γα}
Positive False Discovery Rate
 for more hypotheses: q-value(t) = inf {pFDR(Γα); t ϵ Γα}
 the minimum pFDR that can occur, when rejecting a statistic with
value t
 minimum posterior probability of null hypothesis over all significance
regions containing the statistics
 q-value minimizes the ratio of the type I error to the power over all
significance regions that contain the statistic
 pFNR (positive False Negatives hypothesis Rate): a natural
counterpart to pFDR
 rate of false negatives hypothesis between all negative results
 first we define False Non-discovery rate (FNR)
FNR = E(T / (m – R0) | (m – R0) > 0) P((m – R0) > 0)
 the positive False Non-discovery rate (FNR) is defined as
pFNR = E(T / (m – R0) | (m – R0) > 0)
Comparison of FWER, FDR and pFDR
FWER – control for multiple error rate
 we fix the error rate and estimate the rejection area
FDR – control of false positive between all positive
 we fix the rejection area and estimate the error rate
Interpretation of FDR and pFDR
 FDR - rate that false discoveries occur
 pFDR - rate that discoveries are false
 when all the null hypotheses are true (i.e. when all genes are non-DE
and m0 = m), the pFDR will be always equal to 1 (and hence cannot
be controlled by some prespecified value of α
 when controlling FDR at level α, and positive findings have occurred,
then FDR has really only been controlled at level α / P(R0 > 0)
Comparison of FWER, FDR and pFDR
Two approaches to false discovery rate
 to fix the acceptable rate α and estimate the
corresponding significance threshold
 is available only when using FDR, since the pFDR cannot be
controlled
 to fix the significance threshold and to estimate the
corresponding rate α
 both FDR as well as pFDR can be used ... the later one leads to
“stronger” results
Example
Back to the genetics and the microarray study
 the problems come from high percentage
of truly non-DE genes
 lowering of significance level decreases also FDR
Results of testing
Determined
as DE genes
Reality
Determined as
non-DE genes
Total
Truly DE genes
400
100
500
Truly non-DE genes
475
9 025
9 500
Total
875
9 125
10 000
Assume the following situation





evaluate circa 10 000 genes to decide whether they are DE or non-DE
the genes are independent or slightly dependent
for each gene compare two independent groups with equal variance
n arrays per group
usage of standard t-statistics with pooled variance
Example
 denote α the significance level for each one test, not an overall significance
level for all multiple tests together
Any formal statistical testing procedure
 compute relevant test statistic for each gene
 sort the statistics (or p-values) by order
 determine the cut-off point dividing the genes into DE and non-DE
In such situation it makes sense to care about
 FDR – how many of the rejected null hypotheses are rejected wrongly
 FNR – proportion of true alternative missed by the test
Example
Genetics, microarray studies, particular situations
 FDR varies depending on sample size per group (n),
significance level (α) and proportion of true null hypotheses (π0
= m0 / m)
 n = 5 microarrays per group
 at significance level α = 5% we get
 sensitivity of the test about 35%
 FDR at π0 = 0.9 is greater than 60% and at π0 = 0.995 it is around
95%
 at sensitivity equal to 80% we get
 significance level around 0.45
 FDR at π0 = 0.9 is around 82% and at π0 = 0.99 it is almost 99%
 hence, n = 5 leads to underpowered study
Example
 n = 20 microarrays per group
 at significance level α = 5% we get
 sensitivity of the test around 90%
 FDR at π0 = 0.9 is around 35% and at π0 = 0.99 it is more than 80%
 n = 30 microarrays per group
 at significance level α = 5% the results are still poor
 at significance level α = 0.4% we get
 sensitivity of the test around 80%
 FDR at π0 = 0.9 is slightly above 20% and at π0 = 0.99 it is around
72%
 for any n, the best results associated with significance level α = 5%
can be
 minimal FDR at π0 = 0.9 is 18% and at π0 = 0.99 it is 71%
Example
Genetics, microarray study
 FDR enables another approach to sample size estimation
 the sample size depends on number and distribution of truly DE genes and on the
tolerated value of FDR
 if n = 5, we have to have about π0 = 80% of truly DE genes to obtain
reasonable FDR
 another possibility is to hope for large differences
 in both cases we use really small significance level
 if π0 = 0.9 and we desire FDR less than 10%, then we should have at least
n = 30 observations per group
 if π0 = 0.99 and we classify 1% of top genes as DE, we need sample size of
n = 45 to observe FDR around 10% (sample size n = 35 is necessary
for FDR less than 20%)
 if we can estimate π0, then it makes sense to apply the rule to reject top (1 –
π0) * 100% of hypotheses, since then FDR = FNR = 1 – sensitivity
 we control both these statistics together
Typical examples
Situations, where controlling of FWER (i.e. probability of one false
rejection of H0) is not needed and then controlling of FDR is
meaningful
 multiple endpoints
 typical goal: to recommend a new test or treatment over the standard one
 the aim is to find as many endpoints in which the new treatment can exceed the
standard one
 the limitation on false positive result is not so strict, but too many false discoveries
are also bad
 multiple separate decisions without an overall decision
 multiple subgroups problem, where two treatments are compared in various
subgroups of patients
 we want to find as many subgroups with potentially different reaction on the two
treatments as possible
 but we want to control the rate of false discoveries
 screening of multiple potential effects
 multiple potential effect is screened to weed out the null effect
 screening of various chemicals when screening of potential drug development
 again, we want as many discoveries as possible, while controlling for FDR
Thank you
Sometimes I would like to exchange all my knowledge
for a bottle of Whisky