Tiffany Chao
Beth Johnson
Steven Lee
Hypothesis testing
● Test for each gene
○ null hypothesis: no differential expression
● Two kinds of errors
○ type I error (false positive)
say that a gene is differentially expressed when it
actually isn't; wrongly reject a true null hypothesis
○ type II error (false negative)
say that a gene isn't differentially expressed when it
actually is; fail to reject a false null hypothesis
Thinking about p-values
● Probability of obtaining a test statistic at
least as extreme as the one that was actually
observed, assuming the null hypothesis is
● Minimum false positive rate at which an
observed statistic can be called significant
● If null hypothesis is simple, then a null pvalue is uniformly distributed
Multiple comparison problem
● Even if we have useful approximations for
our p-values, we still face the multiple
comparison problem
● When performing many independent tests,
p-values no longer have the same
Not only in genomics!
● "Statistical Comparisons of Classifiers over
Multiple Datasets", Demsar, JMLR 2006
● "Permutation Tests for Studying Classifier
Performance", Ojala, JMLR 2010
● "On Comparing Classifiers: Pitfalls to avoid
and a recommended approach", Salzberg,
1997, Data Mining and Knowledge Discovery
Multiple hypothesis testing
Called not
Null true
m0 – F
Alternative true
m1 – T
Suppose we care about p-values ≤ 0.05?
Error rates (more on this later)
● Per comparison error rate (PCER)
○ E[F] / m
● Per family error rate (PFER)
○ E[F]
● Family-wise error rate (FWER)
○ Pr(F ≥ 1)
● False discovery rate (FDR)*
○ E[F/S] (and set F/S = 0 when S = 0)
= E[F/S | S > 0] Pr(S > 0)
● Positive false discovery rate (pFDR)*
○ E[F/S | S > 0]
MHT error controlling procedure
● Suppose you test m hypotheses and get m pvalues: p1 , p2 , p3 , ... pm
● A multiple hypothesis test error controlling
procedure is a function T(p; α) such that
rejecting all nulls with pi ≤ T(p; α) implies that
Error ≤ α
● Error is a population quantity (not random)
Weak and strong control
● Weak: T(p; α) is such that Error ≤ α only when
m0 = m
● Strong: T(p; α) is such that Error ≤ α for any
value of m0
○ note that m0 is not an argument for T(p; α)!
Bonferroni correction
provides strong control:
but too restrictive
Why FDR and q-value?
To help us interpret these values, two pieces of
information would be useful
Estimate of the overall proportion of features that are
truly alternative (even if they cannot be precisely
Measure of significance that can be associated with
each feature so that thresholding the numbers at a
particular value has an easy interpretation
Would like an error measure that provides a
balance between
Number of false positive features (F)
Number of true positive features (T)
The false discovery rate is the expected value of
the proportion of false positive features among
all those called significant
*Some possibility S = 0, so some adjustment has to be made to definition of FDR
Estimating FDR
Therefore, the FDR depends on what threshold
(t) we are using to determine significance
Estimating FDR
Because we are considering many features (m is
very large), we can approximate
Estimating FDR
We now need to approximate E[S(t)] and E[F(t)]
To illustrate how FDR is determined, for m genes
we have m p values
denoted p1, p2,…,pm
Define F(t) and S(t)
can count these for a
given t
Estimating FDR
Approximating F(t) is more difficult because we
do know how many values called significant
were truly null
Assuming null p values are uniformly distributed,
the probability(null p ≤ t) = t
(# of null features x probability of null feature called significant)
Estimating FDR
We do not know true value of m0, (# of null
features) so we must estimate
Equivalently, we can estimate the proportion of
features that are truly null (denoted by π0)
Assuming a uniform distribution for null pvalues, we can estimate this quantity using a
Estimating π0
Find where pvalues look like a
distribution and
set λ
Estimating π0
Note π0 does
not depend on
Estimating π0
Can also fit a cubic
function to the π0
vs λ data to
determine π0(1)
(because “most” of
the p values at 1
would be expected
to be null)
Estimate for False Discovery Rate is
Graphical Interpretation
q-value definition
for a given feature, the q-value is the
expected FDR incurred if it is called
(every other p_j <= p_i is also called significant)
in practical terms: a q-value threshold is the
"proportion of significant features that turn
out to be false leads"
Graphical Interpretation
Graphical Interpretation
a measure of each feature's significance
p-value is in terms of the false positive rate
q-value is in terms of the FDR
this takes into account that thousands of features
are simultaneously being tested (via FDR)
uses a better model of where the significant features
are likely to be
p vs q
m = 10000
cutoff at .01 assumes that you likely found about
100 false positives
cutoff of .0001 assumes that you only found 1
false positive, but at what cost?
set q-value cutoff at .05, and be sure that only 5%
of the significant genes found are likely to be false
Algorithm for Determining qValues
Compute test statistic (p-value) for m genes
Estimate π0
Using histogram
Find region where p-values are uniform + set λ
Count p-values > λ and compute (1-λ)m (number of
Using cubic spline
For each p-value
calculate FDR for each threshold t >= p
only choose t values for each unique p in the gene set
choose minimum FDR as q-value
q-value (cutoff)
q-value accuracy
assumes that the dependence between features will
generally be weak dependence
genes are actually dependent in pathways, which
can be modeled as blocks
if so, when m is large, calling all features significant
with q <= alpha, implies the FDR <= alpha
the estimated q value of each feature is greater than
or equal to it's true q-value
conservative is desirable
q-value summary
A standard measure of significance that can
be universally interpreted between studies
better than using just p-values
arbitrary selection of alpha, where it is selected so
the expected number of false positives is < 1
throws away too many likely truly significant
FDR plug-in
Create K permutations of the data, producing
statistics tj for features j=1,...,M and permutations
For a range of cutoffs C, let
Estimate the FDR by