Download Significance Tests

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Short interspersed nuclear elements (SINEs) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Oncogenomics wikipedia , lookup

Twin study wikipedia , lookup

Species distribution wikipedia , lookup

Public health genomics wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression programming wikipedia , lookup

Heritability of IQ wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Genome evolution wikipedia , lookup

Essential gene wikipedia , lookup

Gene wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Genomic imprinting wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Ridge (biology) wikipedia , lookup

Minimal genome wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Significance Tests
P-values and Q-values
Outline





Statistical significance in multiple testing
Empirical distribution of test statistics
Family-wide p-values
Correlation and p-values
False discovery rates
Tests and Test Statistics





T-test is fairly robust to skew, but not robust to outliers –
“thick tails” of distribution
Non-parametric tests are robust, but lose too much
ability to detect differences (power)
Robust tests can be useful
Permutation tests are simple and easy to program
Some authors use:
xi , group1  xi , group2
si 
rather than
ti 
SDi  q , SD
xi , group1  xi , group2
SDi
To reduce numbers of low fold-changes in highly signficant
scores
Distribution of test statistics
Quantile plots of t-statistics: left: random distn; right: experiment
Distribution of Set of p-values
Multiple comparisons

Suppose 10,000 genes on a chip


Each gene has a 5% chance of exceeding
the threshold score for a p-value of .05


None actually differentially expressed
Type I error definition
On average, 500 genes should exceed .05
threshold ‘by chance’
Family-Wide Error Rate

‘Corrected’ p-value:



Probability of finding a single false positive among all
N tests
Normally all tests at same threshold
Simplest correction (Bonferroni)



pi* = Npi, (if Npi < 1, otherwise 1)
Fairly close to true false positive rate in simulations of
independent tests
Too conservative in practice!
P-Values from Correlated
Genes
Null distribution from
Null distribution from
Null distribution from
independent genes
perfectly correlated genes
highly correlated genes
.5
.3
.9
.5
.3
.9
.5
.3
.9
.7
.03
.1
.5
.3
.9
.45
.2
.95
.4
.9
.05
.5
.3
.9
.65
.25
.8
.6
.8
.4
.5
.3
.9
.4
.35
.75
.2
.2
.9
.5
.3
.9
.5
.4
.85
Rows: genes; columns: samples;
entries: p-values from randomized distribution
The Effect of Correlation


If all genes are uncorrelated, Sidak is
exact
If all genes were perfectly correlated
p-values for one are p-values for all
 No multiple-comparisons correction needed


Typical gene data is highly correlated


First eigenvalue of SVD may be more than
half the variance
More sensitive tests possible if we can
generate joint null distribution of p-values
Re-formulating the Question




Independent: ~5% of genes exceed .05
threshold, all the time
Perfectly Correlated: all genes exceed .05
threshold ~5% of the time
Realistically correlated: .05 < f1 < 1 of genes
exceeds .05 threshold, .05 < f2 < 1 of the cases
New question: for a given f1 and , how likely is
it that a fraction f1 of genes will exceed the 
threshold?
Step-Down p-Values



Calculate single-step p-values for genes: p1, …, pN
Order the smallest k p-values: p(1), …, p(k)
For each k, ask:





How likely are we to get k p-values less than p(k) if no
differences are real?
Generate null distribution by permutations
More significant genes, at the same level of Type I error,
compared with single-step procedures
See Ge, et al, Test, 2003
Bioconductor package multtest
False Discovery Rate


At threshold t* what fraction of genes are
likely to be true positives?
Illustration: 10,000 independent genes
t
1.96
2.57
3.29
p
#sig
E(FP)
FDR*
.05
.01
.001
600
200
40
500
100
10
87%
50%
20%
In practice use permutation algorithm to compute FDR
pFDR


How to estimate the FDR?
‘positive’ False Discovery Rate:


E(#false positives/#positives) * P(#positives
>0)
Simes’ inequality allows this to be
computed from p-values