* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Significance Tests
Short interspersed nuclear elements (SINEs) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Oncogenomics wikipedia , lookup
Species distribution wikipedia , lookup
Public health genomics wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Pathogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression programming wikipedia , lookup
Heritability of IQ wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Essential gene wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome (book) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Minimal genome wikipedia , lookup
Significance Tests P-values and Q-values Outline Statistical significance in multiple testing Empirical distribution of test statistics Family-wide p-values Correlation and p-values False discovery rates Tests and Test Statistics T-test is fairly robust to skew, but not robust to outliers – “thick tails” of distribution Non-parametric tests are robust, but lose too much ability to detect differences (power) Robust tests can be useful Permutation tests are simple and easy to program Some authors use: xi , group1 xi , group2 si rather than ti SDi q , SD xi , group1 xi , group2 SDi To reduce numbers of low fold-changes in highly signficant scores Distribution of test statistics Quantile plots of t-statistics: left: random distn; right: experiment Distribution of Set of p-values Multiple comparisons Suppose 10,000 genes on a chip Each gene has a 5% chance of exceeding the threshold score for a p-value of .05 None actually differentially expressed Type I error definition On average, 500 genes should exceed .05 threshold ‘by chance’ Family-Wide Error Rate ‘Corrected’ p-value: Probability of finding a single false positive among all N tests Normally all tests at same threshold Simplest correction (Bonferroni) pi* = Npi, (if Npi < 1, otherwise 1) Fairly close to true false positive rate in simulations of independent tests Too conservative in practice! P-Values from Correlated Genes Null distribution from Null distribution from Null distribution from independent genes perfectly correlated genes highly correlated genes .5 .3 .9 .5 .3 .9 .5 .3 .9 .7 .03 .1 .5 .3 .9 .45 .2 .95 .4 .9 .05 .5 .3 .9 .65 .25 .8 .6 .8 .4 .5 .3 .9 .4 .35 .75 .2 .2 .9 .5 .3 .9 .5 .4 .85 Rows: genes; columns: samples; entries: p-values from randomized distribution The Effect of Correlation If all genes are uncorrelated, Sidak is exact If all genes were perfectly correlated p-values for one are p-values for all No multiple-comparisons correction needed Typical gene data is highly correlated First eigenvalue of SVD may be more than half the variance More sensitive tests possible if we can generate joint null distribution of p-values Re-formulating the Question Independent: ~5% of genes exceed .05 threshold, all the time Perfectly Correlated: all genes exceed .05 threshold ~5% of the time Realistically correlated: .05 < f1 < 1 of genes exceeds .05 threshold, .05 < f2 < 1 of the cases New question: for a given f1 and , how likely is it that a fraction f1 of genes will exceed the threshold? Step-Down p-Values Calculate single-step p-values for genes: p1, …, pN Order the smallest k p-values: p(1), …, p(k) For each k, ask: How likely are we to get k p-values less than p(k) if no differences are real? Generate null distribution by permutations More significant genes, at the same level of Type I error, compared with single-step procedures See Ge, et al, Test, 2003 Bioconductor package multtest False Discovery Rate At threshold t* what fraction of genes are likely to be true positives? Illustration: 10,000 independent genes t 1.96 2.57 3.29 p #sig E(FP) FDR* .05 .01 .001 600 200 40 500 100 10 87% 50% 20% In practice use permutation algorithm to compute FDR pFDR How to estimate the FDR? ‘positive’ False Discovery Rate: E(#false positives/#positives) * P(#positives >0) Simes’ inequality allows this to be computed from p-values