Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Probabilistic Models in Bioinformatics Sepp Hochreiter Institute of Bioinformatics Johannes Kepler University, Linz, Austria Outline FARMS Microarray Analysis FABIA Biclustering cn.MOPS Copy Number Estimation in NGS Data DEXUS Differential Expression in NGS Data IFAS, Linz, 03.04.2014 Sepp Hochreiter Statistics / Bioinformatics Hardy’s 1908 paper: To the Editor of Science: I am reluctant to intrude in a discussion concerning matters of which I have no expert knowledge, and I should have expected the very simple point which I wish to make to have been familiar to biologists. However, some remarks of Mr. Udny Yule, to which Mr. R. C. Punnett has called my attention, suggest that it may still be worth making... Hardy–Weinberg principle: both allele and genotype frequencies in a population remain constant from generation to generation IFAS, Linz, 03.04.2014 Sepp Hochreiter Statistics / Bioinformatics Example: mice AA, Aa white coat aa gray coat A’s: 80% a’s: 20% genotype to phenotype allele frequency random mating genotype frequencies: 64% AA homozygous (0.8*0.8 = 0.64) 32% Aa heterozygotes (0.8*0.2*2 = 0.32) 4% aa homozygous (0.2*0.2 = 0.04) (aA=Aa) 96% white coats 4% gray coats phenotype frequency Hardy–Weinberg principle: both gray coated mice will disappear? allele and genotype frequencies in a population remain constant No! from generation to generation A’s: 80% (0.64 + .5*0.32) a’s: 20% (0.04 + .5*0.32) IFAS, Linz, 03.04.2014 allele frequency Sepp Hochreiter Microarray Analysis Beerse, Belgium IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Berlin, Germany Group: Dept. Psychiatry and Psychotherapy Prof. Andreas Heinz Group: Dept. Nephrology and Internal Intensive Care Prof. Petra Reinke IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Gene Network Science “GNS is collaborating with researchers at Johannes Kepler University, Linz, by way of genomics data processing algorithms developed by the researchers and licensed by GNS.” IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Affymetrix Fluidics station Wash / Stain Affymetrix Scanner IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis mRNA reference sequence 5‘ 3‘ probe probeset 5‘ mRNA reference sequence 3‘ …TGTGATGGTGGGAATGGGTCAGAAGGACTCCTATGTGGGTGACGAGGCC… ||||||||||||||||||||||||| TTACCCAGTCTTCCTGAGGATACAC perfect match Fluorescence intensity image IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis FARMS (Factor Analysis for Robust Microarray Summarization) factor z loading matrix ¸1 ¸2 ¸3 ¸4 ¸5 ¸6 ¸7 ¸8 ¸9 observations x1 x2 x3 x4 x5 x6 x7 x8 x9 additive noise ²1 ²2 ²3 ²4 ²5 ²6 ²7 ²8 ²9 z = variation in mRNA concentration ¸i = sensitivity of log-PMi ²i = measurement noise for log-PMi xi = observed log-PMi IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Model assumption: measurement x is Gaussian • Naef et al. 2002: difference of replicates is Gaussian assumption follows if noise is symmetric • Freudenberg 2004 showed it for log2-PM • On real world data we did a Shapiro-Wilk test for normality (continuous curve is log2-transformed) IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis x = ¸z + ² Generative model: z: factor N(0,1) ² :noise N(0,ª) ¸: loading vector x: data N(0 , ¸¸T + ª) ª,¸ EM-algorithm ª is diagonal covariance x: ¸¸T + ª FARMS Maximum a posterior: Data: Posterior: Likelihood: Prior: correlations between probes can only be explained by hidden factor IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Prior knowledge used in the model via prior on the loading matrix: : more mRNA gives more intensity (positive correlation) high values of are seldom (high variance) most genes have constant signal: Rectified Gaussian as prior: IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis E-step M-step IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Efficient through algebraic reformulations Matrix inversion lemma leading to (note that IFAS, Linz, 03.04.2014 is diagonal): Sepp Hochreiter Microarray Analysis International competition: “Affycomp” (http://affycomp.biostat.jhsph.edu/) Participants from Berkely, Affymetrix, EBI, Roche, etc. AUC (most relevant criterion) Johnson & Johnson tested it on 30 internal data sets Now their default microarray normalization method IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Affycomp data set A FARMS Affycomp data set C FARMS RMA MAS 5.0 RMA MAS 5.0 IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Informative/Non-Informative Call I/NI Call IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Accession number E-MEXP-101 E-MEXP-120 E-MEXP-121 E-MEXP-714 E-MEXP-72 Spike-in U133 E-MEXP-882 E-TABM-127 E-TABM-34 E-TABM-84 GSE3744 E-MEXP-834 E-MEXP-835 E-MEXP-839 E-MEXP-842 E-TABM-102 E-MEXP-856 GSE2867 GSE2882 GSE3858 GSE4065 E-MEXP-553 E-MEXP-920 E-MEXP-948 GSE5606 GSE6119 GSE1491 GSE3326 GSE3350 GSE3416 GSE431 Chip hgu133a hgu133a hgu133a hgu133a hgu133a hgu133a hgu133plus2 hgu133plus2 hgu133plus2 hgu133plus2 hgu133plus2 Mouse430_2 Mouse430_2 Mouse430_2 Mouse430_2 Mouse430_2 Mouse430A_2 Mouse430A_2 Mouse430A_2 Mouse430A_2 Mouse430A_2 Rat230_2 Rat230_2 Rat230_2 Rat230_2 Rat230_2 ATH1-121501 ATH1-121501 ATH1-121501 ATH1-121501 ATH1-121501 IFAS, Linz, 03.04.2014 Total 22283 22283 22283 22283 22283 22300 54675 54675 54675 54675 54675 45101 45101 45101 45101 45101 22690 22690 22690 22690 22690 31099 31099 31099 31099 31099 22810 22810 22810 22810 22810 I/NI calls 1726 5027 5105 1242 4385 113 16022 4962 12810 6781 10673 8067 5247 8107 1756 8858 5014 3027 4080 2801 984 3255 954 4080 2723 7449 3138 8186 5716 4635 3593 A/P calls 12898 13850 16574 13711 13801 12869 41355 41022 35162 38258 42625 26382 26891 28485 27945 29934 16569 16412 15035 14379 12181 19261 22725 19378 20626 22030 17855 17827 16646 15159 15653 real-life datasets to assess I/NI calls • • • • • accession number: GEO/ArrayExpress chip type number of probe sets (total) number of probe sets after A/P call number of probe sets after I/NI call Sepp Hochreiter Microarray Analysis A/P call I/NI call IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis effect of gene filtering on tests for differential expression t-test after filtering using both A/P and I/NI calls The proportion of significant probe sets (®=0.05) is given for the two filtering techniques before and after multiple testing correction with an FDR of 10% (Benjamini and Hochberg, 1995) A/P call: 740 DE genes I/NI call: 36 DE genes True DE: 35 of them IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis appropriate filter (enrich with low p-values, control type I error rate): 1) dependent on the test statistic for alternative hypotheses to enrich the remaining hypotheses with low p-values 2) not introduce dependencies between hypotheses 3) must be independent of the subsequent test statistic for null hypotheses in order to control the type I error rate Item 1): assures an increase of the study's power Item 2) and 3): ensure control of the type I error rate IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Talloen, Hochreiter, et al., PNAS, 2010 t-test statistic: two sample t-test, Gauss assumption, for every scale and location invariant test permutation invariant: non-parametrical test, any test independent of the order of its arguments IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Results for CNA detection on breast cancer. “FP” and “FDR” is the number of falsely detected segments and the false detection rate on the normal cell lines, respectively. The I/NI call filter reduces the FDR 18 to 22 fold. IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Area under the precision-recall curves at detecting previously multiple confirmed CNVs. The I/NI call filter clearly outperforms variance-based filters. IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis A B Precision-recall curves (PRCs) at detecting previously multiple confirmed CNVs. Panel (A) and panel (B) gives the PRC for the whole genome for 3 loci and for 5 loci, respectively. I/NI call filter has a considerable lower FDR compared to variance-based filtering. IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis A perfect filtering method would call all true CNVs (red circles at 1) and does not call others (dark-blue background at 0). The I/NI call filter separates called true positives (true CNVs) from true negatives better than other. IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Rare events: few arrays have signal few factors differ Laplace distributed Laplace FARMS • likelihood cannot be analytically computed • variational approach (from physics, “calculus of variations”) Laplace Gauss variational: is variance of the local Gaussian approximation of the factor E-step M-step IFAS, Linz, 03.04.2014 Sepp Hochreiter Microarray Analysis Laplace distribution vs. Gaussian IFAS, Linz, 03.04.2014 Sepp Hochreiter Biclustering Beerse, Belgium IFAS, Linz, 03.04.2014 Sepp Hochreiter Biclustering University Hasselt Statistics Univ. Leuven Statistics IFAS, Linz, 03.04.2014 Sepp Hochreiter Biclustering Cleveland, Ohio, United States Group: IFAS, Linz, 03.04.2014 Experimental Haematology and Hematopoiesis Taussig Cancer Center Prof. Dr. J. Maciejewski Sepp Hochreiter Biclustering Biclustering Cluster rows and columns of a matrix simultaneously Bicluster A set of row indices and a set of column indices Bicluster criterion Row vectors are similar to each other on the subset of column indices and columns are similar to each other on the subset of row indices IFAS, Linz, 03.04.2014 Sepp Hochreiter