Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
A Factor Graph Model for Minimal Gene Set Enrichment Analysis Diana Uskat Computational Biology - Gene Center Munich Motivation Problem Outline: • • • Cutout of Gene Ontology Single gene analysis of microarray experiments entails a large multiple testing problem Even after appropriate multiple testing correction, the result is usually a long list of differentially expressed genes Interpretation is difficult by hand Possible improvement: Gene set enrichment analysis 1. Group genes into different biologically meaningful categories (Gene Ontology, KEGG Pathways, Transcription factor targets) 2. Use a statistical method for finding those categories which are enriched for differentially expressed genes 24.03.2010 Ontologizer fromGraph S. Bauer, from J. Ontologizer Gagneur, P. N. Robinson by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010) Diana Uskat - Gene Center Munich 2 Motivation Cutout of Gene Ontology Established Methods: • GSEA (Subramanian, Tamayo) • TopGO (Alexa) • Globaltest (Goemann, Mansmann) • GOStats (Falcon, Gentleman) Drawbacks: • There are often 1000’s of overlapping categories, genes can belong to multiple categories difficult new multiple testing problem • Group testing returns often a large number of significant categories identification of biologically relevant categories difficult Graph from Ontologizer by S. Bauer, J. Gagneur, P. N. Robinson (NAR 2010) 24.03.2010 Diana Uskat - Gene Center Munich 3 Minimal Gene Set Enrichment Idea (Bauer, Gagneur et al., Nucleic Acids Research 2010) • Search for a sparse explanation, i.e. a minimal number of categories that explain the data (sufficiently well) • Use a simplistic probabilistic graphical model relating categories and genes, and do Bayesian inference on the marginal posterior for each category Correct explanation T1 T2 Correct minimal explanation T3 Categories T1 T2 T3 E1 E2 E3 “gene E3 is element of category T3” E1 E2 E3 Genes (coloured means „on“) 24.03.2010 Diana Uskat - Gene Center Munich 4 Minimal Gene Set Enrichment The model T1 T2 T3 Categories E1 E2 E3 Genes D1 D2 D3 Observations (data) A Bayesian Network factorization of the full posterior: Posterior 24.03.2010 Likelihood Prior Diana Uskat - Gene Center Munich Main trick: Use a prior favoring sparse solutions 5 Factor Graphs Our method: Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function for its neighbouring variables • Efficient computation of marginal distribution with the sum-product algorithm (if factor graph is a tree...) 24.03.2010 T1 T2 T3 E1 E2 E3 D1 D2 D3 Diana Uskat - Gene Center Munich 6 Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function its neighbouring variables • Efficient computation of Pr(D|E) marginal distribution with the given by dataset (if factor sum-product algorithm graph is a tree...) Pr(T , E | D) f jJ j ( E j ) g j ( E j , Tnext ( g j ) ) fT (T ) 24.03.2010 jJ T1 T2 T3 E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich 7 Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function its neighbouring variables • Efficient computation of marginal distribution with the E only active if at least sum-product algorithm (if factor one parent active graph is a tree...) Pr(T , E | D) f jJ j ( E j ) g j ( E j , Tnext ( g j ) ) fT (T ) 24.03.2010 jJ T1 g1 g2 T2 T3 g3 g6 g4 g5 E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich 7 Factor Graphs • Graphical model (Kschischang IEEE, 2001) • Bipartite graph with factor nodes and variable nodes • Each factor node encodes a function its neighbouring variables N 1T • Efficient computation of T j fT T p (1 p) j marginal distribution with the j 1 sum-product algorithm (if factor with graph is a tree...) 0 p 0.5 fT T1 g1 g2 T2 T3 g3 g6 Pr(T , E | D) f jJ j ( E j ) g j ( E j , Tnext ( g j ) ) fT (T ) 24.03.2010 jJ g4 g5 E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich 7 Estimation Methods for Factor Graphs Computation of posterior for T,E: • • • Message-Passing Algorithm: SumProduct-Algorithm Stops at correct result after one round if graph has a tree structure No guarantees if graph has cycles (e.g., oscillation may occur), however works well in practice fT T1 g1 g2 T2 T3 g3 g6 Principle: • • • g4 Start in leaf nodes Message propagation: – variable to factor node („Sum“) – factor to variable node („Product“) Termination: Compute the marginal distribution of the variable nodes 24.03.2010 g5 E1 E2 E3 f1 f2 f3 D1 D2 D3 Diana Uskat - Gene Center Munich 8 Application: Yeast Salt Stress • Categories: Transcritption factors (with their targets) instead of GO categories • Given: – List of transcription factors with their corresponding genes – List of genes (their p-values) from a yeast salt stress experiment • Question: Which transcription factors are active during salt stress? • Task: Find a set of transcription factors that are most likely to be active g1 g2 TF1 “g2 is target of TF2” g3 g4 TF2 g5 24.03.2010 Diana Uskat - Gene Center Munich 9 Results ~2.000 genes 118 transcription factors Graph obtained from re-analysis of Harbison TF binding data (Nat, 2004) by MacIsaac et al. 24.03.2010 Diana Uskat - Gene Center Munich (BMC Bioinformatics, 2006) 10 Results ~2.000 genes 118 transcription factors YML081W DAL81 STB4 HSF1 UME6 Previously known transcription factors involved in salt stress (Capaldi et al., Nat.Gen 2008, Wu and Chen, Bioinform Biol Insights. 2009) SNT2 RGT1 MET28 MSN2 GAL4 Differentially phosphorylated transcription factors (Soufi et al., Mol.Biosyst 2009) SKO1 Graph obtained from re-analysis of Harbison TF binding data (Nat, 2004) by MacIsaac et al. 24.03.2010 Diana Uskat - Gene Center Munich (BMC Bioinformatics, 2006) 10 Summary and Outlook • Todo: scalability and speed • Lists of (meaningful) gene sets are better than lists of genes • Search for biologically meaningful explanations requires a new minmal model (MGSE) for gene set enrichment analysis • We use factor graphs for parameter estimation • Wide application to GO analysis, TF-target analysis, Pathway enrichment 24.03.2010 Diana Uskat - Gene Center Munich 11 Acknowledgments Gene Center Munich: Achim Tresch, Theresa Niederberger, Björn Schwalb, Sebastian Dümcke Collaborating Partners: Gene Center Munich: Patrick Cramer, Christian Miller, Daniel Schulz, Dietmar Martin, Andreas Mayer EMBL Heidelberg: Julien Gagneur(talk nov. 2009, working group conference of the GMDS „AG Statistische Methoden in der Bioinformatik, Munich“) 24.03.2010 Diana Uskat - Gene Center Munich 12