Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational Methodology for Microbial and Metagenomic Characterization using Large Scale Functional Genomic Data Integration Curtis Huttenhower Harvard School of Public Health Department of Biostatistics 03-08-10 Outline 1. Network models of functional data 2. Network models of microbes 3. Network models of microbiomes 2 Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 1 1 ' log 2 1 z + ' ' ' = Following up with round-robin and semi-supervised evaluations 3 Functional network prediction from diverse microbial data 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions Integrated functional interaction networks in 15 species 4 Functional maps for cross-species knowledge transfer Following up with unsupervised and partially anchored network alignment Huttenhower 2008 Huttenhower 2009 5 Functional maps for functional metagenomics GOS 4441599.3 Hypersaline Lagoon, Ecuador Mapping organisms into phyla = + Integrated functional interaction networks in 27 species Mapping genes into pathways Mapping pathways into organisms 6 Functional maps for functional metagenomics Summarizes information from ~10M metagenomic reads and ~500 genomescale microbial experiments. Edges Process association in obesity Less Coregulated Baseline (no change) More Coregulated Nodes Process cohesiveness in obesity Very Downregulated Baseline (no change) Very Upregulated 7 Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) It’s also speedy:•microbial And it’s data integration computation takes <3hrs. fully documented! 8 Thanks! Olga Troyanskaya Matt Hibbs Chad Myers David Hess Chris Park Ana Pop Aaron Wong Jacques Izard Hilary Coller Erin Haley Sarah Fortune Tracy Rosebrock Wendy Garrett http://huttenhower.sph.harvard.edu/sleipnir http://function.princeton.edu/hefalmp 9 Functional mapping: Functional associations between processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Information mapped from ~100 E. coli experiments Well Covered 11 Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 1 1 ' log 2 1 z + ' ' ' = Following up with round-robin and semi-supervised evaluations 12 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis 13 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis 14 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly 15 Functional maps for cross-species knowledge transfer ECG1, ECG2 BSG1 ECG3, BSG2 … O1: G1, G2, G3 O2: G4 O3: G6 … G2 G3 G4 G1 O2 G5 G6 G7 O3 G8 O5 O4 G9 G10 G12 O8 G11 O6 G13 G15 G16 O7 O9 G14 G17 16 Functional network prediction from diverse microbial data 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions Integrated functional interaction networks in 15 species E. Coli Integration ← Precision ↑, Recall ↓ 17 Functional maps for functional metagenomics GOS 4441599.3 Hypersaline Lagoon, Ecuador KEGG Pathways Organisms Mapping organisms into phyla Env. + Integrated functional interaction networks in 27 species Pathogens = Mapping genes into pathways Mapping pathways into organisms 18 Functional maps for cross-species knowledge transfer Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓ 19 Functional network prediction from diverse microbial data 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions Integrated functional interaction networks in 15 species E. Coli Integration 20 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next? 21 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping • • • • Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data 22 Functional Mapping: Scoring Functional Associations How can we formalize these relationships? Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. FAG1 ,G2 between(G1 , G2 ) baseline background (G1 , G2 ) within(G1 , G2 ) Stronger within self-connections or nonspecific background connections decrease association. 23 Functional Mapping: Bootstrap p-values For any graph, compute FA scores for many Null distribution is • Scoring functional associations is great… randomly chosen gene sets of different sizes. approximately normal …how do you interpret an association score? with mean 1. # Genes– 1 gene5 sets 10 50 sizes? For of arbitrary ˆ FA (Gi , G j ) 1 – In arbitrary graphs? A(| Gi |) | G j | B of edges? 1 – Each with its own bizarre distribution ˆ FA (Gi , G j ) | Gi | C (| G j |) 5 Standard deviation is 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 10 0 0.05 10 0 2 0 10 10 50 10 1 asymptotic in the sizes of both gene sets. P( FAG1 ,G2 x) 1 ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x) 2 10 3 10 4 10 |G1| |G2| Null distribution one graph Histograms of FAsσs forfor random sets Maps FA scores to p-values for any gene sets and underlying graph. 24 Microbial Communities and Functional Metagenomics With Jacques Izard, Wendy Garrett • Metagenomics: data analysis from environmental samples – Microflora: environment includes us! • Pathogen collections of “single” organisms form similar communities • Another data integration problem – Must include datasets from multiple organisms • What questions can we answer? – What pathways/processes are present/over/underenriched in a newly sequences microbe/community? – What’s shared within community X? What’s different? What’s unique? – How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … – Current functional methods annotate ~50% of synthetic data, <5% of environmental data 25 Data Integration for Microbial Communities ~350 available expression datasets ~25 species • • • • Data integration works just as well in microbes as it does in yeast and humans We know an awful lot about some microorganisms and almost nothing about others Sequence-based and network-based tools for function transfer both work in isolation We can use data integration to leverage both and mine out additional biology Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 26 Functional Maps for Functional Metagenomics 27 Validating Orthology-Based Functional Mapping What is the effect of “projecting” through an orthologous space? GO GO Individual datasets log(Precision/Random) Unsupervised integration log(Precision/Random) Does unweighted data integration predict functional relationships? Recall Recall KEGG Unsupervised integration Individual datasets Recall log(Precision/Random) log(Precision/Random) KEGG Recall 28 Validating Orthology-Based Functional Mapping YG2 YG3 YG4 Holdout set, uncharacterized “genome” YG1 YG5 Random subsets, characterized “genomes” YG6 YG7 YG8 YG9 YG10 YG12 YG11 YG13 YG15 YG14 YG16 YG17 29 Validating Orthology-Based Functional Mapping 30 Validating Orthology-Based Functional Mapping Can subsets of the yeast genome predict a heldout subset’s functional maps? Can subsets of the yeast genome predict a heldout subset’s interactome? GO GO What have we learned? 0.68 • Yeast is incredibly well-curated 0.48 0.30 • KEGG tends to be more specific than GO 0.37 0.40 • Predicting interactomes by projecting through functional maps works decently in the absolute best case 0.39 KEGG 0.25 0.27 0.43 0.39 KEGG 31