* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Functional data
Survey
Document related concepts
Transcript
Large scale functional data mining: What can we find in the data we have? Curtis Huttenhower Harvard School of Public Health Department of Biostatistics 08-12-10 Greatest biological discoveries? Our job is to create computational microscopes: To ask and answer specific biological questions using millions of experimental results 2 A computational definition of functional genomics Genomic data Gene ↓ Function Gene ↓ Gene Prior knowledge Data ↓ Function Function ↓ Function 3 A framework for functional genomics Low Correlation G1 G4 G2 G9 + + 0.9 0.7 High Correlation … … G3 G7 G6 G8 - - 0.1 0.2 … G2 G5 ? … 0.8 P(G2-G5|Data) = 0.85 Frequency ← 1Ks datasets 100Ms gene pairs → = + - … - - … + Not let. Low Similarity High Similarity 0.8 0.5 … 0.05 0.1 … 0.6 Let. Frequency + High Correlation Frequency Low Correlation Dissim. Similar 4 Functional network prediction and analysis Global interaction network HEFalMp Currently includes data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases Carbon metabolism network Extracellular signaling network Gut community network 5 Functional network prediction from diverse microbial data 486 bacterial expression experiments 876 raw datasets 310 postprocessed datasets 304 normalized coexpression networks in 27 species 307 bacterial interaction experiments 154796 raw interactions 114786 postprocessed interactions Integrated functional interaction networks in 15 species E. Coli Integration ← Precision ↑, Recall ↓ 6 Cross-species knowledge transfer using functional data P( FRs | Ds ) P( Ds | FRs ) P( FRs ) P( FRs , FRt ) P( FRs | D) P( FRs | {FRt s }, Ds ) P({FRt s }, Ds | FRs ) P( FRs ) P( FRs ) P( Ds | FRs ) P( FRt | FRs ) t s Ds Pinaki Sarder TaFTan 7 TaFTan: Cross-species knowledge transfer using functional data log(precision/random) E. coli log(recall) P. aeruginosa Species-specific data Species’ data excluded All species’ data • Important to take advantage of all available data for any one organism • Important to take advantage of all available data for every organism • Scalable to dozens of organisms with hundreds of functional datasets • Currently working on making this more context-specific B. subtilis M. tuberculosis 8 Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 1 1 ' log 2 1 z y e ,i e ye ,i e e e ,i ̂ e we*,i ye,i i we*,i ' ' ' Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions 1 se2,i ˆ 2e 9 Meta-analysis for unsupervised functional data integration Huttenhower 2006 Hibbs 2007 Evangelou 2007 1 1 ' log 2 1 z + ' ' ' = 10 So what does all of this ~2000 have to do with microbial communities ? AML/ALL Temperature DNA damage Batch effects Gene expression Functional modules 11 2010 Intervention/ perturbation Healthy/IBD Temperature Location ??? Biological story? Crossvalidate Taxa & Orthologs Niches & Phylogeny Independent sample Confounds/ stratification/ environment Test for correlates Multiple hypothesis correction Feature selection p >> n 12 What features to test? Microbiome data Genomic data (Reference genomes) 16S reads Taxa WGS reads Orthologous clusters Functional roles Pathways/ modules Pathway activity Functional data (Experimental models) Binning Clustering 13 MetaHIT: Data features 85 healthy, 15 IBD + 12 healthy, 12 IBD Taxa KO clusters Phymm Brady 2009 ReBLASTed against KEGG since published data obfuscates read counts 10x bootstrap within training cohort, test on 12+12 as validation WGS reads Pathways/ modules KEGG pathways 14 MetaHIT: Taxonomic CD biomarkers Bacteroidetes Methanomicrobia Enterobacteriaceae Firmicutes Chromatiales Desulfobacterales Bradyrhizobiaceae iTOL Letunic 2007 Rhodobacteraceae Oxalobacteraceae 15 MetaHIT: Taxonomic CD biomarkers Down in CD Up in CD 16 MetaHIT: Functional CD biomarkers Down in CD Up in CD Growth/replication Motility Transporters Sugar metabolism 17 MetaHIT: KO IBD biomarkers Down in IBD Growth/ replication LEfSe Motility Transporters Nicola Segata Sugar metabolism Up in IBD 18 Metagenomic differential analysis: LEfSe 1. Is there a statistically significant difference? t-tests, ANOVA, MANOVA, Friedman, Kruskal–Wallis… 2. Is the difference biologically significant? expert supervision, specific post-hoc tests… 3. How large is the difference? PCA, LDA, mean difference, class or cluster distance… LEfSe: p(ANOVA) < 0.05 pairwise post-hoc Wilcoxon OK Log(Score(LDA)) = 3.68 19 LEfSe: A non-human example Viromes vs. bacterial metagenomes Dinsdale 2008 Metastats (White 2009): p < 0.001 ANOVA: p < 0.05 LEfSE: NO DIFF! DIFF! Hi-level functional category: Nucleosides and Nucleotides Hi-level functional category: Carbohydrates Hi-level functional category: Transporters Microbial Viral 20 Sleipnir: Software for scalable functional genomics Massive datasets require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) It’s also speedy:•microbial And it’s data integration computation takes <3hrs. fully documented! 21 Recap • Network framework for scalable data integration • Cross-species knowledge transfer from functional data • Unsupervised system for data mining without curated prior knowledge TaFTan Meta-analytic integration • Comparative microbiome analysis by taxa, orthologs, and pathways • Sleipnir software for scalable functional genomics LEfSe 22 Thanks! Pinaki Sarder Nicola Segata Jacques Izard Sarah Fortune Levi Waldron Larisa Miropolsky Willythssa Pierre-Louis Wendy Garrett http://huttenhower.sph.harvard.edu/sleipnir 23 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes 25 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes 26 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes 27 Comprehensive Validation of Computational Predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL bioPIXIE MEFIT Hibbs et al 2007 Myers et al 2005 Retraining New known functions for correctly predicted genes Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Growth curves Petite frequency Confocal microscopy 28 Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis 106 135 Original GO Annotations Under-annotations 82 17 Novel Confirmations, Novel Confirmations, First Iteration Second Iteration 340 total: >3x previously known genes in ~5 person-months 29 Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis Computational 95 predictions 40from large 80 17 Original GO Annotations Under-annotations collections of genomicConfirmed data canNovel be Confirmations Novel Confirmations Under-annotations First Iteration Second Iteration accurate despite incomplete or misleading gold standards, 340 total: >3x previously known genesand in they ~5 person-months continue to improve as additional data are incorporated. 106 30 Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Luciferase ATG5 (Negative control) (Positive control) Predicted novel autophagy proteins LAMP2 RAB11A Not Starved Starved (Autophagic) 31 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis 32 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis 33 Functional mapping: mining integrated networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly 34 Functional Mapping: Scoring Functional Associations How can we formalize these relationships? Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. FAG1 ,G2 between(G1 , G2 ) baseline background (G1 , G2 ) within(G1 , G2 ) Stronger within self-connections or nonspecific background connections decrease association. 35 Functional Mapping: Bootstrap p-values For any graph, compute FA scores for many Null distribution is • Scoring functional associations is great… randomly chosen gene sets of different sizes. approximately normal …how do you interpret an association score? with mean 1. # Genes– 1 gene5 sets 10 50 sizes? For of arbitrary ˆ FA (Gi , G j ) 1 – In arbitrary graphs? A(| Gi |) | G j | B of edges? 1 – Each with its own bizarre distribution ˆ FA (Gi , G j ) | Gi | C (| G j |) 5 Standard deviation is 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 10 0 0.05 10 0 2 0 10 10 50 10 1 asymptotic in the sizes of both gene sets. P( FAG1 ,G2 x) 1 ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x) 2 10 3 10 4 10 |G1| |G2| Null distribution one graph Histograms of FAsσs forfor random sets Maps FA scores to p-values for any gene sets and underlying graph. 36 Functional Mapping: Functional Associations Between Processes Hydrogen Transport Electron Transport Edges Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Protein Depolymerization Data coverage of processes Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance 37 Functional Mapping: Functional Associations Between Processes Edges Associations between processes Moderately Strong Very Strong Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Data coverage of processes Sparsely Covered Well Covered 38 Functional maps for cross-species knowledge transfer ECG1, ECG2 BSG1 ECG3, BSG2 … O1: G1, G2, G3 O2: G4 O3: G6 … G2 G3 G4 G1 O2 G5 G6 G7 O3 G8 O5 O4 G9 G10 G12 O8 G11 O6 G13 G15 G16 O7 O9 G14 G17 39 Functional maps for functional metagenomics GOS 4441599.3 Hypersaline Lagoon, Ecuador KEGG Pathways Organisms Mapping organisms into phyla Env. + Integrated functional interaction networks in 27 species Pathogens = Mapping genes into pathways Mapping pathways into organisms 40 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next? 41 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping • • • • Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data 42 Functional maps for cross-species knowledge transfer Following up with unsupervised and partially anchored network alignment ← Precision ↑, Recall ↓ 43 LEfSe: A non-human example Viromes vs. bacterial metagenomes Metastats (White 2009): p < 0.001 ANOVA: p < 0.05 LEfSE: NO DIFF! DIFF! Hi-level functional category: Nitrogen Metabolism Hi-level functional category: Carbohydrates Hi-level functional category: Nucleosides Membrane and Transport Nucleotides Microbial Viral 44