Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Genomic Data Integration Curtis Huttenhower Harvard School of Public Health Department of Biostatistics 07-10-10 A Definition of Integrative Data Mining Genomic data Gene ↓ Function Gene ↓ Gene Prior knowledge Data ↓ Function Function ↓ Function 2 Machine Learning for Data Integration Low Correlation G1 G4 G2 G9 + + 0.9 0.7 High Correlation … … G3 G7 G6 G8 - - 0.1 0.2 … G2 G5 ? … 0.8 P(G2-G5|Data) = 0.85 Frequency ← 1Ks datasets 100Ms gene pairs → = + - … - - … + Not coloc. Low Similarity High Similarity 0.8 0.5 … 0.05 0.1 … 0.6 Coloc. Frequency + High Correlation Frequency Low Correlation Dissim. Similar 3 Machine Learning for Data Integration Functional Relationship Jansen 2003 Troyanskaya 2003 Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 4 Alternative Data Integration Frameworks Lee 2004 Lanckriet 2004 Aerts 2006 5 Functional Networks Global interaction network string-db.org function.princeton.edu/hefalmp funcnet.eu Metabolism network Conserved network homes.esat.kuleuven.be/ ~bioiuser/endeavour Kidney network 6 Biological Networks: Clusters, Hubs, Bottlenecks, and Flow 7 Biological Networks: Network Motifs Bi-fan Feedback Positive auto-regulation Negative auto-regulation memory delay WGD and evolvability www.weizmann.ac.il/mcb/UriAlon/ groupNetworkMotifSW.html speed + stability Coherent feed-forward mavisto.ipk-gatersleben.de filter theinf1.informatik.uni-jena.de/~wernicke/motifs Incoherent feed-forward pulse Milo 2002 Alon 2007 8 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes 9 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes 10 Predicting Gene Function Predicted relationships between genes Huttenhower 2009 Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes Rodrigues 2007 Pena-Castillo 2008 11 Comprehensive Validation of Computational Predictions Hess, 2009 Hibbs, 2009 Genomic data Prior knowledge Computational Predictions of Gene Function SPELL bioPIXIE MEFIT Hibbs et al 2007 Myers et al 2005 Retraining New known functions for correctly predicted genes Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Growth curves Petite frequency Confocal microscopy 12 Evaluating the Performance of Computational Predictions Huttenhower, 2009 Genes involved in mitochondrion organization and biogenesis 106 135 Original GO Annotations Under-annotations 82 17 Novel Confirmations, Novel Confirmations, First Iteration Second Iteration 340 total: >3x previously known genes in ~5 person-months 13 Evaluating the Performance of Computational Predictions Huttenhower, 2009 Genes involved in mitochondrion organization and biogenesis Computational 95 predictions 40from large 80 17 Original GO Annotations Under-annotations collections of genomicConfirmed data canNovel be Confirmations Novel Confirmations Under-annotations First Iteration Second Iteration accurate despite incomplete or misleading gold standards, 340 total: >3x previously known genesand in they ~5 person-months continue to improve as additional data are incorporated. 106 14 Functional Mapping: Mining Integrated Networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how cohesive a process is. Chemotaxis 15 Functional Mapping: Mining Integrated Networks Predicted relationships between genes Low Confidence High Confidence Chemotaxis 16 Functional Mapping: Mining Integrated Networks Predicted relationships between genes Low Confidence High Confidence The strength of these relationships indicates how associated two processes are. Chemotaxis Flagellar assembly 17 Functional Mapping: Associations Between Gene Sets Hydrogen Transport Edges Electron Transport Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance 18 Functional Mapping: Associations Between Gene Sets Hydrogen Transport Edges Electron Transport Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance 19 Functional Mapping: Associations Between Gene Sets Hydrogen Transport Edges Electron Transport Associations between processes Cellular Respiration Aldehyde Metabolism Moderately Strong Cell Redox Homeostasis Nodes Peptide Metabolism Protein Processing Cohesiveness of processes Below Baseline Energy Reserve Metabolism Very Strong Vacuolar Protein Catabolism Negative Regulation of Protein Metabolism Baseline (genomic background) Very Cohesive Borders Data coverage of processes Protein Depolymerization Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance 20 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next? 21 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping • • • • Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data 22 Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities huttenhower.sph.harvard.edu/ sleipnir • • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) It’s also speedy: microbial • And data integration computation takes <3hrs. it’s fully documented! 23 Thanks! Curtis Huttenhower Harvard School of Public Health Department of Biostatistics http://huttenhower.sph.harvard.edu Meta-Analysis for Data Integration Evangelou 2007 1 1 ' log 2 1 z + ' ' ' = 26 Meta-Analysis for Data Integration Evangelou 2007 1 1 ' log 2 1 z y e ,i e ye ,i e e e ,i ̂ e we*,i ye,i i we*,i ' ' ' Simple regression: All datasets are equally accurate Random effects: Variation within and among datasets and interactions 1 se2,i ˆ 2e 27