Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Large scale genomic data mining Curtis Huttenhower Harvard School of Public Health Department of Biostatistics 10-23-09 Mining Biological Data ~100 GB More than 100GB Mining Biological Data ~100 GB More than 100GB Mining Biological Data ~100 GB How can we ask and answer specific biomedical questions using thousands of genome-scale datasets? More than 100GB Outline 1. Methodology: 2. Applications: Algorithms for mining genome-scale datasets Human molecular data and clinical cancer cohorts 3. Next steps: Methods for microbial communities and functional metagenomics 5 A Definition of Functional Genomics Genomic data Gene ↓ Function Gene ↓ Gene Prior knowledge Data ↓ Function Function ↓ Function 6 MEFIT: A Framework for Functional Genomics Related Gene Pairs BRCA1 BRCA2 0.9 BRCA1 RAD51 0.8 RAD51 TP53 0.85 … Frequency MEFIT Low Correlation High Correlation 7 MEFIT: A Framework for Functional Genomics Related Gene Pairs BRCA1 BRCA2 0.9 BRCA1 RAD51 0.8 RAD51 TP53 0.85 … Frequency MEFIT Unrelated Gene Pairs BRCA2 SOX2 0.1 RAD51 FOXP2 0.2 ACTR1 H6PD 0.15 … Low Correlation High Correlation 8 MEFIT: A Framework for Functional Genomics Functional Relationship Golub 1999 Butte 2000 Whitfield 2002 Hansen 1998 9 MEFIT: A Framework for Functional Genomics Functional Relationship Golub 1999 Butte 2000 Biological Context Whitfield 2002 Functional area Tissue Disease … Hansen 1998 10 Functional Interaction Networks Global interaction network Currently have data from 30,000 human experimental results, 15,000 expression conditions + 15,000 diverse others, analyzed for 200 biological functions and 150 diseases MEFIT Autophagy network Vacuolar transport network Translation network 11 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes 12 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence Cell cycle genes 13 Predicting Gene Function Predicted relationships between genes Low Confidence High Confidence These edges provide a measure of how likely a gene is to specifically participate in the process of interest. Cell cycle genes 14 Comprehensive Validation of Computational Predictions With David Hess, Amy Caudy Genomic data Prior knowledge Computational Predictions of Gene Function SPELL bioPIXIE MEFIT Hibbs et al 2007 Myers et al 2005 Retraining New known functions for correctly predicted genes Genes predicted to function in mitochondrion organization and biogenesis Laboratory Experiments Growth curves Petite frequency Confocal microscopy 15 Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis 106 135 Original GO Annotations Under-annotations 82 17 Novel Confirmations, Novel Confirmations, First Iteration Second Iteration 340 total: >3x previously known genes in ~5 person-months 16 Evaluating the Performance of Computational Predictions Genes involved in mitochondrion organization and biogenesis Computational 95 predictions 40from large 80 17 Original GO Annotations Under-annotations collections of genomicConfirmed data canNovel be Confirmations Novel Confirmations Under-annotations First Iteration Second Iteration accurate despite incomplete or misleading gold standards, 340 total: >3x previously known genesand in they ~5 person-months continue to improve as additional data are incorporated. 106 17 Functional Associations Between Contexts Predicted relationships between genes Low Confidence High Confidence The average strength of these relationships indicates how cohesive a process is. Cell cycle genes 18 Functional Associations Between Contexts Predicted relationships between genes Low Confidence High Confidence Cell cycle genes 19 Functional Associations Between Contexts Predicted relationships between genes Low Confidence High Confidence The average strength of these relationships indicates how associated two processes are. Cell cycle genes DNA replication genes 20 Functional mapping: Scoring functional associations How can we formalize these relationships? Any sets of genes G1 and G2 in a network can be compared using four measures: • Edges between their genes • Edges within each set • The background edges incident to each set • The baseline of all edges in the network Stronger connections between the sets increase association. FAG1 ,G2 between(G1 , G2 ) baseline background (G1 , G2 ) within(G1 , G2 ) Stronger within self-connections or nonspecific background connections decrease association. 21 Functional mapping: Bootstrap p-values For any graph, compute FA scores for many Null distribution is • Scoring functional associations is great… randomly chosen gene sets of different sizes. approximately normal …how do you interpret an association score? with mean 1. # Genes– 1 gene5 sets 10 50 sizes? For of arbitrary ˆ FA (Gi , G j ) 1 – In arbitrary graphs? A(| Gi |) | G j | B of edges? 1 – Each with its own bizarre distribution ˆ FA (Gi , G j ) | Gi | C (| G j |) 5 Standard deviation is 0.45 0.4 0.35 0.3 0.25 0.2 0.15 0.1 10 0 0.05 10 0 2 0 10 10 50 10 1 asymptotic in the sizes of both gene sets. P( FAG1 ,G2 x) 1 ˆ (G1 ,G2 ),ˆ (G1 ,G2 ) ( x) 2 10 3 10 4 10 |G1| |G2| Null distribution one graph Histograms of FAsσs forfor random sets Maps FA scores to p-values for any gene sets and underlying graph. 22 Functional Associations Between Processes Hydrogen Transport Electron Transport Edges Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Protein Depolymerization Organelle Fusion Organelle Inheritance 23 Functional Associations Between Processes Hydrogen Transport Electron Transport Edges Associations between processes Cellular Respiration Aldehyde Metabolism Very Strong Cell Redox Homeostasis Peptide Metabolism Energy Reserve Metabolism Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Borders Protein Depolymerization Data coverage of processes Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance 24 Functional Associations Between Processes Hydrogen Transport Electron Transport AHP1 DOT5 GRX1 GRX2 … Aldehyde Metabolism Edges Associations between processes Cellular Respiration Very Strong Cell Redox Homeostasis Peptide Metabolism APE3 Energy LAP4 Reserve PAI3 Metabolism PEP4 … Moderately Strong Vacuolar Protein Catabolism Protein Processing Negative Regulation of Protein Metabolism Nodes Cohesiveness of processes Below Baseline Baseline (genomic background) Very Cohesive Borders Protein Depolymerization Data coverage of processes Organelle Fusion Sparsely Covered Well Covered Organelle Inheritance 25 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA Data integration summarizes an impossibly huge amount of experimental data into an impossibly huge number of predictions; what next? 26 Functional Maps: Focused Data Summarization ACGGTGAACGTACA GTACAGATTACTAG GACATTAGGCCGTA TCCGATACCCGATA How can a biologist take advantage of all this data to study his/her favorite gene/pathway/disease without losing information? Functional mapping • • • • Very large collections of genomic data Specific predicted molecular interactions Pathway, process, or disease associations Underlying experimental results and functional activities in data 27 Outline 1. Methodology: 2. Applications: Algorithms for mining genome-scale datasets Human molecular data and clinical cancer cohorts 3. Next steps: Methods for microbial communities and functional metagenomics 28 HEFalMp: Predicting human gene function HEFalMp 29 HEFalMp: Predicting human genetic interactions HEFalMp 30 HEFalMp: Analyzing human genomic data HEFalMp 31 HEFalMp: Understanding human disease HEFalMp 32 Validating Human Predictions With Erin Haley, Hilary Coller Autophagy 5½ of 7 predictions currently confirmed Luciferase ATG5 (Negative control) (Positive control) Predicted novel autophagy proteins LAMP2 RAB11A Not Starved Starved (Autophagic) 33 Current Work: Molecular Mechanisms in a Colon Cancer Cohort With Shuji Ogino, Charlie Fuchs Nurse’s Health Study Health Professionals Follow-Up Study LINE-1 Methylation • Repetitive element making up ~20% of mammalian genomes • Very easy to assay methylation level (%) • Good proxy for whole-genome methylation level ~3,100 gastrointestinal subjects ~2,100 cancer mutation tests ~1,200 LINE-1 methylation ~3,800 tissue samples ~1,450 colon cancer samples ~1,150 CpG island methylation ~700 TMA immunohistochemistry ~775 gene expression DASL Gene Expression • Gene expression analysis from paraffin blocks • Thanks to Todd Golub, Yujin Hoshida 34 Colon Cancer: LINE-1 methylation levels With Shuji Ogino, Charlie Fuchs Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. LINE-1 Methylation in Multiple Tumors from the Same Subject Ogino et al, 2008 Methylation %, Tumor #2 80 70 60 50 40 30 What does it all mean?? What is the biological mechanism linking LINE-1 methylation to colon cancer? 30 40 50 60 70 Methylation %, Tumor #1 80 ρ = 0.718, p < 0.01 35 Colon Cancer: LINE-1 methylation levels With Shuji Ogino, Charlie Fuchs Lower LINE-1 methylation associates with poor colon cancer prognosis. LINE-1 methylation varies remarkably between individuals… …but it is highly correlated within individuals. LINE-1 Methylation in Multiple Tumors from the Same Subject Is anything different about these outliers? Ogino et al, 2008 This suggests linkage to a cancer-related pathway. What is the biological mechanism linking LINE-1 methylation to colon cancer? Methylation %, Tumor #2 80 70 60 50 40 30 This suggests a copy number variation. This suggests a genetic effect. 30 40 50 60 70 Methylation %, Tumor #1 80 ρ = 0.718, p < 0.01 36 Colon Cancer: LINE-1 methylation levels Preliminary Data • • • • • • Six genes differentially expressed even using naïve methods One uncharacterized, one oncogene, three malignancy, one histone 1/3 are from a family with known variable GI expression, prognostic value 2/3 fall in same cytogenic band, which is also a known CNV hotspot HEFalMp links to a set of transmembrane receptors/channels Better analysis pulls out mostly one-carbon metabolism and a few more signaling pathways (neurotransmitters??) Check back in a couple of months! What is the biological mechanism linking LINE-1 methylation to colon cancer? 37 Outline 1. Methodology: 2. Applications: Algorithms for mining genome-scale datasets Human molecular data and clinical cancer cohorts 3. Next steps: Methods for microbial communities and functional metagenomics 38 Next Steps: Microbial Communities • Data integration is off to a great start in humans – Complex communities of distinct cell types – Very sparse prior knowledge • Concentrated in a few specific areas – Variation across populations – Critical to understand mechanisms of disease 39 Next Steps: Microbial Communities • What about microbial communities? – Complex communities of distinct species/strains – Very sparse prior knowledge • Concentrated in a few specific species/strains – Variation across populations – Critical to understand mechanisms of disease 40 Next Steps: Functional Metagenomics • Metagenomics: data analysis from environmental samples – Microflora: environment includes us! • Another data integration problem – Must include datasets from multiple organisms • Another context-specificity problem – Now “context” can also mean “species” • What questions can we answer? – How do human microflora interact with diabetes, obesity, oral health, antibiotics, aging, … – What’s shared within community X? What’s different? What’s unique? – What’s perturbed in disease state Y? One organism, or many? Host interactions? – Current methods annotate ~50% of synthetic data, <5% of environmental data 41 Next Steps: Microbial Communities ~120 available expression datasets ~70 species • • • • Data integration works just as well in microbes as it does in humans We know an awful lot about some microorganisms and almost nothing about others Purely sequence-based and purely network-based tools for function transfer both fall short We need data integration to take advantage of both and mine out useful biology! Weskamp et al 2004 Flannick et al 2006 Kanehisa et al 2008 Tatusov et al 1997 42 Functional Maps for Functional Metagenomics KO1: YG1, YG2, YG3 KO2: YG4 KO3: YG6 … YG2 ECG1, ECG2 PAG1 ECG3, PAG2 … YG3 YG4 YG1 KO2 YG5 YG6 YG7 KO3 KO5 KO 4 YG8 YG9 YG10 YG12 KO8 YG11 KO 6 YG13 YG15 YG16 KO7 KO9 YG14 YG17 43 Functional Maps for Functional Metagenomics 44 Validating Orthology-Based Functional Mapping What is the effect of “projecting” through an orthologous space? GO GO Individual datasets log(Precision/Random) Unsupervised integration log(Precision/Random) Does unweighted data integration predict functional relationships? Recall Recall KEGG Unsupervised integration Individual datasets Recall log(Precision/Random) log(Precision/Random) KEGG Recall 45 Validating Orthology-Based Functional Mapping YG2 YG3 YG4 Holdout set, uncharacterized “genome” YG1 YG5 Random subsets, characterized “genomes” YG6 YG7 YG8 YG9 YG10 YG12 YG11 YG13 YG15 YG14 YG16 YG17 46 Validating Orthology-Based Functional Mapping 47 Validating Orthology-Based Functional Mapping Can subsets of the yeast genome predict a heldout subset’s functional maps? Can subsets of the yeast genome predict a heldout subset’s interactome? GO GO What have we learned? 0.68 • Yeast is incredibly well-curated 0.48 0.30 • KEGG tends to be more specific than GO 0.37 0.40 • Predicting interactomes by projecting through functional maps works decently in the absolute best case 0.39 KEGG 0.25 0.27 0.43 0.39 KEGG 48 Functional Maps for Functional Metagenomics Now, what happens if you do this for characterized microbes? • ~20 (somewhat) well-characterized species • 1-35 datasets each KEGG • Integrate within species • Evaluate using KEGG log(Precision/Random) • Then cross-validate by holding out species Unsupervised integrations Recall 49 Next Steps: Missing Methodology, Mining • Most machine learning algorithms are optimized for one of two cases: – Small, dense data – Large, sparse data • HEFalMp integrates ~300M records using ~1K features, relatively few of which are missing, in ~200 contexts Simple models, efficient algorithms 50 Next Steps: Missing Methodology, Models Functional Relationship Dataset #1 Dataset #2 Dataset #2 … 51 Next Steps: Missing Methodology, Models Functional Relationship Dataset #1 Dataset #2 Biological Context Dataset #3 … 52 Next Steps: Missing Methodology, Models Regulation Dataset #1 Cross-Species Orthology Functional Relationship Dataset #2 Cellular Processes Developmental Stage Dataset #3 Tissue/Cell Lineage Disease State … Types of Interactions This is clearly not a sustainable system; novel large-scale hierarchical modeling is needed to capture the complex biology of metazoan and metagenomic interaction networks. 53 Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities • • Microarray data, interaction data, genes and gene sets, functional catalogs, etc. etc. Network communication, parallelization • Efficient machine learning algorithms • Generative (Bayesian) and discriminative (SVM) It’s also speedy:• improves And it’s on Bayes Net Toolbox by ~22x in memory usage and up to >100x in runtime. fully documented! 54 Efficient Computation For Biological Discovery Massive datasets and genomes require efficient algorithms and implementations. • Sleipnir C++ library for computational functional genomics • Data types for biological entities Original processing time 8 hours 30 years • Microarray data, interaction data, genes and gene sets, Current processing time functional catalogs, etc. etc. • minute Network communication, 1 2 months parallelization • Efficient machine learning algorithms 18 hoursand 2-3 hours (SVM) • Generative (Bayesian) discriminative • And it’s fully documented! 55 Outline • Bayesian system for genomic data integration • Sleipnir software for efficient large scale data mining • Functional mapping to statistically summarize large data collections • HEFalMp system for human data analysis and integration • Six confirmed predictions in autophagy • Ongoing analysis of LINE-1 methylation in colon cancer 1. Methodology: 2. Applications: Algorithms for mining genome-scale datasets Human molecular data and clinical cancer cohorts • Data integration applied to microbial communities and functional metagenomics • Efficient machine learning for large, dense feature spaces 3. Next steps: Methods for microbial communities and functional metagenomics 56 Thanks! Olga Troyanskaya Matt Hibbs Chad Myers David Hess Edo Airoldi Florian Markowetz Hilary Coller Erin Haley Tsheko Mutungu Shuji Ogino Charlie Fuchs Interested? We’re looking for students and postdocs! Biostatistics Department http://huttenhower.sph.harvard.edu http://function.princeton.edu/hefalmp http://function.princeton.edu/sleipnir 57 Colon Cancer: Immunohistochemistry Tumor #1 Tumor #2 … Tumor #700 AKT1 AURKA CCND1 … 0 0 25 11 5 0 55 0 30 … Genes Conditions What is the biological mechanism linking LINE-1 methylation to colon cancer? Quantities What does the IHC data tell us about LINE-1 hypomethylation? The world’s smallest, cheapest microarray! 59 Colon Cancer: Immunohistochemistry ~700 Tumor Samples LINE-1 hypomethylated outliers LINE-1 methylation “normal” IHC Pseudoexpression 80 70 60 50 40 LINE-1 Methylation Low 30 20 Normal 10 STAT3 EPAS1 VDR JCVT HIF1A CTNNB1 CDKN1B SIRT1 AURKA KDM1 MAPK PTGER2 CDX2 HDAC3 DNMT1 ESR2 PPARG AKT1 CDK8 PRKAA1 CTSB MTOR PTEN TP53 CCND1 STMN1 0 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? The world’s smallest, Can existing microarrays cheapest microarray! amplify the LINE-1 hypomethylation signal? 60 Colon Cancer: Mining Microarrays ~650 datasets ~15,000 expression conditions 26 genes in signature 1.2 log2( Low / Normal ) 1 0.8 0.6 0.4 ~24,000 genes 0.2 0 -0.2 -0.4 STAT3 EPAS1 VDR JCVT HIF1A CTNNB1 CDKN1B SIRT1 AURKA KDM1 MAPK PTGER2 CDX2 HDAC3 DNMT1 ESR2 PPARG AKT1 CDK8 PRKAA1 CTSB MTOR PTEN TP53 CCND1 STMN1 -0.6 Most like our 26-gene LINE-1 differential methylation signature What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Least like the signature Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. 61 Colon Cancer: Mining Microarrays Most like our 26-gene LINE-1 differential methylation signature Least like the signature “The goal of GSEA is to determine whether members of a gene data set S tend to occur toward the top (or bottom) of the list L.” Subramanian et al, 2005 Dataset 1 Dataset 2 Condition X Condition Y Condition Z Condition A Condition B Condition C Condition D Condition E Bleomycin effect on mutagen- Folic acid deficiency effect sensitive lymphoblastoid cells on colon cancer cells Normal tissue of diverse types What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Muscle function and aging Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. Bladder tumor stage classification Non-diseased lung tissue What CNV-linked genes are differentially expressed in these datasets? 62 Colon Cancer: Mining Microarrays Most upregulated in significantly enriched datasets Most downregulated “The goal of GSEA is to determine whether members of a gene set S tend to occur toward the top (or bottom) of the list L.” Subramanian et al, 2005 CNV 1 CNV 2 Gene X Gene Y Gene Z Gene A Gene B Gene C Gene D Gene E PSGs (11 genes on 19q13.3) PCDHs (~50 genes on 5q31.3) Misc. ~12 genes on 16p13.3 ? Iafrate et al, 2005 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? 63 Colon Cancer: Mining Microarrays Pregnancy specific β glycoproteins Salahshor et al, 2005 “PSG9 is not found in the nonpregnant adult except in association with cancer, and it appears to be an early molecular event associated with colorectal cancer.” Differential gene expression profile reveals deregulation of pregnancy specific β1 glycoprotein 9 early during colorectal carcinogenesis Iafrate et al, 2005 What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? 64 Colon Cancer: Generating a Hypothesis Pregnancy specific β glycoproteins What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? 65 Colon Cancer: Generating a Hypothesis Pregnancy specific β glycoproteins What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? 66 Colon Cancer: Using All the Data What’s the state of the data? • Extremely hypomethylated colon cancer carries a significantly poor prognosis • In our cohort, these ~20 tumors are weakly enriched for a protein activity signature based on IHC • The expression datasets most enriched for the same signature represent mainly GI cancer and chemotherapy conditions • The PSG gene family is upregulated in these datasets and is linked to a known CNV • HEFalMp associates the PSGs with cancer based on correlation with known colorectal cancer genes in a variety of expression datasets Get back to me in a couple of months… Nothing definite – yet. Yes (caveat investigator) GI cancers and chemotherapy Pregnancy specific β glycoproteins What is the biological mechanism linking LINE-1 methylation to colon cancer? What does the IHC data tell us about LINE-1 hypomethylation? Can existing microarrays amplify the LINE-1 hypomethylation signal? Identify microarray datasets with conditions enriched for LINE-1 hypomethylation. What CNV-linked genes are differentially expressed in these datasets? 67 Human Regulatory Networks Quiescence: reversible exit from the cell cycle G0 Serum starved (hrs) Serum re-stimulated (hrs) 1 2 4 8 24 96 1 2 4 8 24 48 I II III FIRE: Elemento et al. 2007 IV Elk-1 6,829 genes V YY1 0 <5 Development 5< RNA processing X Cell cycle IX Metabolism NF-Y Protein localization VIII Cholesterol Sp1 Development VI VII • Of only five regulators found, four have generic cell cycle/proliferation targets • Just five basic regulators for ~7,000 genes? • These motifs only appear upstream of ~half of the genes 68 Regulatory Modules: Expression Biclusters + Sequence Motifs Bicluster: Coregulated subset of genes and conditions 1 2 3 4 5 6 7 8 RND2 RND6 RND3 CRG2 RND8 RND5 CRG1 CRG3 RND4 RND1 RND7 CRG4 69 Regulatory Modules: Expression Biclusters + Sequence Motifs Bicluster: Coregulated subset of genes and conditions 1 2 3 4 5 6 7 8 CRG1 CRG2 CRG3 CRG4 RND1 RND2 RND3 RND4 RND5 RND6 RND7 RND8 70 Regulatory Modules: Expression Biclusters + Sequence Motifs Bicluster: Coregulated subset of genes and conditions 1 CRG1 CRG2 CRG3 CRG4 RND1 RND2 RND3 RND4 …do all that, and simultaneously find (under)enriched sequence motifs! RND5 RND6 RND7 RND8 3 4 7 2 5 6 8 …any dataset can contain many overlapping biclusters… …any gene or condition can participate in multiple biclusters… 71 COALESCE: Combinatorial Algorithm for Expression and Sequence-based Cluster Extraction 5’ UTR 3’ UTR Upstream flank Nucleosome Positions Gene Expression Downstream flank DNA Sequence Evolutionary Conservation Create a new module Identify conditions Identify motifs Feature selection: where genes enriched in genes’ Tests for differential expression/frequency coexpress sequences Select genes based Bayesian on conditions integration and motifs Subtract mean from all data Regulatory modules • Coregulated genes • Conditions where they’re coregulated • Putative regulating motifs 72 COALESCE: Selecting Coexpressed Conditions • For each gene expression condition… – Compare distributions of values for • Genes in the module versus • Genes not in the module – If significantly different, include the condition Preserving data structure: • If multiple conditions derive from the same dataset, can be included/excluded as a unit • For example, time course vs. deletion collection • Test using multivariate z-test • Precalculate covariance matrix; still very efficient 73 COALESCE: Selecting Significant Motifs • Coalesce looks for three kinds of motifs: – K-mers – Reverse complement pairs – Probabilistic Suffix Trees (PSTs) ACGACGT ACGACAT | ATGTCGT A • For every possible motif… – Compare distributions of values for • Genes in the module versus • Genes not in the module A C T G G C T T – If significantly different, include the motif • This can distinguish flanks from UTRs • Fast! • Efficient enough to search coding sequence (e.g. exons/introns) 74 COALESCE: Selecting Probable Genes • For each gene in the genome… For each significant condition… For each significant motif… What’s the probability the gene came from the module’s distribution? What’s the probability that it came from outside the module? The probability of a gene being in the module given some data… P( g M | D) Prior is used to stabilize module convergence; genes already in the module are more likely to stay there next iteration. P( D | g M ) P( g M ) P( D | g M ) P( g M ) P( D | g M ) P( g M ) Distributions of each feature in and out of the developing module are observed from the data. 75 COALESCE: Integrating Additional Data Types Nucleosome placement Evolutionary conservation • Can be included as additional datasets and feature selected just like expression conditions/motifs. N C G1 2.5 0.0 G2 0.6 0.5 G3 1.2 0.9 … … … • Or can be used as a prior or weight on the values of individual motifs. TCCGGTAGAACTACTGGTATTGTTTTGGATTCCGGTGATG 76 COALESCE Results: S. cerevisiae Modules ~2,200 conditions A needle 100 genes 80 conditions The haystack ~6,000 genes 77 COALESCE Results: Yeast TF/Target Accuracy 1.3 1.1 0.9 Z-Score 0.7 COALESCE cMonkey FIRE 0.5 Weeder 0.3 0.1 -0.1 Bas1p Hap4p Met32p Cup2p Met31p Zap1p Upc2p Mbp1p Hsf1p Gln3p Hap3p Gcn4p Uga3p Gis1p Hap5p -0.3 78 COALESCE Results: Yeast Clustering Accuracy • ~2,200 yeast conditions – Recapitulation of known biology from Gene Ontology 79 COALESCE Results: Yeast Clustering Accuracy C. elegans: Up in larvae, down in adults • ~2,200 yeast conditions – Recapitulation of known biology from Gene Ontology GATA in 5’ flank, miR-788 seed in 3’ UTR M. musculus: Up in callosal and motor neurons ASCL1 in 5’ flank, unch. sequences underenriched in 3’ UTR H. sapiens: Up in normal muscle, down in diabetic AAGGGGC (zf?) and enriched in 5’ flank 80 COALESCE: Coregulated Quiescence Modules Up during quiescence entry, down during quiescence exit Down with let-7 exposure Many known related (proliferation) motifs: Pax4, Staf, NFKB1, Gfi, ESR1, Runx1, Su(H) let-7 motifs predicted in 3’ UTR (UACCUC) Down during quiescence entry, enriched for transport/trafficking Down during quiescence entry, up during quiescence exit, down with adenoviral infection miR-297 motif predicted in 3’ UTR (CACATAC) Specific predicted uncharacterized reverse complement motif 81 Summary • COALESCE algorithm for regulatory module prediction – Biclustering + putative de novo motifs – Optimized for complex organisms (fast!) • Large genomes, large data collections – High accuracy, low false positives – Leverage prior knowledge, multiple data types 82