Download PPT - NUS

Functions, Networks, and Phenotypes by Integrative Genomics Analysis Xianghong Jasmine Zhou Molecular and Computational Biology University of Southern California IMS, NUS, Singapore, July 18, 2007 Information Flow Cellular systems 1995 DNA 1997 Bacteria Eukaryote ~1.6K genes ~6K genes 1998 2001 Human Animal ~20K genes 30~100K genes Objective: $1000 human genome Gene Expression Datasets: the Transcriptome RNA •Oligonucleotide Array •cDNA Array Protein abundance measurement (Mass Spec) Protein Protein interactions (yeast 2-hybrid system, protein arrays) Protein complexes (Mass Spec) Rapid accumulation of microarray data in public repositories • NCBI Gene Expression Omnibus 137231 experiments • EBI Array Express 55228 experiments The public microarray data increases by 3 folds per year Multiple Microarray Technology Platforms Microarray Platforms Graph-based Approach for the Integrative Microarray Analysis Datasets Coexpression Networks Recurrent Patterns Annotation experiments Transcriptional Annotation f f gene ---------------- a j h a c TF e e b b k d j h c h k d g i e i g h e f f ---------------- j a c h c e b k d e g ---------------- i g Functional Annotation f j j a h c a e e b k k d f h c b d i g i k d f a i h b i g g j a b i g d f f a ---------------- j j a h c h c e e Gene Ontology b b k k d g i d g i Frequent Subgraph Mining Problem is hard! Problem formulation: Given n graphs, identify subgraphs which occur in at least m graphs (m  n) Our graphs are massive! (>10,000 nodes and >1 million edges) The traditional pattern growth approach (expand frequent subgraph of k edges to k+1 edges) would not work, since the time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks. Novel Algorithms to identify diverse frequent network patterns • CoDense (Hu et al. ISMB 2005) – identify frequent coherent dense subgraphs across many massive graphs • Network Biclustering (Huang et al, ISMB 2007) – identify frequent subgraphs across many massive graphs • Network Modules (NeMo) (Yan et al. ISMB 2007) – identify frequent dense vertex sets across many massive graphs CODENSE: identify frequent coherent dense subgraphs across massive graphs Hu et al, ISMB 2005 Identify frequent co-expression clusters across multiple microarray data sets c1 c2… cm g1 .1 .2… .2 g2 .4 .3… .4 … c1 c2… cm g1 .8 .6… .2 g2 .2 .3… .4 … f a c d g a c b k i f e j k d e f e b c j h a c k b f j a d g i h j h k i k i f e g i e b d k d g e a j h .. . b d g k i f c .. . a c h j d g i g c b a h f c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … a b .. . c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … e f h j c h j e b d g k i The common pattern growth approach Find a frequent subgraph of k edges, and expand it to k+1 edge to check occurrence frequency – Koyuturk M., Grama A. & Szpankowski W. An efficient algorithm for detecting frequent subgraphs in biological networks. ISMB 2004 – Yan, Zhou, and Han. Mining Closed Relational Graphs with Connectivity Constraints. ICDE 2005 Problem of the Pattern-growth approach The time and memory requirements increase exponentially with increasing size of patterns and increasing number of networks. The number of frequent dense subgraphs is explosive when there are very large frequent dense subgraphs, e.g., subgraphs with hundreds of edges. Problem of the Pattern-growth approach f a c Pattern Expansion k  k+1 h j e d g k i f c j f i a c j h k d g f j d g i d g h a j e c h a c k i f h d g d g a k i i c h b k d g a c f j h k d g j a b k i c e f c k i j h e b k d g a c f i j h e i b k d g i f h j a e b d g h j d g i b e d g e c b a h f f j j e b k c k i f i j a e d g a k c f h j b h e b e b b d g i f f j e b k c k i d g e b e f i h a b a h e c f h j d g b k e c c k i f h e d g a a e d g a j b a c c f h j b b a f a k i c h k j e b d g k i Our solution We develop a novel algorithm, called CODENSE, to mine frequent coherent dense subgraphs. The target subgraphs have three characteristics: (1) All edges occur in >= k graphs (frequency) (2) All edges should exhibit correlated occurrences in the given graph set. (coherency) (3) The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) m: #edges, n: #nodes CODENSE: Mine coherent dense subgraph (1) Builds a summary graph by eliminating infrequent edges f a a h c e b f f a c e b h c h e b f d d i g G1 d i g G2 i g a G3 h c e b f a b f a h c e d b g G4 i f a h c e d b g G5 i d h c i summary graph Ĝ e d g g G6 i CODENSE: Mine coherent dense subgraph (2) Identify dense subgraphs of the summary graph f a f Step 2 h c e h c e b d g i summary graph Ĝ MODES g i Sub(Ĝ) Observation: If a frequent subgraph is dense, it must be a dense subgraph in the summary graph. However, the reverse conclusion is not true. CODENSE: Mine coherent dense subgraph (3) Construct the edge occurrence profiles for each dense summary subgraph E f h c Step 3 e g i Sub(Ĝ) G1 G2 G3 G4 G5 G6 c-e 0 0 1 1 0 1 c-f 0 1 0 1 1 1 c-h 0 0 0 1 1 1 c-i 0 0 1 1 1 0 e-f 0 0 0 1 1 1 … … … … … … … edge occurrence profiles CODENSE: Mine coherent dense subgraph (4) builds a second-order graph for each dense summary subgraph g-h E G1 G2 G3 G4 G5 G6 c-e 0 0 1 1 1 1 c-f 0 1 0 1 1 1 c-h 0 0 0 1 1 1 c-i 0 0 1 1 1 0 e-f 0 0 0 1 1 1 … … … … … … … edge occurrence profiles f-i e-i h-i e-g g-i Step 4 e-h c-h f-h c-f e-f c-e c-i second-order graph S CODENSE: Mine coherent dense subgraph (5) Identify dense subgraphs of the second-order graph g-h g-h f-i e-i h-i e-i h-i e-g g-i Step 4 g-i e-g e-h e-h c-h f-h c-f e-f c-e c-h f-h c-f e-f c-i second-order graph S c-e Sub(S) Observation: if a subgraph is coherent (its edges show high correlation in their occurrences across a graph set), then its 2nd-order graph must be dense. CODENSE: Mine coherent dense subgraph (6) Identify the coherent dense subgraphs g-h h e-i h-i e i g Step 5 g-i e-g e-h f h c c-h f-h e c-f e-f c-e Sub(S) Sub(G) Our solution The identified subgraphs by definition satisfy the three criteria: (1) All edges occur in >= k graphs (frequency) (2) All edges should exhibit correlated occurrences in the given graph set. (coherency) (3) The subgraph is dense, where density d is higher than a threshold  and d=2m/(n(n-1)) (density) m: #edges, n: #nodes a CODENSE: Mine coherent dense subgraph f h c f f a c b e h b e d G1 a b e b h c e d i g G4 f i g a Step 1 G3 f a h d d G2 f c e i g h c b d i g a g c b e d i G5 Step 2 e Add/Cut h e d g i MODES i g Sub(Ĝ) summary graph Ĝ g h c b f a f h c i G6 Step 3 g-h g-h f-i h e-i e-i h-i h-i E e Step 6 i g g-i e-g Step 5 g-i e-g e-h e-h f h c e Restore G and MODES c-h f-h c-f e-f c-e Sub(G) Step 4 Sub(S) MODES c-h f-h c-f e-f c-e c-i second-order graph S G1 G2 G3 G4 G5 G6 c-e 0 0 1 1 1 1 c-f 0 1 0 1 1 1 c-h 0 0 0 1 1 1 c-i 0 0 1 1 1 0 e-f 0 0 0 1 1 1 … … … … … … … edge occurrence profiles CODENSE The design of CODENSE can solve the scalability issue. Instead of mining each biological network individually, CODENSE compresses the networks into two meta-graphs (the summary graph and the second-order graph) and performs clustering in these two graphs only. Thus, CODENSE can handle any large number of networks. MODES: Mine overlapped dense subgraph G Sub(G) a j a f b c h i e Step 1 b HCS’ c g d G’ j f h h Step 2 V condense i e j i g g d HCS’ Sub(G’) a h f Step 4 f b h h Step 3 V e i HCS’ c e i restore d Hu et al. ISMB 2005 i Applying CoDense to 39 yeast microarray data sets f c1 c2… cm g1 .1 .2… .2 g2 .4 .3… .4 … c1 c2… cm g1 .8 .6… .2 g2 .2 .3… .4 … c1 c2… cm g1 .9 .4… .1 g2 .7 .3… .5 … a c e a b d g a c b k i f e j k d a c f j h a c j h e b k d g f a c k b i j h j a f e k i k i d g i h g k i e b d e f c e b d g h j d g i g c b a h f c1 c2… cm g1 .2 .5… .8 g2 .7 .1… .3 … f h j c h j e b d g k i Functional annotation Annotation Functional Annotation (Validation) Method: leave-one-out approach - masking a known gene to be unknown, and assign its function based on the other genes in the subgraph pattern. Functional categories: 166 functional categories at GO level at least 6 Results: 448 predictions with accuracy of 50% Functional Annotation (Prediction) We made functional predictions for 169 genes, covering a wide range of functional categories, e.g. amino acid biosynthesis, ATP biosynthesis, ribosome biogenesis, vitamin biosynthesis, etc. A significant number of our predictions can be supported by literature. However… • How about frequent non-dense graphs? – Many biological modules may form paths • How about subgraphs which are coherent across only a subset of the graphs? – Not all modules are activated across all conditions, and genes may form modules with diff. other genes under diff. conditions Network Biclustering: Identify frequent subgraphs across massive graphs Huang et al, ISMB 2007 Using 65 human co-expression network as an illustration example • 65 co-expression networks generated from 65 microarray data sets • each graph contains 8297 genes, and 1%10% edges of a complete graph Basically, it is a biclustering problem graphs edges Network Biclustering • Objective function graphs c' f  mn  c c’: number of 1 in the bicluster c: number of 1 in the whole matrix mn: size of the bicluster : regularization factor 1 0 1 0 0 … 1 0 1 0 0 1 … 0 0 1 1 1 1 … 0 0 1 0 1 1 … 1 0 1 1 0 1 … 0 1 1 1 1 1 … 0 0 0 1 0 0 … 1 edges However, the matrix is very large with millions of edges … We will first identify robust seed to narrow down the search space Identify Bicluster seed The property of relation graphs: edge labels are unique. Hence, each graph can be treated as a collection of items Thus, Frequent subgraph Mining can be modeled as frequent item set mining Problem: current frequent item set mining algorithms can only efficiently mine across many small item sets In our problem, we have 65 very large item set… We use a trick…. Identify Bicluster seed E Edge occurrence profiles: G1 G2 G3 G4 G5 G6 ... G60 G61 G62 G63 G64 G65 e1 1 1 0 1 0 1 ... 0 0 1 1 1 1 e2 1 1 0 1 0 1 … 0 1 0 1 1 1 e3 1 1 1 1 1 1 … 0 0 0 1 1 1 e4 0 0 1 1 1 0 ... 0 0 1 1 1 0 e5 1 1 0 1 0 1 … 0 0 0 1 1 1 … … … … … … … … … … … … … … Frequent pattern tree {G1, G3, G5, G6, G7,…} common edges {e1, e10, e56, e100, e1000,…} Graph set with more than 5 members and with > 1000 common edges {G2, G3, G5, G7, G8,…} …. {G8, G9, G15, G26, G29} common edges {e4, e12, e33, e56, e890,…} …. common edges {e99, e220, e1545, e2629,…} Very time consuming! It takes more than 2 weeks on 40 Pentium IV nodes Huang et al. ISMB 2007 Expanding the Biclusters E G7 … G61 G62 G63 G64 G65 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 .0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 … … … … … … … … … … … … G1 G2 G3 G4 G5 G6 e1 1 1 0 1 1 e2 1 1 1 1 e3 1 1 1 e4 1 1 e5 1 … … Simulated Annealing E G7 … G61 G62 G63 G64 G65 1 0 0 0 1 1 1 1 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 1 1 1 1 1 1 1 .0 0 0 1 1 1 0 0 0 1 0 1 1 0 0 0 1 1 1 … … … … … … … … … … … … G1 G2 G3 G4 G5 G6 e1 1 1 0 1 1 e2 1 1 1 1 e3 1 1 1 e4 1 1 e5 1 … … Identify connected components Systematic identification of functional modules in human genome • We identified 143,400 network modules with recurrence >= 5. They vary in size from 4 to 180. • 77.0% of the patterns are functionally homogenous (GO hyper-geometric Pvalue less than 0.01) • The figure shows that the functional homogeneity of modules increase with their recurrences. Results (I): Examples of highly recurring modules Recurrence 20 Recurrence 18 Defense against oxygen and nitrogen species Involved in spermatogenesis Loosely connected network patterns with high recurrence can represent functional modules Functional annotation Annotation We made functional predictions for 779 known and 116 unknown genes by random forest classification with 71% accuracy. Variables for random forest classification: functional enrichment P-value network connectivity average node degree Network size network topology score pattern recurrence numbers unknown gene ratio Network Modules (NeMo) Identify frequent dense vertex sets across many massive graphs microarray coexpression graph ... ... step 1 (neighbor association) clustering summary graph refinement ... partitioning step 2 step 3 step 4 Yan et al. ISMB 2007 105 human microarray data sets NeMo 6477 recurrent coexpression clusters (density > 0.7 and support > 10) Validation based on ChIp-chip data (9176 target genes for 20 TFs) 15.4% homogenous clusters (vs. 0.2% by randomization test) Validation based on human-mouse Conserved Transfac prediction (7720 target genes for 407 TFs) 12.5% homogenous clusters (vs. 3.3% by randomization test) Percentage of potential transcription modules validated by ChIP-Chip data increase with cluster density and recurrence Expression data Microarray Data Phenotype information Phenotype Concepts (e.g. diseases, perturbations, tissues ) in Unified Medical Language System (UMLS) Classifying microarray data based on phenotype Adenocarcinoma Arthritis Asthma … Glaucoma HIV For example, the current NCBI GEO database contains >60 cancer datasets, among which 11 leukemia datasets. Identify phenotype-specific functional or transcriptional modules • Unsupervised approach Cancer Cancer Cancer Cancer Frequent pattern mining Module Module 11, Module 2, Module 3, … Module k 105 microarray data sets NeMo 6477 recurrent coexpression clusters (density > 0.7 and support > 10) A total of 459 of the clusters are statistically significant with a hypergeometric P-value <0.01 in a specific phenotype: e.g. malignant neoplasms, cardiovascular diseases or nervous system disorders. An example 5 out of the 9 support datasets are leukemia datasets (P-value 0.0039). It is potentially regulated by E2F4, and majority genes are involved in cell cycle and DNA repair. Identify phenotype-specific functional or transcriptional modules • Supervised approach Adenocarcinoma Arthritis … AsthmaNon-Adenocarcinoma Glaucoma Functional and transcriptional modules which are active ONLY in Adenocarcinoma Related data sets HIV A case study: Identify Network Modules Characterizing Cancer 32 Cancer Datasets c1 c2… cm c1 c.1 2… cm g1 .9 .4… c 1 c.1 2… cm .4… g2 g.71 .9 .3… .5c1 c2… cm g .9 .4… .5c1.1c2… cm …g2 .71 g.3… .9 .4… .5c1.1c2… cm …g2 .71 g.3… .9 .4… .5 .1 …g2 .71 g.3… .9 .4… .1 …g2 .71 .3… .5 …g2 .7 .3… .5 … f a j h c f e a c d g j h c … e b b k a h e b f j k k d i d i g i g 25 Non-cancer Datasets c1 c2… cm c1 c.8 2… cm g1 .2 .5… c 1 c.8 2… cm .5… g2 g.71 .2 .1… .3c1 c2… cm g .2 .5… .3c1.8c2… cm …g2 .71 g.1… .2 .5… .3c1.8c2… cm …g2 .71 g.1… .2 .5… .3 .8 …g2 .71 g.1… .2 .5… .8 …g2 .71 .1… .3 …g2 .7 .1… .3 … f a f h c j a k g a i k g … e b d j h c e b d h c e f j i b k d g i Examples of identified modules Cell cycle Module Cell adhesion PDGF-signaling across all cancer datasets across all solid tumor datasets in breast cancer datasets Reconstruct transcriptional cascades by second-order analysis Zhou et al. Nature Biotech 2005 Frequently occurring tight clusters Frequently occurring tight clusters Transcription Factors Co-occurrence of tight clusters Coexpression network constructed with the dataset 1 Co-occurrence of tight clusters Coexpression network constructed with the dataset 2 Co-occurrence of tight clusters Coexpression network constructed with the dataset 3 Co-occurrence of tight clusters Coexpression network constructed with the dataset 4 Co-occurrence of tight clusters Coexpression network constructed with the dataset 5 Transcription Factors Set 1 Coexpression Networks Cooperativity Transcription Factor Set 2 Relevance Networks Three types of transcription cascades gene 1 TF2 Type I TF1 gene 2 gene 3 TF3 gene 4 gene 5 gene 1 TF2 Type II TF1 gene 4 gene 2 gene 3 gene 5 gene 1 TF1 Type III TF2 gene 2 Transcription Regulation gene 3 Protein Interaction gene 4 gene 5 Applying to 39 yeast microarray data sets • We identified 60 transcription modules. Among them, we found 34 pairs that showed high 2ndorder correlation. A significant portion (29%, pvalue<10-5 by Monte Carlo simulation ) of those modules pairs are participants in transcription cascades: 2 pairs in Type I, 8 pairs in Type II, and 3 pairs in type III cascades. In fact, these transcription cascades inter-connect into a partial cellular regulatory network. HGH1 LOC1 NOC3 YDR152W YHL013C BUD19 RPL13A RPL13B RPL16A RPL24B RPL28 RPL2B RPL33B RPL39 RPL42B RPS16B RPS18A RPS21B RPS22A RPS23B RPS4B RPS6A BAT1 ILV2 LEU9 YBL113C YRF11 YRF1-7 RCS1 MET10 MET14 MET28 SUL2 RGM1 Leu3 GAL4 ALD5 ARG1 ARG2 ARG3 ARG4 ARG5,6 ARO1 ARO3 ARO4 ASN1 BNA1 CPA2 DED81 ECM40 HIS1 HIS4 HOM3 LEU4 LEU9 LYS20 MET22 ORT1 PCL5 TEA1 TRP2 YBR043C YDR341C YHM1 YHR162W YJL200C YJR111C MET4 PDR1 GCN4 YBL113C YRF1-1 YRF1-3 YRF1-7 YAP5 SWI6 GAT3 YBL113C YIL177C YJL225C YLR464W YML133C YRF1-1 YRF1-3 YRF1-4 YRF1-5 YRF16 YRF1-7 SWI5 MBP1 CDC45, RAD27, RNR1, SPT21, YPL267W MSN4 SWI4 NDD1 CDC21 CDC45 CLB5 CLB6 CLN1 GIN4 IRR1 MCD1 MSH6 PDS5 RAD27 RNR1 SPO16 SPT21 SWE1 TOF1 YBR070C CLB6 RNR1 SPT21 YPL267W GAS1 HTA1 HTB1 YEL077C YRF1-3 YRF1-7 ALK1 BUD4 CDC20 CDC5 CLB2 HST3 SWI5 YIL158W YJL051W YOR315W Regulation of transcription modules by transcription factors, based on ChIP-chip data and supported by recurrent expression clusters Regulation between transcription factors, based on ChIP-chip data and supported by 2nd-order expression correlation Protein interactions between two transcription factors, based on experimental data and supported by 2nd-order expression correlation Two transcription modules with high 2nd-order expression correlations Integrative Array Analyzer (iArray): a software package for cross-platform and cross-species microarray analysis Bioinformatics. 22(13):1665-7 Gene Aging Nexus (GAN): An Integrated Genomics Data Mining Platform for Aging Nucleic Acids Res. 2006 Nov Acknowledgement CoDense Cancer Network Module • Haiyan Hu • Min Xu NetBiclustering • Haifeng Li • Yu Huang NeMo • Xifeng Yan (IBM) • Mike Mehan Second-order TF cascade • Ming-Chih Kao (U. Michigan) • Haiyan Huang (UC Berkeley) • Wing H. Wong (Stanford) Consultation • Jiawei Han (UIUC) • Michael Waterman Funding agencies: NIGMS, NCI, NSF, Seaver Foundation, Sloan Foundation, Zumberg Foundation Thank you!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT - NUS