Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Promoter Discovery: A Correlation Mining Approach Yi Lu Department of Computer Science Wayne State University Outline Introduction Related Work Problem Definition Correlation Mining Conclusion and Future work Yi Lu Wayne State University 2 Introduction Central Dogma Gene Expression Transcription Translation RNA Protein DNA Yi Lu Wayne State University 3 Introduction The promoter region (a set of transcription binding sites) of the gene acts as light switch. It signals when to turn the gene on and off. We are interested in the relationship between the promoter region and gene expression. i.e. what kind of binding sites determine whether a gene is expressed or not? Yi Lu Wayne State University 4 Introduction - Microarray Microarray chips Images scanned by laser Gene D26528_at D26561_at D26579_at D26598_at D26599_at D26600_at D28114_at H29189_at G29183_at Value 193 70 318 1764 1537 1204 707 899 9210 Datasets D1 D2 D3 D4…….. D26528_at Gene Day 1 Day 2 Day 3 … D26528_at 193 4157 556 D26561_at 70 11557 476 D26579_at 318 12125 498 D26598_at 1764 8484 1211 D26599_at H21219 1537 3537 131 D26600_at 1207 4578 94 D28114_at 707 2431 209 ……. D26561_at D26579_at D26598_at D26599_at D26600_at D28114_at ….. .. Yi Lu Wayne State University 5 Introduction Transcription factor binding sites (motif) in promoter region should “explain” changes in transcription. R(t1) R(t2) t2 Motif t1 Motif AGCTAGCTGATTGTGCACACTGATCGAG CCCCACCATAGCTTCGTTGTGCGCTATA TATTGTGCAGCTAGTAGAGCTCTGCTAG AGCTCTATTTGTGCCGATTGCGGGGCGT CTGAGCTCTTTGCTCTTTTGTGCCGCTT TTGATATTATCTCTCTGCTCGTTTGTGC TTTATTGTGGGGGTTGTGCTGATTATGC TGCTCATAGGAGATTGTGCGAGAGTCGT CGTAGTTGTGCGTCGTCGTGATGATGCT GCTGATCGATCGTTGTGCCTAGCTAGTA GATCGATGTTTGTGCAGAAGAGAGAGGG TTTTTTCGCGCCGCCCCGCGCTTGTGCT CGAGAGGAAGTATATATTTGTGCGCGCG CCGCGCGCACGTTGTGCAGCTGATGCAT GCATGCTAGTATTGTGCCTAGTCAGCTG CGATCGACTCGTAGCATGCATCTTGTGC AGTCGATCGATGCTAGTTATTGTTGTGC GTAGTAGTGCTTGTGCTCGTAGCTGTAG Yi Lu Time Course Wayne State University genes 6 Related work Cluster gene expression profiles Search for motifs in promoter regions of clustered genes Promoter regions AGCTAGCTGATTGTGCACAC AGCTAGCTGATTGTGCACAC TTCGTTGTGCGCTATATAGA TTCGTTGTGCGCTATATAGA TTGTGCAGCTAGTAGAGCTC TTGTGCAGCTAGTAGAGCTC clustering CTAGAGCTCTATTTGTGCCG CTAGAGCTCTATTTGTGCCG ATTGCGGGGCGTCTGAGCTC TTTGCTCTTTTGTGCCGCTT TTTGCTCTTTTGTGCCGCTT Motif Yi Lu Wayne State University Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 TTGTGC 7 Related work Clustering partition the N genes to a set of disjoint groups so that the expression profile of genes in same group have high similarity to each other and the expression profile of genes in different groups are dissimilar to each other. Most widely used algorithms: K-means clustering, hierarchy clustering algorithms. Genetic K-means algorithms (Lu et al. 2003, 2004). Yi Lu Wayne State University 8 Related work Motif discovery after clustering given a set of upstream sequence of genes which are coexpressed, find subsequences that are overrepresented and are significant to be separated from other subsequences MEME, Gibbs Sampling, Winnower algorithms. PDC algorithm (Lu et al. 2006) Usually have high false positive rate Yi Lu Wayne State University Genes ACGATGCTAGTGTAGCTGATGCTGATCGATCGTACGTGCTAGCTAGCTAGCTAGCTAGCTAGCTAGCAG CCAAT CTAGCTCGACTGCTTTGTGGGGCCTTGTGTGCTCAAACACACACAACACCAAATGTGCTTTGTGGTACT GATAC TCGACTGC CCAAT GATGATCGTAGTAACCACTGTCGATGATGCTGTGGGGGGTATCGATGCATACCACCCCCCGCTCGATCG CCAAT ATCGTAGCTAGCTAGCTGACTGATCAAAAACACCATACGCCCCCCGTCGCTGCTCGTAGCATGCTAGCT GATAC TCGACTGC CCAAT TCGACTGC GATAC AGCTGATCGATCAGCTACGATCGACTGATCGTAGCTAGCTACTTTTTTTTTTTTGCTAGCACCCAACTGA GCAGTT CTGATCGTAGTCAGTACGTACGATCGTGACTGATCGCTCGTCGTCGATGCATCGTACGTAGCTACGTAG CCAAT CATGCTAGCTGCTCGCAAAAAAAAAACGTCGTCGATCGTAGCTGCTCGCCCCCCCCCCCCGACTGATC TCGACTGC GCAGTT CCAAT GATAC TCGACTGC CCAAT GCAGTT GTAGCTAGCTGATCGATCGATCGATCGTAGCTGAATTATATATATATATATACGGCG 9 Motivation Researches have indicated that multiple transcription factor binding sites are involved into each transcription process. This lead us to study the Modules (a pair of motifs) instead of Motifs. Yi Lu Wayne State University 10 Motivation Not all genes contain the same motif cause the same gene expression change. Not all genes with same gene expression change contains same motif. ATCTTGTGCACATGTACTAC Gene 1 AGCTAGTTGTGCACACACTT Gene 2 AATTTCGTTGTGCGCTATAT Gene 3 GAGCTCTTGTGCAGCTAGTA Gene 4 TTCGTTGTGCGCTATATAGA Gene 5 TTGTGCAGCTAGTAGAGCTC Gene 6 Yi Lu CTAGAGCTCTATTTGTGCCG Gene 7 ATTGCGGGGCGTCTGAGCTC Gene 8 TTTGCTCTTTTGTGCCGCTT Gene 9 Wayne State University 11 Problem Definition Gene Mm.100117 Mm.100118 Mm.100125 Mm.10154 Mm.10174 Mm.10178 Mm.10182 ETSFETSF NFKBSTAT 1 0 0 0 0 0 1 … STATETSF 0 0 1 0 1 0 0 0 0 0 1 0 0 1 Day6 … Day0 Day3 16.75 65.3 119.15 150.85 137.55 130.55 84.55 96.9 119.15 84.55 96.9 119.15 223.05 181.55 200.9 16.75 65.3 119.15 79.6 80.3 94.75 Given a list of genes, and corresponding module present information, gene expression information, find the relationship between module and gene expression, i.e. which modules or module combinations may relate to the gene expression change. M1 M2 => increase gene expression change from Day 1 to Day 4 Yi Lu Wayne State University 12 Method - Quantify Gene Expression Days Mm.116803 Days 1 4 8 11 14 18 21 26 29 60 189.9 398.3 224.1 123.4 602.7 2218 8624 9901 11748 18519 21-26 26-29 29-60 1-4 4-8 8-11 11-14 14-18 18-21 log10(Di+1/Di) 0.322 -0.25 -0.26 0.689 0.566 0.59 0.06 0.074 0.198 Mean 0.014 0.006 0.006 0.017 0.04 0.063 0.052 0.019 0.044 Lower Bound -0.110 -0.15 -0.12 -0.23 -0.22 -0.165 -0.225 -0.22 -0.32 Upper Bound 0.138 0.165 0.132 0.269 0.297 0.291 0.328 0.258 0.410 1 0.8 0.6 0.4 0.2 0 -0.2 Day1-4 Day4-8 Day8- Day11- Day14- Day18- Day21- Day26- Day2911 14 18 21 26 29 60 -0.4 -0.6 -0.8 Yi Lu Wayne State University 13 Method - Quantify Gene Expression Days Mm.116803 1 4 8 11 14 18 21 26 29 60 189.9 398.3 224.1 123.4 602.7 2218 8624 9901 11748 18519 Days Ei=log10(Di+1/Di) E1 E2 E3 E4 E5 E6 E7 E8 E9 1-4 4-8 8-11 11-14 14-18 18-21 21-26 26-29 29-60 0.322 -0.25 -0.26 0.689 0.566 0.59 0.06 0.074 0.198 Lower Bound -0.110 -0.15 -0.12 -0.23 -0.22 -0.165 -0.225 -0.22 -0.32 Upper Bound 0.138 0.165 0.132 0.269 0.297 0.291 0.328 0.258 0.410 Mm.116803 E1 E2 E3 E4 E5 E6 E7 E8 E9 + - - + + + 0 0 0 Yi Lu Wayne State University 14 Method – Generate Frequent Module Set Frequent module sets (occurrence >=2) M1(4), M2 (3), M3 (2) , M4(1) M1M2 (3), M1M3 (2) , M2M3 (1) M1M2M3(1) M1 M2 M3 M4 Gene 1 1 1 0 0 Gene 2 1 0 1 0 Gene 3 1 1 1 0 Gene 4 1 1 0 1 Yi Lu Wayne State University 15 Method – Generate Frequent Gene Expression Set Frequent gene expression sets (occurrence >=2): E1+ (2), E1- (0), E2+ (1), E2-(3), E3+ (0), E3-,(2), E1+E2-(1), E1+E3-(1), E2-E3- (2) E1 E2 E3 Gene 1 + + 0 Gene 2 0 - Gene 3 + Gene 4 0 E1+ E2+ E3+ E1- E2- E3- Gene 1 1 1 0 0 0 0 - Gene 2 0 0 0 0 1 1 - - Gene 3 1 0 0 0 1 1 - 0 Gene 4 0 0 0 0 1 0 Yi Lu Wayne State University 16 Correlation Measure – Contingency Table The relation between u and v in the pair (u,v) Yi Lu Wayne State University 17 Liddell Measure E1+ ^E1+ M2 O11=2 O12=1 R1 = 3 ^M2 O21=0 O22=1 R2 = 1 C1= 2 C2 = 2 N=4 Liddell = ( 2*1-1*0)/(2*2) = 0.5 Yi Lu Wayne State University 18 Method – Correlate Module Set with Gene Expression Set Minimize module set Maximize gene expression set Minimum Liddell value is set to 0.5/-0.5, then the result sets: M2 ->E1+ M2 -> ^(E2- E3-) M3 ->E2- E3- E1+ E2- E3- E2-E3- M1 0 0 0 0 M2 0.5 -0.3333 -0.5 -0.5 M3 0 0.66667 1 1 M1M2 0.5 -0.3333 -0.5 -0.5 M1M3 0 0.66667 1 1 M1 M2 M3 M4 E1 E2 E3 Gene 1 1 1 0 0 + + 0 Gene 2 1 0 1 0 0 - - Gene 3 1 1 1 0 + - - Gene 4 1 1 0 1 0 - 0 Yi Lu Wayne State University 19 Result on Spermatogenesis Spermatogenesis is the biological process related to formation of sperm. Two gene expression data sets are downloaded from GEO (Gene Expression Omnibus). The time course of one dataset ranges from day 0, 3, 6, 8, 10, 14, 18, 20, 30, 35, and 56. And the other ranges from day 1, 4, 8, 11, 14, 18, 21, 26, 29, and 60. 0.6 Concondance 0.5 0.4 0.3 0.2 0.1 0 0.5 0.6 0.7 0.8 Liddell Yi Lu Wayne State University 20 System Workflow GEO: Gene Expression Omnibus DBTSS: DataBase of Transcriptional Start Sites TRANSFAC: the Transcription Factor database JASPAR: The highquality transcription factor binding profile database GEO cDNA Gene IDs Expression Data DBTSS Gene Expression Clustering Upstream Sequences Clustered Genes Motif Discovery Motifs K-SPMM Motif TRANSFAC Matrices JASPAR Modules Correlation Mining of Modules Yi Lu Wayne State University 21 Conclusion Not only same module combination result, but also the same genes that contain the module combinations have been pulled out between the two datasets. The promoter detected using our approach statistically shows significance than random generated datasets. Some promoters found by our approach are confirmed by literatures. Yi Lu Wayne State University 22 Future work The concordance between the two gene expression datasets downloaded from GEO are low, new method to reconcile the difference between two data sets is needed. Motifs found by different algorithms are overwhelming, we may incorporate the weight matrix and gene ontology to identify the significant ones. Yi Lu Wayne State University 23 References Gene Expression Clustering: Motif Discovery: Yi Lu, Shiyong Lu, Farshad Fotouhi, Yan Sun and Zijiang Yang, “PDC: Pattern Discovery with Confidence in DNA Sequences”, In the proceedings of the IASTED International Conference on Advances in Computer Science and Technology (ACST 2006), Puerto Vallarta, Mexico, January, 2006 Motif Extraction, Module Integration: Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng and Susan Brown, "FGKA: A Fast Genetic K-means Clustering Algorithm", in Proceedings of the 19th ACM Symposium on Applied Computing, Nicosia, Cyprus, March, 2004. Yi Lu, Shiyong Lu, Farshad Fotouhi, Youping Deng, and Susan Brown, “Incremental Genetic K-means Algorithm and its Application in Gene Expression Data Analysis”, International Journal of BMC Bioinformatics, 5(172), October, 2004. Adrian E. Platts, Yi Lu, Stephen A. Krawetz, “K-SPMM, an Online System for Data Mining Regulatory Elements from Murine Spermatogenic Promoter Sequences”, presented in 2006 Great Lakes Mammalian Development Meeting, Toronto, March 3-5 2006. Yi Lu, Adrian E. Platts, Charles G. Ostermeier, Stephen A. Krawetz, “A Database of Murine Spermatogenic Promoters Modules & Motifs”, Submitted to Journal of BMC Bioinformatics for publication. Correlation Mining: Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, "Correlation Mining to Reveal the Regulation of Transcription Factor Binding Site Modules", 4th Great Lake Bioinformatics Retreat, Frankenmuth, Michigan, August, 2005. Yi Lu, Adrian Platts, Shiyong Lu, Jeffrey L. Ram and Stephen Krawetz, “Mining of Correlation Between Transcription Binding Sites and Gene Expression Profiles”, In preparation. Yi Lu Wayne State University 24 Yi Lu Wayne State University 25 Acknowledgements Dr. Shiyong Lu Dr. Stephen Krawetz Mr. Adrian Platts Dr. Jeffrey Ram Dr. Youping Deng Yi Lu Wayne State University 26 Questions? Yi Lu Wayne State University 27