Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Storytelling and Clustering for Cellular Signaling Pathways M. Shahriar Hossain, Monika Akbar, Nicholas F. Polys Department of Computer Science, Virginia Tech, Blacksburg, VA 24061. Objective STKE Dataset Cell interactions through chemical signals Discover relationships between the pathways Graph structure Subgraph discovery problem Pathways relationships Clustering Storytelling 2 Myocyte Adrenergic Pathway (CMP_9043) 12 1-10 11-20 21-30 31-40 41-50 51-60 61-70 71-80 81-90 91-100 100-110 Number of Pathways in Size Range Dataset properties Total Pathways = 50 10 8 6 4 2 0 Size Range 4 Design Pipeline STKE Dataset Clustering Preprocessor Pathway Graphs Frequent Subgraphs Frequent Subgraph Discovery NN Storytelling 5 Subsequent Candidate Generation Apriori – incremental approach [17] FSG [2] Generate a (k+1)-edge candidate subgraph by combining two k-edge subgraphs where these two k-edge subgraphs have a common core subgraph of (k-1)-edges. Cost of comparison between subgraphs (and core subgraphs) is reduced using hash-code of each subgraph object. l q p m p o n m p o n q l m o n 6 Subsequent Candidate Generation l p p o m m n o o m p z m o n l p r o n m q o n t n m l n l p m q Not generated r l p o n p o n m …………………………………………. ………………………………................ …………………………………………. l p s o m n p s o m n Instance: Number of 5-edge subgraphs: 21 Core subgraph comparisons for s1: 20 l p o m n 7 Master Pathway Graph (MPG) Total Unique Nodes:1205 Total Relations:1376 SEG - Subgraph Extension Generation l n m p m s l p r o o n q l p l m o r p m o n n Neighborhood Extension Neighborhood list : {q, r, s} Comparison is not required. q Subgraph is extended from physical evidence l s m p o n 9 Design Pipeline STKE Dataset Clustering Preprocessor Pathway Graphs Frequent Subgraphs Frequent Subgraph Discovery NN Storytelling 10 Subgraph Discovery • What so novel about pruning edges? min_sup=2% k # of Subgraphs generated 1 1,376 2 3 4 5 5,380 29,565 187,508 1274,852 --- ---- Time (sec.) Existing 41 149 971 7518 ----11 ‘Importance Factor’ of a subgraph: sfipf For i-th subgraph j-th pathway: Subgraph frequency, 1 sf j nj Inverse pathway frequency, ipf i p D j : si p j sfipf i , j sf j ipf i 12 50 40 30 20 10 0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 # of pathways left 1400 1200 1000 800 600 400 200 0 0.00 0.02 0.04 0.06 0.08 0.10 0.12 0.14 0.16 0.18 0.20 # of edges left Dataset Properties (sfipf) min_sfipf min_sfipf Number of edges in MPG=1376 Total pathways=50 13 Subgraph Discovery min_sup= 4.0% min_sfipf= 0.01 400x103 350x103 FSG SEG 250x103 200x103 150x103 100x103 50x103 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 0 3 Time (ms) 300x103 k 14 Subgraph Discovery min_sup= 4.0% min_sfipf= 0.01 3000 2500 FSG SEG 1500 1000 500 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 0 3 Time (ms) 2000 k 15 Subgraph Discovery k min_sup= 4.0% min_sfipf= 0.01 1250000 FSG SEG 750000 500000 250000 k 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 0 3 # of Atempts 1000000 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 Number of Subgraphs 186 246 305 323 313 279 263 292 364 470 608 785 980 1117 1075 804 430 141 20 1 Time Saved (%) 99.83 98.33 98.57 98.95 98.96 98.88 98.67 98.38 98.58 98.76 99.04 99.22 99.38 99.48 99.53 99.51 99.34 98.76 96.15 75.74 Attempts Saved(%) 98.98 86.15 86.38 86.91 85.64 83.25 78.91 74.76 74.75 78.08 81.84 85.02 87.63 89.48 90.26 89.40 85.22 71.22 9.19 -574.47 Overall attempts saved = 89.52% Overall time saved = 99.39% 16 Clustering Hierarchical Agglomerative Clustering (HAC) k-means Unsupervised measure of clusters’ validity Average Silhouette Coefficient (ASC) [19] 18 Clustering min_sup=4%, min_sfipf=0.01 min_sup=4%, min_sfipf=0.01 # of Clusters 20 18 16 14 12 6 20 18 16 14 12 10 8 0.0 6 0.0 4 0.1 2 0.1 4 0.2 2 0.2 8 Cosine sfipf Dice Jaccard Overlap 0.3 ASC 0.3 ASC k-means 0.4 10 HAC 0.4 # of Clusters 19 Clustering ASC Contour map for 10 clusters using HAC 0.05 ASC Contour map for 10 clusters using k-means 0.05 0.10 0.08 0.12 0.14 0.10 0.04 0.16 0.03 0.02 0.08 0.10 0.12 0.14 0.16 0.18 0.20 0.08 0.12 0.18 0.14 0.20 0.16 0.10 0.04 min_sfipf min_sfipf 0.04 0.06 0.08 0.04 0.06 0.08 0.10 0.12 0.14 0.04 0.10 0.08 0.03 0.06 0.12 0.02 0.14 0.06 0.10 0.01 0.08 0.01 4 6 8 min_sup 10 12 4 6 8 10 12 min_sup 20 Design Pipeline STKE Dataset Clustering Preprocessor Pathway Graphs Frequent Subgraphs Frequent Subgraph Discovery NN Storytelling 21 Pathway Relations (StoryTelling) Bidirectional Search Cover tree for NN S p1 p7 p2 p8 p3 p9 T 22 Day-to-day life example Roman Holiday Sabrina Funny Face Deep in my Heart Singing in the rain An American in Paris From Roman Holiday Kismet Take me out to the Ball Game On the Town Anchors Aweigh High Society Kiss me Kate Terminator 3 Collateral damage Blade: Trinity van Helsing Lethal Weapon 4 Die Hard 2 Speed Air Force One From Terminator 3 Roman Holiday Sabrina Breakfast at Tiffany’s Some Like it Hot The day after Tomorrow Die Another Day Golden Eye U.S. Marshals Rear Window From: To: Terminator 3 S.W.A.T. 2001: A Space Odyssey Roman Holiday Terminator 3 Examples in STKE http://people.cs.vt.edu/msh/infoviz/3/ 24 Pathway Relations (StoryTelling) Numbers of varying length stories for different branching factor 350 Number of t-length stories 300 b=2 b=4 b=6 b=8 250 200 150 100 50 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Story length, t 25 Pathway Relations (StoryTelling) Numbers of varying length stories for different branching factor 350 Number of t-length stories 300 b=2 b=3 b=4 b=5 b=6 b=7 b=8 b=9 b=10 250 200 150 100 50 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Story length, t 26 5 6 7 8 9 10 4 10 4 9 3 Branching factor, b Branching factor, b 1.4x106 Time to generate all stories (ms) 2 6 8 0 8 7 200 10 6 400 12 5 600 14 4 800 16 3 1000 2 Length of the longest story Total stories from all pairs Pathway Relations (StoryTelling) 1.2x106 1.0x106 800.0x103 600.0x103 400.0x103 200.0x103 0.0 2 3 4 5 6 7 8 Branching factor, b 9 10 27 Future Directions Compare our SEG graph methods with text based clustering and storytelling Examine costs and benefits for combining text and graph mining techniques 28 References [1] Science Signaling, The signal Transduction Knowledge Environment (STKE), "The Database of Cell Signaling", http://stke.sciencemag.org/cm/ [2] Kuramochi, M. and Karypis, G., "An efficient algorithm for discovering frequent subgraphs", IEEE Transactions on KDE, Vol. 16(9), September 2004, pp. 1038-1051. [3] Breslin, T., Krogh, M., Peterson, C., and Troein, C., "Signal transduction pathway profiling of individual tumor samples", BMC Bioinformatics, June 29, 2005. [4] Kumar, D., Ramakrishnan, N., Helm, R. F., and Potts, M., "Algorithms for Storytelling", IEEE Transactions on KDE, Vol. 20(6), June 2008, pp. 736-751. [5] Ratprasartporn, N., Cakmak, A., and Ozsoyoglu, G., "On Data and Visualization Models for Signaling Pathways", 18th SSDBM, 2006, pp. 133-142. [6] Xu, X., and Yu, Y., "Modeling and Verifying WNT Signaling Pathway", 3rd Intl. Conf. on ICNC. 2007, Vol. 2, pp. 319 - 323. [7] Schreiber, F., "Comparison of metabolic pathways using constraint graph drawing", 1st AsiaPacific bioinformatics Conf. on Bioinfo., Australia, Vol. 19, 2003, pp. 105 - 110. [8] Abello, J., van Ham, F., and Krishnan, N., "ASKGraphView: A Large Scale Graph Visualization System", IEEE Transactions on Visualization and Computer Graphics, Vol. 12(5), 2006, pp. 669 676. [9] Miyake, S., Tohsato, A., Takenaka, Y., and Matsuda, H. "A clustering method for comparative analysis between genomes and pathways", 8th Intl. Conf. on Database Systems for Advanced Applications, March 2003 pp. 327 - 334. 29 References [10] Yan, X., and Han, J. "gSpan: graph-based substructure pattern mining", IEEE ICDM, 2002, pp. 721724. [11] Moti, C., and Ehud, G. "Diagonally Subgraphs Pattern Mining", 9th ACM SIGMOD workshop on Research issues in data mining and knowledge discovery, 2004, pp. 51-58. [12] Ketkar, N., Holder, L., Cook, D., Shah, R., and Coble, J. "Subdue: Compression-based Frequent Pattern Discovery in Graph Data", ACM KDD Workshop on Open-Source Data Mining, August 2005, pp. 71-76. [13] Zhang, T., Ramakrishnan, R., and Livny, M., "BIRCH: An Efficient Data Clustering Method for Very Large Databases", ACM SIGMOD Intl. Conf. on Management of Data, Canada, 1996, pp. 103-114. [14] Wagsta, K., Cardie, C., Rogers, S., and Schroedl, S., "Constrained K-means Clustering with Background Knowledge", ICML 2001, pp. 577-584. [15] Lin, F., and Hsueh, C. M., "Knowledge map creation and maintenance for virtual communities of practice", Intl. Journal of Information Processing and Management, ACM, Vol. 42(2), 2006, pp. 551568. [16] Beygelzimer, A., Kakade, S., Langford, J., "Cover trees for nearest neighbor", ICML 2006, pp. 97-104. [17] Agrawal, R., and Srikant, R. "Fast Algorithms for Mining Association Rules", Intl. Conf. on Very Large Data Bases, Santiago, Chile, September 1994, pp. 487-499. [18] Agrawal, R., Mehta, M., Shafer, J., Srikant, R., Arning, A. and Bollinger, T. "The Quest Data Mining System", KDD'96, USA, 1996, pp. 244-249. [19] Tan, P. N., Steinbachm, M., and Kumar, V., "Introduction to Data Mining", Addison-Wesley, ISBN: 0321321367, April 2005, pp. 539-547. [20] http://people.cs.vt.edu/amonika/infoviz/ 30 Thank You 31