* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PROGRESS REVIEW Mike Langston`s Research Team Department
Inverse problem wikipedia , lookup
Neuroinformatics wikipedia , lookup
Theoretical computer science wikipedia , lookup
Corecursion wikipedia , lookup
Data analysis wikipedia , lookup
Pattern recognition wikipedia , lookup
Dijkstra's algorithm wikipedia , lookup
Data assimilation wikipedia , lookup
Graph coloring wikipedia , lookup
PROGRESS REVIEW Mike Langston’s Research Team Department of Computer Science University of Tennessee with collaborative efforts at Oak Ridge National Laboratory June 27, 2005 Team Members in Attendance Bhavesh Borate, Suman Duvvuru, John Eblen, Mike Langston, Xinxia Peng, Andy Perkins, Jon Scharff, Henry Suters, Yun Zhang Team Members Absent Josh Steadmon Mike Langston’s Progress Report Summer, 2005 • Team Changes – Graduating Soon: Xinxia Peng, Jon Scharff – New Member: Andy Perkins • Team Foci – FPT Tools and Applications – Computational Biology • Recent Conference Talks – AICCSA-05 (Egypt), RTST-05 (Lebanon), DIMACS (New Jersey) • Recent Visits – Cold Spring Harbor Lab (New York) • Upcoming Conference Talks – ACiD-05 (England), Dagstuhl-05 (Germany), COCOON-05 (China) • Upcoming Major Program Committee Service – AICCSA-06 (Program Chair), IWPEC-06 (Program Co-Chair) John Eblen Dr. Ivan Gerling’s Data • Details – Leukocyte data - 2 ages, 3 strands – Islet data – 3 ages, 4 strands • Current Project – Adding Proteins – Add 60 proteins to leukocyte data of 22690 probe sets – How can we improve correlation? – What other types of analysis are possible? General Clique Problem • Specific Approaches – “Biographs” or graphs created from correlation values – Brock graphs – Approach for keller graphs? • Information Gathering – Markov chains – General graph properties • Combining Algorithms Additional Projects • Fast Direct Clique Codes – Currently testing on DIMACS challenge graphs – Work continues • Common Neighbor Preprocessing Jon Scharff Differential Expression • Student’s t-test, in two normally distributed populations: – Mean assumed to be equal – Variance assumed to be equal • Gene by gene basis Differential Correlation Differential Cliquification • Cliques that appear in one graph but not the comparison graph Nucleus Cliques/Clique Nuclei Yun Zhang Clique Enumeration Problem (1) • Proposed a new maximal clique enumeration algorithm – Inspired by Kose et al algorithm – Enumerates cliques in non-decreasing order of sizes – Uses bitwise operations to speedup and reduce space requirements – Sequential algorithm is parallelizable – Serial code is almost 400 times faster than Kose RAM on the 0.85 threshold MAS5.0 graph (size 12,422) Clique Enumeration Problem (2) • Space required to hold the cliques is enormous Memory Usage on a graph with 2895 vertices 20 Memory usage (GBytes) 18 16 14 12 10 8 6 4 2 0 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 Clique Size On 0.7 threshold MAS5.0 graph, it used up to almost 1 terabyte memory after 12 hours running Clique Enumeration Problem (3) • Parallelism on shared-memory machine – SGI Altix, 256 processors, 2 Terabytes shared memory, 8GB per CPU – Use a dynamic task scheduler to • Synchronize multiple threads • Make load balancing decisions – Achieves a super-linear speedup on up to 64 processors Clique Enumeration Problem (4) Run times with/without load balancing using up to 64 processors on a graph with 2895 vertices with load balancing no load balancing Run time (seconds) 600.00 500.00 400.00 300.00 200.00 100.00 0.00 0 4 8 12 16 20 24 28 32 36 40 44 Number of processors (threads) 48 52 56 60 64 68 Maximum Common Subgraph • Clique branch algorithm by Henry (Cocoon05) – Takes advantage of the special structure of association graph built from two graphs • Finished serial implementation • Preliminary performance testing on small graphs • Next step: – Benchmarking – Parallel implementation Andy Perkins • Working with Jon on Brynn Voy's low dose IR mouse data • Finding and examining paracliques in the low dose data • Thresholding via spectral graph theory • Clique on MPSS mouse data Bravesh Borate Thresholding in High-Throughput data Ways of getting to the threshold Graph & Statistical Analysis • Graph features/characteristics • Using confidence intervals with Bayesian statistics • Random: 0.5% edges in graph Utilizing Biological Info • Gene Ontology • Utilization of info from pathway databases Normal distribution of No. of edges Spleen data Skin data 400000 1200000 350000 1000000 300000 800000 250000 600000 200000 Series1 Series1 150000 400000 100000 200000 50000 0 0 0.2 0.4 0.6 0.8 1 0 -50000 0 1.2 -200000 0.2 0.4 0.6 0.8 1 1.2 RMA data MAS5 data 1400000 3000000 1200000 2500000 1000000 2000000 800000 1500000 600000 Series1 Series1 1000000 400000 500000 200000 0 0 0 0.2 0.4 0.6 0.8 1 -200000 1.2 0 0.2 0.4 0.6 0.8 1 1.2 -500000 1800000 1600000 1400000 1200000 PDNN data 1000000 800000 Series1 600000 400000 200000 0 -200000 0 0.2 0.4 0.6 0.8 1 1.2 Comparison with other datasets Data No of edges Maximum Degree Size of Maximum Clique Avg. size of Maximal Cliques Spleen Data (0.85) 34753 349 39 20.03229 Skin data (0.87) 32384 606 66 48.009 MAS5 data (0.84) 3704 134 19 10.35285 RMA data (0.92) 34814 698 116 -- PDNN data (0.87) 34225 678 88 68.1974 Gene Ontology 1 Scores from GO 0.8 0.6 0.4 0.2 0 1 0.8 0.6 0.4 0.2 0 Correlation Scores Limitations •GO data: helpful but blind reliability questionable • Only applicable to genes with GO annotation •For Elissa’s data: Bing got inexplicable results (a more so flat curve) Info from Pathway Databases • What graphs mean in biological context • Extrapolate info from what is “known” to the “unknown”. • Expression data from House-keeping genes invaluable. Limitations • Info in Pathway databases not arranged in tissue-specific or condition-specific manner. A Combinatorial Strategy • Get info & develop algo to make sense of it all and suggest a threshold to the user. • Also suggest the biologist ideal thresholds with each method !!!! • Provide facility for displaying the graph at each threshold • Better so, if it is interactive and dynamic (perhaps too ambitious ???) • User discretion in the end, determines the right threshold. Comparison of clustering algorithms -Suman Duvvuru What is clustering • Clustering: – Partitioning into dissimilar groups of similar objects (in our case objects refer to genes). – Cluster analysis is used to identify genes that show similar expression patterns over a wide range of experimental conditions. • Traditional definition of a “good” clustering: – Points assigned to same cluster should be highly similar. – Points assigned to different clusters should be highly dissimilar. Overview of clustering algorithms • K-cores (Implemented): – A k-core of a graph is a largest subgraph with minimum degree at least k – The k-cores of a graph can be generated by • deleting the vertices from the graph whose degree is less than k and • Performing a DFS on the resulting graph to find all the cores. HCS (Highly connected graph): – The edge connectivity or simply the connectivity k(G) of a graph G is the minimum number k of edges whose removal results in a disconnected graph. – A minimum cut abbreviated mincut is a cut with a minimum number of edges. – A graph G with n vertices is called highly connected if k(G) > n/2. – A highly connected subgraph HCS is an induced subgraph H such that H is highly connected. – This algorithm identifies highly connected subgraphs as clusters. HCS Algorithm • Using Dinics algorithm to compute mincut. The complexity of this computation is O(nm2/3). • Edge density half as compared to our clique method. HCS: An example Other clustering methods Using cluster 3.0 software: • K-means • Hierarchical clustering Disadvantages: • None of these methods allow a single gene to be present in multiple clusters. Quality assessing • Different measures for the quality of a clustering solution are applicable in different situations. • It depends on the data and on the availability of the true solution. • In case the true solution is known, and we wish to compare it to another solution, we can use the Minkowski measure or the Jaccard coefficient. • When the true solution is not known, edge density, Homogeneity and separation, Average silhouette are used as criteria for evaluation.