Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Graph mining in bioinformatics Laur Tooming Graphs in biology • Graphs are often used in bioinformatics for describing processes in the cell • Vertices are genes or proteins • The meaning of an edge depends on the type of the graph – Protein-protein interaction – Gene regulation What we’re looking for • We want to find sets of genes that have a biological meaning. • Idea: find graph-theoretically relevant sets of vertices and find out if they are also biologically meaningful. • Simple example: connected components • A more advanced idea: graph clustering. Find subgraphs that have a high edge density. Markov Cluster Algorithm (MCL) • If there is cluster structure in a graph, random walks tend to remain in a cluster for a long time • Graph modelled as a stochastic matrix: sum of entries in a column is 1 • aij - probability that randomly walking out of j will go to i on the next step • Bigger edge weight means greater probability of choosing that edge Stijn van Dongen, Graph Clustering by Flow Simulation. PhD thesis, University of Utrecht, May 2000. http://micans.org/mcl/ Markov Cluster Algorithm (MCL) • Two procedures, inflation and expansion, are applied alternatively • Expansion: matrix squaring – considers longer random walks • Inflation: raising entries to some power, rescaling to remain stochastic – Weakens weak edges and strengthens strong ones • Converges to a steady state Markov Cluster Algorithm (MCL) Images from http://micans.org/mcl/ani/mcl-animation.html Betweenness centrality clustering • An edge between different clusters is on many shortest paths from one cluster to another. • An edge inside a cluster is on less shortest paths, because there are more alternative paths inside a cluster. • Betweenness centrality of an edge - the number of shortest paths in the graph containing that edge. • Remove edges with the highest centrality from the graph to obtain clustering. • Optimisations: – instead of all shortest paths, pick a sample of vertices and calculate shortest paths from them – remove several edges at once GraphWeb • Web interface for analysing biological graphs • Simple syntax for entering graphs – multiple datasets – directed edges – edge weights • Visualising graphs with GraphViz • Finding biological meaning with g:Profiler ds1: ds2: ds1: ds2: A > B 10 A > B 4 B C 5 C > D 12 Combining several datasets • Whether or not there is an edge between two vertices is determined in biological experiments, which may sometimes give false results. • For a given graph different sources may give different information. Some sources may be more trustworthy than others. • We would like to combine different sources and assess the trustworthyness of each edge in the resulting graph. • Edge weight in summary graph: sum over datasets – w(e,G) = Σ w(e,Gi)*w(Gi) Combining several datasets The end