Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

DISCOVERING LARGER NETWORK MOTIFS Li Chen 4/16/2009 CSC 8910 Analysis of Biological Network, Spring 2009 Dr. Yi Pan THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Two distinct definitions of a motif based on frequency and statistical significance Definition 1: a motif is a sub-graph that appears more than a threshold number of times. Definition 2: a motif is a sub-graph that appears more often than expected by chance. (over-presented motif) THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Two characteristics used to evaluate a motif Frequency: 1. Arbitrary overlaps of nodes and edges (non- identical case) 2. Only overlaps of nodes (edge-disjoint case) 3. No overlaps (edge and vertex-disjoint case) THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Statistical Significance: compares the obtained values of the frequencies for the observed and random networks. 1. Z-score 2. Abundance THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Models of Random Graphs Preserves the same degree distribution of biological networks Preserve degree sequence (search of n-node motifs) Based on geometric random networks and Poisson distribution of the degree Incorporate node clustering into model THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 3. Compact Topological Motifs: introduces a compact graph representation obtained by grouping together maximal sets of nodes that are ‘indistinguishable’. The graph on the left show the sets U1 and U2 as compact nodes and U1U2 as compact edge. THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Motif Discovery Algorithm Exact algorithm on motifs with a small number of nodes 1. Exhaustive Recursive Search (ERS): the input network is represented by an adjacency matrix M. (motif size <= 4) 2. ESU: starting with individual nodes and adding one node at a time until the required size k is reached. (motif size <=14) THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS Approximate Algorithms 1. Search Algorithm Based on Sampling (MFINDER): it picks at random edges of the input graph until a set of k nodes obtained to get sample sub-graph and assigns weights to the samples to correct the non-uniform sampling. It scale will with large networks, but does not scale well with large motifs. THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 2. Rand-ESU: do not needed to compute the weights of all samples compared with MFINDER. ESU builds a tree whose leaves correspond to sub-graphs of size k while internal nodes correspond to sub-graphs of size 1 up to k-1, depending on the tree level. It assigns to each level in the tree a probability that the nodes are further explored, so as to guarantee all leaves are visited with uniform probability. THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 3. NeMoFINDER: combines approaches of data mining and computational biology communities. It search for repeated trees and extend them to sub-graphs. It leads to a reduction of the computation time for discovery of larger motifs, but at the cost of missing some potentially interesting sub-graphs. THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 4. Sub-graph Counting by Scalar Computation: it characterize a biological network by a set of measures based on scalars and functional of the adjacency matrix associated to the network. Its advantages are mathematical elegance and computational efficiency. THE REVIEW ON MODELS AND ALGORITHMS FOR MOTIF DISCOVERY IN PROTEIN-PROTEIN INTERACTION NETWORKS 5. A-priori-based Motif Detection: the basic idea is if a subgraph is frequent so are all its sub-graphs. It builds candidate motifs of size k by joining motifs of size k-1 and then evaluating their frequency. A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Desirable features of clustering algorithms to evaluate Scalability Robustness Order insensitivity Minimum user-specified input Mixed data types Arbitrary-shaped clusters Point proportion admissibility: Duplicating data and reclustering should not alter the results. A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Five categories clustering algorithm Partitioning Clustering Algorithm Hierarchical Clustering Algorithm Grid-based Clustering Algorithm Density-based Clustering Algorithm Model-based Clustering Algorithm Graph-based Clustering Algorithm A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Partition Clustering Algorithm Numerical Methods 1. K-means algorithm and Farthest First Traversal k-center (FFT) algorithm 2. K-medoids or PAM (Partitioning Around Medoids) 3. CLARA (Clustering Large Applications) 4. CLARANS (Clustering Large Applications Based upon Randomized Search) and Fuzzy K-means A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Discrete Methods 1. K-modes 2. Fuzzy K-modes 3. Squeezer and COOLCAT. Mixed of Discrete and Numerical Clustering Methods 1. K-prototypes A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Hierarchical Clustering Algorithm Divide the data into a tree of nodes, where each node represents a cluster. Two categories based on methods or purposes 1. Agglomerative vs. Divisive 2. Single vs. Complete vs. Average linkage A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Popular: natures can have various levels of subsets Drawbacks: 1. Slow 2. Errors are not tolerable 3. Information losses when moving the levels Two kinds of methods 1. Numerical Methods: BIRCH, CURE , Spectral clustering 2. Discrete Methods: ROCK, Chameleon, LIMBO A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Grid-based Clustering Algorithm Form a grid structure of cells from the input data. Then each data is distributed in a cell of the grid. STING combines a numerical grid-base clustering method and hierarchical method A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Density-based Clustering Algorithm Use a local density standard Clusters are dense subspaces separated by low density spaces Examples of bioinformatics application : finding the densest subspaces in interactome(protein-protein interaction) networks A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS DBSCAN, OPTICS, DENCLUE, WaveCluster, CLIQUE use numerical values for clustering SEQOPTICS is used for sequence clustering HIERDENC (Hierarchical Density-based Clustering), MULIC (Multiple Layer Incremental Clustering), Projected (subspace) clustering, CACTUS, STIRR, CLICK, CLOPE use discrete values for clustering A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Model-based Clustering Algorithm Uses a model often derived by a statistical distribution Bioinformatics applications 1. gene expression 2. interactomes 3. sequences A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Numerical model-based methods 1. Self-Organizing Maps Discrete model-based clustering algorithm 1. COBWEB Numerical and discrete model-based clustering methods 1. BILCOM (Bi-level clustering of Mixed Discrete and Numerical Biomedical Data) using empirical Bayesian approach A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Examples 1. Gene expression clustering 2. Protein sequence clustering 3. AutoClass 4. SVM Clustering methods Graph-based Clustering Algorithm Applied to interactomers for complex prediction and sequence networks A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Examples: 1. MCODE (Molecular Complex Detection) 2. SPC (Super Paramagnetic Clustering) 3. RNSC (Restricted Neighborhood Search Clustering) 4. MCL(Markov Clustering) 5. TribeMCL 6. SPC 7. CD-HIT 8. ProClust 9. BAG algorithms A ROADMAP OF CLUSTERING ALGORITHM IN BIOINFORMATICS APPLICATIONS Usage in Bioinformatics Applications Gene expression clustering 1. K-means algorithm 2. Hierarchical algorithm 3. SOMs Interactomes 1. AutoClass, 2. SVM clustering 3. COBSEB 4. MULIC Sequence clustering 1. Hierarchical clustering algorithm REFERENCES [1] Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A roadmap of clustering algorithms: finding a match for a biomedical application. Brief Bioinform, pages bbn058+, February 2009. [2] Alberto Apostolico, Matteo Comin, and Laxmi Parida". Bridging Lossy and Lossless Compression by Motif Pattern Discovery. Electronic Notes in Discrete Mathematics, 21:219 - 225, 2005. General Theory of Information Transfer and Combinatorics. [3] Giovanni Ciriello and Concettina Guerra. A review on models and algorithms for motif discovery in protein-protein interaction networks. Brief Funct Genomic Proteomic, 7(2):147-156, 2008. [4] Jun Huan, Wei Wang, and Jan Prins. Efficient Mining of Frequent Subgraphs in the Presence of Isomorphism. Data Mining, IEEE International Conference on, 0:549, 2003. [5] Michihiro Kuramochi and George Karypis. Finding Frequent Patterns in a Large Sparse Graph. Data Mining and Knowledge Discovery, 11(3):243271, November 2005. [6] Laxmi Parida. Discovering Topological Motifs Using a Compact Notation. Journal of Computational Biology, 14(3):300-323, 2007. Thank you so much !