Download slide

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
DISCOVERING LARGER NETWORK
MOTIFS
Li Chen
4/16/2009
CSC 8910 Analysis of Biological Network, Spring 2009
Dr. Yi Pan
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS

Two distinct definitions of a motif based on frequency and
statistical significance

Definition 1: a motif is a sub-graph that appears more than
a threshold number of times.

Definition 2: a motif is a sub-graph that appears more often
than expected by chance. (over-presented motif)
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS

Two characteristics used to evaluate a motif

Frequency:
1. Arbitrary overlaps of nodes and edges (non- identical
case)
2. Only overlaps of nodes (edge-disjoint case)
3. No overlaps (edge and vertex-disjoint case)
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS

Statistical Significance: compares the obtained values of
the frequencies for the observed and random networks.
1. Z-score
2. Abundance
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS

Models of Random Graphs

Preserves the same degree distribution of
biological networks

Preserve degree sequence (search of n-node motifs)

Based on geometric random networks and Poisson
distribution of the degree

Incorporate node clustering into model
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS
3. Compact Topological Motifs: introduces a compact graph
representation obtained by grouping together maximal
sets of nodes that are ‘indistinguishable’.
The graph on the left show the
sets U1 and U2 as compact nodes
and U1U2
as compact edge.
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS

Motif Discovery Algorithm

Exact algorithm on motifs with a small number of nodes
1. Exhaustive Recursive Search (ERS): the input
network is represented by an adjacency matrix M.
(motif size <= 4)
2. ESU: starting with individual nodes and adding
one node at a time until the required size k is
reached. (motif size <=14)
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS

Approximate Algorithms
1. Search Algorithm Based on Sampling (MFINDER): it
picks at random edges of the input graph until a set of
k nodes obtained to get sample sub-graph and assigns
weights to the samples to correct the non-uniform
sampling. It scale will with large networks, but does not
scale well with large motifs.
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS
2. Rand-ESU: do not needed to compute the weights of all
samples compared with MFINDER. ESU builds a tree
whose leaves correspond to sub-graphs of size k while
internal nodes correspond to sub-graphs of size 1 up to
k-1, depending on the tree level. It assigns to each level
in the tree a probability that the nodes are further
explored, so as to guarantee all leaves are visited with
uniform probability.
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS
3. NeMoFINDER: combines approaches of data mining and
computational biology communities. It search for repeated
trees and extend them to sub-graphs. It leads to a
reduction of the computation time for discovery of larger
motifs, but at the cost of missing some potentially
interesting sub-graphs.
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS
4. Sub-graph Counting by Scalar Computation: it
characterize a biological network by a set of measures
based on scalars and functional of the adjacency matrix
associated to the network. Its advantages are
mathematical elegance and computational efficiency.
THE REVIEW ON MODELS AND
ALGORITHMS FOR MOTIF DISCOVERY
IN PROTEIN-PROTEIN INTERACTION
NETWORKS
5. A-priori-based Motif Detection: the basic idea is if a subgraph is frequent so are all its sub-graphs. It builds
candidate motifs of size k by joining motifs of size k-1 and
then evaluating their frequency.
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS
 Desirable features of clustering algorithms to evaluate

Scalability

Robustness

Order insensitivity

Minimum user-specified input

Mixed data types

Arbitrary-shaped clusters

Point proportion admissibility: Duplicating data and reclustering should not alter the results.
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS
 Five categories clustering algorithm

Partitioning Clustering Algorithm

Hierarchical Clustering Algorithm

Grid-based Clustering Algorithm

Density-based Clustering Algorithm

Model-based Clustering Algorithm

Graph-based Clustering Algorithm
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Partition Clustering Algorithm

Numerical Methods
1. K-means algorithm and Farthest First Traversal k-center
(FFT) algorithm
2. K-medoids or PAM (Partitioning Around Medoids)
3. CLARA (Clustering Large Applications)
4. CLARANS (Clustering Large Applications Based upon
Randomized Search) and Fuzzy K-means
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS
 Discrete Methods
1. K-modes
2. Fuzzy K-modes
3. Squeezer and COOLCAT.

Mixed of Discrete and Numerical Clustering Methods
1. K-prototypes
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Hierarchical Clustering Algorithm

Divide the data into a tree of nodes, where each node
represents a cluster.

Two categories based on methods or purposes
1. Agglomerative vs. Divisive
2. Single vs. Complete vs. Average linkage
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Popular: natures can have various levels of subsets

Drawbacks:
1. Slow
2. Errors are not tolerable
3. Information losses when moving the levels

Two kinds of methods
1. Numerical Methods: BIRCH, CURE , Spectral clustering
2. Discrete Methods: ROCK, Chameleon, LIMBO
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Grid-based Clustering Algorithm

Form a grid structure of cells from the input data. Then each
data is distributed in a cell of the grid.

STING combines a numerical grid-base clustering method
and hierarchical method
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Density-based Clustering Algorithm

Use a local density standard

Clusters are dense subspaces separated by low density
spaces

Examples of bioinformatics application : finding the densest
subspaces in interactome(protein-protein interaction)
networks
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

DBSCAN, OPTICS, DENCLUE, WaveCluster, CLIQUE use
numerical values for clustering

SEQOPTICS is used for sequence clustering

HIERDENC (Hierarchical Density-based Clustering),
MULIC (Multiple Layer Incremental Clustering), Projected
(subspace) clustering, CACTUS, STIRR, CLICK, CLOPE use
discrete values for clustering
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Model-based Clustering Algorithm

Uses a model often derived by a statistical distribution

Bioinformatics applications
1. gene expression
2. interactomes
3. sequences
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Numerical model-based methods
1. Self-Organizing Maps

Discrete model-based clustering algorithm
1. COBWEB

Numerical and discrete model-based clustering methods
1. BILCOM (Bi-level clustering of Mixed Discrete and
Numerical Biomedical Data) using empirical Bayesian
approach
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS
 Examples
1. Gene expression clustering
2. Protein sequence clustering
3. AutoClass
4. SVM Clustering methods

Graph-based Clustering Algorithm

Applied to interactomers for complex prediction and
sequence networks
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Examples:
1. MCODE (Molecular Complex Detection)
2. SPC (Super Paramagnetic Clustering)
3. RNSC (Restricted Neighborhood Search Clustering)
4. MCL(Markov Clustering)
5. TribeMCL
6. SPC
7. CD-HIT
8. ProClust
9. BAG algorithms
A ROADMAP OF CLUSTERING
ALGORITHM IN BIOINFORMATICS
APPLICATIONS

Usage in Bioinformatics Applications
 Gene expression clustering
1. K-means algorithm
2. Hierarchical algorithm
3. SOMs
 Interactomes
1. AutoClass,
2. SVM clustering
3. COBSEB
4. MULIC
 Sequence clustering
1. Hierarchical clustering algorithm
REFERENCES






[1] Bill Andreopoulos, Aijun An, Xiaogang Wang, and Michael Schroeder. A
roadmap of clustering algorithms: finding a match for a biomedical
application. Brief Bioinform, pages bbn058+, February 2009.
[2] Alberto Apostolico, Matteo Comin, and Laxmi Parida". Bridging Lossy
and Lossless Compression by Motif Pattern Discovery. Electronic Notes in
Discrete Mathematics, 21:219 - 225, 2005. General Theory of Information
Transfer and Combinatorics.
[3] Giovanni Ciriello and Concettina Guerra. A review on models and
algorithms for motif discovery in protein-protein interaction networks. Brief
Funct Genomic Proteomic, 7(2):147-156, 2008.
[4] Jun Huan, Wei Wang, and Jan Prins. Efficient Mining of Frequent
Subgraphs in the Presence of Isomorphism. Data Mining, IEEE International
Conference on, 0:549, 2003.
[5] Michihiro Kuramochi and George Karypis. Finding Frequent Patterns in
a Large Sparse Graph. Data Mining and Knowledge Discovery, 11(3):243271, November 2005.
[6] Laxmi Parida. Discovering Topological Motifs Using a Compact Notation.
Journal of Computational Biology, 14(3):300-323, 2007.
Thank you so much !