Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BFAM Project BF-S15T07 “Efficient clustering algorithms for genome-wide expression analysis“ BFAM Project BF-S15T08 “Modeling and visualization of biochemical networks“ Sebastian Wernicke ([email protected]) Arno Buchner ([email protected]) Jan Griebsch ([email protected]) Jens Ernst ([email protected]) Misc. projects in Bioinformatics Hanjo Täubig ([email protected]) Moritz Maass ([email protected]) Project I: Efficient Clustering Algorithms for genome-wide Expression Analysis Expression Profiles Normalization Similarity Measure Gene Expression Data Clustering 1. Retrospect: The SR-Algorithm • Powerful algorithm for similarity-based clustering • Based on methods of spectral graph theory, numerical linear algebra and randomization • Applicable not only to gene expression profiles but to any class of biological objects where pair-wise similarity is defined • Thoroughly mathematically analyzed with respect to noise-robustness and running time • Complexity: Θ(n2), and hence optimal • New: Parallelized version and optimized version for sparse similarity matrices. 2. Tests on Synthetic Data (1) Output quality as a function of n and the amount of noise (false positive, false negative rate α). The number of clusters is specified to the algorithm. 1.0 α 0.45 α 0.45 n n 500 – 2000 genes forming 4 clusters with 20%-49% false positives/negatives Tests on Synthetic Data (2) Output quality as a function of n and the amount of noise (false positive, false negative rate α). The number of clusters is found by the algorithm. 0.45 1.0 n n 0.45 α α 500 – 4000 genes forming 4 clusters with 20%-45% false positives/negatives Tests on Synthetic Data (3) Running time as a function of n and the amount of noise (false positive, false negative rate α) on a 1GHz machine. α = 0.45 293.0 time(s) time(s) 293 0.45 α n 5.0 5000 n 30,000 5.000 – 30.000 genes, i.e. 25.000.000 – 900.000.000 similarity values 4. Clustering Protein Interaction Networks • Experiments with a network from the STRING system provided by the Bork group at EMBL. • Data: Escherichia coli, orthologous group-based • Edge scores: Interaction intensities defined by score=1-(1-neighborhood score)x(1-fusion score)x (1-co-occurence score) [ Courtesy of C. von Mering, Nucleic Acids Res. 2003 Jan 1;31(1):258-61 ] 4.1 Methods Current Applied in STRING • Functional module extraction: Generic partitionbased clustering methods (Single Linkage, MarkovClustering) have been applied to identify functional modules in the network. • However: Due to the definition of the interaction score as a combination of three different channels, multiple cluster structures are superimposed in this data set. • Generalized Clustering: Grouping such that any protein (/orthologous group) can belong to multiple clusters. The density of each cluster should be as high as possible, whereas the inter-cluster connectivity (excluding overlaps) should be minimized. 4.2. Schematic representation: “Lsets” 1 1,3 1,2 2 2,3 1,2,3 2,4 3 3,4 4 Cluster Structure Interaction Matrix (permuted with respect to cluster structure) Interaction Matrix (original form) 4.2. Construction of Intersecting Clusters: 1. Construction of elementary sets by SR-techniques Result: A partition of the protein set into a fixed number k of elementary sets. The value of k may safely be overestimated. Intra- and inter-Lset edge densities: k = 150; Mean intra-Lset density: 0.309 Inter-Lset connectivity: 0.024 Lsets belonging to the same cluster Frequency distribution of edge densities within and between Lsets 2. Definition of the Lset-graph 1 Some pairs of Lsets are still highly connected. This is represented by a graph structure whose3 nodes are Lsets. Maximal cliques in this graph are macroscopic clusters, which can overlap. 1,2 2 1,2,3 2,3 3,4 1,3 4 2,4 Note: This means that the method self-corrects an over-estimated value of k. 3. Construction of the intersecting clusters The cliques are extracted using the Tsukiyama-algorithm. Result: 144 clusters Intra-cluster density: 0.269 Inter-cluster connectivity: 0.020 (excl. overlaps) Quality assessment based on biological expert knowledge: currently pending The clusters are being compared with a known set of protein-to-pathway assignments. 5. Mathematical Result Evaluation in Comparative Analysis of Clustering Algorithms • Mathematical scoring scheme for clustering quality: • Suppose a clustering has induced the partition C={C1,C2,…,Ck} of the set of genes {X1,X2,…,Xn}. • Denote the similarity between a pair of genes Xi,Xj with s(Xi,Xj). • Denote the Cluster containing Xi with C(Xi) and the center of some cluster C with XC. Cluster Homogeneity: Separation: • Remarks: 1. The cluster analysis was conducted in the form of a blind test. Use of expert knowledge or supervised learning techniques was not intended for. 2. No prior selection of genes was asked for. 3. Normalization/standardization of expression data or the similarity-/distance measure were not explicitly required. • Choice of similarity measure s for the evaluation: Pearson Correlation Coefficient (due to invariance under scaling and translation of expression profiles, which was used by some participants). • Homogeneity and Separation in the Clusterings (NRO) NRO Data Set (Pearson correlation) “Average” (2) “Average” (3) Kröger (10) Separation “Binary” (16) (20) “SR” “SOM” (2) “Ward” (2) (2) Homogeneity “Optimum” • Using |Pearson| to accommodate for anti-correlation NRO Data Set Separation “SOM” (2) “Ward” (2) (absolute Pearson correlation) Kröger (10) “Binary” (16) (16) (20) “Average” (2) “SR” “Average” (3) Homogeneity (3) “Optimum” • An SR-Clustering with 16 Clusters on the NRO Data: • The appropriately permuted similarity matrix The gray off-diagonal blocks suggest some inter-cluster similarity. Cluster overlap is conceivable here. Isolated clusters with high confidence 6. Cooperation within the BFAM Network: 1. Cooperation with Genomatix Software GmbH: • Extension of cluster analysis by integration of information from biological databases and expert knowledge 2. Cooperation with Genomatix Software GmbH, Biomax Informatics GmbH, the group of Prof. Lasser and the group of Prof. Kriegel: • Comparative analysis of clustering algorithms 3. Publications: [1] „Similarity-Based Clustering Algorithms for Gene Expression Profiles“, J. Ernst, Dissertation, Technische Universität München, 2002 [2] „Generalized Clustering of Gene Expression Profiles – A Spectral Approach“, J. Ernst, Proc. of the Int. Conference on Bioinformatics, Bangkok, 2002 [3] „The Complexity of Detecting Fixed-Density Clusters“, H. Täubig et. al., Proc. of the 5th Italian Conference on Algorithms and Complexity, 2003 Chair for Efficient Algorithms Algorithms for Bioinformatics Graph Theory Combinatorial Optimization Randomized Algorithms Algorithm Visualization Complexity Theory Computer Algebra Petri Nets Scheduling Project “Clustering“ Project “Biological Networks“ Misc. Bioinformatics Projects