Download (DCCA) for grouping of genes

Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008     Introduction Divisive Correlation Clustering Algorithm Results Conclusions 2     Introduction Divisive Correlation Clustering Algorithm Results Conclusions 3  Correlation Clustering 4   Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004. It is basically based on the notion of graph partitioning. 5  How to construct the graph?  Nodes: genes.  Edges: correlation between the genes.  Two types of edges:  Positive edge.  Negative edge. 6  For example: X Y Positive correlation coefficient: Positive edge( ) X Y Negative correlation coefficient: Negative edge( ) Graph Construction A B C A Graph Partitioning B Cluster 1 C G E D G F H G E D Cluster 2 G F H 7  How to measure the quality of clusters?  The number of agreements.  The number of disagreements.   The number of agreements: the number of genes that are in correct clusters. The number of disagreements: the number of genes wrongly clustered. 8  For example: Cluster 1 A B C D Cluster 2 E The measure of agreements is the sum of: (1) # of positive edges in the same clusters (2) # of negative edges in different clusters 4+4=8 The measure of disagreements is the sum of: (1) # of negative edges in the same clusters (2) # of positive edges in different clusters 0+2=2 9    Minimization of disagreements or equivalently Maximization of agreements! However, it’s NP-Complete proved by Bansal et al., 2004. Another problem is without the magnitude of correlation coefficients. 10     Introduction Divisive Correlation Clustering Algorithm Results Conclusions 11    Pearson correlation coefficient Terms and measurements used in DCCA Divisive Correlation Clustering Algorithm 12   Consider a set of genes, , for each of which expression values are given. The Pearson correlation coefficient between two genes and is defined as: lth sample value of gene mean value of gene from samples 13   : and are positively correlated with the degree of correlation as its magnitude. : and are negatively correlated with value . 14  We define some terms and measurements used in DCCA:  Attraction  Repulsion  Attraction/Repulsion value  Average correlation value 15    Attraction: There’s an attraction between and if . Repulsion: There’s a repulsion between and if . Attraction/Repulsion value: Magnitude of is the strength of attraction or repulsion. 16   The genes will be grouped into disjoint clusters . Average correlation value: Average correlation value for a gene with respect to cluster is defined as: the number of data points in 17   indicates that the average correlation for a gene with other genes inside the cluster . Average correlation value reflects the degree of inclusion of to cluster . 18  Divisive Correlation Clustering Algorithm m samples X1 1 m K disjoint clusters DCCA C1 n genes Xn 1 C2 Ck m 19  Step 1:  Step 2: for each iteration, do:  Step 2-i: 20  Step 2:  Step 2-ii:  Step 2-iii: Which cluster exists the most repulsion value? C1 C2 Cp Cluster C! 21  Step 2-iv: Cp xk xk xk xk xk xi xi xk xk Cluster C Cq xj xj 22  Step 2-v: Place a copy of xk xk C1 C2 CK The highest average correlation value! C1 Cx2k CK CNEW: new clusters 23  Step 2-vi: C1 C2 CK Any change? C1 C2 CNEW: new clusters CK 24     Introduction Divisive Correlation Clustering Algorithm Results Conclusions 25  Performance comparison  A synthetic dataset ADS  Nine gene expression datasets 26  A synthetic dataset ADS: Three groups. 27  Experimental results: Clustering correctly. 28  Experimental results: Undesired Clusters. Undesired Clusters. 29  Five yeast datasets:  Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt, Yeast Cho et al.  Four mammalian datasets:  GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745. 30  Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster. Mutual information The entropies for each cluster-attribute pair. Attributes The entropies for clustering result independent of attributes. The entropies for each of the NA attributes independent of clusters. 31  z-score is defined as: Mean of these MI-values. The computed MI for the clustered data, using the attribute database. The standard deviation of these MI-values. MIrandom is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained. 32   A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result. Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets. 33  Experimental results: 34  Experimental results: 35  Experimental results: 36  Experimental results: 37  Experimental results: 38     Introduction Divisive Correlation Clustering Algorithm Results Conclusions 39  Pros:  DCCA is able to obtain clustering solution from gene-expression dataset with high biological significance.  DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input. 40  Cons:  The computation cost for repairing any misplacement occurring in clustering step is high.  DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1. 41

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download (DCCA) for grouping of genes