* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download (DCCA) for grouping of genes
History of genetic engineering wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Essential gene wikipedia , lookup
Microevolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Designer baby wikipedia , lookup
Genome (book) wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Heritability of IQ wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Minimal genome wikipedia , lookup
Gene expression programming wikipedia , lookup
Anindya Bhattacharya and Rajat K. De Bioinformatics, 2008 Introduction Divisive Correlation Clustering Algorithm Results Conclusions 2 Introduction Divisive Correlation Clustering Algorithm Results Conclusions 3 Correlation Clustering 4 Correlation clustering is proposed by Bansal et al. in Machine Learning, 2004. It is basically based on the notion of graph partitioning. 5 How to construct the graph? Nodes: genes. Edges: correlation between the genes. Two types of edges: Positive edge. Negative edge. 6 For example: X Y Positive correlation coefficient: Positive edge( ) X Y Negative correlation coefficient: Negative edge( ) Graph Construction A B C A Graph Partitioning B Cluster 1 C G E D G F H G E D Cluster 2 G F H 7 How to measure the quality of clusters? The number of agreements. The number of disagreements. The number of agreements: the number of genes that are in correct clusters. The number of disagreements: the number of genes wrongly clustered. 8 For example: Cluster 1 A B C D Cluster 2 E The measure of agreements is the sum of: (1) # of positive edges in the same clusters (2) # of negative edges in different clusters 4+4=8 The measure of disagreements is the sum of: (1) # of negative edges in the same clusters (2) # of positive edges in different clusters 0+2=2 9 Minimization of disagreements or equivalently Maximization of agreements! However, it’s NP-Complete proved by Bansal et al., 2004. Another problem is without the magnitude of correlation coefficients. 10 Introduction Divisive Correlation Clustering Algorithm Results Conclusions 11 Pearson correlation coefficient Terms and measurements used in DCCA Divisive Correlation Clustering Algorithm 12 Consider a set of genes, , for each of which expression values are given. The Pearson correlation coefficient between two genes and is defined as: lth sample value of gene mean value of gene from samples 13 : and are positively correlated with the degree of correlation as its magnitude. : and are negatively correlated with value . 14 We define some terms and measurements used in DCCA: Attraction Repulsion Attraction/Repulsion value Average correlation value 15 Attraction: There’s an attraction between and if . Repulsion: There’s a repulsion between and if . Attraction/Repulsion value: Magnitude of is the strength of attraction or repulsion. 16 The genes will be grouped into disjoint clusters . Average correlation value: Average correlation value for a gene with respect to cluster is defined as: the number of data points in 17 indicates that the average correlation for a gene with other genes inside the cluster . Average correlation value reflects the degree of inclusion of to cluster . 18 Divisive Correlation Clustering Algorithm m samples X1 1 m K disjoint clusters DCCA C1 n genes Xn 1 C2 Ck m 19 Step 1: Step 2: for each iteration, do: Step 2-i: 20 Step 2: Step 2-ii: Step 2-iii: Which cluster exists the most repulsion value? C1 C2 Cp Cluster C! 21 Step 2-iv: Cp xk xk xk xk xk xi xi xk xk Cluster C Cq xj xj 22 Step 2-v: Place a copy of xk xk C1 C2 CK The highest average correlation value! C1 Cx2k CK CNEW: new clusters 23 Step 2-vi: C1 C2 CK Any change? C1 C2 CNEW: new clusters CK 24 Introduction Divisive Correlation Clustering Algorithm Results Conclusions 25 Performance comparison A synthetic dataset ADS Nine gene expression datasets 26 A synthetic dataset ADS: Three groups. 27 Experimental results: Clustering correctly. 28 Experimental results: Undesired Clusters. Undesired Clusters. 29 Five yeast datasets: Yeast ATP, Yeast PHO, Yeast AFR, Yeast AFRt, Yeast Cho et al. Four mammalian datasets: GDS958 Wild type, GDS958 Knocked out, GDS1423, GDS2745. 30 Performance comparison: z-score is calculated by observing the relation between a clustering result and the functional annotation of the genes in the cluster. Mutual information The entropies for each cluster-attribute pair. Attributes The entropies for clustering result independent of attributes. The entropies for each of the NA attributes independent of clusters. 31 z-score is defined as: Mean of these MI-values. The computed MI for the clustered data, using the attribute database. The standard deviation of these MI-values. MIrandom is computed by computing MI for a clustering obtained by randomly assigning genes to clusters of uniform size and repeating until a distribution of values is obtained. 32 A higher value of z indicates that genes would be better clustered by function, indicating a more biologically relevant clustering result. Gibbons ClusterJudge tool is used to calculating z-score for five yeast datasets. 33 Experimental results: 34 Experimental results: 35 Experimental results: 36 Experimental results: 37 Experimental results: 38 Introduction Divisive Correlation Clustering Algorithm Results Conclusions 39 Pros: DCCA is able to obtain clustering solution from gene-expression dataset with high biological significance. DCCA detects clusters with genes in similar variation pattern of expression profiles, without taking the expected number of clusters as an input. 40 Cons: The computation cost for repairing any misplacement occurring in clustering step is high. DCCA will not work if dataset contains less than 3 samples. The correlation value will be either +1 or -1. 41