Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Site-specific recombinase technology wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Metagenomics wikipedia , lookup
Microevolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Ridge (biology) wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
An Unsupervised Learning Approach for Overlapping Co-clustering Machine Learning Project Presentation Rohit Gupta and Varun Chandola {rohit,chandola}@cs.umn.edu Outline • • • • • • • • • Introduction to Clustering Description of Application Domain From Traditional Clustering to Overlapping Co-clustering Current State of Art A Frequent Itemsets Based Solution An Alternate Minimization Based Solution Application to Gene Expression Data Experimental Results Conclusions and Future Directions Clustering • Clustering is an unsupervised machine learning technique Uses unlabeled samples • In the simplest form – determine groups (clusters) of data objects such that the objects in one cluster are similar to each other and dissimilar to objects in other clusters Where each data object is a set of attributes (or features) with a definite notion of proximity • Most traditional clustering algorithms Are partitional in nature. Assign a data object to exactly one cluster Perform clustering along one dimension Application Domains • Gene Expression Data Genes vs. Experimental Conditions Find similar genes based on their expression values for different experimental conditions Each cluster would represent a potential functional module in the organism • Text Documents Data Documents vs. Words • Movie Recommendation Systems Users vs. Movies Overlapping Clustering • Also known as soft clustering, fuzzy clustering • A data object can be assigned to more than one cluster • Motivation is that many real world data sets have inherently overlapping clusters A gene can be a part of multiple functional modules (clusters) Co-clustering • Co-clustering is the problem of simultaneously clustering rows and columns of a data matrix Also known as bi-clustering, subspace clustering, bi-dimensional clustering, simultaneous clustering, block clustering • The resulting clusters are blocks in the input data matrix • These blocks often represent more coherent and meaningful clusters Only a subset of genes participate in any cellular process of interest that is active for only a subset of conditions Overlapping Co-clustering overlaps Segal et al, 2003, Banerjee et al, 2005 Overlapping Co-clusters Co-clusters Dhillon et al, 2003, Cho et al 2004, Banerjee et al, 2005 [Bergmann et al, 2003] Current State of Art • Traditional Clustering – Numerous algorithms like k-means • Overlapping Clustering – Probabilistic Relational Model Based Approach by Segal et al and Banerjee et al • Co-clustering – Dhillon et al for gene expression data and document clustering. (Banerjee et al provided a general framework using a general class of Bregman distortion functions) • Overlapping co-clustering Iterative Signature Algorithm (ISA) by Bergmann et al for gene expression data - Uses an Alternate Minimization technique - Involves thresholding after every iteration We propose a more formal framework based on the co-clustering approach by Dhillon et al and another simpler frequent itemsets based solution Frequent Itemsets Based Approach • Based on the concept of frequent itemsets from association analysis domain A frequent itemset is a set of items (features) which occur together more than a specified number of times (referred to as support threshold) in the data set • The data has to be binary (only presence or absence is considered) Frequent Itemsets Based Approach (2) Application to gene expression data: • Normalization – first along columns (conditions) to remove scaling effects and then along rows (genes) • Binarization - Values above a preset threshold λ are set to 1 and the rest to 0. - Values above a preset percentile are set to 1 and the rest to 0. - Split each gene column to three components g+, g0 and g- signifying the up and down regulation of the gene's expression. This triples the number of items (or genes) • Gene expression matrix converted to transaction format data – each experiment is a transaction and contains index values for the genes that were expressed in this experiment Frequent Itemsets Based Approach (3) Algorithm: • Run closed frequent itemset algorithm to generate frequent closed itemsets with a specified support threshold σ Post-Processing: • Prune frequent itemsets (set of genes) of length < α • For each remaining itemset, scan the transaction data to record all the transactions (experiments) in which this itemset occurs (Note: The combination of these transactions (experiments) and the itemset (genes) will give the desired sets of genes with subsets of conditions they are most tightly co-expressed with) Limitations of Frequent Itemsets Based Approach • Binarization of the gene expression matrix may lose some of the patterns in the data • Up-regulation and down-regulation of genes not directly taken into account • Setting up right support threshold incorporating the domain knowledge is not trivial • Large number of modules obtained – difficult to evaluate biologically • Traditional association analysis based approaches only considers dense blocks, noise may break the actual module in this case – Error tolerant Itemsets (ETI) offers a potential solution though Alternate Minimization (AM) Based Approach • • Extends the non-overlapping co-clustering approach by [Dhillon et al, 2003, Banerjee et al 2005] Algorithm Input: Data Matrix A (size: m x n) and k, l (# of row and column clusters) Initialize row and column cluster mappings, X (size: m x k) and Y (size: n x l) Random assignment of rows (or columns) to row (or column) clusters Any traditional one dimensional clustering can be used to initialize X and Y Objective function: ||A – Â||2, Â is matrix approximation of A computed as follows: Each element of a co-cluster (obtained using current X and Y) is replaced by mean of co-cluster (aI,J) Each element of a co-cluster is replaced by (ai,J + aI,j – aI,J) i.e row mean + column mean – overall mean Alternate Minimization (AM) Based Approach(2) While (converged) - Phase 1: Compute row cluster prototypes (based on current X and matrix A) Compute Bregman distance, dΦ(ri, Rr) - each row to each row cluster prototype Compute probability with which each of m rows fall into each of k row clusters Update row cluster X keeping column cluster Y same (some thresholding is required here to allow limited overlap) - Phase 2: Compute column cluster prototypes (based on current Y and matrix A) Compute Bregman distance, dΦ(cj, Cc) - each column to each column cluster prototype Compute probability with which each of n columns fall into each of l column clusters Update column cluster Y keeping row cluster X same - Compute objective function: ||A – Â||2 - Check convergence Observations • Each row or column can be assigned to multiple row and column clusters respectively by certain probability based on their distances from respective cluster prototypes. This will produce overlapping coclustering. • Maximum overlapping co-clusters that could be obtained = k x l • Initialization of X and Y can be done in multiple ways – two ways are explored in the experiments • Thresholding to control percent overlap is tricky and requires domain knowledge • Cluster Evaluation is important – internal and external SSE, Entropy of each co-cluster Biological evaluation using GO (Gene Ontology) for results on gene expression data Experimental Results (1) • Frequent Itemsets Based Approach A synthetic data set (40 X 40) Total Number of co-clusters detected = 3 Experimental Results (2) • Frequent Itemsets Based Approach Another synthetic data set (40 X 40) Total Number of co-clusters detected = 7 All 4 blocks (in the original data set) were detected Need post-processing to eliminate unwanted co-clusters Experimental Results (3) • AM Based Approach Synthetic data sets (20 X 20) Finds co-clusters for each case Experimental Results (4) • AM Based Approach on Gene Expression Dataset Human Lymphoma Microarray Data [Described in Cho et al, 2004] # genes = 854 # conditions = 96 • • k = 5, l = 5, one dimensional k-means for initialization of X and Y Total Number of co-clusters = 25 Input Data Objective Function vs. Iterations A preliminary analysis of the 25 co-clusters show that only one meaningful co-cluster is obtained Conclusions • Frequent Itemsets based approach is guaranteed to find dense overlapping co-clusters Error Tolerant Itemset Approach offers a potential solution to address the problem of noise • AM based approach is a formal algorithm to find overlapping co-clusters Simultaneously performs clustering in both dimensions while minimizing a global objective function Results on synthetic data prove the correctness of the algorithm • Preliminary results on gene expression data show promise and will be further evaluated A key insight here is that application of these techniques to gene expression data requires domain knowledge for pre-processing, initialization, thresholding as well as post-processing of the coclusters obtained References • [Bergmann et al, 2003] Sven Bergmann, Jan Ihmels and Naama Barkai, Iterative signature algorithm for the analysis of large-scale gene expression data, Phys. Rev. E 67, pp 31902, 2003 • [Liu et al, 2004] Jinze Liu, Paulsen Susan, Wei Wang, Andrew Nobel and Jan Prins, Mining Approximate Frequent Itemset from Noisy Data, Proc. IEEE ICDM, pp. 463-466, 2004 • [Cho et al, 2004] H. Cho, I. S. Dhillon, Y. Guan, and S. Sra, Minimum sumsquared residue co-clustering of gene expression data. In Proceedings of SIAM Data Mining Conference, pages 114-125, 2004 • [Dhillon et al, 2003] Inderjit S. Dhillon, Subramanyam Mallela and Dharmendra S. Modha, Information-Theoretic Co-Clustering, Proc. ACM SIGKDD, pp. 89-98, 2003 • [Banerjee et al, 2004] A generalized maximum entropy approach to bregman coclustering and matrix approximation. In KDD '04: Proceedings of the 10th ACM SIGKDD, pages 509-514, 2004