Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Transcript

Introduction to Bioinformatics: Lecture VII Clustering and Unsupervised Learning Jarek Meller Division of Biomedical Informatics, Children’s Hospital Research Foundation & Department of Biomedical Engineering, UC JM - http://folding.chmcc.org 1 Outline of the lecture From information flood to knowledge: finding an inherent “structure” in the data using unsupervised learning approach Clustering analysis as data mining and knowledge discovery approach The notion of similarity and its importance in clustering K-means and hierarchical clustering algorithms JM - http://folding.chmcc.org 2 Literature watch: “Mathematics in Biology” featured in Science From Bayesian networks to game theory in biology …. http://www.sciencemag.org/ Reading assignment: Friedman N., “Inferring Cellular Networks Using Probabilistic Graphical Models”, Science 303 (2004) JM - http://folding.chmcc.org 3 Clustering and cluster analysis: general considerations Definition The goal of clustering analysis is to group a collection of objects into subsets called clusters, such that the objects within each cluster are more closely related than objects assigned to different clusters. JM - http://folding.chmcc.org 4 11113.3, Pauci , 1_Pauci , MTX 0 , XR_unknown 878, Pauci , 2_Poly , MTX 1 , XR_erosions Individuals (33 patients + 12 controls) 976, Pauci , 1_Pauci , MTX 1 , XR_norm al 9161, Pauci , 1_Pauci , MTX 1 , XR_norm al 7177, Spond, JSPA , MTX 1 , XR_space narrowing 8003, Pauci , 1_Pauci , MTX 0 , XR_unknown 993, Syst, 2_Poly , MTX 1 , XR_space narrowing 9137, Poly, 2_Poly , MTX 0 , XR_space narrowing 19, Pauci , 2_Poly , MTX 1 , XR_space narrowing 912, Syst, 2_Poly , MTX 1 , XR_space narrowing 1081, Poly, 2_Poly , MTX 0 , XR_space narrowing 9272, Poly, 2_Poly , MTX 1 , XR_unknown 1087PB, Poly, 2_Poly , M TX 1 , XR_space narrowing 1073, Syst, 2_Poly , MTX 0 , XR_space narrowing 872, Poly, 2_Poly , MTX 1 , XR_space narrowing 850, Poly, 2_Poly , MTX 1 , XR_erosions 831, Poly, 2_Poly , MTX 1 , XR_erosions 894, Pauci , 2_Poly , MTX 1 , XR_normal 7029, Pauci , 1_Pauci , MTX 0 , XR_norm al 18036, Pauci , 1_Pauci , MTX 0 , XR_norm al 18057, Spond, JAS , MTX 0 , XR_sclerosis 845, Spond, JAS , MTX 0 , XR_space narrowing ... 7113.3, Control, na , MTX 0 , XR_na 9150, Spond, JAS , MTX 0 , XR_sclerosis 7108, Syst, 3_System ic , M TX 1 , XR_normal 813.3, Control, na , MTX 0 , XR_ na 1084, Control, na , MTX 0 , XR_ na 7021.31, Cont rol, na , MTX 0 , XR_na 18042, Spond, JAS , MTX 0 , XR_sclerosis 7118.3, Control, na , MTX 0 , XR_ na 9264, Pauci , 1_Pauci , MTX 0 , XR_erosions 824, Poly, 2_Poly , MTX 0 , XR_norm al 9245, Spond, JSPA , MTX 0 , XR_space narrowing 1087ctrl, Cont rol, na , MTX 0 , XR_na 801, Pauci , 2_Poly , MTX 1 , XR_erosions 7149.3, Control, na , MTX 0 , XR_ na 1095, Control, na , MTX 0 , XR_ na 1089, Control, na , MTX 0 , XR_ na 1085, Control, na , MTX 0 , XR_ na 1082, Control, na , MTX 0 , XR_ na 817, Pauci , 1_Pauci , MTX 0 , XR_space narrowing 7206, Pauci , 1_Pauci , MTX 0 , XR_unknown 7145, Pauci , 1_Pauci , MTX 0 , XR_norm al 1083, Control, na , MTX 0 , XR_ na One classical example: gene expression data Individual: Poly-Articular JRA Course Controls 105 Genes with Significantly Lower Expression In PolyArticular JRA 242 genes ... 137 Genes with Significantly Higher Expression In PolyArticular JRA Picture: courtesy of B. Aronow 5 Another example: clustering and organizing hierarchically protein structures in CATH http://www.biochem.ucl.ac.uk/bsm/cath/ Orengo et. al. Overall structure of the protein universe: Which cluster does my protein belong to? Measuring similarity to cluster representatives … JM - http://folding.chmcc.org 6 Similarity and distance measures in clustering Clustering (or segmentation) of objects starts with the arbitrary choice of a similarity measure that describes proximity between different objects The choice of the similarity (or dissimilarity/distance) measure ultimately defines the outcome of the clustering and is far more important than the choice of the actual clustering algorithm Subject specific considerations provide a suitable similarity notion, externally to the actual clustering JM - http://folding.chmcc.org 7 Distance and similarity measures for string matching JM - http://folding.chmcc.org 8 Similarity and distance measures in clustering In general, any similarity measure can be converted into a dissimilarity measure by applying a suitable monotone-decreasing function For N objects one may define an N times N matrix D with non-negative entries (and zero diagonal elements) representing dissimilarity for each pair of objects Some clustering algorithms assume that the dissimilarity or distance measure is a metric Problem Assuming that D is not symmetric, define a modified symmetric distance measure. JM - http://folding.chmcc.org 9 Similarity measures JM - http://folding.chmcc.org 10 Some observations regarding similarity measures JM - http://folding.chmcc.org 11 K-mean heuristic as a solution to clustering problem Definition For N objects and K<N postulated clusters find an assignment of each object to one of the clusters that minimizes “within cluster” point scatter cost function The number of possible assignments scales exponentially (again ) and in fact clustering problem is another instance of global optimization problem K-means is an iterative greedy descent algorithm that attempts to find good solution (local minimum) Trying different initial solution may be a good idea JM - http://folding.chmcc.org 12 K-means algorithm Step 1: Choose K initial cluster centers (e.g. randomly chosen data points) and the resulting cluster assignment: each point is assigned to the closest cluster center. JM - http://folding.chmcc.org 13 K-means algorithm Step 2: Given the initial cluster assignment find new cluster centers as geometric centers of all the data points (vectors) in each cluster. Redefine cluster assignments using the new centers JM - http://folding.chmcc.org 14 K-means algorithm Iterate Step 2 until cluster assignments do not change. JM - http://folding.chmcc.org 15 Hierarchical clustering and its applications As opposed to K-means there is no predefined number of clusters Instead, groups (cluster) are built iteratively form pairwise dissimilarities (distances), such that at level of the hierarchy clusters are created by merging (or splitting) clusters at the next (previous) level Representations of the data structure as rooted binary trees: a convenient analysis tool JM - http://folding.chmcc.org 16 Hierarchical clustering: agglomerative vs. divisive strategies One may use height of the nodes to indicate the intergroup dissimilarity between the two daughter nodes (subclusters at this level). One may also order the tree (dendogram) by flipping the branches, such that the tighter cluster is placed e.g. to the left. Cut K=2 Cut K=3 The example above is just an illustration of divisive (top-down) strategy JM - http://folding.chmcc.org 17