Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Microarray Data Analysis II: Clustering 1 Copyright notice • Many of the images in this power point presentation of other people. The Copyright belong to the original authors. Thanks! 2 Aim of clustering: Group objects according to their similarity Cluster: a set of objects that are similar to each other and separated from the other objects. Example: green/ red data points were generated from two different normal distributions 3 What is clustering? • Clustering: A way of grouping together data samples that are similar in some way - according to some criteria that you pick • A form of unsupervised learning – you generally don’t have examples demonstrating how the data should be grouped together • It’s a method of data exploration – a way of looking for patterns or structure in the data that are of interest • The results vary drastically depending on – clustering method – similarity or dissimilarity metric – additional parameters specific to each clustering method (e.g. number of centres for the k-mean, agglomeration rule for hierarchical clustering, ...) 4 Why cluster genes microarray data? • Identify groups of possibly co-regulated genes (e.g. in conjunction with sequence data). • Identify typical temporal or spatial gene expression patterns (e.g. cell cycle data). • Arrange a set of genes in a linear order that is at least not totally meaningless. 5 Distance and similarity metrics 6 Distance/similarity measures • A crucial parameter for clustering is the choice of an appropriate metrics to measure the similarity or dissimilarity between objects – The similarity measure is often more important than the clustering algorithm used – don’t overlook this choice! • Jagota defines a dissimilarity measure as a function f(x,y) such that f(x,y) > f(w,z) if and only if x is less similar to y than w is to z • This is always a pair-wise measure 7 Choice of a distance/similarity metric • There are plenty of distance/similarity metrics: – Euclidian distance – Manhattan distance – Pearson's coefficient of correlation – Mahalanobis distance – 2 distance – Mutual information. • The choice of the metrics depends on the data type 8 Euclidian distance • You are probably familiar with the calculation of Euclidian distance in a 2-dimensional space d yb-ya Y xb-xa a dE x a x b y a y b 2 2 b X The concept naturally extends to spaces with higher dimension (p-dimensional space) p dE x x bi 2 ai i1 9 deuc=0.5846 deuc=2.6115 deuc=1.1345 These examples of Euclidean distance match our intuition of dissimilarity pretty well… 10 deuc=1.41 deuc=1.22 …But what about these? What might be going on with the expression profiles on the left? On the right? 11 Correlation • We might care more about the overall shape of expression profiles rather than the actual magnitudes • That is, we might want to consider genes similar when they are “up” and “down” together • When might we want this kind of measure? What experimental issues might make this appropriate? 12 Pearson's coefficient of correlation • Pearson's coefficient of correlation is a classical metric of similarity between two objects. x ai mi x bi mi p c P za zb b i1 i1 a p Where a is the index of an object (e.g. a gene) b is the index of another object (e.g. a gene) i is an index of dimension (e.g. a chip) mi is the mean value of the ith dimension Note the correspondence with z-scores. The correlation is comprised between -1 and 1 It can be converted to a distance metric by a simple operation : dP 1 cP 13 Pearson's coefficient of correlation • Pearson correlation only measures the degree of a linear relationship between two expression profiles! • If you want to measure General relationship, using Mutual information = 0.0249, so dp = 0.4876 The green curve is the square of the blue curve – this relationship is not captured with Pearson correlation 14 Impact of the distance metrics A B 5 d e f c 4 3 2 a 1 b 0 0 1 2 3 4 5 Euclidian distances • a close to b Correlation coefficient • a close to c 0 1 2 3 4 5 6 7 Euclidian distances • d close to f Correlation coefficient • d close to e 8 9 10 15 Dot product • The dot product combines advantages of the Euclidian distance and of the coefficient of correlation p • dp x ai * x bi i1 – It takes positive values to represent co-regulation, and negative values to represent anti-regulation (as the coefficient of correlation) – It reflects the strength of the regulation of both genes, since it uses the real values (as the Euclidian distance) rather than the standardized ones (as the 16 coefficient of correlation). Mutual Information Entropy based distances: Mutual Information (semi-metric distance) • Mutual Information (MI) is a statistical representation of the correlation of two signals A and B. • MI is a measure of the additional information known about one expression pattern when given another. • MI is not based on linear models and can therefore also see non-linear dependencies (see picture). 17 Clustering Algorithms 18 Generic Clustering Tasks • Estimating number of clusters • Assigning each object to a cluster • Assessing strength/confidence of cluster assignments for individual objects • Assessing cluster homogeneity 19 How hard is clustering? • One idea is to consider all possible clustering, and pick the one that has best inter and intra cluster distance properties • Suppose we are given n points, and would like to cluster them into k-clusters – How many possible clustering? • Too hard to do it brute force or optimally • Solution: Heuristic method n k k! 20 Classical clustering methods • Partitioning methods • Hierarchical methods 21 Partitioning Methods • Partition the objects into a prespecified number of groups K • Iteratively reallocate objects to clusters until some criterion is met (e.g. minimize within cluster sums of squares) • Examples: k-means, self-organizing maps (SOM), partitioning around medoids (PAM), model-based clustering 22 Hierarchical Clustering • Produce a dendrogram • Avoid prespecification of the number of clusters K • The tree can be built in two distinct ways: – Bottom-up: agglomerative clustering – Top-down: divisive clustering 23 Agglomerative Methods • Start with n mRNA sample (or G gene) clusters • At each step, merge the two closest clusters using a measure of betweencluster dissimilarity which reflects the shape of the clusters 24 Divisive Methods • Start with only one cluster • At each step, split clusters into two parts • Advantage: Obtain the main structure of the data (i.e. focus on upper levels of dendrogram) • Disadvantage: Computational difficulties when considering all possible divisions into two groups 25 Partitioning vs. Hierarchical • Partitioning – Advantage: Provides clusters that satisfy some optimality criterion (approximately) – Disadvantages: Need initial K, long computation time • Hierarchical – Advantage: Fast computation (agglomerative) – Disadvantages: Rigid, cannot correct later for erroneous decisions made earlier 26 K-means clustering 27 K Means Clustering Algorithm • The number of centres (k) has to be specified a priori • Algorithm – (1) Arbitrarily select k initial centres – (2) Assign each element to the closest centre – (3) Re-calculate centres (mean position of the assigned elements) – (4) Repeat (2) and (3) until one of the stopping conditions is reached • the clusters are the same as in the previous iteration • the difference between two iterations is smaller than a specified threshold • the max number of iterations has been reached 28 K means example - initial conditions • Two sets of random points are randomly generated – 200 points centred on (0,0) – 50 points centred on (1,1) • Two points are randomly chosen as seeds (blue dots) 29 K means example - first iteration • Step 1 – Each dot is assigned to the cluster with the closest centre – Centres are recalculated (blue star) on the basis of the new clusters 30 K means example - second iteration • At each step, – points are reassigned to clusters – centres are recalculated • Cluster boundaries and centre positions evolve at each iteration 31 K means example - after 3 iterations • At each step, – points are reassigned to clusters – centres are recalculated • Cluster boundaries and centre positions evolve at each iteration 32 K means example - after 4 iterations • At each step, – points are reassigned to clusters – centres are recalculated • Cluster boundaries and centre positions evolve at each iteration 33 K means example - after 5 iterations • At each step, – points are reassigned to clusters – centres are recalculated • Cluster boundaries and centre positions evolve at each iteration 34 K means example - after 6 iterations • At each step, – points are reassigned to clusters – centres are recalculated • Cluster boundaries and centre positions evolve at each iteration 35 K means example - after 6 iterations • After some iterations (6 in this case), the clusters and centres do not change anymore 36 K-means clustering - summary • Strengths – Simple to use – Fast – Can be used with very large data sets • Weaknesses – The choice of the number of groups is arbitrary – The results vary depending on the initial positions of centres • Solutions – Try different values for k and compare the result – For each value of k, run repeatedly to sample different initial conditions • Weakness of the solution – Instead of one clustering, you obtain hundreds of different clustering, how to decide among them 37 Cell cycle data: K-means clustering 38 Determining the “correct” number of clusters • We’d like to have a measure of cluster quality Q and then try different values of k until we get an optimal value for Q • But, since clustering is an unsupervised learning method, we can’t really expect to find a “correct” measure Q… • So, once again there are different choices of Q and our decision will depend on what dissimilarity measure we’re using and what types of clusters we want 39 Cluster Quality Measures • Jagota suggests a measure that emphasizes cluster tightness or homogeneity: k 1 Q d (x , i ) i 1 | Ci | x Ci • |Ci | is the number of data points in cluster i • Q will be small if (on average) the data points in each cluster are close 40 Cluster Quality (cont.) This is a plot of the Q measure as given in Jagota for kmeans clustering on the data shown earlier Q How many clusters do you think there actually are? k 41 Cluster Quality • The Q measure given in Jagota takes into account homogeneity within clusters, but not separation between clusters • Other measures try to combine these two characteristics (i.e., the Davies-Bouldin measure) • An alternate approach is to look at cluster stability: – Add random noise to the data many times and count how many pairs of data points no longer cluster together – How much noise to add? Should reflect estimated variance in the data 42 Self-Organizing Maps 43 Self-Organizing Maps • Based on work of Kohonen on learning/memory in the human brain • As with k-means, we specify the number of clusters • However, we also specify a topology – a 2D grid that gives the geometric relationships between the clusters (i.e., which clusters should be near or distant from each other) • The algorithm learns a mapping from the high dimensional space of the data points onto the points of the 2D grid (there is one grid point for each cluster) 44 Self-Organizing Maps 10,10 Grid points map to cluster means in high dimensional space (the space of the data points) 11,11 Each grid point corresponds to a cluster (11x11 = 121 clusters in this example) 45 Self-Organizing Maps • Suppose we have a r x s grid with each grid point associated with a cluster mean 1,1,… r,s • SOM algorithm moves the cluster means around in the high dimensional space, maintaining the topology specified by the 2D grid (think of a rubber sheet) • A data point is put into the cluster with the closest mean • The effect is that nearby data points tend to map to nearby clusters (grid points) 46 SOM Algorithm Initialize Map For t from 0 to 1 //t is the learning factor Randomly select a sample Get best matching unit Scale neighbors Increase t a small amount //decrease learning factor End for From: http://davis.wpi.edu/~matt/courses/soms/ 47 An Example Using Color Three dimensional data: red, blue, green Will be converted into 2D image map with clustering of Dark Blue and Greys together and Yellow close to Both the Red and the Green From http://davis.wpi.edu/~matt/courses/soms/ 48 An Example Using Color Each color in the map is associated with a weight From http://davis.wpi.edu/~matt/courses/soms/ 49 An Example Using Color 1. Initialize the weights Random Values Colors in the Corners Equidistant From http://davis.wpi.edu/~matt/courses/soms/ 50 An Example Using Color Continued 2. Get best matching unit After randomly selecting a sample, go through all weight vectors and calculate the best match (in this case using Euclidian distance) Think of colors as 3D points each component (red, green, blue) on an axis From http://davis.wpi.edu/~matt/courses/soms/ 51 An Example Using Color Continued 2. Getting the best matching unit continued… For example, lets say we chose green as the sample. Then it can be shown that light green is closer to green than red: Green: (0,6,0) Light Green: (3,6,3) Red(6,0,0) LightGreen 32 02 32 4.24 Re d 62 (6) 2 02 8.49 This step is repeated for entire map, and the weight with the shortest distance is chosen as the best match From http://davis.wpi.edu/~matt/courses/soms/ 52 An Example Using Color Continued 3. Scale neighbors 1. 2. Determine which weights are considred nieghbors How much each weight can become more like the sample vector 1. Determine which weights are considered neighbors In the example, a Gaussian function is used where every point above 0 is considered a neighbor f ( x, y ) e 6.66666667 x 2 y 2 From http://davis.wpi.edu/~matt/courses/soms/ 53 An Example Using Color Continued 2. How much each weight can become more like the sample When the weight with the smallest distance is chosen and the neighbors are determined, it and its neighbors ‘learn’ by changing to become more like the sample…The farther away a neighbor is, the less it learns From http://davis.wpi.edu/~matt/courses/soms/ 54 An Example Using Color Continued NewColorValue = CurrentColor*(1-t)+sampleVector*t For the first iteration t=1 since t can range from 0 to 1, for following iterations the value of t used in this formula decreases because there are fewer values in the range (as t increases in the for loop) From http://davis.wpi.edu/~matt/courses/soms/ 55 Conclusion of Example Samples continue to be chosen at random until t becomes 1 (learning stops) At the conclusion of the algorithm, we have a nicely clustered data set. Also note that we have achieved our goal: Similar colors are grouped closely together From http://davis.wpi.edu/~matt/courses/soms/ 56 A SOFM Example With Yeast 57 “Interpresting patterns of gene expression with self-organizing maps: Methods and application to hematopoietic differentiation” by Tamayo et al. Benefits of SOM • SOM contains the set of features extracted from the input patterns (reduces dimensions) • SOM yields a set of clusters • A gene will always be most similar to a gene in its immediate neighborhood than a gene further away From “Data Analysis Tools for DNA Microarrays” by Sorin Draghici 58 SOM Issues • The algorithm is complicated and there are a lot of parameters (such as the “learning rate”) these settings will affect the results • The idea of a topology in high dimensional gene expression spaces is not exactly obvious – How do we know what topologies are appropriate? – In practice people often choose nearly square grids for no particularly good reason • As with k-means, we still have to worry about how many clusters to specify… 59 Hierarchical clustering 60 Hierarchical Agglomerative Clustering (HAC) [Hartigan 1975] • Agglomerative (bottom-up) • Algorithm: dendrogram – Initialize: each item a cluster – Iterate: • select two most similar clusters • merge them – Halt: when required number of clusters is reached 61 HAC (cont.) • This produces a binary tree or dendrogram • The final cluster is the root and each data item is a leaf • The height of the bars indicate how close the items are 62 Hierarchical Clustering Demo 63 Linkage in Hierarchical Clustering • We already know about distance measures between data items, but what about between a data item and a cluster or between two clusters? • We just treat a data point as a cluster with a single item, so our only problem is to define a linkage method between clusters • As usual, there are lots of choices… 64 Hierarchical: Single Link • cluster similarity = similarity of two most similar members - Potentially long and skinny clusters + Fast 65 Example: single link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 3 0 4 9 7 0 5 8 5 4 0 d (1, 2 ), 3 min{ d1 ,3 , d 2 , 3 } min{ 6,3} 3 5 d (1, 2 ), 4 min{ d1, 4 , d 2 , 4 } min{ 10,9} 9 4 3 2 1 d (1, 2 ), 5 min{ d1,5 , d 2 , 5 } min{ 9,8} 8 66 Example: single link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 3 0 4 9 7 0 5 8 5 4 0 d (1, 2, 3), 4 min{ d(1, 2), 4 , d 3, 4} min{ 9,7} 7 d (1, 2, 3),5 min{ d (1, 2), 5 , d3, 5} min{ 8,5} 5 (1,2,3) 4 5 (1,2,3) 0 4 7 0 5 5 4 0 5 4 3 2 1 67 Example: single link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 3 0 4 9 7 0 5 8 5 4 0 (1,2,3) 4 5 (1,2,3) 0 4 7 0 5 5 4 0 5 d (1, 2, 3),( 4, 5) min{ d (1, 2, 3), 4 , d (1, 2, 3),5 } 5 4 3 2 1 68 Hierarchical: Complete Link • cluster similarity = similarity of two least similar members + tight clusters - slow 69 Example: complete link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 d (1, 2 ), 3 max{ d1, 3 , d 2, 3 } max{ 6,3} 6 d (1, 2 ), 4 max{ d1 , 4 , d 2 , 4 } max{10,9} 10 d (1, 2 ), 5 max{ d1, 5 , d 2 ,5 } max{ 9,8} 9 5 4 3 2 1 70 Example: complete link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 d (1, 2 ), ( 4 ,5 ) max{ d (1, 2), 4 , d (1, 2), 5} max{10,9} 10 d 3,( 4, 5) max{ d3, 4 , d 3, 5} max{ 7,5} 7 (1,2) 3 (4,5) (1,2) 0 3 6 0 (4,5) 10 7 0 5 4 3 2 1 71 Example: complete link 1 2 3 4 5 1 2 3 4 5 0 2 6 10 9 0 3 9 8 0 7 0 5 4 0 (1,2) 3 4 5 (1,2) 0 3 6 0 4 10 7 0 5 9 5 4 0 (1,2) 3 (4,5) (1,2) 0 3 6 0 (4,5) 10 7 0 5 d (1, 2, 3),( 4, 5) max{ d(1, 2), ( 4, 5) , d 3, ( 4, 5)} 10 4 3 2 1 72 Hierarchical: Average Link • cluster similarity = average similarity of all pairs + tight clusters - slow 73 Software: TreeView [Eisen et al. 1998] • Fig 1 in Eisen’s PNAS 99 paper • Time course of serum stinulation of primary human fibrolasts • cDNA arrays with approx 8600 spots • Similar to average-link • Free download at: http://rana.lbl.gov/EisenSoftware.htm 74 Hierarchical Clustering Issues • Distinct clusters are not produced – sometimes this can be good, if the data has a hierarchical structure w/o clear boundaries • There are methods for producing distinct clusters, but these usually involve specifying somewhat arbitrary cutoff values 75 Motivation Of New Hierarchical Clustering Algorithm • Traditional hierarchical clustering algorithms have two drawbacks – Topological structure of the hierarchy is fixed – Once data is improperly clustered in the early stage it cannot be re-evaluated later. 76 Dynamically Growing SelfOrganizing Tree (DGSOT) • DGSOT is a divisive hierarchical clustering algorithm • DGSOT grows in two directions: vertical and horizontal. – In each vertical growth, the DGSOT adds two children to the leaf whose heterogeneity is greater than a threshold and turn it to a node. – In each horizontal growth, the DGSOT dynamically finds the proper number of children (sub-clusters) of the lowest level nodes. • Proper number of clusters is determined by cluster validation criteria. • Each vertical growth step is followed by a horizontal growth step. • After each vertical and horizontal growth, a learning process is applied. 77 Clustering Results A) The whole tree constructed by SOTA B) The whole tree constructed by DGSOT C) An enlarged picture represents part of the DGSOT tree that is indicated by the red bar 78 Clustering Results A) Six highest-level clusters (1-6). B-G) show the sub-clusters of the six clusters in A. 79 Comparison DGSOT with SOTA Cluster Number Number of ORFs DGSOT SOTA DGSOT SOTA 11 39 250 184 12 26 182 142 MIPS Functional Category ORFs within Functional Category P-value (-log10) DGSOT SOTA DGSOT SOTA Ribosomal proteins 122 50 124 32 Organization of cytoplasm 141 67 87 25 Translation 10 6 4 2 Ribosomal Proteins 29 49 12 37 Organization of cytoplasm 43 64 9 30 SOTA is available at http://gepas.bioinfo.cnio.es/ 80 Comparison DGSOT with SOTA • The hierarchical structure of SOTA (A) and DGSOT (B) when each has 25 leaves. A) Result of SOTA B) Result of DGSOT 81