Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Clustering CSE 6363 β Machine Learning Vassilis Athitsos Computer Science and Engineering Department University of Texas at Arlington 1 Supervised vs. Unsupervised Learning β’ In supervised learning, the training data is a set of pairs π₯π , π¦π , where π₯π is an example input, and π¦π is the target output for that input. β The goal is to learn a general function H(π₯) that maps inputs to outputs. β The majority of the methods we have covered this semester fall under the category of supervised learning. β’ In unsupervised learning, the training data is a set of values π₯π . β There is no associated target value for any π₯π . β Because of that, data values π₯π are called unlabeled data. 2 Supervised vs. Unsupervised Learning β’ The goal in supervised learning is regression or classification. β’ The goal in unsupervised learning is to discover hidden structure in the data. β’ We have already seen some examples of unsupervised learning in this class. β Can you think of any such examples? 3 Supervised vs. Unsupervised Learning β’ The goal in supervised learning is regression or classification. β’ The goal in unsupervised learning is to discover hidden structure in the data. β’ We have already seen some examples of unsupervised learning in this class. β Learning of mixtures of Gaussians using EM. β’ The hidden structure is the decoupling of the overall distribution as a combination of multiple Gaussians. β PCA: The hidden structure here is the representation of the data as a linear combination of a few eigenvectors plus noise. β SVD: The hidden structure here is the representation of the data as dot products of low-dimensional vectors. β HMMs: Baum-Welch can be seen as unsupervised learning, as it estimates probabilities of hidden states given the training observation sequences. 4 Clustering β’ The goal of clustering is to separate the data into coherent subgroups. β Data within a cluster should be more similar to each other than to data in other clusters. β’ For example: β Can you identify clusters in this dataset? β How many? Which ones? 5 Clustering β’ The goal of clustering is to separate the data into coherent subgroups. β Data within a cluster should be more similar to each other than to data in other clusters. β’ For example: β Can you identify clusters in this dataset? β Many people would identify three clusters. 6 Clustering β’ Some times the clusters may not be as obvious. β’ Identifying clusters can be even harder with high-dimensional data. 7 Applications of Clustering β’ Finding subgroups of similar items is useful in many fields: β In biology, clustering is used to identify relationships between organisms, and to uncover possible evolutionary links. β In marketing, clustering is used to identify segments of the population that would be specific targets for specific products. β For anomaly detection, anomalous data can be identified as data that cannot be assigned to any of the "normal" clusters. β In search engines and recommender systems, clustering can be used to group similar items together. 8 K-Means Clustering β’ K-means clustering is a simple and widely used clustering method. β’ First, clusters are initialized by assigning each object randomly to a cluster. β’ Then, the algorithm alternates between: β Re-assigning objects to clusters based on distances from each object to the mean of each current cluster. β Re-computing the means of the clusters. β’ The number of clusters must be chosen manually. 9 K-Means Clustering: Initialization β’ To start the iterative process, we first need to provide some initial values. β’ We manually pick the number of clusters. β In the example shown, we pick 3 as the number of clusters. β’ We randomly assign objects to clusters. 10 K-Means Clustering: Initialization β’ To start the iterative process, we first need to provide some initial values. β’ We manually pick the number of clusters. β In the example shown, we pick 3 as the number of clusters. β’ We randomly assign objects to clusters. β’ In the example figure, cluster membership is indicated with color. β There is a red cluster, a green cluster, and a brown cluster. 11 K-Means Clustering: Initialization β’ To start the iterative process, we first need to provide some initial values. β’ We manually pick the number of clusters. β In the example shown, we pick 3 as the number of clusters. β’ We randomly assign objects to clusters. β’ The means of these random clusters are shown with × marks. 12 K-Means Clustering: Main Loop β’ The main loop alternates between: β Computing new assignments of objects to clusters, based on distances from each object to each cluster mean. β Computing new means for the clusters, using the current cluster assignments. β’ At this point, we have provided some random initial assignments of points to clusters. β’ What is the next step? 13 K-Means Clustering: Main Loop β’ The main loop alternates between: β Computing new assignments of objects to clusters, based on distances from each object to each cluster mean. β Computing new means for the clusters, using the current cluster assignments. β’ The next step is to compute new assignments of objects to clusters. 14 K-Means Clustering: Main Loop β’ The main loop alternates between: β Computing new assignments of objects to clusters, based on distances from each object to each cluster mean. β Computing new means for the clusters, using the current cluster assignments. β’ The next step is to compute new assignments of objects to clusters. β The figure shows the new assignments. 15 K-Means Clustering: Main Loop β’ The main loop alternates between: β Computing new assignments of objects to clusters, based on distances from each object to each cluster mean. β Computing new means for the clusters, using the current cluster assignments. β’ Next, we move on to compute the new means. β Again, the new means are shown with × marks. 16 K-Means Clustering: Main Loop β’ The main loop alternates between: β Computing new assignments of objects to clusters, based on distances from each object to each cluster mean. β Computing new means for the clusters, using the current cluster assignments. β’ Next, we compute again new assignments. β We end up with the same assignments as before. β When this happens, we can terminate the algorithm. 17 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ First step: ??? 18 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ First step: randomly assign points to the three clusters. β We see the random assignments and the resulting means for the three clusters. 19 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ Main loop, iteration 1: recompute cluster assignments. 20 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ Main loop, iteration 2: recompute cluster assignments. 21 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ Main loop, iteration 3: recompute cluster assignments. 22 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ Main loop, iteration 4: recompute cluster assignments. 23 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ Main loop, iteration 5: recompute cluster assignments. 24 K-Means Clustering: Another Example β’ Here is a more difficult example, where there are no obvious clusters. β’ Again, we specify manually that we want to find 3 clusters. β’ Main loop, iteration 6: recompute cluster assignments. β The cluster assignments do not change, compared to iteration 5, so we can stop. 25 K-Means Clustering - Theory β’ Let πΏ = π₯1 , π₯2 , β¦ , π₯π be our set of points. β We assume that πΏ is a subset of π·-dimensional vector space βπ· . β Thus, each π₯π is a π·-dimensional vector. β’ Let πΊ1 , πΊ2 , β¦ , πΊπΎ be some set of clusters for πΏ. β The union of all clusters should be πΏ. πΎ πΊπ = πΏ π=1 β The intersection of any two clusters should be empty. βπ, π: πΊπ β© πΊπ = β 26 K-Means Clustering - Theory β’ β’ β’ β’ Let πΏ = π₯1 , π₯2 , β¦ , π₯π be our set of points. Let πΊ1 , πΊ2 , β¦ , πΊπΎ be some set of clusters for πΏ. Let ππ be the mean of all elements of πΊπ . The error measure for such a set of clusters is defined as: Euclidean distance between π₯π and ππ πΎ πΈ πΊ1 , πΊ2 , β¦ , πΊπΎ = π₯π β ππ π=1 π₯π βπΊπ β’ The optimal set of clusters is the one that minimizes 27 this error measure. K-Means Clustering - Theory β’ The error measure for a set of clusters πΊ1 , πΊ2 , β¦ , πΊπΎ is defined as: πΎ πΈ πΊ1 , πΊ2 , β¦ , πΊπΎ = π₯π β ππ π=1 π₯π βπΊπ β’ The optimal set of clusters is the one that minimizes this error measure. β’ Finding the optimal set of clusters is NP-hard. β Solving this problem takes time π π πΎπ·+1 , where: β’ π is the number of objects. β’ π· is the number of dimensions. β’ πΎ is the number of clusters. 28 K-Means Clustering - Theory β’ The error measure for a set of clusters πΊ1 , πΊ2 , β¦ , πΊπΎ is defined as: πΎ πΈ πΊ1 , πΊ2 , β¦ , πΊπΎ = π₯π β ππ π=1 π₯π βπΊπ β’ The iterative algorithm we described earlier decreases the error πΈ at each iteration. β’ Thus, this iterative algorithm converges. β However, it only converges to a local optimum. 29 K-Medoid Clustering β’ The main loop in k-means clustering is: β Computing new assignments of objects to clusters, based on distances from each object to each cluster mean. β Computing new means for the clusters, using the current cluster assignments. β’ K-medoid clustering is a variation of k-means, where we use medoids instead of means. β’ The main loop in k-medoid clustering is: β Computing new assignments of objects to clusters, based on distances from each object to each cluster medoid. β Computing new medoids for the clusters, using the current cluster assignments. 30 K-Medoid Clustering β’ To complete the definition of k-medoid clustering, we need to define what a medoid of a set is. β’ Let πΊ be a set, and let πΉ be a distance measure, that evaluates distances between any two elements of πΊ. β’ Then, the medoid ππΊ of πΊ is defined as: ππΊ = argmin π βπΊ πΉ(π , π₯) π₯βπΊ β’ In words, ππΊ is the object in πΊ with the smallest sum of distances to all other objects in πΊ. 31 K-Medoid vs. K-means β’ K-medoid clustering can be applied on non-vector data with non-Euclidean distance measures. β’ For example: β K-medoid can be used to cluster a set of time series objects, using DTW as the distance measure. β K-medoid can be used to cluster a set of strings, using the edit distance as the distance measure. β’ K-means cannot be used in such cases. β Means may not make sense for non-vector data. β For example, it does not make sense to talk about the mean of a set of strings. However, we can define (and find) the medoid of a set of strings, under the edit distance. 32 K-Medoid vs. K-means β’ K-medoid clustering can be applied on non-vector data with non-Euclidean distance measures. β’ K-medoid clustering is more robust to outliers. β A single outlier can dominate the mean of a cluster, but it typically has only small influence on the medoid. β’ The K-means algorithm can be proven to converge to a local optimum. β The k-medoid algorithm may not converge. 33 EM for Clustering β’ Earlier in the course, we saw how to use the Expectation-Maximization (EM) algorithm to estimate the parameters of a mixture of Gaussians. β’ This EM algorithm can be viewed as a clustering algorithm. β Each Gaussian defines a cluster. 34 EM Clustering: Initialization β’ Let's assume that each each π₯π is a π·-dimensional column vector. β’ We want to learn the parameters of πΎ Gaussians π1 , π2 ,β¦, ππΎ . β’ Gaussian ππ is defined by these parameters: β Mean ππ, which is a π·-dimensional column vector. β Covariance matrix ππ , which is a π· × π· matrix. β’ We initialize each ππ and each ππ to random values. 35 EM Clustering: Initialization β’ We initialize each ππ and each ππ to random values. β’ The figure shows those initial assignments. β For every object π₯π, we give it the color of the cluster π for which πππ is the highest. 36 EM Clustering: Main Loop β’ The main loop alternates between: β Computing new assignment probabilities πππ that object π₯π belongs to Gaussian ππ . β Computing, for each Gaussian ππ , a new mean ππ, a new covarriance matrix ππ , and a new weight π€π , using the current assignment probabilities πππ . β’ Here is the result after one iteration of the main loop: 37 EM Clustering: Main Loop β’ The main loop alternates between: β Computing new assignment probabilities πππ that object π₯π belongs to Gaussian ππ . β Computing, for each Gaussian ππ , a new mean ππ, a new covarriance matrix ππ , and a new weight π€π , using the current assignment probabilities πππ . β’ Here is the result after two iterations of the main loop: 38 EM Clustering: Main Loop β’ The main loop alternates between: β Computing new assignment probabilities πππ that object π₯π belongs to Gaussian ππ . β Computing, for each Gaussian ππ , a new mean ππ, a new covarriance matrix ππ , and a new weight π€π , using the current assignment probabilities πππ . β’ Here is the result after three iterations of the main loop: 39 EM Clustering: Main Loop β’ The main loop alternates between: β Computing new assignment probabilities πππ that object π₯π belongs to Gaussian ππ . β Computing, for each Gaussian ππ , a new mean ππ, a new covarriance matrix ππ , and a new weight π€π , using the current assignment probabilities πππ . β’ Here is the result after four iterations of the main loop: 40 EM Clustering: Main Loop β’ The main loop alternates between: β Computing new assignment probabilities πππ that object π₯π belongs to Gaussian ππ . β Computing, for each Gaussian ππ , a new mean ππ, a new covarriance matrix ππ , and a new weight π€π , using the current assignment probabilities πππ . β’ Here is the result after five iterations of the main loop. 41 EM Clustering: Main Loop β’ Here is the result after six iterations of the main loop. β’ The results have not changed, so we stop. β’ Note that, for this example, EM has not found the same clusters that k-means produced. β In general, k-means and EM may perform better or worse, depending on the nature of the data we want to cluster, and our criteria for what defines a good clustering result. 42 EM Clustering: Another Example β’ Here is another example. β’ What clusters would you identify here, if we were looking for two clusters? 43 EM Clustering: Another Example β’ Here is another example. β’ What clusters would you identify here, if we were looking for two clusters? β’ In general, there are no absolute criteria for what defines a "good" clustering result, and different people may give different answers. β’ However, for many people, the two clusters are: β A central cluster of points densely crowded together. β A peripheral cluster of points spread out far from the center.44 EM Clustering: Another Example β’ Let's see how EM does on this dataset. β’ Here we see the initialization. 45 EM Clustering: Another Example β’ Let's see how EM does on this dataset. β’ Here we see the result after one iteration of the main loop. 46 EM Clustering: Another Example β’ Let's see how EM does on this dataset. β’ Here we see the result after two iterations of the main loop. 47 EM Clustering: Another Example β’ Let's see how EM does on this dataset. β’ Here we see the result after three iterations of the main loop. 48 EM Clustering: Another Example β’ Let's see how EM does on this dataset. β’ Here we see the result after six iterations of the main loop. 49 EM Clustering: Another Example β’ Let's see how EM does on this dataset. β’ Here we see the result after nine iterations of the main loop. β’ This is the final result, subsequent iterations do not lead to any changes. 50 EM vs. K-Means β’ The k-means result (on the left) is different than the EM result (on the right). β’ EM can assign an object to cluster A even if the object is closer to the mean of cluster B. 51 EM vs. K-Means β’ Another important difference between K-means clustering and EM clustering is: β K-means makes hard assignments: an object belongs to one and only one cluster. β EM makes soft assignments: weights πππ specify how much each object π₯π belongs to cluster π. 52 Agglomerative Clustering B A D β’ In agglomerative clustering, there are different levels of clustering. β At the top level, every object is its own cluster. β Under the top level, each level is obtained by merging two clusters from the previous level. β The bottom level has just one cluster, covering the entire dataset. C F E A D B G C E F G 53 Agglomerative Clustering β’ Under the top level, each level is obtained by merging two clusters from the previous level. β’ There are different variants of agglomerative clustering. β’ Each variant is specified by the method for selecting which two clusters to merge at each level. C B A D F E A D B G C E F G 54 Agglomerative Clustering β’ Suppose that, at each level, we merge the two clusters from the previous level that have the smallest minimum distance. β’ Let π be a distance measure between objects in our data. β’ The minimum distance πmin (π, π) between two sets π and π is defined as: C B A D F E A D B G C E F G πmin π, π = min π(π₯, π¦) π₯βπ,π¦βπ β’ What do we merge at the next level? 55 Agglomerative Clustering β’ Suppose that, at each level, we merge the two clusters from the previous level that have the smallest minimum distance. β’ Let π be a distance measure between objects in our data. β’ The minimum distance πmin (π, π) between two sets π and π is defined as: C B A D F E G A D B C E A D B C E F G F,G πmin π, π = min π(π₯, π¦) π₯βπ,π¦βπ β’ Clusters {πΉ} and {πΊ} are the closest to each other. Next? 56 Agglomerative Clustering β’ Suppose that, at each level, we merge the two clusters from the previous level that have the smallest minimum distance. β’ Let π be a distance measure between objects in our data. β’ The minimum distance πmin (π, π) between two sets π and π is defined as: C B A D F E G A D B C E A D B C E F,G B C E F,G A,D F G πmin π, π = min π(π₯, π¦) π₯βπ,π¦βπ β’ Clusters {π΄} and {π·} are the closest to each other. Next? 57 Agglomerative Clustering β’ Suppose that, at each level, we merge the two clusters from the previous level that have the smallest minimum distance. β’ Let π be a distance measure between objects in our data. β’ The minimum distance πmin (π, π) between two sets π and π is defined as: πmin π, π = min π(π₯, π¦) C B A D F E G A D B C E A D B C E F,G B C E F,G E F,G A,D A,D B,C F G π₯βπ,π¦βπ β’ Clusters {π΅} and {πΆ} are the closest to each other. Next? 58 Agglomerative Clustering β’ Suppose that, at each level, we merge the two clusters from the previous level that have the smallest minimum distance. β’ Let π be a distance measure between objects in our data. β’ The minimum distance πmin (π, π) between two sets π and π is defined as: πmin π, π = min π(π₯, π¦) π₯βπ,π¦βπ C B A D F E G A D B C E A D B C E F,G B C E F,G E F,G A,D A,D B,C F G A,D,B,C E F,G β’ Clusters {π΄, π·} and {π΅, πΆ} are the closest to each other. Next? 59 Agglomerative Clustering β’ Suppose that, at each level, we merge the two clusters from the previous level that have the smallest minimum distance. β’ Let π be a distance measure between objects in our data. β’ The minimum distance πmin (π, π) between two sets π and π is defined as: πmin π, π = min π(π₯, π¦) π₯βπ,π¦βπ β’ Merging {π΄, π·, π΅, πΆ} and {πΈ}. Next? C B A D F E G A D B C E A D B C E F,G B C E F,G E F,G E F,G 60 F,G A,D A,D B,C A,D,B,C A,D,B,C,E F G Agglomerative Clustering C B A D β’ Suppose that, at each level, we merge the two clusters from the previous level that have the smallest minimum distance. β’ Let π be a distance measure between objects in our data. β’ The minimum distance πmin (π, π) between two sets π and π is defined as: πmin π, π = min π(π₯, π¦) π₯βπ,π¦βπ β’ Merging {π΄, π·, π΅, πΆ, πΈ} and {πΉ, πΊ}. F E G A D B C E A D B C E F,G B C E F,G A,D A,D B,C E A,D,B,C E A,D,B,C,E A,D,B,C,E,F,G F G F,G F,G F,G 61 Agglomerative Clustering C B A D β’ Instead of using πmin (π, π), we could use other definitions of the distance between two sets. β’ For example: πmax π, π = max π(π₯, π¦) π₯βπ,π¦βπ πmean π, π = mean π(π₯, π¦) π₯βπ,π¦βπ β’ Each of these choices can lead to different choices at each level. F E G A D B C E A D B C E F,G B C E F,G A,D A,D B,C E A,D,B,C E A,D,B,C,E A,D,B,C,E,F,G F G F,G F,G F,G 62 Hierarchical Clustering β’ Agglomerative clustering is an example of what we call hierarchical clustering. β’ In hierarchical clustering, there are different levels of clustering. β Each level is obtained by merging, or splitting, clusters from the previous level. β’ If we merge clusters from the previous level, we get agglomerative clustering. β’ If we split clusters from the previous level, it is called divisive clustering. C B A D F E G A D B C E A D B C E F,G B C E F,G A,D A,D B,C E A,D,B,C E A,D,B,C,E A,D,B,C,E,F,G F G F,G F,G F,G 63 Clustering - Recap β’ The goal in clustering is to split a set of objects into groups of similar objects. β’ There is no single criterion for measuring the quality of a clustering result. β’ The number of clusters typically needs to be specified in advance. β Methods (that we have not seen) do exist for trying to find the number of clusters automatically. β In hierarchical clustering (e.g., agglomerative clustering), we do not need to pick a number of clusters. β’ A large variety of clustering methods exist. β’ We saw a few examples of such methods: β K-means, K-medoid, EM, agglomerative clustering. 64