Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
COMP1942 Clustering 1 (Introduction and kmean) Prepared by Raymond Wong Some parts of this notes are borrowed from LW Chan’s notes Screenshots of XLMiner Captured by Kai Ho CHAN Presented by Raymond Wong raywong@cse COMP1942 1 Clustering Raymond Louis Wyman … Cluster 2 (e.g. High Score in History and Low Score in Computer) Computer 100 History 40 90 20 45 95 … History … Cluster 1 (e.g. High Score in Computer and Low Score in History) Computer Problem: to find all clusters COMP1942 2 Why Clustering? Clustering for Utility Summarization Compression COMP1942 3 Why Clustering? Clustering for Understanding Applications Biology Psychology and Medicine Group different types of traffic patterns Software COMP1942 Group different customers for marketing Network Group medicine Business Group different species Group different programs for data analysis 4 Clustering Methods K-means Clustering Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach How to use the data mining tool COMP1942 5 K-mean Clustering Suppose that we have n example feature vectors x1, x2, …, xn, and we know that they fall into k compact clusters, k < n Let mi be the mean of the vectors in cluster i. we can say that x is in cluster i if distance from x to mi is the minimum of all the k distances. COMP1942 6 Procedure for finding k-means Make initial guesses for the means m1, m2, .., mk Until there is no change in any mean Assign each data point to the cluster whose mean is the nearest Calculate the mean of each cluster For i from 1 to k COMP1942 Replace mi with the mean of all examples for cluster i 7 Procedure for finding k-means 10 10 9 9 8 8 7 7 6 6 5 5 10 9 8 7 6 5 4 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 k=2 Arbitrarily choose k means 10 Assign each objects to most similar center 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 4 3 2 1 0 0 1 2 3 4 5 6 reassign 10 10 9 9 8 8 7 7 6 6 5 5 3 2 1 0 0 1 2 3 4 5 6 7 8 7 8 9 10 reassign 4 COMP1942 Update the cluster means 9 10 Update the cluster means 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 8 Initialization of k-means The way to initialize the means was not specified. One popular way to start is to randomly choose k of the examples The results produced depend on the initial values for the means, and it frequently happens that suboptimal partitions are found. The standard solution is to try a number of different starting points COMP1942 9 Disadvantages of k-means Disadvantages In a “bad” initial guess, there are no points assigned to the cluster with the initial mean mi. The value of k is not user-friendly. This is because we do not know the number of clusters before we want to find clusters. COMP1942 10 Clustering Methods K-means Clustering Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach How to use the data mining tool COMP1942 11 Sequential k-Means Clustering Another way to modify the k-means procedure is to update the means one example at a time, rather than all at once. This is particularly attractive when we acquire the examples over a period of time, and we want to start clustering before we have seen all of the examples Here is a modification of the k-means procedure that operates sequentially COMP1942 12 Sequential k-Means Clustering Make initial guesses for the means m1, m2, …, mk Set the counts n1, n2, .., nk to zero Until interrupted Acquire the next example, x If mi is closest to x COMP1942 Increment ni Replace mi by mi + (1/ni) . (x – mi) 13 Clustering Methods K-means Clustering Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach How to use the data mining tool COMP1942 14 Forgetful Sequential k-means This also suggests another alternative in which we replace the counts by constants. In particular, suppose that a is a constant between 0 and 1, and consider the following variation: Make initial guesses for the means m1, m2, …, mk Until interrupted Acquire the next example x If mi is closest to x, replace mi by mi+a(x-mi) COMP1942 15 Forgetful Sequential k-means The result is called the “forgetful” sequential kmeans procedure. It is not hard to show that mi is a weighted average of the examples that were closest to mi, where the weight decreases exponentially with the “age” to the example. That is, if m0 is the initial value of the mean vector and if xj is the j-th example out of n examples that were used to form mi, then it is not hard to show that mn = (1-a)n m0 +a k=1n (1-a)n-kxk COMP1942 16 Forgetful Sequential k-means Thus, the initial value m0 is eventually forgotten, and recent examples receive more weight than ancient examples. This variation of k-means is particularly simple to implement, and it is attractive when the nature of the problem changes over time and the cluster centres “drift”. COMP1942 17 Clustering Methods K-means Clustering Original k-means Clustering Sequential K-means Clustering Forgetful Sequential K-means Clustering How to use the data mining tool Hierarchical Clustering Methods Agglomerative methods Divisive methods – polythetic approach and monothetic approach How to use the data mining tool COMP1942 18 How to use the data mining tool We can use XLMiner for performing kmean clustering Open “cluster.xls” COMP1942 19 COMP1942 20 COMP1942 21 COMP1942 22 COMP1942 23 Data source Workbook Worksheet Data range COMP1942 24 COMP1942 25 # Rows : 10 # Cols: 3 Variables First row contains headers Variables In Input Data Name Computer History COMP1942 26 COMP1942 27 COMP1942 28 COMP1942 29 Normalize input data Parameters # Clusters: 2 COMP1942 10 # Iterations: 30 Options 5 Set Seed Fixed Start Random Starts COMP1942 No. of Starts: 12345 31 Show data summary Show distances from each cluster center COMP1942 32 COMP1942 33 COMP1942 34 COMP1942 35 COMP1942 36 COMP1942 37 COMP1942 38 COMP1942 39 COMP1942 40 COMP1942 41 COMP1942 42 COMP1942 43 COMP1942 44 COMP1942 45 COMP1942 46 COMP1942 47 COMP1942 48 COMP1942 49 COMP1942 50 COMP1942 51 COMP1942 52 COMP1942 53 COMP1942 54