Download Clustering Algorithms

Clustering Algorithms Sunida Ratanothayanon What is Clustering? Clustering   Clustering is a classification pattern that divide data into groups in meaningful and useful way Unsupervised classification pattern Clustering   Clustering is a classification pattern that divide data into groups in meaningful and useful way Unsupervised classification pattern Outline  K-Means Algorithm  Hierarchical Clustering Algorithm K-Means Algorithm     A partial clustering algorithm k clusters (# of k is specified by a user) Each cluster has a cluster center called centroid. The algorithm will literately group data into k clusters based on a distance function. K-Means Algorithm   The centroid can be obtained from the mean of all data points in the cluster. Stop when there is no change of center. A numerical example K-Means example We have five data points with 2 attributes Want to group data into 2 clusters (k=2)  Data Point x1 x2 1 22 21 2 19 20 3 18 22 4 1 3 5 4 2 K-Means example We can plot a graph from five data points as following.  Plot of 5 data points over X1 and X2 25 X2 20 15 Cluster C2 10 Cluster C1 5 0 0 5 10 15 X1 20 25 K-Means example (1  st iteration) Step1 : Choosing center and defining k C1=(18,22), C2= (4,2)  Step2 : Computing cluster centers We already define c1 and c2 Data Point x1 x2 1 22 21 2 19 20 3 18 22 4 1 3 5 4 2 Step3 : Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster  d   2 n  xi  yi i1 K-Means example (1 st iteration)  Step3 (cont): Data Point x1 x2 1 22 21 2 19 20 3 18 22 4 1 3 5 4 2 Distance table for all data points Data Point C1 C2 (18,22) (4,2) (22,21) 4.13 26.9 (19,20) 2.23 23.43 (18,22) 0 24.41 (1,3) 25.49 3.1 (4,2) 24.41 0 Then, we assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster.  K-Means example Data Point C1 C2 (18,22) (4,2) (2 (22,21) 4.13 26.9 (19,20) 2.23 23.43 (18,22) 0 24.41 (1,3) 25.49 3.1 (4,2) 24.41 0 nd iteration) Step2 : Computing cluster centers We will compute new cluster centers  Member of cluster1 are (22,21), (19,20) and (18,22). We will find average of these data points  22 19  18  59  21   20   22   63         59 / 3 19.7   63 / 3   21       C1 is [19.7, 21] Member of cluster2 are (1,3) and (4,2). 1  4 5 3   2  5       5 / 2   2.5 5 / 2    2.5     C2 is [2.5, 2.5] K-Means example (2  nd iteration) Data Point C1 C2 (18,22) (4,2) (22,21) 4.13 26.9 (19,20) 2.23 23.43 (18,22) 0 24.41 (1,3) 25.49 3.1 (4,2) 24.41 0 Step3 :Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster    Distance table for all data points with new centers Data Point C1’ C2’ (19.7,21) (2.5,2.5) (22,21) 2.3 26.88 (19,20) 1.22 24.05 (18,22) 1.97 24.91 (1,3) 25.96 1.58 (4,2) 24.65 1.58 Assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster. Repeat step2 and 3 for the next iteration because centers still have a change K-Means example (3  rd iteration) Step2 : Computing cluster centers Data Point C1’ C2’ (19.7,21) (2.5,2.5) (22,21) 2.3 26.88 (19,20) 1.22 24.05 (18,22) 1.97 24.91 (1,3) 25.96 1.58 (4,2) 24.65 1.58 We will compute new cluster centers  Member of cluster1 are (22,21), (19,20) and (18,22). We will find average of these data points  22 19  18  59  21   20   22   63         59 / 3 19.7   63 / 3   21       C1 is [19.7, 21] Member of cluster2 are (1,3) and (4,2). 1  4 5 3   2  5       5 / 2   2.5 5 / 2    2.5     C2 is [2.5, 2.5] K-Means example (3  rd iteration) Data Point C1’ C2’ (19.7,21) (2.5,2.5) (22,21) 2.3 26.88 (19,20) 1.22 24.05 (18,22) 1.97 24.91 (1,3) 25.96 1.58 (4,2) 24.65 1.58 Step3 :Finding square of Euclidian distance of each data point from the center and assigning each data points to a cluster    Distance table for all data points with new centers Data Point C1’’ C2’’ (19.7,21) (2.5,2.5) (22,21) 2.3 26.88 (19,20) 1.22 24.05 (18,22) 1.97 24.91 (1,3) 25.96 1.58 (4,2) 24.65 1.58 Assign each data point to the cluster by comparing its distance to the center. The data point will be assigned to its closest cluster. Stop the algorithm because centers remain the same. Hierarchical Clustering Algorithm C E A B D    Produce a nest sequence of cluster like a tree. Allow to have subclusters. Individual data point at the bottom of the tree are called “Singleton clusters”. Hierarchical Clustering Algorithm  Agglomerative method   A tree will be build up from the bottom level and will be merged the nearest pair of clusters at each level to go one level up Continue until all the data points are merged into a single cluster. A numerical example Hierarchical Clustering example We have five data points with 3 attributes  Data Point x1 x2 x3 A 9 3 7 B 10 2 9 C 1 9 4 D 6 5 5 E 1 10 3 Hierarchical Clustering example (1 st iteration) Data Point Step1 : Calculating Euclidian distance between two vector points  Then we obtain distance table as following Data Point A B C D E (9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3) A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36 B (10, 2, 9) - 0 12.45 6.4 13.45 C (1, 9, 4) - - 0 6.48 1.41 D (6, 5, 5) - - - 0 7.35 E (1, 10, 3) - - - - 0 x1 x2 x3 A 9 3 7 B 10 2 9 C 1 9 4 D 6 5 5 E 1 10 3 Hierarchical Clustering example (1 st iteration) Data Point Step2 : Forming a tree     Consider the most similar pair of data points from the previous distance table A B C D E (9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3) A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36 B (10, 2, 9) - 0 12.45 6.4 13.45 C (1, 9, 4) - - 0 6.48 1.41 D (6, 5, 5) - - - 0 7.35 E (1, 10, 3) - - - - 0 C and E are the most similar We will obtain the first cluster as following C E  Repeat step1 and 2 until all data points are merged into a single cluster. Hierarchical Clustering example (2 nd iteration) Data Point Step1 : Calculating Euclidian distance between two vector points  A B C D E (9, 3, 7) (10, 2, 9) (1, 9, 4) (6, 5, 5) (1, 10, 3) A ( 9, 3, 7) 0 2.45 10.44 4.12 11.36 B (10, 2, 9) - 0 12.45 6.4 13.45 C (1, 9, 4) - - 0 6.48 1.41 D (6, 5, 5) - - - 0 7.35 E (1, 10, 3) - - - - 0 We will redraw the distance table including the merge of two entities, C&E. C E Data Point A B D C&E (9, 3, 7) (10, 2, 9) (6, 5, 5) A ( 9, 3, 7) 0 2.45 4.12 10.9 B (10, 2, 9) - 0 6.4 12.95 D (6, 5, 5) - - 0 6.90 C&E (1, 9.5, 3.5) - - - 0  A distance for C&E to A can be obtained from d  (C ,E ), A  avg (d ,d ) C, A E, A We can use a previous table to get the distance from C to A and E to A. avg (10.44, 11.36) = 10.9 Hierarchical Clustering example (2 nd iteration) Data Point A B D (9, 3, 7) (10, 2, 9) (6, 5, 5) A ( 9, 3, 7) 0 2.45 4.12 10.9 B (10, 2, 9) - 0 6.4 12.95 D (6, 5, 5) - - 0 6.90 C&E (1, 9.5, 3.5) - - - 0 Step2 : Forming a tree     Consider the most similar pair of data points from the previous distance table A and B are the most similar We will obtain the second cluster as C following E A B  Repeat step1 and 2 until all data points are merged into a single cluster. C&E Hierarchical Clustering example (3 rd iteration) Data Point Step1 : Calculating Euclidian distance between two vector points  A B D C&E (9, 3, 7) (10, 2, 9) (6, 5, 5) A ( 9, 3, 7) 0 2.45 4.12 10.9 B (10, 2, 9) - 0 6.4 12.95 D (6, 5, 5) - - 0 6.90 C&E (1, 9.5, 3.5) - - - 0 We will redraw the distance table including the merge entities, C&E and A&B.  From previous table, we can obtain following distances for the new distance Data Point A&B D C&E table (6, 5, 5) A&B 0 5.26 11.93 D (6, 5, 5) - 0 6.9 C&E - - 0 d( A,B ),D  avg (d( A,D ) , d( B,D ) )  avg (4.12,6.40)  5.26 d ( C , E ), D  6.90 d(C ,E ),( A,B)  avg (d(C ,E ) , d( A,B) )  avg (d(C ,E ), A , d(C ,E ),B )  avg (10.9,12.95)  11.93 Hierarchical Clustering example (3 rd iteration) Data Point    D Consider the most similar pair of data points from the previous distance table A&B 0 5.26 11.93 D (6, 5, 5) - 0 6.9 C&E - - 0 A&B and D are the most similar We will obtain the new cluster as following C E A B D  C&E (6, 5, 5) Step2 : Forming a tree  A&B Repeat step1 and 2 until all data points are merged into a single cluster. Hierarchical Clustering example (4 th iteration) Data Point A&B D C&E (6, 5, 5) Step1 : Calculating Euclidian distance between two vector points  A&B 0 5.26 11.93 D (6, 5, 5) - 0 6.9 C&E - - 0 We will redraw the distance table including the merge entities, C&E and A&B&D.  From previous table, we can obtain a distance from cluster A&B&D to C&E as Data Point A&B&D C&E following A&B&D 0 9.4 C&E - 0 d( A, B, D ),(C , E )  avg (d( A, B ),(C , E ) , d D,(C , E ) )  avg (11.93,6.9)  9.4 Hierarchical Clustering example (4 th iteration) Step2 : Forming a tree     Consider the most similar pair of data points from the previous distance table A&B&D C&E A&B&D 0 9.4 C&E - 0 We can form a final tree because no more recalculation has to be made We can merge all data points into a single cluster A&B&D&C&E. C E A B D  Data Point Stop the algorithm. Conclusion  Two major clustering algorithms.  K-Means algorithm    An algorithm which literately groups data into k clusters based on a distance function. # of k is specified by a user. Hierarchical Clustering algorithm   It is a nest sequence of cluster like a tree. A tree will be build up from the bottom level and continue until all the data points are merged into a single cluster. References [1] Hastie, T., Tibeshirani, R., & Friedman J. Data Mining, Inference, Prediction. Unsupervised Learning. pp.453-480 [2] JAIN, A. K., MURTY, M. N., & FLYNN, P. J. (1999). Data Clustering: A Review. ACM Computing Surveys, 31(3), 264-330. [3] Liu, B. (2006). Web Data Mining. Unsupervised Learning. Springer. pp.117-150. [4] Ning, T. P., STEINBACH, M., & KUMAR, V. Introduction to Data Mining. Cluster Analysis: Basic Concepts and Algorithms. pp.487553. Thank you

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering Algorithms