Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
KDD: Part II Clustering Dae-Won Kim School of Computer Science & Engineering Chung-Ang University What is Data Mining? Too much data and not enough information — this is a problem facing many businesses and industries. A solution lies here, with data mining. Most businesses have an enormous amount of data, with a great deal of information hiding within it, but "hiding" is usually exactly what it is doing: So much data exists that it overwhelms traditional methods of data analysis. Data mining provides a way to get at the information buried in the data. Data mining finds hidden patterns in large, complex collections of data, patterns that elude traditional statistical approaches to analysis. - Oracle Data Mining Solution Issues Classification vs. Clustering vs. Rule Mining What is Classification? Tell me what is the name of this fish? What is Classification? Construct a classifier for making a decision Fish: xT = (x1, x2) = (Lightness, Width) Classification Definition “The act of taking in raw data and making an action based on the category of the pattern.” “ Build a machine that can recognize or predict patterns: Character, Speech, Face, Cancer, Protein, and DNA sequence etc.” An example: Predict the patients by cancer subtype or treatment Leukemia: (Golub et al., Science, 1999) - 2 class problem - 38 patients samples - 34 normal samples - 6,817 genes What is Clustering? Cluster analysis discovers (b) from (a) What is Clustering? Clustering vs. Classification Clustering (cluster analysis) • Unsupervised pattern classification • No training patterns, no prior knowledge • Discovers homogeneous groups in data based on proximity Classification (discriminant analysis) • Supervised pattern classification • Labeled training patterns, the groups are known a priori • Constructs rules for classifying new data into the known groups Interchangeable Terms Cluster Analysis is the preferred generic term • Clustering in Computer Science • Numerical taxonomy in Biology • Q analysis in psychology • Segmentation in Market researchers Cluster is the preferred generic term • group or class are also often used Proximity is the preferred generic term • (dis)similarity or distance are also often used Image Segmentation Gene Function Prediction More Examples Intrusion Detection Systems Atopic Dermatitis Notation Given data set : ‘n x d’ pattern (or data) matrix Air pollution in US cities City ‘n’ data patterns (objects, observations, vectors) SO2 TEMP WIND DAYS Phoenix 10 70.3 6.0 36 Miami 10 75.5 8.8 128 Seattle 29 51.1 9.4 164 Detroit 36 49.9 8.4 113 Time_1 Time_2 Time_d Gene_1 2.1 3.6 -2.6 Sample_1 Gene_2 3.5 7.1 -2.1 Sample_2 Gene_n -1.2 8.9 6.5 Sample_n Cond_1 Cond_2 Cond_d Gene_1 EEG_1 Gene_2 EEG_2 Gene_n EEG_n ‘d’ variables (features, attributes, dimensions, fields) Time_1 Time_2 Time_d Time_1 Time_2 Time_d x11 x X 21 xn1 x12 x1d xnd Q: cluster data into three groups Observation A = (1, 1) C = (3, 2) E = (11, 2) G = (11, 6) B = (3, 1) D = (9, 2) F = (9, 6) Basic Algorithms for Cluster Analysis Hierarchical Clustering Algorithm Dae-Won Kim School of Computer Science and Engineering CAU Idea A B Hierarchical Algorithm 1. Start with each point as its own cluster 2. At each iteration, merge two clusters with the smallest distance Feature 2 Feature 3 Feature 1 Hierarchical Algorithm Eventually all points will be linked into a single cluster Feature 2 Feature 3 Feature 1 Hierarchical Algorithm The sequence of mergers is represented in a hierarchical tree f a b c d e f g g a b c e d Hierarchical Algorithm Agglomerative vs. Divisive 0 1 2 3 4 agglomerative a a,b b a,b,c,d,e c c,d,e d d,e e 4 3 2 1 0 divisive Example Observation from Students’ heights and weights Data 1 = (180, 70), Data 2 = (180, 71), Data 3 = (180, 73), … Example Initial Distance Matrix 1 2 3 4 5 1 0 2 1 0 3 3 2 0 4 6 7 6 0 5 5 6 5 3 0 6 7 10 9 4 5 23 6 0 Example: Single-Link Method Step1 : (1,2) Group identified 1 2 3 4 5 1 0 2 1 0 3 3 2 0 4 6 7 6 0 5 5 6 5 3 0 6 7 10 9 4 5 6 4,5 1,2,3 0 4,5 5 0 6 7 4 1,2 3 4 5 1,2 0 3 2 0 4 6 6 0 5 5 5 3 0 6 7 9 4 5 6 0 0 Step 4 : (4,5,6) identified 1,2,3 Step 2 : (1,2,3) identified Step 3 : (4,5) identified 1,2,3 6 0 4 5 1,2,3 0 4 6 0 5 5 3 0 6 7 4 5 6 0 Example Step 5 : finalized 1,2,3 4,5,6 1,2,3 0 4,5,6 5 0 25 Variant: distance between two groups # based on the distance between two closest elements Feature 2 Feature 2 4 2 2 3 1 1 Feature 1 Element-wise distance 6 Feature 1 Group-wise distance Variant: Average-Link Method # based on the average of all pairs of distances Feature 2 Feature 2 4 2 1 Feature 1 Element-wise distance 2 3 1 6 Feature 1 Group-wise distance Quiz: Find two groups using 1. Hierarchical algorithm using single-link 2. Hierarchical algorithm using average-link 3 8 1 4 9 5 4 3 2 11 Example Student 1 = (180, 70) Typical Student 2 = (176, 70) 1 … 3 8 1 4 9 5 4 Initial distance matrix 3 2 11 1 2 3 1 0 2 4 0 3 8 3 0 4 5 9 11 4 2,3 4 1 0 2,3 4 0 4 5 9 0 1 2,3 4 Variant 0 1 0 2,3 6 0 4 5 10 0 Basic Algorithms for Cluster Analysis K-Means Clustering Method Dae-Won Kim School of Computer Science and Engineering CAU Review: Hierarchical 1. Hierarchical algorithm “Yields a dendrogram representing the nested grouping and similarity levels at which groupings change.” 2. Three popular schemes “Two clusters are merged to a larger cluster based on minimum distance criteria.” 1) single-link: minimum distance of all pairs of patterns from two clusters 2) complete-link: maximum distance Cluster B 3) average-link: average distance Cluster A Review: Hieararchical 3. Single-link vs. complete-link • Single-link is more versatile than complete-link • Complete-link does not suffer from a “chaining effect” single-link complete-link Review: Hieararchical 4. Pros and cons • More versatile than partitional algorithms • Dendrogram provides a visual inspection • Computationally prohibitive: O(n2) • Can not repair the faults from previous steps • May produce large chunks of clusters • Most widely used due to ease of use and visualization • Average-link algorithm showed the superior performance • In some reports, the performance was close to random Review: Hieararchical 5. Graph-theoretic clustering is a hierarchical clustering approach - Single-link clusters are subgraphs of the minimum spanning tree - Complete-link clusters are subgraphs of the maximal spanning tree Using the minimal spanning tree to form clusters Review: Hierarchical Tip. To speed up in implementation, please use the dissimilarity matrix and indexing structure. Dissimilarity matrix 1 1,2 3 4 5 1,2 0 3 2 0 4 6 6 0 5 5 5 3 0 6 7 9 4 5 6 0 1,2,3 4 5 1,2,3 0 4 6 0 5 5 3 0 6 7 4 5 2 3 5 1 0 2 1 0 3 3 2 0 4 6 7 6 0 5 5 6 5 3 0 6 7 10 9 4 5 6 0 4 1,2,3 6 0 4,5 1,2,3 0 4,5 5 0 6 7 4 1,2,3 6 0 1,2,3 0 4,5,6 5 4,5,6 0 K-Means Clustering Algorithm • Fast clustering • Cluster centers representing each cluster • Each cluster center is obtained by the average of its members • Clustering is calculated using the distance to its cluster center K-Means Clustering Algorithm 1. Objective function of K-means algorithm “Yields a single partition of data at each iteration using the distance between patterns and centroids, leading to intracluster compactness and intercluster separation.” k n min J ( X , k ) d ( x j , vi ) i 1 j 1 2. Incremental Greedy Procedure 1) Select an k-initial centroids 2) Assign each pattern to its closest cluster centroid 3) Update cluster centroids 4) Repeat these steps until no improvement in J(X,k) Procedure Step 1. Guess the ‘K’ centers by random Step 2. Classify data into the K groups (centers) Step 3. Update the centers Step 4. if no change in centers, then stop; otherwise go to Step 2 center2 center1 x x Key Steps Classify data: assign each datum to the closest group Update centers: compute the average of data in each group 4 1 c2 c1 x 2 5 3 7 New c1 = (1+2+3+7) / 4 x 6 Example: Cluster the data into K=2 groups 1. Initialize : G1-Center= A = (1, 1), G2-center = B = (2, 1) 2. Classify Data: Update Centers: Check Stop: G1={A, C}, G2={B, D, E, F, G, H} G1-Center =(1.0, 1.5), G2-Center =(3.6, 3.5) Change(Yes) -> Continue y 5 4 3. Classify Data: Update Centers: Check Stop: G1={A, B, C, D}, G2={E, F, G, H} G1-Center =(1.5, 1.5), G2-Center =(4.5, 4.5) Change(Yes) -> Continue 4. Classify Data: Update Centers: Check Stop: G1={A, B, C, D}, G2={E, F, G, H} G1-Center =(1.5, 1.5), G2-Center =(4.5, 4.5) Change(No) -> Stop G H E F 3 2 1 C D A B 1 2 x 3 4 5 Example: 1. Initialize: Cluster the data into K=2 groups G1-Center=(0, 0), G2-Center=(1, 0) x2 2. Classify Data: Update Centers: Check Stop: G1={x1, x3}, G2={x2, x4, …, x20} G1-Center =(0.0, 0.5), G2-Center =(5.67, 5.33) Change(Yes) -> Continue x19 x20 9 x16 x17 x18 8 x12 x15 x13 x14 7 x9 x10 x11 6 5 3. Classify Data: Update Centers: Check Stop: 4. Classify Data: Update Centers: Check Stop: G1={x1, .., x8}, G2={x9, …, x20} G1-Center =(1.25, 1.13), G2-Center =(7.67, 7.33) Change(Yes) -> Continue G1={x1, .., x8}, G2={x9, …, x20} G1-Center =(1.25, 1.13), G2-Center =(7.67, 7.33) Change(Yes) -> Stop 4 3 x6 x7 x8 2 x3 x4 x5 1 x1 x2 0 1 x1 2 3 4 5 6 7 8 9 Issues in K-means Pros and cons • The simplest and most commonly used algorithm • Computationally efficient: O(nks) • Tends working well with isolated and compact clusters • Induce fixed shapes of clusters depending on distance measure • Sensitive to the initial selection of centroids 1) ‘ellipsoidal’ clustering results come from the selection of {A,B,C} 2) ‘rectangular’ clustering results come from the selection of {A,D,F} 3) Density-based initial selection is popular: mountain clustering Two different clustering results (group 7 data into 3 clusters) Issues in K-means Tip. Why does k-means-type algorithm remain so popular? 1) The cluster solution is affected by the choice of internal parameters : K-means algorithm is easy to use in all applications due to its small number of parameters 2) Mathematically sound : Convergence of K-means-type algorithms in a finite number of iterations is proved : Local optimality of the partial optimal solution is proved Tip. In implementation, a ‘dead’ cluster arises due to random initialization. Thus, please always check the number data belonging to each cluster in each iteration Dead cluster Other Issues in K-means How to guess the desirable initial centers in K-means algorithm? Other Issues in K-means How to cluster the uncertain data which have vague boundaries? Other Issues in K-means How to know the optimal number of clusters in K-means algorithm? Other Issues in K-means K-means type algorithms detect only sphere-shaped clusters? Other Issues in K-means How to cluster the symbolic categorical data?