Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Topic 6 Clustering and Unsupervised Learning credits: Padhraic Smyth lecture notes Hand, et al Chapter 9 David Madigan lecture notes Data Mining - Volinsky - 2011 - Columbia University Clustering Outline • • • • • Introduction to Clustering Distance measures k-means clustering hierarchical clustering probabilistic clustering Data Mining - Volinsky - 2011 - Columbia University Clustering • “automated detection of group structure in data” – Typically: partition N data points into K groups (clusters) such that the points in each group are more similar to each other than to points in other groups – descriptive technique (contrast with predictive) – Identify “natural” groups of data objects - qualitatively describe groups of the data • often useful, if a bit reductionist – for real-valued vectors, clusters can be thought of as clouds of points in p-dimensional space – Also called unsupervised learning Data Mining - Volinsky - 2011 - Columbia University Clustering Sometimes easy Sometimes impossible and usually in between Data Mining - Volinsky - 2011 - Columbia University What is Cluster Analysis? • A good cluster analysis results in – Similar (close) to one another within the same cluster – Dissimilar (far) from the objects in other clusters • In other words – high intra-cluster similarity (low intra-cluster variance) – low inter-cluster similarity (high inter-cluster variance) Data Mining - Volinsky - 2011 - Columbia University Example Data Mining - Volinsky - 2011 - Columbia University Example ANEMIA PATIENTS AND CONTROLS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume 3.8 Data Mining - Volinsky - 2011 - Columbia University 3.9 4 Why is Clustering useful? • “Discovery” of new knowledge from data – Contrast with supervised classification (where labels are known) – Can be very useful for summarizing large data sets • For large n and/or high dimensionality • Applications of clustering – WWW • Clustering of documents produced by a search engine (google news) – Customer Segmentation – Spatial data Analysis • geographical clusters of events: cancer rates, sales, etc. – Clustering of genes with similar expression profiles – many more Data Mining - Volinsky - 2011 - Columbia University General Issues in Clustering • No golden truth! – answer is often subjective • Cluster Representation: – What types or “shapes” of clusters are we looking for? What defines a cluster? • Other issues – Distance function, D[x(i),x(j)] critical aspect of clustering, both • distance of individual pairs of objects • distance of individual objects from clusters – How is K selected? Data Mining - Volinsky - 2011 - Columbia University Clustering Outline • • • • • Introduction to Clustering Distance measures k-means clustering hierarchical clustering probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University Distance Measures • • In order to cluster, we need some kind of “distance” between points. Sometimes distances are not obvious, but we can create them case sex glasses Moustache smile hat 1 0 1 0 1 0 2 1 0 0 1 0 3 0 1 0 0 0 4 0 0 0 0 0 5 0 0 0 1 0 6 0 0 1 0 1 7 0 1 0 1 0 8 0 0 0 1 0 9 0 1 1 1 0 10 1 0 0 0 0 11 0 0 1 0 0 12 1 0 0 0 0 Data Mining - Volinsky - 2011 - Columbia University Some Distances • Euclidean distance (L2): d(x,y) = – The most common notion of “distance.” • Manhattan distance (L1) – distance if you had to travel along coordinates only. Data Mining - Volinsky - 2011 - Columbia University Examples of Euclidean Distances y = (9,8) L2: dist(x,y) = (42+32) =5 5 3 4 x = (5,5) Data Mining - Volinsky - 2011 - Columbia University L1: dist(x,y) = 4+3 = 7 Non-Euclidean Distances • Some observations are not appropriate for Euclidian distance: • • • • Binary Vectors: 10011 vs. 11000 Strings: “Statistics” vs. “sadistics” Ordinal variables: “M.S” vs. “B.A.” Categorical: blue vs. green How to calculate distances for variables like these? Data Mining - Volinsky - 2011 - Columbia University Distances for Binary Vectors • A=101110; B=100111 – Hamming distance: # of changes to get from A to B • Hamming(A,B) = 2 • Can be normalized by length of string: 2/6 – Jaccard Similarity: intersection over union • • • • Intersection: # of 1s in common =3 Union: # of spaces with at least one 1 = 5 Jaccard similarity = 3/4. Jaccard distance = 1-3/4 = ¼ – Both of these are metrics => satisfy triangle inequality Data Mining - Volinsky - 2011 - Columbia University Cosine Distance (similarity) • Think of a point as a vector from the origin (0,0,…,0) to its location. • Two points’ vectors make an angle, the cosine of this angle is a measure of similarity – Recall cos(0) = 1; cos(90)=0 – Also: the cosine is the normalized dot-product of the vectors: – Example p1 = 00111; p2 = 10011. – cos() = 2/3; is about 48 degrees. Data Mining - Volinsky - 2011 - Columbia University Cosine-Measure Diagram p1 p1.p2 ||p2|| Data Mining - Volinsky - 2011 - Columbia University p2 Edit Distance for strings • Hamming distance for strings: the number of inserts and deletes of characters needed to turn one into the other. • Equivalently: d(x,y) = |x| + |y| - 2|LCS(x,y)|. – LCS = longest common subsequence = longest string obtained both by deleting from x and deleting from y. Data Mining - Volinsky - 2011 - Columbia University Example • x = statistics; y = sadistic. • Turn x into y by deleting t, deleting t, then inserting d, and deleting s. – Edit-distance = 4. • Or, LCS(x,y) = saistic. • |x| + |y| - 2|LCS(x,y)| = 10+8-14=4. Data Mining - Volinsky - 2011 - Columbia University Categorical Variables • A generalization of the binary variable in that it can take more than 2 states, e.g., red, yellow, blue, green • Method 1: Simple matching – m: # of matches, p: total # of variables m d (i, j) p p Hat Coat Shoes Belt Alice Brown Black Black brown Bob Brown Gray Red Red Craig None Black Black brown Dave None Black Brown None – Distance(Alice,Craig) = 4-3/4 = 1/4 Data Mining - Volinsky - 2011 - Columbia University Ordinal Variables • An ordinal variable can be discrete or continuous • order is important, e.g., rank • Pretend they are interval scaled – replacing xif by their rank – map the range of each variable onto [0, 1]: – compute the dissimilarity using methods using Euclidean or other distance Data Mining - Volinsky - 2011 - Columbia University Clustering Methods • enough about distances! • Now we have a matrix (n x n) of distances. • Two major types of clustering algorithms: – partitioning • Partitions the set into clusters with defined boundaries • place each point in its nearest cluster – hierarchical • agglomerative: each point is in its own cluser, iteratively combine • divisive: all data in one cluser, iteratively dissect Data Mining - Volinsky - 2011 - Columbia University Clustering Outline • • • • • Introduction to Clustering Distance measures k-means Clustering Hierarchical clustering Probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University k –Means Algorithm(s) • Assumes Euclidean space. • Start by picking k, the number of clusters. • Initialize clusters by picking one point per cluster. – typically, k random points Data Mining - Volinsky - 2011 - Columbia University K-Means Algorithm 1. 2. Arbitrarily select K objects from the data (e.g., K customer) to be each a cluster center For each of the remaining objects: Assign each object to the cluster whose center it is most close to 10 9 8 7 6 Cluster center 5 Cluster center 4 3 2 1 0 0 1 2 3 4 5 6 Data Mining - Volinsky - 2011 - Columbia University 7 8 9 10 K-Means Algorithm Then Repeat the following 3 steps until clusters converge (no change in clusters): 1. Compute the new center of the current clusters. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 Data Mining - Volinsky - 2011 - Columbia University 6 7 8 9 10 K-Means Algorithm 2. Assign each object to the cluster whose center it is most close to. 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 3. Go back to Step 1, or stop if center do not change. Data Mining - Volinsky - 2011 - Columbia University The K-Means Clustering Method • Example 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 10 10 9 9 8 8 7 7 6 6 5 5 4 4 3 3 2 2 1 1 0 1 2 3 4 5 6 7 8 9 10 0 0 1 2 3 4 5 6 7 8 9 10 0 1 Data Mining - Volinsky - 2011 - Columbia University 2 3 4 5 6 7 8 9 10 K-means 1. Decide on clusters. (e.g. K=5) (Example is courtesy of Andrew Moore, CMU) Data Mining - Volinsky - 2011 - Columbia University K-means 1. Decide on Clusters. (e.g. K=5) 2. Randomly guess K cluster Center locations Data Mining - Volinsky - 2011 - Columbia University K-means 1. Decide on clusters. (e.g. K=5) 2. Randomly guess K cluster Center locations 3. Each datapoint finds out which Center it’s closest to. (Thus each Center “owns” a set of datapoints) Data Mining - Volinsky - 2011 - Columbia University K-means 1. Decide on clusters. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns Data Mining - Volinsky - 2011 - Columbia University K-means 1. Decide on clusters. (e.g. k=5) 2. Randomly guess k cluster Center locations 3. Each datapoint finds out which Center it’s closest to. 4. Each Center finds the centroid of the points it owns 5. New Centers => new boundaries 6. Repeat until no change Data Mining - Volinsky - 2011 - Columbia University K-Means Example • Given: {2,4,10,12,3,20,30,11,25}, k=2 • Randomly assign means: m1=3,m2=4 • Solve for the rest …. Data Mining - Volinsky - 2011 - Columbia University K-Means Example • • • • • Given: {2,4,10,12,3,20,30,11,25}, k=2 Randomly assign means: m1=3,m2=4 K1={2,3}, K2={4,10,12,20,30,11,25}, m1=2.5,m2=16 K1={2,3,4},K2={10,12,20,30,11,25}, m1=3,m2=18 K1={2,3,4,10},K2={12,20,30,11,25}, m1=4.75,m2=19.6 • K1={2,3,4,10,11,12},K2={20,30,25}, m1=7,m2=25 • Stop as the clusters with these means are the same. Data Mining - Volinsky - 2011 - Columbia University Getting k Right • Hard! Often done subjectively (by feel) • Try different k, looking at the change in the average distance to centroid, as k increases. • Looking for a balance between within-cluster variance and between-cluster variance. – Calinski Index = • Average falls rapidly until right k, then changes little. Average distance to centroid Best value of k k Data Mining - Volinsky - 2011 - Columbia University Comments on the K-Means Method • Strength – Relatively efficient: Easy to implement - often comes up with good, if not best, solutions – intuitive • Weakness – – – – Need to specify k, the number of clusters, in advance Unable to handle noisy data and outliers Not suitable to discover clusters with non-convex shapes Quite sensitive to initial starting points - will find a local optimum. Do it several times and see how much the results change Data Mining - Volinsky - 2011 - Columbia University Variations on k-means • Make it more robust by using k-modes or k-mediods – • K-Medoids: medoids are the most centrally located object in a cluster. Make the initialization better – Take a small random sample and cluster to find a starting point – Pick k points on a grid, – Do several runs with different starting points Data Mining - Volinsky - 2011 - Columbia University Clustering Outline • • • • • Introduction to Clustering Distance measures k-means clustering Hierarchical clustering Probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University Simple example of hierarchical clustering Data Mining - Volinsky - 2011 - Columbia University Hierarchical Clustering • Does not require the number of clusters k as an input. • Two extremes – All data in one cluster – Each data point in its own cluster Step 0 a Step 1 Step 2 Step 3 Step 4 ab b abcde c cde d de e Step 4 agglomerative divisive Step 3 Step 2 Step 1 Step 0 Data Mining - Volinsky - 2011 - Columbia University Hierarchical Clustering • Representation: tree of nested clusters • Greedy algorithm – Find two most similar points – Join them – Repeat • Can also run backwards – divisive • Effective visualization via “dendrograms” – shows nesting structure – merges or splits = tree nodes • Algorithm requires a distance metric for distance between clusters, or between a point and a cluster Data Mining - Volinsky - 2011 - Columbia University Distances Between Clusters • Single Link: – smallest distance between points – Nearest neighbor – can be outlier sensitive • Complete Link: – largest distance between points – enforces “compactness” • Average Link: – mean - gets average behavior – centroid - more robust • Ward’s measure – Merge clusters that minimize increase in within-cluster distances: – D(i,j) = ( (SS(Ci+j) - SS(Cj) - SS(Ci) ) Data Mining - Volinsky - 2011 - Columbia University Dendrograms • By cutting the dendrogram at the desired level, then each connected component forms a cluster. ABC A B DEF C D E G F G HI H I Old Faithful data Data Mining - Volinsky - 2011 - Columbia University Can make plots so the height of the cross-bar shows the change in within-cluster SS Data Mining - Volinsky - 2011 - Columbia University Dendrogram Using Single-Link Method Old Faithful Eruption Duration vs Wait Data Notice how single-link tends to “chain”. dendrogram y-axis = crossbar’s distance score Data Mining - Volinsky - 2011 - Columbia University Dendogram Using Ward’s SSE Distance Old Faithful Eruption Duration vs Wait Data Data Mining - Volinsky - 2011 - Columbia University More balanced than single-link. Hierarchical Clustering • Pros – don’t have to specify k beforehand – visual representation of various cluster characteristics from dendogram • Cons – different linkage options get very different results Data Mining - Volinsky - 2011 - Columbia University Clustering Outline • • • • • Introduction to Clustering Distance measures k-means clustering Hierarchical clustering Probabalistic clustering Data Mining - Volinsky - 2011 - Columbia University Estimating Probability Densities • Using Probability densities is one way to describe data. • Finite mixtures of probability densities can be viewed as clusters • Because we have a probability model, log-likelihood can be used to evaluate: n SL ( ) log f (xi ) i 1 Data Mining - Volinsky - 2011 - Columbia University Mixture Models weekly credit card usage (1 ) x e 1 (2 )52 x e 2 f ( x) p (1 p) x! (52 x)! Two stage model: •Assign data points to clusters •Assess fit of the model Data Mining - Volinsky - 2011 - Columbia University Mixture Models and EM • How do we find the models to mix over? • EM (Expectation / Maximization) is a widely used technique that converges to a solution for finding mixture models. • Assume multivariate normal components. To apply EM: – take an initial solution – calculate the probability that each point comes from each component and assign it (E-step) – re-estimate parameters for the components based on the new assignments (M-step) – repeat until convergance. • Results in probabilistic membership to clusters •Can be slow to converge; can find local maxima Data Mining - Volinsky - 2011 - Columbia University The E (Expectation) Step n data points Current K clusters and parameters E step: Compute p(data point i is in group k) Data Mining - Volinsky - 2011 - Columbia University The M (Maximization) Step n data points New parameters for the K clusters M step: Compute , given n data points and memberships Data Mining - Volinsky - 2011 - Columbia University Comments on Mixtures and EM Learning • Probabilistic assignment to clusters…not a partition • K-means is a special case of EM – Gaussian mixtures with isotropic (diagonal, equi-variance) k ‘s – Approximate the E-step by choosing most likely cluster (instead of using membership probabilities) – EM can be used more broadly to estimate generic distributions Data Mining - Volinsky - 2011 - Columbia University ANEMIA PATIENTS AND CONTROLS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Data Mining - 2011 - Columbia University Red- Volinsky Blood Cell Volume 3.8 3.9 4 EM ITERATION 1 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Volinsky - 2011 - Columbia University 3.8 3.9 4 EM ITERATION 3 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Volinsky - 2011 - Columbia University 3.8 3.9 4 EM ITERATION 5 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Volinsky - 2011 - Columbia University 3.8 3.9 4 EM ITERATION 10 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Volinsky - 2011 - Columbia University 3.8 3.9 4 EM ITERATION 15 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Volinsky - 2011 - Columbia University 3.8 3.9 4 EM ITERATION 25 Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Volinsky - 2011 - Columbia University 3.8 3.9 4 ANEMIA DATA WITH LABELS Red Blood Cell Hemoglobin Concentration 4.4 4.3 4.2 4.1 4 3.9 3.8 3.7 3.3 3.4 3.5 3.6 3.7 Red Blood Cell Volume Data Mining - Volinsky - 2011 - Columbia University 3.8 3.9 4 LOG-LIKELIHOOD AS A FUNCTION OF EM ITERATIONS 490 480 Log-Likelihood 470 460 450 440 430 420 410 400 0 5 10 15 EM Iteration Data Mining - Volinsky - 2011 - Columbia University 20 25 Selecting K in mixture models • cannot just choose K that maximizes likelihood – Likelihood L() is always larger for larger K • Model selection alternatives for choosing k: – 1) In-sample: penalizing complexity • e.g., BIC = 2L() – d log n , d = # parameters • Easy to implement: asymptotically correct (Bayesian information criterion) – 2) Bayesian: compute posteriors p(k | data) • P(k|data) requires computation of p(data|k) = marginal likelihood • Can be tricky to compute for mixture models – 3) Out-of-sample : (cross) validation: • split data into train and validate sets • Score different models by likelihood of test data log p(Xtest | ) • Can be noisy on small data (logL is sensitive to outliers) Data Mining - Volinsky - 2011 - Columbia University Example of BIC Score for Red-Blood Cell Data Data Mining - Volinsky - 2011 - Columbia University Example of BIC Score for Red-Blood Cell Data True number of classes (2) selected by BIC Data Mining - Volinsky - 2011 - Columbia University Model Based Clustering f(x) = k=1…K wk fk(x;) • Mixture of k multivariate Gaussians • Optimal complexity fit via in-sample penalties • EM or Bayesian methods used to fit clusters • ‘mclust’ in R Name Distribution Volume Shape Orientation EII Spherical equal equal NA VII Spherical variable equal NA EEI Diagonal equal equal coordinate azxes VEI Diagonal variable equal coordinate axes VVI Diagonal variable variable coordinate axes EEE Ellipsoidal equal equal equal EEV Ellipsoidal equal equal variable VEV Ellipsoidal variable equal variable VVV Ellipsoidal variable var variable Data Mining - Columbia University mclust output Data Mining - Volinsky - 2011 - Columbia University