Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Cluster Analysis Mark Stamp Cluster Analysis 1 Cluster Analysis Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense of (big) data o Useful analysis technique in many fields Many different clustering strategies Overview, then details on 2 methods o K-means simple and can be effective o EM clustering not as simple Cluster Analysis 2 Intrinsic vs Extrinsic Intrinsic clustering relies on unsupervised learning o No predetermined labels on objects o Apply analysis directly to data Extrinsic relies on category labels o Requires pre-processing of data o Can be viewed as a form of supervised learning Cluster Analysis 3 Agglomerative vs Divisive Agglomerative o Each object starts in its own cluster o Clustering merges existing clusters o A “bottom up” approach Divisive o All objects start in one cluster o Clustering process splits existing clusters o A “top down” approach Cluster Analysis 4 Hierarchical vs Partitional Hierarchical clustering o “Child” and “parent” clusters o Can be viewed as dendrograms Partitional clustering o Partition objects into disjoint clusters o No hierarchical relationship We consider K-means and EM in detail o These are both partitional Cluster Analysis 5 Hierarchical Clustering Example 1. 2. of a hierarchical approach... start: Every point is its own cluster while number of clusters exceeds 1 o Find 2 nearest clusters and merge end while OK, but no real theoretical basis 3. o And some find that “disconcerting” o Even K-means has some theory behind it Cluster Analysis 6 Dendrogram Example Obtained by hierarchical clustering o Maybe… Cluster Analysis 7 Distance Distance between data points? Suppose x = (x1,x2,…,xn) and y = (y1,y2,…,yn) where each xi and yi are real numbers Euclidean distance is d(x,y) = sqrt((x1-y1)2+(x2-y2)2+…+(xn-yn)2) Manhattan (taxicab) distance is d(x,y) = |x1-y1| + |x2-y2| + … + |xn-yn| Cluster Analysis 8 Distance Euclidean distance red line Manhattan distance blue or yellow o Or any similar right-angle only path b a Cluster Analysis 9 Distance Lots and lots more distance measures Other examples include o Mahalanobis distance takes mean and covariance into account o Simple substitution distance measure of “decryption” distance o Chi-squared distance statistical o Or just about anything you can think of… Cluster Analysis 10 One Clustering Approach Given data points x1,x2,x3,…,xm Want to partition into K clusters o I.e., each point in exactly one cluster A centroid specified for each cluster o Let c1,c2,…,cK denote current centroids Each xi associated with one centroid o Let centroid(xi) be centroid for xi o If cj = centroid(xi), then xi is in cluster j Cluster Analysis 11 Clustering Two crucial questions 1. How to determine centroids, cj? 2. How to determine clusters, that is, how to assign xi to centroids? But first, what makes a cluster “good”? o For now, focus on one individual cluster o Relationship between clusters later… What Cluster Analysis do you think? 12 Distortion Intuitively, “compact” clusters good o Depends on data and K, which are given o And depends on centroids and assignment of xi to clusters, which we can control How to measure “goodness”? Define distortion = Σ d(xi,centroid(xi)) o Where d(x,y) is a distance measure Given Cluster Analysis K, let’s try to minimize distortion 13 Distortion Consider this 2-d data o Choose K = 3 clusters Same data for both o Which has smaller distortion? How to minimize distortion? o Good question… Cluster Analysis 14 Distortion Note, distortion depends on K o So, should probably write distortionK Problem we want to solve… o Given: K and x1,x2,x3,…,xm o Minimize: distortionK Best choice of K is a different issue o Briefly considered later For now, assume K is given and fixed Cluster Analysis 15 How to Minimize Distortion? Given m data points and K Minimize distortion via exhaustive search? o Try all “m choose K” different cases? o Too much work for realistic size data set An approximate solution will have to do o Exact solution is NP-complete problem Important Observation: For min distortion… o Each xi grouped with nearest centroid o Centroid must be center of its group Cluster Analysis 16 K-Means Previous slide implies that we can improve suboptimal cluster by either… 1. Re-assign each xi to nearest centroid 2. Re-compute centroids so they’re centers No improvement from applying either 1 or 2 more than once in succession But alternating might be useful o In fact, that is the K-means algorithm Cluster Analysis 17 K-Means Algorithm Given 1. 2. 3. 4. 5. dataset… Select a value for K (how?) Select initial centroids (how?) Group data by nearest centroid Recompute centroids (cluster centers) If “significant” change, then goto 3; else stop Cluster Analysis 18 K-Means Animation Very good animation here http://shabal.in/visuals/kmeans/2.html Nice animations of movement of centroids in different cases here http://www.ccs.neu.edu/home/kenb/db/examples/059.html (near bottom of web page) Other? Cluster Analysis 19 K-Means Are we assured of optimal solution? o Definitely not Why not? o For one, initial centroid locations critical o There is a (sensitive) dependence on initial conditions o This is a common issue in iterative processes (HMM training, is an example) Cluster Analysis 20 K-Means Initialization Recall, K is the number of clusters How to choose K? No obvious “best” way to do so But K-means is fast o So trial and error may be OK o That is, experiment with different K o Similar to choosing N in HMM Is there a better way to choose K? Cluster Analysis 21 Optimal K? Even for trial and error, we need a way to measure “goodness” of results Choosing optimal K is tricky Most intuitive measures will tend to improve for larger K But K “too big” may overfit data So, when is K “big enough”? o But not too big… Cluster Analysis 22 Schwarz Criterion f(K) Choose K that minimizes K f(K) = distortionK + λdK log m o Where d is the dimension, m is the number of data points, and λ is ??? Recall that distortion depends on K o Tends to decrease as K increases o Essentially, adding a penalty as K increases Related to Bayes Information Criterion (BIC) o And some other similar things Consider choice of K in more detail later… Cluster Analysis 23 K-Means Initialization How to choose initial centroids? Again, no best way to do this o Counterexamples to any “best” approach Often just choose at random Or uniform/maximum spacing o Or some variation on this idea Other? Cluster Analysis 24 K-Means Initialization In practice, we might do following Try several different choices of K o For each K, test several initial centroids Select the result that is best o How to measure “best”? o We look at that next May not be very scientific o But often it’s good enough Cluster Analysis 25 K-Means Variations One variation is K-mediods o Centroids point must be actual data point Fuzzy K-means o In K-means, any data point is in one cluster and not in any other o In fuzzy case, data point can be partly in several different clusters o “Degree of membership” vs distance Many Cluster Analysis other variations… 26 Measuring Cluster Quality How can we judge clustering results? o In general, that is, not just for K-means Compare to typical training/scoring… o Suppose we test new scoring method o E.g., score malware and benign files o Compute ROC curves, AUC, etc. o Many tools to measure success/accuracy Clustering Cluster Analysis is different (Why? How?) 27 Clustering Quality Clustering is a fishing expedition o Not sure what we are looking for o Hoping to find structure, “data discovery” o If we know answer, no point to clustering Might find something that’s not there o Even random data can be clustered Some things to consider on next slides o Relative to the data to be clustered Cluster Analysis 28 Cluster-ability? Clustering tendency o How suitable is dataset for clustering? o Which dataset below is cluster-friendly? o We can always apply clustering… o …but expect better results in some cases Cluster Analysis 29 Validation External validation o Compare clusters based on data labels o Similar to usual training/scoring scenario o Good idea if know something about data Internal validation o Determine quality based only on clusters o E.g., spacing between and within clusters o Harder to do, but always applicable Cluster Analysis 30 It’s All Relative Comparing clustering results o That is, compare one clustering result with others for same dataset o Can be very useful in practice o Often, lots of trial and error o Could enable us to “hill climb” to better clustering results… o …but still need a way to quantify things Cluster Analysis 31 How Many Clusters? Optimal number of clusters? o Already mentioned this wrt K-means o But what about the general case? o I.e., not dependent on cluster technique o Can the data tell us how many clusters? o Or the topology of the clusters? Next, Cluster Analysis we consider relevant measures 32 Internal Validation Direct measurement of clusters o Might call it “topological” validation We’ll consider the following o Cluster correlation o Similarity matrix o Sum of squares error o Cohesion and separation o Silhouette coefficient Cluster Analysis 33 Correlation Coefficient For X=(x1,x2,…,xn) and Y=(y1,y2,…,yn) Correlation coefficient rXY is Can show -1 ≤ rXY ≤ 1 o If rXY > 0 then positive cor (and vice versa) o Magnitude is strength of correlation Cluster Analysis 34 Examples of rXY in 2-d Cluster Analysis 35 Cluster Correlation Given data x1,x2,…,xm, and clusters, define 2 matrices Distance matrix D = {dij} o Where dij is distance between xi and xj Adjacency matrix A = {aij} o Where aij is 1 if xi and xj in same cluster o And aij is 0 otherwise Now what? Cluster Analysis 36 Cluster Correlation Compute correlation between D and A High inverse correlation implies nearby things clustered together o Why inverse? Cluster Analysis 37 Correlation Correlation Cluster Analysis examples 38 Similarity Matrix Form “similarity matrix” o Could be based on just about anything o Typically, distance matrix D = {dij}, where dij = d(xi,xj) Group rows and columns by cluster Heat map for resulting matrix o Provides visual representation of similarity within clusters (so look at it…) Cluster Analysis 39 Similarity Matrix Same examples as above Corresponding Cluster Analysis heat maps on next slide 40 Heat Maps Cluster Analysis 41 Residual Sum of Squares Residual Sum of Squares (RSS) o Aka Sum of Squared Errors (SSE) o RSS is squared sum of “error” terms o Definition of error depends on problem What is “error” when clustering? o Distance from centroid? o Then it’s the same as distortion o But, could use other measures instead Cluster Analysis 42 Cohesion and Separation Cluster cohesion o How “tightly packed” is a cluster o The more cohesive a cluster, the better Cluster separation o Distance between clusters o The more separation, the better Can we measure these things? o Yes, easily Cluster Analysis 43 Notation Same notation as K-means o Let ci, for i=1,2,…,K, be centroids o Let x1,x2,…,xm be data points o Let centroid(xi) be centroid of xi o Clusters determined by centroids Following results apply generally o Not just for K-means Cluster Analysis 44 Cohesion Lots of measures of cohesion o Previously defined distortion is useful o Recall, distortion = Σ d(xi,centroid(xi)) Or, could use distance between all pairs Cluster Analysis 45 Separation Again, many ways to measure this o Here, using distances to other centroids Or distances between all points in clusters Or distance from centroids to a “midpoint” Or distance between centroids, or… Cluster Analysis 46 Silhouette Coefficient Essentially, combines cohesion and separation into a single number Let Ci be the cluster that xi belongs to o Let a be average of d(xi,y) for all y in Ci o For Cj ≠ Ci, let bj be avg d(xi,y) for y in Cj o Let b be minimum of bj Then let S(xi) = (b – a) / max(a,b) o What the … ? Cluster Analysis 47 Silhouette Coefficient The idea... a=avg xi Ci Ck avg b=min avg Usually, Cluster Analysis Cj S(xi) = 1 - a/b 48 Silhouette Coefficient For given point xi … o Let a be avg distance to points in its cluster o Let b be dist to nearest other cluster (in a sense) Usually, a < b and hence S(xi) = 1 – a/b If a is a lot less than b, then S(xi) ≈ 1 o Points inside cluster much closer together than nearest other cluster (this is good) If a is almost same as b, then S(xi) ≈ 0 o Some other cluster is almost as close as things inside cluster (this is bad) Cluster Analysis 49 Silhouette Coefficient Silhouette coefficient is defined for each point Avg silhouette coefficient for a cluster o Measure of how “good” a cluster is Avg silhouette coefficient for all points o Measure of overall clustering “goodness” Numerically, what is a good result? o Rule of thumb on next slide Cluster Analysis 50 Silhouette Coefficient Average coefficient (to 2 decimal places) o 0.71 to 1.00 strong structure found o 0.51 to 0.70 reasonable structure found o 0.26 to 0.50 weak or artificial structure o 0.25 or less no significant structure Bottom line on silhouette coefficient o Combine cohesion, separation in one number o A useful measures of cluster quality Cluster Analysis 51 External Validation “External” implies that we measure quality based on data in clusters o Not relying on cluster topology (“shape”) Suppose clustering data is of several different types o Say, different malware families We can compute statistics on clusters o We only consider 2 stats here Cluster Analysis 52 Entropy and Purity Entropy o Standard measure of uncertainty or randomness o High entropy implies clusters less uniform Purity o Another measure of uniformity o Ideally, cluster should be more “pure”, that is, more uniform Cluster Analysis 53 Entropy Suppose total of m data elements o As usual, x1,x2,…,xm Denote cluster j as Cj o Let mj be number of elements in Cj o Let mij be count of type i in cluster Cj Compute probabilities based on relative frequencies o That is, pij = mij / mj Cluster Analysis 54 Entropy Then entropy of cluster Cj is Ej = − Σ pij log pij, where sum is over i Compute entropy Ej for each cluster Cj Overall (weighted) entropy is then E = Σ mj/m Ej, where sum is from 1 to K and K is number of clusters Smaller E is better o Implies clusters less uncertain/random Cluster Analysis 55 Purity Ideally, each cluster is all one type Using same notation as in entropy… o Purity of Cj defined as Uj = max pij o Where max is over i (different types) If Uj is 1, then Cj all one type of data o If Uj is near 0, no dominant type Overall (weighted) purity is U = Σ mj/m Uj, where sum is from 1 to K Cluster Analysis 56 Entropy and Purity Examples Cluster Analysis 57 EM Clustering Data might be from different probability distributions o Then “distance” might be poor measure o Maybe better to use mean and variance Cluster on probability distributions? o But distributions are unknown… Expectation Maximization (EM) o Technique to determine unknown parameters of probability distributions Cluster Analysis 58 EM Clustering Animation Good animation on Wikipedia page http://en.wikipedia.org/wiki/Expectation–maximization_algorithm Another animation here http://www.cs.cmu.edu/~alad/em/ Probably Cluster Analysis others too… 59 EM Clustering Example Old Faithful in Yellowstone NP Measure “wait” and duration Two clusters o Centers are means o Shading based on standard deviation Cluster Analysis 60 Maximum Likelihood Estimator Maximum Likelihood Estimator (MLE) Suppose you flip a coin and obtain X = HHHHTHHHTT What is most likely value of p = P(H)? Coin flips follow binomial distribution: Where Cluster Analysis p is prob of “success” (heads) 61 MLE Suppose X = HHHHTHHHTT Maximum likelihood function for X is And log likelihood function is Optimize In log likelihood function this case, MLE is θ = P(H) = 0.7 Cluster Analysis 62 Coin Experiment Given 2 biased coins, A and B o Randomly select coin, then… o Flip selected coin 10 times, and… o Repeat 5 times, for 50 total coin flips Can we determine P(H) for each coin? Easy, if you know which coin selected o For each coin, just divide number of heads by number of flips of that coin Cluster Analysis 63 Coin Example For example, suppose Coin Coin Coin Coin Coin B: A: A: B: A: Then HTTTHHTHTH HHHHTHHHHH HTHHHHHTHH HTHTTTHHTT THHHTHHHTH 5H 9H 8H 4H 7H and and and and and 5T 1T 2T 6T 3T maximum likelihood estimate is PA(H)=24/30=0.80 and PB(H)=9/20=0.45 Cluster Analysis 64 Coin Example Suppose we have same data, but we don’t know which coin was selected Coin Coin Coin Coin Coin Can ??: 5 ??: 9 ??: 8 ??: 4 ??: 7 H and H and H and H and H and 5T 1T 2T 6T 3T we estimate PA(H) and PB(H)? Cluster Analysis 65 Coin Example We do not know which coin was flipped So, there is “hidden” information o This sounds familiar… Train HMM on sequence of H and T ?? o Using 2 hidden states o Use resulting model to find most likely hidden state sequence (HMM “problem 2”) o Use sequence to estimate PA(H) and PB(H) Cluster Analysis 66 Coin Example HMM is very “heavy artillery” o And HMM needs lots of data to converge (or lots of different initializations) o EM gives us info we need, less work/data EM algorithm: Initial guess for params o Then alternate between these 2 steps: o Expectation: Recompute “expected values” o Maximization: Recompute params via MLEs Cluster Analysis 67 EM for Coin Example Start with a guess (initialization) o Say, PA(H) = 0.6 and PB(H) = 0.5 Compute expectations (E-step) First, from current PA(H) and PB(H) we find 5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Why? See next slide… Cluster Analysis 68 EM for Coin Example Suppose PA(H) = 0.6 and PB(H) = 0.5 o And in 10 flips of 1 coin, we find 8 H and 2 T Assuming coin A was flipped, we have a = 0.68 × 0.42 = 0.0026874 Assuming coin B was flipped, we have b = 0.58 × 0.52 = 0.0009766 Then by Bayes’ Formula P(A) = a/(a + b) = 0.73 and P(B) = b/(a + b) = 0.27 Cluster Analysis 69 E-step for Coin Example Assuming PA(H) = 0.6 and PB(H) = 0.5, we have 5 9 8 4 7 H, 5 T H, 1 T H, 2 T H, 6 T H, 3 T P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Next, compute expected (weighted) H and T For example, in 1st line o For A we have 5 x .45 = 2.25 H and 2.25 T o For B we have 5 x .55 = 2.75 H and 2.75 T Cluster Analysis 70 E-step for Coin Example Assuming PA(H) = 0.6 and PB(H) = 0.5, we have 5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Compute expected (weighted) H and T For example, in 2nd line o For A, we have 9 x .8 = 7.2 H and 1 x .8 = .8 T o For B, we have 9 x .2 = 1.8 H and 1 x .2 = .2 T Cluster Analysis 71 E-step for Coin Example Rounded to nearest 0.1: 5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Coin A Coin B 2.2H 2.2T 7.2H 0.8T 5.9H 1.5T 1.4H 2.1T 4.5H 1.9T 2.8H 2.8T 1.8H 0.2T 2.1H 0.5T 2.6H 3.9T 2.5H 1.1T expected 21.2H 8.5T 11.8H 8.5T This completes the E-step We computed these expected values based on current PA(H) and PB(H) Cluster Analysis 72 M-step for Coin Example M-step Re-estimate PA(H) and PB(H) using results from E-step: PA(H) = 21.2/(21.2+8.5) ≈ 0.71 PB(H) = 11.8/(11.8+8.5) ≈ 0.58 Next? E-step with these PA(H), PA(H) o Then M-step, then E-step, then… o …until convergence (or we get tired) Cluster Analysis 73 EM for Clustering How is EM relevant to clustering? Can use EM to obtain parameters of K “hidden” distributions o That is, means and variances, μi and σi2 Then, use μi as centers of clusters o And σi (standard deviations) as “radii” o Often use Gaussian (normal) distributions Is this better than K-means? Cluster Analysis 74 EM vs K-Means Whether it is better or not, EM is obviously different than K-means… o …or is it? Actually, K-means is special case of EM o Using “distance” instead of “probabilities” In K-means, we re-assign points to centroids o Like “E” in EM, which “re-shapes” clusters In K-means, we recompute centroids o Like “M” of EM, where we recompute parameters Cluster Analysis 75 EM Algorithm Now, we give EM algorithm in general o Eventually, consider a realistic example For simplicity, we assume data is a mixture from 2 distributions o Easy to generalize to 3 or more Choose probability distribution type o Like choosing distance function in K-means Then Cluster Analysis iterate EM to determine params 76 EM Algorithm Assume 2 distributions (2 clusters) o Let θ1 be parameter(s) of 1st distribution o Let θ2 be parameter(s) of 2nd distribution For binomial, θ1 = PA(H), θ2 = PB(H) For Gaussian distribution, θi = (μi,σi2) Also, mixture parameters, τ = (τ1,τ2) o Fraction of samples from distribution i is τi o Since 2 distributions, we have τ1 + τ2 = 1 Cluster Analysis 77 EM Algorithm Let f(xi,θj) be the probability function for distribution j o For now, assuming 2 distributions o And xi are “experiments” (data points) Make initial guess for parameters o That is, θ = (θ1, θ2) and τ Let pji be probability of xi assuming distribution j Cluster Analysis 78 EM Algorithm Initial guess for parameters θ and τ o If you know something, use it o If not, “random” may be OK o In any case, choose reasonable values Next, apply E-step, then M-step… o …then E then M then E then M …. So, what are E and M steps? o Want to state these for the general case Cluster Analysis 79 E-Step Using Bayes’ Formula, we compute o Where j = 1,2 and i = 1,2,…,n o Assuming n data points and 2 distributions Then pji is prob. of xi is in cluster j o Or the “part” of xi that’s in cluster j Note Cluster Analysis p1i + p2i = 1 for i=1,2,…,n 80 M-Step Use probabilities from E-step to reestimate parameters θ and τ Best estimate for τj given by This simplifies to Cluster Analysis 81 M Step Parms θ and τ are funcs of μi and σi2 o Depends on specific distributions used Based on pji, we have… Means: Variances: Cluster Analysis 82 EM Example: Binomial Mean of binomial is μ = Np o Where p = P(H) and N trials per experiment Suppose x1: 8 H and x2: 5 H and x3: 9 H and x4: 4 H and x5: 7 H and N = 10, and 5 experiments: 2T 5T 1T 6T 3T Assuming Cluster Analysis 2 coins, determine PA(H), PB(H) 83 Binomial Example Let X=(x1,x2,x3,x4,x5)=(8,5,9,4,7) o Have N = 10, and means are μ = Np o We want to determine p for both coins Initial guesses for parameters: o Probabilities PA(H)=0.6 and PB(H)=0.5 o That is, θ = (θ1,θ2) = (0.6,0.5) o Mixture params τ1 = 0.7 and τ2 = 0.3 o That is, τ = (τ1,τ2) = (0.7,0.3) Cluster Analysis 84 Binomial Example: E Step Compute pji using current guesses for parameters θ and τ Cluster Analysis 85 Binomial Example: M Step Recompute So, τ using the new pji τ = (τ1,τ2) = (0.7593,0.2407) Cluster Analysis 86 Binomial Example: M Step Recompute θ = (θ1,θ2) First, compute means μ = (μ1,μ2) Obtain μ = (μ1,μ2) = (6.9180,5.5969) So, θ = (θ1,θ2) = (0.6918,0.5597) Cluster Analysis 87 Binomial Example: Convergence Next E-step: o Compute new pji using the τ and θ Next M-step: o Compute τ and θ using pji from E step And so on… In this example, EM converges to τ = (τ1,τ2) = (0.5228, 0.4772) θ = (θ1,θ2) = (0.7934, 0.5139) Cluster Analysis 88 Gaussian Mixture Example We’ll consider 2-d data o That is, each data point is of form (x,y) Suppose we want 2 clusters o That is, 2 distributions to determine We’ll assume Gaussian distributions o Recall, Gaussian is normal distribution o Since 2-d data, bivariate Gaussian Gaussian Cluster Analysis mixture problem 89 Data “Old Faithful” geyser, Yellowstone NP Cluster Analysis 90 Bivariate Gaussian Gaussian (normal) dist. of 1 variable Bivariate Gaussian distribution o Where o And ρ = cov(x,y)/σxσy Cluster Analysis 91 Bivariate Gaussian: Matrix Form Bivariate Gaussian can be written as Where And S is the covariance matrix and hence Also and Cluster Analysis 92 Why Use Matrix Form? Generalizes to multivariate Gaussian o Formulas for det(S) and S-1 change o Can cluster data that’s more than 2-d Re-estimation formulas in E and M steps have the same form as before o Simply replace scalars with vectors In matrix form, params of (bivariate) Guassian: θ=(μ,S) where μ=(μx,μy) Cluster Analysis 93 Old Faithful Data Data is 2-d, so bivariate Gaussians Parameters of a bivariate Gaussian: θ=(μ,S) where μ=(μx,μy) We want 2 clusters, so must determine θ1=(μ1,S1) and θ2=(μ2,S2) Where Si are 2x2 and μi are 2x1 Make initial guesses, then iterate EM o What to use as initial guesses? Cluster Analysis 94 Old Faithful Example Initial guesses? Recall that S is 2x2 and symmetric, o Which implies that ρ = s12/sqrt(s11s22) o And must always have -1 ≤ ρ ≤ 1 This imposes restriction on S We’ll use mean and variance of x and y components when initializing Cluster Analysis 95 Old Faithful: Initialization Want 2 clusters Initialize 2 bivariate Gaussians For both Si, can verify that ρ = 0.5 Also, initialize τ = (τ1,τ2) = (0.6, 0.4) Cluster Analysis 96 Old Faithful: E-Step First Cluster Analysis E-step yields 97 Old Faithful: M-Step First M-step yields τ = (τ1,τ2) = (0.5704, 0.4296) And Easy to verify that -1 ≤ ρ ≤ 1 o For both distributions Cluster Analysis 98 Old Faithful: Convergence After 100 iterations of EM τ = (τ1,τ2) = (0.5001, 0.4999) With And Again, Cluster Analysis we have -1 ≤ ρ ≤ 1 99 Old Faithful Clusters Centroids are means: μ1, μ2 Shading is standard devs o Darker area is within one σ o Lighter is two standard dev’s Cluster Analysis 100 EM Clustering Clusters based on probabilities o Each data point has a probability related to each cluster o Point is assigned to the cluster that gives highest probability In K-means, uses distance, not prob. But can view prob. as a “distance” So, K-means and EM not so different… Cluster Analysis 101 Conclusion Clustering is fun, entertaining, very useful o Can explore mysterious data, and more… And K-means is really simple o EM is powerful and not that difficult either Measuring success is not so easy o “Good” clusters? And useful information? o Or just random noise? Anything can be clustered Clustering is often just a starting point o Helps us decide if any “there” is there Cluster Analysis 102 References: K-Means A.W. Moore, K-means and hierarchical clustering P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison-Wesley, 2006, Chapter 8, Cluster analysis: Basic concepts and algorithms R. Jin, Cluster validation M.J. Norusis, IBM SPSS Statistics 19 Statistical Procedures Companion, Chapter 17, Cluster analysis Cluster Analysis 103 References: EM Clustering C.B. Do and S. Batzoglou, What is the expectation maximization algorithm?, Nature Biotechnology, 26(8):897-899, 2008 J.A. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models, ICSI Report TR-97-021, 1998 Cluster Analysis 104