Download clusters

Cluster Analysis Mark Stamp Cluster Analysis 1 Cluster Analysis  Grouping objects in meaningful way o Clustered data fits together in some way o Can help to make sense of (big) data o Useful analysis technique in many fields  Many different clustering strategies  Overview, then details on 2 methods o K-means  simple and can be effective o EM clustering  not as simple Cluster Analysis 2 Intrinsic vs Extrinsic  Intrinsic clustering relies on unsupervised learning o No predetermined labels on objects o Apply analysis directly to data  Extrinsic relies on category labels o Requires pre-processing of data o Can be viewed as a form of supervised learning Cluster Analysis 3 Agglomerative vs Divisive  Agglomerative o Each object starts in its own cluster o Clustering merges existing clusters o A “bottom up” approach  Divisive o All objects start in one cluster o Clustering process splits existing clusters o A “top down” approach Cluster Analysis 4 Hierarchical vs Partitional  Hierarchical clustering o “Child” and “parent” clusters o Can be viewed as dendrograms  Partitional clustering o Partition objects into disjoint clusters o No hierarchical relationship  We consider K-means and EM in detail o These are both partitional Cluster Analysis 5 Hierarchical Clustering  Example 1. 2. of a hierarchical approach... start: Every point is its own cluster while number of clusters exceeds 1 o Find 2 nearest clusters and merge end while  OK, but no real theoretical basis 3. o And some find that “disconcerting” o Even K-means has some theory behind it Cluster Analysis 6 Dendrogram  Example  Obtained by hierarchical clustering o Maybe… Cluster Analysis 7 Distance  Distance between data points?  Suppose x = (x1,x2,…,xn) and y = (y1,y2,…,yn) where each xi and yi are real numbers  Euclidean distance is d(x,y) = sqrt((x1-y1)2+(x2-y2)2+…+(xn-yn)2)  Manhattan (taxicab) distance is d(x,y) = |x1-y1| + |x2-y2| + … + |xn-yn| Cluster Analysis 8 Distance  Euclidean distance  red line  Manhattan distance  blue or yellow o Or any similar right-angle only path b a Cluster Analysis 9 Distance  Lots and lots more distance measures  Other examples include o Mahalanobis distance  takes mean and covariance into account o Simple substitution distance  measure of “decryption” distance o Chi-squared distance  statistical o Or just about anything you can think of… Cluster Analysis 10 One Clustering Approach  Given data points x1,x2,x3,…,xm  Want to partition into K clusters o I.e., each point in exactly one cluster A centroid specified for each cluster o Let c1,c2,…,cK denote current centroids  Each xi associated with one centroid o Let centroid(xi) be centroid for xi o If cj = centroid(xi), then xi is in cluster j Cluster Analysis 11 Clustering  Two crucial questions 1. How to determine centroids, cj? 2. How to determine clusters, that is, how to assign xi to centroids?  But first, what makes a cluster “good”? o For now, focus on one individual cluster o Relationship between clusters later…  What Cluster Analysis do you think? 12 Distortion  Intuitively, “compact” clusters good o Depends on data and K, which are given o And depends on centroids and assignment of xi to clusters, which we can control  How to measure “goodness”?  Define distortion = Σ d(xi,centroid(xi)) o Where d(x,y) is a distance measure  Given Cluster Analysis K, let’s try to minimize distortion 13 Distortion  Consider this 2-d data o Choose K = 3 clusters  Same data for both o Which has smaller distortion?  How to minimize distortion? o Good question… Cluster Analysis 14 Distortion  Note, distortion depends on K o So, should probably write distortionK  Problem we want to solve… o Given: K and x1,x2,x3,…,xm o Minimize: distortionK  Best choice of K is a different issue o Briefly considered later  For now, assume K is given and fixed Cluster Analysis 15 How to Minimize Distortion? Given m data points and K  Minimize distortion via exhaustive search?  o Try all “m choose K” different cases? o Too much work for realistic size data set  An approximate solution will have to do o Exact solution is NP-complete problem  Important Observation: For min distortion… o Each xi grouped with nearest centroid o Centroid must be center of its group Cluster Analysis 16 K-Means  Previous slide implies that we can improve suboptimal cluster by either… 1. Re-assign each xi to nearest centroid 2. Re-compute centroids so they’re centers  No improvement from applying either 1 or 2 more than once in succession  But alternating might be useful o In fact, that is the K-means algorithm Cluster Analysis 17 K-Means Algorithm  Given 1. 2. 3. 4. 5. dataset… Select a value for K (how?) Select initial centroids (how?) Group data by nearest centroid Recompute centroids (cluster centers) If “significant” change, then goto 3; else stop Cluster Analysis 18 K-Means Animation  Very good animation here http://shabal.in/visuals/kmeans/2.html  Nice animations of movement of centroids in different cases here http://www.ccs.neu.edu/home/kenb/db/examples/059.html (near bottom of web page)  Other? Cluster Analysis 19 K-Means  Are we assured of optimal solution? o Definitely not  Why not? o For one, initial centroid locations critical o There is a (sensitive) dependence on initial conditions o This is a common issue in iterative processes (HMM training, is an example) Cluster Analysis 20 K-Means Initialization  Recall, K is the number of clusters  How to choose K?  No obvious “best” way to do so  But K-means is fast o So trial and error may be OK o That is, experiment with different K o Similar to choosing N in HMM  Is there a better way to choose K? Cluster Analysis 21 Optimal K?  Even for trial and error, we need a way to measure “goodness” of results  Choosing optimal K is tricky  Most intuitive measures will tend to improve for larger K  But K “too big” may overfit data  So, when is K “big enough”? o But not too big… Cluster Analysis 22 Schwarz Criterion  f(K) Choose K that minimizes K f(K) = distortionK + λdK log m o Where d is the dimension, m is the number of data points, and λ is ???  Recall that distortion depends on K o Tends to decrease as K increases o Essentially, adding a penalty as K increases  Related to Bayes Information Criterion (BIC) o And some other similar things  Consider choice of K in more detail later… Cluster Analysis 23 K-Means Initialization  How to choose initial centroids?  Again, no best way to do this o Counterexamples to any “best” approach  Often just choose at random  Or uniform/maximum spacing o Or some variation on this idea  Other? Cluster Analysis 24 K-Means Initialization  In practice, we might do following  Try several different choices of K o For each K, test several initial centroids  Select the result that is best o How to measure “best”? o We look at that next  May not be very scientific o But often it’s good enough Cluster Analysis 25 K-Means Variations  One variation is K-mediods o Centroids point must be actual data point  Fuzzy K-means o In K-means, any data point is in one cluster and not in any other o In fuzzy case, data point can be partly in several different clusters o “Degree of membership” vs distance  Many Cluster Analysis other variations… 26 Measuring Cluster Quality  How can we judge clustering results? o In general, that is, not just for K-means  Compare to typical training/scoring… o Suppose we test new scoring method o E.g., score malware and benign files o Compute ROC curves, AUC, etc. o Many tools to measure success/accuracy  Clustering Cluster Analysis is different (Why? How?) 27 Clustering Quality  Clustering is a fishing expedition o Not sure what we are looking for o Hoping to find structure, “data discovery” o If we know answer, no point to clustering  Might find something that’s not there o Even random data can be clustered  Some things to consider on next slides o Relative to the data to be clustered Cluster Analysis 28 Cluster-ability?  Clustering tendency o How suitable is dataset for clustering? o Which dataset below is cluster-friendly? o We can always apply clustering… o …but expect better results in some cases Cluster Analysis 29 Validation  External validation o Compare clusters based on data labels o Similar to usual training/scoring scenario o Good idea if know something about data  Internal validation o Determine quality based only on clusters o E.g., spacing between and within clusters o Harder to do, but always applicable Cluster Analysis 30 It’s All Relative  Comparing clustering results o That is, compare one clustering result with others for same dataset o Can be very useful in practice o Often, lots of trial and error o Could enable us to “hill climb” to better clustering results… o …but still need a way to quantify things Cluster Analysis 31 How Many Clusters?  Optimal number of clusters? o Already mentioned this wrt K-means o But what about the general case? o I.e., not dependent on cluster technique o Can the data tell us how many clusters? o Or the topology of the clusters?  Next, Cluster Analysis we consider relevant measures 32 Internal Validation  Direct measurement of clusters o Might call it “topological” validation  We’ll consider the following o Cluster correlation o Similarity matrix o Sum of squares error o Cohesion and separation o Silhouette coefficient Cluster Analysis 33 Correlation Coefficient  For X=(x1,x2,…,xn) and Y=(y1,y2,…,yn)  Correlation coefficient rXY is  Can show -1 ≤ rXY ≤ 1 o If rXY > 0 then positive cor (and vice versa) o Magnitude is strength of correlation Cluster Analysis 34 Examples of rXY in 2-d Cluster Analysis 35 Cluster Correlation  Given data x1,x2,…,xm, and clusters, define 2 matrices  Distance matrix D = {dij} o Where dij is distance between xi and xj  Adjacency matrix A = {aij} o Where aij is 1 if xi and xj in same cluster o And aij is 0 otherwise  Now what? Cluster Analysis 36 Cluster Correlation  Compute correlation between D and A  High inverse correlation implies nearby things clustered together o Why inverse? Cluster Analysis 37 Correlation  Correlation Cluster Analysis examples 38 Similarity Matrix  Form “similarity matrix” o Could be based on just about anything o Typically, distance matrix D = {dij}, where dij = d(xi,xj)  Group rows and columns by cluster  Heat map for resulting matrix o Provides visual representation of similarity within clusters (so look at it…) Cluster Analysis 39 Similarity Matrix  Same examples as above  Corresponding Cluster Analysis heat maps on next slide 40 Heat Maps Cluster Analysis 41 Residual Sum of Squares  Residual Sum of Squares (RSS) o Aka Sum of Squared Errors (SSE) o RSS is squared sum of “error” terms o Definition of error depends on problem  What is “error” when clustering? o Distance from centroid? o Then it’s the same as distortion o But, could use other measures instead Cluster Analysis 42 Cohesion and Separation  Cluster cohesion o How “tightly packed” is a cluster o The more cohesive a cluster, the better  Cluster separation o Distance between clusters o The more separation, the better  Can we measure these things? o Yes, easily Cluster Analysis 43 Notation  Same notation as K-means o Let ci, for i=1,2,…,K, be centroids o Let x1,x2,…,xm be data points o Let centroid(xi) be centroid of xi o Clusters determined by centroids  Following results apply generally o Not just for K-means Cluster Analysis 44 Cohesion  Lots of measures of cohesion o Previously defined distortion is useful o Recall, distortion = Σ d(xi,centroid(xi))  Or, could use distance between all pairs Cluster Analysis 45 Separation  Again, many ways to measure this o Here, using distances to other centroids Or distances between all points in clusters  Or distance from centroids to a “midpoint”  Or distance between centroids, or…  Cluster Analysis 46 Silhouette Coefficient  Essentially, combines cohesion and separation into a single number  Let Ci be the cluster that xi belongs to o Let a be average of d(xi,y) for all y in Ci o For Cj ≠ Ci, let bj be avg d(xi,y) for y in Cj o Let b be minimum of bj  Then let S(xi) = (b – a) / max(a,b) o What the … ? Cluster Analysis 47 Silhouette Coefficient  The idea... a=avg xi Ci Ck avg b=min avg  Usually, Cluster Analysis Cj S(xi) = 1 - a/b 48 Silhouette Coefficient  For given point xi … o Let a be avg distance to points in its cluster o Let b be dist to nearest other cluster (in a sense) Usually, a < b and hence S(xi) = 1 – a/b  If a is a lot less than b, then S(xi) ≈ 1  o Points inside cluster much closer together than nearest other cluster (this is good)  If a is almost same as b, then S(xi) ≈ 0 o Some other cluster is almost as close as things inside cluster (this is bad) Cluster Analysis 49 Silhouette Coefficient  Silhouette coefficient is defined for each point  Avg silhouette coefficient for a cluster o Measure of how “good” a cluster is  Avg silhouette coefficient for all points o Measure of overall clustering “goodness”  Numerically, what is a good result? o Rule of thumb on next slide Cluster Analysis 50 Silhouette Coefficient  Average coefficient (to 2 decimal places) o 0.71 to 1.00  strong structure found o 0.51 to 0.70  reasonable structure found o 0.26 to 0.50  weak or artificial structure o 0.25 or less  no significant structure  Bottom line on silhouette coefficient o Combine cohesion, separation in one number o A useful measures of cluster quality Cluster Analysis 51 External Validation  “External” implies that we measure quality based on data in clusters o Not relying on cluster topology (“shape”)  Suppose clustering data is of several different types o Say, different malware families  We can compute statistics on clusters o We only consider 2 stats here Cluster Analysis 52 Entropy and Purity  Entropy o Standard measure of uncertainty or randomness o High entropy implies clusters less uniform  Purity o Another measure of uniformity o Ideally, cluster should be more “pure”, that is, more uniform Cluster Analysis 53 Entropy  Suppose total of m data elements o As usual, x1,x2,…,xm  Denote cluster j as Cj o Let mj be number of elements in Cj o Let mij be count of type i in cluster Cj  Compute probabilities based on relative frequencies o That is, pij = mij / mj Cluster Analysis 54 Entropy  Then entropy of cluster Cj is Ej = − Σ pij log pij, where sum is over i  Compute entropy Ej for each cluster Cj  Overall (weighted) entropy is then E = Σ mj/m Ej, where sum is from 1 to K and K is number of clusters  Smaller E is better o Implies clusters less uncertain/random Cluster Analysis 55 Purity  Ideally, each cluster is all one type  Using same notation as in entropy… o Purity of Cj defined as Uj = max pij o Where max is over i (different types)  If Uj is 1, then Cj all one type of data o If Uj is near 0, no dominant type  Overall (weighted) purity is U = Σ mj/m Uj, where sum is from 1 to K Cluster Analysis 56 Entropy and Purity  Examples Cluster Analysis 57 EM Clustering  Data might be from different probability distributions o Then “distance” might be poor measure o Maybe better to use mean and variance  Cluster on probability distributions? o But distributions are unknown…  Expectation Maximization (EM) o Technique to determine unknown parameters of probability distributions Cluster Analysis 58 EM Clustering Animation  Good animation on Wikipedia page http://en.wikipedia.org/wiki/Expectation–maximization_algorithm  Another animation here http://www.cs.cmu.edu/~alad/em/  Probably Cluster Analysis others too… 59 EM Clustering Example  Old Faithful in Yellowstone NP  Measure “wait” and duration  Two clusters o Centers are means o Shading based on standard deviation Cluster Analysis 60 Maximum Likelihood Estimator  Maximum Likelihood Estimator (MLE)  Suppose you flip a coin and obtain X = HHHHTHHHTT  What is most likely value of p = P(H)?  Coin flips follow binomial distribution:  Where Cluster Analysis p is prob of “success” (heads) 61 MLE  Suppose X = HHHHTHHHTT  Maximum likelihood function for X is  And log likelihood function is  Optimize  In log likelihood function this case, MLE is θ = P(H) = 0.7 Cluster Analysis 62 Coin Experiment  Given 2 biased coins, A and B o Randomly select coin, then… o Flip selected coin 10 times, and… o Repeat 5 times, for 50 total coin flips  Can we determine P(H) for each coin?  Easy, if you know which coin selected o For each coin, just divide number of heads by number of flips of that coin Cluster Analysis 63 Coin Example  For example, suppose Coin Coin Coin Coin Coin B: A: A: B: A:  Then HTTTHHTHTH HHHHTHHHHH HTHHHHHTHH HTHTTTHHTT THHHTHHHTH 5H 9H 8H 4H 7H and and and and and 5T 1T 2T 6T 3T maximum likelihood estimate is PA(H)=24/30=0.80 and PB(H)=9/20=0.45 Cluster Analysis 64 Coin Example  Suppose we have same data, but we don’t know which coin was selected Coin Coin Coin Coin Coin  Can ??: 5 ??: 9 ??: 8 ??: 4 ??: 7 H and H and H and H and H and 5T 1T 2T 6T 3T we estimate PA(H) and PB(H)? Cluster Analysis 65 Coin Example  We do not know which coin was flipped  So, there is “hidden” information o This sounds familiar…  Train HMM on sequence of H and T ?? o Using 2 hidden states o Use resulting model to find most likely hidden state sequence (HMM “problem 2”) o Use sequence to estimate PA(H) and PB(H) Cluster Analysis 66 Coin Example  HMM is very “heavy artillery” o And HMM needs lots of data to converge (or lots of different initializations) o EM gives us info we need, less work/data  EM algorithm: Initial guess for params o Then alternate between these 2 steps: o Expectation: Recompute “expected values” o Maximization: Recompute params via MLEs Cluster Analysis 67 EM for Coin Example  Start with a guess (initialization) o Say, PA(H) = 0.6 and PB(H) = 0.5 Compute expectations (E-step)  First, from current PA(H) and PB(H) we find  5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T  P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Why? See next slide… Cluster Analysis 68 EM for Coin Example  Suppose PA(H) = 0.6 and PB(H) = 0.5 o And in 10 flips of 1 coin, we find 8 H and 2 T  Assuming coin A was flipped, we have a = 0.68 × 0.42 = 0.0026874  Assuming coin B was flipped, we have b = 0.58 × 0.52 = 0.0009766  Then by Bayes’ Formula P(A) = a/(a + b) = 0.73 and P(B) = b/(a + b) = 0.27 Cluster Analysis 69 E-step for Coin Example  Assuming PA(H) = 0.6 and PB(H) = 0.5, we have 5 9 8 4 7 H, 5 T H, 1 T H, 2 T H, 6 T H, 3 T P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Next, compute expected (weighted) H and T  For example, in 1st line  o For A we have 5 x .45 = 2.25 H and 2.25 T o For B we have 5 x .55 = 2.75 H and 2.75 T Cluster Analysis 70 E-step for Coin Example  Assuming PA(H) = 0.6 and PB(H) = 0.5, we have 5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Compute expected (weighted) H and T  For example, in 2nd line  o For A, we have 9 x .8 = 7.2 H and 1 x .8 = .8 T o For B, we have 9 x .2 = 1.8 H and 1 x .2 = .2 T Cluster Analysis 71 E-step for Coin Example  Rounded to nearest 0.1: 5 H, 5 T 9 H, 1 T 8 H, 2 T 4 H, 6 T 7 H, 3 T P(A) = .45, P(B) = .55 P(A) = .80, P(B) = .20 P(A) = .73, P(B) = .27 P(A) = .35, P(B) = .65 P(A) = .65, P(B) = .35 Coin A Coin B 2.2H 2.2T 7.2H 0.8T 5.9H 1.5T 1.4H 2.1T 4.5H 1.9T 2.8H 2.8T 1.8H 0.2T 2.1H 0.5T 2.6H 3.9T 2.5H 1.1T expected 21.2H 8.5T 11.8H 8.5T  This completes the E-step  We computed these expected values based on current PA(H) and PB(H) Cluster Analysis 72 M-step for Coin Example  M-step  Re-estimate PA(H) and PB(H) using results from E-step: PA(H) = 21.2/(21.2+8.5) ≈ 0.71 PB(H) = 11.8/(11.8+8.5) ≈ 0.58  Next? E-step with these PA(H), PA(H) o Then M-step, then E-step, then… o …until convergence (or we get tired) Cluster Analysis 73 EM for Clustering  How is EM relevant to clustering?  Can use EM to obtain parameters of K “hidden” distributions o That is, means and variances, μi and σi2  Then, use μi as centers of clusters o And σi (standard deviations) as “radii” o Often use Gaussian (normal) distributions  Is this better than K-means? Cluster Analysis 74 EM vs K-Means  Whether it is better or not, EM is obviously different than K-means… o …or is it?  Actually, K-means is special case of EM o Using “distance” instead of “probabilities”  In K-means, we re-assign points to centroids o Like “E” in EM, which “re-shapes” clusters  In K-means, we recompute centroids o Like “M” of EM, where we recompute parameters Cluster Analysis 75 EM Algorithm  Now, we give EM algorithm in general o Eventually, consider a realistic example  For simplicity, we assume data is a mixture from 2 distributions o Easy to generalize to 3 or more  Choose probability distribution type o Like choosing distance function in K-means  Then Cluster Analysis iterate EM to determine params 76 EM Algorithm  Assume 2 distributions (2 clusters) o Let θ1 be parameter(s) of 1st distribution o Let θ2 be parameter(s) of 2nd distribution  For binomial, θ1 = PA(H), θ2 = PB(H)  For Gaussian distribution, θi = (μi,σi2)  Also, mixture parameters, τ = (τ1,τ2) o Fraction of samples from distribution i is τi o Since 2 distributions, we have τ1 + τ2 = 1 Cluster Analysis 77 EM Algorithm  Let f(xi,θj) be the probability function for distribution j o For now, assuming 2 distributions o And xi are “experiments” (data points)  Make initial guess for parameters o That is, θ = (θ1, θ2) and τ  Let pji be probability of xi assuming distribution j Cluster Analysis 78 EM Algorithm  Initial guess for parameters θ and τ o If you know something, use it o If not, “random” may be OK o In any case, choose reasonable values  Next, apply E-step, then M-step… o …then E then M then E then M ….  So, what are E and M steps? o Want to state these for the general case Cluster Analysis 79 E-Step  Using Bayes’ Formula, we compute o Where j = 1,2 and i = 1,2,…,n o Assuming n data points and 2 distributions  Then pji is prob. of xi is in cluster j o Or the “part” of xi that’s in cluster j  Note Cluster Analysis p1i + p2i = 1 for i=1,2,…,n 80 M-Step  Use probabilities from E-step to reestimate parameters θ and τ  Best estimate for τj given by  This simplifies to Cluster Analysis 81 M Step  Parms θ and τ are funcs of μi and σi2 o Depends on specific distributions used  Based on pji, we have…  Means:  Variances: Cluster Analysis 82 EM Example: Binomial  Mean of binomial is μ = Np o Where p = P(H) and N trials per experiment  Suppose x1: 8 H and x2: 5 H and x3: 9 H and x4: 4 H and x5: 7 H and N = 10, and 5 experiments: 2T 5T 1T 6T 3T  Assuming Cluster Analysis 2 coins, determine PA(H), PB(H) 83 Binomial Example  Let X=(x1,x2,x3,x4,x5)=(8,5,9,4,7) o Have N = 10, and means are μ = Np o We want to determine p for both coins  Initial guesses for parameters: o Probabilities PA(H)=0.6 and PB(H)=0.5 o That is, θ = (θ1,θ2) = (0.6,0.5) o Mixture params τ1 = 0.7 and τ2 = 0.3 o That is, τ = (τ1,τ2) = (0.7,0.3) Cluster Analysis 84 Binomial Example: E Step  Compute pji using current guesses for parameters θ and τ Cluster Analysis 85 Binomial Example: M Step  Recompute  So, τ using the new pji τ = (τ1,τ2) = (0.7593,0.2407) Cluster Analysis 86 Binomial Example: M Step  Recompute θ = (θ1,θ2)  First, compute means μ = (μ1,μ2)  Obtain μ = (μ1,μ2) = (6.9180,5.5969)  So, θ = (θ1,θ2) = (0.6918,0.5597) Cluster Analysis 87 Binomial Example: Convergence  Next E-step: o Compute new pji using the τ and θ  Next M-step: o Compute τ and θ using pji from E step  And so on…  In this example, EM converges to τ = (τ1,τ2) = (0.5228, 0.4772) θ = (θ1,θ2) = (0.7934, 0.5139) Cluster Analysis 88 Gaussian Mixture Example  We’ll consider 2-d data o That is, each data point is of form (x,y)  Suppose we want 2 clusters o That is, 2 distributions to determine  We’ll assume Gaussian distributions o Recall, Gaussian is normal distribution o Since 2-d data, bivariate Gaussian  Gaussian Cluster Analysis mixture problem 89 Data  “Old Faithful” geyser, Yellowstone NP Cluster Analysis 90 Bivariate Gaussian  Gaussian (normal) dist. of 1 variable  Bivariate Gaussian distribution o Where o And ρ = cov(x,y)/σxσy Cluster Analysis 91 Bivariate Gaussian: Matrix Form  Bivariate Gaussian can be written as  Where  And S is the covariance matrix and hence  Also and Cluster Analysis 92 Why Use Matrix Form?  Generalizes to multivariate Gaussian o Formulas for det(S) and S-1 change o Can cluster data that’s more than 2-d  Re-estimation formulas in E and M steps have the same form as before o Simply replace scalars with vectors  In matrix form, params of (bivariate) Guassian: θ=(μ,S) where μ=(μx,μy) Cluster Analysis 93 Old Faithful Data  Data is 2-d, so bivariate Gaussians  Parameters of a bivariate Gaussian: θ=(μ,S) where μ=(μx,μy)  We want 2 clusters, so must determine θ1=(μ1,S1) and θ2=(μ2,S2) Where Si are 2x2 and μi are 2x1  Make initial guesses, then iterate EM o What to use as initial guesses? Cluster Analysis 94 Old Faithful Example  Initial guesses?  Recall that S is 2x2 and symmetric, o Which implies that ρ = s12/sqrt(s11s22) o And must always have -1 ≤ ρ ≤ 1  This imposes restriction on S  We’ll use mean and variance of x and y components when initializing Cluster Analysis 95 Old Faithful: Initialization  Want 2 clusters  Initialize 2 bivariate Gaussians  For both Si, can verify that ρ = 0.5  Also, initialize τ = (τ1,τ2) = (0.6, 0.4) Cluster Analysis 96 Old Faithful: E-Step  First Cluster Analysis E-step yields 97 Old Faithful: M-Step  First M-step yields τ = (τ1,τ2) = (0.5704, 0.4296)  And  Easy to verify that -1 ≤ ρ ≤ 1 o For both distributions Cluster Analysis 98 Old Faithful: Convergence  After 100 iterations of EM τ = (τ1,τ2) = (0.5001, 0.4999)  With  And  Again, Cluster Analysis we have -1 ≤ ρ ≤ 1 99 Old Faithful Clusters  Centroids are means: μ1, μ2  Shading is standard devs o Darker area is within one σ o Lighter is two standard dev’s Cluster Analysis 100 EM Clustering  Clusters based on probabilities o Each data point has a probability related to each cluster o Point is assigned to the cluster that gives highest probability  In K-means, uses distance, not prob.  But can view prob. as a “distance”  So, K-means and EM not so different… Cluster Analysis 101 Conclusion  Clustering is fun, entertaining, very useful o Can explore mysterious data, and more…  And K-means is really simple o EM is powerful and not that difficult either  Measuring success is not so easy o “Good” clusters? And useful information? o Or just random noise? Anything can be clustered  Clustering is often just a starting point o Helps us decide if any “there” is there Cluster Analysis 102 References: K-Means A.W. Moore, K-means and hierarchical clustering  P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining, Addison-Wesley, 2006, Chapter 8, Cluster analysis: Basic concepts and algorithms  R. Jin, Cluster validation  M.J. Norusis, IBM SPSS Statistics 19 Statistical Procedures Companion, Chapter 17, Cluster analysis  Cluster Analysis 103 References: EM Clustering C.B. Do and S. Batzoglou, What is the expectation maximization algorithm?, Nature Biotechnology, 26(8):897-899, 2008  J.A. Bilmes, A gentle tutorial of the EM algorithm and its application to parameter estimation for Gaussian mixture and hidden Markov models, ICSI Report TR-97-021, 1998  Cluster Analysis 104

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download clusters