Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
4. Cluster Analysis I Hierarchical clustering I k-means clustering I Classification maximum likelihood (model-based clustering) I How many clusters? Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 1 The basic objective in cluster analysis is to discover natural homogeneous groupings (clusters) in the observed vectors x1 , . . . , xn . There is no dependent random variable: none of the components of the observed vector X is treated differently from the others. No assumptions are made on the number of groups. The notion of a cluster cannot be precisely defined, which is why there are many clustering models and algorithms. Grouping is done on the basis of similarities or distances (dissimilarities) between the observations. Data clustering has been used for three main purposes: • Underlying structure: to gain insight into data, generate hypotheses, detect anomalies, and identify salient features. • Natural classification: to identify the degree of similarity among individuals. • Compression: as a method for summarizing the data through cluster prototypes. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 2 Cluster analysis is also called unsupervised classification or unsupervised pattern recognition to distinguish it from discriminant analysis (supervised classification). A good revision of the historical evolution of clustering procedures and current perspectives is Jain (2010). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 3 Clustering Hierarchical Partitional Divisive Agglomerative Centroid (k-means) Bayesian Decision Nonparametric based Model based Advanced Course in Statistics. Lecturer: Amparo Baı́llo Graph theoretic 4. Cluster Analysis Spectral 4 4.1. Hierarchical clustering Hierarchical clustering seeks to build a hierarchy of clusters. The superior levels or hierarchies will contain the inferior ones: it generates a nested sequence of clusters. The nesting can be decreasing or increasing. Agglomerative hierarchical methods start with the individual data points as singleton clusters (initially there are as many clusters as objects). Then the most similar elements are grouped, and afterwards the nearest or most similar of these groups are merged. This is repeated until, eventually, all the subgroups are merged into a single cluster. It is a “bottom up” approach: each observation starts in its own cluster, and pairs of clusters are merged as one moves up the hierarchy. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 5 Divisive hierarchical methods start from the single, global group containing all the data points, which is divided into two subgroups such that the objects in one subgroup are far from the objects in the other. These subgroups are then further split into dissimilar subgroups. The process is repeated until each individual in the sample forms a group. It is a “top down” approach: all observations start in one cluster, and splits are performed as one moves down the hierarchy. Divisive methods are often more computationally demanding than agglomerative ones, because decisions must be made about dividing a cluster in two in all possible ways. For a description of divisive hierarchical methods, see Everitt et al. (2011), Kaufman and Rousseeuw (2005). It is conventional to limit all merges to two input clusters and all splits to two output clusters. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 6 Hierarchical clustering methods work for many types of variables (nominal, ordinal, quantitative,...). The original observations are not needed: only the matrix of pairwise dissimilarities. In order to decide which clusters should be combined (for agglomerative), or where a cluster should be split (for divisive), a measure of dissimilarity between sets of observations is required. The choice of a similarity measure is subjective and depends, for instance, on the nature of the variables. For numerical variables, we can consider the Minkowski metric: 1/m p X d(x, y) = |xj − yj |m . j=1 For m = 1, d(x, y) is the city-block or Manhattan metric. For m = 2, d is the Euclidean metric, the most usual in clustering. Once chosen the dissimilarity between individual observations, we should define the dissimilarity between groups of observations. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 7 The linkage criterion determines the distance between sets of observations as a function of the pairwise distances between individual objects. We list three well-known linkage criteria: • Single linkage: distance between two groups is the distance between their nearest members −→ minimum distance or nearest neighbour clustering Cluster distance = d24 • Complete linkage: distance between two groups is the distance between their farthest members −→ maximum distance or farthest neighbour clustering Cluster distance = d15 • Average linkage: distance between two groups is the average distance between their members −→ mean or average distance clustering Cluster distance = Advanced Course in Statistics. Lecturer: Amparo Baı́llo d13 + d14 + d15 + d23 + d24 + d25 6 4. Cluster Analysis 8 Agglomerative hierarchical clustering algorithm 1. We have n observations x1 , . . . , xn . Start with n clusters, each containing one observation, and an n × n matrix D of distances or similarities. 2. Search the matrix D for the nearest pair of clusters (say, U and V ). 3. Merge clusters U and V into a new cluster UV . Update the entries in the distance matrix by deleting the rows and columns corresponding to clusters U and V and adding a row and column giving the distance between cluster UV and the rest of clusters. 4. Repeat Steps 2 and 3 until all the observations are in a single cluster. Record the identity of the merged clusters and the distances or similarities at which the mergers take place. The results of any hierarchical method are displayed in a dendrogram, a tree-like diagram whose branches represent clusters. The position of the nodes along the distance axis measure the distance at which mergers take place. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 9 Example 4.1. Data on protein consumption in twenty-five European countries for nine food groups (Weber 1973). Table = read.table(’ProteinConsumptionEurope.txt’,header=TRUE) c = ncol(Table) Data = Table[,2:c] D = dist(Data) 1 2 3 4 5 6 7 8 9 10 11 12 13 D = 14 15 16 17 18 19 20 21 22 23 24 25 0 23.2 0 21.7 7.9 0 15.7 32.3 32.8 0 15.2 10.3 10.6 24.0 0 30.2 12.00 11.1 40.3 19.4 0 22.9 10.7 8.9 33.6 10.6 15.2 0 31.0 17.4 17.6 40.3 24.0 12.3 23.8 0 23.2 11.0 6.0 33.3 13.4 12.7 13.9 18.2 0 12.1 19.5 18.3 19.3 15.0 24.5 22.1 24.1 18.3 0 13.2 17.0 18.8 18.4 9.2 26.7 17.5 30.0 21.3 14.9 0 27.9 10.0 9.2 38.4 17.6 8.9 16.2 11.6 10.2 23.0 25.0 0 10.6 14.7 13.6 21.0 8.7 21.6 15.6 23.8 15.2 8.0 10.7 20.0 0 28.3 6.8 9.7 38.5 16.4 8.4 13.3 14.7 12.4 24.1 23.2 6.8 19.9 0 26.8 13.7 10.8 38.2 18.7 6.7 15.0 11.7 13.2 21.4 25.7 10.9 18.8 11.6 0 17.6 9.9 12.2 24.5 8.3 17.8 14.9 19.4 14.0 12.4 12.0 16.0 9.1 15.3 16.9 0 23.1 22.9 19.2 33.3 19.1 23.9 15.2 31.2 21.9 22.2 22.0 27.1 17.9 25.4 20.7 21.7 0 10.3 25.3 25.9 8.3 17.3 33.3 26.7 33.4 27.1 13.3 11.6 31.3 14.3 31.2 31.0 17.4 27.7 0 17.1 17.4 13.9 28.9 13.1 21.2 11.8 26.6 17.2 16.3 16.2 21.8 10.9 20.7 17.6 15.5 8.8 22.3 0 30.0 13.0 11.6 41.5 20.4 4.8 15.6 11.9 14.0 25.2 27.7 8.9 21.9 8.5 5.6 19.2 24.1 34.2 20.8 0 24.9 7.6 7.5 35.5 15.0 9.7 14.7 13.1 8.2 20.0 22.1 5.1 16.7 6.3 10.7 13.5 24.8 28.6 19.3 9.5 0 24.3 12.9 6.8 36.4 16.5 11.7 14.8 16.0 7.0 20.3 24.2 8.2 17.3 12.0 10.5 17.1 22.8 29.7 17.7 10.9 8.0 0 11.0 19.0 18.4 16.7 12.6 25.3 21.4 24.7 19.4 8.2 12.5 23.0 9.5 24.2 22.8 10.8 24.0 9.9 17.9 26.3 21.0 21.6 0 29.1 10.1 9.1 40.6 17.2 9.9 10.6 18.9 12.5 26.3 24.7 9.7 21.1 6.5 12.2 18.5 22.8 33.4 19.1 9.1 9.6 11.1 26.6 0 15.5 32.0 32.7 4.9 24.0 39.9 33.3 39.3 33.7 18.8 17.6 38.0 20.7 38.0 37.5 23.7 32.8 6.9 28.1 40.8 35.3 36.4 15.8 40.3 1 2 3 4 5 6 7 8 9 10 11 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 12 13 14 15 16 4. Cluster Analysis 17 18 19 20 21 22 23 0 24 25 10 hc = hclust(D,method=’complete’) plot(hc) 20 1 23 10 13 5 16 17 19 12 21 14 24 15 6 20 22 3 9 18 4 25 0 2 11 8 7 10 Height 30 40 Cluster Dendrogram D hclust (*, "complete") Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 11 10 Romania Bulgaria Yugoslavia Finland Norway Denmark Sweden UK Belgium France Austria Ireland Switzerland Netherlands WGermany EGermany Portugal Spain Hungary Czechoslovakia Poland Albania USSR Greece Italy 0 Height 20 30 40 hc = hclust(D,method=’complete’) plot(hc,labels=Table[,1]) Cluster Dendrogram D hclust (*, "complete") Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 12 Some results are surprising. We standardize the data and carry out the analysis again. Hungary USSR Poland Czechoslovakia EGermany Finland Norway Denmark Sweden France UK Ireland Belgium WGermany Switzerland Austria Netherlands Portugal Spain Greece Italy Albania Bulgaria Romania Yugoslavia 3 2 1 0 Height S = var(Data) m = colMeans(Data) mMat = matrix(rep(m,times=n), nrow=n,byrow=T) MinusSqrtD = diag(diag(S)^(-0.5) ) DataStandar = as.matrix(DatamMat) %* % MinusSqrtD DStandar = dist(DataStandar) hcStandar = hclust(DStandar, method=’complete’) plot(hcStandar,labels=Table[,1]) 4 5 6 7 Cluster Dendrogram DStandar hclust (*, "complete") Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 13 It is difficult to provide a clear-cut solution about choosing the “best” number of clusters, k, whatever clustering method is used. Hence, the analyst has to interpret the output of the clustering procedure and several solutions may be equally interesting. We can choose to place the threshold where the clusters are too far apart (i.e. there’s a big jump between the two adjacent levels). If n denotes the p sample size, a simple rule-of-thumb (see Mardia et al. 1989) is k ≈ n/2. In the “elbow method” we compute the within-cluster sum of squares for k clusters WCSS(k) = k X X j=1 xi ∈cluster kxi − x̄j k2 , j where x̄j is the sample mean in cluster j. We plot WCSS(k) against the number k of clusters and we choose k in the elbow of the graph. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 14 0 WCSS(k) 2000 4000 ## determine a "good" k using elbow library("GMD") css.obj <- css.hclust(D,hc) plot(css.obj$k,css.obj$tss-css.obj$totbss,pch=19, ylab="WCSS(k)",xlab="number of clusters k",cex.lab=1.5) par (mar = c(5, 10, 3, 1) + 0.1) 5 10 15 number of clusters k Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 20 15 4.2. k-means Clustering In contrast to hierarchical techniques, which give a nested sequence of clusterings, usually based on a dissimilarity, partitional clustering procedures start with an initial clustering (a partition of the data) and refine it iteratively, typically until the final clustering satisfies an optimality criterion. In general, the clusterings in the sequence are not nested, so no dendrogram is possible. Hierarchical methods are based on a dissimilarity, while partitional procedures are usually based on an objective function. Objective functions are of two types. They can serve to maximize the homogeneity of the points in each cluster (internal criteria). In this case the objective function measures the within-groups spread. Or they can focus on how clusters are different from each other, in which case the objective function measures the between-groups variability. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 16 4.2.1 Introduction k-means clustering is a nonhierarchical centroid-based clustering technique designed to group the n individuals in the sample into k clusters. Compared to hierarchical techniques, nonhierarchical ones can be applied to larger datasets, because a matrix of distances does not have to be determined. Let us assume that the number of clusters, k(≤ n), is specified in advance. Given the sample x1 , . . . , xn of vectors taking values in Rd , k-means clustering aims to partition this collection of observations into k groups, G = {G1 , . . . , Gk }, so as to minimize (over all possible G’s) the within-cluster sum of squares WCSS = k X X kxi − x̄j k2 , j=1 xi ∈Gj P where x̄j := xi ∈Gj xi /card(Gj ) is the sample mean in Gj and k k is the Euclidean norm. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 17 P Since x̄ = arg mı́na∈Rp ni=1 kxi − ak2 , minimizing WCSS is equivalent to minimizing n 1X mı́n kxi − aj k2 1≤j≤k n (1) i=1 over all possible collections {a1 , . . . , ak } in Rp . Then each xi , i = 1, . . . , n, is assigned to its nearest cluster center. The sample means x̄1 , . . . , x̄k in the resulting groups, G1 , . . . , Gk , are precisely the arguments a1 , . . . , ak minimizing (1). Remark: For each j = 1, . . . , k, the set of Rp points closer to x̄j than to any other cluster center is the Voronoi cell of x̄j . Remark: Let Fn denote the empirical probability distribution on Rd corresponding to the sample. It is possible to restate the problem of minimizing (1) as that of minimizing Z Φ(A, Fn ) := mı́n kx − ak2 dFn (x), (2) a∈A over all choices of the set A ⊂ Rd containing k (or fewer) points. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 18 4.2.2 Asymptotic behaviour of k-means clustering For any probability measure Q on Rp and any finite subset A of Rp define Z Φ(A, Q) := mı́n kx − ak2 dQ(x) a∈A and mk (Q) := ı́nf{Φ(A, Q) : A contains k or fewer points}. For a given k, the set An = An (k) of optimal sample cluster centers satisfies Φ(An , Fn ) = mk (Fn ) and the optimal population cluster centers Ā = Ā(k) satisfy Φ(Ā, P) = mk (F ), where F is the probability distribution of X. Z 2 If EkXk = kxk2 dF (x) < ∞, then Φ(A, F ) < ∞ for each A: for each a ∈ Rp Z kx − ak2 dF (x) < ∞. Then these minimization procedures make sense. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 19 We define the Hausdorff distance between two sets A and B in Rd as dH (A, B) = ı́nf{ > 0 : A ⊂ B , B ⊂ A }, where A := {x ∈ Rd : kx − ak < for some a ∈ A}. Theorem 4.1 (Consistency of k-means, Pollard 1981): Suppose that EkXk2 < ∞ and that for each j = 1, . . . , k, there is a unique set Ā(j) for which Φ(Ā(j), F ) = mj (F ). Then dH (An , Ā(k)) → 0 a.s. as n → ∞ and Φ(An , Fn ) → mk (F ) a.s. as n → ∞. The uniqueness condition on Ā(j) implies that m1 (F ) > m2 (F ) > . . . > mk (F ) and Ā(j) contains exactly j distinct points. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 20 Consistency of k-means in dimension p = 1 Let F be a class of measurable functions defined on R equipped with the Borel σ-algebra on which a probability measure F is defined. Assume that, for each g ∈ F, Z F |g | := EF (g (X )) = |g (x)| dF (x) < ∞. Let X1 , . . . , Xn be a sample of i.i.d. r.v. from F . Then we define the empirical (probability) measure Fn on R, that assigns weight 1/n to each Xi in the sample, for i = 1, . . . , n: Z n 1X g (Xi ). Fn g := EFn (g (X )) = g (x) dFn (x) = n i=1 Then the strong law of large numbers (SLLN) is stated as: for each fixed g ∈ F, Fn g −−−→ Fg n→∞ Advanced Course in Statistics. Lecturer: Amparo Baı́llo almost surely (a.s.) 4. Cluster Analysis 21 We have the following one-sided uniformity result: Theorem II.3 in Pollard (1984): Suppose that, for each > 0, there exists a finite class F of functions such that, for each g ∈ F, there exists a g ∈ F satisfying g ≤ g and Fg − Fg ≤ . Then lı́m inf ı́nf (Fn g − Fg ) ≥ 0 n→∞ F almost surely. For simplicity, assume that we want to partition the sample X1 , . . . , Xn into k = 2 groups. The k-means method minimizes the within-cluster sum of squares. In this case, this is equivalent to minimizing n 1X mı́n (Xi − a)2 , (Xi − b)2 (3) n i=1 over all (a, b) ∈ R2 . If we denote ga,b (x) := mı́n (x − a)2 , (x − b)2 , then minimizing (3) is equivalent to minimizing, over all (a, b) ∈ R2 , W (a, b, Fn ) := Fn ga,b . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis (4) 22 By the SLLN, for each fixed (a, b), W (a, b, Fn ) = Fn ga,b −−−→ W (a, b, F ) = Fga,b n→∞ a.s. This suggests that the (an , bn ) minimizing W (·, ·, Fn ) might converge to the (a∗ , b ∗ ) minimizing W (·, ·, F ). Let us adopt the convention that a ≤ b in any pair (a, b) (so that (b ∗ , a∗ ) is not a distinct minimizing pair of W (·, ·, F )). Theorem (Example II.4 of Pollard 1984): Assume that Fx 2 < ∞. Assume that W (·, ·, F ) achieves its unique minimum at (a∗ , b ∗ ). Then (an , bn ) −−−→ (a∗ , b ∗ ) a.s. n→∞ Sketch of the proof: • There exists a sufficiently large constant M > 0 such that (a∗ , b ∗ ) belongs to the set C := [−M, M] × R ∪ R × [−M, M] and, for n large enough, (an , bn ) ∈ C a.s. too. From now on we assume that this is the case. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 23 • Construction of the finite approximating class F : First note that, for (a, b) ∈ C , ga,b (x) ≤ gM (x) := (x − M)2 + (x + M)2 and that FgM < ∞. Then we can find a constant D > M > 0 such that F (gM 1[−D,D]c ) < . Thus we only have to worry about the approximation to ga,b (x) for x ∈ [−D, D]. We may also assume that both a and b lie in the interval [−3D, 3D]. Let C be a finite subset of [−3D, 3D]2 such that each (a, b) in that square has an (a0 , b 0 ) with |a − a0 | < /D and |b − b 0 | < /D. Then, for each x ∈ [−D, D], |ga,b (x) − ga0 ,b0 (x)| ≤ 16. We take the class F := (ga0 ,b0 (x) − 16)1{|x|≤D} : (a0 , b 0 ) ∈ C Then Theorem II.3 in Pollard (1984) implies that lı́m inf ı́nf (Fn ga,b − Fga,b ) ≥ 0 n→∞ (a,b)∈C Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis a.s. 24 Thus, lı́m inf W (an , bn , Fn ) − W (an , bn , F ) ≥ 0 n→∞ a.s. • Now observe that the uniformity result allows to transfer the optimality of (an , bn ) for Fn to asymptotic optimality for F : W (an , bn , Fn ) ≤ W (a∗ , b ∗ , Fn ) because (an , bn ) is optimal for Fn → W (a∗ , b ∗ , F ) a.s. ≤ W (an , bn , F ) because (a∗ , b ∗ ) is optimal for F . This implies that W (an , bn , F ) −−−→ W (a∗ , b ∗ , F ) n→∞ a.s. Since (a∗ , b ∗ ) is the unique minimum for W (·, ·, F ), then (an , bn ) converges a.s. to (a∗ , b ∗ ). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 25 4.2.3 k-means algorithm The k-means clustering procedure is also used in quantization. Quantization is the procedure of constraining a continuous large set of values (such as the real numbers) to a relatively small discrete set (such as the integers). In our context, a k-level quantizer is a map q : Rp → {a1 , . . . , ak }, where aj ∈ Rp , j = 1, . . . , k, are called the prototype vectors. Such a map can be used to convert a p-dimensional input signal X into a code q(X) that can take on at most k different values (fixed-length code). This technique was originally used in signal processing for data compression. An optimal k-level quantizer for a probability distribution P on Rp minimizes the distortion as measured by the mean-squared error Z 2 EkX − q(X)k = kx − q(x)k2 dF (x). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 26 Thus, searching for an optimal k-level quantizer is equivalent to the k-means problem. For more information on quantization see, e.g., Gersho and Gray (1992), Sayood (2005). Finding the minimum of (1) is computationally difficult, but several fast algorithms have been proposed that approximate the optimum. The most popular algorithm was proposed by Lloyd (1982) in the quantization context and is known as k-means algorithm or Lloyd algorithm. The name “k-means” was coined by MacQueen (1967). A history of the k-means algorithm can be found in Bock (2007). The k-means algorithm employs an iterative refinement similar to that of the EM algorithm for mixtures of distributions. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 27 k-means algorithm (1) (1) Take k initial centers a1 , . . . , ak ∈ Rp . For instance, the Forgy method randomly chooses k observations from the sample and uses them as the initial centers. Assignment or expectation step: Cluster the data points x1 , . . . , xn (`) (`) around the centers a1 , . . . , ak into k sets such that the j-th set (`) (`) Gj contains the xi ’s that are closer to aj than to any other center. This means assigning the observations to the Voronoi cells (`) (`) determined by a1 , . . . , ak . Update or maximization step: Determine the new centers (`+1) a1 (`+1) , . . . , ak as the averages of the data within the clusters: X 1 (`+1) xi . aj = (`) card(Gj ) (`) xi ∈Gj The steps are iterated until there are no changes in the cluster centers. Each step of the algorithm decreases the empirical squared Euclidean distance error. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 28 k − means with R: kmeans(x, centers, iter.max = 10, nstart = 1, algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen")) The algorithm of Hartigan and Wong (1979) is used by default: it generally does a better job than those of Lloyd (1957), Forgy (1965) or MacQueen (1967). Observe that there is no guarantee that k-means algorithm will converge to the global optimum, and the result (a local optimum) may depend on the initial clusters. As the algorithm is usually very fast, trying several random starts (nstart> 1) is often recommended. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 29 150 100 50 n = nrow(Data) # Sample size wss = (n-1)*sum(apply(DataStandar,2,var)) for (i in 2:15) { wss[i] = sum(kmeans(DataStandar,centers=i)$withinss) } plot(1:15, wss, type="b", xlab="Number of Clusters", ylab="Within groups sum of squares") Within groups sum of squares 200 Example 4.1 (protein consumption in Europe): A plot of the within groups sum of squares by number of clusters extracted can help determine the appropriate number of clusters. We look for a bend in the plot similar to a scree plot in PCA or factor analysis. 2 4 6 8 10 12 14 Number of Clusters Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 30 fit = kmeans(DataStandar, centers = 5, nstart = 10) aggregate(DataStandar,by=list(fit$cluster),FUN=mean) 1 2 3 4 5 Group.1 V1 V2 V3 V4 V5 V6 V7 V8 V9 1 0.006572 -0.229015 0.191478 1.345874 1.158254 -0.872272 0.167678 -0.955339 -1.114804 2 -0.570049 0.580387 -0.085897 -0.460493 -0.453779 0.318183 0.785760 -0.267918 0.068739 3 -0.807569 -0.871935 -1.553305 -1.078332 -1.038637 1.720033 -1.423426 0.996131 -0.643604 4 -0.508801 -1.108800 -0.412484 -0.832041 0.981915 0.130025 -0.184201 1.310884 1.629244 5 1.011180 0.742133 0.940841 0.570058 -0.267153 -0.687758 0.228874 -0.508389 0.021619 NewTable = data.frame(Table, fit$cluster) head(NewTable) Country RedMeat WhiteMeat Eggs Milk Fish Cereals Starch Nuts FrVeg fit.cluster 1 Albania 10.1 1.4 0.5 8.9 0.2 42.3 0.6 5.5 1.7 5 2 Austria 8.9 14.0 4.3 19.9 2.1 28.0 3.6 1.3 4.3 2 3 Belgium 13.5 9.3 4.1 17.5 4.5 26.6 5.7 2.1 4.0 2 4 Bulgaria 7.8 6.0 1.6 8.3 1.2 56.7 1.1 3.7 4.2 5 5 Czechoslovakia 9.7 11.4 2.8 12.5 2.0 34.3 5.0 1.1 4.0 3 6 Denmark 10.6 10.8 3.7 25.0 9.9 21.9 4.8 0.7 2.4 1 o = order(fit$cluster) Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 31 data.frame(NewTable$Country[o],NewTable$fit.cluster[o]) NewTable.Country.o. NewTable.fit.cluster.o. 1 Denmark 1 2 Finland 1 3 Norway 1 4 Sweden 1 5 Czechoslovakia 3 6 EGermany 3 7 Hungary 3 8 Poland 3 9 USSR 3 10 Albania 5 11 Bulgaria 5 12 Romania 5 13 Yugoslavia 5 14 Greece 4 15 Italy 4 16 Portugal 4 17 Spain 4 18 Austria 2 19 Belgium 2 20 France 2 21 Ireland 2 22 Netherlands 2 23 Switzerland 2 24 UK 2 25 WGermany 2 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 32 4.2.4 Generalizations and modifications of k-means algorithm • Clustering with non-Euclidean divergences (see, e.g., Banerjee et al. 2005). Instead of using the squared Euclidean norm in the objective function Z Φ(A, Q) = mı́n kx − ak2 dQ(x) a∈A we can use a more general distortion function φ = φ(x), continuous, non-decreasing and satisfying certain properties. For instance, k-medians clustering is obtained with φ(x) = |x1 | + . . . + |xp |, the norm obtained from Manhattan metric. • Modifications of k-means can be found in Jain (2010). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 33 4.3. Classification Maximum Likelihood (ML) Here we regard the clusters as representing local modes of the underlying distribution, which is then modeled as a mixture of components. One way to represent the uncertainty in whether a point is in a cluster is to back off from determining absolute membership of each xi in a cluster and ask only for a degree of membership (soft assignment to the clusters). Pk Then we say that xi is in cluster Gk with probability πik , with j=1 πik = 1. Hard clustering can be recovered by assigning xi to cluster Gk such that k = arg maxj πij . The idea of soft membership can be formalized through a parametric mixture model f (x; ψ) = k X πj fj (x; θ j ), (5) j=1 where ψ = (π1 , . . . , πk−1 , ξ 0 )0 and ξ is the vector containing all the parameters in θ 1 , . . . , θ k known a priori to be different. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 34 Denote by zi = (zi1 , . . . , zik )0 a k-dimensional vector with zij = 1 or 0 according to whether xi did or did not arise from the j-th component of the mixture, i = 1, . . . , n, j = 1, . . . , k. The complete data sample is xc1 , . . . xcn , where xci = (xi , zi ). The complete-data log-likelihood for ψ is log Lc (ψ) = n X k X zij (log πj + log fj (xi ; θ j )). i=1 j=1 In the classification likelihood approach the parameter vector ψ and the unknown component indicator vectors z1 , . . . , zn are chosen to maximize Lc (ψ). That is, the vectors z1 , . . . , zn are treated as parameters to be estimated along with ψ. The classification ML is also called Gaussian Mixture Model (GMM) classification. A more detailed account can be found in McLachlan and Basford (1988). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 35 Unless n is small, however, searching over all possible combinations of values of the zi ’s is computationally prohibitive. McLachlan (1982) proved that a solution corresponding to a local maximum can be computed iteratively via the so-called classification EM algorithm, where a classification step is inserted between the E-step and the M-step of the original EM algorithm: The algorithm starts with an initial guess ψ (0) for ψ. Let ψ (g ) be the approximate value of ψ after the g -th iteration. E-Step: Compute the posterior probability that the i-th member of the sample, Xi , belongs to the j-th component of the mixture: (g ) (g ) τij (g ) πj fj (xi ; θ j ) := P k (g ) (g ) j=1 πj fj (xi ; θ j ) Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis . 36 (g ) (g ) (g ) Classification Step: zi = (zi1 , . . . , zik )0 with ( (g ) 1 if j ∗ = arg maxj=1,...,k τij (g ) zij ∗ = 0 otherwise. M-Step for Gaussian mixtures: n (g +1) πj 1 X (g ) = zij n i=1 (g +1) µj and (g +1) Σj = P n (g ) i=1 zij (xi Pn = (g ) i=1 zij xi (g ) i=1 zij Pn (g +1) − µj Pn (g +1) 0 ) )(xi − µj (g ) . i=1 zij The three steps above are repeated until the difference Lc (ψ (g +1) ) − Lc (ψ (g ) ) is small enough. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 37 Celeux and Govaert (1993) compared the classification EM algorithm and the original EM algorithm and found that the first approach tends to be better for small samples, whereas the mixture likelihood approach is preferable for large samples (in this case, the classification likelihood approach usually yields inconsistent parameter estimates). The Classification EM procedure can be shown to be equivalent to some commonly used clustering criteria under the assumption of Gaussian populations with various constraints on their covariance matrices. For instance, if we assume equal priors π1 = . . . = πk = 1/k and Σ1 = . . . = Σk = Σ with Σ = σ 2 Ip (a scalar matrix), the GMM classification is exactly k-means clustering. This model is referred to as isotropic or spherical variance since the variance is equal in all directions. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 38 With the availability of powerful MCMC methods, there is renewed interest in the Bayesian approach to model-based clustering. In the Bayesian Maximum A Posteriori (MAP) classification approach, a prior distribution is placed on the unknown parameters ξ of the mixture. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 39 4.4. How many clusters? In k-means and in GMM classification the number k of clusters or components in the mixture model is assumed to be known in advance. However, choosing k is a problem that still receives considerable attention in the literature (see McLachlan and Peel 2000; Sugar and James 2003; Oliveira-Brochado and Martins 2005). Here we focus on choosing the order of a Gaussian mixture model. Specifically, we consider the estimation of the number of components in the mixture based on a penalized form of the likelihood. As the likelihood increases with the addition of a component to the mixture model, the log-likelihood is penalized by the subtraction of a term that “penalizes” the model for the number of parameters in it. This leads to a penalized log-likelihood, yielding the so-called information criteria for the choice of k. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 40 4.4.1 Information criteria in model selection Let f and g denote two probability densities in Rd (in general, two probability distribution models). The Kullback-Leibler information between models f and g is defined as Z f (X) f (x) dx = Ef log (6) I (f , g ) := f (x) log g (x) g (X) and represents the information lost when g is used to approximate f (Kullback and Leibler 1951). The KullbackLeibler information can be conceptualized as a directed “distance” between the two models f and g (Kullback 1959). Strictly speaking, I (f , g ) is a measure of “discrepancy”, not a distance, because it is not symmetric. The K-L distance has also been called the K-L discrepancy, divergence and number. It happens to be the negative of Boltzmann’s (1877) concept of generalized entropy in physics and thermodynamics. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 41 The KL discrepancy (6) can be expressed as I (f , g ) = Ef (log f (X)) − Ef (log g (X)). Typically, f = f (x) is the true density of the random vector of interest, X, and g is some estimate, fˆ, of f . In our framework, fˆ is a density from a parametric P family, namely, a mixture of continuous distributions f (x; ψ) = kj=1 πj fj (x; θ j ). Intuitively, we would like to find a number k of components and a value of ψ such that the K-L divergence of f relative to the parametric model f (·; ψ) I (f , f (·; ψ)) = Ef (log f (X)) − Ef (log f (X; ψ)) (7) is small. We would need to know f , k and ψ to fully compute (7), but f is usually unknown. However, these requirements can be diminished by noting that we only want to minimize (7) with respect to k and ψ and that Cf := Ef (log f (X)) is a constant that does not depend on these parameters, only on the unknown, true f . Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 42 Hence, we would like to find the values of k and ψ minimizing the relative information η(ψ, F ) := I (f , f (·; ψ)) − Cf = −Ef (log f (X; ψ)) Z = − log f (x; ψ)dF (x), (8) where F denotes the distribution of X. Let x1 , . . . , xn be a sample from X. A natural estimator of the probability distribution F of X is the empirical distribution, Fn , a discrete probability measure giving probability mass 1/n to each sample point. We can plug Fn in the place of the unknown F , thus obtaining an estimator of (8) Z η(ψ, Fn ) = − n 1X log f (xi ; ψ) log f (x; ψ) dFn (x) = − n i=1 = − log L(ψ; x1 , . . . , xn ). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 43 It is tempting to just estimate ψ by its m.l.e. However, Akaike (1973) showed that minus the maximized log-likelihood is a downwards biased estimator of (8). Given the m.l.e. ψ̂ = ψ̂(x1 , . . . , xn ) of ψ, the bias of η(ψ̂, Fn ) is the functional b(F ) := Ef (η(ψ̂, Fn )) − η(ψ, F ) ! Z # " n 1X log f (Xi ; ψ̂) − log f (x; ψ)dF (x) = − Ef n i=1 The information criterion for model selection seeks the model (in our case, the mixture order) minimizing the bias-corrected relative information 2 ( η(ψ̂, Fn ) − b(F ) ) = −2(log L(ψ̂) + b(F )), factor for “historical reasons” measures the model’s lack of fit penalty term measuring the model’s complexity bias-corrected log-likelihood using an appropriate estimate of the bias b(F ). Different bias estimators give rise to different information criteria. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 44 Akaike’s Information Criterion (AIC) Akaike (1973, 1974) approximated the bias b(F ) by −d, where d is equal to the total number of unknown parameters in the model. Thus AIC selects the model minimizing AIC = −2 log L(ψ̂) + 2d. Many authors (see McLachlan and Peel 2000 and references therein) have observed that AIC tends to overestimate the correct number of components in a mixture and thus leads to overfitting. Bayesian Information Criterion (BIC) The BIC has been derived within a Bayesian framework for model selection, but can also be applied in a non-Bayesian context. Schwarz (1978) proposed the following negative penalized log-likehood to be minimized BIC = −2 log L(ψ̂) + d log n. For n > 8, the penalty term of BIC penalizes complex models more heavily than AIC. BIC works well and is popular in practice. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 45 Example: Times between Old Faithful eruptions (Y ) and duration of eruptions (X ). Datos = read.table(’Datos-geyser.txt’,header=TRUE) XY = cbind(Datos$X,Datos$Y) library(mclust) faithfulBIC = mclustBIC(XY) faithfulSummary = summary(faithfulBIC, data = XY) faithfulSummary classification table: 1 2 80 27 best BIC values: VVV,2 EEE,2 EEI,2 -962.3372 -962.5509 -966.3832 In the package mclust BIC = 2 log L(ψ̂) − d log n. The faithful dataset included in the R language distribution has a larger sample size than Datos-geyser.txt. Read the help pdf for package mclust. faithfulBIC BIC: 1 2 3 4 5 6 7 8 9 EII -1569.04 -1379.31 -1319.98 -1308.77 -1277.10 -1266.59 -1229.99 -1213.58 -1213.43 VII -1569.04 -1377.37 -1322.97 -1292.39 -1260.01 -1267.84 -1235.25 -1238.59 NA EEI VEI EVI VVI EEE EEV -1179.96 -1179.96 -1179.96 -1179.96 -1041.77 -1041.77 -966.38 -968.38 -971.03 -973.01 -962.55 -966.84 -970.24 -978.44 -978.04 -981.76 -969.12 -973.32 -983.88 -987.56 -993.71 -1001.38 -979.89 -984.59 -976.17 -974.21 -987.11 -989.63 -976.21 -981.17 -990.17 -980.26 NA NA -990.27 -999.86 -1015.08 -1007.07 NA NA -1003.13 -1033.46 -1001.58 -1011.37 NA NA -1015.10 -1052.19 -1006.84 NA NA NA -1029.14 -1070.88 VEV VVV -1041.77 -1041.77 -967.29 -962.33 -972.45 -966.92 -991.81 -1002.93 -986.05 -1005.03 -1004.22 NA -1022.75 NA -1035.70 NA NA NA Top 3 models based on the BIC criterion: VVV,2 EEE,2 EEI,2 -962.3372 -962.5509 -966.3832 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 46 4.4.2 Number of clusters = mixture order? Any continuous distribution can be approximated arbitrarily well by a finite mixture of normal densities with common covariance matrix (see McLachlan and Peel 2000). But, for instance, more than one component of the mixture might be needed to model a skewed shape of the density in a neighbourhood of a local mode: 0.0 0.1 0.2 0.3 0.4 0.5 Mixture of 3 Gaussian distributions −2 −1 0 Advanced Course in Statistics. Lecturer: Amparo Baı́llo 1 2 3 4. Cluster Analysis 4 47 Thus, when a GMM is used for clustering, the number of mixture components is not necessarily the same as the number of clusters. There are some recent proposals to overcome this problem. Most of them are algorithms for hierarchically merging mixture components into the same cluster, once the “optimal” number of components in the mixture has been chosen (for example, using BIC). See Hennig (2010). One of the difficulties encountered is that it has to be decided by the statistician under which conditions different Gaussian mixture components should be regarded as a common cluster: there is no objective unique definition of what a “true” cluster is. Should clusters be identified with regions surrounding local modes? Ray and Lindsay (2005) state “the modes are the dominant feature, and are themselves potentially symptomatic of underlying population structures” and prove that there is a (k − 1)-dimensional surface, the ridgeline manifold, which includes all the critical points (modes, antimodes and saddlepoints). Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 48 4.5. Clustering high-dimensional data Some references: Bouveyron, C. and Brunet-Saumard, C. (2014). Model-based clustering of high-dimensional data: A review. Computational Statistics and Data Analysis, 56, 4462–4469. Bouveyron, C., Girard, S. and Schmid, C. (2007). High-dimensional data clustering. Computational Statistics and Data Analysis, 52, 502–519. Kriegel, H.-P., Kröger, P. and Zimek, A. (2009). Clustering high-dimensional data: A survey on subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions on Knowledge Discovery from Data, 3, Article 1. Sun, W., Wang, J. and Fang, Y. (2012). Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electronic Journal of Statistics, 6, 148–167. Tomasev, N., Radovanović, M., Mladenić, D. and Ivanović, M. (2014). The role of hubness in clustering high-dimensional data. IEEE Transactions on Knowledge and Data Engineering, 26, 739–751. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 49 References Akaike, H. (1973). Information theory as an extension of the maximum likelihood principle. In B.N. Petrov, and F. Csaki, (eds.) Second International Symposium on Information Theory, Akademiai Kiado, Budapest, pgs 267-281. Akaike, H. (1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723. Banerjee, A., Merugu, S., Dhillon, I.S. and Ghosh, J. (2005). Clustering with Bregman divergences. Journal of Machine Learning Research, 6, 1705–1749. Bock, H.-H. (2007). Clustering methods: a history of k-means algorithms. In Selected Contributions in Data Analysis and Classification, Springer, pp. 161–172. Boltzmann, L. (1877). Über die Beziehung zwischen dem Hauptsatze derzwe: Ten mechanischen Wärmetheorie und derWahrscheinlichkeitsrechnung respective den Sätzen über das Wärmegleichgewicht. Wiener Berichte, 76, 373-435. Celeux, G. and G. Govaert (1993). Comparison of the mixture and the classification maximum likelihood in cluster analysis. Journal of Statistical Computation and Simulation, 47, 127–146. Everitt, B.S., Landau, S., Leese, M. and Stahl, D. (2011). Cluster Analysis. Wiley. Forgy, E.W. (1965). Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics, 21, 768-769. Gersho, A. and Gray, R.M. (1992). Vector Quantization and Signal Compression. Springer. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 50 Hartigan, J.A. and Wong, M.A. (1979). Algorithm AS 136: a k-means clustering algorithm. Journal of the Royal Statistical Society, Series C, 28, 100-108. Hennig, C. (2010). Methods for merging gaussian mixture components. Advances in Data Analysis and Classification, 4, 3-34. Jain, A.K. (2010). Data clustering: 50 years beyond K-means. Pattern Recognition Letters, 31, 651–666. Kaufman, L. and Rousseeuw, P.J. (2005). Finding Groups in Data: An Introduction to Cluster Analysis. Wiley. Kullback, S. (1959). Information theory and statistics. Wiley. Kullback, S. and Leibler, R.A. (1951). On information and sufficiency. Annals of Mathematical Statistics, 22, 79–86. Lloyd., S.P. (1982). Least squares quantization in PCM. IEEE Transactions on Information Theory, 28, 129-137. Mardia, K.V., Kent, J.T. and Bibby, J.M. (1989). Multivariate analysis. Academic Press. McLachlan, G. (1982). The classification and mixture maximum likelihood approaches to cluster analysis. In Handbook of Statistics, vol. 2, eds. P.R. Krishnaiah and L. Kanal, North-Holland., pp. 199–208. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 51 McLachlan, G. and Basford, K.E. (1988). Mixture Models: Inference and Applications to Clustering. Marcel Dekker. McLachlan, G. and Peel, D. (2000). Finite Mixture Models. Wiley. MacQueen, J.B. (1967). Some methods for classification and analysis of multivariate observations. In Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability. University of California Press, pp. 281-297. Oliveira-Brochado, A. and Martins, F.V. (2005). Assessing the number of components in mixture models: a review. FEP Working Papers 194. Pollard, D. (1981). Strong consistency of k-means clustering. The Annals of Statistics, 9,135–140. Pollard, D. (1984). Convergence of Stochastic Processes. Springer. Downloadable at www.stat.yale.edu/~pollard/. Ray, S. and Lindsay, B.G. (2005). The topography of multivariate normal mixtures. The Annals of Statistics, 33, 2042-2065. Sayood, K. (2005). Introduction to Data Compression. Elsevier. Schwarz, G. (1978). Estimating the dimension of a model. The Annals of Statistics, 6, 461-464. Sugar, C.A. and James, G.M. (2003). Finding the number of clusters in a dataset: an information-theoretic approach. Journal of the American Statistical Association, 98, 750–763. Weber, A. (1973). Agrarpolitik im Spannungsfeld der internationalen Ernaehrungspolitik. Institut fuer Agrarpolitik und marktlehre, Kiel. Advanced Course in Statistics. Lecturer: Amparo Baı́llo 4. Cluster Analysis 52