Download ON APPLICATION OF PROBABILITY METRICS IN THE CLUSTER

ON STATISTICAL MODEL OF CLUSTER STABILITY Z. Volkovich Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel; Department of Mathematics and Statistics, the University of Maryland, Baltimore County, USA; Z. Barzily Software Engineering Department, ORT Braude College of Engineering, Karmiel 21982, Israel; 1 Concept We propose a general statistical approach for the study of the cluster validation problem. Our concept suggests that partitions obtained by a cluster algorithm rerunning can be considered as realizations of a random variable such that the most stable random variable infers the true number of clusters. We offer to measure stability by means of probability distances within the observed realization. Our method suggests the application of probability metrics between random variables. 2 Motivating works – T. Lange, V. Roth, L. M. Braun, and J. M. Buhmann. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299 – 1323, 2004. – V. Roth, V. Lange, M. Braun, and Buhmann J. A resampling approach to cluster validation. In COMPSTAT, 2002. 3 Clustering Goal: partition a set S  Rd by means of a clustering algorithm CL such that CL(x)= CL(y) if x and y are “similar” CL(x)  CL(y) if x and y are “dissimilar” 4 Clustering (cont) CL(x)  CL(y) CL(x)= CL(y) An important question: how many clusters are there? 5 Example: a three-cluster set partitioned into 2 and 4 clusters 6 Implication It is observed that in the case when the number of clusters is not correctly chosen non consistent clusters can be formed. Such a cluster is a union of several heterogeneous parts having different distributions. 7 Concept DATA SAMPLE - S INITIAL PARTITION CLUSTERING MACHINE- CL CLUSTERED SAMPLE S= CL(S) 8 Concept (cont. 1) S1= CL(S1) S2= CL(S2) ……………………………………………………………………………… These clustered samples can be considered as estimates of the looked-for “true partition”, and appropriately, our concept views these partitions as instances of a random variable such that the steadiest random variable is associated with the true number of clusters. DATA 9 Concept (cont. 2) Outliers in the samples and limitations of clustering algorithms significantly add to the noise level and increase the model’s volatility. Thus, the clustering procedure has to be iterated many times to obtain meaningful results. The stability of random variables is measured via probability distances. According to our principle, it is natural to assume that the true number of clusters corresponds to the distance distribution having the minimal inner variation. 10 Some probability metrics Probability metrics are introduced on a space Λ of realvalued random variables defined on the same probability space. A functional dis: ΛR+ is called a probability metric if it satisfies the following additional properties: – Identity: – Symmetry – Triangular inequality The last two properties are not always required. If a probability metric identifies the distribution i.e. then the metric is called simple, otherwise the metric is called compound. 11 Examples LP-metric Ky Fan metrics m Hellinger distance Relative entropy (or Kullback-Leibler divergence) Kolmogorov (or Uniform) metric Levy metric Prokhorov metric Total variation distance Wasserstein (or Kantorovich) metric Chi-2 distance And so on…. 13 Concentration measure index The LP and the Ky Fan metrics are compound and the other metrics shown above are simple. A difference between the simple and compound metrics is that, unlike the compound ones, simple metrics equal zero for two independent identically distributed random variables. Moreover, if dis is a compound metric, dis(X , X’ ) = 0 and X , X’ are independent realizations of the same random variable X, then X is a constant almost surely. A compound distance can be used as a measure of uncertainty. Particularly, dis(X , X’ ) = d(X ) is called a concentration measure index derived by the compound distance dis. The stability of a random variable can be estimated by the average value of this index. 14 Simple and Compound Metrics Simple metrics Geometrical Algorithms Compound metrics Membership Algorithms 15 Geometrical Algorithm Several algorithms can be offered here. The first type is to consider the Geometrical Stability. The basic idea is that if one ”properly” clusters, two independent samples then, under the assumption of a consistent clustering algorithm, the clustered samples can be classified as samples drawn from the same population. 16 Example: Cluster stability via samples Cluster 2 Cluster 1 The samples found the same clusters. A partition is stable 17 General algorithm. Given a probability metric dis(·, ·) 1. 2. 3. 4. 5. 6. 7. For each tested number of clusters k, execute steps 2 to 6 as follows: Draw pairs of disjoint samples (St, St+1), t=1,2,… Cluster the united sample (S= CL(StSt+1)) and separate S to the two clustered samples (St,St+1) Calculate the distance values dt=dis(St,St+1) Average each consecutive T distances Normalize the set of averages and obtain the set DISk The number of clusters k, which yields the most concentrated normalized distribution DISk, is chosen as the estimate of the true number of clusters. 18 How is the concentration measured? The distances are inherently positive, and we are interested in distributions concentrated near zero. Thus, we can use as concentration measures the sample mean, the sample standard deviation and the lower quartile. 19  Simple distances Many simple distances, applicable in two sample tests, are simple metrics. For instance, the well known Kolmogorov-Smirnov test is based on the max-distance between the probability functions in the one dimensional case. In the multivariate case, the following tests can be mentioned in this context: • the Friedman-Rafsky test 1979. (Minimal spanning tree based) • the Nearest Neighbors test of Henze 1988. • the Energy test, Zech and Aslan 2005. • the N-distances test of Klebanov 2005. Such metrics can be used in the Geometrical Algorithm. 20 Simple distances (cont) Recall, that the classical two-sample problem is intended for testing the hypothesis H0 : F(x) = G(x) against the general alternative H1 : F(x)  G(x), when the distributions F and G are unknown. Here we consider applications of the Ndistances test of Klebanov and the Friedman-Rafsky test. 21 Klebanov’s N-distances N-distances are defined as follows: dis( X  Y )  2E ( N ( X 1 Y1 ))     E ( N ( X1 X1 ))  E ( N (Y1 Y1 )) where X1, X1’ and Y1, Y1’ are independent realizations of X and Y respectively. N is a negative definite kernel. We use N(x,y)=||x-y||α , 0<α2. 22 Graphical illustration Let us consider two samples S1 and S2 partitioned into clusters S1 S2 23 Graphical illustration (cont. 1) Distances between points belonging to different samples Cluster C1 Cluster C2 k   2dis ( S , S i 1 i 1 2 )  disi ( S1 , S1 )  disi ( S2 , S2 )  24 Graphical illustration (cont. 2) Cluster C2 Cluster C1 k   2dis ( S , S i 1 i 1 2 )  disi ( S1 , S1 )  disi ( S2 , S2 )  25 Remark Note, the metric is actually the average distance between the samples in the clusters. If this value is close to zero then it can be concluded that the partition is stable. Namely, the elements of the samples are closely located inside the clusters and can be deemed as a consistent set within each of the clusters. 26 Distances calculations k Dis( S1 , S2 )    2disi ( S1 , S2 )  disi ( S1 , S1 )  disi ( S2 , S 2 )  i 1 disi ( S1 , S2 )  disi ( S1 , S1 )  disi ( S 2 , S 2 )    X1Ci  S1 X 2 Ci  S2 X1  X 2  | Ci  S1 | * Ci  S2  X 1 , X 2 C i  S1 ; X 1  X 2 X1  X 2  | C i  S1 | *  C i  S1  1  X 1 , X 2 C i  S 2 ; X 1  X 2 X1  X 2  | C i  S 2 | *  C i  S 2  1 27 Histograms of the distances’ values Two cases are presented. The case when the quantity T of the averaged distance values is big and the case when T is small correspondently. In the second case, measuring the concentration via the lower quartile appears to be more appropriated. 28 Example:The Iris Flower Dataset • The kernel: N(x,y)=||x-y|| • Sample size: 70 • Number of samples: 20*80 • Number of averaged samples: 80 29 Graph of the normalized mean value 30 Graph of the normalized quartile value 31 Euclidean Minimal Spanning Tree The Euclidean minimum spanning tree or EMST is a minimum spanning tree of a set S of points in an Euclidean space Rd, where the weight of the edge between each pair of points is the distance between those two points. In simpler terms, an EMST connects a set of dots using lines such that the total length of all the lines is minimized and any dot can be reached from any other by following the lines ( see, Wikipedia). 32 Euclidean minimal spanning tree (cont.) A tree on S is a graph which has no loops and whose vertices are elements of S. If all distances between the items of S are distinct then the set S is called nice and it has a unique EMST. An EMST can be built in O(n2) time (including distance calculations) using the Prim’s, Kruskal’s, Boruvka’s or the Dijkstra’s algorithms. 33 An EMST of 60 random points 34 How can an EMST be used in the cluster validation problem? We draw two samples S1 and S2 and determine S= S1S2- pooled sample S= CL(S)-clustered pooled sample We expect that, in the case of a stable clustering, the two samples’ items are closely located inside the clusters. This fact is characterized by the number of edges connecting points from different samples. These edges are marked by red in the diagrams in the examples. 35 Graphical illustration. Stable clustering 36 Graphical illustration. Non-stable clustering 37 The two-sample MST-test The two- sample test the hypothesis H0 : F(x) = G(x) against the general alternative H1 : F(x)  G(x), when the distributions F and G are unknown. The number of the edges connecting points from different samples has been considered in the Friedman-Rafsky’s MST test. Particularly, let X = {xi}, i=1,…n, Y = {yj}, j=1,…m be two samples of independent random elements, distributed according to F and G, respectively. 38 The two-sample MST-test (cont. 1) Suppose that the set Z = X  Y is nice. Friedman and Rafsky’s test statistic Rmn can be defined as the number of edges of Z which connect a point of X to a point of Y . Friedman and Rafsky actually introduced the statistic 1+Rmn, which expresses the number of disjoint sub-trees resulting from removing all edges uniting vertices of different samples. 39 The two-sample MST-test (cont. 2) Henze and Penrose (1979) considered the asymptotic behavior of Rmn. Suppose that m  and n  such that m/(m+n) p (0, 1). Introducing q = 1 − p and r = 2pq, they obtained: where the convergence is in distribution and N(0,2) denotes the normal distribution with a zero expectation and a variance 2= r (r + Cd(1 − 2r)) for some constant Cd depending only on the space’s dimension d. 40 The two-sample MST-test (cont. 3) This results coincides with our intuition since if two sets are closed then there are many edges which unit points from different samples. In the spirit of The Central Limit Theorem, this quantity is expected to be asymptotically normally distributed. 41 Theorem’s application To use this theorem, for each possible number of clusters k=2,…,k* , we draw many pairs of disjoint samples S1 and S2 having the same sufficiently big size n and calculate S= S1 S2, St= CL(S). We consider sub- samples S1,j= S1j(St), S2,j= S2j(St), where j(St), j=1,…,k is the jth cluster in the partition of St obtained by means of CL. 42 Theorem’s application (cont. 1) For each j=1,…,k we compute a value of the twosample MST-test statistics Rnn(S1,j, S2,j) k 1 Rn (S1 , S2 )=  Rnn (S1,j , S2,j ) k j 1 In the case where all clusters are homogenous sets this statistic is normally distributed. We construct masses of such values and look at the distance between their standardized distribution and the standard normal distribution. The number of clusters k, which yields the minimal distance, is chosen as the estimate of the true number of clusters. 43 Example: Calculation of Rn(S1,S2) 1 5 ; n=m=10, k=2; ( 3  2 )  10    2 20 2 20 2nm 1 1  10;  n+m nm 20 44 Rn (S1 , S2 )= Distances from normality We can consider the following distances from normality: 1. The Kolmogorov-Smirnov test statistics 2. The Friedman’s Pursuit Index 3. Entropy Based Pursuit Indexes 4. BCM function Generally, any simple metric can be used here. 45 The Kolmogorov-Smirnov Distance The Kolmogorov-Smirnov distance is based on the empirical distribution function. Given N ordered data items R1≤ R2≤ R3 ≤... ≤ RN, a function FN(i) is defined as the fraction of items less or equal to Ri. The distance to the standard normal distribution is given by: D=max(FN(i) –G(i), G(i)- FN(i)), where G(i) is the value of standard normal cumulative distribution function evaluated at the point i. 46 The Kolmogorov-Smirnov Distance 47 Example : synthetic data Mixture of three Gaussian clusters 48 Example : synthetic data (cont. 1) 49 Another application Another approach by Jain, Xu, Ho and Xiao and by Smith and Jain proposes to carry out a uniformity testing for cluster detection. The main idea here is to locate an “ inconsistent” edge whose length is significantly larger than the average length of nearby edges. The distribution of these lengths must be normal under the null hypothesis assumption. This method is unable to asses the true number of clusters in the data. 50 51 Membership Stability Algorithm Such algorithm uses a simulation of Random Variables, obtained by repeated clusterings of the same sample. Obtained results are compared by means of a compound (non-simple) metric, which is a function of the variable defined on the set Ck={1,…,k}, where k is the number of clusters considered. 52 p L -metric The Lp –metrics is the most known example of a compound metric. For every p the Lp -metrics is defined by p min (11p ) dis p ( X  Y )  E ( X  Y ) In particular:  dis ( X  Y )  inf{c  0  P  X  Y  c   0}  the  metric; dis0 ( X  Y )  E (  ( X  Y ))  the indicator metric; The last metric has been, de facto, applied in: T. Lange, V. Roth, L. M. Braun, and J. M. Buhmann. Stability-based validation of clustering solutions. Neural Computation, 16(6):1299 – 1323, 2004. V. Roth, V. Lange, M. Braun, and Buhmann J. A resampling approach to cluster validation. In COMPSTAT, 53 2002. A family of clustering algorithms S CLUSTERING MACHINE 1CL1 S1= CL1(S) S CLUSTERING MACHINE 2CL2 S2= CL2(S) ● ● ● S CLUSTERING MACHINE MCLM SM= CLM(S) In contrast with the geometrical algorithm, we use here a family of clustering algorithms. It can be the same algorithm starting from different initial points. 54 Clusters correspondence problem Let us consider we twice cluster the same set. Cluster 1 S Cluster 2 Cluster 2 Cluster 1 55 The same cluster can be labeled differently Correspondence between labels α and β obtained for a sample S. This task is solved by finding of a permutation ψ of the set Ck which achieves the smallest misclassification between α and the permuted β. I.e.  *  argmin .  ( ( x)   (  ( x)))   xS Computational complexity for solving this problem by the well known Hungarian method is O(k3). This technique has been used in [Lange T. et al, 2004] and is also known in the clusters combining area (see, for example [Topchy A. et al, 2003]). 56

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ON APPLICATION OF PROBABILITY METRICS IN THE CLUSTER