Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A new method to determine a similarity threshold in clustering problems 1 G. SANCHEZ-DIAZ, 2J. FCO. MARTINEZ-TRINIDAD, 1J. WAISSMAN-VILANOVA 1 Center of Technologies Research on Information and Systems The Autonomous University of the Hidalgo State Carr. Pachuca – Tulancingo Km. 4.5; C.U.; C.P. 42084; Pachuca, Hgo. 2 National Institute of Astrophysics, Optics and Electronics Luis Enrique Erro No. 1, Sta. María Tonantzintla, Puebla CP 72840. MEXICO Abstract: - In this paper, a new automatic method based on an intra-cluster criterion, to obtain a similarity threshold that generates a well defined clustering (or near to it) of the data set, is proposed. This method is focussed on unsupervised Logical Combinatorial Pattern Recognition approach. Some experimentations of the new method are presented with some data sets. Key-Words: - Unsupervised classification, Clustering, Graph theory, Similarity threshold, Logical combinatorial pattern recognition. 1 Introduction In unsupervised Pattern Recognition area many algorithms have been proposed [1]. Some of them are based on graph theory. In this paper we will consider the approach based on graph proposed in the Logical Combinatorial Pattern Recognition [2, 3]. In this context it is assumed that the structure of an universe is not known. To find such structure, in general an initial sample is given. The problem is precisely to find the classes, the groupings. The main idea consists in consider the data as vertexes in a graph and the similarity among the objects as edges. In this way the problem of unsupervised classification can be seen as finding subgraphs (clusters) in the initial graph (initial sample). Note that there exists a natural correspondence among data, their similarity and a graph whose vertexes are objects and the weight of their edges is the similarity between adjacent vertexes. In this context a parameter β0 can be introduced for controlling how many similar a pair of objects must be in order to be considered similar. As result then a new graph containing only edges with weight greater or equal than β0 (the parameter) is obtained. Therefore depending on the desired closeness in similarity, an appropriate value for this parameter can be chosen by the user and then different graphs are obtained. Now the problem is reduced to find in the resultant graph certain subgraphs. For example, we can find connected components in the graph fixing a certain β0, where each vertex is an object of the sample and all the edges have a weight greater than β0. Note that when the value of β0 is modified the graph may change and then the connected components also can change obtaining a different clustering for each value of β0. Here rises a natural question, what value of β0 must be chosen?. There are many criteria to find subgraphs (clustering criteria), in [4] are presented some of them as β0-connected components, β0-compact sets, β0strictly compact sets, and β0-complete maximal sets. The problem of choosing an adequate value for β0 is studied in this paper. A new automatic method to obtain a similarity threshold β0 to generate a welldefined clustering (or near to it) of a data set is proposed. This method is based on the maximum and minimum values of similarity among objects and it calculates an intra-cluster similarity for each cluster. Then the method uses this intra-cluster similarity to obtain a global value, which indicates what clustering has the best average intra-cluster similarity. 2 Related Works A method to determine the parameter β0 for a hierarchical algorithm is proposed in [5]. The method uses as similarity between two clusters Ci and Cj the expression β (Ci , C j ) = max{β (O, O′ )}. Then using a O ∈Ci O′∈C j traditional hierarchical algorithm, i.e, grouping the two clusters more similar in each level, a dendograme is built. Using the dendograme the user can choose the parameter β0 that prefers according to the number of clusters generated by this β0 value. In this case, the user determines the value of β0 in function of the number of cluster that he want to get analyzing the dendograme, the method automatically not determine the value β0. Another related work is [6], where some concepts and theorems to demonstrate the number of forms that a sample of objects can be partitioned in β0connected components, are introduced. The values β0 that characterize the different forms of partitioning the sample and the cardinality for each component are also studied. In this paper, an algorithm such that given a number 0<k<m, (where m is the number of objects in the sample), the k β0-connected components are generated, is proposed. Note that this method is very useful if the number of β0-connected components to form is known, otherwise when the number of clusters to form is an incognita in the problem we can not use the method. The problem of determining the value of β0 that generates natural clusters is very important in the context of Logical Combinatory Pattern Recognition. Therefore in the next sections a new method to estimate β0 is introduced. 3 Basic Concepts In this section, the context of an unsupervised classification problem in the Logical Combinatorial Pattern Recognition is explained and also some basic concepts that our method to determine β0 uses are introduced. Let Ω={O1,O2,...,Om}be a set of objects and R={x1,x2,…,xn} a set of features. A description I(O) is defined for every O ∈Ω and this is represented by an n-tuple, e.g. I(O)=(x1(O),...,xn(O))∈D1×…×Dn (initial representation space), where xi(O)∈Di; i=1,...,n; and Di is the domain of admissible values for the feature xi. Di can be a set of nominal, ordinal and/or numerical values. Hereafter we will use O instead of I(O) to simplify the notation. Definition 1. A comparison criterion [4] is a function ϕi:Di×Di→Li which is associated to each feature xi (i=1,...,n), where: 1. ϕi(xi(O),xi(O))=min{y}, y∈Li, if ϕi is a dissimilarity comparison criterion between values of variable xi, or 2. ϕi(xi(O),xi(O))=max{y}, y∈Li, if ϕi is a similarity comparison criterion between values of variable xi, for i=1,…,n. ϕi is an evaluation of the similarity or dissimilarity degree between any two values of the variable xi. Li i=1,…,n is a total ordered set, usually it is considered as Li=[0,1]. Definition 2. Let a function β: (D1×...×Dn)2→ L, this function is named similarity function [7], where L is a total ordered set. The similarity function is defined using a comparison criterion for each attribute. Definition 3. Let β be a similarity function and β0∈L a similarity threshold. Then two objects Oi, Oj ∈Ω are β0-similars if β(Oi,Oj)≥β0. If for all Oj∈Ω β(Oi,Oj)<β0, then Oi is a β0-isolated object. Definition 4. Let |Γij|m×m be a matrix where each entry Γij=β(Oi,Oj)∈L [4]. This matrix is denominated similarity matrix. Definition 5 (Intra_i criterion). Let C1,…,Ck be k clusters obtained after apply a clustering criteria with a certain β0. The intra-cluster similarity criterion (Intra_i) is defined as follows: maxs if maxCi = min Ci = max s Intra _ i(Ci ) = maxCi − min Ci if maxCi ≠ min Ci min if maxCi = min Ci ≠ max s s where maxs and mins are the maximum and minimum similarity values that β, the similarity function, can takes (i.e. maxs=1.0 and mins=0.0, if L=[0, 1] for example). Besides, maxCi and minCi are the maximum and minimum similarity values among objects belonging to the cluster Ci. According to Han and Kamber [8] a good clustering method will produce high quality clusters (well-defined), with high intra-cluster similarity and low inter-cluster similarity. The proposed Intra_i criterion was inspired in two facts. First, The Intra_i criterion gives a low weight to the clusters where the difference between the maximum and the minimum similarity between objects is low (or null). Also, this criterion gives a high weight to those clusters formed by only one object. This is because these clusters may be outliers or noise, and its global contribution can generates not adequate results. Second, in this approach of unsupervised classification based in Graph Theory there are two trivial solutions: when β0=mins, obtaining one cluster with all objects of the sample, and when β0=maxs, generating m clusters, each one formed by only one object. Then, while β0 is increased from 0.0 to 1.0 the Intra_i takes several values in function of the difference between the maximum and the minimum similarity between objects in the different clusters for each clustering. Therefore, we propose that a reasonable way to determine an adequate value of β0 (associated to a well-defined clustering) is considering a clustering (a set of clusters) with minimum average intra-cluster similarity. 4 The Method for Determining β0 In this section, we describe the proposed method to obtain a threshold β0 such that a well-defined clustering is associated to this value. 4.1 Description of the method The proposed method works in the following way. First, the threshold β0 is initialized with the minimum value of similarity between objects (β0=0.0). After this, in order to generate several clustering, the method handles a loop, which increases the threshold value with a small constant (INC=0.01, for example), and then, a clustering is generated with this β0 value and the specified clustering criterion. For each cluster in a generated clustering, the maximum and minimum values of similarity among objects are calculated and the Intra_i criterion is computed. Finally, the method calculates the average Intra_i for each clustering and takes the minimum value obtaining the threshold β0 that generates a welldefined clustering in the data set. This process continues until INC reaches the maximum similarity value β0=1.0. The proposed method in this paper is the following: method Input: |Γij|m×m (similarity matrix), CC (Clustering criterion), INC (β0 increase value); Output: β0 (threshold calculated); β0= mins Repeat CCk(β0) Cik=Clusters(CCk, i), i=1,...,hk maxci-ik=Max(|Γij|m×m, Cik, i), i=1,...,hk minci-ik=Min(|Γij|m×m, Cik, i), i=1,..., hk valueik=Intra_i(Cik, maxci-ik , minci-ik) meanCCk=∑ valueik / |CCk| β0=β0 + INC until β0= maxs β0=β0_min_average_Intra_i(meanCCk) The CCk(β0) function build a clustering (CCk) with a specific similarity threshold β0, applying the clustering criterion given as parameter (β0-connected components). The function Clusters(CCk, i) returns the cluster i, from the clustering CCk. The functions Max(|Γij|m×m, Cik, i) and Min(|Γij|m×m, Cik, i) return the maximum and minimum values of similarity among objects for the cluster i, in the clustering CCk. The Intra_i(Cik, maxci-ik , minci-ik) function calculates and return the intra-cluster criterion for the cluster i. Finally, β0_min_mean_Intra_i(meanCCk) obtains and return the minimum average value of the Intra_i criterion. The threshold β0 obtained by the method indicates that the clustering associated to β0 is a well defined clustering. 4.2 Parameters of the method The method handles the following parameters: |Γij|m×m (similarity matrix), CC (Clustering criterion), INC (β0 increase value). The |Γij|m×m was calculated using the definitions 1, 2 and 4, for a specific algorithm. In this paper, the CC clustering criterion handled was β0-connected components criterion. Finally, the increase value for β0 can takes several values, depending of the accurate required in the problem (0.1, 0.15, 0.5, 0.01, etc., for example). For this parameter, we propose handle an increase of 0.01, if β0 ∈[0.0, 1.0]. 5 Experimental Results In this section, three examples of applications of the method to data sets are presented. The first and second data sets show graphically two examples with linear separable data. The first data set (DS1) contains twelve objects, in table 1, we show the symmetric similarity matrix for this data set. The figure 1 shows the 12 objects corresponding to DS1. The method was applied to DS1 and 7 clustering were obtained, each one with 1, 2, 3, 4, 6, 9 and 12 clusters respectively. The clusters obtained in each clustering are as follows: C1: β0=0.00; (c1,1)={1,2,3,4,5,6,7,8,9,10,11,12}. C2: β0=0.71; c1,2={1,2,3, 4,5,7,8,9,10,11,12}, c2,2={6}. C3: β0=0.74; c1,3={1,2,3,4,5,9,10,11,12}, c2,3={6}, c3,3={7,8}. C4: β0=0.81; c1,4={1,2,3,4,5},c2,4={6},c3,4={7,8}, c4,4={9,10,11,12}. C5: β0=0.84; c1,5={1,2,3,4,5},c2,5={6},c3,5={7},c4,5={8}, c5,5={9}, c6,5={10,11,12}. C6: β0=0.86; c1,6={1},c2,6={2,3,4},c3,6={5},c4,6={6}, c5,6={7}, c6,6={8}, c7,6={9}, c8,6={10,11}, c9,6={12}. C7: β0=0.91; c1,7={1}, c2,7={2},..., c12,7={12}. The uppercase C means a clustering and the lowercase c means a cluster in the clustering. The Intra_i criterion calculated for each cluster, in each clustering was the following: C1={.90}, C2={.77,1.0}, C3={.77,1.0,.00}, C4={.27,1.0,.00,.15}, C5={.27,1.0,1.0,1.0,1.0,.80}, C6={1.0,.80,1.0,1.0,1.0,1.0,1.0,.00,1.0}, and C7={1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0}. C1: AI=0.9667; with β0=0.00; and 1 cluster. C2: AI=0.8167; with β0=0.67; and 2 clusters. C3: AI=0.6444; with β0=0.80; and 3 clusters. C4: AI=0.45; with β0=0.84; and 4 clusters. C5: AI=0.3467; with β0=0.87; and 5 clusters. C6: AI=0.2667; with β0=0.90; and 6 clusters. C7: AI=1.0; with β0=0.97; and 45 clusters. Figure 2. Clustering obtained with β0=0.81, for DS1 Figure 1. The objects corresponding to DS1 The average Intra_i for each clustering is as follows: C1=0.90; C2=0.885; C3=0.59; C4=0.355; C5=0.845; C6=0.866; and C7=1.0. The minimum value of these averages determines the value β0=0.81, which corresponds to a welldefined clustering (i.e. C4), formed by the clusters shown in figure 2. 1.0 .85 .75 .78 .63 .70 .51 .38 .43 .27 .22 .13 1.0 .90 .90 .76 .62 .58 .45 .56 .38 .36 .27 1.0 .88 .83 .50 .60 .47 .65 .47 .47 .33 1.0 .85 .58 .68 .55 .63 .45 .42 .30 1.0 .48 .73 .65 .80 .65 .58 .50 1.0 .51 .36 .27 .11 .05 .00 1.0 .83 .66 .58 .48 .51 1.0 .66 .63 .53 .61 1.0 .83 .80 .70 1.0 .88 .85 1.0 .80 1.0 Figure 3. The objects corresponding to DS2 The well-defined cluster, using the proposed method (obtained with β0=0.90) is shown in the figure 4. Table 1. Similarity matrix of DS1 The second data set (DS2) is shows in figure 3. This data set contains 45 numerical objects in 2D. Has it is possible to see in figure 3, the objects are graphically clustering in 5 well-defined clusters. The method was applied to DS2, and seven clustering were obtained with the following average Intra_i (AI): Figure 4. Clustering obtained with β0=0.90, for DS2 In this third experimentation, the behavior of the proposed method is shown step by step. The third data set (DS3) contains 350 numerical objects in 2D, and it is shown in figure 5. The clusters shown in figure 5 have several shapes, sizes and densities, and they are not lineally separable. In order to show the behavior of the proposed method, the several clustering obtained with their respective β0, are presented in figures 6 to 9. Again, we show the well-defined cluster generated for DS3, which corresponds with a β0=0.94 value, generating 5 clusters, with an average Intra_i value of 0.2378. The cases when β0=0.00, β0=0.98, and β0=0.99 are not shown, because for the first case, the obtained clustering has all the objects. And, for the remaining values of β0, the number of clusters obtained is too large (i.e. 86 and 350 clusters respectively). The method allows obtaining a well-defined clustering, based in an intra-cluster similarity criterion. The proposed method gives a value of threshold β0 to obtain a well-defined clustering of a data set. The method does not establish any assumptions about the shape, the size or cluster density characteristics of the resultant clusters in each clustering generated. However, the proposed method is still susceptible to noise. The proposed method uses the β0-connected components clustering criterion. We will work in the generalization of the proposed Intra_i criterion in order to handle any criterion that uses a similarity matrix, for example: methods based in the Graph theory [9], Compact sets [4], etc. The application of the method is not feasible to large data sets, and we will work in this direction, using techniques like GLC [10] in order to process large data sets, without calculate the similarity matrix. Figure 5. The objects corresponding to DS3 Figure 7. Clustering obtained with β0=0.90, for DS3 Figure 6. Clustering obtained with β0=0.89, for DS3 6 Conclusions Figure 8. Clustering obtained with β0=0.93, for DS3 [8] J. Han and M. Kamber, Data mining: concepts and techniques, The Morgan Kaufmann Series in Data Management Systems, Jim Gray Series Editor, 2000. [9] J. Auguston and J. Minker, An analysis of some graph theorical cluster techniques, Journal ACM, 17, 1970, pp. 571-588. [10] G. Sanchez-Diaz and J. Ruiz-Shulcloper, MID mining: a logical combinatorial pattern recognition approach to clustering large data sets, Proc. 5th Iberoamerican Symposium on Pattern Recognition, Lisbon, Portugal, 2000, pp. 475-483. Figure 9. Clustering obtained with β0=0.94, for DS3 (well-defined clustering discovered) Acknowledgement.- This work was financially supported by CONACyT (Mexico) through project J38707-A. References: [1] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (2nd ed), Wiley, New York, NY 2000. [2] J. F. Martínez-Trinidad and A. Guzmán-Arenas, The logical combinatorial approach to pattern recognition an overview through selected works, Pattern Recognition 34(4), 2001, pp 741-751. [3] J. Ruiz-Shulcloper and A. Mongi, A. Logical Combinatorial Pattern Recognition: A Review, In Recent Research Developments in Pattern Recognition, ed. Pandalai, Pub. Transword Research Networks, USA. (To appear) [4] J. F. Martinez Trinidad, J. Ruiz Shulcloper, and M. Lazo Cortes, Structuraliation of universes, Fuzzy Sets and Systems, Vol. 112 No. 3, 2000, pp.485-500. [5] R. Pico Peña, Determining the similarity threshold for clustering algorithms in the Logical Combinatorial Pattern Recognition through a dendograme, Proc. 4th Iberoamerican Simposium of Pattern Recognition , Havana Cuba, 1999, pp 259-265. [6] R. Reyes Gonzales and J. Ruiz-Shulcloper, An algorithm for restricted structuralization of spaces, Proc. 4th Iberoamerican Simposium of Pattern Recognition , Havana Cuba, 1999, pp 267-278. [7] J. Ruiz-Shulcloper and J. Montellano-Ballesteros, A new model of fuzzy clustering algorithms, Proc. of the 3rd EUFIT, Aachen, Germany, 1995, pp.1484-1488.