Download gSOM - a new gravitational clustering algorithm based on the self

gSOM - a new gravitational clustering algorithm based on the self-organizing map Nejc Ilc University of Ljubljana, Faculty of Computer and Information Science Trzaska 25, 1000 Ljubljana, Slovenia [email protected] Data clustering is a fundamental data analysis method, widely used in data mining, feature extraction, pattern recognition, function approximation, vector quantization and image segmentation. Numerous clustering algorithms exist based on various theories and approaches. In this paper we present a novel twolevel clustering algorithm (gSOM) based on the Kohonen’s self-organizing map (SOM) and gravitational clustering. In the first level, SOM is applied to the input data and prototypes are extracted. In the second level, each prototype acts as a unit-mass point in a feature space, in which presence of a gravitational force is simulated. The proposed approach is able to discover complex cluster shapes not limited only to spherical ones. It automatically determines number of clusters in the data and can handle noise and outliers. Experiments with synthetic and real data were conducted to show performance of the presented method in comparison with other two well-known clustering techniques. Key words: clustering, data mining, self-organizing map, gravitational force. 1 Introduction Clustering is a process of organizing data into clusters or natural groups such that data points assigned to the same cluster have high similarity, while the similarity between points assigned to different clusters is low. Unlike classification, groups are not predefined and input data points are unlabelled. Clustering has been widely used in data mining, feature extraction, pattern recognition, function approximation, vector quantization and image segmentation [1]. In this paper, we present gSOM, a new clustering algorithm, based on two present data analysis approaches. We combined the self-organizing map with gravitational clustering approach, exploiting advantages of both. The self-organizing map (SOM), proposed by Kohonen [2], is a type of artificial neural network, where a set of neurons is arranged in a low-dimensional structure - usually in 2-D grid. Through unsupervised training, each neuron or prototype is attached to a feature vector of the same dimension as the input space, considering neighborhood relations between neurons. Therefore, SOM preserves topological relations between points in input space. By assigning each input data point to the neuron with the nearest feature vector, the SOM is able to divide the input space into regions with common nearest feature vector, also called best matching unit (BMU). Several attempts have been made to 2 Nejc Ilc cluster data with the SOM using two-level approach. In the first level, SOM is used to form a feature map of neurons, which are in the second level divided into as many different regions as the predefined number of clusters. Each input data point can be assigned to a cluster according to their nearest neuron. This approach has been addressed in [3] and [4]. In [3], SOM is clustered using hierarchical clustering, where merging criterion is based on inter-cluster and intra-cluster density and inter-cluster distances. In [4] a comparison was made between classical clustering methods and those based on SOM, showing greatly reduced computational complexity of latter. Both hierarchical and partitional approaches were involved. As mentioned, the SOM is combined with gravitational clustering, which assumes that every point in the data set can be viewed as a mass particle in the input space. If a gravitational force between points exists, they will begin to move towards each other with respect to mass and distance. This natural inspired notion was firstly adopted in the algorithm, proposed by Wright [5]. Several other papers extend his work, proposing algorithm robust to noise [6] and solution to handle overlapping clusters with a fuzzy model [7]. In this paper we present the new algorithm based on both, the SOM and the gravitational clustering. The clustering is carried out in a two-level process: in the first level, the SOM is applied on the input data and best matching units prototypes are obtained. In the second level, each prototype acts as a unit-mass point in a feature space, in which we simulate presence of a gravitational force to group similar prototypes together. The paper is organized as follows. In Section 2, the proposed algorithm called gSOM is described. Section 3 presents performance of gSOM in comparison with K-means and EM GMM algorithm over some synthetic and real data sets, along with discussion. Finally, the conclusion is drawn in Section 4. 2 Proposed algorithm - the gSOM We developed clustering algorithm based on a two-level approach that is depicted in Figure 1. First, a set of prototypes is created using SOM as a vector quantization method. Nearest data points are attached to each prototype thus becoming a small, first-level cluster. There are few times more prototypes than expected number of clusters. In the next step, prototypes are observed as object in a feature space, where a force of gravity moves them towards each other. When two prototypes are close enough, they merge into a single prototype with mass unchanged, due to reasons explained later in this section. The main benefit using the SOM in the first level of proposed algorithm is to obtain topological ordered representatives of the data. Prototypes are connected with each other in a grid and neighbors for each of them are known. We use this information to bound influence of gravitational field around prototype and therefore stabilize clustering process in the second step. Another advantage of the SOM is a reduction of noise and therefore solving problem with outliers. The prototypes are local averages of the data and therefore less sensitive to random gSOM - a new gravitational clustering algorithm based on the SOM 3 variations than the original data. Finally, the SOM reduces computational complexity of clustering data set, especially when a huge number of input points is a case as shown in [4]. Fig. 1. Two-level approach scheme. Data points in a) are mapped to the SOM prototypes in b). BMUs are then merged together by gravitational clustering algorithm as shown in c). 2.1 The first level - training the SOM The SOM is a regular two-dimensional grid of M map units, each of them represented by a prototype vector mi = [mi1 , . . . , mid ], where d is dimension of input vector. The units are connected to adjacent ones by a neighborhood relation. SOM can be viewed as elastic net that folds onto the shape of input data during training. Data points lying near each other in the input space are mapped onto nearby map units, thus preserving topology. The SOM is trained iteratively in a ”batch” mode, where the whole data set is presented to the map before any adjustments are made. In each training step, the data set is partitioned according to the Voronoi regions of the map units. In simple words, each data point xj belongs to the map unit to which it is closest. After this, the new map units are calculated as: n j=1 hic(j) (t)xj mi (t + 1) = n , (1) j=1 hic(j) (t) where c(j) is the the best matching unit (BMU) of data point xj and n number of points in data set. The new value of map unit mi is weighted average of the data points, where the weight of each data point is the neighborhood kernel value hic(j) (t) centered on its BMU c(j). Neighborhood kernel function has a width defined by parameter σ that decreases monotonically with time. Before training, the initialization of map has to be made. Linear initialization is used to fold map in the subspace spanned by the two eigenvectors with greatest eigenvalues computed from the input data. The maps are then trained in two 4 Nejc Ilc phases: a rough phase with larger initial neighborhood width σ1 and a finetunning phase with smaller initial neighborhood width σ2 . For training the SOM, we use toolbox for MATLAB - the SOM Toolbox1 . As the result of the first level of algorithm we obtain SOM prototypes, which represent original data. Interpolating prototypes - those that are not BMU for any data point - are eliminated. Only BMUs are taken into the second level, representing portions of original data points attached to them. 2.2 The second level - applying the gravity Identified BMUs from first level of algorithm are now interpreted as d-dimensional particle in the d-dimensional space with mass equal to 1. During iterations of algorithm, each particle is moved around according to a simplified version of the Gravitation Law using Second Newton’s Motion Law, proposed in [6]. The new position of point x influenced by gravity of point y is: x(t + 1) = x(t) + G d · , ||d|| ||d||2 (2) where d is the Euclidean distance between point x and y, and G is the gravitational constant. In order to avoid the big crunch effect, where all points collapse into a single point, each iteration G is reduced by a proportion ΔG, so G = (1 − ΔG) · G. When two points are close enough, i.e. ||d|| is lower than parameter α, they are merged into a single point with mass equal to 1, which is rather strange, but doing so, clusters with greater density do not affect ones with smaller density. The experiments presented in the Section 3 prove that such approach is beneficial. It is obvious that number of points decreases during iterations, when appropriate G is chosen. At each iteration of algorithm, every point x in remaining set of points, denoted with P , is visited once. Then we have to choose another point y and move them according to Equation 2. Point y is one of the neighbors of x or a random point. With probability p, which is parameter of algorithm, a random point is selected, otherwise one of the existing neighbors of x is chosen. Algorithm stops, when G is reduced enough, so movements of points are under some threshold value. Alternative stopping criterion is a case when only a single point remains in P or a predefined maximum number of iterations is reached. Points in the set P are final clusters representatives as illustrated in Figure 1c). The number of finally discovered clusters is obviously dependent on a data set itself and mentioned parameters’ values. 3 Experiments and results In the experiments, five synthetic (a–e) and two real (f and g) data sets were used, demonstrating the capabilities of proposed clustering algorithm in comparison with well-known K-means and Expectation Maximization Gaussian Mixture 1 SOM Toolbox is available under the GNU General Public License at: http://www.cis.hut.fi/somtoolbox/. gSOM - a new gravitational clustering algorithm based on the SOM 5 Model (EM GMM) algorithm [8]. Table 1 summarizes properties of each data set and their plots are presented in Figure 2, showing already clustered data by gSOM algorithm. Note that data sets f) and g) are plotted using the PCA (principal component analysis) projection due to higher-dimensional data. All variables in each data set were linearly scaled to fit in range [0, 1] before clustering was carried out. 3.1 Data sets a) Data set Giant consists of 862 2-D points and has two clusters: one small spherical cluster on the right side (10 points) and one huge spherical cluster (852 points) on the left side. A much greater density of the leftmost cluster, compared to the other one, is difficulty here leading algorithms to split the giant instead of finding the dwarf. b) Data set Ring consists of 800 2-D points forming two concentric rings, each contains 400 points. Non-linearity and sophisticated connectivity are presented here to challenge methods. c) Data set Wave was generated to measure algorithms’ performance on highly irregular, longitudinal and linearly non-separable clusters. 2-D data consists of 148 points in the upper and 145 points in the lower wavy curve. d) Data set Moon is another problem domain with linearly non-separable clusters. Here, four clusters are defined, containing 104, 150, 150 and 110 2-D points, from the topmost to the lowermost cluster, respectively. e) Data set Flag consists of 640 points that form three clusters. Spherical cluster in the middle contains 100 2-D points; cluster above and cluster beneath contains 270 points each. We created this domain to see, how methods deal with different cluster shapes. f) The Iris data set [10] has been widely used in classification tasks. It has 150 points of four dimensions, divided into three classes of iris plant with 50 points each. The first class is linearly separable from the other two. The second and the third classes are overlapped and linearly non-separable. g) The Wine data set [10] has 178 13-D points with known three classes of wines derived from three different cultivars. The numbers of data points in the classes are 59, 71 and 48, respectively. Table 1. Data sets used for performance measurements. Number of clusters is given as a ground truth by human. data set points dimensions clusters a) Giant 862 2 2 b) Ring 800 2 2 c) Wave 293 2 2 d) Moon 514 2 4 e) Flag 640 2 3 f) Iris 150 4 3 g) Wine 178 13 3 6 Nejc Ilc Fig. 2. Data sets used in experiments. Results of clustering with gSOM are displayed using different shape and color of markers that belong to different clusters. 3.2 Parameters setting As described in Section 2, eight parameters need to be set for the gSOM algorithm to work, which is the main disadvantage of the proposed approach. Fortunately, experiments on sensitivity of parameters show that constant or automatically selected values for a majority of them can be used. Table 2 displays experimentally selected values of parameters for each problem domain. Size of the SOM was automatically defined for all data sets, except for one called Moon, 1 using following heuristic: number of prototypes is M = 5·n 2 , where n is a number of points in data set. Ratio between numbers of prototypes in each dimension of 2-D grid is the square root of the ratio between the two largest eigenvalues of data in input space. Lattice that defines neighborhood of prototypes is chosen to be rectangular except in case of Wine data set where hexagonal is used. According to our work, we suggest to use following parameters’ values for the second level clustering of some unknown data: G = 0.0008, ΔG = 0.045, α = 0.01 and p = 0.1. Maximal allowed number of iterations for EM GMM and K-means was 30 and 100, respectively, in order to assure convergence. 3.3 Comparison and evaluation Clustering of seven data sets was performed by proposed gSOM and two other well-known algorithms - EM GMM and K-means. Results were compared with predefined classes, i.e. ground truth, and clustering error was computed - its minimal, maximal and mean value for 100 runs of each algorithm. Clustering gSOM - a new gravitational clustering algorithm based on the SOM 7 Table 2. Selected values of the parameters for the gSOM algorithm. SOM size is the size of 2-D grid of prototypes, lattice means their neighborhood relations (rectangular and hexagonal), σ1 and σ2 are initial neighborhood widths for rough and fine-tunning phase of the SOM, G is initial gravitational constant, ΔG reduction proportion of G, α merging distance and p probability for choosing a random point instead of a neighbor. data set a) Giant b) Ring c) Wave d) Moon e) Flag f) Iris g) Wine SOM size 14 × 10 13 × 11 11 × 8 20 × 10 17 × 8 13 × 5 11 × 6 lattice rect rect rect rect rect rect hexa σ1 2 2 2 3 3 2 2 σ2 1 1 1 1 1 1 1 G 0.00010 0.00085 0.00080 0.00080 0.00080 0.00080 0.00085 ΔG 0.020 0.040 0.045 0.045 0.045 0.045 0.030 α 0.01 0.01 0.01 0.01 0.01 0.01 0.07 p 0.0 0.1 0.1 0.1 0.1 0.1 0.2 error is a performance criterion defined as proportion of wrongly labeled points compared to the true class information [9]. Also, average running time was measured. Results of experiments are collected in Table 3. Considering the minimal and mean clustering error of clustering data sets a–e, it is clear that gSOM outperformed both, EM GMM and K-means. More concerning is relatively high deviation of its results, pointing out an unstable behavior in the second level of gSOM, and longer execution time - for factor of ten, approximately. The latter is an effect of using two-level algorithm on a relatively small data sets. However, only gSOM was able to perfectly cluster complex data shapes given in sets c) and d), what is quite a success. It was less successful when it come to real data sets Iris and Wine (sets f and g), where other two algorithms were slightly better. 4 Conclusion A new two-level clustering algorithm gSOM was presented in the paper. According to the results of the experiments, the performance of the presented method is acceptable and is useful in discovering clusters in complex, linearly non-separable data sets with arbitrary shape. Also, it automatically determines number of clusters in the data and is robust to noise and outliers, thus fully exploiting advantages of a two-level approach. Our further work will concentrate on improving algorithm’s speed, employing more effective data structures like graphs. Furthermore, relations between parameters, especially G and ΔG, have to be studied in order to automatically set them according to the features of input data. Additional comparison with other two-level methods will be made; preliminary experiments have already shown some promising results. 8 Nejc Ilc Table 3. Performance comparison of gSOM, EM GMM and K-Means algorithms. data set a) b) c) d) e) f) g) algorithm gSOM EM GMM K-means gSOM EM GMM K-means gSOM EM GMM K-means gSOM EM GMM K-means gSOM EM GMM K-means gSOM EM GMM K-means gSOM EM GMM K-means max 0.466 0.095 0.463 0.388 0.500 0.500 0.495 0.498 0.457 0.494 0.523 0.502 0.000 0.366 0.348 0.333 0.493 0.113 0.404 0.461 0.051 Clustering min 0.000 0.000 0.000 0.230 0.498 0.498 0.000 0.259 0.160 0.000 0.274 0.502 0.000 0.000 0.000 0.033 0.033 0.113 0.056 0.017 0.045 error (%) mean 0.041 0.080 0.442 0.339 0.500 0.500 0.215 0.492 0.422 0.058 0.418 0.502 0.000 0.114 0.228 0.253 0.106 0.113 0.209 0.064 0.047 std 0.0942 0.0128 0.0902 0.0267 0.0007 0.0006 0.1909 0.0334 0.0805 0.1077 0.0579 0.0000 0.0000 0.1634 0.1325 0.0890 0.1487 0.0000 0.0770 0.0957 0.0029 elapsed time (s) 0.3200 0.0342 0.0076 0.2598 0.0314 0.0315 0.1537 0.0206 0.0081 0.4350 0.0390 0.0151 0.2496 0.0201 0.0069 0.1336 0.0179 0.0063 0.1852 0.0268 0.0110 References 1. Dunham, M.H.: Data Mining: Introductory and Advanced Topics. Prentice Hall (2003) 2. Kohonen, T.: Self-organizing maps. Springer (2001) 3. Wu, S., Chow, T.W.: Clustering of the self-organizing map using a clustering validity index based on inter-cluster and intra-cluster density. Pattern Recognition 37, 175– 188 (2004) 4. Vesanto, J., Alhoniemi, E.: Clustering of the Self-Organizing Map. IEEE Trans. on Neural Networks 11(3), 586–600 (2000) 5. Wright, W.E.: Gravitational clustering. Pattern Recognition 9, 151–166 (1977) 6. Gomez, J., Dasgupta, D., Nasraoui, O.: A new gravitational clustering algorithm. In: Proceedings of the Third SIAM International Conference on Data Mining (2003) 7. Orhan, U., Hekim, M., Ibrikci, T.: Gravitational Fuzzy Clustering. In: Gelbukh, A, Morales, E.F. (eds.) MICAI 2008. LNAI, vol. 5317, pp. 542–531. Springer, Heidelberg (2008) 8. Bishop, C.M.: Pattern Recognition and Machine Learning. Springer (2006) 9. Meilã, M., Heckerman, D.: An Experimental Comparison of Model-Based Clustering Methods. Machine Learning 42, 9-29 (2001) 10. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository, http://www.ics. uci.edu/~mlearn/MLRepository.html

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download gSOM - a new gravitational clustering algorithm based on the self