Download Analysis of the efficiency of Data Clustering Algorithms on high

Analysis of the efficiency of Data Clustering Algorithms on high dimensional data S.R.Pande 1 Associate Professor, Department of Computer Science, Shivaji Science College, Congressnagar, Nagpur, India. Email:[email protected] ABSTRACT In this paper we are analyzing the performance of clustering techniques- Fuzzy C- means clustering, kMeans Clustering, Subtractive clustering and Mountain clustering in conjunction with Fuzzy Modeling against a medical problem of cancer disease diagnosis. In this performance analysis, the medical data used which is downloaded from KDD database repositories, consists of 13 input attributes related to clinical diagnosis of a cancer disease, and one output attribute which indicates whether the patient is diagnosed with the cancer disease or not. The data set is partitioned into two data sets: two-thirds of the data for training, and one-third for evaluation. The number of clusters into which the data set is to be partitioned is two clusters; i.e. patients diagnosed with the cancer disease, and patients not diagnosed with the cancer disease. Because of the high number of dimensions in the problem (13-dimensions), no visual representation of the clusters can be presented The similarity metric used to calculate the similarity between an input vector and a cluster center is the Euclidean distance. Since most similarity metrics are sensitive to the large ranges of elements in the input vectors, each of the input variables must be normalized to within the unit interval [0,1]. i.e. the data set has to be normalized to be within the unit hypercube. These clustering techniques is presented with the training data set and implemented and analyzed in MATLAB . Performance of these techniques are presented and compared. The results of the experiments clearly show that the k-means clustering seemed to over perform the other techniques for this type of problem Keywords: Clustering, K-means Clustering , Fuzzy Cmeans Clustering, Mountain Clustering, Subtractive Clustering techniques 1. INTRODUCTION Data mining is the process of extracting previously unknown, valid and actionable information from large databases and then using the information to make crucial business decisions. In essence, data mining is distinguished by the fact that it is aimed at the discovery of information, without a previously formulated hypothesis [1]. Data clustering[8][14] plays an important role in many disciplines, including data mining, machine learning, bioinformatics, pattern recognition, and other fields, where there is a need to Mrs. S.S.Pande 2 Assistant Professor, Department of Computer Applications, Dhanwate National College, Congressnagar, Nagpur, India. Email:[email protected] learn the inherent grouping structure of data in an unsupervised manner. There are many clustering approaches proposed in the literature with different quality/complexity tradeoffs. Each clustering algorithm works on its domain space with no optimum solution to all datasets of different properties, sizes, structures, and distributions. Challenges in data clustering include, identifying proper number of clusters, scalability of the clustering approach, robustness to noise, tackling distributed datasets, and handling clusters of different configurations. In this paper, K-means Clustering , Fuzzy Cmeans Clustering, Mountain Clustering and Subtractive Clustering techniques are reviewed. These techniques are usually used in conjunction with radial basis function networks (RBFNs) and Fuzzy Modeling. Those four techniques are implemented and tested against a medical diagnosis problem for cancer disease. The results are presented with a comprehensive comparison of the different techniques and the effect of different parameters in the process. 2. DATA CLUSTERING OVERVIEW The term cluster analysis (CA) was first used by Tryon in 1939 [2] to denominate the group of different algorithms and methods for grouping objects of similar kind into respective categories. The main goal of the clustering is to find the proper and well separated clusters of the objects. Cluster analysis groups objects (observations, events) based on the information found in the data describing the objects or their relationships. The aim is the objects in a group should be similar (or related) to one another and different from (or unrelated to) the objects in other groups. The greater the similarity (or homogeneity) within a group and the greater the difference between groups, the better the clustering. Cluster analysis is a classification of objects from the data, where by “classification” we mean a labeling of objects with class (group) labels. As such, clustering does not use previously assigned class labels, except perhaps for verification of how well the clustering worked. Thus, cluster analysis is sometimes referred to as “unsupervised classification” and is distinct from “supervised classification,” or more commonly just “classification,” which seeks to find rules for classifying objects given a set of pre-classified objects. As mentioned above, the term, cluster, does not have a precise definition. However, several working definitions of a cluster are commonly used and are given below. There are two aspects of clustering that should be mentioned in conjunction with these definitions. First, clustering is sometimes viewed as finding only the most “tightly” connected points while discarding “background” or noise points. Second, it is sometimes acceptable to produce a set of clusters where a true cluster is broken into several subclusters (which are often combined later, by another technique). The key requirement in this latter situation is that the subclusters are relatively “pure,” i.e., most points in a sub cluster are from the same “true” cluster. The common approach of all the clustering techniques presented here is to find cluster centers that will represent each cluster. A cluster center is a way to tell where the heart of each cluster is located, so that later when presented with an input vector, the system can tell which cluster this vector belongs to by measuring a similarity metric between the input vector and al the cluster centers, and determining which cluster is the nearest or most similar one. Some of the clustering techniques rely on knowing the number of clusters a priori. In that case the algorithm tries to partition the data into the given number of clusters. K-means[12] and Fuzzy Cmeans[9] clustering are of that type. In other cases it is not necessary to have the number of clusters known from the beginning; instead the algorithm starts by finding the first large cluster, and then goes to find the second, and so on. Mountain and Subtractive clustering[14] are of that type. In both cases a problem of known cluster numbers can be applied; however if the number of clusters is not known, K-means and Fuzzy C-means clustering cannot be used. A brief overview of the four techniques is presented here. Full detailed discussion will follow in the next section. K-means is perhaps the most popular clustering method in metric spaces [3]-[5]. Initially k cluster centroids are selected at random. k-means then reassigns all the points to their nearest centroids and recomputes centroids of the newly assembled groups. The iterative relocation continues until the criterion function, e.g. square-error, converges. Despite its wide popularity, k-means is very sensitive to noise and outliers since a small number of such data can substantially influence the centroids. Other weaknesses are sensitivity to initialization, entrapments into local optima, poor cluster descriptors, inability to deal with clusters of arbitrary shape, size and density, reliance on user to specify the number of clusters. Fuzzy C-means clustering was proposed by Bezdek[6] as an improvement over earlier Hard Cmeans clustering. Fuzzy clustering methods assign degrees of membership in several clusters to each input pattern. The resulted fuzzy partition matrix (U) describes the relationship of the objects and the clusters. The fuzzy partition matrix U = [𝜇𝑖,𝑘 ] is a 𝑐 × 𝑁 matrix, where 𝜇𝑖,𝑘 denotes the degree of the membership of xk in cluster Ci, so the ith row of U contains values of the membership function of the ith fuzzy subset of X. Mountain clustering was proposed by Yager and Filev [6]. This technique builds calculates a mountain function (density function) at every possible position in the data space, and chooses the position with the greatest density value as the center of the first cluster. It then destructs the effect of the first cluster mountain function and finds the second cluster center. This process is repeated until the desired number of clusters have been found. Subtractive clustering was proposed by Chiu [6]. This technique is similar to mountain clustering, except that instead of calculating the density function at every possible position in the data space, it uses the positions of the data points to calculate the density function, thus reducing the number of calculations significantly. 3. DATA CLUSTERING TECHNIQUES In this section a detailed discussion of each technique is presented. Implementation and results are presented in the following sections. 3.1 K-means Clustering The k-means clustering technique is one of the simplest algorithms. We assume we have some data point, D=(X1…., Xn), first choose from this data points, k initial centroid, where k is user-parameter, the number of clusters desired. Each point is then assigned to nearest centroid. The idea is to choose random cluster centres, one for each cluster. The centroid of each cluster is then updated based on means of each group which assign as a new centroid. We repeat assignment and updated centroid until no point changes, means no point don’t navigate from each cluster to another or equivalently, each centroid remain the same. Algorithm : K-means Clustering 1: Choose k points as initial centroid 2: Repeat 3: Assign each point to the closest cluster centre, 4: Re compute the cluster centres of each cluster, 5:Until convergence criteria meet. Fig. 1 Algorithm K- means clustering We consider each steps of basic K-means algorithm in more detail and then provide an analysis of the algorithm’s space and time complexity. 1) Assigning points to closest centroid : To assign a point to closest centroid we need a proximity measure that quantifies the closest for the specific data under consideration. Euclidean distance is often used for data points. However, there may be several types of proximity measures that are appropriate for a given data. As example Manhattan distance can be used for Euclidean data while Jaccard measure is often used for documents. Sometimes the calculating similarity measure of each point is time consuming; in Euclidean space it is possible to avoiding thus speed up the Kmeans algorithm. 2) Centroid and objective function: Step 4 in algorithm is “Recompute the centroid of each cluster”, since the centroid can be variable depending on the goal of clustering. For example, to measuring distance, minimized the squared distance of each point to closest centroid, is the goal of clustering that depends on the proximity of the point to another, which is expressed by an objective function. 3) Data in Euclidean space : Consider the proximity measure is Euclidean distance. For our objective function we use the Sum of the Squared Error (SSE) which is known as scatter. For more, we calculate the error of each data point. SSE formula is formally defined as follow: The FCM algorithm is an iterative process like to the k-means algorithm. As initialization it generates the membership matrix U with random values. The two-step iterative process works as follows. Firstly the cluster centers will be calculated. The cluster centers 𝑣𝑖 are given as the weighted mean of the data items that belong to a cluster, where the weights are the membership degrees. This can be formulated as follows: 𝑚 ∑𝑁 𝑘=1 (𝜇𝑖,𝑘 ) 𝑥𝑘 𝑣𝑖 = 𝑚 ∑𝑁 𝑘=1(𝜇𝑖,𝑘 ) 1≤𝑖≤𝑐 , (4) In the next step FCM updates the fuzzy membership values based on the following formula: 1 𝜇𝑖,𝑘 = , 2⁄(𝑚−1) ∑𝑐𝑗=1 ( 2 ‖𝑥𝑘 −𝑣𝑖 ‖ 2) ‖𝑥𝑘 −𝑣𝑗 ‖ 1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑘 ≤ 𝑁 𝑘 2 𝑆𝑆𝐸 = ∑ ∑(𝐶𝑖 , 𝑥) 𝑖=1 𝑥∈𝑐𝑖 (1) The centroid (mean) of the ith cluster is defined by Equation (2). 1 ∑𝑥∈𝑐𝑖 𝑥 𝐶𝑖 = (2) (5) The iteration terminates when the difference between the fuzzy partition matrices in two following iterations is lower than a predefined threshold, or a predefined number of iterations is reached. Algorithm Fuzzy c-means Given the data set X, choose the number of clusters 1 < c < N, the weighting exponent m > 1 and the termination tolerance c > 0. Initialize the fuzzy partition matrix randomly, such that 𝑈 (0) 𝜖𝑀𝑓𝑐 Repeat for t = 1,2,….. Step 1 Calculate the cluster centers 𝑣𝑖 𝑡 for all 1 ≤ 𝑖 ≤ 𝑐 with 𝑈 (𝑡−1) 𝑚𝑖 Step 3 and 4 in algorithm directly attempt to minimize the SSE. Step 3 forms group by assigning points to nearest centroid which minimized SSE for the given centroid. And Step 4 recomputes the centroid so as the further minimize the SSE. The performance of the K-means algorithm depends on the initial positions of the cluster centers, thus it is advisable to run the algorithm several times, each with a different set of initial cluster centers 3.2 (𝑡) 𝑣𝑖 ∑𝑁 𝑘=1 ( 𝜇𝑖,𝑘 ) 𝑥𝑘 (𝑡−1) 𝑚 ∑𝑁 𝑘=1(𝜇𝑖,𝑘 ) ,1≤𝑖≤𝑐 Step 2 Update the fuzzy partition matrix. If 𝑥𝑘 = (𝑡) (𝑡) 𝑣𝑖 then 𝜇𝑖,𝑘 = 1, else Fuzzy C-means Clustering The fuzzy c-means algorithm (FCM) [9],[11] is one of the most widely used methods in fuzzy clustering. The fuzzy c-means algorithm is very similar to the k-means algorithm, but in contrast to the k-means algorithm it aims to find a fuzzy partitioning of the data set. The objective of FCM algorithm is to minimize the fuzzy c-means cost function formulated as: 𝑐 (𝑡−1) 𝑚 = 𝜇𝑖,𝑘 (𝑡) = 𝑁 Until 1 2 ‖𝑥 −𝑣 ‖ ∑𝑐𝑗=1 ( 𝑘 𝑖 2 ) ‖𝑥𝑘 −𝑣𝑗 ‖ 2⁄(𝑚−1) , 1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑘 ≤ ‖𝑈 (𝑡) − 𝑈 (𝑡−1) ‖ < 𝑐 Fig. 2 Algorithm Fuzzy c-means clustering 𝑁 𝑚 𝐽(𝑋, 𝑈, 𝑉) = ∑ ∑ [𝜇𝑖,𝑘 ] ‖𝑥𝑘 − 𝑣𝑖 ‖2 𝑖=1 𝑘=1 (3) where U = 𝜇𝑖,𝑘 is a fuzzy partition matrix, 𝑉 = {𝑣1 , 𝑣2 , … . . , 𝑣𝑐 }, 𝑣𝑖 ∈ ℝ𝑛 is the set of the cluster centers, ‖𝑥𝑘 − 𝑣𝑖 ‖ is a dissimilarity measure between the object 𝑥𝑘 and the center 𝑣𝑖 , and m is a weighting parameter, that determines the fuzziness of the resulted clusters. The value of the cost function (3) is a measure of the total weighted within-group squared error. As in K-means clustering, the performance of FCM depends on the initial membership matrix values; thereby it is advisable to run the algorithm for several times, each starting with different values of membership grades of data points. 3.3 Mountain Clustering The mountain clustering approach is a simple way to find cluster centers based on a density measure called the mountain function. This method is a simple way to find approximate cluster centers, and can be used as a preprocessor for other sophisticated clustering methods. The first step in mountain clustering involves forming a grid on the data space, where the intersections of the grid lines constitute the potential cluster centers, denoted as a set V . The second step entails constructing a mountain function representing a data density measure. The height of the mountain function at a point 𝑣 ∈ 𝑉 is equal to 𝑁 𝑚(𝑣) = ∑ 𝑒𝑥𝑝 (– 𝑖=1 ‖𝑣 − 𝑥𝑖 ‖2 ) 2𝜎 2 (6) where 𝑥𝑖 is the ith data point and 𝜎 specific constant. This equation states that the data density measure at a point 𝑣 is affected by all the points 𝑥𝑖 in the data set, and this density measure is inversely proportional to the distance between the data points 𝑥𝑖 and the point under consideration 𝑣 . The constant 𝜎 determines the height as well as the smoothness of the resultant mountain function. The third step involves selecting the cluster centers by sequentially destructing the mountain function. The first cluster center 𝑐1 is determined by selecting the point with the greatest density measure. Obtaining the next cluster center requires eliminating the effect of the first cluster. This is done by revising the mountain function: a new mountain function is formed by subtracting a scaled Gaussian function centered at 𝑐1 : ‖𝑣 − 𝑐1 ‖2 𝑚𝑛𝑒𝑤 (𝑣) = 𝑚(𝑣) − 𝑚(𝑐1 ) exp (– ) 2𝛽 2 (7) The subtracted amount eliminates the effect of the first cluster. Note that after subtraction, the new mountain function 𝑚𝑛𝑒𝑤 (𝑣) reduces to zero at 𝑣 = 𝑐1 . After subtraction, the second cluster center is selected as the point having the greatest value for the new mountain function. This process continues until a sufficient number of cluster centers is attained. 3.4 Subtractive Clustering The problem with the previous clustering method, mountain clustering, is that its computation grows exponentially with the dimension of the problem; that is because the mountain function has to be evaluated at each grid point. Subtractive clustering solves this problem by using data points as the candidates for cluster centers, instead of grid points as in mountain clustering. This means that the computation is now proportional to the problem size instead of the problem dimension. However, the actual cluster centers are not necessarily located at one of the data points, but in most cases it is a good approximation, especially with the reduced computation this approach introduces. 𝐷𝑖 = ∑𝑛𝑗=1 exp (– ‖𝑥𝑖 −𝑥𝑗 ‖ 2 (𝑟𝑎 ⁄2)2 ) (8) where 𝑟𝑎 is a positive constant representing a neighborhood radius. Hence, a data point will have a high density value if it has many neighboring data points. The first cluster center 𝑥𝑐1 is chosen as the point having the largest density value 𝐷𝑐1 . Next, the density measure of each data point 𝑥𝑖 is revised as follows: 𝐷𝑖 = 𝐷𝑖 − 𝐷𝑐1 ∑𝑛𝑗=1 exp (– ‖𝑥𝑖 −𝑥𝑐1 ‖2 (𝑟𝑏 ⁄2)2 ) (9) where 𝑟𝑏 is a positive constant which defines a neighborhood that has measurable reductions in density measure. Therefore, the data points near the first cluster center 𝑥𝑐1 will have significantly reduced density measure. After revising the density function, the next cluster center is selected as the point having the greatest density value. This process continues until a sufficient number of clusters is attainted. 4. IMPLEMENTATION AND RESULTS After discussions of the different clustering techniques and their mathematical foundations, we now turn to the practical study. This study involves the implementation of each of these techniques on a set of medical data related to cancer disease diagnosis problem. The medical data used consists of 13 input attributes related to clinical diagnosis of a cancer disease, and one output attribute which indicates whether the patient is diagnosed with the cancer disease or not. The whole data set consists of 300 cases. The data set is partitioned into two data sets: two-thirds of the data for training, and one-third for evaluation. The number of clusters into which the data set is to be partitioned is two clusters; i.e. patients diagnosed with the cancer disease, and patients not diagnosed with the cancer disease. Because of the high number of dimensions in the problem (13-dimensions), no visual representation of the clusters can be presented; only 2D or 3-D clustering problems can be visually inspected. We will rely heavily on performance measures to evaluate the clustering techniques rather than on visual approaches. As mentioned earlier, the similarity metric used to calculate the similarity between an input vector and a cluster center is the Euclidean distance. Since most similarity metrics are sensitive to the large ranges of elements in the input vectors, each of the input variables must be normalized to within the unit interval [0,1] i.e. the data set has to be normalized to be within the unit hypercube. Each clustering algorithm is presented with the training data set, and as a result two clusters are produced. The data in the evaluation set is then tested against the found clusters and an analysis of the results is conducted. The following sections present the results of each clustering technique, followed by a comparison of the four techniques. 4.1 K-means Clustering As mentioned in the previous section, Kmeans clustering works on finding the cluster centers by trying to minimize a cost function J . It alternates between updating the membership matrix and updating the cluster centers using Equations (1) and (2), respectively, until no further improvement in the cost function is noticed. Since the algorithm initializes the cluster centers randomly, its performance is affected by those initial cluster centers. So several runs of the algorithm is advised to have better results. Evaluating the algorithm is realized by testing the accuracy of the evaluation set. After the cluster centers are determined, the evaluation data vectors are assigned to their respective clusters according to the distance between each vector and each of the cluster centers. An error measure is then calculated; the root mean square error (RMSE) is used for this purpose. Also an accuracy measure is calculated as the percentage of correctly classified vectors. The algorithm was tested for 10 times to determine the best performance. Table 1 lists the results of those runs. Fig. 3 shows a plot of the cost function over time for the best test case. Table 1. K-means Clustering Performance Results Test 1 2 3 4 5 6 7 8 9 10 No. of iterations 8 7 8 6 4 4 3 8 10 8 RMSE Accuracy 0.479 0.479 0.446 0.468 0.630 0.691 0.691 0.446 0.446 0.460 79% 79% 81% 77% 60% 50% 50% 81% 81% 79% Regression line slope 0.561 0.561 0.610 0.564 0.385 0.067 0.056 0.611 0.611 0.562 To further measure how accurately the identified clusters represent the actual classification of data, a regression analysis is performed of the resultant clustering against the original classification. Performance is considered better if the regression line slope is close to 1. Fig. 3 : K-means cost history function As seen from the results, the best case achieved 81% accuracy and an RMSE of 0.446. This relatively moderate performance is related to the high dimensionality of the problem; having too much dimensions tend to disrupt the coupling of data and introduces overlapping in some of these dimensions that reduces the accuracy of clustering. It is noticed also that the cost function converges rapidly to a minimum value as seen from the number of iterations in each test run. However, this has no effect on the accuracy measure. 4.2 Fuzzy C-means Clustering FCM allows for data points to have different degrees of membership to each of the clusters; thus eliminating the effect of hard membership introduced by K-means clustering. This approach employs fuzzy measures as the basis for membership matrix calculation and for cluster centers identification. As it is the case in K-means clustering, FCM starts by assigning random values to the membership matrix U , thus several runs have to be conducted to have higher probability of getting good performance. However, the results showed no (or insignificant) variation in performance or accuracy when the algorithm was run for several times.For testing the results, every vector in the evaluation data set is assigned to one of the clusters with a certain degree of belongingness (as done in the training set). However, because the output values we have are crisp values (either 1 or 0), the evaluation set degrees of membership are defuzzified to be tested against the actual outputs. The same performance measures applied in Kmeans clustering will be used here; however only the effect of the weighting exponent m is analyzed, since the effect of random initial membership grades has insignificant effect on the final cluster centers. Table II lists the results of the tests with the effect of varying the weighting exponent m . It is noticed that very low or very high values for m reduces the accuracy; moreover high values tend to increase the time taken by the algorithm to find the clusters. A value of 2 seems adequate for this problem since it has good accuracy and requires less number of iterations. Fig. 4 shows the accuracy and number of iterations against the weighting factor. Table 2. Fuzzy C-means Clustering Performance Results Weighting exponent m No. of iteratio ns RMSE Accura cy Regress ion line slope 1.1 1.2 1.5 2 3 5 8 12 19 17 18 20 25 28 34 37 0.469 0.469 0.480 0.469 0.458 0.480 0.480 0.480 79% 79% 76% 79% 77% 76% 76% 76% 0,561 0.561 0.538 0.561 0.385 0. 538 0. 538 0. 538 So for the problem at hand, with input data of 13- dimensions, 200 training inputs, and a grid size of 10 per dimension, the required number of mountain function calculation is approximately 2.011 *1015 calculations. In addition the value of the mountain function needs to be stored for every grid point for later calculations in finding subsequent clusters; which requires 𝑠 𝑑 storage locations, for our problem this would be 1013 storage locations. Obviously this seems impractical for a problem of this dimension. In order to be able to test this algorithm, the dimension of the problem have to be reduced to a reasonable number; e.g. 4-dimensions. This is achieved by randomly selecting 4 variables from the input data out of the original 13 and performing the test on those variables. Several tests involving differently selected random variables are conducted in order to have a better understanding of the results. Table 3 lists the results of 10 test runs of randomly selected variables. The accuracy achieved ranged between 51% and 79% with an average of 69%, and average RMSE of 0.5479. Those results are quite discouraging compared to the results achieved in K-means and FCM clustering. This is due to the fact that not all of the variables of the input data contribute to the clustering process; only 4 are chosen at random to make it possible to conduct the tests. However, with only 4 attributes chosen to do the tests, mountain clustering required far much more time than any other technique during the tests; this is because of the fact that the number of computation required is exponentially proportional to the number of dimensions in the problem, as stated in Equation (10). So apparently mountain clustering is not suitable for problems of dimensions higher than two or three. Fig. 5 shows a plot of accuracy and Test runs . Table 3. Mountain Clustering Performance Results Fig. 4 Fuzzy C-means Clustering Performance In general, the FCM technique showed no improvement over the K-means clustering for this problem. Both showed close accuracy; moreover FCM was found to be slower than K-means because of fuzzy calculations. 4.3 Mountain Clustering Mountain clustering relies on dividing the data space into grid points and calculating a mountain function at every grid point. This mountain function is a representation of the density of data at this point. The performance of mountain clustering is severely affected by the dimension of the problem; the computation needed rises exponentially with the dimension of input data because the mountain function has to be evaluated at each grid point in the data space. For a problem with k clusters, d dimensions, t data points, and a grid size of s per dimension, the required number of calculations is: 𝑁 = 𝑡 × 𝑠 𝑑 + (𝑘 − 1)𝑠 𝑑 (10) Test RMSE Accuracy Regression line slope 1 2 3 4 5 6 7 8 9 10 0.567 0.470 0.567 0.500 0.548 0.567 0.567 0.528 0.695 0.470 67% 79% 67% 75% 70% 67% 67% 72% 51% 79% 0,350 0.555 0.344 0.510 0.427 0.346 0.346 0.489 0.027 0.555 is diminished. Usually the 𝑟𝑏 variable is taken to be as 1.5𝑟𝑎 . Table 4 shows the results of varying 𝑟𝑎 . Fig. 6 shows a plot of accuracy and RMSE against 𝑟𝑎 . Fig. 5 Mountain Clustering Performance 4.4 Subtractive Clustering This method is similar to mountain clustering, with the difference that a density function is calculated only at every data point, instead of at every grid point. So the data points themselves are the candidates for cluster centers. This has the effect of reducing the number of computations significantly, making it linearly proportional to the number of input data instead of being exponentially proportional to its dimension. For a problem of k clusters and t data points, the required number of calculations is: 𝑁 = 𝑡 2 + (𝑘 − 1)𝑡 (11) As seen from the equation, the number of calculations does not depend on the dimension of the problem. For the problem at hand, the number of computations required is in the range of few ten thousands only. Since the algorithm is fixed and does not rely on any randomness, the results are fixed. However, we can test the effect of the two variables 𝑟𝑎 and 𝑟𝑏 on the accuracy of the algorithm. Those variables represent a radius of neighborhood after which the effect (or contribution) of other data points to the density function Table 4. Subtractive Clustering Performance Results Neighborhood radius 𝑟𝑎 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 RMSE Accuracy 0.678 0.645 0.645 0.499 0.499 0.499 0.499 0.499 0.645 54% 57% 57% 76% 76% 76% 76% 76% 57% Regression line slope 0.0994 0.1925 0.1925 0.5070 0.5070 0.5070 0.5070 0.5070 0.1925 Fig. 6 Subtractive Clustering Performance It is clear from the results that choosing 𝑟𝑎 very small or very large will result in a poor accuracy because if 𝑟𝑎 is chosen very small the density function will not take into account the effect of neighboring data points; while if taken very large, the density function will be affected account all the data points in the data space. So a value between 0.4 and 0.7 should be adequate for the radius of neighborhood. As seen from table 1.4, the maximum achieved accuracy was 76% with an RMSE of 0.512. Compared to K-means and FCM, this result is a little bit behind the accuracy achieved in those other techniques. 5. RESULTS AND DISCUSSION According to the previous discussion of the implementation of the four data clustering techniques and their results, it is useful to summarize the results and present some comparison of performances. A summary of the best achieved results for each of the four techniques is presented in Table 5. Table 5 Comparison of Performance Comparis on Parameter s RMSE Accuracy Regressio n line slope Time(sec) Algorithms KFCM means Mountai n Subtractiv e 0.440 81% 0.610 0.469 79% 0.561 0.470 79% 0.555 0.499 76 .5070 0.8 2.3 117.0 3.60 From this comparison we can conclude some remarks that K-means clustering produces fairly higher accuracy and lower RMSE than the other techniques, and requires less computation time. Mountain clustering has a very poor performance regarding its requirement for huge number of computation and low accuracy. However, we have to notice that tests conducted on mountain clustering were done using part of the input variables in order to make it feasible to run the tests. Mountain clustering is suitable only for problems with two or three dimensions. FCM produces close results to K-means clustering, yet it requires more computation time than K-means because of the fuzzy measures calculations involved in the algorithm. In subtractive clustering, care has to be taken when choosing the value of the neighborhood radius 𝑟𝑎 , since too small radii will result in neglecting the effect of neighboring data points, while large radii will result in a neighborhood of all the data points thus canceling the effect of the cluster. Since none of the algorithms achieved enough high accuracy rates, it is assumed that the problem data itself contains some overlapping in some of the dimensions; because of the high number of dimensions tend to disrupt the coupling of data and reduce the accuracy of clustering. As stated earlier in this paper, clustering algorithms are usually used in conjunction with radial basis function networks and fuzzy models. The techniques described here can be used as preprocessors for RBF networks for determining the centers of the radial basis functions. In such cases, more accuracy can be gained by using gradient descent or other advanced derivative- based optimization schemes for further refinement. means and FCM cannot be used to solve this type of problem, leaving the choice only to mountain or subtractive clustering. Subtractive clustering seems to be a better alternative to mountain clustering since it is based on the same idea, and uses the data points as cluster centers candidates instead of grid points; however, mountain clustering can lead to better results if the grid granularity is small enough to capture the potential cluster centers, but with the side effect of increasing computation needed for the larger number of grid points. Finally, the clustering techniques discussed here do not have to be used as stand-alone approaches; they can be used in conjunction with other neural or fuzzy systems for further refinement of the overall system performance. REFERENCES 1. 2. 3. 4. 5. 6. 7. 8. 6. CONCLUSION Four clustering techniques have been reviewed in this paper, namely: K-means clustering, Fuzzy Cmeans clustering, Mountain clustering, and Subtractive clustering. These approaches solve the problem of categorizing data by partitioning a data set into a number of clusters based on some similarity measure so that the similarity in each cluster is larger than among clusters. The four methods have been implemented and tested against a data set for medical diagnosis of cancer disease. The comparative study done here is concerned with the accuracy of each algorithm, with care being taken toward the efficiency in calculation and other performance measures. The medical problem presented has a high number of dimensions, which might involve some complicated relationships between the variables in the input data. It was obvious that mountain clustering is not one of the good techniques for problems with this high number of dimensions due to its exponential proportionality to the dimension of the problem. Kmeans clustering seemed to over perform the other techniques for this type of problem. However in other problems where the number of clusters is not known, K- 9. 10. 11. 12. 13. 14. 15. 16. 17. Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi A. Discovering Data Mining : From concepts to implementation. Prentice Hall, Upper Saddle River, NJ, 1997 J.Han and Michelin , “Data mining concepts andtechniques,” morgan Kauffman, 2006. J. Hartigan and M. Wong. Algorithm as136: A k-means clustering algorithm. Applied Statistics, 28:100–108, 1979. A. K. Jain and R. C. Dubes. Algorithms for Clustering Data. Prentice-Hall, 1988. J. MacQueen. Some methods for classification and analysis of multivariate observations.In Proc. 5th Berkeley Symp. Math. Statist., Prob., pages 281–297, 1967. Jang, J.-S. R., Sun, C.-T., Mizutani, E., “Neuro- Fuzzy and Soft Computing – A Computational Approach to Learning and Machine Intelligence,”Prentice Hall. J. Bezdek. Pattern Recognition with Fuzzy Objective Function Algorithms.Plenum Press, New York, 1981. Azuaje, F., Dubitzky, W., Black, N., Adamson, K., “Discovering Relevance Knowledge in Data: A Growing Cell Structures Approach,” IEEE Transactions on Systems, Man, and Cybernetics- Part B: Cybernetics, Vol. 30, No. 3, June 2000 (pp. 448) Lin, C., Lee, C., “Neural Fuzzy Systems,” Prentice Hall, NJ, 1996. Tsoukalas, L., Uhrig, R., “Fuzzy and Neural Approaches in Engineering,” John Wiley & Sons, Inc., NY, 1997. Nauck, D., Kruse, R., Klawonn, F., “Foundations of NeuroFuzzy Systems,” John Wiley & Sons Ltd., NY, 1997. J. A. Hartigan and M. A. Wong, “A k-meansclustering algorithm,” Applied Statistics, 28:100--108, 1979. The MathWorks, Inc., “Fuzzy Logic Toolbox – For Use With MATLAB,” The MathWorks, Inc., 1999. Jiawei Han M .K, Data Mining Concepts and Techniques , Morgan Kaufman Publishers, An Imprint of Elsevier, 2006. K.Krishna, and M.Murty,:Genetic k-means Algorithm,”IEEE Transactions on System, Vol.29, No.3,1999. U.Maulik,and S.Bandyopadhyay,”Genetic Algorithm-Based Clusrtering Technique” Pattern Recognition 33, 1999. K.A Abdul Nazeer , M.P Sebastian “Improving the Accurancy and Effiency of the K-means Clustering Algorithm “ WCE 2009, London.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Analysis of the efficiency of Data Clustering Algorithms on high