Download Synthetic Datasets for Clustering Algorithms

i SYNTHETIC DATASETS FOR CLUSTERING ALGORITHMS By Jhansi Rani Vennam A THESIS SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE BY RESEARCH (COMPUTER SCIENCES) Under the Gudanice of Prof. Kamalakar Karlapalem at the International Institute of Information Technology - Hyderabad June 2006 ii Abstract The process of grouping similar objects in the given dataset is known as clustering. A large variety of clustering algorithms have been proposed to find clusters in the given dataset. Not many real-life datasets are available for testing the proposed algorithms. Moreover the existing datasets do not have actual clustering result. This leads to the idea of generating “benchmarking” datasets with high dimensionality and noise, which can evaluate clustering algorithms on various aspects like scalability, accuracy and robustness to noise. We first propose few algorithms and methodologies that generate high-dimensional cluster datasets in R d space along with the original clustering results. We developed a toolkit called SynDECA[1] that generates synthetic datasets based on the algorithms proposed. Given inputs like the number of clusters; dimensionality; maximum value of a dimension and size of the dataset by the user, SynDECA generates the clustering dataset. The proposed methods ensure that there are exactly the requested number of clusters in the dataset. Traditional clustering algorithms try to find clusters in all dimensions of the dataset. When the dimensionality of the dataset increases, some dimensions could be irrelevant for few data points. There could be clusters which are spread in subset of dimensions of the dataset, these clusters may not be visible when seen in all the dimensions of the dataset. A number of subspace clustering algorithms are proposed to find such clusters. We propose methods to generate datasets which are useful for the subspace clustering algorithms. When the subspace cluster datasets are used as input DBSCAN could not identify the subspace clusters, whereas K-Means got confused by the presence noise as expected. iii iv Contents 1 Introduction 1.1 Cluster Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.3 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 2 2 3 2 Clustering Methods 2.1 Categories of Clustering Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 6 8 3 Related Work 9 4 Synthetic Datasets 4.1 SynDECA - Framework . . . . . . . . . . . . . 4.1.1 Notation . . . . . . . . . . . . . . . . 4.2 Algorithms . . . . . . . . . . . . . . . . . . . 4.2.1 Algorithm for Cluster Placement . . . . 4.2.2 Algorithm for Cluster Points Generation 4.2.3 Algorithm for Noise Points Generation 4.2.4 Complexity of Algorithms . . . . . . . 4.3 Evaluation of datasets generated . . . . . . . . 4.3.1 Discussion . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 11 12 13 15 16 16 16 22 23 5 Subspace Cluster Data 5.1 Subspace Clustering Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Subspace Clustering Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 27 30 6 31 31 33 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Experimental Results 6.1 Comparison with the Existing Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Validation of SynDECA Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 Conclusions and Future Work 45 v vi CONTENTS List of Figures 2.1 2.2 2.3 Working of partition clustering method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Working of hierarchical clustering methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Working of DBSCAN algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 4.2 4.3 4.4 4.5 Generation of data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm for Cluster Placement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Division of the given space into cells which can accommodate clusters with minimum radius Algorithm for Cluster Points Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . Algorithm for Noise Points Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 13 14 17 18 5.1 5.2 5.3 Example how different dimensions are important to different clusters . . . . . . . . . . . . . . . . . . Clusters according to projected subspace clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . Subspace clustering dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 26 27 6.1 6.2 6.3 6.4 6.5 6.6 6.7 6.8 6.9 6.10 6.11 6.12 6.13 6.14 6.15 6.16 6.17 6.18 Few examples of the generated data in 2-Dimensions. . . . . . . . . . . . . . . . . . Few examples of the generated data in 3-Dimensions. . . . . . . . . . . . . . . . . . Top row datasets are from “clusutils” — Bottom row datasets are from “SynDECA”. Plot of generated datasets (a) dataset1 (b) dataset2 . . . . . . . . . . . . . . . . . . . Plot of clustering result of K-Means algorithm on (a) dataset1 (b) dataset2 . . . . . . Plot of dataset3 (a) Original (b) Result of K-Means algorithm . . . . . . . . . . . . . Plot of dataset4 (a) Original (b) Result of K-Means algorithm . . . . . . . . . . . . . Plot of clustering result of DBSCAN algorithm on (a) dataset1 (b) dataset2 . . . . . Plot of dataset5 (a) Original (b) Result of DBSCAN algorithm . . . . . . . . . . . . Plot of subspace-dataset1 (a) three dimensional plot (b) two dimensional projection . Clustering result of subspace-dataset1 (a) K-Means (b) DBSCAN . . . . . . . . . . Plot of subspace-dataset2 (a) three dimensional plot (b) two dimensional projection . Clustering result of subspace-dataset2 (a) K-Means (b) DBSCAN . . . . . . . . . . Plot of subspace-dataset3 (a) Original (b) Result of DBSCAN . . . . . . . . . . . . Plot of clustering result of ReCkless algorithm on dataset1 (a) k=25 (b) k=60 . . . . Plot of clustering result of ReCkless algorithm on dataset1 (a) k=90 (b) k=162 . . . . Plot of clustering result of ReCkless algorithm on dataset5 (a) k=32 (b) k=33 . . . . Plot of clustering result of ReCkless algorithm on dataset5 (a) k=50 (b) k=78 . . . . . . . . . . . . . . . . . . . . . . 35 36 36 37 37 38 38 39 39 40 40 41 41 42 42 43 43 44 7.1 Clusters in arbitrary shaped bounding boxes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 7 8 viii LIST OF FIGURES Chapter 1 Introduction For complex problems such as data clustering, intrusion detection and spatio-temporal applications it is difficult to evaluate solutions and algorithms as not enough real life datasets are available. Therefore there is a need to come up with synthetic datasets for comprehensive evaluation of the algorithms. Manually generating test data is a time consuming process and prone to human error. So when real data sets are not available, people do create some synthetic data. While generating data for a particular algorithm or problem, the following are the properties to be satisfied. 1. Validity of dataset. 2. Circumstances for which the dataset is valid. 3. Expected behavior of the algorithm with the dataset. 4. Required format. In the area of spatio-temporal data management, datasets are required in order to evaluate spatio-temporal databases, spatio-temporal data modeling, query languages, spatio-temporal data mining and spatio-temporal indexing. A recent work on data generation for spatio-temporal data can be found in [2]. Given few parameters such as duration, shift and resizing of an object, tools like GSTD[3], G-TERD[4], and Oporto[5] will generate spatio-temporal data. Out of these Oporto mimics a very specific scenario: fishing at sea. But these techniques being spatio-temporal, data is up to four dimensional. In the fields such as intrusion detection, synthetic data helps analyzing at which level the intrusion can be predicted/found without affecting the security. Even if there is any such attack there will be a financial loss but not the life. Where as in the case of analyzing the effects of earthquakes and changes in nuclear reactor, synthetic datasets are going to help the mankind in estimating the effects and taking the necessary actions in advance. Benchmarks are similar to synthetic datasets, but benchmarks help in evaluating the performance of a system rather than just checking the validity of the algorithm. In the field of databases, benchmarking attempts to measure how fast a database is and how much cost is involved so that it can help in determining the best suited database for an application. Transaction Processing Performance Council (TPC), a group of hardware vendors and database companies, formed in the late 1980s to give a third part objectivity to the database benchmarking. The TPC benchmarks have evolved and multiplied over time, but they’ve always provided two measurements: the transaction rate of a database, and the 1 CHAPTER 1. INTRODUCTION 2 cost per transaction including hardware, software, and maintenance. Almost all the major companies in this field benchmark their DBMS’s. 1.1 Cluster Data Clustering is the process of grouping a set of objects into classes of similar objects. A cluster is a collection of data objects that are similar to objects within the same cluster and dissimilar to those in other clusters [6]. Similarity between two objects is calculated using a distance measure. The family of L k -norm distances, Mahalanobis distance functions are a few distance measures to mention. The dimensionality of the data plays an important role in the process of clustering. The result of any clustering algorithm can be visualized and verified only if the dimension of the data is less than or equal to three. If there are datasets with the actual clustering result, they help in correcting the behavior of the clustering algorithm. There are very less number of real life datasets with large dimensions in this field. Moreover those datasets do not provide the actual clustering result. Our aim is to generate datasets in R d (d is number of dimensions) with the result given the following inputs by the user. 1. No. of clusters 2. No. of Points 3. Dimensionality 4. Max value for each Dimension. Though there are few existing methods to generate cluster data, they do not provide solid proof about the number of clusters. They divide the attribute range into a non-overlapping sub-ranges and assigns each range to a cluster. Few methods generate a random point, and then distribute the points according to the randomly generated range and variance assigned to that cluster. In such case there is a possibility that two clusters merge into a single cluster. The generated data must contain exactly the requested number of clusters, otherwise the basic aim for generating synthetic cluster data will not be met. We ensure that the data generated will exactly have requested number of clusters. In general there are few objects in the dataset which are not attached to any cluster, those points are referred to as outliers/noise. The data generated should include some amount of noise. The attributes of each object could be floating points or categorical. Categorical values do not have order defined among those, so it is difficult to work with such attributes. Of late researchers found that there is a possibility that some clusters are not spread in all the dimensions of dataset. That is, a cluster can be identified in some subset of given dimensions. We also proposed a method to generate subspace cluster datasets too. 1.2 Contributions The contributions of this thesis are as follows. 1.3. STRUCTURE OF THESIS 3 We proposed efficient algorithms for generating cluster datasets in R d space. The method of generation ensures that there are exactly the requested number of clusters in the dataset. None of the existing techniques generate arbitrary shaped clusters. We proposed a method to generate such clusters. We also provide the clustering result along with the dataset, which can be used to verify the result of clustering algorithms. Along with the traditional cluster data, we generate datasets containing subspace clusters. 1.3 Structure of Thesis The reminder of this thesis is divided into six chapters. In chapter two we introduce the method of clustering and various techniques available. Chapter three deals with the existing cluster data generators. Chapter four is the heart of this thesis, which explains how we generated the cluster data and the required proofs. Fifth chapter speaks about subspace clustering methods and the subspace cluster data. We provide some experimental results in chapter six. conclusions and future work are given in chapter seven, which is followed by bibliography. 4 CHAPTER 1. INTRODUCTION Chapter 2 Clustering Methods Given a set of objects, the process of dividing this set into subsets of objects is known as Clustering. Objects in a cluster are more similar to each other when compared to those of different cluster. Similarity between two objects is calculated using a distance measure.1 The family of Lk -norm distances, Mahalanobis distance functions are a few distance measures to mention. The major goal of clustering is to identify underlying patterns based on the similarities between the objects. Clustering is also knows as un-supervised learning. Since clustering forms groups, it can be used as a pre-processing step for methods like classification. Of late, with the explosion of data, the data to process for clustering has increased tremendously. Thus, enforcing the following issues into the clustering algorithms: 1. Scalability of the algorithm: Firstly, the order of the input records in a dataset should not affect the output of the clustering algorithm. The algorithm should also be able to handle very large datasets. 2. High-dimensional data: The algorithms should be able to handle large dimensional data. 3. Heterogeneous attributes: Ability to handle various types of attributes (numerical, categorical) is important, due to increasing nature of heterogeneous data, of late. 4. Complex shapes of clusters: The accuracy of the clustering result should be acceptable despite the presence of complex shaped clusters. 5. Noise: The algorithm should be able to handle noise too. Presence of noise should not deter the accuracy or efficiency of the clustering algorithm. The most recent work in clustering algorithms [7], [8], [9], [10] address almost all the issues mentioned above. However, any clustering algorithm needs to be evaluated for its ability to handle the above mentioned issues. For the sake of evaluation, we need datasets which are large, noisy and high-dimensional with the presence of complex clusters. In addition to this, the “actual” clustering result should also be available to benchmark the results given by various clustering algorithms. Such datasets along with the clustering results are very less in number. Though there are a few readily available real-life datasets, the actual clustering results are not known. This necessitates a tool to be devised, which is capable of generating high-dimensional, noisy datasets along with the original clustering results. 1 The terms similarity and distance are used interchangeably. The more the distance between the two objects the lesser similar they are. 5 CHAPTER 2. CLUSTERING METHODS 6 Our toolkit SynDECA (Synthetic Datasets to Evaluate Clustering Algorithms) generates large noisy high-dimensional datasets. SynDECA also provides the information about each point (whether it belongs to cluster or noise) and a brief statistical description of the clusters present in the dataset. 2.1 Categories of Clustering Algorithms Grouping of the objects can be done using any of partitioning, hierarchical, density-based, grid-based and modelbased techniques. Each of which is explained as follows. 1. Partitioning Method: Given n objects and number of partitions k(k < n) the partitioning method divides the objects into k partitions (i.e. k clusters). These algorithms work iteratively to improve the quality of partitioning. In each iteration a representative object is found and all the rest of the objects are associated to one of these representatives. The most popular partitioning algorithms are K-Means and K-Medoids. K-Means algorithm works as follows. In each iteration mean of each partition is calculated and then distance of each object with respect to all the means. Each object is associated with that mean with which the distance is minimum. The method is explained as in figure 2.1 (a) (b) (c) Figure 2.1: Working of partition clustering method The major drawback of partitioning algorithms is that they can not tackle noise/outliers in the dataset. They are not be able to recognize such points. Moreover the user is supposed to give the number of clusters as input. Figuring out the number of clusters may not be easy all the time, due to the dimensionality of the data and the heterogeneous attributes. Also, these algorithms are not suitable for discovering non-convex shapes. 2. Hierarchical Method: In Hierarchical clustering methods, objects are grouped into a tree of clusters, known as dendogram. There are two methods to perform the grouping, agglomerative and divisive. Agglomerative is a bottom-up approach. It takes each object as a cluster and at each iteration it merges these clusters to form larger and larger clusters until all the objects form a single cluster or until certain threshold conditions are satisfied. Divisive method is a reverse process of agglomeration. It starts with all the objects in one single cluster and divides the clusters into smaller and smaller clusters until each object is a cluster or until threshold conditions are met. AGNES, DIANA, BIRCH, CURE, CHAMELEON and ROCK are few hierarchical clustering algorithms. Working of these algorithms can be found in figure 2.2 2.1. CATEGORIES OF CLUSTERING ALGORITHMS Agglomerative Clustering Algorithms Step 5 Step 4 BCDEF Step 1 BC A B D E Step 4 EF C Step 3 DEF Step 2 Step 2 Step 3 Step 1 ABCDEF Divisive Clustering Algorithms 7 F Step 5 Figure 2.2: Working of hierarchical clustering methods The problem with these clustering algorithms is a decision taken can not be undone and there is no way to swap the objects from one group to other, so as to form a better clustering. 3. Density-Based Method: Most of the distance based algorithms try to find only spherical shaped clusters. In order to find the arbitrary shaped clusters, density based methods are proposed. The idea is to grow the cluster as long as the density of around a point in the cluster is above certain threshold. These methods not only find arbitrary shaped clusters but also outliers/noise points. DBSCAN and OPTICS are examples of density based clustering algorithms. DBSCAN algorithm takes a radius epsilon and minpoints as input. It takes each object and verifies whether it has at least minpoints within an epsilon radius. If so, it considered as a part of some cluster. If minpoints are three then points A,B,C,D,E,F and G in figure 2.3 helps the cluster to grow, where as points H and I are considered as outliers. DBSCAN needs epsilon and minpoints as input, calculating which in high-dimensional space is a tough task. OPTICS algorithm is proposed to overcome this difficulty. OPTICS computes an augmented cluster ordering for automatic and interactive density based clusters. 4. Grid based Method: Grid based algorithms divide the given data space into finite number of cells that form a grid structure. This grid structure is used to perform all the clustering operations. The advantage here is clustering processing can be done in parallel. Quality of the clusters depends on the granularity of the grid. STING, CLIQUE and Wave Cluster are few grid based methods. 5. Model based Method: In this method, a model is hypothesized for each cluster that will best fit the given data. The algorithm might construct a density function that reflects the spatial distribution of the data. Model based methods follow two approaches: statistical approach and neural network based approach. Sometimes it is tough to classify a clustering algorithm into one of the above methods, as the algorithms integrate ideas from more than one of the methods. CHAPTER 2. CLUSTERING METHODS 8 %$% $ #"# A " B D C E G H !! I F Figure 2.3: Working of DBSCAN algorithm 2.2 Outliers Few objects in the data behave very much dissimilar to the general behavior of the entire data, these objects are called as outliers. Outlier mining can be seen as two subproblems: (1) Defining what is an inconsistent data (2) Finding methods to determine such data. Defining what an outlier is not a trivial job. Data visualization methods will help in detecting outliers to an extent. But in the presence of categorical and/or a very high dimensional data visualization techniques won’t be effective. Detecting such points will be useful, as applications such as fraud detection needs to identify such objects. Chapter 3 Related Work There have been some early attempts to generate synthetic data to test the clustering algorithms. A few schemes for generating artificial data have appeared in the literature ([11, 12, 13]). One of the processes used multivariate normal mixtures with fairly complex covariance matrices. This leads to the generation of the overlapping clusters. The algorithm proposed by Glenn [11] generates data in either four, six, or eight dimensional space containing up to five clusters. Three different methods are followed in assigning points to each cluster. The overall approach followed by the Glenn’s algorithm is as follows. Initially the extent of each cluster in the first dimension is fixed in such a way that there will not be any overlap between any clusters in this dimension. Then the extent of each cluster in the remaining dimensions is calculated. Points are generated within the bounding box1 of each cluster. Outliers associated with each cluster are generated, these outliers are not within the bounding box of the cluster. Finally error measures (such as adding error perturbation to each dimension of each point in the dataset) are added. Since the clusters are non-overlapping in the first dimension, They remain non-overlapped even if other dimensions are also considered. In this case any clustering algorithm will be able to identify clusters properly in any of n − i (i=1 . . . n − 1 where n is the dimensionality of the data) dimensions. For each cluster some percentage of the cluster points are added as noise to that cluster. There could be every chance that a noise point assigned to a cluster may fall within the bounding box of some other cluster. The major drawback is that the algorithm is limiting the number of dimensions as well as the number of clusters. “Clusutils” [14], has a component called “clusgen”which is a tool based on Glenn’s [11] algorithm. Though this tool does not limit the number of dimensions and the number of clusters, it can generate only rectangular/square shaped (a fixed shape) clusters but not random shaped clusters. Moreover clusgen doesn’t guarantee the presence of requested number of clusters. While evaluating the clustering algorithms authors [15, 16, 17, 18, 19, 20, 21, 12] generated some synthetic datasets. These datasets are tailor made for their particular needs i.e as DBSCAN algorithm needs arbitrary shaped clusters, they generated such datasets. Most of the papers used the method explained in BIRCH[12]. The synthetic generator explained in BIRCH paper takes pattern, number of clusters, maximum and minimum number of points in a cluster, maximum and minimum radius of a cluster and few more inputs and generates the data. Cluster centers are fixed depending on the pattern of the cluster. After fixing few more characteristics of the cluster, data points are 2 generated using a 2-d normal distribution whose mean is the cluster center and variance is r2 . They also generate some amount of noise points. Since the maximum distance between the cluster center and the cluster point is 1 The term bounding box is used extensively to mean the minimum bounding box 9 10 CHAPTER 3. RELATED WORK unbounded, there are many chances that a point that belongs to cluster c i is much closer to the center of cluster cj , which makes the generated data bit fuzzy. As there is an overlap there is no guarantee that the data will have exactly the requested number of clusters. Algorithms like [22] used data generation method explained by M. Zait et. al in [13]. The data generator explained by M. Zait et. al takes number of objects, number of dimensions, type and range of values for each dimension as input. The process is divided into three steps. In the first step a file containing details about each and every dimension is generated. The second step is to define the ranges of each cluster for each dimension. Each dimension is divided into non-overlapping ranges, which is in turn assigned to one of the cluster. If there are categorical dimensions then the values are divided into set of subsets and each subset is randomly assigned to a cluster. Equal number of points are generated according to the ranges of dimensions assigned for each cluster. It is very evident that the generated data has the required number of clusters, but clusters can be identified by just looking at one dimension which is like the output of the algorithm proposed by Glenn. Generating such data is not a good option. Moreover none of the clusters generated has an arbitrary shape. The method proposed by us takes number of clusters, number of points, number of dimensions and maximum range of each dimension as input and generates data. we guarantee that there will be exactly the requested number of clusters. We are not restricted only to rectangular shaped clusters, in fact we generate random shaped clusters. We include some amount of noise too. Clusters generated can be overlapped when seen in subset of the dimensions, So chances are less to figure out the clusters by just looking at some dimension. Chapter 4 Synthetic Datasets 4.1 SynDECA - Framework SynDECA currently generates datasets in R d , where d is the number of dimensions. In the following sections, we describe the various components of SynDECA and their functionalities. 4.1.1 Notation Let the dataset to be generated be represented as X. Let the dimensionality of the dataset be d, the dimensional space be D ⊆ Rd and the number of points in the dataset be n. Let the range of each axis be [0, m], where m is, the maximum allowable value in each dimension. Let c be the number of clusters to be present in the dataset X and X c be the set of points that belong to clusters and Xn be the set of points that are noise. Let η be the noise percentage, n| i.e. |X |X| × 100. Let µ represent the set of cluster centers, µ i represents the cluster center of cluster i and µij is the value of j-th dimension of cluster center i. With the above notation, we define the problem of synthetic numerical dataset generation as: Given the number of points to be present in the dataset n, the dimensionality of the dataset d, the maximum range of each dimension m, the number of non-overlapping clusters c to be present in the dataset, our aim is to generate a dataset that is spread across all the d dimensions with c non-overlapping clusters and η percentage of noise points within the dataset. We list out the tasks, step-by-step, that together address the above problem definition. We divide the whole problem into four smaller tasks: 1. Cluster placement: For all the c clusters, the radius of the cluster and the cluster centers are to be determined such that there is no overlapping between the clusters in all d dimensions. The algorithm is in table 4.2. 2. Cluster and Noise Cardinality estimation: For each cluster, the number of points to be placed is in proportion with its radius. The number of points to be placed as noise, also needs to be determined. 3. Filling clusters with points: In this technique, with the cluster centers and their radii in place, we sprinkle randomly generated points to various clusters, till the required number of cluster points |X c | is achieved (as described in figure 4.4). 11 CHAPTER 4. SYNTHETIC DATASETS 12 4. Sprinkling noise: After filling up the clusters with random points, noise points need to be added such that they do not lie within the region of any cluster, in which cluster points are placed. Noise points are filled till the |X n | is achieved (as described in Figure 4.5). Figure 4.1: Generation of data 4.2 Algorithms With the brief mention of the various tasks in the previous section, we now explain the algorithms in detail. The inputs from the user are: the number of points in the dataset n, the dimensionality of the dataset d, the number of non-overlapping clusters c and the maximum allowable range m. Apart from these user-given parameters, we use another set of parameters which are set dynamically during the algorithm execution. They are: 1. Parameter : This parameter is used to ensure a minimum gap between any two clusters, i.e. within (1 + ) ∗ r (r - radius of the cluster) region around the center of any cluster no other cluster can be placed. Range is (0 , 1] and the value is set randomly. 2. Parameter rmax : rmax denotes the maximum allowable values for the cluster radius, which is calculated from the user-inputs and . 3. Parameter rmin : This parameter determines the minimum allowable radius for a cluster. This depends on how less a cluster can be when compared with the largest cluster. It is determined by k (ratio of r max to rmin ) and rmax , In our experiments we have taken k = 3. 4.2. ALGORITHMS 13 4. Parameter η: η denotes the percentage of noise points. The range of the noise points is (0 , the points are allocated to each cluster in the proportion to its radius. 1 1+c ∗ n]. The rest of 4.2.1 Algorithm for Cluster Placement The algorithm proceeds with assigning the first cluster’s center and radius randomly. For the remaining clusters, the cluster center and radius are set depending on the placement of the previous clusters. For every cluster, the center µ i and radius ri are generated randomly. The center generated is checked such that it is at a distance of radius r i from the edges along each dimension i.e m − µij is at least radius of that cluster(j = 1 . . . d). Once the center is adjusted, it is checked against with each of the already placed clusters for the minimum gap that needs to be maintained. If there is any cluster with which the current cluster is not having the minimum required gap, the radius of the current cluster is reduced. If the reduced radius is less than that of the minimum allowable radius then the process is started again for generating a new center point as well as radius. The algorithm is given in Figure 4.2. Input: µi and existing set of cluster centers µ; radii of already placed clusters Output: µi and Radius of the current cluster Ci Algorithm: place cluster(i,µi ,ri ): Cluster Placement 1. ri ← Generate radius(rmin , rmax ) /* Generate radius randomly generates a value between rmin and rmax */ 2. µi ← Generate center(ri ) /* Generate center randomly generates a point in d dimensional space such that distance between the center 3. and any edge of bounding box of given space is at least Radius of that cluster */ 4. do 5. is-cluster-placed ← true 6. for j in 1 to (i-1) do 7. if distance(µi , µj ) < (1+)* (ri + rj ) then 8. is-cluster-placed ← false 9. Reduce (ri ) 10. if ( ri < rmin ) then 11. place cluster(i,µi ,ri ) 12. end if 13. break 14. end if 16. end for 17. while (! is-cluster-placed) 18. end Figure 4.2: Algorithm for Cluster Placement The algorithm for cluster placement will definitely be able to find continuous free space in the given space s, so that it can place a cluster with rmin as radius. The following proof shows the same. Please look into section 4.3 to verify CHAPTER 4. SYNTHETIC DATASETS 14 the equations. rmax d1 1 s ∗ = c 2 ∗ (1 + ) s = md 1 m = rmax ∗ c d ∗ 2(1 + ) as rmin = rmax k 1 m = k ∗ rmin ∗ c d ∗ 2(1 + ) (4.1) 1/d k* c ..... 3 2 1 2 .............. 3 1/d k* c 2(1+e)* r min m Figure 4.3: Division of the given space into cells which can accommodate clusters with minimum radius From equation 4.1, we can divide each dimension into k ∗ c d units as shown in figure (4.3). While placing a cluster with rmax as radius, it might at most span k + 1 units along each dimension. Similarly the number of units occupied in each dimension for a cluster having radius rmin is at most two. The dotted squares in figure 4.3 are the bounding boxes of clusters. The algorithm generates a radius in the range of [r min , rmax ], so not all the clusters are going to have maximum radius. Each cluster will be leaving (k + 1)d − (cellsi )d number of empty cells, where cellsi are the cells occupied by ith cluster. As d increases the number of un-occupied cells increases and there is a chance that some cells are occupied by more than one cluster. Which shows that there are ample number of cells left free, each of which is capable of hosting a cluster with rmin as radius. 1 In some iteration, the algorithm will generate a center in the availble free space. If at all the generated radius is not allowing it to place the cluster, algorithm will be shrinking the size of the radius and will try to accomodate in that. 4.2. ALGORITHMS 15 We ran cluster placement algorithm for 50 times for each input. The results are tabulated in Table 4.1. The results show that the average radius of the clusters is not even half of rmax , which shows that there is ample amount of free space left. Algorithm is recursively called on an average of atmost 2 times in most of the cases. In some cases the value is little bit high due to the randomness in finding the center of the cluster, same is the case with the number of times the cluster radius is reduced to fit in that position. Dimensions 2 2 2 3 3 3 5 5 5 10 10 10 20 20 20 50 50 50 100 100 100 Clusters 10 100 1000 10 100 1000 10 100 1000 10 100 1000 10 100 1000 10 100 1000 10 100 1000 rmax 13.1762 4.16667 1.31762 19.34 8.97681 4.16667 26.2899 16.5878 10.4662 33.097 26.2899 20.8828 37.1355 33.097 29.4977 39.7914 38.0005 36.2901 40.7182 39.7914 38.8856 rmin 4.38766 1.3875 0.438766 6.4402 2.98928 1.3875 8.75453 5.52374 3.48524 11.0213 8.75453 6.95397 12.3661 11.0213 9.82275 13.2505 12.6542 12.0846 13.5592 13.2505 12.9489 Avg Radius 6.9583308 2.1744378 0.68921596 9.9235446 4.5308042 2.1070136 12.399806 7.8629638 4.9784418 14.526136 11.594366 9.2591416 15.864384 13.58244 12.394634 16.450032 15.341142 14.555292 17.001032 15.94729 15.538592 Avg Trails 1.618 1.6226 1.6068 1.588 1.5952 1.56458 1.628 1.4942 1.48072 7.85 1.35 1.3389 4.442 4.7032 1.25274 57.964 3.896 2.30472 11.7 3.5456 1.8887 Avg Radius Reduction 1.496 1.5188 1.51924 1.506 1.5472 1.50642 1.78 1.5088 1.46624 13.762 1.3378 1.32902 7.386 7.9074 1.27538 109.782 6.3966 3.37338 21.232 5.7674 2.58478 Table 4.1: Avg Radius, Avg no.of trails to place the cluster and Avg no.of times the radius got shrinked 4.2.2 Algorithm for Cluster Points Generation 1 ∗ n] After obtaining the cluster positions and radii from the Cluster Placement algorithm, a random number (∈ (0, 1+c ) of points are allotted for noise. The remaining points are allotted to each of the cluster in proportion to the cluster’s radius. A shape (circle, ellipse, rectangle, square or irregular) for each cluster is assigned randomly. Points for a cluster are generated according to its shape. For a circular shaped cluster, points are generated in such a way that all the points are within a distance of radius ri from its center. For rectangular and elliptical shaped clusters, the extent of radius (along a dimension) is reduced in at most d − 1 dimensions, points generated will be within the structure. For irregular shaped clusters there are two different methods that we employ: 1. The first method is analogous to the reverse mechanism of DBSCAN algorithm [15]. A point p is randomly chosen and a small percentage of the total allocated points (min pts) are sprinkled around p within a distance of eps (eps r). A new point p0 , which is at a distance q 0 (eps < q 0 < eps + eps0 and eps0 r), from p CHAPTER 4. SYNTHETIC DATASETS 16 is chosen and min pts0 are sprinkled around p0 within a distance of eps0 . This process is continued till all the required number of points are generated. The condition that the distance q 0 between points p and p0 should be within (eps, eps + eps0 ) ensures the connectivity of the generated points. 2. The second technique is based on the hyper-dimensional grids. In this method, the bounding box of the cluster is divided into small d dimensional boxes, from which one box is randomly chosen, to fill with points. Following that, an adjacent box of the filled box is chosen to fill with points, this process is continued till the required number of points are generated for that cluster. By using grids, it is ensured that each point that is generated for a cluster is within the bounding box of the cluster. Since the rand function in C generates random numbers in uniform distribution, density of regular shaped clusters is uniform. Where as irregular shaped clusters do not follow any distribution. The algorithm is given in Figure 4.4. Given a center µi and radius ri , the function generate-point generates a point within the bounding box of the cluster. Function is-within-required-boundary checks the extent of a point around the center of the cluster depending upon the shape of the cluster. Function sprinkle-points generates min pts points within eps region of a selected point and function get-a-point generates a point p0 which is at a distance q 0 from p. 4.2.3 Algorithm for Noise Points Generation With the cluster points generated, we now generate η percentage of n (number of points in the dataset) noise points. While generating noise points, care is taken such that a point generated for noise will not be with in the space of any cluster where cluster points are generated. The generated point p is a potential noise point if it is not within the bounding box of any cluster. Otherwise if one of the following conditions are satisfied then p is treated as valid noise point. If the shape of the cluster is • circle: p is not within distance ri from the center of the cluster µi . • ellipse: p is not within the space of the ellipse. • rectangle: for any dimension i, |pj − µij | is > Radiiij (Radiiij is radius of cluster i along dimension j). • irregular: p is not within the eps distance of any point which is used as pivot to sprinkle the points. The algorithm is given in Figure 4.5. 4.2.4 Complexity of Algorithms SynDECA generates the required datasets within a reasonable time (It does not take more than 10 seconds of time to generate 10,00,000 point, 2-dimensional data containing 100 clusters). For high-dimensional large datasets, it is not easy to evaluate the time analysis because at times, getting the position of a point such that it satisfies various conditions is a time consuming task. 4.3 Evaluation of datasets generated Given the user inputs: number of points in the dataset n, the dimensionality d, the number of clusters c and the maximum allowable value in each dimension m, the toolkit generates the required number n points in R d dimension space with c clusters along with some noise points (calculated based on the other parameter values). 4.3. EVALUATION OF DATASETS GENERATED 17 Input: No.of points to be generated, Center µi ; Radius ri and Shape of cluster 1. Radiii , extent in each dimension for ellipse and rectangle Output: Required number of points with in the bounding box of the cluster. Algorithm: Cluster Points Generation 2. Let i ← 1 3.switch (Shape) 4. case CIRCLE: 5. case ELLIPSE: 6. case RECTANGLE: 7. case SQUARE: 8. while (i < No.of points to be generated) 9. point ← generate-point (µi , ri ) 10. if (is-within-required-boundary(µi,point,ri ,Shape)) 11. generated-points[i] ← point 12. increment i 13. end if 14. end while 15. case IRREGULAR: 16. p ← generate-point(µi, ri ) 17. sprinkle-points(p,eps,min pts) 18. no-of-generated-points ← min pts 19. while ( no-of-generated-points < No.of points to be generated 20. p’ ← get-a-point(p,eps,eps’) 21. sprinkle-points(p’,eps’,min pts’) 22. increment no-of-generated-points by min pts’ 23. p ← p’ eps ← eps’ 24. end while 25. end switch 26. end Figure 4.4: Algorithm for Cluster Points Generation 100 A random number between 0 to 1+c is selected as noise percent η. The maximum limit on noise percent ensures the average number of points that are given for a cluster is greater than or equal to noise points, which is explained below. 100 − η ∗n ≥ η∗n c 100 ≥ (1 + c) ∗ η 100 η ≤ 1+c The amount of space provided for generating the dataset is, s = m d . Since any size of bounding box (for c clusters) can not fit in the space provided, there should be a restriction on the size of the bounding box of the cluster. So we need to define a maximum and minimum allowable radius (r max , CHAPTER 4. SYNTHETIC DATASETS 18 Input: No.of noise points η % n, Center µi ; Shape; Radius/Radii of each cluster and the selected points. Output: η % n of noise points. Algorithm: Noise Points Generation 1. Let i ← 1 2. while ( i < No.of noise Points) 3. point ← Generate-point-in-space() 4. if (not within-any-cluster-bounding-box(point)) 5. generated-noise[i] ← point 6. increment i 7. end if 8. els if(is-within-free-space-of-cluster(point, µj , rj , Shapej )) 9. generated-noise[i] ← point 10. increment i 11. end else 12. end while 13. end Figure 4.5: Algorithm for Noise Points Generation rmin ) for each cluster. Computation of rmax : Maximum space allocated for each cluster is sc . The size of the bounding box of largest cluster (which includes the space) in terms of rmax and is [2 ∗ rmax ∗ (1 + )]d . s c rmax = [2 ∗ rmax ∗ (1 + )]d d1 1 s ∗ = c 2 ∗ (1 + ) Minimum allowable radius ,rmin = = 1 rmax k d1 1 s ∗ c 2 ∗ k ∗ (1 + ) where k (> 1), is the ratio of rmax to rmin . k can take any value, depending on how small a cluster can be when compared to the largest cluster. In our experiments we have taken k as 3. The size of bounding box of cluster with rmax as radius = = s 1 1 2 ∗ [ ]d ∗ [ ] c 2 ∗ (1 + ) d 1 s ∗ c 1+ d 4.3. EVALUATION OF DATASETS GENERATED 19 Similarly, the size of bounding box of cluster with rmin as radius = d s 1 ∗ c k ∗ (1 + ) if every cluster takes rmax as its radius, then d s 1 Minimum free space available = s − ∗c c 1+ [1 + ]d − 1 = s∗ (1 + )d Similarly if every cluster takes rmin as its radius, then d s 1 Maximum free space available = s − ∗c c k ∗ (1 + ) = s∗ [k ∗ (1 + )]d − 1 [k ∗ (1 + )]d The step 6 of the algorithm for Cluster Placement (given in Table 4.2) checks whether the current cluster j is within the (1 + ) ∗ ri (ri is the radius of cluster i) space around µi (center of cluster i, which is already placed). If there is any such overlap then the radius of the cluster j is reduced. Moreover ∗ r i space around the bounding box of cluster i is left free. Hence we can have the following lemma. Lemma 1 The algorithm for Cluster Placement (table 4.2) generates clusters that are non-overlapping. Theorem 1 The dataset generated by the above step-by-step process mentioned in section 3, will exactly contain c clusters. Proof: From the above lemma it is assured that the bounding boxes of the clusters will not overlap. Each cluster has its own private bounding box, within which there will not be any other cluster’s points or the noise points. Now if we can prove that the density of the points in within the bounding boxes of each cluster is more than that of the noise, then we can assure that there exist exactly c clusters (i.e. there will be exactly c sets of points which are denser than the surrounding space) in the generated dataset. Case 1: When every cluster has rmax as radius and noise percentage as η = Number of noise points = = 100 1+c η ∗n 100 1 ∗n 1+c CHAPTER 4. SYNTHETIC DATASETS 20 Total number of cluster points = = = all the clusters have same radius Number of points for each cluster = = 100 − η ∗n 100 1 ∗n 1− 1+c c ∗n 1+c c ∗n (1 + c) ∗ c 1 ∗n 1+c The number of points allotted for every cluster is the same as that of the noise points. The bounding box size of a cluster = [2 ∗ rmax ]d d s 1 1 d = 2∗[ ] ∗ c 2 ∗ (1 + ) d s 1 = ∗ c 1+ Total free space = Total space provided − d X (Bounding box size of cluster i) i=1 d 1 s Total free space = s − c ∗ ∗ c 1+ (1 + )d − 1 = s (1 + )d In order to have the density of any cluster to be more than that of the density of noise, (since each cluster has the same number of points) the available free space should be greater than that of the size of the bounding box of the cluster. d 1 (1 + )d − 1 s ∗ s > (1 + )d c 1+ s > 0 (1 + )d 1 (1 + )d − 1 > c 1+c d (1 + ) > c 1 1+c d > −1 c 4.3. EVALUATION OF DATASETS GENERATED 21 If the above condition is ensured then the amount of free space is greater than that of the size of the bounding box 1 d of each cluster. So we make to take values within the range ([ 1+c c ] − 1, 1]. Even when every cluster takes rmin as radius, the number of points in every cluster is same as number of noise points. In this case the size of the bounding box of cluster is reduced by k1d times and the amount of free space is increased. This means that the same number of noise points(as above) are spread over an increased free space and the same number of cluster points(as above) occupy less space. Therefore in this case also, the density of any cluster is more than that of the noise density. Case 2: 100 , c−x When x clusters got rmin as radius, the remaining c − x clusters got rmax as radius and η = 1+c clusters will get more number of points(due to these x clusters having minimum radius) than in case 1 and the amount of free space is also increased. Therefore, the density of c − x clusters (with r max as radius) is greater than the noise density. Points allotted for cluster withrmax as radius = k c ∗ ∗n k ∗ (c − x) + x 1 + c Points allotted for cluster withrmin as radius = c 1 ∗ ∗n k ∗ (c − x) + x 1 + c When x = 1 the cluster with rmin as radius gets the least share of points, since the denominator is more when compared to all the other cases. free space is increased by Total free space = (2 ∗ rmax )d − (2 ∗ rmin )d 2 = (2 ∗ rmax )d − ( ∗ rmax )d k s kd − 1 ∗ = (k ∗ (1 + ))d c s (1 + )d − 1 kd − 1 ∗ s+ ∗ (1 + )d (k ∗ (1 + ))d c The size of the bounding box of the cluster withrmin as radius The density of the cluster with rmin as radius = = d s 1 ∗ c k ∗ (1 + ) 1 k∗(c−1)+1 ∗ c 1+c ∗ d s 1 c ∗ k∗(1+) n CHAPTER 4. SYNTHETIC DATASETS 22 n 1+c The density of noise is determined by = ∗ s+ kd −1 (k∗(1+))d (1+)d −1 (1+)d kd −1 (k∗(1+))d ∗ s c Since density of cluster should be greater than noise density, 1 k∗(c−1)+1 ∗ c 1+c ∗ d n > 1 s c ∗ k∗(1+) c ∗ n ∗ c ∗ k d ∗ (1 + )d s(1 + c)(k ∗ [c − 1] + 1) Since n∗c∗kd (1+)d s(1+c) > n 1+c (1+)d −1 (1+)d ∗ s+ ∗ s c n ∗ c ∗ k d ∗ (1 + )d s(1 + c)[c ∗ k d ∗ ((1 + )d − 1) + (k d − 1)] >0 c > c∗ kd k ∗ (c − 1) + 1 ∗ ((1 + )d − 1) + (k d − 1) c2 ∗ k d ((1 + )d − 1) > kc − k + 1 − c ∗ (k d − 1) c2 ∗ k d ((1 + )d − 1) > 1 − k − c(k d − k − 1) 1 − k − c(k d − k − 1) (1 + )d > 1 + c2 ∗ k d > 1 − k − c(k d − k − 1) c2 ∗ k d d1 1 1 − k − c(k d − k − 1) d Since c2 ∗ k d −1 (4.2) < 1 The right hand side of the equation 4.2 is always negative and as is always positive, the above condition is satisfied. Therefore, for each cluster the density is more than that of noise. Hence, the theorem is proved. 4.3.1 Discussion Let mi be the maximum allowable value in each dimension. When max di=1 (mi ) mindi=1 (mi ), the given space is skewed towards rectangular shape. The bounding box of cluster can no longer be of square shaped, since r max calculated will be more than that of mindi=1 (mi ). So in this case we should consider the rectangular shaped bounding boxes to generate points. Let li be the length of the side of bounding box of large cluster along dimension i. Let mlii = ρ i = 1 . . . d. When all the bounding boxes of clusters are large, space occupied by each cluster = = s c d Y i=1 li ∗ (1 + )d 4.4. SUMMARY 23 Where ( ∗ di /2) is the amount of space left free around the bounding box of cluster along dimension i. s c = d Y li ∗ (1 + )d d Y mi d Y li ∗ (1 + )d i=1 s = i=1 Qd i=1 Qd mi c i=1 Qd = i=1 mi i=1 li ρd = c ∗ (1 + )d = c ∗ (1 + )d Value of either ρ or need to be fixed, in order to get the size of the large bounding box (to get the density of the cluster with largest bounding box). If ρ is fixed then there could be every chance that for some values of c and d, epsilon could be negligible (the gap between clusters will also be negligible). In this scenario we cannot ensure nonoverlapping clusters. (< 1) is fixed to get the value of ρ with the help of c and d. We can prove that there are exactly c clusters in the generated data in the similar way mentioned in the above theorem. 4.4 Summary In this chapter we proposed methods to generate clustering datasets in R d (d is the dimensionality of dataset). Our algorithms are capable of generating regular as well as arbitrary shaped clusters. Algorithms generate the requested number of clusters in the dataset along with the actual clustering result. This property makes them to be helpful in evaluating the clustering result of clustering algorithms. 24 CHAPTER 4. SYNTHETIC DATASETS Chapter 5 Subspace Cluster Data Subspace cluster can be defined as a pair of subset of given data and a subspace. The data objects in a subspace cluster are similar in the associated subspace. When we consider all the dimensions we may not be able to figure out the similarity between these data points. Traditional clustering algorithms take all the dimensions into account in order to figure out the clusters. In case of a very high-dimensional data, all the dimensions may not be relevant. If a dimension is irrelevant for all the objects in the database, it can be removed using dimensionality reduction techniques. Feature transformation techniques such as principle component analysis, might be lucrative, But after applying these methods the distance between any two points remain approximately same as in the original dataset. Moreover interpreting the meaning of the new dimensions is difficult. Dimensionality reduction techniques may not always work as different dimensions play an important role in finding different clusters, so removing a dimension could be a great loss for some cluster. For example points A,B,C and D are forming a cluster in dimensions Y and Z where as points E,F,G,H and I are forming a cluster in dimensions X and Z (Figure 5.1). If the dimensionality reduction techniques remove any of X and Y dimensions, one of the cluster will not be identified. Subspace clustering algorithms try to give a solution for all these problems. I E Dimension Z E H F Dimension Y G H D C B G C A I D B A F Dimension Z Dimension X (b) (a) Figure 5.1: Example how different dimensions are important to different clusters 25 CHAPTER 5. SUBSPACE CLUSTER DATA 26 5.1 Subspace Clustering Methods Lance Parsons et.al [23] gave a very good survey about the subspace clustering algorithms. The first algorithm proposed to figure out subspace clusters is CLIQUE [24]. Algorithm CLIQUE is a grid based technique and works in a bottom-up fashion. The given data space is divided into axis parallel hyper cubes of equal size, each cube is labeled as dense if it is having certain number of points. The algorithm first checks for one dimensional dense hyper cubes and proceeds to higher dimensions. At each stage the candidate subspaces are generated in an apriori method, these candidates are further used to find clusters in higher dimensions. Adjacent dense cubes at each stage are combined into one single cluster. Adjusting the parameters density and the granularity of the cuboid is difficult. If the size of the cuboid is not set properly then chances are there that some noise points are added into the cluster or some cluster points are ignored. The objects might belong to many subspace clusters at each level. CLIQUE does not partition the data into non-overlapping subsets of points, which is required for most of the applications like clustering documents based on the language. Dimension Y Charu Aggarwal et.al proposed PROCLUS(PROjected CLUStering) [22], which works as K-medoids algorithm. Initially a large set of medoids are found and then out of which K are chosen, the rest of the objects are attached to one of the medoid. In the next run again the medoids are calculated, to refine the quality of the clusters. Each medoid is associated with some subspace. A dimension is relevant to a cluster if variance of the cluster points is less. PROCLUS takes two inputs, number of clusters and average number of dimensions per subspace cluster, which are again difficult to figure out. ORCLUS [25] is proposed by the same authors, which handles the non-axis aligned subspaces. In case of these projected clustering algorithms, few points can be reported in multiple clusters, as shown in Figure(5.2) '&'& ))( +*+* ( --,, 0/.0./0.0. 656 2211 4 58 :9:9 343 > 787 <;< =/=/@?@ =>= ; DCD ? BAB C A FEFE N JIJI NMVMVU U HGHG P R\RQ\Q[ T ZYZY LKLK X OPO [S//S STS WXW Dimension X Figure 5.2: Clusters according to projected subspace clustering LAC (Locally Adaptive Clustering) [26] calculates clusters using an attribute weighted method. It partitions data into requested number of clusters. Each cluster is associated with a weight vector. Weights are larger if the dimension is more relevant. Dimensions are more relevant if the variance along that dimension is less. Though LAC resembles PROCLUS, the algorithms are quite different. 5.2. SUBSPACE CLUSTERING DATASETS 27 H.P. Kriegel et.al proposed a density based subspace clustering algorithm[27], which also works in bottom-up fashion like CLIQUE. The algorithm computes 1-dimensional subspace clusters, applying DBSCAN algorithm on one dimension. In the next iterations it generates candidate subspaces in an apriori method, and again applies DBSCAN algorithm for finding the clusters in those subspaces. The same authors along with Christian proposed another density based approach with local subspace preferences. Variance is calculated within the -neighborhood of a point along a dimension, to determine whether it is relevant dimension. According to the calculated variance each dimension is given a weight. This algorithm works more like a DBSCAN, but at each stage weights of the dimensions are considered in the calculation of -neighborhood. 5.2 Subspace Clustering Datasets Synthetic datasets are generated while testing the subspace clustering algorithms. Most of them used the methods given in BIRCH [12] or by M. Zait[13]. For the dimensions in which the cluster is present, they followed the given method, but for the remaining dimensions they used the entire range of those dimensions. The problem with such datasets is that there is a possibility that some points of a cluster mix with some other cluster points. It might happen that the density in the common area for both the clusters is more and that particular portion can be identified as a cluster in high-dimensional space. In Figure (5.3), black dots are belong to one cluster for which X is chosen as one of the subspace dimensions and Y for gray dots. The other dimension is not selected so the data points are spread over the entire range of that dimension. Figure 5.3: Subspace clustering dataset While we generate datasets for sub-space clustering algorithms, we restrict the whole range of the data points to be with-in the bounding box of that cluster. Restricting in a bounding box will help the cluster points not to mix up with any other cluster points or with noise. The aim is to generate subspace cluster data points, which should form a cluster when seen in only a subset of dimensions, but not in all the dimensions i.e subspace cluster points look like almost as noise when seen in all the dimensions. It is not possible to generate the subspace cluster points whose density is almost like noise in all dimensions, but is more when compared to that of noise when seen in a particular CHAPTER 5. SUBSPACE CLUSTER DATA 28 subspace associated with that cluster. We will see how it is not possible in the following. Since the major problem is with density, the number of points allotted to a cluster plays a major role. If we allot the points to a cluster based on the size of the bounding box and since we are restricting the cluster points to be with in the bounding box, the density of all the clusters is almost same when seen in all the dimensions. So, we allotted points to a cluster based on the area/volume occupied by the cluster in its associated dimensions. The notation is as given in the Chapter 4. Let n be the total number of points, d be the total number of dimensions, c be the number of clusters, η be the percentage of noise, s be the total space given to the user, m be the maximum value for each dimension, r be the average radius of the cluster and ds be the average number of dimensions chosen for x subspace clusters. Volume of a normal cluster = (2r)d Volume occupied by a subspace cluster = (2r)ds n Max points allotted for noise are = 1+c Points allotted to a normal cluster = Points allotted to a subspace cluster = c (2r)d ∗ ∗n (c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c (2r)ds c ∗ ∗n (c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c Free space when seem in all the given ’n’ dimensions = s − (2r) d ∗ c d1 1 s ∗ rmax = c 2 ∗ (1 + ) 1 rmin = rmax k 1 1 Average Radius r = ∗ rmax ∗ 1+ 2 k d1 k+1 s = ∗ (4k) ∗ (1 + ) c d (2k) ∗ (1 + ) ∗c s = (2r)d ∗ k+1 Density of Noise = = Density of Subspace Cluster = 1 n ∗ 1+c s − (2r)d ∗ c 1 n ∗ 1+c d d − 2 rd ∗ c∗ 4k∗(1+) k+1 (2r)ds (c−x)∗(2r)d +x∗(2r)ds (2r)ds ∗ c 1+c ∗n 5.2. SUBSPACE CLUSTERING DATASETS 29 Density of noise when seen in all dimensions should be greater than or equal to density of the subspace cluster. (2r)ds c 1 (c−x)∗(2r)d +x∗(2r)ds ∗ 1+c ∗ n n ∗ 1+c ≥ s − (2r)d ∗ c (2r)ds d (c − x) ∗ (2r) + x ∗ (2r) ds 2 ≥ c ∗ (2r) Free space inds dimensions ds 2k ∗ (1 + ) k+1 d −1 (5.1) = mds − c ∗ (2r)ds 1 m = s d 1 2r ∗ 2k ∗ (1 + ) = ∗ cd 1+k d ds 2k ∗ (1 + ) s ds d Free space inds dimensions = (2r) ∗c −c 1+k Density of noise in ds dimensions = 1 1+c ds (2r)ds 2k∗(1+) 1+k n∗ Volume of subspace cluster inds dimensions = (2r)ds Density of subspace cluster in ds dimensions = ∗c ds d −c c 1 (2r)ds ∗ ∗n∗ (c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c (2r)ds In order to identify the subspace cluster in its associated ds dimensions, density of noise when seen in ds dimensions should be less than density of the subspace cluster. (2r)ds c 1 ∗ ∗n∗ > (c − x) ∗ (2r)d + x ∗ (2r)ds 1 + c (2r)ds (2r)ds 2k ∗ (1 + ) 1+k ds ∗c ds d 1 1+c ds (2r)ds 2k∗(1+) 1+k n∗ ∗c ds d −c − c ∗ c > (c − x) ∗ (2r)d + x ∗ (2r)ds (5.2) From Equations 5.1 and 5.2 (2r)ds 2k ∗ (1 + ) 1+k ds c ∗c ds d ds d d 2k ∗ (1 + ) −1 − c ∗ c > c2 ∗ (2r)ds k+1 2k ∗ (1 + ) > 1+k d−ds ∗c (5.3) CHAPTER 5. SUBSPACE CLUSTER DATA 30 Since ds d < 1 and 2k∗(1+) 1+k d−ds >1 Equation 5.3 is impossible. Thus generating subspace cluster points whose density is almost like noise in all dimensions, but is more when compared to that of noise when seen in a particular subspace associated with that cluster is not possible. When we take large d (ds d) and compare the density of the normal cluster in all d dimensions with that of a subspace cluster, then the density of normal cluster will be much larger than that of subspace cluster. Thus the subspace cluster points can be treated as noise. After fixing the cardinality of the clusters, subspace cluster points are generated in the same way as described for the traditional clusters but for the non-selected dimension the entire range of the cluster bounding box is used. 5.3 Summary In this chapter we discussed what a subspace cluster is and few of the existing techniques for subspace clustering. We propose a method to generate subspace cluster datasets. The subspace cluster in the dataset almost looks like noise when seen in all the dimensions, where as it is identifiable in subspaces attached to it. Chapter 6 Experimental Results SynDECA[28] is developed in C++ language on Linux platform. For the high dimensional data space, not all the generated points (those will fall within the bounding box of the cluster) will be with in the feasible region of the circular, elliptical or random shaped clusters. As the number of dimensions increase, the ratio of feasible region to size of bounding box of the cluster will decrease. In this scenario time taken to generate cluster points will increase. In order to overcome this, care is taken to make sure that each generated point will be within the feasible region of the cluster. In case of the grid method for random shaped clusters, as the number of the dimensions increase, number of cells in the grid will explode. Moreover there is no control over the size of the bounding box of the cluster. Since the maximum allowable size of each dimension can take a large value, it is not possible to maintain the information about each and every cell. Since the number of points allotted to a cluster can be very large and a small percentage of points are filled in each cell, maintaining information about filled cells can be cumbersome. Few examples of two dimensional datasets are given in Figure 6.1. Few three dimensional datasets without noise are shown in Figure 6.2. The time taken to generate datasets for various inputs is given in Table 6.1. Since the generation of points to the subspace cluster is not so different from the normal cluster points, time taken is almost same. SynDECA generates large number of data points in a very high dimensional data space including complex shaped clusters with some amount of noise. The generated datasets help the clustering algorithms in dealing with many of the issues listed in Chapter 2. 6.1 Comparison with the Existing Work Since the current version of the “clusutils” tool is not working properly, we could not give much of the comparison of its output with SynDECA datasets. The figure 6.3 shows results by “clusutils” in the top row and results by “SynDECA” in the bottom row. The major inputs for clusutils are as follows. • Number of clusters. • Number of points. • Number of dimensions. 31 CHAPTER 6. EXPERIMENTAL RESULTS 32 No.of Points 1000 10000 100000 1000000 1000000 1000000 10000000 100000 100000 1000000 100000 100000 100000 100000 No.of Dimensions 2 2 2 2 2 2 2 10 10 10 50 50 100 100 No.of Clusters 10 10 50 50 100 100 100 100 100 100 50 100 50 50 Max Value in each dimension 100 100 100 100 100 1000 1000 100 1000 1000 100 100 100 1000 Time taken (in sec) 0.30 0.10 0.99 8.22 8.25 8.93 88.21 8.55 7.42 47.06 397.92 500.54 1196.1 1568.31 Table 6.1: Time taken for generating datasets for various inputs • Density level. • Random noise (this value will be added to each and every dimension of every point). • Percentage of outliers (this is the percentage of number of extra points to be added as outliers/noise). The inputs given to both the tools are as follows. Number of points = 500 Number of clusters = 5 Number of dimensions = 2 Input for the density level to clusutils parameter is 1, meaning all the clusters will have same density. The first two columns are the datasets without noise points, SynDECA is modified sightly for generating noise less datasets. In the last column the dataset of the clusutils is given Random noise as 50.0 and noise points are generated in the case of SynDECA. Even when the inputs are same clusutils have taken different ranges in both x and y dimensions. In the case of SynDECA the first image is in 10 x 10 space and the second image is in 100 x 100 space (Remember here the user can specify the maximum value taken by a dimension). The last image of clusutils shows that there are no clusters even though it is specified that there should be five clusters in the dataset. Whereas SynDECA ensures that there are the required number of clusters in each of the generated dataset. Method explained in [12] generates only two dimensional datasets. Since the regions in which clusters are spread can overlap there is no guarantee that there are requested number of clusters in the generated dataset. M.Zait et.al method [13] can handle any number of dimensions, but there is no noise in the dataset and each cluster has a dedicated range in each of the dimensions. Where as SynDECA can generate clusters which are not having any dedicated range and guarantees the requested number of clusters. 6.2. VALIDATION OF SYNDECA DATASETS 33 6.2 Validation of SynDECA Datasets We generated some datasets (details in Table 6.2), which are used to study the behavior of some of the existing clustering algorithms. We used implementation of K-Means available at [29], DBSCAN implementation available at [30] and ReCkless algorithm [31] to test our datasets. Since we could not get any of the subspace algorithms code, we are unable to show the results with respect to those algorithms. We generated four different datasets in two and three dimensions. Name of Dataset dataset1 dataset2 dataset3 dataset4 dataset5 subspace-dataset1 subspace-dataset2 subspace-dataset3 No.of Points 10000 10000 10000 10000 10000 10000 10000 10000 No.of Dimensions 2 3 2 2 2 3 3 3 No.of Clusters 10 6 10 30 30 5(2 subspace) 5(1 subspace) 6(3 subspace) Table 6.2: Details of datasets used in the experiments Each cluster is plotted in a different color and noise in a different color for better visibility of the datasets. Original datasets can be checked in Figures (6.4, 6.6, 6.7, 6.9, 6.10, 6.12, 6.14). All those clustering algorithms which do not consider the presence of noise, will not be able to find the clusters properly when noisy datasets are given as input. Results of K-Means algorithm in Figures (6.5, 6.11, 6.13) shows the same. When there is no noise and the clusters are well separated, K-Means is able to find the clusters (Figure 6.6). But when took a very small value, K-Means is confused (Figure 6.7) and gave wrong result. Density based algorithms such as DBSCAN will be able to find out the clusters properly (Figure 6.8) when all the clusters are traditional clusters. When takes very small value, then the distance between two clusters might be negligible, in such case DBSCAN clubs two clusters into single one (Figure 6.9). In the presence of subspace clusters, though sometimes it is able to identify the correct number of clusters (Figures 6.11, 6.13) sometimes it is unable to figure out the clusters (Figure 6.14). The implementation of DBSCAN is having a provision not to specify epsilon value, the program will take care of that. Different minpoints are given to the program and the results are verified. For subspace-dataset3 when minpoints are given in [1, 7] the no.of clusters identified are in the range [7, 26]. Where as for minpoints [8, 18] number of clusters identified is only five, for the other input it identified very less no.of clusters. ReCkless [9] is an agglomerative algorithm based on reverse nearest neighbors. With a small k value it will find large number of clusters, gradually merges the clusters as the k value increases. In Figure(6.15) we can see the change. As k value is increased, the clusters in random shape are merging and forming a single cluster (Figure 6.16), But this time more number of noise points are treated as cluster points. When dataset5 (whose epsilon value is very small) is used as input, for smaller k-values (≤ 32) it is able to identify the regular clusters (Figure 6.17). But there is a sudden merge when k=33 as the clusters are near by. Even at the higher values the two clusters are identified as one cluster(Figure 6.18) as RecKless is a hierarchical clustering algorithm. 34 CHAPTER 6. EXPERIMENTAL RESULTS Grid based clustering algorithms such as STING build a hierarchical grid structure and try to find the dense regions. Datasets that are generated with some subspace clusters will make grid based methods to fail, but when all normal clusters are generated they give correct result. Projection based subspace clustering methods are always gullied by the presence of the noise, as they ignore the presence of the noise points. Density based subspace clustering algorithms can also get confused by the noise as the density of the noise is almost same as the density of the subspace cluster when seen in all dimensions. Various datasets submitted here prove that SynDECA is capable of generating datasets with which algorithms will be able to form the clusters properly, also it generates datasets which confuse the clustering algorithms by choosing different values to parameters. Table 6.3 gives a overview of the results from various algorithms along with some remarks. 6.2. VALIDATION OF SYNDECA DATASETS Figure 6.1: Few examples of the generated data in 2-Dimensions. 35 36 CHAPTER 6. EXPERIMENTAL RESULTS Figure 6.2: Few examples of the generated data in 3-Dimensions. Figure 6.3: Top row datasets are from “clusutils” — Bottom row datasets are from “SynDECA”. 6.2. VALIDATION OF SYNDECA DATASETS Figure 6.4: Plot of generated datasets (a) dataset1 (b) dataset2 Figure 6.5: Plot of clustering result of K-Means algorithm on (a) dataset1 (b) dataset2 37 38 CHAPTER 6. EXPERIMENTAL RESULTS Figure 6.6: Plot of dataset3 (a) Original (b) Result of K-Means algorithm Figure 6.7: Plot of dataset4 (a) Original (b) Result of K-Means algorithm 6.2. VALIDATION OF SYNDECA DATASETS Figure 6.8: Plot of clustering result of DBSCAN algorithm on (a) dataset1 (b) dataset2 Figure 6.9: Plot of dataset5 (a) Original (b) Result of DBSCAN algorithm 39 40 CHAPTER 6. EXPERIMENTAL RESULTS Figure 6.10: Plot of subspace-dataset1 (a) three dimensional plot (b) two dimensional projection Figure 6.11: Clustering result of subspace-dataset1 (a) K-Means (b) DBSCAN 6.2. VALIDATION OF SYNDECA DATASETS Figure 6.12: Plot of subspace-dataset2 (a) three dimensional plot (b) two dimensional projection Figure 6.13: Clustering result of subspace-dataset2 (a) K-Means (b) DBSCAN 41 42 CHAPTER 6. EXPERIMENTAL RESULTS Figure 6.14: Plot of subspace-dataset3 (a) Original (b) Result of DBSCAN Figure 6.15: Plot of clustering result of ReCkless algorithm on dataset1 (a) k=25 (b) k=60 6.2. VALIDATION OF SYNDECA DATASETS Figure 6.16: Plot of clustering result of ReCkless algorithm on dataset1 (a) k=90 (b) k=162 Figure 6.17: Plot of clustering result of ReCkless algorithm on dataset5 (a) k=32 (b) k=33 43 44 CHAPTER 6. EXPERIMENTAL RESULTS Figure 6.18: Plot of clustering result of ReCkless algorithm on dataset5 (a) k=50 (b) k=78 Chapter 7 Conclusions and Future Work Datasets are required for testing the correctness of any clustering algorithm. Since not many real life datasets are available, there is a need for a tool which generates clustering datasets. The existing methods to generate cluster datasets are not good enough. We proposed algorithms to generate clustering datasets in R d (where d is the no.of dimensions). We generate clusters having regular and arbitrary shapes. Methods proposed by us make sure that there are exactly the requested number of clusters in the dataset. Few researchers found that there is a possibility of clusters spreading in subset of dimensions of the dataset. We also generate datasets which help in evaluating subspace clustering algorithms. Experimental results show that the existing traditional clustering algorithms may not identify the subspace clusters. The work presented in this thesis can be extended as follows. 1. The proposed algorithms take cuboid as minimum bounding box. Methods are to be developed to take an arbitrary bounding box. Datasets generated in such bounding boxes will look like clusters shown in Figure 7.1. 2. We concentrated only on numerical datasets. Some clustering algorithms are proposed [32, 33] to deal with pure categorical or a mixture of categorical and real number dimensions. Techniques need to be devised to generate datasets for evaluation of such clustering methods Figure 7.1: Clusters in arbitrary shaped bounding boxes 45 46 CHAPTER 7. CONCLUSIONS AND FUTURE WORK Bibliography [1] J. R. Vennam, “Syndeca.” http://cde.iiit.net/syndeca/. [2] M. A. Nascimento, D. Pfoser, and Y. Theodoridis, “Synthetic and real spatiotemporal datasets,” Bulletin of the IEEE Computer Society Technical Committee on Data Engineering, 2003. [3] “Gstd: Generate spatio temporal data.” http://www.cti.gr/RD3/GSTD/index2.html. [4] “G-terd: Generator for time-evolving regional data.” http://delab.csd.auth.gr/stdbs/g-terd.html. [5] J. M. Jean-Marc Saglio, “Oporto: A realistic scenario generator for moving objects,” GeoInformatica, Volume 5, Issue 1, p. 71-93, 2001. [6] J. Han and M. Kamber, Data Mining Concepts and Techniques. Morgan Kauffmann Publishers, 1988. [7] L. Ertoz, M. Steinbach, and V. Kumar, “Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data,” SIAM International Conference on Data Mining (SIAM), 2003. [8] L. Ertoz, M. Steinbach, and V. Kumar, “A new shared nearest neighbor clustering algorithm and its applications,” Workshop on Clustering High Dimensional Data and its Applications at 2nd SIAM International Conference on Data Mining, 2002. [9] S. Vadapalli, S. R. Valluri, K. Karlapalem, and P. Gupta, “Cluster analysis and outlier detection using reverse nearest neighbors,” Tech. Rep. IIIT-H/TR/2004/009, International Institute of Information Technology, Hyderabad. [10] A. Foss and O. R. Zaane, “A parameterless method for efficiently discovering clusters of arbitrary shape in large datasets,” International Conference on Data Mining (ICDM), pp. pp 179–186, 2002. [11] G. W. Milligan, “An algorithm for generating artificial test clusters,” Psychometria. [12] R. R. T. Zhang and M. Livny, “Birch : an efficient data clustering method for very large databases,” Proc. of ACM SIGMOD International Conference on Management of Data, 1996. [13] H. M. M. Zait, “A comparative study of clustering methods,” FGCS Journal, Special Issue on Data Mining, 1997. [14] D. X. Pape, “Clusutils.” http://clusutils.sourceforge.net/, http://clusutils.sourceforge.net/manual/cg-man.html, September 2000. [15] J. M.Ester, H.P.Kriegel and X.Xu, “A density-based algorithm for discovering clusters in large spatial databases,” International Conference on Knowledge Discovery and Data Mining (KDD), pp. 226–231, 1996. [16] R. R. Sudipto Guha and K. Shim, “Cure: Anefficient clustering algorithm for large databases,” Proc. of ACM SIGMOD International Conference on Management of Data, 1998. [17] R. T. Ng and J. Han, “Efficient and effective clustering methods for spatial data mining,” Proc. of International Conference on Very Large Databases (VLDB), 1994. 47 BIBLIOGRAPHY 48 [18] H.-P. K. M. Ankerst, M. Breunig and J. Sander, “Optics: Ordering points to identify the clustering structure,” Proc. of ACM SIGMOD International Conference on Management of Data, 1999. [19] E. Knorr and R. Ng, “Algorithms for mining distance-based outliers in large datasets,” Proc. of International Conference on Very Large Databases (VLDB), 1998. [20] S. C. G. Sheikholeslami and A. Zhang, “Wavecluster: A multi-resolution clustering approach for very large spatial databases,” Proc. of International Conference on Very Large Databases (VLDB), 1998. [21] R. M. W. Wang, Yang, “Sting: A statistical information grid approach to spatial data mining,” Proc. of International Conference on Very Large Databases (VLDB), 1997. [22] J. L. W. P. S. Y. e. a. Charu C. Aggarwal, Cecilia Procopiuc, “Fast algorithms for projected clustering,” Proc. of ACM SIGMOD International Conference on Management of Data, 1999. [23] H. L. Lance Parsons, Ehtesham Haque, “Evaluating subspace clustering algorithms,” SIAM International Conference on Data Mining (SIAM), 2004. [24] D. G. P. R. Rakesh Agrawal, Johannes Gehrke, “Automatic subspace clustering of high dimensional data for data mining applications,” Proc. of ACM SIGMOD International Conference on Management of Data, 1998. [25] P. S. Y. Charu C. Aggarwal, “Finding generalized projected clusters for high dimensional spaces,” Proc. of ACM SIGMOD International Conference on Management of Data, 2000. [26] D. G. S. M. Carlotta Domeniconi, Dimitris Papadopoulos, “Subspace clustering of high dimensional data,” SIAM International Conference on Data Mining (SIAM), 2004. [27] p. K. Karin Kailing, Hans-Peter Kriegel, “Density-connected subspace clustering for high-dimensional data,” SIAM International Conference on Data Mining (SIAM), 2004. [28] S. V. Jhansi Rani Vennam, “Syndeca: A toolkit to generate synthetic datasets for evaluation of clustering algorithms,” 11th International Conference on Management of Data (COMAD), 2005. [29] “Efficient algorithms for k-means clustering.” http://www.cs.umd.edu/ mount/Projects/KMeans/. [30] “Matlab implementation of dbscan.” http://www.chemometria.us.edu.pl/download.html. [31] S. Vadapalli, “k-reverse http://cde.iiit.ac.in/RNNs/. nearest neighbor (k-rnn) cluster analysis and outlier detection.” [32] R. R. Venkatesh Ganti, Johannes Gehrke, “Cactus: Clustering categorical data using summaries,” International Conference on Knowledge Discovery and Data Mining (KDD), 1999. [33] M. J. Z. Markus Peters, “Click: Clustering categorical data using k-partite maximal cliques,” International Conference on Data Engineering (ICDE), 2005.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Synthetic Datasets for Clustering Algorithms