Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mr. Chetan J. Awati& Prof. D.G. Chougule Clustering Algorithm for 2D Multi-Density Large Dataset Using Adaptive Grids Mr. Chetan J. Awati [email protected] Lecturer in Department of Information Technology Bharati Vidyapeeth’s College of Engineering, Kolhapur Shivaji University, Kolhapur, 416013, India Prof. D.G. Chougule [email protected] Professor in Department of Computer Science & Engineering Tatyasaheb Kore Institute of Engineering & Technology, Warnanagar Shivaji University, Kolhapur, 416113, India. Abstract Clustering is a key data mining problem. Density-based clustering algorithms have recently gained popularity in the data mining field.Density and grid based technique is a popular way to mine clusters in a large spatial datasets wherein clusters are regarded as dense regions than their surroundings. The attribute values and ranges of these attributes characterize the clusters In this paper we adapt a density-based clustering algorithm, Grid Density clustering using triangle subdivision(GDCT) capable of identifying arbitrary shaped embedded clusters as well as multi density clusters over massive spatial datasets. Experimental results on a wide variety of synthetic and real data sets demonstrate the effectiveness of Adaptive grids and triangle subdivision method. Keywords:Clustering, Density Based, Intrinsic Cluster, Adaptive Grids 1. INTRODUCTION Clustering involves the grouping of similar objects together. It is one of the fundamental data mining tasks that can serve as an independent data mining tool or a preprocessing step for other data mining tasks such as classification. Clustering is a versatile unsupervised learning method that can be used in several ways, e.g., outlier detection, data reduction and identification of natural data types and classes.Major clustering techniques have been classified into partitional, hierarchical, density based, grid based and model based. Among these techniques, the density-based approach is famous for its capability of discovering arbitrary shaped clusters of good quality even in noisy datasets. Grid-based clustering approach is well known for its fast processing time especially for large datasets. In this algorithm the data points are grouped into blocks and density of each block is calculated. The blocks are then clustered by forming adaptive grids. First rough clusters are formed and then for finer results triangle subdivision algorithm is used. This algorithm finds quality clustering. 1 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule 1.1 Related Work Clustering methods can be categorized into partitioning, hierarchical, density-based, grid-based, and model-based Methods. DBSCAN is a density-based clustering algorithm capable of discovering clusters of various shapes even in presence of noise. However, due to the use of the global density parameters, it fails to detect embedded or nested clusters.Grid based clustering algorithms are computationally efficient which depends on the number of cells in each dimension in the quantized space. It is independent of the number of data points and is not input order sensitive. STING uses a multi-resolution approach, which is query-independent and easy to parallelize. However the shapes of clusters have horizontal or vertical boundaries but no diagonal boundary is detected. WaveCluster also uses a multidimensional grid structure, capable to detect clusters at varying levels of accuracy by removing outliers. However, it is not suitable for high dimensional data sets. CLIQUE automatically finds subspaces of the highest dimensionality and is insensitive to the order of input. However, the accuracy of the clustering result may be degraded at the expense of simplicity of the method. Some clustering algorithms that can cluster on multi-density datasets. Chameleon [3] can handle multidensity datasets, but for large datasets the time complexity is too high. SNN (shared nearest neighbor), can find clusters of varying shapes, sizes and densities and can deal multi-density dataset. The disadvantage of SNN is that the degree of precision is low on the multi-density clustering and finding outliers. DGCL [6] is based on density-grid based clustering approach. But, since it uses a uniform density threshold it causes the low density clusters to be lost. Real life datasets can also be found with skewed distribution and may contain nested cluster structures the discovery of which is very difficult. OPTICS and EnDBSCAN attempts to handle such situations. OPTICS can identify embedded clusters; however, it is very sensitive to the three input parameters to be provided by the user. EnDBSCAN can detect embedded clusters, however, with the increase in the volume of data, the performance of it also degrades and also it is highly dependent on two input parameters. Based on our selected survey and experimental analysis, it has been observed that Density based approach is most suitable for quality cluster detection over massive datasets. Grid based approach is suitable for fast processing of large datasets. Distribution of most of the real-life datasets are skewed in nature, so, handling of such datasets for all types for qualitative cluster detection based on a global input parameter seems to be Impractical. An algorithm which is capable of handling large data and at the same time effectively detects multiple, nested or embedded clusters even in presence of noise is of utmost importance. This paper presents a clustering algorithm, GDCT, which can effectively address the previously mentioned clustering challenges. 2. GRID DENSITY CLUSTERING USING TRIANGLE SUBDIVISION The aim of our clustering algorithm is to discover basic as well as global clusters over spatial datasets of variable density. We introduce some new terms which are used in GDCT. Cell Density: The number of spatial point objects within a particular grid cell. Useful Cell: Those cells with data points will be treated as useful cell. Neighbor Cell: Those cells which are edge neighbors or vertex neighbors of a current cell are the neighbors of the current cell. Density Confidence of a cell: If the ratio of the densities of the current cell and one of its neighbors is greater than or equal to some threshold β (user input) then β is the density confidence between them. For two cells p and q to be merged into the same cluster the following condition should be satisfied: d n (q) / dn (p) ≥ β, where dn represents the density of that particular cell. 2 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule Reachability of cell: A cell p is reachable from a cell q if p is a neighbor cell of q and cell p satisfies the density confidence condition w.r.t. cell q. Triangle Density: The number of spatial point objects within a particular triangle of a grid cell. Useful Triangle: Only those triangles which are populated i.e., which contain data points will be treated as useful triangle. Neighbor Triangle: Those triangles which have a common edge to the current triangle are neighbors of the current triangle. Density Confidence of a triangle: Two triangles can be merged into the same cluster, if the ratio of the densities of that triangle and one of its neighbors is greater than or equal to β/4 i.e., dn(Tp) /dn (Tq) ≥ β/4, where dn represents density of the triangle. Reachability of a triangle: A triangle p is reachable from a triangle q if p is a neighbor triangle of q and triangle p satisfies the density confidence condition w.r.t. triangle q. Cluster: A cluster C is defined to be the set of points belonging to the set of reachable cells and triangles i.e., if p C and q is reachable from p w.r.t. β, then q C, where p, q are cells or triangles. Noise: Noise is simply the set of points belonging to the cells (or triangles) not belonging to any of its clusters. Let C1, C2 ...Ck be the clusters w.r.t. β, then, Noise= {no_ p| p n×n, i: no_ p Ci} where no_p is set of points in cell p and Ci (i=1,..., k). Density Confidence of a Useful Cell: The density confidence for a given set of cells reflects the general trend of that set. If the density of one cell is abnormal from the others it is excluded from the set. Similarly, each useful cell has a density confidence with each of its neighbor cells. If the density confidence of a current cell with one of its neighbor cell does not satisfy the density confidence condition than that neighbor cell is not included into the local dense area. Spatial Dataset Conversion of spatial dataset into image Clusters Finding pixels from image dataset Applying GDCT algorithm Grid formation and density calculation FIGURE 1: Process Flow Diagram The above process flow diagram depicts sequence of operation. We are dealing with the spatial datasets. So we have to convert that dataset into image. Then we have to find the pixel from the dataset. Then we are applying the grid structure and calculating the density. Finally we have to apply GDCT algorithm and obtain the clusters. Here are some modules to implement the algorithm. 3 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule Initially, the GDCT algorithm divides the data space into n × n non-overlapping square grid cells, where n is a user input, and maps the data set to each cell. Then computing the density of each cell. Next we are sorting the cells according to their densities. Then identifying the maximum dense cell from the set of unclassified cells. Next we have to traverse the neighbor cells starting from the dense cell and form the adaptive grid (Rough cluster). Algorithm 1 Expand_adaptive_grid (Cell) -------------------------------------------------------------------------1. IF Cell. classified== CLASSIFIED 2. RETURN; 3. END IF 4. ELSE 5. Cell.classified= CLASSIFIED; 6. Cell.Clus_id= Clus_id; 7. // Assigns the Cell with cluster id 8. Neighb= NeighborQuery(Cell); 9. FOR i from 1 to Neighb.size Do 10. IF ( Neighb.get(i).density/Cell.density) >= β 11. Expand_adaptive_grid(Neighb.get(i)) 12. END IF 13. END FOR 14. END ELSE 15. END//End Expand Cluster --------------------------------------------------------------------------The formation of rough cluster is explained in thealgorithm1Expand_adaptive_grid. It starts with the maximum density cell from the sorted list. Here adaptive grid expands to the neighboring cell based on the position of the cell. Neighboring cell is merged if it is not a member of adaptive grid and it satisfies the density confidence. The process of adaptive grid continues recursively in the same way until no neighboring cell satisfies the condition. The cells falling in the same adaptive grid are given the same cluster id. Algorithm 2 Expand_triangle (Triangle) ---------------------------------------------------------------------------------------1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. IF Triangle.classified== CLASSIFIED Neighbor_ Triangle = Neighbor Query (Triangle); FOR i from 1 to Neighbor_ Triangle.size Do IF ( Neighbor_ Triangle.get(i).classified = UNCLASSIFIED ) IF(Neighbor_Triangle.get(i).density / Triangle. density) >= β/4 Neighbor_ Triangle.get (i).Cluster id= Triangle. Cluster id; Neighbor_ Triangle.get(i).classified = CLASSIFIED; Expand_triangle (Neighb_Triangle.get(i)); END IF END IF END FOR END IF END//End Expand_triangle -------------------------------------------------------------------------------------The process then checks the neighbors of the last formed adaptive grid cells. If any one of the neighbors is an unclassified useful cell then both the adaptive grid cell as well as the unclassified neighbor cell is triangulated. The process of triangle subdivision is explained in algorithm 2 Expand_triangle. First we have to identify the neighbors of a triangle. During triangle subdivision the grid cell is divided into four 4 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule triangles. If that triangle is not a part of any adaptive grid and it satisfies the density confidence of a triangle as given in algorithm 2 Expand_triangle, then it is added into the rough cluster. This process continues recursively until no neighboring triangle satisfies the condition. After the process of triangle subdivision the triangles are merged into the cluster and assigned the same cluster id. 3. EXPERIMENTAL RESULTS The proposed algorithm is applied to various synthetic and real datasets. The algorithm was also applied on Chameleon t4.8k.dat, t5.8k.dat and t7.10k.dat. From the experiment results it has been found that the clustering result is dependent on threshold β which varies in the interval [0.3, 0.9]. We performed experiments using a personal computer with 1 GB of RAM and Intel Core 2 Duo with 1.60 GHz and running Windows 7 Ultimate. 3.1 Clustering on Real Dataset FIGURE 2: Earthquake Dataset FIGURE 3: Clustering Result for Earthquake Dataset This is the earthquake dataset taken from UCI machine repository and contains 65000 data points which gives information about the longitude and latitude of the earthquake points. Here we are able to find four different clusters. So based on this clusters we can easily find the frequency of the earthquake in the required area. 5 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule FIGURE 4: Hypsography Supplemental points of Alaska State in U.S.A FIGURE 5: Clustering Result for Hypsography Supplemental points of Alaska State in U.S.A. The result shows eight different clusters of the real dataset of Hypsography supplemental points of Alaska State in U.S.A. FIGURE 6: Global Summation of Vegetation for Calendar Year 1986 To 1988 FIGURE 7: Clustering Result for Global Summation of Vegetation for Calendar Year 1986 To 1988 6 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule For the real dataset of global summation of Vegetation clusters are found according to the density accurately. FIGURE 8: Map of Caribbean Island FIGURE 9: Clustering Result for Map of Caribbean Island This is the map of Caribbean island and clustering result is shown with ten clusters. 3.2 Clustering On Chameleon Data FIGURE 10: Chameleon t4.8k.dat Dataset 7 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule FIGURE 11: Clustering Result for Chameleon t4.8k.dat Dataset This is the chameleon t4.8k.dat dataset containing 8000 data points. Here we are able to detect six different clusters efficiently. So embedded and nested clusters are detected properly on chameleon dataset. Also here we are able to detect outliers successfully. 3.3 Clustering On Synthetic Dataset FIGURE 12: Synthetic Dataset FIGURE 13: Clustering Result for Synthetic Dataset This is the synthetic dataset containing 2913 points. Here we are able to find two clusters efficiently. So from the above results obtained we can conclude that GDCT is capable of detecting intrinsic as well as multi density clusters efficiently. 8 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule 4. COMPARISON WITH ITS COUNTERPARTS Algorithms Input Parameters Outlier Handling Scalability Varying Density Embedded Cluster Complexity DBSCAN 2 Yes Yes No No O (n log n) Yes Yes Yes No O (n log n) + O (n ) Yes Yes Yes Yes O (n log n) (MinPts, ε) OPTICS(Int) 3 (MinPts, ε, ε) GDCT 2 (n, β) TABLE1:Comparison with DBSCAN and OPTICS A detailed comparative study among the three algorithms (i.e. GDCT, DBSCAN & OPTICS (Int.)) was carried out in light of those real and synthetic datasets. Table 1 presents the same in terms of six crucial factors. As can be seen from the column 1 of the table that like DBSCAN, the GDCT algorithm also requires less number of input parameters than OPTICS. Similarly, column 5 depicts that embedded clusters can be detected only by the GDCT algorithm. Also, from the complexity point of view, column 6 clearly shows that the performance of GDCT is similar with DBSCAN. However, OPTICS requires an additional complexity O(n) (at least) to classify those points after ordering, apart from O(nlogn), when a spatial index is used. The rest other columns establish that in terms of the other quality parameters, the performance of the GDCT algorithm is equally good with its other two counterparts. 9 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule FIGURE 14: Time Cost of GDCT for Different Data Points The time cost of GDCT for different Data points is shown in figure 8. Three different categories are shown in the graph. The time required for the formation of rough cluster is sometimes greater than the time required for the triangle subdivision. GDCT time cost is less than time cost of DBSCAN, OPTICS and EnDBSCAN. 5. CONCLUSION In this paper, a clustering algorithm for massive spatial datasets with variable density has been presented. The result shows that GDCT can find the clusters correctly and eliminate outliers efficiently. Furthermore it also fast enough. If users want to get more exact result, they can adjust the threshold β. Moreover, it is not affected by the outliers and can handle them properly. And GDCT is order insensitive. It is observed that the time complexity depends on the number of cells. From the experimental results given above, we can conclude that GDCT is highly capable of detecting intrinsic as well as multi-density clusters qualitatively. All the sequential algorithms were found to be slower for the above datasets and embedded cluster detection was not possible. 10 International Journal of Data Engineering Mr. Chetan J. Awati& Prof. D.G. Chougule 6. REFERENCES [1] Zhi-Wei SUN. A Cluster Algorithm Identifying the Clustering Structure” IEEE International Conference on CSSE 2008. [2] Karypis, G., Han, & Kumar, V. “CHAMELEON: A hierarchical clustering algorithm using dynamic modeling”. IEEE Computer, 32(8), pp 68-75, 1999. [3] Nagesh, H. S., ET. al. “A scalable parallel subspace clustering algorithm for massive data sets”, in Proc. Of Parallel Processing, pp. 477, 2000. [4] Ertoz, L., et. al. “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data”, in Proc. of SDM '03. [5] Ho Seok Kim, Song Gao, Ying Xia, Gyoung Bae Kim, and Hae Young Bae “DGCL: An Efficient Density and Grid Based Clustering Algorithm for Large Spatial Database” WAIM 2006, LNCS 4016, pp. 362 – 371, 2006. [6] S. Roy and D.K. Bhattacharyya, “An Approach to Find Embedded Clusters Using Density Based Techniques”, ICDCIT 2005, LNCS 3816, pp. 523 – 535, 2005. [7] Ying Xia, GuoYin Wang, Song Gao, “ An Efficient Clustering Algorithm for 2D Multi-density Dataset in Large Database”, 2007 IEEE International Conference on Multimedia and Ubiquitous Engineering(MUE'07) [8] Ozge Uncu, Member, IEEE, William A. Gruver, Fellow, IEEE, Dilip B. Kotak, Senior Member, “GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise” 2006 IEEE International Conference on Systems, Man, and Cybernetics October 8-11, 2006, Taipei, Taiwan. [9] Song Gao, Ying Xia, “GDCIC: A Grid-based Density-Confidence-Interval Clustering Algorithm for Multidensity Dataset in Large Spatial Database”, Sixth IEEE International Conference on Intelligent Systems Design and Applications (ISDA'06) 11 International Journal of Data Engineering