Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
IJCST Vol. 3, Issue 4, Oct - Dec 2012 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) A Study of DBSCAN Algorithms for Spatial Data Clustering Techniques 1 Dr. Mohammed Ali Hussain, 2Dr. R. Satya Rajesh, 3Md. Abdul Ahad 1,2 Dept. of ECE, KL University, Guntur, AP, India Abstract Spatial data mining is the application of data mining techniques to spatial data. Data mining in general is the search for hidden patterns that may exist in large databases. Spatial data mining is the discovery of interesting the relationship and characteristics that may exist implicitly in spatial databases. Because of the huge amounts (usually, terabytes) of spatial data that may be obtained from satellite images, medical equipments, video cameras, etc. It is costly and often unrealistic for users to examine spatial data in detail. Spatial data mining aims to automate such a knowledge discovery process. Thus, new and efficient methods are needed to discover knowledge large databases. For this purpose, clustering is one of the most valuable methods in spatial data mining. The main advantage of using clustering is that interesting structures or clusters can be found directly from the data without using any prior knowledge. This paper presents an overview of densitybased methods for spatial data clustering. Keywords Clustering, DBSCAN, Density- based method, Spatial data. I. Introduction Huge amounts of data have been collected through the advances in data collection, database technologies and data collection techniques. This explosive growth of data creates the necessity of automated knowledge/information discovery from data, which leads to a promising and emerging field, called data mining or knowledge discovery in databases [1]. Spatial data mining is the discovery of interesting relationships and characteristics that may exist implicitly in spatial databases [2]. KDD follows several stages including data selection, data preprocessing, information extraction or spatial data mining, interpretation and reporting [3]. Data mining is a core component of the KDD process. Spatial data mining is a demanding field since huge amounts of spatial data have been collected in various applications such as real-estate marketing, traffic accident analysis, environmental assessment, disaster management and crime analysis. Thus, new and efficient methods are needed to discover knowledge from large databases such as crime databases. Because of the lack of primary knowledge about the data, clustering is one of the most valuable methods in spatial data mining. The main advantage of using clustering is that interesting structures or clusters can be found directly from the data without using any prior knowledge. Clustering algorithms can be roughly classified into hierarchical methods and non-hierarchical methods. Nonhierarchical method can also be divided into four categories [4]: 1. Partitioning methods 2. Density-based methods 3. Grid-based methods 4. Model-based methods In this paper, we are presenting an overview of density based algorithms called DBSCAN algorithms. DBSCAN is a wellknown spatial clustering technique that has been shown to find clusters of arbitrary shapes by defining a cluster to be a maximum w w w. i j c s t. c o m set of density-connected points. DBSCAN can find the genuine clusters of different shapes as long as the evaluation of the density of the clusters can be correctly made in advance and the density of clusters is uniform. II. DBSCAN Algorithm DBSCAN (Density Based Spatial Clustering of Applications with Noise) algorithm is based on center-based approach. In the centerbased approach, density is estimated for a particular point in the dataset by counting the number of points within a specified radius, Eps, of that point. This includes the point itself. The center-based approach to density allows us to classify a point as a core point, a border point, a noise or background point. A point is core point if the number of points within Eps, a user-specified parameter, exceeds a certain threshold, MinPts, which is also a user- specified parameter. The DBSCAN algorithm is given as follows: • Step 1: Label all points as core, border, or noise points. • Step 2: Eliminate noise points. • Step 3: Put an edge between all core points that are within Eps of each other. • Step 4: Make each group of connected core points into a separate cluster. • Step 5: Assign each border point to one of the clusters with its associated core points. Any two core points that are close enough within a distance Eps of one another are put in the same cluster. It is also applicable for any border point which is close enough to a core point is put in the same cluster as the core point. Noise points are disposed. The basic approach of how to determine the parameters Eps and MinPts is to look at the behavior of the distance from a point to its kth nearest neighbor, which is called k-dist. The k-dists are computed for all the data points for some k. III. VDBSCAN Algorithm DBSCAN uses a density-based definition of a cluster, it is relatively resistant to noise and can handle clusters of different shapes and sizes. Thus, DBSCAN can find many clusters that could not be found using some other clustering algorithms, like K-means. However, the main weakness of DBSCAN is that it has trouble when the clusters have greatly varied densities. To sweep over the limitations of DBSCAN, VDBSCAN (Varied Density Based Spatial Clustering of Applications with Noise)[5] is acquainted. The VDBSCAN algorithm is given as follows: 1. Step 1: Calculate and stores k-dist for each project and partition k-dist plots. 2. Step 2: The number of densities is given intuitively by k-dist plot. 3. Step 3: Choose parameters Epsi automatically for each density. 4. Step 4: Scan the dataset and cluster different densities using corresponding Epsi. 5. Step 5: Finally, display the valid clusters corresponding with varied densities. International Journal of Computer Science And Technology 191 IJCST Vol. 3, Issue 4, Oct - Dec 2012 ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) IV. IDBSCAN Algorithm The IDBSCAN (Improved Sampling- based DBSCAN) algorithm [6] was designed to eliminate the sampling on DBSCAN. Let Eps denotes the region in the object for searching neighborhoods, and Minpts represents the minimum number of objects. In fig. 1 (DBSCAN), if the neighborhoods of point C have at least MinPts points, then C is called a core point. Points in the search region are clustered repeatedly until fewer than MinPts remain. For instance, Point B is searched from point P, involving fewer than MinPts points. Consequently, point B must be treated as the border point. Point O has fewer than MinPts neighborhoods points, so cannot become a new cluster. Fig. 3: Clustering of Objects with Intersections FDBSCAN defines clusters by density and reduces the point search action and time cost depending on the point at which clusters are merged. Fig. 1: Density Reachable In IDBSCAN, representative points on the circle form the Marked Boundary Objects (MBO). The MBO identify the eight distinct objects in the neighborhood of an object. The points that are the closest to these eight positions are selected as seeds. Therefore, at most eight objects are labeled at once. IDBSCAN requires fewer seeds than DBSCAN, as indicated in fig. 2. Fig. 2: Circle with Eight Distinct Points V. FDBSCAN Algorithm The FDBSCAN (Fast DBSCAN) algorithm [7], extends DBSCAN by adding non-linear searching, and has the same parameters as DBSCAN. DBSCAN scans many objects repeatedly with many times. FDBSCAN solves this problem by applying an external scan to reduce the search times. In fig. 3, Point O is involved in two different clusters P and cluster Q. Additionally, is defined as a core Point. Hence, these clusters P and Q are the same. 192 International Journal of Computer Science And Technology VI. Dbscan-Gb Algorithm GM- DBSCAN (DBSCAN and Gaussian Means) that combines Gaussian-Means and DBSCAN algorithms. The idea of DBSCANGM is to cover the limitations of DBSCAN, by exploring the benefits of Gaussian- Means. DBSCAN can automatically find noisy objects which is not achieved in most of the other methods. Actually, density-based clustering is capable of finding arbitrary shape clusters, and handling noises well. In other hand, it faces difficulty in setting density threshold because it requires inputting two parameters (Eps and Minpts) which are difficult to guess. K-means [10] clustering algorithm is a well-known technique to cluster data. It is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters. It starts the iterative process by randomly selecting K points as initial cluster centers (the initial centroid in each K data groups). A main problem of K-Means that a user must give the number of clusters K in order to construct the partitions and it is not obvious to find the best value for k. Gaussian-Means [8] is an improved algorithm for learning k and the initial center location while clustering. It is based on a statistical test for the hypothesis that a subset of data follows Gaussian distribution. It consists of applying the expectation maximization (EM) [9] algorithm together with K-Means to estimate the number of clusters which best fit the data and the initial centroid for each cluster. The DBSCAN- GM algorithm is given as follows: 1. Step 1: Run GMeans and find the center of all data. 2. Step 2: Estimate the parameters EPS of DBSCAN-GM. 3. Step 3: Estimate the parameters Minpts of DBSCAN-GM. 4. Step 4: After computing the global parameters Eps and Minpts, we run the density algorithm DBSCAN with these two determined inputs: DBSCAN (Eps, Minpts). VII. Gmdbscan Algorithm DBSCAN is one of the most popular algorithms for cluster analysis. It can discover all clusters with arbitrary shape and separate noises. But this algorithm can’t choose parameter according to distributing of dataset. It simply uses the global MinPts parameter, so that the clustering result of multi-density database is inaccurate. In addition, when it is used to cluster large databases, it will cost too much time.To solve these problems, GMDBSCAN (MultiDensity DBSCAN Cluster Based on Grid) is used. GMDBSCAN w w w. i j c s t. c o m ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print) algorithm[11] is based on spatial index and grid technique. The GMDBSCAN algorithm is given as follows: 1. Step 1: Given data sets as input. 2. Step 2: Dividing the data space into grids. 3. Step 3: Statistics the grid density and Construct SP-Tree. SPTree [12], has a space index structure based on space dividing, which can hold the position information of grid. 4. Step 4: Record the neighborhood relationship of two points by bitmap and only compute the points at the grids of neighborhood of the point. This is called bitmap formation. 5. Step 5: GMDBSCAN mainly gets the idea of locally clustering, identifying a local-MinPts for each grid. For each grid, processing clustering with their local-MinPts to form a number of distributed local clusters. For the data density within the same cluster should be similar, if two sub-cluster which have same points and the similar density, can be merged to a single cluster. VIII. Conclusion DBSCAN is a well-known spatial clustering technique that has been shown to find clusters of arbitrary shapes by defining a cluster to be a maximum set of density-connected points. DBSCAN has many limitations. Based on each limitation, an alternative method is developed. For example, DBSCAN has trouble when the clusters have greatly varied densities. To avoid this VDBSCAN was introduced. The IDBSCAN(Improved Sampling- based DBSCAN) algorithm was designed to eliminate the sampling on DBSCAN. The FDBSCAN extends DBSCAN by adding non-linear searching, and has the same parameters as DBSCAN. DBSCAN scans many objects repeatedly with many times. FDBSCAN solves this problem by applying an external scan to reduce the search times. GM- DBSCAN (DBSCAN and Gaussian Means) that combines Gaussian-Means and DBSCAN algorithms. The idea of DBSCAN-GM is to cover the limitations of DBSCAN, by exploring the benefits of Gaussian-Means. DBSCAN can discover all clusters with arbitrary shape and separate noises. But this algorithm can’t choose parameter according to distributing of dataset. It simply uses the global MinPts parameter, so that the clustering result of multi-density database is inaccurate. In addition, when it is used to cluster large databases, it will cost too much time.To solve these problems, GMDBSCAN is used. References [1] U.M. Fayya, P. Smyth,"Advances in Knowledge Discovery and Data Mining", AAAI/MIT Press, Menlo Park, CA, 1996. [2] T. Ng. Raymond, J. Han,"Efficient and effective clustering methods for spatial data mining", VLDB Conference. Santiago, Chile, 1994. [3] F. Karimipour, M.R. Delavar, M. Kianie,"Water quality management using GIS data mining", Journal of Environmental Informatics, Vol. 5, No. 2, pp. 61-71, 2005. [4] X. Hu, I. Yoo.,"Cluster ensemble and its application in gene expression analysis", Proceedings, The Second Conference on Asia-Pacific Bioinformatics, Vol. 29, Dunedin, New Zealand, pp. 297-302, 2004. [5] Peng Liu, Dong Zhou, Naijun Wu,“Varied Density Based Spatial Clustering of Applications with Noise”, 2007, IEEE. [6] Borah, B., Bhattacharyya, D. K.,“An Improved SamplingBased DBSCAN for Large Spatial Databases”, In: Proceedings of International Conference on Intelligent w w w. i j c s t. c o m IJCST Vol. 3, Issue 4, Oct - Dec 2012 Sensing and Information, pp. 92-96, 2004. [7] Liu, Bing.,“A Fast Density-Based Clustering Algorithm for Large Databases”, In: Proceedings of International Conference on Machine Learning and Cybernetics, 2006, pp. 996-1000. [8] G. Hamerly, C. Elkan,“Learning the k in k-means”, Vol. 17. MIT Press, 2003. [9] T. K. Moon,“The expectation-maximization algorithm”, Signal Processing Magazine, IEEE, Vol. 13, No. 6, pp. 47–60, Nov. 1996. [10] J. B. MacQueen,“Some methods for classification and analysis of multivariate observations”, in Proceeding of the fifth Berkeley Symposium on Mathematical Statistics and Probability, L. M. L. Cam and J. Neyman, Eds., Vol. 1. University of California Press, 1967, pp. 281–297. [11] Zeng Donghai,"The Study of Clustering Algorithm Basedon Grid-Density and Spatial Partition Tree", XiaMen University, PRC, 2006. [12] A. H. Pilevar. M. Sukumar,"GCHL: a grid-c1ustering algorithm for high-dimensional Data Mining", SBIA 2004, LNAI 3171. Dr. Md. Ali Hussain M.Tech., Ph.D. working as Assoc. Professor Dept. of Electronics and Computer Engineering, KL University, Guntur District, A.P., India. His research interest includes Computer Networks, Wireless Networks, Mobile Networks and Web Commerce. He has published large number of papers in National & International Conferences and International Journals. He also served as a Program Committee (PC) Member for many International Conferences. He is at present Chief Technical Advisory Board Member, Chief Editor, Editor and Technical Reviewer of reputated International Journals. He is a member of IACSIT, IRACST, UACEE, ISTE, IAENG, AIRCC, AICIT, and IARCS. Dr. K. Satya Rajesh M.Tech., Ph.D working as Assoc.Professor Department of Electronics and Computer Engineering in K L University, Guntur District, Andhra Pradesh. His research interest includes Wireless and Sensor Networks. He has published large number of papers in National and International Journals. He presented large number of papers in National and International conferences. He is a Technical Review member of IJRRAN, IJCSIT etc., He had a membership of IACSIT, SSRGJ etc., Md. Abdul Ahad, M.Tech., (Ph.D) working as Asst. Professor Department of Electronics and Computer Engineering in K L University, Guntur District, Andhra Pradesh. His research includes Wireless Networks. He published many number of papers in National and International Journals. He presented many number of papers in National and International conferences. International Journal of Computer Science And Technology 193