Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ A roadmap to varied density dataset issue of DBSCAN and its variants 1 Neha R. Soni, 2Amit P. Ganatra Asst. Prof., SVIT, Vasad, Gujarat, Dean, Faculty of Tech. & Engg., CHARUSAT, Changa, Gujarat Email: [email protected], [email protected] Abstract -Wide variety of methods had been designed under the cluster analysis; an unsupervised learning, like partitioning based, hierarchical, density based, model based, etc. DBSCAN, one of the most widely applied density based clustering algorithm outperforms partitioning based clustering algorithms such as k-means, CLARA, CLARANS and hierarchical algorithms, as it does not require a prior knowledge of number of clusters or termination condition and generates clusters of arbitrary shape, which need not to be convex. Despite the wide applicability, it also exhibits few issues like: i) time complexity is O (n2) if R* indexing is not used, ii) does not work properly for the varying density dataset and iii) Eps and MinPts, two input parameters selection greatly change the output. To overcome these issues different modifications of original DBSCAN had been proposed in the literature. The algorithms proposed for handling varied density dataset are surveyed in this paper. Index Terms--DBSCAN, Density based clustering varied density dataset I. INTRODUCTION Clustering or cluster analysis, an unsupervised learning, is the process of grouping the objects of similar kind. Clustering plays an outstanding role in data mining applications and is the subject of active research in several fields such as statistics, pattern recognition and machine learning [2]. Thousands of clustering algorithms have been proposed in the literature in many different disciplines and from many different applications [5]. Even the categorization of clustering algorithms had also been done from number of perspectives as presented in [1][2][3][4][15], in which the major categories are partitioning, hierarchical, density based, grid based and model based. Density based clustering is one of the primary methods for clustering in data mining. It is more efficient in detecting clusters with arbitrary shapes. Density based clustering considers clusters as dense regions separated by sparse regions and can be applied very efficiently to spatial databases. The main representative algorithms in this category are DBSCAN [6], OPTICS [7], DENCLUE [8], and DBCLASD [9]. DBSCAN is the most widely used algorithm under this category. It takes as input two parameters: Eps and Minpts. The main weakness of DBSCAN is that it is unable to produce proper clusters when the dataset have greatly varied densities. As it makes use of global radius (Eps), it is possible to find clusters with the single density levels only. Large number of modifications of DBSCAN in the literature had been proposed to handle this issue. The paper discusses few of them giving comparison and remarks. The rest of the paper is organized as follows. In section 2 working of DBSCAN is described very briefly. Section 3 discusses the issue of varied density dataset it’s consequences in the output of DBSCAN. Section 4 provides the summary of the different algorithms proposed to address the issue of varied density dataset with the detailed comparison too. Finally section 5 present the conclusion and direction for future work. II. THE DBSCAN ALGORITHM DBSCAN [6] is the first density based clustering algorithm became very popular. The basic idea of DBSCAN is that, the cluster which is a dense region has to contain some minimum number of points (MinPts) within some specified neighbourhood region (radius) given as two input parameter. To find a cluster, DBSCAN starts with an arbitrary point p, finds the Eps neighbourhood of p and if the neighbourhood contains more than MinPts then point p is considered as a core point and retrieves all points which are density reachable from p wrt. Eps and MinPts. The point which does not have minimum number of points in its neighbourhood is considered as a border point or noise point and DBSCAN continue with checking of other points in the dataset till all points are classified. III. ISSUE OF VARIED DENSITY DATASET IN DBSCAN The two input parameters Eps and MinPts in DBSCAN are global parameters. Due to this the clusters present in the dataset having different density and not well separated by sparse regions produce incorrect results. With the low value of Eps as an input, highly dense clusters can be extracted and the other sparse clusters will be considered as outliers. Whereas, with the high _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014 12 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ value of Eps as input, the densest clusters will be merged in the sparse clusters [10] [11] [14]. As shown in figure 1, if the value of Eps is small enough then DBSCAN will generate two clusters C1 and C2, with C3 as an outlier and if the value of Eps is large enough then the DBSCAN will produce two clusters: C3 as one cluster and the other one is the merging of C1 and C2 in the single cluster. Thus, DBSCAN is unable to produce all three clusters C1, C2 and C3 with a global value of Eps. In many real datasets, clusters with respect to different densities may present and useful for further analysis. Therefore it became necessary to find out both dense clusters as well as sparse. To handle this issue several new algorithms had been proposed in the literature which are extension or modification of DBSCAN. Section 4 surveys all such algorithms. reached. A border point is determined by checking the size of ISk for the point in the consideration and threshold, whose value is set to 2k/3 based on the experimental results. 2) The proposed algorithm, Grid-based DBSCAN Algorithm with Referential Parameters, is based on the grid partition technique and multi-density based clustering. The author has proposed the technique for automatic generation of Eps and Minpts parameters of the DBSCAN algorithm. The algorithm starts by performing grid division for the dataset and then applying binning for each data object to map it to the corresponding grid cell. Eps and MinPts are then determined from grid structure and DBSCAN is applied considering the core object as grid unit whose number of data objects are larger than MinPts. Then undirected graph is constructed by placing an edge from a one core grid unit to adjacent core grid. Every connected component represents a cluster. 3) Figure 1: Example of Varied density dataset issue IV. COMPARATIVE STUDY The following is the summary of the different modifications proposed for DBSCAN to handle the issue of varied density datasets. Table 1 represent the comparison of all in the summarized form with the author’s own reviews and remarks on each providing guidelines for selection as well as direction for further work or improvements. 1) Enhancing density-based clustering: Parameter reduction and outlier detection[12] In this paper authors had tried to address the 3 common issues of density based clustering: (i) selection of data dependent parameters; (ii) algorithm behaviour sensitivity to the starting object density; (iii) improper identification of adjacent clusters with different densities. To address the above mentioned issue a new density function is proposed based on the concept of knn-stratification and influence function. First of all knn-stratification is applied on the dataset to identify the different density levels in the dataset efficiently. The original dataset is projected on new space by adding rank of the objects derived using knn-stratification as one more dimension. Then density based clustering is applied using the knowledge of k-influence space ISk. A random point p is selected and a cluster around a point p is constructed until a border point or an outlier is Grid-based DBSCAN Algorithm with Referential Parameters [16] Enhanced Density Based Spatial clustering of Applications with Noise [14] In this paper authors have proposed a new algorithm to handle the issue of varied density dataset as an extension of DBSCAN. It starts by finding kNN for each point p and stores them in ascending order according to the distance to point p. Then local density function is computed for each point p which is the sum of distances of the kNN and dataset is rearranged in descending order according to the local density of each point. From the input parameter Maxpt, an Eps is determined as the distance to Maxpt neighbour for the point p and then DBSCAN is applied for each value of Eps ignoring the previously clustered points. 4) DDSC : Density Clustering [19] Differentiated Spatial DDSC is an extension of the DBSCAN algorithm to detect clusters with differing densities. The algorithm finds natural density based cluster that may not be separated by sparse region by considering that, the local density within a cluster is reasonable homogeneous and adjacent regions are separated into different clusters if there is significant change in density. It starts a cluster with homogeneous core object and goes on expanding it by including directly density reachable homogenous core object until non-homogeneous core object are detected. The homogeneous core object is determined based on the parameter α, a density threshold. 5) VDBSCAN :Varied Density Based Spatial Clustering of Applications with Noise [10] VDBSCAN is an improvement to DBSCAN for handling varied density dataset. The basic idea is to use different Eps values for different density variation exists in the dataset, instead of single global value of Eps for all the clusters to be formed. To do so algorithm first _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014 13 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ calculates and stores k-dist for each point and plot the graph of k-dist. Due to density variation if exist in the dataset , there will be sharp change on the graph of kdist that corresponds to a suitable value of Eps. Thus different value of Eps, known as Epsi can be chosen at each sharp change from a smooth curve and DBSCAN is adopted for each different Epsi, by not ignoring the points which had been already clustered. 6) STDBSCAN : An algorithm for clustering spatial-temporal data [11] Is the extension of DBSCAN from three different aspects as suggested by authors, as: i) it can cluster spatial –temporal data according to its non-spatial, spatial and temporal attributes. ii) detection of noise in case of varying density can be achieved by density factor, assigned to each cluster iii) effect of spatial and non-spatial attribute on the border object residing at opposite side in adjacent cluster. The algorithm takes four parameters Eps1, Eps2, MinPts and Δε, where Eps1 is used for spatial attribute and Eps2 Sr. No 1 Name Proposed By Year ISDBSCAN 2013 2 GRPDBSCA N Carmelo Cassisi, Alfredo Ferron, Rosalba Giugno, Giuseppe Pigola, Alfredo Pulvirenti Huang Darong, Wang Peng 3 Enhanced DBSCAN 4 5 is for non-spatial attribute. It starts with the first point p and retrieves all points which are density reachable from p with respect to Eps1 and Eps2. If p is a core object then cluster is formed else it visits the next point in the dataset. Thus issue one and three will be addressed by considering the non-spatial and temporal attributes as well in the formation of the cluster. 7) Locally Scaled Density Based Clustering[17] As the name suggest the proposed algorithm is based on the concept of local scaling, a technique which makes use of the local statistics of data during identification of clusters. LSDBC clusters the points by connecting dense regions until the density falls below a threshold, determined by the centre of the cluster. It first calculates the Eps values for each point based on their kNN dist and then sort the dataset in the ascending order of Eps. Then most dense local point is selected and cluster is expanded for that point by comparing the density each time. Thus makes algorithm to work for different density variations. Complexi ty As of DBSCA N Input Parameter a) number of nearest neighbours k Issue Addressed a)parameter selection; b)input order dependency c)varying density dataset a) handling varied density datasets b)reduction in parameter 2012 As of DBSCA N a) number of grid units – N A. Fahim, G. Saake, A. salem, F. torkey and M. Ramadan 2009 As of DBSCA N a) Number of nearest neighbours k b)limitation of highest density – Maxpts DDSC B.Borah, D.K. Bhattacharyya 2008 O (nlogn) a) radius Eps b) minimum points MinPts c)density threshold – α a)varied density dataset b) reduction in the sensitivity of Eps VDBSCAN Peng Liu, Dong Zhou, Naijun Wu 2007 As of DBSCA N a) radius Epsi b) minimum points MinPts c) number of nearest neighbours k Varied Density Dataset Main Concept Used space stratification based on both INFLO function and knn distances Research Findings/ Issues a) sorting of dataset as per w function b) adds one more dimension to the dataset to the dataset c) threshold is to be set combines the grid partition technique and multi-density based clustering a) grid division and data binning needs to be explore b) selection of Eps and MinPts needs to be more specific a) sorting is performed two time, one for each point’s kNN and other for whole dataset b)introduces a new parameter to control highest density in a cluster c) as Eps absolute dist to the MaxPts is used which may be explored further. a) introduction of new parameter for density threshold b) tries to reduce the sensitivity of the parameters not complete elimination Based on the concept of local density function to find the local density at each point which is an approximation of over all density function. Partitions the dataset such that adjacent regions significantly differ in density by making use of homogeneity test to detect variations in density. For varied density dataset different values of Epsi can be used, which can be determined by plotting k-dist graph. a) result may vary with value of k b) values of Epsi are subjectively chosen from the k-dist plot. _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014 14 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ 6 STDBSCAN Derya Birant & Alp Kut 2007 As of DBSCA N a) radius Eps1 & Eps2 b) minimum points MinPts c) density threshold - ∆ε 7 LSDBC Ergun Bicici and Deniz Yuret 2007 As of DBSCA N a) number of nearest neighbours k b) density threshold – α a)Handling of SpatialTemporal Dataset b) Varied Density Dataset c) border points on opposite side in adjacent clusters a) varied density dataset b) reduction in parameter a)two Eps value for two dimension, Spatial and Temporal b)handles varied density(noise point identification) by defining density factor a) selection of threshold value is to be explored. uses the notion of local scaling in density based clustering, which determines the density threshold based on the local statistics of the data a) introduction of new parameter for density threshold b) sorting of dataset Table 1: Comparative study of density based clustering algorithms Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining, Portland, Ore, USA, pp. 226-231, 1996. V. CONCLUSION AND FUTURE WORK Density based clustering is one of the primary methods for clustering in data mining and DBSCAN is the most widely used algorithm under this category. Despite the wide applicability, it also exhibits few problems like high time complexity; selection of input parameter is crucial and is unable to produce proper clusters when the clusters in the dataset have greatly varied densities. Number of modifications of DBSCAN had been proposed in the literature for addressing the issue of varying density dataset. The paper discusses the few of them by providing the detail comparison and remarks. It has been observed that each modifications leads to either the introduction of some new input parameter or results in to some other issues. Future direction for the research work is to come up with the parameter free clustering or method with the automatic selection of the parameter. REFERENCES [1] J. Han and M. Kamber, “Data Mining: Concepts and Techniques”, Morgan Kaufman, 2001. [2] P. Berkhin, “Survey of clustering data mining techniques”, Technical report, Accrue Software, San Jose, CA, 2002 [3] A. K. Jain, M. N. Murty and P. J. Flynn, “Data clustering: a review”, ACM Computing Surveys, Vol. 31, Issue 3, pp. 264-323, 1999. [4] Rui Xu and D. Wunsch, "Survey of clustering algorithms," IEEE Transactions on Neural Networks, vol.16, no.3, pp.645-678, May 2005. [5] A. K. Jain, “Data Clustering: 50 Years Beyond K-Means”, in Pattern Recognition Letters, Vol. 31, No. 8, pp. 651-666, 2010. [6] M. Ester, H. P. Kriegel, J. Sander and X. Xu, “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”, [7] M. Ankerst, M. Breunig, H. P. Kriegel and J. Sander, “OPTICS: Ordering Objects to Identify the Clustering Structure”, Proc. of International Conference on Management of Data, ACM SIGMOD, pp. 49–60, New York, USA, 1999, ACM Press. [8] A. Hinneburg and D. Keim, “An efficient approach to clustering large multimedia databases with noise”, In Proceedings of the 4th ACM SIGKDD, 58-65, New York, NY, 1998. [9] X. Xu, M. Ester, H. P. Kriegel and J. Sander J, “A distribution-based clustering algorithm for mining in large spatial databases”, In Proceedings of the 14th ICDE, 324-331, Orlando, FL, 1998. [10] L. Peng, Z. Dong, and W. Naijun, “VDBSCAN: Varied Density Based Spatial Clustering of Applications with Noise”, Proc. of IEEE Conference, ICSSSM, pp.528-531, Shanghai, China, 2007. [11] D. Birant and A. Kut, “ ST-DBSCAN : An algorithm for clustering spatial-temporal data”, Data and Knowledge Engineering, pp. 208-221, 2007. [12] C. Cassisi, A. Ferro, R. Giugno, G. Pigola and A. Pulvirenti, “Enhancing density-based clustering : Parameter reduction and outlier detection”, Information Systems, 38, 317-330, 2013, Elsevier. [13] A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey and M. A. Ramadan, “Dcbor: a density clustering based on outlier removal”, International Journal of Computer Science, Vol. 4, No. 3, 2009. _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014 15 International Journal on Advanced Computer Theory and Engineering (IJACTE) _______________________________________________________________________________________________ [14] [15] [16] A. M. Fahim, G. Saake, A. M. Salem, F. A. Torkey and M. A. Ramadan, “Enhanced density based spatial clustering of application with noise”, Proceedings of the International Conference on Data Mining, Las Vegas, USA, pp. 517–523, 2009. N. Soni, A. Ganatra, “Categorization of several clustering algorithms from different perspective : a review”, International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 1, No. 8, pp. 1-6, 2012. [17] E. Biçici, Y. Deniz, "Locally scaled density based clustering", In Adaptive and Natural Computing Algorithms, pp. 739-748., Berlin Heidelberg, 2007, Springer. [18] D. R. Edla and K. J. Prasanta, "A PrototypeBased Modified DBSCAN for Gene Clustering" , Procedia Technology 6 , pp. 485-492, 2012, Elsevier. [19] B. Borah and D. K. Bhattacharyya, "DDSC: A density differentiated spatial clustering technique", Journal of Computers, Vol. 3, No. 2, pp. 72-79, 2008. H. Darong and Wang Peng. "Grid-based DBSCAN Algorithm with Referential Parameters", Physics Procedia, pp. 1166-1170, 2012, Elsevier. _______________________________________________________________________________________________ ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014 16