* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Parameter reduction for density-based clustering
Survey
Document related concepts
Mixture model wikipedia , lookup
Human genetic clustering wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
K-means clustering wikipedia , lookup
Transcript
Parameter Reduction for Density-based Clustering on Large Data Sets Baoying Wang, William Perrizo {baoying.wang, william.perrizo}@ndsu.nodak.edu Computer Science Department North Dakota State University Fargo, ND 58105 Tel: (701) 231-6257 Fax: (701) 231-8255 Abstract Clustering on large datasets has become one of the most intensively studied areas with increasing data volumes. One of the problems of clustering on large datasets is minimal domain knowledge to determine the input parameters. In the density based clustering, the main input is the minimum neighborhood radius. The problem becomes more difficult when the clusters are in different densities. In this paper, we explore an automatic approach to determine the minimum neighborhood radius based on the distribution of datasets. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters based on many experiments and observations. MINR can be used together with any density based clustering method to make a nonparametric clustering algorithm. In this paper, we combine MINR with the enhanced DBCSCAN, e-DBCSCAN. Experiments show our approach 1 , is more efficient and scalable than TURN* [2]. Keywords: Data mining, Density-based clustering, Parameter reduction. 1. INTRODUCTION Clustering on large datasets has become one of the most intensively studied areas in data mining. In particular, density-based clustering is widely used in various spatial applications such as geographical information analysis, medical applications, and satellite image analysis. In density-based clustering, clusters are dense areas of points in the data space that are separated by areas of low density (noise) [4]. A cluster is regarded as a connected dense area of data points, which grows in any direction that density leads. One of the problems of clustering on large spatial datasets is minimal domain knowledge to determine the input parameters. A dataset may consist of clusters with 1 This work is partially supported by GSA Grant ACT#: K96130308. same density or different densities. Figure 1 shows some possible distributions of a dataset. (a) Same density Figure 1. (b) Different densities Clusters with same or different densities In the density based clustering, the main input is the minimum neighborhood radius. When clusters are in different densities, it is more difficult to determine the minimum neighborhood radii. Although there have been many efforts to make clustering parameter free, they either try to give users all possible choices [7], or adopt trial-and-error approach based on statistic information. We explore an automatic approach to determine the minimum neighborhood radii on the distribution of datasets. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters based on many experiments and observations. MINR can be used together with any density based clustering method to make a nonparametric clustering algorithm. In this paper, we combine MINR with the enhanced DBCSCAN, e-DBCSCAN, into a nonparametric density-based clustering algorithm (NPDBC). Experiments show NPDBC is more efficient and scalable than TURN* [2]. The reason is that for NPDBC, the parameters are computed once at the beginning of the clustering process, while TURN* algorithm tries different neighborhood radii until the first “turn” is found in case of two different densities clusters. This paper is organized as follows. In section 2, we give a brief review of related work. In section 3, we present parameter reduction method for density-based clustering, and a nonparametric clustering method. We give performance analysis in section 4. Finally, we conclude the paper in section 5. 2. 2.1. RELATED WORK Clustering methods There are mainly two clustering methods: similaritybased partitioning methods and density-based clustering methods. A similarity-based partitioning algorithm breaks a dataset into k subsets, called clusters. The major problems with partitioning methods are: (1) k has to be predetermined; (2) it is difficult to identify clusters with different sizes; (3) it only finds convex clusters. Density-based clustering methods are used to discover clusters with arbitrary shapes. The most typical algorithm is DBSCAN [1]. The basic idea of DBSCAN is that each cluster is a maximal set of density-connected points. Points are connected when they are density-reachable from neighborhood to the other. DBSCAN is very sensitive to input parameters, which are the neighborhood radius (r) and a minimum number of neighbors (MinPts). Another density-based method is WaveCluster [10], which applies wavelet transform to the feature space. It can detect arbitrary-shape clusters at different scales. The algorithm is grid-based and only applicable to lowdimensional data. Input parameters include the number of grid cells for each dimension, the wavelet to use and the number of applications of the wavelet transform. In [5], another density-based algorithm DenClue is proposed. This algorithm uses a grid but is very efficient because it only keeps information about grid cells that do actually contain data points and manages these cells in a treebased access structure. This algorithm generalizes some other clustering approaches which, however, results in a large number of input parameters. 2.2. Attempts to reduce parameters There have been many efforts to make clustering process parameter-free, such as OPTICS [7], CHAMELEON [6] and TURN*[2]. OPTICS computes an augmented cluster ordering. This ordering represents the density-based clustering structure of the data. This method is used for interactive cluster analysis. CHAMELEON operates on a derived similarity graph. The algorithm first uses a graph partitioning approach to divide the dataset into a set of small clusters. Then the small clusters are merged based on their similarity measure. CHAMELEON has been found to be very effective in clustering convex shapes. However, the algorithm cannot handle outliers and needs parameter setting to work effectively. TURN* is a brute force approach. It first decreases the neighborhood radius to so small that every data point becomes noise. Then the radius is doubled each time to do clustering until it finds a “turn” where stabilization occurs in the clustering process [3]. TURN* uses two constant step sizes 2 and 0.4 to increase and decrease the neighborhood radius respectively. Obviously the step sizes depend on data distribution of the dataset. Even though it chooses big steps, the computation time is not promising for large datasets with various densities. 2.3. Enhanced DBSCAN clustering Given a data set X, the neighborhood radius, r, and the minimum points in the neighborhood, k, we introduce some definitions of density-based clustering and then present our enhanced DBSCAN clustering algorithm. Definition 1. The neighborhood of a data point p with a radius r is defined as the set Nbr(p, r) = {xX: |p-x| r}, where |p-x| is the distance between x and p. Definition 2. A point p is an internal point if it has at least k neighbors within its neighborhood Nbr(p, r), denoted as |Nbr(p,r)| ≥k. Its neighborhood is called core. Definition 3. A point p is an external point if the number of its neighbors within its neighborhood Nbr(p, r), is less than k, i.e. |Nbr(p,r)| < k, and it is located within a core. Figure 2 shows the internal points and external points, given k = 4. + + + + +3 +2 + + +7 +1 +4 +5 + + +6 + + + +8 + + + + (a) Five internal points Figure 2. (b) Two internal points one external point Internal and external points (k=4) Definition 4: A point p is directly density-reachable from a point q if p Nbr(q, r) and q is an internal point. Definition 5: A point p is density-reachable from a point q if there is a chain of points x1, x2 ..., xn, q = x1, p = xn such that xi+1 is directly density-reachable from xi+1. Definition 6: A cluster C is a collection of cores, the centers of which are density reachable from each other. Definition 7: Boundary points of a cluster is a collection of external points within clusters. Enhanced DBSCAN: We develop an enhanced DBSCAN algorithm (e-DBSCAN). e-DBSCAN is used as a nested clustering procedure, which is called repeatedly to process clustering in different densities. e-DBSCAN is different from the original DBSCAN in that the boundary points of each cluster are stored as a separate set. The boundary sets are used for cluster merge at the later stage. The enhanced DBSCAN process is summarized as follows: 1. 2. 3. 4. Pick an arbitrary point x, if it is not an internal point, it is labeled as noise. Otherwise its neighborhood will be a rudiment cluster C. Insert all neighbors of point x into the seed store. Retrieve the next point from the seed store. If it is an internal point, merge its neighborhood to cluster C. Insert all its neighbors to the seed store; if it is an external point, insert it to the boundary set of C. Go back to step 2 with the next seed until the seed store is empty. Go back to step 1 with the next unclustered point in the dataset. When the process is finished, there will be some cluster sets, a noise set and a boundary set for each cluster. two datasets DS1 and DS2 and their R-x graphs respectively after sorting. DS1 is a dataset used by DBSCAN. The data size is 200. DS2 is reproduced from a dataset used by CHAMELEON. The original data is 10K and the clusters have similar density. In order to test our algorithm, we insert more data in the 3 clusters on the left up part. The size of DS2 is 17.5K. As we can see from Figure 3, for a noisy dataset, there is a turning point in the R-x graph where R starts to increase dramatically. Our experiments show most points on the right side of the turning point are noise. If the dataset were clean, there would be no turning point in the graph. DS1 and DS2 are both noisy datasets, therefore there are turning points in Figure 3 (c) and (d). We can even check our observation on the dataset DS1 by eyes. The turning point in (c) is at around 175. There are 24 points on its right side. In fact DS1 has 20 noise points. 3. PARAMETER REDUCTION FOR DENSITY-BASED CLUSTERING There are two input parameters in DBSCAN algorithm: the minimum number of neighbors, k, and the minimum neighborhood radius, r. In fact, k is the size of the smallest cluster. It shouldn’t be varied with different datasets. DBSCAN set k to 4 [1]. TURN* also treats it as a fixed value [2]. We also set k to 4. Therefore, the only input parameter is the minimum neighborhood radius, r. Intuitively, r should depend on the cluster density of the dataset. Different density cluster should have different r. Because of it, DBSCAN presents the user a graph of sorted distance between each point and its 4th nearest neighbor. The user will be asked to find the “valley” which represents the optimal r. The method is only for clusters with the same density. TURN* treats the whole set as an image, tries a range of resolutions (radii) from one end where each point is classified as noise, to the other end where all data points can be included in a single cluster. An optimal resolution is found out of the range by statistic method. In this section, we first present a few observations based on our experiments on many different datasets. And then we develop a built-in algorithm, MINR, to determine the minimum neighborhood radii for clusters in different densities based on the data distribution. Finally, we develop a nonparametric density based clustering method by combining MINR with e-DBSCAN. 3.1. Experiments and Observations Observation 1: We define R as a distance between each point x and its 4th nearest neighbor. The points are then sorted based on R in ascending order. Figure 3 shows (a) DS1 (b) DS2 (c) R-x of DS1 Figure 3. (d) R-x of DS2 DS1 and DS2 and their sorted R-x graphs Observation 2: Given a neighborhood radius r, we calculate the number of neighbors for each point within the given radius, denoted as K, sort the points in descending order, and get the sorted K-x graph. When r is small, the line is quite smooth. As r increases, the graph starts to have “knees”. When we continue to increase r, the graph becomes smooth again. The rational is that if r is very small or very big, the number of neighbors of each point will be close. One extreme case is when r is so small that every point will have no neighbor but itself. The other extreme case is when r is large enough to cover the whole data set as the neighborhood. Figure 4 shows K-x graphs for DS1 and DS2 for three different radii respectively. Figure 4 (a) and (b) are the cases when r is very small. (c) and (d) are the cases when r is close to the maximum R in the R-x graph. (e) and (f) are the cases when r is very large. (a) DS1 r = 2 (b) DS2 r = 5 (c) DS1 r = 22 (d) DS2 r = 30 (e) DS1 r = 50 (f) DS2 r = 250 Figure 4. Both DS1 and DS2 consist of clusters in two different densities and some noise. The knees are close to the points with peak differentials as we can see in (c) and (d). The number of “knees” is equal to the number of cluster densities in the dataset. Intuitively, we infer that the points divided by “knees” belong to different density clusters or noise. Observation 3: In order to justify our intuition above, we sort the dataset DS2 based on K, and then partition the sorted dataset into three subsets separated by two “knees” in Figure 5 (b). The two “knees” are at positions of 10000 and 15500. Therefore the three partitions are 0 – 10000, 10001 – 15500, and 15501-17500. The three partitions are shown in Figure 6. We can see that partition (a) consists of the denser clusters; partition (b) consists of the less dense clusters; and partition (c) is mainly noise. (a) Partition 0 - 10000 Sorted K-x graphs for datasets DS1 and DS2 with different neighborhood radii From Figure 4, we can see that when the neighborhood radius is close to the maximum R, the K-x graph shows “knees” very clearly. In order to find the “knees” we need to calculate the differentials of the graphs, Ks. Figure 5 (a) and (b shows the sorted K-x graphs for DS1 and DS2 when the neighborhood radius is close to R. (c) and (d) show the differentials of the graphs respectively. (b) Partition 100000 – 15500 (a) K-x graph for DS1 (c) K for DS1 Figure 5. (b) K-x graph for DS2 (c) Partition 15500 - 17500 (d) K for DS2 Sorted K-x graphs of datasets DS1 and DS2 and their differentials Ks Figure 6. Partitions of the sorted DS2 separated by two “knees” at 10000 and 15500 3.2. Determination of the neighborhood radii Densest cluster is formed Based on the experiments above, we develop an algorithm to automatically determine the minimum neighborhood radii for mining clusters with different densities, MINR, based on the data distribution. The process is as follows: 1. Calculate the distance between each point and its 4th neighbor, R; Find the maximum R; Compute the number of neighbors, K, within the maximum neighborhood radius R for each point; Sort the points in descending order based on K; Calculate the differential K; Search for the peak K values; Find the “knee” point right before each peak point with K = 0. 2. 3. 4. 5. + Figure 8. ++ + + ++ + + Noise + r2 + + + + Resulted clusters after clustering with r2: The sparser cluster is formed. The unclustered is noise. Our nonparametric density-based clustering algorithm is processed as follows. First, calculates a series of neighborhood radii for different density clusters using MINR, then starts iterative clustering process using eDBSCAN with the radii. Finally, merge any pair of clusters which share most of the boundary points of either cluster. The whole process of our nonparametric clustering algorithm is summarized in Figure 10. Nonparametric Clustering Algorithm Input: A dataset X Output: Clusters and noise 1. Calculate a number of the neighborhood radii: r1, r2 … rm for different density clusters with MINR ( ): 2. Iterative Clustering with e-DBSCAN 3. Check the boundaries of each pair of clusters. If two clusters share most of the boundary of either cluster, merge the two clusters into one. Nonparametric Density-based Clustering We start clustering using the enhance DBSCAN algorithm, e-DBSCAN, with k = 4 and r = r1. The densest cluster(s) would be formed as shown in Figure 8. + Sparser cluster is formed MINR algorithm In this section, we first propose an iterative clustering process given a series of neighborhood radii for different density cluster groups in the dataset, and then develop our nonparametric density based clustering method. + Resulted clusters after clustering with r1: the denser cluster is formed. Figure 9. 3.3. + + Densest cluster is formed + Figure 7. + Then set r = r2. Only process those unclustered points. The next sparser cluster(s) are formed (See Figure 9). The process continues until r = rm. The remaining unclustered points are noise. The “knee” points are denoted as KNi, where i = 1, 2 …m, m is the number of “knees.” The distance between KNi and its 4th neighbor will be the neighborhood radius for clustering the ith dense cluster group. The algorithm is summarized in Figure 7. MINR Algorithm Input: A data set X Output: neighborhood radii ri 1. Calculate the distance between each point and its 4th neighbor, R. Get Rm = max (R). 2. Compute the number of neighbors within Rm for each point, K. 3. Sort the points in descending order based on K. 4. Calculate the differential K, and find the peak K position, XPi. Stop if it is at the end of dataset. 5. For the ith peak K position, find the “knee” point KNi: if x < XPi and Ki = 0 and |x- XPi| is the smallest, then KNi = x. 6. ri = Rx. Increase i and go back to step 4. r1 Figure 10. 4. Nonparametric clustering algorithm PERFORMANCE ANALYSIS In this section, we compare our nonparametric densitybased clustering algorithm (NPDBC) with the performance of TURN*. We tested the algorithms on several data sets. We will show the run time comparisons on the dataset, DS2, we discussed above. In order to make the data contain the clusters in different densities, we artificially insert more data in some clusters to make them denser than the others. The resulted datasets have the sizes from 10k to 200k. We implemented NPDBC in the C language and run on a 1GHz Pentium PC machine with 1GB main memory, and Debian Linux 4.0. The run time comparison of NPDBC and TURN* is shown in Figure 11. for clusters in two different densities. The reason is that in NPDBC, the parameters are computed once at the beginning of the clustering process, while TURN* algorithm tries different neighborhood radii until the first “turn” is found in case of clusters in two different densities. When the dataset contains clusters in various densities, our algorithm will be much more efficient. In our future work, we will implement our NPDBC using the vertical data structure, P-tree, the efficient data mining ready data representation. 6. Figure 11. Comparison of NPDBC and TURN* From Figure 11, we see NPDBC is more efficient than TURN* for large datasets. The reason is that for NPDBC, the parameters are computed once at the beginning of the clustering process, while TURN* algorithm tries different neighborhood radii until the first “turn” is found in case of two different densities. We only compare NPDBC with TURN* on datasets with two different densities. If the density variety increases, NPDBC will outperform TURN* much more. In that case, TURN* wouldn’t stop at the first turning point. It has to continue to search for more knees till the very end. It is obvious that TURN* will fail for large datasets with various densities. 5. CONCLUSION One of the major challenges of clustering is minimal domain knowledge to determine the input parameters. It is even more difficult to determine the input parameters when the dataset contains clusters in different densities. Although many algorithms have tried to make clustering parameter free, they either try to give users all possible choices, or adopt trial-and-error approach based on statistic information, not practical for very large datasets. In this paper, we explore an automatic approach to determine this parameter based on the distribution of datasets. The algorithm, MINR, is developed to determine the minimum neighborhood radii for different density clusters. We developed a nonparametric clustering method (NPDBC) by combining MINR with the enhanced DBCSCAN, e-DBCSCAN. Experiments show our NPDBC is more efficient and scalable than TURN* REFERENCES [1]. Ester, M., Kriegel, H-P., Sander, J. & Xu, X.: A densitybased algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the 2nd ACM SIGKDD, Portland, Oregon (1996) 226-231 [2]. Foss, A. & Zaiane, O., R. A Parameterless Method for Efficiently Discovering Clusters of Arbitrary Shape in Large Datasets. In Proceedings of ICDM 2002. [3]. Halkidi, M. V. M. and Batistakis, Y.. On clustering validation techniques. Journal of Intelligent Information Systems, 17(2-3):107–145, December 2001. [4]. Han, J. and Kamber, M. Data Mining, Concepts and Techniques. Morgan Kaufmann, 2001. [5]. Hinneburg, A., and Keim, D. A.: An Efficient Approach to Clustering in Large Multimedia Databases with Noise. Proceeding 4th Int. Conf. on Knowledge Discovery and Data Mining, AAAI Press (1998) [6]. Karypis, G., Han, E.-H., and Kumar, V. CHAMELEON: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer, 32(8):68–75, August 1999. [7]. M.Ankerst, M.Breunig, H.-P. Kriegel, and J.Sander. OPTICS: Ordering points to identify the clustering structure. In Proc. 1999 ACM-SIGMOD Conf. on Management of Data (SIGMOD’99), pages 49–60, 1999. [8]. Ng, R. T. and Han, J., Efficient and effective clustering methods for spatial data mining. In Proc. of the 20th Int’l Conf. on Very Large Data Bases, 1994. [9]. Palmer, C. R. and Faloutsos, C. Density biased sampling: an improved method for data mining and clustering. In Proceedings of Int’l Conf. on Management of Data, ACM SIGMOD 2000. [10]. Sheikholeslami, G., Chatterjee, S. and A. Zhang. A wavelet-based clustering approach for spatial data in very large databases. The International Journal on Very Large Databases, 8(4):289–304, February 2000.