Download Parameter reduction for density-based clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Parameter Reduction for Density-based Clustering on Large Data Sets
Baoying Wang, William Perrizo
{baoying.wang, william.perrizo}@ndsu.nodak.edu
Computer Science Department
North Dakota State University
Fargo, ND 58105
Tel: (701) 231-6257
Fax: (701) 231-8255
Abstract
Clustering on large datasets has become one of the
most intensively studied areas with increasing data
volumes. One of the problems of clustering on large
datasets is minimal domain knowledge to determine the
input parameters. In the density based clustering, the main
input is the minimum neighborhood radius. The problem
becomes more difficult when the clusters are in different
densities. In this paper, we explore an automatic approach
to determine the minimum neighborhood radius based on
the distribution of datasets. The algorithm, MINR, is
developed to determine the minimum neighborhood radii
for different density clusters based on many experiments
and observations. MINR can be used together with any
density based clustering method to make a nonparametric
clustering algorithm. In this paper, we combine MINR
with the enhanced DBCSCAN, e-DBCSCAN.
Experiments show our approach 1 , is more efficient and
scalable than TURN* [2].
Keywords: Data mining, Density-based clustering,
Parameter reduction.
1.
INTRODUCTION
Clustering on large datasets has become one of the
most intensively studied areas in data mining. In
particular, density-based clustering is widely used in
various spatial applications such as geographical
information analysis, medical applications, and satellite
image analysis. In density-based clustering, clusters are
dense areas of points in the data space that are separated
by areas of low density (noise) [4]. A cluster is regarded
as a connected dense area of data points, which grows in
any direction that density leads.
One of the problems of clustering on large spatial
datasets is minimal domain knowledge to determine the
input parameters. A dataset may consist of clusters with
1
This work is partially supported by GSA Grant ACT#:
K96130308.
same density or different densities. Figure 1 shows some
possible distributions of a dataset.
(a) Same density
Figure 1.
(b) Different densities
Clusters with same or different densities
In the density based clustering, the main input is the
minimum neighborhood radius. When clusters are in
different densities, it is more difficult to determine the
minimum neighborhood radii. Although there have been
many efforts to make clustering parameter free, they
either try to give users all possible choices [7], or adopt
trial-and-error approach based on statistic information.
We explore an automatic approach to determine the
minimum neighborhood radii on the distribution of
datasets. The algorithm, MINR, is developed to determine
the minimum neighborhood radii for different density
clusters based on many experiments and observations.
MINR can be used together with any density based
clustering method to make a nonparametric clustering
algorithm. In this paper, we combine MINR with the
enhanced
DBCSCAN,
e-DBCSCAN,
into
a
nonparametric density-based clustering algorithm
(NPDBC). Experiments show NPDBC is more efficient
and scalable than TURN* [2]. The reason is that for
NPDBC, the parameters are computed once at the
beginning of the clustering process, while TURN*
algorithm tries different neighborhood radii until the first
“turn” is found in case of two different densities clusters.
This paper is organized as follows. In section 2, we
give a brief review of related work. In section 3, we
present parameter reduction method for density-based
clustering, and a nonparametric clustering method. We
give performance analysis in section 4. Finally, we
conclude the paper in section 5.
2.
2.1.
RELATED WORK
Clustering methods
There are mainly two clustering methods: similaritybased partitioning methods and density-based clustering
methods. A similarity-based partitioning algorithm breaks
a dataset into k subsets, called clusters. The major
problems with partitioning methods are: (1) k has to be
predetermined; (2) it is difficult to identify clusters with
different sizes; (3) it only finds convex clusters.
Density-based clustering methods are used to discover
clusters with arbitrary shapes. The most typical algorithm
is DBSCAN [1]. The basic idea of DBSCAN is that each
cluster is a maximal set of density-connected points.
Points are connected when they are density-reachable
from neighborhood to the other. DBSCAN is very
sensitive to input parameters, which are the neighborhood
radius (r) and a minimum number of neighbors (MinPts).
Another density-based method is WaveCluster [10],
which applies wavelet transform to the feature space. It
can detect arbitrary-shape clusters at different scales. The
algorithm is grid-based and only applicable to lowdimensional data. Input parameters include the number of
grid cells for each dimension, the wavelet to use and the
number of applications of the wavelet transform. In [5],
another density-based algorithm DenClue is proposed.
This algorithm uses a grid but is very efficient because it
only keeps information about grid cells that do actually
contain data points and manages these cells in a treebased access structure. This algorithm generalizes some
other clustering approaches which, however, results in a
large number of input parameters.
2.2.
Attempts to reduce parameters
There have been many efforts to make clustering
process parameter-free, such as OPTICS [7],
CHAMELEON [6] and TURN*[2]. OPTICS computes an
augmented cluster ordering. This ordering represents the
density-based clustering structure of the data. This
method is used for interactive cluster analysis.
CHAMELEON operates on a derived similarity graph.
The algorithm first uses a graph partitioning approach to
divide the dataset into a set of small clusters. Then the
small clusters are merged based on their similarity
measure. CHAMELEON has been found to be very
effective in clustering convex shapes. However, the
algorithm cannot handle outliers and needs parameter
setting to work effectively.
TURN* is a brute force approach. It first decreases the
neighborhood radius to so small that every data point
becomes noise. Then the radius is doubled each time to do
clustering until it finds a “turn” where stabilization occurs
in the clustering process [3]. TURN* uses two constant
step sizes 2 and 0.4 to increase and decrease the
neighborhood radius respectively. Obviously the step
sizes depend on data distribution of the dataset. Even
though it chooses big steps, the computation time is not
promising for large datasets with various densities.
2.3.
Enhanced DBSCAN clustering
Given a data set X, the neighborhood radius, r, and
the minimum points in the neighborhood, k, we introduce
some definitions of density-based clustering and then
present our enhanced DBSCAN clustering algorithm.
Definition 1. The neighborhood of a data point p with a
radius r is defined as the set Nbr(p, r) = {xX: |p-x| r},
where |p-x| is the distance between x and p.
Definition 2. A point p is an internal point if it has at
least k neighbors within its neighborhood Nbr(p, r),
denoted as |Nbr(p,r)| ≥k. Its neighborhood is called core.
Definition 3. A point p is an external point if the number
of its neighbors within its neighborhood Nbr(p, r), is less
than k, i.e. |Nbr(p,r)| < k, and it is located within a core.
Figure 2 shows the internal points and external points,
given k = 4.
+
+
+
+
+3
+2
+
+
+7
+1
+4
+5
+
+
+6
+
+
+
+8
+
+
+
+
(a) Five internal points
Figure 2.
(b) Two internal points
one external point
Internal and external points (k=4)
Definition 4: A point p is directly density-reachable from
a point q if p Nbr(q, r) and q is an internal point.
Definition 5: A point p is density-reachable from a point
q if there is a chain of points x1, x2 ..., xn, q = x1, p = xn
such that xi+1 is directly density-reachable from xi+1.
Definition 6: A cluster C is a collection of cores, the
centers of which are density reachable from each other.
Definition 7: Boundary points of a cluster is a collection
of external points within clusters.
Enhanced DBSCAN: We develop an enhanced
DBSCAN algorithm (e-DBSCAN). e-DBSCAN is used as
a nested clustering procedure, which is called repeatedly
to process clustering in different densities. e-DBSCAN is
different from the original DBSCAN in that the boundary
points of each cluster are stored as a separate set. The
boundary sets are used for cluster merge at the later stage.
The enhanced DBSCAN process is summarized as
follows:
1.
2.
3.
4.
Pick an arbitrary point x, if it is not an internal point,
it is labeled as noise. Otherwise its neighborhood will
be a rudiment cluster C. Insert all neighbors of point
x into the seed store.
Retrieve the next point from the seed store. If it is an
internal point, merge its neighborhood to cluster C.
Insert all its neighbors to the seed store; if it is an
external point, insert it to the boundary set of C.
Go back to step 2 with the next seed until the seed
store is empty.
Go back to step 1 with the next unclustered point in
the dataset.
When the process is finished, there will be some cluster
sets, a noise set and a boundary set for each cluster.
two datasets DS1 and DS2 and their R-x graphs
respectively after sorting. DS1 is a dataset used by
DBSCAN. The data size is 200. DS2 is reproduced from a
dataset used by CHAMELEON. The original data is 10K
and the clusters have similar density. In order to test our
algorithm, we insert more data in the 3 clusters on the left
up part. The size of DS2 is 17.5K.
As we can see from Figure 3, for a noisy dataset, there
is a turning point in the R-x graph where R starts to
increase dramatically. Our experiments show most points
on the right side of the turning point are noise. If the
dataset were clean, there would be no turning point in the
graph. DS1 and DS2 are both noisy datasets, therefore
there are turning points in Figure 3 (c) and (d). We can
even check our observation on the dataset DS1 by eyes.
The turning point in (c) is at around 175. There are 24
points on its right side. In fact DS1 has 20 noise points.
3. PARAMETER REDUCTION FOR
DENSITY-BASED CLUSTERING
There are two input parameters in DBSCAN algorithm:
the minimum number of neighbors, k, and the minimum
neighborhood radius, r. In fact, k is the size of the
smallest cluster. It shouldn’t be varied with different
datasets. DBSCAN set k to 4 [1]. TURN* also treats it as
a fixed value [2]. We also set k to 4.
Therefore, the only input parameter is the minimum
neighborhood radius, r. Intuitively, r should depend on the
cluster density of the dataset. Different density cluster
should have different r. Because of it, DBSCAN presents
the user a graph of sorted distance between each point and
its 4th nearest neighbor. The user will be asked to find the
“valley” which represents the optimal r. The method is
only for clusters with the same density. TURN* treats the
whole set as an image, tries a range of resolutions (radii)
from one end where each point is classified as noise, to
the other end where all data points can be included in a
single cluster. An optimal resolution is found out of the
range by statistic method.
In this section, we first present a few observations
based on our experiments on many different datasets. And
then we develop a built-in algorithm, MINR, to determine
the minimum neighborhood radii for clusters in different
densities based on the data distribution. Finally, we
develop a nonparametric density based clustering method
by combining MINR with e-DBSCAN.
3.1.
Experiments and Observations
Observation 1: We define R as a distance between each
point x and its 4th nearest neighbor. The points are then
sorted based on R in ascending order. Figure 3 shows
(a) DS1
(b) DS2
(c) R-x of DS1
Figure 3.
(d) R-x of DS2
DS1 and DS2 and their sorted R-x graphs
Observation 2: Given a neighborhood radius r, we
calculate the number of neighbors for each point within
the given radius, denoted as K, sort the points in
descending order, and get the sorted K-x graph. When r is
small, the line is quite smooth. As r increases, the graph
starts to have “knees”. When we continue to increase r,
the graph becomes smooth again. The rational is that if r
is very small or very big, the number of neighbors of each
point will be close. One extreme case is when r is so small
that every point will have no neighbor but itself. The
other extreme case is when r is large enough to cover the
whole data set as the neighborhood. Figure 4 shows K-x
graphs for DS1 and DS2 for three different radii
respectively. Figure 4 (a) and (b) are the cases when r is
very small. (c) and (d) are the cases when r is close to the
maximum R in the R-x graph. (e) and (f) are the cases
when r is very large.
(a) DS1 r = 2
(b) DS2 r = 5
(c) DS1 r = 22
(d) DS2 r = 30
(e) DS1 r = 50
(f) DS2 r = 250
Figure 4.
Both DS1 and DS2 consist of clusters in two different
densities and some noise. The knees are close to the
points with peak differentials as we can see in (c) and (d).
The number of “knees” is equal to the number of cluster
densities in the dataset. Intuitively, we infer that the
points divided by “knees” belong to different density
clusters or noise.
Observation 3: In order to justify our intuition above, we
sort the dataset DS2 based on K, and then partition the
sorted dataset into three subsets separated by two “knees”
in Figure 5 (b). The two “knees” are at positions of 10000
and 15500. Therefore the three partitions are 0 – 10000,
10001 – 15500, and 15501-17500. The three partitions are
shown in Figure 6. We can see that partition (a) consists
of the denser clusters; partition (b) consists of the less
dense clusters; and partition (c) is mainly noise.
(a) Partition 0 - 10000
Sorted K-x graphs for datasets DS1 and
DS2 with different neighborhood radii
From Figure 4, we can see that when the neighborhood
radius is close to the maximum R, the K-x graph shows
“knees” very clearly. In order to find the “knees” we need
to calculate the differentials of the graphs, Ks. Figure 5
(a) and (b shows the sorted K-x graphs for DS1 and DS2
when the neighborhood radius is close to R. (c) and (d)
show the differentials of the graphs respectively.
(b) Partition 100000 – 15500
(a) K-x graph for DS1
(c) K for DS1
Figure 5.
(b) K-x graph for DS2
(c) Partition 15500 - 17500
(d) K for DS2
Sorted K-x graphs of datasets DS1 and
DS2 and their differentials Ks
Figure 6.
Partitions of the sorted DS2 separated by
two “knees” at 10000 and 15500
3.2.
Determination of the neighborhood radii
Densest cluster
is formed
Based on the experiments above, we develop an
algorithm to automatically determine the minimum
neighborhood radii for mining clusters with different
densities, MINR, based on the data distribution. The
process is as follows:
1.
Calculate the distance between each point and its 4th
neighbor, R; Find the maximum R;
Compute the number of neighbors, K, within the
maximum neighborhood radius R for each point;
Sort the points in descending order based on K;
Calculate the differential K; Search for the peak K
values;
Find the “knee” point right before each peak point
with K = 0.
2.
3.
4.
5.
+
Figure 8.
++
+ +
++
+ +
Noise
+
r2
+ +
+
+
Resulted clusters after clustering with r2:
The sparser cluster is formed. The
unclustered is noise.
Our nonparametric density-based clustering algorithm
is processed as follows. First, calculates a series of
neighborhood radii for different density clusters using
MINR, then starts iterative clustering process using eDBSCAN with the radii. Finally, merge any pair of
clusters which share most of the boundary points of either
cluster.
The whole process of our nonparametric
clustering algorithm is summarized in Figure 10.
Nonparametric Clustering Algorithm
Input: A dataset X
Output: Clusters and noise
1. Calculate a number of the neighborhood radii: r1,
r2 … rm for different density clusters with MINR ( ):
2. Iterative Clustering with e-DBSCAN
3. Check the boundaries of each pair of clusters. If two
clusters share most of the boundary of either cluster,
merge the two clusters into one.
Nonparametric Density-based Clustering
We start clustering using the enhance DBSCAN
algorithm, e-DBSCAN, with k = 4 and r = r1. The densest
cluster(s) would be formed as shown in Figure 8.
+
Sparser cluster
is formed
MINR algorithm
In this section, we first propose an iterative clustering
process given a series of neighborhood radii for different
density cluster groups in the dataset, and then develop our
nonparametric density based clustering method.
+
Resulted clusters after clustering with r1:
the denser cluster is formed.
Figure 9.
3.3.
+ +
Densest cluster
is formed
+
Figure 7.
+
Then set r = r2. Only process those unclustered points.
The next sparser cluster(s) are formed (See Figure 9). The
process continues until r = rm. The remaining unclustered
points are noise.
The “knee” points are denoted as KNi, where i = 1,
2 …m, m is the number of “knees.” The distance between
KNi and its 4th neighbor will be the neighborhood radius
for clustering the ith dense cluster group. The algorithm is
summarized in Figure 7.
MINR Algorithm
Input: A data set X
Output: neighborhood radii ri
1. Calculate the distance between each point and its 4th
neighbor, R. Get Rm = max (R).
2. Compute the number of neighbors within Rm for
each point, K.
3. Sort the points in descending order based on K.
4. Calculate the differential K, and find the peak K
position, XPi. Stop if it is at the end of dataset.
5. For the ith peak K position, find the “knee” point
KNi: if x < XPi and Ki = 0 and |x- XPi| is the
smallest, then KNi = x.
6. ri = Rx. Increase i and go back to step 4.
r1
Figure 10.
4.
Nonparametric clustering algorithm
PERFORMANCE ANALYSIS
In this section, we compare our nonparametric densitybased clustering algorithm (NPDBC) with the
performance of TURN*. We tested the algorithms on
several data sets. We will show the run time comparisons
on the dataset, DS2, we discussed above. In order to make
the data contain the clusters in different densities, we
artificially insert more data in some clusters to make them
denser than the others. The resulted datasets have the
sizes from 10k to 200k.
We implemented NPDBC in the C language and run
on a 1GHz Pentium PC machine with 1GB main memory,
and Debian Linux 4.0. The run time comparison of
NPDBC and TURN* is shown in Figure 11.
for clusters in two different densities. The reason is that in
NPDBC, the parameters are computed once at the
beginning of the clustering process, while TURN*
algorithm tries different neighborhood radii until the first
“turn” is found in case of clusters in two different
densities. When the dataset contains clusters in various
densities, our algorithm will be much more efficient. In
our future work, we will implement our NPDBC using the
vertical data structure, P-tree, the efficient data mining
ready data representation.
6.
Figure 11.
Comparison of NPDBC and TURN*
From Figure 11, we see NPDBC is more efficient than
TURN* for large datasets. The reason is that for NPDBC,
the parameters are computed once at the beginning of the
clustering process, while TURN* algorithm tries different
neighborhood radii until the first “turn” is found in case of
two different densities. We only compare NPDBC with
TURN* on datasets with two different densities. If the
density variety increases, NPDBC will outperform
TURN* much more. In that case, TURN* wouldn’t stop
at the first turning point. It has to continue to search for
more knees till the very end. It is obvious that TURN*
will fail for large datasets with various densities.
5.
CONCLUSION
One of the major challenges of clustering is minimal
domain knowledge to determine the input parameters. It is
even more difficult to determine the input parameters
when the dataset contains clusters in different densities.
Although many algorithms have tried to make clustering
parameter free, they either try to give users all possible
choices, or adopt trial-and-error approach based on
statistic information, not practical for very large datasets.
In this paper, we explore an automatic approach to
determine this parameter based on the distribution of
datasets. The algorithm, MINR, is developed to determine
the minimum neighborhood radii for different density
clusters. We developed a nonparametric clustering
method (NPDBC) by combining MINR with the
enhanced DBCSCAN, e-DBCSCAN. Experiments show
our NPDBC is more efficient and scalable than TURN*
REFERENCES
[1]. Ester, M., Kriegel, H-P., Sander, J. & Xu, X.: A densitybased algorithm for discovering clusters in large spatial
databases with noise. In Proceedings of the 2nd ACM
SIGKDD, Portland, Oregon (1996) 226-231
[2]. Foss, A. & Zaiane, O., R. A Parameterless Method for
Efficiently Discovering Clusters of Arbitrary Shape in
Large Datasets. In Proceedings of ICDM 2002.
[3]. Halkidi, M. V. M. and Batistakis, Y.. On clustering
validation techniques. Journal of Intelligent Information
Systems, 17(2-3):107–145, December 2001.
[4]. Han, J. and Kamber, M. Data Mining, Concepts and
Techniques. Morgan Kaufmann, 2001.
[5]. Hinneburg, A., and Keim, D. A.: An Efficient Approach to
Clustering in Large Multimedia Databases with Noise.
Proceeding 4th Int. Conf. on Knowledge Discovery and
Data Mining, AAAI Press (1998)
[6]. Karypis, G., Han, E.-H., and Kumar, V. CHAMELEON: A
hierarchical clustering algorithm using dynamic modeling.
IEEE Computer, 32(8):68–75, August 1999.
[7]. M.Ankerst, M.Breunig, H.-P. Kriegel, and J.Sander.
OPTICS: Ordering points to identify the clustering
structure. In Proc. 1999 ACM-SIGMOD Conf. on
Management of Data (SIGMOD’99), pages 49–60, 1999.
[8]. Ng, R. T. and Han, J., Efficient and effective clustering
methods for spatial data mining. In Proc. of the 20th Int’l
Conf. on Very Large Data Bases, 1994.
[9]. Palmer, C. R. and Faloutsos, C. Density biased sampling:
an improved method for data mining and clustering. In
Proceedings of Int’l Conf. on Management of Data, ACM
SIGMOD 2000.
[10]. Sheikholeslami, G., Chatterjee, S. and A. Zhang. A
wavelet-based clustering approach for spatial data in very
large databases. The International Journal on Very Large
Databases, 8(4):289–304, February 2000.