Download International Journal on Advanced Computer Theory and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
A roadmap to varied density dataset issue of DBSCAN and its
variants
1
Neha R. Soni, 2Amit P. Ganatra
Asst. Prof., SVIT, Vasad, Gujarat, Dean, Faculty of Tech. & Engg., CHARUSAT, Changa, Gujarat
Email: [email protected], [email protected]
Abstract -Wide variety of methods had been designed
under the cluster analysis; an unsupervised learning, like
partitioning based, hierarchical, density based, model
based, etc. DBSCAN, one of the most widely applied
density
based
clustering
algorithm outperforms
partitioning based clustering algorithms such as k-means,
CLARA, CLARANS and hierarchical algorithms, as it
does not require a prior knowledge of number of clusters
or termination condition and generates clusters of
arbitrary shape, which need not to be convex. Despite the
wide applicability, it also exhibits few issues like: i) time
complexity is O (n2) if R* indexing is not used, ii) does not
work properly for the varying density dataset and iii) Eps
and MinPts, two input parameters selection greatly change
the output. To overcome these issues different
modifications of original DBSCAN had been proposed in
the literature. The algorithms proposed for handling
varied density dataset are surveyed in this paper.
Index Terms--DBSCAN, Density based clustering varied
density dataset
I. INTRODUCTION
Clustering or cluster analysis, an unsupervised learning,
is the process of grouping the objects of similar kind.
Clustering plays an outstanding role in data mining
applications and is the subject of active research in
several fields such as statistics, pattern recognition and
machine learning [2]. Thousands of clustering
algorithms have been proposed in the literature in many
different disciplines and from many different
applications [5]. Even the categorization of clustering
algorithms had also been done from number of
perspectives as presented in [1][2][3][4][15], in which
the major categories are partitioning, hierarchical,
density based, grid based and model based. Density
based clustering is one of the primary methods for
clustering in data mining. It is more efficient in
detecting clusters with arbitrary shapes. Density based
clustering considers clusters as dense regions separated
by sparse regions and can be applied very efficiently to
spatial databases. The main representative algorithms in
this category are DBSCAN [6], OPTICS [7],
DENCLUE [8], and DBCLASD [9]. DBSCAN is the
most widely used algorithm under this category. It takes
as input two parameters: Eps and Minpts. The main
weakness of DBSCAN is that it is unable to produce
proper clusters when the dataset have greatly varied
densities. As it makes use of global radius (Eps), it is
possible to find clusters with the single density levels
only. Large number of modifications of DBSCAN in the
literature had been proposed to handle this issue. The
paper discusses few of them giving comparison and
remarks.
The rest of the paper is organized as follows. In section
2 working of DBSCAN is described very briefly.
Section 3 discusses the issue of varied density dataset
it’s consequences in the output of DBSCAN. Section 4
provides the summary of the different algorithms
proposed to address the issue of varied density dataset
with the detailed comparison too. Finally section 5
present the conclusion and direction for future work.
II. THE DBSCAN ALGORITHM
DBSCAN [6] is the first density based clustering
algorithm became very popular. The basic idea of
DBSCAN is that, the cluster which is a dense region has
to contain some minimum number of points (MinPts)
within some specified neighbourhood region (radius)
given as two input parameter. To find a cluster,
DBSCAN starts with an arbitrary point p, finds the Eps
neighbourhood of p and if the neighbourhood contains
more than MinPts then point p is considered as a core
point and retrieves all points which are density reachable
from p wrt. Eps and MinPts. The point which does not
have minimum number of points in its neighbourhood is
considered as a border point or noise point and
DBSCAN continue with checking of other points in the
dataset till all points are classified.
III. ISSUE OF VARIED DENSITY
DATASET IN DBSCAN
The two input parameters Eps and MinPts in DBSCAN
are global parameters. Due to this the clusters present in
the dataset having different density and not well
separated by sparse regions produce incorrect results.
With the low value of Eps as an input, highly dense
clusters can be extracted and the other sparse clusters
will be considered as outliers. Whereas, with the high
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014
12
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
value of Eps as input, the densest clusters will be
merged in the sparse clusters [10] [11] [14]. As shown in
figure 1, if the value of Eps is small enough then
DBSCAN
will
generate
two
clusters
C1 and C2, with C3 as an outlier and if the value of Eps
is large enough then the DBSCAN will produce two
clusters: C3 as one cluster and the other one is the
merging of C1 and C2 in the single cluster. Thus,
DBSCAN is unable to produce all three clusters C1, C2
and C3 with a global value of Eps. In many real datasets,
clusters with respect to different densities may present
and useful for further analysis. Therefore it became
necessary to find out both dense clusters as well as
sparse. To handle this issue several new algorithms had
been proposed in the literature which are extension or
modification of DBSCAN. Section 4 surveys all such
algorithms.
reached. A border point is determined by checking the
size of ISk for the point in the consideration and
threshold, whose value is set to 2k/3 based on the
experimental results.
2)
The proposed algorithm, Grid-based DBSCAN
Algorithm with Referential Parameters, is based on the
grid partition technique and multi-density based
clustering. The author has proposed the technique for
automatic generation of Eps and Minpts parameters of
the DBSCAN algorithm. The algorithm starts by
performing grid division for the dataset and then
applying binning for each data object to map it to the
corresponding grid cell. Eps and MinPts are then
determined from grid structure and DBSCAN is applied
considering the core object as grid unit whose number of
data objects are larger than MinPts. Then undirected
graph is constructed by placing an edge from a one core
grid unit to adjacent core grid. Every connected
component represents a cluster.
3)
Figure 1: Example of Varied density dataset issue
IV. COMPARATIVE STUDY
The following is the summary of the different
modifications proposed for DBSCAN to handle the issue
of varied density datasets. Table 1 represent the
comparison of all in the summarized form with the
author’s own reviews and remarks on each providing
guidelines for selection as well as direction for further
work or improvements.
1)
Enhancing density-based clustering: Parameter
reduction and outlier detection[12]
In this paper authors had tried to address the 3 common
issues of density based clustering: (i) selection of data
dependent parameters; (ii) algorithm behaviour
sensitivity to the starting object density; (iii) improper
identification of adjacent clusters with different
densities. To address the above mentioned issue a new
density function is proposed based on the concept of
knn-stratification and influence function. First of all
knn-stratification is applied on the dataset to identify the
different density levels in the dataset efficiently. The
original dataset is projected on new space by adding
rank of the objects derived using knn-stratification as
one more dimension. Then density based clustering is
applied using the knowledge of k-influence space ISk. A
random point p is selected and a cluster around a point p
is constructed until a border point or an outlier is
Grid-based DBSCAN Algorithm with Referential
Parameters [16]
Enhanced Density Based Spatial clustering of
Applications with Noise [14]
In this paper authors have proposed a new algorithm to
handle the issue of varied density dataset as an extension
of DBSCAN. It starts by finding kNN for each point p
and stores them in ascending order according to the
distance to point p. Then local density function is
computed for each point p which is the sum of distances
of the kNN and dataset is rearranged in descending order
according to the local density of each point. From the
input parameter Maxpt, an Eps is determined as the
distance to Maxpt neighbour for the point p and then
DBSCAN is applied for each value of Eps ignoring the
previously clustered points.
4)
DDSC : Density
Clustering [19]
Differentiated
Spatial
DDSC is an extension of the DBSCAN algorithm to
detect clusters with differing densities. The algorithm
finds natural density based cluster that may not be
separated by sparse region by considering that, the local
density within a cluster is reasonable homogeneous and
adjacent regions are separated into different clusters if
there is significant change in density. It starts a cluster
with homogeneous core object and goes on expanding it
by including directly density reachable homogenous
core object until non-homogeneous core object are
detected. The homogeneous core object is determined
based on the parameter α, a density threshold.
5)
VDBSCAN :Varied Density Based Spatial
Clustering of Applications with Noise [10]
VDBSCAN is an improvement to DBSCAN for
handling varied density dataset. The basic idea is to use
different Eps values for different density variation exists
in the dataset, instead of single global value of Eps for
all the clusters to be formed. To do so algorithm first
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014
13
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
calculates and stores k-dist for each point and plot the
graph of k-dist. Due to density variation if exist in the
dataset , there will be sharp change on the graph of kdist that corresponds to a suitable value of Eps. Thus
different value of Eps, known as Epsi can be chosen at
each sharp change from a smooth curve and DBSCAN is
adopted for each different Epsi, by not ignoring the
points which had been already clustered.
6)
STDBSCAN : An algorithm for clustering
spatial-temporal data [11]
Is the extension of DBSCAN from three different
aspects as suggested by authors, as: i) it can cluster
spatial –temporal data according to its non-spatial,
spatial and temporal attributes. ii) detection of noise in
case of varying density can be achieved by density
factor, assigned to each cluster iii) effect of spatial and
non-spatial attribute on the border object residing at
opposite side in adjacent cluster.
The algorithm takes four parameters Eps1, Eps2, MinPts
and Δε, where Eps1 is used for spatial attribute and Eps2
Sr.
No
1
Name
Proposed By
Year
ISDBSCAN
2013
2
GRPDBSCA
N
Carmelo
Cassisi, Alfredo
Ferron, Rosalba
Giugno,
Giuseppe
Pigola, Alfredo
Pulvirenti
Huang Darong,
Wang Peng
3
Enhanced
DBSCAN
4
5
is for non-spatial attribute. It starts with the first point p
and retrieves all points which are density reachable from
p with respect to Eps1 and Eps2. If p is a core object then
cluster is formed else it visits the next point in the
dataset. Thus issue one and three will be addressed by
considering the non-spatial and temporal attributes as
well in the formation of the cluster.
7)
Locally Scaled Density Based Clustering[17]
As the name suggest the proposed algorithm is based on
the concept of local scaling, a technique which makes
use of the local statistics of data during identification of
clusters. LSDBC clusters the points by connecting dense
regions until the density falls below a threshold,
determined by the centre of the cluster. It first calculates
the Eps values for each point based on their kNN dist
and then sort the dataset in the ascending order of Eps.
Then most dense local point is selected and cluster is
expanded for that point by comparing the density each
time. Thus makes algorithm to work for different density
variations.
Complexi
ty
As
of
DBSCA
N
Input
Parameter
a) number of
nearest
neighbours k
Issue
Addressed
a)parameter
selection;
b)input order
dependency
c)varying
density
dataset
a) handling
varied density
datasets
b)reduction in
parameter
2012
As
of
DBSCA
N
a) number of
grid units – N
A. Fahim, G.
Saake,
A.
salem, F. torkey
and
M.
Ramadan
2009
As
of
DBSCA
N
a) Number of
nearest
neighbours k
b)limitation
of highest
density –
Maxpts
DDSC
B.Borah, D.K.
Bhattacharyya
2008
O (nlogn)
a) radius Eps
b) minimum
points MinPts
c)density
threshold – α
a)varied
density
dataset
b) reduction
in the
sensitivity of
Eps
VDBSCAN
Peng Liu, Dong
Zhou,
Naijun
Wu
2007
As
of
DBSCA
N
a) radius Epsi
b) minimum
points MinPts
c) number of
nearest
neighbours k
Varied
Density
Dataset
Main
Concept
Used
space
stratification
based on both
INFLO function
and knn distances
Research
Findings/
Issues
a) sorting of dataset as
per w function
b) adds one more
dimension to the dataset
to the dataset
c) threshold is to be set
combines the grid
partition
technique
and
multi-density
based clustering
a) grid division and data
binning needs to be
explore
b) selection of Eps and
MinPts needs to be more
specific
a) sorting is performed
two time, one for each
point’s kNN and other
for whole dataset
b)introduces a new
parameter to control
highest density in a
cluster
c) as Eps absolute dist to
the MaxPts is used
which may be explored
further.
a) introduction of new
parameter for density
threshold
b) tries to reduce the
sensitivity of the
parameters not complete
elimination
Based on the
concept of local
density function to
find the local
density at each
point which is an
approximation of
over all density
function.
Partitions
the
dataset such that
adjacent regions
significantly differ
in density by
making use of
homogeneity test
to
detect
variations
in
density.
For varied density
dataset different
values of Epsi can
be used, which
can be determined
by plotting k-dist
graph.
a) result may vary with
value of k
b) values of Epsi are
subjectively chosen
from the k-dist plot.
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014
14
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
6
STDBSCAN
Derya Birant &
Alp Kut
2007
As
of
DBSCA
N
a) radius Eps1 & Eps2
b) minimum
points MinPts
c) density
threshold - ∆ε
7
LSDBC
Ergun
Bicici
and Deniz Yuret
2007
As
of
DBSCA
N
a) number of
nearest
neighbours k
b) density
threshold – α
a)Handling of
SpatialTemporal
Dataset
b) Varied
Density
Dataset
c) border
points on
opposite side
in adjacent
clusters
a) varied
density
dataset
b) reduction
in parameter
a)two Eps value
for
two
dimension, Spatial
and
Temporal
b)handles varied
density(noise
point
identification) by
defining density
factor
a) selection of threshold
value is to be explored.
uses the notion of
local scaling in
density
based
clustering, which
determines
the
density threshold
based on the local
statistics of the
data
a) introduction of new
parameter for density
threshold
b) sorting of dataset
Table 1: Comparative study of density based clustering algorithms
Proc. 2nd Int. Conf. on Knowledge Discovery and
Data Mining, Portland, Ore, USA, pp. 226-231,
1996.
V. CONCLUSION AND FUTURE WORK
Density based clustering is one of the primary methods
for clustering in data mining and DBSCAN is the most
widely used algorithm under this category. Despite the
wide applicability, it also exhibits few problems like
high time complexity; selection of input parameter is
crucial and is unable to produce proper clusters when the
clusters in the dataset have greatly varied densities.
Number of modifications of DBSCAN had been
proposed in the literature for addressing the issue of
varying density dataset. The paper discusses the few of
them by providing the detail comparison and remarks. It
has been observed that each modifications leads to either
the introduction of some new input parameter or results
in to some other issues. Future direction for the research
work is to come up with the parameter free clustering or
method with the automatic selection of the parameter.
REFERENCES
[1]
J. Han and M. Kamber, “Data Mining: Concepts
and Techniques”, Morgan Kaufman, 2001.
[2]
P. Berkhin, “Survey of clustering data mining
techniques”, Technical report, Accrue Software,
San Jose, CA, 2002
[3]
A. K. Jain, M. N. Murty and P. J. Flynn, “Data
clustering: a review”, ACM Computing Surveys,
Vol. 31, Issue 3, pp. 264-323, 1999.
[4]
Rui Xu and D. Wunsch, "Survey of clustering
algorithms," IEEE
Transactions
on Neural
Networks, vol.16, no.3, pp.645-678, May 2005.
[5]
A. K. Jain, “Data Clustering: 50 Years Beyond
K-Means”, in Pattern Recognition Letters, Vol.
31, No. 8, pp. 651-666, 2010.
[6]
M. Ester, H. P. Kriegel, J. Sander and X. Xu, “A
Density-Based Algorithm for Discovering
Clusters in Large Spatial Databases with Noise”,
[7]
M. Ankerst, M. Breunig, H. P. Kriegel and J.
Sander, “OPTICS: Ordering Objects to Identify
the Clustering Structure”, Proc. of International
Conference on Management of Data, ACM
SIGMOD, pp. 49–60, New York, USA, 1999,
ACM Press.
[8]
A. Hinneburg and D. Keim, “An efficient
approach to clustering large multimedia
databases with noise”, In Proceedings of the 4th
ACM SIGKDD, 58-65, New York, NY, 1998.
[9]
X. Xu, M. Ester, H. P. Kriegel and J. Sander J,
“A distribution-based clustering algorithm for
mining in large spatial databases”, In Proceedings
of the 14th ICDE, 324-331, Orlando, FL, 1998.
[10]
L. Peng, Z. Dong, and W. Naijun, “VDBSCAN:
Varied Density Based Spatial Clustering of
Applications with Noise”, Proc. of IEEE
Conference, ICSSSM, pp.528-531, Shanghai,
China, 2007.
[11]
D. Birant and A. Kut, “ ST-DBSCAN : An
algorithm for clustering spatial-temporal data”,
Data and Knowledge Engineering, pp. 208-221,
2007.
[12]
C. Cassisi, A. Ferro, R. Giugno, G. Pigola and A.
Pulvirenti, “Enhancing density-based clustering :
Parameter reduction and outlier detection”,
Information Systems, 38, 317-330, 2013,
Elsevier.
[13]
A. M. Fahim, G. Saake, A. M. Salem, F. A.
Torkey and M. A. Ramadan, “Dcbor: a density
clustering
based
on
outlier
removal”,
International Journal of Computer Science, Vol.
4, No. 3, 2009.
_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014
15
International Journal on Advanced Computer Theory and Engineering (IJACTE)
_______________________________________________________________________________________________
[14]
[15]
[16]
A. M. Fahim, G. Saake, A. M. Salem, F. A.
Torkey and M. A. Ramadan, “Enhanced density
based spatial clustering of application with
noise”, Proceedings of the International
Conference on Data Mining, Las Vegas, USA,
pp. 517–523, 2009.
N. Soni, A. Ganatra, “Categorization of several
clustering algorithms from different perspective :
a review”, International Journal of Advanced
Research in Computer Science and Software
Engineering, Vol. 1, No. 8, pp. 1-6, 2012.
[17]
E. Biçici, Y. Deniz, "Locally scaled density
based clustering", In Adaptive and Natural
Computing Algorithms, pp. 739-748., Berlin
Heidelberg, 2007, Springer.
[18]
D. R. Edla and K. J. Prasanta, "A PrototypeBased Modified DBSCAN for Gene Clustering" ,
Procedia Technology 6 , pp. 485-492, 2012,
Elsevier.
[19]
B. Borah and D. K. Bhattacharyya, "DDSC: A
density
differentiated
spatial
clustering
technique", Journal of Computers, Vol. 3, No. 2,
pp. 72-79, 2008.
H. Darong and Wang Peng. "Grid-based
DBSCAN
Algorithm
with
Referential
Parameters", Physics Procedia, pp. 1166-1170,
2012, Elsevier.

_______________________________________________________________________________________________
ISSN (Print): 2319-2526, Volume -3, Issue -5, 2014
16