Download A Study of DBSCAN Algorithms for Spatial Data Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
IJCST Vol. 3, Issue 4, Oct - Dec 2012
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
A Study of DBSCAN Algorithms for Spatial Data
Clustering Techniques
1
Dr. Mohammed Ali Hussain, 2Dr. R. Satya Rajesh, 3Md. Abdul Ahad
1,2
Dept. of ECE, KL University, Guntur, AP, India
Abstract
Spatial data mining is the application of data mining techniques
to spatial data. Data mining in general is the search for hidden
patterns that may exist in large databases. Spatial data mining is
the discovery of interesting the relationship and characteristics
that may exist implicitly in spatial databases. Because of the huge
amounts (usually, terabytes) of spatial data that may be obtained
from satellite images, medical equipments, video cameras, etc.
It is costly and often unrealistic for users to examine spatial data
in detail. Spatial data mining aims to automate such a knowledge
discovery process. Thus, new and efficient methods are needed to
discover knowledge large databases. For this purpose, clustering
is one of the most valuable methods in spatial data mining. The
main advantage of using clustering is that interesting structures
or clusters can be found directly from the data without using any
prior knowledge. This paper presents an overview of densitybased methods for spatial data clustering.
Keywords
Clustering, DBSCAN, Density- based method, Spatial data.
I. Introduction
Huge amounts of data have been collected through the advances
in data collection, database technologies and data collection
techniques. This explosive growth of data creates the necessity
of automated knowledge/information discovery from data, which
leads to a promising and emerging field, called data mining or
knowledge discovery in databases [1]. Spatial data mining is the
discovery of interesting relationships and characteristics that may
exist implicitly in spatial databases [2]. KDD follows several
stages including data selection, data preprocessing, information
extraction or spatial data mining, interpretation and reporting [3].
Data mining is a core component of the KDD process.
Spatial data mining is a demanding field since huge amounts of
spatial data have been collected in various applications such as
real-estate marketing, traffic accident analysis, environmental
assessment, disaster management and crime analysis. Thus, new
and efficient methods are needed to discover knowledge from
large databases such as crime databases. Because of the lack of
primary knowledge about the data, clustering is one of the most
valuable methods in spatial data mining.
The main advantage of using clustering is that interesting
structures or clusters can be found directly from the data without
using any prior knowledge. Clustering algorithms can be roughly
classified into hierarchical methods and non-hierarchical methods.
Nonhierarchical method can also be divided into four categories
[4]:
1. Partitioning methods
2. Density-based methods
3. Grid-based methods
4. Model-based methods
In this paper, we are presenting an overview of density based
algorithms called DBSCAN algorithms. DBSCAN is a wellknown spatial clustering technique that has been shown to find
clusters of arbitrary shapes by defining a cluster to be a maximum
w w w. i j c s t. c o m
set of density-connected points. DBSCAN can find the genuine
clusters of different shapes as long as the evaluation of the density
of the clusters can be correctly made in advance and the density
of clusters is uniform.
II. DBSCAN Algorithm
DBSCAN (Density Based Spatial Clustering of Applications with
Noise) algorithm is based on center-based approach. In the centerbased approach, density is estimated for a particular point in the
dataset by counting the number of points within a specified radius,
Eps, of that point. This includes the point itself. The center-based
approach to density allows us to classify a point as a core point,
a border point, a noise or background point. A point is core point
if the number of points within Eps, a user-specified parameter,
exceeds a certain threshold, MinPts, which is also a user- specified
parameter.
The DBSCAN algorithm is given as follows:
• Step 1: Label all points as core, border, or noise points.
• Step 2: Eliminate noise points.
• Step 3: Put an edge between all core points that are within
Eps of each other.
• Step 4: Make each group of connected core points into a
separate cluster.
• Step 5: Assign each border point to one of the clusters with
its associated core points.
Any two core points that are close enough within a distance Eps
of one another are put in the same cluster. It is also applicable
for any border point which is close enough to a core point is put
in the same cluster as the core point. Noise points are disposed.
The basic approach of how to determine the parameters Eps and
MinPts is to look at the behavior of the distance from a point to
its kth nearest neighbor, which is called k-dist. The k-dists are
computed for all the data points for some k.
III. VDBSCAN Algorithm
DBSCAN uses a density-based definition of a cluster, it is relatively
resistant to noise and can handle clusters of different shapes and
sizes. Thus, DBSCAN can find many clusters that could not be
found using some other clustering algorithms, like K-means.
However, the main weakness of DBSCAN is that it has trouble
when the clusters have greatly varied densities. To sweep over
the limitations of DBSCAN, VDBSCAN (Varied Density Based
Spatial Clustering of Applications with Noise)[5] is acquainted.
The VDBSCAN algorithm is given as follows:
1. Step 1: Calculate and stores k-dist for each project and
partition k-dist plots.
2. Step 2: The number of densities is given intuitively by k-dist
plot.
3. Step 3: Choose parameters Epsi automatically for each
density.
4. Step 4: Scan the dataset and cluster different densities using
corresponding Epsi.
5. Step 5: Finally, display the valid clusters corresponding with
varied densities.
International Journal of Computer Science And Technology 191
IJCST Vol. 3, Issue 4, Oct - Dec 2012
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
IV. IDBSCAN Algorithm
The IDBSCAN (Improved Sampling- based DBSCAN) algorithm
[6] was designed to eliminate the sampling on DBSCAN.
Let Eps denotes the region in the object for searching neighborhoods,
and Minpts represents the minimum number of objects. In fig. 1
(DBSCAN), if the neighborhoods of point C have at least MinPts
points, then C is called a core point. Points in the search region are
clustered repeatedly until fewer than MinPts remain. For instance,
Point B is searched from point P, involving fewer than MinPts
points. Consequently, point B must be treated as the border point.
Point O has fewer than MinPts neighborhoods points, so cannot
become a new cluster.
Fig. 3: Clustering of Objects with Intersections
FDBSCAN defines clusters by density and reduces the point search
action and time cost depending on the point at which clusters are
merged.
Fig. 1: Density Reachable
In IDBSCAN, representative points on the circle form the Marked
Boundary Objects (MBO). The MBO identify the eight distinct
objects in the neighborhood of an object. The points that are the
closest to these eight positions are selected as seeds. Therefore, at
most eight objects are labeled at once. IDBSCAN requires fewer
seeds than DBSCAN, as indicated in fig. 2.
Fig. 2: Circle with Eight Distinct Points
V. FDBSCAN Algorithm
The FDBSCAN (Fast DBSCAN) algorithm [7], extends DBSCAN
by adding non-linear searching, and has the same parameters as
DBSCAN. DBSCAN scans many objects repeatedly with many
times. FDBSCAN solves this problem by applying an external
scan to reduce the search times. In fig. 3, Point O is involved in
two different clusters P and cluster Q. Additionally, is defined as
a core Point. Hence, these clusters P and Q are the same.
192
International Journal of Computer Science And Technology
VI. Dbscan-Gb Algorithm
GM- DBSCAN (DBSCAN and Gaussian Means) that combines
Gaussian-Means and DBSCAN algorithms. The idea of DBSCANGM is to cover the limitations of DBSCAN, by exploring the
benefits of Gaussian- Means. DBSCAN can automatically find
noisy objects which is not achieved in most of the other methods.
Actually, density-based clustering is capable of finding arbitrary
shape clusters, and handling noises well. In other hand, it faces
difficulty in setting density threshold because it requires inputting
two parameters (Eps and Minpts) which are difficult to guess.
K-means [10] clustering algorithm is a well-known technique
to cluster data. It is a distance-based clustering algorithm that
partitions the data into a predetermined number of clusters. It
starts the iterative process by randomly selecting K points as
initial cluster centers (the initial centroid in each K data groups).
A main problem of K-Means that a user must give the number
of clusters K in order to construct the partitions and it is not
obvious to find the best value for k. Gaussian-Means [8] is an
improved algorithm for learning k and the initial center location
while clustering. It is based on a statistical test for the hypothesis
that a subset of data follows Gaussian distribution. It consists
of applying the expectation maximization (EM) [9] algorithm
together with K-Means to estimate the number of clusters which
best fit the data and the initial centroid for each cluster.
The DBSCAN- GM algorithm is given as follows:
1. Step 1: Run GMeans and find the center of all data.
2. Step 2: Estimate the parameters EPS of DBSCAN-GM.
3. Step 3: Estimate the parameters Minpts of DBSCAN-GM.
4. Step 4: After computing the global parameters Eps and
Minpts, we run the density algorithm DBSCAN with these
two determined inputs: DBSCAN (Eps, Minpts).
VII. Gmdbscan Algorithm
DBSCAN is one of the most popular algorithms for cluster analysis.
It can discover all clusters with arbitrary shape and separate noises.
But this algorithm can’t choose parameter according to distributing
of dataset. It simply uses the global MinPts parameter, so that
the clustering result of multi-density database is inaccurate. In
addition, when it is used to cluster large databases, it will cost
too much time.To solve these problems, GMDBSCAN (MultiDensity DBSCAN Cluster Based on Grid) is used. GMDBSCAN
w w w. i j c s t. c o m
ISSN : 0976-8491 (Online) | ISSN : 2229-4333 (Print)
algorithm[11] is based on spatial index and grid technique.
The GMDBSCAN algorithm is given as follows:
1. Step 1: Given data sets as input.
2. Step 2: Dividing the data space into grids.
3. Step 3: Statistics the grid density and Construct SP-Tree. SPTree [12], has a space index structure based on space dividing,
which can hold the position information of grid.
4. Step 4: Record the neighborhood relationship of two points
by bitmap and only compute the points at the grids of
neighborhood of the point. This is called bitmap formation.
5. Step 5: GMDBSCAN mainly gets the idea of locally
clustering, identifying a local-MinPts for each grid. For each
grid, processing clustering with their local-MinPts to form
a number of distributed local clusters. For the data density
within the same cluster should be similar, if two sub-cluster
which have same points and the similar density, can be merged
to a single cluster.
VIII. Conclusion
DBSCAN is a well-known spatial clustering technique that
has been shown to find clusters of arbitrary shapes by defining
a cluster to be a maximum set of density-connected points.
DBSCAN has many limitations. Based on each limitation, an
alternative method is developed. For example, DBSCAN has
trouble when the clusters have greatly varied densities. To avoid
this VDBSCAN was introduced. The IDBSCAN(Improved
Sampling- based DBSCAN) algorithm was designed to eliminate
the sampling on DBSCAN. The FDBSCAN extends DBSCAN
by adding non-linear searching, and has the same parameters as
DBSCAN. DBSCAN scans many objects repeatedly with many
times. FDBSCAN solves this problem by applying an external
scan to reduce the search times. GM- DBSCAN (DBSCAN and
Gaussian Means) that combines Gaussian-Means and DBSCAN
algorithms. The idea of DBSCAN-GM is to cover the limitations of
DBSCAN, by exploring the benefits of Gaussian-Means. DBSCAN
can discover all clusters with arbitrary shape and separate noises.
But this algorithm can’t choose parameter according to distributing
of dataset. It simply uses the global MinPts parameter, so that
the clustering result of multi-density database is inaccurate. In
addition, when it is used to cluster large databases, it will cost too
much time.To solve these problems, GMDBSCAN is used.
References
[1] U.M. Fayya, P. Smyth,"Advances in Knowledge Discovery
and Data Mining", AAAI/MIT Press, Menlo Park, CA,
1996.
[2] T. Ng. Raymond, J. Han,"Efficient and effective clustering
methods for spatial data mining", VLDB Conference.
Santiago, Chile, 1994.
[3] F. Karimipour, M.R. Delavar, M. Kianie,"Water quality
management using GIS data mining", Journal of
Environmental Informatics, Vol. 5, No. 2, pp. 61-71, 2005.
[4] X. Hu, I. Yoo.,"Cluster ensemble and its application in gene
expression analysis", Proceedings, The Second Conference
on Asia-Pacific Bioinformatics, Vol. 29, Dunedin, New
Zealand, pp. 297-302, 2004.
[5] Peng Liu, Dong Zhou, Naijun Wu,“Varied Density Based
Spatial Clustering of Applications with Noise”, 2007,
IEEE.
[6] Borah, B., Bhattacharyya, D. K.,“An Improved SamplingBased DBSCAN for Large Spatial Databases”, In:
Proceedings of International Conference on Intelligent
w w w. i j c s t. c o m
IJCST Vol. 3, Issue 4, Oct - Dec 2012
Sensing and Information, pp. 92-96, 2004.
[7] Liu, Bing.,“A Fast Density-Based Clustering Algorithm
for Large Databases”, In: Proceedings of International
Conference on Machine Learning and Cybernetics, 2006,
pp. 996-1000.
[8] G. Hamerly, C. Elkan,“Learning the k in k-means”, Vol. 17.
MIT Press, 2003.
[9] T. K. Moon,“The expectation-maximization algorithm”,
Signal Processing Magazine, IEEE, Vol. 13, No. 6, pp. 47–60,
Nov. 1996.
[10] J. B. MacQueen,“Some methods for classification and
analysis of multivariate observations”, in Proceeding of
the fifth Berkeley Symposium on Mathematical Statistics
and Probability, L. M. L. Cam and J. Neyman, Eds., Vol. 1.
University of California Press, 1967, pp. 281–297.
[11] Zeng Donghai,"The Study of Clustering Algorithm Basedon
Grid-Density and Spatial Partition Tree", XiaMen University,
PRC, 2006.
[12] A. H. Pilevar. M. Sukumar,"GCHL: a grid-c1ustering
algorithm for high-dimensional Data Mining", SBIA 2004,
LNAI 3171.
Dr. Md. Ali Hussain M.Tech., Ph.D.
working as Assoc. Professor Dept. of
Electronics and Computer Engineering,
KL University, Guntur District, A.P.,
India. His research interest includes
Computer Networks, Wireless Networks,
Mobile Networks and Web Commerce.
He has published large number of papers
in National & International Conferences
and International Journals. He also served
as a Program Committee (PC) Member for many International
Conferences. He is at present Chief Technical Advisory Board
Member, Chief Editor, Editor and Technical Reviewer of reputated
International Journals. He is a member of IACSIT, IRACST,
UACEE, ISTE, IAENG, AIRCC, AICIT, and IARCS.
Dr. K. Satya Rajesh M.Tech., Ph.D
working as Assoc.Professor Department
of Electronics and Computer Engineering
in K L University, Guntur District, Andhra
Pradesh. His research interest includes
Wireless and Sensor Networks. He has
published large number of papers in
National and International Journals. He
presented large number of papers in
National and International conferences.
He is a Technical Review member of
IJRRAN, IJCSIT etc., He had a membership of IACSIT, SSRGJ
etc.,
Md. Abdul Ahad, M.Tech., (Ph.D)
working as Asst. Professor Department
of Electronics and Computer Engineering
in K L University, Guntur District, Andhra
Pradesh. His research includes Wireless
Networks. He published many number
of papers in National and International
Journals. He presented many number
of papers in National and International
conferences.
International Journal of Computer Science And Technology 193