Download 2. The DBSCAN algorithm - Linköpings universitet

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
DBSCAN
A Density-Based Spatial Clustering of Application with
Noise
Henrik Bäcklund (henba892), Anders Hedblom (andh893), Niklas Neijman (nikne866)
1
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
1. Introduction
Today data is received automatically from many different kinds of equipments. Satellites, x-rays and
traffic cameras are just a few of them. To make this information/data understandable for us, it has to be
processed.
When working with large data sets it is in most scenarios useful to be able to separate information by
dividing the data into smaller categories, and eventually, to do class identification. Not least is this
important when treating large spatial databases. A satellite, for example, gathers images as it travels
around our earth. It is desired to classify what parts of the images are houses, cars, roads, lakes, forests,
etc. Since the image database is big, a good classification algorithm is needed. Classification can, for
instance, be done with the help of clustering algorithms, which clumps similar data together into different
clusters. However, using clustering algorithms involves some problems: It can often be difficult to know
which input parameters that should be used for a specific database, if the user does not have enough
knowledge of the domain. Furthermore, spatial data sets can contain huge amounts of data, and trying to
find cluster patterns in several dimensions is very computationally costly. Short computing time is always
favorable. Last, the shapes of the clusters can be arbitrary and in bad cases very complex. Finding these
shapes can be very cumbersome.
There are some well-used clustering algorithms out there; one of them is the famous CLARANS.
Other methods are K-means, K-medoid, Hierarchical Clustering and Self-Organized Maps. Nevertheless,
none of these algorithms can handle all these three mentioned problems in a good way. This report will
not discuss these methods but focus on the DBSCAN [1] (Density Based Spatial Clustering of
Applications with Noise) algorithm, which introduces solutions to these problems.
The following structure will be used in this paper. Section 2 will discuss how the DBSCAN algorithm
works in. Section 3 will present various possible applications for the DBSCAN and some comparisons
with CLARANS, in terms of efficiency. Finally, section 4 contains the conclusion of this paper and sums
up the positive and negative aspects of the DBSCAN algorithm.
2
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
2. The DBSCAN algorithm
The DBSCAN algorithm can identify clusters in large spatial data sets by looking at the local density of
database elements, using only one input parameter. Furthermore, the user gets a suggestion on which
parameter value that would be suitable. Therefore, minimal knowledge of the domain is required. The
DBSCAN can also determine what information should be classified as noise or outliers. In spite of this,
its working process is quick and scales very well with the size of the database – almost linearly.
By using the density distribution of nodes in the database, DBSCAN can categorize these nodes into
separate clusters that define the different classes. DBSCAN can find clusters of arbitrary shape, as can be
seen in figure 1 [1]. However, clusters that lie close to each other tend to belong to the same class.
Figure 1. The node distribution of three different databases, taken from SEQUOIA 2000 benchmark database.
The following section will describe further how the DBSCAN algorithm works. Its computing process is
based on six rules or definitions, creating two lemmas.
Definition 1: (The Eps-neighborhood of a point)
NEps(p) = {q ϵ D|dist(p,q)<Eps}
For a point to belong to a cluster it needs to have at least one other point that lies closer to it than the
distance Eps.
Definition 2: (Directly density-reachable)
There are two kinds of points belonging to a cluster; there are border points and core points, as can be
seen in figure 2 [1].
Figure 2. Border- and core points
“The Eps-neighborhood of a border point tends to have significantly less points than the Epsneighborhood of a core point.”[1]. The border points will still be a part of the cluster and in order to
include these points, they must belong to the Eps-neighborhood of a core point q as seen in figure 3 [1].
3
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
1) p ϵ NEps(q)
In order for point q to be a core point it needs to have a minimum number of points within its Epsneighborhood.
2) | NEps(q)| ≥ MinPts (core point condition)
Figure 3. Point p is directly density-reachable from point q but not vice versa.
Definition 3: (Density-reachable)
“A point p is density-reachable from a point q with respect to Eps and MinPts if there is a chain of points
p1…,pn, p1=q, pn=p such that pi+1 is directly density-reachable from pi.” [1] Figure 4 [1] shows an
illustration of a density-reachable point.
Figure 4. Point p is density-reachable from point q and not vice versa.
Definition 4: (Density-connected)
There are cases when two border points will belong to the same cluster but where the two border points
don’t share a specific core point. In these situations the points will not be density-reachable from each
other. There must however be a core point q from which they are both density-reachable. Figure 5 [1]
shows how density connectivity works.
“A point p is density-connected to a point q with respect to Eps and MinPts if there is a
point o such that both, p and q are density-reachable from o with respect to Eps and
MinPts.”
Figure 5. Density connectivity.
4
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
Definition 5: (cluster)
If point p is a part of a cluster C and point q is density-reachable from point p with respect to a given
distance and a minimum number of points within that distance, then q is also a part of cluster C.
1) "∀ p, q: if p ϵ C and q is density-reachable from p with respect to Eps and MinPts, then
q ϵ C.
Two points belongs to the same cluster C, is the same as saying that p is density-connected to q with
respect to the given distance and the number of points within that given distance.
2) "∀ p, q ϵ C: p is density-connected to q with respect to Eps and MinPts.
Definition 6: (noise)
Noise is the set of points, in the database, that don’t belong to any of the clusters.
Lemma 1:
A cluster can be formed from any of its core points and will always have the same shape.
Lemma 2:
Let p be a core point in cluster C with a given minimum distance (Eps) and a minimum number of points
within that distance (MinPts). If the set O is density-reachable from p with respect to the same Eps and
MinPts, then C is equal to the set O.
“To find a cluster, DBSCAN starts with an arbitrary point p and retrieves all points density-reachable
from p with respect to Eps and MinPts. If p is a core point, this procedure yields a cluster with respect to
Eps and MinPts (see Lemma 2). If p is a border point then no points are density-reachable from p and
DBSCAN visits the next point of the database.” [1]
3. Applications
An example of software program that has the DBSCAN algorithm implemented is WEKA. The following
of this section gives some examples of practical application of the DBSCAN algorithm.
Satellites images
A lot of data is received from satellites all around the world and this data have to be translated into
comprehensible information, for instance, classifying areas of the satellite-taken images according to
forest, water and mountains. Before the DBSCAN algorithm can classify these three elements in the
database, some work have to be done with image processing. Once the image processing is done, the data
appears as spatial data where the DBSCAN can classify the clusters as desired.
X-ray crystallography
X-ray crystallography is another practical application that locates all atoms within a crystal, which results
in a large amount of data. The DBSCAN algorithm can be used to find and classify the atoms in the data.
5
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
Anomaly Detection in Temperature Data
This kind of application focuses on pattern anomalies in data, which is important in several cases, e.g.
credit fraud, health condition etc. This application measures anomalies in temperatures [3], which is
relevant due to the environmental changes (global warming). It can also discover equipment errors and so
forth. These unusual patterns need to be detected and examined to get control over the situation. The
DBSCAN algorithm has the capability to discover such patterns in the data.
4. Comparisons (DBSCAN vs. CLARANS)
Through the original report [1], the DBSCAN algorithm is compared to another clustering algorithm. This
one is called CLARANS (Clustering Large Applications based on RANdomized Search). It is an
improvement of the k-medoid algorithms (one object of the cluster located near the center of the cluster,
instead of the gravity point of the cluster, i.e. k-means). The good properties compared to k-medoid are
that CLARANS works efficient for databases with about a thousand objects. When the database grows
larger, CLARANS will fall behind because the algorithm temporarily stores all the objects in the main
memory, i.e. the run time will increase.
The following part shows that the difference in performance between DBSCAN and CLARANS is vast.
The used databases are the ones shown in figure 1 and the result from using CLARANS is shown in
figure 6 [1].
Figure 6. The classification of the CLARANS algorithm.
As seen in the figure 6, CLARANS did not manage to classify all the clusters correctly for the three
different databases, in contrast to DBSCAN that does, as shown in figure 7 [1].
Figure 7. The classification of the DBSCAN algorithm.
6
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
So far the DBSCAN algorithm have done a good job classifying all the clusters. Nevertheless, it is also
superior when comparing the run time between the two algorithms. Graph 1 shows the comparisons, and
the resulting differences in time are large. Data taken from paper [1].
Graph 1. Run time in seconds.
To summarize the table, the DBSCAN has an almost linear increase in computing time, relative to the
number of points in the database. The CLARANS gain, however, is exponential and gets outperformed
with a factor of 250 to 1900 in this test. The factor will continue to get bigger as the size of the database
increases.
5. Conclusion
Classification on large spatial databases is a difficult and computational heavy task. Using clustering
algorithms is one way of doing it, and we have seen that there are several methods out there that get the
job done. However, we have also seen the problems included when using these methods; finding the right
input parameters, localizing clusters of arbitrary shapes and, last but not least, doing the whole process in
a reasonable time. The DBSCAN algorithm has a solution to all these problems as it can find those
complicated cluster shapes in a very quickly, with only one assigned input parameter. A value for this
parameter is also suggested to the user.
We have discussed the efficiency, in terms of accuracy, of the DBSCAN algorithm compared to the
famous CLARANS algorithm. The results were extremely one-sided. In some cases with huge databases,
the DBSCAN algorithm was over 1900 times faster and still ended up with the better final result. Its only
real weakness is that it has some difficulties in distinguishing separated clusters if they are located too
close to each other, even though they have different densities.
7
Data Mining
TNM033
Linköpings Universitet - ITN
2011-11-30
Bibliography
[1] - A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise
Martin Ester, Hans-Peter Kriegel, Jörg Sander, Xiaowei Xu
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.71.1980
[2] - Advances in knowledge discovery and data mining:
9th Pacific-Asia conference, PAKDD 2005, Hanoi, Vietnam, May 18-20, 2005; proceedings
Tu Bao Ho, David Cheung, Huan Liu
[3] - Anomaly Detection in Temperature Data Using DBSCAN Algorithm:
Erciyes Univeristy, http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5946052&tag=1
Mete Celic, Filiz Dadaser-Celic, Ahmet Sakir DOKUZ
8