Download Running Resilient Distributed Datasets Using DBSCAN on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
© October 2016 | IJIRT | Volume 3 Issue 5 | ISSN: 2349-6002
Running Resilient Distributed Datasets Using DBSCAN
on Apache Spark
Amaresh A M
Assistant professor,IS&E Department.GSSSIETW,MYSORE
Abstract—The density based data clustering is used
because of its ability to find shaped noisy cluster data.
However, DBSCAN is hard to scale which limits its
utility when working with large data sets. Resilient
Distributed Datasets (RDDs), on the other hand, are a
fast data-processing abstraction created explicitly for
inmemory computation of large data sets.
This paper presents a new algorithm based on
DBSCAN using the Resilient Distributed Datasets
approach: RDD-DBSCAN. RDD-DBSCAN overcomes
the scalability limitations of the traditional DBSCAN
algorithm by operating in a fully distributed fashion.
This paper presents a parallel DBSCAN algorithm on
top of Apache Spark the experiment conducted in the
paper shows that the proposed method can work well
with maritime data.
I.
INTRODUCTION
Trajectory pattern mining has become a hot brunch
topic of data mining in these years because of the
increasing universality of some location report
devices (GPS,AIS etc).
The information collected by these systems can be
used in many different ways, from predicting
possible failures to recommending the best solutions
to different problems based
on previous experiences. However, when the amount
of data gets very large, the difficulty of deriving
useful conclusions from the data increases. A popular
approach to overcome this difficulty is machine
learning, in specific, clustering algorithms. Clustering
algorithms ameliorate the complexity of the data by
grouping similar data into groups, or clusters, which
can then be more readily analyzed.
Among clustering algorithms, Density-based Spatial
Clustering of Applications with Noise (DBSCAN) is
one of the most widely used. Its popularity stems
from the benefits that DBSCAN provides over other
clustering algorithms: the ability to discover
arbitrarily shaped clusters, robustness in the presence
of noise and simplicity in its input parameters.
IJIRT 144018
Density based spatial clustering of applications with
noise (DBSACN) is a data clustering algorithm
proposed by martin Ester,Hans-peter Kriegel,Jorg
Sander and Xiaowei Xu in 1996. It is a density based
clustering algorithm. Given a set of points in some
space, it groups together points that are closely
packed together (points with many nearby
neighbors), making as outliers points that lie alone in
low density regions (whose nearest neighbors are too
far away). DBSCAN is one of the most common
clustering algorithms and also most cited scientific
literature.
Numerous applications require the management of
spatial data, i.e. data related to space. Spatial
Database Systems(SDBS) (Gueting 1994) are
database systems for the management of spatial data.
Increasingly large amounts of data are obtained from
satellite images, X-ray crystallography or other
automatic
equipment.
Therefore,
automated
knowledge discovery becomes more and more
important in spatial databases.
Several tasks of knowledge discovery in databases
(KDD) have been defined in the literature (Matheus,
Chan&P iatetsky- Shapiro 1993). The task considered
in this paper is class identification, i.e. the grouping
of the objects of a database into meaningful
subclasses. In an earth observation database,e.g., we
might want to discover classes of houses along some
river.
Clustering algorithms are attractive for the task of
class identification. However the e application to
large spatial databases rises the following
requirements for clustering algorithms
(1) Minimal requirements of domain knowledge to
determine the input parameters, because appropriate
values. These values often not knowing advance then
dealing with large databases.
(2) Discovery of clusters with arbitrary shape, because
the shape of clusters in spatial databases may be
spherical, drawn-out, linear, elongated etc.
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY
162
© October 2016 | IJIRT | Volume 3 Issue 5 | ISSN: 2349-6002
(3) Good efficiency on large databases, i.e. on databases
of significantly more than just a few thousand
objects. Clustering algorithms are attractive for the
task of class identification in spatial databases.
However, the application to large spatial databases
rises the following requirements or clustering
algorithms: minimal requirements of domain
Knowledge to determine the input parameters,
discovery of clusters with arbitrary shape and coding
efficiency on large databases. Thee well-known
clustering algorithms offer no solution to the
combination of these requirements in this paper, this
resent the new clustering algorithm DBSCAN
relaying on a density-based notion of clusters which
is designed to discover clusters of arbitrary shape.
DBSCAN only one input parameter and supports user
in determining an appropriate value for it.In 2014, the
algorithm was awarded the test of time award (an
award given to algorithms which have received
substantial attention in theory and practice) at the
leading data mining conference, KDD.
The well-known clustering algorithms offer no
solution to the combination of these requirements. In
this paper, we present the new clustering algorithm
DBSCAN. It requires only one input parameter and
supports the user in determining an appropriate value
for it. It discovers clusters of arbitrary shape. Finally,
DBSCAN is efficient even for large spatial databases.
The rest of the paper is organized as follows. We
discuss clustering algorithms, evaluating them
according to the above requirements. In section 3, we
present our notion of clusters which is based on the
concept of density in the database. Even introduces
the algorithm DBSCAN which discovers such
clusters in a spatialdatabase. We performed an
experimental evaluation of the effectiveness and
efficiency of DBSCAN using synthetic data and data
of the SEQUOI2A0 00 benchmark. At last concludes
with a summary and some directions for future
research.
Spark is a comparatively new technique which has
been rapidly adopted by enterprises across a wide
range of industries these days. And the community of
Spark has become one of the largest open source
communities in big data area. So working on the
topic is challenging but definitely useful for my
future study or work.
In this Paper, the work as Follows
IJIRT 144018




Describing a modified DBSCAN algorithm that takes
full advantage of Apache Spark‘s parallel capabilities
and obtains the same results as the original algorithm.
Implementing the modified algorithm in the Scala
programming language and running it on top of
Apache Spark.
Proving experimentally that the modified algorithm
scales as the amount of nodes available to Apache
Spark for computation increases.
Listing the limitations of the given approach and
places in which it can be improved.
II.
RELATED WORK
B. R. Dai, and I. C. Linin [8] and Y. He, et al. in [9]
have both proposed variations of DBSCAN that
allow the algorithm to run on top of the Apache
Hadoop framework, the most popular implementation
of the MapReduce paradigm.
One of the big drawbacks of Hadoop‘s
implementation of MapReduce, is that the only
communication that can happen between dataprocessing steps, in a data-processing pipeline, is
through the file system. This communication
restriction introduces a very high cost, in the form on
latency, on top of the already costly computation.
This cost is acceptable for pipelines that only do
transformations on the data. However, most machine
learning algorithms (including DBSCAN) perform
many iterations over the same set of data, with each
step also depending on results on the previous steps.
This characteristic of machine learning algorithms
has made those algorithms poor performers on the
Hadoop platform [10]. M. Zaharia, et al. noticed this
drawback and provided a solution in the form of an
abstraction for in-memory computing: Resilient
Distributed Datasets (RDDs).
M. Zaharai et al. noticed that, while MapReduce
effectively provides an abstraction on top of the
computing resources of a cluster [10], iterative
algorithms, in order to achieve reasonable levels of
performance, need to also manage another of the
cluster‘s resources: memory. Utilizing memory
effectively allows iterative algorithms to avoid the
performance penalties of reading data from the data
store for each of their steps, since the data is already
present in memory.
M. Zaharai, et al. proposed Resilient Distributed
Datasets (RDDs) as a solution to the shortcomings of
MapReduce. RDDs, as defined by its creators, are an
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY
163
© October 2016 | IJIRT | Volume 3 Issue 5 | ISSN: 2349-6002
abstraction that ―enables efficient data reuse in a
broad range of applications‖ [10]. RDDs allow for
algorithms to do coarse operations over data (e.g.
map, filter, reduce) while also allowing for
algorithms to manage the persistence of data in
memory.
III.
OVERVIEW OF PROPOSED
SOLUTION
We provided a solution in the form of an abstraction
for in-memory computing: Resilient Distributed
Datasets (RDDs) to solve the drawback in existing
system.
MapReduce with Hadoop, RDDs have a dominant
implementation that is widely used: Apache Spark.
Apache Spark is an implementation of the RDDs
abstraction that provides a programming framework
that allows the programmer to process large amounts
of data and also to take advantage of volatile memory
in order to speed-up computation. Thus, machine
learning algorithms perform much faster on Apache
Spark than on the Hadoop platform.
3.1 Introduction on Apache Spark
Apache Spark was developed in 2009 in UC
Berkeleys AMPLab1 aimed for building a fast and
general engine for large-scale data processing. From
the benchmark result, Spark
claims it can run 100x faster than Hadoop when
Spark uses RAM cache, and 10x faster than Hadoop
when running on disk[4]. Spark is developed in Scala
and can provide API support for Java, Scala and
Python, which makes it easy to build parallel apps.
A general model of Spark runtime is demonstrated in
Figure 1 [19]. To use Spark,developers need to write
the high-level control flow of their applications and
the driver program can launch multiple workers for
the parallel tasks. The data blocks are read from a
distributed file system (e.g. HDFS) and partitioned
into the RAM of the workers.
IJIRT 144018
Figure 1: Apache Spark Runtime
3.2 Clustering Algorithm
Consider a set of points in some space to be
clustered. For the purpose of DBSCAN clustering,
the points are classified as core points, (density-)
reachable points and outliers, as follows:
A point p is core point if at least min Pts points are
within distance d‘ of it and those points are said to be
directly reachable from p. No points are directly
reachable from a non- core point.
A point q is reachable from p if there is a path
p1,…pn with p1=p and p(n)=q, where each p(i+1) is
directly reachable from pi(so all the points on the
path must be core points, with the possible exception
of q).All points not reachable from any other point
are outliers.
Now if p is a core point, then it forms a cluster
together with all points (core or non-core) that are
reachable from it. Each cluster contains at least one
core point. Non -core points can be part of cluster,
but they form its ―edge‖, since they cannot be used to
reach more points.
In this diagram, min pts=3. Point A and the other red
points are core points, because at least three Points
surrounded it in and radius. Because they are all
reachable from one another, they form a single
cluster. Point B and C are not core points, but they
are reachable from A (via other core points) and thus
belong to the cluster as well. Point N is a noise point
that is neither a core point nor density reachable.
Reachability is not a symmetric relation since, by
definition, no point may be reachable from a noncore point, regardless of distance (so a non-core point
may be reachable, but nothing can be reached from
it).
Therefore a further notion of connectedness is needed
to formally define the extent of the clusters found by
DBSCAN. Two points‘ p and q are density –
connected if there is a point ‗O‘ such that both p and
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY
164
© October 2016 | IJIRT | Volume 3 Issue 5 | ISSN: 2349-6002
q are density reachable
connectedness is symmetric.
from
‗O‘
density
Figure 2: Clustering Diagram
A cluster then satisfies two properties:
All points within the cluster are mutually densityconnected. If a point is density –reachable from any
point of the cluster, it is part of the cluster as well.
3.3 DBSCAN Algorithm
DBSCAN requires two parameters: d (eps) and
the minimum number of points required to form a
dense region (minPts).It starts with an arbitrary
staring point that has not been visited. This point‘s dneighborhood many points, a cluster is started.
Otherwise, the point is labeled as noise. Note that this
point might later be found in a sufficiently sized denvironment of a different point and hence be made
part of cluster.
If a point is found to be dense part of a cluster, its
d- neighborhood is also part of that cluster. Hence, all
points that are found within the d- neighborhood are
added, as is their own d-neighborhood when they are
also dense. This process continues until the densityconnected cluster is completely found. Then, a new
unvisited point is retrieved and processed, leading to
the discovery of a further cluster or noise.
As mentioned before, clustering algorithms are
used toreduce data sets into groups, or clusters, which
can be morereadily analyzed and reasoned about.
These clusters can thenbe used to make predictions
about new data, or to findpreviously unnoticed
connections among existing data. Thatmakes
clustering algorithms useful in a variety of fields,
frommachine learning, to data mining, to image
analysis. For a given data set, there are multiple ways
to find commonalities between the data in order to
generate clusters.This characteristic of clustering has
IJIRT 144018
led to the developmentof several clustering
algorithms. Each one of those algorithmsuses
different criteria to form clusters from the data.
Amongthe most widely used algorithms for
clustering we can find K-Means and DBSCAN.
DBSCAN is a density-based clustering algorithm.
Density based clustering algorithms define a cluster
as an area that hasa higher data density than its
surrounding area. In DBSCAN density is measured
by analyzing whether a point has atleast a minimum
number of points (MinPts) inside a given radius (d).
That is, if the _ neighborhood (points within d) of a
given point has exceeded a particular density
threshold (MinP ts) that given point, and its
neighbors, form a cluster.The DBSCAN algorithm is
described in pseudo code in Algorithm.
3.4 DBSCAN using resilient distributed datasets
(RDD-DBSCAN)
Despite all of its virtues, the DBSCAN algorithm
does not lend itself well to the set of operations that
are enabled by RDDs: coarse-grained transformations
that are applied in parallel to many data items [10].
Although DBSCAN itself does not require a specific
starting point or a sequential order, once a point has
been selected for processing and the point has entered
the region growing phase (Line 10 through 26 in
Algorithm mention in figure 5) no other points can
enter the same region. This xclusion effectively
limits the amount of parallel processing that can be
done for the data set and limits the ability of RDDs
to effectively execute DBSCAN.
3.5 Architecture
Dataset
Performance Graph
Preprocess
Reporting
Partition
Time
RAM
Clustered
1
Clustered
2
Cluster Results
Cluster Reduction
Figure 3: System Architecture
Figure 3 depicts the system architecture of this paper
which has following Steps:
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY
165
© October 2016 | IJIRT | Volume 3 Issue 5 | ISSN: 2349-6002
Preprocess: In this module the input dataset has been
loaded to HDFS.
Partition: The data‘s form the dataset has been
divided into different data and load data as partition
blocks.
Clustering: Do cluster for partition and view the
clusters.
Cluster Reduction: Initially load all the partition and
do cluster. So that the merge and reduction of cluster
done in this module.
Report: This module shows the performance results
we compute the record time and report it to the
manager to view the time comparision.
IV.
4.2 RDD-DBSCAN Algorithm
IMPLEMENTATION
4.1 DBSCAN Algorithm
Figure 5: RDD-DBSCAN Algorithm
V.
RESULTS
5.1 Evaluation Results
We showed the result in terms of time taken. In the
below figure we showed result that time is reduced
with the help of Apache Spark compared with Map
reduce Scheme.
Figure 4: DBSCAN Algorithm
Figure 6: Performance Graph In terms of Time Taken
to Run the Dataset
VI.
CONCLUSION AND FUTURE WORK
DBSCAN does not require one to specify the number
of clusters in the data priori, as opposed to k-means.
DBSCAN can find arbitrary shaped clusters. It can
even find a cluster completely surrounded by (but not
connected to) a different cluster. Due to the MinPts
parameter, the so- called single-link effect (different
IJIRT 144018
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY
166
© October 2016 | IJIRT | Volume 3 Issue 5 | ISSN: 2349-6002
clusters being connected by a thin line of points) is
reduced. DBSCAN has a notation of noise, and is
robust to outliers. DBSCAN requires just two
parameters and is mostly insensitive to ordering of
the points in the database. (However, points sitting on
the edge of two different clusters might swap cluster
membership if the ordering of the points is changed,
and the cluster assignment is unique only up to
isomorphism.) DBSCAN is designed for use with
database can use that can accelerate region queries,
e.g. using an R*tree. The parameter minPts and d can
be set by a domain expert, if the data is well
understood.
This paper presents a new algorithm based on
DBSCAN using the Resilient Distributed Datasets
approach. This paper presents a parallel DBSCAN
algorithm on top of Apache Spark.
Many sets of data that need clustering cannot be
adequately represented by just two dimensions.
Improving the partitioning scheme so that it may
successfully partition n-dimensional data would
allow RDD-DBSCAN to cluster more diverse data
sets in future.
REFERENCES
[1] (2014, August 18) 2014 SIGKDD Test of Time
Award.
[Online].
Available:
http://www.kdd.org/blog/2014-sigkdd-test-timeaward
[2] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, ―A
density-based algorithm for discovering clusters
in large spatial databases with noise.‖ in KDD,
vol. 96, no. 34, 1996, pp. 226–231.
[3] M. Chen, X. Gao, and H. Li, ―Parallel dbscan
with priority r-tree,‖ in Information Management
and Engineering (ICIME), 2010 The 2nd IEEE
International Conference on. IEEE, 2010, pp.
508–511.
[4] D. Arlia and M. Coppola, ―Experiments in
parallel clustering with dbscan,‖ in Euro-Par
2001 Parallel Processing. Springer, 2001, pp.
326–331.
[5] M. M. A. Patwary, D. Palsetia, A. Agrawal, W.k. Liao, F. Manne, and A. Choudhary, ―A new
scalable parallel dbscan algorithm using the
disjoint-set data structure,‖ in High Performance
Computing, Networking, Storage and Analysis
IJIRT 144018
(SC), 2012 International Conference for. IEEE,
2012, pp. 1–11.
[6] J. Dean and S. Ghemawat, ―Mapreduce:
simplified data processing on large clusters,‖
Communications of the ACM, vol. 51, no. 1, pp.
107–113, 2008.
[7] Wikipedia. (2015, April, 4) Mapreduce —
Wikipedia, the free encyclopedia. Accessed 22July-2004.
[Online].
Available:
http://en.wikipedia.org/wiki/MapReduce
[8] B.-R. Dai and I.-C. Lin, ―Efficient map/reducebased dbscan algorithm with optimized data
partition,‖ in Cloud Computing (CLOUD), 2012
IEEE 5th International Conference on. IEEE,
2012, pp. 59–66.
[9] Y. He, H. Tan, W. Luo, H. Mao, D. Ma, S. Feng,
and J. Fan, ―Mrdbscan: an efficient parallel
density-based clustering algorithm using
mapreduce,‖ in Parallel and Distributed Systems
(ICPADS), 2011 IEEE 17th International
Conference on. IEEE, 2011, pp. 473–480.
[10] [M. Zaharia, M. Chowdhury, T. Das, A. Dave, J.
Ma, M. McCauley, M. J. Franklin, S. Shenker,
and I. Stoica, ―Resilient distributed datasets: A
fault-tolerant abstraction for in-memory cluster
computing,‖ in Proceedings of the 9th USENIX
conference on Networked Systems Design and
Implementation. USENIX Association, 2012, pp.
2–2.
[11] Wikipedia. (2015, April, 5) Scalability —
Wikipedia, the free encyclopedia. Accessed 22July-2004.
[Online].
Available:
http://en.wikipedia.org/wiki/Scalability
[12] K.-H. Lee, Y.-J. Lee, H. Choi, Y. D. Chung, and
B. Moon, ―Parallel data processing with
mapreduce: a survey,‖ AcM sIGMoD Record,
vol. 40, no. 4, pp. 11–20, 2012.
[13] Y. X. Fu, W. Z. Zhao, and H. F. Ma, ―Research
on parallel dbscan algorithm design based on
mapreduce,‖ Advanced Materials Research, vol.
301, pp. 1133–1138, 2011.
[14] Y. He, H. Tan, W. Luo, S. Feng, and J. Fan,
―Mr-dbscan: a scalable mapreduce-based dbscan
algorithm for heavily skewed data,‖ Frontiers of
Computer Science, vol. 8, no. 1, pp. 83–99,
2014.
[15] M. J. Berger and S. H. Bokhari, ―A partitioning
strategy for
nonuniform problems on
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY
167
© October 2016 | IJIRT | Volume 3 Issue 5 | ISSN: 2349-6002
multiprocessors,‖ Computers, IEEE Transactions
on, vol. 100, no. 5, pp. 570–580, 1987.
[16] P. Wendell and M. Zaharai. (2015, February 13)
Spark: A review of 2014 and looking ahead to
2015
priorities.
[Online].
Available:
https://databricks.com/blog/2015/02/13/spark-areview-of-2014-and-looking-ahead-to-2015priorities.html
[17] F. Pedregosa, G. Varoquaux, A. Gramfort, V.
Michel, B. Thirion, O. Grisel, M. Blondel, P.
Prettenhofer, R. Weiss, V. Dubourg, J.
Vanderplas, A. Passos, D. Cournapeau, M.
Brucher, M. Perrot, and E. Duchesnay, ―Scikitlearn: Machine learning in Python,‖ Journal of
Machine Learning Research, vol. 12, pp. 2825–
2830, 2011.
[18] (2015, April 12) Dataset loading utilities.
[Online].
Available:
http://scikitlearn.org/stable/datasets/
[19] A. Guttman, R-trees: a dynamic index structure
for spatial searching. ACM, 1984, vol. 14, no. 2.
IJIRT 144018
INTERNATIONAL JOURNAL OF INNOVATIVE RESEARCH IN TECHNOLOGY
168