Download DRID- A New Merging Approach - International Journal of Computer

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Transcript
ISSN:2249-5789
Rimmy Chuchra et al, International Journal of Computer Science & Communication Networks,Vol 2(2), 201-204
DRID- A New Merging Approach
Rimmy Chuchra
M.tech (Computer Science)
Lovely Professional University
Phagwara, India
[email protected]
ABSTRACT
INTRODUCTION
Merging of clusters is a deterministic
approach which provides results in an
efficient manner. It involves the data input
values as per the suitability of the
algorithm. There are various advantages
for merging of clusters like to improve the
quality of clusters, to reduce the noise
level and to increase the performance of
the algorithm. Merging of clusters is
possible in any environment it totally
depends on the availability of the type of
dataset values. In this paper, we propose
an algorithm for merging of cluster. This
proposed algorithm merges the clusters
which is placed near by to each other
because of Cluster balancing is a key
factor to achieve good performance. The
performance of DRID heavily depends on
the dataset availibity and type of
environment used by the user. The crucial
step in this algorithm is how to select the
best and next cluster for merging and
splitting.Experimental
results
and
comparisons actually demonstrate that the
proposed DRID is an effective approach
which helps to reduce execution time and
increase the overall performance of the
algorithm.
Data mining is basically called “sorting
technique” which helps to detect patterns
which may be hidden or unknown.
Generally data mining parameters include
path
analysis,
and
classification,
association and clustering. Each parameter
has one specific goal. The goal of
classification is looking for new patterns.
The goal of path analysis is to search for
those events in which is part of event is
happened now and other occur later. The
goal of association is looking for those
patterns which are actually shows interrelated behaviour. And the goal of
clustering is finding those patterns which
are previously unknown. Here, we are
merging two concepts text mining with
clustering. Text mining holds natural
form of text and the process of
deriving high quality data from it.
Basically Text mining provides a
structure to the input text and derives
patterns from the structured data. Text
Mining consists of various tasks like
Text
Clustering, Text Classification,
Document Summarization,
Sentiment
analysis etc. The two major advantages
for using the text mining are visualization
customization
as
per
user
and
requirements. Clustering is a technique
which helps to place similar objects
together. It is used in many diversified
applications such as image compression,
market
segmentation,
and
spatial
discovery. There are various types of
methods are used to implement the concept
Keywords: Clustering algorithms, text
mining.
201
ISSN:2249-5789
Rimmy Chuchra et al, International Journal of Computer Science & Communication Networks,Vol 2(2), 201-204
of clustering like partitioning based
methods, hierarchal based methods,
Density based methods, Grid based
methods, Model based methods etc. Each
method itself consists of variety of
clustering algorithms. I
am
using
Enhanced
K-means
Clustering
algorithm
which
comes
under
Partitioning methods having distance
based
environment
and
OC
(Orthogonal
Partitioning)
Clustering
algorithm comes under Grid Based
methods
having
grid
based
environment. Both algorithms belongs
two
different
environments. The
Purpose
of
Enhanced
K-Means
Clustering algorithm is to find out
better
initial
centroids
with
reduced time complexity and whole
working of this algorithm is based
on
K-Means
Clustering
algorithm
where K-Means Clustering Algorithm is
Distance based Clustering algorithm
which defines distance measures from
data instances and also find partitions
of the distances
as like distance
between objects within same clusters
is minimized and between different
clusters is maximized. The purpose of
Orthogonal
Partitioning
(OC)
Clustering
algorithm
creates a
hierarchical grid-based clustering model,
which
means, it creates axis-parallel
(orthogonal) partitions in the input
attribute space. O-Cluster separates areas
of high density by placing cutting planes
through areas of low density. The
advantages of using
Grid based
Environment is objects are represented
in multi- resolution grid form with
higher processing time and independent
number of objects. In generally, we
divide clustering algorithms into four
categories whose names are Hierarchal
clustering,
Probabilistic
clustering,
Exclusive clustering and overlapping
clustering. We are using Exclusive type of
clustering on two separate environments
distance based and grid based. It indicates
that cluster must belong to one specific
cluster; that specific cluster must not to be
considered into any other cluster. For
Example: - such kind of clustering is used
in the separation of line shows the
difference between the existing clusters
lies upper and lower boundary of the lines.
OUR CONTRIBUTION
In our contribution we are proposing an
algorithm
named
as
DRID
(Distance+Grid) .DRID is used to merge
the clusters. This algorithm is used in two
different environments named as Distance
based environment and Grid based
environment. Here, we are using a
common strategy (i.e. DRID) for merging
clusters in two different environments.
Performance of the DRID must be affected
by the input given by the user.
Performance may be increase or decrease
it further depend on the type of the input
applied by the user under some specific
environment.
Basic steps of DRID are follows as:Step1.
Find
the
Euclidian
distance.ED=D2-D1. D1 is the distance of
the first cluster and D2 is the distance from
the second cluster and ED is the Euclidian
distance.
Step2. On the basis of calculated Euclidian
distance, moves centroid towards it.
Step3. Find out again distance after
moving centroid. If nearest neighbour
distance is not found then mark the
reassignment and again move to step2.
Step4. If nearest neighbour distance is
found then mark assignment and set the
centroid.
Step5: Then Repeat the loop for N
clusters
202
ISSN:2249-5789
Rimmy Chuchra et al, International Journal of Computer Science & Communication Networks,Vol 2(2), 201-204
Step6. End of the loop.
RESULT ANALYSIS
Figure 1: DRID-In Distance Based
Environment
Figure 2: DRID-In
Environment
Grid
Based
When we run DRID in distance based
environment then we entered a set of
hundred values of array as an input value
then the output of the algorithm assigns a
different different ClusterID for each
element of an array at some specific point.
And in case of grid based environment
when we entered a set of hundered array as
an input value then the output of the
algorithm assigns a Same ClusterID for
each element of an array at some specific
point. ClusterID is generated by the
algorithm automatically. ClusterID shows
the distance between specific clusters with
its neighbour cluster. Clusters either are
placed nearest or farthest from each other.
In this way clusterID helps to find the
location between two clusters. But the
performance of the algorithm totally
depends on the type of dataset used in
specific environment. Like Distance based
environment is much comfortable to deal
with numeric type values but Grid based
environment is not much comfortable. So,
from Figure1 the performance of the DRID
in Distance based environment is greater
than that of Grid based environment from
Firgure2 because of Grid based
environment shows that each point of array
is located at the same clusterID where as
distance based environment shows
different location (i.e-ClusterID). In case
of Grid based environment clusters are
described by intervals along attributes axes
and
corresponding
centroids
and
histograms as in input form. This same
algorithm DRID can also be applied in
grid based environment Just difference in
input applied by the user as per the
requirement of the algorithm.
CONCLUSIONS
We propose a “DRID – A new algorithmic
approach” that merge two differentdifferent environments, Distanced based
environment and Grid based environment
that results by merging of two clusters,
which are placed close to each other
increase the performance, reduce noise
problem and reduce the execution time. A
common strategy is to be followed in two
different-different environments. Distance
based algorithm(K-means algorithm) is
203
ISSN:2249-5789
Rimmy Chuchra et al, International Journal of Computer Science & Communication Networks,Vol 2(2), 201-204
suitable for optimization type of problem
and Grid based algorithm(Orthogonal
Partitioning clustering ) algorithm is
suitable for the set of cutting of hyper
planes problems.
FUTURE WORK
In future it can be extended by providing
some additional changes in this approach
so that “DRID” automatically detects the
type of data set available and on the basis
of data set, it can automatically choose the
type of environment for merging.
REFERENCES
[1] Manoranjan dash, Huan liu, xiaowei
Xu. Merging distance and density based
clustering.
[2]Ping YU, Data mining in library Reader
Management.2011
International
Conference on Network Computing and
Information Security.
[3] Dr. J. Akilandeswari, A survey of
partitioning
clustering
algorithms.
International Journal of Enterprise
Computing and Business Systems.
[4] Jiann-Cherng Shieh, Yung-Shun Lin.
Bibliomining User Behaviors in the
Library. Journal of Educational Media &
Library Sciences.2006.
[5] Hsiao-Tieh Pu. Explore improving the
utilization of library resources by
bibliomining.
Journal
of
Library
Association
of
the
Republic
of
China.2006.
[6] ZhaoHui Tang, Jamie MacLennan.Data
Mining with SQL Server 2005.Wiley
Publishing Inc, 2005.
International Conference on Knowledge
Discovery & Data Mining (KDD'2002).
[8] Nicholson, S. The Bibliomining
Process: Data Warehousing and Data
Mining for Library Decision-Making.
Information Technology and Libraries.
2003.
[9] Seth Paul, Jamie MacLennan, Zhaohui
Tang. Data Mining Tutorial. Microsoft
Corporation.2005.
[10]Jiawei Han andMichelin Kamber. Data
Mining: Concepts and Techniques.Morgan
Kaufmann Publishers, 2000.
[11] M Steinbach, G Karypis, and V.
Kumar. A comparison of document
clustering techniques. In KDD Workshop
on Text Mining, 2000.
[12] H. Abolhassani, M.Madhvi, 2009.
Harmony
K-means
document
clustering.
Algorithm
Data
for
Mining
Knowledge Discovery.18:370-391.
[13] P. Berkhin, 2002 Survey of Clustering
Data Mining Techniques.Technical Report
, Accure Software , San Jose, Caiff.
ACKNOWLEDGEMENT
There are a bunch of people to thank for
this paper, including Mr. Iqbal Singh. This
paper would not exist but for their faith in
me, and I offer them my heartfelt thanks.
[7]P. S. Bradley, U. Fayyad, and C.
Reina.Scaling clustering algorithms to
large databases. In Proceedings of the 4th
204