Download an empirical review on unsupervised clustering algorithms in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
AN EMPIRICAL REVIEW ON UNSUPERVISED CLUSTERING
ALGORITHMS IN MULTIPLE DOMAINS
Aditi Chawla
M.Tech. Research Scholar
Department of Computer Science and Engineering
Punjab Technical University
Jallandhar, Punjab, India
Navneet Kaur
Assistant Professor
Department of Computer Science and Engineering
RIMT Institute of Engineering and Technology
Mandi Gobindgarh, Punjab, India
ABSTRACT
models to give factual or different method for
Data mining refers to the investigation of the
investigation. Information comes in, perhaps
huge
in
from numerous sources. It is incorporated and
called
put in some normal information store. A piece
exploratory data analysis in multiple streams.
of it is then taken and preprocessed into a
Masses of information produced from money
standard
registers,
information is then moved to an information
quantities of data
computers.
Data
from
mining
examining,
sets stored
is
also
from
subject
organization.
This
arranged
the
mining calculation which handles a yield as
examined,
standards or some other sort of patterns. In this
lessened, and reused. Quests are performed
manuscript, we have analyzed the clustering
crosswise over diverse models proposed for
algorithms for multiple applications.
particular
databases
organization,
are
all
around
investigated,
anticipating deals, promoting reaction, and
benefit. Traditional factual methodologies are
Keywords - Data Mining, Clustering, Dynamic
principal to information mining. Mechanized
Clustering
AI
strategies
are
additionally
utilized.
Information mining obliges ID of an issue,
INTRODUCTION
alongside accumulation of information that can
Numerous analytic computer models have
prompt better understanding and machine
been used in the domain of data mining. The
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
standard model types in data mining include
and decision trees. These techniques are well
normal regression for prediction, logistic
known in the academic and research domains.
regression for classification, neural networks,
Figure 1: Data Mining and Knowledge Discovery Process
Data mining needs the identification of a
much output, while too few can overlook key
problem, with collection of data that can lead
relationships
to better understanding and computer models
understanding
to deliver statistical or other means of analysis.
mandatory for successful data mining.
in
the
of
data.
Fundamental
statistical
concepts
is
This may be assisted by visualization tools,
that display data, or through fundamental
NOTION
statistical analysis, such as correlation analysis.
RELATED ASPECTS
Data mining tools need to be versatile,
Clustering
scalable, capable of accurately predicting
discovery
responses between actions and results, and
applications, such as marketing and customer
capable of automatic implementation. Versatile
segmentation. Clustering group data into sets
refers to the proficiency of the tool to be
in such a way that the intra-cluster similarity is
applied in a wide variety of models. Scalable
maximized and while inter-cluster similarity is
tools refer that if the tools works on a small
minimized.
data set, it should also work on larger data sets.
unsupervised learning that examines data to
Automation is useful, but its application is
find groups of items that are similar. For
relative. Some analytic functions are often
example, an insurance company might group
automated,
to
customers according to income, age, types of
implementing procedures is required. In fact,
policy purchased, prior claims experience in a
analyst judgment is critical to successful
fault diagnosis application, electrical faults
implementation
might be grouped according to the values of
but
human
of
data
setup
prior
mining.
Proper
selection of data to include in searches is
critical. Data transformation also is often
required. Too many variables produce too
OF
is
CLUSTERING
the
important
technique
Clustering
certain key variables.
with
is
the
AND
knowledge
numerous
form
of
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
Figure 2: Clustering of Data
Many earlier clustering algorithms focus on the
Clustering is the most important unsupervised
numerical data whose inherent geometric
learning problem; so, as every other problem
properties can be exploited naturally to define
of this kind, it deals with finding a structure in
distance
points.
a collection of unlabeled data. A loose
However, much of the data existed in the
definition of clustering could be “the process
databases is categorical, where attribute values
of organizing objects into groups whose
can’t be naturally ordered as numerical values.
members are similar in some way”. A cluster
Due to the special properties of categorical
is therefore a collection of objects which are
attributes, the clustering of categorical data
“similar” between them and are “dissimilar” to
seems more complicated than that of numerical
the objects belonging to other clusters.
functions
between
data
data. To overcome this problem, several datadriven similarity measures have been proposed
for categorical data. The behavior of such
measures directly depends on the data.
Figure 3: Formation of Clusters
Here, we easily identify the 4 clusters into
criterion is distance: two or more objects
which the data can be divided; the similarity
belong to the same cluster if they are “close”
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
according to a given distance (in this case
used in order to evaluate the proposed partition
geometrical distance).This is called distance-
or the solution. This measure of quality could
based clustering.
be the average distance between clusters; for
instance, some well-known algorithms under
Another kind of clustering is conceptual
this category are k-means, PAM and CLARA.
clustering: two or more objects belong to the
One of the most popular and widely studied
same cluster if this one defines a concept
clustering methods for objects in Euclidean
common to all that objects. In other words,
space is called k-means clustering. Given a set
objects are grouped according to their fit to
of N data objects xi and an integer M number
descriptive concepts, not according to simple
of clusters. The problem is to determine C,
similarity measures.
which is a set of M cluster representatives cj,
as to minimize the mean squared Euclidean
MAJOR FOCUS IN CLUSTERING
distance from each data object to its nearest
The goal of clustering is to determine the
centroid.
intrinsic grouping in a set of unlabeled data.
But how to decide what constitutes a good
HIERARCHICAL CLUSTERING
clustering? It can be shown that there is no
Hierarchical clustering methods build a cluster
absolute “best” criterion which would be
hierarchy, i.e. a tree of clusters also known as
independent of the final aim of the clustering.
dendogram. A dendrogram is a tree diagram
Consequently, it is the user which must supply
often used to represent the results of a cluster
this criterion, in such a way that the result of
analysis. Hierarchical clustering methods are
the clustering will suit their needs. For
categorized into agglomerative (bottom-up)
instance, we could be interested in finding
and divisive (top-down). An agglomerative
representatives for homogeneous groups (data
clustering starts with one-point clusters and
reduction), in finding “natural clusters” and
recursively
describe their unknown properties (“natural”
appropriate clusters. In contrast, a divisive
data types), in finding useful and suitable
clustering starts with one cluster of all data
groupings (“useful” data classes) or in finding
points
unusual data objects (outlier detection).
nonoverlapping clusters.
TAXONOMY
DENSITY-BASED
PARTITIONAL CLUSTERING
CLUSTERING
Partition-based methods construct the clusters
The key idea of density-based methods is that
by creating various partitions of the dataset.So,
for each object of a cluster the neighbourhood
partition gives for each data object the cluster
of a given radius has to contain a certain
index pi. The user provides the desired number
number of objects; i. e. the density in the
of clusters M, and some criterion function is
neighborhood has to exceed some threshold.
merges
and
two
recursively
AND
or
more
splits
most
into
GRID-BASED
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
The shape of a neighborhood is determined by
depends only on the number of grid cells for
the choice of a distance function for two
each dimension. Famous methods in this
objects. These algorithms can efficiently
clustering category are STING and CLIQUE.
separate noise. DBSCAN and DBCLASD are
the well-known methods in the density based
OUTLIERS
category. The basic concept of grid-based
An outlying observation, or outlier, is one that
clustering algorithms is that they quantize the
appears to deviate markedly from other
space into a finite number of cells that form a
members of the sample in which it occurs.
grid structure. And then these algorithms do all
Outlier points can therefore indicate faulty
the operations on the quantized space. The
data, erroneous procedures, or areas where a
main advantage of the approach is its fast
certain theory might not be valid. However, in
processing
large samples, a small number of outliers is to
time,
which
is
typically
independent of the number of objects, and
be expected.
Figure 4: Outliers in two-dimensional dataset
OUTLIER DETECTION
the object does not suit the probabilistic model,
Most outlier detection techniques treat objects
it is considered to be an outlier. Third, density-
with K attributes as points in ℜK space and
based approach detects local outliers based on
these techniques can be divided into three main
the local density of an object’s neighborhood.
categories. The first approach is distance based
These methods use different density estimation
methods, which distinguish potential outliers
strategy.
from others based on the number of objects in
observation is an indication of a possible
the neighborhood. Distribution-based approach
outlier.
A
low
local
density
on
the
deals with statistical methods that are based on
the probabilistic data model. A probabilistic
Distance-based approach
model can be either a priori given or
In Distance-based methods outlier is defined as
automatically constructed using given data. If
an object that is at least dmin distance away
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
from k percentage of objects in the dataset.
To explain the definition by example we take
The problem is then finding appropriate dmin
parameter k = 3 and distance d. Here are points
and k such that outliers would be correctly
xi and xj be defined as outliers, because of
detected with a small number of false
inside the circle for each point lie no more than
detections. This process usually needs domain
3 other points. And x′ is an inlier, because it
knowledge.
has exceeded number of points inside the circle
Definition: A point x in a dataset is an outlier
for given parameters k and d.
with respect to the parameters k and d, if no
more than k points in the dataset are at a
distance d or less from x.
Figure 5: Illustration of outlier definition by Knorr and Ng.
DISTRIBUTION-BASED APPROACH
neighbourhood is based on Euclidean distance,
Distribution-based methods originate from
while in graph-based spatial outlier detections
statistics, where object is considered as an
the definition is based on graph connectivity.
outlier if it deviates too much from underlying
Whereas distribution-based methods consider
distribution.
normal
just the statistical distribution of attribute
distribution outlier is an object whose distance
values, ignoring the spatial relationships
from the average object is three times of the
among items, density-based approach consider
variance.
both attribute values and spatial relationship.
DENSITY-BASED APPROACH
RELATED ASPECTS OF THE CLUSTER
Density-based methods have been developed
FORMATION PROBLEM
for finding outliers in a spatial data. These
Given a set of unclassified training data sets
methods can be grouped into two categories
•
For
example,
in
To find an efficient way of partitioning
called multi-dimensional metric space-based
and classifying the training data into
methods and graph-based methods. In the first
classes.
category,
the
definition
of
spatial
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
•
•
•
that
applications in many areas. For example,
enables the category of cluster of any new
clustering can be applied to a document
example to be determined.
collection to reveal which documents are about
Although the two subtasks are logically
the same topic. The objective in any clustering
distinct, they are usually performed
application is to minimize the inter-clusters
together.
similarities and maximize the intra-cluster
Classification learning programs are
similarities. There are different clustering
successful if the predictions they make are
algorithms each of which may or may not be
correct. i.e. If they agree with an
suited to a particular application.
To
construct
the
representation
externally defined classification.
The traditional clustering paradigm pertains to
In clustering, there is no externally defined
a single dataset. Recently, attention has been
notion of correctness. There are a huge number
drawn to the problem of clustering multiple
of ways in which a training set could be
heterogeneous datasets where the datasets are
partitioned. Some of these are better than
related but may contain information about
others. The classical methods suggest members
different types of objects and the attributes of
of a cluster should resemble each other more
the objects in the datasets may differ
than resemble members of other classes. Hence
significantly. A clustering based on related but
a good partition should
different object sets may reveal significant
•
Maximise similarity within classes
information that cannot be obtained by
•
Minimise similarity between classes.
clustering a single dataset.
Clustering is a well-studied data mining
problem
that
has
found
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
Table 1 : Comparative Analysis of assorted clustering algorithms
CONCLUSION
investigation itself is not one particular
Cluster Formation or basically grouping is the
calculation, however the general assignment to
procedure of total the set of articles in such a
be explained. It could be attained by different
way, to the point that protests in the same
calculations that vary altogether in their idea of
assembly
what
called
group
that
are
more
constitutes
a
group
and
how
to
comparable in some sense or an alternate to
productively discover them. Famous thoughts
one another than to those in different
of groups incorporate aggregations with little
aggregations or Clusters. It is an unmistakable
separations around the Cluster parts, thick
and required errand of exploratory information
ranges of the information space, interims or
mining, and a basic system for factual
specific measurable disseminations. Clustering
information examination utilized as a part of
can hence be figured as a multi-objective
numerous fields, including machine taking in,
enhancement issue. A huge measure of
example difference, picture dissection, data
examination work is under methodology all
recovery,
around the globe in different calculations. In
and
bioinformatics.
Group
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
this examination work, we have proposed a
Pacific-Asia
novel algorithmic approach and model that
Discovery Data Mining
makes
numerical
[3] Andre Baresel, HarmenSthamer, Michael
establishment and evolutionary methodology
Schmidt,2002. Fitness Function Design to
for the shaping of Clusters in productive and
improve Evolutionary Structural Testing
successful behavior regarding execution time
[4] Andrew L.Nelson, Gregory J.Barlow,
and cohorted outcomes. What's to come extent
Lefteris Doitsidis,2008 .Fitness Functions in
of the examination work can stretched out to
Evolutionary Robotics: A Survey and Analysis
the cross breed methodology. The mixture
[5] Can, F.; Ozkarahan, E. A. (1990).
methodology makes utilization of two or more
"Concepts and effectiveness of the cover-
algorithmic methodologies to be consolidated
coefficient-based clustering methodology for
in single plan to get the ideal effects. The
text
mixture methodology can make utilization of
Database
the ground dwelling insect state enhancement
doi:10.1145/99935.99938
or hereditary calculation to get the ideal
[6] Hans-Peter Kriegel, Peer Kröger, Jörg
outcomes. In the event that the displayed
Sander, Arthur Zimek (2011). "Density-based
calculation is executed to the emphases with
Clustering".WIREs
hereditary algorithmic methodology, the best
Knowledge
result might be attained. Later on work, the
doi:10.1002/widm.30.
bunch structuring could be coordinated with
[7] He Zengyou, Xu Xiaofei, Deng Shenchun,
best first pursuit of the heuristic quest
2002. Squeezer: An Efficient Algorithm for
strategies for the evacuation of clamor.
Clustering
utilization
of
the
Conferences
databases".ACM
on
Transactions
Systems15
on
(4):
Data
483.
Mining
Discovery1
Categorical
Knowledge
(3):
and
231–240.
Data,Journal
of
Computer Science and Technology,Vol. 17,
REFERENCES
No. 5,pp 611-624
[1] Achtert, E.; Böhm, C.; Kröger, P. (2006).
[8] He Zengyou, Xu Xiaofei, Deng Shenchun,
"DeLi-Clu:
2003.
Boosting
Robustness,
Discovering
Cluster
Based
Local
Completeness, Usability, and Efficiency of
Outliers,Article Published in Journal Pattern
Hierarchical Clustering by a Closest Pair
Recognition Letters, Volume 24. Issue 9-10,pp
Ranking". LNCS: Advances in Knowledge
1641-1650,01 June 2003
Discovery and Data Mining. Lecture Notes in
[9] He Zengyou, Xu Xiaofei, Deng Shenchun,
Computer
119–128.
2006. Improving Categorical Data Clustering
ISBN 978-3-540-
Algorithm by Weighting Uncommon Attribute
Science
3918:
doi:10.1007/11731139_16.
Value Matches, ComSIS Vol.3,No.1
33206-0.
[2]
Aditya
VikramPudi,
Desai,
2011.
Himanshu
DISC:
Singh,
Data-Intensive
Similarity Measure for Categorical Data,
[10] Jerzy Stefanowski, 2009, Data Mining Clustering, University of Technology, Poland
[11]
Lloyd,
S.
(1982).
"Least
squares
quantization in PCM". IEEE Transactions on
International Journal of Enterprise Computing and Business Systems
ISSN (Online) : 2230-8849
Volume 4 Issue 1 January - July 2014
International Manuscript ID : ISSN22308849-V4I1M1-012014
Information
Theory28
(2):
129–137.
Density
Based
Techniques".LNCS
doi:10.1109/TIT.1982.1056489.
Vol.3816.Springer Verlag. pp. 523–535.
[12] Martin Ester, Hans-Peter Kriegel, Jörg
[19] ShyamBoriah, VarunChandola, Vipin
Sander, Xiaowei Xu (1996). "A density-based
Kumar,
algorithm for discovering clusters in large
Categorical Data: A Comparative Evaluation,
spatial
SIAM International Conference on Data
databases
with
noise".
In
2008.
EvangelosSimoudis, Jiawei Han, Usama M.
Mining-SDM
Fayyad.
[20]
Proceedings
of
the
Second
Tian
Similarity
Zhang,
Raghu
Measures
for
Ramakrishnan,
Knowledge
MironLivny. "An Efficient Data Clustering
Discovery and Data Mining (KDD-96).AAAI
Method for Very Large Databases." In: Proc.
Press. pp. 226–231. ISBN 1-57735-004-9.
Int'l Conf. on Management of Data, ACM
[13] Microsoft academic search: most cited
SIGMOD, pp. 103–114.
data mining articles: DBSCAN is on rank 24,
[21] Z. Huang. "Extensions to the k-means
when accessed on: 4/18/2010
algorithm for clustering large data sets with
[14]
categorical
International
Conference
M.Davarynejad,
N.Pariz,2007.
A
on
M.-R.Akbarzadeh-T,
Novel
Framework
for
values".
Data
Mining
and
Knowledge Discovery, 2:283–304, 1998.
Evolutionary Optimization: Adaptive Fuzzy
[22]
Fitness Granulation, IEEE Conference on
Gregory; Smyth, Padhraic (1996). "From Data
Evolutionary Computation, pp 951-956,2007
Mining
[15] MihaelAnkerst, Markus M. Breunig,
Databases". Retrieved 17 December 2008
Hans-Peter Kriegel, Jörg Sander (1999).
[23] Agrawal, R.; Imieliński, T.; Swami, A.
"OPTICS: Ordering Points To Identify the
(1993). "Mining association rules between sets
Clustering
SIGMOD
of items in large databases". Proceedings of the
international conference on Management of
1993 ACM SIGMOD international conference
data.ACM Press. pp. 49–60.
on
[16] R. Ng and J. Han. "Efficient and effective
'93.p. 207.doi:10.1145/170035.170072.
clustering method for spatial data mining". In:
ISBN 0897915925.
Proceedings of the 20th VLDB Conference,
[24]Barnett, V. and Lewis, T.: 1994, Outliers
pages 144-155, Santiago, Chile, 1994.
in Statistical Data. John Wiley &Sons., 3rd
[17]
edition.
Structure".
R.Ranjini,
J.Akilandeswari.2012.
ACM
S.AnithaElavarasi,
Categorical
Data
Clustering Using Cosine Based Similarity for
Enhancing
the
Accuracy
of
Squeezer
Algorithm
[18] S Roy, D K Bhattacharyya (2005). "An
Approach to find Embedded Clusters Using
Fayyad,
to
Usama;
Piatetsky-Shapiro,
Knowledge
Management
of
data
Discovery
-
in
SIGMOD