Download A Review on Various Clustering Techniques in Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
ISSN:2249-5789
Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142
A Review on Various Clustering Techniques in Data Mining
Mamta Mor
[email protected]
Abstract
This paper presents a review on various clustering
techniques used in data mining. Data mining is the task
of retrieving useful and hidden knowledge from data
sets [1] [2]. Clustering is one of the important tasks of
data mining. Clustering is an unsupervised learning
problem which is used to determine the intrinsic
grouping in a set of unlabeled data [3]. The grouping
of objects is done on the principle of maximizing the
intra-cluster similarity and minimizing the inter-cluster
similarity in such a way that the objects in the same
group/cluster share some similar properties/traits [4].
pair of objects. On the basis of similarity or
dissimilarity clustering can be classified into two types:
a) Distance based clustering
b) Conceptual clustering
In distance based clustering the objects/instances are
put into clusters on the basis of distance criteria: two or
more objects belong to the same cluster if they are
“closer” to the centroid of that particular cluster. The
basic idea behind distance based clustering is to
minimize the intra-cluster distance and maximize the
inter-cluster distance, which is shown in Figure 1.
1. Introduction
Clustering techniques are useful in various
applications of real world including data/text mining,
voice mining, image processing, web mining etc. It is a
main task of exploratory data mining, and a common
technique for statistical data analysis used in many
fields, including machine learning, pattern recognition,
image
analysis,
information
retrieval,
and
bioinformatics [3] [5]. Clustering is the technique of
partitioning the data being mined into several clusters
of data objects, in such a way that:
a) The objects in a cluster resemble to each other to a
great extent; and
b) The objects of a cluster are much different from
the objects in another cluster.
A cluster should exhibit the properties of external
isolation and internal cohesion. External isolation
requires that objects/instances in one cluster should be
separated from objects/instances in another cluster by
fairly empty areas of space. Internal cohesion requires
that objects/instances within the same cluster should be
similar to each other, at least within the local metric
[6]. It can also be sated as that a good clustering
algorithm always maximizes the intra-cluster similarity
and minimizes the inter-cluster similarity [1] [2].
Clustering is often based on some:
a) Similarity measure
b) Distance measure
The notion of similarity is always problem dependent.
The dissimilarity (or similarity) between objects is
typically computed based on the distance between each
IJCSCN | June-July 2016
Available [email protected]
Figure 1: Distance based Clustering
In conceptual clustering two or more objects belong to
the same cluster if they are conceptually same or
similar. In other words, clusters are formed according
to descriptive concepts, not according to distance
measure, which is shown in Figure 2.
Figure 2: Conceptual clustering
2. Clustering Techniques/Methods
The various clustering methods that are used in the
field of data mining can categorized as:
138
ISSN:2249-5789
Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142
a)
b)
c)
d)
e)
3.
Hierarchical methods
Partitioning methods
Density-based methods
Grid-based methods
Model-based methods
Repeat until n clusters are there.
Here in figure 4 both the techniques agglomerative as
well as divisive have been shown:
2.1 Hierarchical Clustering
Hierarchical clustering (also called hierarchical
cluster analysis
or
HCA)
is
a
technique
of cluster analysis which seeks to build a
hierarchy of clusters also known as dendrogram.
Hierarchical clustering involves creating clusters that
have a predetermined ordering from top to bottom. It is
based on the core idea of objects being more related to
nearby objects than to objects farther away. So it can be
concluded, these algorithms connect "objects" to form
"clusters" on the basis of distance based clustering [5].
It can be further divided into two subtypes as shown in
Fig. 3:
Hierarchical Clustering
Agglomerative Method
Divisive Method
Figure 3
•
Agglomerative Method
It is a bottom-up approach which starts by assigning
each data instance to one cluster and then iteratively
merges the two most similar clusters. This technique
builds the hierarchy from the individual objects by
progressively merging clusters which is shown in figure
4.
The general algorithm for agglomerative clustering is
as follows [5][6]:
1. Find the 2 closest objects and merge them into a
cluster.
2. Find and merge the next two closest points, where
a point is either an individual object or a cluster of
objects.
3. If more than one cluster remains, return to step 2.
•
Figure 4: Hierarchical Clustering
Advantages of hierarchical clustering include:
a) Embedded flexibility regarding the level of
granularity.
b) Ease of handling of any forms of similarity or
distance.
c) Consequently, applicability to any attributes types.
Disadvantages of hierarchical clustering are related
to:
a) Vagueness of termination criteria.
b) The fact that most hierarchical algorithms do not
revisit once constructed (intermediate) clusters
with the purpose of their improvement.
2.2 Partitioning Clustering
The partitioning methods generally result in a set of M
clusters, each object belonging to one cluster. Each
cluster may be represented by a centroid [5]. It can be
further divided into two subtypes as shown in Fig. 5:
Iterative Partitioning Method
Overlapping Method
Non-Overlapping Method
Figure 5
Divisive Method
It is top-down approach which starts by assigning all
the objects to one cluster and then iteratively dividing it
into smaller and smaller clusters which is shown in
figure 4.
The general algorithm for agglomerative clustering is
as follows [5]: (Let us assume there n objects)
1. Find the 2 farthest objects and split them into two
clusters.
2. Find and split the next two closest points, where a
point is either an individual object or a cluster of
objects.
IJCSCN | June-July 2016
Available [email protected]
Perhaps the most popular class of clustering algorithms
is the combinatorial optimization algorithms a.k.a.
iterative relocation algorithms. These algorithms
minimize a given clustering criterion by iteratively
relocating data points between clusters until a (locally)
optimal partition is attained. In overlapping method, a
data instance can belong to one or more clusters at
same time whereas in non-overlapping method a data
instance can belong to only cluster at a time. Among
the partition based clustering algorithms k-means
clustering is one of the most popular and the simplest
139
ISSN:2249-5789
Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142
unsupervised learning technique which belongs to the
non overlapping class.
•
K-means Clustering
K-means partition the data points into K groups or
cluster, where parameter k is a positive integer
specified by the user. K-means starts with K centroids,
and it iteratively performs the following steps [3][8]:
a) Assign each data instance to the cluster whose
centroid is nearest to it.
b) Compute the new centroids of each cluster.
Repeat the above steps until no data instance moves
from one cluster to another. A formal description of kmeans is given below:
Let us consider a dataset X=(x1, x2, …, xn) be a set of n
objects, where each object has p attributes. K-means
clustering aims to partition the n observations
into k (≤n) sets S = {S1, S2, …, Sk} so as to minimize
the intra-cluster distance and maximize inter-cluster
distance. In other words, its objective is to find [4][9]:
Where,
x –is the point representing data object.
Si – ith cluster
Mi – is the mean of ith cluster
K-means clustering algorithm is fast, robust, relatively
efficient and easier to understand.
Advantages of k-means clustering include:
1) If number of data objects is large, then K-Means
most of the times computationally faster than
hierarchical clustering, if the value of k is kept small.
2) K-Means produce tighter clusters than hierarchical
clustering, especially if the clusters are globular.
Disadvantages of k-means clustering include:
1. It is difficult to predict the value of K.
2. Different partitions can result in different final
clusters.
3. Many a times it converges to a sub-optimal
solution due to large clustering space.
2.3 Density based Clustering
Density-based clustering groups together data
objects/points that are tightly packed together (points
with many nearby neighbours), marking as outliers
points that lie alone in low-density regions (whose
nearest neighbours are too far away)[5]. Clusters are
defined as areas of higher density than the remainder of
the data set. Objects in these sparse areas - that are
required to separate clusters - are usually considered to
IJCSCN | June-July 2016
Available [email protected]
be noise and border points. DBSCAN (Density-based
spatial clustering of applications with noise) is one of
the most common and popular density based clustering
algorithms [6].
Density Reachability - A point "x" is said to be density
reachable from a point "y" if point "x" is within ε
distance from point "y" and "y" has sufficient number
of points in its neighbors which are within distance
ε[5].
Density Connectivity - A point "x" and "y" are said to
be density connected if there exist a point "z" which has
sufficient number of points in its neighbors and both
the points "x" and "y" are within the ε distance. This is
chaining process. So, if "y" is neighbor of "z", "z" is
neighbor of "s", "s" is neighbor of "t" which in turn is
neighbor of "x" implies that "y" is neighbor of "x"[5].
The algorithm for DBSCAN clustering as follows:
Let X = {x1, x2, x3, ..., xn} be the set of n data points.
DBSCAN requires two parameters: ε (eps) and the
minimum number of points required to form a cluster
(minPts) [10].
1) Start with an arbitrary starting point that has not
been visited.
2) Extract the neighborhood of this point using ε (All
points which are within the ε distance are
neighborhood).
3) If there are sufficient neighborhoods around this
point then clustering process starts and point is
marked as visited else this point is labeled as noise
(Later this point can become the part of the
cluster).
4) If a point is found to be a part of the cluster then
its ε neighborhood is also the part of the cluster
and the above procedure from step 2 is repeated for
all ε neighborhood points. This is repeated until all
points in the cluster is determined.
5) A new unvisited point is retrieved and processed,
leading to the discovery of a further cluster or
noise.
6) This process continues until all points are marked
as visited.
Advantages
1) It does not require a-priori specification of number
of clusters which is opposite from the case of kmeans.
2) It is able to identify noise data while clustering, so
more robust in nature.
3) Able to find arbitrarily size and arbitrarily shaped
clusters.
Disadvantages
1) DBSCAN algorithm fails in case of varying
density clusters
2) Fails in case of neck type of dataset.
140
ISSN:2249-5789
Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142
2.4 Grid based Clustering
Grid based clustering is popular for mining clusters in
large multi-dimensional space. The huge benefit of it is
its reduction of the computational complexity,
especially when the datasets are very large. This
approach is not concerned with the data points but with
the value space associated with the data points.
• Algorithm
A typical grid based clustering consists of the following
basic steps [11]:
1) Create a grid structure which consists of
partitioning the data space into finite number of
grid cells.
2) Assign data objects to the appropriate grid cell and
calculate the density for each cell.
3) Sorting the cells according to their density and
eliminating cells, whose density is below a certain
threshold t.
4) Identifying clusters centre.
5) Traversal of neighbour cells.
Grid
approach
includes
STING (STatistical
INformation Grid) approach and CLIQUE. Let us
discuss STING in brief.
STING
2.4.1
STING clustering method was proposed by Wang et al.
(1997) used to cluster spatial databases. It can be used
to answer different kind of spatial queries [12].
• Pseudo-Code
1) Divide the spatial areas into rectangular cells
representing a hierarchical structure.
2) Cells at higher level are divided into a number of
smaller cells in the next/lower level.
3) Statistical information of each cell is calculated
and is used to answer spatial queries.
4) Parameters of higher level cells can be easily
calculated from parameters of lower level cell.
5) Use a top-down approach to answer spatial data
queries.
6) Start from a pre-selected layer—typically with a
small number of cells from the pre-selected layer
until you reach the bottom layer do the following:
7) For each cell in the current level compute the
confidence interval indicating a cell‟s relevance to
a given query;
a. If it is relevant, include the cell in a
cluster.
b. If it irrelevant, remove cell from further
consideration otherwise, look for relevant
cells at the next lower layer.
c. Combine relevant cells into relevant
regions (based on grid-neighbourhood)
and return the so obtained clusters as your
answers.
IJCSCN | June-July 2016
Available [email protected]
Advantages:
• Query-independent,
easy
to
parallelize,
incremental update.
• O(K), where K is the number of grid cells at the
lowest level.
Disadvantages:
• All the cluster boundaries are either horizontal or
vertical, and no diagonal boundary is detected.
3. Application Areas of Clustering
Clustering algorithms can be applied in many fields, for
instance:
• Marketing: finding groups of customers with
similar interests and behaviour given a large
database of customer data containing their
properties and past buying records.
• Medicine: IMRT segmentation, Analysis of
antimicrobial activity, Medical imaging.
• Financial task: Forecasting stock market, currency
exchange rate, bank bankruptcies, understanding
and managing financial risk, trading futures, credit
rating.
• Computer Science: Software evolution, Image
segmentation, Anomaly detection.
• Biology: classification of plants and animals given
their features, human genetic clustering,
transcriptomics.
• Insurance: identifying groups of motor insurance
policy holders with a high average claim cost;
identifying frauds.
• City-planning: identifying groups of houses
according to their house type, value and
geographical location.
• Earthquake studies: clustering observed earthquake
epicentres to identify dangerous zones.
• WWW: document classification; clustering web
log data to discover groups of similar access
patterns.
4. Conclusion
Several clustering algorithm have been proposed for the
task of data mining. The clustering algorithm to be used
in a particular case depends on the type, nature and of
dataset. It has large area of application. Clustering is a
descriptive technique. The solution is not unique and it
strongly depends upon the analysts choices.
5. References
[1]J. Han, M. Kamber, and J. Pei, Data mining: concepts and
techniques. Morgan kaufmann, 2006.
[2] A.A. Frietas, “A survey of evolutionary algorithms for
data mining and knowledge discovery,” in Advances in
Evolutionary computing, Springer, 2003 pp.819-845.
[3] M. Mor, P. Gupta, and P. Sharma, “A Genetic Algorithm
Approach for Clustering.” ,2014 pp6442-47
[4] Priyanka Sharma “Comparative Analysis of Various
Clustering Algorithms” , 2015 pp.107-112.
141
ISSN:2249-5789
Mamta Mor, International Journal of Computer Science & Communication Networks,Vol 6(3),138-142
[5]“Hierarchical
clustering,”
Wikipedia,
the
free
encyclopedia. 09-Feb-2015.
[6]Glenn. W. Milligan,”An examination of the effect of six
type of error perturbation on fifteen clustering algorithm”
[7] Data Clustering. A Review: A.K. Jain Michigan State
University and M.N. Murty Indian Institute of Science and
P.J. Flynn The Ohio State University.
[8] J.A. Hartigan (1975). Clustering algorithms. John Wiley
& Sons, Inc.
[9] Hartigan, J. A.; Wong, M. A. (1979). "Algorithm AS 136:
A K-Means Clustering Algorithm". Journal of the Royal
Statistical Society, Series C 28 (1): 100–108
[10] Sander, Jörg; Ester, Martin; Kriegel, Hans-Peter; Xu,
Xiaowei (1998). "Density-Based Clustering in Spatial
Databases:
The
Algorithm
GDBSCAN
and
Its
Applications".Data
Mining
and
Knowledge
Discovery (Berlin: Springer-Verlag) 2 (2): 169–194.
[11]”Grid based clustering” Wikipedia, the free encyclopedia.
[12] T. Soni Madhulatha, “AN OVERVIEW ON
CLUSTERING METHODS”, IOSR Journal of Engineering
Apr. 2012, Vol. 2(4) pp: 719-72.
IJCSCN | June-July 2016
Available [email protected]
142