Download K-Means and K-Medoids Data Mining Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Analysis and Approach: K-Means and K-Medoids Data Mining
Algorithms
Dr. Aishwarya Batra
Asst Professor, L. J. Institute of Computer Applications, Ahmedabad, India.
E–mail: [email protected]
Abstract
Clustering is similar to classification in which data are
grouped. A cluster is therefore a collection of objects which
are similar between them and are dissimilar to the objects
belonging to other clusters .There exist a large number of
clustering algorithms in the literature. The choice of clustering
algorithm depends both on the type of data available and on
the particular purpose and application. Clustering analysis is
one of the main analytical methods in data mining. K-means is
the most popular and partition based clustering algorithm. But
it is computationally expensive and the quality of resulting
clusters heavily depends on the selection of initial centroid and
the dimension of the data. Several methods have been
proposed in the literature for improving performance of the kmeans clustering algorithm. In this research, the most
representative algorithms K-Means and K-Medoids were
examined and analyzed based on their basic approach. The
best algorithm in each category was found out based on their
performance. The input data points are generated by two ways,
one by using normal distribution and another by applying
uniform distribution.
Keywords: K Means, K Medoid, Clustering, Partitional
Algorithm
Introduction
Clustering techniques have a wide use and importance
nowadays. This importance tends to increase as the amount of
data grows and the processing power of the computers
increases. Clustering applications are used extensively in
various fields such as artificial intelligence, pattern
recognition, economics, ecology, psychiatry and marketing.
Data clustering is under vigorous development. Contributing
areas of research include data mining, statistics, machine
learning, spatial database technology, biology, and marketing.
Owing to the huge amounts of data collected in databases,
cluster analysis has recently become a highly active topic in
data mining research. As a branch of statistics, cluster analysis
has been extensively studied for many years, focusing mainly
on distance-based cluster analysis.
The main purpose of clustering techniques is to partition a
set of entities into different groups, called clusters. Cluster
analysis tools based on k-means, k-medoids, and several other
methods have also been built into many statistical analysis
software packages or systems, such as S-Plus, SPSS, and SA.
Categorization of Major Clustering Methods
In general, the major clustering methods can be classified into
the following categories:
• Partitioning methods
• Hierarchical methods
• Density-based methods
• Grid-based methods
• Model-based methods
Classical Partitioning Methods: K-Means & K-Medoids
The most well-known and commonly used partitioning
methods are k-means, k-medoids, and their variations.
Partitional clustering techniques create a one-level partitioning
of the data points. There are a number of such techniques, but
we shall only describe two approaches in this section: Kmeans and K-medoid. Both these techniques are based on the
idea that a centre point can represent a cluster. For K-means
we use the notion of a centroid, which is the mean or median
point of a group of points. Note that a centroid almost never
corresponds to an actual data point. For K-medoid we use the
notion of a medoid, which is the most representative (central)
point of a group of points. Partitional techniques create a onelevel (un-nested) partitioning of the data points. If K is the
desired number of clusters, then partitional approaches
typically find all K clusters at once.
Clustering
k-means (MacQueen’67): Each cluster is represented by the
center of the cluster.
k-medoids or PAM (Partition around medoids) (Kaufman &
Rousseeuw’87): Each cluster is represented by one of the
objects in the cluster
Centroid-Based Technique: The K-Means Method
Basic Algorithm
The K-means clustering technique is very simple and we
immediately begin with a description of the basic algorithm.
We elaborate in the following sections.
Basic K-means Algorithm for finding K clusters
1. Select K points as the initial centroids.
2. Assign all points to the closest centroid.
3. Recompute the centroid of each cluster.
[Page No. 274]
Analysis and Approach: K-Means and K--Medoids Data Mining Algorithms
4. Repeat steps 2 and 3 until the centroids doon’t change.
Point
The k-means algorithm takes the inputt parameter, k, and
partitions a set of n objects into k clusters so
s that the resulting
intracluster similarity is high but the interccluster similarity is
low.
A1
A2
A3
A4
A5
A6
A7
A8
where E is the sum of the square error for all objects in the
data set; p is the point in space representiing a given object;
and mi is the mean of cluster Ci (booth p and mi are
multidimensional). In other words, for eaach object in each
cluster, the distance from the object to itts cluster center is
squared, and the distances are summed. Thhis criterion tries to
make the resulting k clusters as compact and as separate as
possible.
(2,10)
(2, 5)
(8, 4)
(5, 8)
(7, 5)
(6, 4)
(1, 2)
(4, 9)
(2,10)
Dist Mean 1
0
5
12
5
10
10
9
3
(5, 8))
Dist Mean
M
2
5
6
7
0
5
5
10
2
(1,2)
Dist Mean 3
9
4
9
10
9
7
0
10
Cluster
1
3
2
2
2
2
3
2
The initial cluster centres – means, are (2, 10), (5, 8) and
(1, 2) - chosen randomly. Next,, we will calculate the distance
from the first point (2, 10) to each of the three means, by using
the distance function:
Point
mean1
x1, y1
x2, y2
(2, 10)
(2, 10)
p (a, b) = |x2 – x1| + |y2 – y1|
y
p (point, mean1) = |x2 – x1| + |y2 – y1|
= |2 – 2| + |10 – 10|
=0+0
=0
point
x1, y1
(2, 10)
mean2
x2, y2
(5, 8)
p (a, b) = |x2 – x1| + |y2 – y1|
y
p (point, mean2) = |x2 – x1| + |y2 – y1|
= |5 – 2| + |8 – 10|
=3+2
=5
point
x1, y1
(2, 10)
mean3
x2, y2
(1, 2)
p (a, b) = |x2 – x1| + |y2 – y1|
y
p (point, mean2) = |x2 – x1| + |y2 – y1|
= |1 – 2| + |2 – 10|
=1+8
=9
•
•
•
Flow Chart of K-Means Algorrithm
For example if we consider the folloowing data set, KMeans Algorithm will work like this –
•
•
[Page No. 275]
Next, we need to re-coompute the new cluster centers.
We do so by taking the mean of all points in each
cluster.
m is sensitive to outliers, since
The k-means algorithm
an object with an extremely large value may
substantially distort thee distribution of the data.
In k-means algorithhm (MacQueen, 1967), the
prototype, called the center,
c
is the mean value of all
objects belonging to a cluster.
c
Furthermore it requirees several passes on the entire
dataset, which can maake it very expensive for large
datasets as the dataset in
i our application.
Often found in data relating to classified images.
5th IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3
•
The k-medoids approach is moore robust in this
aspect.
K-Medoid Algorithm
K-medoid (The PAM-algorithm)(Kaufmann and Rousseeuw,
1990),a partitioning around Medoids was one of the first kMedoids algorithms introduced. It attemppts to determine k
partitions for n objects. After an initial ranndom selection of k
medoids, the algorithm repeatedly tries to make
m
a better choice
of medoids.
Instead of taking the mean value of thee object in a cluster
as a reference point, medoids can be used, which is the most
centrally located object in a cluster.
is the centroid of cluster Ci (both p and Oj are
multidimensional).
3. Select the configurationn with the lowest cost. If this is
a new configuration, thhen repeat step 2.
Numerical experiments
Cluster the given data set of tenn objects into two clusters i.e. k
= 2.
Consider a data set of ten obbjects as follows:
X1
X2
X3
X4
X5
X6
X7
X8
X9
X10
2
3
3
4
6
6
7
7
8
7
6
4
8
7
2
4
3
4
5
6
Data Set – Ten
T Objects
Distribution of Data Objects
The PAM-algorithm is based on the search for
f k representative
objects or medoids among the objects of the dataset. These
objects should represent the structure of the data. After finding
a set of k medoids, k clusters are construucted by assigning
each object to the nearest medoid. The goal is to find k
representative objects which minimize the sum of the
dissimilarities of the objects to their cloosest representative
object.
Step 1
Initialise k centre, Let us assum
me c1 = (3,4) and c2 = (7,4).
Here c1 and c2 are selected as medoid.
m
Calculating distance so as to
t associate each data object to
its nearest medoid. Cost is calculated using Minkowski
distance metric.
C1
3
3
3
3
3
3
3
3
Basic K-medoid Algorithm for finding K cllusters.
1. Select K initial points. These pointts are the candidate
medoids and are intended to bee the most central
points of their clusters.
o of the selected
2. Consider the effect of replacing one
objects (medioids) with one off the non-selected
objects.
C1
3
3
3
3
3
3
3
3
Conceptually, this is done in the foollowing way. The
distance of each non-selected point from thhe closest candidate
medoid is calculated, and this distance iss summed over all
points. This distance represents the “cosst” of the current
configuration. All possible swaps of a non-selected point for a
selected one are considered, and the cost off each configuration
is calculated.
4
4
4
4
4
4
4
4
Data Objects Xi
2 6 3 8 4 7 6 2 6 4 7 3 8 5 7 6 Cost (Distance)
3 4 4 5 3 5 6 6 4
4
4
4
4
4
4
4
Data Objects Xi
2 6 3 8 4 7 6 2 6 4 7 3 8 5 7 6 Cost (Distance)
3 4 4 5 3 5 6 6 Cost Calculation
where E is the sum of the distances for all objects in the data
set; p is the point in space representing a giiven object; and Oj
Then the clusters become:
Cluster1 = {(3,4) (2,6) (3,8) (4,77)}
Cluster2 = {(7,4) (6,2) (6,4) (7,33) (8,5) (7,6)}
[Page No. 276]
Analysis and Approach: K-Means and K--Medoids Data Mining Algorithms
Since the points (2,6) (3,8) and (4,77) are closer to c1
hence they form one cluster whilst remaaining points form
another cluster.
So the total cost involved is 20.
Where cost between any two pointts is found using
formula where x is any data object, c is thhe medoid, and d is
the dimension of the object which in this case is 2.
Total cost is the summation of the cost of data object from
its medoid in its cluster so here:
Total cost =
{ cost((3,4),(2,6)) + cost ((3,4),(3,8)) + cost ((3,4),(4,7)) }
+
{ cost ((7,4) , (6,2)) + cost ((7,4) , (66,4)) + cost ((7,4) ,
(7,3)) + cost ((7,4),(8,5)) + cost ((7,4) , (7,6))) }
= (3+4+4)+(3+1+1+2+2)
= 20
Cluster1 = {(3,4)(2,6)(3,8)(4,7)}
Cluster2 = {(7,4)(6,2)(6,4)(7,3)(8,5)(7,66)}
Step 2
Selection of nonmedoid O′ randomly. Leet us assume O′ =
(7,3), so now the medoids are c1(3,4) and O′(7,3)
O
If c1 and O′ are new medoids, calcuulate the total cost
involved by using the formula in the step 1
Total cost = 3+4+4+2+2+1+3+3
= 22
So cost of swapping medoid from c2 too O′ is
S = current total cost – past total cost
=22 - 20
=2 > 0
So moving to O′ would be bad idea, hence
h
the previous
choice was good and algorithm terminates here
h (i.e. there is no
change in the medoids).
It may happen that some data points may
m shift from one
cluster to another cluster depending upon their closeness to
medoid.
Artificial Data for Comparison & Handling Outliers
In order to evaluate the perfformance of the clusters the
artificial data will be generatedd and clustered using K-means
clustering and PAM.We gennerate 120 objects having 2
variables for each of three classes shown in Fig. 1. We call the
first group marked by square as class A, the second group
marked by circle as class B and third group marked by triangle
as class C for the sake of convennience.
Artificial Data for Comparison
C
(Fig 1)
Data is generated from muultivariate normal distribution,
whose mean vector and variancce of each variable (variance of
each variable is assumed to be equal and covariance is zero)
are given in Table 1. In orderr to compare the performance
when some outliers are present among objects, we add outliers
to the class B. The outliers are generated from a multivariate
normal distribution which has thhe equal mean with class B but
larger variance as shown in Table1.
Table1 - Mean and variancce when generating objects
Class
A
Class
B
Class
C
Outliers
(Class B)
Mean
Vector
(0 , 0)
(6 , 2)
(6,-1)
(6 , 2)
Variance
of each
Variable
1.52
0.52
0.52
22
The adjusted Rand index will
w be used as the performance
measure, which is proposed by Hubert and Arabie (1985) and
is popularly used for comparison of clustering results. The
adjusted Rand index is calculateed as
Typical K-Medoids Algorithm (PAM)
Where
a = number of pairs whichh are in the identical cluster of
compared clustering solution for
f pairs of objects in certain
cluster of correct clustering soluution
[Page No. 277]
5th IEEE International Conference on Advanced Computing & Communication Technologies [ICACCT-2011] ISBN 81-87885-03-3
b = number of pairs which are not in the identical cluster
of compared clustering solution for pairs of objects in certain
cluster of correct clustering solution
c = number of pairs which are not in the identical cluster
of correct clustering solution for pairs of objects in certain
cluster of compared clustering solution
d = number of pairs which are not in the identical cluster
of both correct clustering solution and compared clustering
solution.
Features of K-Medoid Algorithm:
It operates on the dissimilarity matrix of the given data set or
when it is presented with an nxp data matrix, the algorithm
first computes a dissimilarity matrix.
It is more robust, because it minimizes a sum of
dissimilarities instead of a sum of squared Euclidean distances
It provides a novel graphical display, the silhouette plot,
which allows the user to select the optimal number of clusters.
However, PAM lacks in scalability for very large
databases and it present high time and space complexity.
Unsupervised classification, or clustering, as it is more
often referred as, is a data mining activity that aims to
differentiate groups (classes or clusters) inside a given set of
objects , being considered the most important unsupervised
learning problem. The resulting subsets or groups, distinct and
non-empty, are to be built so that the objects within each
cluster are more closely related to one another than objects
assigned to different clusters. Central to the clustering process
is the notion of degree of similarity (or dissimilarity) between
the objects.
Let O = fO1;O2;::: ;On be the set of objects to be
clustered. The measure used for discriminating objects can be
any metric or semi-metric function The distance expresses the
dissimilarity between objects. The partitioning process is
iterative and heuristic; it stops when a “good” partitioning is
achieved. Finding a “good” partitioning coincides with
optimizing a criterion function defined either locally (on a
subset of the objects) or globally (defined over all of the
objects, as in kmeans).
These algorithms try to minimize certain criteria (a
squared error function in K-Means); the squared error criterion
tends to work well with isolated and compact clusters .In kmedoids or PAM (Partitioning around medoids) algorithm,
each cluster is represented by one of the objects in the cluster.
It finds representative objects, called medoids, in clusters. The
algorithm starts with k initial representative objects for the
clusters (medoids), then iteratively recalculates the clusters
(each object is assigned to the closest cluster - medoid), and
their medoids until convergence is achieved. At a given step, a
medoid of a cluster is replaced with a non-medoid if it
improves the total distance of the resulting clustering.
It is understood that the average time for normal
distribution is greater than the average time for the uniform
distribution. This is true for both the algorithms K-Means and
K-Medoids.
If the number of data points is less, then the K-Means
algorithm takes lesser execution time. But when the data
points are increased to maximum the K-Means algorithm takes
maximum time and the K-Medoids algorithm performs
reasonably better than the K-Means algorithm. The
characteristic feature of this algorithm is that it requires the
distance between every pair of objects only once and uses this
distance at every stage of iteration.
Conclusion
Since measuring similarity between data objects is simpler
than mapping data objects to data points in feature space, these
pairwise similarity based clustering algorithms can greatly
reduce the difficulty in developing clustering based pattern
recognition applications. The advantage of the K means
algorithm is its favourable execution time. Its drawback is that
the user has to know in advance how many clusters are
searched for. It is observed that K means algorithm is efficient
for smaller data sets and K medoids seems to perform better
for large datasets.
Bibliography
[1] J. Han, M. Kamber, Data Mining Concepts and
Techniques, Morgan Kaufmann, 2006 (2nd Edition),
http://www.cs.sfu.ca/~han/dmbook
[2] I.H. Witten, E. Frank, Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition,
Morgan,
Kaufmann,
2005,http://www.cs.waikato.ac.nz/~ml/weka/book.htm
l
[3] Pang-Ning Tan, Michael Steinbach, and Vipin Kumar,
Introduction to Data Mining (First Edition), Addison
Wesley, (May 2, 2005).
[4] Margaret H. Dunham, Data Mining: Introductory and
Advanced
Topics,
Prentice
Hall,
2003,http://lyle.smu.edu/~mhd/book
[5] A K-means-like Algorithm for K-Medoids Clustering
and Its Performance, Hae-Sang Park, Jong-Seok Lee
and Chi-Hyuck Jun, Department of Industrial and
Management Engineering, POSTECH, South Korea,
[email protected],
[email protected],
[email protected]
[6] Computational Complexity between K-Means and KMedoids Clustering lgorithms for Normal and
Uniform Distributions of Data Points T. Velmurugan
and T. Santhanam ,Department of Computer Science,
DG Vaishnav College, Chennai, India
[7] Best of Both: A Hybridized Centroid-Medoid
Clustering
Heuristic,
[email protected],
Michael [email protected], National Institute of
Informatics, Japan
[8] K-Medoids: CUDA Implementation, Douglas Roberts,
May 21, 2009
[9] International Journal of Database Management
Systems (IJDMS), Vol.3, No.1, February 2011
[10] Top 10 algorithms in data mining, XindongWu · Vipin
Kumar · J. Ross Quinlan
[11] Comparison between K-Means & K-Medoids
clustering
Algorithms,
T
SoniMadhuLatha,
International Journal of Advanced Computing, April
2011.
[Page No. 278]
Analysis and Approach: K-Means and K-Medoids Data Mining Algorithms
[12] Clustering
Parallel
Data
Streams,
Yixin
Chen,Department of Computer Science, Washington
University. St. Louis, Missouri
[13] A Partitional Clustering Algorithm for Crosscutting
Concerns Identification,Gabreilla Czibula Grigoreta,
Sofia Cojocar, Istvan Gergely Czibula
[14] Data Mining and Soft Computing Research Group on
Soft Computing and Information Intelligent Systems
(SCI2S), Dept. of Computer Science and A.I.,
University of Granada, Spain
[15] Variance enhanced K-medoid clustering qPor-Shen
Lai , Hsin-Chia Fu Department of Computer Science,
National Chiao-Tung University, Hsin-Chu 300,
Taiwan, ROC
[Page No. 279]