Download Analysis of the efficiency of Data Clustering Algorithms on high

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Analysis of the efficiency of Data Clustering Algorithms on high
dimensional data
S.R.Pande 1
Associate Professor, Department of
Computer Science,
Shivaji Science College,
Congressnagar, Nagpur, India.
Email:[email protected]
ABSTRACT
In this paper we are analyzing the performance
of clustering techniques- Fuzzy C- means clustering, kMeans Clustering, Subtractive clustering and Mountain
clustering in conjunction with Fuzzy Modeling against a
medical problem of cancer disease diagnosis. In this
performance analysis, the medical data used which is
downloaded from KDD database repositories, consists of
13 input attributes related to clinical diagnosis of a cancer
disease, and one output attribute which indicates whether
the patient is diagnosed with the cancer disease or not.
The data set is partitioned into two data sets: two-thirds of
the data for training, and one-third for evaluation. The
number of clusters into which the data set is to be
partitioned is two clusters; i.e. patients diagnosed with the
cancer disease, and patients not diagnosed with the cancer
disease. Because of the high number of dimensions in the
problem (13-dimensions), no visual representation of the
clusters can be presented The similarity metric used to
calculate the similarity between an input vector and a
cluster center is the Euclidean distance. Since most
similarity metrics are sensitive to the large ranges of
elements in the input vectors, each of the input variables
must be normalized to within the unit interval [0,1]. i.e.
the data set has to be normalized to be within the unit
hypercube.
These clustering techniques is presented with the
training data set and implemented and analyzed in
MATLAB . Performance of these techniques are presented
and compared. The results of the experiments clearly
show that the k-means clustering seemed to over perform
the other techniques for this type of problem
Keywords: Clustering, K-means Clustering , Fuzzy Cmeans Clustering,
Mountain
Clustering,
Subtractive
Clustering techniques
1. INTRODUCTION
Data mining is the process of extracting
previously unknown, valid and actionable information
from large databases and then using the information to
make crucial business decisions. In essence, data
mining is distinguished by the fact that it is aimed at the
discovery of information, without a previously
formulated hypothesis [1]. Data clustering[8][14] plays
an important role in many disciplines, including data
mining, machine learning, bioinformatics, pattern
recognition, and other fields, where there is a need to
Mrs. S.S.Pande 2
Assistant Professor, Department of Computer
Applications,
Dhanwate National College, Congressnagar, Nagpur,
India.
Email:[email protected]
learn the inherent grouping structure of data in an
unsupervised manner. There are many clustering
approaches proposed in the literature with different
quality/complexity tradeoffs. Each clustering algorithm
works on its domain space with no optimum solution to
all datasets of different properties, sizes, structures, and
distributions. Challenges in data clustering include,
identifying proper number of clusters, scalability of the
clustering approach, robustness to noise, tackling
distributed datasets, and handling clusters of different
configurations.
In this paper, K-means Clustering , Fuzzy Cmeans Clustering, Mountain Clustering and Subtractive
Clustering techniques are reviewed. These techniques
are usually used in conjunction with radial basis
function networks (RBFNs) and Fuzzy Modeling.
Those four techniques are implemented and tested
against a medical diagnosis problem for cancer disease.
The results are presented with a comprehensive
comparison of the different techniques and the effect of
different parameters in the process.
2. DATA CLUSTERING OVERVIEW
The term cluster analysis (CA) was first used
by Tryon in 1939 [2] to denominate the group of
different algorithms and methods for grouping objects
of similar kind into respective categories. The main
goal of the clustering is to find the proper and well
separated clusters of the objects. Cluster analysis groups
objects (observations, events) based on the information
found in the data describing the objects or their
relationships. The aim is the objects in a group should
be similar (or related) to one another and different from
(or unrelated to) the objects in other groups. The greater
the similarity (or homogeneity) within a group and the
greater the difference between groups, the better the
clustering. Cluster analysis is a classification of objects
from the data, where by “classification” we mean a
labeling of objects with class (group) labels. As such,
clustering does not use previously assigned class labels,
except perhaps for verification of how well the
clustering worked. Thus, cluster analysis is sometimes
referred to as “unsupervised classification” and is
distinct from “supervised classification,” or more
commonly just “classification,” which seeks to find
rules for classifying objects given a set of pre-classified
objects.
As mentioned above, the term, cluster, does
not have a precise definition. However, several working
definitions of a cluster are commonly used and are
given below. There are two aspects of clustering that
should be mentioned in conjunction with these
definitions. First, clustering is sometimes viewed as
finding only the most “tightly” connected points while
discarding “background” or noise points. Second, it is
sometimes acceptable to produce a set of clusters where
a true cluster is broken into several subclusters (which
are often combined later, by another technique). The
key requirement in this latter situation is that the
subclusters are relatively “pure,” i.e., most points in a
sub cluster are from the same “true” cluster.
The common approach of all the clustering
techniques presented here is to find cluster centers that
will represent each cluster. A cluster center is a way to
tell where the heart of each cluster is located, so that
later when presented with an input vector, the system
can tell which cluster this vector belongs to by
measuring a similarity metric between the input vector
and al the cluster centers, and determining which cluster
is the nearest or most similar one.
Some of the clustering techniques rely on
knowing the number of clusters a priori. In that case the
algorithm tries to partition the data into the given
number of clusters. K-means[12] and Fuzzy Cmeans[9] clustering are of that type. In other cases it is
not necessary to have the number of clusters known
from the beginning; instead the algorithm starts by
finding the first large cluster, and then goes to find the
second, and so on. Mountain and Subtractive
clustering[14] are of that type. In both cases a problem
of known cluster numbers can be applied; however if
the number of clusters is not known, K-means and
Fuzzy C-means clustering cannot be used. A brief
overview of the four techniques is presented here. Full
detailed discussion will follow in the next section.
K-means is perhaps the most popular
clustering method in metric spaces [3]-[5]. Initially k
cluster centroids are selected at random. k-means then
reassigns all the points to their nearest centroids and
recomputes centroids of the newly assembled groups.
The iterative relocation continues until the criterion
function, e.g. square-error, converges. Despite its wide
popularity, k-means is very sensitive to noise and
outliers since a small number of such data can
substantially influence the centroids. Other weaknesses
are sensitivity to initialization, entrapments into local
optima, poor cluster descriptors, inability to deal with
clusters of arbitrary shape, size and density, reliance on
user to specify the number of clusters.
Fuzzy C-means clustering was proposed by
Bezdek[6] as an improvement over earlier Hard Cmeans clustering. Fuzzy clustering methods assign
degrees of membership in several clusters to each input
pattern. The resulted fuzzy partition matrix (U)
describes the relationship of the objects and the clusters.
The fuzzy partition matrix U = [𝜇𝑖,𝑘 ] is a 𝑐 × 𝑁
matrix, where 𝜇𝑖,𝑘 denotes the degree of the membership
of xk in cluster Ci, so the ith row of U contains values of
the membership function of the ith fuzzy subset of X.
Mountain clustering was proposed by Yager
and Filev [6]. This technique builds calculates a
mountain function (density function) at every possible
position in the data space, and chooses the position with
the greatest density value as the center of the first
cluster. It then destructs the effect of the first cluster
mountain function and finds the second cluster center.
This process is repeated until the desired number of
clusters have been found.
Subtractive clustering was proposed by Chiu
[6]. This technique is similar to mountain clustering,
except that instead of calculating the density function at
every possible position in the data space, it uses the
positions of the data points to calculate the density
function, thus reducing the number of calculations
significantly.
3. DATA CLUSTERING TECHNIQUES
In this section a detailed discussion of each
technique is presented. Implementation and results are
presented in the following sections.
3.1
K-means Clustering
The k-means clustering technique is one of the
simplest algorithms. We assume we have some data
point, D=(X1…., Xn), first choose from this data points,
k initial centroid, where k is user-parameter, the
number of clusters desired. Each point is then assigned
to nearest centroid. The idea is to choose random
cluster centres, one for each cluster. The centroid of
each cluster is then updated based on means of each
group which assign as a new centroid. We repeat
assignment and updated centroid until no point changes,
means no point don’t navigate from each cluster to
another or equivalently, each centroid remain the same.
Algorithm : K-means Clustering
1: Choose k points as initial centroid
2: Repeat
3: Assign each point to the closest cluster
centre,
4: Re compute the cluster centres of each
cluster,
5:Until convergence criteria meet.
Fig. 1 Algorithm K- means clustering
We consider each steps of basic K-means algorithm in
more detail and then provide an analysis of the
algorithm’s space and time complexity.
1) Assigning points to closest centroid : To assign a
point to closest centroid we need a proximity measure
that quantifies the closest for the specific data under
consideration. Euclidean distance is often used for data
points. However, there may be several types of
proximity measures that are appropriate for a given
data. As example Manhattan distance can be used for
Euclidean data while Jaccard measure is often used for
documents.
Sometimes the calculating similarity
measure of each point is time consuming; in Euclidean
space it is possible to avoiding thus speed up the Kmeans algorithm.
2) Centroid and objective function: Step 4 in algorithm
is “Recompute the centroid of each cluster”, since the
centroid can be variable depending on the goal of
clustering. For example, to measuring distance,
minimized the squared distance of each point to closest
centroid, is the goal of clustering that depends on the
proximity of the point to another, which is expressed by
an objective function.
3) Data in Euclidean space : Consider the proximity
measure is Euclidean distance. For our objective
function we use the Sum of the Squared Error (SSE)
which is known as scatter. For more, we calculate the
error of each data point. SSE formula is formally
defined as follow:
The FCM algorithm is an iterative process like
to the k-means algorithm. As initialization it generates
the membership matrix U with random values. The
two-step iterative process works as follows. Firstly the
cluster centers will be calculated. The cluster centers
𝑣𝑖 are given as the weighted mean of the data items that
belong to a cluster, where the weights are the
membership degrees. This can be formulated as
follows:
𝑚
∑𝑁
𝑘=1 (𝜇𝑖,𝑘 ) 𝑥𝑘
𝑣𝑖 =
𝑚
∑𝑁
𝑘=1(𝜇𝑖,𝑘 )
1≤𝑖≤𝑐
,
(4)
In the next step FCM updates the fuzzy membership
values based on the following formula:
1
𝜇𝑖,𝑘 =
,
2⁄(𝑚−1)
∑𝑐𝑗=1 (
2
‖𝑥𝑘 −𝑣𝑖 ‖
2)
‖𝑥𝑘 −𝑣𝑗 ‖
1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑘 ≤ 𝑁
𝑘
2
𝑆𝑆𝐸 = ∑ ∑(𝐶𝑖 , 𝑥)
𝑖=1 𝑥∈𝑐𝑖
(1)
The centroid (mean) of the ith cluster is defined
by Equation (2).
1
∑𝑥∈𝑐𝑖 𝑥
𝐶𝑖 =
(2)
(5)
The iteration terminates when the difference
between the fuzzy partition matrices in two following
iterations is lower than a predefined threshold, or a
predefined number of iterations is reached.
Algorithm Fuzzy c-means
Given the data set X, choose the number of clusters
1 < c < N, the weighting exponent m > 1 and the
termination tolerance c > 0. Initialize the fuzzy
partition matrix randomly, such that 𝑈 (0) 𝜖𝑀𝑓𝑐
Repeat for t = 1,2,…..
Step 1 Calculate the cluster centers 𝑣𝑖 𝑡 for all 1 ≤
𝑖 ≤ 𝑐 with 𝑈 (𝑡−1)
𝑚𝑖
Step 3 and 4 in algorithm directly attempt to
minimize the SSE. Step 3 forms group by assigning
points to nearest centroid which minimized SSE for the
given centroid. And Step 4 recomputes the centroid so
as the further minimize the SSE. The performance of
the K-means algorithm depends on the initial positions
of the cluster centers, thus it is advisable to run the
algorithm several times, each with a different set of
initial cluster centers
3.2
(𝑡)
𝑣𝑖
∑𝑁
𝑘=1 ( 𝜇𝑖,𝑘
) 𝑥𝑘
(𝑡−1) 𝑚
∑𝑁
𝑘=1(𝜇𝑖,𝑘 )
,1≤𝑖≤𝑐
Step 2 Update the fuzzy partition matrix. If 𝑥𝑘 =
(𝑡)
(𝑡)
𝑣𝑖 then 𝜇𝑖,𝑘 = 1, else
Fuzzy C-means Clustering
The fuzzy c-means algorithm (FCM) [9],[11]
is one of the most widely used methods in fuzzy
clustering. The fuzzy c-means algorithm is very similar
to the k-means algorithm, but in contrast to the k-means
algorithm it aims to find a fuzzy partitioning of the data
set. The objective of FCM algorithm is to minimize
the fuzzy c-means cost function formulated as:
𝑐
(𝑡−1) 𝑚
=
𝜇𝑖,𝑘 (𝑡) =
𝑁
Until
1
2
‖𝑥 −𝑣 ‖
∑𝑐𝑗=1 ( 𝑘 𝑖 2 )
‖𝑥𝑘 −𝑣𝑗 ‖
2⁄(𝑚−1)
, 1 ≤ 𝑖 ≤ 𝑐, 1 ≤ 𝑘 ≤
‖𝑈 (𝑡) − 𝑈 (𝑡−1) ‖ < 𝑐
Fig. 2 Algorithm Fuzzy c-means clustering
𝑁
𝑚
𝐽(𝑋, 𝑈, 𝑉) = ∑ ∑ [𝜇𝑖,𝑘 ] ‖𝑥𝑘 − 𝑣𝑖 ‖2
𝑖=1 𝑘=1
(3)
where U = 𝜇𝑖,𝑘
is a fuzzy partition matrix, 𝑉 =
{𝑣1 , 𝑣2 , … . . , 𝑣𝑐 }, 𝑣𝑖 ∈ ℝ𝑛 is the set of the cluster
centers, ‖𝑥𝑘 − 𝑣𝑖 ‖ is a dissimilarity measure between
the object 𝑥𝑘 and the center 𝑣𝑖 , and m is a weighting
parameter, that determines the fuzziness of the resulted
clusters. The value of the cost function (3) is a measure
of the total weighted within-group squared error.
As in K-means clustering, the performance of
FCM depends on the initial membership matrix values;
thereby it is advisable to run the algorithm for several
times, each starting with different values of membership
grades of data points.
3.3
Mountain Clustering
The mountain clustering approach is a simple
way to find cluster centers based on a density measure
called the mountain function. This method is a simple
way to find approximate cluster centers, and can be
used as a preprocessor for other sophisticated clustering
methods.
The first step in mountain clustering involves
forming a grid on the data space, where the
intersections of the grid lines constitute the potential
cluster centers, denoted as a set V .
The second step entails constructing a mountain
function representing a data density measure. The
height of the mountain function at a point 𝑣 ∈ 𝑉 is
equal to
𝑁
𝑚(𝑣) = ∑ 𝑒𝑥𝑝 (–
𝑖=1
‖𝑣 − 𝑥𝑖 ‖2
)
2𝜎 2
(6)
where 𝑥𝑖 is the ith data point and 𝜎
specific constant.
This equation states that the data density measure at a
point 𝑣 is affected by all the points 𝑥𝑖 in the data set,
and this density measure is inversely proportional to the
distance between the data points 𝑥𝑖 and the point under
consideration 𝑣 . The constant 𝜎 determines the height
as well as the smoothness of the resultant mountain
function.
The third step involves selecting the cluster
centers by sequentially destructing the mountain
function. The first cluster center 𝑐1 is determined by
selecting the point with the greatest density measure.
Obtaining the next cluster center requires eliminating
the effect of the first cluster. This is done by revising
the mountain function: a new mountain function is
formed by subtracting a scaled Gaussian function
centered at 𝑐1 :
‖𝑣 − 𝑐1 ‖2
𝑚𝑛𝑒𝑤 (𝑣) = 𝑚(𝑣) − 𝑚(𝑐1 ) exp (–
)
2𝛽 2
(7)
The subtracted amount eliminates the effect of
the first cluster. Note that after subtraction, the new
mountain function 𝑚𝑛𝑒𝑤 (𝑣) reduces to zero at 𝑣 = 𝑐1 .
After subtraction, the second cluster center is selected
as the point having the greatest value for the new
mountain function. This process continues until a
sufficient number of cluster centers is attained.
3.4
Subtractive Clustering
The problem with the previous clustering
method, mountain clustering, is that its computation
grows exponentially with the dimension of the problem;
that is because the mountain function has to be
evaluated at each grid point. Subtractive clustering
solves this problem by using data points as the
candidates for cluster centers, instead of grid points as
in mountain clustering. This means that the computation
is now proportional to the problem size instead of the
problem dimension. However, the actual cluster centers
are not necessarily located at one of the data points, but
in most cases it is a good approximation, especially
with the reduced computation this approach introduces.
𝐷𝑖 = ∑𝑛𝑗=1 exp (–
‖𝑥𝑖 −𝑥𝑗 ‖
2
(𝑟𝑎 ⁄2)2
)
(8)
where 𝑟𝑎 is a positive constant representing a
neighborhood radius. Hence, a data point will have a
high density value if it has many neighboring data
points. The first cluster center 𝑥𝑐1 is chosen as the point
having the largest density value 𝐷𝑐1 . Next, the density
measure of each data point 𝑥𝑖 is revised as follows:
𝐷𝑖 = 𝐷𝑖 − 𝐷𝑐1 ∑𝑛𝑗=1 exp (–
‖𝑥𝑖 −𝑥𝑐1 ‖2
(𝑟𝑏 ⁄2)2
)
(9)
where 𝑟𝑏 is a positive constant which defines a
neighborhood that has measurable reductions in density
measure. Therefore, the data points near the first cluster
center 𝑥𝑐1 will have significantly reduced density
measure. After revising the density function, the next
cluster center is selected as the point having the greatest
density value. This process continues until a sufficient
number of clusters is attainted.
4. IMPLEMENTATION AND RESULTS
After discussions of the different clustering
techniques and their mathematical foundations, we now
turn to the practical study. This study involves the
implementation of each of these techniques on a set of
medical data related to cancer disease diagnosis
problem. The medical data used consists of 13 input
attributes related to clinical diagnosis of a cancer
disease, and one output attribute which indicates
whether the patient is diagnosed with the cancer disease
or not. The whole data set consists of 300 cases. The
data set is partitioned into two data sets: two-thirds of
the data for training, and one-third for evaluation. The
number of clusters into which the data set is to be
partitioned is two clusters; i.e. patients diagnosed with
the cancer disease, and patients not diagnosed with the
cancer disease. Because of the high number of
dimensions in the problem (13-dimensions), no visual
representation of the clusters can be presented; only 2D or 3-D clustering problems can be visually inspected.
We will rely heavily on performance measures to
evaluate the clustering techniques rather than on visual
approaches. As mentioned earlier, the similarity metric
used to calculate the similarity between an input vector
and a cluster center is the Euclidean distance. Since
most similarity metrics are sensitive to the large ranges
of elements in the input vectors, each of the input
variables must be normalized to within the unit
interval [0,1] i.e. the data set has to be normalized to be
within the unit hypercube.
Each clustering algorithm is presented with the
training data set, and as a result two clusters are
produced. The data in the evaluation set is then tested
against the found clusters and an analysis of the results
is conducted. The following sections present the results
of each clustering technique, followed by a comparison
of the four techniques.
4.1
K-means Clustering
As mentioned in the previous section, Kmeans clustering works on finding the cluster centers by
trying to minimize a cost function J . It alternates
between updating the membership matrix and updating
the cluster centers using Equations (1) and (2),
respectively, until no further improvement in the cost
function is noticed. Since the algorithm initializes the
cluster centers randomly, its performance is affected by
those initial cluster centers. So several runs of the
algorithm is advised to have better results.
Evaluating the algorithm is realized by testing
the accuracy of the evaluation set. After the cluster
centers are determined, the evaluation data vectors are
assigned to their respective clusters according to the
distance between each vector and each of the cluster
centers. An error measure is then calculated; the root
mean square error (RMSE) is used for this purpose.
Also an accuracy measure is calculated as the
percentage of correctly classified vectors. The
algorithm was tested for 10 times to determine the best
performance. Table 1 lists the results of those runs.
Fig. 3 shows a plot of the cost function over time for the
best test case.
Table 1. K-means Clustering Performance Results
Test
1
2
3
4
5
6
7
8
9
10
No.
of
iterations
8
7
8
6
4
4
3
8
10
8
RMSE
Accuracy
0.479
0.479
0.446
0.468
0.630
0.691
0.691
0.446
0.446
0.460
79%
79%
81%
77%
60%
50%
50%
81%
81%
79%
Regression
line slope
0.561
0.561
0.610
0.564
0.385
0.067
0.056
0.611
0.611
0.562
To further measure how accurately the identified
clusters represent the actual classification of data, a
regression analysis is performed of the resultant
clustering against the
original classification.
Performance is considered better if the regression line
slope is close to 1.
Fig. 3 : K-means cost history function
As seen from the results, the best case achieved
81% accuracy and an RMSE of 0.446. This relatively
moderate performance is related to the high
dimensionality of the problem; having too much
dimensions tend to disrupt the coupling of data and
introduces overlapping in some of these dimensions that
reduces the accuracy of clustering. It is noticed also that
the cost function converges rapidly to a minimum value
as seen from the number of iterations in each test run.
However, this has no effect on the accuracy measure.
4.2
Fuzzy C-means Clustering
FCM allows for data points to have different
degrees of membership to each of the clusters; thus
eliminating the effect of hard membership introduced
by K-means clustering. This approach employs fuzzy
measures as the basis for membership matrix
calculation and for cluster centers identification.
As it is the case in K-means clustering, FCM
starts by assigning random values to the membership
matrix U , thus several runs have to be conducted to
have higher probability of getting good performance.
However, the results showed no (or insignificant)
variation in performance or accuracy when the
algorithm was run for several times.For testing the
results, every vector in the evaluation data set is
assigned to one of the clusters with a certain degree of
belongingness (as done in the training set). However,
because the output values we have are crisp values
(either 1 or 0), the evaluation set degrees of
membership are defuzzified to be tested against the
actual outputs.
The same performance measures applied in Kmeans clustering will be used here; however only the
effect of the weighting exponent m is analyzed, since
the effect of random initial membership grades has
insignificant effect on the final cluster centers. Table II
lists the results of the tests with the effect of varying the
weighting exponent m . It is noticed that very low or
very high values for m reduces the accuracy; moreover
high values tend to increase the time taken by the
algorithm to find the clusters. A value of 2 seems
adequate for this problem since it has good accuracy
and requires less number of iterations. Fig. 4 shows the
accuracy and number of iterations against the weighting
factor.
Table 2. Fuzzy C-means Clustering Performance Results
Weighting
exponent m
No. of
iteratio
ns
RMSE
Accura
cy
Regress
ion line
slope
1.1
1.2
1.5
2
3
5
8
12
19
17
18
20
25
28
34
37
0.469
0.469
0.480
0.469
0.458
0.480
0.480
0.480
79%
79%
76%
79%
77%
76%
76%
76%
0,561
0.561
0.538
0.561
0.385
0. 538
0. 538
0. 538
So for the problem at hand, with input data of
13- dimensions, 200 training inputs, and a grid size of
10 per dimension, the required number of mountain
function
calculation
is
approximately
2.011
*1015 calculations. In addition the value of the mountain
function needs to be stored for every grid point for later
calculations in finding subsequent clusters; which
requires 𝑠 𝑑 storage locations, for our problem this would
be 1013 storage locations. Obviously this seems
impractical for a problem of this dimension.
In order to be able to test this algorithm, the
dimension of the problem have to be reduced to a
reasonable number; e.g. 4-dimensions. This is achieved
by randomly selecting 4 variables from the input data
out of the original 13 and performing the test on those
variables. Several tests involving differently selected
random variables are conducted in order to have a better
understanding of the results. Table 3 lists the results of
10 test runs of randomly selected variables. The
accuracy achieved ranged between 51% and 79% with
an average of 69%, and average RMSE of 0.5479.
Those results are quite discouraging compared to the
results achieved in K-means and FCM clustering. This
is due to the fact that not all of the variables of the input
data contribute to the clustering process; only 4 are
chosen at random to make it possible to conduct the
tests. However, with only 4 attributes chosen to do the
tests, mountain clustering required far much more time
than any other technique during the tests; this is because
of the fact that the number of computation required is
exponentially proportional to the number of dimensions
in the problem, as stated in Equation (10). So
apparently mountain clustering is not suitable for
problems of dimensions higher than two or three. Fig. 5
shows a plot of accuracy and Test runs .
Table 3. Mountain Clustering Performance Results
Fig. 4 Fuzzy C-means Clustering Performance
In general, the FCM technique showed no
improvement over the K-means clustering for this
problem. Both showed close accuracy; moreover FCM
was found to be slower than K-means because of fuzzy
calculations.
4.3
Mountain Clustering
Mountain clustering relies on dividing the data
space into grid points and calculating a mountain
function at every grid point. This mountain function is a
representation of the density of data at this point. The
performance of mountain clustering is severely affected
by the dimension of the problem; the computation
needed rises exponentially with the dimension of input
data because the mountain function has to be evaluated
at each grid point in the data space. For a problem
with k clusters, d dimensions, t data points, and a grid
size of s per dimension, the required number of
calculations is:
𝑁 = 𝑡 × 𝑠 𝑑 + (𝑘 − 1)𝑠 𝑑
(10)
Test
RMSE
Accuracy
Regression
line slope
1
2
3
4
5
6
7
8
9
10
0.567
0.470
0.567
0.500
0.548
0.567
0.567
0.528
0.695
0.470
67%
79%
67%
75%
70%
67%
67%
72%
51%
79%
0,350
0.555
0.344
0.510
0.427
0.346
0.346
0.489
0.027
0.555
is diminished. Usually the 𝑟𝑏 variable is taken to be as
1.5𝑟𝑎 . Table 4 shows the results of varying 𝑟𝑎 . Fig. 6
shows a plot of accuracy and RMSE against 𝑟𝑎 .
Fig. 5 Mountain Clustering Performance
4.4
Subtractive Clustering
This method is similar to mountain clustering,
with the difference that a density function is calculated
only at every data point, instead of at every grid point.
So the data points themselves are the candidates for
cluster centers. This has the effect of reducing the
number of computations significantly, making it
linearly proportional to the number of input data instead
of being exponentially proportional to its dimension.
For a problem of k clusters and t data points, the
required number of calculations is:
𝑁 = 𝑡 2 + (𝑘 − 1)𝑡
(11)
As seen from the equation, the number of calculations
does not depend on the dimension of the problem. For
the problem at hand, the number of computations
required is in the range of few ten thousands only. Since
the algorithm is fixed and does not rely on any
randomness, the results are fixed. However, we can test
the effect of the two variables 𝑟𝑎 and 𝑟𝑏 on the
accuracy of the algorithm. Those variables represent a
radius of neighborhood after which the effect (or
contribution) of other data points to the density function
Table 4. Subtractive Clustering Performance Results
Neighborhood
radius 𝑟𝑎
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
RMSE
Accuracy
0.678
0.645
0.645
0.499
0.499
0.499
0.499
0.499
0.645
54%
57%
57%
76%
76%
76%
76%
76%
57%
Regression
line slope
0.0994
0.1925
0.1925
0.5070
0.5070
0.5070
0.5070
0.5070
0.1925
Fig. 6 Subtractive Clustering Performance
It is clear from the results that choosing 𝑟𝑎 very
small or very large will result in a poor accuracy
because if 𝑟𝑎 is chosen very small the density function
will not take into account the effect of neighboring data
points; while if taken very large, the density function
will be affected account all the data points in the data
space. So a value between 0.4 and 0.7 should be
adequate for the radius of neighborhood. As seen from
table 1.4, the maximum achieved accuracy was 76%
with an RMSE of 0.512. Compared to K-means and
FCM, this result is a little bit behind the accuracy
achieved in those other techniques.
5. RESULTS AND DISCUSSION
According to the previous discussion of the
implementation of the four data clustering techniques
and their results, it is useful to summarize the results
and present some comparison of performances. A
summary of the best achieved results for each of the
four techniques is presented in Table 5.
Table 5 Comparison of Performance
Comparis
on
Parameter
s
RMSE
Accuracy
Regressio
n
line
slope
Time(sec)
Algorithms
KFCM
means
Mountai
n
Subtractiv
e
0.440
81%
0.610
0.469
79%
0.561
0.470
79%
0.555
0.499
76
.5070
0.8
2.3
117.0
3.60
From this comparison we can conclude some
remarks that K-means clustering produces fairly higher
accuracy and lower RMSE than the other techniques,
and requires less computation time. Mountain clustering
has a very poor performance regarding its requirement
for huge number of computation and low accuracy.
However, we have to notice that tests conducted on
mountain clustering were done using part of the input
variables in order to make it feasible to run the tests.
Mountain clustering is suitable only for problems with
two or three dimensions.
FCM produces close results to K-means
clustering, yet it requires more computation time
than K-means because of the fuzzy measures
calculations involved in the algorithm. In subtractive
clustering, care has to be taken when choosing the value
of the neighborhood radius 𝑟𝑎 , since too small radii
will result in neglecting the effect of neighboring data
points, while large radii will result in a neighborhood of
all the data points thus canceling the effect of the
cluster. Since none of the algorithms achieved enough
high accuracy rates, it is assumed that the problem data
itself contains some overlapping in some of the
dimensions; because of the high number of dimensions
tend to disrupt the coupling of data and reduce the
accuracy of clustering.
As stated earlier in this paper, clustering
algorithms are usually used in conjunction with radial
basis function networks and fuzzy models. The
techniques described here can be used as preprocessors
for RBF networks for determining the centers of the
radial basis functions. In such cases, more accuracy can
be gained by using gradient descent or other advanced
derivative- based optimization schemes for further
refinement.
means and FCM cannot be used to solve this type of
problem, leaving the choice only to mountain or
subtractive clustering. Subtractive clustering seems to
be a better alternative to mountain clustering since it is
based on the same idea, and uses the data points as
cluster centers candidates instead of grid points;
however, mountain clustering can lead to better results
if the grid granularity is small enough to capture the
potential cluster centers, but with the side effect of
increasing computation needed for the larger number of
grid points.
Finally, the clustering techniques discussed here
do not have to be used as stand-alone approaches; they
can be used in conjunction with other neural or fuzzy
systems for further refinement of the overall system
performance.
REFERENCES
1.
2.
3.
4.
5.
6.
7.
8.
6.
CONCLUSION
Four clustering techniques have been reviewed
in this paper, namely: K-means clustering, Fuzzy Cmeans clustering, Mountain clustering, and Subtractive
clustering. These approaches solve the problem of
categorizing data by partitioning a data set into a
number of clusters based on some similarity measure so
that the similarity in each cluster is larger than among
clusters. The four methods have been implemented and
tested against a data set for medical diagnosis of cancer
disease. The comparative study done here is concerned
with the accuracy of each algorithm, with care being
taken toward the efficiency in calculation and other
performance measures.
The medical problem presented has a high
number of dimensions, which might involve some
complicated relationships between the variables in the
input data. It was obvious that mountain clustering is
not one of the good techniques for problems with this
high number of dimensions due to its exponential
proportionality to the dimension of the problem. Kmeans clustering seemed to over perform the other
techniques for this type of problem. However in other
problems where the number of clusters is not known, K-
9.
10.
11.
12.
13.
14.
15.
16.
17.
Cabena, P., Hadjinian, P., Stadler, R., Verhees, J., and Zanasi
A. Discovering Data Mining : From concepts to
implementation. Prentice Hall, Upper Saddle River, NJ, 1997
J.Han and Michelin , “Data mining concepts andtechniques,”
morgan Kauffman, 2006.
J. Hartigan and M. Wong. Algorithm as136: A k-means
clustering algorithm. Applied Statistics, 28:100–108, 1979.
A. K. Jain and R. C. Dubes. Algorithms for Clustering Data.
Prentice-Hall, 1988.
J. MacQueen. Some methods for classification and analysis of
multivariate observations.In Proc. 5th Berkeley Symp. Math.
Statist., Prob., pages 281–297, 1967.
Jang, J.-S. R., Sun, C.-T., Mizutani, E., “Neuro- Fuzzy and
Soft Computing – A Computational Approach to Learning
and Machine Intelligence,”Prentice Hall.
J. Bezdek. Pattern Recognition with Fuzzy Objective Function
Algorithms.Plenum Press, New York, 1981.
Azuaje, F., Dubitzky, W., Black, N., Adamson, K.,
“Discovering Relevance Knowledge in Data: A Growing Cell
Structures Approach,” IEEE Transactions on Systems, Man,
and Cybernetics- Part B: Cybernetics, Vol. 30, No. 3, June
2000 (pp. 448)
Lin, C., Lee, C., “Neural Fuzzy Systems,” Prentice Hall, NJ,
1996.
Tsoukalas, L., Uhrig, R., “Fuzzy and Neural Approaches in
Engineering,” John Wiley & Sons, Inc., NY, 1997.
Nauck, D., Kruse, R., Klawonn, F., “Foundations of NeuroFuzzy Systems,” John Wiley & Sons Ltd., NY, 1997.
J. A. Hartigan and M. A. Wong, “A k-meansclustering
algorithm,” Applied Statistics, 28:100--108, 1979.
The MathWorks, Inc., “Fuzzy Logic Toolbox – For Use With
MATLAB,” The MathWorks, Inc., 1999.
Jiawei Han M .K, Data Mining Concepts and Techniques ,
Morgan Kaufman Publishers, An Imprint of Elsevier, 2006.
K.Krishna, and M.Murty,:Genetic k-means Algorithm,”IEEE
Transactions on System, Vol.29, No.3,1999.
U.Maulik,and S.Bandyopadhyay,”Genetic Algorithm-Based
Clusrtering Technique” Pattern Recognition 33, 1999.
K.A Abdul Nazeer , M.P Sebastian “Improving the Accurancy
and Effiency of the K-means Clustering Algorithm “ WCE
2009, London.