Download a two-staged clustering algorithm for multiple scales

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
202
International Journal of Electronic Business Management, Vol. 3, No. 3, pp. 202-208 (2005)
A TWO-STAGED CLUSTERING ALGORITHM FOR
MULTIPLE SCALES
Chien-Lung Chan* and Rung-Ting Chien
Department of Information Management
Yuan Ze University
Chung-Li (320), Taiwan
ABSTRACT
Cluster analysis is a data mining technique used to identify hidden patterns within data.
Most clustering algorithms treat different fields of data with equal weights and calculate the
“distance” using the same method. They ignore the fact that different fields of data have
different scales; therefore, the “distance” should be calculated differently. This study
incorporated a traditional clustering algorithm with expert subjective judgment, and used
different methods to calculate the degree of similarity for four different scales -- nominal,
ordinal, interval and ratio. This study proposes a two-staged clustering algorithm to
improve the process. In the first stage, training data was used to determine the parameters
that improved clustering quality. In the second stage, different methods were used to
calculate the degree of similarity for four different scales of data and treated different fields
with unequal weights. To evaluate the outcomes of this proposed clustering method, four
standard data sets were used for testing. They were the Wisconsin Breast Cancer Data,
Contraceptive Method Choice Data, Iris Education Data, and Balance Scale Weight &
Distance Data. The results were positive; the algorithm using multi-scales resulted in a
better quality clustering. Also, the algorithm incorporating expert subjective weighting had
better accuracy in clustering.
Keywords: Data Mining, Clustering Algorithm, Multi-scales Analysis, Expert Weight
1. INTRODUCTION
*
Clustering is a method of grouping objects into
clusters according to their similarity [2]. A cluster is a
set of like objects. Objects from different clusters
are not alike. This method helps to discover important
attributes within the same cluster in large datasets.
Objects can be separated according to their attributes.
Objects in the same cluster share common attributes.
However, previous research mentioned the difficulty
of interpreting the outcome of clustering [7,8].
Researchers have applied clustering algorithms
into datasets for different purposes, even when using
the same clustering algorithm. This study evaluated
how an expert subjective judgment affects the quality
of clustering, and investigated whether expert
subjective weightings in clustering influences the
quality of clustering. Experiments were designed to
compare the quality of clustering between equally
weighted and differently weighted attributes. Most
clustering algorithms use the same method to
calculate the “distance” between different objects
regardless of their scales (nominal, ordinal, interval
*
Corresponding author: [email protected]
and ratio). In this study, the values of different scales
were treated differently, and experiments were
designed to compare the quality of clustering between
traditional clustering algorithms and the algorithms
used in this study.
2. CLUSTERING ALGORITHM
Knowledge discovery is a process that extracts
potentially interesting and previously unknown
information from large amounts of data [3,4]. The
data might contain pattern in the large database.
According to Fayed’s (1996) studies, there are six
steps in the process of knowledge discovery. The six
steps are:
1. Learning the application domain
2. Creating a target data set
3. Data cleaning and preprocessing
4. Data mining
5. Result interpretation
6. Apply discovered knowledge
Data mining is one step in knowledge discovery,
and clustering is one method of data mining. It is
used to segment data into different clusters. Objects
have similar attributes within the same cluster.
Clustering is an un-supervised classification
C. L. Chan and R. T. Chien: A Two-staged Clustering Algorithm for Multiple Scales
method for grouping similar objects into the same
cluster. The objective is to infer common
characteristics for the objects in the same cluster. A
good clustering method produces a quality cluster
meaning a high intra-class similarity and a low
inter-class similarity. The quality of a clustering
method is also measured by its ability to discover
hidden patterns [1]. There are two kinds of clustering
methods -- hierarchical and partitioning. This study
used a k-means method (one of the popular
partitioning methods) to be the experimental method
because this method of clustering produces a higher
quality cluster than the hierarchical method.
The k-means clustering method is an algorithm
for clustering n data points into k subsets so as to
minimize the cost function (usually expressed as
sum-of-squares error, SSE). It is comprised of a
simple re-estimation procedure as follows: [9]
 Initial the centroid
 Assign the data points at random to the k sets
 Compute the centroid for each set
 Iterate the
three steps above until a
stoppingcriterion is met, which is defined as:
Minimize Cost function:
k
k
n
F   (  d ij )   U ij * d ij
i 1
d
xi Gi
ij
xi  Gi
203
3. TWO-STAGED CLUSTERING
A traditional k-means clustering algorithm is
designed to find clusters by assuming that all data
attributes are numeric, and thus, numeric distances
can be calculated. Researchers have tried to release
this assumption to be more close to the real data.
Instead of calculating the numeric distance, Huang
(1998) calculated the total mismatching when
clustering the categorical data [5,6]. Ralambondrainy
(1995) transformed the categorical data into a binary
code to calculate the distance [9]. This study used a
two-staged clustering algorithm and treated
multiple-scale data differently.
In the first stage, the training data was driven
randomly from the database to find the cluster
parameters. Distances of different scales were
calculated using different methods. Then, the training
data was clustered with equal weight. After that, the
domain expert reviewed the outcome of clustering
and used discriminate analysis to determine the
weights. In the second stage, the parameters driven
from the first stage were used to cluster all data. The
two-staged clustering algorithm is illustrated as
Figure 1.
(1)
i 1 j 1
: Total distance between object j and the
centroid of group i

1, if x j  mi  x j  ml , l  i
U ij  

0, otherwise
(2)
x j : The object j
mi : The center of cluster i
U ij : If i and j are the same cluster
In the equation (1), the dij means the distance
between object j and the centroid of group i. It is the
most important factor for clustering. The similarity
between two objects is a measure of how closely they
resemble each other. Dissimilarity is the opposite
concept measured by the distance between two
objects. The most popular distance is the Euclidean
distance used in this study to calculate the similarity.
The weakness of the k-means method is that it
treats different kinds of scales the same, and uses the
same algorithm to calculate the distance between two
objects. Consequently, the k-means algorithm was
improved by treating different scales differently and
calculating the distance with different methods.
Figure 1: Two-staged clustering algorithm
In this study, we tried four different methods to
cluster standard data sets. The first clustering method
is traditional K-means. Distance between two objects
was calculated using numeric values with equal
weight. The second clustering method is K-means
204
International Journal of Electronic Business Management, Vol. 3, No. 3 (2005)
with different weights. Distance between two objects
was calculated using numeric values with unequal
weights. The third method is to calculate the distances
between two objects by treating different types of
scale differently. Finally, the fourth method is to use
unequal weights and multi-scales calculation
simultaneously for clustering.
When multi-scales method was applied to
cluster data, researchers need to clarify the scale of
each attribute. For interval and ratio scale, the
distance can be calculated directly using the numeric
values. For nominal scale, the concept of “similarity”
was applied to calculate distance. If two objects have
the same value in nominal scale, their distance should
be 0. Otherwise their distance is 1. For the ordinal
scale, we first transform the original value to new
value (value / max value –min value), which
represents the object location. Then we calculate the
distance between two objects using these two
transformed values.
In this study, the expert’s role is set up to be the
confirmation of the accuracy of clustering results in
the first stage. It’s a very similar concept of “training”
in data mining. In this stage, the expert can make a
subjective judgment if any object has been clustered
into the cluster it should belong to. If not, the expert
can exclude this “misclassified” object subjectively.
After the expert’s confirmation, discriminate analysis
was applied to find the weights of these attributes.
This is why we call it the expert’s subjective
weighting process.
The expert’s involvement is very important to
our proposed two-staged clustering method; however,
it is difficult to find the domain experts for our data
sets. Therefore, instead of recruiting the real domain
experts, this study chose a satisfied alternative by
reference the standard datasets and associated journal
papers. By this way, the clustering result in the first
stage can be checked with the standard data sets,
which have the real values for every attributes,
including the real clusters each object should belong
to. After eliminating those “misclassified” cases by a
comparison of the standard data sets, we calculated
the weights for those attributes determining clustering.
For example, in the second data set, there is the final
result of contraceptive method that every subject
chose; therefore, we can confirm the result of
clustering without consulting real domain experts.
4. EXPERIMENT DESIGN
To verify the effectiveness of the proposed
algorithm, results of the two-staged algorithm were
compared with traditional k-means using four
standard data sets. They were the Wisconsin Breast
Cancer Data, Contraceptive Method Choice Data, Iris
Education Data, and Balance Scale Weight &
Distance Data.
The two-staged clustering algorithm was
programmed on Matlab 6.0, and the test environment
was WINXP in PC (Pentium III 866MHz with
256MB SDRAM). The details of these four data
sets are presented in the appendix. To compare the
quality of three algorithms with the four data sets,
experiments were designed as illustrated in Table 1.
Table 1: A comparison of three algorithms with 4 data
sets
Methods\
Data
Data Set 1 Data Set 2 Data Set 3 Data Set 4
K-means
X
X
X
X
Two-staged
Algorithm
with
Multi-scales
X
X
X
X
Algorithm
with Different
Weights
X
X
X
X
The detailed descriptions of these four data sets
are as follows:
Wisconsin breast cancer data
This data set is a breast cancer database obtained
from the University of Wisconsin Hospital in
Madison, Wisconsin. Dr. William H. Wolberg (1992)
constructed the data base. Samples arrived
periodically as Dr. Wolberg reported his clinical cases.
The database therefore reflects this chronological
grouping of the data. Every record has numeric
attributes (Clump Thickness, Uniformity of Cell Size,
Uniformity of Cell Shape, Marginal Adhesion, Single
Epithelial Cell Size, Bare Nuclei, Bland Chromatin,
Normal Nucleoli, and Mitoses), and a class attribute
(benign and malignant).
Contraceptive method choice data
This dataset is a subset of the 1987 National
Indonesia Contraceptive Prevalence Survey. The
samples are married women who were either not
pregnant or did not know they were at the time of
interview. The problem was to predict the current
contraceptive method choice (no use, long-term
methods, or short-term methods) of a woman based
on
her
demographic
and
socio-economic
characteristics. There were 1473 instances with
multi-type attributes (numeric, nominal and ordinal )
including the Wife's age, Wife's education, Husband's
education, Number of children born, Wife's religion,
Wife's
working status, Husband's occupation,
Standard-of-living index, and Media exposure.
Iris education data
This data set concerned educational transitions
for a sample of 500 Irish schoolchildren aged 11 in
1967. The data were collected by Greaney and
Kelleghan (1984), and reanalyzed by Raftery and
Hout (1985, 1993). There were 441 instances with
C. L. Chan and R. T. Chien: A Two-staged Clustering Algorithm for Multiple Scales
multi-type attributes (numeric, nominal and ordinal)
including Sex, DVRT (Drumcondra Verbal Reasoning
Test Score), Educational level attained, Leaving
Certificate, Prestige score for father's occupation, and
a class attribute (Type of school ).
Balance scale weight & distance data
This data set was generated to model
psychological experiments reported by Siegler (1976).
Each example was classified as having the balance
scale tip to the right, tip to the left, or balanced. There
were 625 examples with four numeric attributes (the
left weight, the left distance, the right weight, and the
right distance). There were three kinds of scales in the
four data sets.
 Only numeric attributes in the data set (data set
1).
 Multi-scale attributes in the data set (data set 2
and data set 3).
 Only non-numeric attributes in the data set (data
set 4).
The four data sets were clustered using four
algorithms (k-means, k-means with weight,
Multi-Scales Clustering, and Multi-Scales Clustering
with weight) to compare the influence of multi-scales
and weight. The following four criteria evaluated the
quality of the clustering algorithms.
(1) Accuracy of grouping:
n
F=
U
i 1
i
1 if f i  ri
0 otherwise
, Ui  
n 1
(3) The difference within groups:
 Total difference within groups
n
k
Dwithin group =  ( X i  Ci )U ij
(6)
i 1 j 1
1 if i  j where U ij  

0 otherwise

Average difference within groups:
Dwithin group =
Xi :
Dwithin group
n
(7)
Data point i , Ci : centroid of cluster i
(4) The distance between group’s centers:
 Total distance between group’s centers
k
Dc =
k
  (C
i
Cj)
(8)
(3)

Average distance between group’s centers:
Dc  avg =
2 Dc
n(n  1)
(9)
A better-quality clustering algorithm will
increase the total distance between group’s centers
(Dc).
(2) The difference between groups:
 Total difference between groups
n
  P(i, j)Q(i, j)
better-quality clustering algorithm will cause P(i,j)
and Q(i,j) to increase; consequently, their total
difference between groups will increase. Therefore,
Γcan be a criteria to evaluate the quality of the
clustering algorithm.
i 1 j i 1
f i : The object’s cluster determined by algorithm
ri : The object’s real class
T=
205
(4)
5. RESULTS
i 1 j i 1

Average difference between groups
=
2T
n(n  1)
(5)
P(i,j): the distance between object i and j
Q(i,j): the distance between cluster Ci and Cj
Object i belongs to cluster Ci, and object j
belongs to cluster Cj. When both object i and j are
belong to the same cluster, Q(i,j) is equal to zero.
The total difference between groups is the total
summation of product between P(i,j) and Q(i,j).
When object i is totally different from object j, a
5.1 Data Set with All Numeric Attributes
For the data that had only numeric attributes, a
traditional k-means created a closer cluster. This is
because k-means with equal weight would not change
the inherent structure of the data. However, k-means
with unequal weights increases the importance of
some attributes. For example, there were three objects
A (5, 2), B (5, 3) and C (7, 2) in the dataset. The
k-means algorithm grouped A and B into the same
cluster because of their similarity, and placed C into
another cluster. But after weighting the second
attribute, A and C were placed into the same cluster.
In this case, the average difference within groups and
the average difference between groups for k-means
with unequal weights were greater than those of
k-means with equal weights. Therefore, k-means
206
International Journal of Electronic Business Management, Vol. 3, No. 3 (2005)
increases the similarity of objects within the same
cluster, and decreases the similarity of objects
between groups. As far as the accuracy of
classification is concerned, k-means with unequal
weight is more accurate then k-means with equal
weight (Table 2). The reason is weighting can
strengthen the influence of significant attributes.
Table 3. A comparison of clustering quality between
traditional k-means and k-means with multi-scale
clustering method for Contraceptive Method Choice
Data and Iris Education Data
Contraceptive
Iris Education
Method Choice
Data (n=441)
Method Data (n=1473)
Run
5.2 Data Sets with Multi Scales
Data set 2 was the Contraceptive Method Choice
Data. To be consistent, before clustering these data,
all numeric attributes were preprocessed and
standardized. In data set 3, not only were numeric
attributes standardized, but dummy variables for
nominal attribute were also used.
Table 2. A comparison of clustering quality between
traditional k-means and k-means with expert’s weight
for Wisconsin Breast Cancer Data (n=699)
K-means
Method
with
K-means
Run
Expert’s
Weight
H-BOV*
1.1201
1.1032
1.1337
1.1521
M-BOV
1
Accuracy of
94.13%
94.85%
clustering
4.1046
Center distance 4.1510
H-BOV
1.1201
1.0894
M-BOV
1.1337
1.1635
2
Accuracy of
94.13%
94.71%
clustering
Center distance 4.1510
4.0768
H-BOV
1.1201
1.0706
M-BOV
1.1337
1.1815
3
Accuracy of
94.13%
94.28%
Clustering
Center distance 4.1510
4.0433
H-BOV
1.1201
1.0877
M-BOV
1.1337
1.1657
Average
Accuracy of
94.13%
94.61%
clustering
Center distance 4.151
4.0749
* H-BOV is the average difference between groups
(Herbert  ), the larger the better
& M-BOV is the average difference within groups
( Dwithin  group
 avg ), the smaller the better
 Accuracy of clustering is the accuracy of objects
clustering (F)
 Center distance is the average distance of
group’s center Dc  avg
1
2.
3.
Average
*
&


K*
MSC&
K
MSC
H-BOV
4.4679
5.3816
1.7291 2.3339
M-BOV
3.0658
2.4035
1.8816 1.6298
H-BOV
4.4576
4.9506
1.8304 2.4090
M-BOV
3.0113
2.6766
2.0620 1.8204
H-BOV
4.4846
5.9258
1.8087 2.3928
M-BOV
3.0008
2.5033
1.9415 1.6962
H-BOV
4.4700
5.4193
1.7894 2.3786
M-BOV
3.0259
2.5278
1.9617 1.7155
K is the k-means algorithm
MSC is the Multi-Scale Cluster method
H-BOV is the average difference between groups
(Herbert  )
M-BOV is the average difference within groups
( Dwithin  group  avg )
Table 4: A comparison of clustering quality between
k-means with weights and multi-scale clustering
method with weights for Contraceptive Method
Choice and Iris Education Data
Run
Contraceptive
Iris Education
Method Choice
Method
Data (n=441)
Data (n=1473)
KW* MSCW& KW MSCW
Accuracy
of
clustering
1

Center
distance
Accuracy
of
clustering
2
Center
distance
Accuracy
of
clustering
3
Center
distance
Accuracy
of
Average clustering
Center
distance
39.44% 40.19% 65.31% 70.75%
3.0019
3.013
2.8922 4.1068
38.09% 40.60% 65.53% 67.35%
3.6976 3.4158
3.4781 1.5864
43.65% 43.25% 71.66% 71.66%
3.9246 4.8624
3.708 4.6272
40.39% 41.34% 67.35% 70.07%
3.5414 3.7637
3.3594 3.4401
C. L. Chan and R. T. Chien: A Two-staged Clustering Algorithm for Multiple Scales
*
&


KW is the k-means with weights
MSCW is the multi-scale clustering method with
weights
Accuracy of clustering is the accuracy of objects
clustering (F)
Center distance is the average distance of group’s
center Dc  avg
After analyzing the data, it was found that
k-means with multiple scale calculation causes
similar objects to group together and dissimilar
objects to separate. This is because
different ways
were used to calculate distance for different scales.
For the nominal scale, match and mismatch was used
to calculate the similarity. When the two values of the
nominal attributes from two records were the same,
the distance was 0. Otherwise, the distance was 1.
The mode was used to find the center of the cluster.
For the ordinal scale, the ordinal attribute was
transformed into a value between 0 and 1. The center
of the cluster was represented by the median rather
than the mean. The result shows multi-scale
calculation had a shorter average difference within
group and a larger average difference between groups.
For the accuracy of clustering, multi-scale calculation
with weight was more accurate than k-means with
weight. Also, the center distance in multi-scale
calculation was larger than that of k-means with
weight.
Table 5: A comparison of clustering quality between
k-means with weights and multi-scale clustering
method for Balance Scale Weight & Distance Data
(n=625)
Method
K*
MSC&
Run
9.4262
10.0080
H-BOV
1
3.9278
3.9408
M-BOV
2
3
H-BOV
8.1604
9.4743
M-BOV
3.9205
3.7904
H-BOV
6.7353
8.2132
M-BOV
4.1727
4.0704
H-BOV
8.1073
9.2318
Average
M-BOV
4.007
3.9338
* K is the k-means algorithm
& MSC is the multi-scale clustering method
 H-BOV is the average difference between groups
(Herbert  )
 M-BOV is the average difference within groups
( Dwithin  group  avg )
5.3 Data set with All Ordinal Scale
There were four ordinal attributes in the
Balance Scale Weight & Distance data. For traditional
cluster algorithms, a numeric calculation was used
207
with this kind of data. But in fact, ordinal data should
use its own method to calculate the distance. So
k-means and multi-scale k-means were compared just
like the Contraceptive Method Choice Data. The
reason is the same as the Contraceptive Method
Choice Data. But in this dataset, k-means had a larger
distance of cluster center then MSC. This was
because the median was used to determine the cluster
center in the ordinal attribute, and it is difficult to
produce an extreme value with the median. So the
distance in the multi-scale algorithm is shorter than
with k-means.
Table 6. A comparison of clustering quality between
k-means with weights and multi-scale clustering
method with weights for Balance Scale Weight &
Distance Data (n=625)
Method
Run
1
Accuracy of
clustering
Center distance
KW*
MSCW&
55.04% 60.32%
4.4127
4
Accuracy of
43.04% 52.80%
clustering
Center distance
4.3425
4
Accuracy of
44.64% 57.76%
clustering
3
Center distance
4.2987
2
Accuracy of
47.52% 56.96%
clustering
Average
Center distance
4.3513 3.3333
* KW is the k-means with weights
& MSCW is the multi-scale clustering method with
weights
 Accuracy of clustering is the accuracy of objects
clustering (F)
 Center distance is the average distance of group’s
center Dc  avg
2
6. CONCLUSION
A k-means clustering algorithm uses the same
method to calculate the distances for all kinds of
scales. It is a simple and quick way to cluster objects.
However, it ignores the inherent meaning of each
kind of scale. Therefore, the results of clustering are
difficult to interpret. In this study, a two-staged
clustering algorithm taking multiple scales into
account has been provided. Through the designed
experiments, the results of clustering algorithm with
multi-scales were more interpretable. Also, the
quality of the clustering measured by the average
difference between groups and the average difference
within group is better. Furthermore, the clustering
208
International Journal of Electronic Business Management, Vol. 3, No. 3 (2005)
algorithm using unequal weight improved the
accuracy of clustering and the average distance of a
group’s center. The limitations for this study are as
follows:
 Only four standard data sets were used to
compare the performance of algorithms.
More
data sets are needed to confirm the findings.
 Instead of getting insight from domain experts,
this study applied the knowledge from published
journal papers for each data set as expert weights.
In practice, the domain experts should be
involved during the clustering process.
 The future works for this study are as follows:
 Apply this proposed algorithm into more data
sets to verify if the findings are consistent.
 Combine this algorithm with optimization
methods to improve the efficiency.
 Involve the domain experts into the weighting
process to see if the quality of clustering can be
improved even more.
REFERENCES
1.
2.
3.
4.
Berry, M. J. A. and Linoff, G., 1997, Data Mining
Technique for Marketing, Sale and Customer
Support, Wiley.
Biswas, G., Weinberg, J. and Fisher, D. H., 1992,
“ITERATE: A conceptual clustering method for
knowledge discovery in databases,” Artificial
Intelligence in Petroleum Industry, B.
Braunschweig and R. Day eds.
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P.,
1996, “The KDD process for extracting useful
knowledge for volumes of data,” Communication
of the ACM, Vol. 39, No.11, pp. 27-34.
Guape, F. H. and Owrang, M. M., 1995,
“Database mining discovering new knowledge
and cooperative advantage,” Information Systems
5.
6.
7.
8.
9.
Management, Vol. 12, pp. 26-31.
Huang, Z., 1997, “A fast clustering algorithm to
cluster very large categorical data sets in data
mining,” Research Issues on Data Mining and
Knowledge Discovery.
Huang, Z., 1998, “Extensions to the k-means
algorithm for clustering large data sets with
categorical values,” Data Mining and Knowledge
Discovery, Vol. 2, 283-304.
Jain, A. K. and Dubes, R. C., 1988, Algorithm for
Clustering Data, Prentice Hall Advanced
Reference Series.
Leonard, K. and Rousseeuw, P. J., 1990, Finding
Groups in Data: An Introduction to Cluster
Analysis, A Wily-Interscience Publication.
Ralambondrainy, H., 1995, “A conceptual
version of the k-means algorithm,” Pattern
Recognitions Letters, pp. 1147-1157.
ABOUT THE AUTHORS
Chien-Lung Chan is an Associate Professor and
Chairman in the Department of Information
Management at Yuan Ze University (YZU), Taiwan
R.O.C. He received his Ph.D. degree in Industrial
Engineering at University of Wisconsin-Madison in
1995. His current research and teaching interests are
in the area of Decision Science, Decision Support
System and Healthcare Informatics.
Rung-Ting Chien received his master degree from
Department of Information Management, Yuan Ze
University (YZU). His research interests are Data
Mining and Decision Support.
(Received August 2004, revised October 2004,
accepted November 2004)