Download An Evaluation of Two Clustering Algorithms in Data Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Computer, Mathematical Sciences and Applications
Vol. 5, No. 1-2, January-June 2011, pp. 39– 47
© Serials Publications
ISSN: 0973-6786
An Evaluation of Two Clustering Algorithms in
Data Mining - a Case Study
P.K. SRIMANI1 AND YETHIRAJ N.G.2
1
2
Director, R and D, Bangalore. E-mail: [email protected]
Department of Computer Science, Maharani’s Science College for Women, Bangalore.
E-mail: [email protected]
Abstract: Data Mining is fundamentally an applied discipline and this paper aims at integrating
mathematical and computer science concepts by taking a case study. This paper discusses the testing
of two clustering algorithms (K-means and Expectation Maximization) on two datasets. Data Mining
is the process of extracting the potential information, patterns and trends from large quantities of
data possibly stored in databases. Clustering is alternatively referred to as unsupervised learning
segmentation and can be accomplished by using many algorithms. From the experiments it is
concluded that EM is the best suited algorithm for the given datasets, since EM calculates the log
likelihood for each of the attribute in the datasets. Also, Maximum log likelihood gives the best
quality of clustering which predicts the number of burst instances that are present in the datasets.
Experimental results are presented through charts and tables. The results of the comparison are
presented in detail.
Keywords: Clustering, Algorithms, K-means, EM (Expectation Maximization), Comparison,
datasets.
1. INTRODUCTION
It is Interesting to note that the progress in digital data acquisition and storage technology has
resulted in the growth of huge databases. This has occurred in almost all areas of human endeavor
from the mundane to the more exotic. The science of extracting useful/potential information from
large data sets or databases is known as Data Mining. Data Mining is not simply a process, but an
ongoing voyage of discovery, interpretation and re-investigation. It is a new discipline and an
exciting discipline lying at the intersection of Statistics, Machine Learning, Data Management,
Databases, Pattern Recognition, Artificial Intelligence and other areas [1] and [6].
In fact, Data Mining is an interdisciplinary exercise, Mathematics, Statistics, Database Technology,
Machine Learning, Pattern Recognition, Artificial Intelligence, and Visualization. Further it is difficult
to draw a sharp boundary between any two disciplines. Therefore, for understanding the complexities
of Data Mining, an excellent understanding of both “The Mathematical Modeling” and “The
Computational Algorithmic” views are essential. Therefore, the requirement to master two different
areas of expertise presents a challenge for the students, instructors and researchers. The lack of
details regarding statistical, mathematical and algorithmic concepts results in a very poor
understanding of the subject Data Mining and probably the topic on regression is the most
40
International Journal of Computer, Mathematical Sciences and Applications
mathematically challenging one. Data Mining is fundamentally an applied discipline and this paper
aims at integrating mathematical and computer science concepts by taking a case study.
Data Mining is the process of posing various queries and extracting useful information, patterns,
hidden from large quantities of data possibly stored in databases. Essentially, the goals of data
mining with regard to many organizations include improving marketing capabilities, detecting
abnormal patterns, and predicting the future, based on the past experiences and current trends. Data
Mining is an integration of multiple technologies which include, Data management, such as database
management, data warehousing, statistics, machine learning, decision support and other such as
visualization and parallel computing. Data mining research is being carried out in various disciplines.
There are various steps in data mining [7].
Traditional database queries access database using a well defined query stated in a language
such as SQL. The output of the query consists of the data from the database that satisfies the query.
Data mining access of a database differs from this traditional access in several ways. Data mining
involves many different algorithms to accomplish different tasks. All of these algorithms attempt to
fit a model to the data. The algorithms examine the data and determine a model that is closest to the
characteristics of the data being examined. Data mining algorithms can be characterized as consisting
of three parts viz., Model, Preference and Search.
Some the basic data mining functions are (i) Classification (ii) Regression (iii) Time series
analysis (iv) Prediction (v) Clustering (vi) Summarization (vii) Association rules and (viii) Sequence
discovery.
Clustering : In this study, our attention is focused on two clustering algorithms. Clustering is
alternatively referred to as unsupervised learning or segmentation [1]. It can be thought of as
partitioning or segmenting the data into groups that might or might not be disjoint. The clustering is
usually accomplished by determining the similarities among the data on predefined attributes or
parameters. The most similar data are grouped into clusters.
In high-dimensional data, clusters can exist in subspaces that hide themselves from traditional
clustering methods. A number of algorithms have been proposed to identify such projected clusters,
but most of them rely on some user parameters to guide the clustering process. There are many
clustering algorithms. Clustering algorithms can be applied in many fields for instance: Marketing,
Biology, Libraries, Insurance, City-Planning, Earthquake studies, WWW. Among them K-means
and EM clustering algorithms are considered. K-means is an iterative clustering algorithm in which
items are moved among sets of clusters until the desired set is reached [2]. As such, it may be viewed
as a type of squared error algorithm [8], although the convergence criteria need not be defined based
on the squared error. A high degree of similarity among elements in clusters is obtained, while a high
degree of dissimilarity among elements in different clusters is achieved simultaneously.
K-means clustering algorithm can also be implemented in disk-based method. This disk-based
approach is designed to work inside a relational database management system. It can cluster large
data sets having very high dimensionality.
As discussed earlier, these algorithms are compared on different data sets. The comparison is
done based on the two categories viz. Time and cluster quality. The best suited algorithm for the
given dataset is found out.
The main requirements that a clustering algorithm should satisfy are:
(1) Scalability (2) Dealing with different types of attributes (3) Discovering clusters with arbitrary
An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study
41
shape (4) Minimal requirements for domain knowledge to determine input parameters (5) Ability to
deal with noise and outliers (6) Insensitivity to order of input records (7) High dimensionality and
(8) Interpretability and usability.
2. STATEMENT OF THE PROBLEM
The assumption here is that the number of clusters to be created is an input value k. The actual
content of each cluster, Kj 1 ≤ j ≤ k, is determined as a result of the function definition. Without loss
of generality, the result of solving a clustering problem is that a set of clusters is created: K =
{K1, K2, . . . .., Kk}.
Given a database D = {t1, t2, . . . ., tn} of tuples and integer value k, the clustering problem is to
define a mapping f: D → {1, . . . ., k} where each ti is assigned to one cluster Kj, 1 ≤ j ≤ k. A Cluster,
Kj, contains precisely those tuples mapped to it; that is Kj = {ti / f(ti) = Kj, 1 ≤ i ≤ n, and ti ∈ D}
3. CLASSIFICATION OF CLUSTERING ALGORITHMS
Clustering algorithms may be classified as listed below:
• Exclusive Clustering
• Overlapping Clustering
• Hierarchical Clustering
• Probabilistic Clustering
3.1 Maximum Likelihood Estimate (MLE)
Likelihood can be defined as the value proportional to the actual probability with a specific distribution
for the given sample. So the sample gives an estimate for a parameter of the distribution. The higher
the likelihood value, the more likely the underlying distribution will produce the results observed.
Given a sample set of values X = {x1, . . . . . ., xn} from a known distribution function f(xi/Θ), the
MLE can estimate the parameters for the population from which the sample is drawn[5]. The approach
obtains parameters estimates that maximize the probability that the sample data occur for the specific
mode. It looks at the joint probability for observing the sample data by multiplying the individual
probabilities. The Likelihood function, L, is thus defined as
n
L(Θ/x1, . . . . . . xn) = Π f(xi/Θ)
i=1
where Θ is the parameterized estimated value.
3.2 Expectation Maximization Algorithm (EM)
This is one of the clustering algorithms that solve the estimation problem with incomplete
data. The EM algorithm finds an MLE for a parameter (such as mean) using a two-step process:
estimation and maximization. The basic EM algorithm is shown below. An initial set of estimates
for the parameters is obtained. Given these estimates and the training data as input, the algorithm
then calculates a value for the missing data. For example, it might use the estimated mean to predict
a missing value. These data are then used to determine an estimate for the mean that maximizes the
likelihood. These steps are applied iteratively until successive parameter estimates converge. Any
42
International Journal of Computer, Mathematical Sciences and Applications
approach can be used to find the initial parameter estimates. In the algorithm it is assumed that the
input database has actual observed values Xobs = {x1, . . . . . ., xk} as well as values that are missing
Xmiss = {xk + 1, . . . ., xn}. We assume that the entire database is actually X = Xobs ∪ Xmiss. The parameters
to be estimated are Θ = {θ1, . . . . . θp}. The likelihood function is defined by
n
L(Θ/X) = Π f (xi/Θ)
i =1
We are looking for the Θ that maximizes L. The MLE of Θ are the estimates that satisfy
∂ ln (Θ / X )
∂θi
The expectation part of the algorithm estimates the missing values using the current estimates
of Θ. This can initially be done by finding a weighted average of the observed data. The maximization
step then finds the new estimates for the Θ parameters that maximize the likelihood by using those
estimates of the missing data.
Before invoking the EM model, the following three options are presented:
(i) Number of Clusters
(ii) Maximum iterations
(iii) Minimum allowable standard deviation
The first option allows the selection of the number of clusters that are to be created for the data.
The second option determines the maximum number of times to loop through the algorithm. In
general, decreasing the maximum number of iterations results in a less precise clustering, with a
reduction in time while, the increase of the maximum number of iterations yields a more precise
clustering with more time. The third option determines the minimum standard deviation for each
attribute in each cluster. Generally, increasing the number will yield clusters that encompass more
volume in the data set, and decreasing the number will yield clusters the encompass less volume in
the data set.
4. ALGORITHM FOR EXPECTATION MAXIMIZATION (EM)AND K-MEANS S
Algorithm
Input:
Θ = {θ1, . . . . ., θp}
// Parameters to be estimated
Xobs = {x1, . . . . . ., xk}
// Input database values observed
Xmiss = {xk + 1, . . . ., xn}
// Input database value missing
Output:
θ̂
// Estimates for [4]
EM algorithm:
i = 0;
Obtain initial parameter MLE estimate Θi;
An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study
43
Repeat
Estimate missing data, Xmissi;
i++;
Obtain next parameter estimate, θ^i to
Maximize likelihood;
Until estimate converges;
EM assigns a probability distribution to each instance, which indicates its probability of belonging
to each of the clusters. EM can decide how many clusters to be created by cross validation, or may
specify apriori the number of clusters to be generated.
5. K- MEANS CLUSTERING
K-means is an iterative clustering algorithm in which items are moved among sets of clusters until
the desired sets is reached[3].
Input:
D = {t1, t2, . . . . ., tm}
//Sets of elements
k
//Number of desired clusters
Output:
K
//Sets of clusters
K-means algorithm:
Assign initial values for means m1,, . . . . ., mk;
Repeat
Assign each item ti to the cluster which has the closest mean;
Calculate new mean [4] for each cluster;
Until convergence criteria is met;
6. DATA SETS
In this section, two data sets are presented viz. Australian and German credit card data sets. Australian
credit is a data consisting of the credit card applications. This dataset is a good mix of attributes. It
consists of continuous, nominal with small number of values, and nominal with larger number of
values. There are also a few missing values. The number of instances present is 690 and the number
of attributes are 14 +. Out of these 6 attributes are numerical and 8 are categorical. All the attribute
values are numerical.
German credit dataset consists of German Credit card details. It consists of 1000 instances. It
consists of 20 attributes out of which 7 are numerical and 13 are categorical. Some of the attribute
values are numerical and some are characters.
7. OUTPUTS AND RESULTS
In this section, the results of the present study are presented in Tables 1 to 5. The given data set is the
Banking data set. Two algorithms viz. K-means and EM were applied on each of the data sets. The
summarized output is shown below:
44
International Journal of Computer, Mathematical Sciences and Applications
The table below shows the outputs of EM algorithm on the Australian Data Set.
No. of Clusters: 2
Seed : 100
Table 1
The table below shows the outputs of EM algorithm on the Australian Data Set.
Sl.
No. of
No. instances
Prior
Probabilities
Clustered Instances
Clust 0
Clust 1
Clust 0
Clust 1
Log-likelihood
Time reqruired to
build the model
(in sec)
1
1000
0.5706
0.4292
57%
43%
-25.33392
01
2
2000
0.5703
0.4297
57%
43%
-25.32674
01
3
3000
0.5703
0.4297
57%
43%
-25.32707
02
4
4000
0.5704
0.4296
57%
43%
-25.32724
01
5
5000
0.5705
0.4295
57%
43%
-25.3313
02
6
6000
0.5705
0.4295
57%
43%
-25.33076
02
7
7000
0.5705
0.4295
57%
43%
-25.33121
03
8
8000
0.5706
0.4294
57%
43%
-25.33224
03
9
9000
0.5706
0.4294
57%
43%
-25.33337
03
10
10000
0.5707
0.4293
57%
43%
-25.33397
05
Table 2
The table below shows the outputs of EM algorithm on the German Data Set.
No. of Clusters: 2 and Seed: 100
Sl.
No. of
No. instances
Prior
Probabilities
Clustered Instances
Clust 0
Clust 1
Clust 0
Clust 1
Log-likelihood
Time reqruired to
build the model
(in sec)
1
1000
0.3003
0.6997
30%
70%
–21.81065
01
2
2000
0.3003
0.6997
30%
70%
–21.8097
01
3
3000
0.3005
0.6995
30%
70%
–21.80852
01
4
4000
0.3005
0.6995
30%
70%
–21.8087
01
5
5000
0.3004
0.6995
30%
70%
–21.8087
01
6
6000
0.3004
0.6990
30%
70%
–21.80889
02
7
7000
0.3004
0.6996
30%
70%
–21.80895
03
8
8000
0.3004
0.6996
30%
70%
–21.809
03
9
9000
0.3004
0.6996
30%
70%
–21.80903
04
10
10000
0.3004
0.6996
30%
70%
–21.80906
04
45
An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study
Table 3
The table below shows the outputs of K-means algorithm on the Australian Data Set
No. of Clusters : 2 Seed : 100
Sl.
No. of instances
Clustered instances
Time reqrd. to build
No.
the model (in secs)
Cluster 0
Cluster 1
1
1000
44%
56%
00
2
2000
56%
44%
00
3
3000
44%
56%
01
4
4000
44%
56%
01
5
5000
44%
56%
01
6
6000
44%
56%
01
7
7000
32%
69%
02
8
8000
53%
47%
03
9
9000
56%
44%
04
10
10000
56%
44%
04
Table 4
Table table below shows the outputs of K-means algorithm on the German Data Set
Sl.
No. of instances
Clustered instances
Time reqrd. to build
No.
the model (in secs)
Cluster 0
Cluster 1
1
1000
53%
47%
00
2
2000
53%
47%
00
3
3000
60%
40%
01
4
4000
52%
48%
01
5
5000
62%
38%
01
6
6000
49%
51%
01
7
7000
24%
76%
01
8
8000
50%
50%
02
9
9000
25%
75%
02
10
10000
61%
39%
03
46
International Journal of Computer, Mathematical Sciences and Applications
Table 5
The Table below showing relative comparison of EM and K-means Algorithms
Sl.
No.
No. of
instances
Log- likelihood
Aus
Ger
Clustered instances
EM
K-means
Aus
Ger
Aus
C 0% C 1% C 0% C 1%
Ger
C 0% C 1%
C 0% C 1%
1
1000
–25.33
21.81
57
43
30
70
44
56
53
47
2
2000
–25.34
–21.80
57
43
30
70
56
44
53
47
3
3000
–25.37
–21.808
57
43
30
70
44
56
60
40
4
4000
–25.32
–21.808
57
43
30
70
44
56
52
48
5
5000
–25.33
–21.808
57
43
30
70
44
56
62
38
6
6000
–25.33
–21.809
57
43
30
70
44
56
49
51
7
7000
–25.33
–21.80895
57
43
30
70
32
69
24
76
8
8000
–25.24
–21.809
57
43
30
70
53
47
50
50
9
9000
–25.37
–21.81
57
43
30
70
56
44
25
75
10
10000
–25.97
–21.81
57
43
30
70
56
44
61
39
7
In EM it calculates the log likelihood, which gives the information of the best higher cluster
quality. If log likelihood is maximum, it represents the goodness of clustering.
Log likelihood can be defined as the value proportional to the actual probability to that with a
specific distribution with the given sample. So the sample gives an estimate for a parameter from the
distribution. The higher the likelihood value, the more likely the underlying distribution will produce
the results observed. So from the graph it is observed that as the number of instances increases the
log likelihood of the Australian data set remains constant.
From the table it is clear that EM algorithm works better on the German Data Set.
8. COMPARISON OF K-MEANS AND EM
Comparing the clustered instances of both K-means and EM, it is found that EM is the best-suited
algorithm for the given data sets, since it depends on the probability distributions where each
distribution represents a cluster. In EM it calculates the log likelihood, which gives the
information of the best higher cluster quality. If the log likelihood is maximum, it represents
the goodness of clustering. The likelihood computation is simply the multiplication of the sum of
the probabilities for each of the instances.
In the case of K-means, the results are entirely dependent on the value of k i.e. the number of
clusters. There is no way of knowing how many clusters exist. The same algorithm can be applied to
the same dataset, which can produce two or three clusters. There is no general theoretical solution to
find the optimal number of clusters for any given data set. A simple approach is to compare the
results of multiple runs with different k values. But increasing the values of k, clustering results in
smaller error function and also an increasing risk of over fitting.
By comparing the results, it is found that EM algorithm works better on the data sets, when
compared to K-means. Hence for the clustering of data, it is better to apply EM algorithm only.
An Evaluation of Two Clustering Algorithms in Data Mining - a Case Study
47
REFERENCES
[1] Han J., and Kamber M., (2001). “Data Mining: Concepts and Techniques”, Morgan Kaufmann, San
Francisco.
[2]
J
a
i
n
A
. K
. ,
a
n
d
D
u
b
e
s
R
. C
. ,
(
1
9
9
8
)
.
“
A
l
g
o
r
i
t
h
m
s
f
o
r
C
l
u
s
t
e
r
i
n
g
D
a
t
a
”
.
Prentice Hall Advanced Reference
Series, Upper Saddle Rivea. N.J.
[3] Ordonez. C, Omiecinski.E, “Efficient disk-based K-means Clustering for Relational Databases”. IEEE
Transactions on Knowledge and Data Engineering August 2004, Vol. 16, pp. 909-921.
[4] Yip K.Y, Cheung.D.W - “HARP: A Practical Projected Clustering Algorithm”. IEEE Transactions on
Knowledge and Data Engineering, November 2004, Vol. 16, pp. 1387-1397.
[5] Dempster A.P., Laird N.M., and Rubin D.B., “Maximum Likelihood from Incomplete Data via the EM
Algorithm”, Journal of the Royal Statistical Society, Vol. 39, pp. 1-38, 1977.
[6] Cooley R., Mobasher B., and Srivastava J., “Data Preparation for Mining World Wide Web Browsing
Patterns”, Knowledge and Information Systems, Vol. 1, pp. 5-32, 1999.
[7] Wang J., and Net Library I., “Encyclopedia of Data Warehousing and Mining: Idea Group Reference”,
2006.
[8] Arthur D.; Vassilvitskii S., (2006). “How Slow is the K-means Method?” Proceedings of the 2006
Symposium on Computational Geometry (SoCG).