Download Data Mining using Conceptual Clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
DATA MINING USING CONCEPTUAL CLUSTERING
1
Data Mining using Conceptual Clustering
Khaled Hammouda
Prof. Mohamed Kamel
University of Waterloo, Ontario, Canada
Abstract – The task of data mining is mainly concerned
with the extraction of knowledge from large sets of data.
Clustering techniques are usually used to find regular
structures in data. Conceptual clustering is one
technique that forms concepts out of data incrementally
by subdividing groups into subclasses iteratively; thus
building a hierarchy of concepts. This paper presents
the use of conceptual clustering in data mining a large
set of documents to find meaningful groupings among
them. An incremental conceptual clustering technique
based on probabilistic guidance function is implemented
and tested against the data set for cohesion of the
resulting cluster structure.
Index Terms—data mining, conceptual clustering,
document clustering, hierarchical clustering.
I. INTRODUCTION
D
ATA MINING is the field concerned with
the non trivial extraction of hidden and
potentially useful information from large sets of
data. With the current dramatic increase of the
amount of data available due to the high
availability of low cost storage and other factors,
it became interesting to discover knowledge in
these data. Often there is some sort of regularities
in large amounts of data that can be only
uncovered using a smart knowledge discovery
algorithm. When no classification information is
known about the data, a clustering algorithm is
usually used to cluster the data into groups such
that the similarity within each group is larger than
that among groups. This is known as learning
from observations, as opposed to the
classification task which is considered as learning
from examples.
K. M. Hammouda,
Department of Systems Design Engineering,
University of Waterloo, Waterloo, Ontario, Canada N2L 3G1
In this paper we apply one of the machine
learning methodologies known as “conceptual
clustering” to demonstrate the task of data mining
a large set of documents. Early work on
conceptual clustering was done by Mechalski and
Stepp [1] who proposed the conceptual clustering
algorithm known as CLUSTER/2. The choice for
conceptual clustering arises from the interesting
property that conceptual clustering is mostly used
for nominal-valued data. An extension exists for
conceptual clustering that can deal with numeric
data [2], but for the purpose of this paper we only
need to be concerned with nominal-valued data as
the data set we are dealing with is inherently
nominal and symbolic-valued. However, the data
set we are dealing with contains large number of
attributes, and their values are non-fixed nominal
values. Preprocessing of the data is then a very
important step to make the data usable.
Conceptual clustering builds a structure
out of the data incrementally by trying to
subdivide a group of observations into subclasses.
The result is a hierarchical structure known as the
concept hierarchy. Each node in the hierarchy
subsumes all the nodes underneath it, with the
whole data set at the root of the hierarchy tree.
A system known as COBWEB [3] was
introduced by Fisher (1987) which performs
conceptual clustering as described above using a
probabilistic technique for measuring how well a
certain observation can fit in one of the groups
constructed so far; hence the term ‘incremental’.
UNIMEM [5] is a similar system which performs
incremental conceptual clustering, but it uses
weights of attribute values for reorganization of
the concept hierarchy. Biswas et al [4] proposed
an improved conceptual clustering algorithm
known as ITERATE that alleviates the effect of
random ordering of observations, and iteratively
redistributes the observations among clusters to
improve cluster cohesion.
DATA MINING USING CONCEPTUAL CLUSTERING
To demonstrate the task of data mining at
hand, an incremental conceptual clustering
technique is implemented and tested against the
document data set. The data set contains
documents of job offerings collected from a job
offering newsgroup. The data set has been preprocessed to extract key attributes and values.
However, further preprocessing had to be done to
refine the data set and produce more interpretable
results. Section 5 describes the data set and the
types of preprocessing done to improve its
usability. The key difference between the
implemented system here and other conceptual
clustering systems is its ability to deal with freetext valued attributes, and to be able to solve the
document clustering problem efficiently.
As a measure for assessing the
performance of the clustering technique, we adopt
a probabilistic measure known as cohesion [4] to
measure the intra-cluster similarity of the cluster
structure. Cohesion measures how close are the
observations inside a certain cluster. We also
demonstrate the effect of choosing different
combinations of attributes for clustering on the
cohesion measure. And finally the effect of
ordering the observations to minimize cluster
skewing is also studied.
The rest of this paper is organized as
follows. Section 2 discusses the representation of
data. Section 3 introduces the criterion function
and its bias that guides the clustering algorithm.
Section 4 presents the incremental clustering
algorithm. Section 5 presents the implementation
and results. Finally, section 6 gives a brief
summary with conclusions.
II. DATA REPRESENTATION
In clustering schemes, data objects are
usually represented as vectors of feature-value
pairs. Features represent certain attributes of the
objects that are known to be useful for the
clustering task. Attributes that are not relevant in
forming structures out of data can lead to nonaccurate results. Attributes can be numeric and
non-numeric, thus forming a mixed-mode data
representation. Conceptual clustering is one of the
algorithms that can deal with mixed-mode data.
2
However, conceptual clustering has primarily
focused on attributes described by nominal
values. The best way to combine numeric,
ordinal, and nominal-valued data is still an open
question.
If a convention is adopted for the ordering
of the attributes in a given problem context, we
can represent instances of data as feature vectors
consisting of the attribute values only, where the
attribute names themselves are implicitly known
by their order. Sison and Shimura [6] proposed a
relational description model to clustering data as
opposed to the usual prepositional attribute-value
pair representation.
Usually attributes are single valued, but
sometimes they can be multi-valued, such as the
document clustering problem at hand. In this case
a convention has to be adopted to deal with multivalued attributes depending on the problem
context.
In numeric clustering methods, a distance
measure is used to find the dissimilarity between
to instances. This distance is usually measured as
the Euclidean distance or the Mahalanobis
distance. For nominal valued attributes however,
a distance such as the Manhattan distance can be
used. The Manhattan distance is simply the
number of differences in the attribute-value pairs.
Difficulty arises when attributes are multi-valued
such as the case at hand.
Each instance in the document data set at
hand is represented as a vector of 17 attributevalue pairs. An attribute can have zero, single, or
multiple values in a certain instance. Thus, care
has to be taken when dealing with multi-valued
and zero-valued attributes (see section 5).
In the next section we present the function
that guides the clustering algorithm to find useful
structures in data.
III. CLUSTERING CRITERION FUNCTION
Clustering techniques typically rely on
nonparametric probabilistic measures to define
groupings. A clustering algorithm can be viewed
as a search algorithm that looks for the “best”
groupings of data among a multitude of different
grouping structures. In this search there has to be
a guidance function (heuristic) that evaluates
DATA MINING USING CONCEPTUAL CLUSTERING
3
certain groupings, and based on this evaluation,
the best one is selected. This has to be done
incrementally as we introduce new instances to
the system. Thus, instead of exhaustively
searching the concept space we only limit
ourselves to the direction given by this criterion
function. A well known probabilistic criterion
function was developed by Gluck and Corter [7];
which is called the category utility measure, and
is used by COBWEB and ITERATE.
The category utility measure is based on
probability matching strategy to establish the
usefulness or utility of a category. The Category
Utility (CU) of a class Ck is defined as
CU k = P (Ck )
⋅
P ( Ai = Vij | Ck ) 2 −
i
j
P ( Ai = Vij ) 2
i
j
(1)
where P (Ck ) represents the size of cluster Ck as
a proportion of the entire data set, P ( Ai = Vij ) is
the probability of attribute Ai taking on value Vij
over the entire set, and P ( Ai = Vij | Ck ) is its
conditional probability of taking the same value
in class Ck . This function represents the increase
in the number of feature values that can be
correctly guessed for class Ck over the expected
number of correct guesses, given that no class
information is available.
To evaluate an entire partition made up of
K clusters, we use the average CU over the K
clusters
K
Partition Score =
k =1
CU k
K
(2)
An important note is that the CU function
has a tradeoff between cluster size ( P (Ck ) ) and
predictive
accuracy
of
feature
values
2
2
(
P ( Ai = Vij | Ck ) −
P ( Ai = Vij ) ). Thus
i
j
i
j
this function favors large sized clusters over small
ones. If the data contains consecutive similar
objects, they tend to go to the same cluster, and as
the cluster size increases other less similar object
are going to be attracted to this oversized cluster,
causing a skewed cluster structure. A method
known as the Anchored Dissimilarity Ordering
(ADO) [4] is employed to order the data objects
before partitioning at each step so that the
distance between any two consecutive objects is
maximized; thus avoiding building oversized
clusters. The Manhattan distance described earlier
is used as the distance measure between two
objects. The object chosen to be next in order is
the one that maximizes the sum of the distances
between it and the previous n objects in the order.
The window size n is user defined, but usually
taken to be as the expected number of classes in
the data.
The CU function is the basis for
comparing different cluster partitions, and
selecting the partition with the highest partition
score given by equation (2). A more detailed use
of the function in the context of the clustering
algorithm follows in the next section.
IV. INCREMENTAL CONCEPTUAL CLUSTERING
In conceptual clustering a concept can be
viewed as a node in a hierarchical tree
representing a hierarchy of concepts. Nodes
(concepts) in the higher levels of the tree are
more general than nodes in lower levels. Each
node stores a list of instances that are covered by
the concept at that node. Thus, the root of the tree
represents all the instances in the data set. Lower
level nodes represent subclasses of their parents,
covering only the instances that match their
specific concept. Leaves at the lowest level
consist of one instance each, and represent the
most specific concepts. The representation of
each concept takes the form of a probabilistic
distribution of each possible attribute value
calculated over the set of instances associated
with this concept. For example, if we have two
language
and
features
per
instance
platform; then if a concept (node) carries one
instance only with language=C++ and
platform=unix,
then
the
probability
distribution for feature values in this concept is
C++=1.0 and unix=1.0, respectively, while
any other attribute values will have probability of
DATA MINING USING CONCEPTUAL CLUSTERING
4
zero. If we add another instance to this concept
that has language=C and platform=unix,
then the new feature value probability distribution
for this concept will be C++=0.5, C=0.5 and
unix=1.0. Figure 1 shows an example of such
a concept in a concept hierarchy tree.
P(C2)=2/4
Platform
Unix
Windows
Language C
C++
P(V|C)
1.0
0.0
0.5
0.5
Figure 1. Example of a concept hierarchy
The following clustering algorithm is
based on both COBWEB [3] and ITERATE [4].
At each node in the tree we try to partition the list
of instances associated with this node among a set
of classes, guided by the Category Utility
function given by equation (1). We start by
selecting the next instance to be considered and
try to add it to each of the node’s children, and
each time the partition score is calculated. In
addition, we try putting the instance in a new
child by itself, and calculate the partition score as
well. The partition with the highest partition score
is chosen. If there are no children for the current
node, a new child is created and the instance is
added to it.
The concept hierarchy that is created is a
representation of the classification of concepts
and sub-concepts. However, for the purpose of
clustering we need to extract from the tree the
potential clusters that form a “good”
representation of the underlying groupings in
data. After creating the concept hierarchy, the
second step is to extract the candidate clusters
from the tree using the following algorithm.
Along the path from the root to any child
the value of the CU function is known to initially
increase and then drops [4]. This fact is exploited
to extract the clusters in the following manner.
We start traversing the tree from root. At each
node we calculate the CU function for it and each
of its children. If the CU of the parent is larger
than every child’s CU we take the parent as a
candidate cluster, and we don’t consider any
nodes under that parent any more. If some of the
children have CU larger than the parent those are
recursively processed in the same manner as their
parent. The other children with CU lower than the
parent are taken as candidate cluster in the final
partition. Detailed algorithm steps are given in
[4].
V. IMPLEMENTATION AND RESULTS
The incremental conceptual clustering
outlined above was implemented and tested
against the job offerings data set. Before
discussing the results we first opt to discuss an
important step for the preparation of the data set,
as the data set in its given format is not suitable
for the algorithm.
A. Data set description
The original data consists of 100
documents taken from a job offering newsgroup.
The original documents are in free text format,
which has been processed to extract certain
keywords from each document and produce 17
feature-value pairs for each document. However,
the feature values are still free text values, and
not suitable for the conceptual clustering
algorithm because it expects nominal-valued
features. Moreover, each attribute in a certain
instance can take zero, one, or many values,
making it even difficult for the algorithm.
A conversion process had to be applied to
put the data set in a usable form. First, for every
attribute we extracted all the possible values that
the attribute can take from all the instances and
compiled a list of attribute values. Then a
dictionary-based approach was adopted to limit
the variation between similar values; i.e. all
values that are known to be similar but only differ
in free-text form are converted to the same value.
For example, the values Object Oriented
Design, Object Oriented Development, OO
Development, and OOD can be all converted to
OOD. This had the effect of greatly improving the
clustering by limiting the variation of values, and
thus similarity is greatly enhanced. Otherwise
DATA MINING USING CONCEPTUAL CLUSTERING
5
similar objects could be misclassified to different
clusters if their values should be the same but
only differ in free-text form. Table 1 shows the
list of attributes, their description, and sample
values.
Numeric attributes such as salary and
years of experience were discretized to fixed
average values in the range found in data.
Attribute
Description
Sample value
id
Message
identification
Job title
Job salary
Offering company
Recruiting company
City
State
Country
Required platform
Required job area
Application
Programming
Language
Required years of
experience
Desired years of
experience
Required educational
degree
Desired educational
degree
Message post date
[email protected]
dy.net
Programmer
60K
Pencom Software
JobBank USA
Chicago
IL
USA
UNIX
Database
Oracle
SQL
To be able to assess the result of a certain
clustering operation, we adopt a measure known
as cohesion, which measures the degree of interclass similarity between objects in the same class.
A more formal definition given in [4] is the
increased predictability of each feature value of
the objects in the data set, given the assigned
class structure. The increase in predictability for
an object for an object d assigned to cluster k,
M dk is defined as
title
salary
company
recruiter
city
state
country
platform
area
application
language
req_
years_exp
desired_
years_exp
req_degree
desired_
degree
post_date
B. Evaluating Cluster Partitions
i , j∈{ Ai }d
( P( Ai = Vij | Ck )2 −P( Ai = Vij )2 ) .
(3)
3
5
BS
MS
17 Nov 1996
Table 1. Dataset attributes and sample values
As seen from the attribute table, some of the
attributes are not expected to help in correct
partitioning, such as the id attribute; usually such
attribute is dropped before clustering to avoid its
problematic consequences. A discussion of the
effect of choosing certain combinations of
attributes is presented in this section.
Another problem that is faced with this
data set is that attributes can be multi-valued. To
solve this problem we make an assumption that
any value of the multi-valued attribute is a
representative of the attribute value; i.e. if an
attribute has two values for example, any one of
them is considered a possible value for the
attribute. This has the effect of making the
probability distribution of a certain attribute does
not sum to unity. However, this can be considered
valid in the context of this clustering problem, the
reason being that we only concern ourselves with
how well a probability of a certain attribute value
predicts unseen instances, independent of other
attribute values.
The cohesion of the partition structure is
measured as the sum of the M dk values for all
objects in the data set. This can be interpreted as
the increase in match between a data object and
its assigned cluster prototype over the match
between the data object and the data set
prototype. We rely on the cohesion measure to
assess the quality of the resulting partition.
C. Experimental Results
As a quick test of the algorithm outcome,
we tested the algorithm on a subset of 20
instances using 3 potentially correlated attributes
(company, application, and area). The result of
this experiment is shown in Table 2.
Cluster
# of
instances
Dominating Attribute Values
company application
area
C1
7
N/A
C2
2
N/A
C3
6
N/A
C4
3
C5
2
SOA
Consultant
Services
N/A
DB2,
Foxpro
N/A
Oracle,
Sybase
N/A
MS Test
Database
Networking,
TCP/IP
N/A
Networking
Software
Quality
Assurance
Table 2. Results of clustering 20 documents using 3 attributes
DATA MINING USING CONCEPTUAL CLUSTERING
6
As seen from the dominating attribute
values in each cluster, the clusters exhibit an
acceptable degree of intra-class similarity. A
dominating attribute value is a value having
higher probability in its cluster compared other
values.
In order to be able to determine the
correct number of clusters in the dataset, we ran a
number of tests on different disjoint subsets of the
data set, each containing 25 instances. The
attributes used in this test were (company, title,
salary, language, platform, application, and
area). The results are shown in Table 3.
As shown the results verifies that the
correct number of clusters is around 5 clusters.
The results of the 60 and 80 data sets showed 6
clusters with higher cohesion values, which can
be verified because the more clusters there are,
the smaller the size of each one, and the higher
the similarity will be within each cluster.
100
# of c lus ters
Cohes ion
Tim e (s ec )
90
80
70
60
50
40
Dataset
1-25
26-50
51-75
76-100
# of clusters
4
5
5
2
Cohesion
12.66
15.21
24.21
2.58
30
20
10
0
20
For the first 3 subsets, the result is almost
around 5 clusters, with the third quarter
exhibiting high cohesion. However, the last
quarter of the data set results in 2 clusters only
with a very low cohesion value. A look at the last
quarter subset original documents revealed that
the attributes have large number of multiple
values, making the instances seem closer to each
other than they should be. Thus they finally go to
one of the two clusters, and as the cluster sizes
get larger they attract more instances (see section
3).
To better verify the above results, a test
was run on different subset sizes, at 20, 40, 60,
80, and the full 100 data set. The same attributes
were used for this test. Table 4 summarizes the
results of this test.
20
40
60
80
100
# of
clusters
3
4
6
6
5
Cohesion
11.37
11.52
22.64
29.81
23.41
Table 4. Overlapping data subsets results
40
50
60
Datas et s iz e
70
80
90
100
Figure 2. Overlapping data subsets results
Table 3. Disjoint data subsets results
Dataset
30
Time
(sec)
10
23
67
180
300
Figure 2 shows a plot of these results. The
plot shows that the time required to cluster the
data set is exponential in the size of the data set.
This is a very important note that limits the use of
this algorithm for very large sets of data. This
observation stems from the fact that the algorithm
requires an evaluation for the Category Utility
function given by equation (1), which is a
function of a high cost requiring probability
calculation for every value of every attribute, and
this is done in a greedy loop that evaluates the
partition score when trying to incorporate an
instance into a child node. Other experiments
showed also that the number of attributes chosen
to do the clustering is an important factor in the
problem dimension.
D. Effect of using a subset of attributes
To be able to have a better understanding
of the underlying correlation between attributes,
we conducted a number of tests by choosing a
subset of the full attribute set. Different
combinations of attributes were chosen to study
the relationship between them. The following is a
list of potential correlated attributes chosen for
this test:
DATA MINING USING CONCEPTUAL CLUSTERING
Case
Cohesion
1
60.84
2
27.3
3
30.42
4
22.19
(3 clusters)
(5 clusters)
(6 clusters)
(4 clusters)
Attributes
C1
7
C2
C3
company
state
city
country
req_years_exp
desired_years_exp
salary
Lion’s Time
OR, CA, MA
Portland
USA
2
5
65K
Info. Indust.
TN, IL, CO
Chicago, Denver
USA
2, 4
3, 4
80K
Lamreen Inc.
TX
Houston, Dallas
USA
1, 2, 3
3
50K
title
application
area
language
platform
application
Programmer
Oracle, DB2, Sybase
Database
C, C++, SQL
Intel, AS/400
Oracle, DB2
C/S Architect
DB2
Client/Server
C, C++
Windows
MS Test
QA Tester
SQA Test Suite
Client/Server
C
Unix
Oracle
C4
3
10
60K …
………
Assembly
MVS, Intel
VSDM, DB2
Table 3. Results of using different attribute combinations
• company, state, city, and country
• req_years_exp, desired_years_exp, and
salary
• title, area, and application
• language, platform, and application
The results of this experiment are shown in
Table 3. The table shows the most dominating
attribute values for each of the test cases. The
results are encouraging to some extent as most of
the attribute values are correlated in a certain
cluster. The number of clusters ranged between 3
and 6, which is consistent with the previously
obtained results. The cohesion measure showed
high values for all the cases indicating high
degree of intra-class similarity, especially in the
first case due to the high correlation between
companies and their locations (absolutely
correlated). Another experiment was done using
non-related attributes (salary, recruiter, and
platform), which showed very low degree of
cohesion (12.11) thus verifying that the attributes
are not very related.
VI. CONCLUSION AND SUMMARY
The conceptual clustering scheme proved to
be a powerful tool for dealing with mixed-mode
data, and in particular nominal-valued data. In the
problem at hand we demonstrated the task of data
mining a large set of documents based on an
incremental conceptual clustering technique. The
method employs a probabilistic guidance function
that guides the search for “good” partitions of
data. This function, however, was found to be of
high cost, and its computation exponentially rises
with the size of the data set.
Due to the nature of the given data set, it
had to be pre-processed to put it in a usable form.
Pre-processing was necessary since the attribute
values given are in free-text form. This step had a
great impact on the result achieved by the
algorithm since it limits the large variation in the
values of some attributes.
Several tests were conducted to find
groupings of the document data set. The results
were encouraging and showed meaningful
groupings of data. The number of clusters was
acceptable and ranged between 3 and 6 clusters.
The effect of choosing a subset of the attributes
was studied as well. Some attributes showed to be
highly related than others. This leads us to an
important note that using all attributes for
clustering is not encouraged, since non-related
attributes can result in meaningless groupings of
data, and will lead to decrease the intra-class
similarity of the clusters.
Finally, we note the importance of the
ability of conceptual clustering to make us able to
better interpret the results of the algorithm, as
opposed to some other clustering techniques
which might produce results that are not
interpretable.
DATA MINING USING CONCEPTUAL CLUSTERING
8
[5]
VII. REFERENCES
[6]
[7]
[1]
[2]
[3]
[4]
R. Michalski and R. E. Stepp, “Learning from observation:
Conceptual clustering,” Machine Learning: An Artificial
Intelligence Approach, R. Michaliski, J. Carbonell, and T. Mitchell,
Eds. Palo Alto, CA: Tioga Press, 1983, pp. 331-364.
C. Li and G. Biswas, “Conceptual clustering with numeric-andnominal mixed data – A new similarity based system,” IEEE Trans.
Knowl. Data Engineering.
D. Fisher, “Knowledge acquisition via incremental conceptual
clustering,” Machine Learning, vol. 2, no. 2, pp. 139-172, 1987.
G. Biswas, J. B. Weinberg, and D. Fisher, “ITERATE: A
Conceptual Clustering Algorithm for Data Mining,” IEEE Trans.
Systems, Man, and Cybernetics – Part C: Applications and Reviews,
vol. 28, no. 2, 1998.
[8]
[9]
[10]
[11]
[12]
[13]
M. Lebowitz, “Experiments with incremental concept formation,”
Machine Learning, 2:103-138, 1987
R. Sison and M. Shimura, “Incremental Clustering of Relational
Descriptions,” Technical Report, ISSN 0918-2802, 1996.
M. Gluck and J. Corter, “Information, uncertainty, and the utility
of categories,” in Proc. 7th Ann. Conf. Cognitive Sci. Soc., Irvine,
CA, pp. 283-287, 1985.
T. Mitchell, “Machine Learning,” McGraw Hill, 1997.
___, “Special Issue on Knowledge Discovery,” Communications
of the ACM, vol. 42, no. 11, pp. 31-57, Nov. 1999.
___, “Chameleon: Hierarchical Clustering using Dynamic
Modeling,” Computer Magazine, vol. 32, no. 8, August 1999.
S. Russel and P. Norvig, “Artificial Intelligence: A Modern
Approach,” Prentice Hall, 1995.
W. Iba and P. Langely, “Unsupervised Learning of Probabilistic
Concept Hierarchies.”
D. Fisher, “Iterative optimization and Simplification of
Hierarchical Clustering,” Technical Report CS-96-01.