* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Data Mining using Conceptual Clustering
Survey
Document related concepts
Transcript
DATA MINING USING CONCEPTUAL CLUSTERING 1 Data Mining using Conceptual Clustering Khaled Hammouda Prof. Mohamed Kamel University of Waterloo, Ontario, Canada Abstract – The task of data mining is mainly concerned with the extraction of knowledge from large sets of data. Clustering techniques are usually used to find regular structures in data. Conceptual clustering is one technique that forms concepts out of data incrementally by subdividing groups into subclasses iteratively; thus building a hierarchy of concepts. This paper presents the use of conceptual clustering in data mining a large set of documents to find meaningful groupings among them. An incremental conceptual clustering technique based on probabilistic guidance function is implemented and tested against the data set for cohesion of the resulting cluster structure. Index Terms—data mining, conceptual clustering, document clustering, hierarchical clustering. I. INTRODUCTION D ATA MINING is the field concerned with the non trivial extraction of hidden and potentially useful information from large sets of data. With the current dramatic increase of the amount of data available due to the high availability of low cost storage and other factors, it became interesting to discover knowledge in these data. Often there is some sort of regularities in large amounts of data that can be only uncovered using a smart knowledge discovery algorithm. When no classification information is known about the data, a clustering algorithm is usually used to cluster the data into groups such that the similarity within each group is larger than that among groups. This is known as learning from observations, as opposed to the classification task which is considered as learning from examples. K. M. Hammouda, Department of Systems Design Engineering, University of Waterloo, Waterloo, Ontario, Canada N2L 3G1 In this paper we apply one of the machine learning methodologies known as “conceptual clustering” to demonstrate the task of data mining a large set of documents. Early work on conceptual clustering was done by Mechalski and Stepp [1] who proposed the conceptual clustering algorithm known as CLUSTER/2. The choice for conceptual clustering arises from the interesting property that conceptual clustering is mostly used for nominal-valued data. An extension exists for conceptual clustering that can deal with numeric data [2], but for the purpose of this paper we only need to be concerned with nominal-valued data as the data set we are dealing with is inherently nominal and symbolic-valued. However, the data set we are dealing with contains large number of attributes, and their values are non-fixed nominal values. Preprocessing of the data is then a very important step to make the data usable. Conceptual clustering builds a structure out of the data incrementally by trying to subdivide a group of observations into subclasses. The result is a hierarchical structure known as the concept hierarchy. Each node in the hierarchy subsumes all the nodes underneath it, with the whole data set at the root of the hierarchy tree. A system known as COBWEB [3] was introduced by Fisher (1987) which performs conceptual clustering as described above using a probabilistic technique for measuring how well a certain observation can fit in one of the groups constructed so far; hence the term ‘incremental’. UNIMEM [5] is a similar system which performs incremental conceptual clustering, but it uses weights of attribute values for reorganization of the concept hierarchy. Biswas et al [4] proposed an improved conceptual clustering algorithm known as ITERATE that alleviates the effect of random ordering of observations, and iteratively redistributes the observations among clusters to improve cluster cohesion. DATA MINING USING CONCEPTUAL CLUSTERING To demonstrate the task of data mining at hand, an incremental conceptual clustering technique is implemented and tested against the document data set. The data set contains documents of job offerings collected from a job offering newsgroup. The data set has been preprocessed to extract key attributes and values. However, further preprocessing had to be done to refine the data set and produce more interpretable results. Section 5 describes the data set and the types of preprocessing done to improve its usability. The key difference between the implemented system here and other conceptual clustering systems is its ability to deal with freetext valued attributes, and to be able to solve the document clustering problem efficiently. As a measure for assessing the performance of the clustering technique, we adopt a probabilistic measure known as cohesion [4] to measure the intra-cluster similarity of the cluster structure. Cohesion measures how close are the observations inside a certain cluster. We also demonstrate the effect of choosing different combinations of attributes for clustering on the cohesion measure. And finally the effect of ordering the observations to minimize cluster skewing is also studied. The rest of this paper is organized as follows. Section 2 discusses the representation of data. Section 3 introduces the criterion function and its bias that guides the clustering algorithm. Section 4 presents the incremental clustering algorithm. Section 5 presents the implementation and results. Finally, section 6 gives a brief summary with conclusions. II. DATA REPRESENTATION In clustering schemes, data objects are usually represented as vectors of feature-value pairs. Features represent certain attributes of the objects that are known to be useful for the clustering task. Attributes that are not relevant in forming structures out of data can lead to nonaccurate results. Attributes can be numeric and non-numeric, thus forming a mixed-mode data representation. Conceptual clustering is one of the algorithms that can deal with mixed-mode data. 2 However, conceptual clustering has primarily focused on attributes described by nominal values. The best way to combine numeric, ordinal, and nominal-valued data is still an open question. If a convention is adopted for the ordering of the attributes in a given problem context, we can represent instances of data as feature vectors consisting of the attribute values only, where the attribute names themselves are implicitly known by their order. Sison and Shimura [6] proposed a relational description model to clustering data as opposed to the usual prepositional attribute-value pair representation. Usually attributes are single valued, but sometimes they can be multi-valued, such as the document clustering problem at hand. In this case a convention has to be adopted to deal with multivalued attributes depending on the problem context. In numeric clustering methods, a distance measure is used to find the dissimilarity between to instances. This distance is usually measured as the Euclidean distance or the Mahalanobis distance. For nominal valued attributes however, a distance such as the Manhattan distance can be used. The Manhattan distance is simply the number of differences in the attribute-value pairs. Difficulty arises when attributes are multi-valued such as the case at hand. Each instance in the document data set at hand is represented as a vector of 17 attributevalue pairs. An attribute can have zero, single, or multiple values in a certain instance. Thus, care has to be taken when dealing with multi-valued and zero-valued attributes (see section 5). In the next section we present the function that guides the clustering algorithm to find useful structures in data. III. CLUSTERING CRITERION FUNCTION Clustering techniques typically rely on nonparametric probabilistic measures to define groupings. A clustering algorithm can be viewed as a search algorithm that looks for the “best” groupings of data among a multitude of different grouping structures. In this search there has to be a guidance function (heuristic) that evaluates DATA MINING USING CONCEPTUAL CLUSTERING 3 certain groupings, and based on this evaluation, the best one is selected. This has to be done incrementally as we introduce new instances to the system. Thus, instead of exhaustively searching the concept space we only limit ourselves to the direction given by this criterion function. A well known probabilistic criterion function was developed by Gluck and Corter [7]; which is called the category utility measure, and is used by COBWEB and ITERATE. The category utility measure is based on probability matching strategy to establish the usefulness or utility of a category. The Category Utility (CU) of a class Ck is defined as CU k = P (Ck ) ⋅ P ( Ai = Vij | Ck ) 2 − i j P ( Ai = Vij ) 2 i j (1) where P (Ck ) represents the size of cluster Ck as a proportion of the entire data set, P ( Ai = Vij ) is the probability of attribute Ai taking on value Vij over the entire set, and P ( Ai = Vij | Ck ) is its conditional probability of taking the same value in class Ck . This function represents the increase in the number of feature values that can be correctly guessed for class Ck over the expected number of correct guesses, given that no class information is available. To evaluate an entire partition made up of K clusters, we use the average CU over the K clusters K Partition Score = k =1 CU k K (2) An important note is that the CU function has a tradeoff between cluster size ( P (Ck ) ) and predictive accuracy of feature values 2 2 ( P ( Ai = Vij | Ck ) − P ( Ai = Vij ) ). Thus i j i j this function favors large sized clusters over small ones. If the data contains consecutive similar objects, they tend to go to the same cluster, and as the cluster size increases other less similar object are going to be attracted to this oversized cluster, causing a skewed cluster structure. A method known as the Anchored Dissimilarity Ordering (ADO) [4] is employed to order the data objects before partitioning at each step so that the distance between any two consecutive objects is maximized; thus avoiding building oversized clusters. The Manhattan distance described earlier is used as the distance measure between two objects. The object chosen to be next in order is the one that maximizes the sum of the distances between it and the previous n objects in the order. The window size n is user defined, but usually taken to be as the expected number of classes in the data. The CU function is the basis for comparing different cluster partitions, and selecting the partition with the highest partition score given by equation (2). A more detailed use of the function in the context of the clustering algorithm follows in the next section. IV. INCREMENTAL CONCEPTUAL CLUSTERING In conceptual clustering a concept can be viewed as a node in a hierarchical tree representing a hierarchy of concepts. Nodes (concepts) in the higher levels of the tree are more general than nodes in lower levels. Each node stores a list of instances that are covered by the concept at that node. Thus, the root of the tree represents all the instances in the data set. Lower level nodes represent subclasses of their parents, covering only the instances that match their specific concept. Leaves at the lowest level consist of one instance each, and represent the most specific concepts. The representation of each concept takes the form of a probabilistic distribution of each possible attribute value calculated over the set of instances associated with this concept. For example, if we have two language and features per instance platform; then if a concept (node) carries one instance only with language=C++ and platform=unix, then the probability distribution for feature values in this concept is C++=1.0 and unix=1.0, respectively, while any other attribute values will have probability of DATA MINING USING CONCEPTUAL CLUSTERING 4 zero. If we add another instance to this concept that has language=C and platform=unix, then the new feature value probability distribution for this concept will be C++=0.5, C=0.5 and unix=1.0. Figure 1 shows an example of such a concept in a concept hierarchy tree. P(C2)=2/4 Platform Unix Windows Language C C++ P(V|C) 1.0 0.0 0.5 0.5 Figure 1. Example of a concept hierarchy The following clustering algorithm is based on both COBWEB [3] and ITERATE [4]. At each node in the tree we try to partition the list of instances associated with this node among a set of classes, guided by the Category Utility function given by equation (1). We start by selecting the next instance to be considered and try to add it to each of the node’s children, and each time the partition score is calculated. In addition, we try putting the instance in a new child by itself, and calculate the partition score as well. The partition with the highest partition score is chosen. If there are no children for the current node, a new child is created and the instance is added to it. The concept hierarchy that is created is a representation of the classification of concepts and sub-concepts. However, for the purpose of clustering we need to extract from the tree the potential clusters that form a “good” representation of the underlying groupings in data. After creating the concept hierarchy, the second step is to extract the candidate clusters from the tree using the following algorithm. Along the path from the root to any child the value of the CU function is known to initially increase and then drops [4]. This fact is exploited to extract the clusters in the following manner. We start traversing the tree from root. At each node we calculate the CU function for it and each of its children. If the CU of the parent is larger than every child’s CU we take the parent as a candidate cluster, and we don’t consider any nodes under that parent any more. If some of the children have CU larger than the parent those are recursively processed in the same manner as their parent. The other children with CU lower than the parent are taken as candidate cluster in the final partition. Detailed algorithm steps are given in [4]. V. IMPLEMENTATION AND RESULTS The incremental conceptual clustering outlined above was implemented and tested against the job offerings data set. Before discussing the results we first opt to discuss an important step for the preparation of the data set, as the data set in its given format is not suitable for the algorithm. A. Data set description The original data consists of 100 documents taken from a job offering newsgroup. The original documents are in free text format, which has been processed to extract certain keywords from each document and produce 17 feature-value pairs for each document. However, the feature values are still free text values, and not suitable for the conceptual clustering algorithm because it expects nominal-valued features. Moreover, each attribute in a certain instance can take zero, one, or many values, making it even difficult for the algorithm. A conversion process had to be applied to put the data set in a usable form. First, for every attribute we extracted all the possible values that the attribute can take from all the instances and compiled a list of attribute values. Then a dictionary-based approach was adopted to limit the variation between similar values; i.e. all values that are known to be similar but only differ in free-text form are converted to the same value. For example, the values Object Oriented Design, Object Oriented Development, OO Development, and OOD can be all converted to OOD. This had the effect of greatly improving the clustering by limiting the variation of values, and thus similarity is greatly enhanced. Otherwise DATA MINING USING CONCEPTUAL CLUSTERING 5 similar objects could be misclassified to different clusters if their values should be the same but only differ in free-text form. Table 1 shows the list of attributes, their description, and sample values. Numeric attributes such as salary and years of experience were discretized to fixed average values in the range found in data. Attribute Description Sample value id Message identification Job title Job salary Offering company Recruiting company City State Country Required platform Required job area Application Programming Language Required years of experience Desired years of experience Required educational degree Desired educational degree Message post date [email protected] dy.net Programmer 60K Pencom Software JobBank USA Chicago IL USA UNIX Database Oracle SQL To be able to assess the result of a certain clustering operation, we adopt a measure known as cohesion, which measures the degree of interclass similarity between objects in the same class. A more formal definition given in [4] is the increased predictability of each feature value of the objects in the data set, given the assigned class structure. The increase in predictability for an object for an object d assigned to cluster k, M dk is defined as title salary company recruiter city state country platform area application language req_ years_exp desired_ years_exp req_degree desired_ degree post_date B. Evaluating Cluster Partitions i , j∈{ Ai }d ( P( Ai = Vij | Ck )2 −P( Ai = Vij )2 ) . (3) 3 5 BS MS 17 Nov 1996 Table 1. Dataset attributes and sample values As seen from the attribute table, some of the attributes are not expected to help in correct partitioning, such as the id attribute; usually such attribute is dropped before clustering to avoid its problematic consequences. A discussion of the effect of choosing certain combinations of attributes is presented in this section. Another problem that is faced with this data set is that attributes can be multi-valued. To solve this problem we make an assumption that any value of the multi-valued attribute is a representative of the attribute value; i.e. if an attribute has two values for example, any one of them is considered a possible value for the attribute. This has the effect of making the probability distribution of a certain attribute does not sum to unity. However, this can be considered valid in the context of this clustering problem, the reason being that we only concern ourselves with how well a probability of a certain attribute value predicts unseen instances, independent of other attribute values. The cohesion of the partition structure is measured as the sum of the M dk values for all objects in the data set. This can be interpreted as the increase in match between a data object and its assigned cluster prototype over the match between the data object and the data set prototype. We rely on the cohesion measure to assess the quality of the resulting partition. C. Experimental Results As a quick test of the algorithm outcome, we tested the algorithm on a subset of 20 instances using 3 potentially correlated attributes (company, application, and area). The result of this experiment is shown in Table 2. Cluster # of instances Dominating Attribute Values company application area C1 7 N/A C2 2 N/A C3 6 N/A C4 3 C5 2 SOA Consultant Services N/A DB2, Foxpro N/A Oracle, Sybase N/A MS Test Database Networking, TCP/IP N/A Networking Software Quality Assurance Table 2. Results of clustering 20 documents using 3 attributes DATA MINING USING CONCEPTUAL CLUSTERING 6 As seen from the dominating attribute values in each cluster, the clusters exhibit an acceptable degree of intra-class similarity. A dominating attribute value is a value having higher probability in its cluster compared other values. In order to be able to determine the correct number of clusters in the dataset, we ran a number of tests on different disjoint subsets of the data set, each containing 25 instances. The attributes used in this test were (company, title, salary, language, platform, application, and area). The results are shown in Table 3. As shown the results verifies that the correct number of clusters is around 5 clusters. The results of the 60 and 80 data sets showed 6 clusters with higher cohesion values, which can be verified because the more clusters there are, the smaller the size of each one, and the higher the similarity will be within each cluster. 100 # of c lus ters Cohes ion Tim e (s ec ) 90 80 70 60 50 40 Dataset 1-25 26-50 51-75 76-100 # of clusters 4 5 5 2 Cohesion 12.66 15.21 24.21 2.58 30 20 10 0 20 For the first 3 subsets, the result is almost around 5 clusters, with the third quarter exhibiting high cohesion. However, the last quarter of the data set results in 2 clusters only with a very low cohesion value. A look at the last quarter subset original documents revealed that the attributes have large number of multiple values, making the instances seem closer to each other than they should be. Thus they finally go to one of the two clusters, and as the cluster sizes get larger they attract more instances (see section 3). To better verify the above results, a test was run on different subset sizes, at 20, 40, 60, 80, and the full 100 data set. The same attributes were used for this test. Table 4 summarizes the results of this test. 20 40 60 80 100 # of clusters 3 4 6 6 5 Cohesion 11.37 11.52 22.64 29.81 23.41 Table 4. Overlapping data subsets results 40 50 60 Datas et s iz e 70 80 90 100 Figure 2. Overlapping data subsets results Table 3. Disjoint data subsets results Dataset 30 Time (sec) 10 23 67 180 300 Figure 2 shows a plot of these results. The plot shows that the time required to cluster the data set is exponential in the size of the data set. This is a very important note that limits the use of this algorithm for very large sets of data. This observation stems from the fact that the algorithm requires an evaluation for the Category Utility function given by equation (1), which is a function of a high cost requiring probability calculation for every value of every attribute, and this is done in a greedy loop that evaluates the partition score when trying to incorporate an instance into a child node. Other experiments showed also that the number of attributes chosen to do the clustering is an important factor in the problem dimension. D. Effect of using a subset of attributes To be able to have a better understanding of the underlying correlation between attributes, we conducted a number of tests by choosing a subset of the full attribute set. Different combinations of attributes were chosen to study the relationship between them. The following is a list of potential correlated attributes chosen for this test: DATA MINING USING CONCEPTUAL CLUSTERING Case Cohesion 1 60.84 2 27.3 3 30.42 4 22.19 (3 clusters) (5 clusters) (6 clusters) (4 clusters) Attributes C1 7 C2 C3 company state city country req_years_exp desired_years_exp salary Lion’s Time OR, CA, MA Portland USA 2 5 65K Info. Indust. TN, IL, CO Chicago, Denver USA 2, 4 3, 4 80K Lamreen Inc. TX Houston, Dallas USA 1, 2, 3 3 50K title application area language platform application Programmer Oracle, DB2, Sybase Database C, C++, SQL Intel, AS/400 Oracle, DB2 C/S Architect DB2 Client/Server C, C++ Windows MS Test QA Tester SQA Test Suite Client/Server C Unix Oracle C4 3 10 60K … ……… Assembly MVS, Intel VSDM, DB2 Table 3. Results of using different attribute combinations • company, state, city, and country • req_years_exp, desired_years_exp, and salary • title, area, and application • language, platform, and application The results of this experiment are shown in Table 3. The table shows the most dominating attribute values for each of the test cases. The results are encouraging to some extent as most of the attribute values are correlated in a certain cluster. The number of clusters ranged between 3 and 6, which is consistent with the previously obtained results. The cohesion measure showed high values for all the cases indicating high degree of intra-class similarity, especially in the first case due to the high correlation between companies and their locations (absolutely correlated). Another experiment was done using non-related attributes (salary, recruiter, and platform), which showed very low degree of cohesion (12.11) thus verifying that the attributes are not very related. VI. CONCLUSION AND SUMMARY The conceptual clustering scheme proved to be a powerful tool for dealing with mixed-mode data, and in particular nominal-valued data. In the problem at hand we demonstrated the task of data mining a large set of documents based on an incremental conceptual clustering technique. The method employs a probabilistic guidance function that guides the search for “good” partitions of data. This function, however, was found to be of high cost, and its computation exponentially rises with the size of the data set. Due to the nature of the given data set, it had to be pre-processed to put it in a usable form. Pre-processing was necessary since the attribute values given are in free-text form. This step had a great impact on the result achieved by the algorithm since it limits the large variation in the values of some attributes. Several tests were conducted to find groupings of the document data set. The results were encouraging and showed meaningful groupings of data. The number of clusters was acceptable and ranged between 3 and 6 clusters. The effect of choosing a subset of the attributes was studied as well. Some attributes showed to be highly related than others. This leads us to an important note that using all attributes for clustering is not encouraged, since non-related attributes can result in meaningless groupings of data, and will lead to decrease the intra-class similarity of the clusters. Finally, we note the importance of the ability of conceptual clustering to make us able to better interpret the results of the algorithm, as opposed to some other clustering techniques which might produce results that are not interpretable. DATA MINING USING CONCEPTUAL CLUSTERING 8 [5] VII. REFERENCES [6] [7] [1] [2] [3] [4] R. Michalski and R. E. Stepp, “Learning from observation: Conceptual clustering,” Machine Learning: An Artificial Intelligence Approach, R. Michaliski, J. Carbonell, and T. Mitchell, Eds. Palo Alto, CA: Tioga Press, 1983, pp. 331-364. C. Li and G. Biswas, “Conceptual clustering with numeric-andnominal mixed data – A new similarity based system,” IEEE Trans. Knowl. Data Engineering. D. Fisher, “Knowledge acquisition via incremental conceptual clustering,” Machine Learning, vol. 2, no. 2, pp. 139-172, 1987. G. Biswas, J. B. Weinberg, and D. Fisher, “ITERATE: A Conceptual Clustering Algorithm for Data Mining,” IEEE Trans. Systems, Man, and Cybernetics – Part C: Applications and Reviews, vol. 28, no. 2, 1998. [8] [9] [10] [11] [12] [13] M. Lebowitz, “Experiments with incremental concept formation,” Machine Learning, 2:103-138, 1987 R. Sison and M. Shimura, “Incremental Clustering of Relational Descriptions,” Technical Report, ISSN 0918-2802, 1996. M. Gluck and J. Corter, “Information, uncertainty, and the utility of categories,” in Proc. 7th Ann. Conf. Cognitive Sci. Soc., Irvine, CA, pp. 283-287, 1985. T. Mitchell, “Machine Learning,” McGraw Hill, 1997. ___, “Special Issue on Knowledge Discovery,” Communications of the ACM, vol. 42, no. 11, pp. 31-57, Nov. 1999. ___, “Chameleon: Hierarchical Clustering using Dynamic Modeling,” Computer Magazine, vol. 32, no. 8, August 1999. S. Russel and P. Norvig, “Artificial Intelligence: A Modern Approach,” Prentice Hall, 1995. W. Iba and P. Langely, “Unsupervised Learning of Probabilistic Concept Hierarchies.” D. Fisher, “Iterative optimization and Simplification of Hierarchical Clustering,” Technical Report CS-96-01.