* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document Clustering Using Concept Space and Cosine Similarity
Survey
Document related concepts
Transcript
2009 International Conference on Computer Technology and Development Document Clustering using Concept Space and Cosine Similarity Measurement Lailil Muflikhah Baharum Baharudin Department of Computer and Information Science Universiti Teknologi Petronas Brawijaya University Bandar Seri Iskandar, Tronoh, Perak, Malaysia [email protected] Department of Computer and Information Science Universiti Teknologi Petronas Bandar Seri Iskandar, Tronoh, Perak, Malaysia [email protected] browsing of retrieval results [2]. Some experimental evidences show that IR application can benefit from the use of document clustering [3]. Document clustering has always been used as a tool to improve the performance of retrieval and navigating large data. The clustering methods can be classified into hierarchical method and partitioning method. Partitioning clustering can be divided into hard clustering and overlapping (fuzzy clustering). In any given document, there is the possibility that it can contain multiple subject or category. This issue is purposed to use fuzzy clustering approach, which allows a document to appear in multiple clusters. The method used is different from hard clustering method, which a document is only belongs to one cluster, not more. It means that we assume well defined boundary among the clusters. Many researchers have conducted in document clustering by hard clustering methods. In the research show that Bisecting K-Means algorithm is better than Basic K-Means algorithm and Agglomerative Hierarchical Clustering using similarity concept with cosine formulation [4]. Thus, another research that in grouping process used by variance value is better in the result [5], or by using Euclidean distance [6]. Furthermore, there are several applications by using Fuzzy C-Means algorithm, the researcher had applied clustering for symbolic data and its result of clustering has better quality than hierarchical method (hard clustering) [7]. Also, it had been applied in text mining [8]. While grouping or clustering of document, the problem is very huge terms or words are represented to the full text vector space. Many lexical matching at term level are inaccurate. Sometimes the words have the number of meaning or the number of words has the same meaning, it effects to match returns irrelevant documents. It is difficult to judge which documents belong to the same cluster based on the specific category without any selection for the terms which have meaning full or correlation between the terms each other. Therefore, this research is used concept of Information Retrieval approach is Latent Semantic Index (LSI). In this method, the documents are projected onto a small subspace of this vector space and clustered. So, there is creation of new abstract vector space, which contain of the important term is captured in order [2]. Abstract— Document clustering is related to data clustering concept which is one of data mining tasks and unsupervised classification. It is often applied to the huge data in order to make a partition based on their similarity. Initially, it used for Information Retrieval in order to improve the precision and recall from query. It is very easy to cluster with small data attributes which contains of important items. Furthermore, document clustering is very useful in retrieve information application in order to reduce the consuming time and get high precision and recall. Therefore, we propose to integrate the information retrieval method and document clustering as concept space approach. The method is known as Latent Semantic Index (LSI) approach which used Singular Vector Decomposition (SVD) or Principle Component Analysis (PCA). The aim of this method is to reduce the matrix dimension by finding the pattern in document collection with refers to concurrent of the terms. Each method is implemented to weight of term-document in vector space model (VSM) for document clustering using fuzzy c-means algorithm. Besides reduction of term-document matrix, this research also uses the cosine similarity measurement as replacement of Euclidean distance to involve in fuzzy c-means. And as a result, the performance of the proposed method is better than the existing method with f-measure around 0.91 and entropy around 0.51. Keywords-data mining; clustering; LSI; SVD; PCA; fuzzy cmeans; euclidean distance; cosine similarity I. INTRODUCTION Data mining is a technique to get the pattern from hidden information. This technique is to find and describe structural patterns in data collection as a tool for helping to explain that data and make predictions from it. Generally, data mining tasks are divided into two major categories: predictive tasks which aim to predict the value of a particular attribute based on the values of other attributes and another one is descriptive tasks which aim to derive patterns (correlations, trends, clusters, trajectories, and anomalies)[1]. Clustering is a method to organize automatically a large data collection by partition a set data, so the objects in the same cluster are more similar to one another than to objects in other clusters. Document clustering is related to organize a large data text collection. In the field of Information Retrieval (IR), document clustering is used to automatically group the document that belongs to the same topic in order to provide user’s 978-0-7695-3892-1/09 $26.00 © 2009 IEEE DOI 10.1109/ICCTD.2009.206 58 matrix[9]. And Fig.2 depicts how w the LSI in getting the pattern in document collection. In this paper, firstly it is described abbout information retrieval using LSI concept. Then, we deescribe document similarity as basic concept of clustering, and also one of clustering algorithm applied for document cclustering itself is Fuzzy C-Means. After that, the methodologgy which used to implement document clustering by LSI and cosine similarity embedded to experiment evaluation. Sincee this experiment is implemented, the performance can be m made an analysis and conclusion. II. Figure 2. Description of LSI in data collection B. Singular Value Decomposition (SVD) ( The Singular Value Decomposiition (SVD) is a method which can find the patterns in the matrix m and identify which words and documents are similar to each other. It creates the new matrices from term (t) x docum ment (d) matrix A that are matrices U, ∑ and V such that A= A USVT which can be illustrated as in Fig. 3.[10] INFORMATION RETRIEV VAL Information retrieval is the way to seearch the match information as user desired. Unrelated doocument may be retrieved simply because terms occur acciddentally in it, and on the other hand, related documents may be missed because no term in the document occurrs in the query. Therefore, the retrieval could be based onn concept rather than on terms, by mapping first items to a “concept space” and using ranking of similarity as shown in Fig. 1[2]. Figure 3. The relative sizes of the th hree matrixes when t > d In Fig. 3, the SVD matrix shows where U has orthogonal, unit-length column (UTU=I) and it is called left singular vectors; V has orthogonaal which is called right singular vectors, unit-length colu umn (VTV=I) and ∑ is diagonal matrix (k x k) of singulaar values, where k is the rank of A ( ≤ min (t, d)). Generally y, A = U∑VT matrix must all be of full rank. The amount of diimension reduction, need to choice correctly in order to repreesent the real structure in the data [2]. Figure 1. Using concept for Retrieval Infformation Fig. 1 describes there is middle layerr into two query based on the concept (c1 and c2) and docum ment maps instead of directly relating documents and term as inn vector retrieval. This vector, the query c2 of t3 return d2, d3,, d4 in the answer set based on the observation that they relatte to concept c2, without requiring that the document containss term t3. ( C. Principal Component Analysis (PCA) Principal component analysis is a method to find k “principal axes” which are orthono ormal coordinate systems that can capture most of the variiance in data. Basically, PCA is formed from Singular Vecto or Decomposition (SVD) on the covariance matrix which used eigen vector or value of covariance matrix [11, 12]. A. Latent Semantic Index (LSI) Initially, latent semantic indexing (LSII) is obtained to get pattern in the document collection which used to Retrieval. It uses improve the accuracy in Information R Singular Vector Decomposition (SVD D) or Principle Component Analysis (PCA) to decompoose the original matrix A of document representation and to retain only k largest singular value from singular value m matrix ∑. In this matrix, ∑ selects only the largest singular vvalue which is to keep the corresponding columns in two oother matrixes U and VT. The choice of s determines on hhow many of the “important concepts” the ranking will bee based on. It is assumption that concepts with small singuular value ∑ are rather to be considered as “noise” and thuss can be ignored. Therefore, LSI can be depicted of how the sizes of the involved matrixes reduce, when only thee first s singular values are kept for the computation of the ranking and also how the position between term and document in the III. DOCUMENT DISSSIMILARITY The dissimilarity of data object (document) is shown by the distance between document ass cluster center and the others. The Euclidean distance, d, between b two points, x and y, in one-, two-, three-, or higheer-dimensional space, is given by the equation 1. ∑ , (1) where n is the number of dimension ns and Xk and Yk are respectively, the kth attributes (comp ponents) of x and y [1]. 59 VI. In contrast, the similarity between data object (document) is known as the small distance in one cluster. Documents are often represented as vectors, where each attribute represents the frequency with which a particular term (word) occurs in the document. A measure of similarity for document clustering is the cosine of the angle between two vectors as in this equation 2 [1]. cos , / (2) where di and dj are two different documents In order to know the performance for quality of clustering, there are two measurements which are F-measure and entropy[17]. This basic idea is from information retrieval concept. In this technique, each cluster is considered as if it were the result of query and each class as if it were the desired set of documents for the query. Furthermore, the formulation of F-measure involves Precision and recall for each cluster j and class i are as follows: FUZZY C-MEANS CLUSTERING IV. , There are various fuzzy clustering algorithms and one simple fuzzy clustering technique is the fuzzy c-means algorithm (FCM) by Duda and Hart [13] which was birth of fuzzy method. The FCM is known to produce reasonable partitioning of the original data in many cases (see [14]) and is very quickly compared to some other approaches. Besides that, FCM is convergence to a minimum or saddle point from any initializations, it may be either a local or (the) global minimum of objective function [15]. As principle, this algorithm is based on minimization of the objective function J(X; U,V). Generally, the objective function is the summing up dissimilarity weighted by membership degree ( is shown as equation 3 [16] ∑ ∑ ; , , (3) where d is the distance; V is cluster center and X is data (document-term matrix). V. , 5. · (4) , , , , (5) The higher f-measure is the higher accuracy of cluster, includes precision and recall. Another measurement which related to the internal quality of clustering is entropy measurement ( and it can be formulated: ∑ , . log , (6) where, P(i, j) is probability that a document has class label i and is assigned to cluster j. Thus, the total entropy of clusters is obtained by summing the entropies of each cluster weighted by the size of each cluster: The proposed method is LSI concept in order to get the small vector space. The details steps for document clustering are as below: 1. Document preprocessing which includes case folding, parsing, removing stop word and stemming. 2. Removing the terms which have global frequency less than 2 and local frequency is more than a half of the total document. 3. Representation full text document to term-document A vector (vector space model) using TF-IDF weight term. 4. Mapping the term-document A matrix to V document matrix in concept space using LSI approach. Since , ; where nij is the number of documents with class label i in cluster j, ni is the number of documents with class label i and nj is the number of documents in cluster j. Thus, the F-measure cluster j and class i is obtained as this below equation: METHODOLOGY ∑ PERFORMANCE EVALUATION ∑ (7) where, nj is size of cluster j and n is total document number in the corpus. The lower value of entropy, the higher quality of cluster internally. VII. EXPERIMENTAL RESULT We evaluated the performance of the proposed method using the data sets taken from 20News Group [18]. The dataset is made up of four groups with refer to the data volume: Binary2, Multi5, Multi7 and Multi10. They consist of short news with various topics which used as class reference in clustering process. The binary2 dataset contains of 200 documents (10215 terms) and 100 documents in each topic. Also for the other dataset, there are 100 documents in each topic. The multi5 dataset contains of 500 documents (18367 terms), and multi7 dataset contains of 700 documents (18661 terms). And another dataset contains of 1000 documents (25627 terms). Thus, each dataset is clustered separately based on the number of topics. The first step is document preprocessing in order to reduce the volume density of data and the result after preprocessing is shown in the Table I. ·∑· · · 1 (as property of SVD matrix), thus: · ·∑ Implementing Fuzzy C-means clustering algorithm by term-document V vector as representative of the document collection and using Cosine similarity as replacement of Euclidean distance which defined as [ 1 cos , . It is applied into the objective function of Fuzzy C-means algorithm. 60 TABLE I. Dataset Total Topics Binary2 Multi5 Multi7 Multi10 After that, it is represented the weight term using TFIDF method in matrix (vector space model). By applying LSI method, the term-document matrix dimension is reduced as shown in the Table II. The size of volume reduction is based on the selected k-rank with optimum condition at interval [2...50]. DESCRIPTION OF DATASET AFTER PREPROCESSING talk.politics.mideast talk.politics.misc comp.graphics rec.motorcycles rec.sport.baseball sci.space talk.politics.mideast alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.electronics talk.politics.guns alt.atheism comp.sys.mac.hardware misc.forsale rec.autos rec.sport.hockey sci.crypt sci.electronics sci.med sci.space talk.politics.guns Docs Terms 200 7432 500 13646 TABLE II. 700 Binary2 Total Documents 200 Total Patternterms (SVD) 18 Total Patternterms (PCA) 26 Multi2 500 22 22 Multi7 700 28 24 Multi10 1000 20 24 Data set 13959 1000 Thus, to cluster the document collection is used Fuzzy C-Means algorithm by parameter fuzziness (m=1.1), error rate ( =0.001) and specific cluster number (c) based on the number of topic or class. By applied LSI method (SVD and PCA), the distribution of term-document for binary2 dataset and the position of cluster center at the certain k-rank can be illustrated at Fig. 4. There is different distribution of dataset in clustering for the both methods. 18655 SVD method of binary2 dataset PCA method of binary2 dataset 0.2 0.3 0.15 0.25 0.1 0.2 0.05 0.15 0 0.1 -0.05 0.05 -0.1 0 -0.15 -0.05 -0.1 -0.2 -0.25 -0.18 REDUCTION DIMENSION OF TERM-DOCUMENT USING LSI -0.16 -0.14 -0.12 -0.1 -0.08 -0.06 -0.04 -0.15 -0.3 -0.02 -0.25 -0.2 -0.15 -0.1 -0.05 0 0.05 0.1 Figure 4. Dataset and cluster center distribution using LSI approach VIII. DISCUSSION 1.2 It means that there are k patterns in each document collection. In order to know the effect of k-rank with quality of cluster and get optimum condition (high performance), we apply these methods by various k-rank. It is applied to binary2 dataset using SVD and PCA with k=1, 2, 3, …, 50 and the result is shown in Fig. 4 and 5. 0.8 1 0.6 0.4 0.2 0 2 6 10 14 18 22 26 30 34 38 42 46 50 k-rank prec recall f-measure entropy Figure 5. Performance of clustering for binary2 using SVD 61 entropy (svd:0.710; pca:0.732). The performance of rest data sets is also increase even not significant. 1.2 1 0.8 IX. 0.6 The document clustering can be applied using concept space and cosine similarity. It had made the significant reduction of term-document matrix dimension with refer to k-rank (total number of pattern). Also their average performance is very high with f-measure about 0.91 and entropy about 0.51. It is significant improvement when applied in huge data volume (multi7 and multi10 dataset) until more than 50% increasing. 0.4 0.2 0 2 6 10 14 18 22 26 30 34 38 42 46 50 k-rank prec recall fmeasure CONCLUSION entropy Figure 6. Performance of clustering for binary2 using PCA REFERENCES The performance of clustering for each LSI method is obtained optimum condition at different k-rank. This is showed that the optimum SVD of binary2 is at k-rank=18, and for PCA is at k-rank=6. It is also applied to the other data sets: multi5, multi7 and multi10. Furthermore, the comparison of performance for document clustering between without LSI applied and with LSI applied including Cosine similarity measurement as replacement of Euclidean distance is shown in Fig. 7 for all data sets. 1. Pang-Ning Tan, M.S., Vipin Kumar, Introduction to Data Mining. Pearson International ed. 2006: Pearson Education, Inc. 2. M.A. Hearst, a.J.O.P. Reexamining the cluster hypothesis. 1996: In Proceeding of SIGIR '96. 3. Jardine, N.a.v.R., C.J., The Use of Hierarchical Clustering in Information Retrieval. Information Storage and Retrieval. Vol. 7. 1971. 4. Steinbach M., K.G., Kumar V., A Comparison of Document Clustering Techniques. 2000, University of Mineasota. 5. Saveresi, S.M., D.L. Boley, S.Bittanti and G. Gazzaniga, Cluster Selection in Divisive Clustering Algorithms. 2002. 6. Larose, D.T., An Introduction to Data Mining. Discovering Knowledge in Data. 2005: Willi & Sons, Inc. 7. El-Sonbaty, Y.a.I., M.A., Fuzzy Clustering for Symbol Data. IEEE Transactions on Fuzzy Systems, 1998. 6. 8. Rodrigues, M.E.S.M.a.S., L. A Scalable Hierarchical Fuzzy Clustering Algorithm for Text Mining. in The 5th International Conference on Recent Advances in Soft Cpmputing. 2004. 9. Aberer, K., EPFL-SSC, L.d.s.d.i. repartis, Editor. 2003. 10. S. Deerwester, e.a., Indexing by latent semantic analysis. Journal of American Society for Information Science and Technology, 1990. 41: p. 391-407. 11. Smith, L., A Tutorial on Principal Component Analysis. 2002. 12. Shlens, J., A Tutorial on Principal Component Analysis. 2009. 13. Bezdek, J.C., Fuzzy Mathematics in Pattern Classification. 1973, Cornell University: Ithaca, New York. 14. Bezdek, J.C., Pattern Recognition with Fuzzy Objective Function Algorithm. 1981: Plenum Press. 15. Hathaway, R., Bezdek, J., and Tucker, W., An Improved Convergence Theory for the Fuzzy ISODATA Clustering Algorithms. Analysis of Fuzzy Information, 1987. 3(Boca Raton: CRC Press): p. 123 - 132. 16. Sadaaki Miyamoto, H.I., Katsuhiro Honda, Algorithm for Fuzzy Clustering. Methods in c-Means Clustering with Applications, ed. S.i.F.a.S. Computing. Vol. 229. 2008, Osaka, Japan: Scientific Publishing Services Pvt. Ltd., Chennai, India. 17. Brojner Larsen and Chinatsu Aone, Fast and Effective Text Mining Using Linear-time Document Clustering, in KDD-99. 1999: San Diego, California. 18. http://kdd.ics.uci.edu/ databases/20newgroups.html 3 2.5 Precision Euclidean Recall cosine 2 1.5 Recall Euclidean F-measure cosine F-measure Euclidean Entropy cosine 1 0.5 no LSI SVD PCA no LSI SVD PCA no LSI SVD PCA no LSI SVD PCA 0 binary2 multi5 multi7 Entropy Euclidean multi10 Figure 7. Performance comparison of document clustering Fig. 7 depicts that the accuracy of document clustering without LSI applied is very low, especially for huge data volume (multi7 and multi10) which using Euclidean distance. When it is applied to multi7 dataset, it has precision=0.464, recall=0.324, and f-measure =0.381, but it has high entropy which is 2.337. It is also happened to multi10 dataset which has the worst performance with precision=0.410, recall=0.341, f-measure=0.372 and entropy=2.422. In contrast, it is used LSI approach and Cosine similarity as reprecement of Euclidean distance method to be applied in FC-Means algorithm. And the performance for external and internal quality of cluster is very high. The multi7 dataset is obtain f-measure (svd:0.906; pca:0.903) and entropy(svd:0.597; pca:0.609) and multi10 with f-measure (svd:0.888; pca:0.887) and 62