Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Text Documents Clustering Greta Ciganaitė Aušra Mackutė-Varoneckienė Tomas Krilavičius Faculty of Informatics Vytautas Magnus University Kaunas, Lithuania El. paštas: [email protected] Faculty of Informatics Vytautas Magnus University Kaunas, Lithuania El. paštas: [email protected] Vytautas Magnus University Baltics Institute of Advanced Technology Vilnius, Lithuania El. paštas: [email protected] Abstract— Big amounts of textual information are generated every day, and existing techniques can hardly deal with such information flow. However, users expect fast and exact information management and retrieval tools. Clustering is a well known technique for grouping similar data and in such a way making it more manageable and usable. Text clustering is an adaptation of clustering for a very specific data - documents. However, it is not transferable directly to any language, i.e. specifics of language influence performance quite a lot, as shows results for English and other well investigated languages. In this paper we apply different distances and clustering approaches for Lithuanian data, discuss results and provide recommendations for documents in Lithuanian clustering. Keywords— measures. text I. document clustering, II. CLUSTERING Clustering is an unsupervised learning technique, which essence is to divide set of documents to the smaller sets according to the meaning of each document. The sets that are made up of many documents with similar meaning are called clusters. The main properties of clusters are that each cluster with any other cluster must be dissimilar as possible and documents in one cluster must be as similar as possible [1, 2]. We can perform clustering in two methods: hard and soft clustering [1]. Hard clustering means that each document must belong to only one cluster (K-means, see subsection II.B), when in soft clustering (EM, see subsection II.C), each document can belong to more than one cluster, but in every cluster each document have probability which notes how strong are dependence for specific cluster. similarity INTRODUCTION Documents data clustering process consists of steps as follows: Big amounts of textual information are generated every day, and existing techniques can hardly deal with such information flow. However, users expect fast and exact information management and retrieval tools. Clustering is a well known technique for grouping similar data and in such a way making it more manageable and usable. Text clustering is an adaptation of clustering for a very specific data documents. However, it is not transferable directly to any language, i.e. specifics of language influence performance quite a lot, as shows results for English and other well investigated languages. 1. 2. 3. 4. Documents preparation. Similarity measure choice. Clustering algorithm choice. Method evaluation. A. Similarity Generally, similarity measure defines similarity between two objects. In case of clustering problem, similarity measure is used to describe the homogeneity between two clusters. It is very important part of clustering, because, as was mentioned before, documents in the same cluster must be as homogenic as possible and clusters have to be as heterogenic as possible [1, 2]. There is quite a number of different clustering techniques [1, 2], which can be used with different similarity measure [1, 2] as well. In this paper we apply the most common similarity measure (Euclidean and cosine) to textual data. Moreover, we experiment with well known K-means [1, 2] and Expectation Maximization clustering approaches. These approaches as well as similarity measures were well investigated with English language [1, 2], therefore it is reasonable to try to apply to the Lithuanian language as well. There are many ways to measure distance. The most popular distance measure is Euclidean and additionally we extended research to use cosine distance. Euclidean distance [3, 4] which we can interpret geometrically as straight line between two objects: The paper is organized as follows. In second section we discuss different clustering techniques and evaluation measures. Then we continue with experiments and conclude paper in section IV. d x,y = 90 2 ∑ (x n i =1 − yi ) 2 i (1) where x and y are two objects, the ith component defines attribute of an object, n is the number of attributes – dimensions. 3. In EM all completed data must be generated by one distribution (e.g. Gaussian) (hard EM version) or more distributions (soft EM version) [8]. Cosine similarity [3,4] measures the cosine between two documents: cos( d i , d j ) = di ⋅d di ⋅ d j EM algorithm is quite precise, but the problem is that it is not very suitable for big corpora. As will be shown in experiments that were performed, execution time for big sets of data is very long and results are not very satisfactory. (2) j where di and dj are two different documents. D. Evaluation External evaluation of clustering model uses a confusion matrix (it is a result of clustering). Confusion matrix is a table which shows a number of documents that were subsumed in clusters rightly and wrongly. The following contingency table is computed in accordance with the confusion matrix [1, 2]: Also, it is possible to measure distance between two clusters. Distance between two clusters can be comprehensible as average distance between all documents in those clusters, as distance between the nearest documents in those clusters or as distance between the furthest documents in those clusters [4]. B. K-means K-means is a simple unsupervised learning algorithm which is used to solve clustering problem. Its essence is to reelect centroids of clusters when all objects which are not centroids has to be assigned to the nearest centroid’s cluster. In each iteration method calculate distances between centroids and each object. Object is assigned to that centroid where the distance is shortest [1, 3, 4]. TABLE I. Retrieved Not retrieved To fix number of clusters. 2. To identify initial centroids. 3. To measure distances between centroid and each element. 4. To identify new centroids in each cluster. 5. To re-elect each document to that cluster where distance is shortest. 6. End if centroids after re-identifying are the same and each element is to the nearest centroid, else go to step three. P = 2. (M – step) Computing new parameters, which maximize the likelihood of the completed data. No relevant False positives (fp) True negatives (tn) tp tp + fp (3) Recall [1, 2] is a measure which describes ratio of value of relevant information which was retrieved of information which was not retrieved, in other words, it is average probability of ended retrieval. R = tp tp + fn (4) F-measure (F1) [1, 2] is a measure whose value is weighted average of precision and recall: F1 = 2 tp 2 PR = P+R 2 tp + fn + fp (5) Inherently these measures belong to the interval [0; 1], or converted to percentage scale, it is between 0 and 100. Various users require different values: one user, such as students who want to get precise information, usually require high precision value than recall and conversely another one user, such as professional researchers, prefer high recall value than precision [1]. C. EM (Expectation-Maximization) [5] This algorithm is recurrent algorithm for likelihood maximization in problem with incomplete data. EM consists of steps as follows [5]: (E – step) The filling of lacking data. Relevant True positives (tp) False negatives (fn) Precision [1, 2] is a proportion which expresses the value of retrieved relevant information among retrieved irrelevant information or simply average probability of correct retrieval. This algorithm is very simple and fits for big sets of data but there is one big problem with it: the number of clusters. This number should be selected before first K-means iteration and it cannot be changed during all the process. It is important that there is no one general method to set number of clusters. The simplest way to do this is to experiment with different number of clusters and compare the results of some experiments and then select the best result. 1. CONTINGENCY TABLE External evaluation of model values (precision, recall and F-measure) is computed by using values of contingency table (Table I). To sum up, k-means algorithm step by step: 1. Method works until convergence (in practice it converges only in local maximum). 91 III. EXPERIMENTS Analysis of confusion matrices we have got after clustering with each chosen algorithm is presented in the result table (Table IV) (all values are approximated): We use two different algorithms (K-means (with Euclidean distance and cosine similarity) and EM) to cluster internet daily newspaper articles from the Lrytas.lt and information from largest internet forum‘s for woman - supermama.lt. The stemming and Bag Of Words (BOW) techniques were used for document pre-processing. Stemming is the process which reduces words to their stem (e.g. pienas, pieno, pieną would be stemmed as pien) [6]. Bag Of Words (BOW) is the technique which split text to distinct words, disregarding grammar or word order [9]. For clustering we have got an AttributeRelation File Format (.ARFF) document which contained attributes and data for each attribute [7]. In this case, attributes are words. The articles from Lrytas.lt are classified into 3 groups marked by “1”, ”3” and ”5” labels which corresponds to real classes “Aktualijos”, “Įvykiai” and “Komentarai”. Information from supermama.lt is classified into 14 groups labeled by the numbers by 1 to 14. These Labels also corresponds to real classes: “Apie nėštumą ir gimdymą“, ”Bendras”, “Grožis ir sveikata”, “Konsultacijų centras”, “Mamų bendravimas, susitikimai, akcijos”, “Motinystė ir tėvystė”, “Mūsų namai”, “Naudingi patarimai”, “Paramos skyrelis”, “Poilsis, pomėgiai ir šventės”, “Skelbmų lenta”, “Socialinis gyvenimas”, “Tarp mūsų, mergaičių”, “Vaikų auklėjimas ir ugdymas”. Data set statistics after pre-processing is presents in Table II. TABLE II. Method K-means (cosine) K-means (Euclidean) EM Method K-means (cosine) K-means (Euclidean) EM Algorithm K-means (cosine) K-means (Euclidean) EM # of attributes 3841 # of data objects 11353 # of data objects 1249 1 481 2 511 3 987 4 416 5 1552 6 186 7 1536 8 416 9 620 10 1387 11 1304 12 712 13 824 14 412 Classes 6886 1 402 2 286 3 561 Incorrect, % P, % R, % F1, % K=3 35.87 57.26 53.6 51.98 K=3 73.42 24.01 24.58 23.88 K=3 37.55 59.00 59.50 58.41 SUPERMAMA.LT CLUSTERING RESULT TABLE # of clusters Incorrect, % P, % R, % F1, % K=14 91,06 7,93 11,44 6,47 K=14 94,79 3,39 7,01 1,98 K=14 95,04 7,24 4,37 2,83 Lrytas.lt, min Supermama.lt, min ~ 1,5 ~149 ~ 0,05 ~48 ~ 3,5 ~200 The computer used is DELL XPS with these specifications: 1. Intel® Core™ i7-3537U processor, 2. 8,00 GB RAM, 3. Windows 8 64-bit Operating System. In this case, percentage values of F1 and incorrect clustered documents are very important for evaluating clustering effectiveness. The fact, that K-means and EM are not suitable for all sizes of corpora. For small corpora EM is more effective than K-means while for big corpora EM works really slow and not so effective. Comparing F1 values and knowing that clustering is best when it achieves the highest value (F1=1=100%), we can affirm that the best clustering results with Lrytas.lt data set were reached by EM, while with supermama.lt the best method was K-means with cosine similarity. It is important to note, that with Lrytas.lt data set K-means algorithm with cosine similarity have reached almost the same values of F1 and incorrect clustered documents but despite small difference between F1 values and evaluating the time taken for clustering (Table V), K-means with cosine similarity results can be considered well because of considerably short execution time. In supermama.lt case situation is a little bit different: K-means, as was expected, reached the best results and comparing with EM, execution time is shorter ~51 min. Data set # of attributes # of clusters TABLE V. THE EXECUTION TIME TAKEN TO BUILD EACH MODEL: Lrytas.lt Data set LRYTAS.LT CLUSTERING RESULT TABLE TABLE IV. NUMBER OF DATA OBJECTS AND CLASS DISTRIBUTIONS Supermama.lt Classes TABLE III. 92 Fig. 1. Clustering results REFERENCES IV. CONCLUSION [1] Documents data clustering techniques were analyzed, all steps for documents data clustering (similarity measures, clustering algorithms and clustering models evaluating) were presented and experimental analysis was performed. This analysis has shown the differences between clustering algorithms. K-means is suitable for big corpora because it works quite fast, but it is important to choose suitable distance measure. In our case, despite the fact that execution time taken to build K-means with Euclidean distance is significantly shorter than with cosine similarity (Table V), K-means with cosine similarity is more effective than K-means with Euclidean distance (Table III, Table IV). EM algorithm is more precise than K-means (Table III), but when we have large set of attributes, it works very slowly (Table V) and then precision is lower than K-means with cosine similarity (Table IV). [2] [3] [4] [5] [6] [7] We are planning to experiment with different clustering techniques, e.g. hierarchical clustering, as well as experiment with different similarity measures, and dimensionality reduction techniques to minimize number of features to speed –up clustering. [8] [9] Research was funded by ESFA (VP1-3.1-ŠMM-10-V-02025). 93 C.D.Manning, P. Raghavan, and A. Schütze, An Introduction to Information Retrieval. Cambridge University Press: England, 2008. P.-N. Tan, M.S. V. Kumar, Introduction to Data Mining. Pearson International ed.: Pearson Education, Inc., 2006. N. Sandhya, Y.Sri Lalitha, Dr.A.Govardhan, Dr.K.Anuradha, “ Analysis of Similarity Measures for Text Clustering“, International Journal of Data Engineering, vol 2, no. 4, July 2011. Improved Outcomes Software Inc., Web page: http://www.improvedoutcomes.com/docs/WebSiteDocs/Clustering/Clust ering_Parameters/Distance_Metrics_Overview.htm (last visited: March 8, 2014). B. Liu, W. S. Lee, P. S. Yu, X. Li, “Partially Supervised Classification of Text Documents”, in Proceedings of International Conference on Machine Learning (ICML 2002), pp. 387-394, 2002. Stemming [interactive]: http://www.comp.lancs.ac.uk/computing/research/stemming/general/ (last visited: March 8, 2014). Attribute-Relation file format [interactive]. http://www.cs.waikato.ac.nz/ml/weka/arff.html (last visited: March 9, 2014) Expectation Maximization. Hierarchical Clustering, [interactive]: http://www.facweb.iitkgp.ernet.in/~sudeshna/courses/ML06/mllecture19.pdf, 2005. Bag of words [interactive]: http://en.wikipedia.org/wiki/Bag-ofwords_model (last visited: March 20, 2014)