Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
BULETINUL INSTITUTULUI POLITEHNIC DIN IAŞI Publicat de Universitatea Tehnică „Gheorghe Asachi” din Iaşi Tomul LVII (LXI), Fasc. 1, 2011 SecŃia AUTOMATICĂ şi CALCULATOARE A PRACTICAL CASE STUDY ON THE PERFORMANCE OF TEXT CLASSIFIERS BY MIRCEA IONUł ASTRATIEI∗ and ALEXANDRU ARCHIP “Gheorghe Asachi” Technical University of Iaşi, Faculty of Automatic Control and Computer Engineering Received: January 11, 2011 Accepted for publication: March 14, 2011 Abstract. This paper aims to improve the K-Means clustering and K-NN classification algorithms results in order to aid the human expert in choosing the number of clusters and theirs initials centers for K-Means algorithm and the variable K for the K-NN algorithm. We present a set of comparative results between classifications that are performed using only the human expert as trainer and an automatic approach that uses clustering results as training sets for the classification of text documents. Key words: Data mining, K-Means, K-Nearest-Neighborn, clustering, classification, English/Romanian documents. 2000 Mathematics Subject Classification: 53B25, 53C15. 1. Introduction Data mining is the process of extracting patterns from data and establishing relationships. Data mining is used in many areas like mathematics, cybernetics, genetics, marketing, web search engines, etc. The classification of web search results represents only one of the newest applications for data ∗ Corresponding author; e-mail: [email protected] 104 Mircea IonuŃ Astratiei and Alexandru Archip mining: web documents are clustered/classified based on their content and relevant search results with respect to a given set of keywords are better presented to the client (Grossman & Frieder, 2004). Considering the huge amount of data gathered from the Internet is allimportant to aid the human expert in applying the data mining algorithms. The present paper introduces an automated method to classify the collected documents based on text analysis and similarity determination using Cosine Similarity Metric. Finally the paper compares the results of generic and improved KMeans and K-NN algorithms and highlights the fact that a human expert cannot face the huge amount of data, unlike the computer which might determine the similarity more accurately using formulas. 2. Clustering and Classification – Generic Data Mining Algorithms 2.1. Clustering Clustering, represents the process of grouping the data into classes called clusters based on some attribute values that are characteristic for all the objects. The elements in one cluster must have similar attributes values and must be different from the elements of the other classes. This similarity is often calculated using distances between the attributes that define the objects. Clustering methods can be organized in some categories: partitioning methods, hierarchical methods, density-based methods, grid-based methods, model-based methods, methods for high-dimensional data (such as frequent pattern–based methods), and constraint-based clustering according to (Han & Kamber, 2006; Hornick et al., 2007). Partitioning methods classify the given data in a known number of groups that represent the clusters. A cluster must satisfy two conditions: it has to contain at least one object and each object must belong to one cluster. Partitioning methods involve determining all clusters during the first iteration and then improving the results by moving objects between clusters at each new iteration. Final results are given when no other efficient change can be done. Through out each iteration, the clusters are defined by a dominant object or by a new object obtained from processing the values of the characteristic attributes. Hierarchical methods can be either agglomerative or divisive, also called the bottom-up and top-down strategies, and consist in building a hierarchy of clusters. Data is not partitioned into a cluster in a single step, but instead a series of partitions takes places. The agglomerative methods proceed by merging each object into groups. On the other hand the divisive approach starts with groups, or a number of groups, that will be successively separated into a larger number of classes. Density-based methods focus on the notion of compactness and the main idea is to start from a cluster already created. Each new object is added to Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011 105 the cluster that contains the closest neighborhood. One of the partitioning methods used in statistics and machine learning is the K–Means algorithm which aims to create a group of k clusters from an initial set of n objects, each object being assigned to the cluster with the closest mean. Centroid-based technique for K-Means uses a number n of objects and a constant K which represent the number of the classes in which the human expert desires to split the data (Han & Kamber, 2006; Hornick et al., 2007). Brief description of how the algorithm works: For the given K clusters, a number of k objects are needed in order to represent the centers of the respective clusters. These initial centroids are in most cases chosen randomly from the n lot. The remaining n-K objects will be assigned to the most similar cluster with respect to the used metric function. Usually the Euclidian metric is employed to compute the distance between the center of the cluster and an object. After all the n objects are assigned, a new center for each cluster is determined. This new center may be the most representative object in the lot or it can be defined as a new object which is obtained by computing the mean values for the significant attributes of all members (Han & Kamber, 2006; Hornick et al., 2007; Moore, 2001). The algorithm reaches a state of convergence when no objects are moved between the clusters or when the centroids of the clusters no longer change their attribute values. 2.2. Classification Classification is a form of data analysis where a model is used to predict the category to which a new object belongs to. The main difference between clustering and classification is that in classification the number of classes and theirs labels are predetermined. This process has two steps. The first step, also known as learning step, consists of analyzing a predetermined training set of data classes or concepts. Unlike the clustering process, this first stage consists of a supervised learning strategy. For the second step, the classifier estimates whether the new data belongs to one of the classes already known, based on the rules obtained at the first step. There are many classification algorithms: decision tree, Bayesian Classification, Rule-Based Classification, Learning from Your Neighbors. Decision tree algorithms are based on a tree structure where internal nodes represent test conditions for various attributes. For each such a node, an edge represents the decision taken to reach another node. Finally the leaves indicate the classes determined based on a set of possible decisions (Han & Kamber, 2006; Hornick et al., 2007). The same idea is used in rule-based classification algorithms where the learning model is represented by a set of IF THEN rules (Han & Kamber, 2006; Hornick et al., 2007). Bayesian classification consists on statistical data and probabilities computed by Bayes 106 Mircea IonuŃ Astratiei and Alexandru Archip Theorem (Han & Kamber, 2006). Lazy Learners (or Learning from Your Neighbors methods) construct a general model from the training data set. Brief description of how the algorithm works: A lazy learner algorithm is K-Nearest-Neighbor, its efficiency being demonstrated on large amount of data. The principle on K-NN classification consists in comparing the new data with the training set and learning from this analogy. Given a new object, the algorithm searches within the training data set for the K most similar objects (similarity is again considered with respect to a given metric function).The result is given by the simple majority of the similar objects. K represents the number of training data objects that will be used to compare the new object. K must be a positive non null integer number. The value of K is important and this choice influences the predictions. For K=1 the unclassified object is assigned to the class with the most similar object in the training data (Han & Kamber, 2006; Hornick et al., 2007; Moore, 2001). We choose these two form of data analysis to group a lot of diverse text documents into clusters based on their semantic meaning. The results can be used with succes for organize the documents extracted by a web crawler. The semiautomatic method is very useful when the amount of data is very large, like in a web crawler case when milions of pages are brouthg every day. 3. Improvements 3.1. K-MEANS Our implementation of K-Means algorithm eliminates the need of a human expert for choosing the number of clusters and the initial centroids. Furthermore after the clusters are obtained, a re-clusterization is made for the documents that belong to none of the clusters already obtained. Similar work is done in (Pelleg & Moore, 2000; Muhr & Granitzer, 2009), but unlike (Pelleg & Moore, 2000; Muhr & Granitzer, 2009) our approach determines, in a semiautomatic way the variable K and also the initial centers for each one of the K clusters. The number of clusters and initial centroids We suppose that there is at least one cluster and we randomly choose the first center for the first cluster. Then the distance between the centroid and the remaining documents is computed and the most similar are eliminated. We consider similar those documents if the distance do not exceed a threshold. In our case this threshold is obtained by observation of the distances between all the documents we have tested (detalies in (Astratiei & Archip, 2010)).The threshold can vary if a more refined or compact classification is needed. Once we have a centroid we compute the average distance, using Cosine Similarity (Astratiei & Archip, 2010), from the centroid and the documents that are most similar to it. The average is used to determine the next Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011 107 center for the next cluster. The farthermost distance, computed from the last centroid determined and the remaining documents, distance referred to the average already calculated, indicates the nest centroids. These steps are repeated until there are no documents remaining. Finally we obtain a number of centroids that represents the input data for the K-Means algorithm, data obtained automatically. Having the number of clusters and the initial centroids we can apply the classic K-Means algorithm. A number of iterations are done until there are no changes between the clusters members. After each iteration, a new center is computed. As a new centroid we choose the document that is closer to all the other ones in the same cluster. The results are then filtered and if there is at least one document in a cluster that is farther from the final centroid that the threshold imputed, there will be a re-clusterization that applies the same steps for the lot of “rejected” files. After a re-clusterization at least one new cluster must appear. This condition reduces the probability of including documents in clusters that contain dissimilar documents. The improvement for the algorithm aids the human expert for a better detection of the described fields (Salvador & Chan, 2003; Li et al., 2003; Matveeva, 2006; Arthur & Vassilvitskii, 2007). 3.2. K-NN The classic K-NN algorithm is independent from other data mining methods and needs a set of input data to learn from. Input data for K-NN algorithm are the variable K and the training set which represents the model for learning. K is generally determined experimentally starting with K = 1 and observing the error rate. This determination is also a human influence; therefore for an automatic classification algorithm we choose the K value depending on the number of files in the clusters. For a minimum error rate we consider K double of the minimum length of the clusters. If we associate these two data mining methods, clustering and classification, we will obtain an automatic training data set for K-NN. This set represents the results of the K-Means clusterization. Therefore the classes obtained will represent the training model for a classifier; the new documents brought being associated with the clusters to which they are most similar. After testing over 90 files in Romanian and 90 files in English using this association of algorithms and the automatic detection of input data for KMeans, we managed to improve the precision. 4. Results The test application has been developed using C ANSI language. Tests were performed on total of 180 text documents, 90 for Romanian language and 108 Mircea IonuŃ Astratiei and Alexandru Archip 90 for English. Tests for K-NN classification were performed on 52 new text documents for English and 46 new text documents for Romanian. For testing the Romanian documents we chose 12 fields as history, physics, animals, gothic art, heart diseases, psychology, computer science, chemistry, philosophy, economy, astrology and geography. English document belonged to the following fields: advertising, AIDS, London, Cold War, cloning, Shakespeare, music, pollution, racism, operating systems, Greek mythology, internet, drugs in sports. In Tables 1,…,4 the columns marked with “*” denote the choice made from the human expert perspective. 4.1. K-Means Tests For K-Means we performed 6 test: one for our automatic implementation and 5 for the classic method with randomly chosen centers and centers chosen from human expert perspective. We observed for the total number of classes resulted, the incorrect number of clusters and finally we calculated the percentage of correct clusters reported to the total number. Results interpretation For the Romanian documents there is a significant difference between the automatic approach, that in our case had correctly classified all the files, and the classic K-Means approach where the percentage of correct clusters, in randomly chosen centroids case, is around 50~60%. This method gives incorrect results in tests, so can’t be a reliable solution. We have obtained better results when we have chosen the centroids, but this would imply the existence of a good expert in a real world scenario. During testing we have discovered some cases where the expert tends to extend the meaning of the document to a larger class because of the ideas exposed in the file. More detailes in (Astratiei & Archip, 2010). Table 1 Results of K-MEANS Clustering for Romanian Documents Classic Auto Centroids Random Chosen* No 1 No 1 No 2 No 3 No 4 No 5 Total clusters 20 15 15 15 15 15 Correct clusters 20 9 8 9 11 11 Incorrect clusters 0 6 7 6 4 4 Percentage of correct clusters 100% 60% 53.33% 60% 73.33% 73.33% 109 Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011 Table 2 Results of K-MEANS Clustering for English Documents Classic Auto Centroids Random Chosen* No 1 No 1 No 2 No 3 No 4 No 5 Total clusters 21 14 14 14 14 14 Correct clusters 20 11 10 6 10 13 Incorrect clusters 1 3 4 8 4 1 Percentage of correct clusters 95.23% 57.14% 71.42% 92.85% 78.57% 71.42% 4.2. K-NN Tests For the K-NN we performed 4 tests using the automatic K-Means results as training set, classic K-Means (centroids randomly chosen and from the human expert perspective) results as training set and a training set chosen from the human expert perspective. We observed the total number of documents to classify the number of training set clusters, the number of incorrect classified documents and finally we calculated the percentage of correct classified documents. We can not rely on the randomly chosen centroids K-Means method because a classification using those classes for training data will give ambiguous results. For better results the new files used in testing the K-NN algorithm belonged to the same fields like the documents that we have been clustering with K-Means. Test results for K-NN are presented detailed in Table 3 and 4 and in (Astratiei & Archip, 2010). Table 3 Results of K-NN Classification for Romanian Documents Auto K-Means results Classic K-Means results Chosen centroids* Random centroids Chosen clusters* Total training clusters 20 15 15 15 Number of documents 46 46 46 46 Correct classified 45 41 42 44 Incorrect classified 1 5 4 2 Correct classified, [%] 97.82 89.13 91.3 95.65 110 Mircea IonuŃ Astratiei and Alexandru Archip Table 4 Results of K-NN Classification for English Documents Auto K-Means results Classic K-Means results Chosen centroids* Random centroids Chosen clusters* Total training clusters 21 14 14 14 Number of documents 52 52 52 52 Correct classified 50 48 45 45 Incorrect classified 2 4 7 7 Correct classified, [%] 96.15 92.30 86.53 86.53 5. Practical Case Study This two data mining algorithms can be used in many practical pourposes like for example in games, in bussines where can be determined new marketing solutions by analysis of customer data set. Olso, data mining has been widely used in area of science and engineering, such as bioinformatics, genetics, medicine, electrical power engineering. In this paper we will present a practical case study that involves the K-Means algorithm and techniques for extracting information from the web. Most of the search engines extract information from the World Wide Web and offer the to users in a mixed fashion depending on the importance of the resource that contained the searched information. To test our approach for the K-Means algorithm we have done a real time test by applying the algorithm over a set of documents feched by a web crawler. The crawler started from a seed list containing links to a websites describing the life of the animal jaguar and links sending to pages describing the car Jaguar. The crawler fetched about 600 pages from both domaines cars and wild animals. The results, after applying the semiautomatic K-Means were interesting. We have obtained four different clusters that included all the 600 documents: − the first cluster contained links sending to pages describing the car Jaguar, the performances of the car, the features available and the prices; − the second one contained links to pages describing accidents involving the Jaguar cars; − the third cluster was composed of links to pages describing the life of the animal jaguar in the wild; − the last cluster was composed from link to pages listing the dealer for the car Jaguar. Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011 111 Other practical examples on testing the algorithm are some documents concerning operating systems. After reading all the documents we have classified them as one cluster labeled Operating Systems and we have chosen one initial centroid. The automatic method had split the documents in two clusters. Indeed the documents contained parallels between Windows and Unix or Mac or they were describing one of the OS, apart from some that explained how to configure an OS. Those files wore clustered in a separate class, a benefic thing because even if the same field was reached in all files, the idea was different. For better understanding of the use of a fine clusterization imagine this case: you want to configure Windows and the search for information leads you to a document where a parallel is made between Windows an Linux, instead of leading you directly to the desired information. We manage to create a simple search engine that, using data mining algorithms, can offer to the user a set o results ordered in an intelligent and intuitive manner. This can be a very powerful tool considering the huge amount of information that can be found on the web. 6. Conclusion and Future Work In this paper we have presented a method of clustering and classification that aids the human expert in choosing the initial number of clusters and the initial centroids for K-Means algorithm and training data set for the K-NN algorithm. We have focused on combining two different data mining techniques (a descriptive method as trainer – K-Means Clustering and a predictive technique – K-NN classifier) in order to improve the accuracy of a classifier. In order to obtain more precise results we intent to make an analysis that will automatically detect derived and composed words. This analysis should take into consideration other descriptive data mining methods (such as frequent sequence miners) in order to further improve a potential classifier. One such improvement consists in determining dominant phrases for a set of computed clusters. Future development also involves addressing larger datasets. REFERENCES Arthur D., Vassilvitskii S., K-Means++: the Advantages of Careful Seeding. In SODA ’07: Proceedings of the eighteenth annual ACMSIAM symposium on Discrete algorithms. Philadelphia, PA, USA: Society for Industrial and Applied Mathematics, 2007, 1027–1035. Astratiei M.I., Archip A., A Case Study on Improving the Performance of Text Classifiers. Proceedins of 14th International Conference of System Theory and Control Sinaia 2010. 112 Mircea IonuŃ Astratiei and Alexandru Archip Garcia E., An Information Retrieval Tutorial on Cosine Similarity Measures, Dot Products and Term Weight Calculations. 2006, http://www.miislita. com/information-retrievaltutorial/cosine-similarity-tutorial.html Grossman D.A., Frieder O., Information Retrieval - Algorithms and Heuristics. Second Edition, Springer, 2004. Han J., Kamber M., Data Mining - Concepts and Techniques. Morgan Kaufman Publications, 2006. Hornick M.F., Marcad E., Venkayala S., Java Data Mining Strategy. Standard and Practice. Morgan Kaufman Publications, 2007. Li B., Yu S., Lu Q., An Improved k-Nearest Neighbor Algorithm for Text Categorization. CoRR, vol. cs.CL/0306099, 2003. Matveeva I., Document Representation and Multilevel Measures of Document Similarity. In Proceedings of the 2006 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. Morristown, NJ, USA: Association for Computational Linguistics, 2006, 235–238. Moore A.W., K-Means and Hierarchical Clustering. 2001. Available: http://www.cs.cmu.edu/afs/cs/user/awm/web/tutorials/kmeans11.pdf Muhr M., Granitzer M., Automatic Cluster Number Selection Using a Split and Merge k-Means Approach. Database and Expert Systems Applications, International Workshop on, 363–367, 2009. Pelleg D., Moore A., X-Means: Extending k-Means with Efficient Estimation of the Number of Clusters. in Proceedings of the Seventeenth International Conference on Machine Learning, San Francisco: Morgan Kaufmann, 2000, 727–734. Salvador S., Chan P., Determining the Number of Clusters/Segments in Hierarchical Clustering/Segmentation Algorithms. Dept. of Computer Sciences Florida Institute of Technology Melbourne, 2003. STUDIU DE CAZ PRACTIC PRIVIND PERFORMANłELE CLASIFICATORILOR DOCUMENTELOR TEXT (Rezumat) Prezenta lucrare popune o soluŃie privind îmbunătăŃirea clasificatorilor documentelor text prin abordarea unor variante ale algoritmilor de clusterizare şi clasificare. Astfel s-a demonstrat experimental faptul că implementarea unor noi abordări pentru K-Means si K-NN, se obŃin rezultate evident îmbunatăŃite. Adaptarea algoritmului K-Means pentru a putea fi aplicat asupra documentelor de tip text implică o alegere iniŃială a centrilor, varianta alegerii aleatoare nefiind o soluŃie în acest caz. Centrii iniŃiali trebuiesc aleşi în mod eficient pentru a realiza o clusterizare corectă a documentelor spre deosebire de cazul în care obiectele de clusterizat au valori caracteristice numerice şi recalcularea centrilor se realizează cu o precizie mai mare, fiind posibil cazul în care se obŃin obiecte noi, obiecte de referinŃă în ceea ce priveşte cel mai reprezentativ element al mulŃimii. Bul. Inst. Polit. Iaşi, t. LVII (LXI), f. 1, 2011 113 Adaptarea algoritmului K-NN constă în alegerea unei metrici adecvate contextului care să poată calcula similaritatea între două documente în scopul gasirii celor mai apropiate în raport cu setul de antrenament. Roluri importanate în precizia rezultatelor le au şi alegerea setului de antrenament precum şi a variabilei K. La fel ca şi în cazul clusterizării inferenŃa expertului uman define dificilă pe un set mare de date. În acest caz, setul de antrenament trebuie să fie bine organizat şi structurat în funcŃie de domeniul de activitate la care se face referire, această organizare implicând un volum mare de muncă din partea unui factor uman, a căror cunoştiinŃe tenhice trebuie să facă referire la contextul utilizat.