Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ABSTRACT Data mining is the discovery of knowledge and useful information from large amounts of data stored in databases. Since a large portion of the available data is stored in text databases, the field of text mining is gaining importance. The text databases are rapidly growing due to the increasing amount of data available in electronic form such as digital libraries, World Wide Web, electronic repositories etc. Due to this vast amount of digitized texts, classification systems are used more often in text mining, to analyse texts and to extract knowledge they contain. Text classification (also called text categorization) is a process that assigns a text document to one of a set of predefined classes. Most of the existing classification systems use the Bag-of-Words model which classifies the text document based on number of occurrences of its component words and omit the fact that various words might have been used to express a similar concept. Hence this model suffers from the problem of synonymy which arises due to different words with similar meanings. The proposed approach classifies the text documents by enriching the Bag-ofwords data representation with synonyms. This approach uses WordNet – a lexical database of English to extract the synonyms for all the key terms in the text document, and then combines them with the key terms to form a new representative vector. As a result, the system counts the occurrence of both the key term and corresponding synonyms in the document for the classification, resulting in the reduction of synonymy problem. The performance of the proposed system in comparison with the two classification approaches i.e. synonym frequency approach and term frequency approach is evaluated using the 20Newsgroups data corpus. The experimental results showed that the proposed approach of using the sum of term and its synonym frequencies for classification results in the increase in performance of the classification system when compared to the classification using the other two approaches.