Download ABSTRACT Data mining is the discovery of knowledge and useful

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia, lookup

Data mining is the discovery of knowledge and useful information from large amounts of
data stored in databases. Since a large portion of the available data is stored in text
databases, the field of text mining is gaining importance. The text databases are rapidly
growing due to the increasing amount of data available in electronic form such as digital
libraries, World Wide Web, electronic repositories etc. Due to this vast amount of
digitized texts, classification systems are used more often in text mining, to analyse
texts and to extract knowledge they contain. Text classification (also called text
categorization) is a process that assigns a text document to one of a set of predefined
Most of the existing classification systems use the Bag-of-Words model which classifies
the text document based on number of occurrences of its component words and omit
the fact that various words might have been used to express a similar concept. Hence
this model suffers from the problem of synonymy which arises due to different words
with similar meanings. The proposed approach classifies the text documents by
enriching the Bag-ofwords data representation with synonyms. This approach uses
WordNet – a lexical database of English to extract the synonyms for all the key terms in
the text document, and then combines them with the key terms to form a new
representative vector. As a result, the system counts the occurrence of both the key
term and corresponding synonyms in the document for the classification, resulting in the
reduction of synonymy problem.
The performance of the proposed system in comparison with the two classification
approaches i.e. synonym frequency approach and term frequency approach is
evaluated using the 20Newsgroups data corpus. The experimental results showed that
the proposed approach of using the sum of term and its synonym frequencies for
classification results in the increase in performance of the classification system when
compared to the classification using the other two approaches.