Download Text Mining Technique to Extract Relevant Information from Corpus

Text Mining Technique to Extract Relevant Information from Corpus Based on User Input Miss. Mariam M. Merchant a *, Miss. Amrita A. Manjrekar b a, b Computer Science and Technology, Department of Technology, Shivaji University, Kolhapur, India Abstract Text Mining is the discovery of interesting knowledge in text documents. To find accurate knowledge from text document is challenging issue. The proposed system for text mining makes use of Corpus, i.e. data from which knowledgeable information can be generated. Pre-processing technique like stopword elimination is applied on corpus to reduce the dimensionality of the representation space. In order to extract pattern the pre-processed document will be split into paragraphs. The paragraphs will consist of set of terms (or keywords) which can be extracted using Pattern Taxonomy Model for evaluation of relevant document. Training will be carried out on set of text documents using supervised learning algorithm and then the system will be tested to check whether it produces the relevant text or not. With the help of proposed system the user can retrieve the most relevant text from corpus. Keywords: Text mining; text classification; pattern extraction 1. Text Mining helps in discovery of interesting knowledge from text documents [4]. Challenging issue is to find accurate knowledge from text documents that will help users to find what they actually want. There is difference between Text Mining and Web Search. In Web search, user is looking for the information that is already known and which has been written by someone else. So the problem that arises is to push aside all the material that is currently not relevant to our needs and to find the relevant information. The goal of text mining is to discover unknown information which is not known by others and thus was not yet written down. Introduction In today’s world of Information Technology there is wide availability of huge amount of digital data. So it is necessary to turn such huge amount of data into useful information. Text Mining deals with applying knowledge discovery techniques to unstructured text, it is also termed as knowledge discovery in text (KDT). Data Mining refers to extracting or mining knowledge from large amount of data. Mining of gold from rocks is referred as gold mining, not rock mining. Thus data mining can be appropriately named as knowledge mining. Knowledge discovery is the process of extraction of useful information from information which is implicitly present in the data but previously unknown. Therefore data mining is an essential step in process of knowledge discovery in databases (KDD). Data mining is concerned with extraction of information from structured databases. However, in reality large portion of available information appears in textual and hence unstructured format. So it is necessary to have specialized technique that specifically operates on textual data. 2. Related Work In the past decade, various text mining techniques have been presented in order to perform different knowledge tasks. These techniques where used for developing efficient mining algorithm to find particular patterns within a reasonable time span. Large numbers of patterns were generated from these mining approaches and the challenge was how to effectively use and update these patterns. Term- * Corresponding author. E-mail: [email protected] 462 [14] [15] Developed Pattern taxonomy model to improve the effectiveness by effectively using closed patterns in text mining. [13] A two-stage model that used both term-based methods and pattern based methods was introduced to significantly improve the performance of information filtering. [16] [17] Concept-based model used to bridge the gap between NLP and text mining, by analyzing terms on the sentence and document levels. Three components of model analyzed the semantic structure of sentences; constructed a conceptual ontological graph (COG) to describe the sematic structures; and extracted top concepts to build feature vectors using the standard vector space model. based methods were used to solve this challenge with the help of Information Retrieval (IR) technique such as Rocchio and probabilistic models [6]. The advantage of termbased methods is efficient computational performance However, the problem with termbased methods are polysemy and synonymy, where polysemy means a word with multiple meanings, and synonymy is multiple words with same meaning. [1] Phrase-based approach could perform better than termbased approach. Even though phrases have less ambiguity and are more discriminative than individual terms, the performance may be reduced due to inferior statistical properties of phrases and low frequency of occurrence among them. This drawback of phrase-based approaches was overcome using pattern mining-based approaches (called pattern taxonomy models (PTM) [2]), which makes use of concept of closed sequential patterns and pruned non-closed patterns. There are various text representations such as bag of words that use key terms as elements in the vector of the feature space. [5] The tf*idf weighting scheme was used for such text representation. It combines the Rocchio and SVM technique for classifier building, which produced significant output. The problem with this technique was overfitting. In research work related to data mining sequential patterns have been studied extensively. Algorithms for discovering patterns for large dataset are Apriori, SLPMiner, PrefixSpan FP-tree, SPADE, and GST. The challenging issue is to find interesting patterns thus, closed sequential patterns have been used for text mining in [3] which proposed that the concept of closed patterns in text mining is useful and it can improve performance of text mining. [7] A statistical method called Latent Semantic Indexing (LSI) was proposed, in which implicit higher-order structure in the association of words and objects was considered that improved retrieval performance by up to 30%. He stated that iterative retrieval can give performance improvements. [8] [1] describes selection functions for reducing the number of features. Various dimensionality reduction approaches based on feature selection techniques are Information Gain, Mutual Information, Chi-Square, Odds ratio. Categorization performance was good with use of proportional assignment strategy and statistical classifier. Some researchers have used phrases instead of individual words. [9] For document indexing in text categorization combination of unigram and bigrams was used and evaluation was carried out based on variety of feature evaluation functions (FEF). [10] Proposed phrasebased text representation for Web document management. Text representations were also performed using termbased ontology mining. In [11] [12] Hierarchical clustering is used to determine hyponymy and synonymy relations between keywords. [13] Introduced pattern evolution technique to improve the performance of term-based ontology mining. 3. Proposed Work Fig.1. shows the proposed system that develops an efficient Text Mining system in which based on user input in Natural Language relevant information will be retrieved from huge collection of text data. The proposed system consists of following steps: 3.1. User Input It consists of Text input and Text structure analysis. 3.1.1. Text Input The user will enter text in natural language (either keyword or statement). This will be the input to proposed system. 3.1.2. Text Structure Analysis The input text need to be preprocessed. The preprocessing step deals with structured representation of the original text; i.e., the input text gets converted into the form that can match with extracted patterns. For this stemming is used. In stemming syntactically-similar words are considered similar; the purpose of this procedure is to obtain the stem or radix of each word, which emphasize its semantics. If input is long statement then it must be segmented and finally the keywords can be extracted. 3.2. Text Mining System: It consists of phases such as: Corpus Selection, Preprocessing, Feature/ Pattern Extraction, Training Set, Classified document. 3.2.1. Corpus Selection The first task in building any system is to specify what is the input, output and how to generate the output. Corpus is collection of textual data from which text mining can generate knowledgeable information. There are popular datasets that can be used as Corpus for experimental purpose. 463 Fig.1. Block Diagram for Extract Relevant Information Form Corpus knowledge base of the system and by applying those rules patterns are extracted. For extracting the patterns following technique is used: i. Pattern Taxonomy Model: It focuses on the finding useful patterns from text documents. It consists of two main stages- first, to extract useful patterns from text and second, to use these discovered patterns to improve the effectiveness of a knowledge discovery. The proposed system will make use of already existing corpus called “Reuter21578” that contain newswire stories. 3.2.2. Preprocessing The proposed system will firstly preprocess the text. The preprocessing will be carried out on both corpus and input text. For preprocessing following techniques will be used: i. Stopword Elimination: common words with no semantics and which do not aggregate relevant information to the task (e.g., “the”, “a”) are eliminated. ii. Term Stemming: syntactically similar words, such as plurals, verbal variations, etc. are considered similar; the purpose of this procedure is to obtain the stem or radix of each word, which emphasize its semantics. 3.4. Training Set For training Supervised Learning is used, in proposed system dataset consist of both features and labels. The task is to construct an estimator which is able to predict the label of an object given the set of features. Text Preprocessing Procedure 3.5. Classified Document Stop Word Removal: o Read text o Compare each word from text with words in stop word list o Remove all stop words in text Stemming: o Read text o Apply Porter Stemmer Algorithm o Store Result Documents in above datasets are assigned either positive or negative, where “positive” means the document is relevant to the assigned topic; otherwise classified as “negative” or irrelevant text. The proposed system will extract relevant text based on user input. 4. Result Analysis 3.3. Feature/ Pattern Extraction After the preprocessing step each text element is analyzed to select characteristic of a text. In order to extract pattern the document will be split into paragraphs. The paragraphs will consist of set of terms (or keywords) which can be extracted for evaluation of further steps. Then, all the rules needed for pattern extraction are entered in the Fig. 2 Time required for extraction of files in preprocessing 464 [5] Sixth Int’l Conf. Data Mining (ICDM ’06). X. Li and B. Liu, “Learning to Classify Texts Using Positive andUnlabeled Data,” Proc. Int’l Joint Conf. Artificial Intelligence (IJCAI ’03), pp. 587-594, 2003. [6] R. Baeza-Yates and B. Ribeiro-Neto (1999), Modern Information Retrieval. Addison Wesley. [7] S.T. Dumais, “Improving the Retrieval of Information from External Sources,” Behavior Research Methods, Instruments, and Computers, vol. 23, no. 2, pp. 229-236, 1991. [8] D.D. Lewis, “Feature Selection and Feature Extraction for Text Categorization,” Proc. Workshop Speech and Natural Language, pp. 212-217, 1992. [9] M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in Automated Text Categorization,” Technical Report IEI-B4-07- 2000, Instituto di Elaborazione dell’Informazione, 2000. [10] R. Sharma and S. Raman, “Phrase-Based Text Representation for Managing the Web Document,” Proc. Int’l Conf. Information Technology: Computers and Comm. (ITCC), pp. 165-169, 2003. [11] A. Maedche, Ontology Learning for the Semantic Web. Kluwer Academic, 2003. [12] C. Manning and H. Schu¨ tze, Foundations of Statistical Natural Language Processing. MIT Press, 1999. [13] Y. Li and N. Zhong, “Mining Ontology for Automatically Acquiring Web User Information Needs,” IEEE Trans. Knowledge and Data Eng., vol. 18, no. 4, pp. 554-568, Apr. 2006. [14] S.-T. Wu, Y. Li, and Y. Xu, “Deploying Approaches for Pattern Refinement in Text Mining,” Proc. IEEE Sixth Int’l Conf. Data Mining (ICDM ’06), pp. 1157-1161, 2006. [15] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic PatternTaxonomy Extraction for Web Mining,” Proc. IEEE/WIC/ACM Int’l Conf. Web Intelligence (WI ’04), pp. 242-248, 2004. [16] S. Shehata, F. Karray, and M. Kamel, “Enhancing Text Clustering Using Concept-Based Mining Model,” Proc. IEEE Sixth Int’l Conf. Data Mining (ICDM ’06), pp. 1043-1048, 2006. [17] S. Shehata, F. Karray, and M. Kamel, “A Concept-Based Model for Enhancing Text Categorization,” Proc. 13th Int’l Conf. Knowledge Discovery and Data Mining (KDD ’07), pp. 629-637, 2007. Fig. 2 shows the time required by system for preprocessing. The system will first preprocess the text, then perform text mining operation using above mentioned steps so as to extract relevant text based on user input 5. Conclusion As huge amount of information is available in text format, text mining is gaining importance in commercial world. Pattern Extraction algorithms identifies some terms and linguistic patterns. The Text-based navigation enables users to move about in a document collection by relating topics and significant terms. It helps to identify key concepts and additionally presents some of the relationships between key concepts. The proposed system will extract the relevant text by analyzing the user input and patterns extracted from corpus. References [1] F. Sebastiani (2002), “Machine Learning in Automated Text Categorization,” ACM Computing Surveys, vol. 34, no. 1. [2] S.-T. Wu, Y. Li and Y. Xu (2006), “Deploying Approaches for Pattern Refinement in Text Mining,” Proc. IEEE. [3] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic PatternTaxonomy Extraction for Web Mining”, Proc. IEEE/WIC/ACM Int’l Conf. Web Intelligence (WI ’04), pp. 242-248, 2004. [4] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu (2012) “Effective Pattern Discovery for Text Mining” IEEE transactions on knowledge and data engineering, vol. 24, No. 1, January 2012 465

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Text Mining Technique to Extract Relevant Information from Corpus