Download Text Mining Technique to Extract Relevant Information from Corpus

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Text Mining Technique to Extract Relevant Information from Corpus
Based on User Input
Miss. Mariam M. Merchant a *, Miss. Amrita A. Manjrekar b
a, b
Computer Science and Technology, Department of Technology, Shivaji University, Kolhapur, India
Abstract
Text Mining is the discovery of interesting knowledge in text documents. To find accurate knowledge from text document is
challenging issue. The proposed system for text mining makes use of Corpus, i.e. data from which knowledgeable information can be
generated. Pre-processing technique like stopword elimination is applied on corpus to reduce the dimensionality of the representation
space. In order to extract pattern the pre-processed document will be split into paragraphs. The paragraphs will consist of set of terms (or
keywords) which can be extracted using Pattern Taxonomy Model for evaluation of relevant document. Training will be carried out on set
of text documents using supervised learning algorithm and then the system will be tested to check whether it produces the relevant text or
not. With the help of proposed system the user can retrieve the most relevant text from corpus.
Keywords: Text mining; text classification; pattern extraction
1.
Text Mining helps in discovery of interesting knowledge
from text documents [4]. Challenging issue is to find
accurate knowledge from text documents that will help
users to find what they actually want. There is difference
between Text Mining and Web Search. In Web search, user
is looking for the information that is already known and
which has been written by someone else. So the problem
that arises is to push aside all the material that is currently
not relevant to our needs and to find the relevant
information. The goal of text mining is to discover
unknown information which is not known by others and
thus was not yet written down.
Introduction
In today’s world of Information Technology there is
wide availability of huge amount of digital data. So it is
necessary to turn such huge amount of data into useful
information. Text Mining deals with applying knowledge
discovery techniques to unstructured text, it is also termed
as knowledge discovery in text (KDT).
Data Mining refers to extracting or mining knowledge
from large amount of data. Mining of gold from rocks is
referred as gold mining, not rock mining. Thus data mining
can be appropriately named as knowledge mining.
Knowledge discovery is the process of extraction of useful
information from information which is implicitly present in
the data but previously unknown. Therefore data mining is
an essential step in process of knowledge discovery in
databases (KDD). Data mining is concerned with extraction
of information from structured databases. However, in
reality large portion of available information appears in
textual and hence unstructured format. So it is necessary to
have specialized technique that specifically operates on
textual data.
2.
Related Work
In the past decade, various text mining techniques have
been presented in order to perform different knowledge
tasks. These techniques where used for developing efficient
mining algorithm to find particular patterns within a
reasonable time span. Large numbers of patterns were
generated from these mining approaches and the challenge
was how to effectively use and update these patterns. Term-
* Corresponding author. E-mail: [email protected]
462
[14] [15] Developed Pattern taxonomy model to
improve the effectiveness by effectively using closed
patterns in text mining. [13] A two-stage model that used
both term-based methods and pattern based methods was
introduced to significantly improve the performance of
information filtering.
[16] [17] Concept-based model used to bridge the gap
between NLP and text mining, by analyzing terms on the
sentence and document levels. Three components of model
analyzed the semantic structure of sentences; constructed a
conceptual ontological graph (COG) to describe the
sematic structures; and extracted top concepts to build
feature vectors using the standard vector space model.
based methods were used to solve this challenge with the
help of Information Retrieval (IR) technique such as
Rocchio and probabilistic models [6]. The advantage of
termbased methods is efficient computational performance
However, the problem with termbased methods are
polysemy and synonymy, where polysemy means a word
with multiple meanings, and synonymy is multiple words
with same meaning.
[1] Phrase-based approach could perform better than
termbased approach. Even though phrases have less
ambiguity and are more discriminative than individual
terms, the performance may be reduced due to inferior
statistical properties of phrases and low frequency of
occurrence among them. This drawback of phrase-based
approaches was overcome using pattern mining-based
approaches (called pattern taxonomy models (PTM) [2]),
which makes use of concept of closed sequential patterns
and pruned non-closed patterns.
There are various text representations such as bag of
words that use key terms as elements in the vector of the
feature space. [5] The tf*idf weighting scheme was used for
such text representation. It combines the Rocchio and SVM
technique for classifier building, which produced
significant output. The problem with this technique was
overfitting.
In research work related to data mining sequential
patterns have been studied extensively. Algorithms for
discovering patterns for large dataset are Apriori,
SLPMiner, PrefixSpan FP-tree, SPADE, and GST.
The challenging issue is to find interesting patterns thus,
closed sequential patterns have been used for text mining in
[3] which proposed that the concept of closed patterns in
text mining is useful and it can improve performance of
text mining.
[7] A statistical method called Latent Semantic Indexing
(LSI) was proposed, in which implicit higher-order
structure in the association of words and objects was
considered that improved retrieval performance by up to
30%. He stated that iterative retrieval can give performance
improvements.
[8] [1] describes selection functions for reducing the
number of features. Various dimensionality reduction
approaches based on feature selection techniques are
Information Gain, Mutual Information, Chi-Square, Odds
ratio. Categorization performance was good with use of
proportional assignment strategy and statistical classifier.
Some researchers have used phrases instead of
individual words. [9] For document indexing in text
categorization combination of unigram and bigrams was
used and evaluation was carried out based on variety of
feature evaluation functions (FEF). [10] Proposed phrasebased text representation for Web document management.
Text representations were also performed using termbased ontology mining. In [11] [12] Hierarchical clustering
is used to determine hyponymy and synonymy relations
between keywords. [13] Introduced pattern evolution
technique to improve the performance of term-based
ontology mining.
3.
Proposed Work
Fig.1. shows the proposed system that develops an
efficient Text Mining system in which based on user input
in Natural Language relevant information will be retrieved
from huge collection of text data.
The proposed system consists of following steps:
3.1. User Input
It consists of Text input and Text structure analysis.
3.1.1. Text Input
The user will enter text in natural language (either
keyword or statement). This will be the input to proposed
system.
3.1.2. Text Structure Analysis
The input text need to be preprocessed. The
preprocessing step deals with structured representation of
the original text; i.e., the input text gets converted into the
form that can match with extracted patterns. For this
stemming is used. In stemming syntactically-similar words
are considered similar; the purpose of this procedure is to
obtain the stem or radix of each word, which emphasize its
semantics. If input is long statement then it must be
segmented and finally the keywords can be extracted.
3.2. Text Mining System:
It consists of phases such as: Corpus Selection,
Preprocessing, Feature/ Pattern Extraction, Training Set,
Classified document.
3.2.1. Corpus Selection
The first task in building any system is to specify what is
the input, output and how to generate the output. Corpus is
collection of textual data from which text mining can
generate knowledgeable information. There are popular
datasets that can be used as Corpus for experimental
purpose.
463
Fig.1. Block Diagram for Extract Relevant Information Form Corpus
knowledge base of the system and by applying those rules
patterns are extracted. For extracting the patterns following
technique is used:
i. Pattern Taxonomy Model: It focuses on the finding
useful patterns from text documents. It consists of two
main stages- first, to extract useful patterns from text and
second, to use these discovered patterns to improve the
effectiveness of a knowledge discovery.
The proposed system will make use of already existing
corpus called “Reuter21578” that contain newswire stories.
3.2.2. Preprocessing
The proposed system will firstly preprocess the text. The
preprocessing will be carried out on both corpus and input
text. For preprocessing following techniques will be used:
i. Stopword Elimination: common words with no semantics
and which do not aggregate relevant information to the task
(e.g., “the”, “a”) are eliminated.
ii. Term Stemming: syntactically similar words, such as
plurals, verbal variations, etc. are considered similar; the
purpose of this procedure is to obtain the stem or radix of
each word, which emphasize its semantics.
3.4. Training Set
For training Supervised Learning is used, in proposed
system dataset consist of both features and labels. The task
is to construct an estimator which is able to predict the
label of an object given the set of features.
Text Preprocessing Procedure
3.5. Classified Document
Stop Word Removal:
o Read text
o Compare each word from text with words in
stop word list
o Remove all stop words in text
Stemming:
o Read text
o Apply Porter Stemmer Algorithm
o Store Result
Documents in above datasets are assigned either positive
or negative, where “positive” means the document is
relevant to the assigned topic; otherwise classified as
“negative” or irrelevant text. The proposed system will
extract relevant text based on user input.
4.
Result Analysis
3.3. Feature/ Pattern Extraction
After the preprocessing step each text element is
analyzed to select characteristic of a text. In order to extract
pattern the document will be split into paragraphs. The
paragraphs will consist of set of terms (or keywords) which
can be extracted for evaluation of further steps. Then, all
the rules needed for pattern extraction are entered in the
Fig. 2 Time required for extraction of files in preprocessing
464
[5] Sixth Int’l Conf. Data Mining (ICDM ’06). X. Li and B. Liu,
“Learning to Classify Texts Using Positive andUnlabeled Data,” Proc.
Int’l Joint Conf. Artificial Intelligence (IJCAI ’03), pp. 587-594, 2003.
[6] R. Baeza-Yates and B. Ribeiro-Neto (1999), Modern Information
Retrieval. Addison Wesley.
[7] S.T. Dumais, “Improving the Retrieval of Information from External
Sources,” Behavior Research Methods, Instruments, and Computers,
vol. 23, no. 2, pp. 229-236, 1991.
[8] D.D. Lewis, “Feature Selection and Feature Extraction for Text
Categorization,” Proc. Workshop Speech and Natural Language, pp.
212-217, 1992.
[9] M.F. Caropreso, S. Matwin, and F. Sebastiani, “Statistical Phrases in
Automated Text Categorization,” Technical Report IEI-B4-07- 2000,
Instituto di Elaborazione dell’Informazione, 2000.
[10] R. Sharma and S. Raman, “Phrase-Based Text Representation for
Managing the Web Document,” Proc. Int’l Conf. Information
Technology: Computers and Comm. (ITCC), pp. 165-169, 2003.
[11] A. Maedche, Ontology Learning for the Semantic Web. Kluwer
Academic, 2003.
[12] C. Manning and H. Schu¨ tze, Foundations of Statistical Natural
Language Processing. MIT Press, 1999.
[13] Y. Li and N. Zhong, “Mining Ontology for Automatically Acquiring
Web User Information Needs,” IEEE Trans. Knowledge and Data
Eng., vol. 18, no. 4, pp. 554-568, Apr. 2006.
[14] S.-T. Wu, Y. Li, and Y. Xu, “Deploying Approaches for Pattern
Refinement in Text Mining,” Proc. IEEE Sixth Int’l Conf. Data
Mining (ICDM ’06), pp. 1157-1161, 2006.
[15] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic PatternTaxonomy Extraction for Web Mining,” Proc. IEEE/WIC/ACM Int’l
Conf. Web Intelligence (WI ’04), pp. 242-248, 2004.
[16] S. Shehata, F. Karray, and M. Kamel, “Enhancing Text Clustering
Using Concept-Based Mining Model,” Proc. IEEE Sixth Int’l Conf.
Data Mining (ICDM ’06), pp. 1043-1048, 2006.
[17] S. Shehata, F. Karray, and M. Kamel, “A Concept-Based Model for
Enhancing Text Categorization,” Proc. 13th Int’l Conf. Knowledge
Discovery and Data Mining (KDD ’07), pp. 629-637, 2007.
Fig. 2 shows the time required by system for preprocessing.
The system will first preprocess the text, then perform text
mining operation using above mentioned steps so as to
extract relevant text based on user input
5.
Conclusion
As huge amount of information is available in text
format, text mining is gaining importance in commercial
world. Pattern Extraction algorithms identifies some terms
and linguistic patterns. The Text-based navigation enables
users to move about in a document collection by relating
topics and significant terms. It helps to identify key
concepts and additionally presents some of the
relationships between key concepts. The proposed system
will extract the relevant text by analyzing the user input and
patterns extracted from corpus.
References
[1] F. Sebastiani (2002), “Machine Learning in Automated Text
Categorization,” ACM Computing Surveys, vol. 34, no. 1.
[2] S.-T. Wu, Y. Li and Y. Xu (2006), “Deploying Approaches for Pattern
Refinement in Text Mining,” Proc. IEEE.
[3] S.-T. Wu, Y. Li, Y. Xu, B. Pham, and P. Chen, “Automatic PatternTaxonomy Extraction for Web Mining”, Proc. IEEE/WIC/ACM Int’l
Conf. Web Intelligence (WI ’04), pp. 242-248, 2004.
[4] Ning Zhong, Yuefeng Li, and Sheng-Tang Wu (2012) “Effective
Pattern Discovery for Text Mining” IEEE transactions on knowledge
and data engineering, vol. 24, No. 1, January 2012
465