Download Effective Classification of Text

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-nearest neighbors algorithm wikipedia , lookup

Transcript
International Journal of Computer Trends and Technology (IJCTT) – volume 11 number 1 – May 2014
Effective Classification of Text
A Saritha ,Student,SIT,JNTUH
N NaveenKumar,Asst.Prof, SIT,JNTUH,Hyderabad
ABSTRACT: Text mining is the process of obtaining
useful and interesting information from text.
Huge
amount of text data is available in the form of various
formats. Most of it is unstructured.Text mining usually
involves the process of structuring the input text which
involves parsing it, structuring it by inserting results into
a database, deriving patterns from the structured data,
and finally evaluation and interpretation of the output.
There are several data mining techniques proposed for
mining useful patterns in text documents. Mining
techniques can use either terms or pattern (or
phrases).Theoretically using patterns rather than using
terms in text mining may yield good results,but it is not
proved and there is a need for effective ways of mining
text.
Text Mining and Text analytics:Most of businessrelevant information originates in unstructured
formt.
The term Text Analytics describes a set of
linguistic ,statistical, and machine learning
techniques useful for business intelligence, to solve
business problems, to exploratory data analysis,
research, or investigation. These techniques and
processes discover and present knowledge – facts,
business rules, and relationships
1.1 Data Mining versus Text Mining
This paper present means to classify text using termbased approach
Data Mining and Text mining both are
semi-automated processes seek for useful patterns
but the difference is in the nature of the data:
Keywords – Classification , Naïve Bayes classifier,
Preprocessing, Stopword Removal, Stemming, ,
Unigram, N-gram models
Structured versus unstructured data:
1. INTRODUCTION
The corporate data is becoming double in
size. Inorder the utilize that data for business needs,
an automated approach is Text mining. By mining
that text required knowledge can be retrieved
which will be very useul. Knowledge from text
usually
refers
to
some
combination
of relevance, novelty, and interestingness.
Typical text mining tasks include text
categorization, text
clustering, concept/entity extraction, production
of
granular taxonomies, sentiment
analysis, document summarization, and entity
relation modeling (i.e., learning relations
between named entities).
Text
analysis
involves information
retrieval, lexical analysis to study word frequency
distributions, pattern
recognition, tagging/annotation, informati-on
extraction, data mining techniques including link
and
association
analysis,
visualization,and predictive analytics.
A typical application of text mining is to
scan given set of documents written in a natural
language and either to model them for predictive
classification or populate a database or search
index with the information extracted.
ISSN: 2231-2803
Structured data is the data present in database
where as unstructured data is the data present in
documents such as word documents, PDF files, text
excerpts, XML files, and so on
Data Mining is a process of extracting knowledgs
from structured data.
Text mining – first , impose structure to the data,
and then mine the structured data
1.2 Applications of Text Mining
The technology is now broadly applied for a wide
variety of government, research, and business
needs
Some applications and application areas of text
mining are,








Publishing and Media
Customer service, email support
Spam filtering
Measuring customer preferences by
analyzing qualitative interviews
Creating suggestion and recommendations
Monitoring public opinions(for example in
blogs and review sites)
Automatic labeling of documents in
business libraries
Political institutions,political
analytics,
public administration and legal documents
http://www.ijcttjournal.org
Page1
International Journal of Computer Trends and Technology (IJCTT) – volume 11 number 1 – May 2014

Fraud
detection
by
investigating
notification of claims
 Fighting cyberbullying or cybercrime in
IM and IRC chat
 Pharmaceutical and research companies
and healthcare
1.3 System Architecture
accuracy and time efficiency is much
better than manual text classification.
2.1.2 Text classification Framework
Fig:2.1 Text classification Frame Work
2.2 PreProcessing
2.2.1 Need for PreProcessing
Fig: 1.1 An example Text Mining System
Architecture
Starting with a collection of documents, a text
mining tool would retrieve a particular document
and preprocess it by checking format and character
sets. Then it would go through a text analysis
phase, sometimes repeating techniques until
information is extracted. Three text analysis
techniques are shown in the example, but many
other combinations of techniques could be used
depending on the goals of the organization. The
resulting information can be placed in a
management information system, yielding an
abundant amount of knowledge for the user of that
system.
The main objective of pre-processing is to
obtain the key features or key terms from stored
text documents and to enhance the relevancy
between word and document and the relevancy
between word and category. Pre-Processing step is
crucial in determining the quality of the next stage,
that is, the classification stage. It is important to
select the significant keywords that carry the
meaning and discard the words that do not
contribute to distinguishing between the
documents. The pre-processing phase of the study
converts the original textual data in a data-mining
ready structure.
2. CLASSIFICATION
2.1 Introduction
Text Classification tasks can be divided
into two sorts: supervised document classification
where some external mechanism (such as human
feedback) provides information on the correct
classification for documents or to define classes for
the classifier, and unsupervised document
classification (also known as document clustering),
where the classification must be done without any
external reference, this system do not have
predefined classes. There is also another task called
semi-supervised document classification, where
some documents are labeled by the external
mechanism (means some documents are already
classified for better learning of the classifier).
2.1.1Need for Automatic Text Classification
To classify millions of text document
manually is an expensive and time consuming task.
Therefore, automatic text classifier is constructed
using pre-classified sample documents whose
ISSN: 2231-2803
Fig: 2.2 Flow diagram of Preprocessing task
2.2.2
StopWord Removal
A stopwords is a commonly occurring
grammatical word that does tell us anything about
documents content. Words such as ‘a’,’an’,
`the',`and', etc are stopwords. The process of
stopword removal is to examine documents content
for stopwords and write any non-stopwords to a
http://www.ijcttjournal.org
Page2
International Journal of Computer Trends and Technology (IJCTT) – volume 11 number 1 – May 2014
temporary file for the document. We are then ready
to perform stemming on that file.
of each word is used as a feature for training a
classifier.
Stop Words are words which do not
contain important significance to be used in Search
Queries. Usually these words are filtered out from
search queries because they return vast amount of
unnecessary information. Stop words vary from
system to system.
Ex:The following models a text document using
bag-of-words.
For Example Consider the following
paragraph.
D2: Ajay also likes to play basketball.
“It is typical of the present leadership of
the Union government that sensitive issues which
impact the public mind are broached or attended to
after the moment has passed, undermining the idea
of communication between citizens and the
government they have elected.”
Stopward removal in the above paragraph
means removing the words such as a ,an ,the and
etc.
So from the above paragraph the words
it,is,of,the,that,which,are,or,to,has,they have will be
removed. So the text we get after removing
stopwords from the above paragraph will be as
follows.
“typical
present leadership Union
government sensitive issues impact public mind
broached attended moment passed, undermining
idea communication between citizens government
elected.”
2.2.2
Stemming
In this process we find out the root/stem of
a word. The purpose of this method is to remove
various suffixes, to reduce number of words, to
have exactly matching stems, to save memory
space and time. For example, the words producer,
producers, produced, production, producing can be
stemmed to the word “Produce”
After performing stemming on the result
of stopword removal on the sample text above, we
get the following text as result.
“typical present leadership Union
government sensitive issues impact public mind
broach attend moment pass, undermine idea
communicat between citizen government elect.”
2.2.3
Bag-Of-Words Model
The bag-of-words model is a simplifying
representation used in natural language processing
and information retrieval . In this process, a text is
represented as the container of its words,
disregarding grammar and even word order but
keeping multiplicity. The bag-of-words model is
commonly used in methods of document
classification, where the (frequency of) occurrence
ISSN: 2231-2803
Here are two simple text documents:
D1: Ajay likes to play cricket. Anil likes cricket
too.
A dictionary is constructed with the words
in these two documents as follows,.This do not
preserve the order of words in the sentences
{“Ajay”:1,
“play”:2,”likes”:3,”to”:4,”cricket”:5,
“basketball”:6, ”Anil”:7”also”:8,”too”:9}
The dictionary has 9 distinct words, so
each document is represented by a 9- entry vector
as follows
D1: {1, 1, 2, 1, 2, 0, 1, 0, 1}
D2: {1, 1, 1, 1, 0, 1, 0, 1, 0}
Each entry the document vector represents
the frequency of occurrence of the words in the
dictionary in the respective document
2.3 Feature Vector Generation
In pattern recognition and in image
processing, feature extraction is a special form of
dimensionality reduction. When the input data to
an algorithm is too large to be processed, then the
input data will be transformed into a reduced
representation set of features (also named features
vector). Transforming the input data into the set of
features is called feature extraction.
A feature vector is an n-dimensional
vector of numerical features that represent some
object. Many algorithms in machine learning
require a numerical representation of objects, since
such representations facilitate processing and
statistical analysis. When representing images, the
feature values might correspond to the pixels of an
image, when representing texts perhaps to term
occurrence frequencies. The vector space
associated with these vectors is often called the
feature space. In order to reduce the dimensionality
of the feature space, a number of dimensionality
reduction techniques can be employed.
If the features extracted are carefully
chosen it is expected that the features set will
extract the relevant information from the input data
in order to perform the desired task using this
reduced representation instead of the full size input.
Higher-level features can be obtained
from already available features and added to the
feature vector. Feature construction is the
http://www.ijcttjournal.org
Page3
International Journal of Computer Trends and Technology (IJCTT) – volume 11 number 1 – May 2014
application of a set of constructive operators to a
set of existing features resulting in construction of
new features.
Term Frequency and Inverse Document Frequency
(TF-IDF) :
The tf-idf weight ( term frequency inverse document frequency ) is a statistical
measure used to evaluate how important a word is
to a document in a collection The importance
increases proportionally to the number of times a
word appears in the document but is set by the
frequency of the word in the corpus.
For a given term wj and document di , n ij
is the number of occurrences of wj in document d i
Term Frequency Tfij=nij/|di|
di| is the number of words in document d i
Inverse Document Frequency IDFj=log(nj/n)
2.3.2 Bigram model
A more complex model that captures
some of the ordering restrictions that may occur in
some language or text: bigrams . The basic idea
behind higher-order N-gram models is to consider
the probability of a word occurring as a function of
its immediate context. In a bigram model, this
context is the immediately preceeding word:
p(w1w2...wi)=p(w1)×p(w2|w1)×.
.
×p(wi|wi−1)
.
We calculate conditional probability in the
usual fashion.
p(wi|wi−1) =|wi−1wi|/|wi−1|
Calculating conditional probabilities is
then a straightforward matter of division. For
example, the conditional probability of “Z” given
“X”:
p(Z|X) =|X Y|/|X|=2/2= 1
n is the no.of documents
However, the conditional probability of
“Z” given “Q”:
nj is the no.of documents that contain wj
2.3.1 Unigram model
The simplest model, the unigram model,
treats the words in isolation.
If There are 16 words in an example text,
and the word “X” occurs twice and thus has a
probability of (2/16)=.125
On the other hand, if a word “Y” occurs
only once will have the probability of(1/16)=.0625
This kind of information can be used to
judge the well-formedness of texts. This is
analogous to, say, identifying some new text as
being in the same language
The way this works is that we calculate
the overall probability of the new text as a function
of the individual probabilities of the words that
occur in it. On this view, the likelihood of a text
like “X Y Z” where X, Y, Z are the words would
be a function of the probabilities of its parts:.125,
.125, and .125. If we assume the choice of each
word is independent, then the proba bility of the
whole string is the product of the independent
words, in this case: .125×.125×.125 =.00195.
The way this model works is that the wellformedness of a text fragment is correlated with its
overall probability. Higher-probability text
fragments are more well-formed than lowerprobability texts.
A major shortcoming of this model is that
it makes no distinction among texts in terms of
ordering. Thus this model cannot distinguish ab
from ba.
p(Z|Q) =|Q Z|/|Q|=0/1= 0
Using conditional probabilities thus
captures the fact that the likelihood of “Z” varies
by preceeding context: it is more likely after “Z”
than after “Q”.
Different orderings are distinguished in
the bigram model. Consider, for example, the
difference between “X Z” and “Z X”.
2.3.3 N-gram model
N-gram models are not restricted to
unigrams and bigrams; higher-order N-gram
models are also used. These higher-order models
are characterized as we would expect. For example,
a trigram model would view a text w1w2. . . wnas
the product of a series of conditional probabilities:
p(w1w2...wn)
=p(w1)×p(w2|w1)×p(wn|wn−2wn−1)
One way to try to appreciate the success of
N-gram language models is to use them to
approximate text in a generative fashion. That is,
we can compute all the occurring N-grams over
some text, and then use those N-grams to generate
new text.
2.4 Classification Techniques
There are many techniques which are used
for text classification. Following are some
techniques:
2.4.1 Nearest Neighbour classifier
2.4.2 Bayesian Classification
ISSN: 2231-2803
http://www.ijcttjournal.org
Page4
International Journal of Computer Trends and Technology (IJCTT) – volume 11 number 1 – May 2014
2.4.3 Support Vector Machine
For testing under same class labels five
documents from each class i.e., totally 25
documents are considered.
2.4.4 Association Based Classification
2.4.5 Term Graph Model
3.2 Work
2.4.6 Decision Tree Induction
2.4.7 Centroid based classification
Using weka converter the documents are
loaded into a single text document
2.4.8 Classification using Neural Network
This paper presents classification of text using
Naïve Bayes classifier
The following are the commands that
perform various tasks in text classification using
Naïve Bayes classifier
2.4.2 Bayesian Classification
Initially using the
class
TextDirectoryLoader the set of documents present
in a folder ‘traindata’ are stored into single
'traininput.arff’ file. This class is defined in the java
package weka.core.converters.the command is as
follows:
The Naïve Bayes Classifier is the simplest
probabilistic classifier used to classify the text
documents. It severe assumption that each feature
word is independent of other feature words in a
document. The basic idea is to use the joint
probabilities of words and categories to estimate
the class of a given document. Given a documentdi
, the probability with each class cj is calculated as
P(cj/di)=P(di/cj)p(cj)/P(di)
As P(di) is the same for all class, then
label(d i) is the class (or label) of di , can be
determined by
label(di)= arg Maxcj{ P(cj /di)}=
Max{P(cj))/ P(di /c j)P(cj)}
arg
java weka.core.converters.TextDirectoryLoader
–dir d:\traindata > d:\traininput.arff
java weka.core.converters.TextDirectoryLoader
–dir d:\testdata > d:\testninput.arff
Then we can remove stopwords , apply
stemming and we can build a word vector usimg
the filter StringToWordVector as follows by
performing batch processing:
This technique Classify using probabilities
and assuming independence among terms
P(C/XiXjXk) = P(C) P(Xi /C) P(Xj/C) P(Xk/C)
3. RESULTS
3.1 Dataset
The data set that is considered in this work is the
Reuters Corpus Volume
In 2000 Reuters released a corpus of Reuters News
stories for use in research and development of
natural language-processing, information-retrieval
or machine learning systems.
The Reuters Corpus is divided into a number of
volumes.
Each volume is identified by a volume number and
a description, e.g. Reuters Corpus Volume 1
(English Language, 1996-08-20 to 1997-08-19).
The training data is classified into 5 categories with
labels acq,alum,barley,carcass and coffee.
In this work a set of documents belonging to each
class are considered. The categories from reuters
that are considered as training data are acq, alum,
barley, carcass ,and coffee
Ten documents of each class i.e., totally
50 documents are considered as training data set.
ISSN: 2231-2803
weka.filters.unsupervised.attribute.StringToWo
rdVector
-b –i
d:\traininput.arff –o
d:\trainoutput.arff –r d:\testinput.arff –s
d:\testoutput.arff -R first-last -W 1000 -prunerate -1.0 -T -I -N 0 -S -stemmer
weka.core.stemmers.LovinsStemmer -M 1 stopwords D:\as\stopwordsnew.txt -tokenizer
"weka.core.tokenizers.WordTokenizer
delimiters \" \\r \\t.,;:\\\'\\\"()?!- <>\\n\""
Now apply Naïve Bayes classifier on training file
to built a model.
http://www.ijcttjournal.org
Page5
International Journal of Computer Trends and Technology (IJCTT) – volume 11 number 1 – May 2014
delimiters \" \\r \\t.,;:\\\'\\\"()?!- <>\\n\" max 3 -min 1"
Now apply Naïve Bayes classifier on the result
4. Conclusion and Future Work
This classifier model can be used to classify the
test data using weka as follows
Text Classification is an important
application area in text mining why because
classifying millions of text document manually is
an expensive and time consuming task. Therefore,
automatic text classifier is constructed using preclassified sample documents whose accuracy and
time efficiency is much better than manual text
classification.
If the input to the classifier is having less
noisy data, we obtain efficient results. So during
mining the text, efficient preprocessing algorithms
must be chosen. The test data also should be
preprocessed before classifying it.
Text can be classified better by identifying
patterns . Once patterns are identified we can
classify given text or documents efficiently.
Identifying efficient patterns also plays major role
in text classification.When there is a need for
efficient classification of text various methods can
be implemented for identifying better patterns for
classifying text efficienty.
REFERENCES
[1] ”Effective Pattern Discovery for Text mining”,IEEE
TRANSACTIONS ON KNOWLEDGE AND DATA
ENGINEERING, VOL. 24, NO. 1, JANUARY 2012
[2] “A Survey of Text Mining Techniques and Applications”,
JOURNAL OF EMERGING TECHNOLOGIES IN WEB
INTELLIGENCE, VOL. 1, NO. 1, AUGUST 2009
[3]
“Recent
Trends
in
Text
Classification
Techniques”,INTERNATIONAL JOURNAL OF COMPUTER
APPLICATIONS(0975-8887)
VOLUME
35NO.6,DECEMBER 2011
The Naïve bayes classifier has classified
all the instances of training data correctly except
one instance and with test data 64% of data is
classified correctly where as 36% of data is
classified incorrectly.
In the same way we can apply n-gram
tokenizer on train and test data and then classify the
data . apply n-gram tokenizer as follows
weka.filters.unsupervised.attribute.StringT
oWordVector -b
–i d:\traininput.arff
–o d:\trainoutput.arff –r d:\testinput.arff –s
d:\testoutput.arff -R first-last -W 1000 prune-rate -1.0 -T -I -N 0 -S -stemmer
weka.core.stemmers.LovinsStemmer -M 1 –
stopwords
D:\as\stopwordsnew.txt
tokenizer
"weka.core.tokenizers.NGramTokenizer -
ISSN: 2231-2803
[4] “Learning to Classify Texts Using Positive and Unlabeled
Data”,Xiaoli Li and Bing Liu
[5] “Text Classification using String Kernels”, Journal of
Machine Learning Research 2 (2002) 419-444
[6] “SLPMiner: An Algorithm for Finding Frequent Sequential
Patterns Using Length-Decreasing Support Constraint”,
Masakazu Seno and George Karypis
[7] ” A Concept-based Model for Enhancing Text
Categorization”, A research paper by Shady Shehata.Fakhri
Karray,Mohamed Kamel
[8] “Text Categorisation: A Survey,” Technical Report Raport
NR 941, Norwegian Computing Center, 1999
[9] Pierre Baldi, Paolo Frasconi, Padhraic Smyth “Modeling the
Internet and the Web, Probabilistic Methods and
Algorithms”,2003
(chapter
4)
[http://ibook.ics.uci.edu/Chapter4.pdf ]
[10] David J. Hand, Heikki Mannila and Padhraic Smyth,
“Principles of Data Mining”, 2001
[11] Jochen Dijrre, Peter Gerstl, Roland Seiffert, “Text Mining:
Finding Nuggets in Mountains of Textual Data”, KDD 1999.
http://www.ijcttjournal.org
Page6