Download in the document - XP

Document related concepts

Time series wikipedia , lookup

Automatic summarization wikipedia , lookup

Knowledge representation and reasoning wikipedia , lookup

Pattern recognition wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Latent semantic analysis wikipedia , lookup

Transcript
SUPERVISED
CLASSIFICATION OF
TEXT DOCUMENTS
Ravi. N.
Veer Prakash S.
Vivek Shenoy T
Contents
• Introduction
• Literature Review
• Document Representation
• Text Classifiers
• Implementation Aspects
• Results and Analysis
• Conclusion
• Future Enhancements
• References
INTRODUCTION
• Current scenario of the Documents in the WEB.
• Structured data
• Unstructured data
• Information Retrieval (IR)
• Deals with the representation, storage, organization and access to
information items.
• This representation of the information is used to manipulate the
unstructured data.
• Goal of IR
• To provide users with those documents which satisfy their information
needs.
Objective of the project
• To classify the documents in the corpus. The documents thus
classified are then classified into various classes. A particular document
is assigned to a class if there is a relevance in the query and the
document.
• To provide a comparative study between two classifiers namely
- Centroid based classifier
- K- nearest Neighbour classifier
• Definition of Information Retrieval (IR)
IR is finding material of an unstructured nature that satisfies an information
need from within large collections [28].
• Different fields of Information Retrieval (IR)
There are 2 categories :
- General Applications of IR
- Domain Specific Applications
• IR Process
The IR process is a 6 step process as shown in the next slide,
Problem recognition and acceptance
Query Formulation
Query Execution
Examination of the Result
Information Retrieval
Fig. Schematic representation of Information Retrieval
• Machine learning
A system that can do autonomous acquisition, integration of knowledge
and is capable of learning from experience, analytical observation and other
means, resulting in a system that can continuously self-improve and thereby offer
increased efficiency and effectiveness [5].
•Types of Machine Learning :
• Supervised Learning, in which the training data is labeled with the correct
answers, e.g.,“spam” .
• Unsupervised learning, in which user are given a collection of unlabeled
data, which have to be analyzed and patterns have to be discovered.
• Text Classification (Document Classification)
• Classifying a set of documents into classes, this classification is based upon
the relevance of the document with the query given by the user [4].
•Types of Text Classification:
•Supervised Learning : The training data is labeled with the correct
answers, e.g.,“spam”.
• Unsupervised Document Classification/ Document clustering
The classification must be done entirely without reference to
external information.
•Definition of Text Classification
Let C = { c1, c2, ... cm} be a set of categories and D = { d1, d2, ... dn}
a set of documents. The task of the text classification consists in assigning to
each pair ( ci, dj ) of C x D (with 1 ≤ i ≤ m and 1 ≤ j ≤ n) a value of 0 or 1, i.e.
the value 0, if the document dj doesn't belong to ci. This mapping is done with
the help of a decision matrix [17].
LITERATURE
REVIEW
•Phases of IR Development :
There are several phases in the development of IR :
•1st Phase, 1950s - 1960s, of IR is the research phase
•2nd Phase, 1970s, IR struggled for adoption
•3rd Phase, 1980s- 1990s, reached acceptance phase in terms of free-text
search systems.
• Now-a-days the influence of IR is such that it is moving towards projects
in sound and image retrieval, along with electronic provision [26].
•Defination of TC by H.P. Luhn
H.P. Luhn gave a definition for TC in 1958, this made the start of the text
classification era [32], the definition is as follows :
“…utilize data-processing machines for auto-abstracting and auto-encoding of
documents and for creating interest profiles for each of the ‘action points’ in an
organization. Both incoming and internally generated documents are
automatically abstracted, characterized by a word pattern, and sent
automatically to appropriate action points.”
DOCUMENT
REPRESENTATION
• Need for DR
• The task of information retrieval is to extract relevant documents from
a large
collection of documents
in response to
user queries. The
documents contain primarily unrestricted text.
Document representation basically involves generating a representation R of
a document such that for any text items D1 and D2, R(D1)
≈
R(D2)
where R is a function knows as relevance of the document which is obtained
by matching the key words in the query with document set.
In order to reduce the complexity of the documents and make them clear and
easier to handle we transform the document from its full text version to a
document vector which describes the contents of the document.
The terms that occur in a document are the parameters of the document
representation. The types of parameters determine the type of the document
representation.
• Different Types:
•
Binary Document Representation
•
Term Frequency Representation (Frequency vector)
•
Probabilistic representation
Example Documents :
Documents
Document content
No. of Unique
words
D0
Gold silver truck
3
D1
Shipment of gold damaged in a fire
4
D2
Delivery of silver arrived in a silver truck
4
D3
Shipment of gold arrived in a truck
4
•Binary Document Representation:
The Binary Independence Model (BIM) introduces some simple
assumptions, Here, the term “binary'' is equivalent to Boolean, documents and
queries are both represented as binary term incidence vectors. That is, a
document “d” is represented by the vector
x
=(x1,…xM) where ‘xt=1’ if term
‘t’ is present in document ‘d’ and ‘xt=0’ if ‘t’ is not present in ‘d’ [22].
• Representation of the Example Documents :
Doc id
Arrived
Damaged
Delivery
Fire
Gold
Shipment
Silver
Truck
D0
0
0
0
0
1
0
1
1
D1
0
1
0
1
1
1
0
0
D2
1
0
1
0
0
0
1
1
D3
1
0
0
0
1
1
0
1
• Drawback :
It does not take into account the frequency of a term within a
document. All the terms with in a document will be given equal weightage
irrespective of the number of their occurrence in the document.
• Term Frequency Representation (Frequency vector)
In this type of representation we assign to each term in a document a
weight, that depends on the number of occurrences of the term in the document.
• Representation of the Example Documents :
Docid
Arrived
Damaged
Delivery
Fire
Gold
Shipment
Silver
Truck
D0
0
0
0
0
1
0
1
1
D1
0
1
0
1
1
1
0
0
D2
1
0
1
0
0
0
2
1
D3
1
0
0
0
1
1
0
1
• Drawback :
This approach does not weigh the terms in a document with respect
to other documents in the dataset.
• Probabilistic representation
In this scheme every component of the vector denotes the probability
of occurrence of the corresponding term with in the document. The probability
of a particular term is found by the following
Probability =
Number of occurrences of the term ‘t’ in the document ‘d’
Total number of terms in the document ‘d'
• Representation of the Example Documents :
Doc id
Arrived
Damaged
Delivery
Fire
Gold
Shipment
Silver
Truck
D0
0
0
0
0
1/3
0
1/3
1/3
D1
0
¼
0
1/4
1/4
¼
0
0
D2
1/4
0
1/4
0
0
0
2/4
1/4
D3
1/4
0
0
0
1/4
¼
0
1/4
• tf-idf (term frequency – inverse document frequency)
representation
The main idea behind tf-idf is that the term occurring
infrequently should be given a higher weight than a term that occurs
frequently.
•Important definitions in tf-idf context :
t = number of distinct terms in the document collection.
tfij = number of occurrences of term tj in document Di.
This is also referred to as term frequency.
dfj = number of documents which contain tj.
idfj = log( d/dfj) where d is the total number of documents.
This is the inverse document frequency
• Weighting Factor of each term :
The weighting factor for each term in the document is calculated
by taking the product of term-frequency and inverse-document frequency
related to the term by using the following,
dij = tfij * idfj
• tf-idft,d assigns to term t a weight in document d that is
1. highest when t occurs many times within a small number of documents (thus
lending high discriminating power to those documents);
2. lower when the term occurs fewer times in a document, or occurs in many
documents (thus offering a less pronounced relevance signal);
3. lowest when the term occurs in virtually all documents. The values thus
computed are then filled into the document vectors,
• Representation of the Example Documents :
Docid
Arrived
Damaged
Delivery
Fire
Gold
Shipment
Silver
Truck
D0
0
0
0
0
0.12
0
0.3
0.12
D1
0
0.3
0
0.6
0.12
0.3
0
0
D2
0.3
0
0.6
0
0
0
0.6
0.12
D3
0.3
0
0
0
0.12
0.3
0
0.12
Text Classifiers
• Refined Definition of Text Classification (TC) :
TC is defined as the task of approximating the unknown target function
Φ: D×C →{T,F}
Where Φ is called as the classifier [29], where,
C={c1,...,c|C|}…………………………a predefined set of categories
D …………………………..…………..a (possibly infinite) set of documents.
If Φ(dj ,ci)= T,
then dj is called a positive example (or a member) of ci,
If Φ(dj ,ci)= F
then dj is called a negative example (or not a member) of ci.
In order to build a classifier we need a set Ω documents such that the
value of Φ(dj ,ci) is known for every (dj ,ci) € Ω ×C, usually Ω is partitioned
into three disjoint sets Tr (the training set), Va (the validation set), and Te (the
test set) [31].
• Training set: The training set is the set of documents observing which the
learner builds the classifier.
• Validation set: The validation set is the set of documents on which the
engineer fine tunes the classifier, e.g. choosing for a parameter p on which the
classifier depends, the value that has yielded the best effectiveness when
evaluated on Va.
• Test set : The test set is the set on which the effectiveness of the classifier is
finally evaluated.
“evaluating the effectiveness” means running the classifier on a set of
pre-classified documents (Va or Te) and checking the degree of correspondence
between the output of the classifier and the pre-assigned classes.
•Types of classifiers
The following are some of the classifiers [37],
• Naïve-Bayesian classifier
• kNN classifier .
• Linear Classifiers
• C4.5
• Support Vector Machines etc.
In this project we mainly concentrate on only 2 classifiers.
• Centroid classifier
• kNN classifier.
• CENTROID CLASSIFIER
• This type of a classifier computes a centroid vector for every pre-defined
class using all the training documents belonging to the class.
• Next, the test document (which must be classified) is compared with all
these centroid vectors to compute the similarity coefficients.
• Finally a class is chosen whose centroid nearly matches with that of the
test document (i.e. selecting that class whose similarity coefficient score is
the highest)
• Pseudo code of Centroid Classifier
Step 1) The input documents (under pre-defined categories) are split into training
set and testing set respectively.
Step 2) Scan through the entire training set to identify all the unique words
across the entire collection .The total count of the unique words decides
the length of the document vector.
Step 3) For each of the unique terms (as identified in step 2) ,compute the
document frequency (i.e. total number of documents in which
a
particular unique terms occurs).
Step 4) Represent every input training document as a vector.(here we shall
assume that we are using tf-idf weights to represent the input documents.
Any of the representation schemes explained earlier can also be used).
Thus a document vector is represented as
dtf = ( tf1 log (N/df1) ,tf2 log (N/df2), tf3 log (N/df3) , …tfm log (N/dfm) )
Step 5) For every pre-defined class compute a centroid vector. this is done using the
following formula
where S is the training set of the category/class for which the centroid vector is
being computed.
Thus, if there are “m” different classes, there will be “m” different
centroid vectors. The “m” centroid vectors are denoted as
Step 6: For every test document “d”
1)Use the document frequencies of the various terms computed from the
training set , to compute the tf – idf representation of d i.e
2) Compute the similarity coefficient between
and all the k- centroid vectors
using the normalised cosine measure. The cosine measure is computed as
follows
where ,
is any centroid vector of a class.
3) Based on similarity coefficient score , assign document x to the class, with
whom ,the score is the highest. It can be mathematically represented as
Thus using the above discussed formula’s, the classification of the
document can be done.
• K Nearest Neighbor Classifier
• It is one of the instance learning algorithm which has been applied to text
categorization.
• This classifier first computes k nearest neighbor’s of a test document .Then
the similarities of the test document to the k-nearest neighbors are
aggregated according to the class of the neighbors, and the test document
is assigned to the most similar class (as measured by aggregate similarity)
[37].
• Drawbacks :
•one test document must be compared with all the test documents,
so as to decide the class of the test document. Thus it requires huge
amount of computation.
•It uses all the features equally in computing similarities. This may
lead to poor similarity measures and may lead to classification
errors.
• Pseudo code of KNN Classifier
Step 1) The input documents (under pre-defined categories) are split into training
set and testing set respectively.
Step 2) Scan through the entire training set to identify all the unique words
across the entire collection. The total count of the unique words decides
the length of the document vector.
Step 3) Fix a value for k. This value determines the number of nearest neighbors
which will be considered during document classification.
Step 4) For every test document , compute the similarity coefficient with each of
the training documents and record the similarity score in a hash table.
Step 5) Select the top “k” scores from the hash.
Step 6) Compute the aggregate score for each class. If several of the k-nearest
neighbor’s share a class, then the per- neighbor weights of that class are
added together and resulting weighted sum is used as likelihood score of
that class. Sort the scores of candidate classes and generate a ranked list.
The decision rule can be mathematically represented as
Where,
• “d” is the test document which is being classified and KNN (d) indicates the
set of k-nearest neighbors of document d.
• (dj, ci ) represents the classification for document dj with respect to class ci .
Step 7) Test document “d” should be assigned to the class that has the highest
weighted aggregate score.
Implementation
Aspects
• PERL
Perl stands for “Practical Extraction and Reporting Language”. It
was invented by Larry Wall. Perl is very powerful, versatile scripting language
and it is famously known as “Swiss Army chainsaw of programming
languages”.
• PDL “Perl Data Language”
• PDL is an object oriented extension to perl that is designed for scientific and
bulk numeric data processing and display. It is a very powerful and at the same
time fast array-oriented language.
•The PDL concept gives standard Perl, the ability to compactly store and
speedily manipulate the large N-dimensional data setswhich are very essential
for scientific computing.
• PDL uses Perl `objects' to hold piddle data. An `object' is like a user-defined
data-type and is a very powerful feature of Perl, PDL creates it's own class of
`PDL' objects to store piddles.
• PDL’s over perl variables
•
It is impossible to manipulate Perl `arrays' arithmetically as we like.
i.e.
@y = @x * 2
•
Perl lists are intrinsically one-dimensional and we can have `lists of
lists' but this is not the same thing as a pdl.
•
Perl lists do not support the range of datatypes that piddles do (byte
arrays, integer arrays, single precision, double precision, etc.)
•
Perl lists consume a lot of memory. At least 20 bytes per number, of
which only a few bytes are used for storing the actual value. This is because
Perl lists are flexible, and can contain text strings as well as numbers.
•
Perl lists are scattered about memory. The list data structure means
consecutive numbers are not stored in a neat block of consecutive memory
• Advantages of using Perl Data Language
• Both Perl and PDL are easily available, free of cost under the open source
license.
• Since PDL is an extension of perl, a perl programmer has all the powerful
features of perl at his hands.Thus even in mainly numerically oriented
programming, it is often extremely handy if we have access to non-numeric
functionality.
• Since it is a package of perl ,it makes PDL extensible and interoperable.
• Syntax associated with PDL is very simple thus making it a user friendly
package
• Usage of pdl in our project:
pdl
Binary
representatin
(pdl elements
indicates
presence or
absence of a
word )
Term frequency
representation
(pdl elements
indicates the
frequency i.e.
number of times
a word occurring
in a file)
Probabilistic
Representation
(pdl elements
indicates the
probability of
occurance of a
word)
Tf-idf
Representation
(pdl elements indicate
the product of term
frequency and inverse
document frequency)
• Organization of our code
Classifiers
KNN
Centroid
Binary
representation
Textfiles
(contains all
the training
and testing
documents)
Term
frequency
representation
Freq
(contains the files
representing
training and
testing documents
which indicates
the frequency of a
word in a file)
Probabilistic
representation
String
(contains all
the scripts
and the result
of
classification)
Tf-idf
representation
Actuals
(contains
predefined files
which indicate
the class to
which each of
the file belongs )
• Scripts of our project
There are 6 different scripts.
1) init.pl
2) main.pl
3) script1.pl
4) script2.pl
5) script3.pl
6) script4.pl
1) init.pl
This is the initialization script. The main intention behind this
script is make all the necessary folders available for the smooth
functioning of the code. It deletes the selected folders (for example, freq,
source code/results etc which holds all the necessary data ) and recreates
them again
2) main.pl
This is the main script. This script invokes all the other scripts
sequentilally.
3) script1.pl
The main intention of this script is removal of stopwords and
other unwanted characters from the source file .
Note:- We are not actually modifying the actual source file.
4) script2 .pl
The main intention of this script is to calculate the document
frequency for each of the unique terms depicted in uniquefile.txt
5) script3.pl
This script mainly performs the task of document classification.
6) script4.pl
The main intention of this script, is to generate a input to an html
browser, so as to display the results to the user.
Results
&
Analysis
The following are the elements of our Project :
1) Pre-defined classes -7
2) Training documents - 651
3) Testing Documents - 47
1) Pre-defined classes :
The following are the 7 pre-defined classes,
S.no. Class Name
No. of Documents
1.
Cricket
101
2.
Formula-1
90
3.
Hockey
109
4.
Ice-Hockey
109
5.
Movies
122
6.
Politics
20
7.
Religion
100
• Results for Centroid Classifier
Representation
Binary
Term
frequency
Probabilistic
Tf-idf
Properly classified
36
38
43
33
Misclassfied
11
9
4
14
0.76
0.80
0.91
0.70
Level of classification
Accuracy
• Level of Accuracy Achieved
1
0.91
0.9
0.8
0.8
0.76
Accuracy Obtained
0.7
0.7
0.6
Accuracy
0.5
0.4
0.3
0.2
0.1
0
Binary
Term Frequency
Probabilistic
Tf-idf
Document Representation Scheme
Fig. 6.1.2 Accuracy obtained by the Centroid classifier in each
Document Representation
• Results for KNN Classifier
The following table shows the result of applying KNN
classifier on the document vectors when the value of k=2 is given by the
user.
Representation
Binary
Term
frequency
Probabilistic
Tf-idf
Properly classified
39
41
42
37
Misclassfied
08
06
05
10
0.83
0.87
0.89
0.78
Level of classification
Accuracy
• Level of Accuracy Achieved
0.9
0.89
0.88
0.87
Accuracy Obtained
0.86
0.84
0.83
0.82
Accuracy
0.8
0.78
0.78
0.76
0.74
0.72
Binary
Term Frequency
Probabilistic
Tf-idf
Document Representation Scheme
Fig. 6.2.2. Accuracy obtained by the KNN (k=2) classifier in each Document
Representation.
The following table shows the result of applying KNN classifier on the
document vectors when the value of k=20 is given by the user.
Representation
Binary
Term
frequency
Probabilistic
Tf-idf
Properly classified
44
43
42
37
Misclassfied
3
4
5
10
0.94
0.91
0.89
0.78
Level of classification
Accuracy
• Level of Accuracy Achieved
1
0.9
0.8
Accuracy Obtained
0.7
0.6
Accuracy
0.5
0.4
0.3
0.2
0.1
0
Binary
Term Frequency
Probabilistic
Classification
Accuracy Obtained in KNN (K=20) in each Document Representation
Tf-idf
• Comparison of Centroid Classifier and the
KNN classifier
1
0.91
0.87
0.9
0.8
0.83
0.89
0.8
0.78
0.76
0.7
Accuracy obtained
0.7
0.6
0.5
Centroid
KNN
0.4
0.3
0.2
0.1
0
Binary
Term Frequency
Probabilistic
Tf-idf
Document Representation
Fig. 6.3.1 Comparison of the Centroid and KNN(k=2) classifier
Conclusion
About KNN Classifier
We find KNN Classifier provides the top notch results
in terms of classification accuracy.
Drawbacks :
1) We cannot decide on the ideal value of K.
2) Requires huge amount of Computational Resources
3)Impracticle in case of very large Document Collections.
About Centroid Classifier
We also find that Centroid Classifier provides a
classification accuracy very near to that of KNN.
Advantages of Centroid Classifier over KNN
1) Does not require Huge amount of Computation.
2) Very quick to decide results of classification.
3) Ideally suited in case of very large input document
collection.
Thus we can conclude that Centroid
Classifier is better than KNN Classifier.
Future Enhancements
• To increase the number of classes
• To build a suitable front end
• To integrate the classifiers built to the search engine to provide classification of
websites
• To enhance the centroid classifier by implementing weighted centroid classifier.
• To incorporate a stemming algorithm ex. Stemmer porter.
• To upgrade the implementation to incorporate the standard data collections, such
as, Reuters-21578, TREC-5, TREC-6 and OHSUMED collection, 20 news group
data set.
References
[1].Ricardo Bayeza-yates, Berthier ribeiro Neto, “Modern Information Retrieval”, Addison-Wesley-
Longman Publishing co., 1999
[2].Spoerri, A, “Information Processing & Management”. Proceedings of the IEEE First International
Conference on Computer Vision. Volume 43, pp. 1044-1058, 2007
[3]. Forrester, “Coping with complex data”, The forrester Report, pp.2-4, April 1995.
[4]. W. Bruce, “intelligent Information Retrieval”, Croft Center for Intelligent Information Retrieval
Computer Science Department University of Massachusetts, Amherst Amherst, D-Lib Magazine,
November 1995
[5] Simon Colton, ”AI Bite”, The Society for the Study of Artificial Intelligence and Simulation of
Behaviour, pp.66-67,
Thank you.....    
Any Questions???