Download Practicum 4: Text Classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Lab 5: Text Classification
Evgueni N. Smirnov and Jeroen Donkers
{smirnov,donkers}@cs.unimaas.nl
June 28, 2007
1. Introduction
Text mining is concerned with the task of extracting relevant information from natural
language text and to search for interesting relationships between the extracted entities. Text
classification is one of the basic techniques in the area of text mining. It is one of the more
difficult data-mining problems, since it deals with very high-dimensional data sets with
arbitrary patterns of missing data.
In this lab session you will study three text classification problems. You will try to build
classification models for each of these problems using Weka.
2. Weka and Text Classification
To represent text information in Weka we can use the STRING attribute type (see pages 5455). A STRING attribute can have in principle infinite number of values and therefore it
cannot be handled by any classifier in Weka. That is why we have to convert STRING
values into a set of attributes that represent the frequency of each word in the strings. This
type of string conversion can be realized by the StringToWordVector filter in Weka
(see page 399 for all possible options of the filter). Figure 1 provides an example of string
conversion implemented by this filter.
@relation text
@attribute docs string
@attribute class {animals, no_animals}
@data
‘cat mouse lion’, animals
‘tree flower’, no_animals
@relation text-converted
@attribute
@attribute
@attribute
@attribute
@attribute
@attribute
class {animals, no_animals}
cat numeric
mouse numeric
lion numeric
tree numeric
flower numeric
@data
animals,1,1,1,0,0
no_animals,0,0,0,1,1
Figure 1. Converting String Attributes using the StringToWordVector filter
1
Please note that after the conversion the class attribute becomes always the first attribute
(see Figure 1). This may cause some problems, since in Weka by default the last attribute
is the class attribute. To overcome this obstacle you can copy the class attribute using
attribute filter Copy. The Copy filter copies the class attribute into a new last attribute
called “Copy of Class”. Then you have to remove the original class attribute using
attribute filter “Remove”.
Once you have converted your text data into the word-frequency attribute representation,
you can start training and testing any classifier in the standard for Weka way. In this lab
we will use four classifiers: the Nearest Neighbor Classifier (IBk), the Naïve Bayes
Classifier (NB), the J48 Decision Tree Learner, and the JRip Classifier.
3. The Data-Mining Text Problem
3.1 Description
The data-mining text problem is a relatively simple problem. It is defined as follows
below:
Given:

A collection of 189 sentences about data mining derived from Wikipedia given in
English, French, Spanish, and German.
Find:

Classifiers that recognize the language of a given text.
The data for the task are given in file dataMining.arff that can be downloaded from
the course website.
3.2 Task
Your task is to train the IBk, NB, J48, and JRip classifiers on the dataMining.arff
as well as to analyze these classifiers and the data. To accomplish this task please follow
the following main scenario:
Main Scenario.
Study the data file dataMining.arff.
Load the data file dataMining.arff in Weka.
Convert the data using the StringToWordVector filter.
Create a copy of the class attribute as the last attribute and remove the original class
attribute.
5. Train the IBk, NB, J48, and JRip classifiers in the standard Weka way.
6. Test the IBk, NB, J48, and JRip classifiers using the 10-fold cross validation.
1.
2.
3.
4.
Once you have followed the main scenario, you can write down the statistics and
confusion matrices of the classifiers as well as the text representations of the J48 and JRip
2
classifiers. You can use this information in order to answer to two types of questions:
language-related questions and classifier-related questions.
The language-related questions:
Are the languages presented in the data related? How can you show it?
How the explicit classifiers (like the J48 and JRip classifiers) recognize the languages?
What would happen if we could use stop lists for all the four languages when preparing
the data?
The classifier-related questions:
The accuracy of the NB, J48, and JRip classifiers are relatively good. Why is that? In this
context, please note the IBk classifier has very low accuracy for low values of the
parameter k. When you increase the value of k to let say 59 the accuracy becomes
90.48%. How can you explain this phenomena?
Another issue is the computational performance of the classifiers. Due to the highdimensional data (1902 numeric attributes) training of some of the classifiers (like JRip)
takes time. To reduce this time you can reduce the number of the attributes by selecting
the most relevant ones. The selection can be realized using some of the attribute-selection
methods from the “Select Attributes” pane of the Weka Explorer. Once the attributes
indices have been identified, you can select the attributes using the attribute filter
Remove. Then, please repeat the experiments and record again the statistics and
confusion matrices of the classifiers. As you will see after the experiments, in addition to
the time reduction, the IBk classifier suddenly becomes quite accurate for low values of k.
How can you explain this result?
4. The Music-Jokes Text Problem
4.1 Description
The music-jokes text problem is a bit harder problem. It is defined as follows below:
Given:

A collection of 409 music jokes in English from http://www.mit.edu/~jcb/jokes/.
The jokes are labeled as funny or not funny by Dr. Jeroen Donkers.
Find:

Classifiers that recognize whether a joke is funny or not funny.
The data for the task are given in file musicJokes.arff that can be downloaded from
the course website.
4.2 Task
Your task is to train the IBk, NB, J48, and JRip classifiers on the musicJokes.arff
as well as to analyze these classifiers and the data. To accomplish this task please follow
3
the scenario from section 3 and set the option useStoplist of the
StringToWordVector filter to true. Once you have finished training, please write
down the statistics and confusion matrices of the classifiers.
Using the text representations of the J48 and JRip classifiers can you hypothesize what
makes a joke funny?
Since none of the classifiers has accuracy of greater than 65%, please try to improve the
classifiers by adjusting the options of the StringToWordVector filter and the
classifiers used. In addition, you can select the most promising attributes using some of
the attribute-selection methods from the “Select Attributes” pane, or you can use the meta
CostSensitiveClassifier classifier to adjust the accuracy using the costs.
5. The Problem of American and Chinese Styles of English Sentence Construction
5.1 Description
The problem of American and Chinese styles of English sentence construction is the most
difficult problem. It is defined as follows below:
Given:
 A collection of 132 pair of sentences written in English by Americans and Chinese
living in US (see http://www.englishdaily626.com/c-mistakes.php for more
info).
Find:
 Classifiers that recognize whether a sentence has been written by an American or a
Chinese.
The data for the task are given in file americanChineseStyles.arff that can be
downloaded from the course website.
5.2 Task
Your task is to train the IBk, NB, J48, and JRip classifiers on the
americanChineseStyles.arff as well as to analyze these classifiers and the
data. To accomplish this task please follow the scenario from section 3. The accuracy of
the classifiers you will design will vary between 7.95% till 50%. In this context, the main
question is: can you employ the classifier with low accuracy to design an acceptable
classifier?
Note: If during the experiments Weka crashes due to the memory requirements, please set
the option maxheap in file C:\Program Files\Weka-3-4\ RunWeka.ini to
512m or 1024m.
4