Download Data Mining Lab Manual - MLR Institute of Technology

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Data Mining
Laboratory Manual
Department of Information Technology
MLR INSTITUTE OF TECHNOLOGY
Marri Laxman Reddy Avenue,
Dundigal, Gandimaisamma (M), R.R. Dist.
Data Mining
Laboratory Manual
Prepared by
Mr. A. Ramachandra Reddy
Assistant Professor, CSE
Date of Issue
Document No:
MLRIT/IT/LAB MANUAL/DATA
MINING
Faculty name
A. RAMACHANDRA
REDDY
Date of
Revision
Verified by
Authorized by
HOD
INDEX
Sl. No.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
Content
Preface
Lab Code
JNTU Syllabus
List of Experiments and Problem statements
Fundamentals of Data Mining
Introduction to WEKA
Launching WEKA
The WEKA Explorer
Preprocessing
Classification
Clustering
Association
Selecting Attributes
Visualization
Working with WEKA
File Formats
Credit Risk Management
Lab Cycle Tasks
JNTU Experiment 1
JNTU Experiment 2
JNTU Experiment 3
JNTU Experiment 4
JNTU Experiment 5
JNTU Experiment 6
JNTU Experiment 7
JNTU Experiment 8
JNTU Experiment 9
JNTU Experiment 10
JNTU Experiment 11
JNTU Experiment 12
Additional Experiment 1
Additional Experiment 2
Viva-voce Questions
Page No.
4
5
6
8
21
22
23
24
25
27
30
31
31
31
33
33
37
41
41
51
56
64
66
67
74
76
81
82
83
87
89
91
94
PREFACE
Data mining is one of the important subjects included in the fourth year curriculum by JNTUH. In
addition to theory subject also includes Data mining as lab practical‘s using WEKA (Waikato Environment
for Knowledge Analysis) tool.
A weka software is association with machine learning algorithm for data mining task. The algorithm
can either be applied directly to a data set or from a java program.
A weka software is developed at university of Newzealand and it is named based on a bird found in
Newzealand. A weka software is flexible with object oriented programming language as it is used widely
today. It is a plat form independent software and can work on windows,Linux and solaris.
The Weka (pronounced Way-Kuh) workbench contains a collection of visualization tools and
algorithms for data analysis and predictive learning together with graphical user interfaces for easy access to
this functionality.
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms can either
be applied directly to a dataset or called from your own Java code. Weka contains tools for data preprocessing, classification, regression, clustering, association rules, and visualization. It is also well-suited for
developing new machine learning schemes
LAB CODE
1. Students should report to the concerned lab as per the time table.
2. Students who turn up late to the labs will in no case be permitted to do the program schedule for the
day.
3. After completion of the program, certification of the concerned staff in-charge in the observation
book is necessary.
4. Student should bring a notebook of 100 pages and should enter the readings /observations into the
notebook while performing the experiment.
5. The record of observations along with the detailed experimental procedure of the experiment in the
immediate last session should be submitted and certified staff member in-charge.
6. Not more than 3-students in a group are permitted to perform the experiment on the set.
7. The group-wise division made in the beginning should be adhered to and no mix up of students
among different groups will be permitted.
8. The components required pertaining to the experiment should be collected from stores in-charge after
duly filling in the requisition form.
9. When the experiment is completed, should disconnect the setup made by them, and should return all
the components/instruments taken for the purpose.
10. Any damage of the equipment or burn-out components will be viewed seriously either by putting
penalty or by dismissing the total group of students from the lab for the semester/year.
11. Students should be present in the labs for total scheduled duration.
12. Students are required to prepare thoroughly to perform the experiment before coming to laboratory.
13. Procedure sheets/data sheets provided to the student‘s groups should be maintained neatly and to be
returned after the experiment.
JNTUH Syllabus
Description:
The business of banks is making loans. Assessing the credit worthiness of an applicant is of crucial
importance. You have to develop a system to help a loan officer decide whether the credit of a customer is
good, or bad. A bank's business rules regarding loans must consider two opposing factors. On the one hand, a
bank wants to make as many loans as possible. Interest on these loans is the banks profit source. On the other
hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the collapse of the
bank. The bank's loan policy must involve a compromise: not too strict, and not too lenient.
To do the assignment, you first and foremost need some knowledge about the world of credit. You can
acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her
knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate
this knowledge from text form to production rule form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to
judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and
when not to, approve a loan application.
The German Credit Data:
Actual historical credit data is not always easy to come by because of confidentiality rules. Here is one such
dataset, consisting of 1000 actual cases collected in Germany. credit dataset (original) Excel spreadsheet
version of the German credit data.
In spite of the fact that the data is German, you should probably make use of it for this assignment. (Unless
you really can consult a real loan officer !)
A few notes on the German dataset
• DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like
a quarter).
• owns_telephone. German phone rates are much higher than in Canada so fewer people own telephones.
• foreign_worker. There are millions of these in Germany (many from Turrkey). It is very hard to get German
citizenship if you were not born of German parents.
• There are 20 attributes used in judging a loan applicant. The goal is the classify the applicant into one of
two categories, good or bad.
Subtasks : (Turn in your answers to the following tasks)
1. List all the categorical (or nominal) attributes and the real-valued attributes separately
2. What attributes do you think might be crucial in making the credit assessement ? Come up with some
simple rules in plain English using your selected attributes.
3. One type of model that you can create is a Decision Tree - train a Decision Tree using the complete
dataset as the training data. Report the model obtained after training. (10 marks)
4. Suppose you use your above model trained on the complete dataset, and classify credit good/bad for
each of the examples in the dataset. What % of examples can you classify correctly ? (This is also
called testing on the training set) Why do you think you cannot get 100 % training accuracy ? (10
marks)
5. Is testing on the training set as you did above a good idea ? Why orWhy not ? (10 marks)
6. One approach for solving the problem encountered in the previous question is using cross-validation ?
Describe what is cross-validation briefly. Train a Decistion Tree again using cross-validation and
report your results. Does your accuracy increase/decrease ? Why ? (10 marks)
7. Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personal-status"
(attribute 9). One way to do this (perhaps rather simple minded) is to remove these attributes from the
dataset and see if the decision tree created in those cases is significantly different from the full dataset
case which you have already done. To remove an attribute you can use the preprocess tab in Weka's
GUI Explorer. Did removing these attributes have any significant effect? Discuss. (10 marks)
8. Another question might be, do you really need to input so many attributes to get good results? Maybe
only a few would do. For example, you could try just having attributes 2, 3, 5, 7, 10, 17 (and 21, the
class attribute (naturally)). Try out some combinations. (You had removed two attributes in problem
7. Remember to reload the arff data file to get all the attributes initially before you start selecting the
ones you want.) (10 marks)
9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might be higher
than accepting an applicant who has bad credit (case 2). Instead of counting the misclassifcations
equally in both cases, give a higher cost to the first case (say cost 5) and lower cost to the second
case. You can do this by using a cost matrix in Weka. Train your Decision Tree again and report the
Decision Tree and cross-validation results. Are they significantly different from results obtained in
problem 6 (using equal cost)? (10 marks)
10. Do you think it is a good idea to prefer simple decision trees instead of having long complex decision
trees ? How does the complexity of a Decision Tree relate to the bias of the model ? (10 marks)
11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use Reduced
Error Pruning - Explain this idea briefly. Try reduced error pruning for training your Decision Trees
using cross-validation (you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ? (10 marks)
12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up your own
small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There also exist
different classifiers that output the model in the form of rules - one such classifier in Weka is
rules.PART, train this model and report the set of rules obtained. Sometimes just one attribute can be
good enough in making the decision, yes, just one ! Can you predict what attribute that might be in
this dataset ? OneR classifier uses a single attribute to make decisions (it chooses the attribute based
on minimum error). Report the rule obtained by training a one R classifier. Rank the performance of
j48, PART and oneR. (10 marks)
LIST OF EXPERIMENTS AND PROBLEM STATMENTS
EXP.
NO.
NAME OF THE EXPERIMENT
1
To list all the categorical (or nominal) attributes and the real valued attributes
JNTU using WEKA mining tool.
P1
List all the categorical (or nominal) attributes and the real-valued attributes
separately using contact Lenses.arff.
P2
List all the categorical (or nominal) attributes and the real-valued attributes
separately using cpu.arff
P3
List all the categorical (or nominal) attributes and the real-valued attributes
separately using diabetes.arff
P4
List all the categorical (or nominal) attributes and the real-valued attributes
separately using glass.arff
P5
List all the categorical (or nominal) attributes and the real-valued attributes
separately using inosphere.arff
P6
List all the categorical (or nominal) attributes and the real-valued attributes
separately using iris.arff
P7
List all the categorical (or nominal) attributes and the real-valued attributes
separately using labor.arff
P8
List all the categorical (or nominal) attributes and the real-valued attributes
separately using RoutersGran-test.arff
P9
List all the categorical (or nominal) attributes and the real-valued attributes
separately using weather.arff
P10 List all the categorical (or nominal) attributes and the real-valued attributes
separately using supermarket.arff
P11 List all the categorical (or nominal) attributes and the real-valued attributes
separately using vote.arff
2
What attributes do you think might be crucial in making the credit assessment
JNTU ? Come up with some simple rules in plain English using your selected
attributes.
P1
What attributes do you think might be crucial in making the analysis of
contact-lenses? Come up with some simple rules in plain English using your
selected attributes using contact Lenses.arff
P2
What attributes do you think might be crucial in making the analysis of cpu?
Come up with some simple rules in plain English using your selected
attributes using cpu.arff database
P3
What attributes do you think might be crucial in making the analysis of
diabetes? Come up with some simple rules in plain English using your
selected attributes using diabetes.arff database
P4
What attributes do you think might be crucial in making the analysis of glass?
Come up with some simple rules in plain English using your selected
attributes using glass.arff database
DATAB
ASE
FILE
German
Credit
Data
contact
Lenses
CPU
PAG
E
NO.
31
Diabetes
Glass
Inosphere
Iris
Labor
RoutersG
ran-test
Weather
Vote
contact
Lenses
German
Credit
Data
contact
Lenses
CPU
Diabetes
Glass
41
P5
What attributes do you think might be crucial in making the analysis of
inosphere? Come up with some simple rules in plain English using your
selected attributes using inosphere.arff
P6
What attributes do you think might be crucial in making the analysis of iris?
Come up with some simple rules in plain English using your selected
attributes using iris.arff database
P7
What attributes do you think might be crucial in making the analysis of labor?
Come up with some simple rules in plain English using your selected
attributes using labor.arff database
P8
What attributes do you think might be crucial in making the analysis of
RoutersGran-test? Come up with some simple rules in plain English using
your selected attributes using routersGrain-test.arff database
P9
What attributes do you think might be crucial in making the analysis of
weather? Come up with some simple rules in plain English using your
selected attributes using weather.arff database
P10 What attributes do you think might be crucial in making the analysis of
supermarket? Come up with some simple rules in plain English using your
selected attributes using supermarket.arff
P11 What attributes do you think might be crucial in making the analysis of vote?
Come up with some simple rules in plain English using your selected
attributes using vote.arff database
3
One type of model that you can create is a Decision Tree - train a Decision
JNTU Tree using the complete dataset as the training data. Report the model
obtained after training.
P1
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P2
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P3
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P4
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P5
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P6
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P7
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P8
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
Inosphere
Iris
Labor
RoutersG
ran-test
Weather
Vote
contact
Lenses
German
Credit
Data
contact
Lenses
CPU
Diabetes
Glass
Inosphere
Iris
Labor
RoutersG
ran-test
47
obtained after training.
P9
One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
P10 One type of model that you can create is a Decision Tree - train a Decision
Tree using the complete dataset as the training data. Report the model
obtained after training.
4
Suppose you use your above model trained on the complete dataset, and
JNTU classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P1
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P2
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P3
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P4
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P5
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P6
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P7
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P8
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
P9
Suppose you use your above model trained on the complete dataset, and
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
Weather
Vote
German
Credit
Data
contact
Lenses
CPU
Diabetes
Glass
Inosphere
Iris
Labor
RoutersG
ran-test
Weather
P10
Suppose you use your above model trained on the complete dataset, and Vote
classify credit good/bad for each of the examples in the dataset. What % of
examples can you classify correctly ? (This is also called testing on the
training set) Why do you think you cannot get 100 % training accuracy ?
5
Is testing on the training set as you did above a good idea ? Why or Why not ? German
JNTU
Credit
Data
P1
Is testing on the training set as you did above a good idea ? Why orWhy not ? contact
Lenses
P2
Is testing on the training set as you did above a good idea ? Why orWhy not ? CPU
P3
Is testing on the training set as you did above a good idea ? Why orWhy not ? Diabetes
P4
P5
Is testing on the training set as you did above a good idea ? Why orWhy not ?
Is testing on the training set as you did above a good idea ? Why orWhy not ?
Glass
Inosphere
P6
Is testing on the training set as you did above a good idea ? Why orWhy not ?
Iris
P7
Is testing on the training set as you did above a good idea ? Why orWhy not ?
Labor
P8
Is testing on the training set as you did above a good idea ? Why orWhy not ?
P9
Is testing on the training set as you did above a good idea ? Why orWhy not ?
RoutersG
ran-test
Weather
P10
Is testing on the training set as you did above a good idea ? Why orWhy not ?
Vote
6
One approach for solving the problem encountered in the previous question is
JNTU using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P1
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P2
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P3
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P4
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P5
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
German
Credit
Data
contact
Lenses
CPU
Diabetes
Glass
Inosphere
P6
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P7
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P8
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P9
One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
P10 One approach for solving the problem encountered in the previous question is
using cross-validation ? Describe what is cross-validation briefly. Train a
Decistion Tree again using cross-validation and report your results. Does your
accuracy increase/decrease ? Why ?
7
Another question might be, do you really need to input so many attributes to
JNTU get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P1
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P2
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P3
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P4
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Iris
Labor
RoutersG
ran-test
Weather
Vote
German
Credit
Data
contact
Lenses
CPU
Diabetes
Glass
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P5
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P6
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P7
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P8
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P9
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P10 Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
8
Another question might be, do you really need to input so many attributes to
JNTU get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P1
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
Inosphere
Iris
Labor
RoutersG
ran-test
Weather
Vote
German
Credit
Data
contact
Lenses
P2
P3
P4
P5
P6
P7
P8
P9
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
CPU
Diabetes
Glass
Inosphere
Iris
Labor
RoutersG
ran-test
Weather
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
P10 Another question might be, do you really need to input so many attributes to
get good results? Maybe only a few would do. For example, you could try just
having attributes 2, 3, 5, 7, 10, 17 (and 21, the class attribute (naturally)). Try
out some combinations. (You had removed two attributes in problem 7.
Remember to reload the arff data file to get all the attributes initially before
you start selecting the ones you want.)
9
Sometimes, the cost of rejecting an applicant who actually has a good credit
JNTU (case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
P1
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
P2
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
P3
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
P4
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
P5
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
Vote
German
Credit
Data
contact
Lenses
CPU
Diabetes
Glass
Inosphere
P6
P7
P8
P9
P10
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
Sometimes, the cost of rejecting an applicant who actually has a good credit
(case 1) might be higher than accepting an applicant who has bad credit (case
2). Instead of counting the misclassifcations equally in both cases, give a
higher cost to the first case (say cost 5) and lower cost to the second case. You
can do this by using a cost matrix in Weka. Train your Decision Tree again
and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
Iris
Labor
RoutersG
ran-test
Weather
Vote
10
Do you think it is a good idea to prefer simple decision trees instead of having German
JNTU long complex decision trees ? How does the complexity of a Decision Tree
Credit
Data
relate to the bias of the model ?
P1
Do you think it is a good idea to prefer simple decision trees instead of having contact
Lenses
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P2
Do you think it is a good idea to prefer simple decision trees instead of having CPU
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P3
Do you think it is a good idea to prefer simple decision trees instead of having Diabetes
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P4
Do you think it is a good idea to prefer simple decision trees instead of having Glass
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P5
Do you think it is a good idea to prefer simple decision trees instead of having Inosphere
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P6
Do you think it is a good idea to prefer simple decision trees instead of having Iris
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P7
Do you think it is a good idea to prefer simple decision trees instead of having Labor
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P8
Do you think it is a good idea to prefer simple decision trees instead of having RoutersG
ran-test
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P9
Do you think it is a good idea to prefer simple decision trees instead of having Weather
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
P10
Do you think it is a good idea to prefer simple decision trees instead of having Vote
long complex decision trees ? How does the complexity of a Decision Tree
relate to the bias of the model ?
11
You can make your Decision Trees simpler by pruning the nodes. One German
JNTU approach is to use Reduced Error Pruning - Explain this idea briefly. Try Credit
reduced error pruning for training your Decision Trees using cross-validation Data
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P1
You can make your Decision Trees simpler by pruning the nodes. One contact
approach is to use Reduced Error Pruning - Explain this idea briefly. Try Lenses
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P2
You can make your Decision Trees simpler by pruning the nodes. One CPU
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P3
You can make your Decision Trees simpler by pruning the nodes. One Diabetes
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P4
You can make your Decision Trees simpler by pruning the nodes. One Glass
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P5
You can make your Decision Trees simpler by pruning the nodes. One Inosphere
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P6
You can make your Decision Trees simpler by pruning the nodes. One Iris
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P7
You can make your Decision Trees simpler by pruning the nodes. One Labor
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P8
You can make your Decision Trees simpler by pruning the nodes. One RoutersG
approach is to use Reduced Error Pruning - Explain this idea briefly. Try ran-test
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P9
You can make your Decision Trees simpler by pruning the nodes. One Weather
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
P10
You can make your Decision Trees simpler by pruning the nodes. One Vote
approach is to use Reduced Error Pruning - Explain this idea briefly. Try
reduced error pruning for training your Decision Trees using cross-validation
(you can do this in Weka) and report the Decision Tree you obtain ? Also,
report your accuracy using the pruned model. Does your accuracy increase ?
12
(Extra Credit): How can you convert a Decision Trees into "if-then-else
JNTU rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
P1
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
P2
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
P3
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
P4
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
German
Credit
Data
contact
Lenses
CPU
Diabetes
Glass
P5
P6
P7
P8
P9
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
(Extra Credit): How can you convert a Decision Trees into "if-then-else
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
Inosphere
Iris
Labor
RoutersG
ran-test
Weather
P10
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
(Extra Credit): How can you convert a Decision Trees into "if-then-else Vote
rules". Make up your own small Decision Tree consisting of 2-3 levels and
convert it into a set of rules. There also exist different classifiers that output
the model in the form of rules - one such classifier in Weka is rules.PART,
train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you
predict what attribute that might be in this dataset ? OneR classifier uses a
single attribute to make decisions (it chooses the attribute based on minimum
error). Report the rule obtained by training a one R classifier. Rank the
performance of j48, PART and oneR.
Additional Experiments
13
Generate Association rules for the following transactional database using
Apriori algorithm.
14
Generate classification rules for the following data base using decision tree
(J48).
German
Credit
Data
German
Credit
Data
Fundamentals of Data Mining
Definition of Data Mining:
Data mining refers to extracting or mining knowledge from large amounts of data.
Data mining can also be referred as knowledge mining from data, knowledge extraction, data archeology and
data dredging.
Applications of Data Mining:





Business Intelligence applications
Insurance
Banking
Medicine
Retail/Marketing etc.
Functionalities of Data Mining:
These functionalities are used to specify the kind of patterns to be found in data mining tasks.
Data mining tasks can be classified into 2 categories:


Descriptive
Predictive
The following are the functionalities of data mining:
Concept/Class description: Characterization and Discrimination:
Generalize , summarize and contrast data characteristics.
Mining frequent patterns , Associations and Correlations
Frequent patterns are patterns (such as item sets, subsequences, or substructures) that appear in a data set
frequently.
Classification and Prediction:
Construct models that describe and distinguish classes or concepts for future prediction.
Predicts some unknown or missing numerical values.
Cluster analysis:
Class label is unknown. Group data to form new classes.
Maximizing intra-class similarity and minimizing inter-class similarity.
Outlier analysis:
Outlier: a data object that does not comply with the general behavior of data.
Noise or exception but is quite useful in fraud detection, rare events analysis.
Introduction to WEKA
 WEKA (Waikato Environment for Knowledge Analysis) is a popular suite of machine learning
software written in Java, developed at the University of Waikato, New Zealand.
 WEKA is an open source application that is freely available under the GNU general public license
agreement. Originally written in C, the WEKA application has been completely rewritten in Java and
is compatible with almost every computing platform. It is user friendly with a graphical interface that
allows for quick set up and operation. WEKA operates on the predication that the user data is
available as a flat file or relation. This means that each data object is described by a fixed number of
attributes that usually are of a specific type, normal alpha-numeric or numeric values. The WEKA
application allows novice users a tool to identify hidden information from database and file systems
with simple to use options and visual interfaces.
 The WEKAworkbench contains a collection of visualization tools and algorithms for data analysis
and predictive modeling, together with graphical user interfaces for easy access to this functionality.
 This original version was primarily designed as a tool for analyzing data from agricultural domains,
but the more recent fully Java-based version (WEKA 3), for which development started in 1997, is
now used in many different application areas, in particular for educational purposes and research.
ADVANTAGES OF WEKA
The obvious advantage of a package like WEKA is that a whole range of data preparation, feature selection
and data mining algorithms are integrated. This means that only one data format is needed, and trying out
and comparing different approaches becomes really easy. The package also comes with a GUI, which should
make it easier to use.
Portability, since it is fully implemented in the Java programming language and thus runs on almost any
modern computing platform.
A comprehensive collection of data preprocessing and modeling techniques.
Ease of use due to its graphical user interfaces.
WEKA supports several standard data mining tasks, more specifically, data preprocessing, clustering,
classification, regression, visualization, and feature selection.
All of WEKA's techniques are predicated on the assumption that the data is available as a single flat file or
relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal
attributes, but some other attribute types are also supported).
WEKA provides access to SQL databases using Java Database Connectivity and can process the result
returned by a database query.
It is not capable of multi-relational data mining, but there is separate software for converting a collection of
linked database tables into a single table that is suitable for processing using WEKA. Another important area
is sequence modeling.
Attribute Relationship File Format (ARFF) is the text format file used by WEKA to store data in a database.
The ARFF file contains two sections: the header and the data section. The first line of the header tells us the
relation name.
Then there is the list of the attributes (@attribute...). Each attribute is associated with a unique name and a
type.
The latter describes the kind of data contained in the variable and what values it can have. The variables
types are: numeric, nominal, string and date.
The class attribute is by default the last one of the list. In the header section there can also be some comment
lines, identified with a '%' at the beginning, which can describe the database content or give the reader
information about the author. After that there is the data itself (@data), each line stores the attribute of a
single entry separated by a comma.
WEKA's main user interface is the Explorer, but essentially the same functionality can be accessed through
the component-based Knowledge Flow interface and from the command line. There is also the Experimenter,
which allows the systematic comparison of the predictive performance of WEKA's machine learning
algorithms on a collection of datasets.
Launching WEKA
The WEKA GUI Chooser window is used to launch WEKA‟s graphical environments. At the bottom of the
window are four buttons:
1. Simple CLI. Provides a simple command-line interface that allows direct execution of WEKA commands
for operating systems that do not provide their own command line Interface.
2. Explorer. An environment for exploring data with WEKA.
3. Experimenter. An environment for performing experiments and conducting.
4.Knowledge Flow. This environment supports essentially the same functions as the Explorer but with a
drag-and-drop interface. One advantage is that it supports incremental learning.
If you launch WEKA from a terminal window, some text begins scrolling in the terminal. Ignore this text
unless something goes wrong, in which case it can help in tracking down the cause. This User Manual
focuses on using the Explorer but does not explain the individual data preprocessing tools and learning
algorithms in WEKA. For more information on the various filters and learning methods in WEKA, see the
book Data Mining (Witten and Frank, 2005).
The WEKA Explorer
Section Tabs
At the very top of the window, just below the title bar, is a row of tabs. When the Explorer is first started
only the first tab is active; the others are grayed out. This is because it is necessary to open (and potentially
pre-process) a dataset before starting to explore the data.
The tabs are as follows:
1. Preprocess. Choose and modify the data being acted on.
2. Classify. Train and test learning schemes that classify or perform regression.
3. Cluster. Learn clusters for the data.
4. Associate. Learn association rules for the data.
5. Select attributes. Select the most relevant attributes in the data.
6. Visualize. View an interactive 2D plot of the data.
Once the tabs are active, clicking on them flicks between different screens, on which the respective actions
can be performed. The bottom area of the window(including the status box, the log button, and the WEKA
bird) stays visible regardless of which section you are in.
Status Box
The status box appears at the very bottom of the window. It displays messages that keep you informed about
what's going on. For example, if the Explorer is busy loading a file, the status box will say that.
TIP—right-clicking the mouse anywhere inside the status box brings up a little menu. The menu gives two
options:
1. Available memory. Display in the log box the amount of memory available to WEKA.
2. Run garbage collector. Force the Java garbage collector to search for memory that is no longer
needed and free it up, allowing more memory for new tasks. Note that the garbage collector is
constantly running as a back ground task anyway.
Log Button
Clicking on this button brings up a separate window containing a scrollable text field. Each line of text is
stamped with the time it was entered into the log. As you perform actions in WEKA, the log keeps a record
of what has happened.
WEKA Status Icon
To the right of the status box is the WEKA status icon. When no processes are running, the bird sits down
and takes a nap. The number beside the × symbol gives the number of concurrent processes running. When
the system is idle it is zero, but it increases as the number of processes increases. When any process is
started, the bird gets up and starts moving around. If it' standing but stops moving for a long time, it's sick:
something has gone wrong! In that case you should restart the WEKA explorer.
1. Preprocessing
Opening files
The first three buttons at the top of the preprocess section enable you to load data into WEKA:
1. Open file.... Brings up a dialog box allowing you to browse for the data file on the local file
system.
2. Open URL.... Asks for a Uniform Resource Locator address for where the data is stored.
3. Open DB.... Reads data from a database. (Note that to make this work you might have to edit the
file in WEKA/experiment/DatabaseUtils.props.)
Using the Open file... button you can read files in a variety of formats: WEKA‟s ARFF format, CSV format,
C4.5 format, or serialized Instances format. ARFF files typically have a .arff extension, CSV files a .csv
extension, C4.5 files a data and names extension, and serialized Instances objects a .bsi extension.
The Current Relation
Once some data has been loaded, the Preprocess panel shows a variety of information. The Current relation
box (the ―current relation‖ is the currently loaded data, which can be interpreted as a single relational table in
database terminology) has three entries:
1. Relation. The name of the relation, as given in the file it was loaded from. Filters (described
below) modify the name of a relation.
2. Instances. The number of instances (data points/records) in the data.
3. Attributes. The number of attributes (features) in the data.
Working with Attributes
Below the Current relation box is a box titled Attributes. There are three buttons, and beneath them is
a list of the attributes in the current relation. The list has three columns:
1. No... A number that identifies the attribute in the order they are specified in the data file.
2. Selection tick boxes. These allow you select which attributes are present in the relation.
3. Name. The name of the attribute, as it was declared in the data file.
When you click on different rows in the list of attributes, the fields change in the box to the right titled
selected attribute. This box displays the characteristics of the currently highlighted attribute in the list:
1. Name. The name of the attribute, the same as that given in the attribute list.
2. Type. The type of attribute, most commonly Nominal or Numeric.
3. Missing. The number (and percentage) of instances in the data for which this attribute is missing
(unspecified).
4. Distinct. The number of different values that the data contains for this Attribute.
5. Unique. The number (and percentage) of instances in the data having a value for this attribute that no
other instances have.
Below these statistics is a list showing more information about the values stored in this attribute, which differ
depending on its type. If the attribute is nominal, the list consists of each possible value for the attribute
along with the number of instances that have that value. If the attribute is numeric, the list gives four
statistics describing the distribution of values in the data—the minimum, maximum, mean and standard
deviation. And below these statistics there is a colored histogram, color-coded according to the attribute
chosen as the Class using the box above the histogram. (This box will bring up a drop-down list of available
selections when clicked.) Note that only nominal Class attributes will result in a color-coding. Finally, after
pressing the Visualize All button, histograms for all the attributes in the data are shown in a separate witting.
Returning to the attribute list, to begin with all the tick boxes are un ticked. They can be toggled on/off by
clicking on them individually. The three buttons above can also be used to change the selection:
1. All. All boxes are ticked.
2. None. All boxes are cleared (un ticked).
3. Invert. Boxes that are ticked become un ticked and vice versa.
Once the desired attributes have been selected, they can be removed by clicking the Remove button
below the list of attributes. Note that this can be undone by clicking the Undo button, which is located next
to the Save button in the top-right corner of the Preprocess panel.
Working with Filters
The preprocess section allows filters to be defined that transform the data in various ways. The Filter
box is used to set up the filters that are required. At the left of the Filter box is a Choose button. By clicking
this button it is possible to select one of the filters in WEKA. Once a filter has been selected, its name and
options are shown in the field next to the Choose button. Clicking on this box brings up a Generic Object
Editor dialog box.
The GenericObjectEditor Dialog Box
The GenericObjectEditor dialog box lets you configure a filter. The same kind of dialog box is used
to configure other objects, such as classifiers and clusters(see below). The fields in the window reflect the
available options. Clicking on any of these gives an opportunity to alter the filters settings. For example, the
setting may take a text string, in which case you type the string into the text field provided. Or it may give a
drop-down box listing several states to choose from. Or it may do something else, depending on the
information required. Information on the options is provided in a tool tip if you let the mouse pointer hover
of the corresponding field.
More information on the filter and its options can be obtained by clicking on the More button in the
About panel at the top of the GenericObjectEditor window. Some objects display a brief description of what
they do in an About box, along with a More button. Clicking on the More button brings up a window
describing what the different options do.
At the bottom of the GenericObjectEditor dialog are four buttons. The first two, Open... and Save...
allow object configurations to be stored for future use. The Cancel button backs out without remembering
any changes that have been made. Once you are happy with the object and settings you have chosen, click
OK to return to the main Explorer window.
Applying Filters
 Once you have selected and configured a filter, you can apply it to the data by pressing the Apply
button at the right end of the Filter panel in the Preprocess panel.
 The Preprocess panel will then show the transformed data.
 The change can be undone by pressing the Undo button.
 You can also use the Edit...button to modify your data manually in a dataset editor.
 Finally, the Save...button at the top right of the Preprocess panel saves the current version of the
relation in the same formats available for loading data, allowing it to be kept for future use.
Note: Some of the filters behave differently depending on whether a class attribute has been set or not (using
the box above the histogram, which will bring up a drop-down list of possible selections when clicked). In
particular, the supervised filters require a class attribute to be set, and some of the ―unsupervised attribute
filters‖ will skip the class attribute if one is set. Note that it is also possible to set Class to None, in which
case no class is set.
2. Classification
Selecting a Classifier
At the top of the classify section is the Classifier box. This box has a text field that gives the name of
the currently selected classifier, and its options. Clicking on the text box brings up a GenericObjectEditor
dialog box, just the same as for filters that you can use to configure the options of the current classifier. The
Choose button allows you to choose one of the classifiers that are available in WEKA.
Test Options
The result of applying the chosen classifier will be tested according to the options that are set by
clicking in the Test options box. There are four test modes:
1. Use training set. The classifier is evaluated on how well it predicts the class of the instances it
was trained on.
2. Supplied test set. The classifier is evaluated on how well it predicts the class of a set of instances
loaded from a file. Clicking the Set... button brings up a dialog allowing you to choose the file to
test on.
3. Cross-validation. The classifier is evaluated by cross-validation, using the number of folds that
are entered in the Folds text field.
4. Percentage split. The classifier is evaluated on how well it predicts a certain percentage of the
data which is held out for testing. The amount of data held out depends on the value entered in the
% field.
Note: No matter which evaluation method is used, the model that is output is always the one build from all
the training data. Further testing options can beset by clicking on the More options... button:
1. Output model. The classification model on the full training set is output so that it can be viewed,
visualized, etc. This option is selected by default.
2. Output per-class stats. The precision/recall and true/false statistics for each class are output. This option
is also selected by default.
3. Output entropy evaluation measures. Entropy evaluation measures are included in the output. This
option is not selected by default.
4. Output confusion matrix. The confusion matrix of the classifier's predictions is included in the output.
This option is selected by default.
5. Store predictions for visualization. The classifier's predictions are remembered so that they can be
visualized. This option is selected by default.
6. Output predictions. The predictions on the evaluation data are output. Note that in the case of a crossvalidation the instance numbers do not correspond to the location in the data!
7. Cost-sensitive evaluation. The errors is evaluated with respect to a cost matrix. The Set... button allows
you to specify the cost matrix used. 8. Random seed for xval / % Split. This specifies the random seed used
when randomizing the data before it is divided up for evaluation purposes.
The Class Attribute
The classifiers in WEKA are designed to be trained to predict a single 'class attribute', which is the
target for prediction. Some classifiers can only learn nominal classes; others can only learn numeric classes
(regression problems);still others can learn both. By default, the class is taken to be the last attribute in the
data. If you want to train a classifier to predict a different attribute, click on the box below the Test options
box to bring up a drop-down list of attributes to choose from.
Training a Classifier
Once the classifier, test options and class have all been set, the learning process is started by clicking
on the Start button. While the classifier is busy being trained, the little bird moves around. You can stop the
training process at anytime by clicking on the Stop button.
When training is complete, several things happen. The Classifier output area to the right of the
display is filled with text describing the results of training and testing. A new entry appears in the Result list
box. We look at the result list below; but first we investigate the text that has been output.
The Classifier Output Text
The text in the Classifier output area has scroll bars allowing you to browse the results. Of course,
you can also resize the Explorer window to get a larger display area. The output is split into several sections:
1. Run information. A list of information giving the learning scheme options, relation name, instances,
attributes and test mode that were involved in the process.
2. Classifier model (full training set). A textual representation of the classification model that was produced
on the full training data.
3. The results of the chosen test mode are broken down thus:
4. Summary. A list of statistics summarizing how accurately the classifier was able to predict the true class
of the instances under the chosen test mode.
5. Detailed Accuracy By Class. A more detailed per-class break down of the classifier's prediction accuracy.
6. Confusion Matrix. Shows how many instances have been assigned to each class. Elements show the
number of test examples whose actual class is the row and whose predicted class is the column.
The Result List
After training several classifiers, the result list will contain several entries. Left-clicking the entries
flicks back and forth between the various results that have been generated. Right-clicking an entry invokes a
menu containing these items:
1. View in main window. Shows the output in the main window (just like left-clicking the entry).
2. View in separate window. Opens a new independent window for viewing the results.
3. Save result buffer. Brings up a dialog allowing you to save a text file containing the textual output.
4. Load model. Loads a pre-trained model object from a binary file.
5. Save model. Saves a model object to a binary file. Objects are saved in Java serialized object form.
6. Re-evaluate model on current test set. Takes the model that has been built and tests its performance on
the data set that has been specified with the Set.. button under the Supplied test set option.
7. Visualize classifier errors. Brings up a visualization window that plots the results of classification.
Correctly classified instances are represented by crosses, whereas incorrectly classified ones show up as
squares.
8. Visualize tree or Visualize graph. Brings up a graphical representation of the structure of the classifier
model, if possible (i.e. for decision trees or Bayesian networks). The graph visualization option only appears
if a Bayesian network classifier has been built. In the tree visualizer, you can bring up a menu by rightclicking a blank area, pan around by dragging the mouse, and see the training instances at each node by
clicking on it. CTRL-clicking zooms the view out, while SHIFT-dragging a box zooms the view in. The
graph visualizer should be self-explanatory.
9. Visualize margin curve. Generates a plot illustrating the prediction margin. The margin is defined as the
difference between the probability predicted for the actual class and the highest probability predicted for the
other classes. For example, boosting algorithms may achieve better performance on test data by increasing
the margins on the training data.
10. Visualize threshold curve. Generates a plot illustrating the trade offs in prediction that are obtained by
varying the threshold value between classes. For example, with the default threshold value of 0.5, the
predicted probability of positive must be greater than 0.5 for the instance to be predicted as positive. The plot
can be used to visualize the precision/recall tradeoff, for ROC curve analysis (true positive rate vs false
positive rate), and for other types of curves.
11. Visualize cost curve. Generates a plot that gives an explicit representation of the expected cost, as
described by Drummond and Holte (2000).Options are greyed out if they do not apply to the specific set of
results.
3. Clustering
Selecting a Cluster
By now you will be familiar with the process of selecting and configuring objects. Clicking on the
clustering scheme listed in the Clusterer box at the top ofthe window brings up a GenericObjectEditor
dialog with which to choose a new clustering scheme.
Cluster Modes
The Cluster mode box is used to choose what to cluster and how to evaluate the results. The first
three options are the same as for classification: Use training set, Supplied test set and Percentage split
(Section 4)—except that now the data is assigned to clusters instead of trying to predict a specific class. The
fourth mode, Classes to clusters evaluation, compares how well the chosen clusters match up with a preassigned class in the data. The dropdown box below this option selects the class, just as in the Classify panel.
An additional option in the Cluster mode box, the Store clusters forvisualization tick box, determines
whether or not it will be possible to visualize the clusters once training is complete. When dealing with
datasets that are solarge that memory becomes a problem it may be helpful to disable this option.
Ignoring Attributes
Often, some attributes in the data should be ignored when clustering. The Ignore attributes button brings up
a small window that allows you to select which attributes are ignored. Clicking on an attribute in the window
high lights it, holding down the SHIFT key selects a range of consecutive attributes, and holding down
CTRL toggles individual attributes on and off. To cancel the selection, back out with the Cancel button. To
activate it, click the Select button. The next time clustering is invoked, the selected attributes are ignored.
Learning Clusters
The Cluster section, like the Classify section, has Start/Stop buttons, a result text area and a result list.
These all behave just like their classification counterparts. Right-clicking an entry in the result list brings up
a similar menu, except that it shows only two visualization options: Visualize cluster assignments and
Visualize tree. The latter is grayed out when it is not applicable.
4. Association
Setting Up
This panel contains schemes for learning association rules, and the learners are chosen and
configured in the same way as the clusterers, filters, and classifiers in the other panels.
Learning Associations
Once appropriate parameters for the association rule learner bave been set, click the Start button. When
complete, right-clicking on an entry in the result listallows the results to be viewed or saved.
5. Selecting Attributes
Searching and Evaluating
Attribute selection involves searching through all possible combinations of attributes in the data to
find which subset of attributes works best for prediction. To do this, two objects must be set up: an attribute
evaluator and a searchmethod. The evaluator determines what method is used to assign a worth to each
subset of attributes. The search method determines what style of search is performed.
Options
The Attribute Selection Mode box has two options:
1. Use full training set. The worth of the attribute subset is determined using the full set of training
data.
2. Cross-validation. The worth of the attribute subset is determined by a process of crossvalidation. The Fold and Seed fields set the number off olds to use and the random seed used when
shuffling the data. As with Classify (Section 4), there is a drop-down box that can be used tospecify
which attribute to treat as the class.
Performing Selection
Clicking Start starts running the attribute selection process. When it is finished, the results are output
into the result area, and an entry is added tothe result list. Right-clicking on the result list gives several
options. The first three, (View in main window, View in separate window and Save result buffer), are the
same as for the classify panel. It is also possible to Visualize reduced data, or if you have used an attribute
transformer such as Principal Components, Visualize transformed data.
6. Visualizing
WEKA‟s visualization section allows you to visualize 2D plots of the current relation.
The scatter plot matrix
When you select the Visualize panel, it shows a scatter plot matrix for all the attributes, color coded
according to the currently selected class. It is possible to change the size of each individual 2D plot and the
point size, and to randomly jitter the data (to uncover obscured points). It also possible to change the
attribute used to color the plots, to select only a subset of attributes for inclusion in the scatter plot matrix,
and to sub sample the data. Note that changes will only come into effect once the Update button has been
pressed.
Selecting an individual 2D scatter plot
When you click on a cell in the scatter plot matrix, this will bring up a separate window with a
visualization of the scatter plot you selected. (We described above how to visualize particular results in a
separate window—for example, classifier errors—the same visualization controls are used here.)Data points
are plotted in the main area of the window. At the top are two drop-down list buttons for selecting the axes to
plot. The one on the left shows which attribute is used for the x-axis; the one on the right shows which is
used for the y-axis.
Beneath the x-axis selector is a drop-down list for choosing the colour scheme. This allows you to
colour the points based on the attribute selected. Below the plot area, a legend describes what values the
colours correspond to. If the values are discrete, you can modify the colour used for each one by clicking on
them and making an appropriate selection in the window that pops up. To the right of the plot area is a series
of horizontal strips. Each strip represents an attribute, and the dots within it show the distribution of values of
the attribute. These values are randomly scattered vertically to help you see concentrations of points.
You can choose what axes are used in the main graph by clicking on these strips. Left-clicking an
attribute strip changes the x-axis to that attribute, whereas right-clicking changes the y-axis. The 'X' and 'Y'
written beside the strips shows what the current axes are („B‟ is used for both X and Y). Above the attribute
strips is a slider labeled Jitter, which is a random displacement given to all points in the plot. Dragging it to
the right increases the amount of jitter, which is useful for spotting concentrations of points. Without jitter, a
million instances at the same point would look no different to just a single lonely instance.
Selecting Instances
There may be situations where it is helpful to select a subset of the data using the visualization tool.
(A special case of this is the User Classifier in the Classify panel, which lets you build your own classifier by
interactively selecting instances.)Below the y-axis selector button is a drop-down list button for choosing a
selection method. A group of data points can be selected in four ways:
1. Select Instance. Clicking on an individual data point brings up a window listing its attributes. If
more than one point appears at the same location, more than one set of attributes is shown.
2. Rectangle. You can create a rectangle, by dragging, that selects the points inside it.
3. Polygon. You can build a free-form polygon that selects the points inside it. Left-click to add
vertices to the polygon, right-click to complete it. The polygon will always be closed off by
connecting the first point to the last.
4. Polyline. You can build a polyline that distinguishes the points on one side from those on the
other. Left-click to add vertices to the polyline, right-click to finish. The resulting shape is open (as
opposed to a polygon, which is always closed).
Once an area of the plot has been selected using Rectangle, Polygon or Polyline, it turns grey. At this point,
clicking the Submit button removes all instances from the plot except those within the grey selection area.
Clicking on the Clear button erases the selected area without affecting the graph. Once any points have been
removed from the graph, the Submit button changes to a Reset button. This button undoes all previous
removals and returns you to the original graph with all points included. Finally, clicking the Save button
allows you to save the currently visible instances to a new ARFF file. BHARAT
Weka GUI Chooser.
WEKA (Waikato Environment for Knowledge Analysis) Supporting File Formats:
• ARFF (Attribute Relation File Format)
• CSV (Comma Separated Values)
• C4.5
• Binary Files (True/False or Yes/No, Buys TV or Not ….)
Confusion Matrix
How well your classifier can recognize tuples of different classes
• True Positives : Refer to the Positive tuples that were correctly labeled by the classifier.
• True Negatives : Refer to the Negative tuples that were correctly labeled by the classifier.
• False Positives : Refer to the Negative tuples that were incorrectly labeled by the classifier.
• False Negatives : Refer to the Positive tuples that were incorrectly labeled by the classifier.
Fig 1. Weka GUI Chooser
Weka Application Interfaces
• Explorer – preprocessing, attribute selection, learning, visualiation
• Experimenter – testing and evaluating machine learning algorithms
• Knowledge Flow – visual design of KDD process – Explorer
• Simple Command-line – A simple interface for typing commands
Fig 2. Weka Application Interfaces
Weka Functions and tools

Preprocessing Filters

Attribute selection

Classification/Regression

Clustering

Association discovery

Visualization
Load data file

Load data file in formats: ARFF, CSV, C4.5, binary

Import from URL or SQL database (using JDBC)
Attribute Relation File Format (arff)
An ARFF file consists of two distinct sections:

the Header section defines attribute name, type
and relations, start with a keyword.
@Relation <data-name>
@attribute <attribute-name> <type> or {range}

the Data section lists the data records, starts with
@Data
list of data instances

Any line start with % is the comments.
Data types supported by ARFF:

numeric

string

nominal specification

date
Example:
@RELATION STUDENT
@ATTRIBUTE SNO NUMERIC
@ATTRIBUTE NAME STRING
@ATTRIBUTE AGE NUMERIC
@ATTRIBUTE CITY {HYD,DELHI,MUMBAI}
@ATTRIBUTE BRANCH {CSE,IT,ECE,EEE}
@ATTRIBUTE MARKS NUMERIC
@ATTRIBUTE CLASS {PASS,FAIL}
@DATA
1,DEEPIKA,22,HYD,CSE,76,PASS
2,RADHIKA,23,DELHI,IT,34,FAIL
3,PRADEEP,21,MUMBAI,EEE,45,PASS
4,KRISHNA,22,HYD,ECE,23,FAIL
5,RISHI,21,DELHI,IT,88,PASS
6,SHARAN,21,MUMBAI,EEE,92,PASS
7,SHREYANSH,22,HYD,CSE,26,FAIL
8,SUGUNA,23,MUMBAI,ECE,65,PASS
Write the file in notepad
save the file with .arff extension
save it in All Files
CSV (Comma Separated Value)
The CSV File Format

Each record is one line

Fields are separated with commas.
Example John,Doe,120 any st.,"Anytown, WW",08123

Leading and trailing space-characters adjacent to comma field separators are ignored.
So John , Doe ,... resolves to "John" and "Doe", etc. Space characters can be spaces, or tabs.

Fields with embedded commas must be delimited with double-quote characters.
In the above example. "Anytown, WW" had to be delimited in double quotes because it had an embedded
comma.

Fields that contain double quote characters must be surounded by double-quotes, and the embedded
double-quotes must each be represented by a pair of consecutive double quotes.
So, John "Da Man" Doe would convert to "John ""Da Man""",Doe, 120 any st.,...

A field that contains embedded line-breaks must be surounded by double-quotes
So:
Note that this is a single CSV record, even though it takes up more than one line in the CSV file. This works
because the line breaks are embedded inside the double quotes of the field.

Fields with leading or trailing spaces must be delimited with double-quote characters.
So to preserve the leading and trailing spaces around the last name above: John ," Doe ",...

The delimiters will always be discarded.

The first record in a CSV file may be a header record containing column (field) names
Example:
SNO,NAME,AGE,CITY,BRANCH,MARKS,CLASS
1,DEEPIKA,22,HYD,CSE,76,PASS
2,RADHIKA,23,DELHI,IT,34,FAIL
3,PRADEEP,21,MUMBAI,EEE,45,PASS
4,KRISHNA,22,HYD,ECE,23,FAIL
5,RISHI,21,DELHI,IT,88,PASS
6,SHARAN,21,MUMBAI,EEE,92,PASS
7,SHREYANSH,22,HYD,CSE,26,FAIL
8,SUGUNA,23,MUMBAI,ECE,65,PASS
Write the file in notepad
save the file with .csv extension
save it in All Files
Credit Risk Assessment
Description: The business of banks is making loans. Assessing the credit worthiness of an applicant is of
crucial importance. You have to develop a system to help a loan officer decide whether the credit of a
customer is good, or bad. A bank's business rules regarding loans must consider two opposing factors. On the
one hand, a bank wants to make as many loans as possible. Interest on these loans is the banks profit source.
On the other hand, a bank cannot afford to make too many bad loans. Too many bad loans could lead to the
collapse of the bank. The bank's loan policy must involve a compromise: not too strict, and not too lenient.
To do the assignment, you first and foremost need some knowledge about the world of credit. You can
acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview her and try to represent her
knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable textbook on finance. Translate
this knowledge from text form to production rule form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable rules which can be used to
judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers correctly judged when, and
when not to, approve a loan application.
The German Credit Data: Actual historical credit data is not always easy to come by because of
confidentiality rules. Here is one such dataset, consisting of 1000 actual cases collected in Germany. Credit
dataset (original) Excel spreadsheet version of the German credit data. (Down load from web) In spite of the
fact that the data is German, you should probably make use of it for this assignment. (Unless you really can
consult a real loan officer!)
A few notes on the German dataset
• DM stands for Deutsche Mark, the unit of currency, worth about 90 cents Canadian (but looks and acts like
a quarter).
• owns telephone. German phone rates are much higher than in Canada so fewer people own telephones.
• Foreign worker. There are millions of these in Germany (many from Turkey). It is very hard to get German
citizenship if you were not born of German parents.
• There are 21 attributes used in judging a loan applicant. The goal is the classify the applicant into one of
two categories, good or bad.
Procedure
Download German dataset from the internet (save data as arff format).
The description of data is as follows:
Description of the German credit dataset.
1. Title: German Credit data
2. Source Information
Professor Dr. Hans Hofmann
Institut f"ur Statistik und "Okonometrie
Universit"at Hamburg
FB Wirtschaftswissenschaften
Von-Melle-Park 5
2000 Hamburg 13
3. Number of Instances: 1000
4. Number of Attributes german: 21 (7 numerical, 14 categorical)
5. Attribute description for german
List of Attributes:-
Attribute 1: (qualitative)
Status of existing checking account
A11 :
... < 0 DM
A12 : 0 <= ... < 200 DM
A13 :
... >= 200 DM /
salary assignments for at least 1 year
A14 : no checking account
Attribute 2: (numerical)
Duration in month
Attribute 3: (qualitative)
Credit history
A30 : no credits taken/
all credits paid back duly
A31 : all credits at this bank paid back duly
A32 : existing credits paid back duly till now
A33 : delay in paying off in the past
A34 : critical account/
other credits existing (not at this bank)
Attribute 4: (qualitative)
Purpose
A40 : car (new)
A41 : car (used)
A42 : furniture/equipment
A43 : radio/television
A44 : domestic appliances
A45 : repairs
A46 : education
A47 : (vacation - does not exist?)
A48 : retraining
A49 : business
A410 : others
Attribute 5: (numerical)
Credit amount
Attibute 6: (qualitative)
Savings account/bonds
A61 :
... < 100 DM
A62 : 100 <= ... < 500 DM
A63 : 500 <= ... < 1000 DM
A64 :
.. >= 1000 DM
A65 : unknown/ no savings account
Attribute 7: (qualitative)
Present employment since
A71 : unemployed
A72 :
... < 1 year
A73 : 1 <= ... < 4 years
A74 : 4 <= ... < 7 years
A75 :
.. >= 7 years
Attribute 8: (numerical)
Installment rate in percentage of disposable income
Attribute 9: (qualitative)
Personal status and sex
A91 : male : divorced/separated
A92 : female : divorced/separated/married
A93 : male : single
A94 : male : married/widowed
A95 : female : single
Attribute 10: (qualitative)
Other debtors / guarantors
A101 : none
A102 : co-applicant
A103 : guarantor
Attribute 11: (numerical)
Present residence since
Attribute 12: (qualitative)
Property
A121 : real estate
A122 : if not A121 : building society savings agreement/
life insurance
A123 : if not A121/A122 : car or other, not in attribute 6
A124 : unknown / no property
Attribute 13: (numerical)
Age in years
Attribute 14: (qualitative)
Other installment plans
A141 : bank
A142 : stores
A143 : none
Attribute 15: (qualitative)
Housing
A151 : rent
A152 : own
A153 : for free
Attribute 16: (numerical)
Number of existing credits at this bank
Attribute 17: (qualitative)
Job
A171 : unemployed/ unskilled - non-resident
A172 : unskilled - resident
A173 : skilled employee / official
A174 : management/ self-employed/
highly qualified employee/ officer
Attribute 18: (numerical)
Number of people being liable to provide maintenance for
Attribute 19: (qualitative)
Telephone
A191 : none
A192 : yes, registered under the customers name
Attribute 20: (qualitative)
foreign worker
A201 : yes
A202 : no
Attribute 21: (qualitative)
class
A211 : Good
A212 : Bad
LAB CYCLE TASKS
JNTUH Experiment 1:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately.
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/PROCEDURE:
1) Open the WEKA GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
“german credit data.csv”.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE/SOLUTION:
Number of Attributes German credit data: 21
Categorical (or nominal) Attributes: 14
Numerical Attributes: 7
List of Attributes:1. checking_status
2. duration
3. credit history
4. purpose
5. credit amount
6. savings_status
7. employment duration
8. installment rate
9. personal status
10. debitors
11. residence_since
12. property
13. age in years
14. installment plans
15. housing
16. existing credits
17. job
18. num_dependents
19. telephone
20. foreign worker
21. class
OUTPUT:
Categorical or Nomianal attributes:1. checking_status
2. credit history
3. purpose
4. savings_status
5. employment_since
6. personal status
7. debtors
8. property
9. installment plans
10. housing
11. job
12. telephone
13. foreign worker
14. class label
Real valued attributes:1. duration
2. credit amount
3. installment rate
4. residence_since
5. age
6. existing credits
7. num_dependent
In Weka Preprocessing, click on Edit button to edit the Data.
PROBLEM DEFINITIONS FOR JNTU EXP. 1
P1:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using contact Lenses.arff.
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/PROCEDURE:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―contact lenses.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
OUTPUT:
List of categorical attributes:
1. Age
2. spectacle-prescrip
3. astigmatism
4. tear-prod-rate
5. contact-lenses
List of Real-valued Attributes:
NIL
P2:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using cpu.arff.
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/PROCEDURE:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―cpu.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
OUTPUT:
List of categorical Attributes:
NIL
List of Real-valued Attributes:
1. MYCT
2. MMIN
3. MMAX
4. CASH
5. CHMIN
6. CHMAX
7. CLASS
No. of row values (tuples) in the cpu.arff are 209.
P3:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using pima_diabetes.arff
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―pima_diabetes.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
OUTPUT:
List of categorical Attributes:
1. class
List of Real-valued Attributes:
1. preg
2. plas
3. pres
7. pedi
8. age
4. skin
5. insu
6. mass
No. of row values (tuples) in the pima_diabetes.arff are 768.
P4:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using glass.arff
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/PROCEDURE:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―glass.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
Output:
List of categorical Attributes:
1. type
List of Real-valued Attributes:
1. RI 2. Na 3. Mg 4. Al 5. Si
6. K
7. Ca 8. Ba 9. Fe
P5:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using inosphere.arff
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―inosphere.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
Output:
List of categorical Attributes:(No. of Attributes: 1)
1. class
List of Real-valued Attributes: a01, a02, a03........,a34 (No. of Attribute: 34)
P6:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using iris.arff
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―iris.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
Output: No. of Attributes: 05
List of categorical Attributes: (No. of Attributes: 1)
1. class
List of Real-valued Attributes: 1. Sepallength
2. Sepalwidth
Petallength
4. Petalwidth
3.
P7:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using “labor.arff”
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―labor.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
Output: No. of Attributes: 17
List of categorical Attributes:(No. of Attributes: 09)
1. cost-of-living-adjustment
2. Pension
3. education-allowance
4. vacation
5. longterm-disability-assistance
6. contribution-to-dental-plan
7. bereavement-assistance
8. contribution-to-health-plan
9. class
List of Real-valued Attributes: (No. of Attributes: 08)
1. wage-increase-first-year
2. wage-increase-second-year
3. wage-increase-third-year
4. working-hours
5. standby-pay
6. shift-differential
7. statutory-holidays
8. duration
P8:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using RoutersGran-Test.arff
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―RoutersGran-Test.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
Output:
No. of Attributes: 2 No. of instances : 604
No. of weghts: 604
List of categorical Attributes:(No. of Attributes: 01) Text
List of Real-valued Attributes: (No. of Attributes: 01) class
P9:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using weather.arff
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―weather.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
Output:
No. of Attributes: 5 No. of instances : 14
No. of weghts: 14
List of categorical Attributes:(No. of Attributes: 03)
1. outlook
2. Humidity
3. windy
List of Real-valued Attributes: (No. of Attributes: 02)
1. temperature
2. play
Click on Edit button in Preprocessing and save the data:
P10:
AIM: List all the categorical (or nominal) attributes and the real-valued
attributes separately using vote.arff
THEORY:

Categorical (or Nominal) Attributes contains the values in the categorical
format (Characters/words)

The Real-values attributes contains the Numeric data
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1) Open the Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system
―vote.arff‖.
5) Clicking on any attribute in the left panel will show the basic statistics on that
selected attribute.
6) Click on the Edit button to edit the data.
SOURCE CODE:
Output:
No. of Attributes: 17 No. of instances : 435
No. of weghts: 435
List of categorical Attributes: (No. of Attributes: 16)
1. handicapped-infants
10. immigration
2. water-project-cost-sharing
11. synfuels-corporation-cutback
3. adoption-of-the-budget12. education-spending
resolution
13. superfund-right-to-sue
4. physician-fee-freeze
14. crime
5. el-salvador-aid
15. duty-free-exports
6. religious-groups-in-schools
16. export-administration-act7. anti-satellite-test-ban
south-africa
8. aid-to-nicaraguan-contras
17. Class
9. mx-missile
List of categorical Attributes: (No. of Attributes: 0)
JNTUH Experiment 2:
AIM: What attributes do you think might be crucial in making the credit
assessment? Come up with some simple rules in plain English using your
selected attributes.
THEORY:
The business of banks is making loans. Assessing the credit worthiness of an
applicant is of crucial importance. You have to develop a system to help a loan
officer decide whether the credit of a customer is good, or bad. A bank's business
rules regarding loans must consider two opposing factors. On the one hand, a
bank wants to make as many loans as possible. Interest on these loans is the
banks profit source. On the other hand, a bank cannot afford to make too many
bad loans. Too many bad loans could lead to the collapse of the bank. The bank's
loan policy must involve a compromise: not too strict, and not too lenient.
To do the assignment, you first and foremost need some knowledge about the
world of credit. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
The German Credit Data: Actual historical credit data is not always easy to
come by because of confidentiality rules. Here is one such dataset, consisting of
1000 actual cases collected in Germany. Credit dataset (original) Excel
spreadsheet version of the German credit data. (Down load from web) In spite of
the fact that the data is German, you should probably make use of it for this
assignment.
A few notes on the German dataset
• DM stands for Deutsche Mark, the unit of currency, worth about 90 cents
Canadian (but looks and acts like a quarter).
• owns telephone. German phone rates are much higher than in Canada so fewer
people own telephones.
• Foreign worker. There are millions of these in Germany (many from Turkey). It
is very hard to get German citizenship if you were not born of German parents.
• There are 21 attributes used in judging a loan applicant. The goal is the classify
the applicant into one of two categories, good or bad.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the german credit data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: German Credit Data.csv
Number of Attributes in german credit data: 21
Important Attributes: 09
OUTPUT:
According to me the following attributes may be crucial in making the credit risk
assessment.
1. Credit_history
2. Employment
3. Property_magnitude
4. job
5. duration
6. credit_amount
7. installment
8. existing credit
9. class
To make a decision whether to give credit or not, we must analyze the above
important attributes.
PROBLEM DEFINITIONS FOR JNTU EXP 2
P1:
AIM: What attributes do you think might be crucial in making the analysis
of contact-lenses? Come up with some simple rules in plain English using
your selected attributes using contact Lenses.arff database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: contact-lenses.arff
Number of Attributes: 5
Important Attributes: 03
OUTPUT:
According to me the following attributes may be crucial in making the analysis
of contact-lenses.
1. spectacle-prescrip
2. tear-prod-rate
3. contact-lenses
P2:
AIM: What attributes do you think might be crucial in making the cpu
analysis? Come up with some simple rules in plain English using your
selected attributes using "cpu.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: cpu.arff
Number of Attributes: 7
Important Attributes: 04
OUTPUT:
According to me the following attributes may be crucial in making the cpu data
assessment.
1. MMIN
2. CASH
3. CHMIN
4. CLASS
P3:
AIM: What attributes do you think might be crucial in making the
assessment of diabetes? Come up with some simple rules in plain English
using your selected attributes using "pima-diabetes.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: pima-diabetes.arff
Number of Attributes: 9
Important Attributes: 05
OUTPUT:
According to me the following attributes may be crucial in making the
assessment of diabetes.
1. pres
2. skin
3. insu
4. age
5. class
P4:
AIM: What attributes do you think might be crucial in making the
assessment of glass sales? Come up with some simple rules in plain English
using your selected attributes using "glass.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: glass.arff
Number of Attributes: 10
Important Attributes: 05
OUTPUT:
According to me the following attributes may be crucial in making the
assessment of glass sales.
1. type
2. RI
3. Mg
4. K
5. Fe
P5:
AIM: What attributes do you think might be crucial in making the
assessment of inosphere? Come up with some simple rules in plain English
using your selected attributes using "inosphere.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: inosphere.arff
Number of Attributes: 10
Important Attributes: 05
OUTPUT:
According to me the following attributes may be crucial in making the
assessment of inosphere.
1. class
2. a01
3. a02
4. a33
5. a34
P6:
AIM: What attributes do you think might be crucial in making the
assessment of iris? Come up with some simple rules in plain English using
your selected attributes using "iris.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: iris.arff
Number of Attributes: 5
Important Attributes: 3
OUTPUT:
According to me the following attributes may be crucial in making the
assessment of iris.
1. class
2. Sepallength
3. Petalwidth
P7:
AIM: What attributes do you think might be crucial in making the
assessment of labor? Come up with some simple rules in plain English using
your selected attributes using "labor.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: labor.arff
Number of Attributes: 17
Important Attributes: 8
OUTPUT:
According to me the following attributes may be crucial in making the
assessment of labor.
1. cost-of-living-adjustment
2. Pension
3. longterm-disability-assistance
4. contribution-to-health-plan
5. class
6. wage-increase-first-year
7. working-hours
8. duration
P8:
AIM: What attributes do you think might be crucial in making the
assessment of Routers Grain Mod Aptitude test? Come up with some simple
rules in plain English using your selected attributes using "RoutersGraintest.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: RoutersGrain-test.arff
Number of Attributes: 2
Important Attributes: 2
OUTPUT:
According to me the following attributes may be crucial in making the
assessment of Routers Grain Mod Aptitude test.
1. Text
2. class
P9:
AIM: What attributes do you think might be crucial in making the
assessment of weather? Come up with some simple rules in plain English
using your selected attributes using "weather.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: weather.arff
Number of Attributes: 5
Important Attributes: 3
OUTPUT:
According to me the following attributes may be crucial in making the
assessment of weather.
1. Humidity
2. windy
3. temperature
P10:
AIM: What attributes do you think might be crucial in making the
assessment of supermarket? Come up with some simple rules in plain
English using your selected attributes using "supermarket.arff" database.
THEORY:
To do the assignment, you first and foremost need some knowledge about the
given data. You can acquire such knowledge in a number of ways.
1. Knowledge Engineering. Find a loan officer who is willing to talk. Interview
her and try to represent her knowledge in the form of production rules.
2. Books. Find some training manuals for loan officers or perhaps a suitable
textbook on finance. Translate this knowledge from text form to production rule
form.
3. Common sense. Imagine yourself as a loan officer and make up reasonable
rules which can be used to judge the credit worthiness of a loan applicant.
4. Case histories. Find records of actual cases where competent loan officers
correctly judged when, and when not to, approve a loan application.
HARDWARE/SOFTWARE REQUIREMENTS:
WEKA 3.7.4 mining tool.
ALGORITHM/Procedure:
1. Open the data file
2. Read the description of the data
3. Use the domain knowledge to analyses the data
4. List all the crucial attributes
SOLUTION:
Input Data: supermarket.arff
Number of Attributes: 217
Important Attributes: 3
According to me the following attributes may be crucial in making the
assessment of supermarket.
1. department1
2. grocery misc
3. baby needs
4. baking needs
5. coupons
6. vegetables
7. total
JNTU Experiment 3:
AIM: One type of model that you can create is a Decision Tree - train a Decision Tree using
the complete dataset as the training data. Report the model obtained after training.
Theory:
A decision tree is a flow chart like tree structure where each internal node(nonleaf) denotes a test
on the attribute, each branch represents an outcome of the test ,and each leaf node(terminal
node)holds a class label. Decision trees can be easily converted into classification rules. e.g.
ID3,C4.5 and CART.
Classification is a data mining function that assigns items in a collection to target categories or
classes. The goal of classification is to accurately predict the target class for each case in the data.
For example, a classification model could be used to identify loan applicants as low, medium, or
high credit risks.
A classification task begins with a data set in which the class assignments are known. For example,
a classification model that predicts credit risk could be developed based on observed data for many
loan applicants over a period of time.
Different classification algorithms use different techniques for finding relationships. These
relationships are summarized in a model, which can then be applied to a different data set in which
the class assignments are unknown.
Classification models are tested by comparing the predicted values to known target values in a set
of test data. The historical data for a classification project is typically divided into two data sets:
one for building the model; the other for testing the model.
Classification has many applications in customer segmentation, business modeling, marketing,
credit analysis, and biomedical and drug response modeling.
Different Classification Algorithms
Oracle Data Mining provides the following algorithms for classification:
· Decision Tree
Decision trees automatically generate rules, which are conditional statements that reveal the logic
used to build the tree.
· Naive Bayes
Naive Bayes uses Bayes' Theorem, a formula that calculates a probability by counting the
frequency of values and combinations of values in the historical data.
HARDWARE/SOFTWARE REQUIREMENTS: WEKA 3.7.4 mining tool.
Algorithm/Procedure:
1) Open Weka GUI Chooser.
2) Select EXPLORER present in Applications.
3) Select Preprocess Tab.
4) Go to OPEN file and browse the file that is already stored in the system ―German Credit
Data.csv‖.
5) Go to Classify tab.
6) Here the c4.5 algorithm has been chosen which is entitled as j48 in Java and can be selected by
clicking the button choose
7) And select tree j48
9) Select Test options ―Use training set‖
10) if need select attribute.
11) Click Start.
12) Now we can see the output details in the Classifier output.
13) Right click on the result list and select ―visualize tree‖ option.
Source Code:
The decision tree constructed by using the implemented C4.5 algorithm
OUTPUT:
J48 pruned tree
-----------------=== Run information ===
Scheme:
Relation:
Instances:
Attributes:
weka.classifiers.trees.J48 -C 0.25 -M 2
German Credit Data
1000
21
checking_status
duration
credit_history
purpose
credit_amount
savings_status
employment
installment commitment
personal status
other parties
residence_since
property_magnitude
age
other_payment_plans
housing
existing_credits
job
num_dependents
own_telephone
foreign
class
Test mode: evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree
-----------------checking_status = <0
| foreign = yes
| | duration <= 11
| | | existing_credits <= 1
| | | | property_magnitude = real estate: good (8.0/1.0)
| | | | property_magnitude = life insurance
| | | | | own_telephone = yes: good (4.0)
| | | | | own_telephone = none: bad (2.0)
| | | | property_magnitude = no known property: bad (3.0)
| | | | property_magnitude = car: good (2.0/1.0)
| | | existing_credits > 1: good (14.0)
| | duration > 11
| | | job = skilled
| | | | other parties = none
| | | | | duration <= 30
| | | | | | savings_status = no known savings
| | | | | | | existing_credits <= 1
| | | | | | | | own_telephone = yes: good (4.0/1.0)
| | | | | | | | own_telephone = none: bad (9.0/1.0)
| | | | | | | existing_credits > 1: good (2.0)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| | | savings_status = <100
| | | | credit_history = critical/other existing credit: good (14.0/4.0)
| | | | credit_history = existing paid
| | | | | own_telephone = yes: bad (5.0)
| | | | | own_telephone = none
| | | | | | existing_credits <= 1
| | | | | | | property_magnitude = real estate
| | | | | | | | age <= 26: bad (5.0)
| | | | | | | | age > 26: good (2.0)
| | | | | | | property_magnitude = life insurance: bad (7.0/2.0)
| | | | | | | property_magnitude = no known property: good (2.0)
| | | | | | | property_magnitude = car
| | | | | | | | credit_amount <= 1386: bad (3.0)
| | | | | | | | credit_amount > 1386: good (11.0/1.0)
| | | | | | existing_credits > 1: bad (3.0)
| | | | credit_history = delayed previously: bad (4.0)
| | | | credit_history = no credits/all paid: bad (8.0/1.0)
| | | | credit_history = all paid: bad (6.0)
| | | savings_status = 500<=X<1000: good (4.0/1.0)
| | | savings_status = >=1000: good (4.0)
| | | savings_status = 100<=X<500
| | | | credit_history = critical/other existing credit: good (2.0)
| | | | credit_history = existing paid: bad (3.0)
| | | | credit_history = delayed previously: good (0.0)
| | | | credit_history = no credits/all paid: good (0.0)
| | | | credit_history = all paid: good (1.0)
| | duration > 30: bad (30.0/3.0)
| other parties = guarantor: good (12.0/3.0)
| other parties = co applicant: bad (7.0/1.0)
job = unskilled resident
| purpose = radio/tv
| | existing_credits <= 1: bad (10.0/3.0)
| | existing_credits > 1: good (2.0)
| purpose = education: bad (1.0)
| purpose = furniture/equipment
| | employment = >=7: good (2.0)
| | employment = 1<=X<4: good (4.0)
| | employment = 4<=X<7: good (1.0)
| | employment = unemployed: good (0.0)
| | employment = <1: bad (3.0)
| purpose = new car
| | own_telephone = yes: good (2.0)
| | own_telephone = none: bad (10.0/2.0)
| purpose = used car: bad (1.0)
| purpose = business: good (3.0)
| purpose = domestic appliance: bad (1.0)
| purpose = repairs: bad (1.0)
| purpose = other: good (1.0)
| | | | purpose = retraining: good (1.0)
| | | job = high qualif/self emp/mgmt: good (30.0/8.0)
| | | job = unemp/unskilled non res: bad (5.0/1.0)
| foreign = no: good (15.0/2.0)
checking_status = 0<=X<200
| credit_amount <= 9857
| | savings_status = no known savings: good (41.0/5.0)
| | savings_status = <100
| | | other parties = none
| | | | duration <= 42
| | | | | personal status = male single: good (52.0/15.0)
| | | | | personal status = female div/dep/mar
| | | | | | purpose = radio/tv: good (8.0/2.0)
| | | | | | purpose = education: good (4.0/2.0)
| | | | | | purpose = furniture/equipment
| | | | | | | duration <= 10: bad (3.0)
| | | | | | | duration > 10
| | | | | | | | duration <= 21: good (6.0/1.0)
| | | | | | | | duration > 21: bad (2.0)
| | | | | | purpose = new car: bad (5.0/1.0)
| | | | | | purpose = used car: bad (1.0)
| | | | | | purpose = business
| | | | | | | residence_since <= 2: good (3.0)
| | | | | | | residence_since > 2: bad (2.0)
| | | | | | purpose = domestic appliance: good (0.0)
| | | | | | purpose = repairs: good (1.0)
| | | | | | purpose = other: good (0.0)
| | | | | | purpose = retraining: good (0.0)
| | | | | personal status = male div/sep: bad (8.0/2.0)
| | | | | personal status = male mar/wid
| | | | | | duration <= 10: good (6.0)
| | | | | | duration > 10: bad (10.0/3.0)
| | | | duration > 42: bad (7.0)
| | | other parties = guarantor
| | | | purpose = radio/tv: good (18.0/1.0)
| | | | purpose = education: good (0.0)
| | | | purpose = furniture/equipment: good (0.0)
| | | | purpose = new car: bad (2.0)
| | | | purpose = used car: good (0.0)
| | | | purpose = business: good (0.0)
| | | | purpose = domestic appliance: good (0.0)
| | | | purpose = repairs: good (0.0)
| | | | purpose = other: good (0.0)
| | | | purpose = retraining: good (0.0)
| | | other parties = co applicant: good (2.0)
| | savings_status = 500<=X<1000: good (11.0/3.0)
| | savings_status = >=1000: good (13.0/3.0)
| | savings_status = 100<=X<500
| | | purpose = radio/tv: bad (8.0/2.0)
| | | purpose = education: good (0.0)
| | | purpose = furniture/equipment: bad (4.0/1.0)
| | | purpose = new car: bad (15.0/5.0)
| | | purpose = used car: good (3.0)
| | | purpose = business
| | | | housing = own: good (6.0)
| | | | housing = for free: bad (1.0)
| | | | housing = rent
| | | | | existing_credits <= 1: good (2.0)
| | | | | existing_credits > 1: bad (2.0)
| | | purpose = domestic appliance: good (0.0)
| | | purpose = repairs: good (2.0)
| | | purpose = other: good (1.0)
| | | purpose = retraining: good (0.0)
| credit_amount > 9857: bad (20.0/3.0)
checking_status = no checking: good (394.0/46.0)
checking_status = >=200: good (63.0/14.0)
Number of Leaves :
Size of the tree :
98
135
Time taken to build model: 0.06 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances
Incorrectly Classified Instances
Kappa statistic
Mean absolute error
Root mean squared error
Relative absolute error
Root relative squared error
Coverage of cases (0.95 level)
Mean rel. region size (0.95 level)
Total Number of Instances
855
145
85.5
14.5
0.6251
%
%
0.2312
0.34
55.0377 %
74.2015 %
100
%
93.3 %
1000
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.956 0.38
0.854 0.956 0.902
0.857 good
0.62
0.044
0.857 0.62 0.72
0.857 bad
Weighted Avg. 0.855 0.279
0.855 0.855 0.847
0.857
=== Confusion Matrix ===
a b <-- classified as
669 31 | a = good
114 186 | b = bad
Then we will be getting confusion matrix as follows:
=== Confusion Matrix ===
a b <-- classified as
669 31 | a = 1
114 186 | b = 2
Visualize threshold curve
Cost benefit analysis
Visualize cost curve
JNTU Experiment 4:
AIM:
Suppose you use your above model trained on the complete dataset, and classify credit
good/bad for each of the examples in the dataset. What % of examples can you classify
correctly? (This is also called testing on the training set) Why do you think you cannot get
100 % training accuracy?
Theory:
Naive Bayes classifier assumes that the presence (or absence) of a particular feature of a class is
unrelated to the presence (or absence) of any other feature. For example, a fruit may be considered
to be an apple if it is red, round, and about 4" in diameter. Even though these features depend on
the existence of the other features, a naive Bayes classifier considers all of these properties to
independently contribute to the probability that this fruit is an apple.
An advantage of the naive Bayes classifier is that it requires a small amount of training data to
estimate the parameters (means and variances of the variables) necessary for classification.
Because independent variables are assumed, only the variances of the variables for each class need
to be determined and not the entirecovariance matrix The naive Bayes probabilistic model :
The probability model for a classifier is a conditional model
P(C|F1 .................Fn) over a dependent class variable C with a small number of outcomes or
classes, conditional on several feature variables F1 through Fn. The problem is that if the number
of features n is large or when a feature can take on a large number of values, then basing such a
model on probability tables is infeasible. We therefore reformulate the model to make it more
tractable.
Using Bayes' theorem, we write P(C|F1...............Fn)=[{p(C)p(F1..................Fn|C)}/p(F1,........Fn)]
In plain English the above equation can be written as
Posterior= [(prior *likehood)/evidence]
In practice we are only interested in the numerator of that fraction, since the denominator does not
depend on C and the values of the features Fi are given, so that the denominator is effectively
constant. The numerator is equivalent to the joint probability model p(C,F1........Fn) which can be
rewritten as follows, using repeated applications of the definition of conditional probability:
p(C,F1........Fn) =p(C) p(F1............Fn|C) =p(C)p(F1|C) p(F2.........Fn|C,F1,F2)
=p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2)
= p(C)p(F1|C) p(F2|C,F1)p(F3.........Fn|C,F1,F2)......p(Fn|C,F1,F2,F3.........Fn1)
Now the "naive" conditional independence assumptions come into play: assume that each feature
Fi is conditionally independent of every other feature Fj for j≠i .
This means that p(Fi|C,Fj)=p(Fi|C)
and so the joint model can be expressed as p(C,F1,.......Fn)=p(C)p(F1|C)p(F2|C)...........
=p(C)π p(Fi|C)
This means that under the above independence assumptions, the conditional distribution over the
class variable C can be expressed like this:
p(C|F1..........Fn)= p(C) πp(Fi|C)
Z
where Z is a scaling factor dependent only on F1.........Fn, i.e., a constant if the values of the feature
variables are known.
Models of this form are much more manageable, since they factor into a so called class prior p(C)
and independent probability distributions p(Fi|C). If there are k classes and if a model for
eachp(Fi|C=c) can be expressed in terms of r parameters, then the corresponding naive Bayes
model has (k − 1) + n r k parameters. In practice, often k = 2 (binary classification) and r = 1
(Bernoulli variables as features) are common, and so the total number of parameters of the naive
Bayes model is 2n + 1, where n is the number of binary features used for prediction
P(h/D)= P(D/h) P(h) P(D)
• P(h) : Prior probability of hypothesis h
• P(D) : Prior probability of training data D
• P(h/D) : Probability of h given D
• P(D/h) : Probability of D given h
Naïve Bayes Classifier : Derivation
• D : Set of tuples
– Each Tuple is an ‗n‘ dimensional attribute vector
– X : (x1,x2,x3,…. xn)
• Let there me ‗m‘ Classes : C1,C2,C3…Cm
• NB classifier predicts X belongs to Class Ci iff
– P (Ci/X) > P(Cj/X) for 1<= j <= m , j <> i
• Maximum Posteriori Hypothesis
– P(Ci/X) = P(X/Ci) P(Ci) / P(X)
– Maximize P(X/Ci) P(Ci) as P(X) is constant
Naïve Bayes Classifier : Derivation
• With many attributes, it is computationally expensive to evaluate P(X/Ci)
• Naïve Assumption of ―class conditional independence‖
• P(X/Ci) = n P( xk/ Ci)
k=1
• P(X/Ci) = P(x1/Ci) * P(x2/Ci) *…* P(xn/ Ci)
HARDWARE/SOFTWARE REQUIREMENTS: WEKA 3.7.4 mining tool.
Algorithm/Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―German Credit
Data.csv‖.
6) Go to Classify tab.
7) Choose Classifier ―Tree‖
8) Select ―NBTree‖ i.e., Navie Baysiean tree.
9) Select Test options ―Use training set‖
10) if need select attribute.
11) now Start weka.
12)now we can see the output details in the Classifier output.
Source Code:
Output:
In this way we classified each of the examples in the dataset.
We classified 85.5% of examples correctly
and the remaining 14.5% of examples are incorrectly classified.
We can‘t get 100% training accuracy because out of the 20 attributes, we have some unnecessary
attributes which are also been analyzed and trained. Due to this the accuracy is affected and hence
we can‘t get 100% training accuracy.
5. Is testing on the training set as you did above a good idea? Why or Why not?
SOLUTION:
According to the rules, for the maximum accuracy, we have to take 2/3 of the dataset as training set
and the remaining 1/3 as test set. But here in the above model we have taken complete dataset as
training set which results only 85.5% accuracy.
This is done for the analyzing and training of the unnecessary attributes which does not make a
crucial role in credit risk assessment. And by this complexity is increasing and finally it leads to the
minimum accuracy.
If some part of the dataset is used as a training set and the remaining as test set then it leads to the
accurate results and the time for computation will be less. This is why, we prefer not to take
complete dataset as training set.
In some of the cases it is good, it is better to go with cross validation
X-fold cross-validation (N-N/x; N/x)
The cross-validation is used to prevent the overlap of the test sets
First step: split data into x disjoint subsets of equal size
Second step: use each subset in turn for testing, the remainder for training (repeating crossvalidation) As resulting rules (if applies) we take the sum of all rules.
The error (predictive accuracy) estimates are averaged to yield an overall error (predictive
accuracy) estimate Standard cross-validation: 10-fold cross-validation
Why 10?
Extensive experiments have shown that this is the best choice to get an accurate estimate.
There is also some theoretical evidence for this. So interesting!
Tools/ Apparatus: Weka Mining tool
Procedure:
1) In Test options, select the Supplied test set radio button
2) click Set
3) Choose the file which contains records that were not in the training set we used to create the
model.
4) Click Start(WEKA will run this test data set through the model we already created. )
5) Compare the output results with that of the 4th experiment
Sample output:
This can be experienced by the different problem solutions while doing practice.
The important numbers to focus on here are the numbers next to the "Correctly Classified
Instances" (92.3 percent) and the "Incorrectly Classified Instances" (7.6 percent). Other important
numbers are in the "ROC Area" column, in the first row (the 0.936); Finally, in the "Confusion
Matrix," it shows the number of false positives and false negatives. The false positives are 29, and
the false negatives are 17 in this matrix.
Based on our accuracy rate of 92.3 percent, we say that upon initial analysis, this is a good model.
One final step to validating our classification tree, which is to run our test set through the model
and ensure that accuracy of the model Comparing the "Correctly Classified Instances" from this
test set with the "Correctly Classified Instances" from the training set, we see the accuracy of the
model , which indicates that the model will not break down with unknown data, or when future
data is applied to it.
6. One approach for solving the problem encountered in the previous question is using crossvalidation? Describe what cross-validation is briefly. Train a Decision Tree again using crossvalidation and report your results. Does your accuracy increase/decrease? Why?
Cross validation:In k-fold cross-validation, the initial data are randomly portioned into ‗k‘ mutually exclusive
subsets or folds D1, D2, D3, . . . . . ., Dk. Each of approximately equal size. Training and testing is
performed ‗k‘ times. In iteration I, partition Di is reserved as the test set and the remaining
partitions are collectively used to train the model. That is in the first iteration subsets D2, D3, . . . . .
., Dk collectively serve as the training set in order to obtain as first model. Which is tested on Di.
The second trained on the subsets D1, D3, . . . . . ., Dk and test on the D2 and so on….
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―bank.csv‖.
6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list
available filters.
7) Select ―weka.filters.unsupervised.attribute.Remove‖
8) Next, click on text box immediately to the right of the "Choose" button
9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the
"Invert Selection" option is set to false)
10) Then click "OK". Now, in the filter box you will see "Remove -R 1"
11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and
create a new working relation
12) To save the new working relation as an ARFF file, click on save button in the top panel.
13) Go to OPEN file and browse the file that is newly saved (attribute deleted file)
14) Go to Classify tab.
15) Choose Classifier ―Tree‖
16) Select j48 tree
17) Select Test options ―Use training set‖
18) If needed select attribute.
19) Now start weka.
20) We can see the output details in the Classifier output.
21)Right click on the result list and select ‖ visualize tree ―option .
22) Compare the output results with that of the 4th experiment
23) check whether the accuracy increased or decreased?
24)Check whether removing these attributes have any significant effect.
Output:
J48 pruned tree :------------------
checking_status = <0
| foreign_worker = yes
| | duration <= 11
| | | existing_credits <= 1
| | | | property_magnitude = real estate: good (8.0/1.0)
| | | | property_magnitude = life insurance
| | | | | own_telephone = none: bad (2.0)
| | | | | own_telephone = yes: good (4.0)
| | | | property_magnitude = car: good (2.0/1.0)
| | | | property_magnitude = no known property: bad (3.0)
| | | existing_credits > 1: good (14.0)
| | duration > 11
| | | job = unemp/unskilled non res: bad (5.0/1.0)
| | | job = unskilled resident
| | | | purpose = new car
| | | | | own_telephone = none: bad (10.0/2.0)
| | | | | own_telephone = yes: good (2.0)
| | | | purpose = used car: bad (1.0)
| | | | purpose = furniture/equipment
| | | | | employment = unemployed: good (0.0)
| | | | | employment = <1: bad (3.0)
| | | | | employment = 1<=X<4: good (4.0)
| | | | | employment = 4<=X<7: good (1.0)
| | | | | employment = >=7: good (2.0)
| | | | purpose = radio/tv
| | | | | existing_credits <= 1: bad (10.0/3.0)
| | | | | existing_credits > 1: good (2.0)
| | | | purpose = domestic appliance: bad (1.0)
| | | | purpose = repairs: bad (1.0)
| | | | purpose = education: bad (1.0)
| | | | purpose = vacation: bad (0.0)
| | | | purpose = retraining: good (1.0)
| | | | purpose = business: good (3.0)
| | | | purpose = other: good (1.0)
| | | job = skilled
| | | | other_parties = none
| | | | | duration <= 30
| | | | | | savings_status = <100
| | | | | | | credit_history = no credits/all paid: bad (8.0/1.0)
| | | | | | | credit_history = all paid: bad (6.0)
| | | | | | | credit_history = existing paid
| | | | | | | | own_telephone = none
| | | | | | | | | existing_credits <= 1
| | | | | | | | | | property_magnitude = real estate
| | | | | | | | | | | age <= 26: bad (5.0)
| | | | | | | | | | | age > 26: good (2.0)
| | | | | | | | | | property_magnitude = life insurance: bad (7.0/2.0)
| | | | | | | | | | property_magnitude = car
| | | | | | | | | | | credit_amount <= 1386: bad (3.0)
| | | | | | | | | | | credit_amount > 1386: good (11.0/1.0)
| | | | | | | | | | property_magnitude = no known property: good (2.0)
| | | | | | | | | existing_credits > 1: bad (3.0)
| | | | | | | | own_telephone = yes: bad (5.0)
| | | | | | | credit_history = delayed previously: bad (4.0)
| | | | | | | credit_history = critical/other existing credit: good (14.0/4.0)
| | | | | | savings_status = 100<=X<500
| | | | | | | credit_history = no credits/all paid: good (0.0)
| | | | | | | credit_history = all paid: good (1.0)
| | | | | | | credit_history = existing paid: bad (3.0)
| | | | | | | credit_history = delayed previously: good (0.0)
| | | | | | | credit_history = critical/other existing credit: good (2.0)
| | | | | | savings_status = 500<=X<1000: good (4.0/1.0)
| | | | | | savings_status = >=1000: good (4.0)
| | | | | | savings_status = no known savings
| | | | | | | existing_credits <= 1
| | | | | | | | own_telephone = none: bad (9.0/1.0)
| | | | | | | | own_telephone = yes: good (4.0/1.0)
| | | | | | | existing_credits > 1: good (2.0)
| | | | | duration > 30: bad (30.0/3.0)
| | | | other_parties = co applicant: bad (7.0/1.0)
| | | | other_parties = guarantor: good (12.0/3.0)
| | | job = high qualif/self emp/mgmt: good (30.0/8.0)
| foreign_worker = no: good (15.0/2.0)
checking_status = 0<=X<200
| credit_amount <= 9857
| | savings_status = <100
| | | other_parties = none
| | | | duration <= 42
| | | | | personal_status = male div/sep: bad (8.0/2.0)
| | | | | personal_status = female div/dep/mar
| | | | | | purpose = new car: bad (5.0/1.0)
| | | | | | purpose = used car: bad (1.0)
| | | | | | purpose = furniture/equipment
| | | | | | | duration <= 10: bad (3.0)
| | | | | | | duration > 10
| | | | | | | | duration <= 21: good (6.0/1.0)
| | | | | | | | duration > 21: bad (2.0)
| | | | | | purpose = radio/tv: good (8.0/2.0)
| | | | | | purpose = domestic appliance: good (0.0)
| | | | | | purpose = repairs: good (1.0)
| | | | | | purpose = education: good (4.0/2.0)
| | | | | | purpose = vacation: good (0.0)
| | | | | | purpose = retraining: good (0.0)
| | | | | | purpose = business
| | | | | | | residence_since <= 2: good (3.0)
| | | | | | | residence_since > 2: bad (2.0)
| | | | | | purpose = other: good (0.0)
| | | | | personal_status = male single: good (52.0/15.0)
| | | | | personal_status = male mar/wid
| | | | | | duration <= 10: good (6.0)
| | | | | | duration > 10: bad (10.0/3.0)
| | | | | personal_status = female single: good (0.0)
| | | | duration > 42: bad (7.0)
| | | other_parties = co applicant: good (2.0)
| | | other_parties = guarantor
| | | | purpose = new car: bad (2.0)
| | | | purpose = used car: good (0.0)
| | | | purpose = furniture/equipment: good (0.0)
| | | | purpose = radio/tv: good (18.0/1.0)
| | | | purpose = domestic appliance: good (0.0)
| | | | purpose = repairs: good (0.0)
| | | | purpose = education: good (0.0)
| | | | purpose = vacation: good (0.0)
| | | | purpose = retraining: good (0.0)
| | | | purpose = business: good (0.0)
| | | | purpose = other: good (0.0)
| | savings_status = 100<=X<500
| | | purpose = new car: bad (15.0/5.0)
| | | purpose = used car: good (3.0)
| | | purpose = furniture/equipment: bad (4.0/1.0)
| | | purpose = radio/tv: bad (8.0/2.0)
| | | purpose = domestic appliance: good (0.0)
| | | purpose = repairs: good (2.0)
| | | purpose = education: good (0.0)
| | | purpose = vacation: good (0.0)
| | | purpose = retraining: good (0.0)
| | | purpose = business
| | | | housing = rent
| | | | | existing_credits <= 1: good (2.0)
| | | | | existing_credits > 1: bad (2.0)
| | | | housing = own: good (6.0)
| | | | housing = for free: bad (1.0)
| | | purpose = other: good (1.0)
| | savings_status = 500<=X<1000: good (11.0/3.0)
| | savings_status = >=1000: good (13.0/3.0)
| | savings_status = no known savings: good (41.0/5.0)
| credit_amount > 9857: bad (20.0/3.0)
checking_status = >=200: good (63.0/14.0)
checking_status = no checking: good (394.0/46.0)
Number of Leaves : 103
Size of the tree : 140
Time taken to build model: 0.07 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 705 70.5 %
Incorrectly Classified Instances 295 29.5 %
Kappa statistic 0.2467
Mean absolute error 0.3467
Root mean squared error 0.4796
Relative absolute error 82.5233 %
Root relative squared error 104.6565 %
Total Number of Instances 1000
=== Detailed Accuracy By Class ===
TP Rate
FP Rate Precision
Recall
0.84
0.61
0.763
0.84
0.39
0.16 0.511
0.39
Weighted Avg.
0.705
0.475 0.687
0.705
F-Measure
0.799
0.442
ROC Area Class
0.639
good
0.639
bad
0.692
0.639
=== Confusion Matrix ===
a b <-- classified as
588 112 | a = good
183 117 | b = bad
7. Check to see if the data shows a bias against "foreign workers" (attribute 20),or "personalstatus"(attribute 9). One way to do this (perhaps rather simple minded) is to remove these
attributes from the dataset and see if the decision tree created in those cases is significantly
different from the full dataset case which you have already done. To remove an attribute you
can use the reprocess tab in Weka's GUI Explorer. Did removing these attributes have any
significant effect? Discuss.
This increase in accuracy is because thus two attributes are not much important in training and
analyzing by removing this, the time has been reduced to some extent and then it results in increase
in the accuracy.
The decision tree which is created is very large compared to the decision tree which we have
trained now. This is the main difference between these two decision trees.
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―German Credit
Data.csv‖.
6) In the "Filter" panel, click on the "Choose" button. This will show a popup window with list
available filters.
7) Select ―weka.filters.unsupervised.attribute.Remove‖
8) Next, click on text box immediately to the right of the "Choose" button
9) In the resulting dialog box enter the index of the attribute to be filtered out (Make sure that the
"Invert Selection" option is set to false)
10) Then click "OK". Now, in the filter box you will see "Remove -R 1"
11) Click the "Apply" button to apply this filter to the data. This will remove the "id" attribute and
create a new working relation
12) To save the new working relation as an ARFF file, click on save button in the top panel.
13) Go to OPEN file and browse the file that is newly saved (attribute deleted file)
14) Go to Classify tab.
15) Choose Classifier ―Tree‖
16) Select j48 tree
17) Select Test options ―Use training set‖
18) If needed select attribute.
19) Now start weka.
20) We can see the output details in the Classifier output.
21)Right click on the result list and select ‖ visualize tree ―option .
22) Compare the output results with that of the 4th experiment
23) check whether the accuracy increased or decreased?
24)Check whether removing these attributes have any significant effect.
Visualize results:-
Visualize classifier errors:
Visualize tree
The Difference what we observed is accuracy had improved.
8. Another question might be, do you really need to input so many attributes to get good
results? Maybe only a few would do. For example, you could try just having attributes 2, 3, 5,
7, 10, 17 (and 21, the class attribute (naturally)). Try out some combinations. (You had
removed two attributes in problem 7. Remember to reload the arff data file to get all the
attributes initially before you start selecting the ones you want.)
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―bank.csv‖.
6) Select some of the attributes from attributes list which are to be removed. With this step only the
attributes necessary for classification are left in the attributes panel.
7) The go to Classify tab.
8) Choose Classifier ―Tree‖
9) Select j48
10) Select Test options ―Use training set‖
11) If needed select attribute.
12) Now Start weka.
13) Now we can see the output details in the Classifier output.
14) Right click on the result list and select‖ visualize tree ―option.
15) Compare the output results with that of the 4th experiment
16) check whether the accuracy increased or decreased?
17) check whether removing these attributes have any significant effect.
=== Classifier model (full training set) ===
J48 pruned tree
-----------------credit_history = no credits/all paid: bad (40.0/15.0)
credit_history = all paid
| employment = unemployed
| | duration <= 36: bad (3.0)
| | duration > 36: good (2.0)
| employment = <1
| | duration <= 26: bad (7.0/1.0)
| | duration > 26: good (2.0)
| employment = 1<=X<4: good (15.0/6.0)
| employment = 4<=X<7: bad (10.0/4.0)
| employment = >=7
| | job = unemp/unskilled non res: bad (0.0)
| | job = unskilled resident: good (3.0)
| | job = skilled: bad (3.0)
| | job = high qualif/self emp/mgmt: bad (4.0)
credit_history = existing paid
| credit_amount <= 8648
| | duration <= 40: good (476.0/130.0)
| | duration > 40: bad (27.0/8.0)
| credit_amount > 8648: bad (27.0/7.0)
credit_history = delayed previously
| employment = unemployed
| | credit_amount <= 2186: bad (4.0/1.0)
| | credit_amount > 2186: good (2.0)
| employment = <1
| | duration <= 18: good (2.0)
| | duration > 18: bad (10.0/2.0)
| employment = 1<=X<4: good (33.0/6.0)
| employment = 4<=X<7
| | credit_amount <= 4530
| | | credit_amount <= 1680: good (3.0)
| | | credit_amount > 1680: bad (3.0)
| | credit_amount > 4530: good (11.0)
| employment = >=7
| | job = unemp/unskilled non res: good (0.0)
| | job = unskilled resident: good (2.0/1.0)
| | job = skilled: good (14.0/4.0)
| | job = high qualif/self emp/mgmt: bad (4.0/1.0)
credit_history = critical/other existing credit: good (293.0/50.0)
Number of Leaves : 27
Size of the tree : 40
Time taken to build model: 0.01 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances 764 76.4 %
Incorrectly Classified Instances 236 23.6 %
Kappa statistic 0.3386
Mean absolute error 0.3488
Root mean squared error 0.4176
Relative absolute error 83.0049 %
Root relative squared error 91.1243 %
Total Number of Instances 1000
9. Sometimes, the cost of rejecting an applicant who actually has a good credit (case 1) might
be higher than accepting an applicant who has bad credit (case 2). Instead of counting the
misclassifications equally in both cases, give a higher cost to the first case (say cost 5) and
lower cost to the second case. You can do this by using a cost matrix in Weka. Train your
Decision Tree again and report the Decision Tree and cross-validation results. Are they
significantly different from results obtained in problem 6 (using equal cost)?
In the Problem 6, we used equal cost and we trained the decision tree. But here, we consider two
cases with different cost.
Let us take cost 5 in case 1 and cost 2 in case 2.
When we give such costs in both cases and after training the decision tree, we can observe that
almost equal to that of the decision tree obtained in problem 6.
But we find some difference in cost factor which is in summary in the difference in cost factor.
Case1 (cost 5)
Case2 (cost 5)
Total Cost
3820
1705
Average cost
3.82
1.705
We don‘t find this cost factor in problem 6. As there we use equal cost. This is the major difference
between the results of problem 6 and problem 9.
The cost matrices we used here:
Case 1: 5 1
1 5
Case 2: 2 1
1 2
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―bank.csv‖.
6) Go to Classify tab.
7) Choose Classifier ―Tree‖
8) Select j48
9) Select Test options ―Training set‖.
10)Click on ―more options‖.
11)Select cost sensitive evaluation and click on set button
12)Set the matrix values and click on resize. Then close the window.
13)Click Ok
14)Click start.
15) we can see the output details in the Classifier output
16) Select Test options ―Cross-validation‖.
17) Set ―Folds‖ Ex:10
18) if need select attribute.
19) now Start weka.
20)now we can see the output details in the Classifier output.
21)Compare results of 15th and 20th steps.
22)Compare the results with that of experiment 6.
Sample output:
10. Do you think it is a good idea to prefer simple decision trees instead of having long
complex decision trees? How does the complexity of a Decision Tree relate to the bias of the
model?
When we consider long complex decision trees, we will have many unnecessary attributes in the
tree which results in increase of the bias of the model. Because of this, the accuracy of the model
can also effected.
This problem can be reduced by considering simple decision tree. The attributes will be less and it
decreases the bias of the model. Due to this the result will be more accurate.
So it is a good idea to prefer simple decision trees instead of long complex trees.
Aim: Is small rule better or long rule check the bias,by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
This will be based on the attribute set, and the requirement of relationship among attribute we
want to study. This can be viewed based on the database and user requirement.
11. You can make your Decision Trees simpler by pruning the nodes. One approach is to use
Reduced Error Pruning - Explain this idea briefly. Try reduced error pruning for training
your Decision Trees using cross-validation (you can do this in Weka) and report the Decision
Tree you obtain ? Also, report your accuracy using the pruned model. Does your accuracy
increase ?
Reduced-error pruning :The idea of using a separate pruning set for pruning—which is applicable to decision trees as well
as rule sets—is called reduced-error pruning. The variant described previously prunes a rule
immediately after it has been grown and is called incremental reduced-error pruning. Another
possibility is to build a full, unpruned rule set first, pruning it afterwards by discarding individual
tests.
However, this method is much slower. Of course, there are many different ways to assess the worth
of a rule based on the pruning set. A simple measure is to consider how well the rule would do at
discriminating the predicted class from other classes if it were the only rule in the theory, operating
under the closed world assumption. If it gets p instances right out of the t instances that it covers,
and there are P instances of this class out of a total T of instances altogether, then it gets p positive
instances right. The instances that it does not cover include N - n negative ones, where n = t – p is
the number of negative instances that the rule covers and N = T - P is the total number of negative
instances. Thus the rule has an overall success ratio of [p +(N - n)] T , and this quantity, evaluated
on the test set, has been used to evaluate the success of a rule when using reduced-error pruning.
Aim: To create a Decision tree by using Prune mode and Reduced error Pruning and show
accuracy for cross validation trained data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Theory :
Reduced-error pruning
�Each node of the (over-fit) tree is examined for pruning
�A node is pruned (removed) only if the resulting pruned tree
performs no worse than the original over the validation set
�Pruning a node consists of
• Removing the sub-tree rooted at the pruned node
• Making the pruned node a leaf node
• Assigning the pruned node the most common classification of the training instances attached to
that node
�Pruning nodes iteratively
• Always select a node whose removal most increases the DT accuracy over the validation set
• Stop when further pruning decreases the DT accuracy over the validation set
IF (Children=yes) Λ (income=>30000)
THEN (car=Yes)
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―bank.csv‖.
6) select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier ―Tree‖
9) Select ―NBTree‖ i.e., Navie Baysiean tree.
10) Select Test options ―Use training set‖
11) right click on the text box besides choose button , select show properties
12) Now change unprone mode ―false‖ to ―true‖.
13) Change the reduced error pruning % as needed.
14) If need select attribute.
15) Now Start weka.
16) Now we can see the output details in the Classifier output.
17) Right click on the result list and select ‖ visualize tree ―option .
Sample output:
J48 pruned tree
-----------------checking_status = <0
| foreign_worker = yes
| | credit_history = no credits/all paid: bad (11.0/3.0)
| | credit_history = all paid: bad (9.0/1.0)
| | credit_history = existing paid
| | | other_parties = none
| | | | savings_status = <100
| | | | | existing_credits <= 1
| | | | | | purpose = new car: bad (17.0/4.0)
| | | | | | purpose = used car: good (3.0/1.0)
| | | | | | purpose = furniture/equipment: good (22.0/11.0)
| | | | | | purpose = radio/tv: good (18.0/8.0)
| | | | | | purpose = domestic appliance: bad (2.0)
| | | | | | purpose = repairs: bad (1.0)
| | | | | | purpose = education: bad (5.0/1.0)
| | | | | | purpose = vacation: bad (0.0)
| | | | | | purpose = retraining: bad (0.0)
| | | | | | purpose = business: good (3.0/1.0)
| | | | | | purpose = other: bad (0.0)
| | | | | existing_credits > 1: bad (5.0)
| | | | savings_status = 100<=X<500: bad (8.0/3.0)
| | | | savings_status = 500<=X<1000: good (1.0)
| | | | savings_status = >=1000: good (2.0)
| | | | savings_status = no known savings
| | | | | job = unemp/unskilled non res: bad (0.0)
| | | | | job = unskilled resident: good (2.0)
| | | | | job = skilled
| | | | | | own_telephone = none: bad (4.0)
| | | | | | own_telephone = yes: good (3.0/1.0)
| | | | | job = high qualif/self emp/mgmt: bad (3.0/1.0)
| | | other_parties = co applicant: good (4.0/2.0)
| | | other_parties = guarantor: good (8.0/1.0)
| | credit_history = delayed previously: bad (7.0/2.0)
| | credit_history = critical/other existing credit: good (38.0/10.0)
| foreign_worker = no: good (12.0/2.0)
checking_status = 0<=X<200
| other_parties = none
| | credit_history = no credits/all paid
| | | other_payment_plans = bank: good (2.0/1.0)
| | | other_payment_plans = stores: bad (0.0)
| | | other_payment_plans = none: bad (7.0)
| | credit_history = all paid: bad (10.0/4.0)
| | credit_history = existing paid
| | | credit_amount <= 8858: good (70.0/21.0)
| | | credit_amount > 8858: bad (8.0)
| | credit_history = delayed previously: good (25.0/6.0)
| | credit_history = critical/other existing credit: good (26.0/7.0)
| other_parties = co applicant: bad (7.0/1.0)
| other_parties = guarantor: good (18.0/4.0)
checking_status = >=200: good (44.0/9.0)
checking_status = no checking
| other_payment_plans = bank: good (30.0/10.0)
| other_payment_plans = stores: good (12.0/2.0)
| other_payment_plans = none
| | credit_history = no credits/all paid: good (4.0)
| | credit_history = all paid: good (1.0)
| | credit_history = existing paid
| | | existing_credits <= 1: good (92.0/7.0)
| | | existing_credits > 1
| | | | installment_commitment <= 2: bad (4.0/1.0)
| | | | installment_commitment > 2: good (5.0)
| | credit_history = delayed previously: good (22.0/6.0)
| | credit_history = critical/other existing credit: good (92.0/3.0)
Number of Leaves : 47
Size of the tree : 64
Time taken to build model: 0.49 seconds
=== Stratified cross-validation ===
=== Summary ===
Correctly Classified Instances 725 72.5 %
Incorrectly Classified Instances 275 27.5 %
Kappa statistic 0.2786
Mean absolute error 0.3331
Root mean squared error 0.4562
Relative absolute error 79.2826 %
Root relative squared error 99.5538 %
Total Number of Instances 1000
12. (Extra Credit): How can you convert a Decision Trees into "if-then-else rules". Make up
your own small Decision Tree consisting of 2-3 levels and convert it into a set of rules. There
also exist different classifiers that output the model in the form of rules - one such classifier in
Weka is rules.PART, train this model and report the set of rules obtained. Sometimes just one
attribute can be good enough in making the decision, yes, just one ! Can you predict what
attribute that might be in this dataset? OneR classifier uses a single attribute to make
decisions (it chooses the attribute based on minimum error). Report the rule obtained by
training a one R classifier. Rank the performance of j48, PART and oneR.
In weka, rules.PART is one of the classifier which converts the decision trees into ―IF-THENELSE‖ rules.
Converting Decision trees into “IF-THEN-ELSE” rules using rules.PART classifier:PART decision list
outlook = overcast: yes (4.0)
windy = TRUE: no (4.0/1.0)
outlook = sunny: no (3.0/1.0) : yes (3.0)
Number of Rules : 4
Yes, sometimes just one attribute can be good enough in making the decision.
In this dataset (Weather), Single attribute for making the decision is “outlook”
outlook:
sunny -> no
overcast -> yes
rainy -> yes
(10/14 instances correct)
With respect to the time, the oneR classifier has higher ranking and J48 is in 2nd
place and PART gets 3rd place.
J48
TIME (sec)
RANK
PART
0.12
II
oneR
0.14
III
0.04
I
But if you consider the accuracy, The J48 classifier has higher ranking, PART gets second place
and oneR gets lst place
J48
ACCURACY (%)
RANK
PART
70.5
I
oneR
70.2%
II
66.8%
III
Aim: To compare OneR classifier which uses single attribute and rule with J48 and PART
classifier‘s, by training data set using Weka mining tool.
Tools/ Apparatus: Weka mining tool..
Procedure:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―bank.csv‖.
6) Select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier ―TreesRules‖
9) Select ―J48‖.
10) Select Test options ―Use training set‖
11) If need select attribute.
12) Now Start weka.
13) Now we can see the output details in the Classifier output.
14) Right click on the result list and select ‖ visualize tree ―option .
(or)
 java weka.classifiers.trees.J48 -t c:\temp\bank.arff
Procedure for “OneR”:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―bank.csv‖.
6) Select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier ―Rules‖
9) Select ―OneR‖ .
10) Select Test options ―Use training set‖
11) If need select attribute.
12) Now Start weka.
13) Now we can see the output details in the Classifier output.
Procedure for “PART”:
1) Given the Bank database for mining.
2) Use the Weka GUI Chooser.
3) Select EXPLORER present in Applications.
4) Select Preprocess Tab.
5) Go to OPEN file and browse the file that is already stored in the system ―bank.csv‖.
6) Select some of the attributes from attributes list
7) Go to Classify tab.
8) Choose Classifier ―Rules‖
9) Select ―PART‖ .
10) Select Test options ―Use training set‖
11) If need select attribute.
12) Now start weka.
13) Now we can see the output details in the Classifier output.
Attribute relevance with respect to the class – relevant attribute (science)
IF accounting=1 THEN class=A (Error=0, Coverage = 7 instance)
IF accounting=0 THEN class=B (Error=4/13, Coverage = 13 instances)
Sample output:
J48
java weka.classifiers.trees.J48 -t c:/temp/bank.arff
One R
PART
Extra Experiments:
13. Generate Association rules for the following transactional database using Apriori
algorithm.
TID
T100
T200
T300
T400
T500
T600
T700
T800
T900
List of Items
I1,I2,I5
I2,I4
I2,I3
I1,I2,I4
I1,I3
I2,I3
I1,I3
I1,I2,I3,I5
I1,I2,I3
Step-1
Create Excel Document of the above data and save it as (.CSV delimited) file type
Tid
T100
T200
T300
T400
T500
T600
T700
T800
T900
I1
yes
no
no
yes
yes
no
yes
yes
yes
I2
yes
yes
yes
yes
no
yes
no
yes
yes
I3
no
no
yes
no
yes
yes
yes
yes
yes
I4
no
yes
no
yes
no
no
no
no
no
I5
yes
no
no
no
no
no
no
yes
no
Step-2
Open WEKA tool and click on WEKA tool and then click on explorer
Step-3
Click on open file tab and take the file from the desired location, the file should be of (.csv) which
was saved earlier in one location.
Step-4
Click on association tab which is on top headers tab then choose apriori and then click ok.
Step-5
Then click start button and see the generated association rules results:
=== Run information ===
Scheme:
weka.associations.Apriori -N 10 -T 0 -C 0.9 -D 0.05 -U 1.0 -M 0.1 -S -1.0 -c -1
Relation: customer
Instances: 9
Attributes: 6
Tid
I1
I2
I3
I4
I5
=== Associator model (full training set) ===
Apriori
=======
Minimum support: 0.35 (3 instances)
Minimum metric <confidence>: 0.9
Number of cycles performed: 13
Generated sets of large itemsets:
Size of set of large itemsets L(1): 7
Size of set of large itemsets L(2): 13
Size of set of large itemsets L(3): 9
Size of set of large itemsets L(4): 2
Best rules found:
1. I3=yes 6 ==> I4=no 6 conf:(1)
2. I4=no I5=no 5 ==> I3=yes 5 conf:(1)
3. I3=yes I5=no 5 ==> I4=no 5 conf:(1)
4. I1=yes I3=yes 4 ==> I4=no 4 conf:(1)
5. I2=yes I3=yes 4 ==> I4=no 4 conf:(1)
6. I1=no 3 ==> I2=yes 3 conf:(1)
7. I1=no 3 ==> I5=no 3 conf:(1)
8. I3=no 3 ==> I2=yes 3 conf:(1)
9. I1=no I5=no 3 ==> I2=yes 3 conf:(1)
10. I1=no I2=yes 3 ==> I5=no 3 conf:(1)
14. Generate classification rules for the following data base using decision tree (J48).
RID
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Age
Youth
Youth
Middle_aged
Senior
Senior
Senior
Middle_aged
Youth
Youth
Senior
Youth
Middle-aged
Middle_aged
Senior
Income
High
High
High
Medium
Low
Low
Low
Medium
Low
Medium
Medium
Medium
High
Medium
Student
No
No
No
No
Yes
Yes
Yes
No
Yes
Yes
Yes
No
Yes
No
Credit_rating
Fair
Excellent
Fair
Fair
Fair
Excellent
Excellent
Fair
Fair
Fair
Excellent
Excellent
Fair
Excellent
Class:buys_computer
No
No
Yes
Yes
Yes
Yes
No
Yes
No
Yes
Yes
Yes
Yes
No
Step1:
Create Excel Document of the above data and save it as (.CSV delimited) file type
Step-2
Open WEKA tool and click on WEKA tool and then click on explorer
Step-3
Click on open file tab and take the file from the desired location, the file should be of (.csv) which
was saved earlier in one location.
Step-4
Click on classify tab which is on top headers tab then choose classifier as j48 and select test option
as Use training set.
Step-5
Then click start button and see the generated classification rules results:
=== Run information ===
Scheme:weka.classifiers.trees.J48 -C 0.25 -M 2
Relation: buys
Instances: 14
Attributes: 6
Rid
age
income
student
credit_rating
class:buys_computer
Test mode:evaluate on training data
=== Classifier model (full training set) ===
J48 pruned tree
-----------------age = youth
| student = no: no (3.0)
| student = yes: yes (2.0)
age = middle_aged: yes (4.0)
age = senior
| credit_rating = fair: yes (3.0)
| credit_rating = excellent: no (2.0)
Number of Leaves : 5
Size of the tree :
8
Time taken to build model: 0.02 seconds
=== Evaluation on training set ===
=== Summary ===
Correctly Classified Instances
14
Incorrectly Classified Instances
0
Kappa statistic
1
Mean absolute error
0
Root mean squared error
0
Relative absolute error
0
%
Root relative squared error
0
%
Total Number of Instances
14
100
%
0
%
=== Detailed Accuracy By Class ===
TP Rate FP Rate Precision Recall F-Measure ROC Area Class
1
0
1
1
1
1
no
1
0
1
1
1
1
yes
Weighted Avg. 1
0
1
1
1
1
=== Confusion Matrix ===
a b <-- classified as
5 0 | a = no
0 9 | b = yes
Step6:
Right click on Result list to visualize the tree
Output:
VIVA QUESTIONS:
1. What is Data mining?
Data mining refers to extracting or "mining" knowledge from large amount of data. It is considered
as a synonym for another popularly used term Knowledge Discovery in Databases or KDD.
2. Define data warehouse
Data warehouse is a repository of information collected from multiple sources stored under a
unified schema and which usually resides at a single site. It is constructed via a process of data
cleaning, data transformation, data integration, data loading and periodic data refreshing.
3. What is an association analysis?
Association analysis is the discovery of association rules showing attribute-value conditions that
occur frequently together in a given set of data. It is widely used for market basket or transaction
data analysis.
4. Define Classification.
It is the process of finding set of models that describe and distinguish data classes or concepts for
the purpose of being able to use the model to predict the class objects whose class label is
unknown. The derived model is based on the analysis of a set of training data.























What are the GUI's available in weka.
What are the data formats supported by weka.
Extensions of ARFF and CSV.
What are filters in weka.
what is the difference between categorical and real-valued attribute.
What is relation.
What is data.
What are the four types of data types we use in creation of ARFF file.
What is .xls , .arff, .csv.
How to convert an excel document in to csv document.
What are the data preprocsssing tools available in weka.
What is filter.
How to use filters in weka.
What is Data discretization.
What is Association functionality of data mining.
List out the algorithms used for association.
What is minimum support threshold.
What is minimum confidence threshold.
When an association rule is called as strong.
What is data classification.
Which techniques are used to implement data classification.
What is training data and test data.
What is decision tree.