Download Section4_Techical_Details

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Technical Details
Overview
This chapter explains about how the data mining algorithms are implemented in Faculty Support
system model to address the problems discussed in chapter 3.4. To address the issue of final exam
grade prediction, a Naïve Bayes classifier is implemented and the grades are predicted. To address
the problem of classifying students into different groups based on final exam grade, different
algorithms like Naïve Bayes, J48 Decision Trees, Random Forests and Multiple Layer Neural
Networks were implemented using a data mining tool called Weka. The Weka tool implements all
the algorithms by choosing different parameters. A study of choosing best parameters is also
explained in this chapter. Also, the different algorithms were evaluated based on different
evaluation methods and performance metrics.
The rest of the chapter is organized as follows: First the Naïve Bayes implementation is explained
clearly and then the process of final exam grade prediction is explained, followed by different
evaluation methods and metrics used to evaluate performance of classifiers. Then, a brief summary
of Weka tool is outlined. Finally, the parameter selection in different algorithms is clearly
explained.
Naïve Bayes approach for future grade prediction
Naïve Bayes Classification
As seen from the Chapter 2, Bayesian classification is based on Bayes Theorem. Let X be a data
tuple. In Bayesian terms, X is considered evidence. Let H be some hypothesis, such as that the
data tuple X belongs to a specified class C. For classification problems, we want to determine
P(H|X), the probability that the hypothesis H holds given the evidence or observed data tuple X.
Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H|X),
from the equation 1.
𝑃(𝐻|𝑋) =
ο‚·
ο‚·
ο‚·
ο‚·
𝑃(𝑋|𝐻) βˆ— 𝑃(𝐻)
𝑃(𝑋)
P(H | X) is the posterior probability of H conditioned on X
P(H) is the prior probability of H
P(X | H) is the posterior probability of X conditioned on H.
P(X) is the prior probability of X.
During the training phase, we need to learn the posterior probabilities P(Y l X) for every
combination of X, evidence or attribute value set and Y, class variables based on information
gathered from the training data. By knowing these probabilities, a test record Xi can be classified
by finding the class Y that maximizes the posterior probability, P(Y | Xi).
Let D be a training set of tuples and their associated class labels. Each tuple is represented by an
n-dimensional attribute vector, X = {X1,X2,…,Xn}, depicting n measurements made on the tuple
1
from n attributes, respectively, A1, A2, … , A2. Suppose that there are m classes, C1, C2, … , Cm.
Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior
probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs
to the class Ci if and only if
Thus we maximize P(C | X). The class Ci for which P(C | X) is maximized is called the
maximum posteriori hypothesis. By Bayes’ theorem
Given data sets with many attributes, it would be extremely computationally expensive to compute
P(X | Ci). In order to reduce computation in evaluating P(X | Ci), the naive assumption of class
conditional independence is made. This presumes that the values of the attributes are conditionally
independent of one another, given the class label of the tuple. Thus,
With the conditional independence assumption, instead of computing the class-conditional
probability for every combination of X, we only have to estimate the conditional probability of
each Xi, given C. This approach is more practical because it does not require a very large training
set to obtain a good estimate of the probability. To classify a test record, the naive Bayes classifier
computes the posterior probability for each class C:
d
P(C | X ) ο€½
P(C ) P( X i | C )
i ο€½1
P( X )
Since P(X) is fixed for every Y, it is sufficient to choose the class that maximizes the numerator
d
term, P (Y ) P ( X i | Y ) .
i ο€½1
Prediction of final grade using Naïve Bayes:
Training phase:
The Naïve Bayes classifier is trained with all the training data. In this research, we used 241
instances of data for training. In the training phase we need to calculate the posterior probabilities
P(Y | X) for every combination of X and Y based on information gathered from the training data,
where X = attribute value set and Y = class label.
To calculate the posterior probabilities, the prior and conditional probabilities should be calculated
first. The prior and conditional probabilities are calculated by constructing frequency tables for
each attribute. The frequency table consists of count for each different variable in the attribute set
and the number of instances it can contains each class variable. The frequency table is calculated
for each and every attribute. The next step is finding the conditional probabilities from the
2
frequency tables. The frequency table and the conditional probabilities are outlined for each
attribute in the following tables.
From table1, the conditional probability of an attribute, ATT = Good and having class variable
First, is given by: P(ATT = Good | class = First) is equal to number of students with ATT = Good
divided by total number of students with class = First. i.e. 75/92. Likewise, all other conditional
probabilities were computed.
Attendance (ATT)
First Second Third Fail
75
66
21
10
Good
17
8
5
Average 12
5
4
7
11
Poor
92
87
36
26
Total
Conditional Probability
75/92 66/87
21/36 10/26
Good
8/36 5/26
Average 12/92 17/87
5/92 4/87
7/36 11/26
Poor
Table 1: Frequencies with Conditional probabilities for attribute Attendance
Quizzes (QZ)
First
Second Third Fail
13
2
1
0
Good
59
55
5
2
Average
20
30
30
24
Poor
92
87
36
26
Total
Conditional Probability
13/92 2/87
1/36
0/26
Good
5/36
2/26
Average 59/92 55/87
20/92 30/87
30/36 24/26
Poor
Table 2: Frequencies with Conditional probabilities for attribute Quizzes
Assignments (ASS)
First
Second Third Fail
70
49
9
3
Good
21
8
1
Average 18
4
17
19
22
Poor
92
87
36
26
Total
Conditional Probability
70/92 49/87
9/36
3/26
Good
8/36
1/26
Average 18/92 21/87
4/92
17/87
19/36 22/26
Poor
Table 3: Frequencies with Conditional probabilities for attribute Assignments
3
Class Projects (CP)
First
Second Third Fail
57
42
7
3
Good
24
8
8
Average 23
12
21
21
15
Poor
92
87
36
26
Total
Conditional Probability
57/92 42/87
7/36
3/26
Good
8/36
8/26
Average 23/92 24/87
12/92 21/87
21/36 15/26
Poor
Table 4: Frequencies with Conditional probabilities for attribute Class Projects
Exams (EX)
First
Second Third Fail
38
6
3
1
Good
53
7
3
Average 52
2
28
26
22
Poor
92
87
36
26
Total
Conditional Probability
38/92 6/87
3/36
1/26
Good
7/36
3/26
Average 52/92 53/87
2/92
28/87
26/36 22/26
Poor
Table 5: Frequencies with Conditional probabilities for attribute Exams
Finding Prior probabilities of Response Class:
The prior probability for response classes are found from the training set by simple calculations
like counting the number of instances each class variable had. The prior probabilities are given in
the table 6.
Prior Probabilty of Response Variables
First
Second Third Fail
87
36
26
No of Instances 92
92/241 87/241 36/241 26/241
Probability
Table 6: Prior Probability of Response Variables
Testing phase:
Once the training phase is done, the classifier is learned with all the training data and can predict
the class labels of the test data. Consider a test instance with no class label
ATT
Good
QZ
ASS
CP
EX
Grade
Average Average Poor
Poor
????
Table 7: Test Instance with no class label
4
Using the prior and conditional probabilities calculated above, we need to find the class label for
this instance.
From equation 1,
To find the new instance of the class, since P(X) is fixed for every Y, it is sufficient to choose the
d
class that maximizes the numerator term, P(Y ) P( X i | Y ) , where
i ο€½1
d
 P( X
i
| Y ) is conditional
i ο€½1
probability of each Xi, given Y.
So, the grade of the new instance can be calculated with the below equation.
d
P(Grade | X ) ο€½ P(Grade) P( X i | Grade)
i ο€½1
Here, X is evidence and Grade = {First, Second, Third, Fail}
So, calculate the equation from each grade.
P(Grade = First | X ) = P(Grade = First)
* P(ATT = Good | Grade = First)
* P(QZ = Average | Grade = First)
* P(ASS = Average | Grade = First)
* P(CP = Poor | Grade = First)
* P(EX = Poor | Grade = First)
=
92 75 59 18 12 2
* * * * *
241 92 92 92 92 92
= 0.198
Similarly, we calculate for Grade = Second, Third and Fail and obtain the results as
P(Grade = Second | X ) = 0.266
P(Grade = Third | X ) = 0.344
P(Grade = Fail | X ) = 0.202
Since the posterior probability for Third Grade is higher than other grades, the new test instance is
classified to be Third Grade.
The test data we considered are 2 data sets with 111 instances and 130 instances respectively
clearly explained in chapter 5.2. Finding class labels of each and every instance is very time taking
5
and difficult task as it involves calculating the probabilities for each and every instance in the
testing data set.
To address this problem, we have automated the process by implementing the naïve bayes
classifier using a java program that calculates all the probabilities and computes the class labels of
the test instance in no matter of time because of the api support provided by the java programming
language. The program implementation is given in the figure 1:
Figure 4.1: data flow of Naive Bayes Algorithm with Input and Output Data
Explanation:
1. There are 2 kinds of input to the program (from figure 1).
a. Training data set with the explanatory variables and class labels
b. Testing data set with only explanatory variables
2. The classifier is learned using training data. In the process of learning,
a. Compute the likelihood of each attribute and store them in a separate Hash Map.
So, there are five attributes, we used five different hash maps.
b. Compute the prior probabilities of class variable and store them in a separate Hash
Map.
3. After calculating all the probabilities, the testing set is read line by line and for each
attribute obtain its likelihoods and prior probability from the stored hash maps and compute
the posterior probability.
4. The step 3 is repeated for number of classes (4 in this research).
5. The computed values are compared against each other and the class having higher value is
chosen to be the class of that instance in test data.
6. Repeat the steps 3 to 5 for each instance of the testing data set.
The implementation of Naïve Bayes algorithm is presented in the Section 4.6.
6
Evaluation Methods and Metrics:
This section explains the different evaluation methods and metrics used in Faculty Support System.
Model. The most important part in data mining is to understand the data present in the records,
analyze it, finding what can be done and achieved with the data and finally draw conclusions from
the data analytic results. In a given context, data mining metrics are some measures of quantitative
assessment which are used for comparison or evaluation of different data mining algorithms.
Generally, data mining metrics are divided into three categories, accuracy, robustness and
usefulness.
Accuracy is a measure that tell us how well a model associates an outcome with the given attributes
in the data sets provided. Accuracy is measured in multiple ways but all the accuracy measures
depends on the data that has been used. Robustness judges the way that a data mining model
performs on different kinds of datasets. The data mining model is robust only if the same model
generates the same type of predictions for the same kind of patterns irrespective of the supplied
test data. The data mining metric usefulness is a combined metric of several metrics that provides
us whether the model generates useful information or not.
Evaluation Methods
The performance of the classifiers is evaluated based on the different evaluation methods. The
evaluation is important to understand the quality of the model, for refining parameters in the
iterative process of learning and for selecting the most appropriate model from a given set of
models. There are several criteria for evaluating the model. As far as classification models are
concerned, the performance of a classifier is measured in terms of error-rate. If the classifier
predicts the class of an instance correctly, then it is treated as success, otherwise an error. In this
research, for choosing the best performing algorithm in the post-Data Mining phase, we used
different kinds of evaluation methods and compare them based on different evaluation metrics.
Hold out Method:
Hold out Method involves a single data split. The data is split into two separate datasets where one
data set is used for training and the other is used for testing. The model is learned with the training
data and finally asked to predict the output values in the testing data.
Random Sub Sampling:
The Random Sub Sampling method is an extension of hold-out method. In the random sub
sampling method, the hold out method can be repeated several times to improve the estimation of
the classifier’s performance.
K-fold Cross Validation Method:
Cross Validation is a popular technique for evaluating the generalization performance of a data
mining model. The basic idea behind cross validation is to split the data, once or many times for
estimating the risk of the data mining algorithm. From the split data, a small part called training
sample is used for training each algorithm and the remaining part also called the validation sample
7
is used to estimate the risk of the algorithm. The cross validation method finally selects the
algorithm with smallest estimated risk.
The k-fold cross validation method is more optimized than hold out method. In this method, the
dataset is divided into k subsets and the hold out method repeated for k times. Each time, one of
the k subsets is used as the testing set and the union of other folds are used as training set. The k
results from the folds can be averaged and error is computed. The advantage of this method is that
all observations are used for both training and testing, and each observation is used for testing
exactly once.
Evaluation Metrics:
Metrics summarize performance of a model and give a simplified view of model behavior. Thus,
using several performance metrics and check whether they agree helps better understanding the
model behavior by quantifying its performance. Evaluation of data mining algorithms can be
compared according to a number of measures. Comparing the performance of different data mining
algorithms determines their predictability, some quantities that interpret the goodness of fit of a
model, and error measurements must be considered. Though empirical studies claimed that it is
difficult to decide which metric to use for different problem, each of them has specific features
that measures various aspects of the algorithms being evaluated. It is often difficult to state which
metrics is the most suitable to evaluate algorithms in educational data due to large weighted
discrepancies that often arise between predicted and actual value or otherwise. The combination
of different metrics may reveal accurate results. For instance, some metrics such as true positive
rate (TPR) take higher values if the algorithm gives better results compared to metric such as errors
that takes lower values.
The classification of metrics is divided into three families, probabilistic understanding of errors,
qualitative understanding of errors, and visual metrics.
Probabilistic understanding of errors:
These kind of metrics are based on probabilistic understanding of predictions pi and of errors (pi
– oi), where pi is predicted outcome, oi is actual outcome. This type of metrics is natural mainly
for predictions of performance i.e. correctness of answers. Most commonly used metrics based on
probabilistic understanding of errors are Mean Absolute Error (MAE), Root Mean Square Error
(RMSE). Typically, lower the values of MAE and RMSE, higher is the performance of the
classifier.
Mean absolute error considers absolute differences between predictions and answers. This is not a
suitable performance metric, because it prefers models which are biased towards the majority
result. Despite this disadvantage it is sometimes used for evaluation of student models.
1
𝑀𝐴𝐸 = ( ) βˆ‘|π‘œπ‘– βˆ’ 𝑝𝑖|
𝑛
Root mean square error is obtained by using squared values instead of absolute values. In the
particular context of student modelling and evaluation of probabilities, this is not particularly
useful, since the resulting numbers are hard to interpret anyway. However, in EDM the use of
RMSE metric is very common, particularly for evaluation of skill models
8
1
𝑅𝑀𝑆𝐸 = π‘ π‘žπ‘Ÿπ‘‘ (( ) βˆ‘(π‘œπ‘– βˆ’ 𝑝𝑖)2 )
𝑛
Qualitative understanding of errors:
These metrics are based on qualitative understanding of errors, i.e., either the prediction is correct
or incorrect. In student modeling this approach is suitable mainly for predictions of student state.
In qualitative understanding of errors, predictions have to classify into multiple classes, then the
classification can be done easily by choosing a threshold and doing the classification by
comparison to this threshold. Once predictions are divided into multiple classes, they can be
classified as true/false positives/negatives by a confusion matrix. The confusion matrix juxtaposes
the observed classifications for a phenomenon (columns) with the predicted classifications
of a model (rows). The classifications that lie along the major diagonal of the table are the correct
classifications, that is, the true positives and the true negatives. The other fields signify model
errors. The most common qualitative performance metrics calculated from the matrix are accuracy,
sensitivity, specificity, precision, and F- Measure. These statistical measures are commonly used
to explain out dataset and estimate how good and consistent was the classifier.
Accuracy: It compares how close a new test value is to a value predicted by if ... then rules
π΄π‘π‘π‘’π‘Ÿπ‘Žπ‘π‘¦ = (
𝑇𝑃 + 𝑇𝑁
) βˆ— 100%
𝑇𝑃 + 𝑇𝑁 + 𝐹𝑃 + 𝐹𝑁
Sensitivity: It measures the ability of a test to be positive when the condition is actually present.
It is often recognizing as recall
𝑇𝑃
𝑆𝑒𝑛𝑠𝑖𝑑𝑖𝑣𝑖𝑑𝑦( π‘…π‘’π‘π‘Žπ‘™π‘™) = (
) βˆ— 100%
𝑇𝑃 + 𝐹𝑁
Specificity: It measures the ability of a test to be negative when the condition is actually not
present.
𝑇𝑁
𝑆𝑝𝑒𝑐𝑖𝑓𝑖𝑐𝑖𝑑𝑦 = (
) βˆ— 100%
𝑇𝑁 + 𝐹𝑃
Precision: It measures the positive predictive value
𝑇𝑃
π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› = (
) βˆ— 100%
𝑇𝑃 + 𝐹𝑃
F-Measure: A measure that combines precision and recall is the harmonic mean of precision and
recall,
𝐹=
2π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› βˆ— π‘…π‘’π‘π‘Žπ‘™π‘™
π‘ƒπ‘Ÿπ‘’π‘π‘–π‘ π‘–π‘œπ‘› + π‘…π‘’π‘π‘Žπ‘™π‘™
9
Visual metrics (ROC)
Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers
and visualizing their performance. This approach for evaluation of predictions takes into account
ranking of predictions, i.e., the values of pi are considered relatively to each other. The ROC curve
summarizes the qualitative error of the prediction model over all possible thresholds. The curve
has false positive rate (Specificity) on the x-axis and true positive rate on the y-axis (Sensitivity),
each point of the curve corresponds to a choice of a threshold. Area under the ROC curve (AUC)
provides a summary performance measure across all possible thresholds. It is equal to the
probability that a randomly selected positive observation has higher predicted score than a
randomly selected negative observation.
The below graph clearly shows that when false positive rate (FPR) and true positive rate (TPR)
are plotted on X and Y axis respectively and if a students’ instance is taken into consideration, the
classifier who has high value of TPR and less value of FPR is the one with better performance.
Likewise, a student with high FPR and high TPR is always classified correctly.
Figure 2: Performance Comparison Graph
In order to improve the educational systems behavior and for getting insight into the learning
process and the teaching methods, this study combines the above mentioned measures of
performance to locate the predictor and/or the classifier used in the educational model on the ROC
graph. We divided the upper AUC in three zones: poor zones which means that the algorithm
accuracy is disputed, reasonable zones where the algorithm is somehow efficient, perfect zone
which means the algorithm performance is the best.
10
Weka Workbench:
Weka provides implementations of learning algorithms that you can easily apply to your dataset.
It also includes a variety of tools for transforming datasets. We can preprocess a dataset, feed it
into a learning scheme, and analyze the resulting classifier and its performance all without writing
any program code at all. The workbench includes methods for all the standard data mining
problems: regression, classification, clustering, association rule mining, and attribute selection.
Getting to know the data is an integral part of the work, and many data visualization facilities and
data preprocessing tools are provided. All algorithms take their input in the form of a single
relational table in the ARFF format or a CSV format, which can be read from a file or generated
by a database query. One way of using Weka is to apply a learning method to a dataset and analyze
its output to learn more about the data. Another is to use learned models to generate predictions on
new instances. A third is to apply several different learners and compare their performance in order
to choose one for prediction. The learning methods are called classifiers, and in the interactive
Weka interface you select the one you want from a menu. Many classifiers have tunable
parameters, which you access through a property sheet or object editor.
In most data mining applications, the machine learning component is just a small part of a far
larger software system. If you intend to write a data mining application, you will want to access
the programs in Weka from inside your own code. By doing so, you can solve the machine learning
sub problem of your application with a minimum of additional programming.
Parameter estimation of different classifiers
Parameter selection of Multiple Layer Neural Networks
This section explains in detail of what and how parameters are considered in building a Multiple
Layer Neural Network classifier using data mining tool. The different parameters and their values
taken into account are shown in table 5.2.
Parameter
Value
True
autoBuild
False
debug
False
decay
a
hiddenLayers
0.3
learningRate
0.2
momentum
True
normalizeAttributes
True
nominalToBinaryFilter
False
reset
0
seed
500
trainingTime
0
Validation Set Size
20
Validation Threshold
sigmoid
activationFunction
11
Table 5.2: Parameters used in Weka Tool for Multiple Layer Neural Network classifier
Table 5.2 show the parameters and their values taken into consideration for building Multiple
Layer Neural Network classifier using Weka tool. As mentioned in Table 5.2, the autoBuild is set
to True, which signifies that the tool adds and connects up hidden layers in the network and debug
is set to False, meaning that classifier does not output any additional information to the console.
The decay attribute is crucial in building Multiple Layer Neural Network as the true value of it
may cause the learning rate to decrease which will divide the starting learning rate by the epoch
number, to determine what the current learning rate should be. This may help to stop the network
from diverging from the target output, as well as to improve general performance.
The hiddenLayers parameter determines the number of hidden layers of the neural network. It
generally takes a list of positive whole numbers. There are also wildcard values such as: a =
(attributes + classes) / 2, i = attributes, o = classes, t = attributes + classes. The parameter
learningRate determines the amount the weights are updated. The too low a learning rate makes
the network learn very slowly and Too large a learning rate will proceed much faster, but may
simply produce oscillations between relatively poor solutions Typical values for the learning rate
parameter are numbers between 0 and 1. So choose an optimal value that is not too low or too high
and we chosen 0.3. The momentum parameter can be helpful in speeding the convergence and
avoiding local minima and applied to the weights during updating. It ranges between 0 and 0.9.
The combination of learning rate and momentum is important to choose as the momentum will
allow a larger learning rate and that this will speed convergence and avoid local minima. On the
other hand, a learning rate of 1 with no momentum will be much faster when no problem with
local minima or non-convergence is encountered.
The parameter nominalToBinaryFilter will preprocess the instances with the filter. This could help
improve performance if there are nominal attributes in the data. The parameter normalizeAttributes
will normalize the attributes. This could help improve performance of the network. This is not
reliant on the class being numeric. The attribute reset set to True, will allow the network to reset
with a lower learning rate. If the network diverges from the answer this will automatically reset
the network with a lower learning rate and begin training again. The seed parameter is used to
initialize the random number generator. Random numbers are used for setting the initial weights
of the connections between nodes, and also for shuffling the training data.
The trainingTime attr parameter ibute determines the number of epochs to train through. If the
validation set is non-zero then it can terminate the network early. The validationSetSize parameter
determines the percentage size of the validation set. The training will continue until it is observed
that the error on the validation set has been consistently getting worse, or if the training time is
reached. If this is set to zero no validation set will be used and instead the network will train for
the specified number of epochs. The validationThreshold parameter is used to terminate validation
testing. The value here dictates how many times in a row the validation set error can get worse
before training is terminated. Finally, the parameter activationFunction determines the type of
activation function used in model. As it is clear from Chapter 2.1, different activation functions
are used and sigmoid function is implemented in this model.
12
Parameter selection of Decision Trees:
This section explains in detail of what and how parameters are considered in building a Decision
Tree classifier using data mining tool. The different parameters and their values taken into account
are shown in table 5.3.
Parameter
Value
confidenceFactor 0.25
2
MinNumObj
3
NumFolds
1
seed
False
unpruned
False
Use Laplace
Table 5.3: Parameters used in Weka Tool for Decision Tree classifier
There are usually two criteria for the quality of decision trees: classification accuracy and decision
tree size, expressed as the number of nodes in the tree. With the increasing size of the decision tree
its incomprehensibility also increases, which is particularly undesirable if the decision tree is used
for the interpretation of the discovered relationships in the training data. If a decision tree is used
for classification, its size is not so significant, as the implementation of decision trees in practice
is not demanding. Smaller trees are often preferred to larger ones, as they do not over fit the training
set and are less sensitive to noise. The size of decision trees can be controlled during the
construction process by pruning. There are two approaches to decision tree pruning:
1. Pre pruning - terminating the subtree construction during the tree-building process
2. Post pruning - pruning the subtrees of an already constructed tree
Post-pruning tends to give better results than pre-pruning because it makes pruning decisions based
on a fully grown tree, unlike pre-pruning, which can suffer from premature termination of the treegrowing process. However, for post-pruning, the additional computations needed to grow the full
tree may be wasted when the subtree is pruned. The different parameters like confidence factor,
minimum number of objects and number of folds play a vital role in post-pruning technique.
Lowering the confidence factor decreases the amount of post-pruning. If we have less confidence
factor in our training data the error estimate for each node goes up, increasing the likelihood that
it will be pruned away in favor of a more stable node upstream. We tested the J48 classifier with
confidence factor ranging from 0.1 to 0.5 by an increment of 0.1, as well as auxiliary values
approaching zero and find out that smaller values incur more pruning. So, the confidenceFactor is
set to 0.25. The number of minimum instances per node (minNumObj) was held at 2, and cross
validation folds for the Testing Set (numFolds) was held at 3 during confidence factor testing. The
unpruned parameter sets to False which means that pruning should be done. The usLaplace
parameter is set to False as it determines whether the counts at leaves are smoothed based on
Laplace.
13
Parameter selection of Random Forests:
This section explains in detail of what and how parameters are considered in building a Random
Forest classifier using data mining tool. The different parameters and their values taken into
account are shown in table 5.4.
Parameter Value
True
debug
0
maxDepth
2
numFeatures
100
numTrees
1
seed
Table 5.4: Parameters used in Weka Tool for Random Forest classifier
As shown in the Table 5.4, the debug parameter is set to True, meaning that the classifier may
output additional information to the console. The maxDepth parameter determines the depth of the
constructed random forest. Random forests depths aren’t a limiting factor in standard random
forest implementation. If we do some sort of gradiant boosting or some sort of pruned decision
tree we can pick and choose whatever you like. However, 0 values means for unlimited depth. The
seed attribute is used to initialize the random number generator. Random numbers are used for
setting the initial weights of the connections between nodes, and also for shuffling the training
data.
The forest error rate depends on two things:
1. The correlation between any two trees in the forest. Increasing the correlation increases the
forest error rate.
2. The strength of each individual tree in the forest. A tree with a low error rate is a strong
classifier. Increasing the strength of the individual trees decreases the forest error rate.
The important parameters in building a Random Forest are numFeatures and numTrees
1. Number of trees used in the forest (numTrees)
2. Number of random variables used in each tree (numFeatures)
First set the numFeatures to the default value (sqrt of total number of all predictors) and search for
the optimal numTrees value. To find the number of trees that correspond to a stable classifier, we
build random forest with different numTrees values (100, 200, 300….,1000). We build 10 Random
Forest classifiers for each numTrees value, record the OOB error rate and see the number of trees
where the out of bag error rate stabilizes and reach minimum.
There are two ways to find the optimal numFeatures:
1. Apply a similar procedure such that random forest is run 10 times. The optimal number of
predictors selected for split is selected for which out of bag error rate stabilizes and reach
minimum.
14
2. Experiment with including the (square root of total number of all predictors), (half of this
square root value), and (twice of the square root value). And check which numFeatures
returns maximum Area under curve. Thus, for 1000 predictors the number of predictors to
select for each node would be 16, 32, and 64 predictors.
Based on these experiments, the numTrees and numFeatures in the Random Forest generation are
chosen to be 100 and 2 respectively.
Parameter selection of Naïve Bayes:
This section explains in detail of what and how parameters are considered in building a Naïve
Bayes classifier using data mining tool. The different parameters and their values taken into
account are shown in table 5.5.
Parameter
Value
False
debug
False
useKernelEstimator
useSupervisedDiscretization False
Table 5.5: Parameters used in Weka Tool for Naïve Bayes classifier
As shown in the Table 5.4, the debug parameter is set to False, meaning that the classifier may not
output any additional information to the console. The parameter useKernelEstimator uses a kernel
estimator for numeric attributes rather than a normal distribution. The parameter
useSupervisedDiscretization uses supervised discretization to convert numeric attributes to
nominal ones. Both the parameters useKernelEstimator and useSupervisedDiscretization are maily
used in the case of numeric attributes which is not the case with our data, where we have categorical
attributes. Hence, we can ignore these parameters and build the naïve bayes classifier.
System Implementation
import
import
import
import
import
import
import
import
import
import
java.io.BufferedReader;
java.io.BufferedWriter;
java.io.File;
java.io.FileReader;
java.io.FileWriter;
java.io.IOException;
java.util.ArrayList;
java.util.Arrays;
java.util.HashMap;
java.util.List;
public class Bayes {
static
static
static
static
static
static
static
HashMap<String, List<Integer>>
HashMap<String, List<Integer>>
HashMap<String, List<Integer>>
HashMap<String, List<Integer>>
HashMap<String, List<Integer>>
int firstCount;
int secondCount;
15
ATT = new HashMap<>();
QZ = new HashMap<>();
ASS = new HashMap<>();
CP = new HashMap<>();
EX = new HashMap<>();
static int thirdCount;
static int failCount;
static int totalCount;
static List<String> results = new ArrayList<String>();
public static void main(String[] args) {
intialization();
process();
BufferedReader reader = null;
BufferedWriter writer = null;
try {
File file = new File(
"C:/Users/rohith/workspace/JavaExamples/Result.csv");
FileWriter fw = new FileWriter(file.getAbsoluteFile());
reader = new BufferedReader(new FileReader(
"C:/Users/rohith/workspace/JavaExamples/TestData.csv"));
writer = new BufferedWriter(fw);
String csvLine;
while ((csvLine = reader.readLine()) != null) {
String[] row = csvLine.split(",");
String att = row[0];
String qz = row[1];
String ass = row[2];
String cp = row[3];
String ex = row[4];
String st = row[5];
double f, s, t, fail;
f = (double) (((double) ATT.get(att).get(0)
firstCount)
* ((double) QZ.get(qz).get(0) /
firstCount)
* ((double) ASS.get(ass).get(0)
firstCount)
* ((double) CP.get(cp).get(0) /
firstCount)
* ((double) EX.get(ex).get(0) /
firstCount) * ((double) firstCount / totalCount));
s = (double) (((double) ATT.get(att).get(1)
secondCount)
* ((double) QZ.get(qz).get(1) /
secondCount)
* ((double) ASS.get(ass).get(1)
secondCount)
* ((double) CP.get(cp).get(1) /
secondCount)
* ((double) EX.get(ex).get(1) /
secondCount) * ((double) secondCount / totalCount));
t = (double) (((double) ATT.get(att).get(2)
thirdCount)
* ((double) QZ.get(qz).get(2) /
thirdCount)
* ((double) ASS.get(ass).get(2)
thirdCount)
16
/
/
/
/
/
/
* ((double) CP.get(cp).get(2) /
thirdCount)
* ((double) EX.get(ex).get(2) /
thirdCount) * ((double) thirdCount / totalCount));
fail = (double) (((double) ATT.get(att).get(3) /
failCount)
* ((double) QZ.get(qz).get(3) /
failCount)
* ((double) ASS.get(ass).get(3) /
failCount)
* ((double) CP.get(cp).get(3) /
failCount)
* ((double) EX.get(ex).get(3) /
failCount) * ((double) failCount / totalCount));
int i = (f > s) ? ((f > t) ? 0 : 2) : ((s > t) ? 1 :
2);
writer.write(att + "," + qz + "," + ass + "," + cp +
"," + ex
+ "," + st + ",");
if (i == 0) {
if (f > fail)
writer.write("First");
else
writer.write("Fail");
} else if (i == 1) {
if (s > fail)
writer.write("Second");
else
writer.write("Fail");
} else {
if (t > fail)
writer.write("Third");
else
writer.write("Fail");
}
writer.write("\n");
}
} catch (IOException ex) {
throw new RuntimeException("Error in reading CSV file: " +
ex);
} finally {
try {
reader.close();
writer.close();
} catch (IOException e) {
throw new RuntimeException("Error while closing
Reader: " + e);
}
}
}
public static void intialization() {
String[] labels = { "Good", "Average", "Poor" };
// ATT
for (int i = 0; i < 3; i++) {
17
ATT.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0,
0)));
}
// QZ
for (int i = 0; i < 3; i++) {
QZ.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0,
0)));
}
// ASS
for (int i = 0; i < 3; i++) {
ASS.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0,
0)));
}
// CP
for (int i = 0; i < 3; i++) {
CP.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0,
0)));
}
// EX
for (int i = 0; i < 3; i++) {
EX.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0,
0)));
}
}
public static void process() {
BufferedReader reader = null;
try {
reader = new BufferedReader(new FileReader(
"C:/Users/rohith/workspace/JavaExamples/TrainData.csv"));
String csvLine;
HashMap<String, Integer> result = new HashMap<>();
String[] labels = { "First", "Second", "Third", "Fail" };
for (int i = 0; i <= 3; i++) {
result.put(labels[i], i);
}
while ((csvLine = reader.readLine()) != null) {
String[] row = csvLine.split(",");
countIncrement(row[5]);
int index = (int) result.get(row[5]);
// ATT
int j = ATT.get(row[0]).get(index) + 1;
ATT.get(row[0]).set(index, j);
// QZ
j = QZ.get(row[1]).get(index) + 1;
QZ.get(row[1]).set(index, j);
// ASS
j = ASS.get(row[2]).get(index) + 1;
ASS.get(row[2]).set(index, j);
// CP
j = CP.get(row[3]).get(index) + 1;
CP.get(row[3]).set(index, j);
// EX
j = EX.get(row[4]).get(index) + 1;
EX.get(row[4]).set(index, j);
}
18
} catch (IOException ex) {
throw new RuntimeException("Error in reading CSV file: " +
ex);
} finally {
try {
reader.close();
} catch (IOException e) {
throw new RuntimeException("Error while closing
Reader: " + e);
}
}
}
public static void countIncrement(String s) {
totalCount++;
if (s.equals("First"))
firstCount++;
else if (s.equals("Second"))
secondCount++;
else if (s.equals("Third"))
thirdCount++;
else
failCount++;
}
}
19