* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Section4_Techical_Details
Survey
Document related concepts
Transcript
Technical Details Overview This chapter explains about how the data mining algorithms are implemented in Faculty Support system model to address the problems discussed in chapter 3.4. To address the issue of final exam grade prediction, a Naïve Bayes classifier is implemented and the grades are predicted. To address the problem of classifying students into different groups based on final exam grade, different algorithms like Naïve Bayes, J48 Decision Trees, Random Forests and Multiple Layer Neural Networks were implemented using a data mining tool called Weka. The Weka tool implements all the algorithms by choosing different parameters. A study of choosing best parameters is also explained in this chapter. Also, the different algorithms were evaluated based on different evaluation methods and performance metrics. The rest of the chapter is organized as follows: First the Naïve Bayes implementation is explained clearly and then the process of final exam grade prediction is explained, followed by different evaluation methods and metrics used to evaluate performance of classifiers. Then, a brief summary of Weka tool is outlined. Finally, the parameter selection in different algorithms is clearly explained. Naïve Bayes approach for future grade prediction Naïve Bayes Classification As seen from the Chapter 2, Bayesian classification is based on Bayes Theorem. Let X be a data tuple. In Bayesian terms, X is considered evidence. Let H be some hypothesis, such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H|X), the probability that the hypothesis H holds given the evidence or observed data tuple X. Bayesβ theorem is useful in that it provides a way of calculating the posterior probability, P(H|X), from the equation 1. π(π»|π) = ο· ο· ο· ο· π(π|π») β π(π») π(π) P(H | X) is the posterior probability of H conditioned on X P(H) is the prior probability of H P(X | H) is the posterior probability of X conditioned on H. P(X) is the prior probability of X. During the training phase, we need to learn the posterior probabilities P(Y l X) for every combination of X, evidence or attribute value set and Y, class variables based on information gathered from the training data. By knowing these probabilities, a test record Xi can be classified by finding the class Y that maximizes the posterior probability, P(Y | Xi). Let D be a training set of tuples and their associated class labels. Each tuple is represented by an n-dimensional attribute vector, X = {X1,X2,β¦,Xn}, depicting n measurements made on the tuple 1 from n attributes, respectively, A1, A2, β¦ , A2. Suppose that there are m classes, C1, C2, β¦ , Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X belongs to the class Ci if and only if Thus we maximize P(C | X). The class Ci for which P(C | X) is maximized is called the maximum posteriori hypothesis. By Bayesβ theorem Given data sets with many attributes, it would be extremely computationally expensive to compute P(X | Ci). In order to reduce computation in evaluating P(X | Ci), the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple. Thus, With the conditional independence assumption, instead of computing the class-conditional probability for every combination of X, we only have to estimate the conditional probability of each Xi, given C. This approach is more practical because it does not require a very large training set to obtain a good estimate of the probability. To classify a test record, the naive Bayes classifier computes the posterior probability for each class C: d P(C | X ) ο½ P(C )ο P( X i | C ) i ο½1 P( X ) Since P(X) is fixed for every Y, it is sufficient to choose the class that maximizes the numerator d term, P (Y )ο P ( X i | Y ) . i ο½1 Prediction of final grade using Naïve Bayes: Training phase: The Naïve Bayes classifier is trained with all the training data. In this research, we used 241 instances of data for training. In the training phase we need to calculate the posterior probabilities P(Y | X) for every combination of X and Y based on information gathered from the training data, where X = attribute value set and Y = class label. To calculate the posterior probabilities, the prior and conditional probabilities should be calculated first. The prior and conditional probabilities are calculated by constructing frequency tables for each attribute. The frequency table consists of count for each different variable in the attribute set and the number of instances it can contains each class variable. The frequency table is calculated for each and every attribute. The next step is finding the conditional probabilities from the 2 frequency tables. The frequency table and the conditional probabilities are outlined for each attribute in the following tables. From table1, the conditional probability of an attribute, ATT = Good and having class variable First, is given by: P(ATT = Good | class = First) is equal to number of students with ATT = Good divided by total number of students with class = First. i.e. 75/92. Likewise, all other conditional probabilities were computed. Attendance (ATT) First Second Third Fail 75 66 21 10 Good 17 8 5 Average 12 5 4 7 11 Poor 92 87 36 26 Total Conditional Probability 75/92 66/87 21/36 10/26 Good 8/36 5/26 Average 12/92 17/87 5/92 4/87 7/36 11/26 Poor Table 1: Frequencies with Conditional probabilities for attribute Attendance Quizzes (QZ) First Second Third Fail 13 2 1 0 Good 59 55 5 2 Average 20 30 30 24 Poor 92 87 36 26 Total Conditional Probability 13/92 2/87 1/36 0/26 Good 5/36 2/26 Average 59/92 55/87 20/92 30/87 30/36 24/26 Poor Table 2: Frequencies with Conditional probabilities for attribute Quizzes Assignments (ASS) First Second Third Fail 70 49 9 3 Good 21 8 1 Average 18 4 17 19 22 Poor 92 87 36 26 Total Conditional Probability 70/92 49/87 9/36 3/26 Good 8/36 1/26 Average 18/92 21/87 4/92 17/87 19/36 22/26 Poor Table 3: Frequencies with Conditional probabilities for attribute Assignments 3 Class Projects (CP) First Second Third Fail 57 42 7 3 Good 24 8 8 Average 23 12 21 21 15 Poor 92 87 36 26 Total Conditional Probability 57/92 42/87 7/36 3/26 Good 8/36 8/26 Average 23/92 24/87 12/92 21/87 21/36 15/26 Poor Table 4: Frequencies with Conditional probabilities for attribute Class Projects Exams (EX) First Second Third Fail 38 6 3 1 Good 53 7 3 Average 52 2 28 26 22 Poor 92 87 36 26 Total Conditional Probability 38/92 6/87 3/36 1/26 Good 7/36 3/26 Average 52/92 53/87 2/92 28/87 26/36 22/26 Poor Table 5: Frequencies with Conditional probabilities for attribute Exams Finding Prior probabilities of Response Class: The prior probability for response classes are found from the training set by simple calculations like counting the number of instances each class variable had. The prior probabilities are given in the table 6. Prior Probabilty of Response Variables First Second Third Fail 87 36 26 No of Instances 92 92/241 87/241 36/241 26/241 Probability Table 6: Prior Probability of Response Variables Testing phase: Once the training phase is done, the classifier is learned with all the training data and can predict the class labels of the test data. Consider a test instance with no class label ATT Good QZ ASS CP EX Grade Average Average Poor Poor ???? Table 7: Test Instance with no class label 4 Using the prior and conditional probabilities calculated above, we need to find the class label for this instance. From equation 1, To find the new instance of the class, since P(X) is fixed for every Y, it is sufficient to choose the d class that maximizes the numerator term, P(Y )ο P( X i | Y ) , where i ο½1 d ο P( X i | Y ) is conditional i ο½1 probability of each Xi, given Y. So, the grade of the new instance can be calculated with the below equation. d P(Grade | X ) ο½ P(Grade)ο P( X i | Grade) i ο½1 Here, X is evidence and Grade = {First, Second, Third, Fail} So, calculate the equation from each grade. P(Grade = First | X ) = P(Grade = First) * P(ATT = Good | Grade = First) * P(QZ = Average | Grade = First) * P(ASS = Average | Grade = First) * P(CP = Poor | Grade = First) * P(EX = Poor | Grade = First) = 92 75 59 18 12 2 * * * * * 241 92 92 92 92 92 = 0.198 Similarly, we calculate for Grade = Second, Third and Fail and obtain the results as P(Grade = Second | X ) = 0.266 P(Grade = Third | X ) = 0.344 P(Grade = Fail | X ) = 0.202 Since the posterior probability for Third Grade is higher than other grades, the new test instance is classified to be Third Grade. The test data we considered are 2 data sets with 111 instances and 130 instances respectively clearly explained in chapter 5.2. Finding class labels of each and every instance is very time taking 5 and difficult task as it involves calculating the probabilities for each and every instance in the testing data set. To address this problem, we have automated the process by implementing the naïve bayes classifier using a java program that calculates all the probabilities and computes the class labels of the test instance in no matter of time because of the api support provided by the java programming language. The program implementation is given in the figure 1: Figure 4.1: data flow of Naive Bayes Algorithm with Input and Output Data Explanation: 1. There are 2 kinds of input to the program (from figure 1). a. Training data set with the explanatory variables and class labels b. Testing data set with only explanatory variables 2. The classifier is learned using training data. In the process of learning, a. Compute the likelihood of each attribute and store them in a separate Hash Map. So, there are five attributes, we used five different hash maps. b. Compute the prior probabilities of class variable and store them in a separate Hash Map. 3. After calculating all the probabilities, the testing set is read line by line and for each attribute obtain its likelihoods and prior probability from the stored hash maps and compute the posterior probability. 4. The step 3 is repeated for number of classes (4 in this research). 5. The computed values are compared against each other and the class having higher value is chosen to be the class of that instance in test data. 6. Repeat the steps 3 to 5 for each instance of the testing data set. The implementation of Naïve Bayes algorithm is presented in the Section 4.6. 6 Evaluation Methods and Metrics: This section explains the different evaluation methods and metrics used in Faculty Support System. Model. The most important part in data mining is to understand the data present in the records, analyze it, finding what can be done and achieved with the data and finally draw conclusions from the data analytic results. In a given context, data mining metrics are some measures of quantitative assessment which are used for comparison or evaluation of different data mining algorithms. Generally, data mining metrics are divided into three categories, accuracy, robustness and usefulness. Accuracy is a measure that tell us how well a model associates an outcome with the given attributes in the data sets provided. Accuracy is measured in multiple ways but all the accuracy measures depends on the data that has been used. Robustness judges the way that a data mining model performs on different kinds of datasets. The data mining model is robust only if the same model generates the same type of predictions for the same kind of patterns irrespective of the supplied test data. The data mining metric usefulness is a combined metric of several metrics that provides us whether the model generates useful information or not. Evaluation Methods The performance of the classifiers is evaluated based on the different evaluation methods. The evaluation is important to understand the quality of the model, for refining parameters in the iterative process of learning and for selecting the most appropriate model from a given set of models. There are several criteria for evaluating the model. As far as classification models are concerned, the performance of a classifier is measured in terms of error-rate. If the classifier predicts the class of an instance correctly, then it is treated as success, otherwise an error. In this research, for choosing the best performing algorithm in the post-Data Mining phase, we used different kinds of evaluation methods and compare them based on different evaluation metrics. Hold out Method: Hold out Method involves a single data split. The data is split into two separate datasets where one data set is used for training and the other is used for testing. The model is learned with the training data and finally asked to predict the output values in the testing data. Random Sub Sampling: The Random Sub Sampling method is an extension of hold-out method. In the random sub sampling method, the hold out method can be repeated several times to improve the estimation of the classifierβs performance. K-fold Cross Validation Method: Cross Validation is a popular technique for evaluating the generalization performance of a data mining model. The basic idea behind cross validation is to split the data, once or many times for estimating the risk of the data mining algorithm. From the split data, a small part called training sample is used for training each algorithm and the remaining part also called the validation sample 7 is used to estimate the risk of the algorithm. The cross validation method finally selects the algorithm with smallest estimated risk. The k-fold cross validation method is more optimized than hold out method. In this method, the dataset is divided into k subsets and the hold out method repeated for k times. Each time, one of the k subsets is used as the testing set and the union of other folds are used as training set. The k results from the folds can be averaged and error is computed. The advantage of this method is that all observations are used for both training and testing, and each observation is used for testing exactly once. Evaluation Metrics: Metrics summarize performance of a model and give a simplified view of model behavior. Thus, using several performance metrics and check whether they agree helps better understanding the model behavior by quantifying its performance. Evaluation of data mining algorithms can be compared according to a number of measures. Comparing the performance of different data mining algorithms determines their predictability, some quantities that interpret the goodness of fit of a model, and error measurements must be considered. Though empirical studies claimed that it is difficult to decide which metric to use for different problem, each of them has specific features that measures various aspects of the algorithms being evaluated. It is often difficult to state which metrics is the most suitable to evaluate algorithms in educational data due to large weighted discrepancies that often arise between predicted and actual value or otherwise. The combination of different metrics may reveal accurate results. For instance, some metrics such as true positive rate (TPR) take higher values if the algorithm gives better results compared to metric such as errors that takes lower values. The classification of metrics is divided into three families, probabilistic understanding of errors, qualitative understanding of errors, and visual metrics. Probabilistic understanding of errors: These kind of metrics are based on probabilistic understanding of predictions pi and of errors (pi β oi), where pi is predicted outcome, oi is actual outcome. This type of metrics is natural mainly for predictions of performance i.e. correctness of answers. Most commonly used metrics based on probabilistic understanding of errors are Mean Absolute Error (MAE), Root Mean Square Error (RMSE). Typically, lower the values of MAE and RMSE, higher is the performance of the classifier. Mean absolute error considers absolute differences between predictions and answers. This is not a suitable performance metric, because it prefers models which are biased towards the majority result. Despite this disadvantage it is sometimes used for evaluation of student models. 1 ππ΄πΈ = ( ) β|ππ β ππ| π Root mean square error is obtained by using squared values instead of absolute values. In the particular context of student modelling and evaluation of probabilities, this is not particularly useful, since the resulting numbers are hard to interpret anyway. However, in EDM the use of RMSE metric is very common, particularly for evaluation of skill models 8 1 π πππΈ = π πππ‘ (( ) β(ππ β ππ)2 ) π Qualitative understanding of errors: These metrics are based on qualitative understanding of errors, i.e., either the prediction is correct or incorrect. In student modeling this approach is suitable mainly for predictions of student state. In qualitative understanding of errors, predictions have to classify into multiple classes, then the classification can be done easily by choosing a threshold and doing the classification by comparison to this threshold. Once predictions are divided into multiple classes, they can be classified as true/false positives/negatives by a confusion matrix. The confusion matrix juxtaposes the observed classifications for a phenomenon (columns) with the predicted classifications of a model (rows). The classifications that lie along the major diagonal of the table are the correct classifications, that is, the true positives and the true negatives. The other fields signify model errors. The most common qualitative performance metrics calculated from the matrix are accuracy, sensitivity, specificity, precision, and F- Measure. These statistical measures are commonly used to explain out dataset and estimate how good and consistent was the classifier. Accuracy: It compares how close a new test value is to a value predicted by if ... then rules π΄πππ’ππππ¦ = ( ππ + ππ ) β 100% ππ + ππ + πΉπ + πΉπ Sensitivity: It measures the ability of a test to be positive when the condition is actually present. It is often recognizing as recall ππ ππππ ππ‘ππ£ππ‘π¦( π πππππ) = ( ) β 100% ππ + πΉπ Specificity: It measures the ability of a test to be negative when the condition is actually not present. ππ ππππππππππ‘π¦ = ( ) β 100% ππ + πΉπ Precision: It measures the positive predictive value ππ ππππππ πππ = ( ) β 100% ππ + πΉπ F-Measure: A measure that combines precision and recall is the harmonic mean of precision and recall, πΉ= 2ππππππ πππ β π πππππ ππππππ πππ + π πππππ 9 Visual metrics (ROC) Receiver Operating Characteristics (ROC) graphs are a useful technique for organizing classifiers and visualizing their performance. This approach for evaluation of predictions takes into account ranking of predictions, i.e., the values of pi are considered relatively to each other. The ROC curve summarizes the qualitative error of the prediction model over all possible thresholds. The curve has false positive rate (Specificity) on the x-axis and true positive rate on the y-axis (Sensitivity), each point of the curve corresponds to a choice of a threshold. Area under the ROC curve (AUC) provides a summary performance measure across all possible thresholds. It is equal to the probability that a randomly selected positive observation has higher predicted score than a randomly selected negative observation. The below graph clearly shows that when false positive rate (FPR) and true positive rate (TPR) are plotted on X and Y axis respectively and if a studentsβ instance is taken into consideration, the classifier who has high value of TPR and less value of FPR is the one with better performance. Likewise, a student with high FPR and high TPR is always classified correctly. Figure 2: Performance Comparison Graph In order to improve the educational systems behavior and for getting insight into the learning process and the teaching methods, this study combines the above mentioned measures of performance to locate the predictor and/or the classifier used in the educational model on the ROC graph. We divided the upper AUC in three zones: poor zones which means that the algorithm accuracy is disputed, reasonable zones where the algorithm is somehow efficient, perfect zone which means the algorithm performance is the best. 10 Weka Workbench: Weka provides implementations of learning algorithms that you can easily apply to your dataset. It also includes a variety of tools for transforming datasets. We can preprocess a dataset, feed it into a learning scheme, and analyze the resulting classifier and its performance all without writing any program code at all. The workbench includes methods for all the standard data mining problems: regression, classification, clustering, association rule mining, and attribute selection. Getting to know the data is an integral part of the work, and many data visualization facilities and data preprocessing tools are provided. All algorithms take their input in the form of a single relational table in the ARFF format or a CSV format, which can be read from a file or generated by a database query. One way of using Weka is to apply a learning method to a dataset and analyze its output to learn more about the data. Another is to use learned models to generate predictions on new instances. A third is to apply several different learners and compare their performance in order to choose one for prediction. The learning methods are called classifiers, and in the interactive Weka interface you select the one you want from a menu. Many classifiers have tunable parameters, which you access through a property sheet or object editor. In most data mining applications, the machine learning component is just a small part of a far larger software system. If you intend to write a data mining application, you will want to access the programs in Weka from inside your own code. By doing so, you can solve the machine learning sub problem of your application with a minimum of additional programming. Parameter estimation of different classifiers Parameter selection of Multiple Layer Neural Networks This section explains in detail of what and how parameters are considered in building a Multiple Layer Neural Network classifier using data mining tool. The different parameters and their values taken into account are shown in table 5.2. Parameter Value True autoBuild False debug False decay a hiddenLayers 0.3 learningRate 0.2 momentum True normalizeAttributes True nominalToBinaryFilter False reset 0 seed 500 trainingTime 0 Validation Set Size 20 Validation Threshold sigmoid activationFunction 11 Table 5.2: Parameters used in Weka Tool for Multiple Layer Neural Network classifier Table 5.2 show the parameters and their values taken into consideration for building Multiple Layer Neural Network classifier using Weka tool. As mentioned in Table 5.2, the autoBuild is set to True, which signifies that the tool adds and connects up hidden layers in the network and debug is set to False, meaning that classifier does not output any additional information to the console. The decay attribute is crucial in building Multiple Layer Neural Network as the true value of it may cause the learning rate to decrease which will divide the starting learning rate by the epoch number, to determine what the current learning rate should be. This may help to stop the network from diverging from the target output, as well as to improve general performance. The hiddenLayers parameter determines the number of hidden layers of the neural network. It generally takes a list of positive whole numbers. There are also wildcard values such as: a = (attributes + classes) / 2, i = attributes, o = classes, t = attributes + classes. The parameter learningRate determines the amount the weights are updated. The too low a learning rate makes the network learn very slowly and Too large a learning rate will proceed much faster, but may simply produce oscillations between relatively poor solutions Typical values for the learning rate parameter are numbers between 0 and 1. So choose an optimal value that is not too low or too high and we chosen 0.3. The momentum parameter can be helpful in speeding the convergence and avoiding local minima and applied to the weights during updating. It ranges between 0 and 0.9. The combination of learning rate and momentum is important to choose as the momentum will allow a larger learning rate and that this will speed convergence and avoid local minima. On the other hand, a learning rate of 1 with no momentum will be much faster when no problem with local minima or non-convergence is encountered. The parameter nominalToBinaryFilter will preprocess the instances with the filter. This could help improve performance if there are nominal attributes in the data. The parameter normalizeAttributes will normalize the attributes. This could help improve performance of the network. This is not reliant on the class being numeric. The attribute reset set to True, will allow the network to reset with a lower learning rate. If the network diverges from the answer this will automatically reset the network with a lower learning rate and begin training again. The seed parameter is used to initialize the random number generator. Random numbers are used for setting the initial weights of the connections between nodes, and also for shuffling the training data. The trainingTime attr parameter ibute determines the number of epochs to train through. If the validation set is non-zero then it can terminate the network early. The validationSetSize parameter determines the percentage size of the validation set. The training will continue until it is observed that the error on the validation set has been consistently getting worse, or if the training time is reached. If this is set to zero no validation set will be used and instead the network will train for the specified number of epochs. The validationThreshold parameter is used to terminate validation testing. The value here dictates how many times in a row the validation set error can get worse before training is terminated. Finally, the parameter activationFunction determines the type of activation function used in model. As it is clear from Chapter 2.1, different activation functions are used and sigmoid function is implemented in this model. 12 Parameter selection of Decision Trees: This section explains in detail of what and how parameters are considered in building a Decision Tree classifier using data mining tool. The different parameters and their values taken into account are shown in table 5.3. Parameter Value confidenceFactor 0.25 2 MinNumObj 3 NumFolds 1 seed False unpruned False Use Laplace Table 5.3: Parameters used in Weka Tool for Decision Tree classifier There are usually two criteria for the quality of decision trees: classification accuracy and decision tree size, expressed as the number of nodes in the tree. With the increasing size of the decision tree its incomprehensibility also increases, which is particularly undesirable if the decision tree is used for the interpretation of the discovered relationships in the training data. If a decision tree is used for classification, its size is not so significant, as the implementation of decision trees in practice is not demanding. Smaller trees are often preferred to larger ones, as they do not over fit the training set and are less sensitive to noise. The size of decision trees can be controlled during the construction process by pruning. There are two approaches to decision tree pruning: 1. Pre pruning - terminating the subtree construction during the tree-building process 2. Post pruning - pruning the subtrees of an already constructed tree Post-pruning tends to give better results than pre-pruning because it makes pruning decisions based on a fully grown tree, unlike pre-pruning, which can suffer from premature termination of the treegrowing process. However, for post-pruning, the additional computations needed to grow the full tree may be wasted when the subtree is pruned. The different parameters like confidence factor, minimum number of objects and number of folds play a vital role in post-pruning technique. Lowering the confidence factor decreases the amount of post-pruning. If we have less confidence factor in our training data the error estimate for each node goes up, increasing the likelihood that it will be pruned away in favor of a more stable node upstream. We tested the J48 classifier with confidence factor ranging from 0.1 to 0.5 by an increment of 0.1, as well as auxiliary values approaching zero and find out that smaller values incur more pruning. So, the confidenceFactor is set to 0.25. The number of minimum instances per node (minNumObj) was held at 2, and cross validation folds for the Testing Set (numFolds) was held at 3 during confidence factor testing. The unpruned parameter sets to False which means that pruning should be done. The usLaplace parameter is set to False as it determines whether the counts at leaves are smoothed based on Laplace. 13 Parameter selection of Random Forests: This section explains in detail of what and how parameters are considered in building a Random Forest classifier using data mining tool. The different parameters and their values taken into account are shown in table 5.4. Parameter Value True debug 0 maxDepth 2 numFeatures 100 numTrees 1 seed Table 5.4: Parameters used in Weka Tool for Random Forest classifier As shown in the Table 5.4, the debug parameter is set to True, meaning that the classifier may output additional information to the console. The maxDepth parameter determines the depth of the constructed random forest. Random forests depths arenβt a limiting factor in standard random forest implementation. If we do some sort of gradiant boosting or some sort of pruned decision tree we can pick and choose whatever you like. However, 0 values means for unlimited depth. The seed attribute is used to initialize the random number generator. Random numbers are used for setting the initial weights of the connections between nodes, and also for shuffling the training data. The forest error rate depends on two things: 1. The correlation between any two trees in the forest. Increasing the correlation increases the forest error rate. 2. The strength of each individual tree in the forest. A tree with a low error rate is a strong classifier. Increasing the strength of the individual trees decreases the forest error rate. The important parameters in building a Random Forest are numFeatures and numTrees 1. Number of trees used in the forest (numTrees) 2. Number of random variables used in each tree (numFeatures) First set the numFeatures to the default value (sqrt of total number of all predictors) and search for the optimal numTrees value. To find the number of trees that correspond to a stable classifier, we build random forest with different numTrees values (100, 200, 300β¦.,1000). We build 10 Random Forest classifiers for each numTrees value, record the OOB error rate and see the number of trees where the out of bag error rate stabilizes and reach minimum. There are two ways to find the optimal numFeatures: 1. Apply a similar procedure such that random forest is run 10 times. The optimal number of predictors selected for split is selected for which out of bag error rate stabilizes and reach minimum. 14 2. Experiment with including the (square root of total number of all predictors), (half of this square root value), and (twice of the square root value). And check which numFeatures returns maximum Area under curve. Thus, for 1000 predictors the number of predictors to select for each node would be 16, 32, and 64 predictors. Based on these experiments, the numTrees and numFeatures in the Random Forest generation are chosen to be 100 and 2 respectively. Parameter selection of Naïve Bayes: This section explains in detail of what and how parameters are considered in building a Naïve Bayes classifier using data mining tool. The different parameters and their values taken into account are shown in table 5.5. Parameter Value False debug False useKernelEstimator useSupervisedDiscretization False Table 5.5: Parameters used in Weka Tool for Naïve Bayes classifier As shown in the Table 5.4, the debug parameter is set to False, meaning that the classifier may not output any additional information to the console. The parameter useKernelEstimator uses a kernel estimator for numeric attributes rather than a normal distribution. The parameter useSupervisedDiscretization uses supervised discretization to convert numeric attributes to nominal ones. Both the parameters useKernelEstimator and useSupervisedDiscretization are maily used in the case of numeric attributes which is not the case with our data, where we have categorical attributes. Hence, we can ignore these parameters and build the naïve bayes classifier. System Implementation import import import import import import import import import import java.io.BufferedReader; java.io.BufferedWriter; java.io.File; java.io.FileReader; java.io.FileWriter; java.io.IOException; java.util.ArrayList; java.util.Arrays; java.util.HashMap; java.util.List; public class Bayes { static static static static static static static HashMap<String, List<Integer>> HashMap<String, List<Integer>> HashMap<String, List<Integer>> HashMap<String, List<Integer>> HashMap<String, List<Integer>> int firstCount; int secondCount; 15 ATT = new HashMap<>(); QZ = new HashMap<>(); ASS = new HashMap<>(); CP = new HashMap<>(); EX = new HashMap<>(); static int thirdCount; static int failCount; static int totalCount; static List<String> results = new ArrayList<String>(); public static void main(String[] args) { intialization(); process(); BufferedReader reader = null; BufferedWriter writer = null; try { File file = new File( "C:/Users/rohith/workspace/JavaExamples/Result.csv"); FileWriter fw = new FileWriter(file.getAbsoluteFile()); reader = new BufferedReader(new FileReader( "C:/Users/rohith/workspace/JavaExamples/TestData.csv")); writer = new BufferedWriter(fw); String csvLine; while ((csvLine = reader.readLine()) != null) { String[] row = csvLine.split(","); String att = row[0]; String qz = row[1]; String ass = row[2]; String cp = row[3]; String ex = row[4]; String st = row[5]; double f, s, t, fail; f = (double) (((double) ATT.get(att).get(0) firstCount) * ((double) QZ.get(qz).get(0) / firstCount) * ((double) ASS.get(ass).get(0) firstCount) * ((double) CP.get(cp).get(0) / firstCount) * ((double) EX.get(ex).get(0) / firstCount) * ((double) firstCount / totalCount)); s = (double) (((double) ATT.get(att).get(1) secondCount) * ((double) QZ.get(qz).get(1) / secondCount) * ((double) ASS.get(ass).get(1) secondCount) * ((double) CP.get(cp).get(1) / secondCount) * ((double) EX.get(ex).get(1) / secondCount) * ((double) secondCount / totalCount)); t = (double) (((double) ATT.get(att).get(2) thirdCount) * ((double) QZ.get(qz).get(2) / thirdCount) * ((double) ASS.get(ass).get(2) thirdCount) 16 / / / / / / * ((double) CP.get(cp).get(2) / thirdCount) * ((double) EX.get(ex).get(2) / thirdCount) * ((double) thirdCount / totalCount)); fail = (double) (((double) ATT.get(att).get(3) / failCount) * ((double) QZ.get(qz).get(3) / failCount) * ((double) ASS.get(ass).get(3) / failCount) * ((double) CP.get(cp).get(3) / failCount) * ((double) EX.get(ex).get(3) / failCount) * ((double) failCount / totalCount)); int i = (f > s) ? ((f > t) ? 0 : 2) : ((s > t) ? 1 : 2); writer.write(att + "," + qz + "," + ass + "," + cp + "," + ex + "," + st + ","); if (i == 0) { if (f > fail) writer.write("First"); else writer.write("Fail"); } else if (i == 1) { if (s > fail) writer.write("Second"); else writer.write("Fail"); } else { if (t > fail) writer.write("Third"); else writer.write("Fail"); } writer.write("\n"); } } catch (IOException ex) { throw new RuntimeException("Error in reading CSV file: " + ex); } finally { try { reader.close(); writer.close(); } catch (IOException e) { throw new RuntimeException("Error while closing Reader: " + e); } } } public static void intialization() { String[] labels = { "Good", "Average", "Poor" }; // ATT for (int i = 0; i < 3; i++) { 17 ATT.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0, 0))); } // QZ for (int i = 0; i < 3; i++) { QZ.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0, 0))); } // ASS for (int i = 0; i < 3; i++) { ASS.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0, 0))); } // CP for (int i = 0; i < 3; i++) { CP.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0, 0))); } // EX for (int i = 0; i < 3; i++) { EX.put(labels[i], new ArrayList(Arrays.asList(0, 0, 0, 0))); } } public static void process() { BufferedReader reader = null; try { reader = new BufferedReader(new FileReader( "C:/Users/rohith/workspace/JavaExamples/TrainData.csv")); String csvLine; HashMap<String, Integer> result = new HashMap<>(); String[] labels = { "First", "Second", "Third", "Fail" }; for (int i = 0; i <= 3; i++) { result.put(labels[i], i); } while ((csvLine = reader.readLine()) != null) { String[] row = csvLine.split(","); countIncrement(row[5]); int index = (int) result.get(row[5]); // ATT int j = ATT.get(row[0]).get(index) + 1; ATT.get(row[0]).set(index, j); // QZ j = QZ.get(row[1]).get(index) + 1; QZ.get(row[1]).set(index, j); // ASS j = ASS.get(row[2]).get(index) + 1; ASS.get(row[2]).set(index, j); // CP j = CP.get(row[3]).get(index) + 1; CP.get(row[3]).set(index, j); // EX j = EX.get(row[4]).get(index) + 1; EX.get(row[4]).set(index, j); } 18 } catch (IOException ex) { throw new RuntimeException("Error in reading CSV file: " + ex); } finally { try { reader.close(); } catch (IOException e) { throw new RuntimeException("Error while closing Reader: " + e); } } } public static void countIncrement(String s) { totalCount++; if (s.equals("First")) firstCount++; else if (s.equals("Second")) secondCount++; else if (s.equals("Third")) thirdCount++; else failCount++; } } 19