Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
DOI 10.4010/2016.1809 ISSN 2321 3361 © 2016 IJESC ` Research Article Volume 6 Issue No. 6 Data Mining of Imbalanced Dataset in Educational Data Using Weka Tool Mohammad Imran 1, Mohammed Afroze2, Dr. Suresh Kumar Sanampudi3, Dr. Ahmed Abdul Moiz Qyser 4 Assistant Professor 1, 2, 3, Professor and Head4 Department of CSE1, 4 Department of Information Techonology2, 3 Muffakham Jah College of Engineering and Technology, Telangana, India1, 2, 4 JNTUH College Of Engineering, Telangana, India3 [email protected], [email protected], [email protected] Abstract: Data mining approaches have been used in business purposes since its inception however, at present it is used successfully in new and emerging areas like education systems. In this exploration, we utilize data mining ways to deal with anticipate students' last result, i.e., last grade in a specific course by defeating the issue of imbalanced dataset. We execute a few re-inspecting strategies to adjust the dataset so that could improve execution. Re-inspecting systems incorporate SMOTE (Synthetic Minority Over-testing Technique), ROS (Random over Sampling). Trial results demonstrate that re-testing procedures improve the execution of the order models that are produced to anticipate students' last grade in a specific course. Keywords: Imbalanced dataset, SMOTE, Naive Bayes, ROS, Educational Data Mining (EDM), Decision Tree. INTRODUCTION Data Mining in education is a developing control, concerned with creating techniques for investigating the extraordinary sorts of information that originate from instructive settings, and utilizing those routines to better comprehend students, and the settings which they learn in [1]. Data mining is extraction of fascinating (non-inconsequential, verifiable, already obscure and conceivably valuable) examples or learning from colossal measure of information. As we probably am aware expansive measure of information is put away in instructive database, so with a specific end goal to get obliged information & to locate the shrouded relationship, diverse information mining methods are created & utilized. There are mixtures of prevalent information mining undertaking inside the instructive information mining e.g. grouping, bunching, anomaly location, affiliation standard, expectation and so forth. We use three re-sampling techniques, for example, SMOTE (Synthetic Minority Oversampling Technique), ROS (Random over Sampling), and RUS (Random under Sampling). Three different classifiers i.e., Decision Tree, Naive Bayes and Neural Network are trained again with the rebalanced data and the trained models are used to classify the test set. Recent advances in data collection and storage technology has made it possible for organizations to accumulate huge amounts of data at moderate cost. While most data is not stored with predictive modelling or analysis in mind, the collected data could contain potentially valuable information. Exploiting this stored data, in order to extract useful and actionable information, is the overall goal of the generic activity termed data mining [2]. The class imbalance issue has gotten noteworthy consideration in ranges, for example, Machine Learning and Pattern Recognition as of late. A two-class information set is implied to be imbalanced when one of the classes in the minority one is intensely under-spoken to rather than alternate class in the greater part one. This worry is mostly fundamental in true International Journal of Engineering Science and Computing, June 2016 applications where it is exorbitant to misclassify samples from the minority class, for example, discovery of fake phone calls, conclusion of uncommon maladies, data recovery, content classification and separating assignments [3]. In many applications, like fraud detection, risk management, text classification, and medical diagnosis class imbalance problems are found [4]. Hierarchical agglomerative clustering, K-means clustering, and clustering model based to group the students based on their skill sets is discussed by Ayers et al. [5]. Several predictive models were built by Rus et al. [6] to detect student's mental model. In order to balance the minority and majority class, i.e., to provide equal number of instances for classification models, over sampling technique is used to increase instances of the minority class and under sampling technique is used to decrease the instances of the majority class. The authors used the Synthetic Minority Over-sampling approach which provides good performance. To get good accuracy from fully imbalance data, Chen [7] used several resampling techniques, e.g., SMOTE, Oversampling by duplicating minority examples, random under sampling. The re-sampling techniques could improve performance when applied with classification models except Naive Bayes model. Our main focus is to predict student’s final grade and improve it further by overcoming the imbalance nature of dataset. DATA MINING PROCESS: The data mining techniques are essential for one of the most important points of KDD: they are applied in data analysis phase and machine learning algorithms are used to produce the models that summarize the knowledge discovered [11]. Therefore, it is easy to see that educational tasks can benefit from the knowledge extracted by data mining. CLASSIFICATION TECHNIQUES: NAIVE BAYES: Naive Bayes is a straightforward probabilistic classifier in which Bayes hypothesis is connected. We utilized Naive 7666 http://ijesc.org/ Bayes Classification to make four unique models. We utilize probability distribution function (PDF) to gauge the class marks for persistent qualities and after that apply Naïve Bayes grouping. This essential model is stretched out by re-inspected information by SMOTE [4], ROS and RUS procedure [7] to defeat the class imbalance issue. C4.5 Algorithm: C4.5 is one of the three standard routines for characterization. It is an augmentation of ID3 (Iterative Dichotomiser 3). In ID3 calculation, a choice tree is produced from an information set. We have to ascertain entropy of each trait of information set in this calculation. Utilizing the qualities of least entropy or greatest data pick up, the information set is part into subsets. We part the information utilizing addition proportion and insignificant size for part was situated to 4, i.e., those hubs where the quantity of kids are more than or equivalent to 4 will be part. Diverse weights can be connected to the highlights that contain the preparation information. C4.5 acknowledges both discrete and persistent highlights and handles fragmented information focuses [8,9]. These are the real expansion of C4.5 from ID3. SMOTE: SMOTE stands for synthetic minority oversampling strategy which is an extremely surely understood oversampling methodology for adjusting an imbalanced dataset to create an adjusted dataset. This technique has concocted a helpful approach to enhance arbitrary oversampling by appropriating the cases for the dominant part class and the minority class similarly. Destroyed makes manufactured examples or illustrations of the minority class and has a tendency to expand the prescient exactness over the minority class. Additionally, the inductive learner, for example, choice tree finds the chance to amplify their choice areas for the minority class. Thus, better execution can be attained to in the field of imbalanced information arrangement issue [4]. The cases are embedded along the line sections joining any or the majority of the k minority class closest neighbors. Neighbors from the k closest neighbors are picked arbitrarily relying on the measure of over-testing needed. Their usage as of now uses five closest neighbors. For instance, if the measure of over-examining required is 200%, just two neighbors from the five closest neighbors are chosen and one example is produced in every heading [10]. WEKA TOOL: The Weka workbench contains an accumulation of visualization algorithms and tools for information examination and prescient displaying, together with graphical client interfaces for simple access to this usefulness [12]. It is uninhibitedly accessible programming. It is platform & portable independent on the grounds that it is completely actualized in the Java programming language and thus runs on almost any modern computing platform. Weka has several standard data mining tasks, data pre-processing, clustering, classification, association, visualization, and feature selection. The WEKA GUI chooser launches the WEKA’s graphical environment which has six buttons: Simple CLI, Explorer, Experimenter, Knowledge Flow, ARFFViewer, & Log. RESULT AND DISCUSSION: The information set utilized as a part of this study was acquired from the distinctive branches of Engineering College. At first size of the information is 50. The outcomes got from the different information mining calculations viz, BayesNet , Navive Bayes, Multilayer Perceptron, IB1, Decision Table and PART on the information set for distinctive branches of understudies are organized and the execution dissected Comparison table gives the aggregate no. of occurrences, Correctly arranged and Incorrectly ordered cases, Time taken to fabricate a model, Confusion framework. Figure 1: Student‘s Dataset with Weka 3.6.8 with Explorer window. International Journal of Engineering Science and Computing, June 2016 7667 http://ijesc.org/ Fig 2: Weka Classifier cost/benefit analysis – tress J48. === Run information === Scheme:weka.classifiers.bayes.NaiveBayes Relation: weka.datagenerators.classifiers.classification.BayesNet-S_1_-n_100_-A_20_-C_2-weka.filters.super vised.instance.SMOTE-C0-K5-P100.0-S1 Instances: 148 Attributes: 10 === Summary === Correctly Classified Instances 101 68.2432 % Incorrectly Classified Instances 47 31.7568 % Kappa statistic 0.2609 Mean absolute error 0.4007 Root mean squared error 0.4582 Relative absolute error 87.7652 % Root relative squared error 95.9463 % Total Number of Instances 148 === Detailed Accuracy By Class === Weighted Average Precision Recall F-Measure ROC Area Class 0.177 0.564 0.423 0.484 0.673 Value1 0.823 0.577 0.725 0.823 0.771 0.673 Value2 0.682 0.436 0.668 0.682 0.67 0.673 TP Rate FP Rate 0.423 CONCLUSION: This work is an endeavor to utilize Data Mining methods to analyze students' scholastic information and to improve the nature of specialized instructive framework. In this work we connected six order systems on understudy information i.e. BayesNet, Naïve Bayes ,Decision Table and PART Classification technique .We perceive that as per exploratory result International Journal of Engineering Science and Computing, June 2016 IB1 Classifier is most suitable system for this kind of understudy dataset. The Higher administration's administrators for preparing & situation bureau of building schools or Company Executives can utilize such order model to measures or envisioned the understudies' execution as per the separated information. For future work, this study will be useful for establishments and commercial ventures. We can be 7668 http://ijesc.org/ - producing the data in the wake of actualizing the others information mining systems like bunching, Predication and Association rules and so forth with help of Data Mining devices. References: [1] International Educational Data www.educationaldatamining.org. Mining Society [2] M. J. A. Berry and G. Linoff, Data Mining Techniques: For Marketing,Sales and Customer Support, Wiley, 1997. [3] V. García, J.S. Sánchez, R.A. Mollineda, R. Alejo, J.M. Sotoca, “The class imbalance problem in pattern classification and learning”, Pattern Analysis and Learning Group, Dept.de Llenguatjes i Sistemes Informàtics, Universitat Jaume I. [4] N. V. Chawla et. al. "SMOTE: Synthetic Minority Over-sampling Technique". Journal of Artificial Intelligence Research. vol.16, pp.321-357, 2002. [5] E. Ayers, R.Nugent, and N. Dean. “A Comparison of Student Skill Knowledge Estimates”, In International Conference On Educational Data Mining, Cordoba, Spain, pp.1-10, 2009. [6] V. Rus, M.Lintean, R.Azevedo, “Automatic detection of student mental models during prior knowledge activation in MetaTutor”. In International Conference on Educational Data Mining, Cordoba, Spain, pp.161- 170, 2009. [7] Y. Chen, “Learning Classifiers from Imbalanced, Only Positive and Unlabeled Data Sets”. Retrieved July 25, 2014, from https://www.cs.iastate.edu/~yetianc/cs573/files/CS573_Pro jectReportYetianChen.pdf [8] Quinlan, J. R. “C4. 5: programs for machine learning". Morgan Kaufmann, 1993. [9]P.Tan,V.Kumar&M.Steinbach,"Introduction Mining". Adison Wesley, USA, 2006. to Data [10] R. C. Prati, E. A. Gustavo, P. A. Batista, and M.C.Monard, “Data mining with imbalanced class distributions: concepts and methods”, 4th Indian International Conference on Artificial Intelligence, pp.359376, 2009. [11] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, “From Data Mining to Knowledge Discovery in Databases”, AI Magazine, vol. 17, pp. 37-54. 1996. [12]International Educational www.educationaldatamining.org. Data Society International Journal of Engineering Science and Computing, June 2016 7669 http://ijesc.org/