Download Data Mining of Imbalanced Dataset in Educational Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
DOI 10.4010/2016.1809
ISSN 2321 3361 © 2016 IJESC
`
Research Article
Volume 6 Issue No. 6
Data Mining of Imbalanced Dataset in Educational Data Using
Weka Tool
Mohammad Imran 1, Mohammed Afroze2, Dr. Suresh Kumar Sanampudi3, Dr. Ahmed Abdul Moiz Qyser 4
Assistant Professor 1, 2, 3, Professor and Head4
Department of CSE1, 4
Department of Information Techonology2, 3
Muffakham Jah College of Engineering and Technology, Telangana, India1, 2, 4
JNTUH College Of Engineering, Telangana, India3
[email protected], [email protected], [email protected]
Abstract:
Data mining approaches have been used in business purposes since its inception however, at present it is used successfully in
new and emerging areas like education systems. In this exploration, we utilize data mining ways to deal with anticipate students'
last result, i.e., last grade in a specific course by defeating the issue of imbalanced dataset. We execute a few re-inspecting
strategies to adjust the dataset so that could improve execution. Re-inspecting systems incorporate SMOTE (Synthetic Minority
Over-testing Technique), ROS (Random over Sampling). Trial results demonstrate that re-testing procedures improve the
execution of the order models that are produced to anticipate students' last grade in a specific course.
Keywords: Imbalanced dataset, SMOTE, Naive Bayes, ROS, Educational Data Mining (EDM), Decision Tree.
INTRODUCTION
Data Mining in education is a developing control, concerned
with creating techniques for investigating the extraordinary
sorts of information that originate from instructive settings,
and utilizing those routines to better comprehend students, and
the settings which they learn in [1]. Data mining is extraction
of fascinating (non-inconsequential, verifiable, already
obscure and conceivably valuable) examples or learning from
colossal measure of information. As we probably am aware
expansive measure of information is put away in instructive
database, so with a specific end goal to get obliged
information & to locate the shrouded relationship, diverse
information mining methods are created & utilized. There are
mixtures of prevalent information mining undertaking inside
the instructive information mining e.g. grouping, bunching,
anomaly location, affiliation standard, expectation and so
forth. We use three re-sampling techniques, for example,
SMOTE (Synthetic Minority Oversampling Technique), ROS
(Random over Sampling), and RUS (Random under
Sampling). Three different classifiers i.e., Decision Tree,
Naive Bayes and Neural Network are trained again with the
rebalanced data and the trained models are used to classify the
test set.
Recent advances in data collection and storage technology has
made it possible for organizations to accumulate huge
amounts of data at moderate cost. While most data is not
stored with predictive modelling or analysis in mind, the
collected data could contain potentially valuable information.
Exploiting this stored data, in order to extract useful and
actionable information, is the overall goal of the generic
activity termed data mining [2].
The class imbalance issue has gotten noteworthy consideration
in ranges, for example, Machine Learning and Pattern
Recognition as of late. A two-class information set is implied
to be imbalanced when one of the classes in the minority one
is intensely under-spoken to rather than alternate class in the
greater part one. This worry is mostly fundamental in true
International Journal of Engineering Science and Computing, June 2016
applications where it is exorbitant to misclassify samples from
the minority class, for example, discovery of fake phone calls,
conclusion of uncommon maladies, data recovery, content
classification and separating assignments [3].
In many applications, like fraud detection, risk management,
text classification, and medical diagnosis class imbalance
problems are found [4]. Hierarchical agglomerative clustering,
K-means clustering, and clustering model based to group the
students based on their skill sets is discussed by Ayers et al.
[5]. Several predictive models were built by Rus et al. [6] to
detect student's mental model. In order to balance the minority
and majority class, i.e., to provide equal number of instances
for classification models, over sampling technique is used to
increase instances of the minority class and under sampling
technique is used to decrease the instances of the majority
class. The authors used the Synthetic Minority Over-sampling
approach which provides good performance. To get good
accuracy from fully imbalance data, Chen [7] used several resampling techniques, e.g., SMOTE, Oversampling by
duplicating minority examples, random under sampling. The
re-sampling techniques could improve performance when
applied with classification models except Naive Bayes model.
Our main focus is to predict student’s final grade and improve
it further by overcoming the imbalance nature of dataset.
DATA MINING PROCESS:
The data mining techniques are essential for one of the most
important points of KDD: they are applied in data analysis
phase and machine learning algorithms are used to produce the
models that summarize the knowledge discovered [11].
Therefore, it is easy to see that educational tasks can benefit
from the knowledge extracted by data mining.
CLASSIFICATION TECHNIQUES:
NAIVE BAYES:
Naive Bayes is a straightforward probabilistic classifier in
which Bayes hypothesis is connected. We utilized Naive
7666
http://ijesc.org/
Bayes Classification to make four unique models. We utilize
probability distribution function (PDF) to gauge the class
marks for persistent qualities and after that apply Naïve Bayes
grouping. This essential model is stretched out by re-inspected
information by SMOTE [4], ROS and RUS procedure [7] to
defeat the class imbalance issue.
C4.5 Algorithm:
C4.5 is one of the three standard routines for characterization.
It is an augmentation of ID3 (Iterative Dichotomiser 3). In ID3
calculation, a choice tree is produced from an information set.
We have to ascertain entropy of each trait of information set in
this calculation. Utilizing the qualities of least entropy or
greatest data pick up, the information set is part into subsets.
We part the information utilizing addition proportion and
insignificant size for part was situated to 4, i.e., those hubs
where the quantity of kids are more than or equivalent to 4
will be part. Diverse weights can be connected to the
highlights that contain the preparation information. C4.5
acknowledges both discrete and persistent highlights and
handles fragmented information focuses [8,9]. These are the
real expansion of C4.5 from ID3.
SMOTE:
SMOTE stands for synthetic minority oversampling strategy
which is an extremely surely understood oversampling
methodology for adjusting an imbalanced dataset to create an
adjusted dataset. This technique has concocted a helpful
approach to enhance arbitrary oversampling by appropriating
the cases for the dominant part class and the minority class
similarly. Destroyed makes manufactured examples or
illustrations of the minority class and has a tendency to expand
the prescient exactness over the minority class. Additionally,
the inductive learner, for example, choice tree finds the chance
to amplify their choice areas for the minority class. Thus,
better execution can be attained to in the field of imbalanced
information arrangement issue [4]. The cases are embedded
along the line sections joining any or the majority of the k
minority class closest neighbors. Neighbors from the k closest
neighbors are picked arbitrarily relying on the measure of
over-testing needed. Their usage as of now uses five closest
neighbors. For instance, if the measure of over-examining
required is 200%, just two neighbors from the five closest
neighbors are chosen and one example is produced in every
heading [10].
WEKA TOOL:
The Weka workbench contains an accumulation of
visualization algorithms and tools for information examination
and prescient displaying, together with graphical client
interfaces for simple access to this usefulness [12]. It is
uninhibitedly accessible programming. It is platform &
portable independent on the grounds that it is completely
actualized in the Java programming language and thus runs on
almost any modern computing platform. Weka has several
standard data mining tasks, data pre-processing, clustering,
classification, association, visualization, and feature selection.
The WEKA GUI chooser launches the WEKA’s graphical
environment which has six buttons: Simple CLI, Explorer,
Experimenter, Knowledge Flow, ARFFViewer, & Log.
RESULT AND DISCUSSION:
The information set utilized as a part of this study was
acquired from the distinctive branches of Engineering College.
At first size of the information is 50. The outcomes got from
the different information mining calculations viz, BayesNet ,
Navive Bayes, Multilayer Perceptron, IB1, Decision Table and
PART on the information set for distinctive branches of
understudies are organized and the execution dissected
Comparison table gives the aggregate no. of occurrences,
Correctly arranged and Incorrectly ordered cases, Time taken
to
fabricate
a
model,
Confusion
framework.
Figure 1: Student‘s Dataset with Weka 3.6.8 with Explorer window.
International Journal of Engineering Science and Computing, June 2016
7667
http://ijesc.org/
Fig 2: Weka Classifier cost/benefit analysis – tress J48.
=== Run information ===
Scheme:weka.classifiers.bayes.NaiveBayes
Relation:
weka.datagenerators.classifiers.classification.BayesNet-S_1_-n_100_-A_20_-C_2-weka.filters.super
vised.instance.SMOTE-C0-K5-P100.0-S1
Instances: 148
Attributes: 10
=== Summary ===
Correctly Classified Instances
101
68.2432 %
Incorrectly Classified Instances
47
31.7568 %
Kappa statistic
0.2609
Mean absolute error
0.4007
Root mean squared error
0.4582
Relative absolute error
87.7652 %
Root relative squared error
95.9463 %
Total Number of Instances
148
=== Detailed Accuracy By Class ===
Weighted Average
Precision
Recall
F-Measure
ROC
Area
Class
0.177
0.564
0.423
0.484
0.673
Value1
0.823
0.577
0.725
0.823
0.771
0.673
Value2
0.682
0.436
0.668
0.682
0.67
0.673
TP Rate
FP Rate
0.423
CONCLUSION:
This work is an endeavor to utilize Data Mining methods
to analyze students' scholastic information and to improve
the nature of specialized instructive framework. In this
work we connected six order systems on understudy
information i.e. BayesNet, Naïve Bayes ,Decision Table
and PART Classification technique .We perceive that as
per exploratory result
International Journal of Engineering Science and Computing, June 2016
IB1 Classifier is most suitable system for this kind of
understudy dataset. The Higher administration's
administrators for preparing & situation bureau of building
schools or Company Executives can utilize such order
model to measures or envisioned the understudies'
execution as per the separated information.
For future work, this study will be useful for
establishments and commercial ventures. We can be
7668
http://ijesc.org/
-
producing the data in the wake of actualizing the others
information mining systems like bunching, Predication and
Association rules and so forth with help of Data Mining
devices.
References:
[1] International Educational Data
www.educationaldatamining.org.
Mining Society
[2] M. J. A. Berry and G. Linoff, Data Mining Techniques:
For Marketing,Sales and Customer Support, Wiley, 1997.
[3] V. García, J.S. Sánchez, R.A. Mollineda, R. Alejo, J.M.
Sotoca, “The class imbalance problem in pattern
classification and learning”, Pattern Analysis and Learning
Group, Dept.de Llenguatjes i Sistemes Informàtics,
Universitat Jaume I.
[4] N. V. Chawla et. al. "SMOTE: Synthetic Minority
Over-sampling Technique". Journal of Artificial
Intelligence Research. vol.16, pp.321-357, 2002.
[5] E. Ayers, R.Nugent, and N. Dean. “A Comparison of
Student Skill Knowledge Estimates”, In International
Conference On Educational Data Mining, Cordoba, Spain,
pp.1-10, 2009.
[6] V. Rus, M.Lintean, R.Azevedo, “Automatic detection
of student mental models during prior knowledge
activation in MetaTutor”. In International Conference on
Educational Data Mining, Cordoba, Spain, pp.161- 170,
2009.
[7] Y. Chen, “Learning Classifiers from Imbalanced, Only
Positive and Unlabeled Data Sets”. Retrieved July 25,
2014, from
https://www.cs.iastate.edu/~yetianc/cs573/files/CS573_Pro
jectReportYetianChen.pdf
[8] Quinlan, J. R. “C4. 5: programs for machine learning".
Morgan Kaufmann, 1993.
[9]P.Tan,V.Kumar&M.Steinbach,"Introduction
Mining". Adison Wesley, USA, 2006.
to
Data
[10] R. C. Prati, E. A. Gustavo, P. A. Batista, and
M.C.Monard, “Data mining with imbalanced class
distributions: concepts and methods”, 4th Indian
International Conference on Artificial Intelligence, pp.359376, 2009.
[11] U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, “From
Data Mining to Knowledge Discovery in Databases”, AI
Magazine, vol. 17, pp. 37-54. 1996.
[12]International
Educational
www.educationaldatamining.org.
Data
Society
International Journal of Engineering Science and Computing, June 2016
7669
http://ijesc.org/