Download 16030983.doc

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
1
STUDENT DROP OUT FACTOR ANALYSIS AND
2
TREND PREDICTION USING DECISION TREE
3
4
Running head: Student Drop out Factor Analysis and Trend Prediction
5
Using Decision Tree
6
7
Jeeranan Chareonrat
8
9
Department of Business Computer, Faculty of Management Science,
10
Sakon Nakhon Rajabhat University, Sakon Nakhon 47000, Thailand. Tel. 0-4297-0028;
11
Fax. 0-4297-0028; E-mail: [email protected]
12
13
Abstract
14
Issues relating to increases in student drop-out rates are becoming a top
15
priority in many educational institutions. This paper aims to identify and
16
explore the factors influencing this growing phenomenon focusing on a
17
university in provincial Thailand. Research conducted between 2010 and
18
2014 targeted Management Science students attending Sakon Nakhon
19
Rajabhat University. Survey database on 14 attributes of 4,163 current
20
students. Data analysis was undertaken using algorithm J48 Data Mining
21
techniques with a decision- tree classification and Weka's 10-fold cross
22
validation program. The findings of the research indicated that the four most
23
significant factors that induced student drop-out were low GPA results,
2
24
studying loans, earlier educational attainment, and parents' monthly
25
incomes. Further analysis indicated that in the 2010-2011 year low GPA
26
attainment was the most significant factor, and added with studying loans in
27
2012 to 2013 then plused parents' incomes in 2014. This suggests a trend in
28
line with the Classification Rule that may predict drop-out rates in the
29
current year 2015.
30
31
Keyword: Data mining, Classification, Prediction, Student drop out
32
33
Introduction
34
It is widely known that Information Technology (IT) which brings about efficient
35
working process and decision making has been sky-rocketed developing these
36
days. It plays a vital role in most organization for manipulating and collecting data
37
of huge databases. Educational institutions, for example, have stored many
38
aspects of information including, students’ data, and lectures’ websites and e-
39
learning administration. Despite, obtaining a large amount of data, they seldom
40
made use of those data for other benefits especially for prediction analysis. Data
41
mining is the process of discovering interesting patterns and knowledge from
42
large amount of data (Han et al., 2011). Gulati (2015) predicted student’ drop out
43
by using data mining technique, Yukselturk et al. (2014) predicted students’ drop
44
out of the On-line program using K-Nearest Neighbour(K-NN), Decision
45
Tree(DT), Naive Bayes (NB) and Neural Network (NN). Omkar and Parag (2015)
3
46
use Data mining J48, Random Forest, Rep Tree and BF Tree of Decision Tree and
47
JRip rule.
48
The Management Science faculty has also been facing with students’ drop
49
out which is considered to be an important problem in the educational system as it
50
directly affects organization’s budget management. Some data from the Education
51
promotion Department revealed that the drop out rate was 20.68 percent in
52
between year 2010-2014 which considered “high”. This study aims to form a
53
model that can predict factors affecting annual students’ drop out by using data
54
mining technique. This suggests a trend that contributes administrators to help
55
prevent students from dropping out.
56
57
Materials and Methods
58
This research was conducted according to Cross-Industry Standard Process for
59
Data Mining (CRIPS-DM) (Chapman et al., 2000). The process of study was as
60
followings :
61
1. Business Understanding: By gathering student data from the educational
62
Promotion department and studying related research, there were 14 significant
63
attributes. The study was started by Attribute-Class relationship analyzing.
64
Attributes that suitable to analyze the drop out rate shown in Table1
65
2. Data Understanding: Researcher selected 4,163 datasets of the
66
management science students stored in the department of academic promotion’s
67
databases system during the year 2010-2014 then classified by year as shown in
68
Table 2
4
69
70
3. Data preparation: Steps in preparing data to use with the WEKA
program (Bouckaert et al., 2013) shown below.
71
1) Data Cleansing: Initial data with missing value, error or noisy
72
data were dropped at first step of cleaning up data. This has left 4,163 data sets
73
out of the total 4,366.
74
2) Data Adjusting: Raw data consisted of both numerical and
75
alphabetic aspects so they were needed to adjust into a common from that can be
76
analyzed.
77
4. Modeling and selecting the right technique: Classification Data mining
78
technique was used to form a model and used Decision Tree algorithm J48 to
79
predict trends then used WEKA’s 10-fold Cross Validation to specify results test
80
form.
81
5. Evaluation:
82
1) K-fold cross validation (Hastie et al., 2008)
83
divide datasets into K equal parts
84
use K-1 parts to form train set
85
use the rest datasets to be a test set
86
repeat the process until every dataset brought to test
87
2) Accuracy (Mohammed and Wagner Meira, 2014)
88
The accuracy of a data can be calculated into percentage by using
89
formula as following
90
𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁)
91
𝑇𝑃 𝑅𝑎𝑡𝑒 = (𝑇𝑃+𝐹𝑁)
(𝑇𝑃+𝑇𝑁)
𝑇𝑃
(1)
(2)
5
92
𝑇𝑁 𝑅𝑎𝑡𝑒 = (𝑇𝑁+𝐹𝑃)
93
𝐹𝑃 𝑅𝑎𝑡𝑒 = (𝐹𝑃+𝑇𝑁)
94
𝐹𝑁 𝑅𝑎𝑡𝑒 = (𝑇𝑃+𝐹𝑁)
𝑇𝑁
𝐹𝑃
𝐹𝑁
(3)
(4)
(5)
95
96
Where as :
TP is a True Positive value
97
TN is a True Negative value
98
FP is a False Positive value
99
FN is a False Negative value
100
101
102
6. Development: Bringing a qualified model to modify and use as a
predicting tool for the year 2015.
103
104
Results and Discussion
105
The researcher brought datasets of each year to from a model using classification
106
data mining technique with algorithm J48 Decision Tree to predict the data and
107
The WEKA 3.7.5 10-fold Cross Validation to specify result testing form. This
108
model was employed to seek factors affecting students drop-out and to predict
109
changing trends of those factors in each year. figures 1 to 5 identified results and
110
Classification Rule and Table3 expressed True values.
111
112
113
According to figure 1 Rules of Decision from Decision Tree in year 2010 were:
IF GPA = Weak THEN student Drop Out
6
114
IF GPA = Medium THEN student not Drop Out
115
IF GPA = Good THEN student not Drop Out
116
IF GPA = Best THEN student not Drop Out
117
IF GPA = Excellent THEN student not Drop Out
118
The important Decision Rule in year 2010 indicated the average GPA of less than
119
2.00 as the affecting factor to students’ drop out.
120
121
According to figure 2 Rules of Decision from Decision Tree in year 2011 were:
122
IF GPA = Weak THEN student Drop Out
123
IF GPA = Medium THEN student not Drop Out
124
IF GPA = Good THEN student not Drop Out
125
IF GPA = Best THEN student not Drop Out
126
IF GPA = Excellent THEN student not Drop Out
127
The important Decision Rule in year 2011 indicated the average GPA of less than
128
2.00 as the affecting factor to students’ drop out.
129
130
According to figure 3 Rules of Decision from Decision Tree in year 2012 were:
131
IF GPA = Weak AND Loan = No THEN student Drop Out
132
IF GPA = Weak AND Loan = Yes THEN student not Drop Out
133
IF GPA = Medium THEN student not Drop Out
134
IF GPA = Good THEN student not Drop Out
135
IF GPA = Best THEN student not Drop Out
136
IF GPA = Excellent THEN student not Drop Out
7
137
The important Decision Rule in year 2012 indicated the average GPA of less than
138
2.00 and not being allowed to receive student loans as factors affecting students’
139
drop out.
140
141
According to figure 4 Rules of Decision from Decision Tree in year 2013 were:
142
IF GPA = Weak AND Loan = No THEN student Drop Out
143
IF GPA = Weak AND Loan = Yes THEN student not Drop Out
144
IF GPA = Medium THEN student not Drop Out
145
IF GPA = Good THEN student not Drop Out
146
IF GPA = Best THEN student not Drop Out
147
IF GPA = Excellent THEN student not Drop Out
148
The important Decision Rule in year 2013 indicated the average GPA of less than
149
2.00 and not being allowed to receive student loans as factors affecting students’
150
drop out.
151
152
According to figure 5 Rules of Decision from Decision Tree in year 2014 were:
153
IF GPA = Weak AND Loan = No THEN student Drop Out
154
IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND
155
156
157
158
159
revenue_far =Rev_far1 THEN student Drop Out
IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND
revenue_far =Rev_far2 THEN student not Drop Out
IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND
revenue_far =Rev_far3 THEN student not Drop Out
8
160
161
162
163
164
165
166
167
IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND
revenue_far =Rev_far4 THEN student not Drop Out
IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND
revenue_far =Rev_far5 THEN student not Drop Out
IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu2 THEN
student not Drop Out
IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu3 THEN
student not Drop Out
168
IF GPA = Medium THEN student not Drop Out
169
IF GPA = Good THEN student not Drop Out
170
IF GPA = Best THEN student not Drop Out
171
IF GPA = Excellent THEN student not Drop Out
172
The important Decision Rule in year 2014 indicated the average GPA of less than
173
2.00, being allowed to receive student loans, high school graduation, and father’s
174
monthly incomes of less than 12,500 bath as factors affecting students’ drop out.
175
176
According to Table 3 the predicting model obtained an accuracy value of
177
more than 90 percent (92.57% - 96.90%). Therefore, the year 2013 obtained the
178
highest accuracy value of 96.90% Figure 1 to 5 can be concluded factors affecting
179
students’ drop out by year shown in Table 4.
180
181
The table 4 showed the average GPA of less than 2.00 displayed the
182
mutual factor though out 5 years that went in accordance with Thinsungnoen et al.
9
183
(2012) and the team' s study. Study loans was the factor added in the year 2012
184
and 2013 and the 2 new factors i.e. students' background educational attainment
185
and father's monthly incomes of less than 12,500 Bht. were found in the year 2015
186
in line with the study of Pahannarat et al. (2009) and team which had found that
187
students from poor families trended to leave classroom to make ends meet for
188
families.
189
190
Conclusions
191
The results of this study can be concluded as follow
192
1. There were 4 important factors affecting students’ drop-out including
193
the average GPA, student loans, earlier educational attainment and father’s
194
monthly incomes. Administrators and advisors can use these factors to plan and
195
encourage students to be able to finish their studying.
196
2. The changing trends of factors from year 2010 to 2011 was the average
197
GPA, from year 2012 to 2013 were the average GPA and student loans and for the
198
student loans and plused with earlier educational attainment, and father’s monthly
199
incomes in 2014.
200
201
This suggests a classification rule to be developed for the year 2015
prediction.
202
203
Acknowledgments
204
This research was funded by the Faculty of Management Science, Sakon Nakhon
205
Rajabhat University in fiscal year 2015. The author acknowledges the Office of
10
206
Academic Promotion and Registrarion, Sakon Nakhon Rajabhat University for
207
providing the data used in this research. The author also acknowledges Assoc.
208
Prof. Dr. Kittisak Kerdprasop and Assoc. Prof. Dr. Nittaya Kredprosob from the
209
Suranaree University of Technology for their invaluable guidance and consulting
210
about the research on data mining techniques. The author would like to express
211
the appreciation and gratitude for all supports and assistances.
212
213
References
214
Bouckaert, R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A.,
215
Scuse, D. (2013). WEKA Manual for Version 3-7-8, University of
216
Waikato, Hamilton, New Zealand, 327p.
217
Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C.,
218
Wirth, R. (2000). CRISP-DM 1.0 Step-by-step data mining guide,
219
Technical report, SPSS inc, USA, 78p.
220
221
222
223
224
225
Gulati, H. (2015). IEEE Conference Publications; March 11-13, 2015; New Delhi,
India, p.713-716.
Han, J., Kamber, M., Pei, J. (2011). Data mining concepts and techniques. 3rd ed.
Elsevier, USA, 703p.
Hastie ,T., Tibshirani, R., Friedman, J. (2008). The Elements of Statistical
Learning Data Mining, Inference, and Prediction. 2nd ed. Springer, 739p.
226
Mohammed, J., Wagner Meira, JR. (2014). Data Mining and Analysis
227
Fundamental Concepts and Algorithms. Cambridge University Press, NY,
228
USA, 593p.
11
229
230
Omkar, S., Parag, M. (2015). Predicting Dropout Students Using Data-Mining
Techniques. IJR J., 2(1):365-375.
231
Pahannarat, N., Wangrangsimakul, K., Pipatpen M. (2009). The Problems of
232
Withdrawal of the Youths Receiving Scholarship from World Vision
233
Foundation of Thailand in Songkhla Province. PNU J., 1(3)(130-144).
234
Thinsungnoen, T., Kulnawin, K., Thinsungnoen, M. (2012). Payap University
235
Research Symposium 2012; February 17, 2012; Chiang Mai, Thailand,
236
p.34-42.
237
Yukselturk, E., Ozekes, S., Türel, Y. (2014). Predicting Dropout Student: An
238
Application of Data Mining Methods in an Online Education Program.
239
EURODL J., 17(1):118-133.
240
241
242
243
244
245
246
247
248
249
250
251
12
252
Table 1. Student related Variables
Variable
Description
Possible Values
Sex
gender
{F,M}
Province
habitation
{Sakon, Nakon, Mukda,Out_sanok}
Occup_farther
Occupation of father
{Occ_far0, Occ_far1, Occ_far2, Occ_far3, cc_far4,
Occ_far5, Occ_far6, Occ_far7 Occ_far8, Occ_far9}
Revenue_far
Incomes of father
{Rev_far1, Rev_far2, Rev_far3, Rev_far4,
Rev_far5}
occup_mother
Occupation of mother
{Occ_mom0, Occ_mom1, Occ_mom2,
Occ_mom3, Occ_mom4, Occ_mom5, Occ_mom6,
Occ_mom7, Occ_mom8, Occ_mom9}
Revenue_mom
Incomes of mother
{Rev_mom1, Rev_mom2, Rev_mom3, Rev_mom4
, Rev_mom5}
Parent_status
Status of parents
{Par_st1, Par_st2, Par_st3, Par_st4, Par_st5,
Par_s61, Par_st7, Par_st8, Par_st9, Par_st10}
GPA_school
Average GPA from
{Weak, Medium, Good, Best, Excellent}
secondary school
Old_Edu
Educational attainment
{Old_Edu1, Old_Edu2, Old_Edu3, Old_Edu4}
from secondary school
curriculum
Learning program
{Curriculum2, Curriculum4}
Major
Subject oriented
{Major1, Major2, Major3, Major4, Major5,
Major6, Major7, Major8, Major9, Major10}
GPA
Average GPA
{Weak, Medium, Good, Best, Excellent}
Loan
Studying loans
{Yes, No}
Drop Out
Learning abandon
Class
{Yes, No}
253
13
254
255
256
257
258
259
260
261
262
263
264
265
266
Table 2. datasets by year
Academic years
Data sets
2010
715
2011
902
2012
874
2013
775
2014
897
Total
4,163
14
267
Table 3 accuracy comparison by year
Classifier J48
268
269
270
271
272
273
274
275
276
277
278
279
Year
2010
2011
2012
2013
2014
Accuracy
92.59%
96.23%
94.39%
96.90%
95.42%
TP Rate
0.998
0.994
0.984
0.976
0.967
FP Rate
0.366
0.136
0.198
0.052
0.135
TN Rate
0.634
0.864
0.802
0.748
0.868
FN Rate
0.002
0.006
0.016
0.024
0.033
15
280
Table 4 the affecting factors by year
Year
Affecting factors
2010
The average GPA of less than 2.00
2011
The average GPA of less than 2.00
2012
The average GPA of less than 2.00, not being allowed to receive student
loans
2013
The average GPA of less than 2.00, not being allowed to receive student
loans
2014
The average GPA of less than 2.00, being allowed to receive student loans,
high school graduation, and father’s monthly incomes of less than 12,500
Baht.
281
282
283
284
285
286
287
288
289
290
291
292
293
16
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
GPA
‘=Good’
‘No(269.0/17.0)’
‘=Medium’
‘No(130.0/24.0)’
‘=Weak’
‘Yes(91.0/1.0)’
‘=Excellent’
‘=Best’
‘No(48.0/2.0)’
Figure 1 Rule model produced by J48 Decision Tree Year 2010
‘No(177.0/9.0)’
17
316
GPA
317
318
‘=Good’
‘No(263.0/9.0)’
‘=Medium’
‘No(101.0/15.0)’
‘=Weak’
‘Yes(195.0/4.0)’
‘=Excellent’
‘No(64.0/2.0)’
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
‘=Best’
Figure 2 Rule model produced by J48 Decision Tree Year 2011
‘No(279.0/4.0)’
18
339
GPA
340
341
342
343
344
‘=Good’
‘No(229.0/10.0)’
‘=Medium’
‘No(74.0/13.0)’
‘=Weak’
‘=Excellent’
‘No(90.0/2.0)’
Loan
‘=Yes’
‘No(3.0/1.0)’
‘=No’
‘Yes(162.0/9.0)’
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
‘=Best’
Figure 3 Rule model produced by J48 Decision Tree Year 2012
‘No(316.0/13.0)’
19
362
GPA
363
364
‘=Good’
‘No(234.0/4.0)’
‘=Medium’
‘No(83.0/3.0)’
‘=Weak’
‘=Excellent’
‘No(67.0)’
Loan
365
366
‘=Yes’
‘=No’
367
‘No(3.0/1.0)’
368
Figure 4 Rule model produced by J48 Decision Tree Year 2013
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
‘=Best’
‘Yes(194.0/12.0)’
‘No(194.0/2.0)’
20
385
GPA
386
‘=Good’
387
388
‘No(310.0/2.0)’
‘=Weak’
‘No(147.0/5.0)’
389
‘=Old_edu2’ ‘=Old_edu1’
‘No(0.0)’
‘No(91.0/3.0)’
‘=No’
‘Yes(119.0/20.0)’
Old_Edu
391
‘=Best’
‘=Excellent’
Loan
‘=Yes’
390
392
‘=Medium’
‘=Old_edu3’
Revenue_far
‘No(2.0)’
393
394
395
396
397
398
399
400
401
402
403
404
405
‘=Rev_far5’ ‘=Rev_far2’ ‘=Rev_far3’ ‘=Rev_far4’
‘No(7.0/2.0)’
‘No(0.0)’
‘No(0.0)’
‘=Rev_far1’
‘No(0.0)’
‘Yes(2.0)’
Figure 5 Rule model produced by J48 Decision Tree Year 2014
‘No(219.0/1.0)’