Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 STUDENT DROP OUT FACTOR ANALYSIS AND 2 TREND PREDICTION USING DECISION TREE 3 4 Running head: Student Drop out Factor Analysis and Trend Prediction 5 Using Decision Tree 6 7 Jeeranan Chareonrat 8 9 Department of Business Computer, Faculty of Management Science, 10 Sakon Nakhon Rajabhat University, Sakon Nakhon 47000, Thailand. Tel. 0-4297-0028; 11 Fax. 0-4297-0028; E-mail: [email protected] 12 13 Abstract 14 Issues relating to increases in student drop-out rates are becoming a top 15 priority in many educational institutions. This paper aims to identify and 16 explore the factors influencing this growing phenomenon focusing on a 17 university in provincial Thailand. Research conducted between 2010 and 18 2014 targeted Management Science students attending Sakon Nakhon 19 Rajabhat University. Survey database on 14 attributes of 4,163 current 20 students. Data analysis was undertaken using algorithm J48 Data Mining 21 techniques with a decision- tree classification and Weka's 10-fold cross 22 validation program. The findings of the research indicated that the four most 23 significant factors that induced student drop-out were low GPA results, 2 24 studying loans, earlier educational attainment, and parents' monthly 25 incomes. Further analysis indicated that in the 2010-2011 year low GPA 26 attainment was the most significant factor, and added with studying loans in 27 2012 to 2013 then plused parents' incomes in 2014. This suggests a trend in 28 line with the Classification Rule that may predict drop-out rates in the 29 current year 2015. 30 31 Keyword: Data mining, Classification, Prediction, Student drop out 32 33 Introduction 34 It is widely known that Information Technology (IT) which brings about efficient 35 working process and decision making has been sky-rocketed developing these 36 days. It plays a vital role in most organization for manipulating and collecting data 37 of huge databases. Educational institutions, for example, have stored many 38 aspects of information including, students’ data, and lectures’ websites and e- 39 learning administration. Despite, obtaining a large amount of data, they seldom 40 made use of those data for other benefits especially for prediction analysis. Data 41 mining is the process of discovering interesting patterns and knowledge from 42 large amount of data (Han et al., 2011). Gulati (2015) predicted student’ drop out 43 by using data mining technique, Yukselturk et al. (2014) predicted students’ drop 44 out of the On-line program using K-Nearest Neighbour(K-NN), Decision 45 Tree(DT), Naive Bayes (NB) and Neural Network (NN). Omkar and Parag (2015) 3 46 use Data mining J48, Random Forest, Rep Tree and BF Tree of Decision Tree and 47 JRip rule. 48 The Management Science faculty has also been facing with students’ drop 49 out which is considered to be an important problem in the educational system as it 50 directly affects organization’s budget management. Some data from the Education 51 promotion Department revealed that the drop out rate was 20.68 percent in 52 between year 2010-2014 which considered “high”. This study aims to form a 53 model that can predict factors affecting annual students’ drop out by using data 54 mining technique. This suggests a trend that contributes administrators to help 55 prevent students from dropping out. 56 57 Materials and Methods 58 This research was conducted according to Cross-Industry Standard Process for 59 Data Mining (CRIPS-DM) (Chapman et al., 2000). The process of study was as 60 followings : 61 1. Business Understanding: By gathering student data from the educational 62 Promotion department and studying related research, there were 14 significant 63 attributes. The study was started by Attribute-Class relationship analyzing. 64 Attributes that suitable to analyze the drop out rate shown in Table1 65 2. Data Understanding: Researcher selected 4,163 datasets of the 66 management science students stored in the department of academic promotion’s 67 databases system during the year 2010-2014 then classified by year as shown in 68 Table 2 4 69 70 3. Data preparation: Steps in preparing data to use with the WEKA program (Bouckaert et al., 2013) shown below. 71 1) Data Cleansing: Initial data with missing value, error or noisy 72 data were dropped at first step of cleaning up data. This has left 4,163 data sets 73 out of the total 4,366. 74 2) Data Adjusting: Raw data consisted of both numerical and 75 alphabetic aspects so they were needed to adjust into a common from that can be 76 analyzed. 77 4. Modeling and selecting the right technique: Classification Data mining 78 technique was used to form a model and used Decision Tree algorithm J48 to 79 predict trends then used WEKA’s 10-fold Cross Validation to specify results test 80 form. 81 5. Evaluation: 82 1) K-fold cross validation (Hastie et al., 2008) 83 divide datasets into K equal parts 84 use K-1 parts to form train set 85 use the rest datasets to be a test set 86 repeat the process until every dataset brought to test 87 2) Accuracy (Mohammed and Wagner Meira, 2014) 88 The accuracy of a data can be calculated into percentage by using 89 formula as following 90 𝐴𝑐𝑐𝑢𝑟𝑎𝑐𝑦 = (𝑇𝑃+𝑇𝑁+𝐹𝑃+𝐹𝑁) 91 𝑇𝑃 𝑅𝑎𝑡𝑒 = (𝑇𝑃+𝐹𝑁) (𝑇𝑃+𝑇𝑁) 𝑇𝑃 (1) (2) 5 92 𝑇𝑁 𝑅𝑎𝑡𝑒 = (𝑇𝑁+𝐹𝑃) 93 𝐹𝑃 𝑅𝑎𝑡𝑒 = (𝐹𝑃+𝑇𝑁) 94 𝐹𝑁 𝑅𝑎𝑡𝑒 = (𝑇𝑃+𝐹𝑁) 𝑇𝑁 𝐹𝑃 𝐹𝑁 (3) (4) (5) 95 96 Where as : TP is a True Positive value 97 TN is a True Negative value 98 FP is a False Positive value 99 FN is a False Negative value 100 101 102 6. Development: Bringing a qualified model to modify and use as a predicting tool for the year 2015. 103 104 Results and Discussion 105 The researcher brought datasets of each year to from a model using classification 106 data mining technique with algorithm J48 Decision Tree to predict the data and 107 The WEKA 3.7.5 10-fold Cross Validation to specify result testing form. This 108 model was employed to seek factors affecting students drop-out and to predict 109 changing trends of those factors in each year. figures 1 to 5 identified results and 110 Classification Rule and Table3 expressed True values. 111 112 113 According to figure 1 Rules of Decision from Decision Tree in year 2010 were: IF GPA = Weak THEN student Drop Out 6 114 IF GPA = Medium THEN student not Drop Out 115 IF GPA = Good THEN student not Drop Out 116 IF GPA = Best THEN student not Drop Out 117 IF GPA = Excellent THEN student not Drop Out 118 The important Decision Rule in year 2010 indicated the average GPA of less than 119 2.00 as the affecting factor to students’ drop out. 120 121 According to figure 2 Rules of Decision from Decision Tree in year 2011 were: 122 IF GPA = Weak THEN student Drop Out 123 IF GPA = Medium THEN student not Drop Out 124 IF GPA = Good THEN student not Drop Out 125 IF GPA = Best THEN student not Drop Out 126 IF GPA = Excellent THEN student not Drop Out 127 The important Decision Rule in year 2011 indicated the average GPA of less than 128 2.00 as the affecting factor to students’ drop out. 129 130 According to figure 3 Rules of Decision from Decision Tree in year 2012 were: 131 IF GPA = Weak AND Loan = No THEN student Drop Out 132 IF GPA = Weak AND Loan = Yes THEN student not Drop Out 133 IF GPA = Medium THEN student not Drop Out 134 IF GPA = Good THEN student not Drop Out 135 IF GPA = Best THEN student not Drop Out 136 IF GPA = Excellent THEN student not Drop Out 7 137 The important Decision Rule in year 2012 indicated the average GPA of less than 138 2.00 and not being allowed to receive student loans as factors affecting students’ 139 drop out. 140 141 According to figure 4 Rules of Decision from Decision Tree in year 2013 were: 142 IF GPA = Weak AND Loan = No THEN student Drop Out 143 IF GPA = Weak AND Loan = Yes THEN student not Drop Out 144 IF GPA = Medium THEN student not Drop Out 145 IF GPA = Good THEN student not Drop Out 146 IF GPA = Best THEN student not Drop Out 147 IF GPA = Excellent THEN student not Drop Out 148 The important Decision Rule in year 2013 indicated the average GPA of less than 149 2.00 and not being allowed to receive student loans as factors affecting students’ 150 drop out. 151 152 According to figure 5 Rules of Decision from Decision Tree in year 2014 were: 153 IF GPA = Weak AND Loan = No THEN student Drop Out 154 IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND 155 156 157 158 159 revenue_far =Rev_far1 THEN student Drop Out IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND revenue_far =Rev_far2 THEN student not Drop Out IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND revenue_far =Rev_far3 THEN student not Drop Out 8 160 161 162 163 164 165 166 167 IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND revenue_far =Rev_far4 THEN student not Drop Out IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu1 AND revenue_far =Rev_far5 THEN student not Drop Out IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu2 THEN student not Drop Out IF GPA = Weak AND Loan = Yes AND Old_Edu = Old_Edu3 THEN student not Drop Out 168 IF GPA = Medium THEN student not Drop Out 169 IF GPA = Good THEN student not Drop Out 170 IF GPA = Best THEN student not Drop Out 171 IF GPA = Excellent THEN student not Drop Out 172 The important Decision Rule in year 2014 indicated the average GPA of less than 173 2.00, being allowed to receive student loans, high school graduation, and father’s 174 monthly incomes of less than 12,500 bath as factors affecting students’ drop out. 175 176 According to Table 3 the predicting model obtained an accuracy value of 177 more than 90 percent (92.57% - 96.90%). Therefore, the year 2013 obtained the 178 highest accuracy value of 96.90% Figure 1 to 5 can be concluded factors affecting 179 students’ drop out by year shown in Table 4. 180 181 The table 4 showed the average GPA of less than 2.00 displayed the 182 mutual factor though out 5 years that went in accordance with Thinsungnoen et al. 9 183 (2012) and the team' s study. Study loans was the factor added in the year 2012 184 and 2013 and the 2 new factors i.e. students' background educational attainment 185 and father's monthly incomes of less than 12,500 Bht. were found in the year 2015 186 in line with the study of Pahannarat et al. (2009) and team which had found that 187 students from poor families trended to leave classroom to make ends meet for 188 families. 189 190 Conclusions 191 The results of this study can be concluded as follow 192 1. There were 4 important factors affecting students’ drop-out including 193 the average GPA, student loans, earlier educational attainment and father’s 194 monthly incomes. Administrators and advisors can use these factors to plan and 195 encourage students to be able to finish their studying. 196 2. The changing trends of factors from year 2010 to 2011 was the average 197 GPA, from year 2012 to 2013 were the average GPA and student loans and for the 198 student loans and plused with earlier educational attainment, and father’s monthly 199 incomes in 2014. 200 201 This suggests a classification rule to be developed for the year 2015 prediction. 202 203 Acknowledgments 204 This research was funded by the Faculty of Management Science, Sakon Nakhon 205 Rajabhat University in fiscal year 2015. The author acknowledges the Office of 10 206 Academic Promotion and Registrarion, Sakon Nakhon Rajabhat University for 207 providing the data used in this research. The author also acknowledges Assoc. 208 Prof. Dr. Kittisak Kerdprasop and Assoc. Prof. Dr. Nittaya Kredprosob from the 209 Suranaree University of Technology for their invaluable guidance and consulting 210 about the research on data mining techniques. The author would like to express 211 the appreciation and gratitude for all supports and assistances. 212 213 References 214 Bouckaert, R., Frank, E., Hall, M., Kirkby, R., Reutemann, P., Seewald, A., 215 Scuse, D. (2013). WEKA Manual for Version 3-7-8, University of 216 Waikato, Hamilton, New Zealand, 327p. 217 Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C., 218 Wirth, R. (2000). CRISP-DM 1.0 Step-by-step data mining guide, 219 Technical report, SPSS inc, USA, 78p. 220 221 222 223 224 225 Gulati, H. (2015). IEEE Conference Publications; March 11-13, 2015; New Delhi, India, p.713-716. Han, J., Kamber, M., Pei, J. (2011). Data mining concepts and techniques. 3rd ed. Elsevier, USA, 703p. Hastie ,T., Tibshirani, R., Friedman, J. (2008). The Elements of Statistical Learning Data Mining, Inference, and Prediction. 2nd ed. Springer, 739p. 226 Mohammed, J., Wagner Meira, JR. (2014). Data Mining and Analysis 227 Fundamental Concepts and Algorithms. Cambridge University Press, NY, 228 USA, 593p. 11 229 230 Omkar, S., Parag, M. (2015). Predicting Dropout Students Using Data-Mining Techniques. IJR J., 2(1):365-375. 231 Pahannarat, N., Wangrangsimakul, K., Pipatpen M. (2009). The Problems of 232 Withdrawal of the Youths Receiving Scholarship from World Vision 233 Foundation of Thailand in Songkhla Province. PNU J., 1(3)(130-144). 234 Thinsungnoen, T., Kulnawin, K., Thinsungnoen, M. (2012). Payap University 235 Research Symposium 2012; February 17, 2012; Chiang Mai, Thailand, 236 p.34-42. 237 Yukselturk, E., Ozekes, S., Türel, Y. (2014). Predicting Dropout Student: An 238 Application of Data Mining Methods in an Online Education Program. 239 EURODL J., 17(1):118-133. 240 241 242 243 244 245 246 247 248 249 250 251 12 252 Table 1. Student related Variables Variable Description Possible Values Sex gender {F,M} Province habitation {Sakon, Nakon, Mukda,Out_sanok} Occup_farther Occupation of father {Occ_far0, Occ_far1, Occ_far2, Occ_far3, cc_far4, Occ_far5, Occ_far6, Occ_far7 Occ_far8, Occ_far9} Revenue_far Incomes of father {Rev_far1, Rev_far2, Rev_far3, Rev_far4, Rev_far5} occup_mother Occupation of mother {Occ_mom0, Occ_mom1, Occ_mom2, Occ_mom3, Occ_mom4, Occ_mom5, Occ_mom6, Occ_mom7, Occ_mom8, Occ_mom9} Revenue_mom Incomes of mother {Rev_mom1, Rev_mom2, Rev_mom3, Rev_mom4 , Rev_mom5} Parent_status Status of parents {Par_st1, Par_st2, Par_st3, Par_st4, Par_st5, Par_s61, Par_st7, Par_st8, Par_st9, Par_st10} GPA_school Average GPA from {Weak, Medium, Good, Best, Excellent} secondary school Old_Edu Educational attainment {Old_Edu1, Old_Edu2, Old_Edu3, Old_Edu4} from secondary school curriculum Learning program {Curriculum2, Curriculum4} Major Subject oriented {Major1, Major2, Major3, Major4, Major5, Major6, Major7, Major8, Major9, Major10} GPA Average GPA {Weak, Medium, Good, Best, Excellent} Loan Studying loans {Yes, No} Drop Out Learning abandon Class {Yes, No} 253 13 254 255 256 257 258 259 260 261 262 263 264 265 266 Table 2. datasets by year Academic years Data sets 2010 715 2011 902 2012 874 2013 775 2014 897 Total 4,163 14 267 Table 3 accuracy comparison by year Classifier J48 268 269 270 271 272 273 274 275 276 277 278 279 Year 2010 2011 2012 2013 2014 Accuracy 92.59% 96.23% 94.39% 96.90% 95.42% TP Rate 0.998 0.994 0.984 0.976 0.967 FP Rate 0.366 0.136 0.198 0.052 0.135 TN Rate 0.634 0.864 0.802 0.748 0.868 FN Rate 0.002 0.006 0.016 0.024 0.033 15 280 Table 4 the affecting factors by year Year Affecting factors 2010 The average GPA of less than 2.00 2011 The average GPA of less than 2.00 2012 The average GPA of less than 2.00, not being allowed to receive student loans 2013 The average GPA of less than 2.00, not being allowed to receive student loans 2014 The average GPA of less than 2.00, being allowed to receive student loans, high school graduation, and father’s monthly incomes of less than 12,500 Baht. 281 282 283 284 285 286 287 288 289 290 291 292 293 16 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 GPA ‘=Good’ ‘No(269.0/17.0)’ ‘=Medium’ ‘No(130.0/24.0)’ ‘=Weak’ ‘Yes(91.0/1.0)’ ‘=Excellent’ ‘=Best’ ‘No(48.0/2.0)’ Figure 1 Rule model produced by J48 Decision Tree Year 2010 ‘No(177.0/9.0)’ 17 316 GPA 317 318 ‘=Good’ ‘No(263.0/9.0)’ ‘=Medium’ ‘No(101.0/15.0)’ ‘=Weak’ ‘Yes(195.0/4.0)’ ‘=Excellent’ ‘No(64.0/2.0)’ 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 ‘=Best’ Figure 2 Rule model produced by J48 Decision Tree Year 2011 ‘No(279.0/4.0)’ 18 339 GPA 340 341 342 343 344 ‘=Good’ ‘No(229.0/10.0)’ ‘=Medium’ ‘No(74.0/13.0)’ ‘=Weak’ ‘=Excellent’ ‘No(90.0/2.0)’ Loan ‘=Yes’ ‘No(3.0/1.0)’ ‘=No’ ‘Yes(162.0/9.0)’ 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 ‘=Best’ Figure 3 Rule model produced by J48 Decision Tree Year 2012 ‘No(316.0/13.0)’ 19 362 GPA 363 364 ‘=Good’ ‘No(234.0/4.0)’ ‘=Medium’ ‘No(83.0/3.0)’ ‘=Weak’ ‘=Excellent’ ‘No(67.0)’ Loan 365 366 ‘=Yes’ ‘=No’ 367 ‘No(3.0/1.0)’ 368 Figure 4 Rule model produced by J48 Decision Tree Year 2013 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 ‘=Best’ ‘Yes(194.0/12.0)’ ‘No(194.0/2.0)’ 20 385 GPA 386 ‘=Good’ 387 388 ‘No(310.0/2.0)’ ‘=Weak’ ‘No(147.0/5.0)’ 389 ‘=Old_edu2’ ‘=Old_edu1’ ‘No(0.0)’ ‘No(91.0/3.0)’ ‘=No’ ‘Yes(119.0/20.0)’ Old_Edu 391 ‘=Best’ ‘=Excellent’ Loan ‘=Yes’ 390 392 ‘=Medium’ ‘=Old_edu3’ Revenue_far ‘No(2.0)’ 393 394 395 396 397 398 399 400 401 402 403 404 405 ‘=Rev_far5’ ‘=Rev_far2’ ‘=Rev_far3’ ‘=Rev_far4’ ‘No(7.0/2.0)’ ‘No(0.0)’ ‘No(0.0)’ ‘=Rev_far1’ ‘No(0.0)’ ‘Yes(2.0)’ Figure 5 Rule model produced by J48 Decision Tree Year 2014 ‘No(219.0/1.0)’