Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal on Advanced Computer Theory and Engineering (IJACTE) ________________________________________________________________________ MINING THE FACTORS AFFECTING THE HIGH SCHOOL DROPOUTS IN RURAL AREAS 1 P. Sunil Kumar, 2Ashok Kumar Panda, 3D. Jena 1,2 Assistant Professor, 3Associate Professor, IISIT, BPUT, Bhubaneswar, Odisha, India Email : [email protected] attendance and retention rates, high teacher absenteeism, irregular classes, poor teaching standards and other related issues. The government needs to shift its focus on increasing enrolment rates and also reducing school drop-outs in rural areas which is also a significant problem [24]. Abstract -Government of India has taken number of initiatives in the last decade for retention of school children at primary level. All those initiatives, although successful at that level of education, when it comes to secondary education, the scenario is deplorable. The number of drop outs during secondary education is mainly due to three major factors namely Poverty, disinterestedness and Teaching environment. This paper is based on a survey work which proposes to apply association rule mining measures like support confidence and other interesting measures on these major factors of high school dropouts to understand the problem in a better way and to have a proper planning for retention of students . Data mining is finding hidden patterns in a large collection of data. Data Mining can be used in educational field to enhance our understanding of learning process to focus on identifying, extracting and evaluating Variables related to the learning process of students as described by Alaa el-Halees [2]. Keywords: Data Mining, Association Rule, Aprioi, EDM, Interesting Measures, Higher Secondary Education. In this paper it is tried to find out the association of various factors leading to student dropouts at secondary level of education in rural areas. I. INTRODUCTION II. BACKGROUND AND RELATED WORK Recent years have shown a growing interest and concern in many countries about the problem of school failure and the determination of its main contributing factors. This problem is known as the “the one hundred factors problem” and a great deal of research has been done on identifying the factors that affect the low performance of students (school failure and dropout) at different educational levels (primary, secondary and higher) as described by Araque et al., 2009[1]. Educational data mining has emerged as an independent research area in recent years, culminating in 2008 with the establishment of the annual International Conference on Educational Data Mining, and the Journal of Educational Data Mining. Romero and Ventura [21] provides a comprehensive study of EDM from 1995 to 2005. It describes the need for analyzing the student data which can be used by students, educators and administrators. A study titled “Evolution of Indian Rural Markets” made by the Associated Chambers of Commerce and Industry of India (ASSOCHAM) has revealed that, during 2004-05 to 2009-10”, Odisha ranked lowest with average rural household monthly expenditure on education worth just about Rs.52. Although great strides have been made in improving India’s education scenario, much ground still needs to be covered as our education system is still plagued by low enrolment rates, lower There are examples about how to apply EDM techniques for predicting drop out and school failure as described by Kotsiantis, 2009[3]. Shaeela Ayesha, Tasleem Mustafa, Ahsan Raza Sattar, and M. Inayat Khan [13] used data mining technique named k-means clustering applied to analyze student’s learning behavior that will help the teachers to reduce ________________________________________________________________________ ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013 1 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) ________________________________________________________________________ the drop out ratio to a significant level and improve the performance of Students. J.F.Superby et al. [23] to classify 533 first-year university students into three groups: the low-risk, the medium-risk, and the high-risk students (high probability of dropping out) D'Mello et al. [8] studied on students who are bored or frustrated. Dekker et al. [7] Romero et al. [22]; Superby et al. [23] found factors that predict student failure or non-retention in college courses. Dropout predicting in e-learning has been studied by Lykorentzou I., Giannoukos I., Nikopoulos V., Mpardis G. AND Loumos V. 2009[6]. As well as M. Jadrić et al. [14] to analyze the problem of students drop out in the higher education by using the datamining methods and also a model suggested. III. ASSOCIATION RULE MINING Data Mining is the discovery of hidden information found in databases [4] [18]. Dunham [9] categorized various models and tasks of data mining into two groups: predictive and descriptive. One o f the m o s t significant descriptive data mining applications is that of mining association rules. Introduced in 1993 [15], used extensively in marketing and retail communities in addition to many other diverse fields [17]. Association rule mining is one of the important technique which aims at extracting, interesting correlation, frequent patterns, associations or casual structures among set of items in the transaction databases or other data mining repositories [17]. A formal statement of the association rule problem is as follows: Definition: [16] [12] Let I = {i1, i2… im} be a set of m distinct attributes. Let D be a database, where each record (tuple) T has a unique identifier, and contains a set of items such that T⊆I. An association rule is an implication of the form of X⇒Y, where X,Y ⊆ I are sets of items called itemsets, and X∩Y=ϕ. Here, X is called antecedent while Y i s c a l l e d c o n s e q u e n t . Association r u l e s c a n b e c l a s s i f i e d based on the type of vales, dimensions of data, and levels of abstractions involved in the rule. If a rule concerns associations between the presence or absence of items, it is called Boolean association rule. And the dataset consisting of attributes which can assume only binary (0absent, 1-present) values is called Boolean database. itemsets. First, the set of frequent 1-itemsets is found by scanning the database to accumulate the count for each item, and collecting those items that satisfy minimum support. The resulting set is denoted L1.Next, L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. The finding of each Lk requires one full scan of the database.Lk-1 is used to find Lk for k _ 2. A two-step process is followed, consisting of join and prune actions. The join step: To find Lk, a set of candidate k-itemsets is generated by joining Lk-1 with itself. This set of candidates is denoted Ck. Let l1 and l2 be itemsets in Lk-1.The notation li[ j] refers to the jth item in li (e.g., l1[k-2] refers to the second to the last item in l1). By convention, Apriori assumes that items within a transaction or item set are sorted in lexicographic order. For the (k-1)-item set, li, this means that the items are sorted such that li[1] < li[2] < : : : < li[k-1]. The join, Lk-1 ⋈ Lk-1, is performed, where members of Lk-1 are joinable if their first (k-2) items are in common. That is, members l1 and l2 of Lk-1 are joined if (l1[1] = l2[1]) ^ (l1[2] = l2[2]) ^: : :^(l1[k-2] = l2[k-2]) ^(l1[k-1] < l2[k-1]). The condition l1 [k-1] < l2 [k-1] simply ensures that no duplicates are generated. The resulting item set formed by joining l1 and l2 is l1[1], l1[2], : : : , l1[k-2], l1[k-1], l2[k-1]. The prune step: Ck is a superset of Lk, that is, its members may or may not be frequent, but all of the frequent k-itemsets are included in Ck. As can of the database to determine the count of each candidate in Ck would result in the determination of Lk (i.e., all candidates having a count no less than the minimum support count are frequent by definition, and therefore belong to Lk). Ck, however, can be huge, and so this could involve heavy computation. To reduce the size of Ck, the Apriori property is used as follows. Any (k-1)-item set that is not frequent cannot be a subset of a frequent k-item set. Hence, if any (k-1)subset of a candidate k-item set is not in Lk-1, then the candidate cannot be frequent either and so can be removed from Ck. V. DATA MINING TECHNIQUES USED IN THIS PAPER i) Support(X)=P(X) ii) Confidence (X⇒Y)= IV. APRIORI ALGORITHM Apriori is a seminal algorithm proposed by R. Agrawal and R. Srikant in 1994 for mining frequent itemsets for Boolean association rules. The algorithm uses prior knowledge of frequent item set properties. Apriori employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)- P(X,Y) P(Y) ________________________________________________________________________ ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013 2 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) ________________________________________________________________________ iii) Cosine(X⇒Y) = P(X,Y) iv) Added Value(X⇒Y)= Confidence (XY)-P(Y) v) Lift((X⇒Y)= vi) vii) Confidence(X⟹Y) Correlation(X⇒Y)= Conviction(X⇒Y)= P(Y) P(X,Y)−P(X)∗P(Y) √P(X)∗P(Y)∗(1−P(X))∗(1−P(Y) ii) Confidence: The confidence or strength () for an association rule X⇒Y is the ratio of the number of transaction that contain X [9]. v) Correlation: Correlation is a symmetric measure. A correlation around 0 indicates that X and Y are not correlated. A negative figure indicates that X and Y are negatively correlated, and a positive figure indicates that they are positively related. Note that the denominator of the division is positive and smaller than 1.Inother words, if the lift is around 1,correlation can still be significantly different from 0[11] vii) Conviction: Conviction is not a symmetric measure. A conviction around 1 says that X and Y are independent; while conviction is infinite as conf (X⇒Y) is tending to 1.Note that if P(Y) is high then 1-P(Y) is small. In that case even if conf(X, Y) is strong conviction may be small [10]. (1-Confidence(X⟹Y) Support: The support (s) for an association rule X⇒Y is the percentage of transaction that contains X∪Y [9]. iv) vi) 1-P(Y) i) iii) correlation between X and Y. A lift around 1 says that P(X, Y) = P(X)*P(Y). In terms of probability, this means that the occurrence of X and the occurrence of Y in the same transaction are independent events, hence X and Y not correlated. It is easy to show that the lift is 1 exactly when added value is 0; the lift is greater than 1 exactly when added value is positive and the lift is below 1 exactly when added value is negative [10, 11]. √P(X)*P(Y) Cosine: Consider two vectors X and Y and the angle they form when they are placed so that their tails coincide. When this angle nears 0°, then cosine nears 1, i.e. the two vectors are very similar: all their coordinates are pair wise the same. When this angle is 90 degree the vector are perpendicular, the most dissimilar, and the cosine is 0.The usual form that is given for cosine of an association rule is X, Y. The closer t h e cosine (X⇒Y) v a l u e to 1, the more transactions containing item X also contain item Y, and vice versa. On the contrary, the closer cosine (X⇒Y) value to 0, the more transactions contain item X without containing item Y, and vice versa. This equality shows that transactions not containing neither item X nor item Y have no influence on the result of Cosine (X⇒Y). This is known as the null-invariant property. Note also that cosine is a symmetric measure [10, 11]. These measures are calculated on the test data. VI. APPLICATION OF THE TECHNIQUES In this study data was collected from students who discontinued their studies from high schools in Balianta block of Khurda District, Odisha. It is necessary to mention that the current dropout rate for the state of Odisha at secondary education is 49.5% [5]. These data are analyzed using Association rule mining to find out the cause or causes behind their decision to discontinue their education. . In order to apply this following steps are performed in sequence: Added value: The added value of the rule X⇒Y is denoted by AV ( X⇒Y ) and measures whether the proportion of transactions containing Y among the transactions containing X is greater than the proportion of transactions containing Y among all transactions. Then, only if the probability of finding item Y when item X has been found is greater than the probability of finding item Y at all can we say that X and Y are associated and that X implies Y.A positive number indicates that X and Y are related, while a negative number means that the occurrence of X prevents Y from occurring. Added Value is closely related to another well-known measure of interest, the lift [10]. Data set: The data set consisted of the following variables related to each dropout student. All the attributes are listed as follows: Table 1: Data Set Variable Description Possible values Sex Boy or Girl Factor 1 primary, {Poverty,?} secondary, absolute, Relative, persistent transient}as explained by Townsend and Rowntree[19][20] No-drinking{Teaching Factor 2 Lift: A lift well above 1 indicates a strong {boy, girl} ________________________________________________________________________ ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013 3 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) ________________________________________________________________________ water, No-Toilet, Environment,?} No-Classroom, Inadequateteacher, Poor teaching weak memory, {Disinterestedness fear of study, ,?} other interest, parent force Table 2: Support analysis Factor for dropout The size of the data set is 80. Poverty Teaching Environment Disinterestedness (Poverty, Disinterestedness) (poverty, Teaching Environment) (Teaching Environment, Disinterestedness) (Poverty, Disinterestedness,Teaching Environment) Factor 3 Data selection and transformation: The preprocessed data set in weka is shown in figure-1 The dropouts are categorized into three categories irrespective of sex i.e. Poverty, Teaching Environment, and Disinterestedness. It has been seen that among the students who are affected by one factor also affected by other two factors. Following Venn diagram (fig. 2 ) shows the complete d r o p o u t picture of t h e students due to different factors. support 0.72 0.31 0.77 0.6 0.07 0.15 0.03 Measures and their analysis The Apriori algorithm is implemented on the data set using weka as shown in figure 3. The results obtained like confidence, lift and conviction along with other interesting measures are analyzed one by one. Figure 3 Table 3 describes that the most of the students (85%) who discontinued due to poverty are also d i s i n t e r e s t e d because it has highest confidence. 74% of disinterested students also suffer due to Poverty. 41 % of students who are not satisfied with the Teaching Environment are also disinterested. Only 18 % of disinterested students have problems Teaching Environment. Table 3:Confidence Analysis Factor for dropout Poverty ⇒ Disinterestedness Disinterestedness ⇒ Poverty Teaching Environment ⇒ Disinterestedness Disinterestedness ⇒ Teaching Environment Figure 2 So, in this study it has been tried to find out the factors affecting the dropouts by using the association rule mining technique. It is also identified that more than one factor affecting the student. Confidence 0.85 0.74 0.41 0.15 Table 4 describes cosine analysis valve. It is a symmetric analysis. It means two sets give same results in either direction. In this research paper it shows the angular value between two different factors affecting the school dropout. Table shows that Poverty and disinterestedness has lower angle (37.72) in comparison to Teaching Environment and disinterestedness. It can be concluded that students discontinuing due to Poverty and disinterestedness have more similarity of than the Teaching Environment and disinterestedness On the basis of available data, it was decided to calculate the support of each factor causing the dropout. Following table 1 shows the support level for each factor. It describes that higher percentage of dropouts is because of the disinterestedness whose support value is 0.77. ________________________________________________________________________ ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013 4 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) ________________________________________________________________________ Table 4: Cosine Analysis Factor for dropout Poverty ⇒ Disinterestedness Disinterestedness ⇒ Poverty Teaching Environment ⇒ Disinterestedness Disinterestedness ⇒ Teaching Environment Table7: Correlation Analysis Cosine 0.791 0.791 0.267 Angle 37.72 37.72 74.51 0.267 74.51 Factor for dropout Poverty ⇒ Disinterestedness Disinterestedness ⇒ Poverty Teaching Environment ⇒ Disinterestedness Disinterestedness ⇒ Teaching Environment Table 5 shows the added value analysis. In this table Poverty ⇒ Disinterestedness and Disinterestedness ⇒ Poverty has positive number which shows that they are related to each other. The presence of students with these dual problems cannot prevent one another. Teaching Environment ⇒ Disinterestedness and Disinterestedness ⇒ Teaching Environment has negative number which shows that the occurrence of Teaching Environment prevents occurring of disinterestedness similarly occurrence of disinterestedness p r e v e n t s occurrence of Teaching Environment. Table 5: Added value Analysis Factor for dropout Poverty ⇒ Disinterestedness Disinterestedness ⇒ Poverty Teaching Environment ⇒ Disinterestedness Disinterestedness ⇒ Teaching Environment Added Value 0.0875 0.0775 -0.352 -2.69 Table 8 shows conviction analysis. It shows that highest conviction is found in the association of Poverty ⇒ Disinterestedness with value 1.34. Lowest conviction is found in association of Teaching Environment ⇒ Disinterestedness with value 0.36. Table 8: Conviction Analysis Factor for dropout Poverty ⇒ Disinterestedness Disinterestedness ⇒ Poverty Teaching Environment ⇒ Disinterestedness Disinterestedness ⇒ Teaching Environment Conviction 1.34 1.18 0.36 0.83 VII. CONCLUSION -0.125 Table 6 contains lift analysis. It is a symmetric analysis. It shows the occurrence of one item to another item. In this table the first two relations has similar positive value (1.1) greater than 1 which shows that occurrence of first is strongly correlated with the other. In the case of third and fourth relations, it has also same positive value but less than 1 which shows that they are negatively correlated. Table 6: Lift Analysis Factor for dropout Poverty ⇒ Disinterestedness Disinterestedness ⇒ Poverty Teaching Environment ⇒ Disinterestedness Disinterestedness ⇒ Teaching Environment Correlation 1.57 1.57 -2.69 Lift 1.1 1.1 0.53 Association rules are useful t o find the association between two elements and shows relationship between them. In this paper seven different parameters are used to find the relationship between two different factors affecting the school dropout. From the above analysis it can be concluded that the students who are disinterested are more prone to dropouts than due to Poverty and Teaching Environment. Another conclusion is extracted from confidence, cosine, AV analysis, lift, correlation and conviction analysis is that most of the Poor students are disinterested as well as students not satisfied with the teaching environment are also not interested to continue their education. So as desired by the study made by ASSOCHAM [24], government of Odisha has to take necessary steps to raise the expenditure level per family per month and also organize awareness camps to increase the interest level in the young minds. REFERENCES [1] ARAQUE F., ROLDAN C., SALGUERO A. Factors Influencing University Drop Out Rates. Computers Education, 53, 563–574, 2009. [2] Alaa el-Halees, Mining Students Data to Analyze e- Learning Behavior: A Case Study, 2009 [3] Kotsiantis S.. Educational Data Mining: A Case Study for Predicting Dropout – Prone Students. Int. J.Knowledge Engineering and Soft Data Paradigms, 1(2), 101–111, 2009. [4] Ming-Syan Chen, Jiawei Han and Philip S. 0.53 Table 7 contains correlation value. In this table the first two relations has similar positive value indicating that they are positively correlated to each other. On the other hand third and fourth relations have similar negative value showing that they are negatively correlated to each other. ________________________________________________________________________ ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013 5 International Journal of Advanced Computer Engineering and Communication Technology (IJACECT) ________________________________________________________________________ Yu.. Data Mining: An Overview from a Database Perspective, IEEE transactions on Knowledge and Data Engineering Vol. 8, No. 6, pp. 866-883, 1996. [5] http://www.odisha.gov.in/p&c/Download /Economic%20Survey_2012-13.pdf page 282 [6] LYKORENTZOU I., GIANNOUKOS I., NIKOPOULOS V., MPARDIS G. AND LOUMOS V.. Dropout Prediction in e-learning Courses through the Combination of Machine Learning Techniques. Computers & Education, 53,950–965, 2009. [7] Dekker, G., Pechenizkiy, M. and Vleeshouwers, J., “Predicting Students Drop Out: A Case Study.” In Proceedings of the International Conference on Educational Data Mining, Cordoba, Spain, T. Barnes, M. Desmarais, C. Romero and S. Ventura Eds.,pp41-50, 2009. [8] D'mello, S.K., Craig, S.D., Witherspoon, A.W., McDaniel, B.T. and Graesser, A.C., “Automatic Detection of Learner’s Affect from Conversational Cues.” User Modeling and UserAdapted Interaction vol 18. pp45-80, 2008. [9] Margret H Dunham, “Data Mining Introductory and Advanced Topics”, Pearson education ISBN 978-81-7758-785-2,2006. 2010. [15] R. Agrawal, T. Imielinski, A. Swami. Mining Associations between Sets of Items in massive D atabases, Proc. of the ACMSIGMOD Int'l Conference on Management of ata, Washington D.C. 1993. [16] V.Umarani, Dr. M. Punithavalli, “Sampling based Association Rules Mining- A Recent Overview” , (IJCSE) International Journal on Computer Science and Engineering Vol. 02, No. 02, 314-318, 2010. [17] V.Umarani, Dr.M.Punithavalli, “A STUDY ON EFFECTIVE MINING OF ASSOCIATION RULES FROM HUGE DATABASES”, IJCSR International Journal of Computer Science and Research, Vol. 1,Issue 1ISSN : 2210-9668, , 2010. [18] Usama M. Fayyad, Gregory Piatetsky- Shapiro, and Padhraic Smyth. From Data Mining to knowledge Discovery: An Overview, Advances in Knowledge Discovery and Data Mining,). AAAI Press, , pp 1-34, 1996. [19] Townsend),The concept of Poverty, London , Heinmann, 1971. [20] Rowntree,S Poverty: A Study Life,London, Macmillan, 1901. of Town [10] Merceron A and Yacef K, “Interestingness Measures for Association Rules in Educational Data”,2007. [21] [11] Merceron A, Yacef K, “Revisiting interestingness of strong symmetric association rules in educational data”, Proceedings of the International Workshop on Applying Data Mining in e-Learning 2007 Romera, C. and Ventura, S., ” Educational Data Mining: A Survey from 1995 to 2005.”Expert Systems with Applications 33, 125-146,2007.. [22] Romero, C ., V e n t u r a , S., Eapejo, P.G. and Hervas, C.,.” Data Mining Algorithms to Classify Students.” In Proceedings of the 1st International Conference on Educational Data Mining, pp8-17, 2008. [23] Superby, J.F., Vandamme, J.-P. and Meskens, N.,. “Determination of factors influencing the achievement of thefirst-year university students using data mining methods.” In Proceedings of the Workshop on Educational Data Mining at the 8th International Conference on Intelligent Tutoring Systems (ITS 2006), pp37-44,2006. [24] http://www.thehindu.com/todays-paper/tpational/tp-newdelhi/haryana-records-highestaverage-rural-expenditure-oneducation/article4975971.ece [12] [13] [14] David Wai-Lok Cheung, Vincent T. Ng, Ada Wai-Chee Fu, and Yongjian Fu, “Efficient Mining of Association Rules in Distributed Databases”, IEEE ... IEEE Transactions on Knowledge and Data Engineering, Vol 8, No 6, pages 866–883, 1996. S. Ayesha, T. Mustafa, A.R. Sattar, and M.I.Khan, Data Mining Model for Higher Education System, European Journal of ScientificResearch, Vol.43, No.1, pp.24-29, 2010. M. Jadrić, Ž. Garača, and M. Ćukušić, Student dropout analysis with application of data mining methods, Management, Vol.15, No.1, ,pp. 31-46, ________________________________________________________________________ ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013 6