Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A multi-stage decision algorithm for rule generation for minority class by Soma Datta, M.C.A A Dissertation In Computer Science Submitted to the Graduate Faculty of Texas Tech University in Partial Fulfillment of the Requirements for the Degree of DOCTOR OF PHILOSOPHY Approved Dr. Susan Mengel Chair of Committee Dr. D’aun Green Dr. Jim Burkhalter Mark Sheridan Dean of the Graduate School August 2014 Copyright 2014, Soma Datta Texas Tech University, Soma Datta, August 2014 ACKNOWLEDGMENTS I wish to thank my committee members for their precious time and help. A special thanks to Dr. Susan Mengel, my committee chairperson for her immense patience in reading, proofing, encouraging, and enlightening me with the thought process. Thanks to Dr. D’aun Green, Dr. Jim Burkhalter for graciously accepting to serve on my committee. I would like to thank my previous supervisor James Anderson for his support and encouragement. Special thanks goes to Eric Thompson, my colleague, for compiling the data and creating scripts for computations. Finally, I would like to thank the University Writing Center for all the help they have rendered in getting this manuscript to what it is now. Special thanks goes to Elizabeth Bowen and Leslie Akchurin for their help and expertise in modifying this manuscript. I dedicate my dissertation work my family and friends. A special feeling of gratitude goes to my loving parents, Ira Ray and Amelendu Ray, for being my friend, my philosopher and my guide. I also dedicate my dissertation work to my daughter, my son-in-law and my husband, Sanjana Greenhill, Joshua Greenhill, and Amlan Datta, for support and confidence. Finally, I dedicate my dissertation work to my sister, Ruma Guha Neogi and her family for building confidence in me and love for me. ii Texas Tech University, Soma Datta, August 2014 TABLE OF CONTENTS Acknowledgments................................................................................................... ii Abstract .................................................................................................................. vi List of Figures ....................................................................................................... vii List of Tables ....................................................................................................... viii Chapter I.................................................................................................................. 1 Introduction ............................................................................................................. 1 Chapter II ................................................................................................................ 4 Predictive Modeling of Student Retention Data ..................................................... 4 Abstract ............................................................................................................... 4 Introduction ......................................................................................................... 5 The Background .................................................................................................. 7 The Data ............................................................................................................ 10 The Proposed Approach .................................................................................... 11 Motivational Concepts .................................................................................. 11 Ensemble Learning ....................................................................................... 15 Ensemble Learning with Clustering.............................................................. 17 Experiments and Results ................................................................................... 17 Conclusion and Future work ............................................................................. 25 iii Texas Tech University, Soma Datta, August 2014 Chapter III ............................................................................................................. 27 Multi-Stage Decision Method to Generate Rules for Student Retention .............. 27 Abstract ............................................................................................................. 27 Introduction ....................................................................................................... 28 Related work ..................................................................................................... 30 Background ................................................................................................... 30 Multi-stage Decision Tree............................................................................. 31 Data for the Study ............................................................................................. 33 Methodology ..................................................................................................... 35 Motivation for multi-stage mining methods ................................................. 35 Multi-stage controlled decision tree method................................................. 37 Results ............................................................................................................... 40 Results from multi-stage controlled decision tree ........................................ 40 Conclusion ........................................................................................................ 43 Future work ....................................................................................................... 45 Appendix A ....................................................................................................... 48 Chapter IV ............................................................................................................. 50 Dynamic Multi-stage decision rules ..................................................................... 50 Abstract ............................................................................................................. 50 Introduction and related work ........................................................................... 51 Background Work ............................................................................................. 53 iv Texas Tech University, Soma Datta, August 2014 Data for study.................................................................................................... 54 Dynamic multi-stage decision rule ................................................................... 54 Clustering ...................................................................................................... 55 Stage I: Decision Tree ................................................................................... 57 Stage II- Association Mining ........................................................................ 61 Results ............................................................................................................... 62 Discussion and conclusions .............................................................................. 66 Future work ....................................................................................................... 67 Appendix B ....................................................................................................... 69 Chapter V .............................................................................................................. 70 Conclusions and future work ................................................................................ 70 References ............................................................................................................. 74 v Texas Tech University, Soma Datta, August 2014 ABSTRACT This study analyzes student retention data for the prediction of students who are likely to drop out. Retention is an increasingly important problem for institutions who must meet legislative mandates, face budget shortfalls due to decreased tuition or statebased revenue, and who do not produce enough graduates in fields of need, such as technology. The study proposes a multiple stage decision method to improve rule extraction issues encountered previously when using ensemble learning with clustering and decision trees. These expansions include rules with anomalous classes and rules with attributes only chosen by the decision tree method. To improve rule extraction, the study described in this paper uses a multi-stage decision method with clustering, controlled decision trees, and association mining. This study uses dynamic method to generate rules. The characteristic of the dataset commands the path towards generation of the rules. A dynamic multi-stage decision tree was generated depending on the attribute dimensions and size of dataset. Each rule gets its coverage and accuracy. This technique generates rules after mining data from the minority class. The rules generated were grouped into ranges to facilitate the rule choice as needed. vi Texas Tech University, Soma Datta, August 2014 LIST OF FIGURES Figure 1. Tree using proposed Ensemble learning................................................ 13 Figure 2.Tree using Yu's approach ...................................................................... 13 Figure 3.Tree using Recursive Partition and an enlarged version ........................ 14 Figure 4.Tree using CART ................................................................................... 14 Figure 5. Tree using J48, super imposed on the actual size tree ........................... 14 Figure 6. Building decision tree ............................................................................ 16 Figure 7. Building decision tree ............................................................................ 16 Figure 8. Steps of split and prune of the decision tree .......................................... 39 Figure 9. Controlled Decision Tree ...................................................................... 41 Figure 10. Breast Cancer 1st level when accuracy stops changing ....................... 58 Figure 11. Breast Cancer dataset 2nd level with repeating accuracy ................... 59 Figure 12. Breast Cancer 3rd level with repeating accuracy ................................. 60 Figure 13. Steps for Dynamic decision rules ........................................................ 62 vii Texas Tech University, Soma Datta, August 2014 LIST OF TABLES Table 1. Data Characteristics ................................................................................ 10 Table 2. Statistical characteristics of the datasets ................................................. 10 Table 3. Average accuracy in % without clustering ............................................. 19 Table 4. Total no. of rules generated without clustering ...................................... 19 Table 5: Different measures with different cluster sizes using C4.5 .................... 21 Table 6: Average accuracy in % after clustering .................................................. 21 Table 7: Total no. of rules after clustering ............................................................ 22 Table 8. Sample rules using merged dataset ......................................................... 23 Table 9: Total Number of rules above threshold with three or more conditions .. 24 Table 10. Attrition Rules from different clusters of the merged dataset .............. 24 Table 11: Rules for attrition with common attribute conditions ........................... 25 Table 12: Sample rules predicting opposite class in cluster ................................. 25 Table 13: Example of combining rules ................................................................. 26 Table 14. Accuracy and Coverage using classification on Retention Data .......... 29 Table 15. Final Selected Attributes M-2 ............................................................... 34 Table 16. Data characteristics M-2 ....................................................................... 34 Table 17. Statistical characteristics of the datasets M-2 ....................................... 34 Table 18. Sample rules, their respective probability, and the class type .............. 35 Table 19. Association mining statistics ................................................................ 37 Table 20. Characteristics of the dataset ................................................................ 40 viii Texas Tech University, Soma Datta, August 2014 Table 21. Sample rules with accuracy and coverage ............................................ 41 Table 22. Testing different datasets with rules for attrition .................................. 42 Table 23. Comparisons of rules coverage using different methods ...................... 43 Table 24. Rule coverage over instance in each class ............................................ 43 Table 25. Accuracies obtain using incoming freshmen ........................................ 45 Table 26. Examples of combining rules ............................................................... 46 Table 27. Apriori settings ..................................................................................... 48 Table 28. Effect of changing default values ......................................................... 48 Table 29. Sample rules at confidence=0.70 .......................................................... 49 Table 30: Data set with reported Accuracy........................................................... 52 Table 31: Dataset and their characteristics- UCI .................................................. 54 Table 32. Cluster comparisons with EM and K-means ........................................ 56 Table 33. EM clustering for Connect dataset........................................................ 57 Table 34. Experimental results of accuracy using different datasets .................... 63 Table 35. Sample rules with their accuracy and coverage .................................... 64 Table 36. Experimental results of total coverage.................................................. 64 Table 37. Summary of minority class in the datasets ........................................... 65 Table 38. Apriori settings ..................................................................................... 69 ix Texas Tech University, Soma Datta, August 2014 CHAPTER I INTRODUCTION Student retention has been an ongoing problem in institutions of higher learning in the US. Qualitative research on student retention was started by Tinto as early as 1975. In recent years, educational institutions have collected a large amount of student related data; hence, quantitative research using data mining techniques make it possible to discover hidden information at an early stage. The purpose of the study is to answer the following question: How can data mining methodologies be employed to reduce error rate and to produce more usable rules for the minority class? Previous studies have used C4.5, CART, recursive partitioning, Naïve Bayes, neutral networks, and logistic regression to identify attributes in student attrition. Naïve Bayes and neural networks were used to validate the results from the decision tree algorithms. Decision tree algorithms are widely used because of their ease of understanding and transparency. However, the attribute that is selected as the top attribute might not always be the best one to generate a model. In addition, some attributes might have correlation between themselves. Thus, to investigate the above questions, this study developed a multi-stage domain-driven data mining procedure that minimizes the errorrate and merges the models of different data sets, producing more useable rules. 1 Texas Tech University, Soma Datta, August 2014 The study is divided into three sections. Chapter II studies and analyzes data mining methods used by other researchers such as Yu et al. 2010, and is then extended to generate rules using clustering and controlled decision tree techniques and unrepeatable attributes. Chapter III overcomes some of the limitations of the method applied in Chapter II by generating retention/ attrition rules by merging decision trees and association mining to produce multi-stage rules; the method was called multi-stage decision tree (MSDT). In Chapter IV the same algorithms are used to generate rules that have been applied to the dataset from UCI data repository for validation. The multi-stage logic had to be further modified depending on the characteristics of the datasets. The new ensemble method was named dynamic multi-stage decision rules (DMDR). In this study, available attributes from the institutions’ database were collected. Principle component analysis and multivariate correlation were applied to the datasets to remove any correlation between the attributes. Clustering, which was not used in other studies, was applied to the datasets to facilitate homogenous grouping. The unique methods used in this research were the following: Model-1, stopping the tree when an attribute starts to repeat, to replicate Yu et al., 2010 study Model-2, locking the attributes when used, to extend Yu et al. method. Measuring each rule using accuracy and coverage. Model-3, clustering to have homogenous groups than using model 2’s techniques. Model-4, generating a controlled decision tree by stopping the tree from growing further after it has reached an accuracy and when the accuracy stops improving for the next two steps. The dataset at this point is pruned to remove the attributes that 2 Texas Tech University, Soma Datta, August 2014 were used in generating the tree. The pruned dataset is allowed to generate association mining rules. Decision tree and association mining rules are joined with an AND condition to produce the actual rule. These rules are validated for accuracy and coverage from the whole dataset. Model-5, extending to generalize and verify model-4 on known datasets from UCI. Here the model becomes dynamic and follows the path depending on the characteristics of the dataset. 3 Texas Tech University, Soma Datta, August 2014 CHAPTER II PREDICTIVE MODELING OF STUDENT RETENTION DATA Abstract Student retention is an intensely scrutinized issue, but remains a problem area for many institutions. Further complicating the issue, student retention is taken to be an essential indicator of university performance and enrollment management, particularly by state legislators. In addition, even if students do continue after their first year, high withdrawal rates can occur in the second year, causing substantial revenue loss thereafter. Reasons for student attrition are well known, however, how to identify potential drop outs reliably and quickly is less known, but several advances through the vehicle of data mining are being made. To learn more about how to identify potential drop outs, researchers conduct studies using institutional data to find indicators that may yield predictive qualities in identifying student attrition. These studies, however, can yield differing and sometimes conflicting indicators due to the use of different data mining techniques that yield different models, even on the same datasets. Even so, models are still useful for drop out prediction and some data mining models may be examined in detail to determine the reasons for how the model renders a decision. The authors propose to utilize data mining techniques for the purpose of finding drop out risk indicators for early identification of at-risk students to apply intervention measures expediently. The study in this paper, therefore, looks at building on the results of previous studies, but with the intention of lowering model error rates and of attempting to produce rules that can be used with some accuracy and clarity to find at-risk students. 4 Texas Tech University, Soma Datta, August 2014 The study focuses on specific data mining methodologies that can produce rules, such as classification (ADTree, PART, Random Tree, J48Graft, CART, Recursive Partition, C4.5) and clustering (K-means and EM) techniques rather than those methodologies that do not allow rules to be derived from the model easily. Results show that these methodologies can enhance the accuracy of identifying at-risk students and can give more specific insight into attrition, such as revealing otherwise unknown indicator combinations. Key Words and Phrases: Data mining, clustering, retention, attrition, model selection Introduction Student retention is a significant issue in higher education because it is an essential indicator of performance for each institution and for enrollment management. It is also important for policy makers due to the potential negative impact of non-retention of students on the image of the university. In addition, low retention rates for an institution can cause substantial loss in tuition, fees, and alumni contributions (DeBerard et al. 2004). The problem of student retention is complex because of the variety of causes and lack of a single model to predict why students drop out. Further, high withdrawal rates occur even during the students’ second year of enrollment (Nara et al. 2005); thus, it is desirable to identify at-risk students and intervene as early as is practicable to facilitate student retention. The rationale behind investigating student retention includes the following: 5 Texas Tech University, Soma Datta, August 2014 to predict if a student is at a risk of dropping out from the institution so that intervention strategies can be implemented to help retain the student, to respond to pressure from potential legislative bills stipulating measures to ensure that institutions address retention by causing state grants to be dependent on the number of graduating students, to understand the risk factors and causes behind attrition so that effective intervention measures may be determined, to help institutions reduce risk factors, and to determine effective classification, clustering, and predictive models using data mining techniques on student data to find at-risk students. Studies on retention in higher education go back as far as the 1800s (Boston et al. 2011) and have found the following list of common factors that indicate retention: high school – high GPA standardized tests – high scores on the ACT and/or SAT admissions policies – rigorous rather than open social integration – higher use of the library and living on-campus In addition, researchers who mine student data using modeling techniques, such as decision trees, logistic regression, and neural networks (for example, Baker et al. 2009, Yu et al. 2010, and Zhang 2010) are finding other interesting factors not necessarily directly related to the above list but supportive of it: frequent use of the online learning system high transfer hours ethnic factors residency as opposed to non-residency high library usage social networking 6 Texas Tech University, Soma Datta, August 2014 Although interesting results are being found, the modeling techniques can be improved in accuracy and analyzed to reveal rules used by the model. Thus, work in this paper seeks to expand upon prior modeling to find ways to improve accuracy and determine rules that can be used to pinpoint at-risk students. The Background Qualitative studies, for example ACT (2004), Bean (2001), Sewell and Wegner (1970), and Tinto (1975, 2006) have shown that retention depends on social integration, demographics, academic achievements, financial aid or ability to pay, and institutional factors. In recent years, some quantitative studies—Bayer (2012), Delen (2010), Eitel et al. (2012), Herzog (2006), Kotsiantis (2009), Lykourentzou et al. (2009), Macfadyen et al. (2010), Marquez et al. (2013), Pittman (2008), Yadav et al. (2012) , Yu et al. (2010), and Zhang et al. (2010)—have used different data mining techniques and have achieved varied results, although the attributes used in these studies are similar. Other factors such as social networking (Bayer 2012) have shown that students who interact with academic departments are more likely to be retained. In addition, most researchers have mentioned that the first year of a student’s career is the most critical because the likelihood of dropping out is high, and so their studies included the first year cohort of students. Some of these studies have different cohorts; for instance, Herzog and Pittman used both freshmen and the whole university in their datasets. Herzog’s study also included the factors for degree completion. The datasets were balanced in a few studies because some statistical techniques required it. Delen (2010), Eitel (2012), Kotsiantis (2009), and Marquez (2013) argued that the imbalanced nature of the dataset could affect the results 7 Texas Tech University, Soma Datta, August 2014 of the study of the minor class (class with fewer instances in the dataset). Marquez’s original dataset had a class distribution of 91.04/8.96 (Retained/Dropped), whereas Delen’s (2010) was 80/20 (Retained/Dropped). Lin (2012) replicated his dataset twice and thrice to get a larger group. Kotsiantis (2009) used cost-analysis as an alternative for an imbalanced dataset, but retained the same accuracy as for the major class. Thus, it can be seen that qualitative research laid a path for the predictor attributes and that recent research is using similar attributes and cohorts to build a retention model. Qualitative research models are survey-based and are often criticized due to their incapability of generalization, and the difficulty in collecting data and administrating the survey. The model foundation laid by qualitative researchers, however, is the steppingstone to quantitative research. Educational institutes collect data on students, so datadriven analysis can complement the results obtained from qualitative models (Delen 2010, Marquez et al. 2012, Pittman 2008, and Yadav et al. 2012). Most of the quantitative studies use data mining techniques with different decision tree algorithms because of the ease of rule interpretation, ease of learning, and widely accepted results (Delen 2010, Eitel et al. 2012, Marquez et al. 2012, and Yadav et al. 2012). Other data mining or statistical techniques used to validate the results of the decision trees are MARS, neural networks, support vector machine, lazy learner, PART (Partial decision tree), ADT (Alternate decision Tree), bagging, logistic regression, Bayesian classification, genetic algorithms, and ensemble learning (the use of more than one data mining technique). Thus, decision tree algorithms are commonly used by researchers in building retention models. Of the decision tree algorithms, the common techniques are 8 Texas Tech University, Soma Datta, August 2014 C4.5, CART, and recursive partitioning, using the WEKA and JMP software (discussed later), respectively. High dimensions of data used in quantitative research are stored by institutions, but not all attributes help to predict retention, and some data may be unreliable due to human error or omission. Also, rules generated using high dimensional data may not be interpretable as will be demonstrated later; hence, researchers use attribute selection techniques to lower the dimensions for the model (Delen 2010, Eitel et al 2012, Kotsiantis 2009, Macfadyen et al. 2009, Marquez et al 2012, Yadav et al. 2012). The attribute selection techniques are ranking methods; for example, entropy, chi-square, minimum-error, and MDL. Marquez et al. utilize ten different ranking methods and weigh the attributes using the frequency of each attribute’s ranking. To calculate the accuracy, validate the rules, and predict the important attributes, the following measures are used: accuracy, ROC area, F-measure, sensitivity, and specificity. The researchers conclude that data-mining techniques are capable of predicting retention with sufficient attributes and data (Delen 2010, Marquez et al. 2012). The accuracy for the researchers is approximately 80% to 90%. Macfadyen et al. (2009) mention a better retention model can be built using the course/credit details. Bayer et al. (2012) mention that retention accuracy would be better after the student’s fourth semester and with the use of a social networking attribute. Herzog (2006) and Pittman (2008) achieve better accuracy with the cohort that includes the whole university. Yu et al. (2010) mention ethnicity, transfer hours, and residency are factors for retention. Zhang’s (2010) study points out that academic activity is an indicator of retention. Additionally, 9 Texas Tech University, Soma Datta, August 2014 most of the researchers mention that financial aid is an indicator for retention. In other words, the quantitative models have confirmed the qualitative research. The Data The approach proposed in this paper is evaluated by applying an internal Texas Tech University system dataset as detailed below. The characteristics of the data are given in Table 1 and Table 2. Attributes o demographic: age, ethnicity, and gender o academic: attempted hours, earned hours, GPA, degree, and major changed o financial aid data: grants, scholarships, and loans o social factor: parent’s education and student organization Classes- retained (R) and dropped (D) as determined by the student’s enrollment status during their 2nd year in college. Source- new freshmen admitted for Fall 2010, Fall 2011, and a merged dataset of both. Table 1. Data Characteristics Name Merged 2010 and 2011 Fall 2010 Fall 2011 Instances 9240 4490 4750 Attributes 14 14 14 Table 2. Statistical characteristics of the datasets Attribute Age Gender Test score Class percentile College GPA Earned hours Attempted hours Financial assistance (complied grants/Pell/loan) Mean and/or Distribution 21.46 M/F (52%/48%) 1099 71.6 2.70 12.12 13.89 52%-63% Low income group (only received Pell) 18%-23% 10 Texas Tech University, Soma Datta, August 2014 The Proposed Approach The next section introduces the motivational concepts, motivation, and justification for the new method proposed, which is called ensemble learning because multiple models are used. The other section has a detailed approach to ensemble learning. The last section describes ensemble learning with clustering. Motivational Concepts Modeling techniques do not always yield a mechanism whereby rules may be derived. In particular, statistical techniques, neural networks, and regression yield effective mathematical models, but do not show decision reasoning. Decision trees, however, yield rules derived from the path taken from the root to a leaf. These rules, however, may contain conflicting attributes since decision tree modeling algorithms may reuse attributes. In addition, some rules may not be particularly useful, such as those with repeating attributes of different values or those that lead to small sets of data. Finally, the resulting datasets that satisfy the rule may be a mix of classes, such as 30% drop and 70% retain, causing the rule to have reduced accuracy. For example, the rule below includes the Avg_Attempt attribute three times, each with different and conflicting conditional values. Because the rule represents a path in the decision tree, the rule does not seem to represent a logical line of reasoning that a human might follow. The approach proposed in this study does attempt to generate rules that are more usable by focusing on producing rules with non-ambiguous attribute conditions. Ex. Avg_Attempt>=9&LOAN=Y&GPA>=1.672&GRANT=Y&GPA>= 2.616 &PERCENTILE>=55&Avg_Attempt<18ÐNIC(HP, AS, WH, HI)&PERCENTILE>= 74&Avg_Attempt>=12.5&AGE>= 21& Transfer_Hrs<16 11 Texas Tech University, Soma Datta, August 2014 The attributes in the datasets are ranked using residual log-likelihood, chi-square, sum of squared error, and clustering. Clustering is used as an additional method for attribute selection to rule out the bias of any of the attribute ranking methods. For example, residual log-likelihood, chi-square, or minimum sum of squared error ranking use top-ranked attributes, whereas clustering groups them on a similarity measure. Here the K-means and EM clustering techniques are employed. K-means may perform better than EM clustering in the case of high-dimensional datasets due to EM’s numerical precision problems. EM uses the Gaussian distribution, and a delta function may be created as shown in equation 1 below (Alldrin et al. 2003). A delta function is sometimes mentioned as the black hole of mathematics because its limit tends to zero causing no or few records in a cluster. 1⁄ 𝑓𝑜𝑟 |𝑥| < 𝑎⁄2 𝜕(𝑥) = lim { 𝑎 𝑎→0 0 𝑒𝑣𝑒𝑟𝑦𝑤ℎ𝑒𝑟𝑒 𝑒𝑙𝑠𝑒 1 The models built in this study using ensemble learning (discussed later) are as follows: Model 1 uses attributes that are ranked using residual log-likelihood, chi-square, or sum of squared error. Model 2 divides the datasets into clusters based on a similarity measure. In Model 1, the tree is generated using recursive partitioning (JMP) (Gaudard et al. 2006). The approach in this study locks the attribute after each split rather than stopping tree growth when the attributes start to repeat as in Yu et al’s (2010) approach, which might not make use of all attributes in high-dimensional datasets. The motivation behind locking attributes is to avoid repeating attributes in the effort to maximize 12 Texas Tech University, Soma Datta, August 2014 accuracy and generate usable rules. An example tree of the proposed approach is shown in Figure 1, a tree generated using Yu et al’s approach is shown in Figure 2, and trees generated using the default setting for recursive partition, CART, and C4.5 for the same dataset are shown in Figures 3, 4, and 5, respectively. Each path through the decision tree from the root to the leaf represents a subset of the dataset being modeled. More paths in the tree and a larger number of subsets increase the potential for a greater number of candidate rules which could correspond to very small subsets. Figure 1. Tree using proposed Ensemble learning Figure 2.Tree using Yu's approach 13 Texas Tech University, Soma Datta, August 2014 Figure 3.Tree using Recursive Partition and an enlarged version Figure 4.Tree using CART Figure 5. Tree using J48, super imposed on the actual size tree 14 Texas Tech University, Soma Datta, August 2014 For prediction, the rules are validated against all datasets. The rules that have lower coverage of the dataset are ignored. In Model 2, the dataset is clustered using K-means clustering (Kaufmann et al. 1990). The user has the option of using any k value for clustering. This study has tested several cluster sizes to find the optimal k value for clustering, as discussed later. The dataset is not balanced in this study as is done in some previous studies (Delen, 2010, Kotsiantstis 2009, Yadav et al. 2012 ) because the class distribution is similar to other published datasets, such as the University of California Irvine Machine Learning Repository dataset with D=36.09% and R=63.91%. Also, data mining is able to perform analysis without balancing while balancing is necessary for statistical techniques. Ensemble Learning Ensemble learning in this study consists of three parts. First, the decision tree is constructed using x unique attributes. Secondly, the rules are abstracted and aggregated. Finally, the rules are validated using the whole dataset and compared with other decision trees. The steps used in the first part are given in Figure 6 below. In the first step, all the attributes are ranked by the tree learning algorithm using residual log-likelihood chisquare (2*entropy) which uses LogWorth (categorical attributes) or Sum of Squares (continuous attribute) depending on the type of attribute (Gaudard et al. 2006), both being statistical methods. Recursive partitioning is used to generate the decision tree. 15 Texas Tech University, Soma Datta, August 2014 Recursive partitioning (Gaudard et al. 2006) uses a statistical method for multivariable analysis. Starting from the top ranked attribute, the decision tree is allowed to split once. The attribute used for this split is locked, and the tree is split for the next level. This process is iterated until x-1 attributes are used for splitting where x is the total number of attributes. Let x= number of attributes Step 1. Rank all x attributes. Step 2. Allow the tree to branch. Step 3. Lock the attribute that generated the branch. Step 4. Repeat step until x-1 attributes or tree is fully branched or optimized. Figure 6. Building Figure 7. Building decisiondecision tree tree In the second part, the rules are abstracted from the trees. Let r be the number of rules generated by the tree. Let c be the number of condition attributes in each rule. The predictive rules are identified depending on the accuracy and significance of the rule. Predictive rules are abstracted by eliminating rules with accuracy less than a given threshold; that is, 𝑟 ≤ 𝑡% (t is a hypothetical value chosen; here it is 70% because the average baseline accuracy of the three datasets is 69.2857%). Rules are not significant if c is a small number because of the complex nature of this problem (<3). For example, rules with fewer than three attributes tend to have higher false positive rates. Previous studies have identified rules using decision trees (Herzog 2006, Pittman 2008, and Yu et al 2010), but did not look at coverage and accuracy per rule. The concept of identifying predictive rules in this study is similar to identifying association rules using a coverage measure for association rule mining (Agrawal et al. 1994). When identifying rules, it is possible for a rule to have conflicting classes; in such a situation, the rules are grouped as uncertain rules. 16 Texas Tech University, Soma Datta, August 2014 The final part of the ensemble learning involves validation with other methods used in previous studies (Delen 2012, Eitel et al. 2012, Marquez et al. 2012, Pittman 2008, and Yadav et al. 2012). The same dataset is used to generate a decision tree using C4.5, CART, and recursive partitioning. The rules generated from these processes are extracted similarly. Ensemble Learning with Clustering Clustering is an alternative method of selecting attributes for the ensemble learning method mentioned previously. Clustering selects attributes based on some similarity as opposed to the attribute selection method, which selects attributes based on ranking. The method employed in this study is K-means clustering (Kaufmann et al. 1990). The distance function used for the study is Euclidean Distance (ED) as opposed to Manhattan Distance. ED uses the mean for the distance calculation and the latter uses the median. The user has the option of choosing the k value for the number of clusters. Since the results can vary with the size of k, the k value is chosen after experimentation. Experiments and Results The proposed approach is tested with the following software: recursive partitioning using JMP (Gaudard et al. 2006), and C4.5 and CART using WEKA (Hall et al. 2009). WEKA, which is open-source software for C4.5 (or “J48”) and CART (or “SimpleCart”), is also used for K-means clustering. For comparison purposes with other studies, this study uses 10-fold cross validation on the Fall 2010 dataset, Fall 2011 dataset, and the dataset generated by merging the two. 17 Texas Tech University, Soma Datta, August 2014 As mentioned before, to predict a rule, the following two conditions must be satisfied: the minimum number of attributes in the condition, c, is three, and the threshold, t, for each significant rule is 70% accuracy. For example, a rule with one attribute may cover many instances of the dataset, but the instances covered usually consist of a mix of the classes (conflicting classes) which lowers the rule’s accuracy. In general, as attributes are added to the rule, the rule becomes more specialized, covers fewer instances, and rises in accuracy since more of the conflicting classes are successively eliminated. Although a higher accuracy threshold might seem desirable, it turns out that a threshold of 80% creates fewer rules. A 60% threshold covers more instances, but causes a larger number of false positive counts. A threshold of 70% seems a reasonable compromise. All three datasets are tested without clustering using the nine methods shown in Table 3. Default conditions and 10-fold cross validation are used in all the methods, to generate the trees. The recursive partition method generated the most consistent results across the data sets and the highest accuracy in comparison to the other methods. It is known to be an accurate method that can overfit the data. Each method performed the best on the Fall 2010 dataset and the worst on the Fall 2011 dataset, showing that each incoming cohort of students has differing dropped/retained characteristics. All methods performed at about the same accuracy except for recursive partition (best) and random tree (worst). A clue for each method’s performance is provided in the next table discussed. 18 Texas Tech University, Soma Datta, August 2014 Table 4 shows the total number of rules generated using each method and consequently the number of subsets partitioned on the dataset. The random tree method produced far more rules and subsets while Yu’s method produced the lowest. Ensemble learning, ADTree, and Yu’s methods are consistent in the number of rules generated using the three datasets. Recursive partition, however, presents the interesting outlier with a relatively high number of rules, but not nearly so much as random tree. For each splitting point in the tree, recursive partition statistically analyzes each attribute and its potential values for best satisfaction of a purity measure of the resulting dataset partitions. Clearly, such a splitting analysis causes greater success than those of the other decision tree methods. Table 3. Average accuracy in % without clustering Dataset Fall 2010 Fall 2011 Merged dataset C4.5 83.1403 75.4316 78.961 CART 82.5612 75.2842 79.0368 ADTree 83.5412 75.7053 79.3831 PART J48graft 81.5145 82.9844 72.7368 75.5789 76.9372 79.0909 RandomTree 75.2116 66.9895 71.0714 Yu’s method Recursive partition 82.81 87.55 74.84 85.09 78.71 86.17 Ensemble learning 82.81 75.58 79.17 Table 4. Total no. of rules generated without clustering Dataset C4.5 CART ADTree PART J48graft RandomTree Yu’s method Recursive partition Ensemble learning Fall 2010 110 6 21 179 99 6701 4 277 14 Fall 2011 11 6 21 193 15 8568 4 452 13 19 Merged dataset 158 10 21 402 163 17544 4 866 12 Texas Tech University, Soma Datta, August 2014 In Model 2, the merged dataset (all attributes) is clustered using K-means and EM (Expectation-Maximization) clustering. The method chosen for testing the optimum cluster size is C4.5 with the default setting utilizing 10-fold validation for each cluster. The reason behind selecting C4.5 is to compare it with previous studies (Delen 2010, Pittman 2008, Yadav et al. 2012, Marquez et al. 2012, and Eitel et al. 2012). In Table 5, several measures are used to evaluate the results of C4.5 on each cluster. The average accuracy is computed across the clusters, as well as the average true positive (TP) rate (correctly classified instances, also known as recall), the average false positive (FP) rate (incorrectly classified instances), the average precision (fraction of retrieved instances that are correctly classified), F-measure (harmonic mean of precision and recall), and average ROC area (area under the curve of the TP versus FP plot). The benefit of using several measures is that potential inadequate performance may not be reflected in all of the measures, but will be shown in others. For example, while the highest accuracy is obtained using EM’s thirteen clusters, the FP rate is also high and the ROC score (below .5) indicates that the classifier is not properly distinguishing among the instances. In fact, as the clusters increase, the ROC score decreases, indicating poorer classification performance. It turns out that several of the clusters in the larger numbers of clusters have poor FP rates (contain conflicting classes) due to the mixture of dropped and retained instances in the cluster. Choosing different attributes on which to cluster might possibly purify the resulting clusters more; however, such pure clustering is known to be difficult to achieve. K-Means cluster sizes of 3 and 5 seem likely candidates to choose. Cluster size 3 is chosen because of its higher accuracy, precision, and TP/recall rate, as 20 Texas Tech University, Soma Datta, August 2014 well as its competent ROC area score that is close to the score of K-Means with 5 clusters. Table 5: Different measures with different cluster sizes using C4.5 Cluster 1 0.7861 Avg TP Rate (recall) 0.79 0.288 0.789 0.789 0.8 Cluster 3 0.84564 0.845667 0.699333 0.838 0.804333 0.704667 Cluster 5 Cluster 7 0.83667 0.83391 0.8366 0.8338571 0.6266 0.6645714 0.8176 0.7835714 0.8066 0.7958571 0.7286 0.6445714 Cluster 9 0.84171 0.8417778 0.5866667 0.8097778 0.8191111 0.6323333 Cluster 11 0.84675 0.8468182 0.62 0.8107273 0.82 0.6218182 Cluster 13 0.85582 0.8661538 0.7476154 0.8271538 0.83 0.5976923 Cluster 15 0.85405 0.854 0.7006667 0.8087333 0.8222 0.578 Cluster 13 0.89085 0.890846 0.866692 0.817923 0.851154 0.468462 Avg. Accuracy C4.5 Original dataset Clusters generated using K-means clustering Default clusters generated using EM clustering Avg. FP Rate Avg. Precision Avg. Fmeasure Avg. ROC Area Table 6 shows the results of the decision tree methods with a cluster size of three. In comparison to the results without clustering, the Fall 2010 dataset did not receive a great deal of advantage from clustering (only 2% on two methods), but the Fall 2011 set did with as high as a 10% increase in accuracy. The merged dataset received a more modest 6% boost in accuracy. Again, recursive partitioning is the most accurate of the nine methods with ensemble learning not far behind. Table 6: Average accuracy in % after clustering Dataset C4.5 with clustering CART with clustering ADTree with clustering PART with clustering J48graft with clustering Random Tree with clustering Yu’s method with clustering Recursive partition with clustering Ensemble learning with clustering Fall 2010 83.119 83.233 83.109 81.348 82.957 75.348 82.34 89.12 84.503 Fall 2011 82.552 82.579 82.756 81.894 82.577 76.331 81.95 87.646 82.583 Merged dataset 84.564 84.522 83.844 82.758 84.528 76.586 82.163 90.0466 83.56 In Table 7, the number of rules is expressed as the addition of the rules in each cluster. Equivalent rules are not deleted. The number of rules within a cluster falls at or 21 Texas Tech University, Soma Datta, August 2014 below the number of rules in the non-clustered results, which is expected for clusters forming subsets of the larger dataset. Table 7: Total no. of rules after clustering Dataset C4.5 with clustering CART with clustering ADTree with clustering PART with clustering J48graft with clustering Random Tree with clustering Yu’s method with clustering Recursive partition with clustering Ensemble learning with clustering Fall 2010 45+71+78 3+4 +22 21+21+21 116+61+30 75+76+101 3528+2079+1193 4+4+2 Fall 2011 0+0 +20 14+0 +11 21+21+21 86+37+53 1+1+20 2887+1843+2263 4+3+4 Merged dataset 5+30 +5 3+ 7+3 21+21+21 77+137+76 7+61+9 6041+5075+3156 2+3+2 232+88+77 170+102+66 334+265+173 13+13+13 13+12+12 13+13+14 Sample rules from the merged dataset are shown in Table 8. Each rule has conditions followed by a colon and the class (D)rop or (R)etain. The coverage is calculated using the decision tree partition size for the rule over the total number of instances. Accuracy is calculated as the number of instances classified correctly by the rule over the total partition size. The coverage ranged from around 41% to less than 1% of the data. The accuracy ranged from 100% of the covered instances to around 12%. The results reinforce that the more specialized the rule, the lower the coverage, but the higher the accuracy. Table 9 lists the number of rules that met the study criteria of three or more attributes with 70% or more accuracy over the whole dataset. With lower coverage of the data set comes higher accuracy; however, recursive partitioning again outperforms the other methods with higher coverage and accuracy, but has lower coverage and many more rules than ensemble learning. 22 Texas Tech University, Soma Datta, August 2014 Table 8. Sample rules using merged dataset Merged dataset Sample rules Coverage Accuracy C4.5 GRANT= Y & GPA_CHAR = Fail & ETHNIC = WH & Greek_Life = N & SAT_RANGE1 = 950<=SAT<=1000 & Degree = BA & GENDER = F: D 0.152% 78.571% C4.5 with clustering Degree = BA & percentile_range = 2ndquarter & LOAN= N & GRANT= N: D 4.491% 70.120% CART GRANT=N & LOAN!=N & GPA_CHAR=(Fail) & SAT_RANGE1=(850<=SAT<=900)|(750<=SAT<=800)|(700<=SAT< =750)|(1150<=SAT<=1200)|(1200<=SAT<=1250)|(1100<=SAT<=1 150)|(950<=SAT<=1000)|(1350<=SAT<=1400)|(0<=SAT<50)|(1400 <=SAT<=1450)|(1500<=SAT<=1550)|(1450<=SAT<=1500)|(650<= SAT<=700)|(1550<=SAT<=1600)|(500<=SAT<=550): D 1.093% 60.396% CART with clustering Degree=(BA)|(BLA) & percentile_range=(2ndquarter) & GPA_CHAR=(Low)|(High)|(Medium) & LOAN =Y & GRANT!=N: D 3.983% 12.228% Recursive partition Recursive partition with clustering GRANT=N&LOAN=N&GPA_CHAR(Low, Fail, Missing)&GENDER(M)&FALL_ATMP_RANGE(T): D 0.195% 100.000% ^&percentile_range(Next15, 4thQuarter, 3rdQuarter, Top10)&Degree(BGS, BS, NDU): R 32.121% 73.214% ADTree GRANT=N & LOAN=N & GPA_CHAR!=Fail: D 21.331% 66.362% ADTree with clustering Degree = BS AND percentile_range = Next15: D 14.719% 27.647% PART GRANT=N AND LOAN=N AND GPA_CHAR = Fail AND GENDER = M AND Greek_Life = N: D 2.554% 85.532% 0.006% 87.719% 0.476% 65.909% 4.491% 70.120% 0.011% 100.000% 0.087% 50.000% 25.465% 68.508% 40.985% 27.013% Ensemble learning GRANT=Y&GPA_CHAR(High, Medium, Low)&SAT_RANGE1(1500<=SAT<=1550, 1550<=SAT<=1600, 500<=SAT<=550, 650<=SAT<=700, 1400<=SAT<=1450, 1350<=SAT<=1400, 1450<=SAT<=1500, 1300<=SAT<=1350, 0<=SAT<50, 1250<=SAT<=1300) : R 10.855% 94.417% Ensemble learning with clustering GPA_CHAR(High, Medium, Low)ÐNIC(HP, M, PR, AS, B, WH, AI)&LOAN=N : R 26.861% 64.666% PART with clustering J48graft J48graft with clustering Random Tree Random Tree with clustering Yu’s method Yu’s method with clustering Degree = BS AND SAT_RANGE1 = 1400<=SAT<=1450 AND Major_CHANG_FALL_Spring = N: D GRANT=Y & GPA_CHAR = Fail & ETHNIC = WH & Greek_Life = N & SAT_RANGE1 = 1100<=SAT<=1150 & FALL_ATMP_RANGE = F: D Degree = BA & percentile_range = 2ndquarter & LOAN= N & GRANT= N: D percentile_range = Next15 & GRANT= Y & Parent_Char = 13P & GPA_CHAR = Low & SAT_RANGE1 = 1050<=SAT<=1100 & Major_CHANG_FALL_Spring = Y : D percentile_range = 2ndquarter & GPA_CHAR = Medium & Parent_Char = P & SAT_RANGE1 = 1100<=SAT<=1150 & ETHNIC = WH & Greek_Life = N & FALL_ATMP_RANGE = F : D GRANT=N&LOAN=N:D Degree(BM, BBA, BS, BFA, BID, BAR, BGS, NDU)&percentile_range(Next15, 4thQuarter, 3rdQuarter, Top10) :D 23 Texas Tech University, Soma Datta, August 2014 Table 9: Total Number of rules above threshold with three or more conditions Method Merged dataset C4.5 C4.5 cluster CART CART Cluster Recursive partition Recursive partition Cluster C4.5 C4.5 cluster Ensemble learning Ensemble learning Cluster 52 0+11+0 10 0+5+0 81 96+52+52 51 2+11+4 8 12+6 +11 Min. Coverage % 0.011 0.020 19.500 0.281 0.649 0.011 0.011 0.011 8.658 0.087 Max. coverage% Total Coverage % Average Accuracy% 5.768 6.240 52.370 14.048 10.963 5.812 5.768 62.446 31.797 44.456 11.86 0+21.73+0 100.00 0+30.00+0 68.84 30.66+13.32+15.28 11.75 1.266+21.017+0.714 77.53 92.00+33.04+50.09 93.00 0+77.00+0 73.145 0+81.00+0 91.666 96.78+95.249+94.81 92.93 52.916+60.93+ 27.00 85.969 63.13+72.98+80.32 Table 10. Attrition Rules from different clusters of the merged dataset Rules Grant=N, Loan=N :D Grant=N, Loan=N ,Parent_Char=8P GPA_CHAR=Fail Greek Life=N :D Grant=N, Loan=N , ETHNIC=WH GPA_CHAR=Low Greek Life=N :D GRANT=N&LOAN=N &GENDER(M): D GPA_CHAR(Fail, Missing)&Degree(BM, BBA, BS, BLA)& percentile_range(4thQuarter, Top10, 3rdQuarter, Next15) &FALL_ATMP_RANGE(T, H):D Merged Dataset Coverage Accuracy % % 25.465 68.508 Fall 2010 Covera Accuracy ge% % 22.74 74.93 Fall 2011 Covera Accuracy ge% % 28.04 63.59 1.320 88.525 1.38 91.94 1.26 85.00 1.45 77.612 1.96 79.55 0.97 73.91 14.13 72.43 12.49 77.54 15.68 68.59 0.22 0.8 0.156 85.714 0.274 76.923 The most interesting result for attrition that was noted from this study concerned financial aid. Without financial aid, students tended to drop as indicated in Table 11. Other attributes that indicated attrition included lower college GPA, lower enrollment (total attempted hours), lower class percentile (high school class ranking), lower parent educational qualifications (relates to social background), and low involvement in campus activities (like Greek life). The overall attributes of attrition in the three datasets were lack of grants, lack of loans, and sex as male, as shown in Table 11. 24 Texas Tech University, Soma Datta, August 2014 Table 11: Rules for attrition with common attribute conditions Merged dataset GRANT=N&LOAN=N&GENDER(M) : D, accuracy=72.43% Fall 2010 GRANT=N&LOAN=N&GENDER(M) : D, accuracy= 77.54% Fall 2011 GRANT=N&LOAN=N&GENDER(M)&percentile_range(2ndquarter, 3rdQuarter) :D, accuracy= 73.08 Conclusion and Future work This study did find methods that created useable rules, but for such a complex problem, the process needs to be investigated further for better accuracy and a unified model. It is not enough to use just a few top ranked attributes to generate rules. For example, the rules generated in the clusters of each of the datasets were anomalous when tested on the whole dataset as shown in Table 12. The rules are numbered for illustrative purposes only. R1 was generated from a cluster using C4.5, but when it was tested using the whole dataset, the resulting class was R rather than D as generated in that cluster. No anomaly occurred in predicting the classes when CART was used. R4 is a sample rule from recursive partitioning that had a different class predicted for the whole dataset. Ensemble learning had another anomaly; different clusters predicted two rules with same attributes and class except for the PELL attribute. When validated both the rules for the whole dataset had the opposite class. Table 12: Sample rules predicting opposite class in cluster Method Rule No Rule with the class Class actual from the whole dataset C4.5 cluster R1 Degree = BA& percentile_range = Top10 &GRANT= Y: D R CART Cluster R2 All rules predicted the same class from the whole dataset. J48graft cluster R3 Degree = BA &percentile_range = Top10&GRANT= N&LOAN= Y: D R Recursive partition Cluster R4 GPA_CHAR(High,Medium,Low)&GRANT=N&Greek_Life(N)& Major_CHANG_FALL_Spring(N)&Parent_Char(4P, 7P) : D R Ensemble Learning clustering R5 GPA_CHAR(Fail, Missing)&Degree(BM, BBA, BS,BLA)& percentile_range(2ndquarter)&PELL_Y(N) : R GPA_CHAR(Fail, Missing)&Degree(BM, BBA, BS, BLA)& percentile_range(2ndquarter)&PELL_Y(Y) : R 25 D D Texas Tech University, Soma Datta, August 2014 In Table 13, the rules generated are similar to each other except for one attribute condition. If this attribute is omitted or combined, several rules can be combined into one. For rules 1 and 2, the last attribute condition has two values (Female or Male) allowing the attribute to be deleted from the rule conditions. For rules 3 and 4, the rules can be combined by including an OR as shown in the combined rule (3, 4). In the future, the study will be extended to combine rules as shown in the example below so that they will be clearer and more condensed. Table 13: Example of combining rules 1 2 Combined rule(1,2) 3 4 Combined rule(3,4) GPA_CHAR(High, Medium, Low)&GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX)&GENDER(M)=R(37/0= 100%) GPA_CHAR(High, Medium, Low)& GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX)&GENDER(F)=R(269/16=94.38%) GPA_CHAR(High, Medium, Low)& GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX)& (GENDER(M) or GENDER(F)) =R OR GPA_CHAR(High, Medium, Low)&GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX) =R GPA_CHAR(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P, 6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(FM)= D (75%) GPA_CHAR(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P, 6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T) = D(100%) GPA_CHAR(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P, 6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T,FM) This study will be expanded to include more data, such as within semester data like mid-term grades and to investigate the possibility of combining different data mining algorithms, such as decision trees and association mining for rule generation. For example, generating trees in multiple stages using a decision tree algorithm and in the next stage using association mining techniques will help to sample rules at earlier stages. Finally, rule combination techniques will be derived with the goal that the number of generated rules may be reduced to a set of rules that cover as much of the data set as possible with good accuracy. 26 Texas Tech University, Soma Datta, August 2014 CHAPTER III MULTI-STAGE DECISION METHOD TO GENERATE RULES FOR STUDENT RETENTION Abstract The analysis of student retention data for the prediction of students who are likely to drop out is an increasingly important problem to institutions who must meet legislative mandates, face budget shortfalls due to decreased tuition or state-based revenue, and fall short of producing enough graduates in fields of need, such as technology. In fact, several researchers have achieved a level of success in predictive analysis of student retention data by utilizing data modeling techniques, such as those used in statistics (clustering) and data mining (decision trees). Usually, researchers apply one technique at a time to the data and derive a rules characterizing students who drop out, but applying more than one technique in stages could give broader and better rule extraction as well as additional benefits of higher accuracy and identification of data retention indicators. Thus, the authors propose a multiple stage decision method to overcome rule extraction issues encountered previously when using ensemble learning with clustering and decision trees. These issues include rules with anomalous classes and rules with attributes only chosen by the decision tree method. To improve rule extraction, the study described in this paper uses a multi-stage decision method with clustering, controlled decision trees, and association mining. In the first stage of the study, clustering and the decision tree model segment the dataset into smaller datasets that have fewer attributes than those of the original dataset. In the next stage, association mining generates rules for each of the 27 Texas Tech University, Soma Datta, August 2014 smaller datasets and unifies its rule set with those of the decision tree. Because many rules result, the method filters rules by consideration of rule accuracy, data coverage, number of attributes, and similarity. To characterize the method’s performance, the authors measure coverage, precision, and accuracy on different student datasets. The performance of the method is promising, and results are in use at Texas Tech to find students at risk of dropping out. Key Words: association rules, data mining, decision tree, multi-stage decision rules, recursive partition, retention, rule sets Introduction Student retention data is difficult to analyze due to changing enrollment trends, the effect of new or discontinued programs at an institution, the characteristics of different cohorts of students, and the multitude of reasons that exist for a student to leave an institution that may not be related to financial need or social integration. In addition, high withdrawal rates occur at students’ second year of enrollment (Nandeshwar et al. (2011)). Thus, identification of at-risk students as early as is practical is desirable so that interventions may be applied quickly. Encouraging students to stay at an institution in situations of mutual benefit is a win-win situation, but institutions may have the additional stress of boosting retention numbers to qualify for legislative budget increases. Early identification of at-risk students is facilitated through the use of data mining techniques on student data, such as those techniques used in the previous studies of Herzog (2006), Zhang et al. (2010), Yu et al. (2010), Pittman (2008), Luan (2002), Lin (2012), Lykourentzou (2009), and Macfadyen et al. (2010). These studies indicate that 28 Texas Tech University, Soma Datta, August 2014 decision trees help in identifying the rules for student retention and attrition. Further, Datta and Mengel (2014a) used two years of school data to show that the rules generated by ensemble learning utilizing clustering and the recursive partition data mining technique are effective, but could be improved. For example in Table 14, three rules are shown where coverage increases with fewer attributes, but the accuracy does not. Additional issues found with the rules are: Some rules generated from the clustered dataset had anomalous results when tested against the original dataset, and Rules only had the attributes as chosen by the decision tree method. Table 14. Accuracy and Coverage using classification on Retention Data Grant=N, Loan=N , ETHNIC=WH GPA =Low Greek Life=N :D Merged Dataset Coverage% Accuracy% 1.45 77.612 GRANT=N&LOAN=N &GENDER(M): D 14.13 72.43 GPA=(Fail, Missing)&Degree(BM, BBA, BS, BLA)& percentile_range(4thQuarter, Top10, 3rdQuarter, Next15) &FALL_ATMP_RANGE(T, H):D 0.22 0.8 Rules In order to generate rules, decision tree methods are very expedient since the path from the root to the leaf is one rule. In addition, decision trees have precedent in prior studies, such as Delen (2010), Yadav et al. (2012), Marquez et al. (2012), and Eitel et al. (2012). This study continues the work started in Datta and Mengel (2014a) which looked at several decision tree methods, but proposes to use them in multi-stage steps to find improvements in accuracy and rules, similar to Senator (2005) as described below. Additionally, this study uses the guidelines for attribute selection utilizing techniques, such as information gain and Chi-Square, as discussed by Nandeswar et al. (2011). This 29 Texas Tech University, Soma Datta, August 2014 study’s contribution is increased modeling and rule accuracy by utilizing a multi-stage decision method. To provide the context for this work, The Section Related work outlines related studies on retention that have used various data mining methodologies. It also briefly introduces studies that have developed multi-stage decision trees that have multiple classes. The Section Data for the Study explains the attributes used in the study and gives the data sources for the study. The Section Methodology explains the process to develop a multiple stage model. The Section Results investigates the results of the model and compares it to other studies. The Section Conclusion discusses various conclusions based on the results obtained from Section Results. In addition, the authors compare the results obtained in this study with the results obtained from other studies outlined in Section Related work. The Section Future work presents ideas about directions for future research. Related work Background Qualitative and quantitative studies identify the causes for attrition. Some examples of qualitative studies utilizing survey-based data, such as attributes of demographics, academic integration, academic achievement, social background, and financial aid, are Tinto (1975), Bean (2001), Sewell and Wegner (1970), ACT (2004), and Tinto (2006). Comparable attributes are used in quantitative studies utilizing institutional student data that apply data mining techniques: Zhang et al. (2010), Marquez et al. (2013), Yu et al. (2010), Pittman (2008), Bayer (2012), Delen (2010), Herzog 30 Texas Tech University, Soma Datta, August 2014 (2006), Kotsiantis (2009), Lykourentzou et al. (2009), Macfadyen et al. (2010), Nandeshwar et al. (2011), Eitel et al. (2012), and Yadav et al. (2012). Kerkvliet et al. (2004) shows that different institutions experience different attrition causes; for example, grants may favorably influence retention at one institution, but not at another. Pittman, Herzog, Zhang, Nandeshwar et al., and Yu et al have mentioned that the first year in college for a student is more critical than later years. Delen (2010), Eitel (2012), Kotsiantis (2009), and Marquez (2013) Balance the dataset to get better accuracy. Lin (2012) replicated his data to remove the imbalance but had better accuracy with the original class distribution. The tool used often in quantitative studies is the decision tree because of its transparency, ease of learning, ease of understanding, and acceptance. Other data mining techniques are MARS, neural networks, logistic regression, Bayesian classification, and genetic algorithms. Researchers use different attribute selection methods to lower the high dimensionality of the datasets because not all attributes are useful to predict attrition. Multi-stage Decision Tree The first part of this section highlights the multiple stage processes used by Huo et al. (2006) and Lu et al. (2009) on datasets with more than two classes. The last part of this section describes Senator’s (2005) study using multiple stages to generate rules for datasets having fewer instances of the positive class or class of interest. The minority class data distribution in this study is similar to Senator’s (2005) data distribution. Two studies, Huo et al (2006) and Lu et al. (2009), deal with multi-stage decision trees for datasets that have more than two classes. The motivation behind their studies is 31 Texas Tech University, Soma Datta, August 2014 to generate simple, short rules and achieve lower error rates. Both studies use the same datasets (Car, derm, ecoli, and glass) from the UCI repository (UC Irvine Machine Learning Repository). Huo et al. (2006) propose a new method called a multi-stage decision tree (MDT) to extract general rules with lower error-rates. The basis of the method is maximum margin learning using the Support Vector Machine (SVM) (Wang et al., 2005) to separate the data in the multiple-class space into two classes (positive and negative) each of which may contain more than one class. A recursive process ensues where the ID3 algorithm generates a tree for each newly created two-class dataset using maximum margin SVM until only one of the original classes is contained. Huo et al. mentions that the number of rules generated from each of the datasets is fewer than the traditional C4.5 decision tree algorithm. Lu et al. (2009) wanted to reduce the complexity of re-labeling and re-grouping of classes that occurred in the study of Huo et al. (2006); hence, Lu et al. utilized maximum margin SVM to separate the given dataset as if without classes into two classes. The C4.5 algorithm derived a decision tree from the newly created two-class dataset and continued recursively until the newly generated dataset had only two of the original classes. Lu et al. did reduce complexity and generated a set of decision trees with shorter and simpler rules. Senator (2005) aimed to predict greater accuracy by using a two-stage classification process. In the first stage, the data set is classified using fewer attributes, and those instances that are suspicious (true negative and false positive in the confusion matrix) are carried over to the next stage for further classification work; the other 32 Texas Tech University, Soma Datta, August 2014 instances are dropped at this point. In the second stage, more complex methods are utilized, such as human intervention and linkage analysis. In addition, more attributes are added to the remaining suspicious instances from the first stage. The accuracy is calculated using the ‘true positives’ (x1) and ‘true negatives’ (y2) from the first and second stages. Senator performed his study in two different domains, both with large data sets, but with a low number of instances in the class of interest. He used HIV and counter-terrorism detection data sets where only 5% of the dataset tested as a high-risk group with a very low false positive rate. The metrics utilized to measure the results are the ROC curve, lift curve, sensitivity, and specificity. Data for the Study The approach proposed in this paper is evaluated by applying an internal Texas Tech University system dataset of first-time admitted undergraduates. During the preprocessing of the data, the attribute selection methods—candidate_G2, Logworth, ChiSquare, NavieBayesSimple_Accuracy GainRatio, NaiveBayesSimple_RMSE, and SignificanceAttributeEval—are used with voting to select the final attributes. The final attributes are as given in Table 15, and the data characteristics are as given in Table 16. Further details of the data are given below. Attributes o demographic: age, ethnicity, geographic location, residency, and gender o academic: attempted hours, test scores, class percentile, earned hours, GPA, and major changed o financial aid data: grants, scholarships, and loans o social factor: parent’s education and student organization 33 Texas Tech University, Soma Datta, August 2014 Classes- retained (R) and dropped (D) as determined by the student’s enrollment status during the 2nd year in college Source- new freshmen, all majors, admitted for Fall 2010, Fall 2011, and a merged dataset of both Validation and testing- Fall 2012 dataset Table 15. Final Selected Attributes M-2 Attribute category Attributes Demographic Financial aid Ethnicity, Gender Attempted credit hours in ranges, GPA, degree, major change binary value, test scores, class percentile grants, scholarships, and loans Social factors parent’s education and student organization (greek life) Academic Table 16. Data characteristics M-2 Name Merged 2010 and 2011 Fall 2010 Fall 2011 Fall 2012 Instances 9240 4490 4750 4296 Attributes 14 14 14 14 The statistical characteristics of the datasets are as shown in Table 17 and in later result tables. Table 17. Statistical characteristics of the datasets M-2 Attribute Age Gender Test score Class percentile College GPA Earned hours Attempted hours Financial assistance (complied grants/Pell/loan) Mean/ Distribution 21.46 M/F (52%/48%) 1099 71.6 2.70 12.12 13.89 52%-63% Greek Life (student organization) Major Change (N/Y) Ethnicity Parents education Gender F/M Degree Low income group (only received Pell) 81% with no involvement 78.139/21.861 WH- 70% Ranged from no school to graduates 48.34/51.58 BS-46.09/BA-40.23 18%-23% 34 Texas Tech University, Soma Datta, August 2014 Methodology In this study, clustering, recursive partitioning algorithm (Gaudard et al., 2006), and association mining generate decision trees and rules in multiple stages. Clustering and recursive partitioning gave better accuracy in Datta and Mengel (2014a). The trees are generated using default values of the JMP tool (Gaudard et al. 2006). The fall 2010 and 2011 merged dataset allows exploration of building a unified model rather than two possibly different models for the separate fall 2010 and 2011 datasets (see Datta and Mengel (2014a) for results in considering the datasets separately). In the first stage, the data is clustered using K-Means, and a classification tree is generated from each cluster using recursive partitioning. In the next stage, association mining generates rules using instances from each branch of the classification tree. These rules are then joined with the associated branch from the decision tree. One of the reasons for reducing the dataset using classification is to reduce the very large number of rules extracted by association mining. Table 18. Sample rules, their respective probability, and the class type Method Rule No Rule with the class Class actual from the whole dataset C4.5 cluster R1 Degree = BA& percentile_range = Top10 &GRANT= Y: D R CART Cluster R2 All rules predicted the same class from the whole dataset. - J48graft cluster R3 Degree = BA &percentile_range = Top10&GRANT= N&LOAN= Y: D R Recursive partition Cluster R4 GPA= (High,Medium,Low)&GRANT=N&Greek_Life=N& Major_CHANG_FALL_Spring=N&Parent_educ=(4P, 7P) : D R Ensemble Learning clustering R5 GPA= (Fail, Missing)&Degree =(BM, BBA, BS,BLA)& percentile_range=(2ndquarter)&PELL= N : R GPA= (Fail, Missing)&Degree =(BM, BBA, BS, BLA)& percentile_range =(2ndquarter)&PELL=(Y) : R D D Motivation for multi-stage mining methods This section looks at the use of association mining without a first stage of clustering and classification. Association mining algorithms process the dataset via the 35 Texas Tech University, Soma Datta, August 2014 WEKA software tool (Hall et al. 2009). The merged dataset is used with the default parameters for the Apriori association mining algorithm except that the number of rules to be extracted is increased to 100, minimum accuracy/confidence is lowered to 0.8, and the delta value (increase/decrease in confidence) is increased to 0.1. Association mining is known to generate a large number of rules; therefore, the number is limited to 100. An unexpectedly low number of rules (four) resulted for only one class (R), with the largest rule having three attributes. The merged dataset yielded 100 rules, all for class=R. Of these rules, only five had six attributes; the other rules had fewer than six attributes. The predictive Apriori algorithm also modeled the data with default parameters except that “class enable” was selected so that “class” would not be employed as an attribute. The algorithm required at least three hours to return the rules. Out of the 100 rules generated, only three rules were for class=D. Table 19 gives a summary of the results of applying three varieties of Apriori association mining algorithms. Because class=D has a lower number of instances than class=R, Apriori favors rules with class=R. In addition, even though several rules are generated, less than 14% of the entire dataset is covered. Each rule covers so few instances that accuracy for each rule is high; however, an extraordinary number of rules would be needed to cover the rest of the dataset and would be impractical for people without automation. In addition, so many rules would seem to “overfit” the dataset and would be more likely to be dataset specific. 36 Texas Tech University, Soma Datta, August 2014 Table 19. Association mining statistics Apriori Filtered Apriori Predictiv e Apriori rules=100, Car=True rules=100, Car=True rules=100, Car=True Accuracy # rules Coverage Class=D Accuracy Coverage Class=R # of rules Accuracy Coverage # of rules Changed Parameters Techniques # of rules with above 70% and 3 or more rules 88 0.1 0. 8785 100 0.1 0.8785 0 N/A N/A 92 0.2 0.8804 100 0.2 0.8804 0 N/A N/A 99 0.7465 0.99475 97 0.7261 0.9948 3 0.16775 0.99469 Multi-stage controlled decision tree method The merge dataset used in this methodology first is divided into three clusters using K-means clustering as shown in Table 20 (Three was found to be an appropriate cluster size in Datta and Mengel (2014a)). Each cluster is used to generate a controlled decision tree using the recursive partitioning algorithm and is evaluated using 10-fold cross validation. The decision tree is controlled in that the attributes used to split the tree are locked from further use. As shown in the following example rule, the attributes are locked because the decision tree algorithm uses the same attribute repeatedly, resulting in a rule with conflicting attribute values: Ex. Avg_Attempt>=9 & LOAN(Y) & GPA>=1.672 & GRANT(Y) & GPA>= 2.616 & Summer(N) & CLASS_PERCENTILE>=55 & Avg_Attempt<18 & ETHNIC(HP, AS, WH, HI) & CLASS_PERCENTILE>= 74 & Avg_Attempt_Hour>=12.5 & AGE>=21 & Transfer_Hrs<16 Other researchers have also adopted attribute locking as well, such as Yu et al (2010). Each split of the decision tree is monitored to check if the accuracy changes. The tree continues to split until the accuracy remains the same for the next two levels. If the 37 Texas Tech University, Soma Datta, August 2014 accuracy remains the same (Figure 8 split 3), the tree is pruned back as shown in Figure 8, split 2. At this point, the tree is locked. The instances of each of the sub-trees are stored as different datasets. The attributes used in generating the tree are removed from the new datasets. Now the Apriori association-mining algorithm generates new rules for each of the datasets. The settings for the algorithm are those used by Hall et al. (2009) as shown in Appendix A, Table 15 where the minimum matrix parameter changes from 0.9 to 0.85. Lowering the percentile five percent reduces accuracy in order to examine more instances to encourage the extraction of class=D rules. The extracted rules are joined back to their respective decision tree branch attributes. The confusion matrix, coverage, and the precision are calculated for each rule. The steps used in this algorithm are summarized below. The dataset is divided into three clusters. Each cluster is used to generate a controlled decision tree using recursive partitioning. Apriori association mining analyzes each of the sub-datasets/nodes from each branch to extract rules. The rules extracted from the decision tree and association-mining techniques combine to create the final rules. The newly generated rules are validated against the whole dataset. 38 Texas Tech University, Soma Datta, August 2014 Figure 8. Steps of split and prune of the decision tree 39 Texas Tech University, Soma Datta, August 2014 Results Results from multi-stage controlled decision tree Table 20 shows the characteristics of the merged dataset after clustering for clusters one through three. Figure 9 shows the tree derived from the first cluster. Three splits occur on the grant, class percentile, and loan attributes resulting in four rules given below. GRANT=Y and percentile-range=TOP10 GRANT=Y and percentile-range= Next15 or 2nd quarter or 3rd quarter or 4th quarter GRANT=N and LOAN=Y GRANT=N and LOAN=N As a result, cluster 1, which has 734 instances, is divided into four datasets of sizes 347, 238, 56, and 93. Because the first dataset is characterized by attribute values GRANT=Y and percentile_range=TOP10, these attributes are removed from consideration when generating the association mining rules. After association mining, the attributes are added back into the rules as with the example below. Grant=Y and Percentile-range=TOP10 and Major_change_Fall_Spring=N and Greek Life=N ==> Class=R Gender % in Female Test score 48.50 N= 84.62 Bachelors40.01 WH= 76.89 Next133.27 Medium49.10 F=78.36 N= 78.40 Bachelors36.85 WH= 64.56 2nd quarter40.53 D-34.94 40 Percentile F=65.73 50.38 45.79 R-72.16 Ethnicity % Greek Life % 27.49 21.86 Medium45.22 1300<=SAT <=1350=18. 80 1100<=SAT <=1150=31. 88% 950<= SAT<1000= 28.40% GPA % Attempted Hrs % Top1049.72 Pell % 21.25 WH= 74.03 Loan % Bachelors37.46 42.64 N= 91.28 26.22 45.85 D-15.80 D-27.83 R-65.05 3 F=66.21 Grant % 2 High 50.54 R-84.19 79.70 1 Parent Education % Class Distribut ion % 57.38 65.36 Cluster Table 20. Characteristics of the dataset Texas Tech University, Soma Datta, August 2014 Figure 9. Controlled Decision Tree Some rules across datasets can be a subset of another rule, such as Rule-2 and Rule-3, and Rule-7 and Rule-8 in Table 21 which shows sample attrition rules. The accuracy of the rules is high, but the coverage of the dataset is low as is characteristic of association mining rules. Table 21. Sample rules with accuracy and coverage Rule # Rule-1 Rule-2 Rule-3 Rule-4 Rule-5 Rule-6 Rule-7 Rule-8 Rule Grant=Y percentile=Top10, GPA=High (R) Grant=N, Loan=N, GENDER=M GPA=Fail Major_CHANG_FALL_Spring=N Greek Life=N FALL_ATMP_RANGE=F (D) Grant=N, Loan=N, GENDER=M GPA=Fail Greek Life=N FALL_ATMP_RANGE=F (D) Grant=N, Loan=N, GENDER=M GPA=Fail Greek Life=N (D) Grant=N, Loan=N, GENDER=M GPA=Fail Major_CHANG_FALL_Spring=N Greek Life=N (D) Grant=N, Loan=N, ETHNIC=WH GPA=Fail Greek Life=N (D) Grant=N, Loan=N, GENDER=M GPA=Fail Major_CHANG_FALL_Spring=N Degree=BS (D) Grant=N, Loan=N, GENDER=M GPA=Fail Degree=BS (D) Accuracy 0.9535 Coverage 0.07 0.863354 0.017424 0.856383 0.006493 0.851695 0.025541 0.85 0.021645 0.849398 0.001731 0.844961 0.013961 0.84 0.162337 Table 22 has sample rules and their precisions on the datasets. The fairly consistent precision is expected across the merged and unmerged data sets. The precision 41 Texas Tech University, Soma Datta, August 2014 rate for Fall 2012, however, differs from the others on several rules. Upon further examination, the Fall 2012 dataset was found to be affected by the higher percentage of students receiving financial aid (loans received increased from 51% to 70% from previous years). Table 22. Testing different datasets with rules for attrition Rule No Rule-1 Rules Grant=N, Loan=N (D) Grant=N, Loan=N , GENDER=M Rule-3 GPA=Fail Greek Life=N FALL_ATMP_RANGE=F (D) Grant=N, Loan=N , GENDER=M Rule-4 GPA=Fail Greek Life=N (D) Grant=N, Loan=N , ETHNIC=WH Rule-6 GPA=Fail Greek Life=N (D) Grant=N, Loan=N , ETHNIC=WH Rule-12 GPA=Low Greek Life=N (D) Grant=N, Loan=N , Parent_educ=8P Rule-13 SAT_RANGE=1100<=SAT<=1150 (D) Number of students received Gift aid Number of students received loans Number of students received none Merge 10-11 Accuracy (coverage) Fall 10 Fall 11 Fall 12 0.685 (0.25) 0.749 0.636 0.171 (0.24) 0.856 (0.02) 0.899 0.789 0.706 (0.001) 0.852 (0.03) 0.888 0.817 0.630 (0.02) 0.849 (0.02) 0.883 0.737 0.696 (0.01) 0.776 (0.01) 0.795 0.739 0.214 (0.01) 0.701 (0.11) 0.775 0.683 0.159 (0.02) 62% 50% 25% 64% 52% 23% 61% 48% 28% 58% 71% 21% Table 23 compares the method used in this study with the method by Datta and Mengel (2014a) and other standard data mining algorithms. This Table gives the statistics of the number of rules generated, the coverage of each rule, and the average accuracy from each rule. Ensemble learning gave the best coverage; however, the multi-stage decision method gave higher accuracy with a decent amount of coverage. 42 Texas Tech University, Soma Datta, August 2014 Table 23. Comparisons of rules coverage using different methods Min coverage% Max coverage% Total coverage% Average accuracy% 0+11+0 0.020 6.240 0+37.28+0 0+77.00+0 0+5+0 0.281 14.048 0+30.00+0 0+81.00+0 8 8.658 31.797 77.53 85.969 12+6 +11 0.087 44.456 92.00+33.04+50.09 63.13+72.98+80.32 40+40+30 0.01 12.749 47.29+65.281+66.418 84.74+86.90+86.85 Method # of rules C4.5 cluster CART Cluster Ensemble learning Ensemble learning Cluster Multi-stage cluster The multiple stage data mining method combines the rules for each cluster so that redundant and duplicate rules are eliminated. Interestingly, the multiple stage data mining method did not extract any rules that resulted in an anomaly when validated with the whole dataset. One of the reasons would be that the rules covered less data and characterized the covered data more accurately. Table 24 shows the coverage of the total instances of the class. Table 24. Rule coverage over instance in each class Class # of unique rules Total coverage Total Accuracy D R 16 59 69.95 82.45 80.157 89.54 Conclusion Each of the data attributes are crucial in determining which intervention process needs to be taken. For example, students with higher test scores and higher classpercentiles tend to have different reasons for attrition than students with lower test scores and lower class percentiles. In addition, certain majors like pre-nursing and business, as well as undeclared majors, tend to have higher dropout rates. 43 Texas Tech University, Soma Datta, August 2014 The verification results of the multi-stage controlled decision tree (showed Table 22) that it is challenging to develop a definite set of attrition rules. Policies and uncontrollable events, such as more incoming freshmen receiving scholarships, changes in economic conditions, or a natural disaster, may affect a student’s financial aid. Hence, the distribution of the attributes may change each year. For example, in Fall 2012, more students received loans than in previous years, so the rules that governed the previous two-year’s dataset may not be applicable. In addition, the accessible student data from the institution may not include potentially important factors, such as behavioral motivations for attending college (for example, to be at school with a boyfriend/girlfriend or parental requirement to attend). This study concludes that ethnicity, lack of parental education, lower financial aid, and non-participation in student organizations are causative factors for student attrition. This methodology can use data from several years and generate a new model. This study inferred from the results that a multi-stage, data mining method is a more accurate method to extract rules. None of the rules generated by the multi-stage method had conflicting classes when tested on the whole dataset, and attributes could be incorporated through association mining that were beyond those employed by the decision tree as shown in Table 18. The method also had better precision rates compared to Pittman’s (2008) and Yu et al.’s (2006) studies as shown in Table 25. Since this current study calculated precision on individual rules, the accuracy and precision have the same value. 44 Texas Tech University, Soma Datta, August 2014 Table 25. Accuracies obtain using incoming freshmen Researchers (year) Cohort size Pittman’s (2008) 2768 Yu et al. (2006) 6690 Herzog (2006) Datta and Mengel (2014a) Datta and Mengel (this paper) 3565 9240 9240 Methods used Logistic, multilayer perceptron, j48, Naïve Bayes Recursive partition, mars, neutral networks Chaid, C4.5 Clustering, ensemble learning Clustering, Recursive partition, association mining Measure of accuracy Average Accuracy Overall accuracy 72.8 Overall accuracy 73.53 Overall accuracy Average accuracy of each rule Average accuracy of each rule 85.4 85.96 87.5 Future work To validate and standardize the methodology used in this study and utilize dynamic rule creation using different data mining techniques, published datasets from UCI’s website will be used in the future. This study will generate rules for other datasets that have a lower true positive class in the dataset in comparison to true negatives. For example, possible datasets include the US population dataset to isolate individuals infected from HIV and from the high school population it will use female students that would choose STEM degree. This study showed that ethnicity, parents’ educational level, financial aid, and student organizations affect student’s attrition. Future studies will split the dataset to investigate diversity issues and specific characteristics related to the student’s ethnicity. The study will divide the dataset by ethnicity. Minority groups like Hispanics and African Americans will be grouped in different datasets. Another research question for future study will be – do women behave in a similar manner to men in retention/attrition? More data will be collected from the future years (at least 5 years), and the multistage decision method will be applied to get unified rules. All rules will be tested and used for isolating attrition rules. The dataset’s characteristics will be used to match any of 45 Texas Tech University, Soma Datta, August 2014 the trained cohorts, and this could help in deciding which set of rules need to be applied to that specific cohort. This study will be extended in the future to do reverse engineering and derive a model for undergraduate admissions to recruit freshmen that are likely to be retained, and this could indirectly increase retention rates of the institution beyond the students’ first year in college. To create the admissions model, these retention rules will be used. Lastly, one issue found in Datta and Mengel (2014a) is still an issue in the multistage method. Table 26 shows the rules extracted are similar to each other except for one attribute condition. By omitting or combining the attributes, several rules can be combined into one rule. For rules 1 and 2, the last attribute condition has two values (Female or Male) allowing the attribute to be deleted from the rule conditions. Rules 3 and 4 combine by an OR as shown in the combined rule (3, 4). Rules 5 and 6 generate a shorter rule eliminating two redundant attributes. In the future, the study will be extended to combine rules as shown in the following examples so that they will be clearer and more condensed. Table 26. Examples of combining rules 1 2 Combined rule(1,2) 3 4 GPA=(High, Medium, Low)&GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX)&GENDER(M)=R(664/58= 91.265%) GPA=(High, Medium, Low)& GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX)&GENDER(F)=R(838/65=92.243%) GPA=(High, Medium, Low)& GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX)& (GENDER(M) or GENDER(F)) =R (1502/123=91.810%) OR GPA=(High, Medium, Low)&GRANT=Y&LOAN=NÐNIC(NR, U, HI, B, WH, MX) =R (1502/123=91.810%) GPA=(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P, 6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(FM)= D (8/3=62.5%) GPA=(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P, 6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T) = D(44/8=81.818%) 46 Texas Tech University, Soma Datta, August 2014 Combined rule(3,4) 5 6 Combined rule(5,6) GPA=(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P, 6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T,FM)= D (52/11=78.846%) Grant="Y" and GPA= (“high” or “medium” or “low”) and Greek-life="Y" GENDER="F" =R(435/39= 91.03%) Grant="N" and GPA= (“high” or “medium” or “low”) and Greek-life="N" GENDER="F"=R(903/405=55.149%) GPA= (“high” or “medium” or “low”) and " GENDER="F" and (Grant=”Y” or Grant=”N”) and (Greek_life=”Y’ or Greek_life=“N”) OR GPA= (“high” or “medium” or “low”) and " GENDER="F"=R (3997/939=76.507%) 47 Texas Tech University, Soma Datta, August 2014 Appendix A Table 27. Apriori settings Options Default values Experiment values Car False True classIndex -1 -1 Delta 0.05 0.1 lowerBoundMinsupport 0.1 0.1 What they mean Generates rules with the class attribute instead of general association Placement of the class attribute is in the dataset Iteratively decrease support by this factor. Reduces support until min support is reached or required number of rules has been generated. The lowest value of minimum support metricType confidence confidence minMetric 0.9 0.85 numRules 10 100 The metric can be set to any of the following: Confidence (Class association rules can only be mined using confidence), Lift , Leverage, Conviction Will consider rules with accuracy of .85 or higher Number of rules to generate outptItemsSets False True Itemsets are shown in the output removeAllMissingCols significantLevel False -1.0 False -1.0 treatZeroasmissing False False upperBoundMinsupport 1.0 1.0 verbose False False removes columns with missing values Significance testing Zero is treated the same way as missing value The highest value of minimum support, the process starts and iteratively decreases until lower boundary Algorithm can run in verbose if enabled Table 28. Effect of changing default values Delta value Performance Rules 0.05 Minimum support: 0.1 (9 instances) Minimum metric <confidence>: 0.9 Number of cycles performed: 18 1. ETHNIC=WH SAT_RANGE=1200<=SAT<=1250 Major_CHANG_FALL_Spring=N 11 ==> Class=D 10 conf:(0.91) 2. ETHNIC=WH PELL_Y=N SAT_RANGE=1200<=SAT<=1250 Major_CHANG_FALL_Spring=N 11 ==> Class=D 10 conf:(0.91) 0.1 Minimum support: 0.1 (9 instances) Minimum metric <confidence>: 0.85 Number of cycles performed: 18 1. ETHNIC=WH SAT_RANGE=1200<=SAT<=1250 Major_CHANG_FALL_Spring=N 11 ==> Class=D 10 conf:(0.91) 2. ETHNIC=WH PELL=N SAT_RANGE=1200<=SAT<=1250 Major_CHANG_FALL_Spring=N 11 ==> Class=D 10 conf:(0.91) 3. ETHNIC=WH SAT_RANGE=1200<=SAT<=1250 14 ==> Class=D 12 conf:(0.86) 4. ETHNIC=WH PELL=N SAT_RANGE=1200<=SAT<=1250 14 ==> Class=D 12 conf:(0.86) 48 Texas Tech University, Soma Datta, August 2014 Table 29. Sample rules at confidence=0.70 cluster Cluster I Cluster II Cluster III comments Sample Rules with one attribute due to lowering of confidence Confidence lowered to 0.80,0.75, and 0.70 Minimum support: 0.05 Confidence lowered to 0.80, 0.75, and 0.70 (had to lower in support to generate rules. ETHNIC=WH 41 ==> Class=R 36 conf:(0.88) Major_CHANG_FALL_Spring=N 47 ==> Class=R 41 Confidence lowered to 0.80,0.75 and 0.70 GENDER=M 520 ==> Class=D 368 conf:(0.71) GENDER=M 716 ==> Class=D 532 conf:(0.74) ETHNIC=WH 687 ==> Class=D 504 conf:(0.73) Parent_educ=8P 642 ==> Class=D 458 conf:(0.71) 49 conf:(0.87) Texas Tech University, Soma Datta, August 2014 CHAPTER IV DYNAMIC MULTI-STAGE DECISION RULES Abstract After generating “usable” rules with better accuracy on retention data with multistage decision for the minor class, the motivation was to extend that previous study and validate the results from Datta and Mengel’s (2014b) ensemble learning method with other known datasets. Studies that need rules to predict for the minor group would find this method helpful. Typically, rules generated using either decision tree or class association mining generate more rules for the major class such as the retention dataset (Datta and Mengel 2014a). In Datta and Mengel (2014b) it was shown that rules can be generated for the minor classes using both decision trees and association mining. This study uses dynamic method to generate rules. The characteristic of the dataset commands the path towards generation of the rules. A dynamic multi-stage decision tree was generated depending on the attribute dimensions and size of dataset. Each rule gets its coverage and accuracy. This technique generates rules after mining data from the minority class. The rules generated were grouped into ranges to facilitate the rule choice as needed. Key Words: association mining, decision tree, gain ratio, large datasets, multistage rule generation, recursive partition, rule sets 50 Texas Tech University, Soma Datta, August 2014 Introduction and related work Data mining algorithms possibly will acquire different accuracy and rules depending on the dataset. Not every data-mining algorithm is appropriate for each dataset; however, irrespective of their characteristics, datasets are currently used to generate rules for different techniques. Different from previous studies such as Yu et al. (2010), Pittman (2008), Huo et al. (2006) and Lu et al. (2009) but similar to Senator (2005) this study determines the path for rule creation depending on the characteristics of each dataset, using the data-mining algorithms. This study originated from Datta and Mengel’s (2014a and 2014b) studies to find usable rules with better accuracy for the minor group. The first study used different decision tree algorithms such as C4.5, CART, Recursive Partition, and J48graft that were applied to retention data to generate rules. It generated rules using Yu et al. (2010) but the rules tend to be very simple for such a complex issue. Therefore, Datta and Mengel’s (2014a) study used clustering and unique attributes to generate the rules. Some of the enhancements that can be made based on that study were: Rules generated from a cluster had anomaly class when validated over the whole dataset. All attributes had to be used for generating the rules. Because that is not a feasible solution for a high dimension dataset, Thus, the next study by Datta and Mengel (2014b) used a different approach. The second study used a multi-stage decision tree (MSDT) to generate viable rules for users and administrators. The MSDT method used student data from an institution to generate the rules and validated those rules using data from a different year. The purpose of the MSDT method was to discover rules that identify student attrition. 51 Texas Tech University, Soma Datta, August 2014 This current paper introduces dynamic multi-stage decision rules (DMDR) that use various data mining algorithms. In this study, the MSDT method is applied to other wellknown datasets from UCI. When the MSDT method did not show improvements for a dataset, the method or steps were changed in a logical way. The new ensemble method was named the dynamic multi-stage decision rule (DMDR). The rules generated using the DMDR method depend on the characteristics of the dataset. Table 30 shows six datasets with three different accuracies and their total number of rules generated using C4.5 algorithm; these accuracies and rules will be used later to compare with the ensemble method developed in this study. C4.5 was used to compare with other researchers (Huo et al (2006) and Lu et al. (2009)). Baseline accuracy is the simplest accuracy calculated for the given dataset. Here the baseline accuracy is calculated using ZeroR (Hall et al. 2009) classifier. The accuracy is predicted by calculating the mean of numeric classes or the mode of nominal classes. Table 30: Data set with reported Accuracy Data set Baseline Accuracy % Accuracy reported % Accuracy % using C4.5 Total # of rules generated using C4.5 Adult 75.919 78.5885.95 86.21 731 Balance 46.08 - 63.2 33 Breast Cancer 70.2797 65.0-78.0 71.134 10 Car 70.0231 - 92.36 131 Connect 65.8303 - 80.90 4297 Lymphograph y 54.7297 76.0-85.0 76.35 19 change_node = 4: 3C (25.0/2.0) Retention 69.2857 85.9687.5 79.35 11 GRANT= Y AND GPA= Low: R (843.0/126.0) 52 Example of less useful rules capital-gain <= 6849 and marital-status = Never-married and education-num > 14 and age > 32 and capital-loss <= 653 and occupation = Adm-clerical: >50K (0.0) Left-Weight = 1LW: R (125.0/27.0) deg-malig <= 2: no-recurrence-events (201.0/40.0) safety = low: unacc (576.0) d3 = x & d2 = o & d1 = x & d4 = o & c2 = b & g2 = b & e2 = b & a3 = b & d5 = o & a2 = b & e1 = b & a1 = b & c1 = x & b1 = x: win (0.0) Texas Tech University, Soma Datta, August 2014 Two research questions were studied: What characteristics of a dataset work better with this process? How does this methodology behave with large and high dimension datasets or with small and low dimension datasets? Background Work Huo et al. (2006) and Lu et al. (2009) developed algorithms to lower error-rate and obtain fewer rules compared to the traditional C4.5. Both their datasets dealt with multiple classes. The algorithm of Huo et al. re-labeled the classes recursively to divide the dataset into two classes using the maximum margin learning method from Support Vector Machine (SVM) (Wang et al. 2005). He used ID3 to generate the tree. Lu et al. (2009) reduced the complexity of re-labeling and re-grouping by dividing the dataset into two classes using the inverse problem of SVM. He then used the C4.5 algorithm with the newly labeled dataset. The objective of both studies was to generate shorter and simpler rules. Senator (2005) used a two-stage classification process to isolate instances that were at high risk. His study undertook the problem of the minor class in the dataset in two stages: first, classification of the whole dataset and second, selection of those instances that were classified as positive. Following the study of Senator (2005) that used the multi-stage decision for two specific domains (HIV, fraud detection) with fewer highrisk instances in their datasets, the retention dataset in the study by Datta and Mengel (2014b) used multi-stage decision to discover a few useful rules. In the study of Datta and Mengel (2014b), the retention dataset was first preprocessed to normalize the attributes by setting them in ranges, integrating them, or 53 Texas Tech University, Soma Datta, August 2014 converting them to binary values. The dataset was clustered, and each of the three clusters (details of using three clusters is detailed in Section Clustering) was treated as an independent dataset. The process is detailed in a later section. Data for study The data used in this study were from the UCI data repository, as shown in Table 31. To generalize the model MSDT, Datta and Mengel (2014b) selected datasets of sizes and dimension that differed from the retention dataset. The numeric attributes were converted to alphanumeric for consistency and to enable use of association mining. Table 31: Dataset and their characteristics- UCI Data set Size No. of attributes Adult(train) Balance Breast Cancer 32561 625 286 15 5 10 No. of classe s 2 3 2 Car 1728 6 4 69.56 Connect 67557 43 3 65.65 Lymphography 148 19 4 44.00 Retention 9240 15 2 69.00 Baseline Accuracy 76.50 40.566 76.2887 Class distribution >50K- 23.93%, <=50K- 76.07% L 46.08%, B 07.84%, R 46.08% No-recurrence 70.27% , Recurrence 29.73% unacc 70.023 %, acc 22.222%, good 3.993% v-good 3.762% Win 65.83%, Loss 24.62%, Draw 9.55% normal find 1.351%, metastases 54.729% malign lymph 41.21, fibrosis 4 D- 30.71%, R- 69.28% Dynamic multi-stage decision rule To create DMDR, the datasets were first divided into three clusters using Kmeans clustering (details in section Clustering). Then each cluster in the dataset followed the same process: A controlled/monitored decision tree was generated using recursive partition. The tree was pruned depending on the characteristics of the datasets mentioned in Table 31. Each tree generated rules using association mining. Rules for the complete dataset were developed as a combination (logical AND condition) of the attributes used during the decision tree creation and those generated from association mining. For example, in the “Car” dataset, the first split using the decision tree was “Person<4,” and 54 Texas Tech University, Soma Datta, August 2014 one of the rules generated using the association mining was “safety =low 274 ==> class=unacc.” The combined rule was “Person<4 AND safety=low ==> class=unacc.” The coverage and accuracy of this rule were evaluated from the original dataset. Clustering To decide the optimum cluster size, two different clustering techniques were used. The data was first clustered using EM clustering. Each cluster generated a tree and the following measures were averaged to facilitate the decision making for cluster size: accuracy, TP rate, FP rate, precision, F-measure, and ROC. Secondly, K-means clustering was applied to generate three clusters. As shown in Table 32, not all datasets performed well with EM clustering. Adult dataset generated three clusters using EM clusters; hence, it is justified to use three cluster size when using K-means clustering. EM clustering generated two clusters with Balance dataset, out of which one cluster had only one class and that will not help in generating rules via either decision tree algorithm or association mining. Hence, K-means three clusters were used for this study. EM clustering generated four default clusters using Breast Cancer data, but the average accuracy using three clusters is higher than the average accuracy of four clusters. Hence, three clusters were used for the study. EM clustering generated seven default clusters using Car dataset, but only one cluster had more than one class. Hence, the study used three clusters generated from K-means. EM clustering had some difficulties clustering Connect dataset; the process ran for 12 days before it was aborted by the user. The dataset was reduced by removing low ranked attributes by voting using three attribute selection methods such as gain ratio, significance attribute evaluation, and Naïve Bayes, but the EM clustering did not return any clusters after six days. Next, the default parameters were changed as 55 Texas Tech University, Soma Datta, August 2014 shown in Table 33 and in both cases—case III and IV—it generated the same number of clusters. These clusters were tested using C4.5 for matrices as shown in Table 32 ROC had better value using K-means. Hence, the three clusters that were generated using Kmeans were used in this study. EM clustering generated six clusters using Lymphography dataset; out of these six clusters, only two clusters could be used for testing because one of the clusters had five instances, another had three instances, the other had zero instances, and the last one had one class. Hence, the experiment used three clusters generated by K-means clustering. For the Retention dataset, a specific dataset unique to the series of studies by Datta & Mengel (2014a), EM cluster generated 13 clusters using Retention dataset. However, K-means clustering with three clusters had better results for the ROC and precision. Hence, K-means clustering with three-cluster size was used to standardize the process. Table 32. Cluster comparisons with EM and K-means Adult Balance accuracy TP Rate 3 3 86.58 0.866 2 1 85.46 0.855 Breast Cancer 4 4 75.08 0.751 FP Rate 0.486 0.855 Precision Fmeasure ROC accuracy TP rate FP rate Precision Fmeasure ROC 0.855 0.73 0.853 Dataset cluster size useable Clustering using EM K-means -3 clusters Average in % Average in % Car Connect Lymphography Retention 7 1 98.508 0.985 39 39 80.926 0.8092 6 2 80.895 0.809 13 13 89.085 0.890846 0.702 0.014 0.3377 0.665 0.866692 0.648 0.986 0.7849 0.7935 0.817923 0.788 0.693 0.985 0.7928 0.8015 0.851154 0.768 89.65 0.8966 0.607 0.893 0.488 85.145 0.862 0.721 0.791 0.477 78.89 0.789 0.789 0.623 1 90.271 0.9026 0.299 0.8913 0.797 79.984 0.8 0.228 0.7767 0.3345 60.1096 0.6013 0.526 0.5713 0.468462 84.564 0.845667 0.699333 0.838 0.879 0.824 0.696 0.8916 0.784 0.5833 0.804333 0.692 0.523 0.457 0.908 0.8523 0.5183 0.704667 56 Texas Tech University, Soma Datta, August 2014 Table 33. EM clustering for Connect dataset Default Default Case I Case II maxIterations 100 Default Default minStdDev 1.0E-6 Default default seed 100 Default Default Case III Case IV 10 5 Default 0.001 10 5 comments Default parameters Did not converge after 12 days, hence aborted Lower ranked attributes removed, process aborted after six days Generated 39 clusters within 24 hours Generated 39 clusters within 24 hours Stage I: Decision Tree The tree was allowed to split once if the dataset had fewer than eight attributes. In this case, after the first split, each node became a sub-dataset for the next stage of the rule generation process. The attributes that were used for the first split were removed from the respective subset. This caused each subset to reduce in both size and dimension. Datasets that had more than eight attributes were split as follows: each split of the decision tree was monitored to check if the accuracy had changed. Each attribute was locked after it split. The tree continued to split until the accuracy remained the same for the next two levels, in which case the tree was pruned to the level where the accuracy stopped changing (Datta and Mengel 2014b). An example of the tree capture in Stage I using Breast Cancer dataset is shown in Figures 1, 2, and 3. At this split, the distribution of the classes was checked for impurity. If the distributions of the classes were in the ratio of 66% to 33%, then the tree was locked from further splitting. If the distribution was below this range, the tree was further split for impure distribution. 57 Texas Tech University, Soma Datta, August 2014 Figure 10. Breast Cancer 1st level when accuracy stops changing 58 Texas Tech University, Soma Datta, August 2014 Figure 11. Breast Cancer dataset 2nd level with repeating accuracy 59 Texas Tech University, Soma Datta, August 2014 Figure 12. Breast Cancer 3rd level with repeating accuracy The instances of each of the locked branches were then stored as sub-datasets. The attributes that were used in generating the tree were removed from the subsets created from the previous step. Large and high dimensional datasets with more than twenty attributes followed the same path as datasets with more than eight attributes, but at their second stage, each 60 Texas Tech University, Soma Datta, August 2014 of the sub-datasets was used to run through attribute selection methods. Attributes with lower or no significance were removed from the subset before the next step. These subsets were transferred to Stage II. Stage II- Association Mining Association mining techniques were applied to the subsets to generate rules. Apriori association mining techniques were used with default settings for these experiments, with two exceptions: First, the parameter Car was changed to TRUE (because this parameter now uses the class attribute to decide on the rules). Second, the “minmetric” parameter was changed from 0.9 to 0.85. Lowering the percentile by five percent decreases accuracy, so more rules can be generated. The rules were generated using class association. Table 38 of Appendix B shows the details of the default and the experiment parameters. All of the rules created using association mining were combined with the attributes and their values that were created from the decision tree. These rules were evaluated for accuracy and coverage using the original dataset. See Figure 13 for illustration of this process. 61 Texas Tech University, Soma Datta, August 2014 Figure 13. Steps for Dynamic decision rules Results The process followed for each of the datasets was noted in the study of Datta and Mengel (2014b). Of the six datasets that were tested for this process, two datasets, the Balance and Connect datasets, did not generate rules as expected. In comparison to the Retention dataset, both Balance (five attributes) and Connect (43 attributes) had either fewer or more attributes. Hence, a different method was chosen in these cases. For the Balance dataset, this method had one attribute before progressing to Stage II, association 62 Texas Tech University, Soma Datta, August 2014 mining. Thus, the path chosen was that, the decision tree was allowed to split once, then the dataset was forced to Stage II. For the Connect dataset, since the system ran for more than 24 hours and still was unable to generate any association rules, the dimension of the dataset was lowered by removing least significant attributes after ranking and voting using symmetrical uncertainty, gain ratio, and Chi squared. The reasons for this deviation are analyzed in the Section, Discussion and conclusions. The results obtained from all these datasets are shown in Table 34. Table 34. Experimental results of accuracy using different datasets Accuracy in % Accuracy range Data Set Thres hold Maxim um Mini mum Average all rules Adult Balance Breast Cancer Car Connect Lymphography 86.21 63.2 78.0 92.36 80.9 85.0 97.36 93.277 81.62 95.72 93.16 76 84.63 65.00 75.92 84.76 69.66 28.57 83.75 69.74 78.89 93.47 85.96 77.65 Retention 87.5 86.90 63.13 87.5 Average selected rules 80.19 83.96 93.34 85.94 77.65 87.50 90-100 80-89 70-79 60-69 Standard Deviation 31 6 19 38 35 34 13 4 13 2 84 10 16 5 8 5 14 11 9 2 10 3 0.0329 0.1075 0.0829 0 0.0570 0.0925 30 42 7 2 0.0478 The threshold value in Table 34 is the highest accuracy listed for each dataset from Table 30. The accuracies in the maximum and minimum column are derived from comparison of the three clusters of each dataset using J48 from Weka (Hall et al. 2009). The average all rules column has the average of all the rules generated via this process. The average selected rules column has average of accuracy of rules that have three or attribute in their conditions. The majority of rules in the Balance dataset had rules with two or three conditions; hence, no rule was left out from the accuracy calculation. The total number of rules in each accuracy range was determined by merging all the rules generated from the different clusters after removing the duplicates. Accuracy for each of 63 Texas Tech University, Soma Datta, August 2014 the rules was determined using the whole dataset. The rules were then grouped in different ranges of accuracy. The reason for reporting rules in various ranges was to help the user choose the most appropriate accuracy range for determining risk. Table 35 has sample rules from each of the datasets along with their accuracies and coverage. These rules were chosen from one of the highest accuracies to show that coverage was typically low with higher accuracy. Table 35. Sample rules with their accuracy and coverage Dataset Rule # Adult 52 Balance 17 Breast Cancer 60 Car 13 Connect 64 Lymphography 57 Retention 2 Rule capital-gain_c<7298 & native_county=united-states & NOT Not-in-family_Unmarried_Own-child :<=50K Right-Distance=5RD or 2RD or 1RD Right-Weight=2RW :L breast-quad=(central | right_low) AND irradiat=no AND.node_caps=no : no-recurrence-events maint=vhigh & safety=high |med & persons=2 & buying=high :unacc Accuracy Coverage 0.9745 0.23515 0.9333 0.02400 0.91176 0.11888 1.0 0.055556 C1=b & c4=b 7 d1=x & d2=x |b & d3=o & d4= x |b :win Affer=YES & change_node= 2 | 3 & lymph_s=NO & lymph_enlar=2 :2 grant="Y" and percentile_range= Top10 and ETHNIC="WH" and PELL="N" :R 0.905405 0.006572 0.9729 0.25 0.93357 0.09 The maximum, minimum and total coverage for each of the datasets are shown in Table 36, along with their accuracy ranges. Table 36. Experimental results of total coverage Data Set Adult Balance Breast Cancer Car Connect Retention Total coverage Average accuracy Minimum coverage Maximum coverage 89.65 79.70 0.000122846 0.01600 0.003496503 0.723595713 0.12000 0.18881118 99.69 100.00 22.377 0.006944 0.000933 0.01 0.666667 0.747487 12.749 55.90 84.86 69.00 78.89 90.27 86.32 87.5 The Balance dataset did not run through Stage II in the ensemble process because it had fewer attributes. After Stage I was executed, only one attribute remained to be executed through Stage II, for which a different methodology had to be followed, as 64 Texas Tech University, Soma Datta, August 2014 mentioned in Figure 13. The dataset Connect experienced a different problem: During the Stage II process, while generating the rules using association mining methods, the process would not generate any rules for all the sub-datasets, even after running for four days. Thus, the method had to be altered in situations where attribute size was more than 20. The attributes with least significance were removed from the dataset by using trial and error method to generate rules: After the attributes that were of least significance were removed, the dataset was tested for generating rules. If the dataset still failed to generate rules, more attributes were removed. The behaviors of the sub-datasets were different in that some generated rules with 23 attributes while others had to be trimmed further to generate rules. Thus, the Connect dataset generated 109 rules with three or more attributes and an accuracy above the threshold value of 80.9%. Since the main objective of this study was to generate rules for the minority group, Table 37 shows the accuracy, coverage, and number of rules generated by this process. Balance, Car, and Connect did not generate any minority class rules. Nevertheless, the other four datasets generated rules for the minority class. As shown in Table 37 those datasets that had less than 10% of their data in the minority class failed to generate rules. Table 37. Summary of minority class in the datasets Data Set Minor class Adult Balance Breast Cancer Car Connect Lymphography Retention >50K B Recurrence-events vgood draw 1,2,3 D Class size 7841 49 85 65 6449 2+4+61 2838 Class Size % 0.24080956 0.0784 0.2972028 0.03761574 0.09546013 0.4527027 0.30714286 65 Average Accuracy 0.940324 0.768394 0.8040516 0.8 Total coverage % 22.0890 0.72028 10.3040541 69.00 Total # of rules 26 17 0+58+52 16 Texas Tech University, Soma Datta, August 2014 Since no rules were generated for these three datasets using association mining, the parameters (Table 38 of Appendix B) were changed again. The parameter Delta was changed back to 0.05 but had no effect on any of the subsets except for cluster II of Retention dataset to allow it to generate rules. The parameter LowerBoundMinSupport was changed from 0.1 to 0.05; in general, the dataset generated rules but nothing extra was generated for the minority class. MinMetric was lowered from 0.85 to 0.65 in intervals of 0.07 but none of the datasets generated any minority class rules. The number of rules in the parameter numRules was increased from 10 to 100. Each dataset generated 100 rules, but no special rules were generated for the minority class. After lowering the accuracy and increasing the number of rules, it was observed that rules generated had as smaller number of attributes, even as few as one attribute per rule. Discussion and conclusions The ensemble learning that was developed in this study (DMDR) was an extension of the ensemble learning of MSDT (Datta and Mengel 2014b); it was found that the process of rule generation changed if the dataset had fewer than six or more than 20 attributes. Only three different datasets in these extreme conditions were tested. The research questions that arose during this study were the following: Do the rule generation processes depend on the characteristics of the datasets? Does selecting the most significant attributes give accurate results? Which attributes should be ignored? To answer these questions, the study used dynamic methods to generate rules. To find the most significant attributes, three different attribute selection methods (symmetrical uncertainty, gain ratio, and Chi squared) were used for each of the sub66 Texas Tech University, Soma Datta, August 2014 datasets. The attributes were weighed and ranked, and those with the lowest weights were deleted. In case of a tie, the average merit was considered: Deleting the attributes from the dataset was an iterative process; initially the attributes with average merit equal to zero were deleted. If a dataset still failed to generate rules, attributes with a higher average merit, such as 0.001, were deleted. The process continued until rules were generated using Apriori association mining. Another finding from this study was that datasets that had less than 10% of the data in their minority classes did not generate rules or were below the accuracy threshold. In Lymphography dataset, three of the four classes were considered as minority classes. Because the lowest class had two instances, which is not sufficient for generating rules, the second lowest class also had only four instances, the third lowest class was also considered as a minority class for this study. Technically, the class that had, two instances should be the minority class, but a domain expert could determine which needs to be considered the minority class. On the other hand, the Connect database with 6449 instances in the minority class did not generate rules because the minority class distribution was less than 10% of the total instances. To overcome this finding, the accuracy and minsupport parameters were lowered. Future work The ensemble learning method created for this study is a manual process but will be automated in the future. The question about “significant attribute selection” process should be investigated further by experimentation with datasets having larger attribute sets. Ensemble learning, the rule generation technique in this study, generates rules for both classes, but the process failed to generate rules for minority classes having less than 67 Texas Tech University, Soma Datta, August 2014 10% instances in the dataset. Future study will be done to isolate these instances to generate rules. This study will be extended in the future to generate rules for other datasets that have lower true positive class instead of true negative class. For example, this technique would use datasets such as number of female students who choose the STEM degree in the high school population, as well as other known datasets from the UCI database. 68 Texas Tech University, Soma Datta, August 2014 Appendix B Table 38. Apriori settings Options Default values Experiment values What they mean Car False True classIndex -1 -1 Delta 0.05 0.1 lowerBoundMinsupport 0.1 0.1 metricType confidence confidence minMetric 0.9 0.85 numRules 10 100 Number of rules to generate outptItemsSets False True Itemsets are shown in the output removeAllMissingCols False False removes columns with missing values significantLevel -1.0 -1.0 Significance testing treatZeroasmissing False False Zero is treated the same way as missing value Generates rules with the class attribute instead of general association Placement of the class attribute is in the dataset Iteratively decrease support by this factor. Reduces support until min support is reached or required number of rules has been generated. The lowest value of minimum support The metric can be set to any of the following: Confidence (Class association rules can only be mined using confidence), Lift , Leverage, Conviction Will consider rules with accuracy of .85 or higher upperBoundMinsupport 1.0 1.0 The highest value of minimum support, the process starts and iteratively decreases until lower boundary verbose False False Algorithm can run in verbose if enabled 69 Texas Tech University, Soma Datta, August 2014 CHAPTER V CONCLUSIONS AND FUTURE WORK This study postulated several different methods to generate rules for a complex problem like Retention. The study was done in three phases. In the first phase, the study applied methods used by other data mining researchers, then it was extended by clustering and using different decision tree algorithms. In the second phase, the study was further modified to derive rules in multiple stages using decision trees and association mining. In the final phase, the method used in the second stage was applied to other well-known datasets from UCI; this method had to be further modified depending on the characteristics of the datasets. The following paragraphs elaborate on each of these phases. In the first phase, the rules were generated using different decision tree algorithms and different methods implemented by other researchers, for example Yu et al. (2010). The rules generated were either simple or complex to be used in real situations by practitioners. Hence the rules did not meet the necessary and sufficient conditions. Thus, the study was extended, and the datasets were clustered. Rules were generated from a controlled decision tree. Certain anomalies occurred: one, some rules generated from a cluster showed an opposite class when validated using the whole dataset. The other was, two or more rules were similar except for attribute values. These rules could be joined 70 Texas Tech University, Soma Datta, August 2014 with a logical condition. Hence, the study was extended to use two different algorithms: the decision tree and association mining. In the second phase, two different algorithms were used to generate the rules. The rules that the multi-stage method extracted were more accurate and were used for student intervention. The attributes that resulted as possible factors for attrition are now collected for incoming freshmen by the Institutional Research office. None of the rules generated by the multi-stage method had conflicting classes when tested on the whole dataset, and attributes could be incorporated through association mining that were beyond those employed by the decision tree. In the third stage, the study was extended to verify the method from the second stage by applying the method to other well-known datasets. Various types of datasets were used from the UCI repository. During this study, the following question arose: Does this rule generation depend on the characteristics of the dataset? The rule generation technique was modified to facilitate datasets with smaller or larger dimensions. Another finding from this study was that datasets that had less than 10% of the data in their minority classes did not generate rules or were below the accuracy threshold. This finding will be studied in the future for only minority classes. In summary, the multi-stage technique used to generate useful rules for Retention dataset was verified using datasets of other domains. The rules generated were grouped in ranges. The measures in this study were coverage and accuracy for each rule, contrary to only accuracy of the entire dataset. 71 Texas Tech University, Soma Datta, August 2014 In the future, the study will include rough set theory to generate rules in fuzzy areas. Data mining algorithms typically divide the rules as positive or negative class, but some rules, as seen in the study, cannot be defined in just one of the classes. These rules need to be isolated and hence will increase the total accuracy of the dataset. This study will be extended in the future to generate rules for other datasets that have lower true positive class instead of true negative class. For example, this technique would use datasets such as number of female students who choose the STEM degree in the high school population, as well as any other educational datasets if available in the future from the UCI database. This study addressed the complex problem of student retention by creating a multi-stage model to generate retention and attrition rules, which can be used by institutions of higher education. By using these rules, administrators can intervene at an early stage to retain students from dropping before their second year in college. Although no methodology can account for every factor that influences students’ decisions for attrition, this study has developed a way to provide better accuracy for each rule. Furthermore, because the rules are grouped in ranges, this methodology allows administrators to decide how intense their intervention strategy should be for any given rule. For rule creation in data mining, this study provides a novel approach by combining two well-known algorithms: decision tree and association mining. The newly generated rules are created using a controlled decision tree, and each node of the decision tree becomes an independent dataset for creating rules using association mining. The rule 72 Texas Tech University, Soma Datta, August 2014 is derived from combining the rules generated by decision tree and association mining, hence the rules generated via the MSDT technique integrate the advantages of both algorithms. Unlike previous studies, this methodology is a multi-stage process where a controlled decision tree is generated using decision tree algorithm. Next, to overcome the limitation of selecting only top rank attributes in creating a rule, the attributes used by the decision tree are removed from the dataset before generating rules using association mining. Hence, the performance measure in this study is not a measure of the overall accuracy or confidence but is measured per rule to facilitate implementation by administrators. 73 Texas Tech University, Soma Datta, August 2014 REFERENCES ACT, Inc. 2004: What Works In Student Retention – Four-Year Public Institutions, ACT Inc. Alldrin, N., Smith, A., and Turnbull, D. 2003. Clustering with EM and K-means. University of San Diego, California, Tech Report. Agrawal, R., and Srikant. R. 1994. Fast algorithm for mining association rules. International Conference on Very large Databases, 487-499. Baker, R.S.J.D and Yacef, K. 2009. The state of educational data mining in 2009: A review and future visions. Journal of Educational Data Mining 1(1), 3-17. Bean, J. and Eaton, B. 2001. The psychology underlying successful retention practices. Journal of College Student Retention: Research, Theory and Practice 3, 73-89. Boston, W.E., Ice, P., and Gibson, A.M. 2011. Comprehensive assessment of student retention in online learning environments. Online Journal of Distance Learning Administration IV(I). Bayer, J., Bydzovska, H., Geryk, J., Obsivac, T., and Popelinsky, L. 2012. Predicting dropout from social behaviour of students. Proceedings of the 5th International Conference on Educational Data Mining, 103-109. Byers González, J. and DesJardins, S. 2002. Artificial neural networks: A new approach for predicting application behavior. Research in Higher Education 43(2), 235-258. Cattell, R.B. 1966. The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276. Datta, Soma and Mengel, Susan. 2014a. Readable rules and higher accuracy for Retention data with decision tree, [Submitted] Datta, Soma and Mengel, Susan, 2014b. Multi stage decision algorithm to generate readable rules, [manuscript] DeBerard, M.S., Spielmans, G.I., and Julka. D.C. 2004. Predictors of academic achievement and retention among college freshmen: A longitudinal study. College Student Journal 38, 66-80. Delen, D. 2010. A comparative analysis of machine learning techiniques for student 74 Texas Tech University, Soma Datta, August 2014 retention management. Decision Coverage Systems 49, 498-506. Eitel, J.M.L., Baron, J.D., Devireddy, M., Sundararaju, V., and Jayaprakash, S.M. 2012. Mining academic data to improve college student retention: an open source perspective. International conference on Learning Analytics and Knowledge, 139-142. Gaudard, M., Ramsey, P., and Stephens, M. 2006. Interactive Data Mining and Design of Experiments: The JMP Partition and Custom Design Platforms. New Haven Group. Herzog, S. 2006. Estimating student retention and degree completion time: Decision trees and neural networks vis-a-vis regression, New Directions for Institutional Research, 1733. Hagedorn, L.S. 2005. How to define retention. In Alan Seidman (ed.), College Student Retention, Praeger, 89-106. Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I.H. 2009. The WEKA Data Mining Software: An Update; SIGKDD Explorations 11:1. Huo, J., Wang, Xizhao., Lu, Mingzhu., Chen, Junfen. 2006. Induction of Multi-stage decision tree, IEEE International Conference on Systems, Man, and Cybernetics. Intuitor. 1996. http://www.intuitor.com/statistics/SimpsonsParadox.html. Kaiser, H. F. 1960. The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141-151. Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster Analysis, Wiley. Kerkvliet, J., Nowell, C., 2005. Does one size fit all? University differences in the influence of wages, financial aid, and integration on student retention. Econimics of Education Review , 24, 85-95. Kotsiantis S. 2009. Educational data mining: a case study for predicting dropout-prone students. International Journal of Knowledge Engineering and Soft Data Paradigms, 1(2), 101-111. Lin, H.S. 2012. Data mining for student retention management. The Journal of Computing Sciences in Colleges 27(4), 92-99. Lu, Mingzhu, Huo, ianbing, Philip, C. L., Wang, Chen, Xizhao. 2009. Multi-Stage 75 Texas Tech University, Soma Datta, August 2014 Decision Tree based on Inter-class and Inner-class Margin of SVM, Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics. Luan, J. 2002. Data mining and its applications in higher education. In A.M. Serban and J. Luan (eds.), Knowledge Management: Building a Competitive Advantage in Higher Education. New Directions for Institutional Research, no. 113. San Francisco: JosseyBass. Lykourentzou I., Giannoukos I., Nikolopoulos V., Mpardis G., and Loumos, V. 2009. Dropout prediction in e-learning courses through the combination of machine learning techniques. Computer and Education 53, 950-965. Macfadyen, L.P, and Dawson, S. 2010. Mining LMS data to develop an early warning system for educators: A proof of concept. Computers and Education 54, 588-599. Mallinckrodt, B. and Sedlacek, W.E. 1987. Student retention and the use of campus facilities by race. NASPA Journal 24, 28-32. MArquez-Vera, C., Cano, A., Romero, C., and Ventura, S. 2013. Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence 38, 315-330. Mellalieu, P.J., 2011. Predicting success, excellence and retention from student's early course performance: progress results from a data mining decision coverage system in a first year tertiary education programme. XXIX International Conference of the International Council for Higher Education. Nara, A., Barlow, E., and Crisp, G. 2005. Student persistence and degree attainment beyond the first year in college: The need for research. In Alan Seidman (ed.), College Student Retention, Praeger, 129-153. National Center for Education. 2012. http://nced.ed.gov/. Pittman, K. 2008. Comparison of data mining techniques used to predict student retention, Doctoral dissertation. Nova Southeastern University, Fort Lauderdale. Sewell, W., and Wegner, E. 1970. Selection and context as factors affecting the probability of graduation from college. American Journal of Sociology, 75(4), 665-679. Thomas, E., and Galambos, N. 2004. What satisfies students? Mining student-opinion data with regression and decision tree analysis. Research in Higher Education,45(3), 25176 Texas Tech University, Soma Datta, August 2014 269. Tinto, V. 1975. Dropout from higher education: A theoretical synthesis of recent research. Review of Educational Research 45, 89-125. Tinto, V. 1993. Leaving College: Rethinking the Causes and Curses of Student Attrition, University of Chicago Press. Tinto, V., Russo, P., and Kadel, S. 1994. Constructing educational communities in challenging circumstances. Community College Journal 64(1), 26-30. Tinto, V. 2006. Research and practice of student retention: What next?*. J. College Student Retention, Vol. 8(1) 1-19. UCI Repository of machine learning databases and domain theories. FTP address: ftp:// ftp.ics.uci.edu/pub/machine-learning-databases. UC Irvine Machine Learning Repository (UCI) http://archive.ics.uci.edu/ml/. Van Nelson, C., and Neff, K. 1990. Comparing and contrasting neural network solutions to classical statistical solutions. Paper presented at the Midwestern Educational Research Association Conference, Chicago, Oct. 19, 1990. Witten, I.H., and Frank, E. 2005. Data mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco. Yadav, K.S., Bharadway, B., and Pal, S. (2012) Mining Education data to predict student's retention: a comparative study. International journal of computer science and information security, 10(2), 113-117. Yu, H.C., DiGangi, S., Jannasch-Pennell, A., and Kaprolet, C. 2010. A data mining approach for identifying predictors of student retention from sophomore to junior year, Journal of Data Science 8, 307-325. Zhang, Y., Oussena, S., Clark, and Kim, H.T. 2010. Use data mining to improve student retention in higher education– A case Study, 12th International Conference on Enterprise Information Systems 2010, Paper Nr-129. Bean, J and Eaton,B. 2001. The psychology underlying successful retention practices. Journal of College Student Retention: Research, Theory and Practice 3, 73-89. 77 Texas Tech University, Soma Datta, August 2014 Byers González, J., and DesJardins, S. 2002. Artificial neural networks: A new approach for predicting application behavior. Research in Higher Education, 43(2), 235-258. Cattell, R. B. 1966. The scree test for the number of factors. Multivariate Behavioral Research, 1, 245-276. Datta, Soma and Mengel, Susan. 2014a. Readable rules and higher accuracy for Retention data with decision tree, [Submitted] Datta, Soma and Mengel, Susan, 2014b. Multi stage decision algorithm to generate readable rules, [manuscript] DeBerard, M. S., Spielmans, G. I. and Julka. D. C. 2004. Predictors of academic achievement and retention among college freshmen: A longitudinal study. College Student Journal 38, 66-80. Gaudard, M., Ramsey, P and Stephens, M. 2006. Interactive Data Mining and Design of Experiments: The JMP Partition and Custom Design Platforms. New Haven Group. Herzog, S. 2006. Estimating student retention and degree completion time: Decision trees and neural networks vis-a-vis regression, New Directions for Institutional Research, p.17-33. Hagedorn, L. S. 2005. How to define retention. In College Student Retention: Formula for Student Success. (Edited by Alan Seidman, 89-106) Praeger Publishers. HALL, M., Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten. 2009. The WEKA Data Mining Software: An Update; SIGKDD Explorations, Volume 11, Issue 1. Huo, J., Wang, Xizhao., Lu, Mingzhu., Chen, Junfen. 2006. Induction of Multi-stage decision tree, IEEE International Conference on Systems, Man, and Cybernetics. Kaiser, H. F. 1960. The application of electronic computers to factor analysis. Educational and Psychological Measurement, 20, 141-151. Kerkvliet, J., Nowell, C., 2005. Does one size fit all? University differences in the influence of wages, financial aid, and integration on student retention. Econimics of Education Review , 24, 85-95. Lin, H. S., 2012. Data mining for student retention management. Computing Sciences in Colleges, 27(4), 92-99. 78 The Journal of Texas Tech University, Soma Datta, August 2014 Lu, Mingzhu, Huo, ianbing, Philip, C. L., Wang, Chen, Xizhao. 2009. Multi-Stage Decision Tree based on Inter-class and Inner-class Margin of SVM, Proceedings of the 2009 IEEE International Conference on Systems, Man, and Cybernetics. Luan, J. 2002. Data mining and its applications in higher education. In A. M. Serban and J. Luan (eds.), Knowledge Management: Building a Competitive Advantage in Higher Education. New Directions for Institutional Research, no. 113. San Francisco: JosseyBass. Lykourentzou I, Giannoukos I, Nikolopoulos V, Mpardis G, Loumos V (2009) Dropout prediction in e-learning courses through the combination of machine learning techniques. Computer&Education 53:950ñ965. Macfadyen,L.P, Dawson, S. (2010). Mining LMS data to develop an early warning system for educators: A proof of concept. 54 588ñ599. Mallinckrodt, B and Sedlacek, W. E. 1987. Student retention and the use of campus facilities by race. NASPA Journal 24, 28-32. MArquez-Vera, C, Cano, A., Romero, C., Ventura, S. (2013) Predicting student failure at school using genetic programming and different data mining approaches with high dimensional and imbalanced data. Applied Intelligence 38:315ñ330. Mellalieu, P.J., 2011. Predicting success, excellence and retention from student's early course performance: progress results from a data mining decision support system in a first year tertiary education programme. Nandeshwar, A., Menzies, T., Nelson, A., 2011. Learning patterns of university student retention, Elsevier, expert system with application 38 14984-14996. Nara, A., Barlow, E and Crisp, G. 2005. Student persistence and degree attainment beyond the first year in college: The need for research. In Alan Seidman (ed.), College Student Retention, Praeger, 129-153. National Audition Office. 2007. Staying the course: the retention of students in higher education. National Center for Education. 2012. http://nced.ed.gov/. Pittman, Kathleen. 2008. Comparison of data mining techniques used to predict student retention, Doctoral dissertation. Nova Southeastern University, Fort Lauderdale. 79 Texas Tech University, Soma Datta, August 2014 Schmitt N, Oswald FL, Kim BH, Imus A, Merritt S, Friede A, Shivpuri S. 2007. The use of background and ability profiles to predict college student outcomes. J Appl Psychol. Jan;92(1):165-79. PubMed PMID: 17227158. Senator, T. 2005. Multi-stage classification. Proceedings of the Fifth IEEE International Conference on Data Mining, 386-393. [NOTE: tie linkage-analysis to clustering] Sewell, W., & Wegner, E. 1970. Selection and context as factors affecting the probability of graduation from college. American Journal of Sociology, 75(4), 665-679. Superby, J.F., Vandamme, J-P., Meskens, N. 2006.Determination of factors influencing the achievement of the first-year university students using data mining methods. Workshop on Educational Data Mining. Thomas, E., and Galambos, N. 2004. What satisfies students? Mining student-opinion data with regression and decision tree analysis. Research in Higher Education,45(3), 251269. Tinto, V. 1975. Dropout from higher education: A theoretical synthesis of recent research. Review of Educational Research 45, 89-125. Tinto, V. 1993. Leaving College: Rethinking the Causes and Curses of Student Attrition, University of Chicago Press. Tinto, V., Russo, P., AND Kadel, S. 1994. Constructing educational communities in challenging circumstances. Community College Journal 64(1), 26-30. Tinto, V. 2006. Research and practice of student retention: What next?*. J. College Student Retention, Vol. 8(1) 1-19. UCI Repository of machine learning databases and domain theories. UC Irvine Machine Learning Repository (UCI) Van Nelson, C., and Neff, K. 1990. Comparing and contrasting neural network solutions to classical statistical solutions. Paper presented at the Midwestern Educational Research Association Conference, Chicago, Oct. 19, 1990. Witten, I. H., Frank, E. 2005. Data mining: Practical machine learning tools and techniques, 2nd Edition, Morgan Kaufmann, San Francisco. Yadav, K.S., Bharadway, B., Pal, S. (2012) Mining Education data to predict student's retention: a comparative study. International journal of computer science and 80 Texas Tech University, Soma Datta, August 2014 information security, 10(2), 113-117. Yu H,C, DiGangi, S., Jannasch-Pennell, A., Kaprolet,C. 2010. A data mining approach for identifying predictors of student retention from sophomore to junior year, Journal of Data Science 8, 307-325. Wang, Xizhao, He, Qiang, Chen, Degang. 2005. A genetic algorithm for solving the inverse problem of support vector machines, Science Direct, vol. 68, pp. 225-238. Wu, X., Kumar, V., Quinlan, J. Ross, Ghosg, J., yang, Q., Motada, H., McLachlan, G., J., Ng, A., Lui, B., Yu, P., S, Zhou, Z., Steibach, M., Hand, D., j, Steinberg, D. 2007. Top 10 algorithims in data mining, IEEE, International Conference, Survey paper, Zhang, Y., Oussena, S.,Clark, and Kim, H., T. 2010. Use data mining to improve student retention in higher education– A case Study, 12th International Conference on Enterprise Information Systems 2010, Paper Nr-129. 81