Download A multi-stage decision algorithm to generate interesting rules

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A multi-stage decision algorithm for rule generation for minority class
by
Soma Datta, M.C.A
A Dissertation
In
Computer Science
Submitted to the Graduate Faculty
of Texas Tech University in
Partial Fulfillment of
the Requirements for
the Degree of
DOCTOR OF PHILOSOPHY
Approved
Dr. Susan Mengel
Chair of Committee
Dr. D’aun Green
Dr. Jim Burkhalter
Mark Sheridan
Dean of the Graduate School
August 2014
Copyright 2014, Soma Datta
Texas Tech University, Soma Datta, August 2014
ACKNOWLEDGMENTS
I wish to thank my committee members for their precious time and help. A
special thanks to Dr. Susan Mengel, my committee chairperson for her immense patience
in reading, proofing, encouraging, and enlightening me with the thought process. Thanks
to Dr. D’aun Green, Dr. Jim Burkhalter for graciously accepting to serve on my
committee.
I would like to thank my previous supervisor James Anderson for his support and
encouragement. Special thanks goes to Eric Thompson, my colleague, for compiling the
data and creating scripts for computations.
Finally, I would like to thank the University Writing Center for all the help they
have rendered in getting this manuscript to what it is now. Special thanks goes to
Elizabeth Bowen and Leslie Akchurin for their help and expertise in modifying this
manuscript.
I dedicate my dissertation work my family and friends. A special feeling of
gratitude goes to my loving parents, Ira Ray and Amelendu Ray, for being my friend, my
philosopher and my guide.
I also dedicate my dissertation work to my daughter, my son-in-law and my
husband, Sanjana Greenhill, Joshua Greenhill, and Amlan Datta, for support and
confidence.
Finally, I dedicate my dissertation work to my sister, Ruma Guha Neogi and her
family for building confidence in me and love for me.
ii
Texas Tech University, Soma Datta, August 2014
TABLE OF CONTENTS
Acknowledgments................................................................................................... ii
Abstract .................................................................................................................. vi
List of Figures ....................................................................................................... vii
List of Tables ....................................................................................................... viii
Chapter I.................................................................................................................. 1
Introduction ............................................................................................................. 1
Chapter II ................................................................................................................ 4
Predictive Modeling of Student Retention Data ..................................................... 4
Abstract ............................................................................................................... 4
Introduction ......................................................................................................... 5
The Background .................................................................................................. 7
The Data ............................................................................................................ 10
The Proposed Approach .................................................................................... 11
Motivational Concepts .................................................................................. 11
Ensemble Learning ....................................................................................... 15
Ensemble Learning with Clustering.............................................................. 17
Experiments and Results ................................................................................... 17
Conclusion and Future work ............................................................................. 25
iii
Texas Tech University, Soma Datta, August 2014
Chapter III ............................................................................................................. 27
Multi-Stage Decision Method to Generate Rules for Student Retention .............. 27
Abstract ............................................................................................................. 27
Introduction ....................................................................................................... 28
Related work ..................................................................................................... 30
Background ................................................................................................... 30
Multi-stage Decision Tree............................................................................. 31
Data for the Study ............................................................................................. 33
Methodology ..................................................................................................... 35
Motivation for multi-stage mining methods ................................................. 35
Multi-stage controlled decision tree method................................................. 37
Results ............................................................................................................... 40
Results from multi-stage controlled decision tree ........................................ 40
Conclusion ........................................................................................................ 43
Future work ....................................................................................................... 45
Appendix A ....................................................................................................... 48
Chapter IV ............................................................................................................. 50
Dynamic Multi-stage decision rules ..................................................................... 50
Abstract ............................................................................................................. 50
Introduction and related work ........................................................................... 51
Background Work ............................................................................................. 53
iv
Texas Tech University, Soma Datta, August 2014
Data for study.................................................................................................... 54
Dynamic multi-stage decision rule ................................................................... 54
Clustering ...................................................................................................... 55
Stage I: Decision Tree ................................................................................... 57
Stage II- Association Mining ........................................................................ 61
Results ............................................................................................................... 62
Discussion and conclusions .............................................................................. 66
Future work ....................................................................................................... 67
Appendix B ....................................................................................................... 69
Chapter V .............................................................................................................. 70
Conclusions and future work ................................................................................ 70
References ............................................................................................................. 74
v
Texas Tech University, Soma Datta, August 2014
ABSTRACT
This study analyzes student retention data for the prediction of students who are
likely to drop out. Retention is an increasingly important problem for institutions who
must meet legislative mandates, face budget shortfalls due to decreased tuition or statebased revenue, and who do not produce enough graduates in fields of need, such as
technology. The study proposes a multiple stage decision method to improve rule
extraction issues encountered previously when using ensemble learning with clustering
and decision trees. These expansions include rules with anomalous classes and rules with
attributes only chosen by the decision tree method. To improve rule extraction, the study
described in this paper uses a multi-stage decision method with clustering, controlled
decision trees, and association mining. This study uses dynamic method to generate
rules. The characteristic of the dataset commands the path towards generation of the
rules. A dynamic multi-stage decision tree was generated depending on the attribute
dimensions and size of dataset. Each rule gets its coverage and accuracy. This technique
generates rules after mining data from the minority class. The rules generated were
grouped into ranges to facilitate the rule choice as needed.
vi
Texas Tech University, Soma Datta, August 2014
LIST OF FIGURES
Figure 1. Tree using proposed Ensemble learning................................................ 13
Figure 2.Tree using Yu's approach ...................................................................... 13
Figure 3.Tree using Recursive Partition and an enlarged version ........................ 14
Figure 4.Tree using CART ................................................................................... 14
Figure 5. Tree using J48, super imposed on the actual size tree ........................... 14
Figure 6. Building decision tree ............................................................................ 16
Figure 7. Building decision tree ............................................................................ 16
Figure 8. Steps of split and prune of the decision tree .......................................... 39
Figure 9. Controlled Decision Tree ...................................................................... 41
Figure 10. Breast Cancer 1st level when accuracy stops changing ....................... 58
Figure 11. Breast Cancer dataset 2nd level with repeating accuracy ................... 59
Figure 12. Breast Cancer 3rd level with repeating accuracy ................................. 60
Figure 13. Steps for Dynamic decision rules ........................................................ 62
vii
Texas Tech University, Soma Datta, August 2014
LIST OF TABLES
Table 1. Data Characteristics ................................................................................ 10
Table 2. Statistical characteristics of the datasets ................................................. 10
Table 3. Average accuracy in % without clustering ............................................. 19
Table 4. Total no. of rules generated without clustering ...................................... 19
Table 5: Different measures with different cluster sizes using C4.5 .................... 21
Table 6: Average accuracy in % after clustering .................................................. 21
Table 7: Total no. of rules after clustering ............................................................ 22
Table 8. Sample rules using merged dataset ......................................................... 23
Table 9: Total Number of rules above threshold with three or more conditions .. 24
Table 10. Attrition Rules from different clusters of the merged dataset .............. 24
Table 11: Rules for attrition with common attribute conditions ........................... 25
Table 12: Sample rules predicting opposite class in cluster ................................. 25
Table 13: Example of combining rules ................................................................. 26
Table 14. Accuracy and Coverage using classification on Retention Data .......... 29
Table 15. Final Selected Attributes M-2 ............................................................... 34
Table 16. Data characteristics M-2 ....................................................................... 34
Table 17. Statistical characteristics of the datasets M-2 ....................................... 34
Table 18. Sample rules, their respective probability, and the class type .............. 35
Table 19. Association mining statistics ................................................................ 37
Table 20. Characteristics of the dataset ................................................................ 40
viii
Texas Tech University, Soma Datta, August 2014
Table 21. Sample rules with accuracy and coverage ............................................ 41
Table 22. Testing different datasets with rules for attrition .................................. 42
Table 23. Comparisons of rules coverage using different methods ...................... 43
Table 24. Rule coverage over instance in each class ............................................ 43
Table 25. Accuracies obtain using incoming freshmen ........................................ 45
Table 26. Examples of combining rules ............................................................... 46
Table 27. Apriori settings ..................................................................................... 48
Table 28. Effect of changing default values ......................................................... 48
Table 29. Sample rules at confidence=0.70 .......................................................... 49
Table 30: Data set with reported Accuracy........................................................... 52
Table 31: Dataset and their characteristics- UCI .................................................. 54
Table 32. Cluster comparisons with EM and K-means ........................................ 56
Table 33. EM clustering for Connect dataset........................................................ 57
Table 34. Experimental results of accuracy using different datasets .................... 63
Table 35. Sample rules with their accuracy and coverage .................................... 64
Table 36. Experimental results of total coverage.................................................. 64
Table 37. Summary of minority class in the datasets ........................................... 65
Table 38. Apriori settings ..................................................................................... 69
ix
Texas Tech University, Soma Datta, August 2014
CHAPTER I
INTRODUCTION
Student retention has been an ongoing problem in institutions of higher learning
in the US. Qualitative research on student retention was started by Tinto as early as 1975.
In recent years, educational institutions have collected a large amount of student related
data; hence, quantitative research using data mining techniques make it possible to
discover hidden information at an early stage. The purpose of the study is to answer the
following question:
 How can data mining methodologies be employed to reduce error rate and to produce
more usable rules for the minority class?
Previous studies have used C4.5, CART, recursive partitioning, Naïve Bayes,
neutral networks, and logistic regression to identify attributes in student attrition. Naïve
Bayes and neural networks were used to validate the results from the decision tree
algorithms. Decision tree algorithms are widely used because of their ease of
understanding and transparency. However, the attribute that is selected as the top attribute
might not always be the best one to generate a model. In addition, some attributes might
have correlation between themselves. Thus, to investigate the above questions, this study
developed a multi-stage domain-driven data mining procedure that minimizes the errorrate and merges the models of different data sets, producing more useable rules.
1
Texas Tech University, Soma Datta, August 2014
The study is divided into three sections. Chapter II studies and analyzes data
mining methods used by other researchers such as Yu et al. 2010, and is then extended
to generate rules using clustering and controlled decision tree techniques and
unrepeatable attributes. Chapter III overcomes some of the limitations of the method
applied in Chapter II by generating retention/ attrition rules by merging decision trees
and association mining to produce multi-stage rules; the method was called multi-stage
decision tree (MSDT). In Chapter IV the same algorithms are used to generate rules that
have been applied to the dataset from UCI data repository for validation. The multi-stage
logic had to be further modified depending on the characteristics of the datasets. The new
ensemble method was named dynamic multi-stage decision rules (DMDR).
In this study, available attributes from the institutions’ database were collected.
Principle component analysis and multivariate correlation were applied to the datasets to
remove any correlation between the attributes. Clustering, which was not used in other
studies, was applied to the datasets to facilitate homogenous grouping.
The unique methods used in this research were the following:
 Model-1, stopping the tree when an attribute starts to repeat, to replicate Yu et al.,
2010 study
 Model-2, locking the attributes when used, to extend Yu et al. method. Measuring
each rule using accuracy and coverage.
 Model-3, clustering to have homogenous groups than using model 2’s techniques.
 Model-4, generating a controlled decision tree by stopping the tree from growing
further after it has reached an accuracy and when the accuracy stops improving for
the next two steps. The dataset at this point is pruned to remove the attributes that
2
Texas Tech University, Soma Datta, August 2014
were used in generating the tree. The pruned dataset is allowed to generate
association mining rules. Decision tree and association mining rules are joined with
an AND condition to produce the actual rule. These rules are validated for accuracy
and coverage from the whole dataset.
 Model-5, extending to generalize and verify model-4 on known datasets from UCI.
Here the model becomes dynamic and follows the path depending on the
characteristics of the dataset.
3
Texas Tech University, Soma Datta, August 2014
CHAPTER II
PREDICTIVE MODELING OF STUDENT RETENTION DATA
Abstract
Student retention is an intensely scrutinized issue, but remains a problem area for
many institutions. Further complicating the issue, student retention is taken to be an
essential indicator of university performance and enrollment management, particularly by
state legislators. In addition, even if students do continue after their first year, high
withdrawal rates can occur in the second year, causing substantial revenue loss thereafter.
Reasons for student attrition are well known, however, how to identify potential drop
outs reliably and quickly is less known, but several advances through the vehicle of data
mining are being made. To learn more about how to identify potential drop outs,
researchers conduct studies using institutional data to find indicators that may yield
predictive qualities in identifying student attrition. These studies, however, can yield
differing and sometimes conflicting indicators due to the use of different data mining
techniques that yield different models, even on the same datasets. Even so, models are
still useful for drop out prediction and some data mining models may be examined in
detail to determine the reasons for how the model renders a decision. The authors
propose to utilize data mining techniques for the purpose of finding drop out risk
indicators for early identification of at-risk students to apply intervention measures
expediently. The study in this paper, therefore, looks at building on the results of
previous studies, but with the intention of lowering model error rates and of attempting to
produce rules that can be used with some accuracy and clarity to find at-risk students.
4
Texas Tech University, Soma Datta, August 2014
The study focuses on specific data mining methodologies that can produce rules, such as
classification (ADTree, PART, Random Tree, J48Graft, CART, Recursive Partition,
C4.5) and clustering (K-means and EM) techniques rather than those methodologies that
do not allow rules to be derived from the model easily. Results show that these
methodologies can enhance the accuracy of identifying at-risk students and can give more
specific insight into attrition, such as revealing otherwise unknown indicator
combinations.
Key Words and Phrases: Data mining, clustering, retention, attrition, model
selection
Introduction
Student retention is a significant issue in higher education because it is an
essential indicator of performance for each institution and for enrollment management. It
is also important for policy makers due to the potential negative impact of non-retention
of students on the image of the university. In addition, low retention rates for an
institution can cause substantial loss in tuition, fees, and alumni contributions (DeBerard
et al. 2004). The problem of student retention is complex because of the variety of causes
and lack of a single model to predict why students drop out. Further, high withdrawal
rates occur even during the students’ second year of enrollment (Nara et al. 2005); thus, it
is desirable to identify at-risk students and intervene as early as is practicable to facilitate
student retention.
The rationale behind investigating student retention includes the following:
5
Texas Tech University, Soma Datta, August 2014
 to predict if a student is at a risk of dropping out from the institution so that intervention
strategies can be implemented to help retain the student,
 to respond to pressure from potential legislative bills stipulating measures to ensure
that institutions address retention by causing state grants to be dependent on the number
of graduating students,
 to understand the risk factors and causes behind attrition so that effective intervention
measures may be determined,
 to help institutions reduce risk factors, and
 to determine effective classification, clustering, and predictive models using data
mining techniques on student data to find at-risk students.
Studies on retention in higher education go back as far as the 1800s (Boston et al.
2011) and have found the following list of common factors that indicate retention:
 high school – high GPA
 standardized tests – high scores on the ACT and/or SAT
 admissions policies – rigorous rather than open
 social integration – higher use of the library and living on-campus
In addition, researchers who mine student data using modeling techniques, such as
decision trees, logistic regression, and neural networks (for example, Baker et al. 2009,
Yu et al. 2010, and Zhang 2010) are finding other interesting factors not necessarily
directly related to the above list but supportive of it:
 frequent use of the online learning system
 high transfer hours
 ethnic factors
 residency as opposed to non-residency
 high library usage
 social networking
6
Texas Tech University, Soma Datta, August 2014
Although interesting results are being found, the modeling techniques can be
improved in accuracy and analyzed to reveal rules used by the model. Thus, work in this
paper seeks to expand upon prior modeling to find ways to improve accuracy and
determine rules that can be used to pinpoint at-risk students.
The Background
Qualitative studies, for example ACT (2004), Bean (2001), Sewell and Wegner
(1970), and Tinto (1975, 2006) have shown that retention depends on social integration,
demographics, academic achievements, financial aid or ability to pay, and institutional
factors. In recent years, some quantitative studies—Bayer (2012), Delen (2010), Eitel et
al. (2012), Herzog (2006), Kotsiantis (2009), Lykourentzou et al. (2009), Macfadyen et
al. (2010), Marquez et al. (2013), Pittman (2008), Yadav et al. (2012) , Yu et al. (2010),
and Zhang et al. (2010)—have used different data mining techniques and have achieved
varied results, although the attributes used in these studies are similar. Other factors such
as social networking (Bayer 2012) have shown that students who interact with academic
departments are more likely to be retained. In addition, most researchers have mentioned
that the first year of a student’s career is the most critical because the likelihood of
dropping out is high, and so their studies included the first year cohort of students. Some
of these studies have different cohorts; for instance, Herzog and Pittman used both
freshmen and the whole university in their datasets. Herzog’s study also included the
factors for degree completion. The datasets were balanced in a few studies because some
statistical techniques required it. Delen (2010), Eitel (2012), Kotsiantis (2009), and
Marquez (2013) argued that the imbalanced nature of the dataset could affect the results
7
Texas Tech University, Soma Datta, August 2014
of the study of the minor class (class with fewer instances in the dataset). Marquez’s
original dataset had a class distribution of 91.04/8.96 (Retained/Dropped), whereas
Delen’s (2010) was 80/20 (Retained/Dropped). Lin (2012) replicated his dataset twice
and thrice to get a larger group. Kotsiantis (2009) used cost-analysis as an alternative for
an imbalanced dataset, but retained the same accuracy as for the major class. Thus, it can
be seen that qualitative research laid a path for the predictor attributes and that recent
research is using similar attributes and cohorts to build a retention model.
Qualitative research models are survey-based and are often criticized due to their
incapability of generalization, and the difficulty in collecting data and administrating the
survey. The model foundation laid by qualitative researchers, however, is the steppingstone to quantitative research. Educational institutes collect data on students, so datadriven analysis can complement the results obtained from qualitative models (Delen
2010, Marquez et al. 2012, Pittman 2008, and Yadav et al. 2012). Most of the
quantitative studies use data mining techniques with different decision tree algorithms
because of the ease of rule interpretation, ease of learning, and widely accepted results
(Delen 2010, Eitel et al. 2012, Marquez et al. 2012, and Yadav et al. 2012). Other data
mining or statistical techniques used to validate the results of the decision trees are
MARS, neural networks, support vector machine, lazy learner, PART (Partial decision
tree), ADT (Alternate decision Tree), bagging, logistic regression, Bayesian
classification, genetic algorithms, and ensemble learning (the use of more than one data
mining technique). Thus, decision tree algorithms are commonly used by researchers in
building retention models. Of the decision tree algorithms, the common techniques are
8
Texas Tech University, Soma Datta, August 2014
C4.5, CART, and recursive partitioning, using the WEKA and JMP software (discussed
later), respectively.
High dimensions of data used in quantitative research are stored by institutions,
but not all attributes help to predict retention, and some data may be unreliable due to
human error or omission. Also, rules generated using high dimensional data may not be
interpretable as will be demonstrated later; hence, researchers use attribute selection
techniques to lower the dimensions for the model (Delen 2010, Eitel et al 2012,
Kotsiantis 2009, Macfadyen et al. 2009, Marquez et al 2012, Yadav et al. 2012). The
attribute selection techniques are ranking methods; for example, entropy, chi-square,
minimum-error, and MDL. Marquez et al. utilize ten different ranking methods and
weigh the attributes using the frequency of each attribute’s ranking. To calculate the
accuracy, validate the rules, and predict the important attributes, the following measures
are used: accuracy, ROC area, F-measure, sensitivity, and specificity.
The researchers conclude that data-mining techniques are capable of predicting
retention with sufficient attributes and data (Delen 2010, Marquez et al. 2012). The
accuracy for the researchers is approximately 80% to 90%. Macfadyen et al. (2009)
mention a better retention model can be built using the course/credit details. Bayer et al.
(2012) mention that retention accuracy would be better after the student’s fourth semester
and with the use of a social networking attribute. Herzog (2006) and Pittman (2008)
achieve better accuracy with the cohort that includes the whole university. Yu et al.
(2010) mention ethnicity, transfer hours, and residency are factors for retention. Zhang’s
(2010) study points out that academic activity is an indicator of retention. Additionally,
9
Texas Tech University, Soma Datta, August 2014
most of the researchers mention that financial aid is an indicator for retention. In other
words, the quantitative models have confirmed the qualitative research.
The Data
The approach proposed in this paper is evaluated by applying an internal Texas
Tech University system dataset as detailed below. The characteristics of the data are
given in Table 1 and Table 2.
 Attributes
o demographic: age, ethnicity, and gender
o academic: attempted hours, earned hours, GPA, degree, and major changed
o financial aid data: grants, scholarships, and loans
o social factor: parent’s education and student organization
 Classes- retained (R) and dropped (D) as determined by the student’s enrollment status
during their 2nd year in college.
 Source- new freshmen admitted for Fall 2010, Fall 2011, and a merged dataset of both.
Table 1. Data Characteristics
Name
Merged 2010 and 2011
Fall 2010
Fall 2011
Instances
9240
4490
4750
Attributes
14
14
14
Table 2. Statistical characteristics of the datasets
Attribute
Age
Gender
Test score
Class percentile
College GPA
Earned hours
Attempted hours
Financial assistance (complied grants/Pell/loan)
Mean and/or Distribution
21.46
M/F (52%/48%)
1099
71.6
2.70
12.12
13.89
52%-63%
Low income group (only received Pell)
18%-23%
10
Texas Tech University, Soma Datta, August 2014
The Proposed Approach
The next section introduces the motivational concepts, motivation, and
justification for the new method proposed, which is called ensemble learning because
multiple models are used. The other section has a detailed approach to ensemble learning.
The last section describes ensemble learning with clustering.
Motivational Concepts
Modeling techniques do not always yield a mechanism whereby rules may be
derived. In particular, statistical techniques, neural networks, and regression yield
effective mathematical models, but do not show decision reasoning. Decision trees,
however, yield rules derived from the path taken from the root to a leaf. These rules,
however, may contain conflicting attributes since decision tree modeling algorithms may
reuse attributes. In addition, some rules may not be particularly useful, such as those
with repeating attributes of different values or those that lead to small sets of data.
Finally, the resulting datasets that satisfy the rule may be a mix of classes, such as 30%
drop and 70% retain, causing the rule to have reduced accuracy. For example, the rule
below includes the Avg_Attempt attribute three times, each with different and conflicting
conditional values. Because the rule represents a path in the decision tree, the rule does
not seem to represent a logical line of reasoning that a human might follow. The approach
proposed in this study does attempt to generate rules that are more usable by focusing on
producing rules with non-ambiguous attribute conditions.
Ex. Avg_Attempt>=9&LOAN=Y&GPA>=1.672&GRANT=Y&GPA>= 2.616
&PERCENTILE>=55&Avg_Attempt<18&ETHNIC(HP, AS, WH,
HI)&PERCENTILE>= 74&Avg_Attempt>=12.5&AGE>= 21& Transfer_Hrs<16
11
Texas Tech University, Soma Datta, August 2014
The attributes in the datasets are ranked using residual log-likelihood, chi-square,
sum of squared error, and clustering. Clustering is used as an additional method for
attribute selection to rule out the bias of any of the attribute ranking methods. For
example, residual log-likelihood, chi-square, or minimum sum of squared error ranking
use top-ranked attributes, whereas clustering groups them on a similarity measure. Here
the K-means and EM clustering techniques are employed. K-means may perform better
than EM clustering in the case of high-dimensional datasets due to EM’s numerical
precision problems. EM uses the Gaussian distribution, and a delta function may be
created as shown in equation 1 below (Alldrin et al. 2003). A delta function is sometimes
mentioned as the black hole of mathematics because its limit tends to zero causing no or
few records in a cluster.
1⁄ 𝑓𝑜𝑟 |𝑥| < 𝑎⁄2
𝜕(𝑥) = lim { 𝑎
𝑎→0 0 𝑒𝑣𝑒𝑟𝑦𝑤ℎ𝑒𝑟𝑒 𝑒𝑙𝑠𝑒
1
The models built in this study using ensemble learning (discussed later) are as
follows:
 Model 1 uses attributes that are ranked using residual log-likelihood, chi-square, or sum
of squared error.
 Model 2 divides the datasets into clusters based on a similarity measure.
In Model 1, the tree is generated using recursive partitioning (JMP) (Gaudard et
al. 2006). The approach in this study locks the attribute after each split rather than
stopping tree growth when the attributes start to repeat as in Yu et al’s (2010) approach,
which might not make use of all attributes in high-dimensional datasets. The motivation
behind locking attributes is to avoid repeating attributes in the effort to maximize
12
Texas Tech University, Soma Datta, August 2014
accuracy and generate usable rules. An example tree of the proposed approach is shown
in Figure 1, a tree generated using Yu et al’s approach is shown in Figure 2, and trees
generated using the default setting for recursive partition, CART, and C4.5 for the same
dataset are shown in Figures 3, 4, and 5, respectively. Each path through the decision
tree from the root to the leaf represents a subset of the dataset being modeled. More
paths in the tree and a larger number of subsets increase the potential for a greater
number of candidate rules which could correspond to very small subsets.
Figure 1. Tree using proposed Ensemble learning
Figure 2.Tree using Yu's approach
13
Texas Tech University, Soma Datta, August 2014
Figure 3.Tree using Recursive Partition and an enlarged version
Figure 4.Tree using CART
Figure 5. Tree using J48, super imposed on the actual size tree
14
Texas Tech University, Soma Datta, August 2014
For prediction, the rules are validated against all datasets. The rules that have
lower coverage of the dataset are ignored.
In Model 2, the dataset is clustered using K-means clustering (Kaufmann et al.
1990). The user has the option of using any k value for clustering. This study has tested
several cluster sizes to find the optimal k value for clustering, as discussed later.
The dataset is not balanced in this study as is done in some previous studies
(Delen, 2010, Kotsiantstis 2009, Yadav et al. 2012 ) because the class distribution is
similar to other published datasets, such as the University of California Irvine Machine
Learning Repository dataset with D=36.09% and R=63.91%. Also, data mining is able to
perform analysis without balancing while balancing is necessary for statistical
techniques.
Ensemble Learning
Ensemble learning in this study consists of three parts. First, the decision tree is
constructed using x unique attributes. Secondly, the rules are abstracted and aggregated.
Finally, the rules are validated using the whole dataset and compared with other decision
trees.
The steps used in the first part are given in Figure 6 below. In the first step, all
the attributes are ranked by the tree learning algorithm using residual log-likelihood chisquare (2*entropy) which uses LogWorth (categorical attributes) or Sum of Squares
(continuous attribute) depending on the type of attribute (Gaudard et al. 2006), both being
statistical methods. Recursive partitioning is used to generate the decision tree.
15
Texas Tech University, Soma Datta, August 2014
Recursive partitioning (Gaudard et al. 2006) uses a statistical method for multivariable
analysis. Starting from the top ranked attribute, the decision tree is allowed to split once.
The attribute used for this split is locked, and the tree is split for the next level. This
process is iterated until x-1 attributes are used for splitting where x is the total number of
attributes.
Let x= number of attributes
Step 1. Rank all x attributes.
Step 2. Allow the tree to branch.
Step 3. Lock the attribute that generated the branch.
Step 4. Repeat step until x-1 attributes or tree is fully branched or optimized.
Figure
6. Building
Figure 7.
Building
decisiondecision
tree tree
In the second part, the rules are abstracted from the trees. Let r be the number of
rules generated by the tree. Let c be the number of condition attributes in each rule. The
predictive rules are identified depending on the accuracy and significance of the rule.
Predictive rules are abstracted by eliminating rules with accuracy less than a given
threshold; that is, 𝑟 ≤ 𝑡% (t is a hypothetical value chosen; here it is 70% because the
average baseline accuracy of the three datasets is 69.2857%). Rules are not significant if
c is a small number because of the complex nature of this problem (<3). For example,
rules with fewer than three attributes tend to have higher false positive rates. Previous
studies have identified rules using decision trees (Herzog 2006, Pittman 2008, and Yu et
al 2010), but did not look at coverage and accuracy per rule.
The concept of identifying
predictive rules in this study is similar to identifying association rules using a coverage
measure for association rule mining (Agrawal et al. 1994). When identifying rules, it is
possible for a rule to have conflicting classes; in such a situation, the rules are grouped as
uncertain rules.
16
Texas Tech University, Soma Datta, August 2014
The final part of the ensemble learning involves validation with other methods
used in previous studies (Delen 2012, Eitel et al. 2012, Marquez et al. 2012, Pittman
2008, and Yadav et al. 2012). The same dataset is used to generate a decision tree using
C4.5, CART, and recursive partitioning. The rules generated from these processes are
extracted similarly.
Ensemble Learning with Clustering
Clustering is an alternative method of selecting attributes for the ensemble
learning method mentioned previously. Clustering selects attributes based on some
similarity as opposed to the attribute selection method, which selects attributes based on
ranking. The method employed in this study is K-means clustering (Kaufmann et al.
1990). The distance function used for the study is Euclidean Distance (ED) as opposed to
Manhattan Distance. ED uses the mean for the distance calculation and the latter uses the
median. The user has the option of choosing the k value for the number of clusters. Since
the results can vary with the size of k, the k value is chosen after experimentation.
Experiments and Results
The proposed approach is tested with the following software: recursive
partitioning using JMP (Gaudard et al. 2006), and C4.5 and CART using WEKA (Hall et
al. 2009). WEKA, which is open-source software for C4.5 (or “J48”) and CART (or
“SimpleCart”), is also used for K-means clustering. For comparison purposes with other
studies, this study uses 10-fold cross validation on the Fall 2010 dataset, Fall 2011
dataset, and the dataset generated by merging the two.
17
Texas Tech University, Soma Datta, August 2014
As mentioned before, to predict a rule, the following two conditions must be
satisfied: the minimum number of attributes in the condition, c, is three, and the
threshold, t, for each significant rule is 70% accuracy. For example, a rule with one
attribute may cover many instances of the dataset, but the instances covered usually
consist of a mix of the classes (conflicting classes) which lowers the rule’s accuracy. In
general, as attributes are added to the rule, the rule becomes more specialized, covers
fewer instances, and rises in accuracy since more of the conflicting classes are
successively eliminated. Although a higher accuracy threshold might seem desirable, it
turns out that a threshold of 80% creates fewer rules. A 60% threshold covers more
instances, but causes a larger number of false positive counts. A threshold of 70% seems
a reasonable compromise.
All three datasets are tested without clustering using the nine methods shown in
Table 3. Default conditions and 10-fold cross validation are used in all the methods, to
generate the trees. The recursive partition method generated the most consistent results
across the data sets and the highest accuracy in comparison to the other methods. It is
known to be an accurate method that can overfit the data. Each method performed the
best on the Fall 2010 dataset and the worst on the Fall 2011 dataset, showing that each
incoming cohort of students has differing dropped/retained characteristics. All methods
performed at about the same accuracy except for recursive partition (best) and random
tree (worst). A clue for each method’s performance is provided in the next table
discussed.
18
Texas Tech University, Soma Datta, August 2014
Table 4 shows the total number of rules generated using each method and
consequently the number of subsets partitioned on the dataset. The random tree method
produced far more rules and subsets while Yu’s method produced the lowest. Ensemble
learning, ADTree, and Yu’s methods are consistent in the number of rules generated
using the three datasets. Recursive partition, however, presents the interesting outlier
with a relatively high number of rules, but not nearly so much as random tree. For each
splitting point in the tree, recursive partition statistically analyzes each attribute and its
potential values for best satisfaction of a purity measure of the resulting dataset partitions.
Clearly, such a splitting analysis causes greater success than those of the other decision
tree methods.
Table 3. Average accuracy in % without clustering
Dataset
Fall 2010
Fall 2011
Merged dataset
C4.5
83.1403
75.4316
78.961
CART
82.5612
75.2842
79.0368
ADTree
83.5412
75.7053
79.3831
PART
J48graft
81.5145
82.9844
72.7368
75.5789
76.9372
79.0909
RandomTree
75.2116
66.9895
71.0714
Yu’s method
Recursive partition
82.81
87.55
74.84
85.09
78.71
86.17
Ensemble learning
82.81
75.58
79.17
Table 4. Total no. of rules generated without clustering
Dataset
C4.5
CART
ADTree
PART
J48graft
RandomTree
Yu’s method
Recursive partition
Ensemble learning
Fall 2010
110
6
21
179
99
6701
4
277
14
Fall 2011
11
6
21
193
15
8568
4
452
13
19
Merged dataset
158
10
21
402
163
17544
4
866
12
Texas Tech University, Soma Datta, August 2014
In Model 2, the merged dataset (all attributes) is clustered using K-means and EM
(Expectation-Maximization) clustering. The method chosen for testing the optimum
cluster size is C4.5 with the default setting utilizing 10-fold validation for each cluster.
The reason behind selecting C4.5 is to compare it with previous studies (Delen 2010,
Pittman 2008, Yadav et al. 2012, Marquez et al. 2012, and Eitel et al. 2012). In Table 5,
several measures are used to evaluate the results of C4.5 on each cluster. The average
accuracy is computed across the clusters, as well as the average true positive (TP) rate
(correctly classified instances, also known as recall), the average false positive (FP) rate
(incorrectly classified instances), the average precision (fraction of retrieved instances
that are correctly classified), F-measure (harmonic mean of precision and recall), and
average ROC area (area under the curve of the TP versus FP plot). The benefit of using
several measures is that potential inadequate performance may not be reflected in all of
the measures, but will be shown in others. For example, while the highest accuracy is
obtained using EM’s thirteen clusters, the FP rate is also high and the ROC score
(below .5) indicates that the classifier is not properly distinguishing among the instances.
In fact, as the clusters increase, the ROC score decreases, indicating poorer classification
performance. It turns out that several of the clusters in the larger numbers of clusters
have poor FP rates (contain conflicting classes) due to the mixture of dropped and
retained instances in the cluster. Choosing different attributes on which to cluster might
possibly purify the resulting clusters more; however, such pure clustering is known to be
difficult to achieve. K-Means cluster sizes of 3 and 5 seem likely candidates to choose.
Cluster size 3 is chosen because of its higher accuracy, precision, and TP/recall rate, as
20
Texas Tech University, Soma Datta, August 2014
well as its competent ROC area score that is close to the score of K-Means with 5
clusters.
Table 5: Different measures with different cluster sizes using C4.5
Cluster 1
0.7861
Avg TP
Rate
(recall)
0.79
0.288
0.789
0.789
0.8
Cluster 3
0.84564
0.845667
0.699333
0.838
0.804333
0.704667
Cluster 5
Cluster 7
0.83667
0.83391
0.8366
0.8338571
0.6266
0.6645714
0.8176
0.7835714
0.8066
0.7958571
0.7286
0.6445714
Cluster 9
0.84171
0.8417778
0.5866667
0.8097778
0.8191111
0.6323333
Cluster 11
0.84675
0.8468182
0.62
0.8107273
0.82
0.6218182
Cluster 13
0.85582
0.8661538
0.7476154
0.8271538
0.83
0.5976923
Cluster 15
0.85405
0.854
0.7006667
0.8087333
0.8222
0.578
Cluster 13
0.89085
0.890846
0.866692
0.817923
0.851154
0.468462
Avg.
Accuracy
C4.5
Original dataset
Clusters
generated using
K-means
clustering
Default clusters
generated using
EM clustering
Avg. FP
Rate
Avg.
Precision
Avg. Fmeasure
Avg. ROC
Area
Table 6 shows the results of the decision tree methods with a cluster size of three.
In comparison to the results without clustering, the Fall 2010 dataset did not receive a
great deal of advantage from clustering (only 2% on two methods), but the Fall 2011 set
did with as high as a 10% increase in accuracy. The merged dataset received a more
modest 6% boost in accuracy. Again, recursive partitioning is the most accurate of the
nine methods with ensemble learning not far behind.
Table 6: Average accuracy in % after clustering
Dataset
C4.5 with clustering
CART with clustering
ADTree with clustering
PART with clustering
J48graft with clustering
Random Tree with clustering
Yu’s method with clustering
Recursive partition with clustering
Ensemble learning with clustering
Fall 2010
83.119
83.233
83.109
81.348
82.957
75.348
82.34
89.12
84.503
Fall 2011
82.552
82.579
82.756
81.894
82.577
76.331
81.95
87.646
82.583
Merged dataset
84.564
84.522
83.844
82.758
84.528
76.586
82.163
90.0466
83.56
In Table 7, the number of rules is expressed as the addition of the rules in each
cluster. Equivalent rules are not deleted. The number of rules within a cluster falls at or
21
Texas Tech University, Soma Datta, August 2014
below the number of rules in the non-clustered results, which is expected for clusters
forming subsets of the larger dataset.
Table 7: Total no. of rules after clustering
Dataset
C4.5 with clustering
CART with clustering
ADTree with clustering
PART with clustering
J48graft with clustering
Random Tree with clustering
Yu’s method with clustering
Recursive partition with
clustering
Ensemble learning with
clustering
Fall 2010
45+71+78
3+4 +22
21+21+21
116+61+30
75+76+101
3528+2079+1193
4+4+2
Fall 2011
0+0 +20
14+0 +11
21+21+21
86+37+53
1+1+20
2887+1843+2263
4+3+4
Merged dataset
5+30 +5
3+ 7+3
21+21+21
77+137+76
7+61+9
6041+5075+3156
2+3+2
232+88+77
170+102+66
334+265+173
13+13+13
13+12+12
13+13+14
Sample rules from the merged dataset are shown in Table 8. Each rule has
conditions followed by a colon and the class (D)rop or (R)etain. The coverage is
calculated using the decision tree partition size for the rule over the total number of
instances. Accuracy is calculated as the number of instances classified correctly by the
rule over the total partition size. The coverage ranged from around 41% to less than 1%
of the data. The accuracy ranged from 100% of the covered instances to around 12%.
The results reinforce that the more specialized the rule, the lower the coverage, but the
higher the accuracy.
Table 9 lists the number of rules that met the study criteria of three or more
attributes with 70% or more accuracy over the whole dataset. With lower coverage of the
data set comes higher accuracy; however, recursive partitioning again outperforms the
other methods with higher coverage and accuracy, but has lower coverage and many
more rules than ensemble learning.
22
Texas Tech University, Soma Datta, August 2014
Table 8. Sample rules using merged dataset
Merged
dataset
Sample rules
Coverage
Accuracy
C4.5
GRANT= Y & GPA_CHAR = Fail & ETHNIC = WH & Greek_Life =
N & SAT_RANGE1 = 950<=SAT<=1000 & Degree = BA &
GENDER = F: D
0.152%
78.571%
C4.5 with
clustering
Degree = BA & percentile_range = 2ndquarter & LOAN= N &
GRANT= N: D
4.491%
70.120%
CART
GRANT=N & LOAN!=N & GPA_CHAR=(Fail) &
SAT_RANGE1=(850<=SAT<=900)|(750<=SAT<=800)|(700<=SAT<
=750)|(1150<=SAT<=1200)|(1200<=SAT<=1250)|(1100<=SAT<=1
150)|(950<=SAT<=1000)|(1350<=SAT<=1400)|(0<=SAT<50)|(1400
<=SAT<=1450)|(1500<=SAT<=1550)|(1450<=SAT<=1500)|(650<=
SAT<=700)|(1550<=SAT<=1600)|(500<=SAT<=550): D
1.093%
60.396%
CART with
clustering
Degree=(BA)|(BLA) & percentile_range=(2ndquarter) &
GPA_CHAR=(Low)|(High)|(Medium) & LOAN =Y & GRANT!=N: D
3.983%
12.228%
Recursive
partition
Recursive
partition with
clustering
GRANT=N&LOAN=N&GPA_CHAR(Low, Fail,
Missing)&GENDER(M)&FALL_ATMP_RANGE(T): D
0.195%
100.000%
^&percentile_range(Next15, 4thQuarter, 3rdQuarter,
Top10)&Degree(BGS, BS, NDU): R
32.121%
73.214%
ADTree
GRANT=N & LOAN=N & GPA_CHAR!=Fail: D
21.331%
66.362%
ADTree with
clustering
Degree = BS AND percentile_range = Next15: D
14.719%
27.647%
PART
GRANT=N AND LOAN=N AND GPA_CHAR = Fail AND GENDER
= M AND Greek_Life = N: D
2.554%
85.532%
0.006%
87.719%
0.476%
65.909%
4.491%
70.120%
0.011%
100.000%
0.087%
50.000%
25.465%
68.508%
40.985%
27.013%
Ensemble
learning
GRANT=Y&GPA_CHAR(High, Medium,
Low)&SAT_RANGE1(1500<=SAT<=1550, 1550<=SAT<=1600,
500<=SAT<=550, 650<=SAT<=700, 1400<=SAT<=1450,
1350<=SAT<=1400, 1450<=SAT<=1500, 1300<=SAT<=1350,
0<=SAT<50, 1250<=SAT<=1300) : R
10.855%
94.417%
Ensemble
learning with
clustering
GPA_CHAR(High, Medium, Low)&ETHNIC(HP, M, PR, AS, B, WH,
AI)&LOAN=N : R
26.861%
64.666%
PART with
clustering
J48graft
J48graft with
clustering
Random Tree
Random Tree
with clustering
Yu’s method
Yu’s method
with clustering
Degree = BS AND SAT_RANGE1 = 1400<=SAT<=1450 AND
Major_CHANG_FALL_Spring = N: D
GRANT=Y & GPA_CHAR = Fail & ETHNIC = WH & Greek_Life = N
& SAT_RANGE1 = 1100<=SAT<=1150 & FALL_ATMP_RANGE =
F: D
Degree = BA & percentile_range = 2ndquarter & LOAN= N &
GRANT= N: D
percentile_range = Next15 & GRANT= Y & Parent_Char = 13P &
GPA_CHAR = Low & SAT_RANGE1 = 1050<=SAT<=1100 &
Major_CHANG_FALL_Spring = Y : D
percentile_range = 2ndquarter & GPA_CHAR = Medium &
Parent_Char = P & SAT_RANGE1 = 1100<=SAT<=1150 &
ETHNIC = WH & Greek_Life = N & FALL_ATMP_RANGE = F : D
GRANT=N&LOAN=N:D
Degree(BM, BBA, BS, BFA, BID, BAR, BGS,
NDU)&percentile_range(Next15, 4thQuarter, 3rdQuarter, Top10) :D
23
Texas Tech University, Soma Datta, August 2014
Table 9: Total Number of rules above threshold with three or more conditions
Method
Merged
dataset
C4.5
C4.5 cluster
CART
CART Cluster
Recursive partition
Recursive partition Cluster
C4.5
C4.5 cluster
Ensemble learning
Ensemble learning Cluster
52
0+11+0
10
0+5+0
81
96+52+52
51
2+11+4
8
12+6 +11
Min.
Coverage
%
0.011
0.020
19.500
0.281
0.649
0.011
0.011
0.011
8.658
0.087
Max.
coverage%
Total Coverage %
Average Accuracy%
5.768
6.240
52.370
14.048
10.963
5.812
5.768
62.446
31.797
44.456
11.86
0+21.73+0
100.00
0+30.00+0
68.84
30.66+13.32+15.28
11.75
1.266+21.017+0.714
77.53
92.00+33.04+50.09
93.00
0+77.00+0
73.145
0+81.00+0
91.666
96.78+95.249+94.81
92.93
52.916+60.93+ 27.00
85.969
63.13+72.98+80.32
Table 10. Attrition Rules from different clusters of the merged dataset
Rules
Grant=N, Loan=N :D
Grant=N, Loan=N ,Parent_Char=8P
GPA_CHAR=Fail Greek Life=N :D
Grant=N, Loan=N , ETHNIC=WH
GPA_CHAR=Low Greek Life=N :D
GRANT=N&LOAN=N &GENDER(M): D
GPA_CHAR(Fail, Missing)&Degree(BM, BBA,
BS, BLA)& percentile_range(4thQuarter,
Top10, 3rdQuarter, Next15)
&FALL_ATMP_RANGE(T, H):D
Merged Dataset
Coverage
Accuracy
%
%
25.465
68.508
Fall 2010
Covera
Accuracy
ge%
%
22.74
74.93
Fall 2011
Covera
Accuracy
ge%
%
28.04
63.59
1.320
88.525
1.38
91.94
1.26
85.00
1.45
77.612
1.96
79.55
0.97
73.91
14.13
72.43
12.49
77.54
15.68
68.59
0.22
0.8
0.156
85.714
0.274
76.923
The most interesting result for attrition that was noted from this study concerned
financial aid. Without financial aid, students tended to drop as indicated in Table 11.
Other attributes that indicated attrition included lower college GPA, lower enrollment
(total attempted hours), lower class percentile (high school class ranking), lower parent
educational qualifications (relates to social background), and low involvement in campus
activities (like Greek life). The overall attributes of attrition in the three datasets were
lack of grants, lack of loans, and sex as male, as shown in Table 11.
24
Texas Tech University, Soma Datta, August 2014
Table 11: Rules for attrition with common attribute conditions
Merged dataset
GRANT=N&LOAN=N&GENDER(M) : D, accuracy=72.43%
Fall 2010
GRANT=N&LOAN=N&GENDER(M) : D, accuracy= 77.54%
Fall 2011
GRANT=N&LOAN=N&GENDER(M)&percentile_range(2ndquarter, 3rdQuarter) :D, accuracy= 73.08
Conclusion and Future work
This study did find methods that created useable rules, but for such a complex
problem, the process needs to be investigated further for better accuracy and a unified
model. It is not enough to use just a few top ranked attributes to generate rules. For
example, the rules generated in the clusters of each of the datasets were anomalous when
tested on the whole dataset as shown in Table 12. The rules are numbered for illustrative
purposes only. R1 was generated from a cluster using C4.5, but when it was tested using
the whole dataset, the resulting class was R rather than D as generated in that cluster. No
anomaly occurred in predicting the classes when CART was used. R4 is a sample rule
from recursive partitioning that had a different class predicted for the whole dataset.
Ensemble learning had another anomaly; different clusters predicted two rules with same
attributes and class except for the PELL attribute. When validated both the rules for the
whole dataset had the opposite class.
Table 12: Sample rules predicting opposite class in cluster
Method
Rule
No
Rule with the class
Class actual from
the whole dataset
C4.5 cluster
R1
Degree = BA& percentile_range = Top10 &GRANT= Y: D
R
CART Cluster
R2
All rules predicted the same class from the whole dataset.
J48graft cluster
R3
Degree = BA &percentile_range = Top10&GRANT= N&LOAN= Y: D
R
Recursive
partition Cluster
R4
GPA_CHAR(High,Medium,Low)&GRANT=N&Greek_Life(N)&
Major_CHANG_FALL_Spring(N)&Parent_Char(4P, 7P) : D
R
Ensemble
Learning
clustering
R5
GPA_CHAR(Fail, Missing)&Degree(BM, BBA, BS,BLA)&
percentile_range(2ndquarter)&PELL_Y(N) : R
GPA_CHAR(Fail, Missing)&Degree(BM, BBA, BS, BLA)&
percentile_range(2ndquarter)&PELL_Y(Y) : R
25
D
D
Texas Tech University, Soma Datta, August 2014
In Table 13, the rules generated are similar to each other except for one attribute
condition. If this attribute is omitted or combined, several rules can be combined into
one. For rules 1 and 2, the last attribute condition has two values (Female or Male)
allowing the attribute to be deleted from the rule conditions. For rules 3 and 4, the rules
can be combined by including an OR as shown in the combined rule (3, 4). In the future,
the study will be extended to combine rules as shown in the example below so that they
will be clearer and more condensed.
Table 13: Example of combining rules
1
2
Combined
rule(1,2)
3
4
Combined
rule(3,4)
GPA_CHAR(High, Medium, Low)&GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH,
MX)&GENDER(M)=R(37/0= 100%)
GPA_CHAR(High, Medium, Low)& GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH,
MX)&GENDER(F)=R(269/16=94.38%)
GPA_CHAR(High, Medium, Low)& GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH, MX)&
(GENDER(M) or GENDER(F)) =R
OR
GPA_CHAR(High, Medium, Low)&GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH, MX) =R
GPA_CHAR(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P,
6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250,
1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(FM)= D (75%)
GPA_CHAR(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P,
6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250,
1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T) = D(100%)
GPA_CHAR(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P,
6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200, 1200<=SAT<=1250,
1250<=SAT<=1300, 1300<=SAT<=1350, 650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T,FM)
This study will be expanded to include more data, such as within semester data
like mid-term grades and to investigate the possibility of combining different data mining
algorithms, such as decision trees and association mining for rule generation. For
example, generating trees in multiple stages using a decision tree algorithm and in the
next stage using association mining techniques will help to sample rules at earlier stages.
Finally, rule combination techniques will be derived with the goal that the number of
generated rules may be reduced to a set of rules that cover as much of the data set as
possible with good accuracy.
26
Texas Tech University, Soma Datta, August 2014
CHAPTER III
MULTI-STAGE DECISION METHOD TO GENERATE RULES
FOR STUDENT RETENTION
Abstract
The analysis of student retention data for the prediction of students who are likely
to drop out is an increasingly important problem to institutions who must meet legislative
mandates, face budget shortfalls due to decreased tuition or state-based revenue, and fall
short of producing enough graduates in fields of need, such as technology. In fact, several
researchers have achieved a level of success in predictive analysis of student retention
data by utilizing data modeling techniques, such as those used in statistics (clustering)
and data mining (decision trees). Usually, researchers apply one technique at a time to
the data and derive a rules characterizing students who drop out, but applying more than
one technique in stages could give broader and better rule extraction as well as additional
benefits of higher accuracy and identification of data retention indicators. Thus, the
authors propose a multiple stage decision method to overcome rule extraction issues
encountered previously when using ensemble learning with clustering and decision trees.
These issues include rules with anomalous classes and rules with attributes only chosen
by the decision tree method. To improve rule extraction, the study described in this paper
uses a multi-stage decision method with clustering, controlled decision trees, and
association mining. In the first stage of the study, clustering and the decision tree model
segment the dataset into smaller datasets that have fewer attributes than those of the
original dataset. In the next stage, association mining generates rules for each of the
27
Texas Tech University, Soma Datta, August 2014
smaller datasets and unifies its rule set with those of the decision tree. Because many
rules result, the method filters rules by consideration of rule accuracy, data coverage,
number of attributes, and similarity. To characterize the method’s performance, the
authors measure coverage, precision, and accuracy on different student datasets. The
performance of the method is promising, and results are in use at Texas Tech to find
students at risk of dropping out.
Key Words: association rules, data mining, decision tree, multi-stage decision rules,
recursive partition, retention, rule sets
Introduction
Student retention data is difficult to analyze due to changing enrollment trends,
the effect of new or discontinued programs at an institution, the characteristics of
different cohorts of students, and the multitude of reasons that exist for a student to leave
an institution that may not be related to financial need or social integration. In addition,
high withdrawal rates occur at students’ second year of enrollment (Nandeshwar et al.
(2011)). Thus, identification of at-risk students as early as is practical is desirable so that
interventions may be applied quickly. Encouraging students to stay at an institution in
situations of mutual benefit is a win-win situation, but institutions may have the
additional stress of boosting retention numbers to qualify for legislative budget increases.
Early identification of at-risk students is facilitated through the use of data mining
techniques on student data, such as those techniques used in the previous studies of
Herzog (2006), Zhang et al. (2010), Yu et al. (2010), Pittman (2008), Luan (2002), Lin
(2012), Lykourentzou (2009), and Macfadyen et al. (2010). These studies indicate that
28
Texas Tech University, Soma Datta, August 2014
decision trees help in identifying the rules for student retention and attrition. Further,
Datta and Mengel (2014a) used two years of school data to show that the rules generated
by ensemble learning utilizing clustering and the recursive partition data mining
technique are effective, but could be improved. For example in Table 14, three rules are
shown where coverage increases with fewer attributes, but the accuracy does not.
Additional issues found with the rules are:
 Some rules generated from the clustered dataset had anomalous results when tested
against the original dataset, and
 Rules only had the attributes as chosen by the decision tree method.
Table 14. Accuracy and Coverage using classification on Retention Data
Grant=N, Loan=N , ETHNIC=WH GPA =Low Greek Life=N :D
Merged Dataset
Coverage%
Accuracy%
1.45
77.612
GRANT=N&LOAN=N &GENDER(M): D
14.13
72.43
GPA=(Fail, Missing)&Degree(BM, BBA, BS, BLA)&
percentile_range(4thQuarter, Top10, 3rdQuarter, Next15)
&FALL_ATMP_RANGE(T, H):D
0.22
0.8
Rules
In order to generate rules, decision tree methods are very expedient since the path
from the root to the leaf is one rule. In addition, decision trees have precedent in prior
studies, such as Delen (2010), Yadav et al. (2012), Marquez et al. (2012), and Eitel et al.
(2012). This study continues the work started in Datta and Mengel (2014a) which looked
at several decision tree methods, but proposes to use them in multi-stage steps to find
improvements in accuracy and rules, similar to Senator (2005) as described below.
Additionally, this study uses the guidelines for attribute selection utilizing techniques,
such as information gain and Chi-Square, as discussed by Nandeswar et al. (2011). This
29
Texas Tech University, Soma Datta, August 2014
study’s contribution is increased modeling and rule accuracy by utilizing a multi-stage
decision method.
To provide the context for this work, The Section Related work outlines related
studies on retention that have used various data mining methodologies. It also briefly
introduces studies that have developed multi-stage decision trees that have multiple
classes. The Section Data for the Study explains the attributes used in the study and gives
the data sources for the study. The Section Methodology explains the process to develop
a multiple stage model. The Section Results investigates the results of the model and
compares it to other studies. The Section Conclusion discusses various conclusions based
on the results obtained from Section Results. In addition, the authors compare the results
obtained in this study with the results obtained from other studies outlined in Section
Related work. The Section Future work presents ideas about directions for future
research.
Related work
Background
Qualitative and quantitative studies identify the causes for attrition. Some
examples of qualitative studies utilizing survey-based data, such as attributes of
demographics, academic integration, academic achievement, social background, and
financial aid, are Tinto (1975), Bean (2001), Sewell and Wegner (1970), ACT (2004),
and Tinto (2006). Comparable attributes are used in quantitative studies utilizing
institutional student data that apply data mining techniques: Zhang et al. (2010), Marquez
et al. (2013), Yu et al. (2010), Pittman (2008), Bayer (2012), Delen (2010), Herzog
30
Texas Tech University, Soma Datta, August 2014
(2006), Kotsiantis (2009), Lykourentzou et al. (2009), Macfadyen et al. (2010),
Nandeshwar et al. (2011), Eitel et al. (2012), and Yadav et al. (2012).
Kerkvliet et al. (2004) shows that different institutions experience different
attrition causes; for example, grants may favorably influence retention at one institution,
but not at another. Pittman, Herzog, Zhang, Nandeshwar et al., and Yu et al have
mentioned that the first year in college for a student is more critical than later years.
Delen (2010), Eitel (2012), Kotsiantis (2009), and Marquez (2013) Balance the dataset to
get better accuracy. Lin (2012) replicated his data to remove the imbalance but had better
accuracy with the original class distribution.
The tool used often in quantitative studies is the decision tree because of its
transparency, ease of learning, ease of understanding, and acceptance. Other data mining
techniques are MARS, neural networks, logistic regression, Bayesian classification, and
genetic algorithms. Researchers use different attribute selection methods to lower the
high dimensionality of the datasets because not all attributes are useful to predict attrition.
Multi-stage Decision Tree
The first part of this section highlights the multiple stage processes used by Huo
et al. (2006) and Lu et al. (2009) on datasets with more than two classes. The last part of
this section describes Senator’s (2005) study using multiple stages to generate rules for
datasets having fewer instances of the positive class or class of interest. The minority
class data distribution in this study is similar to Senator’s (2005) data distribution.
Two studies, Huo et al (2006) and Lu et al. (2009), deal with multi-stage decision
trees for datasets that have more than two classes. The motivation behind their studies is
31
Texas Tech University, Soma Datta, August 2014
to generate simple, short rules and achieve lower error rates. Both studies use the same
datasets (Car, derm, ecoli, and glass) from the UCI repository (UC Irvine Machine
Learning Repository). Huo et al. (2006) propose a new method called a multi-stage
decision tree (MDT) to extract general rules with lower error-rates. The basis of the
method is maximum margin learning using the Support Vector Machine (SVM) (Wang et
al., 2005) to separate the data in the multiple-class space into two classes (positive and
negative) each of which may contain more than one class. A recursive process ensues
where the ID3 algorithm generates a tree for each newly created two-class dataset using
maximum margin SVM until only one of the original classes is contained. Huo et al.
mentions that the number of rules generated from each of the datasets is fewer than the
traditional C4.5 decision tree algorithm.
Lu et al. (2009) wanted to reduce the complexity of re-labeling and re-grouping of
classes that occurred in the study of Huo et al. (2006); hence, Lu et al. utilized maximum
margin SVM to separate the given dataset as if without classes into two classes. The
C4.5 algorithm derived a decision tree from the newly created two-class dataset and
continued recursively until the newly generated dataset had only two of the original
classes. Lu et al. did reduce complexity and generated a set of decision trees with shorter
and simpler rules.
Senator (2005) aimed to predict greater accuracy by using a two-stage
classification process. In the first stage, the data set is classified using fewer attributes,
and those instances that are suspicious (true negative and false positive in the confusion
matrix) are carried over to the next stage for further classification work; the other
32
Texas Tech University, Soma Datta, August 2014
instances are dropped at this point. In the second stage, more complex methods are
utilized, such as human intervention and linkage analysis. In addition, more attributes are
added to the remaining suspicious instances from the first stage. The accuracy is
calculated using the ‘true positives’ (x1) and ‘true negatives’ (y2) from the first and
second stages. Senator performed his study in two different domains, both with large
data sets, but with a low number of instances in the class of interest. He used HIV and
counter-terrorism detection data sets where only 5% of the dataset tested as a high-risk
group with a very low false positive rate. The metrics utilized to measure the results are
the ROC curve, lift curve, sensitivity, and specificity.
Data for the Study
The approach proposed in this paper is evaluated by applying an internal Texas
Tech University system dataset of first-time admitted undergraduates. During the
preprocessing of the data, the attribute selection methods—candidate_G2, Logworth,
ChiSquare, NavieBayesSimple_Accuracy GainRatio, NaiveBayesSimple_RMSE, and
SignificanceAttributeEval—are used with voting to select the final attributes. The final
attributes are as given in Table 15, and the data characteristics are as given in Table 16.
Further details of the data are given below.
 Attributes
o demographic: age, ethnicity, geographic location, residency, and gender
o academic: attempted hours, test scores, class percentile, earned hours, GPA,
and major changed
o financial aid data: grants, scholarships, and loans
o social factor: parent’s education and student organization
33
Texas Tech University, Soma Datta, August 2014
 Classes- retained (R) and dropped (D) as determined by the student’s enrollment
status during the 2nd year in college
 Source- new freshmen, all majors, admitted for Fall 2010, Fall 2011, and a merged
dataset of both
 Validation and testing- Fall 2012 dataset
Table 15. Final Selected Attributes M-2
Attribute category
Attributes
Demographic
Financial aid
Ethnicity, Gender
Attempted credit hours in ranges, GPA, degree, major change binary value, test
scores, class percentile
grants, scholarships, and loans
Social factors
parent’s education and student organization (greek life)
Academic
Table 16. Data characteristics M-2
Name
Merged 2010 and 2011
Fall 2010
Fall 2011
Fall 2012
Instances
9240
4490
4750
4296
Attributes
14
14
14
14
The statistical characteristics of the datasets are as shown in Table 17 and in later
result tables.
Table 17. Statistical characteristics of the datasets M-2
Attribute
Age
Gender
Test score
Class percentile
College GPA
Earned hours
Attempted hours
Financial assistance (complied grants/Pell/loan)
Mean/ Distribution
21.46
M/F (52%/48%)
1099
71.6
2.70
12.12
13.89
52%-63%
Greek Life (student organization)
Major Change (N/Y)
Ethnicity
Parents education
Gender F/M
Degree
Low income group (only received Pell)
81% with no involvement
78.139/21.861
WH- 70%
Ranged from no school to graduates
48.34/51.58
BS-46.09/BA-40.23
18%-23%
34
Texas Tech University, Soma Datta, August 2014
Methodology
In this study, clustering, recursive partitioning algorithm (Gaudard et al., 2006),
and association mining generate decision trees and rules in multiple stages. Clustering
and recursive partitioning gave better accuracy in Datta and Mengel (2014a). The trees
are generated using default values of the JMP tool (Gaudard et al. 2006). The fall 2010
and 2011 merged dataset allows exploration of building a unified model rather than two
possibly different models for the separate fall 2010 and 2011 datasets (see Datta and
Mengel (2014a) for results in considering the datasets separately).
In the first stage, the data is clustered using K-Means, and a classification tree is
generated from each cluster using recursive partitioning. In the next stage, association
mining generates rules using instances from each branch of the classification tree. These
rules are then joined with the associated branch from the decision tree. One of the reasons
for reducing the dataset using classification is to reduce the very large number of rules
extracted by association mining.
Table 18. Sample rules, their respective probability, and the class type
Method
Rule
No
Rule with the class
Class actual from
the whole dataset
C4.5 cluster
R1
Degree = BA& percentile_range = Top10 &GRANT= Y: D
R
CART Cluster
R2
All rules predicted the same class from the whole dataset.
-
J48graft cluster
R3
Degree = BA &percentile_range = Top10&GRANT= N&LOAN= Y: D
R
Recursive
partition Cluster
R4
GPA= (High,Medium,Low)&GRANT=N&Greek_Life=N&
Major_CHANG_FALL_Spring=N&Parent_educ=(4P, 7P) : D
R
Ensemble
Learning
clustering
R5
GPA= (Fail, Missing)&Degree =(BM, BBA, BS,BLA)&
percentile_range=(2ndquarter)&PELL= N : R
GPA= (Fail, Missing)&Degree =(BM, BBA, BS, BLA)& percentile_range
=(2ndquarter)&PELL=(Y) : R
D
D
Motivation for multi-stage mining methods
This section looks at the use of association mining without a first stage of
clustering and classification. Association mining algorithms process the dataset via the
35
Texas Tech University, Soma Datta, August 2014
WEKA software tool (Hall et al. 2009). The merged dataset is used with the default
parameters for the Apriori association mining algorithm except that the number of rules
to be extracted is increased to 100, minimum accuracy/confidence is lowered to 0.8, and
the delta value (increase/decrease in confidence) is increased to 0.1. Association mining
is known to generate a large number of rules; therefore, the number is limited to 100. An
unexpectedly low number of rules (four) resulted for only one class (R), with the largest
rule having three attributes. The merged dataset yielded 100 rules, all for class=R. Of
these rules, only five had six attributes; the other rules had fewer than six attributes.
The predictive Apriori algorithm also modeled the data with default parameters
except that “class enable” was selected so that “class” would not be employed as an
attribute. The algorithm required at least three hours to return the rules. Out of the 100
rules generated, only three rules were for class=D. Table 19 gives a summary of the
results of applying three varieties of Apriori association mining algorithms. Because
class=D has a lower number of instances than class=R, Apriori favors rules with class=R.
In addition, even though several rules are generated, less than 14% of the entire dataset is
covered. Each rule covers so few instances that accuracy for each rule is high; however,
an extraordinary number of rules would be needed to cover the rest of the dataset and
would be impractical for people without automation. In addition, so many rules would
seem to “overfit” the dataset and would be more likely to be dataset specific.
36
Texas Tech University, Soma Datta, August 2014
Table 19. Association mining statistics
Apriori
Filtered
Apriori
Predictiv
e Apriori
rules=100,
Car=True
rules=100,
Car=True
rules=100,
Car=True
Accuracy
# rules
Coverage
Class=D
Accuracy
Coverage
Class=R
# of rules
Accuracy
Coverage
# of rules
Changed
Parameters
Techniques
# of rules with above 70%
and 3 or more rules
88
0.1
0. 8785
100
0.1
0.8785
0
N/A
N/A
92
0.2
0.8804
100
0.2
0.8804
0
N/A
N/A
99
0.7465
0.99475
97
0.7261
0.9948
3
0.16775
0.99469
Multi-stage controlled decision tree method
The merge dataset used in this methodology first is divided into three clusters
using K-means clustering as shown in Table 20 (Three was found to be an appropriate
cluster size in Datta and Mengel (2014a)). Each cluster is used to generate a controlled
decision tree using the recursive partitioning algorithm and is evaluated using 10-fold
cross validation. The decision tree is controlled in that the attributes used to split the tree
are locked from further use. As shown in the following example rule, the attributes are
locked because the decision tree algorithm uses the same attribute repeatedly, resulting in
a rule with conflicting attribute values:
Ex. Avg_Attempt>=9 & LOAN(Y) & GPA>=1.672 & GRANT(Y) & GPA>= 2.616 &
Summer(N) & CLASS_PERCENTILE>=55 & Avg_Attempt<18 & ETHNIC(HP, AS,
WH, HI) & CLASS_PERCENTILE>= 74 & Avg_Attempt_Hour>=12.5 & AGE>=21 &
Transfer_Hrs<16
Other researchers have also adopted attribute locking as well, such as Yu et al (2010).
Each split of the decision tree is monitored to check if the accuracy changes. The
tree continues to split until the accuracy remains the same for the next two levels. If the
37
Texas Tech University, Soma Datta, August 2014
accuracy remains the same (Figure 8 split 3), the tree is pruned back as shown in Figure
8, split 2. At this point, the tree is locked. The instances of each of the sub-trees are stored
as different datasets. The attributes used in generating the tree are removed from the new
datasets. Now the Apriori association-mining algorithm generates new rules for each of
the datasets. The settings for the algorithm are those used by Hall et al. (2009) as shown
in Appendix A, Table 15 where the minimum matrix parameter changes from 0.9 to 0.85.
Lowering the percentile five percent reduces accuracy in order to examine more instances
to encourage the extraction of class=D rules. The extracted rules are joined back to their
respective decision tree branch attributes. The confusion matrix, coverage, and the
precision are calculated for each rule.
The steps used in this algorithm are summarized below.
 The dataset is divided into three clusters.
 Each cluster is used to generate a controlled decision tree using recursive partitioning.
 Apriori association mining analyzes each of the sub-datasets/nodes from each branch
to extract rules.
 The rules extracted from the decision tree and association-mining techniques combine
to create the final rules.
 The newly generated rules are validated against the whole dataset.
38
Texas Tech University, Soma Datta, August 2014
Figure 8. Steps of split and prune of the decision tree
39
Texas Tech University, Soma Datta, August 2014
Results
Results from multi-stage controlled decision tree
Table 20 shows the characteristics of the merged dataset after clustering for
clusters one through three. Figure 9 shows the tree derived from the first cluster. Three
splits occur on the grant, class percentile, and loan attributes resulting in four rules given
below.
 GRANT=Y and percentile-range=TOP10
 GRANT=Y and percentile-range= Next15 or 2nd quarter or 3rd quarter or 4th quarter
 GRANT=N and LOAN=Y
 GRANT=N and LOAN=N
As a result, cluster 1, which has 734 instances, is divided into four datasets of sizes 347,
238, 56, and 93. Because the first dataset is characterized by attribute values GRANT=Y
and percentile_range=TOP10, these attributes are removed from consideration when
generating the association mining rules. After association mining, the attributes are added
back into the rules as with the example below.
Grant=Y and Percentile-range=TOP10 and Major_change_Fall_Spring=N and Greek
Life=N ==> Class=R
Gender %
in Female
Test score
48.50
N=
84.62
Bachelors40.01
WH=
76.89
Next133.27
Medium49.10
F=78.36
N=
78.40
Bachelors36.85
WH=
64.56
2nd
quarter40.53
D-34.94
40
Percentile
F=65.73
50.38 45.79
R-72.16
Ethnicity %
Greek Life
%
27.49 21.86
Medium45.22
1300<=SAT
<=1350=18.
80
1100<=SAT
<=1150=31.
88%
950<=
SAT<1000=
28.40%
GPA %
Attempted
Hrs %
Top1049.72
Pell %
21.25
WH=
74.03
Loan %
Bachelors37.46
42.64
N=
91.28
26.22 45.85
D-15.80
D-27.83
R-65.05
3
F=66.21
Grant %
2
High 50.54
R-84.19
79.70
1
Parent
Education
%
Class
Distribut
ion %
57.38 65.36
Cluster
Table 20. Characteristics of the dataset
Texas Tech University, Soma Datta, August 2014
Figure 9. Controlled Decision Tree
Some rules across datasets can be a subset of another rule, such as Rule-2 and
Rule-3, and Rule-7 and Rule-8 in Table 21 which shows sample attrition rules. The
accuracy of the rules is high, but the coverage of the dataset is low as is characteristic of
association mining rules.
Table 21. Sample rules with accuracy and coverage
Rule #
Rule-1
Rule-2
Rule-3
Rule-4
Rule-5
Rule-6
Rule-7
Rule-8
Rule
Grant=Y percentile=Top10, GPA=High (R)
Grant=N, Loan=N, GENDER=M GPA=Fail Major_CHANG_FALL_Spring=N
Greek Life=N FALL_ATMP_RANGE=F (D)
Grant=N, Loan=N, GENDER=M GPA=Fail Greek Life=N
FALL_ATMP_RANGE=F (D)
Grant=N, Loan=N, GENDER=M GPA=Fail Greek Life=N (D)
Grant=N, Loan=N, GENDER=M GPA=Fail Major_CHANG_FALL_Spring=N
Greek Life=N (D)
Grant=N, Loan=N, ETHNIC=WH GPA=Fail Greek Life=N (D)
Grant=N, Loan=N, GENDER=M GPA=Fail Major_CHANG_FALL_Spring=N
Degree=BS (D)
Grant=N, Loan=N, GENDER=M GPA=Fail Degree=BS (D)
Accuracy
0.9535
Coverage
0.07
0.863354
0.017424
0.856383
0.006493
0.851695
0.025541
0.85
0.021645
0.849398
0.001731
0.844961
0.013961
0.84
0.162337
Table 22 has sample rules and their precisions on the datasets. The fairly
consistent precision is expected across the merged and unmerged data sets. The precision
41
Texas Tech University, Soma Datta, August 2014
rate for Fall 2012, however, differs from the others on several rules. Upon further
examination, the Fall 2012 dataset was found to be affected by the higher percentage of
students receiving financial aid (loans received increased from 51% to 70% from
previous years).
Table 22. Testing different datasets with rules for attrition
Rule No
Rule-1
Rules
Grant=N, Loan=N (D)
Grant=N, Loan=N , GENDER=M
Rule-3
GPA=Fail Greek Life=N
FALL_ATMP_RANGE=F (D)
Grant=N, Loan=N , GENDER=M
Rule-4
GPA=Fail Greek Life=N (D)
Grant=N, Loan=N , ETHNIC=WH
Rule-6
GPA=Fail Greek Life=N (D)
Grant=N, Loan=N , ETHNIC=WH
Rule-12
GPA=Low Greek Life=N (D)
Grant=N, Loan=N , Parent_educ=8P
Rule-13
SAT_RANGE=1100<=SAT<=1150
(D)
Number of students received Gift aid
Number of students received loans
Number of students received none
Merge 10-11
Accuracy (coverage)
Fall 10
Fall 11
Fall 12
0.685 (0.25)
0.749
0.636
0.171 (0.24)
0.856 (0.02)
0.899
0.789
0.706 (0.001)
0.852 (0.03)
0.888
0.817
0.630 (0.02)
0.849 (0.02)
0.883
0.737
0.696 (0.01)
0.776 (0.01)
0.795
0.739
0.214 (0.01)
0.701 (0.11)
0.775
0.683
0.159 (0.02)
62%
50%
25%
64%
52%
23%
61%
48%
28%
58%
71%
21%
Table 23 compares the method used in this study with the method by Datta and
Mengel (2014a) and other standard data mining algorithms. This Table gives the statistics
of the number of rules generated, the coverage of each rule, and the average accuracy
from each rule. Ensemble learning gave the best coverage; however, the multi-stage
decision method gave higher accuracy with a decent amount of coverage.
42
Texas Tech University, Soma Datta, August 2014
Table 23. Comparisons of rules coverage using different methods
Min
coverage%
Max
coverage%
Total coverage%
Average accuracy%
0+11+0
0.020
6.240
0+37.28+0
0+77.00+0
0+5+0
0.281
14.048
0+30.00+0
0+81.00+0
8
8.658
31.797
77.53
85.969
12+6 +11
0.087
44.456
92.00+33.04+50.09
63.13+72.98+80.32
40+40+30
0.01
12.749
47.29+65.281+66.418
84.74+86.90+86.85
Method
# of rules
C4.5 cluster
CART Cluster
Ensemble
learning
Ensemble
learning Cluster
Multi-stage
cluster
The multiple stage data mining method combines the rules for each cluster so that
redundant and duplicate rules are eliminated. Interestingly, the multiple stage data
mining method did not extract any rules that resulted in an anomaly when validated with
the whole dataset. One of the reasons would be that the rules covered less data and
characterized the covered data more accurately. Table 24 shows the coverage of the total
instances of the class.
Table 24. Rule coverage over instance in each class
Class
# of unique rules
Total coverage
Total Accuracy
D
R
16
59
69.95
82.45
80.157
89.54
Conclusion
Each of the data attributes are crucial in determining which intervention process
needs to be taken. For example, students with higher test scores and higher classpercentiles tend to have different reasons for attrition than students with lower test scores
and lower class percentiles. In addition, certain majors like pre-nursing and business, as
well as undeclared majors, tend to have higher dropout rates.
43
Texas Tech University, Soma Datta, August 2014
The verification results of the multi-stage controlled decision tree (showed Table
22) that it is challenging to develop a definite set of attrition rules. Policies and
uncontrollable events, such as more incoming freshmen receiving scholarships, changes
in economic conditions, or a natural disaster, may affect a student’s financial aid. Hence,
the distribution of the attributes may change each year. For example, in Fall 2012, more
students received loans than in previous years, so the rules that governed the previous
two-year’s dataset may not be applicable. In addition, the accessible student data from the
institution may not include potentially important factors, such as behavioral motivations
for attending college (for example, to be at school with a boyfriend/girlfriend or parental
requirement to attend). This study concludes that ethnicity, lack of parental education,
lower financial aid, and non-participation in student organizations are causative factors
for student attrition. This methodology can use data from several years and generate a
new model.
This study inferred from the results that a multi-stage, data mining method is a
more accurate method to extract rules. None of the rules generated by the multi-stage
method had conflicting classes when tested on the whole dataset, and attributes could be
incorporated through association mining that were beyond those employed by the
decision tree as shown in Table 18. The method also had better precision rates compared
to Pittman’s (2008) and Yu et al.’s (2006) studies as shown in Table 25. Since this
current study calculated precision on individual rules, the accuracy and precision have the
same value.
44
Texas Tech University, Soma Datta, August 2014
Table 25. Accuracies obtain using incoming freshmen
Researchers (year)
Cohort
size
Pittman’s (2008)
2768
Yu et al. (2006)
6690
Herzog (2006)
Datta and Mengel
(2014a)
Datta and Mengel
(this paper)
3565
9240
9240
Methods used
Logistic, multilayer
perceptron, j48, Naïve Bayes
Recursive partition, mars,
neutral networks
Chaid, C4.5
Clustering, ensemble
learning
Clustering, Recursive
partition, association mining
Measure of accuracy
Average Accuracy
Overall accuracy
72.8
Overall accuracy
73.53
Overall accuracy
Average accuracy of
each rule
Average accuracy of
each rule
85.4
85.96
87.5
Future work
To validate and standardize the methodology used in this study and utilize
dynamic rule creation using different data mining techniques, published datasets from
UCI’s website will be used in the future. This study will generate rules for other datasets
that have a lower true positive class in the dataset in comparison to true negatives. For
example, possible datasets include the US population dataset to isolate individuals
infected from HIV and from the high school population it will use female students that
would choose STEM degree.
This study showed that ethnicity, parents’ educational level, financial aid, and
student organizations affect student’s attrition. Future studies will split the dataset to
investigate diversity issues and specific characteristics related to the student’s ethnicity.
The study will divide the dataset by ethnicity. Minority groups like Hispanics and African
Americans will be grouped in different datasets. Another research question for future
study will be – do women behave in a similar manner to men in retention/attrition?
More data will be collected from the future years (at least 5 years), and the multistage decision method will be applied to get unified rules. All rules will be tested and
used for isolating attrition rules. The dataset’s characteristics will be used to match any of
45
Texas Tech University, Soma Datta, August 2014
the trained cohorts, and this could help in deciding which set of rules need to be applied
to that specific cohort.
This study will be extended in the future to do reverse engineering and derive a
model for undergraduate admissions to recruit freshmen that are likely to be retained, and
this could indirectly increase retention rates of the institution beyond the students’ first
year in college. To create the admissions model, these retention rules will be used.
Lastly, one issue found in Datta and Mengel (2014a) is still an issue in the multistage method. Table 26 shows the rules extracted are similar to each other except for one
attribute condition. By omitting or combining the attributes, several rules can be
combined into one rule. For rules 1 and 2, the last attribute condition has two values
(Female or Male) allowing the attribute to be deleted from the rule conditions. Rules 3
and 4 combine by an OR as shown in the combined rule (3, 4). Rules 5 and 6 generate a
shorter rule eliminating two redundant attributes. In the future, the study will be extended
to combine rules as shown in the following examples so that they will be clearer and
more condensed.
Table 26. Examples of combining rules
1
2
Combined
rule(1,2)
3
4
GPA=(High, Medium, Low)&GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH,
MX)&GENDER(M)=R(664/58= 91.265%)
GPA=(High, Medium, Low)& GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH,
MX)&GENDER(F)=R(838/65=92.243%)
GPA=(High, Medium, Low)& GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH, MX)& (GENDER(M) or
GENDER(F)) =R (1502/123=91.810%)
OR
GPA=(High, Medium, Low)&GRANT=Y&LOAN=N&ETHNIC(NR, U, HI, B, WH, MX) =R
(1502/123=91.810%)
GPA=(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P,
6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200,
1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350,
650<=SAT<=700)&FALL_ATMP_RANGE(FM)= D (8/3=62.5%)
GPA=(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P,
6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200,
1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350,
650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T) = D(44/8=81.818%)
46
Texas Tech University, Soma Datta, August 2014
Combined
rule(3,4)
5
6
Combined
rule(5,6)
GPA=(Fail, Missing)&Major_CHANG_FALL_Spring(Y)&Parent_Char(4P, 1P, P,
6P)&SAT_RANGE1(950<=SAT<=1000, 1050<=SAT<=1100, 1150<=SAT<=1200,
1200<=SAT<=1250, 1250<=SAT<=1300, 1300<=SAT<=1350,
650<=SAT<=700)&FALL_ATMP_RANGE(F, H, T,FM)= D (52/11=78.846%)
Grant="Y" and GPA= (“high” or “medium” or “low”) and Greek-life="Y" GENDER="F"
=R(435/39= 91.03%)
Grant="N" and GPA= (“high” or “medium” or “low”) and Greek-life="N"
GENDER="F"=R(903/405=55.149%)
GPA= (“high” or “medium” or “low”) and " GENDER="F" and (Grant=”Y” or Grant=”N”) and
(Greek_life=”Y’ or Greek_life=“N”)
OR
GPA= (“high” or “medium” or “low”) and " GENDER="F"=R (3997/939=76.507%)
47
Texas Tech University, Soma Datta, August 2014
Appendix A
Table 27. Apriori settings
Options
Default
values
Experiment
values
Car
False
True
classIndex
-1
-1
Delta
0.05
0.1
lowerBoundMinsupport
0.1
0.1
What they mean
Generates rules with the class attribute
instead of general association
Placement of the class attribute is in the
dataset
Iteratively decrease support by this factor.
Reduces support until min support is
reached or required number of rules has
been generated.
The lowest value of minimum support
metricType
confidence
confidence
minMetric
0.9
0.85
numRules
10
100
The metric can be set to any of the
following: Confidence (Class association
rules can only be mined using confidence),
Lift , Leverage, Conviction
Will consider rules with accuracy of .85 or
higher
Number of rules to generate
outptItemsSets
False
True
Itemsets are shown in the output
removeAllMissingCols
significantLevel
False
-1.0
False
-1.0
treatZeroasmissing
False
False
upperBoundMinsupport
1.0
1.0
verbose
False
False
removes columns with missing values
Significance testing
Zero is treated the same way as missing
value
The highest value of minimum support, the
process starts and iteratively decreases
until lower boundary
Algorithm can run in verbose if enabled
Table 28. Effect of changing default values
Delta
value
Performance
Rules
0.05
Minimum support:
0.1 (9 instances)
Minimum metric
<confidence>: 0.9
Number of cycles
performed: 18
1. ETHNIC=WH SAT_RANGE=1200<=SAT<=1250 Major_CHANG_FALL_Spring=N 11
==> Class=D 10 conf:(0.91)
2. ETHNIC=WH PELL_Y=N SAT_RANGE=1200<=SAT<=1250
Major_CHANG_FALL_Spring=N 11 ==> Class=D 10 conf:(0.91)
0.1
Minimum support:
0.1 (9 instances)
Minimum metric
<confidence>: 0.85
Number of cycles
performed: 18
1. ETHNIC=WH SAT_RANGE=1200<=SAT<=1250 Major_CHANG_FALL_Spring=N 11
==> Class=D 10 conf:(0.91)
2. ETHNIC=WH PELL=N SAT_RANGE=1200<=SAT<=1250
Major_CHANG_FALL_Spring=N 11 ==> Class=D 10 conf:(0.91)
3. ETHNIC=WH SAT_RANGE=1200<=SAT<=1250 14 ==> Class=D 12 conf:(0.86)
4. ETHNIC=WH PELL=N SAT_RANGE=1200<=SAT<=1250 14 ==> Class=D 12
conf:(0.86)
48
Texas Tech University, Soma Datta, August 2014
Table 29. Sample rules at confidence=0.70
cluster
Cluster I
Cluster II
Cluster III
comments
Sample Rules with one attribute due to lowering of confidence
Confidence lowered to
0.80,0.75, and 0.70
Minimum support: 0.05
Confidence lowered to
0.80, 0.75, and 0.70
(had to lower in support
to generate rules.
ETHNIC=WH 41 ==> Class=R 36 conf:(0.88)
Major_CHANG_FALL_Spring=N 47 ==> Class=R 41
Confidence lowered to
0.80,0.75 and 0.70
GENDER=M 520 ==> Class=D 368
conf:(0.71)
GENDER=M 716 ==> Class=D 532 conf:(0.74)
ETHNIC=WH 687 ==> Class=D 504 conf:(0.73)
Parent_educ=8P 642 ==> Class=D 458 conf:(0.71)
49
conf:(0.87)
Texas Tech University, Soma Datta, August 2014
CHAPTER IV
DYNAMIC MULTI-STAGE DECISION RULES
Abstract
After generating “usable” rules with better accuracy on retention data with multistage decision for the minor class, the motivation was to extend that previous study and
validate the results from Datta and Mengel’s (2014b) ensemble learning method with
other known datasets.
Studies that need rules to predict for the minor group would find this method
helpful. Typically, rules generated using either decision tree or class association mining
generate more rules for the major class such as the retention dataset (Datta and Mengel
2014a). In Datta and Mengel (2014b) it was shown that rules can be generated for the
minor classes using both decision trees and association mining.
This study uses dynamic method to generate rules. The characteristic of the
dataset commands the path towards generation of the rules. A dynamic multi-stage
decision tree was generated depending on the attribute dimensions and size of dataset.
Each rule gets its coverage and accuracy. This technique generates rules after mining data
from the minority class. The rules generated were grouped into ranges to facilitate the
rule choice as needed.
Key Words: association mining, decision tree, gain ratio, large datasets, multistage rule generation, recursive partition, rule sets
50
Texas Tech University, Soma Datta, August 2014
Introduction and related work
Data mining algorithms possibly will acquire different accuracy and rules
depending on the dataset. Not every data-mining algorithm is appropriate for each
dataset; however, irrespective of their characteristics, datasets are currently used to
generate rules for different techniques. Different from previous studies such as Yu et al.
(2010), Pittman (2008), Huo et al. (2006) and Lu et al. (2009) but similar to Senator
(2005) this study determines the path for rule creation depending on the characteristics of
each dataset, using the data-mining algorithms.
This study originated from Datta and Mengel’s (2014a and 2014b) studies to find
usable rules with better accuracy for the minor group. The first study used different
decision tree algorithms such as C4.5, CART, Recursive Partition, and J48graft that were
applied to retention data to generate rules. It generated rules using Yu et al. (2010) but the
rules tend to be very simple for such a complex issue. Therefore, Datta and Mengel’s
(2014a) study used clustering and unique attributes to generate the rules. Some of the
enhancements that can be made based on that study were:
 Rules generated from a cluster had anomaly class when validated over the whole
dataset.
 All attributes had to be used for generating the rules. Because that is not a feasible
solution for a high dimension dataset, Thus, the next study by Datta and Mengel
(2014b) used a different approach.
The second study used a multi-stage decision tree (MSDT) to generate viable
rules for users and administrators. The MSDT method used student data from an
institution to generate the rules and validated those rules using data from a different year.
The purpose of the MSDT method was to discover rules that identify student attrition.
51
Texas Tech University, Soma Datta, August 2014
This current paper introduces dynamic multi-stage decision rules (DMDR) that use
various data mining algorithms. In this study, the MSDT method is applied to other wellknown datasets from UCI. When the MSDT method did not show improvements for a
dataset, the method or steps were changed in a logical way. The new ensemble method
was named the dynamic multi-stage decision rule (DMDR). The rules generated using the
DMDR method depend on the characteristics of the dataset. Table 30 shows six datasets
with three different accuracies and their total number of rules generated using C4.5
algorithm; these accuracies and rules will be used later to compare with the ensemble
method developed in this study. C4.5 was used to compare with other researchers (Huo et
al (2006) and Lu et al. (2009)). Baseline accuracy is the simplest accuracy calculated for
the given dataset. Here the baseline accuracy is calculated using ZeroR (Hall et al. 2009)
classifier. The accuracy is predicted by calculating the mean of numeric classes or the
mode of nominal classes.
Table 30: Data set with reported Accuracy
Data set
Baseline
Accuracy
%
Accuracy
reported
%
Accuracy
% using
C4.5
Total # of
rules
generated
using
C4.5
Adult
75.919
78.5885.95
86.21
731
Balance
46.08
-
63.2
33
Breast Cancer
70.2797
65.0-78.0
71.134
10
Car
70.0231
-
92.36
131
Connect
65.8303
-
80.90
4297
Lymphograph
y
54.7297
76.0-85.0
76.35
19
change_node = 4: 3C (25.0/2.0)
Retention
69.2857
85.9687.5
79.35
11
GRANT= Y AND GPA= Low: R (843.0/126.0)
52
Example of less useful rules
capital-gain <= 6849 and marital-status =
Never-married and education-num > 14 and
age > 32 and capital-loss <= 653 and
occupation = Adm-clerical: >50K (0.0)
Left-Weight = 1LW: R (125.0/27.0)
deg-malig <= 2: no-recurrence-events
(201.0/40.0)
safety = low: unacc (576.0)
d3 = x & d2 = o & d1 = x & d4 = o & c2 = b &
g2 = b & e2 = b & a3 = b & d5 = o & a2 = b &
e1 = b & a1 = b & c1 = x & b1 = x: win (0.0)
Texas Tech University, Soma Datta, August 2014
Two research questions were studied:
 What characteristics of a dataset work better with this process?
 How does this methodology behave with large and high dimension datasets or with
small and low dimension datasets?
Background Work
Huo et al. (2006) and Lu et al. (2009) developed algorithms to lower error-rate
and obtain fewer rules compared to the traditional C4.5. Both their datasets dealt with
multiple classes. The algorithm of Huo et al. re-labeled the classes recursively to divide
the dataset into two classes using the maximum margin learning method from Support
Vector Machine (SVM) (Wang et al. 2005). He used ID3 to generate the tree. Lu et al.
(2009) reduced the complexity of re-labeling and re-grouping by dividing the dataset
into two classes using the inverse problem of SVM. He then used the C4.5 algorithm with
the newly labeled dataset. The objective of both studies was to generate shorter and
simpler rules.
Senator (2005) used a two-stage classification process to isolate instances that
were at high risk. His study undertook the problem of the minor class in the dataset in
two stages: first, classification of the whole dataset and second, selection of those
instances that were classified as positive. Following the study of Senator (2005) that used
the multi-stage decision for two specific domains (HIV, fraud detection) with fewer highrisk instances in their datasets, the retention dataset in the study by Datta and Mengel
(2014b) used multi-stage decision to discover a few useful rules.
In the study of Datta and Mengel (2014b), the retention dataset was first
preprocessed to normalize the attributes by setting them in ranges, integrating them, or
53
Texas Tech University, Soma Datta, August 2014
converting them to binary values. The dataset was clustered, and each of the three
clusters (details of using three clusters is detailed in Section Clustering) was treated as an
independent dataset. The process is detailed in a later section.
Data for study
The data used in this study were from the UCI data repository, as shown in Table
31. To generalize the model MSDT, Datta and Mengel (2014b) selected datasets of sizes
and dimension that differed from the retention dataset. The numeric attributes were
converted to alphanumeric for consistency and to enable use of association mining.
Table 31: Dataset and their characteristics- UCI
Data set
Size
No. of
attributes
Adult(train)
Balance
Breast Cancer
32561
625
286
15
5
10
No. of
classe
s
2
3
2
Car
1728
6
4
69.56
Connect
67557
43
3
65.65
Lymphography
148
19
4
44.00
Retention
9240
15
2
69.00
Baseline
Accuracy
76.50
40.566
76.2887
Class distribution
>50K- 23.93%, <=50K- 76.07%
L 46.08%, B 07.84%, R 46.08%
No-recurrence 70.27% , Recurrence 29.73%
unacc 70.023 %, acc 22.222%, good 3.993% v-good
3.762%
Win 65.83%, Loss 24.62%, Draw 9.55%
normal find 1.351%, metastases 54.729%
malign lymph 41.21, fibrosis 4
D- 30.71%, R- 69.28%
Dynamic multi-stage decision rule
To create DMDR, the datasets were first divided into three clusters using Kmeans clustering (details in section Clustering). Then each cluster in the dataset followed
the same process: A controlled/monitored decision tree was generated using recursive
partition. The tree was pruned depending on the characteristics of the datasets mentioned
in Table 31. Each tree generated rules using association mining. Rules for the complete
dataset were developed as a combination (logical AND condition) of the attributes used
during the decision tree creation and those generated from association mining. For
example, in the “Car” dataset, the first split using the decision tree was “Person<4,” and
54
Texas Tech University, Soma Datta, August 2014
one of the rules generated using the association mining was “safety =low 274 ==>
class=unacc.” The combined rule was “Person<4 AND safety=low ==> class=unacc.”
The coverage and accuracy of this rule were evaluated from the original dataset.
Clustering
To decide the optimum cluster size, two different clustering techniques were used.
The data was first clustered using EM clustering. Each cluster generated a tree and the
following measures were averaged to facilitate the decision making for cluster size:
accuracy, TP rate, FP rate, precision, F-measure, and ROC. Secondly, K-means clustering
was applied to generate three clusters. As shown in Table 32, not all datasets performed
well with EM clustering. Adult dataset generated three clusters using EM clusters; hence,
it is justified to use three cluster size when using K-means clustering. EM clustering
generated two clusters with Balance dataset, out of which one cluster had only one class
and that will not help in generating rules via either decision tree algorithm or association
mining. Hence, K-means three clusters were used for this study. EM clustering generated
four default clusters using Breast Cancer data, but the average accuracy using three
clusters is higher than the average accuracy of four clusters. Hence, three clusters were
used for the study. EM clustering generated seven default clusters using Car dataset, but
only one cluster had more than one class. Hence, the study used three clusters generated
from K-means. EM clustering had some difficulties clustering Connect dataset; the
process ran for 12 days before it was aborted by the user. The dataset was reduced by
removing low ranked attributes by voting using three attribute selection methods such as
gain ratio, significance attribute evaluation, and Naïve Bayes, but the EM clustering did
not return any clusters after six days. Next, the default parameters were changed as
55
Texas Tech University, Soma Datta, August 2014
shown in Table 33 and in both cases—case III and IV—it generated the same number of
clusters. These clusters were tested using C4.5 for matrices as shown in Table 32 ROC
had better value using K-means. Hence, the three clusters that were generated using Kmeans were used in this study. EM clustering generated six clusters using Lymphography
dataset; out of these six clusters, only two clusters could be used for testing because one
of the clusters had five instances, another had three instances, the other had zero
instances, and the last one had one class. Hence, the experiment used three clusters
generated by K-means clustering. For the Retention dataset, a specific dataset unique to
the series of studies by Datta & Mengel (2014a), EM cluster generated 13 clusters using
Retention dataset. However, K-means clustering with three clusters had better results for
the ROC and precision. Hence, K-means clustering with three-cluster size was used to
standardize the process.
Table 32. Cluster comparisons with EM and K-means
Adult
Balance
accuracy
TP Rate
3
3
86.58
0.866
2
1
85.46
0.855
Breast
Cancer
4
4
75.08
0.751
FP Rate
0.486
0.855
Precision
Fmeasure
ROC
accuracy
TP rate
FP rate
Precision
Fmeasure
ROC
0.855
0.73
0.853
Dataset
cluster size
useable
Clustering
using EM
K-means
-3
clusters
Average
in %
Average
in %
Car
Connect
Lymphography
Retention
7
1
98.508
0.985
39
39
80.926
0.8092
6
2
80.895
0.809
13
13
89.085
0.890846
0.702
0.014
0.3377
0.665
0.866692
0.648
0.986
0.7849
0.7935
0.817923
0.788
0.693
0.985
0.7928
0.8015
0.851154
0.768
89.65
0.8966
0.607
0.893
0.488
85.145
0.862
0.721
0.791
0.477
78.89
0.789
0.789
0.623
1
90.271
0.9026
0.299
0.8913
0.797
79.984
0.8
0.228
0.7767
0.3345
60.1096
0.6013
0.526
0.5713
0.468462
84.564
0.845667
0.699333
0.838
0.879
0.824
0.696
0.8916
0.784
0.5833
0.804333
0.692
0.523
0.457
0.908
0.8523
0.5183
0.704667
56
Texas Tech University, Soma Datta, August 2014
Table 33. EM clustering for Connect dataset
Default
Default
Case I
Case II
maxIterations
100
Default
Default
minStdDev
1.0E-6
Default
default
seed
100
Default
Default
Case III
Case IV
10
5
Default
0.001
10
5
comments
Default parameters
Did not converge after 12 days, hence aborted
Lower ranked attributes removed, process aborted after six
days
Generated 39 clusters within 24 hours
Generated 39 clusters within 24 hours
Stage I: Decision Tree
The tree was allowed to split once if the dataset had fewer than eight attributes. In
this case, after the first split, each node became a sub-dataset for the next stage of the rule
generation process. The attributes that were used for the first split were removed from the
respective subset. This caused each subset to reduce in both size and dimension.
Datasets that had more than eight attributes were split as follows: each split of the
decision tree was monitored to check if the accuracy had changed. Each attribute was
locked after it split. The tree continued to split until the accuracy remained the same for
the next two levels, in which case the tree was pruned to the level where the accuracy
stopped changing (Datta and Mengel 2014b). An example of the tree capture in Stage I
using Breast Cancer dataset is shown in Figures 1, 2, and 3. At this split, the distribution
of the classes was checked for impurity. If the distributions of the classes were in the
ratio of 66% to 33%, then the tree was locked from further splitting. If the distribution
was below this range, the tree was further split for impure distribution.
57
Texas Tech University, Soma Datta, August 2014
Figure 10. Breast Cancer 1st level when accuracy stops changing
58
Texas Tech University, Soma Datta, August 2014
Figure 11. Breast Cancer dataset 2nd level with repeating accuracy
59
Texas Tech University, Soma Datta, August 2014
Figure 12. Breast Cancer 3rd level with repeating accuracy
The instances of each of the locked branches were then stored as sub-datasets.
The attributes that were used in generating the tree were removed from the subsets
created from the previous step.
Large and high dimensional datasets with more than twenty attributes followed
the same path as datasets with more than eight attributes, but at their second stage, each
60
Texas Tech University, Soma Datta, August 2014
of the sub-datasets was used to run through attribute selection methods. Attributes with
lower or no significance were removed from the subset before the next step. These
subsets were transferred to Stage II.
Stage II- Association Mining
Association mining techniques were applied to the subsets to generate rules.
Apriori association mining techniques were used with default settings for these
experiments, with two exceptions: First, the parameter Car was changed to TRUE
(because this parameter now uses the class attribute to decide on the rules). Second, the
“minmetric” parameter was changed from 0.9 to 0.85. Lowering the percentile by five
percent decreases accuracy, so more rules can be generated. The rules were generated
using class association. Table 38 of Appendix B shows the details of the default and the
experiment parameters.
All of the rules created using association mining were combined with the
attributes and their values that were created from the decision tree. These rules were
evaluated for accuracy and coverage using the original dataset. See Figure 13 for
illustration of this process.
61
Texas Tech University, Soma Datta, August 2014
Figure 13. Steps for Dynamic decision rules
Results
The process followed for each of the datasets was noted in the study of Datta and
Mengel (2014b). Of the six datasets that were tested for this process, two datasets, the
Balance and Connect datasets, did not generate rules as expected. In comparison to the
Retention dataset, both Balance (five attributes) and Connect (43 attributes) had either
fewer or more attributes. Hence, a different method was chosen in these cases. For the
Balance dataset, this method had one attribute before progressing to Stage II, association
62
Texas Tech University, Soma Datta, August 2014
mining. Thus, the path chosen was that, the decision tree was allowed to split once, then
the dataset was forced to Stage II. For the Connect dataset, since the system ran for more
than 24 hours and still was unable to generate any association rules, the dimension of the
dataset was lowered by removing least significant attributes after ranking and voting
using symmetrical uncertainty, gain ratio, and Chi squared. The reasons for this deviation
are analyzed in the Section, Discussion and conclusions. The results obtained from all
these datasets are shown in Table 34.
Table 34. Experimental results of accuracy using different datasets
Accuracy in %
Accuracy range
Data Set
Thres
hold
Maxim
um
Mini
mum
Average
all rules
Adult
Balance
Breast Cancer
Car
Connect
Lymphography
86.21
63.2
78.0
92.36
80.9
85.0
97.36
93.277
81.62
95.72
93.16
76
84.63
65.00
75.92
84.76
69.66
28.57
83.75
69.74
78.89
93.47
85.96
77.65
Retention
87.5
86.90
63.13
87.5
Average
selected
rules
80.19
83.96
93.34
85.94
77.65
87.50
90-100
80-89
70-79
60-69
Standard
Deviation
31
6
19
38
35
34
13
4
13
2
84
10
16
5
8
5
14
11
9
2
10
3
0.0329
0.1075
0.0829
0
0.0570
0.0925
30
42
7
2
0.0478
The threshold value in Table 34 is the highest accuracy listed for each dataset
from Table 30. The accuracies in the maximum and minimum column are derived from
comparison of the three clusters of each dataset using J48 from Weka (Hall et al. 2009).
The average all rules column has the average of all the rules generated via this process.
The average selected rules column has average of accuracy of rules that have three or
attribute in their conditions. The majority of rules in the Balance dataset had rules with
two or three conditions; hence, no rule was left out from the accuracy calculation. The
total number of rules in each accuracy range was determined by merging all the rules
generated from the different clusters after removing the duplicates. Accuracy for each of
63
Texas Tech University, Soma Datta, August 2014
the rules was determined using the whole dataset. The rules were then grouped in
different ranges of accuracy. The reason for reporting rules in various ranges was to help
the user choose the most appropriate accuracy range for determining risk.
Table 35 has sample rules from each of the datasets along with their accuracies
and coverage. These rules were chosen from one of the highest accuracies to show that
coverage was typically low with higher accuracy.
Table 35. Sample rules with their accuracy and coverage
Dataset
Rule #
Adult
52
Balance
17
Breast Cancer
60
Car
13
Connect
64
Lymphography
57
Retention
2
Rule
capital-gain_c<7298 & native_county=united-states & NOT
Not-in-family_Unmarried_Own-child :<=50K
Right-Distance=5RD or 2RD or 1RD Right-Weight=2RW :L
breast-quad=(central | right_low) AND irradiat=no
AND.node_caps=no : no-recurrence-events
maint=vhigh & safety=high |med & persons=2 & buying=high
:unacc
Accuracy
Coverage
0.9745
0.23515
0.9333
0.02400
0.91176
0.11888
1.0
0.055556
C1=b & c4=b 7 d1=x & d2=x |b & d3=o & d4= x |b :win
Affer=YES & change_node= 2 | 3 & lymph_s=NO &
lymph_enlar=2 :2
grant="Y" and percentile_range= Top10 and ETHNIC="WH"
and PELL="N" :R
0.905405
0.006572
0.9729
0.25
0.93357
0.09
The maximum, minimum and total coverage for each of the datasets are shown in
Table 36, along with their accuracy ranges.
Table 36. Experimental results of total coverage
Data Set
Adult
Balance
Breast
Cancer
Car
Connect
Retention
Total
coverage
Average
accuracy
Minimum
coverage
Maximum
coverage
89.65
79.70
0.000122846
0.01600
0.003496503
0.723595713
0.12000
0.18881118
99.69
100.00
22.377
0.006944
0.000933
0.01
0.666667
0.747487
12.749
55.90
84.86
69.00
78.89
90.27
86.32
87.5
The Balance dataset did not run through Stage II in the ensemble process because
it had fewer attributes. After Stage I was executed, only one attribute remained to be
executed through Stage II, for which a different methodology had to be followed, as
64
Texas Tech University, Soma Datta, August 2014
mentioned in Figure 13. The dataset Connect experienced a different problem: During the
Stage II process, while generating the rules using association mining methods, the
process would not generate any rules for all the sub-datasets, even after running for four
days. Thus, the method had to be altered in situations where attribute size was more than
20. The attributes with least significance were removed from the dataset by using trial
and error method to generate rules: After the attributes that were of least significance
were removed, the dataset was tested for generating rules. If the dataset still failed to
generate rules, more attributes were removed. The behaviors of the sub-datasets were
different in that some generated rules with 23 attributes while others had to be trimmed
further to generate rules. Thus, the Connect dataset generated 109 rules with three or
more attributes and an accuracy above the threshold value of 80.9%.
Since the main objective of this study was to generate rules for the minority
group, Table 37 shows the accuracy, coverage, and number of rules generated by this
process. Balance, Car, and Connect did not generate any minority class rules.
Nevertheless, the other four datasets generated rules for the minority class. As shown in
Table 37 those datasets that had less than 10% of their data in the minority class failed to
generate rules.
Table 37. Summary of minority class in the datasets
Data Set
Minor class
Adult
Balance
Breast Cancer
Car
Connect
Lymphography
Retention
>50K
B
Recurrence-events
vgood
draw
1,2,3
D
Class
size
7841
49
85
65
6449
2+4+61
2838
Class Size %
0.24080956
0.0784
0.2972028
0.03761574
0.09546013
0.4527027
0.30714286
65
Average
Accuracy
0.940324
0.768394
0.8040516
0.8
Total
coverage %
22.0890
0.72028
10.3040541
69.00
Total # of
rules
26
17
0+58+52
16
Texas Tech University, Soma Datta, August 2014
Since no rules were generated for these three datasets using association mining,
the parameters (Table 38 of Appendix B) were changed again. The parameter Delta was
changed back to 0.05 but had no effect on any of the subsets except for cluster II of
Retention dataset to allow it to generate rules. The parameter LowerBoundMinSupport
was changed from 0.1 to 0.05; in general, the dataset generated rules but nothing extra
was generated for the minority class. MinMetric was lowered from 0.85 to 0.65 in
intervals of 0.07 but none of the datasets generated any minority class rules. The number
of rules in the parameter numRules was increased from 10 to 100. Each dataset generated
100 rules, but no special rules were generated for the minority class. After lowering the
accuracy and increasing the number of rules, it was observed that rules generated had as
smaller number of attributes, even as few as one attribute per rule.
Discussion and conclusions
The ensemble learning that was developed in this study (DMDR) was an
extension of the ensemble learning of MSDT (Datta and Mengel 2014b); it was found
that the process of rule generation changed if the dataset had fewer than six or more than
20 attributes. Only three different datasets in these extreme conditions were tested. The
research questions that arose during this study were the following:
 Do the rule generation processes depend on the characteristics of the datasets?
 Does selecting the most significant attributes give accurate results? Which attributes
should be ignored?
To answer these questions, the study used dynamic methods to generate rules. To
find the most significant attributes, three different attribute selection methods
(symmetrical uncertainty, gain ratio, and Chi squared) were used for each of the sub66
Texas Tech University, Soma Datta, August 2014
datasets. The attributes were weighed and ranked, and those with the lowest weights were
deleted. In case of a tie, the average merit was considered: Deleting the attributes from
the dataset was an iterative process; initially the attributes with average merit equal to
zero were deleted. If a dataset still failed to generate rules, attributes with a higher
average merit, such as 0.001, were deleted. The process continued until rules were
generated using Apriori association mining.
Another finding from this study was that datasets that had less than 10% of the
data in their minority classes did not generate rules or were below the accuracy threshold.
In Lymphography dataset, three of the four classes were considered as minority classes.
Because the lowest class had two instances, which is not sufficient for generating rules,
the second lowest class also had only four instances, the third lowest class was also
considered as a minority class for this study. Technically, the class that had, two
instances should be the minority class, but a domain expert could determine which needs
to be considered the minority class. On the other hand, the Connect database with 6449
instances in the minority class did not generate rules because the minority class
distribution was less than 10% of the total instances. To overcome this finding, the
accuracy and minsupport parameters were lowered.
Future work
The ensemble learning method created for this study is a manual process but will
be automated in the future. The question about “significant attribute selection” process
should be investigated further by experimentation with datasets having larger attribute
sets. Ensemble learning, the rule generation technique in this study, generates rules for
both classes, but the process failed to generate rules for minority classes having less than
67
Texas Tech University, Soma Datta, August 2014
10% instances in the dataset. Future study will be done to isolate these instances to
generate rules.
This study will be extended in the future to generate rules for other datasets that
have lower true positive class instead of true negative class. For example, this technique
would use datasets such as number of female students who choose the STEM degree in
the high school population, as well as other known datasets from the UCI database.
68
Texas Tech University, Soma Datta, August 2014
Appendix B
Table 38. Apriori settings
Options
Default values
Experiment values
What they mean
Car
False
True
classIndex
-1
-1
Delta
0.05
0.1
lowerBoundMinsupport
0.1
0.1
metricType
confidence
confidence
minMetric
0.9
0.85
numRules
10
100
Number of rules to generate
outptItemsSets
False
True
Itemsets are shown in the output
removeAllMissingCols
False
False
removes columns with missing
values
significantLevel
-1.0
-1.0
Significance testing
treatZeroasmissing
False
False
Zero is treated the same way as
missing value
Generates rules with the class
attribute instead of general
association
Placement of the class attribute
is in the dataset
Iteratively decrease support by
this factor. Reduces support until
min support is reached or
required number of rules has
been generated.
The lowest value of minimum
support
The metric can be set to any of
the following: Confidence (Class
association rules can only be
mined using confidence), Lift ,
Leverage, Conviction
Will consider rules with accuracy
of .85 or higher
upperBoundMinsupport
1.0
1.0
The highest value of minimum
support, the process starts and
iteratively decreases until lower
boundary
verbose
False
False
Algorithm can run in verbose if
enabled
69
Texas Tech University, Soma Datta, August 2014
CHAPTER V
CONCLUSIONS AND FUTURE WORK
This study postulated several different methods to generate rules for a complex
problem like Retention. The study was done in three phases. In the first phase, the study
applied methods used by other data mining researchers, then it was extended by
clustering and using different decision tree algorithms. In the second phase, the study
was further modified to derive rules in multiple stages using decision trees and
association mining. In the final phase, the method used in the second stage was applied to
other well-known datasets from UCI; this method had to be further modified depending
on the characteristics of the datasets. The following paragraphs elaborate on each of these
phases.
In the first phase, the rules were generated using different decision tree algorithms
and different methods implemented by other researchers, for example Yu et al. (2010).
The rules generated were either simple or complex to be used in real situations by
practitioners. Hence the rules did not meet the necessary and sufficient conditions. Thus,
the study was extended, and the datasets were clustered. Rules were generated from a
controlled decision tree. Certain anomalies occurred: one, some rules generated from a
cluster showed an opposite class when validated using the whole dataset. The other was,
two or more rules were similar except for attribute values. These rules could be joined
70
Texas Tech University, Soma Datta, August 2014
with a logical condition. Hence, the study was extended to use two different algorithms:
the decision tree and association mining.
In the second phase, two different algorithms were used to generate the rules. The
rules that the multi-stage method extracted were more accurate and were used for student
intervention. The attributes that resulted as possible factors for attrition are now collected
for incoming freshmen by the Institutional Research office. None of the rules generated
by the multi-stage method had conflicting classes when tested on the whole dataset, and
attributes could be incorporated through association mining that were beyond those
employed by the decision tree.
In the third stage, the study was extended to verify the method from the second
stage by applying the method to other well-known datasets. Various types of datasets
were used from the UCI repository. During this study, the following question arose: Does
this rule generation depend on the characteristics of the dataset? The rule generation
technique was modified to facilitate datasets with smaller or larger dimensions. Another
finding from this study was that datasets that had less than 10% of the data in their
minority classes did not generate rules or were below the accuracy threshold. This finding
will be studied in the future for only minority classes.
In summary, the multi-stage technique used to generate useful rules for Retention
dataset was verified using datasets of other domains. The rules generated were grouped in
ranges. The measures in this study were coverage and accuracy for each rule, contrary to
only accuracy of the entire dataset.
71
Texas Tech University, Soma Datta, August 2014
In the future, the study will include rough set theory to generate rules in fuzzy
areas. Data mining algorithms typically divide the rules as positive or negative class, but
some rules, as seen in the study, cannot be defined in just one of the classes. These rules
need to be isolated and hence will increase the total accuracy of the dataset.
This study will be extended in the future to generate rules for other datasets that
have lower true positive class instead of true negative class. For example, this technique
would use datasets such as number of female students who choose the STEM degree in
the high school population, as well as any other educational datasets if available in the
future from the UCI database.
This study addressed the complex problem of student retention by creating a
multi-stage model to generate retention and attrition rules, which can be used by
institutions of higher education. By using these rules, administrators can intervene at an
early stage to retain students from dropping before their second year in college. Although
no methodology can account for every factor that influences students’ decisions for
attrition, this study has developed a way to provide better accuracy for each rule.
Furthermore, because the rules are grouped in ranges, this methodology allows
administrators to decide how intense their intervention strategy should be for any given
rule.
For rule creation in data mining, this study provides a novel approach by
combining two well-known algorithms: decision tree and association mining. The newly
generated rules are created using a controlled decision tree, and each node of the decision
tree becomes an independent dataset for creating rules using association mining. The rule
72
Texas Tech University, Soma Datta, August 2014
is derived from combining the rules generated by decision tree and association mining,
hence the rules generated via the MSDT technique integrate the advantages of both
algorithms. Unlike previous studies, this methodology is a multi-stage process where a
controlled decision tree is generated using decision tree algorithm. Next, to overcome the
limitation of selecting only top rank attributes in creating a rule, the attributes used by the
decision tree are removed from the dataset before generating rules using association
mining. Hence, the performance measure in this study is not a measure of the overall
accuracy or confidence but is measured per rule to facilitate implementation by
administrators.
73
Texas Tech University, Soma Datta, August 2014
REFERENCES
ACT, Inc. 2004: What Works In Student Retention – Four-Year Public Institutions, ACT
Inc.
Alldrin, N., Smith, A., and Turnbull, D. 2003. Clustering with EM and K-means.
University of San Diego, California, Tech Report.
Agrawal, R., and Srikant. R. 1994. Fast algorithm for mining association rules.
International Conference on Very large Databases, 487-499.
Baker, R.S.J.D and Yacef, K. 2009. The state of educational data mining in 2009: A review
and future visions. Journal of Educational Data Mining 1(1), 3-17.
Bean, J. and Eaton, B. 2001. The psychology underlying successful retention practices.
Journal of College Student Retention: Research, Theory and Practice 3, 73-89.
Boston, W.E., Ice, P., and Gibson, A.M. 2011. Comprehensive assessment of student
retention in online learning environments. Online Journal of Distance Learning
Administration IV(I).
Bayer, J., Bydzovska, H., Geryk, J., Obsivac, T., and Popelinsky, L. 2012. Predicting dropout from social behaviour of students. Proceedings of the 5th International Conference
on Educational Data Mining, 103-109.
Byers González, J. and DesJardins, S. 2002. Artificial neural networks: A new approach
for predicting application behavior. Research in Higher Education 43(2), 235-258.
Cattell, R.B. 1966. The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
Datta, Soma and Mengel, Susan. 2014a. Readable rules and higher accuracy for Retention
data with decision tree, [Submitted]
Datta, Soma and Mengel, Susan, 2014b. Multi stage decision algorithm to generate
readable rules, [manuscript]
DeBerard, M.S., Spielmans, G.I., and Julka. D.C. 2004. Predictors of academic
achievement and retention among college freshmen: A longitudinal study. College
Student Journal 38, 66-80.
Delen, D. 2010. A comparative analysis of machine learning techiniques for student
74
Texas Tech University, Soma Datta, August 2014
retention management. Decision Coverage Systems 49, 498-506.
Eitel, J.M.L., Baron, J.D., Devireddy, M., Sundararaju, V., and Jayaprakash, S.M. 2012.
Mining academic data to improve college student retention: an open source perspective.
International conference on Learning Analytics and Knowledge, 139-142.
Gaudard, M., Ramsey, P., and Stephens, M. 2006. Interactive Data Mining and Design of
Experiments: The JMP Partition and Custom Design Platforms. New Haven Group.
Herzog, S. 2006. Estimating student retention and degree completion time: Decision trees
and neural networks vis-a-vis regression, New Directions for Institutional Research, 1733.
Hagedorn, L.S. 2005. How to define retention. In Alan Seidman (ed.), College Student
Retention, Praeger, 89-106.
Hall, M., Frank, E., Holmes, G., Pfahringer, B., Reutemann, P., and Witten, I.H. 2009. The
WEKA Data Mining Software: An Update; SIGKDD Explorations 11:1.
Huo, J., Wang, Xizhao., Lu, Mingzhu., Chen, Junfen. 2006. Induction of Multi-stage
decision tree, IEEE International Conference on Systems, Man, and Cybernetics.
Intuitor. 1996. http://www.intuitor.com/statistics/SimpsonsParadox.html.
Kaiser, H. F. 1960. The application of electronic computers to factor analysis. Educational
and Psychological Measurement, 20, 141-151.
Kaufman, L. and Rousseeuw, P. 1990. Finding Groups in Data: An Introduction to Cluster
Analysis, Wiley.
Kerkvliet, J., Nowell, C., 2005. Does one size fit all? University differences in the influence
of wages, financial aid, and integration on student retention. Econimics of Education
Review , 24, 85-95.
Kotsiantis S. 2009. Educational data mining: a case study for predicting dropout-prone
students. International Journal of Knowledge Engineering and Soft Data Paradigms,
1(2), 101-111.
Lin, H.S. 2012. Data mining for student retention management. The Journal of Computing
Sciences in Colleges 27(4), 92-99.
Lu, Mingzhu, Huo, ianbing, Philip, C. L., Wang, Chen, Xizhao. 2009. Multi-Stage
75
Texas Tech University, Soma Datta, August 2014
Decision Tree based on Inter-class and Inner-class Margin of SVM, Proceedings of the
2009 IEEE International Conference on Systems, Man, and Cybernetics.
Luan, J. 2002. Data mining and its applications in higher education. In A.M. Serban and J.
Luan (eds.), Knowledge Management: Building a Competitive Advantage in Higher
Education. New Directions for Institutional Research, no. 113. San Francisco: JosseyBass.
Lykourentzou I., Giannoukos I., Nikolopoulos V., Mpardis G., and Loumos, V. 2009.
Dropout prediction in e-learning courses through the combination of machine learning
techniques. Computer and Education 53, 950-965.
Macfadyen, L.P, and Dawson, S. 2010. Mining LMS data to develop an early warning
system for educators: A proof of concept. Computers and Education 54, 588-599.
Mallinckrodt, B. and Sedlacek, W.E. 1987. Student retention and the use of campus
facilities by race. NASPA Journal 24, 28-32.
MArquez-Vera, C., Cano, A., Romero, C., and Ventura, S. 2013. Predicting student failure
at school using genetic programming and different data mining approaches with high
dimensional and imbalanced data. Applied Intelligence 38, 315-330.
Mellalieu, P.J., 2011. Predicting success, excellence and retention from student's early
course performance: progress results from a data mining decision coverage system in a
first year tertiary education programme.
XXIX International Conference of the
International Council for Higher Education.
Nara, A., Barlow, E., and Crisp, G. 2005. Student persistence and degree attainment
beyond the first year in college: The need for research. In Alan Seidman (ed.), College
Student Retention, Praeger, 129-153.
National Center for Education. 2012. http://nced.ed.gov/.
Pittman, K. 2008. Comparison of data mining techniques used to predict student retention,
Doctoral dissertation. Nova Southeastern University, Fort Lauderdale.
Sewell, W., and Wegner, E. 1970. Selection and context as factors affecting the probability
of graduation from college. American Journal of Sociology, 75(4), 665-679.
Thomas, E., and Galambos, N. 2004. What satisfies students? Mining student-opinion data
with regression and decision tree analysis. Research in Higher Education,45(3), 25176
Texas Tech University, Soma Datta, August 2014
269.
Tinto, V. 1975. Dropout from higher education: A theoretical synthesis of recent research.
Review of Educational Research 45, 89-125.
Tinto, V. 1993. Leaving College: Rethinking the Causes and Curses of Student Attrition,
University of Chicago Press.
Tinto, V., Russo, P., and Kadel, S. 1994. Constructing educational communities in
challenging circumstances. Community College Journal 64(1), 26-30.
Tinto, V. 2006. Research and practice of student retention: What next?*. J. College Student
Retention, Vol. 8(1) 1-19.
UCI Repository of machine learning databases and domain theories. FTP address: ftp://
ftp.ics.uci.edu/pub/machine-learning-databases.
UC
Irvine
Machine
Learning
Repository (UCI) http://archive.ics.uci.edu/ml/.
Van Nelson, C., and Neff, K. 1990. Comparing and contrasting neural network solutions
to classical statistical solutions. Paper presented at the Midwestern Educational
Research Association Conference, Chicago, Oct. 19, 1990.
Witten, I.H., and Frank, E. 2005. Data mining: Practical machine learning tools and
techniques, 2nd Edition, Morgan Kaufmann, San Francisco.
Yadav, K.S., Bharadway, B., and Pal, S. (2012) Mining Education data to predict student's
retention: a comparative study. International journal of computer science and
information security, 10(2), 113-117.
Yu, H.C., DiGangi, S., Jannasch-Pennell, A., and Kaprolet, C. 2010. A data mining
approach for identifying predictors of student retention from sophomore to junior year,
Journal of Data Science 8, 307-325.
Zhang, Y., Oussena, S., Clark, and Kim, H.T. 2010. Use data mining to improve student
retention in higher education– A case Study, 12th International Conference on
Enterprise Information Systems 2010, Paper Nr-129.
Bean, J and Eaton,B. 2001. The psychology underlying successful retention practices.
Journal of College Student Retention: Research, Theory and Practice 3, 73-89.
77
Texas Tech University, Soma Datta, August 2014
Byers González, J., and DesJardins, S. 2002. Artificial neural networks: A new approach
for predicting application behavior. Research in Higher Education, 43(2), 235-258.
Cattell, R. B. 1966. The scree test for the number of factors. Multivariate Behavioral
Research, 1, 245-276.
Datta, Soma and Mengel, Susan. 2014a. Readable rules and higher accuracy for Retention
data with decision tree, [Submitted]
Datta, Soma and Mengel, Susan, 2014b. Multi stage decision algorithm to generate
readable rules, [manuscript]
DeBerard, M. S., Spielmans, G. I. and Julka. D. C. 2004. Predictors of academic
achievement and retention among college freshmen: A longitudinal study. College
Student Journal 38, 66-80.
Gaudard, M., Ramsey, P and Stephens, M. 2006. Interactive Data Mining and Design of
Experiments: The JMP Partition and Custom Design Platforms. New Haven Group.
Herzog, S. 2006. Estimating student retention and degree completion time: Decision trees
and neural networks vis-a-vis regression, New Directions for Institutional Research,
p.17-33.
Hagedorn, L. S. 2005. How to define retention. In College Student Retention: Formula for
Student Success. (Edited by Alan Seidman, 89-106) Praeger Publishers.
HALL, M., Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H.
Witten. 2009. The WEKA Data Mining Software: An Update; SIGKDD Explorations,
Volume 11, Issue 1.
Huo, J., Wang, Xizhao., Lu, Mingzhu., Chen, Junfen. 2006. Induction of Multi-stage
decision tree, IEEE International Conference on Systems, Man, and Cybernetics.
Kaiser, H. F. 1960. The application of electronic computers to factor analysis. Educational
and Psychological Measurement, 20, 141-151.
Kerkvliet, J., Nowell, C., 2005. Does one size fit all? University differences in the influence
of wages, financial aid, and integration on student retention. Econimics of Education
Review , 24, 85-95.
Lin, H. S., 2012.
Data mining for student retention management.
Computing Sciences in Colleges, 27(4), 92-99.
78
The Journal of
Texas Tech University, Soma Datta, August 2014
Lu, Mingzhu, Huo, ianbing, Philip, C. L., Wang, Chen, Xizhao. 2009. Multi-Stage
Decision Tree based on Inter-class and Inner-class Margin of SVM, Proceedings of the
2009 IEEE International Conference on Systems, Man, and Cybernetics.
Luan, J. 2002. Data mining and its applications in higher education. In A. M. Serban and
J. Luan (eds.), Knowledge Management: Building a Competitive Advantage in Higher
Education. New Directions for Institutional Research, no. 113. San Francisco: JosseyBass.
Lykourentzou I, Giannoukos I, Nikolopoulos V, Mpardis G, Loumos V (2009) Dropout
prediction in e-learning courses through the combination of machine learning
techniques. Computer&Education 53:950ñ965.
Macfadyen,L.P, Dawson, S. (2010). Mining LMS data to develop an early warning system
for educators: A proof of concept. 54 588ñ599.
Mallinckrodt, B and Sedlacek, W. E. 1987. Student retention and the use of campus
facilities by race. NASPA Journal 24, 28-32.
MArquez-Vera, C, Cano, A., Romero, C., Ventura, S. (2013) Predicting student failure at
school using genetic programming and different data mining approaches with high
dimensional and imbalanced data. Applied Intelligence 38:315ñ330.
Mellalieu, P.J., 2011. Predicting success, excellence and retention from student's early
course performance: progress results from a data mining decision support system in a
first year tertiary education programme.
Nandeshwar, A., Menzies, T., Nelson, A., 2011. Learning patterns of university student
retention, Elsevier, expert system with application 38 14984-14996.
Nara, A., Barlow, E and Crisp, G. 2005. Student persistence and degree attainment beyond
the first year in college: The need for research. In Alan Seidman (ed.), College Student
Retention, Praeger, 129-153.
National Audition Office. 2007. Staying the course: the retention of students in higher
education.
National Center for Education. 2012. http://nced.ed.gov/.
Pittman, Kathleen. 2008. Comparison of data mining techniques used to predict student
retention, Doctoral dissertation. Nova Southeastern University, Fort Lauderdale.
79
Texas Tech University, Soma Datta, August 2014
Schmitt N, Oswald FL, Kim BH, Imus A, Merritt S, Friede A, Shivpuri S. 2007. The use
of background and ability profiles to predict college student outcomes. J Appl Psychol.
Jan;92(1):165-79. PubMed PMID: 17227158.
Senator, T. 2005. Multi-stage classification. Proceedings of the Fifth IEEE International
Conference on Data Mining, 386-393. [NOTE: tie linkage-analysis to clustering]
Sewell, W., & Wegner, E. 1970. Selection and context as factors affecting the probability
of graduation from college. American Journal of Sociology, 75(4), 665-679.
Superby, J.F., Vandamme, J-P., Meskens, N. 2006.Determination of factors influencing
the achievement of the first-year university students using data mining methods.
Workshop on Educational Data Mining.
Thomas, E., and Galambos, N. 2004. What satisfies students? Mining student-opinion data
with regression and decision tree analysis. Research in Higher Education,45(3), 251269.
Tinto, V. 1975. Dropout from higher education: A theoretical synthesis of recent research.
Review of Educational Research 45, 89-125.
Tinto, V. 1993. Leaving College: Rethinking the Causes and Curses of Student Attrition,
University of Chicago Press.
Tinto, V., Russo, P., AND Kadel, S. 1994. Constructing educational communities in
challenging circumstances. Community College Journal 64(1), 26-30.
Tinto, V. 2006. Research and practice of student retention: What next?*. J. College Student
Retention, Vol. 8(1) 1-19.
UCI Repository of machine learning databases and domain theories. UC Irvine Machine
Learning Repository (UCI)
Van Nelson, C., and Neff, K. 1990. Comparing and contrasting neural network solutions
to classical statistical solutions. Paper presented at the Midwestern Educational
Research Association Conference, Chicago, Oct. 19, 1990.
Witten, I. H., Frank, E. 2005. Data mining: Practical machine learning tools and techniques,
2nd Edition, Morgan Kaufmann, San Francisco.
Yadav, K.S., Bharadway, B., Pal, S. (2012) Mining Education data to predict student's
retention: a comparative study. International journal of computer science and
80
Texas Tech University, Soma Datta, August 2014
information security, 10(2), 113-117.
Yu H,C, DiGangi, S., Jannasch-Pennell, A., Kaprolet,C. 2010. A data mining approach for
identifying predictors of student retention from sophomore to junior year, Journal of
Data Science 8, 307-325.
Wang, Xizhao, He, Qiang, Chen, Degang. 2005. A genetic algorithm for solving the
inverse problem of support vector machines, Science Direct, vol. 68, pp. 225-238.
Wu, X., Kumar, V., Quinlan, J. Ross, Ghosg, J., yang, Q., Motada, H., McLachlan, G., J.,
Ng, A., Lui, B., Yu, P., S, Zhou, Z., Steibach, M., Hand, D., j, Steinberg, D. 2007. Top
10 algorithims in data mining, IEEE, International Conference, Survey paper,
Zhang, Y., Oussena, S.,Clark, and Kim, H., T. 2010. Use data mining to improve student
retention in higher education– A case Study, 12th International Conference on
Enterprise Information Systems 2010, Paper Nr-129.
81