Download Developing Data Driven Predictive Models of Student

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
1
Developing Data-Driven Predictive Models of Student Success
Kresge Data Mining Project
Phase One Report
University of Maryland, University College
November 20, 2012
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
2
Executive Summary
This report documents progress on the Kresge Data Mining Grant: Developing DataDriven Predictive Models of Student Success. The grant was awarded to University of
Maryland, University College (UMUC) in collaboration with two community college
partners, Prince George's Community College (PGCC) and Montgomery College (MC). The
purpose of the grant is to build an integrated database in order to examine student
progress across multiple institutions using data mining techniques and statistical models to
identify factors that predict online student success.
To date, this project has three major accomplishments:
1. UMUC and the two partner community colleges, PGCC and MC, have a data
sharing partnership that supports academic research and student success.
2. UMUC designed, developed, and implemented a multi-institutional database of
over 250,000 students (called the Kresge Data Mart) that contains the academic
history on transfer students from these two colleges integrated with data on
UMUC student online behavior and academic performance.
3. Researchers used data mining and predictive modeling to investigate
relationships between variables to predict student success.
Initial research yielded the following results:
1. Researchers reviewed the extant literature on online education and educational
data mining and identified key variables that have been found to predict
successful course completion and student re-enrollment and retention.
2. Examination of the student data in the Kresge Data Mart using four classification
data mining algorithms and survival analysis indicated that factors in the
students’ classroom behaviors and previous academic background, such as
number of schools attended, transfer credits, and transfer grade point average
were strong predictors of student re-enrollment in the following semester.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
3
3. Regression models utilizing predictors revealed that total transfer credit,
transfer GPA, activity in course conferencing, and activity prior to the start of a
course, contribute strongly to course success.
4. Predictive models verified that course success is a strong predictor of reenrollment.
These data mining and statistical modeling identify variables associated with online
student success for students transferring from PGCC and MC. These findings guide next
research steps, which include validating data mining and predictive models on an
expanded dataset, building profiles of successful online students, and developing
meaningful interventions to improve success of students transferring from our partner
institutions.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
4
Introduction
The purpose of this report is to document progress on the Kresge Data Mining Grant:
Developing Data-Driven Predictive Models of Student Success. The research is being conducted
by the Evaluation and Assessment research team in the Office of Institutional Research at
University of Maryland University College (UMUC). This report is presented in six sections:
1) Section I contains the grant overview, objectives, milestones and a financial update.
2) Section II reviews the research design, general methodologies, and research
questions.
3) Section III contains the literature review and research foundations.
4) Section IV reviews the data developed for this project.
5) Section V describes initial results from the analyses of research questions.
6) Section VI describes next steps to stay on track with both project and research goals.
Section I: Grant overview
The grant was awarded to UMUC, in collaboration with two community college partners,
to build a multi-institutional database to examine student progress across multiple institutions.
The purpose of this grant is to use data mining techniques to develop data driven predictive
models and to identify factors that influence the success of online students. Embedded in the
project is the expectation that findings from the research will be integrated into the business
processes of each of the participating institutions for the purpose of evaluating the impact on
student success. Ongoing evaluation is a key aspect of this agile research project.
Over the course of this three-year project, student transcript data from Montgomery
College and Prince George‟s Community College will be integrated with performance and
classroom behavior data from UMUC. The research in data mining and predictive modeling will
contribute to a process to identify relationships between the student‟s community college
academic history and their performance at the four-year institution.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
5
Grant partnership
UMUC is a four-year public university that offers online degree programs to a diverse
population of working adults. As part of this grant, UMUC established partnerships with two
Maryland community colleges that also serve large and diverse student populations.
Montgomery College (MC), established in 1946, currently enrolls over 35,000 students annually.
MC is a diverse institution that serves a large international student population. Prince George's
Community College (PGCC) enrolls more than 40,000 students and is also very diverse, serving
students from approximately 125 different countries. Both institutions serve the metro-D.C. area,
but are different in that PGCC serves more students who are considered low-income. Both
institutions have endorsed the goals of this project and are committed to working with UMUC to
find ways to better serve transfer students.
A memorandum of Understanding (MOU) between UMUC and each partner institution
was negotiated and signed in order to clarify the security and use of data used for this research.
This MOU allows the research to move forward while protecting the privacy of students (See
Appendix A).
Objectives and milestones
Specific objectives and milestones are presented below for each stage of the research
project. These objectives and milestones have been modified slightly from the original grant
documents, but are consistent with the grant requirements.
Table 1
UMUC Kresge data mining grant objectives and milestones
status as of September 1, 2012
Objectives
Year One
Develop a
project action
plan
Data
Milestones
Status
Identify project requirements.
Develop a project action and collaboration plan
with the partnering agencies.
Hire a full-time researcher and half–time
programmer.
Build an evaluation framework.
Work with the different units at UMUC and
Complete
Complete
Complete
Complete
Complete
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
6
Objectives
Preparation
Evaluation of
project
Year Two
Build the
analytic models
Assess the
validity of the
models
Evaluation of
project
Design pilot
study
Year Three
Conduct pilot
study
Evaluate the
pilot
Milestones
partnering institutions to collect data.
Clean and transform data.
Prepare a data “universe” (integrated database
system) on transfer students in the population.
Understand variables; define student
characteristics and retention data; develop data
dictionary.
Conduct ongoing project evaluation.
Status
Complete
In process
Complete
In process
Analyze data and identify factors that predict
In process
success/failure.
Build the data mining and analytic techniques and In process
integrate into the predictive models.
Build student profiles based on the models.
In process
Discuss results with advisory board and obtain
feedback.
Discuss results with Project Partners and obtain
feedback.
Evaluate the models in predicting student's
chances for success and make recommendations
for adjustments to the models
Conduct ongoing project evaluation.
Design intervention study. Define requirements
for the pilot. Develop implementation plan.
For the resulting predictive models, identify
points of integration in the UMUC business
processes.
Partner institutions with transferring students will
evaluate their business processes to identify
points of integration of modeling techniques for
improvement of student retention.
Deploy interventions and evaluation systems
through a pilot at the partnering institutions and at
UMUC.
Conduct student retention and analysis of student
satisfaction.
Evaluate the effectiveness of the pilot and assess
transferring student‟s chances of success and
retention with the model interventions.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
7
Objectives
Dissemination
of the data
mining model
and resources
Incorporate
feedback from
the education
community to
improve the
models
Comprehensive
evaluation of
the project
Report on the
process and
results
Milestones
Status
Discuss results with Project Partners and develop
recommendations.
Plan project revision based on evaluation data.
Develop website and repository for resources.
Present findings at national conferences on higher
education to describe research purpose and
results.
Organize UMUC conference to engage national
discussion around research in data mining and
predictive modeling for the purpose of improving
student success.
Evaluate the attendance, interest and requests
from the meetings. Develop and disseminate
finding.
Use feedback to identify and integrate into
enhancement plan.
Conduct ongoing project evaluation and develop
an evaluation report
Solidify data-sharing partnerships with
community colleges, documenting roles,
processes, and measurable outcomes of student
success. Develop an integrated research paper
outlining research purpose, research design,
processes and products.
Financials
The Kresge Foundation awarded UMUC $1.2 million to build an integrated database,
explore data mining techniques, build predictive models of student success, define intervention
strategies to increase student success, deploy those strategies, and continuously monitor and
evaluate the strategies and student outcomes.
To date, UMUC has expended approximately 41% of the total grant. The expenses
encompass a hardware purchase to house the data mining database, expenditures for data
collection at the partner institutions, and dedicated salaries for a data mining specialist and a
graduate assistant. Additional staff resources are provided in kind by UMUC. The remaining
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
8
grant (approximately $703,333) will be used to support external evaluation and conference
development. The financial statement for this grant is in Appendix B.
Section II: Research Design
The research design for this study was generated from the grant‟s goal to develop data
driven predictive models of online student success using data mining techniques. The design
covers exploratory research questions that have been addressed so far. Additional questions are
expected to be developed after review of the initial results.
This study employs an exploratory multi-method research design. Exploratory research
is useful in this context because it allows researchers to investigate relationships between
variables and build predictive models that could be replicated and tested. A multi-method
research design allows researchers to be flexible in their choice of methodology and to answer
research questions using exploratory analyses.
With this research design in mind, this study addresses the following exploratory research
questions:
Question 1: What relationships exist among variables in the extant dataset?
Question 2: Which variables contribute to the prediction of online course success?
Question 3: Which variables contribute to the prediction of retention in an online
environment?
Section III: Literature Review
This literature review examines online student success and educational data mining
literature as it pertains to student success. This section contains three parts: study definitions,
online student success literature and educational data mining literature.
Definitions
Data mining in education research is a broad term that encompasses both data
management and statistical techniques. Data management refers to how data are prepared for the
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
9
application of data mining techniques. These techniques are used to explore variables that may
predict student progression related to online course success and retention.
In order to maintain language consistency, terms used throughout the study are defined
below.

Students: stateside, undergraduate, first-bachelor‟s seeking students who transferred
from either PGCC or MC. This is also commonly known as the cohort for the study or
the population of interest.

Course success: obtaining a final grade of A, B, or C in any course.

Re-enrollment: enrolling the immediate next semester following the semester in
question.

Retention: re-enrolling at the institution within 12 months after initial enrollment or in a
rolling window of three semesters following the semester of interest.

Degree completion: attaining a bachelor‟s degree after enrolling as a degree seeking
student.

Student success: a broad term used to indicate course success, re-enrollment, retention,
or degree completion.
These definitions were determined by reviewing a number of sources including: 1) a
review of data mining and online retention literature; 2) institutional publications, such as reports
to the Middle States Commission on Higher Education and in-house studies on retention and
course success and 3) UMUC, MC and PGCC undergraduate course catalogs. Additional
terminology and definitions related to data mining and research methodology are discussed later
in this report.
Online student success literature
Current literature on student success focuses primarily on student outcomes such as
course success, course withdrawal and retention. For example, student and course level variables
such as student characteristics, previous course work, grades, and time spent in course
discussions and activities may be useful in measuring course success (Aragon & Johnson, 2008;
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
10
Morris & Finnegan, 2009; Morris, Finnegan & Lee 2009; Park & Choi, 2009). Courselevel
variables acquired from student login data may have predictive value in measuring course
withdrawal (Willging & Johnson, 2008; Nistor & Neubauer, 2010). Student, course, program and
institution level variables such as student characteristics, number of transfer credits, final grade
in any given course, experience in online environments, and course load may have utility
measuring re-enrollment and retention (Aragon & Johnson, 2008; Morris & Finnegan, 2009;
Boston, Diaz, Gibson, Ice, Richardson & Swan, 2011).
Although these studies showcase a variety of findings related to student success, the
majority of studies in retention in online learning environments use traditional statistical or
qualitative methods. Park and Choi (2009) point out that expansion of methods such as data
mining might have utility when student, course, program, and institutional level variables are
well defined and institutionally meaningful. Literature related to educational data mining
focusses on exploratory research.
Educational data mining literature
Data mining is a method of discovering new and potentially useful information from
large amounts of data (Baker and Yacef, 2009; Luan, 2001). Educational data mining is a subset
of the field of data mining that draws on a wide variety of literatures such as statistics,
psychometrics, and computational modeling to examine relationships that may predict student
outcomes (Romano and Ventura, 2007; Baker and Yacef, 2009). In educational data mining, data
mining algorithms are used to create and improve models of student behavior in order to improve
student learning (Luan, 2002).
Data mining methods are most helpful for finding patterns already present in data, not
necessarily in testing hypotheses (Luan, 2001). Baker and Yucef (2009) suggest that higher
education research should use a variety of algorithms, such as classification, clustering or
association algorithms in determining relationships between variables. Although many
definitions of these techniques exist in data mining literature, Han and Kamber (2001) offer the
following definitions. Classification is the process of finding a set of models or functions that
describe and distinguish data classes or concepts to predict a class of objects whose class label is
unknown. Clustering analyzes data objects that are related to similar outcomes without
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
11
consulting a class label. Association is the discovery of rules showing attribute value conditions
that occur frequently together in a given set of data (Han & Kamber, pp.23-24).
Recent research suggests that these data mining algorithms can be used to examine
variables related to student success. Yu, DiGangi, Jannach-Pennell, Lo, and Kaprolet (2010)
used a classification algorithm to explore potential predictors related to student retention in a
traditional, undergraduate institution. In this study, the authors used a decision tree to explore
demographic, academic performance, and enrollment variables as they related to student
retention. This study revealed a predictable relationship between earned hours and retention, but
also found that at this institution, retention was closely related to state of residence (in-state/out
of state) and living location (on campus/off campus). The authors speculate that this finding
points to the potential utility of online courses in improving retention for out-of-state or offcampus students.
Despite these recent developments in exploring variables related to student success in
traditional higher education settings, research using data mining techniques to uncover
relationships among variables in online courses is limited in scope. This study is designed to fill
this gap in the extant literature by utilizing data on online students who attended multiple
institutions.
Section IV: Project Data
The Kresge project has several goals. One is for UMUC to develop a data sharing
partnership with key community colleges. Another is to build an integrated multi-institutional
data mart that describes the academic history of transfer students who have attended community
colleges prior to enrolling at UMUC. This data mart will be used for data mining and for the
development of statistical models that identify factors related to student success in an online
environment.
To that end, there are several successes in the area of data integration that have been
achieved over the past year. First, a meaningful and significant partnership was established
between UMUC, MC and PGCC. Second, data were collected from source systems and stored in
secure areas for processing. Third, the vision and design for the Kresge Data Mart (KDM) has
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
12
been developed. Fourth, the student population of interest was identified and verified. Fifth,
datasets were created for the initial modeling and data mining.
The most significant success of this grant so far is the establishment of the data sharing
collaboration between the community colleges and UMUC. The partnership agreements have
been forged through a series of meetings and an MOU that guides the research and data sharing
activities. Research questions have been developed to benefit the students through actionable
research that will impact the business processes at each institution. Measurable outcomes have
been set and ongoing evaluation is being conducted and shared with partner institutions.
The data used in this effort will be collected from all three institutions and stored in a
data mart called the Kresge Data Mart (KDM), which is housed on a server at UMUC. This
section will describe the overall vision for the development of the infrastructure and the current
status of the project.
Student Data
A core set of data is collected on each student from each institution. This information
includes demographics, enrollment, courses, grades, and recent classroom activity in online
courses. Each is described below.
Demographic data. Demographic data include fields such as address, phone number, age,
gender, and race/ethnicity. While some of this information may change over time, for the
purpose of this study, the demographic data are considered static.
Enrollment data. Enrollment data are typically captured on students who have officially
registered. Enrollment data includes course registration, grades, program of study or major, and
student status. This information changes each term or semester.
Transfer data. Data about a student‟s academic history at previous institutions are also
stored. This information includes number of transfer credits, courses transferred, grades
associated with each transfer course, and prior degrees earned. There are two sources of transfer
data. One is the community college transcript data that contains all course information on each
student. These data are collected and integrated into the KDM. The other source is transcript
files which may not include all information on the student‟s prior courses for only courses that
were transferred to UMUC and equated to UMUC courses. The UMUC transcript data are not as
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
13
complete as the community college transcript data, but serve as the basis for courses and credits
that are applied to the student‟s degree.
Classroom behavior data. Webtycho, UMUC‟s proprietary learning management
system, stores counts of student activity in various modules within each online class. Data are
stored for each session, which is defined as the time between each student login and logoff.
Identifying the population of interest
The students to be included in the KDM are defined as students who attended UMUC
between 2005 and 2011. Students who attended at least one community college prior to UMUC
enrollment were identified and analyzed as a separate subpopulation. The data will eventually
include those who attended a community college of interest, but did not transfer to UMUC.
These two subgroups and the larger population of students will serve as the foundation of data
stored in the KDM and the backbone for the research.
The process of building the populations has multiple steps. The first step was to identify
two partner institutions who would share data on students who transferred to UMUC. The
second step will be to identify other institutions that are willing to do the same. The third step
will be to identify students who transferred from key partner institutions, but who did not transfer
to UMUC.
This report focuses on the first stage of building an integrated database. The population
of interest is transfer students. Two community colleges, Montgomery College and Prince
George‟s Community College, that transfer a large number of students to UMUC were
identified. UMUC formed a data sharing partnership with each institution to identify factors
related to student success, define academic interventions, implement interventions, and examine
their impact on student success.
UMUC identified 257,903 students who attended UMUC between 2005 and 2011. Of
these students, MC used varying matching strategies to confirm enrollment for 21,444 students,
and PGCC used a different set of matching strategies to confirm enrollment for 11,046 students.
Additionally, the National Student Clearinghouse (NSC) identified 12,776 students who attended
MC and 8,697 students who attended PGCC. Furthermore, UMUC analyzed transcript
information in its student information system and found even fewer matches. Thus, the Kresge
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
14
Data Mart includes student data on all 257,903 students, as well as information on the matched
students matched by the community colleges.
The KDM will serve as a critical resource for all predictive analytics project as well as
research projects related to student success and integrated data management. One key finding is
that the data from the community colleges yielded much more robust and accurate information
about students‟ academic histories at previous institution than UMUC‟s internal data resources.
Data integration
Institutions store student demographic, enrollment and classroom behavior data in a
variety of information systems. For the purpose of this project, data from two UMUC student
applications, PeopleSoft and WebTycho, are included in this database. PeopleSoft is a student
information system that stores current and historical data about the student and their enrollment
history. MC and PGCC have similar systems for storing student data. WebTycho is a proprietary
learning management system that records online classroom for every user (student or faculty)
during an online session.
Figure 1
Description of the Integrated Kresge Data Mart
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
15
Each institution extracted data for each student in the matched population from the source
student information systems and learning management systems. The process for merging data
across institutions requires standardizing the data and developing a data dictionary.
Data from multiple systems are combined in stages. At the first stage, data from each
partner institution were collected and cleaned. Definitions of each data element were gathered,
standardized, and documented. At the second stage, data from the UMUC‟s student information
system and the online classroom were merged and developed in the UMUC data warehouse. The
next stage integrated the community college data with the UMUC data in the Kresge Data Mart
(KDM), as shown in the diagram above. The KDM combines information on the students‟
history at prior institutions, their current course performance information, and their online course
behavior, as well as common outcome measures identified in the literature. The final stage of the
KDM will include information from MC and PGCC on students who transfer elsewhere, as well
as students who have transferred from other community colleges to UMUC. This is the data
mart that will serve as the base of the final research models and intervention analysis.
Current status. The community college data have been collected and a coding structure
has been developed. The code continues to evolve, as the data are still being processed. As a
result, preliminary analyses presented in this report are based on datasets that are derived from
the raw data collected and stored during the data integration process.
Data storage. In order to carry out data mining activities, storage was required for the
various sources of data and for the various stages of the data integration process. To meet the
requirements of building an integrated data mart, UMUC purchased the Oracle Exadata hardware
which houses the KDM, as well as the UMUC data warehouse. The data for the partner
institutions are stored separately from the data warehouse for the purpose of security and
confidentiality. The data sources on the Oracle Exadata hardware, which is in part supported by
the Kresge Foundation, have also contributed to other analytical and predictive analytics projects
related to predictive analytics. The data integration among various sources adds value to all
projects related to student success.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
16
Data preparation for initial analyses
Researchers used sample datasets to develop data mining algorithms and predictive
models. The datasets were designed specifically for the analysis of the exploratory research
questions. The data mining algorithms and predictive models were created to explore the initial
research questions are described in each section below along with initial findings.
Data security
Secure File Transfer Protocol (SFTP) is the method that is currently being used to
transfer data between partner institutions. The storage area, which contains both community
college and UMUC data, is secured by the database administrators providing accounts only to
staff who are working on the development of the KDM.
Section V: Initial findings
Research Question 1: Examining variables and relationships
This analysis uses data mining techniques in order to examine relationships amongst
variables in the extant dataset. The question answered in this analysis is: What relationships exist
amongst variables in the extant dataset?
Dataset. The dataset used for this analysis contains 2,643 new, stateside, first bachelor‟s
degree seeking UMUC students enrolled in 15 gateway courses that students in the Spring 2011
took to begin their program of study. Since each student can be enrolled in more than one course,
this dataset contains 4,331 enrollment records and is comprised of 394 unique sections of these
gateway courses. The variables in the dataset include transfer history, enrollment data for the
students‟ first semester at UMUC, and online classroom behavior defined as WebTycho actions
in the classroom prior to and after the start day of classes (See Table 2).
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
17
Table 2
Dataset variables and definitions
Variable Name
Variable definitions
Emplid
Unique ID for each new student
Subject
Four-letter code for each course, i.e.- ACCT, MATH, or PSYCH
Course Number
Subject and course number together create a class
Total Transfer Credits
A sum of credits/hours transferred from other institutions
Number of Schools
Attended Previously
A count of the number of institutions in the new student‟s transfer history
Time Since Last
Institution
The number of months since previous institution
Transferred in at Least
One English Course
Flag to indicate if the student had at least one class in the transfer history
where the subject was ENGL, EWRT, or WRTG
Transferred in at Least
One Math Course
Flag to indicate if the student had at least one class in the transfer history
where the subject was MATH, STAT, or ENGR
Earned Associates Degree
The degree earned by the new student (if any) at a previous institution
Course Grouping
Developmental (EDCP100), General (MATH 009 and 012), Business and
Professional, Communication, Science and Social Sciences, Technology
Length of Course in
Weeks
Potential values are 8, 10, 12, and 14
Semester Load
Total credits attempted in Spring 2011 by the student
Re-enrolled
A Yes/No flag indicating whether or not the student enrolled in at least one
course the Summer of 2011, excluding the courses enrolled in Spring 2011
Entering Class
Number of times the new student logged into an area of the online course in
WebTycho; counts were created for each of the first four weeks of the class
as well as the number of times prior to the first day of class
Opening Course Content
Number of times the student opened the Course Content – which is an area
of the online course; counts were created for each of the first four weeks of
the class as well as the number of times prior to the first day of class
Opening Class Roster
Number of times the student opened the Class Roster area of the course;
counts were created for each of the first four weeks of the class as well as the
number of times prior to the first day of class
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
18
Variable Name
Variable definitions
Using the Chat Feature
Number of times the student used the Chat Feature in WebTycho; counts
were created for each of the first four weeks of the class as well as the
number of times prior to the first day of class
Opening a Study Group
Conference
Number of times the student went to the conference postings in a study
group; counts were created for each of the first four weeks of the class as
well as the number of times prior to the first day of class
Opening Reserved
Reading
Number of times the student opened the Reserved Reading area of the online
course; counts were created for each of the first four weeks of the class as
well as the number of times prior to the first day of class
Opening the
Webliography
Number of times the student opened the Webliography area of the online
course; counts were created for each of the first four weeks of the class as
well as the number of times prior to the first day of class
Starting the Conferencing
Feature
Number of times the student opened the Conferences area of the online
course; counts were created for each of the first four weeks of the class as
well as the number of times prior to the first day of class
Reading a Conference
Note
Number of times the student opened a note for reading in the Conferences
area of the course; counts were created for each of the first four weeks of the
class as well as the number of times prior to the first day of class
Writing a Conference
Note
Number of times the student wrote or edited a note in the Conferences area
of the course; counts were created for each of the first four weeks of the class
as well as the number of times prior to the first day of class
Adding an Attachment to
a Conference Note
Number of times the student attached a file to a note in the Conferences area
of the course; counts were created for each of the first four weeks of the class
as well as the number of times prior to the first day of class
Notes: class roster is a numbered listing of everyone registered for the class; chat feature is a place where classmates
can participate in synchronous discussion and view shared dialogue; a conference is an area where students
participate in asynchronous group discussion, much like an internet forum; within each conference a student can
post a reply, which is also called a “note”; reserved readings is a place where faculty may post read-only documents
or links to external website for students; webliography is a shared list of web sites relevant to the class.
Method. A data mining program, IBM/SPSS Modeler, was used to develop models that
explored the dataset for relationships between variables. Four classification algorithms were
used: C&R Tree, CHAID, QUEST, and C5.0. Each of the four models has its own advantages
and disadvantages; however, two models, C5.0 and CHAID, proved to be the most valuable in
the identification of relationships between specific variables. Cross-model data mining
techniques were also used to produce a ranking of variable importance.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
19
Findings. All algorithms combined were used to rank variables across all models. These
models generated a ranking for variables related to re-enrollment across data mining models (see
Table 3).
Table 3
Variable ranking derived from data mining models
Ranking
Very
High
Variables
1) Frequency of student activity in the online classroom up to the fourth week
of class
2) Entering the online classroom prior to the first day of class
3) Participation in any conference feature
4) Opening and reviewing the Course Content module within the classroom
5) Opening and reviewing Reserved Reading within the classroom
High
Middle
Low
Number of Schools Previously Attended
Students‟ semester Load
Time Since Last Institution
Transferred in at Least One English Course
Transferred in at Least One Math Course
Negligible All Other Variables
Note: Very high, high, middle, low, and negligible are statistical categories that describe the strength of the
predictive power of the variables included in the re-enrollment model.
This ranking indicates that classroom behavior is of very high importance to re-enrollment. In
addition, the variables “number of schools previously attended” and “semester load” are
somewhat important, and other variables are either of low importance or negligible importance.
The C5.0 model indicated that the variable “Entering Class Week 4,” followed by
“Entering Class Prior to First Day,” “Number of Schools Previously Attended,” “Writing a
Conference Note Prior to First Day,” and “Semester Load” were most related to re-enrollment
(see Figure 2).
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
20
Figure 2
C5.0 model with top five variables related to re-enrollment
The CHAID algorithm also indicated that the variable “Entering Class Week 4” was most related
to re-enrollment (see Figure X).
Figure 3
CHAID Model with top 10 variables related to re-enrollment
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
21
Implications. Data mining techniques indicate that student classroom behaviors are of
very high importance to re-enrollment. Previous academic background, such as number of
schools attended and transfer credits, may also be important. These models suggest variables
related to the student behaviors prior to the first day of class and previous academic work merit
further examination in statistical predictive models.
Research Question 2: Course success
The purpose of this analysis is to construct a logistic regression model with a set of
covariates and predictor variables that predict course success. This model intends to answer the
research question: Which variables, if any, predict student online course success? The intent of
this preliminary model is to predict course success from online student behavior in the first four
weeks of the course and student‟s academic background information.1 The ultimate goal of
analysis is to maximize the prediction of online course success.
Dataset. The dataset used for this analysis contains 4,558 new, undergraduate, first
bachelor-degree seeking students enrolled in 15 UMUC online gateway courses in Spring 2011.2
This dataset also contains transfer data on students from partner institutions, Montgomery
College (MC) and Prince George's Community College (PGCC). Table 4 provides descriptive
statistics for all students, passing or not passing, in each gateway course for this dataset.
1
Subsequent models may include: growth models, latent class analysis and structural equation models in order to
capture a different type of data/information within one system of predictive model.
2
Sessions included in this dataset are Online Session 1 (OL1), OL2, OL3, OL4, and OL5.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
22
Table 4
Descriptive statistics for key covariates
Course
ACCT220
BMGT110
CCJS100
CMIS102
EDCP100
EDCP103
GVPT170
HIST157
IFSM201
LIBS150
MATH12
MATH9
PSYC100
WRTG101
All
Passed a course
Not Passed
Passed
N
%
N
%
91
59%
63
41%
102
45%
127
55%
75
50%
74
50%
133
48%
142
52%
164
39%
254
61%
53
58%
38
42%
46
44%
59
56%
57
39%
88
61%
296
48%
324
52%
459
23%
1521
77%
11
69%
5
31%
57
54%
49
46%
97
56%
77
44%
34
35%
62
65%
1675
37%
2883
63%
All
N
154
229
149
275
418
91
105
145
620
1980
16
106
174
96
4558
.
Method. For this study, a logistic regression model was developed that utilized available
independent variables (covariates and predictors) to predict the desired dependent outcome,
course success. A logistic regression model generates the probability of success (in logits) as it is
linearly related to the set of independent variables. In this study, a pseudo- R2 value3, or value of
prediction accuracy, was used for both model diagnosis and selection. Preliminary methods
focused on the identification of covariates (e.g. student demographic information and the record
of previous academic work) and predictors based on the students‟ coursework behavior.
Identification of key covariates. Exploratory factor analysis (EFA) was used to identify
key covariates. EFA is an exploratory or data mining method that uncovers underlying
constructs, or common structures, among a set of related variables. Previous data mining
3
Pseudo or generalized R squared is a generalization of the coefficient of determination, R2, to more general
regression model such as logistic regression (Nagelkerke,1991).
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
23
analysis4 indicated that the following covariates in student coursework data were strongly
associated with the course success: 1) total number of transfer credits (Xfer_total); 2) amount of
time since students attended the last institution (Xfer_time); 3); semester course load (LOAD)
and 4) GPA from transferred credits (GPA_C). Table 5 shows the rule to categorize each
covariate for use in the logistic regression model.
Table 5
Rules for key covariates in student coursework data
Attribute
Total number of transfer
credits
Variable Name
Xfer_total
Amount of time since a
student attended the last
institution
Semester course load
Xfer_time
GPA from transferred credits
GPA_C
LOAD
Rule
0= 0 credit
1= 1 -29 credits
2= 30-59 credits
3= more than 60 credits
0= didn‟t attend
1= 1-24 months
2= over 24 months
1 = 1-3 credits
2 = over 3 credits
0 = no information
1 = less than 2.5
2 = 2.5 to less than 3.0
3 = 3 to less than 3.5
4 = over 3.5
Additional covariates related to student behavior prior to the official class starting day
(called Week 0) were also found to be associated with course success. Table 6 displays
summarized rules for each student behavior covariate in Week 0.
4
See section on cluster analysis for detailed explanation of this process.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
24
Table 6
Rules for LMS student behaviors in Week 0
Behavior
Variable Name
Rule
Entering a class
Opnclss_cat0
Opened a class roster
Opnclssrstr_cat0
Opened a class content
Opncrscntnt_cat0
Created a conference
note
Createconfnote_cat0
Created a response note
Createrespnote_cat0
Summary of students‟
week0 behavior prior to
the first day
TimeZeroTotal
0 = Did not enter a class
1= Entered a class at
least once
0 = Did not check the
roster
1 = Check the roster at
least once
0 = Did not open a class
content
1 = Open a class content
at least once
0 = Did not create a
conference note
1= Created a conference
note at least once
0 = Did not create
response note
1 = Created a response
note at least once
Sum of five variables
above
Using these rules, a logistic regression model was used to evaluate the effect of the
identified covariates (in terms of pseudo- R2) on course success for use in the final model (see
Table 6). These findings include:
1. Total number of transferred credits is the best covariate predictor of course success
(having pseudo R2 value around .12).56
5
It was the best predictor of success amongst the time invariant covariates, or covariates that have consistent value
over all models in the analysis using different points in time.
6
The covariate number of schools previously attended (Xfer_NBR) exhibited the linear dependency with a total
number of transferred credits thus to remove from the analysis.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
25
2. GPA from transferred credits is the second best covariate predictor of course success
(pseudo R2 around .11).
3. Semester course load contributes less to course success than other covariates.
Analysis of combined covariates (student behaviors and previous coursework behavior)
indicates that total number of transfer credits (Xfer_total), amount of time since a student
attended the last institution (Xfer_time), GPA from transferred credits (GPA_C), and summary
of student‟s behavior prior to first day (TimeZeroTotal) accounts for approximately 15-20% of
the variance based on the pseudo R2 value (see Table 7).
Table 7
Summary of pseudo R2 estimates
Covariates
xfer_total
xfer_time
Load
GPA_C
Opnclss_cat0
Opnclssrstr_cat0
Opncrscntnt_cat0
Createconfnote_cat0
Createrespnote_cat0
TimeZeroTotal
xfer_total xfer_time LOAD TimeZeroTotal GPA_C
R2 estimates
0.12
0.06
0.01
0.11
0.05
0.03
0.06
0.02
0.02
0.07
0.19
Building the predictive model. In order to accurately predict course success, predictor
variables that measure student online course behavior in Weeks 1-4 were integrated into the
logistic regression model. Exploratory analysis based on the data mining techniques and the
availability of data indicated the importance of five predictor variables related to online course
behavior: 1) entering a class; 2) opening class content; 3) creating a conference note; 4) creating
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
26
a response note and 5) reading a conference note. In order to account for high variability in
student online course behaviors predictors were categorized using the following rules (see Table
8):
Table 8
Rules for student online course taking behavior beyond Week 0
Behavior
Variable Name
Rule
Entering a class
Opnclss_catN
Opened a class content
Opncrscntnt_catN2
Created a conference note
Createconfnote_catN
Created a response note
Createrespnote_catN
Read a conference note
readconfnote_catN2
0 = Did not enter a class
1= Entered a class at least
once
0 = Did not open a class
content more than twice
1 = Open a class content at
least twice
0 = Did not create a
conference note
1= Created a conference note
at least once
0 = Did not create response
note
1 = Created a response note at
least once
0 = Did not open a class
content more than twice
1 = Created a response note at
least twice
In this case, each predictor variable represents a weekly total of distinctive actions per student
per class in Weeks 1-4 of an online gateway course (see Table 9):
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
27
Table 9
Weekly total of actions per student beyond Week 0
Variable
OPNCLSS_WEEK1
OPNCLSS_WEEK2
OPNCLSS_WEEK3
OPNCLSS_WEEK4
OPNCLSSRSTR_WEEK1
OPNCLSSRSTR_WEEK2
OPNCLSSRSTR_WEEK3
OPNCLSSRSTR_WEEK4
OPNCRSCNTNT_WEEK1
OPNCRSCNTNT_WEEK2
OPNCRSCNTNT_WEEK3
OPNCRSCNTNT_WEEK4
READCONFNOTE_WEEK1
READCONFNOTE_WEEK2
READCONFNOTE_WEEK3
READCONFNOTE_WEEK4
UPDTCONFNOTE_WEEK1
UPDTCONFNOTE_WEEK2
UPDTCONFNOTE_WEEK3
UPDTCONFNOTE_WEEK4
CREATECONFNOTE_WEEK1
CREATECONFNOTE_WEEK2
CREATECONFNOTE_WEEK3
CREATECONFNOTE_WEEK4
CREATERESPNOTE_WEEK1
CREATERESPNOTE_WEEK2
CREATERESPNOTE_WEEK3
CREATERESPNOTE_WEEK4
N
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
4558
Mean
8.18
6.65
6.39
6.06
0.82
0.22
0.14
0.1
4.15
2.75
2.51
2.39
99.55
62.25
52.97
40.88
0.15
0.13
0.13
0.1
0.61
0.23
0.26
0.17
2.86
1.97
1.79
1.45
SD
10
8.88
2
8.08
1
7.24
0
7.31
0
3.02
0
1.17
0
1.5
0
0.73
0
5.42
0
3.68
0
3.33
0
3.55
0
115.89 2
87.44 0
75.98 0
71.89 0
0.61
0
0.62
0
0.68
0
0.57
0
0.92
0
0.63
0
0.59
0
0.53
0
3.62
0
3.25
0
2.84
0
2.87
0
25
3
2
2
2
0
0
0
0
1
0
0
0
21
5
4
0
0
0
0
0
0
0
0
0
0
0
0
0
Percentile
50 75
6
10
5
8
5
8
4
8
0
1
0
0
0
0
0
0
3
6
2
4
1
4
1
3
59 138
28 84
25 71
14 52
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
2
4
1
3
1
2
0
2
90
16
14
14
13
2
1
0
0
10
7
7
6
248
172
141
114
0
0
0
0
2
1
1
1
7
6
5
5
This categorization accomplished three purposes: 1) removed the effect of extreme values of
highly variable actions; 2) facilitated a percentile level count of each behavior and 3) examined
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
28
the effect of each cut point7. Categorization of predictor variables also emphasizes the
importance of consistent engagement in a classroom activity, which is indicated by the
categorization of „1‟ as a predictor variable for student action from Week 1 to Week 4 and “0” as
an variable indicating no action.8
Evaluating the predictive model. The covariates and predictors included in the final
logistic regression model represent the best model fit (as represented by pseudo R2) to evaluate
student course success. In addition, this model allows for determination of the relative
contribution of each covariate and predictor at each point in time, which will aid in identifying
at-risk students. For example, the power of the model in predicting course success becomes
extremely high (close to 90%) and pseudo R2 exceeding .60 using 8 weeks of data. However,
since interventions for increased course success are not useful at the end of the semester, findings
focus on the performance of predictors over the first four weeks, particularly the first two weeks
of behaviors in the online class.
Findings. Initial analysis using the final model indicates the relative contribution of
covariates and predictors from Week 1 to Week 4 (see Table 10).
Table 10
Relative contribution of Covariates and Predictors
Covariates
Predictors
7
Week
1
15%
xfer_total
2%
xfer_time
5%
Load
11%
TimeZeroTotal
7%
GPA_C
createrespnote_total 10%
readconfnote_total2 34%
createconfnote_total 15%
0%
opnclss_total
opncrscntnt_total2
0%
2
18%
2%
7%
10%
9%
3%
32%
7%
13%
0%
3
16%
2%
6%
9%
8%
6%
23%
14%
16%
0%
4
17%
2%
7%
7%
9%
2%
21%
10%
26%
1%
“Cut point” refers to numeric values or rules to categorize a continuous value into categories.
In other words, the predictor variable represents the sum of student actions across N weeks (0-4) as the final
predictor variable in the model (e.g. <action>_total).
8
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
29
Examination of the relative contribution of covariates indicates that total transfer credit
(xfer_total) exhibited the highest level of contribution to the predictive model (15%-18%)
followed by Week 0 behavior (TimeZeroTotal), transfer GPA (GPA_C), course load (LOAD)
and number of times a student transferred (xfer_time), respectively. Examination of the relative
contribution of predictors indicates that the variable reading conference note
(readconfnote_total2) has the highest influence in the predictive model (21%-34%). Note that
open class (opnclss_total) had no impact on student success in Week 1, but that impact increased
Weeks 2-4.
The predictive power of the model increases gradually over the first four weeks of the
course, from 79% correct prediction of course success at Week One 1, to 82% correct prediction
of course success at Week 4. The pseudo R2 value increased from approximately 19% using the
covariate-only model (see Table 4 for value) to 50% in Week 1 predictive model (see Table 11).
Table 11
Predictive power of the model by week
Outcome/Predicted
R2
estimates
Count
P/P
F/F
Week
F/P
P/F
Percent
P/P F/F
Success
F/P
P/F
1
0.50
1067 964 252 295
41% 37% 10% 11%
79%
2
0.55
1089 990 226 273
42% 38% 9%
11%
81%
3
0.57
1108 993 223 254
43% 39% 9%
10%
81%
4
0.59
9%
82%
1129 991 225 233
44% 38% 9%
Note that the pseudo R value increased from 50% at Week 1 to 59% at Week 4 .
2
Group level analysis. Predictive analysis by sub-group based on sessions or courses was
conducted using the model. Although findings from this group-level analysis is less informative
due to limited sample size, the predictive model maintained a moderate to high level of fit based
on pseudo R-squared value (e.g. above mid .30s).9 The group-level analysis demonstrated that
9
OL5 session had a fair model fit but exhibited low prediction rate. The sample size of OL5 session was too small to
draw a conclusion.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
30
the predictive model maintains high level of model fit and prediction at over 75% in the majority
of sub-groups (see Table 12)10.
Table 12
Predictive power of the model by week and session
Week
Session
Stats
1
2
3
4
OL1
R2
0.65
0.71
0.74
0.74
Success
85.10
87.40
87.60
88.90
R2
0.59
0.70
0.75
0.75
Success
84.30
85.90
88.20
88.20
R2
0.49
0.54
0.56
0.55
Success
78.30
79.00
79.50
80.20
R2
0.45
0.50
0.53
0.55
Success
73.10
74.80
77.00
78.70
R2
0.47
0.52
0.55
0.60
Success
59.00
60.00
64.80
68.60
R2
0.58
0.65
0.68
0.69
Success
83.70
85.50
86.70
86.90
R2
0.25
0.31
0.40
0.45
Success
79.30
81.70
83.80
85.30
R2
0.50
0.55
0.57
0.59
Success
78.80
80.60
81.50
82.20
R2
0.27
0.29
0.36
0.41
OL2
OL3
OL4
OL5
OL123
LIBS150
No LIBS 150
Core LIBS
10
This sample included all students by session level and course category (general education, and remedial course),
but excluded students in LIBS 150. LIBS 150 is a general education course; it is shorter and more
orientation/informative course rather than a regular educational course, and it has very high passing rate of 75.3%.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
31
Core No LIBS
EDCP100
IFSM201
EDCP100 IFSM201
Success
74.00
74.80
78.00
80.40
R2
0.48
0.51
0.53
0.55
Success
78.10
78.80
79.00
79.10
R2
0.66
0.69
0.71
0.73
Success
83.50
86.10
85.60
87.60
R2
0.54
0.59
0.63
0.66
Success
78.50
81.90
82.40
83.50
R2
0.58
0.62
0.65
0.68
Success
82.20
83.10
84.30
85.80
Implications. Preliminary analysis of variables related to course success identified a set
of critical covariates and predictors. Four of five predictors derived from online student behavior
show a strong contribution to student success. The final model using the full dataset will examine
the effect of these predictors as well as evaluating the impact of other behavioral measures such
as accessing course modules.
Research Question 3: Student retention
The purpose of this preliminary analysis is to construct a logistic regression model with a
set of covariates and predictor variables and determine the impact on student retention. This
model intends to answer the research question: Which variables, if any, predict retention?
Dataset. The analysis utilized the same data set for the prediction of course success with
the addition of retention status from Summer 2011, Fall 2011, and Spring 2012. The retention
rate was 66% (N= 3,015) in this sample.
Table 13
Percentage students retained and not retained
Retained
N
%
3015
66%
Not Retained
N
%
1543
34%
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
32
Method. As with prediction of course success, this model uses logistic regression to
predict retention. Preliminary methods focused on the evaluation of covariates, identified in the
previous analysis, predictors based on the students‟ coursework behavior, and course success
information (i.e. passed or not passed).
Evaluation of covariates. A total number of credit transferred (xfer_total) had the highest
predictive power followed by GPA from the transfer credits (GPA_C). The magnitude of
covariates‟ pseudo R2 is lower on the prediction of retention than the prediction of course
success. The total covariates yield a pseudo R2 of 0.14. Table 14 shows pseudo R2 estimates
from five covariates.
Table 14
Retention pseudo R2 estimates of covariates
Covariates
xfer_total
xfer_time
GPA_C
Load
TimeZeroTotal
Cov Total
R2 estimates
0.08
0.02
0.07
0.01
0.03
0.14
Evaluation of predictors. For this model, predictors identified for the course success
model were used and combined with a new set of predictors: 1) number of courses passed
(N_Pass); 2) number of courses not passed (N_NotPass); 3) number of course withdrawals (W)
or failure due to non-attendance (FN) (N_WFN); 4) passing LIBS 150 (passed_LIBS150); 5)
passing a General Education course (passed_GenEd); 6) passing a remedial course
(passed_Remed) and 7) term GPA (PGA_Term).
In this model, student behavior indicators did not predict retention as well as they
predicted course success (see Table 15).
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
33
Table 15
Prediction success of combined student behavior indicators by week
Retention
Week
Course Success
Low
High
Low
High
1
0.14
0.19
0.37
0.50
2
0.14
0.20
0.41
0.55
3
0.15
0.20
0.43
0.57
4
0.14
0.20
0.45
0.59
Findings. When included in the prediction model for retention, covariates predicting
course success generated a higher a pseudo R2 value than student behavior predictors (see Table
16).
Table 16
R-squared estimates for retention predictors and covariates
Predictors and covariates
passed_LIBS150
passed_GenEd
passed_Remed
N_Pass
N_NotPass
N_WFN
GPA_Term
COVS passed_ABC passed_GenEd passed_remed N_Pass
N_NotPass N_WFN GPA_Term
***COVS= xfer_total xfer_time LOAD GPA_C
R2 estimates
0.15
0.15
0.15
0.19
0.17
0.16
0.24
0.35
GPA and number of passing courses had the highest pseudo R2 estimates among seven
indicators. The full model including covariates and passing indicators had a pseudo R2 of 0.35.
Implications. The effect of covariate and student behavior variables made less of a
contribution to this model than the prediction of course success. The final model using the course
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
34
passing indicators11 indicated a stronger contribution to the prediction of retention. These
preliminary results indicate the importance of course success as the primary factor in increasing
retention, which means that interventions aimed at improving course success may have a strong
indirect effect on increasing retention.
Additional exploratory analysis: Student withdrawal patterns
The purpose of this additional exploratory analysis was to conduct a survival analysis that
examined student course withdrawal patterns. The goal of this study is to increase the
understanding of withdrawal patterns so that it can be effectively incorporated in models
predicting the retention.
Dataset. The dataset for this study contains 19,190 undergraduate UMUC students in
OL1 (Online Session 1) in Fall 2011 in 278 distinct courses. This dataset contained student
course information data and WebTycho data. Summary statistics of five student demographic
and academic behavior variables are listed in Table 17.
Table 17
Demographics for OL1 students
Variable
Gender
Unknown
Female
Male
Age Group
Above3212
at or below 32
New/Returning Status
NEW
RET
11
Frequency
Percent
Cumulative Percent
244
10872
8074
1.3
56.7
42.1
1.3
57.9
100.0
8401
10789
43.8
56.2
43.8
100.0
3361
15829
17.5
82.5
17.5
100.0
For additional information on course withdrawal related to student behavior variables is explained in the
subsequent section.
12
Age 32 is used as a cut point because it is the median age of UMUC students.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
35
The majority of the students in OL1 of FAl l2011 were female (56.7%). 56.2% of the students
were at or below the UMUC median student age of 32 years old. 17.5% of the students were new
to UMUC and 82.5% were returning students.
Method. An exploratory survival analysis was carried out using a Kaplan-Meier
estimator. Survival analysis is a statistical technique that can be used to model “time-to-event”
data. In this case, this analysis examines the time it took for a student to withdraw from a
particular course (in weeks and days) reflected as a time-to-event.13 Survival analysis generates a
table which indicates a hazard (or withdrawal) rate during the semester. Additional univariate
survival analyses were also conducted to identify retention patterns among age, gender, and
student status.
Survival analysis. Table 18 shows the number of student withdrawals, cumulative
proportion of withdrawals, and withdrawal rate.
Table 18
Withdrawal rate in OL1
Week
Number of
students
1
2
3
4
5
6
19190
18783
18499
18248
18031
17736
Number of
Student
Withdrawals
407
284
251
217
295
132
Cumulative
Proportion of
Withdrawal
0.98
0.96
0.95
0.94
0.92
0.92
Withdrawal rate
0.0031
0.0022
0.002
0.0017
0.0024
0.0011
Student rate of withdrawal is highest in Week 1 (N= 404 or .31%). The withdrawal rate gradually
drops until Week 5, which is the week before the academic withdrawal deadline. The overall
trend and withdrawal pattern is that, in general, student withdrawal rates decrease after Week 1
of the semester.
13
This analysis only examines students who received a grade of W (withdrawal).
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
36
Univariate survival analysis. This univariate survival analysis examined three covariates:
gender, age, and student status. Age, which was defined as below or above 32 years old, did not
prove to be statistically significant (p = .60) in terms of correlation with withdrawal patterns.
Gender, male and female, approached significance (p = .06) in terms of correlation with
withdrawal patterns. Student status (new or returning) showed statistical significance in terms of
correlation with withdrawal (p < .001): new students withdraw at a higher rate than returning
students. Figure 3 shows that the patterns of new and returning student withdrawal are different.
Figure 4
Withdrawal rate for new and returning students by day
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
37
Implications. Students withdraw at a higher rate in Week 1 compared to any other week
in the course session, with the exception of Week 5, which is the academic withdrawal deadline.
Student status, new or returning, may significantly affect student withdrawal rate. New students
withdraw at a higher rate than returning students. These findings suggest that interventions
targeting new students with interventions in Week 1 may be appropriate.
Section VI: Implications and next steps
In summary, data mining and statistical modeling can identify variables that relate to
student success. Data mining techniques identified the importance of academic background and
student course behaviors as variables related to student success. Predictive models of course
success and retention using variables gleaned from data mining showed similar outcomes.
Specifically, online student course behavior, such as opening and reading conference notes in the
first four weeks of class, were correlated with course success. Previous academic background,
such as total number of transfer credits and transfer GPA, contributed both to course success and
retention. Course success, instead of student behavior, was a key predictor of retention. Student
course behaviors showed that students withdraw at a higher rate in Week 1 over any other week
in the course session, with the exception of Week 5, and that student status may significantly
affect student withdrawal rate.
These findings will guide our next research steps, which will finalize and validate
predictive models in order to build student profiles. These steps may include, but are not limited
to: running predictive models with a full dataset; validating models; sub-group analysis
examining community college students in terms of course success, re-enrollment and retention;
identifying course sequences that indicate re-enrollment and retention; integrating withdrawal
information into predictive models. These additional steps will aid in the building of student
profiles that can be used for predicting student success and designing effective interventions.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
38
References
Aragon, S.R. & Johnson, E.S. (2008). Factors influencing completion and non-completion in
online community college courses. American Journal of Online Education, 22(3), 146-158.
Boston, W., Diaz, S.R., Gibson, A.M., Ice, P., Richardson J., & Swan, K. (2011). An exploration
of the relationship between indicators of the community of inquiry framework and retention in
online programs. Journal of Asynchronous Learning Networks, 13(3), 67-83
Baker, R.S. & Yacef, K. (2009). The state of educational data mining: A review and future
visions. Journal of Educational Data Mining, 1(1), 3-17.
Finnegan, C., Morris, L.V., and Lee, K. (2009). Differences by course discipline on student
behavior, persistence, and achievement in online courses of undergraduate general education.
Journal of College Student Retention, 10(1), 39-54.
Herzog, S. (2006). Estimating student retention and degree completion time: Decision trees and
neural networks vis a vis regression. New Directions for Institutional Research, 131, 17-33.
Ho Yu, C., DiGangi, S., Jannasch-Pennell, A., Lo, W., & Kaprolet, C. (2007, February). A datamining approach to differentiate predictors of retention. Paper presented at the Educause
Southwest Conference, Phoenix, AZ.
Luan, J. & Zhao, C-M. (2006). Data mining: Going beyond traditional statistics. New Directions
for Institutional Research, 131, 7-16.
Luan, J. (2001). Data mining as driven by knowledge management and higher education:
Persistence clustering and prediction. Paper presented at the SPSS Public Conference,
University of California at San Francisco.
Luan, J, (2002). Data mining and its applications in higher education. New Directions for
Institutional Research, 113, 7-36.
Morris, L.V. & Finnegan, C.L. (2009). Best practices in predicting and encouraging student
persistence and achievement online. Journal of College Student Retention, 10(1), 5-34.
Nistor, N., & Neubauer. K (2010). From participation to dropout: Quantitative participation
patterns in online university courses. Computers in Education, 55, 663-672.
Park, J-H., & Choi. H-J. (2009). Factors influencing adult learners‟ decision to drop out or
persist in online learning. Educational Technology and Society, 12(4), 207-217.
Romero, C., Ventura, S., Espejo, P.G. and Hervas, C. (2008, June). Data mining
algorithms to classify students. Paper presented at the1st International Conference on
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
39
Educational Data Mining, Montreal, Canada.
Willging, P.A. & Johnson, S.D. (2009). Factors influencing adult student decisions to dropout of
online courses. Journal of Asynchronous Learning Networks, 13 (3), 115-127.
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
40
Appendix A
Memoranda of Understanding
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
41
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
42
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
43
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
44
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
45
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
46
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
47
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
48
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
49
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
50
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
51
Appendix B
Financial Statement
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
52
UMUC Kresge Data Mining Project: Phase 1
November 28, 2012
53