Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 1 Developing Data-Driven Predictive Models of Student Success Kresge Data Mining Project Phase One Report University of Maryland, University College November 20, 2012 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 2 Executive Summary This report documents progress on the Kresge Data Mining Grant: Developing DataDriven Predictive Models of Student Success. The grant was awarded to University of Maryland, University College (UMUC) in collaboration with two community college partners, Prince George's Community College (PGCC) and Montgomery College (MC). The purpose of the grant is to build an integrated database in order to examine student progress across multiple institutions using data mining techniques and statistical models to identify factors that predict online student success. To date, this project has three major accomplishments: 1. UMUC and the two partner community colleges, PGCC and MC, have a data sharing partnership that supports academic research and student success. 2. UMUC designed, developed, and implemented a multi-institutional database of over 250,000 students (called the Kresge Data Mart) that contains the academic history on transfer students from these two colleges integrated with data on UMUC student online behavior and academic performance. 3. Researchers used data mining and predictive modeling to investigate relationships between variables to predict student success. Initial research yielded the following results: 1. Researchers reviewed the extant literature on online education and educational data mining and identified key variables that have been found to predict successful course completion and student re-enrollment and retention. 2. Examination of the student data in the Kresge Data Mart using four classification data mining algorithms and survival analysis indicated that factors in the students’ classroom behaviors and previous academic background, such as number of schools attended, transfer credits, and transfer grade point average were strong predictors of student re-enrollment in the following semester. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 3 3. Regression models utilizing predictors revealed that total transfer credit, transfer GPA, activity in course conferencing, and activity prior to the start of a course, contribute strongly to course success. 4. Predictive models verified that course success is a strong predictor of reenrollment. These data mining and statistical modeling identify variables associated with online student success for students transferring from PGCC and MC. These findings guide next research steps, which include validating data mining and predictive models on an expanded dataset, building profiles of successful online students, and developing meaningful interventions to improve success of students transferring from our partner institutions. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 4 Introduction The purpose of this report is to document progress on the Kresge Data Mining Grant: Developing Data-Driven Predictive Models of Student Success. The research is being conducted by the Evaluation and Assessment research team in the Office of Institutional Research at University of Maryland University College (UMUC). This report is presented in six sections: 1) Section I contains the grant overview, objectives, milestones and a financial update. 2) Section II reviews the research design, general methodologies, and research questions. 3) Section III contains the literature review and research foundations. 4) Section IV reviews the data developed for this project. 5) Section V describes initial results from the analyses of research questions. 6) Section VI describes next steps to stay on track with both project and research goals. Section I: Grant overview The grant was awarded to UMUC, in collaboration with two community college partners, to build a multi-institutional database to examine student progress across multiple institutions. The purpose of this grant is to use data mining techniques to develop data driven predictive models and to identify factors that influence the success of online students. Embedded in the project is the expectation that findings from the research will be integrated into the business processes of each of the participating institutions for the purpose of evaluating the impact on student success. Ongoing evaluation is a key aspect of this agile research project. Over the course of this three-year project, student transcript data from Montgomery College and Prince George‟s Community College will be integrated with performance and classroom behavior data from UMUC. The research in data mining and predictive modeling will contribute to a process to identify relationships between the student‟s community college academic history and their performance at the four-year institution. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 5 Grant partnership UMUC is a four-year public university that offers online degree programs to a diverse population of working adults. As part of this grant, UMUC established partnerships with two Maryland community colleges that also serve large and diverse student populations. Montgomery College (MC), established in 1946, currently enrolls over 35,000 students annually. MC is a diverse institution that serves a large international student population. Prince George's Community College (PGCC) enrolls more than 40,000 students and is also very diverse, serving students from approximately 125 different countries. Both institutions serve the metro-D.C. area, but are different in that PGCC serves more students who are considered low-income. Both institutions have endorsed the goals of this project and are committed to working with UMUC to find ways to better serve transfer students. A memorandum of Understanding (MOU) between UMUC and each partner institution was negotiated and signed in order to clarify the security and use of data used for this research. This MOU allows the research to move forward while protecting the privacy of students (See Appendix A). Objectives and milestones Specific objectives and milestones are presented below for each stage of the research project. These objectives and milestones have been modified slightly from the original grant documents, but are consistent with the grant requirements. Table 1 UMUC Kresge data mining grant objectives and milestones status as of September 1, 2012 Objectives Year One Develop a project action plan Data Milestones Status Identify project requirements. Develop a project action and collaboration plan with the partnering agencies. Hire a full-time researcher and half–time programmer. Build an evaluation framework. Work with the different units at UMUC and Complete Complete Complete Complete Complete UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 6 Objectives Preparation Evaluation of project Year Two Build the analytic models Assess the validity of the models Evaluation of project Design pilot study Year Three Conduct pilot study Evaluate the pilot Milestones partnering institutions to collect data. Clean and transform data. Prepare a data “universe” (integrated database system) on transfer students in the population. Understand variables; define student characteristics and retention data; develop data dictionary. Conduct ongoing project evaluation. Status Complete In process Complete In process Analyze data and identify factors that predict In process success/failure. Build the data mining and analytic techniques and In process integrate into the predictive models. Build student profiles based on the models. In process Discuss results with advisory board and obtain feedback. Discuss results with Project Partners and obtain feedback. Evaluate the models in predicting student's chances for success and make recommendations for adjustments to the models Conduct ongoing project evaluation. Design intervention study. Define requirements for the pilot. Develop implementation plan. For the resulting predictive models, identify points of integration in the UMUC business processes. Partner institutions with transferring students will evaluate their business processes to identify points of integration of modeling techniques for improvement of student retention. Deploy interventions and evaluation systems through a pilot at the partnering institutions and at UMUC. Conduct student retention and analysis of student satisfaction. Evaluate the effectiveness of the pilot and assess transferring student‟s chances of success and retention with the model interventions. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 7 Objectives Dissemination of the data mining model and resources Incorporate feedback from the education community to improve the models Comprehensive evaluation of the project Report on the process and results Milestones Status Discuss results with Project Partners and develop recommendations. Plan project revision based on evaluation data. Develop website and repository for resources. Present findings at national conferences on higher education to describe research purpose and results. Organize UMUC conference to engage national discussion around research in data mining and predictive modeling for the purpose of improving student success. Evaluate the attendance, interest and requests from the meetings. Develop and disseminate finding. Use feedback to identify and integrate into enhancement plan. Conduct ongoing project evaluation and develop an evaluation report Solidify data-sharing partnerships with community colleges, documenting roles, processes, and measurable outcomes of student success. Develop an integrated research paper outlining research purpose, research design, processes and products. Financials The Kresge Foundation awarded UMUC $1.2 million to build an integrated database, explore data mining techniques, build predictive models of student success, define intervention strategies to increase student success, deploy those strategies, and continuously monitor and evaluate the strategies and student outcomes. To date, UMUC has expended approximately 41% of the total grant. The expenses encompass a hardware purchase to house the data mining database, expenditures for data collection at the partner institutions, and dedicated salaries for a data mining specialist and a graduate assistant. Additional staff resources are provided in kind by UMUC. The remaining UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 8 grant (approximately $703,333) will be used to support external evaluation and conference development. The financial statement for this grant is in Appendix B. Section II: Research Design The research design for this study was generated from the grant‟s goal to develop data driven predictive models of online student success using data mining techniques. The design covers exploratory research questions that have been addressed so far. Additional questions are expected to be developed after review of the initial results. This study employs an exploratory multi-method research design. Exploratory research is useful in this context because it allows researchers to investigate relationships between variables and build predictive models that could be replicated and tested. A multi-method research design allows researchers to be flexible in their choice of methodology and to answer research questions using exploratory analyses. With this research design in mind, this study addresses the following exploratory research questions: Question 1: What relationships exist among variables in the extant dataset? Question 2: Which variables contribute to the prediction of online course success? Question 3: Which variables contribute to the prediction of retention in an online environment? Section III: Literature Review This literature review examines online student success and educational data mining literature as it pertains to student success. This section contains three parts: study definitions, online student success literature and educational data mining literature. Definitions Data mining in education research is a broad term that encompasses both data management and statistical techniques. Data management refers to how data are prepared for the UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 9 application of data mining techniques. These techniques are used to explore variables that may predict student progression related to online course success and retention. In order to maintain language consistency, terms used throughout the study are defined below. Students: stateside, undergraduate, first-bachelor‟s seeking students who transferred from either PGCC or MC. This is also commonly known as the cohort for the study or the population of interest. Course success: obtaining a final grade of A, B, or C in any course. Re-enrollment: enrolling the immediate next semester following the semester in question. Retention: re-enrolling at the institution within 12 months after initial enrollment or in a rolling window of three semesters following the semester of interest. Degree completion: attaining a bachelor‟s degree after enrolling as a degree seeking student. Student success: a broad term used to indicate course success, re-enrollment, retention, or degree completion. These definitions were determined by reviewing a number of sources including: 1) a review of data mining and online retention literature; 2) institutional publications, such as reports to the Middle States Commission on Higher Education and in-house studies on retention and course success and 3) UMUC, MC and PGCC undergraduate course catalogs. Additional terminology and definitions related to data mining and research methodology are discussed later in this report. Online student success literature Current literature on student success focuses primarily on student outcomes such as course success, course withdrawal and retention. For example, student and course level variables such as student characteristics, previous course work, grades, and time spent in course discussions and activities may be useful in measuring course success (Aragon & Johnson, 2008; UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 10 Morris & Finnegan, 2009; Morris, Finnegan & Lee 2009; Park & Choi, 2009). Courselevel variables acquired from student login data may have predictive value in measuring course withdrawal (Willging & Johnson, 2008; Nistor & Neubauer, 2010). Student, course, program and institution level variables such as student characteristics, number of transfer credits, final grade in any given course, experience in online environments, and course load may have utility measuring re-enrollment and retention (Aragon & Johnson, 2008; Morris & Finnegan, 2009; Boston, Diaz, Gibson, Ice, Richardson & Swan, 2011). Although these studies showcase a variety of findings related to student success, the majority of studies in retention in online learning environments use traditional statistical or qualitative methods. Park and Choi (2009) point out that expansion of methods such as data mining might have utility when student, course, program, and institutional level variables are well defined and institutionally meaningful. Literature related to educational data mining focusses on exploratory research. Educational data mining literature Data mining is a method of discovering new and potentially useful information from large amounts of data (Baker and Yacef, 2009; Luan, 2001). Educational data mining is a subset of the field of data mining that draws on a wide variety of literatures such as statistics, psychometrics, and computational modeling to examine relationships that may predict student outcomes (Romano and Ventura, 2007; Baker and Yacef, 2009). In educational data mining, data mining algorithms are used to create and improve models of student behavior in order to improve student learning (Luan, 2002). Data mining methods are most helpful for finding patterns already present in data, not necessarily in testing hypotheses (Luan, 2001). Baker and Yucef (2009) suggest that higher education research should use a variety of algorithms, such as classification, clustering or association algorithms in determining relationships between variables. Although many definitions of these techniques exist in data mining literature, Han and Kamber (2001) offer the following definitions. Classification is the process of finding a set of models or functions that describe and distinguish data classes or concepts to predict a class of objects whose class label is unknown. Clustering analyzes data objects that are related to similar outcomes without UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 11 consulting a class label. Association is the discovery of rules showing attribute value conditions that occur frequently together in a given set of data (Han & Kamber, pp.23-24). Recent research suggests that these data mining algorithms can be used to examine variables related to student success. Yu, DiGangi, Jannach-Pennell, Lo, and Kaprolet (2010) used a classification algorithm to explore potential predictors related to student retention in a traditional, undergraduate institution. In this study, the authors used a decision tree to explore demographic, academic performance, and enrollment variables as they related to student retention. This study revealed a predictable relationship between earned hours and retention, but also found that at this institution, retention was closely related to state of residence (in-state/out of state) and living location (on campus/off campus). The authors speculate that this finding points to the potential utility of online courses in improving retention for out-of-state or offcampus students. Despite these recent developments in exploring variables related to student success in traditional higher education settings, research using data mining techniques to uncover relationships among variables in online courses is limited in scope. This study is designed to fill this gap in the extant literature by utilizing data on online students who attended multiple institutions. Section IV: Project Data The Kresge project has several goals. One is for UMUC to develop a data sharing partnership with key community colleges. Another is to build an integrated multi-institutional data mart that describes the academic history of transfer students who have attended community colleges prior to enrolling at UMUC. This data mart will be used for data mining and for the development of statistical models that identify factors related to student success in an online environment. To that end, there are several successes in the area of data integration that have been achieved over the past year. First, a meaningful and significant partnership was established between UMUC, MC and PGCC. Second, data were collected from source systems and stored in secure areas for processing. Third, the vision and design for the Kresge Data Mart (KDM) has UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 12 been developed. Fourth, the student population of interest was identified and verified. Fifth, datasets were created for the initial modeling and data mining. The most significant success of this grant so far is the establishment of the data sharing collaboration between the community colleges and UMUC. The partnership agreements have been forged through a series of meetings and an MOU that guides the research and data sharing activities. Research questions have been developed to benefit the students through actionable research that will impact the business processes at each institution. Measurable outcomes have been set and ongoing evaluation is being conducted and shared with partner institutions. The data used in this effort will be collected from all three institutions and stored in a data mart called the Kresge Data Mart (KDM), which is housed on a server at UMUC. This section will describe the overall vision for the development of the infrastructure and the current status of the project. Student Data A core set of data is collected on each student from each institution. This information includes demographics, enrollment, courses, grades, and recent classroom activity in online courses. Each is described below. Demographic data. Demographic data include fields such as address, phone number, age, gender, and race/ethnicity. While some of this information may change over time, for the purpose of this study, the demographic data are considered static. Enrollment data. Enrollment data are typically captured on students who have officially registered. Enrollment data includes course registration, grades, program of study or major, and student status. This information changes each term or semester. Transfer data. Data about a student‟s academic history at previous institutions are also stored. This information includes number of transfer credits, courses transferred, grades associated with each transfer course, and prior degrees earned. There are two sources of transfer data. One is the community college transcript data that contains all course information on each student. These data are collected and integrated into the KDM. The other source is transcript files which may not include all information on the student‟s prior courses for only courses that were transferred to UMUC and equated to UMUC courses. The UMUC transcript data are not as UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 13 complete as the community college transcript data, but serve as the basis for courses and credits that are applied to the student‟s degree. Classroom behavior data. Webtycho, UMUC‟s proprietary learning management system, stores counts of student activity in various modules within each online class. Data are stored for each session, which is defined as the time between each student login and logoff. Identifying the population of interest The students to be included in the KDM are defined as students who attended UMUC between 2005 and 2011. Students who attended at least one community college prior to UMUC enrollment were identified and analyzed as a separate subpopulation. The data will eventually include those who attended a community college of interest, but did not transfer to UMUC. These two subgroups and the larger population of students will serve as the foundation of data stored in the KDM and the backbone for the research. The process of building the populations has multiple steps. The first step was to identify two partner institutions who would share data on students who transferred to UMUC. The second step will be to identify other institutions that are willing to do the same. The third step will be to identify students who transferred from key partner institutions, but who did not transfer to UMUC. This report focuses on the first stage of building an integrated database. The population of interest is transfer students. Two community colleges, Montgomery College and Prince George‟s Community College, that transfer a large number of students to UMUC were identified. UMUC formed a data sharing partnership with each institution to identify factors related to student success, define academic interventions, implement interventions, and examine their impact on student success. UMUC identified 257,903 students who attended UMUC between 2005 and 2011. Of these students, MC used varying matching strategies to confirm enrollment for 21,444 students, and PGCC used a different set of matching strategies to confirm enrollment for 11,046 students. Additionally, the National Student Clearinghouse (NSC) identified 12,776 students who attended MC and 8,697 students who attended PGCC. Furthermore, UMUC analyzed transcript information in its student information system and found even fewer matches. Thus, the Kresge UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 14 Data Mart includes student data on all 257,903 students, as well as information on the matched students matched by the community colleges. The KDM will serve as a critical resource for all predictive analytics project as well as research projects related to student success and integrated data management. One key finding is that the data from the community colleges yielded much more robust and accurate information about students‟ academic histories at previous institution than UMUC‟s internal data resources. Data integration Institutions store student demographic, enrollment and classroom behavior data in a variety of information systems. For the purpose of this project, data from two UMUC student applications, PeopleSoft and WebTycho, are included in this database. PeopleSoft is a student information system that stores current and historical data about the student and their enrollment history. MC and PGCC have similar systems for storing student data. WebTycho is a proprietary learning management system that records online classroom for every user (student or faculty) during an online session. Figure 1 Description of the Integrated Kresge Data Mart UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 15 Each institution extracted data for each student in the matched population from the source student information systems and learning management systems. The process for merging data across institutions requires standardizing the data and developing a data dictionary. Data from multiple systems are combined in stages. At the first stage, data from each partner institution were collected and cleaned. Definitions of each data element were gathered, standardized, and documented. At the second stage, data from the UMUC‟s student information system and the online classroom were merged and developed in the UMUC data warehouse. The next stage integrated the community college data with the UMUC data in the Kresge Data Mart (KDM), as shown in the diagram above. The KDM combines information on the students‟ history at prior institutions, their current course performance information, and their online course behavior, as well as common outcome measures identified in the literature. The final stage of the KDM will include information from MC and PGCC on students who transfer elsewhere, as well as students who have transferred from other community colleges to UMUC. This is the data mart that will serve as the base of the final research models and intervention analysis. Current status. The community college data have been collected and a coding structure has been developed. The code continues to evolve, as the data are still being processed. As a result, preliminary analyses presented in this report are based on datasets that are derived from the raw data collected and stored during the data integration process. Data storage. In order to carry out data mining activities, storage was required for the various sources of data and for the various stages of the data integration process. To meet the requirements of building an integrated data mart, UMUC purchased the Oracle Exadata hardware which houses the KDM, as well as the UMUC data warehouse. The data for the partner institutions are stored separately from the data warehouse for the purpose of security and confidentiality. The data sources on the Oracle Exadata hardware, which is in part supported by the Kresge Foundation, have also contributed to other analytical and predictive analytics projects related to predictive analytics. The data integration among various sources adds value to all projects related to student success. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 16 Data preparation for initial analyses Researchers used sample datasets to develop data mining algorithms and predictive models. The datasets were designed specifically for the analysis of the exploratory research questions. The data mining algorithms and predictive models were created to explore the initial research questions are described in each section below along with initial findings. Data security Secure File Transfer Protocol (SFTP) is the method that is currently being used to transfer data between partner institutions. The storage area, which contains both community college and UMUC data, is secured by the database administrators providing accounts only to staff who are working on the development of the KDM. Section V: Initial findings Research Question 1: Examining variables and relationships This analysis uses data mining techniques in order to examine relationships amongst variables in the extant dataset. The question answered in this analysis is: What relationships exist amongst variables in the extant dataset? Dataset. The dataset used for this analysis contains 2,643 new, stateside, first bachelor‟s degree seeking UMUC students enrolled in 15 gateway courses that students in the Spring 2011 took to begin their program of study. Since each student can be enrolled in more than one course, this dataset contains 4,331 enrollment records and is comprised of 394 unique sections of these gateway courses. The variables in the dataset include transfer history, enrollment data for the students‟ first semester at UMUC, and online classroom behavior defined as WebTycho actions in the classroom prior to and after the start day of classes (See Table 2). UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 17 Table 2 Dataset variables and definitions Variable Name Variable definitions Emplid Unique ID for each new student Subject Four-letter code for each course, i.e.- ACCT, MATH, or PSYCH Course Number Subject and course number together create a class Total Transfer Credits A sum of credits/hours transferred from other institutions Number of Schools Attended Previously A count of the number of institutions in the new student‟s transfer history Time Since Last Institution The number of months since previous institution Transferred in at Least One English Course Flag to indicate if the student had at least one class in the transfer history where the subject was ENGL, EWRT, or WRTG Transferred in at Least One Math Course Flag to indicate if the student had at least one class in the transfer history where the subject was MATH, STAT, or ENGR Earned Associates Degree The degree earned by the new student (if any) at a previous institution Course Grouping Developmental (EDCP100), General (MATH 009 and 012), Business and Professional, Communication, Science and Social Sciences, Technology Length of Course in Weeks Potential values are 8, 10, 12, and 14 Semester Load Total credits attempted in Spring 2011 by the student Re-enrolled A Yes/No flag indicating whether or not the student enrolled in at least one course the Summer of 2011, excluding the courses enrolled in Spring 2011 Entering Class Number of times the new student logged into an area of the online course in WebTycho; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Opening Course Content Number of times the student opened the Course Content – which is an area of the online course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Opening Class Roster Number of times the student opened the Class Roster area of the course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 18 Variable Name Variable definitions Using the Chat Feature Number of times the student used the Chat Feature in WebTycho; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Opening a Study Group Conference Number of times the student went to the conference postings in a study group; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Opening Reserved Reading Number of times the student opened the Reserved Reading area of the online course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Opening the Webliography Number of times the student opened the Webliography area of the online course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Starting the Conferencing Feature Number of times the student opened the Conferences area of the online course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Reading a Conference Note Number of times the student opened a note for reading in the Conferences area of the course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Writing a Conference Note Number of times the student wrote or edited a note in the Conferences area of the course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Adding an Attachment to a Conference Note Number of times the student attached a file to a note in the Conferences area of the course; counts were created for each of the first four weeks of the class as well as the number of times prior to the first day of class Notes: class roster is a numbered listing of everyone registered for the class; chat feature is a place where classmates can participate in synchronous discussion and view shared dialogue; a conference is an area where students participate in asynchronous group discussion, much like an internet forum; within each conference a student can post a reply, which is also called a “note”; reserved readings is a place where faculty may post read-only documents or links to external website for students; webliography is a shared list of web sites relevant to the class. Method. A data mining program, IBM/SPSS Modeler, was used to develop models that explored the dataset for relationships between variables. Four classification algorithms were used: C&R Tree, CHAID, QUEST, and C5.0. Each of the four models has its own advantages and disadvantages; however, two models, C5.0 and CHAID, proved to be the most valuable in the identification of relationships between specific variables. Cross-model data mining techniques were also used to produce a ranking of variable importance. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 19 Findings. All algorithms combined were used to rank variables across all models. These models generated a ranking for variables related to re-enrollment across data mining models (see Table 3). Table 3 Variable ranking derived from data mining models Ranking Very High Variables 1) Frequency of student activity in the online classroom up to the fourth week of class 2) Entering the online classroom prior to the first day of class 3) Participation in any conference feature 4) Opening and reviewing the Course Content module within the classroom 5) Opening and reviewing Reserved Reading within the classroom High Middle Low Number of Schools Previously Attended Students‟ semester Load Time Since Last Institution Transferred in at Least One English Course Transferred in at Least One Math Course Negligible All Other Variables Note: Very high, high, middle, low, and negligible are statistical categories that describe the strength of the predictive power of the variables included in the re-enrollment model. This ranking indicates that classroom behavior is of very high importance to re-enrollment. In addition, the variables “number of schools previously attended” and “semester load” are somewhat important, and other variables are either of low importance or negligible importance. The C5.0 model indicated that the variable “Entering Class Week 4,” followed by “Entering Class Prior to First Day,” “Number of Schools Previously Attended,” “Writing a Conference Note Prior to First Day,” and “Semester Load” were most related to re-enrollment (see Figure 2). UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 20 Figure 2 C5.0 model with top five variables related to re-enrollment The CHAID algorithm also indicated that the variable “Entering Class Week 4” was most related to re-enrollment (see Figure X). Figure 3 CHAID Model with top 10 variables related to re-enrollment UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 21 Implications. Data mining techniques indicate that student classroom behaviors are of very high importance to re-enrollment. Previous academic background, such as number of schools attended and transfer credits, may also be important. These models suggest variables related to the student behaviors prior to the first day of class and previous academic work merit further examination in statistical predictive models. Research Question 2: Course success The purpose of this analysis is to construct a logistic regression model with a set of covariates and predictor variables that predict course success. This model intends to answer the research question: Which variables, if any, predict student online course success? The intent of this preliminary model is to predict course success from online student behavior in the first four weeks of the course and student‟s academic background information.1 The ultimate goal of analysis is to maximize the prediction of online course success. Dataset. The dataset used for this analysis contains 4,558 new, undergraduate, first bachelor-degree seeking students enrolled in 15 UMUC online gateway courses in Spring 2011.2 This dataset also contains transfer data on students from partner institutions, Montgomery College (MC) and Prince George's Community College (PGCC). Table 4 provides descriptive statistics for all students, passing or not passing, in each gateway course for this dataset. 1 Subsequent models may include: growth models, latent class analysis and structural equation models in order to capture a different type of data/information within one system of predictive model. 2 Sessions included in this dataset are Online Session 1 (OL1), OL2, OL3, OL4, and OL5. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 22 Table 4 Descriptive statistics for key covariates Course ACCT220 BMGT110 CCJS100 CMIS102 EDCP100 EDCP103 GVPT170 HIST157 IFSM201 LIBS150 MATH12 MATH9 PSYC100 WRTG101 All Passed a course Not Passed Passed N % N % 91 59% 63 41% 102 45% 127 55% 75 50% 74 50% 133 48% 142 52% 164 39% 254 61% 53 58% 38 42% 46 44% 59 56% 57 39% 88 61% 296 48% 324 52% 459 23% 1521 77% 11 69% 5 31% 57 54% 49 46% 97 56% 77 44% 34 35% 62 65% 1675 37% 2883 63% All N 154 229 149 275 418 91 105 145 620 1980 16 106 174 96 4558 . Method. For this study, a logistic regression model was developed that utilized available independent variables (covariates and predictors) to predict the desired dependent outcome, course success. A logistic regression model generates the probability of success (in logits) as it is linearly related to the set of independent variables. In this study, a pseudo- R2 value3, or value of prediction accuracy, was used for both model diagnosis and selection. Preliminary methods focused on the identification of covariates (e.g. student demographic information and the record of previous academic work) and predictors based on the students‟ coursework behavior. Identification of key covariates. Exploratory factor analysis (EFA) was used to identify key covariates. EFA is an exploratory or data mining method that uncovers underlying constructs, or common structures, among a set of related variables. Previous data mining 3 Pseudo or generalized R squared is a generalization of the coefficient of determination, R2, to more general regression model such as logistic regression (Nagelkerke,1991). UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 23 analysis4 indicated that the following covariates in student coursework data were strongly associated with the course success: 1) total number of transfer credits (Xfer_total); 2) amount of time since students attended the last institution (Xfer_time); 3); semester course load (LOAD) and 4) GPA from transferred credits (GPA_C). Table 5 shows the rule to categorize each covariate for use in the logistic regression model. Table 5 Rules for key covariates in student coursework data Attribute Total number of transfer credits Variable Name Xfer_total Amount of time since a student attended the last institution Semester course load Xfer_time GPA from transferred credits GPA_C LOAD Rule 0= 0 credit 1= 1 -29 credits 2= 30-59 credits 3= more than 60 credits 0= didn‟t attend 1= 1-24 months 2= over 24 months 1 = 1-3 credits 2 = over 3 credits 0 = no information 1 = less than 2.5 2 = 2.5 to less than 3.0 3 = 3 to less than 3.5 4 = over 3.5 Additional covariates related to student behavior prior to the official class starting day (called Week 0) were also found to be associated with course success. Table 6 displays summarized rules for each student behavior covariate in Week 0. 4 See section on cluster analysis for detailed explanation of this process. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 24 Table 6 Rules for LMS student behaviors in Week 0 Behavior Variable Name Rule Entering a class Opnclss_cat0 Opened a class roster Opnclssrstr_cat0 Opened a class content Opncrscntnt_cat0 Created a conference note Createconfnote_cat0 Created a response note Createrespnote_cat0 Summary of students‟ week0 behavior prior to the first day TimeZeroTotal 0 = Did not enter a class 1= Entered a class at least once 0 = Did not check the roster 1 = Check the roster at least once 0 = Did not open a class content 1 = Open a class content at least once 0 = Did not create a conference note 1= Created a conference note at least once 0 = Did not create response note 1 = Created a response note at least once Sum of five variables above Using these rules, a logistic regression model was used to evaluate the effect of the identified covariates (in terms of pseudo- R2) on course success for use in the final model (see Table 6). These findings include: 1. Total number of transferred credits is the best covariate predictor of course success (having pseudo R2 value around .12).56 5 It was the best predictor of success amongst the time invariant covariates, or covariates that have consistent value over all models in the analysis using different points in time. 6 The covariate number of schools previously attended (Xfer_NBR) exhibited the linear dependency with a total number of transferred credits thus to remove from the analysis. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 25 2. GPA from transferred credits is the second best covariate predictor of course success (pseudo R2 around .11). 3. Semester course load contributes less to course success than other covariates. Analysis of combined covariates (student behaviors and previous coursework behavior) indicates that total number of transfer credits (Xfer_total), amount of time since a student attended the last institution (Xfer_time), GPA from transferred credits (GPA_C), and summary of student‟s behavior prior to first day (TimeZeroTotal) accounts for approximately 15-20% of the variance based on the pseudo R2 value (see Table 7). Table 7 Summary of pseudo R2 estimates Covariates xfer_total xfer_time Load GPA_C Opnclss_cat0 Opnclssrstr_cat0 Opncrscntnt_cat0 Createconfnote_cat0 Createrespnote_cat0 TimeZeroTotal xfer_total xfer_time LOAD TimeZeroTotal GPA_C R2 estimates 0.12 0.06 0.01 0.11 0.05 0.03 0.06 0.02 0.02 0.07 0.19 Building the predictive model. In order to accurately predict course success, predictor variables that measure student online course behavior in Weeks 1-4 were integrated into the logistic regression model. Exploratory analysis based on the data mining techniques and the availability of data indicated the importance of five predictor variables related to online course behavior: 1) entering a class; 2) opening class content; 3) creating a conference note; 4) creating UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 26 a response note and 5) reading a conference note. In order to account for high variability in student online course behaviors predictors were categorized using the following rules (see Table 8): Table 8 Rules for student online course taking behavior beyond Week 0 Behavior Variable Name Rule Entering a class Opnclss_catN Opened a class content Opncrscntnt_catN2 Created a conference note Createconfnote_catN Created a response note Createrespnote_catN Read a conference note readconfnote_catN2 0 = Did not enter a class 1= Entered a class at least once 0 = Did not open a class content more than twice 1 = Open a class content at least twice 0 = Did not create a conference note 1= Created a conference note at least once 0 = Did not create response note 1 = Created a response note at least once 0 = Did not open a class content more than twice 1 = Created a response note at least twice In this case, each predictor variable represents a weekly total of distinctive actions per student per class in Weeks 1-4 of an online gateway course (see Table 9): UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 27 Table 9 Weekly total of actions per student beyond Week 0 Variable OPNCLSS_WEEK1 OPNCLSS_WEEK2 OPNCLSS_WEEK3 OPNCLSS_WEEK4 OPNCLSSRSTR_WEEK1 OPNCLSSRSTR_WEEK2 OPNCLSSRSTR_WEEK3 OPNCLSSRSTR_WEEK4 OPNCRSCNTNT_WEEK1 OPNCRSCNTNT_WEEK2 OPNCRSCNTNT_WEEK3 OPNCRSCNTNT_WEEK4 READCONFNOTE_WEEK1 READCONFNOTE_WEEK2 READCONFNOTE_WEEK3 READCONFNOTE_WEEK4 UPDTCONFNOTE_WEEK1 UPDTCONFNOTE_WEEK2 UPDTCONFNOTE_WEEK3 UPDTCONFNOTE_WEEK4 CREATECONFNOTE_WEEK1 CREATECONFNOTE_WEEK2 CREATECONFNOTE_WEEK3 CREATECONFNOTE_WEEK4 CREATERESPNOTE_WEEK1 CREATERESPNOTE_WEEK2 CREATERESPNOTE_WEEK3 CREATERESPNOTE_WEEK4 N 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 4558 Mean 8.18 6.65 6.39 6.06 0.82 0.22 0.14 0.1 4.15 2.75 2.51 2.39 99.55 62.25 52.97 40.88 0.15 0.13 0.13 0.1 0.61 0.23 0.26 0.17 2.86 1.97 1.79 1.45 SD 10 8.88 2 8.08 1 7.24 0 7.31 0 3.02 0 1.17 0 1.5 0 0.73 0 5.42 0 3.68 0 3.33 0 3.55 0 115.89 2 87.44 0 75.98 0 71.89 0 0.61 0 0.62 0 0.68 0 0.57 0 0.92 0 0.63 0 0.59 0 0.53 0 3.62 0 3.25 0 2.84 0 2.87 0 25 3 2 2 2 0 0 0 0 1 0 0 0 21 5 4 0 0 0 0 0 0 0 0 0 0 0 0 0 Percentile 50 75 6 10 5 8 5 8 4 8 0 1 0 0 0 0 0 0 3 6 2 4 1 4 1 3 59 138 28 84 25 71 14 52 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 4 1 3 1 2 0 2 90 16 14 14 13 2 1 0 0 10 7 7 6 248 172 141 114 0 0 0 0 2 1 1 1 7 6 5 5 This categorization accomplished three purposes: 1) removed the effect of extreme values of highly variable actions; 2) facilitated a percentile level count of each behavior and 3) examined UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 28 the effect of each cut point7. Categorization of predictor variables also emphasizes the importance of consistent engagement in a classroom activity, which is indicated by the categorization of „1‟ as a predictor variable for student action from Week 1 to Week 4 and “0” as an variable indicating no action.8 Evaluating the predictive model. The covariates and predictors included in the final logistic regression model represent the best model fit (as represented by pseudo R2) to evaluate student course success. In addition, this model allows for determination of the relative contribution of each covariate and predictor at each point in time, which will aid in identifying at-risk students. For example, the power of the model in predicting course success becomes extremely high (close to 90%) and pseudo R2 exceeding .60 using 8 weeks of data. However, since interventions for increased course success are not useful at the end of the semester, findings focus on the performance of predictors over the first four weeks, particularly the first two weeks of behaviors in the online class. Findings. Initial analysis using the final model indicates the relative contribution of covariates and predictors from Week 1 to Week 4 (see Table 10). Table 10 Relative contribution of Covariates and Predictors Covariates Predictors 7 Week 1 15% xfer_total 2% xfer_time 5% Load 11% TimeZeroTotal 7% GPA_C createrespnote_total 10% readconfnote_total2 34% createconfnote_total 15% 0% opnclss_total opncrscntnt_total2 0% 2 18% 2% 7% 10% 9% 3% 32% 7% 13% 0% 3 16% 2% 6% 9% 8% 6% 23% 14% 16% 0% 4 17% 2% 7% 7% 9% 2% 21% 10% 26% 1% “Cut point” refers to numeric values or rules to categorize a continuous value into categories. In other words, the predictor variable represents the sum of student actions across N weeks (0-4) as the final predictor variable in the model (e.g. <action>_total). 8 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 29 Examination of the relative contribution of covariates indicates that total transfer credit (xfer_total) exhibited the highest level of contribution to the predictive model (15%-18%) followed by Week 0 behavior (TimeZeroTotal), transfer GPA (GPA_C), course load (LOAD) and number of times a student transferred (xfer_time), respectively. Examination of the relative contribution of predictors indicates that the variable reading conference note (readconfnote_total2) has the highest influence in the predictive model (21%-34%). Note that open class (opnclss_total) had no impact on student success in Week 1, but that impact increased Weeks 2-4. The predictive power of the model increases gradually over the first four weeks of the course, from 79% correct prediction of course success at Week One 1, to 82% correct prediction of course success at Week 4. The pseudo R2 value increased from approximately 19% using the covariate-only model (see Table 4 for value) to 50% in Week 1 predictive model (see Table 11). Table 11 Predictive power of the model by week Outcome/Predicted R2 estimates Count P/P F/F Week F/P P/F Percent P/P F/F Success F/P P/F 1 0.50 1067 964 252 295 41% 37% 10% 11% 79% 2 0.55 1089 990 226 273 42% 38% 9% 11% 81% 3 0.57 1108 993 223 254 43% 39% 9% 10% 81% 4 0.59 9% 82% 1129 991 225 233 44% 38% 9% Note that the pseudo R value increased from 50% at Week 1 to 59% at Week 4 . 2 Group level analysis. Predictive analysis by sub-group based on sessions or courses was conducted using the model. Although findings from this group-level analysis is less informative due to limited sample size, the predictive model maintained a moderate to high level of fit based on pseudo R-squared value (e.g. above mid .30s).9 The group-level analysis demonstrated that 9 OL5 session had a fair model fit but exhibited low prediction rate. The sample size of OL5 session was too small to draw a conclusion. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 30 the predictive model maintains high level of model fit and prediction at over 75% in the majority of sub-groups (see Table 12)10. Table 12 Predictive power of the model by week and session Week Session Stats 1 2 3 4 OL1 R2 0.65 0.71 0.74 0.74 Success 85.10 87.40 87.60 88.90 R2 0.59 0.70 0.75 0.75 Success 84.30 85.90 88.20 88.20 R2 0.49 0.54 0.56 0.55 Success 78.30 79.00 79.50 80.20 R2 0.45 0.50 0.53 0.55 Success 73.10 74.80 77.00 78.70 R2 0.47 0.52 0.55 0.60 Success 59.00 60.00 64.80 68.60 R2 0.58 0.65 0.68 0.69 Success 83.70 85.50 86.70 86.90 R2 0.25 0.31 0.40 0.45 Success 79.30 81.70 83.80 85.30 R2 0.50 0.55 0.57 0.59 Success 78.80 80.60 81.50 82.20 R2 0.27 0.29 0.36 0.41 OL2 OL3 OL4 OL5 OL123 LIBS150 No LIBS 150 Core LIBS 10 This sample included all students by session level and course category (general education, and remedial course), but excluded students in LIBS 150. LIBS 150 is a general education course; it is shorter and more orientation/informative course rather than a regular educational course, and it has very high passing rate of 75.3%. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 31 Core No LIBS EDCP100 IFSM201 EDCP100 IFSM201 Success 74.00 74.80 78.00 80.40 R2 0.48 0.51 0.53 0.55 Success 78.10 78.80 79.00 79.10 R2 0.66 0.69 0.71 0.73 Success 83.50 86.10 85.60 87.60 R2 0.54 0.59 0.63 0.66 Success 78.50 81.90 82.40 83.50 R2 0.58 0.62 0.65 0.68 Success 82.20 83.10 84.30 85.80 Implications. Preliminary analysis of variables related to course success identified a set of critical covariates and predictors. Four of five predictors derived from online student behavior show a strong contribution to student success. The final model using the full dataset will examine the effect of these predictors as well as evaluating the impact of other behavioral measures such as accessing course modules. Research Question 3: Student retention The purpose of this preliminary analysis is to construct a logistic regression model with a set of covariates and predictor variables and determine the impact on student retention. This model intends to answer the research question: Which variables, if any, predict retention? Dataset. The analysis utilized the same data set for the prediction of course success with the addition of retention status from Summer 2011, Fall 2011, and Spring 2012. The retention rate was 66% (N= 3,015) in this sample. Table 13 Percentage students retained and not retained Retained N % 3015 66% Not Retained N % 1543 34% UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 32 Method. As with prediction of course success, this model uses logistic regression to predict retention. Preliminary methods focused on the evaluation of covariates, identified in the previous analysis, predictors based on the students‟ coursework behavior, and course success information (i.e. passed or not passed). Evaluation of covariates. A total number of credit transferred (xfer_total) had the highest predictive power followed by GPA from the transfer credits (GPA_C). The magnitude of covariates‟ pseudo R2 is lower on the prediction of retention than the prediction of course success. The total covariates yield a pseudo R2 of 0.14. Table 14 shows pseudo R2 estimates from five covariates. Table 14 Retention pseudo R2 estimates of covariates Covariates xfer_total xfer_time GPA_C Load TimeZeroTotal Cov Total R2 estimates 0.08 0.02 0.07 0.01 0.03 0.14 Evaluation of predictors. For this model, predictors identified for the course success model were used and combined with a new set of predictors: 1) number of courses passed (N_Pass); 2) number of courses not passed (N_NotPass); 3) number of course withdrawals (W) or failure due to non-attendance (FN) (N_WFN); 4) passing LIBS 150 (passed_LIBS150); 5) passing a General Education course (passed_GenEd); 6) passing a remedial course (passed_Remed) and 7) term GPA (PGA_Term). In this model, student behavior indicators did not predict retention as well as they predicted course success (see Table 15). UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 33 Table 15 Prediction success of combined student behavior indicators by week Retention Week Course Success Low High Low High 1 0.14 0.19 0.37 0.50 2 0.14 0.20 0.41 0.55 3 0.15 0.20 0.43 0.57 4 0.14 0.20 0.45 0.59 Findings. When included in the prediction model for retention, covariates predicting course success generated a higher a pseudo R2 value than student behavior predictors (see Table 16). Table 16 R-squared estimates for retention predictors and covariates Predictors and covariates passed_LIBS150 passed_GenEd passed_Remed N_Pass N_NotPass N_WFN GPA_Term COVS passed_ABC passed_GenEd passed_remed N_Pass N_NotPass N_WFN GPA_Term ***COVS= xfer_total xfer_time LOAD GPA_C R2 estimates 0.15 0.15 0.15 0.19 0.17 0.16 0.24 0.35 GPA and number of passing courses had the highest pseudo R2 estimates among seven indicators. The full model including covariates and passing indicators had a pseudo R2 of 0.35. Implications. The effect of covariate and student behavior variables made less of a contribution to this model than the prediction of course success. The final model using the course UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 34 passing indicators11 indicated a stronger contribution to the prediction of retention. These preliminary results indicate the importance of course success as the primary factor in increasing retention, which means that interventions aimed at improving course success may have a strong indirect effect on increasing retention. Additional exploratory analysis: Student withdrawal patterns The purpose of this additional exploratory analysis was to conduct a survival analysis that examined student course withdrawal patterns. The goal of this study is to increase the understanding of withdrawal patterns so that it can be effectively incorporated in models predicting the retention. Dataset. The dataset for this study contains 19,190 undergraduate UMUC students in OL1 (Online Session 1) in Fall 2011 in 278 distinct courses. This dataset contained student course information data and WebTycho data. Summary statistics of five student demographic and academic behavior variables are listed in Table 17. Table 17 Demographics for OL1 students Variable Gender Unknown Female Male Age Group Above3212 at or below 32 New/Returning Status NEW RET 11 Frequency Percent Cumulative Percent 244 10872 8074 1.3 56.7 42.1 1.3 57.9 100.0 8401 10789 43.8 56.2 43.8 100.0 3361 15829 17.5 82.5 17.5 100.0 For additional information on course withdrawal related to student behavior variables is explained in the subsequent section. 12 Age 32 is used as a cut point because it is the median age of UMUC students. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 35 The majority of the students in OL1 of FAl l2011 were female (56.7%). 56.2% of the students were at or below the UMUC median student age of 32 years old. 17.5% of the students were new to UMUC and 82.5% were returning students. Method. An exploratory survival analysis was carried out using a Kaplan-Meier estimator. Survival analysis is a statistical technique that can be used to model “time-to-event” data. In this case, this analysis examines the time it took for a student to withdraw from a particular course (in weeks and days) reflected as a time-to-event.13 Survival analysis generates a table which indicates a hazard (or withdrawal) rate during the semester. Additional univariate survival analyses were also conducted to identify retention patterns among age, gender, and student status. Survival analysis. Table 18 shows the number of student withdrawals, cumulative proportion of withdrawals, and withdrawal rate. Table 18 Withdrawal rate in OL1 Week Number of students 1 2 3 4 5 6 19190 18783 18499 18248 18031 17736 Number of Student Withdrawals 407 284 251 217 295 132 Cumulative Proportion of Withdrawal 0.98 0.96 0.95 0.94 0.92 0.92 Withdrawal rate 0.0031 0.0022 0.002 0.0017 0.0024 0.0011 Student rate of withdrawal is highest in Week 1 (N= 404 or .31%). The withdrawal rate gradually drops until Week 5, which is the week before the academic withdrawal deadline. The overall trend and withdrawal pattern is that, in general, student withdrawal rates decrease after Week 1 of the semester. 13 This analysis only examines students who received a grade of W (withdrawal). UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 36 Univariate survival analysis. This univariate survival analysis examined three covariates: gender, age, and student status. Age, which was defined as below or above 32 years old, did not prove to be statistically significant (p = .60) in terms of correlation with withdrawal patterns. Gender, male and female, approached significance (p = .06) in terms of correlation with withdrawal patterns. Student status (new or returning) showed statistical significance in terms of correlation with withdrawal (p < .001): new students withdraw at a higher rate than returning students. Figure 3 shows that the patterns of new and returning student withdrawal are different. Figure 4 Withdrawal rate for new and returning students by day UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 37 Implications. Students withdraw at a higher rate in Week 1 compared to any other week in the course session, with the exception of Week 5, which is the academic withdrawal deadline. Student status, new or returning, may significantly affect student withdrawal rate. New students withdraw at a higher rate than returning students. These findings suggest that interventions targeting new students with interventions in Week 1 may be appropriate. Section VI: Implications and next steps In summary, data mining and statistical modeling can identify variables that relate to student success. Data mining techniques identified the importance of academic background and student course behaviors as variables related to student success. Predictive models of course success and retention using variables gleaned from data mining showed similar outcomes. Specifically, online student course behavior, such as opening and reading conference notes in the first four weeks of class, were correlated with course success. Previous academic background, such as total number of transfer credits and transfer GPA, contributed both to course success and retention. Course success, instead of student behavior, was a key predictor of retention. Student course behaviors showed that students withdraw at a higher rate in Week 1 over any other week in the course session, with the exception of Week 5, and that student status may significantly affect student withdrawal rate. These findings will guide our next research steps, which will finalize and validate predictive models in order to build student profiles. These steps may include, but are not limited to: running predictive models with a full dataset; validating models; sub-group analysis examining community college students in terms of course success, re-enrollment and retention; identifying course sequences that indicate re-enrollment and retention; integrating withdrawal information into predictive models. These additional steps will aid in the building of student profiles that can be used for predicting student success and designing effective interventions. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 38 References Aragon, S.R. & Johnson, E.S. (2008). Factors influencing completion and non-completion in online community college courses. American Journal of Online Education, 22(3), 146-158. Boston, W., Diaz, S.R., Gibson, A.M., Ice, P., Richardson J., & Swan, K. (2011). An exploration of the relationship between indicators of the community of inquiry framework and retention in online programs. Journal of Asynchronous Learning Networks, 13(3), 67-83 Baker, R.S. & Yacef, K. (2009). The state of educational data mining: A review and future visions. Journal of Educational Data Mining, 1(1), 3-17. Finnegan, C., Morris, L.V., and Lee, K. (2009). Differences by course discipline on student behavior, persistence, and achievement in online courses of undergraduate general education. Journal of College Student Retention, 10(1), 39-54. Herzog, S. (2006). Estimating student retention and degree completion time: Decision trees and neural networks vis a vis regression. New Directions for Institutional Research, 131, 17-33. Ho Yu, C., DiGangi, S., Jannasch-Pennell, A., Lo, W., & Kaprolet, C. (2007, February). A datamining approach to differentiate predictors of retention. Paper presented at the Educause Southwest Conference, Phoenix, AZ. Luan, J. & Zhao, C-M. (2006). Data mining: Going beyond traditional statistics. New Directions for Institutional Research, 131, 7-16. Luan, J. (2001). Data mining as driven by knowledge management and higher education: Persistence clustering and prediction. Paper presented at the SPSS Public Conference, University of California at San Francisco. Luan, J, (2002). Data mining and its applications in higher education. New Directions for Institutional Research, 113, 7-36. Morris, L.V. & Finnegan, C.L. (2009). Best practices in predicting and encouraging student persistence and achievement online. Journal of College Student Retention, 10(1), 5-34. Nistor, N., & Neubauer. K (2010). From participation to dropout: Quantitative participation patterns in online university courses. Computers in Education, 55, 663-672. Park, J-H., & Choi. H-J. (2009). Factors influencing adult learners‟ decision to drop out or persist in online learning. Educational Technology and Society, 12(4), 207-217. Romero, C., Ventura, S., Espejo, P.G. and Hervas, C. (2008, June). Data mining algorithms to classify students. Paper presented at the1st International Conference on UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 39 Educational Data Mining, Montreal, Canada. Willging, P.A. & Johnson, S.D. (2009). Factors influencing adult student decisions to dropout of online courses. Journal of Asynchronous Learning Networks, 13 (3), 115-127. UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 40 Appendix A Memoranda of Understanding UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 41 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 42 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 43 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 44 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 45 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 46 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 47 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 48 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 49 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 50 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 51 Appendix B Financial Statement UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 52 UMUC Kresge Data Mining Project: Phase 1 November 28, 2012 53