Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation A Prototype for a Data Mining Based Pathfinder to Sudanese Universities Eltayeb Abuelyaman Atifa Elgimari Department of Computer Science University of Nizwa Nizwa, Oman [email protected] School of Management Studies Ahfad University for Women Omdurman, Sudan [email protected] of further enhancements thereof by the research community. On the mining side, several studies addressed limitations of OLAP technology when it lacked the intelligence necessary for concluding relationships. To date, data mining techniques such as classification and association rule mining are blended with OLAP technique. Such assimilation resulted in the On-Line Analytical Mining (OLAM) technology (also called OLAP mining). Subsequently, several researchers leveraged the power of OLAM in different areas. The mining of association rules over data cubes is a good example [4]. In reference [7], Bogdanova and Georgieva complemented the work in [4] by using OLAM technology in mining association rules via a web-based client/server system. The coupling of OLAP and the association rules technique led to the discovery of interesting correlations among OLAP data cubes [4,6]. Other researchers focused on using OLAM to improve the mining process of association rules via data cubes. In reference [7], Messaoud et al., proposed a general framework for mining inter-dimensional association rules via data cubes. The authors deduced rules from objectives of the analysis to optimize support and confidence parameters using aggregated measures. Some of the recent technological innovations data mining brought about include mining education data. Ironically, use of data mining techniques in the domain of its own fostering mother—the education community, is far beyond expectations compared to its commercial uses for example. Fortunately, the imbalance is gradually changing. Recent uses of data mining techniques in posthigh school education include attempts to improve students’ performance in classrooms [8-12]. M. Goyal and R. Vohra went far beyond classroom to talk about improving student’s life cycle management, their selection of courses, their retention rate and management of their financial supports [12]. In reference [8], Zlatko J. Kovai used the CHi-squared Automatic Interaction Detection (CHAID) to classify students using “success” as their dependent variable. With the use of only “ethnicity”, “course program” and “course block” as predictors, the accuracy he was able to achieve was low according to the author himself. However, the author did set out a strategy for handling the limitations he concluded. For a similar goal, Ramaswami M, and Abstract—Recent data mining innovations in college education include novel techniques for guided selection of courses; predictions of grades; and predictions of success in fulfilling graduation requirements. Attempts have also been made to discover associations among students with sharp learning curves in an effort to address the slow learning impaired ones. This paper suggests a customizable enrollment system based on analysis of multidimensional data storages using the online analytical mining techniques. Such system will empower guardians and enrollment officers with hidden information that can be used in recommending majors of study to students. Unguided selection of majors is among the root causes of two problems students face: dropping out of college and changing majors of studies. The proposed system is expected to reduce both. The system will also help colleges plan enrollments well. Keywords-Data mining, On-Line Analytical relational, enrollment, higher education. Processing, I. INTRODUCTION Some data mining techniques are based on data cubes the construction of which is a major step in the design of a data warehouse. To analyze data, efficiency of processing of the data cubes is necessary. In reference [1], Rob and Ellis offered a great deal of theoretical knowledge, as well as practical experiences with data warehousing tools and the On Line Analytical Processing (OLAP) techniques. In reference [2], Ivanova and Rachev introduced a different approach for constructing data cubes. Their approach enables users to view aggregated data through multiple angles. Conceptually, the authors described components of a cube as a base cuboid surrounded by a collection of sub-cubes. The sub-cubes are then used to compute aggregations of the base cuboid across one or more dimensions. According to references [3-5], the three main storage modes for implementing OLAP operations are: Relational OLAP (ROLAP), Multidimensional OLAP (MOLAP), and Hybrid OLAP (HOLAP). Significant work has been achieved using these storage modes as reported in [3]. For example, a criterion was given in the aforesaid references for selecting the storage mode based on the data type and selecting the physical locations within the data warehouse architecture. As well, the references suggested blueprints for assessing both the effectiveness of the modes and that 978-1-4799-4923-6/14 $31.00 © 2014 IEEE DOI 10.1109/UKSim.2014.65 118 Bhaskaran R, used CHAID to determine the primary root causes of slow learning among students [10]. The authors used probability to reduce their set of 16 predictor variables down to the 7 they considered to be the most influential ones. They concluded that “high schools”, “geography” and “socio-economical statuses” of families are the most influential factors in the sharpness of learning curves for students. In reference [11] Vialardi C. et al. described the generation of domain specific variables that are capable of representing students’ past performances. They then analyzed the domain variables for the purpose of suggesting educated course selections to avoid misguided enrollment decisions. However, the domain variables they chose are the “degree of difficulty” of a course and the student’s “potential”. They based the “degree of difficulty” of a course on the average grade for all students who took the course in previous semesters and; the student’s “potential” in the course on his/her Cumulative Grade Point Average (CGPA). However, the degree of difficulty students encounter in a course is a function of several variables, and their potential to do well in the course is not easily predictable from their CGPA only. Moreover, in many countries in the Middle East and the Orient students do not have a choice which courses to take once they chose a specific major. Another point of concern here is that the predictor variables in [9-14] are from three different cultures, the Oxidant, the Orient and the Middle East. Ironically, some of the primary predictor variables in one culture are secondary in others and vice versa. Thus, using a system of predictor variables that nicely fits the US may not hold so well for the UK. The objective of this paper is to provide a framework for a system that would help students, their guardians and centralized/distributed enrollment offices in predicting majors that best fit students’ backgrounds. Unfortunately no reliable statistics are available to the authors on students who changed majors they unwisely chose for themselves in the middle of their course ofstudy. The authors also have no data on the percentages of students who dropped out upon failing to coop with requirements of majors they hastily chose for themselves or were forced to choose by their guardians. With the availability of a user friendly publicly accessible career predictors, most of the aforesaid problems are either eliminated or reduced. A Prototype for a solution is the Scout for College Students (SCS) system the authors propose in the next section. A suggested implementation for SCS is outlined in sections III. Sections IV and V are about the SCS’s querying and mining, respectively. The conclusion is drawn in section VI. choose from a list of predictor variables, add their own or accept the default set SCS suggests. One challenge that needs more attention is the identification and sorting of a standard set of global predictor variables into primary and secondary classes necessary for globalizing SCS. Such set will enable predictions of inter-country enrollments. For the purpose of prototyping, the selected predictor variables are compatible with the Sudanese educations requirements. The prototype is intended to predict majors for students who plan to enroll in Sudanese higher education institutions. As such, it will not be as accurate when used for students who graduated from high schools in other countries but plan to enroll in Sudanese colleges and universities. The predictors for the prototype are a set of key variables that identify the residential and educational background of students. Classification of predictor variables into primary and secondary for prototyping will not be necessary. The set of predictor variables includes geography, socioeconomic status, high school, gender, and education set up (single-sex vs. coed). The prototype is tested using a dataset collected from the Sudanese Ministry of Higher Education and Scientific Research. The dataset covers records of students admitted to Sudanese Universities during the period 2005- 2009. The On Line Analytical Processing (OLAP) technique was chosen for analysis of the data. OLAP is known for its powerful data mining techniques. It enables users to perform statistical operations on substructures of any multidimensional data structure. Implementation of the prototype is next. III. SUGGESTED IMPLEMENTATION FOR SCS The following are predictor variables henceforth will be called dimensions for consistency with the terminology of OLAP and multidimensional structures: · Student’s Location: represents the location of the high schools students attended. Its hierarchy is (Country – Province- State). · Higher education: represents the university to which a student applies. Its hierarchy is (University – facultyDepartment- Program Type). · Student’s sex: Female vs. male. · Student’s High School: represents the type of high school where students completed their education and took the national standard certificate examination. Types are filtered based on: major (such as academic vs. agricultural); Category (public vs. Private); Designation (Female, Male, or Coed); Category (Regular vs Home student) · Date: (2005 - 2009). The measures chosen for this project include: · Fact Admissions Count (FAC): This is a function that returns count. II. A PROTOTYPE FOR SCS The SCS is an anytime anywhere customizable scouting system for college students. It allows users to 119 · Percentage of Selections: Enrollment application forms offer students 40 different colleges to select from. · Scores Average: The average intake score. Upon concluding the design of dimension tables, their hierarchies, and the fact measures; a snowflake schema was chosen for the prototype’s data warehouse. The design of the snowflakes is shown in figure 1 below. Figure1. Multidimensional snowflakes warehouse for prototyping SCS The proposed snowflakes data warehouse is used as a base for developing a cube. Figure 2 shows the implementation of the cube, whereselected dimensions have been added to the cube browser. High Schools’ Majors are measured by Scores Average Percent across the date dimension that covers the period of years (2005- 2009). Unfortunately such reports have not been designed for end users and cannot be deployed as stand- alone – application. SCS enables users to mine the data. Querying SCS by end user is next. Figure2. Implementing of developed cube for prototyping SCS 120 IV. QUEREYING SCS That is, in the year “date= 2005”, the “average score” was equal to = 76.34 for “Male” students from “academic” high schools in “Sudan” who were accepted by the “University of Khartoum” to study in the college of “Economics”. If among all the dimensions in the figure, only the “date” is rolled up to include all the years in the “date” dimension (from 2005 to 2009), one can easily obtain the average score for students who were enrolled in the college of “Economics” during the specified time range. The above queries are not intended to demonstrate OLAP and its ability to analyze huge records statistically. It only demonstrates implementation of OLAP operations on the SCS prototype. The choice for the “average score” is justified because, on the average, only half of the students who enroll in the said university in any given year graduate on schedule. The rest repeat a year or two; voluntarily drop out; or are forced by the university to withdraw because of poor academic standing. Guardians or enrollment officers may use SCS to advise students on the bases of their scores. That is, a student whose score is right below the average for a specific major may be recommended to enroll in another for which his/her score is above the average. Such is the case because scores for different majors are computed from a different combination of subjects. Mining SCS is next. Users may query OLAP by choosing the dimensions they want to examine/analyze. For example, the user can choose (Sudan) as the “country”; (2005) as the “date”; and (FAC) as the “measure”. These selections are shown in figure 3. The upper part of the window in the figure contains a set of pull-down menus for selecting the desired input dimensions. The bottom part is where the outcome of the chosen measure for the selected dimensions is displayed. For “FAC” as the measure, along with “date” and “location” as dimensions in the input pane, the count, in thousands, of students who were accepted is shown as 4 in the output pane. A user may move down the hierarchy along the “location” dimension by choosing (clicking on) “Middle” as the province and “Khartoum” as the state, the response of the system will show a count of 1 instead of 4 indicating that, for the chosen “date” and “location” dimensions, only 1000 students were accepted. In the same manner, users may drill-down any dimension to retrieve their targeted data. As an example, figure 4 shows the “average” as the measure for the same dimensions in figure 3. The output pane in figure 4 shows the selected dimensions along with the resulting measure. The latter is the average intake score and is equal 76.34. Figure3. Example display of results of a two dimensional query based on count as the only measure. 121 Figure4. Example of a three dimensional query based on average score as the only measure 2007 to enroll only students who scored 92 or more, the ration between males and females for that year would have been, at best, 1 male for every 4 females. Such hypothesis assumes equal number of males and females compete to the faculty of Engineering. V. MINING SCS Two OLAP operations are demonstrated in this section. The first is “dicing” and the second is “drillingdown”. Results of these two operations are shown on table I and table II respectively. Table I represents SCS’s response to a query that involved “date” and “faculty” as dimensions to measure the “average scores” rounded up to the nearest digit. A quick inspection of these averages shows “Engineering” as the only faculty/college with a monotonically increasing “average score”. Fortunately this table hides interesting information. Upon drilling-down “Engineering” via the “school type” dimension, SCS produced the data in table II. From this table, one may see that the “female” and the “coed” schools had better averages than “male” only schools. How useful this information is depends on the imagination of the one who inspects it. One relationship that can be implied here is: Engineering ^ Female Score > 92 TABLE I: AVERAGE SCORES FOR 5 COLLEGES IN 3 YEARS Medicine Engineering Economics Science 2005 92 89 91 87 2006 93 91 90 86 2007 92 92 91 86 TABLE II: AVERAGES “ENGINEERING” ENROLLMENTS BASED ON “SCHOOL TYPE” Engineering (1) Support and confidence values are not included for Equation (1). The equation infers that an “x” increase in the count of female students with “average score=92” would displace “x” male students to a different college if the total Engineering enrollment is to remain fixed. To put it in a different way, had Engineering decided in the year Female Male Coed First three years 2005 2006 2007 90 91 93 87 90 90 90 92 92 VI. CONCLUSION A Prototype for a scouting system for student enrollment has been suggested. It provides college officials with statistics necessary for making enrollment decisions. The same system can be used to match the 122 majors that best fit students. Such usage will reduce the number of students who change majors or fail to complete their college education due to hasty selection of majors. REFERENCES [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] Mohammad A. Rob, Michael E. Ellis, 2007, “Case Projects in Data Warehousing and Data Mining”, University of Houston, Clear Lake, Volume VIII No.1. Available from: http://iacis.org/iis/2007/Rob_Ellis.pdf, [downloaded -15 March 2011]. AntoanetaIvanova, Boris Rachev, 2004, “Multidimensional models - Constructing Data Cube” International Conference on Computer Systems and Technologies- CompSysTech, available from: http://ecet.ecs.ru.acad.bg/cst04/Docs/sV/55.pdf, (Downloaded 24 April 2013). Jiawei Han, MichelineKamber, 2006, “Data Mining: Concept And Techniques”, Morgan Kaufmann, San Francisco, CA 94111 Jigna J. Jadav, Mahesh Panchal, 2012, “Association Rule Mining Method On OLAP Cube”, International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 2, Issue 2, Mar-Apr 2012, pp.1147-1151. India. Available from: http://www.ijera.com/papers/Vol2_issue2/GL2211471151.pdf, [downloaded - 24 April 2013]. Scott Cameron, March 2009. “Microsoft SQL Server 2008 Analysis Services - Step by Step”, Hitachi Consulting, Microsoft Press- A Division of Microsoft Corporation- One Microsoft Way, Redmond, Washington. Galina Bogdanova, TsvetankaGeorgieva, 2005, “Discovering the Association Rules in OLAP Data Cube with Daily Downloads of Folklore Materials”, International Conference on Computer Systems and Technologies - CompSysTech’, Available from: http://ecet.ecs.ru.acad.bg/cst05/Docs/cp/SIII/IIIB.23.pdf, [visited - 24 April 2013]. Riadh Ben Messaoud , Omar Boussaid , Sabine LoudcherRabaséd, RokiaMissaoui, “Enhanced mining of association rules from data cubes”, Proceeding of the 9th ACM International Workshop on Data Warehousing and OLAP, 2006: http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.41, (downloaded 4/4/2013). Zlatko J. Kovai, “Early prediction of student success: Mining students' enrolment data”, Proceedings of Informing Science & IT Education Conference (InSITE), pp 647-665, Southern Italy, June 19-24, 2010 Al-Radaideh, Q. A., Al-Shawakfa, E. M., & Al-Najjar, M. I., “Mining student data using decision trees”. Proceedings of the 2006 International Arab Conference on Information Technology. M. Ramaswami, R. Bhaskaran, “A CHAID Based Performance Prediction Model in Educational Data Mining”, International Journal of Computer Science Issues, IJCSI, pp10-18 Vol. 7, Issue 1, No. 1, January 2010 pp 10-18 César Vialardi, Jorge Chue, Alfredo Barrientos, Daniel Victoria, Jhonny Estrell1, Juan Pablo Peche and Álvaro Ortigosa “A Case Study: Data Mining Applied to Student Enrollment”: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.612 4&rep=rep1&type=pdf (Visited 1/1/2014) Monika Goyal and RajanVohra, “Applications of Data Mining in Higher Education”, JCSI International Journal of Computer Science Issues, pp 113 - 120, Vol. 9, Issue 2, No. 1, March 2012. 123