Download A Prototype for a Data Mining Based Pathfinder to Sudanese

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation
A Prototype for a Data Mining Based Pathfinder to Sudanese Universities
Eltayeb Abuelyaman
Atifa Elgimari
Department of Computer Science
University of Nizwa
Nizwa, Oman
[email protected]
School of Management Studies
Ahfad University for Women
Omdurman, Sudan
[email protected]
of further enhancements thereof by the research
community.
On the mining side, several studies addressed
limitations of OLAP technology when it lacked the
intelligence necessary for concluding relationships. To
date, data mining techniques such as classification and
association rule mining are blended with OLAP
technique. Such assimilation resulted in the On-Line
Analytical Mining (OLAM) technology (also called
OLAP mining). Subsequently, several researchers
leveraged the power of OLAM in different areas. The
mining of association rules over data cubes is a good
example [4]. In reference [7], Bogdanova and Georgieva
complemented the work in [4] by using OLAM
technology in mining association rules via a web-based
client/server system. The coupling of OLAP and the
association rules technique led to the discovery of
interesting correlations among OLAP data cubes [4,6].
Other researchers focused on using OLAM to improve the
mining process of association rules via data cubes. In
reference [7], Messaoud et al., proposed a general
framework for mining inter-dimensional association rules
via data cubes. The authors deduced rules from objectives
of the analysis to optimize support and confidence
parameters using aggregated measures.
Some of the recent technological innovations data
mining brought about include mining education data.
Ironically, use of data mining techniques in the domain of
its own fostering mother—the education community, is
far beyond expectations compared to its commercial uses
for example. Fortunately, the imbalance is gradually
changing. Recent uses of data mining techniques in posthigh school education include attempts to improve
students’ performance in classrooms [8-12]. M. Goyal and
R. Vohra went far beyond classroom to talk about
improving student’s life cycle management, their
selection of courses, their retention rate and management
of their financial supports [12]. In reference [8], Zlatko J.
Kovai used the CHi-squared Automatic Interaction
Detection (CHAID) to classify students using “success”
as their dependent variable. With the use of only
“ethnicity”, “course program” and “course block” as
predictors, the accuracy he was able to achieve was low
according to the author himself. However, the author did
set out a strategy for handling the limitations he
concluded. For a similar goal, Ramaswami M, and
Abstract—Recent data mining innovations in college
education include novel techniques for guided selection of
courses; predictions of grades; and predictions of success in
fulfilling graduation requirements. Attempts have also been
made to discover associations among students with sharp
learning curves in an effort to address the slow learning
impaired ones. This paper suggests a customizable
enrollment system based on analysis of multidimensional
data storages using the online analytical mining techniques.
Such system will empower guardians and enrollment officers
with hidden information that can be used in recommending
majors of study to students. Unguided selection of majors is
among the root causes of two problems students face:
dropping out of college and changing majors of studies. The
proposed system is expected to reduce both. The system will
also help colleges plan enrollments well.
Keywords-Data mining, On-Line Analytical
relational, enrollment, higher education.
Processing,
I. INTRODUCTION
Some data mining techniques are based on data cubes
the construction of which is a major step in the design of a
data warehouse. To analyze data, efficiency of processing
of the data cubes is necessary. In reference [1], Rob and
Ellis offered a great deal of theoretical knowledge, as well
as practical experiences with data warehousing tools and
the On Line Analytical Processing (OLAP) techniques. In
reference [2], Ivanova and Rachev introduced a different
approach for constructing data cubes. Their approach
enables users to view aggregated data through multiple
angles. Conceptually, the authors described components
of a cube as a base cuboid surrounded by a collection of
sub-cubes. The sub-cubes are then used to compute
aggregations of the base cuboid across one or more
dimensions.
According to references [3-5], the three main storage
modes for implementing OLAP operations are: Relational
OLAP (ROLAP), Multidimensional OLAP (MOLAP),
and Hybrid OLAP (HOLAP). Significant work has been
achieved using these storage modes as reported in [3]. For
example, a criterion was given in the aforesaid references
for selecting the storage mode based on the data type and
selecting the physical locations within the data warehouse
architecture. As well, the references suggested blueprints
for assessing both the effectiveness of the modes and that
978-1-4799-4923-6/14 $31.00 © 2014 IEEE
DOI 10.1109/UKSim.2014.65
118
Bhaskaran R, used CHAID to determine the primary root
causes of slow learning among students [10]. The authors
used probability to reduce their set of 16 predictor
variables down to the 7 they considered to be the most
influential ones. They concluded that “high schools”,
“geography” and “socio-economical statuses” of families
are the most influential factors in the sharpness of
learning curves for students. In reference [11] Vialardi C.
et al. described the generation of domain specific
variables that are capable of representing students’ past
performances. They then analyzed the domain variables
for the purpose of suggesting educated course selections
to avoid misguided enrollment decisions. However, the
domain variables they chose are the “degree of difficulty”
of a course and the student’s “potential”. They based the
“degree of difficulty” of a course on the average grade for
all students who took the course in previous semesters
and; the student’s “potential” in the course on his/her
Cumulative Grade Point Average (CGPA). However, the
degree of difficulty students encounter in a course is a
function of several variables, and their potential to do well
in the course is not easily predictable from their CGPA
only. Moreover, in many countries in the Middle East and
the Orient students do not have a choice which courses to
take once they chose a specific major.
Another point of concern here is that the predictor
variables in [9-14] are from three different cultures, the
Oxidant, the Orient and the Middle East. Ironically, some
of the primary predictor variables in one culture are
secondary in others and vice versa. Thus, using a system
of predictor variables that nicely fits the US may not hold
so well for the UK.
The objective of this paper is to provide a framework
for a system that would help students, their guardians and
centralized/distributed enrollment offices in predicting
majors that best fit students’ backgrounds. Unfortunately
no reliable statistics are available to the authors on
students who changed majors they unwisely chose for
themselves in the middle of their course ofstudy. The
authors also have no data on the percentages of students
who dropped out upon failing to coop with requirements
of majors they hastily chose for themselves or were forced
to choose by their guardians. With the availability of a
user friendly publicly accessible career predictors, most of
the aforesaid problems are either eliminated or reduced.
A Prototype for a solution is the Scout for College
Students (SCS) system the authors propose in the next
section. A suggested implementation for SCS is outlined
in sections III. Sections IV and V are about the SCS’s
querying and mining, respectively. The conclusion is
drawn in section VI.
choose from a list of predictor variables, add their own or
accept the default set SCS suggests. One challenge that
needs more attention is the identification and sorting of a
standard set of global predictor variables into primary and
secondary classes necessary for globalizing SCS. Such set
will enable predictions of inter-country enrollments.
For the purpose of prototyping, the selected predictor
variables are compatible with the Sudanese educations
requirements. The prototype is intended to predict majors
for students who plan to enroll in Sudanese higher
education institutions. As such, it will not be as accurate
when used for students who graduated from high schools
in other countries but plan to enroll in Sudanese colleges
and universities.
The predictors for the prototype are a set of key
variables that identify the residential and educational
background of students. Classification of predictor
variables into primary and secondary for prototyping will
not be necessary. The set of predictor variables includes
geography, socioeconomic status, high school, gender,
and education set up (single-sex vs. coed).
The prototype is tested using a dataset collected from
the Sudanese Ministry of Higher Education and Scientific
Research. The dataset covers records of students admitted
to Sudanese Universities during the period 2005- 2009.
The On Line Analytical Processing (OLAP) technique
was chosen for analysis of the data. OLAP is known for
its powerful data mining techniques. It enables users to
perform statistical operations on substructures of any
multidimensional data structure. Implementation of the
prototype is next.
III. SUGGESTED IMPLEMENTATION FOR SCS
The following are predictor variables henceforth will
be called dimensions for consistency with the terminology
of OLAP and multidimensional structures:
· Student’s Location: represents the location of the high
schools students attended. Its hierarchy is (Country –
Province- State).
· Higher education: represents the university to which a
student applies. Its hierarchy is (University – facultyDepartment- Program Type).
· Student’s sex: Female vs. male.
· Student’s High School: represents the type of high
school where students completed their education and took
the national standard certificate examination. Types are
filtered based on: major (such as academic vs.
agricultural); Category (public vs. Private); Designation
(Female, Male, or Coed); Category (Regular vs Home
student)
· Date: (2005 - 2009).
The measures chosen for this project include:
· Fact Admissions Count (FAC): This is a function that
returns count.
II. A PROTOTYPE FOR SCS
The SCS is an anytime anywhere customizable
scouting system for college students. It allows users to
119
· Percentage of Selections: Enrollment application
forms offer students 40 different colleges to select from.
· Scores Average: The average intake score.
Upon concluding the design of dimension tables, their
hierarchies, and the fact measures; a snowflake schema
was chosen for the prototype’s data warehouse. The
design of the snowflakes is shown in figure 1 below.
Figure1. Multidimensional snowflakes warehouse for prototyping SCS
The proposed snowflakes data warehouse is used as a
base for developing a cube. Figure 2 shows the
implementation of the cube, whereselected dimensions
have been added to the cube browser. High Schools’
Majors are measured by Scores Average Percent across
the date dimension that covers the period of years (2005-
2009). Unfortunately such reports have not been designed
for end users and cannot be deployed as stand- alone –
application. SCS enables users to mine the data. Querying
SCS by end user is next.
Figure2. Implementing of developed cube for prototyping SCS
120
IV. QUEREYING SCS
That is, in the year “date= 2005”, the “average score” was
equal to = 76.34 for “Male” students from “academic”
high schools in “Sudan” who were accepted by the
“University of Khartoum” to study in the college of
“Economics”. If among all the dimensions in the figure,
only the “date” is rolled up to include all the years in the
“date” dimension (from 2005 to 2009), one can easily
obtain the average score for students who were enrolled in
the college of “Economics” during the specified time
range. The above queries are not intended to demonstrate
OLAP and its ability to analyze huge records statistically.
It only demonstrates implementation of OLAP operations
on the SCS prototype. The choice for the “average score”
is justified because, on the average, only half of the
students who enroll in the said university in any given
year graduate on schedule. The rest repeat a year or two;
voluntarily drop out; or are forced by the university to
withdraw because of poor academic standing.
Guardians or enrollment officers may use SCS to advise
students on the bases of their scores. That is, a student
whose score is right below the average for a specific
major may be recommended to enroll in another for which
his/her score is above the average. Such is the case
because scores for different majors are computed from a
different combination of subjects. Mining SCS is next.
Users may query OLAP by choosing the dimensions
they want to examine/analyze. For example, the user can
choose (Sudan) as the “country”; (2005) as the “date”;
and (FAC) as the “measure”. These selections are shown
in figure 3. The upper part of the window in the figure
contains a set of pull-down menus for selecting the
desired input dimensions. The bottom part is where the
outcome of the chosen measure for the selected
dimensions is displayed. For “FAC” as the measure, along
with “date” and “location” as dimensions in the input
pane, the count, in thousands, of students who were
accepted is shown as 4 in the output pane.
A user may move down the hierarchy along the
“location” dimension by choosing (clicking on) “Middle”
as the province and “Khartoum” as the state, the response
of the system will show a count of 1 instead of 4
indicating that, for the chosen “date” and “location”
dimensions, only 1000 students were accepted.
In the same manner, users may drill-down any
dimension to retrieve their targeted data. As an example,
figure 4 shows the “average” as the measure for the same
dimensions in figure 3. The output pane in figure 4 shows
the selected dimensions along with the resulting measure.
The latter is the average intake score and is equal 76.34.
Figure3. Example display of results of a two dimensional query based on count as the only measure.
121
Figure4. Example of a three dimensional query based on average score as the only measure
2007 to enroll only students who scored 92 or more, the
ration between males and females for that year would
have been, at best, 1 male for every 4 females. Such
hypothesis assumes equal number of males and females
compete to the faculty of Engineering.
V. MINING SCS
Two OLAP operations are demonstrated in this
section. The first is “dicing” and the second is “drillingdown”. Results of these two operations are shown on
table I and table II respectively.
Table I represents SCS’s response to a query that involved
“date” and “faculty” as dimensions to measure the
“average scores” rounded up to the nearest digit.
A quick inspection of these averages shows
“Engineering” as the only faculty/college with a
monotonically increasing “average score”.
Fortunately this table hides interesting information.
Upon drilling-down “Engineering” via the “school type”
dimension, SCS produced the data in table II. From this
table, one may see that the “female” and the “coed”
schools had better averages than “male” only schools.
How useful this information is depends on the
imagination of the one who inspects it. One relationship
that can be implied here is:
Engineering ^ Female
Score > 92
TABLE I: AVERAGE SCORES FOR 5 COLLEGES IN 3 YEARS
Medicine
Engineering
Economics
Science
2005
92
89
91
87
2006
93
91
90
86
2007
92
92
91
86
TABLE II: AVERAGES “ENGINEERING” ENROLLMENTS BASED
ON “SCHOOL TYPE”
Engineering
(1)
Support and confidence values are not included for
Equation (1). The equation infers that an “x” increase in
the count of female students with “average score=92”
would displace “x” male students to a different college if
the total Engineering enrollment is to remain fixed. To put
it in a different way, had Engineering decided in the year
Female
Male
Coed
First three years
2005
2006
2007
90
91
93
87
90
90
90
92
92
VI. CONCLUSION
A Prototype for a scouting system for student
enrollment has been suggested. It provides college
officials with statistics necessary for making enrollment
decisions. The same system can be used to match the
122
majors that best fit students. Such usage will reduce the
number of students who change majors or fail to complete
their college education due to hasty selection of majors.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
Mohammad A. Rob, Michael E. Ellis, 2007, “Case Projects in
Data Warehousing and Data Mining”, University of Houston,
Clear
Lake,
Volume
VIII
No.1.
Available
from:
http://iacis.org/iis/2007/Rob_Ellis.pdf, [downloaded -15 March
2011].
AntoanetaIvanova, Boris Rachev, 2004, “Multidimensional models
- Constructing Data Cube” International Conference on Computer
Systems and Technologies- CompSysTech, available from:
http://ecet.ecs.ru.acad.bg/cst04/Docs/sV/55.pdf, (Downloaded 24
April 2013).
Jiawei Han, MichelineKamber, 2006, “Data Mining: Concept
And Techniques”, Morgan Kaufmann, San Francisco, CA 94111
Jigna J. Jadav, Mahesh Panchal, 2012, “Association Rule Mining
Method On OLAP Cube”, International Journal of Engineering
Research and Applications (IJERA) ISSN: 2248-9622
www.ijera.com Vol. 2, Issue 2, Mar-Apr 2012, pp.1147-1151.
India. Available from:
http://www.ijera.com/papers/Vol2_issue2/GL2211471151.pdf,
[downloaded - 24 April 2013].
Scott Cameron, March 2009. “Microsoft SQL Server 2008
Analysis Services - Step by Step”, Hitachi Consulting, Microsoft
Press- A Division of Microsoft Corporation- One Microsoft Way,
Redmond, Washington.
Galina Bogdanova, TsvetankaGeorgieva, 2005, “Discovering the
Association Rules in OLAP Data Cube with Daily Downloads of
Folklore Materials”, International Conference on Computer
Systems and Technologies - CompSysTech’, Available from:
http://ecet.ecs.ru.acad.bg/cst05/Docs/cp/SIII/IIIB.23.pdf, [visited
- 24 April 2013].
Riadh Ben Messaoud , Omar Boussaid , Sabine LoudcherRabaséd,
RokiaMissaoui, “Enhanced mining of association rules from data
cubes”, Proceeding of the 9th ACM International Workshop on
Data Warehousing and OLAP, 2006:
http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.104.41,
(downloaded 4/4/2013).
Zlatko J. Kovai, “Early prediction of student success: Mining
students' enrolment data”, Proceedings of Informing Science & IT
Education Conference (InSITE), pp 647-665, Southern Italy, June
19-24, 2010
Al-Radaideh, Q. A., Al-Shawakfa, E. M., & Al-Najjar, M. I.,
“Mining student data using decision trees”. Proceedings of the
2006 International Arab Conference on Information Technology.
M. Ramaswami, R. Bhaskaran, “A CHAID Based Performance
Prediction Model in Educational Data Mining”, International
Journal of Computer Science Issues, IJCSI, pp10-18 Vol. 7, Issue
1, No. 1, January 2010 pp 10-18
César Vialardi, Jorge Chue, Alfredo Barrientos, Daniel Victoria,
Jhonny Estrell1, Juan Pablo Peche and Álvaro Ortigosa “A Case
Study: Data Mining Applied to Student Enrollment”:
http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.612
4&rep=rep1&type=pdf (Visited 1/1/2014)
Monika Goyal and RajanVohra, “Applications of Data Mining in
Higher Education”, JCSI International Journal of Computer
Science Issues, pp 113 - 120, Vol. 9, Issue 2, No. 1, March 2012.
123