Download International Journal on Advanced Computer Theory and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal on Advanced Computer Theory and Engineering (IJACTE)
________________________________________________________________________
MINING THE FACTORS AFFECTING THE HIGH SCHOOL
DROPOUTS IN RURAL AREAS
1
P. Sunil Kumar, 2Ashok Kumar Panda, 3D. Jena
1,2
Assistant Professor, 3Associate Professor,
IISIT, BPUT, Bhubaneswar, Odisha, India
Email : [email protected]
attendance and retention rates, high teacher absenteeism,
irregular classes, poor teaching standards and other
related issues. The government needs to shift its focus on
increasing enrolment rates and also reducing school
drop-outs in rural areas which is also a significant
problem [24].
Abstract -Government of India has taken number of
initiatives in the last decade for retention of school children
at primary level. All those initiatives, although successful at
that level of education, when it comes to secondary
education, the scenario is deplorable. The number of drop
outs during secondary education is mainly due to three
major factors namely Poverty, disinterestedness and
Teaching environment. This paper is based on a survey
work which proposes to apply association rule mining
measures like support confidence and other interesting
measures on these major factors of high school dropouts to
understand the problem in a better way and to have a
proper planning for retention of students .
Data mining is finding hidden patterns in a large
collection of data. Data Mining can be used in
educational field to enhance our understanding of
learning process to focus on identifying, extracting and
evaluating Variables related to the learning process of
students as described by Alaa el-Halees [2].
Keywords: Data Mining, Association Rule, Aprioi, EDM,
Interesting Measures, Higher Secondary Education.
In this paper it is tried to find out the association of
various factors leading to student dropouts at secondary
level of education in rural areas.
I. INTRODUCTION
II. BACKGROUND AND RELATED WORK
Recent years have shown a growing interest and concern
in many countries about the problem of school failure
and the determination of its main contributing factors.
This problem is known as the “the one hundred factors
problem” and a great deal of research has been done on
identifying the factors that affect the low performance of
students (school failure and dropout) at different
educational levels (primary, secondary and higher) as
described by Araque et al., 2009[1].
Educational data mining has emerged as an independent
research area in recent years, culminating in 2008 with
the establishment of the annual International Conference
on Educational Data Mining, and the Journal of
Educational Data Mining. Romero and Ventura [21]
provides a comprehensive study of EDM from 1995
to 2005. It describes the need for analyzing the student
data which can be used by students, educators and
administrators.
A study titled “Evolution of Indian Rural Markets”
made by the Associated Chambers of Commerce and
Industry of India (ASSOCHAM) has revealed that,
during 2004-05 to 2009-10”, Odisha ranked lowest with
average rural household monthly expenditure on
education worth just about Rs.52. Although great strides
have been made in improving India’s education scenario,
much ground still needs to be covered as our education
system is still plagued by low enrolment rates, lower
There are examples about how to apply EDM techniques
for predicting drop out and school failure as described by
Kotsiantis, 2009[3].
Shaeela Ayesha, Tasleem Mustafa, Ahsan Raza Sattar,
and M. Inayat Khan [13] used data mining technique
named k-means clustering applied to analyze student’s
learning behavior that will help the teachers to reduce
________________________________________________________________________
ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013
1
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
________________________________________________________________________
the drop out ratio to a significant level and improve the
performance of Students.
J.F.Superby et al. [23] to classify 533 first-year
university students into three groups: the low-risk, the
medium-risk, and the high-risk students (high probability
of dropping out)
D'Mello et al. [8] studied on students who are bored or
frustrated. Dekker et al. [7] Romero et al. [22]; Superby
et al. [23] found factors that predict student failure or
non-retention in college courses.
Dropout predicting in e-learning has been studied by
Lykorentzou I., Giannoukos I., Nikopoulos V., Mpardis
G. AND Loumos V. 2009[6].
As well as M. Jadrić et al. [14] to analyze the problem
of students drop out in the higher education by using the
datamining methods and also a model suggested.
III. ASSOCIATION RULE MINING
Data Mining is the discovery of hidden information
found in databases [4] [18]. Dunham [9] categorized
various models and tasks of data mining into two groups:
predictive and descriptive. One o f
the m o s t
significant descriptive data mining applications is that
of mining association rules. Introduced in 1993 [15],
used extensively in marketing and retail communities
in addition to many other diverse fields [17]. Association
rule mining is one of the important technique which
aims at extracting, interesting correlation, frequent
patterns, associations or casual structures among set of
items in the transaction databases or other data
mining repositories [17].
A formal statement of the association rule problem is as
follows:
Definition: [16] [12] Let I = {i1, i2… im} be a set of m
distinct attributes. Let D be a database, where each
record (tuple) T has a unique identifier, and contains a
set of items such that T⊆I. An association rule is an
implication of the form of X⇒Y, where X,Y ⊆ I are sets
of items called itemsets, and X∩Y=ϕ. Here, X is called
antecedent while Y i s c a l l e d c o n s e q u e n t .
Association r u l e s c a n b e c l a s s i f i e d based on
the type of vales, dimensions of data, and levels of
abstractions involved in the rule. If a rule concerns
associations between the presence or absence of items, it
is called Boolean association rule. And the dataset
consisting of attributes which can assume only binary (0absent, 1-present) values is called Boolean database.
itemsets. First, the set of frequent 1-itemsets is found by
scanning the database to accumulate the count for each
item, and collecting those items that satisfy minimum
support. The resulting set is denoted L1.Next, L1 is used
to find L2, the set of frequent 2-itemsets, which is used
to find L3, and so on, until no more frequent k-itemsets
can be found. The finding of each Lk requires one full
scan of the database.Lk-1 is used to find Lk for k _ 2. A
two-step process is followed, consisting of join and
prune actions.

The join step:
To find Lk, a set of candidate k-itemsets is generated by
joining Lk-1 with itself. This set of candidates is denoted
Ck.
Let l1 and l2 be itemsets in Lk-1.The notation li[ j] refers
to the jth item in li (e.g., l1[k-2] refers to the second to
the last item in l1).
By convention, Apriori assumes that items within a
transaction or item set are sorted in lexicographic order.
For the (k-1)-item set, li, this means that the items are
sorted such that li[1] < li[2] < : : : < li[k-1].
The join, Lk-1 ⋈ Lk-1, is performed, where members of
Lk-1 are joinable if their first (k-2) items are in common.
That is, members l1 and l2 of Lk-1 are joined if (l1[1] =
l2[1]) ^ (l1[2] = l2[2]) ^: : :^(l1[k-2] = l2[k-2]) ^(l1[k-1]
< l2[k-1]). The condition l1 [k-1] < l2 [k-1] simply
ensures that no duplicates are generated. The resulting
item set formed by joining l1 and l2 is l1[1], l1[2], : : : ,
l1[k-2], l1[k-1], l2[k-1].

The prune step:
Ck is a superset of Lk, that is, its members may or may
not be frequent, but all of the frequent k-itemsets are
included in Ck. As can of the database to determine the
count of each candidate in Ck would result in the
determination of Lk (i.e., all candidates having a count
no less than the minimum support count are frequent by
definition, and therefore belong to Lk). Ck, however, can
be huge, and so this could involve heavy computation.
To reduce the size of Ck, the Apriori property is used as
follows. Any (k-1)-item set that is not frequent cannot be
a subset of a frequent k-item set. Hence, if any (k-1)subset of a candidate k-item set is not in Lk-1, then the
candidate cannot be frequent either and so can be
removed from Ck.
V. DATA MINING TECHNIQUES USED IN
THIS PAPER
i)
Support(X)=P(X)
ii)
Confidence (X⇒Y)=
IV. APRIORI ALGORITHM
Apriori is a seminal algorithm proposed by R. Agrawal
and R. Srikant in 1994 for mining frequent itemsets for
Boolean association rules. The algorithm uses prior
knowledge of frequent item set properties. Apriori
employs an iterative approach known as a level-wise
search, where k-itemsets are used to explore (k+1)-
P(X,Y)
P(Y)
________________________________________________________________________
ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013
2
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
________________________________________________________________________
iii)
Cosine(X⇒Y) =
P(X,Y)
iv)
Added Value(X⇒Y)= Confidence (XY)-P(Y)
v)
Lift((X⇒Y)=
vi)
vii)
Confidence(X⟹Y)
Correlation(X⇒Y)=
Conviction(X⇒Y)=
P(Y)
P(X,Y)−P(X)∗P(Y)
√P(X)∗P(Y)∗(1−P(X))∗(1−P(Y)
ii)
Confidence: The confidence or strength () for
an association rule X⇒Y is the ratio of the number
of transaction that contain X [9].
v)
Correlation: Correlation
is a symmetric
measure. A correlation around 0 indicates that X
and Y are not correlated. A negative figure
indicates that X and Y are negatively correlated,
and a positive figure indicates that they are
positively related. Note that the denominator of
the division is positive and smaller than 1.Inother
words, if the lift is around 1,correlation can still
be significantly different from 0[11]
vii)
Conviction: Conviction is not a symmetric
measure. A conviction around 1 says that X and
Y are independent; while conviction is infinite as
conf (X⇒Y) is tending to 1.Note that if P(Y) is
high then 1-P(Y) is small. In that case even if
conf(X, Y) is strong conviction may be small
[10].
(1-Confidence(X⟹Y)
Support: The support (s) for an association rule
X⇒Y is the percentage of transaction that contains
X∪Y [9].
iv)
vi)
1-P(Y)
i)
iii)
correlation between X and Y. A lift around 1 says
that P(X, Y) = P(X)*P(Y). In terms of
probability, this means that the occurrence of X
and the occurrence of Y in the same transaction
are independent events, hence X and Y not
correlated. It is easy to show that the lift is 1
exactly when added value is 0; the lift is greater
than 1 exactly when added value is positive
and the lift is below 1 exactly when added
value is negative [10, 11].
√P(X)*P(Y)
Cosine: Consider two vectors X and Y and the
angle they form when they are placed so that
their tails coincide. When this angle nears 0°,
then cosine nears 1, i.e. the two vectors are very
similar: all their coordinates are pair wise the
same. When this angle is 90 degree the vector are
perpendicular, the most dissimilar, and the cosine
is 0.The usual form that is given for cosine of
an association rule is X, Y. The closer t h e
cosine (X⇒Y) v a l u e to 1, the more transactions
containing item X also contain item Y, and vice
versa. On the contrary, the closer cosine (X⇒Y)
value to 0, the more transactions contain item X
without containing item Y, and vice versa. This
equality shows that transactions not containing
neither item X nor item Y have no influence on
the result of Cosine (X⇒Y). This is known as the
null-invariant property. Note also that cosine is a
symmetric measure [10, 11].
These measures are calculated on the test data.
VI. APPLICATION OF THE TECHNIQUES
In this study data was collected from students who
discontinued their studies from high schools in
Balianta block of Khurda District, Odisha. It is
necessary to mention that the current dropout rate for
the state of Odisha at secondary education is 49.5% [5].
These data are analyzed using Association rule mining to
find out the cause or causes behind their decision to
discontinue their education. . In order to apply this
following steps are performed in sequence:

Added value: The added value of the rule
X⇒Y is denoted by AV ( X⇒Y ) and measures
whether the proportion of transactions containing
Y among the transactions containing X is greater
than the proportion of transactions containing Y
among all transactions. Then, only if the
probability of finding item Y when item X has
been found is greater than the probability of
finding item Y at all can we say that X and Y are
associated and that X implies Y.A positive
number indicates that X and Y are related, while
a negative number means that the occurrence of
X prevents Y from occurring. Added Value is
closely related to another well-known measure of
interest, the lift [10].
Data set: The data set consisted of the following
variables related to each dropout student. All the
attributes are listed as follows:
Table 1: Data Set
Variable Description
Possible values
Sex
Boy or Girl
Factor 1
primary,
{Poverty,?}
secondary,
absolute,
Relative,
persistent
transient}as
explained by
Townsend and
Rowntree[19][20]
No-drinking{Teaching
Factor 2
Lift: A lift well above 1 indicates a strong
{boy, girl}
________________________________________________________________________
ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013
3
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
________________________________________________________________________
water, No-Toilet, Environment,?}
No-Classroom,
Inadequateteacher, Poor
teaching
weak memory, {Disinterestedness
fear of study, ,?}
other
interest,
parent force
Table 2: Support analysis
Factor for dropout
The size of the data set is 80.
Poverty
Teaching Environment
Disinterestedness
(Poverty, Disinterestedness)
(poverty, Teaching Environment)
(Teaching Environment, Disinterestedness)
(Poverty,
Disinterestedness,Teaching
Environment)


Factor 3
Data selection and transformation:
The preprocessed data set in weka is shown in
figure-1
The dropouts are categorized into three categories
irrespective of sex i.e. Poverty, Teaching Environment,
and Disinterestedness. It has been seen that among the
students who are affected by one factor also affected by
other two factors. Following Venn diagram (fig. 2 )
shows the complete d r o p o u t picture of t h e students
due to different factors.
support
0.72
0.31
0.77
0.6
0.07
0.15
0.03
Measures and their analysis
The Apriori algorithm is implemented on the data set
using weka as shown in figure 3. The results obtained
like confidence, lift and conviction along with other
interesting measures are analyzed one by one.
Figure 3
Table 3 describes that the most of the students
(85%) who discontinued due to poverty are also
d i s i n t e r e s t e d because it has highest confidence.
74% of disinterested students also suffer due to
Poverty. 41 % of students who are not satisfied with
the Teaching Environment are also disinterested. Only
18 % of disinterested students have problems Teaching
Environment.
Table 3:Confidence Analysis
Factor for dropout
Poverty ⇒ Disinterestedness
Disinterestedness ⇒ Poverty
Teaching Environment
⇒ Disinterestedness
Disinterestedness
⇒ Teaching Environment
Figure 2
So, in this study it has been tried to find out the factors
affecting the dropouts by using the association rule
mining technique. It is also identified that more than
one factor affecting the student.
Confidence
0.85
0.74
0.41
0.15
Table 4 describes cosine analysis valve. It is a symmetric
analysis. It means two sets give same results in either
direction. In this research paper it shows the angular
value between two different factors affecting the school
dropout. Table shows that Poverty and disinterestedness
has lower angle (37.72) in comparison to Teaching
Environment and disinterestedness. It can be concluded
that students discontinuing due to Poverty and
disinterestedness have more similarity of than the
Teaching Environment and disinterestedness
On the basis of available data, it was decided to
calculate the support of each factor causing the dropout.
Following table 1 shows the support level for each
factor. It describes that higher percentage of dropouts is
because of the disinterestedness whose support value is
0.77.
________________________________________________________________________
ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013
4
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
________________________________________________________________________
Table 4: Cosine Analysis
Factor for dropout
Poverty ⇒ Disinterestedness
Disinterestedness ⇒ Poverty
Teaching Environment
⇒ Disinterestedness
Disinterestedness
⇒ Teaching Environment
Table7: Correlation Analysis
Cosine
0.791
0.791
0.267
Angle
37.72
37.72
74.51
0.267
74.51
Factor for dropout
Poverty ⇒ Disinterestedness
Disinterestedness ⇒ Poverty
Teaching Environment
⇒ Disinterestedness
Disinterestedness
⇒ Teaching Environment
Table 5 shows the added value analysis. In this table
Poverty ⇒
Disinterestedness and Disinterestedness ⇒ Poverty
has positive number which shows that they are related
to each other. The presence of students with these dual
problems
cannot
prevent
one
another.
Teaching Environment ⇒
Disinterestedness and
Disinterestedness ⇒ Teaching Environment has
negative number which shows that the occurrence of
Teaching Environment prevents occurring of
disinterestedness
similarly
occurrence
of
disinterestedness p r e v e n t s occurrence of Teaching
Environment.
Table 5: Added value Analysis
Factor for dropout
Poverty ⇒ Disinterestedness
Disinterestedness ⇒ Poverty
Teaching Environment
⇒ Disinterestedness
Disinterestedness
⇒ Teaching Environment
Added Value
0.0875
0.0775
-0.352
-2.69
Table 8 shows conviction analysis. It shows that highest
conviction is found in the association of
Poverty ⇒ Disinterestedness with value 1.34. Lowest
conviction
is
found
in
association
of
Teaching Environment ⇒ Disinterestedness with value
0.36.
Table 8: Conviction Analysis
Factor for dropout
Poverty ⇒ Disinterestedness
Disinterestedness ⇒ Poverty
Teaching Environment
⇒ Disinterestedness
Disinterestedness
⇒ Teaching Environment
Conviction
1.34
1.18
0.36
0.83
VII. CONCLUSION
-0.125
Table 6 contains lift analysis. It is a symmetric analysis.
It shows the occurrence of one item to another item.
In this table the first two relations has similar positive
value (1.1) greater than 1 which shows that occurrence
of first is strongly correlated with the other. In the case
of third and fourth relations, it has also same positive
value but less than 1 which shows that they are
negatively correlated.
Table 6: Lift Analysis
Factor for dropout
Poverty ⇒ Disinterestedness
Disinterestedness ⇒ Poverty
Teaching Environment
⇒ Disinterestedness
Disinterestedness
⇒ Teaching Environment
Correlation
1.57
1.57
-2.69
Lift
1.1
1.1
0.53
Association
rules
are
useful t o
find
the
association between two elements and shows relationship
between them. In this paper seven different parameters
are used to find the relationship between two different
factors affecting the school dropout. From the above
analysis it can be concluded that the students who are
disinterested are more prone to dropouts than due to
Poverty and Teaching Environment. Another conclusion
is extracted from confidence, cosine, AV analysis, lift,
correlation and conviction analysis is that most of the
Poor students are disinterested as well as students not
satisfied with the teaching environment are also not
interested to continue their education. So as desired by
the study made by ASSOCHAM [24], government of
Odisha has to take necessary steps to raise the expenditure
level per family per month and also organize awareness
camps to increase the interest level in the young minds.
REFERENCES
[1]
ARAQUE F., ROLDAN C., SALGUERO A.
Factors Influencing University Drop Out Rates.
Computers Education, 53, 563–574, 2009.
[2]
Alaa el-Halees, Mining Students Data to
Analyze e- Learning Behavior: A Case Study,
2009
[3]
Kotsiantis S.. Educational Data Mining: A Case
Study for Predicting Dropout – Prone Students.
Int. J.Knowledge Engineering and Soft Data
Paradigms, 1(2), 101–111, 2009.
[4]
Ming-Syan Chen, Jiawei Han and Philip S.
0.53
Table 7 contains correlation value. In this table
the first two relations has similar positive value
indicating that they are positively correlated to each
other. On the other hand third and fourth relations have
similar negative value showing that they are negatively
correlated to each other.
________________________________________________________________________
ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013
5
International Journal of Advanced Computer Engineering and Communication Technology (IJACECT)
________________________________________________________________________
Yu.. Data Mining: An Overview from a
Database Perspective, IEEE transactions on
Knowledge and Data Engineering Vol. 8, No. 6,
pp. 866-883, 1996.
[5]
http://www.odisha.gov.in/p&c/Download
/Economic%20Survey_2012-13.pdf page 282
[6]
LYKORENTZOU I., GIANNOUKOS I.,
NIKOPOULOS V., MPARDIS G. AND
LOUMOS V.. Dropout Prediction in e-learning
Courses through the Combination of Machine
Learning Techniques. Computers & Education,
53,950–965, 2009.
[7]
Dekker, G., Pechenizkiy, M. and Vleeshouwers,
J., “Predicting Students Drop Out: A Case
Study.” In Proceedings of the International
Conference on Educational Data Mining,
Cordoba, Spain, T. Barnes, M. Desmarais, C.
Romero and S. Ventura Eds.,pp41-50, 2009.
[8]
D'mello, S.K., Craig, S.D., Witherspoon, A.W.,
McDaniel, B.T. and Graesser, A.C., “Automatic
Detection
of
Learner’s
Affect
from
Conversational Cues.” User Modeling and UserAdapted Interaction vol 18. pp45-80, 2008.
[9]
Margret
H
Dunham,
“Data
Mining
Introductory and Advanced Topics”, Pearson
education ISBN 978-81-7758-785-2,2006.
2010.
[15]
R. Agrawal, T. Imielinski, A. Swami. Mining
Associations between Sets of Items
in
massive D atabases, Proc. of the ACMSIGMOD Int'l Conference on Management
of ata, Washington D.C. 1993.
[16]
V.Umarani, Dr. M. Punithavalli, “Sampling
based Association Rules Mining- A Recent
Overview” , (IJCSE) International Journal on
Computer Science and Engineering Vol. 02, No.
02, 314-318, 2010.
[17]
V.Umarani, Dr.M.Punithavalli, “A STUDY
ON EFFECTIVE MINING OF ASSOCIATION
RULES FROM HUGE DATABASES”, IJCSR
International Journal of Computer
Science
and Research, Vol. 1,Issue 1ISSN : 2210-9668,
, 2010.
[18]
Usama M. Fayyad, Gregory Piatetsky- Shapiro,
and Padhraic Smyth. From Data Mining to
knowledge Discovery: An Overview, Advances
in Knowledge Discovery and Data Mining,).
AAAI Press, , pp 1-34, 1996.
[19]
Townsend),The concept of Poverty, London ,
Heinmann, 1971.
[20]
Rowntree,S Poverty: A Study
Life,London, Macmillan, 1901.
of
Town
[10]
Merceron A and Yacef K, “Interestingness
Measures for Association Rules in Educational
Data”,2007.
[21]
[11]
Merceron
A,
Yacef
K,
“Revisiting
interestingness of strong symmetric association
rules in educational data”, Proceedings of the
International Workshop on Applying Data
Mining in e-Learning 2007
Romera, C. and Ventura, S., ” Educational
Data Mining: A Survey from 1995 to
2005.”Expert
Systems with Applications 33,
125-146,2007..
[22]
Romero, C ., V e n t u r a , S., Eapejo, P.G. and
Hervas,
C.,.” Data Mining Algorithms to
Classify Students.” In Proceedings of the 1st
International Conference on Educational Data
Mining, pp8-17, 2008.
[23]
Superby, J.F., Vandamme, J.-P. and Meskens,
N.,. “Determination of factors influencing the
achievement of thefirst-year university students
using data mining methods.” In Proceedings of
the Workshop on Educational Data Mining at the
8th International Conference on Intelligent
Tutoring Systems (ITS 2006), pp37-44,2006.
[24]
http://www.thehindu.com/todays-paper/tpational/tp-newdelhi/haryana-records-highestaverage-rural-expenditure-oneducation/article4975971.ece
[12]
[13]
[14]
David Wai-Lok Cheung, Vincent T. Ng, Ada
Wai-Chee Fu, and Yongjian Fu, “Efficient
Mining of Association Rules in Distributed
Databases”, IEEE ... IEEE
Transactions
on
Knowledge and Data Engineering, Vol 8, No 6,
pages 866–883, 1996.
S. Ayesha, T. Mustafa, A.R. Sattar, and
M.I.Khan, Data Mining Model for Higher
Education System, European Journal of
ScientificResearch, Vol.43, No.1,
pp.24-29,
2010.
M. Jadrić, Ž. Garača, and M. Ćukušić, Student
dropout analysis with application of data mining
methods, Management, Vol.15, No.1, ,pp. 31-46,

________________________________________________________________________
ISSN (Print): 2278-5140, Volume-2, Issue – 3, 2013
6