Download A Data Mining Algorithm In Distance Learning

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
A Data Mining Algorithm In Distance Learning
Dai Shangping1, Zhang Ping2
1
Department of Computer Science, Hua Zhong Normal University, Wu Han, China
[email protected]
2
Foreign Languages College, Zhongnan University of Economics and Law, Wuhan,China
[email protected]
Abstract
Currently there is an increasing interest in data
mining and educational systems, making educational
data mining as a new growing research community.
One of the challenges in developing data mining
systems is to integrate and coordinate existing data
mining applications in a seamless manner so that costeffective systems can be developed without the need of
costly proprietary products. The popularity of distance
education has grown rapidly over the last decade in
higher education, yet many fundamental teaching–
learning issues are still in debate. This paper presents
an approach for classifying students in order to predict
their final grade based on features extracted from
logged data in an education web-based system. In this
paper we take advantage of the genetic algorithm (GA)
designed specifically for discovering association rules.
We propose a novel spatial mining algorithm, called
ARMNGA(Association Rules Mining in Novel Genetic
Algorithm), Compared to the algorithm in
Reference[2] , the ARMNGA algorithm avoids
generating impossible candidates, and therefore is
more efficient in terms of the execution time.
Keywords: data mining; genetic algorithm, association
rules, distance learning
1. Introduction
Some of the commonly used data mining algorithms
fall under the following categories: decision trees and
rules, nonlinear regression and classification methods,
example-based methods, probabilistic graphical
dependency models, and relational learning models.
Over the years genetic algorithms have been
successfully applied in learning tasks in different
domains, like chemical process control, financial
classification, manufacturing scheduling, among others.
Many leading educational institutions are working to
establish an online teaching and learning presence.
Several systems with different capabilities and
approaches have been developed to deliver online
education in an academic setting. [1]
Because of the advances in information technology,
_____________________________________
978 -1-4244-1651-6/08/$25.00 © 2008 IEEE
vast number of educational resources has accumulated
on the Internet. Therefore, how to mine interesting
resources from databases has attracted more and more
attention in recent years [2]. Many data mining methods
have been proposed such as association rule mining,
sequential pattern mining, calling path pattern mining,
text mining, temporal data mining, spatial data mining,
etc.
Association rule mining, as originally proposed in
with its apriori algorithm (ARMA), has developed into
an active research area. Many additional algorithms
have been proposed for association rule mining. Also,
the concept of association rule has been extended in
many different ways, such as generalized association
rules, association rules with item constraints, sequence
rules etc. Apart from the earlier analysis of market
basket data, these algorithms have been widely used in
many other practical applications such as customer
profiling, analysis of products and so on[3].
Genetic Algorithm (GA) is one self-adaptive
optimization searching algorithm. GA obtains the best
solution, or the most satisfactory solution through
generations of chromosomes’ constant evolution
includes the reproduce, crossover and mutation etc.
operation, until a certain termination condition is
coincident [4].
Association rules mining Algorithm Based on a
novel Genetic Algorithm (ARMNGA) is an optimal
algorithm combing GA with ARMA.
The contributions of this paper are:
We take advantage of the genetic algorithm (GA)
designed specifically for discovering association rules.
We propose a novel spatial mining algorithm, called
ARMNGA, Compared to the algorithm in Ref [2], and
the ARMNGA algorithm avoids generating impossible
candidates, and therefore is more efficient in terms of
the execution time.
2. Association Rules
Definition 1 confidence
Set up I {i1 , i2 , im } for items of collection, for item
in
i j (1 d j d m)
,
(1 d j d m)
for
lasting
item,
D {T1 , TN ) it is a trade
Ti Ž I (1 d i d N ) here T is the trade.
collection,
Rule r o q is probability that r q concentrates
on including in the trade.
The association rule here is an implication of the
form r o q where X is the conjunction of conditions,
and Y is the type of classification. The rule r o q has
to satisfy specified minimum support and minimum
confidence measure [5].
The support of Rule r o q is the measure of
frequency both r and q in D
(1)
S (r ) r D
The confidence measure of Rule r o q is for the
premise that includes r in the bargain descend, in the
meantime includes q
C (r o q ) S (rq ) S (r ) (2)
Definition 2 Weighting support
Designated ones project to collect I = {i1, i2, im},
each project ij is composed with the value wj of right
(0”wj ” 1, 1”j ”m). If the rule is r ĺq, the weighting
support is
A j  [1, m], j  [1, n] and An may be a repeatedly
equal natural number.
When the distributive method at random is employed
to produce the initial population comprised of certain
individuals, the population must be in a certain scale in
order to achieve the optimal solution on the whole. The
best way is the generated M individuals randomly that
the length is nˈthen the chromosome bunch encoded
by the natural number is calculated as the initial
population.
4.2 The Fitness
Formula (3) is properly transformed into:
F ( r ) Ws u
S (r )
C (r )
Wc u
S min
C min
(4)
Here, Wc+Ws=1, Wc •0, Ws •0, Smin, is minimum
support, and Cmin is minimum confidence.
4.3 Reproduction Operator
S w (r )
1
¦ w j S (r )
k i r
(3)
And, the K is the size of the Set rq of the project.
When the right value Wj is the same as ij, we
calculating the weighting including rule to have the
same support.
3. Genetic Algorithm (GA)
Genetic Algorithm (GA) is a self-adaptive
optimization searching algorithm. GA obtains the best
solution, or the most satisfactory solution through
generations of chromosomes constant evolution
includes reproduction, crossover and mutation etc.
Here is the general description of this problem:
F (r ) a u S (r ) b u C (r ) c u A(r )
(3)
a,b,c is constants ,a •0,b•0, c•0,S(r) is the
support ,and C(r) is the confidence ,A(r) is the mantle.
4. A Data Mining Algorithm
Reproduction is the transmission of personal
information from the father generation to the son
generation. Each individual in each generation
determines the probability that it can reproduce the next
generation according to how big or small the fitness
value is. Through reproducing, the number of excellent
individuals in the population increases constantly, and
the whole process of evolution head for the optimal
direction. We are adopting roulette selection strategy;
each individual reproduction probability is proportion
to fitness value.
1) Compute the reproduction probability of all the
individuals
P (i) =
f (i)
M
¦
(5)
f (i)
i=1
2) Generate a number r randomly, r=random [0, 1] ˗
3) If P(0)+ P(1)+…+ P(i-1)<r< P(0)+ P(1)+…+
P(i),the individual i is selected into the next generation.
4.4 Crossover Operator
4.1 Encoding
This paper employs natural numbers to encode the
variable Aij. That is, the number of the lines of every
range in the matrix A in which the element 1 exists is
regarded as a gene. The genes are independent of each
other. They are marked by A1, A2,…,Aj,…,An, in which
Crossover is the substitution between two
individuals of the father generation that is to generating
new individual .The crossover probability Pc directly
influences the convergence of the algorithm. The larger
Pc is the most likely is the genetic mode of the optimal
individual to be destroyed .However, the over-small of
Pc can slow down the research process [6] .Here is the
definition of the crossover operator:
Computing
crossover
probability
Pc
­
°°p (pc1 pc2)(f f )
Pc ® c1
fmax f
°
¯°pc1
f tf
(6)
f %f
Here, p c1 =0.9, p c 2 =0.7, f max is the maximum
fitness value of the population,
fitness value of the population.
f is the average
Fig 1. Runtime vs. minimum support.
4.5 Mutation Operator
Here is the definition of the mutations operator,
computing the mutation probability Pm
­
( p m1 p m2 )( f f )
° p m1 Pm ®
f max f
°
¯ p m1
f tf
(7)
f %f
Here, p m1 =0.1, p m 2 =0.001, f max is the maximum
fitness value of the population,
fitness value of the population.
f is the average
4.6 Termination Condition
Fig.2 shows the runtime vs. the average size of
transactions for both algorithms, where the average size
of transactions varies from 4 to 14 for the synthetic
dataset. As the average size of transactions increases,
the runtime of the algorithm in Ref [2] increases
dramatically, however, compared to the algorithm in
Ref [2], the runtime of our proposed algorithm
increases slowly. The reason for increase in the
runtimes of both algorithms is that the number of
frequent resources increases as the lengths of
transactions are increased. Therefore, finding
candidates present in a trans- action takes a longer time.
Our proposed algorithm is more scalable than the
algorithm in Ref [2] because a large number of
candidates can be pruned by using the ARMNGA
pruning strategy.
When the matching condition is not coincident, the
process will naturally stop.
5. Experiments
To check the research capability of the operator and
its operational efficiency, such a simulation result is
given compared with the GA in [2] ,The platform of the
simulation experiment is a Dell power Edge2600 server
(double Intel Xeon 1.8GHz CPU,1G memory , RedHat
Linux 9.0).
We first compare the performance of our proposed
method with the algorithm in Ref [2].
Fig.1 shows the runtime vs. the minimum support for
both algorithms, where the minimum support varies
from 0.25% to 2% for the synthetic dataset. Our
proposed algorithm runs 2–5 times faster than the
Apriori algorithm, because a large number of
candidates can be pruned by using the ARMNGA
pruning strategy. Therefore, as the minimum support
threshold decreases, the runtime of the Apriori
algorithm increases dramatically since it generates too
many candidates when the minimum support is small.
Fig2. Runtime vs. average size of transactions.
From fig 1 and 2, we can educe that ARMNGA has a
higher convergence speed and more reasonable
selective scheme which guarantees the non-reduction
performance of the optimal solution. Therefore, it is
better than GA and ARMA through the theoretic
analysis and the experimental results.
6. CONCLUSIONS
Educational data mining is a young research area and
it is necessary more specialized and oriented work
educational domain in order to obtain a similar
application success level to other areas, such as medical
data mining, mining e-commerce data, etc. In this paper,
We propose an association rules mining based on a
novel genetic algorithm, designed specifically for
discovering association rules. We compare the results
of the ARMNGA with the results of Ref [2], and, it is
better than GA and ARM through the theoretic analysis
and the experimental results. We believe that some
future mining tools more easy to use by educators.
Acknowledgement
This work is partially supported by “The research of
foreign Chinese long-distance visualized teaching
model” The National Society Science Foundation of
P.R. China, Under Grant No. 07BYY033 .We thanks
Professor Zheng shijue of the Department of Computer
Science of Central China Normal University
Technology for his supports of various aspects and
thanks also go to all members of our research group.
References
[1] Behrouz Minaei-Bidgoli , William F. Punch III, Using
Genetic Algorithms for Data Mining Optimization in an
Educational Web-based System, Lecture Notes in
computer Science,pp2724 -2735,2003
[2] G. Chen, Q. Wei, Fuzzy association rules and the
extended mining algorithms, Information Sciences 147
(2002) 201–228.
[3] K. Koperski, J. Han, Discovery of spatial association rules
in geographic information databases, in: Proc. of
International Symposium on Advance in Spatial
Databases, SSD, LNCS, vol. 951, Springer Verlag, 1995,
pp. 47–66.
[4] Shijue Zheng,Wanneng Shu,Li Gao,Task Scheduling
Using
Parallel
Genetic
Simulated
Annealing
Algorithm ,2006 IEEE International Conference on
Service Operations and Logistics, and Informatics
Proceedings June 21-23, 2006, Shanghai, pp46-50
[5] P.Y. Hsu, Y.L. Chen, C.C. Ling, Algorithms for mining
association rules in bag databases, Information Sciences
166 (2004) 31–47.
[6] Gao Li,Li Dan,Dai Shangping, A mining Algorithm of
constraint based association rules, journal of Henan
University Vol.33(2003) pp.55-58
[7] Wu zhaohui , Association rule mining based on simulated
annealing genetic algorithm, Comput Applications
Vol.25(2005) pp.1009-1011