* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Data Mining Algorithm In Distance Learning
Survey
Document related concepts
Transcript
A Data Mining Algorithm In Distance Learning Dai Shangping1, Zhang Ping2 1 Department of Computer Science, Hua Zhong Normal University, Wu Han, China [email protected] 2 Foreign Languages College, Zhongnan University of Economics and Law, Wuhan,China [email protected] Abstract Currently there is an increasing interest in data mining and educational systems, making educational data mining as a new growing research community. One of the challenges in developing data mining systems is to integrate and coordinate existing data mining applications in a seamless manner so that costeffective systems can be developed without the need of costly proprietary products. The popularity of distance education has grown rapidly over the last decade in higher education, yet many fundamental teaching– learning issues are still in debate. This paper presents an approach for classifying students in order to predict their final grade based on features extracted from logged data in an education web-based system. In this paper we take advantage of the genetic algorithm (GA) designed specifically for discovering association rules. We propose a novel spatial mining algorithm, called ARMNGA(Association Rules Mining in Novel Genetic Algorithm), Compared to the algorithm in Reference[2] , the ARMNGA algorithm avoids generating impossible candidates, and therefore is more efficient in terms of the execution time. Keywords: data mining; genetic algorithm, association rules, distance learning 1. Introduction Some of the commonly used data mining algorithms fall under the following categories: decision trees and rules, nonlinear regression and classification methods, example-based methods, probabilistic graphical dependency models, and relational learning models. Over the years genetic algorithms have been successfully applied in learning tasks in different domains, like chemical process control, financial classification, manufacturing scheduling, among others. Many leading educational institutions are working to establish an online teaching and learning presence. Several systems with different capabilities and approaches have been developed to deliver online education in an academic setting. [1] Because of the advances in information technology, _____________________________________ 978 -1-4244-1651-6/08/$25.00 © 2008 IEEE vast number of educational resources has accumulated on the Internet. Therefore, how to mine interesting resources from databases has attracted more and more attention in recent years [2]. Many data mining methods have been proposed such as association rule mining, sequential pattern mining, calling path pattern mining, text mining, temporal data mining, spatial data mining, etc. Association rule mining, as originally proposed in with its apriori algorithm (ARMA), has developed into an active research area. Many additional algorithms have been proposed for association rule mining. Also, the concept of association rule has been extended in many different ways, such as generalized association rules, association rules with item constraints, sequence rules etc. Apart from the earlier analysis of market basket data, these algorithms have been widely used in many other practical applications such as customer profiling, analysis of products and so on[3]. Genetic Algorithm (GA) is one self-adaptive optimization searching algorithm. GA obtains the best solution, or the most satisfactory solution through generations of chromosomes’ constant evolution includes the reproduce, crossover and mutation etc. operation, until a certain termination condition is coincident [4]. Association rules mining Algorithm Based on a novel Genetic Algorithm (ARMNGA) is an optimal algorithm combing GA with ARMA. The contributions of this paper are: We take advantage of the genetic algorithm (GA) designed specifically for discovering association rules. We propose a novel spatial mining algorithm, called ARMNGA, Compared to the algorithm in Ref [2], and the ARMNGA algorithm avoids generating impossible candidates, and therefore is more efficient in terms of the execution time. 2. Association Rules Definition 1 confidence Set up I {i1 , i2 , im } for items of collection, for item in i j (1 d j d m) , (1 d j d m) for lasting item, D {T1 , TN ) it is a trade Ti I (1 d i d N ) here T is the trade. collection, Rule r o q is probability that r q concentrates on including in the trade. The association rule here is an implication of the form r o q where X is the conjunction of conditions, and Y is the type of classification. The rule r o q has to satisfy specified minimum support and minimum confidence measure [5]. The support of Rule r o q is the measure of frequency both r and q in D (1) S (r ) r D The confidence measure of Rule r o q is for the premise that includes r in the bargain descend, in the meantime includes q C (r o q ) S (rq ) S (r ) (2) Definition 2 Weighting support Designated ones project to collect I = {i1, i2, im}, each project ij is composed with the value wj of right (0wj 1, 1j m). If the rule is r ĺq, the weighting support is A j [1, m], j [1, n] and An may be a repeatedly equal natural number. When the distributive method at random is employed to produce the initial population comprised of certain individuals, the population must be in a certain scale in order to achieve the optimal solution on the whole. The best way is the generated M individuals randomly that the length is nˈthen the chromosome bunch encoded by the natural number is calculated as the initial population. 4.2 The Fitness Formula (3) is properly transformed into: F ( r ) Ws u S (r ) C (r ) Wc u S min C min (4) Here, Wc+Ws=1, Wc 0, Ws 0, Smin, is minimum support, and Cmin is minimum confidence. 4.3 Reproduction Operator S w (r ) 1 ¦ w j S (r ) k i r (3) And, the K is the size of the Set rq of the project. When the right value Wj is the same as ij, we calculating the weighting including rule to have the same support. 3. Genetic Algorithm (GA) Genetic Algorithm (GA) is a self-adaptive optimization searching algorithm. GA obtains the best solution, or the most satisfactory solution through generations of chromosomes constant evolution includes reproduction, crossover and mutation etc. Here is the general description of this problem: F (r ) a u S (r ) b u C (r ) c u A(r ) (3) a,b,c is constants ,a 0,b0, c0,S(r) is the support ,and C(r) is the confidence ,A(r) is the mantle. 4. A Data Mining Algorithm Reproduction is the transmission of personal information from the father generation to the son generation. Each individual in each generation determines the probability that it can reproduce the next generation according to how big or small the fitness value is. Through reproducing, the number of excellent individuals in the population increases constantly, and the whole process of evolution head for the optimal direction. We are adopting roulette selection strategy; each individual reproduction probability is proportion to fitness value. 1) Compute the reproduction probability of all the individuals P (i) = f (i) M ¦ (5) f (i) i=1 2) Generate a number r randomly, r=random [0, 1] ˗ 3) If P(0)+ P(1)+…+ P(i-1)<r< P(0)+ P(1)+…+ P(i),the individual i is selected into the next generation. 4.4 Crossover Operator 4.1 Encoding This paper employs natural numbers to encode the variable Aij. That is, the number of the lines of every range in the matrix A in which the element 1 exists is regarded as a gene. The genes are independent of each other. They are marked by A1, A2,…,Aj,…,An, in which Crossover is the substitution between two individuals of the father generation that is to generating new individual .The crossover probability Pc directly influences the convergence of the algorithm. The larger Pc is the most likely is the genetic mode of the optimal individual to be destroyed .However, the over-small of Pc can slow down the research process [6] .Here is the definition of the crossover operator: Computing crossover probability Pc °°p (pc1 pc2)(f f ) Pc ® c1 fmax f ° ¯°pc1 f tf (6) f %f Here, p c1 =0.9, p c 2 =0.7, f max is the maximum fitness value of the population, fitness value of the population. f is the average Fig 1. Runtime vs. minimum support. 4.5 Mutation Operator Here is the definition of the mutations operator, computing the mutation probability Pm ( p m1 p m2 )( f f ) ° p m1 Pm ® f max f ° ¯ p m1 f tf (7) f %f Here, p m1 =0.1, p m 2 =0.001, f max is the maximum fitness value of the population, fitness value of the population. f is the average 4.6 Termination Condition Fig.2 shows the runtime vs. the average size of transactions for both algorithms, where the average size of transactions varies from 4 to 14 for the synthetic dataset. As the average size of transactions increases, the runtime of the algorithm in Ref [2] increases dramatically, however, compared to the algorithm in Ref [2], the runtime of our proposed algorithm increases slowly. The reason for increase in the runtimes of both algorithms is that the number of frequent resources increases as the lengths of transactions are increased. Therefore, finding candidates present in a trans- action takes a longer time. Our proposed algorithm is more scalable than the algorithm in Ref [2] because a large number of candidates can be pruned by using the ARMNGA pruning strategy. When the matching condition is not coincident, the process will naturally stop. 5. Experiments To check the research capability of the operator and its operational efficiency, such a simulation result is given compared with the GA in [2] ,The platform of the simulation experiment is a Dell power Edge2600 server (double Intel Xeon 1.8GHz CPU,1G memory , RedHat Linux 9.0). We first compare the performance of our proposed method with the algorithm in Ref [2]. Fig.1 shows the runtime vs. the minimum support for both algorithms, where the minimum support varies from 0.25% to 2% for the synthetic dataset. Our proposed algorithm runs 2–5 times faster than the Apriori algorithm, because a large number of candidates can be pruned by using the ARMNGA pruning strategy. Therefore, as the minimum support threshold decreases, the runtime of the Apriori algorithm increases dramatically since it generates too many candidates when the minimum support is small. Fig2. Runtime vs. average size of transactions. From fig 1 and 2, we can educe that ARMNGA has a higher convergence speed and more reasonable selective scheme which guarantees the non-reduction performance of the optimal solution. Therefore, it is better than GA and ARMA through the theoretic analysis and the experimental results. 6. CONCLUSIONS Educational data mining is a young research area and it is necessary more specialized and oriented work educational domain in order to obtain a similar application success level to other areas, such as medical data mining, mining e-commerce data, etc. In this paper, We propose an association rules mining based on a novel genetic algorithm, designed specifically for discovering association rules. We compare the results of the ARMNGA with the results of Ref [2], and, it is better than GA and ARM through the theoretic analysis and the experimental results. We believe that some future mining tools more easy to use by educators. Acknowledgement This work is partially supported by “The research of foreign Chinese long-distance visualized teaching model” The National Society Science Foundation of P.R. China, Under Grant No. 07BYY033 .We thanks Professor Zheng shijue of the Department of Computer Science of Central China Normal University Technology for his supports of various aspects and thanks also go to all members of our research group. References [1] Behrouz Minaei-Bidgoli , William F. Punch III, Using Genetic Algorithms for Data Mining Optimization in an Educational Web-based System, Lecture Notes in computer Science,pp2724 -2735,2003 [2] G. Chen, Q. Wei, Fuzzy association rules and the extended mining algorithms, Information Sciences 147 (2002) 201–228. [3] K. Koperski, J. Han, Discovery of spatial association rules in geographic information databases, in: Proc. of International Symposium on Advance in Spatial Databases, SSD, LNCS, vol. 951, Springer Verlag, 1995, pp. 47–66. [4] Shijue Zheng,Wanneng Shu,Li Gao,Task Scheduling Using Parallel Genetic Simulated Annealing Algorithm ,2006 IEEE International Conference on Service Operations and Logistics, and Informatics Proceedings June 21-23, 2006, Shanghai, pp46-50 [5] P.Y. Hsu, Y.L. Chen, C.C. Ling, Algorithms for mining association rules in bag databases, Information Sciences 166 (2004) 31–47. [6] Gao Li,Li Dan,Dai Shangping, A mining Algorithm of constraint based association rules, journal of Henan University Vol.33(2003) pp.55-58 [7] Wu zhaohui , Association rule mining based on simulated annealing genetic algorithm, Comput Applications Vol.25(2005) pp.1009-1011