Download Performance Evaluation of Students with Sequential Pattern Mining

Performance Evaluation of Students with Sequential Pattern Mining Algorithm SPAM Rashmi V. Mane1,∗ and Vijay R. Ghorpade2 2 D.Y. 1 Shivaji University, Kolhapur, Maharashtra, India. Patil College of Engineering, Kolhapur, Maharashtra, India. e-mail: rvm− [email protected] Abstract. Sequential Pattern Mining is one of the most efficient ways of data mining for discovering frequently occurring sequences from large dataset. In any institute a large amount of student’s data is available. Sequential pattern mining plays vital role in evaluation of student’s performance by academic year and by the grades obtained in the examination. Many algorithms were proposed for sequence pattern mining. Adding constraints like regular expression to pattern mining algorithm enhances the performance of these algorithms. These algorithms uses strategies such as Apriori based candidate generation and pattern growth approach. Among these SPAM from candidate generation works better than algorithms of PrefixSpan, one of the algorithms of pattern growth category for large dataset. In this paper, a novel SPAM algorithm is proposed and analyzed with regular expression constraint for Student database. Keywords: constraints, data mining, sequential pattern mining, SPAM. 1. Introduction Data mining algorithms are widely used to extract user interesting data from large dataset. Sequential pattern mining is one of the subtasks of data mining [1]. Nowadays, none of searching work can be carried out without the help of computers. Hence, large data can be collected into the systems. Pattern mining algorithms plays important role during retrieval of user specific data from these huge dataset. These algorithms have a wide range of applications in various fields like medical, bioinformatics, telecommunication, networking etc. Numbers of algorithms were proposed for discovering sequence or patterns. By adding user defined constraints [2,3] into pattern mining algorithm improves the performance of algorithm. Student data is associated with their class obtained in examination. Evaluation of student’s performance can be carried out with pattern mining. The student data is specified as follows: ∗ Corresponding author K. R. Venugopal, P. Deepa Shenoy and L. M. Patnaik (Eds.) ICDMW 2013, pp. 161–168. © Elsevier Publications 2013. Rashmi V. Mane and Vijay R. Ghorpade The input sequences used in this paper are given in Table 1. Table 1. Sequences of Grades Obtained by Student. Student Id 1 2 3 Sequence of Class Obtained DZSFFSFSFFD DFZZZFFFFSD DZFFFSFSFDD Table 1 specifies the first attribute as student id and second column as the class obtained by him/her during graduation of Engineering. We have considered the students from B.E./B.Tech. The second column in table 1 specifies a sequence which starts with 10th, Diploma, 12th, Ist Semester, IInd Semester, IIIrd Semester, IVth Semester, Vth Semester, VIth Semester, VIIth Semester, VIIIth Semester. One engineering college has at least four to five branches. The students are from all four classes as First year Engineering, Second Year Engineering, Third Year Engineering, Final Year engineering. Collecting total strength of any engineering college is near about 1500 to 2000. If we have collected data year wise and the data is collected from last five years then number of sequences grows tremendously. It is difficult to find the performance of student from 10th to final year engineering from the large dataset of student. This can be handled efficiently by pattern mining algorithms. By preprocessing data branch wise, academic year wise user specific data can be discovered easily. To evaluate academic performance from 10th, 12th and starting with Ist semester of engineering to VIIIth semester of engineering is considered. Some students are joining to college direct in IIIrd semester after diploma. For these students 12th, Ist and IInd semester class has not been considered which is represented with ‘Z ’. The classification of student’s marks is done as follows: The representation of classes are given in Table 2. Table 2. Representation of Student Performance. Class D F S P N A X Y Z Meaning of Class Distinction First Class Second Class Pass Class Not Pass/Fail Atkt Absent Detained Not considered As the dataset containing with large number of sequences, sequential pattern mining algorithms can be used to discover frequently occurring patterns. Sequential PAttern Mining (SPAM) 162 Performance Evaluation of Students with Sequential Pattern Mining Algorithm SPAM [4,9,11,16] algorithm discovers sequential patterns efficiently. It performs better for the larger number of sequences than other pattern growth approach algorithm as PrefixSpan [7]. 2. Related Work Discovering frequent patterns or sequences from large dataset is the task of sequential pattern mining. It has a wide range of applications. Many algorithms were proposed to discover useful patterns. The problem was first proposed for market-basket analysis by R. Agrawal and Srikant R. [1,2]. Many enhancement in the algorithms have proposed to improve an efficiency of algorithm. Generally these algorithms are classified into two types as Apriori based algorithm or candidate generation algorithm and algorithm which follow pattern growth approach. GSP [2], SPAM [4], SPADE [8], SPRITIT [3] are from candidate generation. The drawback of these algorithms is unnecessarily many unwanted candidate sequence are generated and this utilizes a lot of memory. The algorithms from pattern growth approach are PrefixSpan [6] and FreeSpan [7]. These algorithms fragment or divide the sequence database to smaller sub database and then perform mining on these sub database. Both of the above approaches generate many unwanted patterns and spend lot of time during generation of all these sequences. To improve time efficiency of an algorithm constraints are pushed in algorithms [5]. These constraints are of different types like attribute constraint, time constraint [13], time constraint [14], aggregate constraint [10], weight constraint [12], regular expression constraint [15] etc. 3. Problem Definition For a given student’s sequence database D with user defined minimum support as minSup and user specified constraint (given in the form of Regular Expression), to find all the subsequence x which satisfy all user specified constraint and xs >= minSup where xs represents the support of subsequence and minSup represents minimum number of occurrences of given sequence in transactional database. 4. Applying SPAM Algorithm SPAM is apriori based algorithm which generates candidate sequences. SPAM works with three main strategies. Those are generation of lexicographic tree for the given sequence, searching of a node through depth first traversal method and bitmap representation for support counting. During candidate generation, sequences are generated with Sequence extension step (S-Step) and Item setextension step (I-step). To improve the performance and to avoid more search space it involves two pruning strategies. Those are based on a priori principal called S-step pruning and I-step pruning. One of the best features of SPAM is its support counting [4]. It is done with forming vertical bitmap structure for each item. Let in Table 1 maximum length of sequence is 11 and we have considered dataset of only 3 students. So each data item is represented with 11 bit in each slot and it is divided into 3 sections. If data item is present in one sequence then it is represented with 1 otherwise with 0. The support of that item is 100 percent if in each section bitmap with at least one 1. For the bitmap presentation of S-Step, every bit after the first index of one is set to zero and every bit after that index position set to be one. During Sequence extension step (S-step) data structure are first transformed and logically ANDed with the bitmap structure of other item. Let in S-step the 163 Rashmi V. Mane and Vijay R. Ghorpade Figure 1. Bitmap representation of data items in Table 1. logical ANDing in between D and F item is carried out, then vertical bitmap of (D) is transformed as if the first bit is 1 then it is set to 0 and next bit with 1 is set as it is and the bit with 0 are changed to 1. Further this transformed bit of (D) is logically ANDed with (F). This is shown in Figure 2. For Item set extension (I-Step) no transformation is needed. Let (D,F) be an item set after I-step of (D) and (F). Two vertical bitmap structure of (d) and (F) are logically ANDed for an I-step. To improve the overall performance of an algorithm and to avoid unnecessary candidate generation, constraints can be pushed into algorithm. Regular expression constraint is more suitable for the student dataset. The pruning strategy at I-step and s-step prunes all those nodes containing sequence which does not satisfy the constraint. In tree building process, regular expression constraint are enforced for traversal of search tree. Let user interesting patterns are all those students with Distinction in 10th and First class in Diploma then regular expression will be DF.*. During tree building, it starts with root node. Then candidate generation is shown in Figure 3. 164 Performance Evaluation of Students with Sequential Pattern Mining Algorithm SPAM Figure 2. Transformation for S-step. Figure 3. Lexicographical tree for regular expression (DF.*). In Figure 3 only D, F node can be expanded and all other nodes are pruned as they not satisfies the regular expression constraint. These regular expressions can be given similar to the expressions acceptable by finite state machine. 165 Rashmi V. Mane and Vijay R. Ghorpade Table 3. Examples of Regular Expression. Expression (DF.∗) D2 D+ [D|F] Meaning of Expression All sequences starting with DF Two occurrences of D in a sequence One or more number of D Sequence containing either D or F With adding regular expression constraint in the SPAM algorithm, during depth first searching, those nodes are pruned which does not satisfy the regular expression constraint. Recursively DFS applied till all nodes are compared with and regular expression. Only those nodes are stored in database which satisfies both constraints. 5. SPAM with Regular Expression Constraint SPAM algorithm with pushing regular expression constraint is given. During depth first traversal of tree for each node, support of node is compared with minSup if support of node greater than minSup then it is checked with regular expression constraint specified by the user. If both of these conditions are satisfied then only that node is stored in database otherwise it is ignored or not considered further. From Apriori principal its entire subsequences are ignored further. The same technique is used for S-Step and I-Step. The algorithm is specified in Table 4. Table 4. Algorithm: SPAM with RegExpr (D, R). Set all item in D to I For Each i belongs to I call DFS For each node in tree If node satisfies constraint R Add node to tree Else Ignore Depth First Search DFS (n, Sn, In) For S – step For each i belongs to Sn If (i is frequent and satisfies R) Add i to tree DFS for all candidate sequence in a tree with greater length than For I – step For each i belongs to In If (i is frequent and satisfies R) Add i to tree DFS for all candidate sequence in a tree with greater length than 166 Performance Evaluation of Students with Sequential Pattern Mining Algorithm SPAM Figure 4. Steps for Mining Student Database. 6. Mining User-Interesting Patterns From Student Database First process in any mining task is data collection. This data containing with student records with his/her all marks from S.S.C. to engineering. In data preprocessing we have to classify students according to branch and academic year. From these the student’s id and class they have obtained in each exam is considered as input for mining. After preprocessing the next step is to give options to user with regular expression for finding only students with user specified class. With SPAM and regular expression constraint, algorithm will find out the frequently occurring sequences of given classes of students. This is shown in Figure 4. 7. Conclusion Student’s performance can be evaluated with mining student database. Sequential pattern mining plays important role in discovering frequently occurring sequences from large dataset. Apriori based SPAM algorithm is time efficient for large number of sequences. If constraints are added to algorithm then performance of an algorithm improves better. We have analyzed SPAM with regular expression constraint. Analytical study of SPAM shows that it works better for support counting as it represents data in vertical bitmap structure. By adding regular expression constraint in SPAM, only user interesting patterns can be discovered without wasting large search space for uninteresting patterns or sequences. The pruning mechanism can be done with both apriori principal and regular expression constraint. With vertical bitmap structures though SPAM become space inefficient but best suited for mining frequent sequence from large datasets as student’s dataset. References [1] Agrawal, R. and Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases. Proceedings of 20th International Conference on Very Large Data Bases, 487–499 (1994). [2] Agrawal, R. and Srikant, R.: Mining Sequential Patterns: Generalization and Performance Improvements. IBM Almaden Research Center, San Jose, California (1995). [3] Garofalakis, M. N., Rastogi, R. and Shim, K.: Spirit: Sequential Pattern Mining with Regular Expression Constraints. Proceedings of 25th International Conference on Very Large Data Bases, 223–234 (1999). [4] Ayres, J., Gehrke, J., Yiu, T. and Flannick, J.: Sequential Pattern Mining using Bitmap representation. In Proceeding of ACM SIGKDD’02, 429–35 (2002). 167 Rashmi V. Mane and Vijay R. Ghorpade [5] Pei, J., Han, J. and Wang, W.: Mining Sequential Patterns with Constraints in Large Databases. Proceedings of the Eleventh International Conference on Information and Knowledge Management, New York, NY, USA, ACM Press, 18–25 (2002). [6] Pei, J., Han, J. and Mortazavi, B.: PrefixSpan: Mining Sequential Patterns Efficiently by Prefix Projected Pattern Growth. ICDE 2001, 215–224 (2001). [7] Pei, J., Han, J. and Mortazavi, B.: FreeSpan: Frequent Pattern Projected Sequential Pattern Mining. Proceedings of 2000 International Conference on Knowledge Discovery and Data Mining, 355–359 (2000). [8] Zaki, M. J.: Spade: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 31–60 (2001). [9] Joshua Ho, Lo Lukov and Sanjay Chawla: Sequential Pattern Mining with Constraints on Large Protein Databases. In ICMD (2005). [10] Zaki, M.: Sequential Mining in Categorical Domains-Incorporating Constraints. In Proceeding of CIKM’00, 422–429 (2000). [11] Chen, E., Cao, H., Li, Q. and Qian, T.: Efficient Strategies for Tough Aggregate Constraint Based Sequential Pattern Mining. In Information Sciences, 178, 1498–1518 (2008). [12] Elena Bonalis, Luca Gagliero, Tania Cerquitelli and Paolo Garza: Generalized Association Rule Mining with Constraints. In Information Sciences, 194, 68–84, (2012). [13] Unil Yun: Mining Lossless Closed Frequent Patterns with Weight Constraints. In Knowledge-Based Systems, 20, 86–97, (2007). [14] Shigeaki Sakurai, Youichi Kitahara and Ryohei Orihana: Sequential Pattern Mining Based on New Criteria and Attribute Constraint. [15] Yu Hirate and H. Yamana: Generalized Sequential Pattern Mining with Item Interval. Proceedings of 10th Pacific-Asia Conference on Knowledge Discovery and Data Mining (2006). [16] Leticia Gomer and Vaisman, A. A.: Re-SPAM: Using Regular Expression for Sequential Pattern Mining in Trajectory Database. IEEE International Conference on Data Mining workshops (2008). 168

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Performance Evaluation of Students with Sequential Pattern Mining