Download Performance Evaluation of Students with Sequential Pattern Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Performance Evaluation of Students with Sequential Pattern
Mining Algorithm SPAM
Rashmi V. Mane1,∗ and Vijay R. Ghorpade2
2 D.Y.
1 Shivaji University, Kolhapur, Maharashtra, India.
Patil College of Engineering, Kolhapur, Maharashtra, India.
e-mail: rvm− [email protected]
Abstract.
Sequential Pattern Mining is one of the most efficient ways of data mining
for discovering frequently occurring sequences from large dataset. In any institute a large
amount of student’s data is available. Sequential pattern mining plays vital role in evaluation
of student’s performance by academic year and by the grades obtained in the examination.
Many algorithms were proposed for sequence pattern mining. Adding constraints like regular
expression to pattern mining algorithm enhances the performance of these algorithms. These
algorithms uses strategies such as Apriori based candidate generation and pattern growth
approach. Among these SPAM from candidate generation works better than algorithms of
PrefixSpan, one of the algorithms of pattern growth category for large dataset. In this paper,
a novel SPAM algorithm is proposed and analyzed with regular expression constraint for
Student database.
Keywords:
constraints, data mining, sequential pattern mining, SPAM.
1. Introduction
Data mining algorithms are widely used to extract user interesting data from large dataset. Sequential
pattern mining is one of the subtasks of data mining [1]. Nowadays, none of searching work can
be carried out without the help of computers. Hence, large data can be collected into the systems.
Pattern mining algorithms plays important role during retrieval of user specific data from these
huge dataset. These algorithms have a wide range of applications in various fields like medical,
bioinformatics, telecommunication, networking etc. Numbers of algorithms were proposed for discovering sequence or patterns. By adding user defined constraints [2,3] into pattern mining algorithm
improves the performance of algorithm.
Student data is associated with their class obtained in examination. Evaluation of student’s performance can be carried out with pattern mining. The student data is specified as follows:
∗ Corresponding author
K. R. Venugopal, P. Deepa Shenoy and L. M. Patnaik (Eds.) ICDMW 2013, pp. 161–168.
© Elsevier Publications 2013.
Rashmi V. Mane and Vijay R. Ghorpade
The input sequences used in this paper are given in Table 1.
Table 1. Sequences of Grades Obtained by Student.
Student Id
1
2
3
Sequence of Class Obtained
DZSFFSFSFFD
DFZZZFFFFSD
DZFFFSFSFDD
Table 1 specifies the first attribute as student id and second column as the class obtained by
him/her during graduation of Engineering. We have considered the students from B.E./B.Tech. The
second column in table 1 specifies a sequence which starts with 10th, Diploma, 12th, Ist Semester,
IInd Semester, IIIrd Semester, IVth Semester, Vth Semester, VIth Semester, VIIth Semester,
VIIIth Semester. One engineering college has at least four to five branches. The students are from
all four classes as First year Engineering, Second Year Engineering, Third Year Engineering, Final
Year engineering. Collecting total strength of any engineering college is near about 1500 to 2000.
If we have collected data year wise and the data is collected from last five years then number
of sequences grows tremendously. It is difficult to find the performance of student from 10th to
final year engineering from the large dataset of student. This can be handled efficiently by pattern
mining algorithms. By preprocessing data branch wise, academic year wise user specific data can
be discovered easily.
To evaluate academic performance from 10th, 12th and starting with Ist semester of engineering to VIIIth semester of engineering is considered. Some students are joining to college direct in
IIIrd semester after diploma. For these students 12th, Ist and IInd semester class has not been considered which is represented with ‘Z ’. The classification of student’s marks is done as follows:
The representation of classes are given in Table 2.
Table 2. Representation of
Student Performance.
Class
D
F
S
P
N
A
X
Y
Z
Meaning of Class
Distinction
First Class
Second Class
Pass Class
Not Pass/Fail
Atkt
Absent
Detained
Not considered
As the dataset containing with large number of sequences, sequential pattern mining algorithms can be used to discover frequently occurring patterns. Sequential PAttern Mining (SPAM)
162
Performance Evaluation of Students with Sequential Pattern Mining Algorithm SPAM
[4,9,11,16] algorithm discovers sequential patterns efficiently. It performs better for the larger
number of sequences than other pattern growth approach algorithm as PrefixSpan [7].
2. Related Work
Discovering frequent patterns or sequences from large dataset is the task of sequential pattern
mining. It has a wide range of applications. Many algorithms were proposed to discover
useful patterns. The problem was first proposed for market-basket analysis by R. Agrawal and
Srikant R. [1,2]. Many enhancement in the algorithms have proposed to improve an efficiency of
algorithm. Generally these algorithms are classified into two types as Apriori based algorithm or
candidate generation algorithm and algorithm which follow pattern growth approach. GSP [2],
SPAM [4], SPADE [8], SPRITIT [3] are from candidate generation. The drawback of these algorithms is unnecessarily many unwanted candidate sequence are generated and this utilizes a lot of
memory. The algorithms from pattern growth approach are PrefixSpan [6] and FreeSpan [7]. These
algorithms fragment or divide the sequence database to smaller sub database and then perform
mining on these sub database. Both of the above approaches generate many unwanted patterns
and spend lot of time during generation of all these sequences. To improve time efficiency of an
algorithm constraints are pushed in algorithms [5]. These constraints are of different types like
attribute constraint, time constraint [13], time constraint [14], aggregate constraint [10], weight
constraint [12], regular expression constraint [15] etc.
3. Problem Definition
For a given student’s sequence database D with user defined minimum support as minSup and user
specified constraint (given in the form of Regular Expression), to find all the subsequence x which
satisfy all user specified constraint and xs >= minSup where xs represents the support of subsequence and minSup represents minimum number of occurrences of given sequence in transactional
database.
4. Applying SPAM Algorithm
SPAM is apriori based algorithm which generates candidate sequences. SPAM works with three
main strategies. Those are generation of lexicographic tree for the given sequence, searching of a
node through depth first traversal method and bitmap representation for support counting. During
candidate generation, sequences are generated with Sequence extension step (S-Step) and Item setextension step (I-step). To improve the performance and to avoid more search space it involves two
pruning strategies. Those are based on a priori principal called S-step pruning and I-step pruning.
One of the best features of SPAM is its support counting [4]. It is done with forming vertical
bitmap structure for each item. Let in Table 1 maximum length of sequence is 11 and we have
considered dataset of only 3 students. So each data item is represented with 11 bit in each slot and
it is divided into 3 sections. If data item is present in one sequence then it is represented with 1
otherwise with 0. The support of that item is 100 percent if in each section bitmap with at least one 1.
For the bitmap presentation of S-Step, every bit after the first index of one is set to zero and every
bit after that index position set to be one. During Sequence extension step (S-step) data structure
are first transformed and logically ANDed with the bitmap structure of other item. Let in S-step the
163
Rashmi V. Mane and Vijay R. Ghorpade
Figure 1. Bitmap representation of data items in Table 1.
logical ANDing in between D and F item is carried out, then vertical bitmap of (D) is transformed
as if the first bit is 1 then it is set to 0 and next bit with 1 is set as it is and the bit with 0 are
changed to 1. Further this transformed bit of (D) is logically ANDed with (F). This is shown in
Figure 2.
For Item set extension (I-Step) no transformation is needed. Let (D,F) be an item set after I-step
of (D) and (F). Two vertical bitmap structure of (d) and (F) are logically ANDed for an I-step.
To improve the overall performance of an algorithm and to avoid unnecessary candidate generation, constraints can be pushed into algorithm. Regular expression constraint is more suitable
for the student dataset. The pruning strategy at I-step and s-step prunes all those nodes containing
sequence which does not satisfy the constraint. In tree building process, regular expression constraint are enforced for traversal of search tree. Let user interesting patterns are all those students
with Distinction in 10th and First class in Diploma then regular expression will be DF.*. During tree
building, it starts with root node. Then candidate generation is shown in Figure 3.
164
Performance Evaluation of Students with Sequential Pattern Mining Algorithm SPAM
Figure 2. Transformation for S-step.
Figure 3. Lexicographical tree for regular expression (DF.*).
In Figure 3 only D, F node can be expanded and all other nodes are pruned as they not satisfies
the regular expression constraint. These regular expressions can be given similar to the expressions
acceptable by finite state machine.
165
Rashmi V. Mane and Vijay R. Ghorpade
Table 3. Examples of Regular Expression.
Expression
(DF.∗)
D2
D+
[D|F]
Meaning of Expression
All sequences starting with DF
Two occurrences of D in a sequence
One or more number of D
Sequence containing either D or F
With adding regular expression constraint in the SPAM algorithm, during depth first searching,
those nodes are pruned which does not satisfy the regular expression constraint. Recursively DFS
applied till all nodes are compared with and regular expression. Only those nodes are stored in
database which satisfies both constraints.
5. SPAM with Regular Expression Constraint
SPAM algorithm with pushing regular expression constraint is given. During depth first traversal of
tree for each node, support of node is compared with minSup if support of node greater than minSup
then it is checked with regular expression constraint specified by the user. If both of these conditions
are satisfied then only that node is stored in database otherwise it is ignored or not considered further.
From Apriori principal its entire subsequences are ignored further. The same technique is used for
S-Step and I-Step. The algorithm is specified in Table 4.
Table 4. Algorithm: SPAM with RegExpr (D, R).
Set all item in D to I
For Each i belongs to I
call DFS
For each node in tree
If node satisfies constraint R
Add node to tree
Else
Ignore
Depth First Search
DFS (n, Sn, In)
For S – step
For each i belongs to Sn
If (i is frequent and satisfies R)
Add i to tree
DFS for all candidate sequence in a tree with greater length than
For I – step
For each i belongs to In
If (i is frequent and satisfies R)
Add i to tree
DFS for all candidate sequence in a tree with greater length than
166
Performance Evaluation of Students with Sequential Pattern Mining Algorithm SPAM
Figure 4. Steps for Mining Student Database.
6. Mining User-Interesting Patterns From Student Database
First process in any mining task is data collection. This data containing with student records with
his/her all marks from S.S.C. to engineering. In data preprocessing we have to classify students
according to branch and academic year. From these the student’s id and class they have obtained
in each exam is considered as input for mining. After preprocessing the next step is to give options
to user with regular expression for finding only students with user specified class. With SPAM and
regular expression constraint, algorithm will find out the frequently occurring sequences of given
classes of students. This is shown in Figure 4.
7. Conclusion
Student’s performance can be evaluated with mining student database. Sequential pattern mining
plays important role in discovering frequently occurring sequences from large dataset. Apriori based
SPAM algorithm is time efficient for large number of sequences. If constraints are added to algorithm then performance of an algorithm improves better. We have analyzed SPAM with regular
expression constraint. Analytical study of SPAM shows that it works better for support counting
as it represents data in vertical bitmap structure. By adding regular expression constraint in SPAM,
only user interesting patterns can be discovered without wasting large search space for uninteresting
patterns or sequences. The pruning mechanism can be done with both apriori principal and regular
expression constraint. With vertical bitmap structures though SPAM become space inefficient but
best suited for mining frequent sequence from large datasets as student’s dataset.
References
[1] Agrawal, R. and Srikant, R.: Fast Algorithms for Mining Association Rules in Large Databases.
Proceedings of 20th International Conference on Very Large Data Bases, 487–499 (1994).
[2] Agrawal, R. and Srikant, R.: Mining Sequential Patterns: Generalization and Performance Improvements.
IBM Almaden Research Center, San Jose, California (1995).
[3] Garofalakis, M. N., Rastogi, R. and Shim, K.: Spirit: Sequential Pattern Mining with Regular Expression
Constraints. Proceedings of 25th International Conference on Very Large Data Bases, 223–234 (1999).
[4] Ayres, J., Gehrke, J., Yiu, T. and Flannick, J.: Sequential Pattern Mining using Bitmap representation.
In Proceeding of ACM SIGKDD’02, 429–35 (2002).
167
Rashmi V. Mane and Vijay R. Ghorpade
[5] Pei, J., Han, J. and Wang, W.: Mining Sequential Patterns with Constraints in Large Databases. Proceedings of the Eleventh International Conference on Information and Knowledge Management, New York,
NY, USA, ACM Press, 18–25 (2002).
[6] Pei, J., Han, J. and Mortazavi, B.: PrefixSpan: Mining Sequential Patterns Efficiently by Prefix Projected
Pattern Growth. ICDE 2001, 215–224 (2001).
[7] Pei, J., Han, J. and Mortazavi, B.: FreeSpan: Frequent Pattern Projected Sequential Pattern Mining.
Proceedings of 2000 International Conference on Knowledge Discovery and Data Mining, 355–359
(2000).
[8] Zaki, M. J.: Spade: An Efficient Algorithm for Mining Frequent Sequences. Machine Learning, 31–60
(2001).
[9] Joshua Ho, Lo Lukov and Sanjay Chawla: Sequential Pattern Mining with Constraints on Large Protein
Databases. In ICMD (2005).
[10] Zaki, M.: Sequential Mining in Categorical Domains-Incorporating Constraints. In Proceeding of
CIKM’00, 422–429 (2000).
[11] Chen, E., Cao, H., Li, Q. and Qian, T.: Efficient Strategies for Tough Aggregate Constraint Based Sequential Pattern Mining. In Information Sciences, 178, 1498–1518 (2008).
[12] Elena Bonalis, Luca Gagliero, Tania Cerquitelli and Paolo Garza: Generalized Association Rule Mining
with Constraints. In Information Sciences, 194, 68–84, (2012).
[13] Unil Yun: Mining Lossless Closed Frequent Patterns with Weight Constraints. In Knowledge-Based
Systems, 20, 86–97, (2007).
[14] Shigeaki Sakurai, Youichi Kitahara and Ryohei Orihana: Sequential Pattern Mining Based on New
Criteria and Attribute Constraint.
[15] Yu Hirate and H. Yamana: Generalized Sequential Pattern Mining with Item Interval. Proceedings of 10th
Pacific-Asia Conference on Knowledge Discovery and Data Mining (2006).
[16] Leticia Gomer and Vaisman, A. A.: Re-SPAM: Using Regular Expression for Sequential Pattern Mining
in Trajectory Database. IEEE International Conference on Data Mining workshops (2008).
168