Download A Survey of Sequence Patterns in Data Mining Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
International Journal of Applied Engineering Research
ISSN 0973-4562 Volume 10, Number 1 (2015) pp. 1807-1815
© Research India Publications
http://www.ripublication.com
A Survey of Sequence Patterns in Data Mining Techniques
S. Muthuselvan1,Dr. K. Soma Sundaram2
1
Research Scholar (CSE), St. Peter’s University, Chennai.&
Assistant Professor Gr.- II,
Aarupadai Veedu Institute of Technology, Paiyanoor, Chennai.
2
Professor, Jaya Engineering College, Chennai.
1
[email protected], [email protected]
Abstract
Data mining techniques are used in many areas in the world to retrieve the
useful knowledge from the very large amount of data. Sequence pattern
mining is the important techniques in data mining concepts with the wide
range of applications. The applications of the sequence patterns data mining
are weblog click streams, DNA sequences, sales analysis, telephone calling
patterns, stock markets and etc., The methods for sequential pattern mining are
categorised in to two approached. First approach is Apriori-based approach
and second is Pattern-Growth-based approaches. In this paper, a methodical
review of the sequential pattern mining algorithms is accomplished. Finally,
reasonablestudy is done on the base of important key features reinforced by
many algorithms and current research encounters are discoursed in this area of
data mining.
In this paper, an organized survey of the sequential pattern mining algorithms
is accomplished. This paper examines these algorithms by studying the
classification algorithm for sequential pattern-mining. These algorithms
classified into two extensive classes. First, on the foundation of algorithms
which are considered to surge effectiveness of mining and the other, on the
origin of numerous additions of sequential pattern mining planned for certain
application.At the end, comparative analysis is done on the basis of important
key features supported by various algorithms and current research challenges
are discussed. [4]
Keywords:Data Mining, Sequence pattern, Association Rule, Pattern Mining.
1808
S. Muthuselvan,Dr. K. Soma Sundaram
Introduction
In Knowledge Discovery Process, Data mining techniques are divided into two major
categories. These are descriptive type and prediction type. Each of the type will have
different type of the approaches.
The sequential pattern mining is anidenticalmain concept of data mining, a further
extension to the concept of association rule mining [1].The set of sequences of the
given data is called data-sequences. Customer transactions list is the data sequences
and the set of items is the transactions. Each transaction is associated with the
transaction time of the sequence database. Association rule mining and the sequential
pattern mining is more or less comparable, the events linked with the time is the
difference among them. The sequential pattern mining determines the correlation
between the dissimilar transactions, but in the event of association rule mining it
determines the association of items in the similar transaction [2].
In this paper segments are ordered as follows: Section II deals with types of the
sequential pattern mining models, Section III discusseslimitations of sequential
pattern mining algorithms, Section IV discusses the comparative analysis of
sequential pattern mining algorithms, Section V discusses about the comparative
analysis of sequential pattern mining algorithms. Finally, the conclusion part is
discussed about the
Problem definition: Let I= {i1, i2, in} be a set of all items. An itemset is a nonempty set of items. A sequence is an ordered list of itemsets. A sequence is denoted
by<s1,s2,…,sl>, where sj is an itemset, i.e., sj⊆I for 1≤j≤l. sj is also called an element
of the sequence and denoted as (x1,x2,…xm), where xk∈I for 1≤k≤m. The number of
instances of items in a sequence is called the length of the sequence. A sequence with
length l is called a l-sequence. A sequence a=<a1,a2,…,an>is called a subsequence of
b=<b1,b2,…,bm> and b a super sequence of a, denoted as a⊆b, if there exist integers
1≤j1≤j2…≤jn≤m such that a1⊆ bj1 , a2⊆ bj2 , … , an⊆bjn.
A sequence database D is a set of tuples<sid, s> where sid is a sequence-id and s
is a sequence. A tuple<sid, s> is said to contain a sequence a, if a is a subsequence of
s, i.e., a⊆s. The number of tuples in a sequence database D containing sequence a is
called the support of a, denoted as sup (a). [22]
Given a sequence database D and some user specified minimum support min_sup,
a sequence a is a sequential pattern in D if sup(a) min_sup. The sequential pattern
mining problem is to find the complete set of sequential pattern with respect to D and
min_sup.
Categories of sequence pattern mining Techniques
As defined by Yen-Liang Chen and Ya-Han Hu [4] in latest years, several methods in
sequential pattern mining have been projected; these studies cover a wide-ranging
variety of problems. In general, there are two different concerns in the area of
sequential pattern mining in research. The first is to increase the efficacy in sequential
pattern mining process while the other one is to. Secondly,extend the mining of
sequential pattern to other time- related patterns.
A Survey of Sequence Patterns in Data Mining Techniques
1809
The algorithms of sequential pattern mining are differed in two different ways, based
on the researches done on the fields of sequential pattern mining [3]. First, generating
the sequences of candidates with storing, and the second is, how the counting and
testing performed on the candidate sequence in a frequent manner. The main goal of
the primary one is to reduce the generation of the total number of candidate
sequences, so that the I/O cost will be reduced. The main goal of the second one is,to
remove any database or data structure that has to be sustained all the period for
support of counting commitments only. The main benefits and shortcomings of
sequential pattern mining are listed in Table 1.
Sequential Pattern
Mining
Apriori-Based
Algorithms
Pattern Growth
Algorithms
Breadth-first Search
FREESPAN
Generate-and-test
WAP-MINE
Multiple Scans of the
Database
PREFIXSPAN
GSP
SPIRIT
SPADE
SPAM
Fig. 1 Categories of Sequential Pattern Mining [6]
The above fig. 1 explaining about the sequential pattern mining algorithm’s
categories in broadly. There are two different important types are Apriori based and
pattern growth. The above mentioned algorithms also having the some of the
algorithms.
Table 1Pros and Cons of Sequential Pattern Mining
Type/Techniques
Pros
Cons
Apriori-Based Algorithms
[5].
It is easy algorithm to
implement.
It takes more memory, lot of space
and it will take more time for the
process of candidate generation.
Pattern Growth
Algorithms [7].
It can be faster when given
large volume of data.
Normally more multifarious to
progress, investigation and maintain.
S. Muthuselvan,Dr. K. Soma Sundaram
1810
Limitations of Sequential Pattern Mining Algorithms
Sequential pattern mining algorithms are typically centred on string. It is not focus on
discovery of the sequential patterns with the limitations in an agreed database. In
query languages like, SQL or MySQL, it will not permit the practice of the nonaggregate functions for the portion of the query compilation [19].
Sequential pattern mining retrieved the relationships among objects in sequential
dataset [18]. The most familiar pattern mining in the sequential is Apriori. This
algorithm, also having the drawbacks like, too many candidate sets, more number of
passes over the databases. Another disadvantages of the above mentioned algorithm
is, requirement of the huge memory space [20].
The assignment of determining entire frequent sequences in huge databases is
relatively interesting. The exploration of the memory space is tremendously large[21].
Table 1 Comparative analysis of algorithm performance [3]. The symbol ―-‖ means
an algorithm crashes with the parameters provided, and memory usage could not be
measured. [3]
Algorithm
GSP Apriori
SPAM Apriori
PrefixSpanPattern
Growth
Data Set Size
Minimum Support
Medium
(D=200K)
Large
(D=800K)
Medium
(D=200K)
Large
(D=800K)
Medium
(D=200K)
Large
(D=800K)
Low(0.1%)
Medium(1%)
Low(0.1%)
Medium(1%)
Low(0.1%)
Medium(1%)
Low(0.1%)
Medium(1%)
Low(0.1%)
Medium(1%)
Low(0.1%)
Medium(1%)
Execution
Time(sec)
>3600
2126
136
674
31
5
1958
798
Memory Usage
(MB)
800
687
574
1052
13
10
525
320
Comparative Analysis of Sequential Pattern Mining Algorithms
Sequential pattern mining is precise significant because it is the foundation of
numerous applications. A sequential mining algorithm should discover the entire set
of patterns, when potentially, adequate the least support. Working with the big data,
the scalability is the one of the important issue of the mining the knowledge from the
huge amount of data. The above mentioned issue will be raised in MapReduce model
in the cloud. The SPAM algorithm, suggestively decrease the mining period with big
data, and also it will attain enormously great scalability [11]. The important and
familiar algorithm for mining the data is Apriori. Using this algorithm, finding the
sequence data from the d-dimensional sequence data is not possible. Using the
PREFIXMD SPAN algorithm, the retrieval of the sequence data is possible from the
d-dimensional data [14]. Generating the huge amount of the unpromising candidate
sub sequences is difficult, while using the Generate-and-test algorithm. This will be
A Survey of Sequence Patterns in Data Mining Techniques
1811
overcome, applying the algorithm called Maximum weighted upper-bound model.
The maximum weighted upper-bound model will give the good performance of
pruning efficiency and also it will improve the performance efficiency [17].
The huge amount of repeated projected databases in mining data sets will be
creating applying pattern growth type of algorithm. It will be overwhelmed, using the
SMPM [13] algorithm. This algorithm will avoid the repeated projected database and
evade physical forecast [13]. The greedy algorithm will raise the issues in the sensor
network applications, by creating the multiple interleaved patterns. The GAIS [15]
method algorithm will find the sequential pattern from the small amount of quality
data. The Frequent Pattern Tree type is another type for finding the pattern using the
sequence mining. In this algorithm will be work in scanning the database many
number of time. It will be time consuming comparing with another type of algorithm.
Yi Sui, Feng Jing Shao, Rencheng Sun and Jinlong Wang were used the STMFP
algorithm. In this algorithm required to scan the database in a single. After the single
scan itself, the tree can store the all the sequences from the source data [9].
The Association rule mining algorithm is the important type of algorithm in the
Apriori model of mining methods. The Apriori based association rule algorithm is the
single minimum support. The single minimum support cannot exactly discover the
interesting pattern. The number of minimum support is very high in the usage of
MSCP growth algorithm [10]. More number of minimum supports will produce the
interesting pattern.Xilu Wang and Weill Yao used their optimum maximum sequence
pattern mining for getting the sequence pattern. The advantage of this algorithm is, to
acquiring the sequential pattern is very reliable. The existing mathematical models for
mining the sequential pattern will be failed in noisy data with the candidate
patterns[12].
Measuring the multidimensional-attribute of the material is not completely
measured concurrently in modified Apriori and PrefisSpan algorithms. It will be
overwhelmed using the Leaner Preference Tree (LPT) algorithm. The advantage of
this algorithm is, the learners actual learning favourite can be fulfilled perfectly [16].
Mining the pattern from the incremental data is very difficult to handle. In this
problem will be solved using the Direct Appending (DirApp) algorithm. The
improvement of this method, the incremental data can be easily dealt and also the
static database [8].
Performance Based Comparative Study
The above table 1 described about the Comparative analysis of different algorithm
based on their performance in sequential mining. These algorithms are studied with
the help of the different size of the data sets. The parameters chosen for these studies
are Minimum Support, execution time (sec) and memory usage (MB). The execution
time is measured here is in the form seconds and, the memory usages of these
algorithm is measured in the form of megabytes. In the data sizes, we have categories
like medium size of data sets and the large size of data sets.
The data size is denoted as D. The value of D for the all the algorithms are
categories in to medium size and large size. The Medium size value is 200k and the
S. Muthuselvan,Dr. K. Soma Sundaram
1812
large size value is 800k. Minimum support for the each algorithm is categorised as
low and medium for the both data sizes medium and large respectively.
The large amount of the execution time taken by the GSP Apriori algorithm was
more than the 3600sec with the memory usage 800mb in the medium size of the data
sets. The minimum support of this highest execution time low. The PrefixSpan pattern
algorithm execution time is very less. The execution time for this algorithm is 5sec
with the memory usage of MB in the large data set size of medium support size.
Table 2 Comparative Analysis of Sequential Pattern Mining Algorithms
Refere
nce
Paper
ABCF
[16]
OMSP
M [12]
MMS
[10]
SPMP
D [8]
IFPT
[9]
GAIS
[15]
WSP
[17]
Methodolog
y Used
Algorithm
Used
Mojtaba
Salehi, Isa
Nakhai
Kamalabadi
and
Mohammad
Bagher
Ghaznavi
Ghoushchi
Xilu Wang and
Weill Yao
Modified
Apriori and
PrefixSpan
algorithms
Mathematic
al Model.
Ya-Han Hu,
Fan Wu and
Yi-Chun Liao
Apriori
based
Association
Rule mining
-
Author
Jen-Wei
Huang, Taipei,
Chi-Yao
Tseng, JianChih Ou and
Ming-Syan
Chen
Yi Sui,
FengJing Shao,
Rencheng Sun
and Jinlong
Wang
Ruotsalainen,
M, AlaKleemola, T
and Visa
Guo-Cheng
Lan, Tzung-Pei
Hong and
Existing System
Proposed System
Leaner
Preference
Tree (LPT)
Multidimensionalattribute of
materials is not
completely
measured
concurrently.
Learner’s actual
learning favourite
can be fulfilled
perfectly
Optimum
maximum
sequence
pattern
mining
MSCPGrowth
Noisy data, with
fewer candidate
patterns.
Acquired
sequential patterns
are reliable.
Single minimum
support cannot
exactly discover
interesting pattern.
Dealing the
incremental data is
difficult.
Numerous
minimum supports
possible.
STMFP
Algorithm
Need to scan the
database many
times.
After the single
scan, the tree can
store the all the
sequences.
Greedy
Algorithm
GAIS
method.
Sequential
patterns can be
identified from
little quality data
Generateand-test.
Maximum
weighted
upper-
Issues in sensor
network
application are
multiple
interleaved
patterns.
Generate a large
number of
unpromising
Direct
Appending
(DirApp)
Frequent
Pattern Tree
It can easily deal
with a static
database or an
incremental
database as well.
Good enactment
of pruning
efficiency and
A Survey of Sequence Patterns in Data Mining Techniques
Hong-Yu Lee
MRMC
[11]
Chun-Chieh
Chen, Chi-Yao
Tseng and
Ming-Syan
Chen
MapReduce
model on
the Cloud
MDSD
[14]
Chung-Ching
Yu and YenLiang Chen
Apriori
(APRIORIM
D)
SEME
[13]
Yong-Gui Zou
and Hong Yu
Pattern
Growth
bound
model
SPAM
Algorithm
candidate sub
sequences
Scalability issues
while working
with big data.
PREFIXM
D
SPAN
ALGORIT
HM
SMPM
Finding sequential
patterns from ddimensional
sequence data is
not possible.
Creating huge
amount of
repeated projected
databases in
mining data sets.
1813
performance
efficiency.
Suggestively
decrease mining
period with big
data, attain
enormously great
scalability
D-dimensional
sequence data is
possible where
d>2.
Evading the
repeated projected
database and
evade physical
forecast.
Conclusions
In this paper, we discussed about the sequential pattern mining and also briefly
represented the major categories of the sequential pattern mining. The comparison
between the some of the types of algorithm was discussed with the help of previously
completed work. Primarily, this topicwas initiated based on the improvement of the
performance of the algorithm with the help of the dissimilar data structure and
representation. The comparative study of different type of the algorithm is used for
the mining the sequential pattern. As well as, we discussed about the comparative
analysis of the algorithm performance. From the discussion about the pros and cons of
sequential mining, easily can be define the strength and their limitations. The analysis
of the comparison based on the different type of methodology and their algorithms are
discussed in detail.
Reference
[1]
[2]
[3]
[4]
J. Han and M. Kamber, ―Data Mining: Concepts and Techniques‖, Morgan
Kaufman publishers, 2001.
Vishal S. Motegaonkar, Prof. Madhav V. Vaidya ―A Survey on Sequential
Pattern Mining Algorithms‖, International Journal of Computer Science and
Information Technologies, Vol. 5 (2) , 2014, 2486-2492.
Nizar R. Mabroukeh and C. I. Ezeife, ―A Taxonomy of Sequential Pattern
Mining Algorithms‖, ACM Computing Surveys, Vol. 43, No. 1, Article 3,
Publication date: November 2010.
J.Pei, J.Han, B.MortazaviAsl, J.Wang, H.Pinto, Q.Chen, U.Dayal and M.C.Hsu, ―Mining sequential patterns by pattern-growth: The PrefixSpan
approach‖, IEEE Transactions on Knowledge and Data Engineering, vol.16,
no.11, 2004, pp. 1424-1440.
1814
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
S. Muthuselvan,Dr. K. Soma Sundaram
Hilderman R. J., Hamilton H. J.,‖Knowledge Discovery and Interest
Measures‖,In: Kluwer Academic Publishers, Boston, 2002.
V.Chandra Shekhar Rao, P.Sammulal,Ph.D, ―Survey on Sequential Pattern
Mining Algorithms ― International Journal of Computer Applications (0975 –
8887) Volume 76– No.12, August 2013.
Carl H. Mooney and John F. Roddick, ACM Journal Name, Vol. V, No. N, M
20YY, Pages 1–46.
Jen-Wei Huang , Nat. Taiwan University, Taipei, Chi-Yao Tseng, Jian-Chih
Ou and Ming-Syan Chen,―A General Model for Sequential Pattern Mining
with a Progressive Database‖, IEEE Trans. Knowledge and Data Eng.,
Volume:20 , Issue 9, page no 1153-1167, September 2008.
Yi Sui, Feng, Jing Shao, Rencheng Sun and Jinlong Wang,―A Sequential
Pattern Mining Algorithm Based on Improved FP-tree‖,. Ninth ACIS
International Conference on Software Engineering, Artificial Intelligence,
Networking, and Parallel/Distributed Computing, 2008. SNPD 2008, Page(s):
440- 444.
Ya-Han Hu, Fan Wu and Yi-Chun Liao, ―Sequential pattern mining with
multiple minimum supports: A tree based approach‖, 2nd International
Conference on Software Engineering and Data Mining (SEDM), 2010,
Page(s): 428-433.
Chun-Chieh Chen, Chi-Yao Tseng and Ming-Syan Chen ―Highly Scalable
Sequential Pattern Mining Based on MapReduce Model on the Cloud‖, IEEE
International Congress on Big Data (Big Data Congress), 2013, Page(s) 310317.
Xilu Wang and Weill Yao, ―Sequential Pattern Mining: Optimum Maximum
Sequential Patterns and Consistent Sequential Patterns‖, IEEE International
Conference on Integration Technology, 2007, Page(s): 365-368.
Yong-Gui Zou and Hong Yu, ―Moving sequential pattern mining based on
Spatial Constraints in Mobile Environment‖, IEEE International Conference
on Intelligent Computing and Intelligent Systems (ICIS), 2010, Page(s) 103107.
Chung-Ching Yu and Yen-Liang Chen, ―Mining Sequential Patterns from
Multidimensional
Sequence
Data‖,
IEEE
Transactions
on
Knowledge
and
Data
Engineering,
vol. 17, no. 1, January 2005.
Ruotsalainen, M, Ala-Kleemola, T and Visa, ―A GAIS: A Method for
Detecting Interleaved Sequential Patterns from Imperfect Data‖, IEEE
Symposium on Computational Intelligence and Data Mining, 2007, Pages(s)
530- 534.
Mojtaba Salehi, Isa Nakhai Kamalabadi, Mohammad Bagher Ghaznavi
Ghoushchi, ―Personalized recommendation of learning material using
sequential pattern mining and attribute based collaborative filtering‖,
Education and Information Technologies, December 2014, Volume 19, Issue
4, page(s) 713-735.
A Survey of Sequence Patterns in Data Mining Techniques
[17]
[18]
[19]
[20]
[21]
[22]
1815
Guo-Cheng Lan, Tzung-Pei Hong and Hong-Yu Lee, ―An efficient approach
for finding weighted sequential patterns from sequence databases‖, Applied
Intelligence, September 2014, Volume 41, Issue 2, pp 439-452.
Thanh-Trung Nguyen, Phi-Khu Nguyen, ―A New Approach for Problem of
Sequential Pattern Mining‖, Lecture Notes in Computer Science on
Computational Collective Intelligence. Technologies and Applications
Volume 7653, 2012, pp 51-60.
VangipuramRadhakrishna, Chintakindi Srinivas and C.V.Guru Rao,
"Constraint Based Sequential Pattern Mining in Time Series Databases - A
Two Way Approach", AASRI Conference on Intelligent Systems and Control,
AASRI Procedia 4(2013)313-318.
ShamilaNasreen, Muhammad AwaisAzamb, KhurramShehzada, Usman
Naeemc, and Mustansar Ali Ghazanfara, ―Frequent Pattern Mining Algorithms
for Finding Associated Frequent Patterns for Data Streams: A Survey‖, The
5th International Conference on Emerging Ubiquitous Systems and Pervasive
Networks (EUSPN-2014), Procedia Computer Science 37 ( 2014 ) 109 – 116.
Mohammed J. Zak, ―SPADE: An Efficient Algorithm for MiningFrequent
Sequences‖, Kluwer Academic Publishers. Manufactured in The Netherlands,
42, 31–60, 2001.
RamakrishnanSrikant, Rakesh Agrawal,‖ Mining Sequential Patterns:
Generalizations and Performance Improvements‖, Advances in Database
Technology — EDBT '96, Lecture Notes in Computer Science, Springer,
Volume 1057, 1996, pp 1-17
1816
S. Muthuselvan,Dr. K. Soma Sundaram
Related documents