Download 33. MINING OF SEQUENTIAL PATTERNS WITH A PROGRESSIVE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Proceedings of the International Conference , “Computational Systems and Communication Technology”
5TH MAY 2010 - by Einstein College of Engineering,
Tirunelveli-Tamil Nadu,PIN-627 012,INDIA
MINING OF SEQUENTIAL PATTERNS WITH A PROGRESSIVE DATABASE
EFFICIENTLY
D. Hepsibha Pearl 1, Dr.S.Selvan2
1
Student II M.E.CSE, 2 Principal,
Francis Xavier Engineering College, Tirunelveli
[email protected],
and sequential pattern mining, to name a few
Abstract:- In this paper, a new algorithm
[3],
[4],
[5].
Among
others,
finding
using fast version of Pisa, FastPisa is
sequential patterns has attracted a significant
proposed to mine the frequent sequences
amount of research attention. Sequential
efficiently which requires less memory
pattern mining was first addressed in [1] as
space than the existing algorithms. FastPisa
the problem: “Given a sequence database,
discovers sequential patterns in defined time
where each sequence consists of a list of
period of interest (POI). FastPisa utilizes a
ordered item sets containing a set of
progressive sequential tree to efficiently
different items, and a user defined minimum
maintain the latest data sequences and delete
support threshold, sequential pattern mining
obsolete data and patterns accordingly. Fast
is to find all subsequences whose occurrence
Pisa only inserts elements which contain a
frequencies are no less than the threshold
single item into Progressive Sequential Tree.
from the set of sequences.” The formal
The number of nodes stored in PS-Tree is
definition is as follows.
reduced significantly in FastPisa, thus
consuming less memory space than Pisa by
Definition 1. Let I = {x1, x2,…, xn} be a set
orders of magnitude.
of different items. An element e, denoted by
Index Terms:Progressive sequential pattern
(xixj….), is a subset of items cI which appear
at the same time. A sequence s, denoted by <
1. Introduction:
e1,e2,…,em > , is an ordered list of elements.
There have been many recent research
A sequence database Db contains a set of
works on data mining area on discovering
sequences, and |Db| represents the number
interesting but unknown knowledge from a
of
sequences
α=<a1,a2,…,an>
in
Db.
is
a
A
sequence
subsequence
of
large amount of data. The data mining
another sequence β =< b1,b2,..., bm > if
techniques include association rules mining,
there exist a set of integers, 1≤ i1 < i2 < ….<
classification, clustering, mining time series,
Proceedings of the International Conference , “Computational Systems and Communication Technology”
5TH MAY 2010 - by Einstein College of Engineering,
Tirunelveli-Tamil Nadu,PIN-627 012,INDIA
in ≤m, such that a1 cbi1, a2 cbi2,… and an cbin.
followed by drapes and rules" is an example
The sequential pattern mining can be
of a sequential pattern in which the elements
defined as “Given a sequence database Db
are sets of items.
and a user-defined minimum support min
sup, find the complete set of subsequences
The assumption of having a static database
whose occurrence frequencies ≥ min sup *
may not hold in practice. The data in real
|Db|.”
world changes on the fly. Moreover, finding
sequential
patterns
in
an
incremental
Database mining is motivated by the
database may cause lack of interest to the
decision support problem faced by most
users.
large retail organizations. Progress in bar-
generated, the newly arriving patterns may
code technology has made it possible for
not be identified as frequent sequential
retail organizations to collect and store
patterns due to the existence of old data and
massive amounts of sales data, referred to as
sequences.
the basket data. A record in such data
sequential patterns that are not frequent
typically consists of the transaction date and
recently may stay in the reported results.
the items bought in the transaction. Very
The incremental mining algorithms do not
often, data records also contain customer-id,
consider the deletion of the obsolete data
particularly when the purchase has been
from the sequence database. That is, these
made using a credit card or a frequent-buyer
works are not applicable to a progressive
card. Catalog companies also collect such
database. It is noted that users are usually
data using the orders they receive. We
more interested in the recent data than the
introduce the problem of mining sequential
old ones. However, if a certain sequence
patterns over this data. An example of such
does not have any newly arriving elements,
a pattern is that customers typically rent
this sequence will still stay in the database
“Star Wars", then “Empire Strikes Back",
and
and then “Return of the Jedi". Note that
Therefore, when new sequential patterns are
these rentals need not be consecutive.
generated, the new patterns which appear
Customers who rent some other videos in
frequently in the recent sequences may not
between also support this sequential pattern.
be considered as frequent sequential patterns
Elements of a sequential pattern need not be
because |Db| is never reduced. In view of
simple items. Fitted Sheet and at sheet and
this, the infrequent sequential patterns
pillow cases", followed by comforter",
whose timestamps are obsolete should be
When
sequential
Even
undesirably
worse,
patterns
the
contribute
are
obsolete
to
|Db|.
removed. Sequential pattern mining with a
Proceedings of the International Conference , “Computational Systems and Communication Technology”
5TH MAY 2010 - by Einstein College of Engineering,
Tirunelveli-Tamil Nadu,PIN-627 012,INDIA
progressive database is widely used in many
progressively discover sequential patterns in
fields.
recent time period of interest.
For
example,
prediction
of
prefetching data to a mobile user on the
wireless gateway is an essential application
The progressive sequential pattern mining
[2], [9], [10]. Whenever a mobile user asks
deals with a progressive database, which not
an item from a wireless gateway, the
only adds new data to the original database
prefetching system decides which other
but also removes obsolete data from the
items the mobile user may want. By an
database. The sequential pattern mining with
accurate and predictive mechanism, the
a static database finds the sequential patterns
prefetching
significantly
in the database in which data do not change
improve the query latency to mobile users.
over time. On the other hand, the sequential
In [6] and [7], Pandey et al. have proved that
pattern mining with an incremental database
prefetching through the web log data of
corresponds to the mining process where
dynamic real-time rules will be more precise
there are new data arriving as time goes by
than static usage log on the web server. In
(i.e., the sequences database is incremental).
system
can
addition, online link recommendation in
adaptive hypermedia educational system is
For the sequential pattern mining with a
based on the mining results of users’ recent
progressive database, new data are added
sequential queries [34]. The applications
into the database and obsolete data are
mentioned above are suitable to apply
removed simultaneously. Therefore, one can
progressive
find the most up-to-date sequential patterns
sequential
pattern
mining
techniques.
without being influenced by obsolete data. It
is noted that the sequential pattern mining
with a
2. Pisa:
static
database
and
with an
incremental database are both special cases
When sequential patterns are generated, the
of the progressive sequential pattern mining.
newly
be
If the obsolete data are not deleted from the
identified as frequent sequential patterns due
database, the proposed algorithms for the
to the existence of old data and sequences.
progressive sequential pattern mining can
In practice, users are usually more interested
solve the incremental sequential pattern
in the recent data than the old ones.
mining problem. In addition, if the database
Sequential pattern mining with a progressive
does not have new data and does not delete
database capture the dynamic nature of data
obsolete data, the progressive sequential
addition and deletion. Progressive concept,
pattern mining algorithm can also deal with
arriving
patterns
may
not
Proceedings of the International Conference , “Computational Systems and Communication Technology”
5TH MAY 2010 - by Einstein College of Engineering,
Tirunelveli-Tamil Nadu,PIN-627 012,INDIA
the static sequential pattern mining problem.
rapidly. Hence, users can identify the latest
Fig1 shows an example database.
information without influences of old data
and recognize the most up-to-date sequential
patterns.
Definition 3. PS-tree represents elements in
the sequence, based on the sequence IDs
and timestamps recorded in the nodes and
Fig1: An example database
the newly arriving data of the progressive
database at each timestamp. PS-tree not
Progressive mInning of Sequential patterns,
PISA takes the concept of time period of
interest (POI) and Progressive Sequential
tree (PS-Tree).
only stores the elements and timestamps of
sequences in each POI but also efficiently
accumulates the occurrence frequency of
every candidate sequential pattern at the
same time.
Definition 2. POI is a sliding window,
whose length is a userspecified time
interval, continuously advancing as the time
goes by. The sequences having elements
whose timestamps fall into this period, POI,
contribute to the |Db| for current sequential
patterns. On the other hand, the sequences
having only elements with timestamps older
The height of PS-tree is bounded by the
length of POI, and PS-tree combines the
same sequential patterns of all sequences
together. By changing Start time and End
time of the POI, Pisa can easily deal with a
static database or an incremental database as
well.
than POI should be pruned away from the
sequence database immediately and will not
3. Fast Pisa:
contribute to the |Db| thereafter.
Pisa with a very slight approximation,
Once the POI is taken into account, the
sequences that do not have any newly
arriving data can be deleted, and the |Db| of
recent sequential patterns can, thus, be
treated correctly. Consequently, obsolete
sequential patterns will be pruned away
enhance the performance of Pisa. For every
arriving element with more than one item,
Pisa needs to insert all combinations of
elements into PS-tree. For example, when
we receive an element having (ABC), all
combinations of elements are A, B, C, (AB),
(AC), (BC), and (ABC). The number of all
Proceedings of the International Conference , “Computational Systems and Communication Technology”
5TH MAY 2010 - by Einstein College of Engineering,
Tirunelveli-Tamil Nadu,PIN-627 012,INDIA
combinations of elements for an arriving
element having |T| items is 2|T| - 1 and these
NPisa = (L(L + 1))/2 * (1 - (1 - Rseq)
|D|
elements will induce huge computation and
*( P *( 2|T| - 1))L ----------------------(1)
)/Rseq
large space in the following timestamps
during each POI. In view of this, we
In addition to the merit of requiring a
employed a slight approximation to speed up
smaller number of elements needed to be
the algorithm and reduce the memory usage.
stored in PS-tree, Pisa utilizes an efficient
traversing method to traverse PS-tree while
4. Complexity Analysis:
dealing
with
each
new
element
and
comparing to the old elements. Therefore,
Pisa combines the same elements of
the execution time of Pisa can be much
different sequences in the same nodes. The
better than the one of DirApp. FastPisa
similarity ratio of the elements in a pair of
reduces the number of combinations for
sequences is denoted by Rseq. That is, Rseq =
each new element, which is stored in PS-
Nsame/N, where Nsame is the number of
tree. Thus, the average number of different
common elements in these two sequences.
combinations of elements appearing in the
Then, the number of different elements in a
whole POI in one sequence becomes (P *
sequence from the elements in the other
(|T| - 1))L. Hence, the average number of
sequence is N * (1 - Rseq). The number of
nodes stored in fast Pisa is
different elements in the kth sequence from
the elements in the other k - 1 sequences is
NFastPisa = (L(L+ 1))/2*(1 - (1 - Rseq)
N *(1 - Rseq)k-1. The total number of different
(P * (|T| - 1))L
|D|
)Rseq*
---------------------(2)
elements in all of the |D| sequences is, thus,
According to (1) and (2), FastPisa does save
N +N *(1 _ Rseq) +N * (1 - Rseq)2 +….. + N*
lots of space from Pisa and, hence,
(1 - Rseq)|D|-1 = N *(1 - (1 - Rseq)|D|) /Rseq
outperforms the comparative algorithms in
execution time.
Considering the tree structure, the average
number of nodes stored in Pisa is, thus,
5. Conclusion:
(L + (L - 1) + (L - 2)+……..+1) * N *(1- (1
If we insert only individual items A, B, and
|D|
- Rseq) )/Rseq
C instead of all combinations of elements
into PS-tree, the sequential patterns formed
Therefore, the average number of node is
by these elements with only a single item
Proceedings of the International Conference , “Computational Systems and Communication Technology”
5TH MAY 2010 - by Einstein College of Engineering,
Tirunelveli-Tamil Nadu,PIN-627 012,INDIA
can still be generated. Therefore, users will
2. M.-S. Chen, J. Han, and P.S. Yu, “Data
not lose any information of these elements
Mining:
with only a single item. For example,
Perspective,” IEEE Trans. Knowledge and
assume there will be sequential patterns XA,
Data Eng., vol. 8, no. 6, pp. 866-883, Dec.
XB, XC, and X(AB), where X is the prefix
1996
An
Overview
from
Database
set of elements of these patterns. According
to the slight modification, the sequential
3. A. Balachandran, G.M. Voelker, P. Bahl,
patterns formed by the elements with only a
and P.V. Rangan, “Characterizing User
single item, XA, XB, and XC can still be
Behavior and Network Performance in a
generated. Users can get the information that
Public
A, B, and C appear following X frequently.
SIGMETRICS Int’l Conf. Measurement and
The only pattern which cannot be found is
Modeling
X(AB). That is, the only information loss is
(SIGMETRICS ’02), June 2002
Wireless
of
LAN,”
Proc.
Computer
ACM
Systems
that the users are not able to know whether
A and B appear together following X
4. U.M. Fayyad, G. Piatetsky-Shapiro, P.
frequently or not. In practice, although, the
Smyth, and R. Uthurasamy, “Advances in
numbers of combinations of elements in
Knowledge Discovery and Data Mining,”
candidate sequential patterns are quite large,
MIT Press,1996
there are seldom frequent sequential patterns
whose elements contain more than one item
5. J. Han and M. Kamber, Data Mining:
at the same time. Therefore, the information
Concepts
loss is marginal. Specifically, assume the
Kaufmann, 2000.
and
Techniques.
Morgan
length of POI is L. From the design of PStree, the cost of space and time can be
6. A.B. Pandey, J. Srivastava, and S.
reduced by a factor of (2|T| - 1)L-|T|L, which
Shekhar, “Web Proxy Server with Intelligent
is about in orders of 10 to 100 times in
Prefetcher
practice, by this approximation.
Association Rules,” Technical Report 01-
for
Dynamic
Pages
Using
004, Univ. of Minnesota, Jan. 2001
6. References:
7. A.B. Pandey, R.R. Vatsavai, X. Ma, J.
1. R. Agrawal and R. Srikant, “Mining
Srivastava, and S. Shekhar, “Data Mining
Sequential Patterns,” Proc. 11th Int’l Conf.
for Intelligent Web Prefetching,” Proc.
Data Eng. (ICDE ’95), pp. 3-14, Feb. 1995.
Workshop Mining Data Across Multiple
Proceedings of the International Conference , “Computational Systems and Communication Technology”
5TH MAY 2010 - by Einstein College of Engineering,
Tirunelveli-Tamil Nadu,PIN-627 012,INDIA
Customer Touchpoints for CRM (MDCRM
’02), May 2002.
8. C.Romero, S.Ventura, J.A.Delgado, &
P.D.Bra,
“Personalized
Links
Recommendation Based on Data Mining.In
Adaptive
Educational
Systems,”
Proc.2
Hypermedia
nd
EuroConf.
TechEnhancedLearning(EC TEL’07)
9. Q. Yang and H.H. Zhang, “Web-Log
Mining for Predictive Web Caching,” IEEE
Trans. Knowledge and Data Eng., vol. 15,
no. 4, pp. 1050-1053, July/Aug. 2003
10. Q. Yang, H.H. Zhang, and T. Li,
“Mining Web Logs for Prediction Models in
www Caching and Prefetching,” Proc.
Seventh
ACM
SIGKDD
Int’l
Conf.
Knowledge Discovery and Data Mining
(KDD’01), pp. 473-478,Aug. 2001