Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the International Conference , “Computational Systems and Communication Technology” 5TH MAY 2010 - by Einstein College of Engineering, Tirunelveli-Tamil Nadu,PIN-627 012,INDIA MINING OF SEQUENTIAL PATTERNS WITH A PROGRESSIVE DATABASE EFFICIENTLY D. Hepsibha Pearl 1, Dr.S.Selvan2 1 Student II M.E.CSE, 2 Principal, Francis Xavier Engineering College, Tirunelveli [email protected], and sequential pattern mining, to name a few Abstract:- In this paper, a new algorithm [3], [4], [5]. Among others, finding using fast version of Pisa, FastPisa is sequential patterns has attracted a significant proposed to mine the frequent sequences amount of research attention. Sequential efficiently which requires less memory pattern mining was first addressed in [1] as space than the existing algorithms. FastPisa the problem: “Given a sequence database, discovers sequential patterns in defined time where each sequence consists of a list of period of interest (POI). FastPisa utilizes a ordered item sets containing a set of progressive sequential tree to efficiently different items, and a user defined minimum maintain the latest data sequences and delete support threshold, sequential pattern mining obsolete data and patterns accordingly. Fast is to find all subsequences whose occurrence Pisa only inserts elements which contain a frequencies are no less than the threshold single item into Progressive Sequential Tree. from the set of sequences.” The formal The number of nodes stored in PS-Tree is definition is as follows. reduced significantly in FastPisa, thus consuming less memory space than Pisa by Definition 1. Let I = {x1, x2,…, xn} be a set orders of magnitude. of different items. An element e, denoted by Index Terms:Progressive sequential pattern (xixj….), is a subset of items cI which appear at the same time. A sequence s, denoted by < 1. Introduction: e1,e2,…,em > , is an ordered list of elements. There have been many recent research A sequence database Db contains a set of works on data mining area on discovering sequences, and |Db| represents the number interesting but unknown knowledge from a of sequences α=<a1,a2,…,an> in Db. is a A sequence subsequence of large amount of data. The data mining another sequence β =< b1,b2,..., bm > if techniques include association rules mining, there exist a set of integers, 1≤ i1 < i2 < ….< classification, clustering, mining time series, Proceedings of the International Conference , “Computational Systems and Communication Technology” 5TH MAY 2010 - by Einstein College of Engineering, Tirunelveli-Tamil Nadu,PIN-627 012,INDIA in ≤m, such that a1 cbi1, a2 cbi2,… and an cbin. followed by drapes and rules" is an example The sequential pattern mining can be of a sequential pattern in which the elements defined as “Given a sequence database Db are sets of items. and a user-defined minimum support min sup, find the complete set of subsequences The assumption of having a static database whose occurrence frequencies ≥ min sup * may not hold in practice. The data in real |Db|.” world changes on the fly. Moreover, finding sequential patterns in an incremental Database mining is motivated by the database may cause lack of interest to the decision support problem faced by most users. large retail organizations. Progress in bar- generated, the newly arriving patterns may code technology has made it possible for not be identified as frequent sequential retail organizations to collect and store patterns due to the existence of old data and massive amounts of sales data, referred to as sequences. the basket data. A record in such data sequential patterns that are not frequent typically consists of the transaction date and recently may stay in the reported results. the items bought in the transaction. Very The incremental mining algorithms do not often, data records also contain customer-id, consider the deletion of the obsolete data particularly when the purchase has been from the sequence database. That is, these made using a credit card or a frequent-buyer works are not applicable to a progressive card. Catalog companies also collect such database. It is noted that users are usually data using the orders they receive. We more interested in the recent data than the introduce the problem of mining sequential old ones. However, if a certain sequence patterns over this data. An example of such does not have any newly arriving elements, a pattern is that customers typically rent this sequence will still stay in the database “Star Wars", then “Empire Strikes Back", and and then “Return of the Jedi". Note that Therefore, when new sequential patterns are these rentals need not be consecutive. generated, the new patterns which appear Customers who rent some other videos in frequently in the recent sequences may not between also support this sequential pattern. be considered as frequent sequential patterns Elements of a sequential pattern need not be because |Db| is never reduced. In view of simple items. Fitted Sheet and at sheet and this, the infrequent sequential patterns pillow cases", followed by comforter", whose timestamps are obsolete should be When sequential Even undesirably worse, patterns the contribute are obsolete to |Db|. removed. Sequential pattern mining with a Proceedings of the International Conference , “Computational Systems and Communication Technology” 5TH MAY 2010 - by Einstein College of Engineering, Tirunelveli-Tamil Nadu,PIN-627 012,INDIA progressive database is widely used in many progressively discover sequential patterns in fields. recent time period of interest. For example, prediction of prefetching data to a mobile user on the wireless gateway is an essential application The progressive sequential pattern mining [2], [9], [10]. Whenever a mobile user asks deals with a progressive database, which not an item from a wireless gateway, the only adds new data to the original database prefetching system decides which other but also removes obsolete data from the items the mobile user may want. By an database. The sequential pattern mining with accurate and predictive mechanism, the a static database finds the sequential patterns prefetching significantly in the database in which data do not change improve the query latency to mobile users. over time. On the other hand, the sequential In [6] and [7], Pandey et al. have proved that pattern mining with an incremental database prefetching through the web log data of corresponds to the mining process where dynamic real-time rules will be more precise there are new data arriving as time goes by than static usage log on the web server. In (i.e., the sequences database is incremental). system can addition, online link recommendation in adaptive hypermedia educational system is For the sequential pattern mining with a based on the mining results of users’ recent progressive database, new data are added sequential queries [34]. The applications into the database and obsolete data are mentioned above are suitable to apply removed simultaneously. Therefore, one can progressive find the most up-to-date sequential patterns sequential pattern mining techniques. without being influenced by obsolete data. It is noted that the sequential pattern mining with a 2. Pisa: static database and with an incremental database are both special cases When sequential patterns are generated, the of the progressive sequential pattern mining. newly be If the obsolete data are not deleted from the identified as frequent sequential patterns due database, the proposed algorithms for the to the existence of old data and sequences. progressive sequential pattern mining can In practice, users are usually more interested solve the incremental sequential pattern in the recent data than the old ones. mining problem. In addition, if the database Sequential pattern mining with a progressive does not have new data and does not delete database capture the dynamic nature of data obsolete data, the progressive sequential addition and deletion. Progressive concept, pattern mining algorithm can also deal with arriving patterns may not Proceedings of the International Conference , “Computational Systems and Communication Technology” 5TH MAY 2010 - by Einstein College of Engineering, Tirunelveli-Tamil Nadu,PIN-627 012,INDIA the static sequential pattern mining problem. rapidly. Hence, users can identify the latest Fig1 shows an example database. information without influences of old data and recognize the most up-to-date sequential patterns. Definition 3. PS-tree represents elements in the sequence, based on the sequence IDs and timestamps recorded in the nodes and Fig1: An example database the newly arriving data of the progressive database at each timestamp. PS-tree not Progressive mInning of Sequential patterns, PISA takes the concept of time period of interest (POI) and Progressive Sequential tree (PS-Tree). only stores the elements and timestamps of sequences in each POI but also efficiently accumulates the occurrence frequency of every candidate sequential pattern at the same time. Definition 2. POI is a sliding window, whose length is a userspecified time interval, continuously advancing as the time goes by. The sequences having elements whose timestamps fall into this period, POI, contribute to the |Db| for current sequential patterns. On the other hand, the sequences having only elements with timestamps older The height of PS-tree is bounded by the length of POI, and PS-tree combines the same sequential patterns of all sequences together. By changing Start time and End time of the POI, Pisa can easily deal with a static database or an incremental database as well. than POI should be pruned away from the sequence database immediately and will not 3. Fast Pisa: contribute to the |Db| thereafter. Pisa with a very slight approximation, Once the POI is taken into account, the sequences that do not have any newly arriving data can be deleted, and the |Db| of recent sequential patterns can, thus, be treated correctly. Consequently, obsolete sequential patterns will be pruned away enhance the performance of Pisa. For every arriving element with more than one item, Pisa needs to insert all combinations of elements into PS-tree. For example, when we receive an element having (ABC), all combinations of elements are A, B, C, (AB), (AC), (BC), and (ABC). The number of all Proceedings of the International Conference , “Computational Systems and Communication Technology” 5TH MAY 2010 - by Einstein College of Engineering, Tirunelveli-Tamil Nadu,PIN-627 012,INDIA combinations of elements for an arriving element having |T| items is 2|T| - 1 and these NPisa = (L(L + 1))/2 * (1 - (1 - Rseq) |D| elements will induce huge computation and *( P *( 2|T| - 1))L ----------------------(1) )/Rseq large space in the following timestamps during each POI. In view of this, we In addition to the merit of requiring a employed a slight approximation to speed up smaller number of elements needed to be the algorithm and reduce the memory usage. stored in PS-tree, Pisa utilizes an efficient traversing method to traverse PS-tree while 4. Complexity Analysis: dealing with each new element and comparing to the old elements. Therefore, Pisa combines the same elements of the execution time of Pisa can be much different sequences in the same nodes. The better than the one of DirApp. FastPisa similarity ratio of the elements in a pair of reduces the number of combinations for sequences is denoted by Rseq. That is, Rseq = each new element, which is stored in PS- Nsame/N, where Nsame is the number of tree. Thus, the average number of different common elements in these two sequences. combinations of elements appearing in the Then, the number of different elements in a whole POI in one sequence becomes (P * sequence from the elements in the other (|T| - 1))L. Hence, the average number of sequence is N * (1 - Rseq). The number of nodes stored in fast Pisa is different elements in the kth sequence from the elements in the other k - 1 sequences is NFastPisa = (L(L+ 1))/2*(1 - (1 - Rseq) N *(1 - Rseq)k-1. The total number of different (P * (|T| - 1))L |D| )Rseq* ---------------------(2) elements in all of the |D| sequences is, thus, According to (1) and (2), FastPisa does save N +N *(1 _ Rseq) +N * (1 - Rseq)2 +….. + N* lots of space from Pisa and, hence, (1 - Rseq)|D|-1 = N *(1 - (1 - Rseq)|D|) /Rseq outperforms the comparative algorithms in execution time. Considering the tree structure, the average number of nodes stored in Pisa is, thus, 5. Conclusion: (L + (L - 1) + (L - 2)+……..+1) * N *(1- (1 If we insert only individual items A, B, and |D| - Rseq) )/Rseq C instead of all combinations of elements into PS-tree, the sequential patterns formed Therefore, the average number of node is by these elements with only a single item Proceedings of the International Conference , “Computational Systems and Communication Technology” 5TH MAY 2010 - by Einstein College of Engineering, Tirunelveli-Tamil Nadu,PIN-627 012,INDIA can still be generated. Therefore, users will 2. M.-S. Chen, J. Han, and P.S. Yu, “Data not lose any information of these elements Mining: with only a single item. For example, Perspective,” IEEE Trans. Knowledge and assume there will be sequential patterns XA, Data Eng., vol. 8, no. 6, pp. 866-883, Dec. XB, XC, and X(AB), where X is the prefix 1996 An Overview from Database set of elements of these patterns. According to the slight modification, the sequential 3. A. Balachandran, G.M. Voelker, P. Bahl, patterns formed by the elements with only a and P.V. Rangan, “Characterizing User single item, XA, XB, and XC can still be Behavior and Network Performance in a generated. Users can get the information that Public A, B, and C appear following X frequently. SIGMETRICS Int’l Conf. Measurement and The only pattern which cannot be found is Modeling X(AB). That is, the only information loss is (SIGMETRICS ’02), June 2002 Wireless of LAN,” Proc. Computer ACM Systems that the users are not able to know whether A and B appear together following X 4. U.M. Fayyad, G. Piatetsky-Shapiro, P. frequently or not. In practice, although, the Smyth, and R. Uthurasamy, “Advances in numbers of combinations of elements in Knowledge Discovery and Data Mining,” candidate sequential patterns are quite large, MIT Press,1996 there are seldom frequent sequential patterns whose elements contain more than one item 5. J. Han and M. Kamber, Data Mining: at the same time. Therefore, the information Concepts loss is marginal. Specifically, assume the Kaufmann, 2000. and Techniques. Morgan length of POI is L. From the design of PStree, the cost of space and time can be 6. A.B. Pandey, J. Srivastava, and S. reduced by a factor of (2|T| - 1)L-|T|L, which Shekhar, “Web Proxy Server with Intelligent is about in orders of 10 to 100 times in Prefetcher practice, by this approximation. Association Rules,” Technical Report 01- for Dynamic Pages Using 004, Univ. of Minnesota, Jan. 2001 6. References: 7. A.B. Pandey, R.R. Vatsavai, X. Ma, J. 1. R. Agrawal and R. Srikant, “Mining Srivastava, and S. Shekhar, “Data Mining Sequential Patterns,” Proc. 11th Int’l Conf. for Intelligent Web Prefetching,” Proc. Data Eng. (ICDE ’95), pp. 3-14, Feb. 1995. Workshop Mining Data Across Multiple Proceedings of the International Conference , “Computational Systems and Communication Technology” 5TH MAY 2010 - by Einstein College of Engineering, Tirunelveli-Tamil Nadu,PIN-627 012,INDIA Customer Touchpoints for CRM (MDCRM ’02), May 2002. 8. C.Romero, S.Ventura, J.A.Delgado, & P.D.Bra, “Personalized Links Recommendation Based on Data Mining.In Adaptive Educational Systems,” Proc.2 Hypermedia nd EuroConf. TechEnhancedLearning(EC TEL’07) 9. Q. Yang and H.H. Zhang, “Web-Log Mining for Predictive Web Caching,” IEEE Trans. Knowledge and Data Eng., vol. 15, no. 4, pp. 1050-1053, July/Aug. 2003 10. Q. Yang, H.H. Zhang, and T. Li, “Mining Web Logs for Prediction Models in www Caching and Prefetching,” Proc. Seventh ACM SIGKDD Int’l Conf. Knowledge Discovery and Data Mining (KDD’01), pp. 473-478,Aug. 2001