Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce An Algorithm To Discover Time-Interval Sequential Patterns In Sequence Databases Y.L. Chen, M.C. Chiang and M. T. Kao Department of Information Management National Central University E-mail:[email protected] Abstract Mining sequential patterns is to discover those frequently occurred subsequences hidden in a sequence database. Although the conventional sequential patterns can tell us the order among the items, we cannot know after how much time the next item would take place; that is, this sequential pattern does not provide time intervals between successive items. And it is for this very concern that this paper provides a study of mining sequential patterns that takes time intervals into consideration, named as time-interval sequential patterns. In this paper, an efficient algorithm is developed to mine time-interval sequential patterns by modifying traditional Apriori algorithm. Keywords: Sequential patterns; Sequence data; Data mining; Time interval 1. Introduction Data mining is to extract implicit, previously unknown and potentially useful information from databases [5, 9]. The discovered information and knowledge is useful for various applications, such as market analysis, decision support, flaw detection and business management. Because of its importance, many approaches have been proposed to extract information, and mining sequential patterns is one of the most important ones. This approach was introduced around the mid 90s and its goal is to discover, from sequence database, patterns that occur frequently [4, 11]. A typical example of sequential pattern is like that a customer, having bought a computer, comes back in the future to buy a scanner and a microphone. Although this kind of sequential pattern tells us the order among the items, we cannot know after how much time the next item would take place; that is, this sequential pattern does not provide time intervals between successive items. Thus, this paper would probe into a new issue: finding sequential patterns with time intervals. We provide a new pattern, named “the time-interval sequential pattern,” which tell us not only the order among the items but also the time intervals between successive items. The following is examples of the time-interval sequential pattern: (a) Having bought a laser printer, a customer will come back to buy a scanner in three months and then a CD burner in six months; (b) A customer revisits website A within a week; (c) After operation X, it is very likely that the patient is infected by virus Y in two weeks. The time-interval sequential pattern provides more valuable pieces of information than the traditional sequential pattern. Let us take the retailing business as an example: with the help of the time-interval sequential pattern, the retailer not only learns the habits, interests, and needs of his customers but also the timing of shopping. As a result, the retailer can mail out the best suitable 31 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce catalogues to different types of customers at a right time, or he can determine when, how many quantities and which products he needs to place order so that the future demand can be satisfied. Therefore, the time-interval sequential pattern allows a business to provide right products and right services to right customers at a right time. In addition to the retailing data, time-interval sequential patterns could be also mined from many other kinds of data, such as the criminal records in a police department, the traveler records in a travel agency, the diagnosis records in a hospital, and any other business record. In all these applications, the discovered time-interval sequential patterns could provide very useful information for decision-makings. The goal of this paper is to propose an algorithm for mining time-interval sequential patterns from sequence data. In Section 2, we review the previous research results concerning mining sequential patterns, and we also discuss some of its potential applications. In Section 3, we formally define the problem and define what the time-interval sequential pattern is. Following that, Section 4 gives an algorithm to find time-interval sequential patterns by modifying the traditional Apriori algorithm. Finally, our conclusion is given in Section 5. 2. Previous Researches The problem of mining sequential patterns was first introduced by Agrawal and Skrikant [4], and can be described as follows. We are given a sequence database, which is formed by a set of data-sequences. Each data sequence includes a series of transactions, where the transactions in a data sequence are ordered according to their transaction-times. And the purpose of the research was to find all the subsequences whose ratios of appearance exceed the minimum support threshold. After Agrawal and Skrikant [4], one after another scholar provides more efficient algorithms to solve the question [14, 19, 20, 22]. Besides that, some try to extend the method of mining sequential patterns to analyze other time-related patterns. The important researches in this category include finding frequent episodes in event sequences [17], finding cyclic patterns in time-stamped transaction database [12, 13, 16], finding similar patterns in a time-series database [2, 8, 15], finding traversal patterns in a web log [6, 7, 18], and finding sequential alarm patterns in a telecommunication database [21]. Typically, there are two ways how the previous researches deal with the issue of time interval: (a) the time-window approach, and (b) ignore the time interval completely. First, in the time-window approach we must specify the length of time window in advance. After that, a sequential pattern mined from the database is a sequence of windows, where each window contains a set of items. The items appear in the same time window mean these items are bought in the same time period. For example, [17] has to specify the window width (win) in order to find the episodes that are frequent in event sequences. In this research, we could find the serial episodes that, in the time range win, a happens, then b follows, and c happens lastly; or we could find the parallel episodes that, in the time range win, a, b, and c happens according to no specific order. Serial episode only indicates that, in win, a, b, and c happen according to some order, but the time intervals between them are unknown. And parallel episodes only indicates that a, b, and c happen in win, but we do not know in what order and 32 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce time interval that they happen. As for the research of [21], it specifies the referent urgent window δ beforehand, with which it means that, in a sequential alarm pattern, the time interval between adjacent alarm events is within the time range δ. For example, if we set δ as six hours, the sequential alarm pattern (a, b, c) indicates that a, b, and c happen in order and that the time interval between a, and b, and between b and c is less than six hours. However, this research cannot discover other possible time-interval patterns between the events. And lastly, the research of [20] has to specify the maximum time interval (max-interval), the minimum time interval (min-interval), and the sliding time window size (window-size). For example, we specify max-interval as thirty days, min-interval as zero days, and window-size as seven days. If the pattern we get is ((a, b) (c, d)), then we know that a and b happen in the time range of seven days, that c and d also happen in seven days, and that the time interval between (a, b) and (c, d) is in thirty days. Nevertheless, this pattern could not tell us the order of a and b and that of c and d. Moreover, the time interval between (a, b) and (c, d) is fixed, but maybe in reality some other values of time interval exist between (a, b) and (c, d). Yet this research is incapable of finding patterns with other time intervals. On the other hand, the second means of dealing with time is employed by [4, 14, 19]. The only thing they do is to mine traditional sequential patterns, which only learn the temporal order of the items. For example, the sequential pattern (d, e, f) only tells us that d happens, then e happens, and lastly f happens. From the above discussion, we see that all the past researches do not articulate the time intervals between successive items. Therefore, this paper proposes a new approach that clearly shows the time intervals between items. By this, we could learn in what time ranges the items happen in order, and in addition, we would know the time intervals between them. We believe that mining sequential patterns with time intervals could find more meaningful patterns for us and provide more valuable pieces of information. Because the previous researches could not find patterns that include the time intervals between successive items, this paper develops an efficient method to do so by modifying the traditional Apriori approach. 3. Problem Definition Traditionally, the data sequence of a customer is represented as an ordered list of itemsets, where each itemset is attached with a transaction time. However, this paper represents the sequence in another way: a sequence A is represented as ((a1, t1), (a2, t2), (a3, t3), …, (an, tn)), where aj is an item and tj stands for the time that aj happens, 1 ≤ j ≤ n, and tj-1 ≤ tj for 2 ≤ j ≤ n. In the sequence, if there are items that occur at the same time, they are ordered according to the alphabetical order. The new presentation of sequence provided by this paper is in fact the same with the typical one. A sequence in the previous format can be transformed to the new format by sorting all of items first by time order and then by alphabetical order. Likewise, a sequence in the new format can be transformed to the traditional format by first combing those items occurring at the same time into an item set and then sort these item sets by time order. 33 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce In addition, let t denote the length of time interval between two successive items, and let Tk be the given constants for 1 ≤ k ≤ r-1. Then, we divide the time interval into r+1 ranges: z I0 stands for the time interval t satisfying t = 0; z I1 stands for the time interval t satisfying 0 < t ≤ T1; z Ij stands for the time interval t satisfying Tj-1< t ≤ Tj for 1< j < r-1; z Ir stands for the time interval t satisfying Tr-1< t < ∞; Let the set of time intervals be denoted as TI= {I0, I1, I2, …, Ir }. Then, we can define the time-interval sequential pattern as follows. Definition 1. Let I = {i1, i2, …, im} be the set of all items and TI= {I0, I1, I2, …, Ir} be the set of time intervals. A sequence B=(b1, &1, b2, &2, …, bs-1, &s-1, bs ) is a time-interval sequence if bi∈I for 1≤i≤s and &i∈TI for 1≤i≤s-1. Definition 2. For a sequences A = ((a1, t1), (a2, t2), (a3, t3), …, (an, tn)) and a time-interval sequence B=( b1, &1, b2, &2,…, bs-1, &s-1, bs ), we say that B is contained in A or B is a time-interval subsequence of A if there exist integers 1 ≤ j1 < j2 < …< js ≤ n such that 1. b1 = aj1, b2 = aj2, …, br = ajs. 2. tji - tji-1 satisfies the condition of interval &i-1 for 2≤i≤s. We denote a transaction by <sid, s>, where sid is the identifier of this transaction and s a sequence. A sequence database is formed by a set of records <sid, s>, and can be denoted by S. For a given time-interval sequence α, its support count in database S can be defined by the following way. Definition 3. support_countS (α) = {(sid, s) (sid, s) ∈ S ∧ α is contained in s} A time-interval sequence α is called the time-interval sequential pattern or the frequent time-interval sequence if the percentage of transactions in S containing α is greater than or equal to the user-specified minimum support (called min_supp.) In other words, we call α a time-interval sequential pattern in S if support_countS(α) ≥ S × min_supp. The total number of items in a time-interval sequence s is referred to as the length of the sequence. A time-interval sequence whose length is k is referred to as k-time-interval sequence. Similarly, a time-interval sequential pattern whose length is k is referred to as k-time-interval sequential pattern. Given a sequence database and min_sup, the goal of time-interval sequential pattern mining is to find in the sequence database all the time-interval sequential sequences whose supports are more than or equal to min_sup. Example 1. Consider a given sequence database shown in Fig. 1 with TI= {I0, I1, I2, I3}, where I0 : t = 0, I1 : 0< t ≤3, I2 : 3< t ≤6 and I3: 6< t ≤∞. The time-interval sequence (b, I1, e, I2, c) includes three items, and therefore the length is 3. We call it as 3-time-interval sequence. The time-interval sequence (b, I1, e, I2, c) is a time-interval subsequence of transaction 40, i.e., sequence (b, 15), (f, 17), (e, 18), (b, 22), and (c, 22). Besides, (b, I1, e, I2, c) is also contained in transactions 10 and 30. Therefore, its support is 75%. If we set min_sup=50%, then (b, I1, e, I2, c) is a time-interval sequential pattern in the database. 34 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce Sid sequence 10 ( (a, 1), (c, 3) (a, 4), (b, 4), (a, 6), (e, 6), (c, 10) ) 20 ( (d, 5), (a, 7) (b, 7), (e, 7), (d, 9), (e, 9), (c, 14), (d, 14) ) 30 ( (a, 8), (b, 8) (e, 11), (d, 13), (b, 16), (c, 16), (c, 20) ) 40 ( (b, 15), (f, 17), (e, 18), (b, 22), (c, 22) ) Fig.1. 4. A sequence database. The Algorithm The goal of this section is to develop an algorithm for mining time-interval sequential patterns from databases. The algorithm is developed by modifying the well-known Apriori algorithm. We introduce them in the following. 4.1 The I-Apriori algorithm The algorithm is listed in Fig. 2. In the algorithm, Lk denotes the set of all frequent k-time-interval sequences and Ck the set of candidate k-time-interval sequences. where the k-th phase is to find Lk from Ck. The algorithm proceeds in phases, In the first phase, we will find L1 from C1. Obviously, C1 can be generated by list all distinct items in databases. Then, by scanning the database sequentially, we can get the supports of all time-interval sequences in C1. Next, by removing those infrequent time-interval sequences, we get the resulting large set L1. The k-th phase is to produce Lk. To this end, we first derive Ck by the function apriori_gen(Lk-1). Having obtained Ck, we then produce Lk by computing their supports, which can be done by scanning the database and deleting all infrequent time-interval sequences. L1= find_1-frequent_item( S ); For ( k = 2 ; Lk-1 ≠ ∅; k++) { Ck = apriori_gen ( Lk-1, TI ); For each sequence s∈S { //scan D for computing support counts Cs = subseq (Ck, s ); //get the subsequences of s For each candidate c ∈ Cs c.count ++; } Lk = { c ∈ Ck | c.count ≥ min_sup } } return ∪Lk ; Fig. 2. The I-Apriori algorithm. Our algorithm has two major differences from the traditional Apriori algorithm: (1) the method to generate Ck, (2) the method to compute the supports of candidate sequences. We introduce them in the following. We first discuss how to generate C2. Next, we discuss how to generate Ck, where k>2. Traditional, C2 can be obtained by joining L1 with L1 directly. 35 However, since the first item and Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce the second item in C2, say b and c, may have different time-interval relations, we need to generate the pairs for all possible time-interval relations. Let us use an example for explanation. Suppose (b) and (c) belong to L1 and TI= {I0, I1, I2}. Then we have the following candidate time-interval sequences in C2: (b, I0, b), (b, I1, b), (b, I2, b), (b, I0, c), (b, I1, c), (b, I2, c), (c, I0, b), (c, I1, b), (c, I2, b), (c, I0, c), (c, I1, In a word, we can generate C2 by L1 × TI × L1, where × denotes join. c) and (c, I2, c). Next, we consider how to generate Ck. Let (e1 &1 e2 &2…&k-1 ek) be a k-time-interval sequence in Lk. Then the (k-1)-time-interval sequences (e1 &1 e2 &2…&k-2 ek-1) and (e2 &2…ek-1 &k-1 ek) must be also frequent, because all the transactions containing (e1 &1 e2 &2…&k-1 ek) will have these two sequences as their subsequences. Therefore, if the time-interval sequences (e1 &1 e2 &2…&k-2 ek-1) and (e2 &2…ek-1 &k-1 ek) exist in Lk-1, then we are certain that (e1 &1 e2 &2…&k-1 ek) must exist in Ck. By joining the time-interval sequences in Lk-1 this way, we can generate all the time-interval sequences in Ck. For example, if (A, I0, B) and (B, I2, D) are in L2, then (A, I0, B, I2, D) must be in C3. Similarly, if (A, I0, B, I2, D) and (B, I2, D, I3, C) are in L3, then (A, I0, B, I2, D, I3, C) must be in C4. Finally, the algorithm for producing the candidate set Ck from Lk-1 is shown in Fig. 3. Procedure apriori_gen ( Lk-1, TI ) for each sequence l1 ∈ Lk-1 { for each sequence l2 ∈ Lk-1 if ( k = 2) then { { for each time interval i ∈TI c = l1 × i × l2; add c to Ck ; { } } else { if ( l1[3] = l2[1] ∧ l1[4] = l2[2] ∧ … ∧ l1[(k-1)*2-1] = l2[(k-1)*2-3] ∧ [l1[(k-1)*2] = l2[(k- 1)*2-2] then { c = l1 × l2 ; add c to Ck ; } } } } return Ck ; Fig. 3. The program for generating candidate set. After finding the candidate set of k-time-interval sequences, a problem arises immediately is how to compute their support counts. In this respect, we use a tree structure, called candidate tree, as the basis to do the job. Basically, the candidate tree is similar to the hash tree [3] adopted by the previous researches. The major difference is that the traditional approach attaches each tree branch with an item name, whereas we attach two components: item name and the time-interval value. Except this difference, how to build the tree, how to traverse the tree and how to compute support counts are all similar. 36 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce First let us see how to build a candidate tree by example. Suppose the candidate set has three 4-time-interval sequences: (a I0 b I1 e I2 b), (a I0 b I1 e I2 c) and (b I1 e I2 b I0 c). Then the constructed candidate tree will be look like the one shown in Fig. 4. a b a b I0 b I1 e a I0 b b I1 e I1 e I2 b a I0 b I1 e I2 b a I0 b I1 e I2 b I2 c b I1 e I2 b I0 c a I0 b I1 e I2 c Fig. 4. b I1 e I2 b I0 c An example of candidate tree. Having constructed the candidate tree, the next job is to scan every transaction in the database, and for every transaction we need to traverse the tree and compute the supports of those candidates. The procedure Traverse shown in Fig. 5 is used to finish this job. Subroutine Traverse(u, T, i, level ) Parameters: u: the node in the candidate tree where we are currently located T: the sequence that we are dealing with i: the position in the sequence that the preceding traversal matches level: the level of node u in the candidate tree item(u, v): the item of arc (u, v) interval(u, v): the time interval of arc (u, v) item(T(j)): the item of T(j) interval(time( T(j) )-time( T(i))): the interval to which the time difference between T(j) and T(i) belongs T_length: the length of T Method: if u is a leaf node then add 1 into the counter of the leaf node return else for( j = i+1; j ≤ T_length; j++ ) for each child v of u if level = 0 then if item(u, v ) = item(T( j )) then Traverse (v, T, j, level+1 ) else if item( u, v ) = item(T( j )) and interval( u, v ) = interval(time( T(j) ) - time( T( i ) ) ) then Traverse (v, T, j, level+1) return Fig. 5. Procedure Traverse. Example 2. Suppose the transaction T=((b, 15), (f, 17), (e, 18), (b, 22), (c, 22)) is about to traverse the 37 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce candidate tree in Fig. 4. Then the entire search is activated by calling Traverse(root, T, 0, 0), where we will call Traverse((b), T, 1, 1) and Traverse((b), T, 4, 1) because item b occurs at the first and the fourth positions in the transaction. In executing Traverse((b), T, 1, 1), a series of calls will be activated, first Traverse((b I1 e), T, 3, 2), then Traverse((b I1 e I2 b), T, 4, 3) and finally Traverse((b I1 e I2 b I0 c), T, 5, 4). Continuing this way until the whole procedure stops, we finally find the pattern (b I1 e I2 b I0 c). 5. Conclusion In this paper, a new issue of the time-interval sequential pattern mining is raised. The time-interval sequential pattern could provide knowledge about not only the order among the items (as the traditional sequential pattern does) but also the time interval between every two successive items. In daily life, we could find applications of the time-interval sequential pattern like analyzing the purchasing behaviors of customers, the exploration of a website, the diagnosis of a disease, to name just a few. We could use these pieces of information to design appropriate policies, in order to strengthen for corporations their powers to compete and to bring to individuals enormous profits. In the paper, we assume that the time interval can be partitioned into a set of fixed time-intervals beforehand. In fact, how to partition the time interval and decide their boundaries are by no means easy. If the specified time interval is too wide, then many interesting patterns with narrow time-intervals will be missing, for they are engulfed by other patterns with larger time intervals. Contrarily, if too narrow, then many interesting time-interval patterns will be missing, for their supports are not large enough. How to balance these two extremes so that all interesting time-interval sequential patterns can be found would be a valuable but difficult issue. Besides, we may apply fuzzy theory or rough set theory to partition the intervals so that the boundary of an interval is no longer fixed but flexible. With this extension, we can mitigate the problem of sharp boundary and provide a smooth transition between member and non-member of a set. Finally, we may build taxonomy on the time intervals so that a bigger interval is formed by a set of smaller intervals. By this extension, we can find multiple-level time-interval sequential patterns from sequence data, not only the sequential patterns with time intervals at the same level but also with time intervals across different levels. References [1] R.C. Agarwall, C. Aggarwal, V.V.V. Prasad, A tree projection algorithm for generation of frequent itemsets, Journal of Parallel and Distributed Computing (2000). [2] R. Agrawal, C. Faloutsos, A. Swami, Efficient similarity search in sequence databases, Conference on Foundations of Data Organization and Algorithms, Chicago, Illinois, 1993, pp. 69-84. [3] R. Agrawal, R. Srikant, Fast algorithms for mining association rules, Proc. 1994 Int. Conf. Very Large Data Bases (VLDB’94), Santiago, Chile, 1994, pp. 487-499. [4] R. Agrawal, R. Srikant, Mining sequential patterns, Proc. 1995 Int. Conf. Data Engineering (ICDE’95), Taipei, Taiwan, 1995, pp. 3-14. 38 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce [5] M.S. Chen, J. Han, P.S. Yu, Data mining: an overview from a database perspective, IEEE Transactions on Knowledge and Data Engineering, 8 (6) (1996) 866-883. [6] M.S. Chen, J.S. Park, P.S. Yu, Efficient data mining for path traversal patterns, IEEE Transactions on Knowledge and Data Engineering 10 (2) (1998) 209-221. [7] R. Cooley, B. Mobasher, J. Srivastava, Data preparation for mining world wide web browsing patterns, Journal of Knowledge and Information Systems, 1 (1) (1999) 5-32. [8] C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence matching in time-series databases, Proceedings of the 1994 ACM SIGMOD International Conference on Management of Data, Minneapolis, Minnesota, 1994, pp. 419-429. [9] W. J. Frawley, G. Piatetsky-Shapiro, C. J. Matheus, Knowledge Discovery in Databases: An Overview, AAAI/MIT press, 1991. [10] V. Guralnik, N. Garg, G. Karypis, Parallel tree projection algorithm for sequence mining, 7th International European Conference on Parallel Processing, Manchester, UK, 2001, pp. 310-320. [11] J. Han, M. Kamber, Data Mining: Concepts and Techniques, Academic Press, 2001. [12] J. Han, G. Dong, Y. Yin, Efficient mining of partial periodic patterns in time series database, Proc. 1999 Int. Conf. on Data Engineering (ICDE'99), Sydney, Australia, March 1999, pp. 106-115. [13] J. Han, W. Gong, Y. Yin, Mining segment-wise periodic patterns in time-related databases. Proc. of 1998 Int. Conf. on Knowledge Discovery and Data Mining (KDD'98), New York City, NY, 1998, pp. 214-218. [14] J. Han, J. Pei, B. Mortazavi-Asl, Q. Chen, U. Dayal, M.-C. Hsu, FreeSpan: frequent pattern-projected sequential pattern mining, Proc. 2000 Int. Conf. on Knowledge Discovery and Data Mining (KDD’00), Boston, MA, 2000, pp. 355-359. [15] C. Li, P.S. Yu, V. Castelli, Hierarchyscan: a hierarchical similarity search algorithm for databases of long sequences, Proceedings of the 12th International Conference on Data Engineering, New Orleans, Louisiana, 1996, pp. 546-553. [16] S. Ma, and J. L. Hellerstein, Mining partially periodic event patterns with unknown periods, Proc. 17th Int. Conf. Data Engineering (ICDE'01), Heidelberg, Germany, 2001, pp. 205-214. [17] H. Mannila, H. Toivonen, A. Inkeri Verkamo, Discovery of frequent episodes in event sequences, Data Mining and Knowledge Discovery 1(3) (1997) 259 –289. [18] J. Pei, J. Han, B. Mortazavi-Asl, H. Zhu, Mining access patterns efficiently from web logs, Proc. 2000 Pacific-Asia Conf. on Knowledge Discovery and Data Mining (PAKDD'00), Kyoto, Japan, 2000, pp. 396-407. [19] J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, M.-C. Hsu, PrefixSpan: mining sequential patterns efficiently by prefix-projected pattern growth, Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), Heidelberg, Germany, 2001. [20] R. Srikant, R. Agrawal, Mining sequential patterns: generalizations and performance improvements, Proc. of the Fifth Int'l Conference on Extending Database Technology (EDBT’96), Avignon, France, 1996. 39 Proceedings of the First Workshop on Knowledge Economy and Electronic Commerce [21] P.-H. Wu, W.-C. Peng, M.-S. Chen, Mining sequential alarm patterns in a telecommunication database, Workshop on Databases in Telecommunications (VLDB 2001), 2001. [22] M. J. Zaki, SPADE: an efficient algorithm for mining frequent sequences, Proceedings of Machine Learning Journal, special issue on Unsupervised Learning, 2001, 42, pp. 31-60. 40