Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
國立東華大學資訊工程學系 碩士論文 部分多重週期性無重複樣式的有效率探勘方法 Efficient Partial Multiple Periodic Patterns Mining Without Redundant Rules 研究生:楊文博 指導教授:李官陵 博士 中華民國九十二年六月 Acknowledgement 這篇論文從初步想法到完成定稿的過程中,要感謝無數師長與朋友的幫 忙,感恩之心難以言喻。首先要感謝我的指導老師 李官陵教授,從思考方向、 提供資料到批改論文,無時無刻不關心備至,再老師的指導下才使的此篇論文能 如此充實且順利完成。 其次要感謝三位口試委員 羅壽之 教授、陳良弼 教授及 徐嘉連教授的建 議及指教才能使本篇論文能更臻完善。 此外也要感謝所以實驗室的同學與學弟妹,有他們的幫忙我才能全心全意 的投入研究,也有了他們的鼓勵才使得此篇論文能如期完成。 最後要感謝我的家人,由於他們的支持,使我能順利的畢業。尤其是我的 外婆,在我口試的那天去世,有她在天之靈的庇祐,我才能順利通過口試。 1 Abstract Partial periodic patterns mining is a very interesting domain in data mining problem. In the previous studies, full and partial multiple periodic patterns mining problems are considered. The proposed methods, however, may produce redundant information and are inefficient. In this thesis, a novel concept and new parameters are proposed to improve the performance of partial multiple periodic patterns mining. Moreover, the proposed method will not produce redundant information with user’s exception. Without mining every period, we only check the necessary period and use this information to do further mining. Instead of considering the whole database, the information needed for mining partial periodic patterns is transformed into a bit vector which can be stored in a main memory. Therefore, it needs at most two times to scan database in our approach. A set of simulations is also performed to show the benefit of our approach. Keyword:Data Mining, Partial periodic patterns mining, Multiple periodic patterns mining, Time series analysis. 2 摘要 部分週期性樣式探勘式資料探勘中十分有趣的一個部分。以往的研究中,完 全與部分週期性樣式的探勘演算法都已被探討過。然而,以往的方法既沒效率也 會產生多餘的規則。 在這篇論文中,我們提出一個新的觀念來增進部分多重週期性樣式探勘演算 法的效能,而且,我們提出的演算法並不會產生多餘的規則。相較於探勘所有的 週期,我們只檢查真正需要去探勘的週期,並且利用已探勘週期的資訊來省略某 些週期的探勘。更進一步,我們將資料庫轉為演算法所需之一個可以容納於記憶 體的位元矩陣。因此,演算法最多只需將完整資料庫掃描兩次。最後,我們提出 一系列的模擬實驗來驗證演算法效能的增益。 關鍵字:時間相關資料庫、部分週期性樣式探勘、多重週期性樣式探勘、資料探 勘。 3 Table of Contents Acknowledgement…………………………………………………………………….1 English Abstract……………………………………………………………………….2 Chinese Abstract………………………………………………………………………3 Chapter 1. Introduction 1.1 Motivation and Objective……………………………………………….5 1.2 Method and Achievement……………………………………………….5 1.3 Thesis Organization……………………………………………………..6 2. Background 2.1 Introduction to Data Mining…………………………………………….7 2.2 Association Rule Mining………………………………………………..8 2.3 Periodic Patterns Mining…………………………………………….....10 2.3.1 Full Periodic Patterns Mining…………………………………..12 2.3.2 Partial Periodic Patterns Mining………………………………..12 2.3.3 Multiple Periodic Patterns Mining and Our Approach…………13 3. Mining Algorithm for Multiple Periodic Patterns 3.1 Data Pre-Processing……………………………………………….……15 3.2 Partial Single Periodic Patterns Mining…………………………….…..16 3.3 Partial Multiple Periodic Patterns Mining………………………….…...18 3.3.1 Prime Period Mining (PPM)…………………………………....20 3.3.2 Composite Period Mining (CPM)……………………………....23 3.4 Data Post-Processing……………………………………………………24 4. Experiment and Efficiency Analysis 4.1 Simulation Platform…………………………………………………….26 4.2 Efficiency Analysis……………………………………………………..28 5. Conclusion and Future Works 5.1 Conclusion……………………………………………………………...29 5.2 Future Works……………………………………………………………29 Reference 4 CHAPTER 1 INTRODUCTION 1.1 Motivation and Objective In the previous studies of periodic patterns mining, the issue is focused on the efficiency of mining full/partial single periodic patterns[6][12]. In [6], the method of mining multiple periodic patterns is to repeat the procedure of mining single periodic patterns. It is inefficient, and the result will contain too many redundant rules. Therefore, we propose a method that could mine partial multiple periodic patterns efficient and generate the result without redundant rules. 1.2 Method and Achievement In this thesis, our objective is to mine the partial multiple periodic patterns (multiple periodic pattern in short) efficient and generate a set of concise rules. To achieve the objective, we propose two algorithms: Prime Period Mining Algorithm (PPM) and Composite Period Mining Algorithm (CPM). Prime Period Mining Algorithm is responsible for mining the periods that cannot be applied to the prune properties, and the Composite Period Mining Algorithm applies the prune properties to check the rest of periods. Because the Composite Period Mining Algorithm only checks the un-large itemset in Prime Period Mining Algorithm, the output of our 5 approach will not contain any redundant rules. In the end of the thesis, a set of experiment is proposed to show the efficiency gain of our approach. 1.3 Thesis Organization The remainder of this thesis is organized as follows. In chapter 2, we will investigate the background knowledge and related works of periodic patterns mining. The method for mining multiple periodic patterns is discussed in chapter 3. The experimental evaluation is presented in chapter 4. Finally, we present our conclusions in chapter 5 and identify directions for future research. 6 CHAPTER 2 BACKGROUND 2.1 Introduction to Data Mining In this knowledge economic age, it is easy for any company to collect and keep huge amounts of data. At the same time, high-speed computation has made it feasible to analyze these data. This is called data mining. Simply stated, data mining refers to extracting or “mining” knowledge from large amounts of data. There are many other terms carrying a similar meaning to data mining, such as knowledge mining from database, knowledge extraction, data/pattern analysis, data archaeology, and data dredging. However, the most popular used term is Knowledge Discovery in Database, or KDD. No matter what terms we use, they all consist of the same iterative sequence of the following steps: 1. Data cleaning: remove noise and inconsistent data. 2. Data integration: multiple data source may be combined. 3. Data selection: data relevant to the analysis task are selected from the database 4. Data transformation: data are transformed into appropriate forms for mining task by performing summary or aggregation operations, for example. 5. Data Mining: essential process where intelligent methods are applied in 7 order to extract patterns. 6. Pattern evaluation: identify the interesting patterns representing knowledge based on some interestingness measures. 7. Knowledge presentation: visualization and knowledge representation techniques are used to present the mined knowledge to the user. The purpose of using data mining is to help the policymaker making a good and correct decision. Therefore, for different application domain, there are several functionalities in data mining, including concept description[4][7], association analysis[1][18][12][13][10][17][2][11], classification/prediction[3][14], cluster analysis[9][8], outlier analysis and evolution[8]. In following discussion, we focus our problem on association analysis. 2.2 Association Rule Mining In the association analysis of transaction database, data records are stored in a transaction form in database, where each transaction is a set of items. In this supposition, the problem of discovering association rules is defined as finding relationships between the occurrences of items within transactions[1]. For example, an association rule might be "bread Æ milk support=10%, confidence=90%", which means there are 10% of all transactions contain both items, and 90% of the transactions that contain the item "milk" also contain item "bread". In the association rule, each rule should have a measure of certainty associated with it that assesses the validity of the rule. It is called confidence. The support of an association rule refers to 8 the percentage of task-relevant transaction for which the rule is true. Therefore, there are two important parameters in data mining process. The first is minimum support, and the second is minimum confidence. A large itemset is the itemset that satisfies the minimum support. A strong association rule is a large itemset which is in the form of "A Æ B" and satisfies the minimum confidence. The support of an itemset is the fraction of transactions that contain the itemset. The confidence of a rule X ÆY is the fraction of transactions containing X that also contain Y. The association rule X->Y holds, if X ∪ Y is large and the confidence of the rule exceeds a given threshold minimum confidence. Furthermore, an itemset that contains k items is a k-itemset. For example, the set {AB} is a 2-itemset. There are two major problems of association rule mining. The first one is huge amount of candidate itemset, and the second one is the times of scanning transaction database. In [1], Agrawal et al. propose a algorithm called Apriori which employs an iterative approach known as a level-wise search, where k-itemsets are used to explore (k+1)-itemsets. First, the set of frequent 1-itemsets is found. This set is denoted L1. L1 is used to find L2, the set of frequent 2-itemsets, which is used to find L3, and so on, until no more frequent k-itemsets can be found. And the finding of each Ln requires one full scan of the database. It is obvious when the transaction database goes larger, the efficiency of Apriori algorithm will theatrically goes down. There are many variations of the Apriori algorithm have been proposed that focus on improving the efficiency of the original algorithm, for example, Hash-based technique, Transaction reduction, Partitioning, Sampling and Dynamic itemset counting. The discovery of association relationships among huge amounts of data is 9 useful in selective marketing, decision analysis, and business management. A popular area of application is market basket analysis, which studies the buying habits of customers by searching for sets of items that are frequently purchased together. Although association rule is useful, it should not be used directly for prediction without further analysis or domain knowledge. They are, however, a helpful starting point for further exploration, making them a popular tool for understanding data. 2.3 Periodic Patterns Mining In the early study of association analysis, transactions are regarded as only one data segment, and there is no attention paid to segmenting the data over different time intervals[6][12][10]. In the real world, we may need to find the case that bread and milk are sold in every couple days or weeks...etc. For the above reason, association analysis in a suitable time interval is more corresponding with the user respect. The issue of time interval association analysis can be generally called periodic patterns mining[12][6][10]. Periodic patterns mining problem can be categorized into two issues. One is full periodic patterns mining, where every point in time contributes to the cyclic behavior of the time series. The other and more general one is called partial periodic patterns mining, which specify the behavior of the time series at some but not all point in time series. Furthermore, the problem covers these two issues is how to mine multiple periodic patterns efficiently, and it is the focus of this thesis. The definition of periodic patterns is showed underlying. 10 Assume the time series database D of n time units has been collected. Let Di denotes the unit time database for the ith time unit. Thus, the time series database is represented as, D = {D1 ∪ D2 ∪ D3 ∪ ... ∪ Dn } D' = {{L1 }1 , {L1 }2 , {L1 }3 ,..., {L1 }n }where {L1 }i We define large-1 matrix as denotes the set of large 1-itemset in Di. We also define a periodic pattern “S[p,o][periodic support]”, which appears in p-length with o offset from timestamp i=1. We use p to denote the period length of S and o to denote the offset inside period p. For example, “AB[2,0][90%]” represents pattern “AB” is frequent in 90% of {Di | i mod 2 = 0} . A frequent periodic pattern is the periodic pattern that satisfies the periodic minimum support. For example, if periodic minimum support is no larger than 90%, “AB[2,0][90%]” is a frequent periodic pattern, otherwise, “AB[2,0][90%]” is not. Let the feature list of periodic pattern S[p,o] is represented as, f S[ p ,o ] =< f i >, f i ∈ 0,1 < fi > is an ordered list where i = p • k + o for all 0 < k < n , k is an integer and for p f i , f j ∈ f S[ p ,o ] if i<j, then fi appears before fj. Moreover, fi is set to 1, if ∀item ∈ S and item contained in {Li} i, otherwise fi = 0 D’={{ABC},{BCD},{ACD},{DEF}}, then f AB[ 2 , 0 ] =< 10 > . 11 . For example, let 2.3.1 Full Periodic Patterns Mining The problem of full periodic patterns mining is first addressed in [12] by Ozden et al, and the definition of full periodic pattern is showed as follows. In full periodic patterns mining problem, the periodic minimum support is equal to 100%. Therefore a pattern S[p,o] is said to be a frequent full periodic pattern if and only if ∀f i ∈ f S[ p ,o ] , f i = 1 . By applying the characteristic of full periodic patterns to prune the irrelevant data, an efficient algorithm, interleaved algorithm, is proposed in [12]. And the strict constraint of full periodic patterns makes the proposed algorithm very efficient. However, it also makes this approach can not be generalized to solve the more general problem, partial periodic patterns mining problem. 2.3.2 Partial Periodic Patterns Mining Partial periodic patterns mining has been completely discussed in [6][10]. By definition a pattern S[p,o] is said to be a frequent partial periodic pattern if and only if f i ∈ f S[ p ,o ] ∧ f i = 1 n p ≥ periodic minimum support, where predicate indicates the number of fi that satisfies the predicate. In [6], Han et al. consider the efficient mining of partial periodic patterns, for a single period as well as for a set of periods. Some interesting properties related to partial periodic patterns, including the Apriori property and the max-subpattern hit set property, are explored in the proposed methods. The main contribution of Han’s study 12 is the speedy mining process on single partial periodic patterns. An interesting method of mining partial periodic patterns is proposed in [10]. Difference with [6] and our approach, the method proposed by Li et al. only mines out the periodic patterns that satisfy a specified calendar schema[10]. 2.3.3 Multiple Periodic Patterns Mining and Our Approach In the previous studies, the problem of multiple periodic patterns mining has not been investigated popularly. Only in [6], Han at el. propose mining multiple periodic patterns by repeating the same procedure of mining single periodic patterns. Thus regard to the multiple periodic patterns mining the mining algorithm of Han’s study is slow. Furthermore there are lots of redundant patterns still be mined out. For example, if AB[2,0] is a frequent periodic pattern, AB[4,0] still has chance to be mined out. Why is this kind of pattern redundant? Let’s take a supermarket for example, if we have a rule such as “we need to stock with milk every two days” and “we need to stock with milk every four days”. It is clear that the second rule is redundant, because if we stock milk every two days, it has already covered stocking milk every four days. In this thesis, we address some properties to speed up the algorithm of multiple periodic patterns mining. By given a predefined periodic minimum support and maximum period length, the set of frequent partial periodic patterns whose period length is no larger than maximum period length will be mined out. The complete mining process is divided into three steps. The first step is data 13 pre-processing which converts the original transaction database into the large-1 database. The second step is multiple periodic patterns mining process. This step contains two major procedures, “Prime Period Mining” and “Composite Period Mining”. As implied by the name, PPM deals with those prime number periods, and CPM deals with composite number periods. An un-frequent itemset list is constructed to store the un-frequent itemset (discuss in chapter 3). All unfrequent itemset of prime number period must be further checked by CPM based on period expansion property. The final step is data post-processing. In this step, the unit time database is scanned to check whether the candidate itemsets generated from the multiple periodic patterns mining process is large in the unit time database. 14 CHAPTER 3 MINING ALGORITHM FOR PARTIAL PERIOIDC PATTERNS 3.1 Data Pre-Processing Let D be the transaction database, and transaction t = (TIME , ITEMSET ) . ITEMSET is the set of items that customers purchase, and Time is the purchase time. Assume that the time attribute is in the form of calendar format such as (year, month, day)[10]. In the transformation phase, the time attribute need to be aggregated to a certain time unit that users are interested in. Refer to Figure 1 the redundant portion of time attribute in each transaction has been ignored. In this case, we consider the daily databases separately. Time Item Day 1 ACD Day 1 BDE … Day 2 DEF … Day N GEF Figure 1 Time series database In our algorithm, the transactions are partitioned based on the time attribute, and each partition forms the unit time database. In each unit time database, Algorithm 15 Apriori[1] is executed to generate the large 1-itemset. Then we obtain a matrix called large-1 matrix, denoted as M, to store these large 1-itemsets. Refer to Figure 2, M(i,j) is set to “1” if item j is a large item in the unit time database i, “0”, otherwise. Time A B C D E 1 1 0 1 1 0 2 0 1 0 1 0 3 0 0 1 1 0 … … … … … … N 1 1 0 0 1 Figure 2 Large-1 matrix 3.2 Mining Single Periodic Patterns Let large-1 matrix be the matrix shown in Figure 2. As illustrated above, the row of large-1 matrix denotes the large 1-items in unit time database i. Assume we are interested in periodic patterns of period length p (P-length). According to the offsets of the periodic patterns, the P-length periodic patterns can be partitioned into (P-1) sets, i.e., offsets from 0 to (P-1). In our approach, D’ is partitioned into (P-1) sets, D0p D1p ...DPp−1 , where Dop keeps the related information of periodic pattern with period length p and offset i. In each set, we run Apriori-Like algorithm[1] to get all frequent periodic patterns. 16 Time A B C D E F Day1 1 1 1 0 0 1 Day3 0 1 1 0 1 0 Day5 0 0 1 0 1 1 Day7 0 0 0 1 1 1 Day9 0 0 0 1 1 1 Day11 1 1 1 0 0 1 Figure 3 D02 For example, Figure 3 shows D02 which denote the related large 1-itemset matrix of period length 2 and offset 0. In the Aprioir-Like algorithm, by joining couple of large 1-itemsets, candidate 2-itemsets set (C2) is generated. The L3 is generated by scanning D02 again. The process is repeated until no more candidate itemset is generated. In order to efficient utilize D02 , scanning D02 is substituted by Boolean operation. Take pattern “BE” for example, we have f B [ 2,0] = 110001 and f E[ 2 , 0 ] = 011110 operated “AND” to carry out f BE[ 2 , 0 ] = 010000 . Assume the periodic minimum support is 2. It is obvious that BE is not large. Furthermore all frequent periodic patterns after single period patterns mining must be check again. The detail of the check process will be discussed in Section 3.4. 17 3.3 Mining Multiple Periodic Patterns Multiple periodic patterns mining is an extension of the single one. In the previous studying, the mining algorithm of multiple periodic pattern mining is to execute the single periodic pattern mining process again and again until all prefer period length have been covered. We observe that if the multiple periodic patterns are mined period by period, not only the cost of mining but also the redundancy of patterns is too high to bear. A pattern is said to be redundant if and only if it is contained by other patterns. In the following the redundant pattern is defined. Definition [Period Contain Property] Let period [p,o] indicates the period with period length p and offset o. A period [p’,o’] is said to be contained by period [p,o] if and only if p ' = m ⋅ p, m ∈ int eger and o' mod p = o . By the definition, if period [p’,o’] is contained by [p,o], then f S[ p ',o '] ⊆ f S[ p ,o ] . Definition [Redundant pattern] Pattern S[p’,o’] is a redundant pattern if and only if there exists a frequent periodic pattern S [p,o] and [p’,o’] is contained by period [p,o]. For example, if we have a rule such as “we need to stock with milk every two days” and “we need to stock with milk every four days”. It is clear that the second rule is redundant, because if we stock milk every two days, it has already covered stocking milk every four days. It is obvious that these redundant patterns have to be pruned. In our approach, the cost of mining partial multiple periodic patterns is reduced by two viewpoints. In the viewpoint of redundant pattern, if we avoid generating redundant patterns, the process and result of mining could be concise. The pruning of 18 redundant pattern involves two cases: first, S’ is a frequent periodic pattern, and second, S’ is not a frequent periodic pattern. In the first case, S’ is a redundant pattern, the information contained by S’ has also be contained by other frequent partial periodic patterns, therefore, it has no need to be mined out. In the second case, S’ is not a frequent periodic pattern, so it of course has no need to be mined out. In the viewpoint of period, the data collected by mining previous period will be able to supply information for mining the periods contained by the previous period. And these two suppositions are the starting point of our pruning properties. The following lemma shows that how the data collected by mining previous period P can be used to speed up the periods contained by P. Lemma 1 [Lower-bound property] Let n be the number of the unit time database, p be the period length and m be the periodic minimum support of patterns. If the count of periodic pattern S[ p ,o ] , C S[ p ,o ] , is less than n m• t • p , t is an integer, then S can not be a frequent periodic pattern with period length t•p and offset o. Proof Let C S[ p ,o ] be the count of periodic pattern S[p,o] C S[ p ,o ] < m • n , according to period contain property, we know that f S[ t • p ,o ] ⊆ f S[ p ,o ] t• p ⇒ C S[ p ,o ] ≥ C S[ t • p ,o ] n ⇒ C S[ p ,o ] < m • t • p ⇒ S cannot be a frequent periodic pattern 19 Lemma 1 guarantees if the count of a pattern in period p is lower than the threshold, it has no chance to be frequent in the period contained by p. It is one of the pruning properties of our approach. Following the definition of redundant pattern, we discover that only the unfrequent itemsets of prime number period need to be checked in the composite number period. Therefore, in order to avoid generating redundant patterns and to achieve the best pruning effect, we first generate the patterns that could not be applied by Lemma 1. It is the characteristic of prime number obviously. Because all numbers are composed by prime number, we can use the information gathered from PPM to prune the candidate itemsets in composite number periods. 3.3.1 Prime Period Mining (PPM) Prime period series is defined as the period which is unique and cannot be composed by other periods. Therefore, the pruning properties can not be applied when mining prime period series. The single periodic pattern mining algorithm is used to mine the prime number period. However, some modification is made to adapt to the multiple partial periodic pattern mining. The change of algorithm is shown underlying. Procedure PPM〈 D:Large-1 matrix〉 Scan D find out L1; For all {a} = {{D} – {L1}} If count of {a}< n Max _ period Let C1 =D - {a}; For〈n =2;;n++〉 Join Cn-1’ to Generate Cn; 20 • periodic _ min_ sup Scan D find out Ln; For all {a} = {{Cn} – {Ln}} If count of {a}< n Max _ period • periodic _ min_ sup Discard {a} from Cn; End Let Cn’=Cn; If Cn’={}; Goto End; End Output(frequent pattern list); Output(unfrequent itemset list); End n • periodic minimum support is a threshold called pruning Max period Length minimum support. The reason why we use pruning minimum support to prune candidate itemsets is based on the Indefinite property on partial periodicity. Property 2 [Indefinite property on partial periodicity] If pattern S[p,o] is a frequent partial periodic pattern, pattern S may not be a frequent partial periodic pattern in the period contained by period [p,o]. If pattern S[p,o] is not a frequent partial periodic pattern, pattern S may be a frequent partial periodic pattern in period contained by period [p,o]. The proof of property 2 is based on the nature of the partial periodic pattern definition. Suppose a pattern S[2,0] is not a frequent partial periodic pattern. Let f S[ 2 , 0 ] = 101010101000 and the periodic minimum support is 50%. However, f S[ 4 , 0 ] = 111110 , it is obvious that pattern S[4,0] is a periodic pattern. In the other case, suppose pattern S[2,0] is a frequent partial periodic pattern. 21 Let f S[ 2 , 0 ] = 011111 and the periodic minimum support is 75%. We have f S[ 4 , 0 ] = 011 . It is obvious that pattern S[4,0] is not a frequent periodic pattern. According to the property 2, due to the frequency of a pattern in current period cannot be exactly predicted by the previous periods, the candidate pruning method[1] must be modified. The candidate itemsets is divided into two parts. One is un-frequent itemset, and the other is pruning itemset. In the PPM the algorithm imports two thresholds. First is periodic minimum support, and the other is pruning minimum support. The itemset whose count is beyond the periodic minimum support count is the frequent periodic pattern that will be put into the frequent pattern list. And the itemset whose count is lower than the pruning minimum support count is the pruning itemset that has to be pruned. The remaining itemsets whose counts are between periodic minimum support count and pruning minimum support count are the candidate itemsets and cannot be pruned. The minimum pruning support is set to be n • periodic minimum support, because it is the minimum count Max period Length that a pattern needs to be satisfied to be a frequent pattern in the period whose period length no larger than max period length. Any itemset whose count is beyond the threshold has a chance to become a frequent periodic patterns, therefore these un-frequent itemsets and frequent itemsets will be put together do joining to generate the candidate itemsets of next mining pass. 22 3.3.2 Composite Period Mining (CPM) The composite number periods is composed by some factor period combination. Therefore we can use the information that we gather from prime period mining procedure to judge the composite period instead of running prime period mining procedure again. This is called period expansion which is based on property 3. Period expansion is depicted in Figure 4. 8-0 4-0 8-4 2-0 8-2 4-2 8-6 Figure. 4 Period Expansion Property 3 [Period expansion property] Let S[p,o] be a candidate itemset, [p,o] be the base period, and [p’,o’] be the expansion p period. We say [p,o] and [p’,o’] share the same related large-1 matrix Do of S if and only if period [p’,o’] is contained by [p,o]. The main concept of period expansion is the support count of itemsets mined in the base period can be the foundation for the expansive period. Because the expansive period and the base period share the same related large-1 matrix, when we have the support count of itemsets in base period, we can predict the itemsets, based on Lemma 1, are frequent in expansion period or not. If the support count of the itemset is lower 23 than the threshold, the itemset will be put into un-frequent itemset list again, waiting for examination in the next time of period expansion. If it is beyond the threshold, the feature list of itemset will be generated and check by Boolean operation discussed above to confirm if it is a frequent periodic pattern or not. If the itemset is a frequent periodic pattern, it will be put into the frequent itemset list, otherwise, it will be put into the un-frequent itemset list. The pruning of redundant periodic patterns is achieved by period expansion, too. During the expansion, the itemsets we check are those in the un-frequent itemset list, so there will be no redundant periodic patterns to be generated in the expansive period. For example, we have a periodic pattern BE[2,0] and a un-frequent pattern CD[2,0]. Pattern BE and CD will be put into the frequent itemset list and un-frequent itemset list, respectively. When mining the expansive period, CD will be checked in period [4,0] and [4,2] to determine if it is a frequent periodic pattern or not. It is clear that the redundant patterns are pruned automatically. Because we only check the un-frequent itemsets of base period, the redundant itemsets will not appear in the expansive periods. The check process is a linear time complexity algorithm, so the efficiency is guaranteed. 3.4 Data Post-Processing No matter what we are interested in, single or multiple periodic patterns, so far we only have a list of candidate periodic patterns. Without this final check step, we cannot ensure that these patterns are truly frequent periodic patterns or not. The 24 candidate periodic patterns mined out by PPM and CPM are only the “frequent pattern” in large 1-itemset matrix but not in the unit time databases. We cannot guarantee that the frequent patterns in large 1-item matrix are also frequent in unit time database. Therefore in the final step of our algorithm, the unit time databases are scanned to determine whether the candidate generated by PPM and CPM are frequent itemsets. Supposed that we have a candidate periodic patterns S[p,o]. We must check if the pattern S is large in unit time database with timestamp i = k ⋅ p + o . If the pattern is large in some unit time databases, we judge it by the periodic minimum support count again. Applying this check process to all candidate periodic patterns, we can obtain all ultimate frequent periodic patterns. 25 CHAPTER 4 EXPERIMENT AND EFFICIENCY ANALYSIS 4.1 Simulation Platform In this section, a set of simulations is performed to show the benefit of our approach. A comparison between Han’s method (repeated approach) and our approach is also made. The result shows there is a notable improvement in efficiency by our approach. The test data is generated by IBM Synthetic Data Generator[6]. The parameters used to generat the synthetic data and simulate the experiments is shown in table. Take D100δ 100T10 I 4 N1 S 0.8 for example, D100 represent there are 100K transactions in an unit time database, δ 100 means there are 100K unit time database, T10 indicates that the average size of transaction in unit time database is 10, I 4 means the average length of a frequent pattern in unit time database is 4, N1 means that the number of items in the time unit database, and S 0.8 means the periodic minimum support is 0.8. Notation Meaning Default Range D Number of transactions per unit time database 100K - Number of unit time database 100K 50K~200K T Average size of transaction 10 10~20 I Average length of large itemset 4 3~7 N Number of items 1,000 0.1K~ 1K P Maximal period length 30 2 ~100 S Periodic minimum support 0.8 0.8 ~ 1 δ Table. 1 26 4.2 Efficiency Analysis The reason is that for the number smaller than 30, the raito of the number of prime to the number of non_prime is high. As mentioned above, the prune properties can only adopt to the periodic pattern with non_prime period length. D100 δ 100 T10 I 4 N 1 S 0.8 14000.00 12000.00 Execute Time (s) 10000.00 Han 8000.00 CPS 6000.00 4000.00 2000.00 0.00 0 10 20 30 40 50 60 70 80 90 100 Maximal Period Length Figure 5 Therefore, the improvement of our approach is not so obvious when the maximum period length is smaller than 30. δ 100T10 I 4 N1 P30 S 0.8 10000 Execute time(s) 9000 8000 7000 Han CPS 6000 5000 4000 3000 2000 1000 0 50,000 100,000 150,000 200,000 Number of unit time database Figure 6 The effect of unit time database size δ on mining efficiency is shown in figure 6. As shown in the result, the execution time needed for both approach increases as the unit time database size increases. Moreover, our approach performs 27 Execute Time (s) much better than Han’s approach as the unit time database size increases. D100δ 100T20 I 4 N 0.1 P30 35000 30000 25000 20000 15000 10000 5000 0 Han CPS 80 85 90 95 100 Periodic Minimum Support Figure 7 Figure 7 shows the effect of periodic minimum support. As shown in the result, even for a large periodic minimum support, our approach outperforms Han’s approach. D100δ 100T10 N 0.1 P30 S 0.8 Execute Time 8000 6000 Han 4000 CPM 2000 0 3 4 5 6 7 Average length of large itemset Figure 8 Figure 8 shows the effect of average length of frequent pattern in unit time database. It shows that both approaches are effect by the average length of frequent pattern, however our approach outperform Han’s approach, especially when the average length of the frequent pattern is large. 28 CHAPTER 5 CONCLUSION AND FUTURE WORKS 5.1 Conclusion In this thesis, an efficient mining method of multiple partial periodic patterns has been studied. We have also explored the full and partial periodic patterns mining issues. The main objective and difference to previous studies[12][6][10] is that our proposed method is efficient and avoids the generation of the redundant periodic patterns. By studying some interesting properties related to multiple partial periodic patterns mining, such as indefinite property on partial periodicity, lower-bound property, and period expansion property, the efficient multiple partial periodic patterns mining algorithm is proposed. The experiment shows that the proposed algorithm offers excellent performance. 5.2 Future Works In the future there are still many issues regarding multiple periodic patterns mining, such as mining multiple periodic association rules, query based mining of multiple partial periodic patterns, and applying the distributed computing environment to improve the efficiency of our approach. We will continue studying these problems 29 and report our progress on time. 30 Reference [1] R. Agrawal and R. Srikant. Fast algorithms for mining association rules. In Proc. 1994 Int. Conf. Very Large Data Bases, pages 487–499, Santiago, Chile, September 1994. [2] R. Agrawal and R. Srikant. Mining sequential patterns. In Proc. 1995 Int. Conf. Data Engineering, pages 3–14, Taipei, Taiwan, March 1995 [3] L. Breiman, J. Friedman, R. Olshen, and C. Stone. “Classification and Regression Trees.” Monterey, CA: Wadsworth International Group, 1984 [4] Y. Cai, N. Cercone, and J. Han. “Attribute-oriented induction in relational databases.” In G. Piatetsky-Shapiro and W. J. Frawley, editors, Knowledge Discovery in Database, page 213-228, 1991 [5] Z. Huang. “Extensions to the k-means algorithm for clustering large data sets with categorical values.” Data Mining and Knowledge Discovery, 2:283-304, 1998 [6] Jiawei Han, Guozhu Dong, Yiwen Yin. “Efficient Mining of Partial Periodic Patterns in Time Series Database”. In Fifteenth International Conference on Data Engineering, 1999. [7] J. Han and Y. Fu. “Exploration of the power of attribute-oriented induction in data mining.” In U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, editors, Advances in Knowledge Discovery and Data Mining, pages 399-421, 1996 31 [8] E. Knorr and R. Ng. “A unified notion of outliers: Properties and computation.” In Proc. 1997 Int. Conf. Knowledge Discovery and Data Mining(KDD’97), pages 219-222, Newport Beach, CA, Aug. 1997 [9] L. Kaufman and P. J. Rousseeuw. “Finding Groups in Data: An Introduction to Cluster Analysis.” New York: John Wiley & Sons, 1990 [10] Yingjiu Li, Peng Ning, X. Sean Wang Sushil Jajodia, “Discovering Calendar-based Temporal Association Rules.”, Temporal Representation and Reasoning, 2001. TIME 2001. Proceedings. Eighth International Symposium on , 2001. [11] R. Ng, L. V. S. Lakshmanan, J. Han, and A. Pang. “Exploratory mining and pruning optimizations of constrained association rules”. In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data, pages 13–24, Seattle, Washington, June 1998. [12] B. ¨ Ozden, S. Ramaswamy, and A. Silberschatz. “Cyclic association rules”. In Proc. 1998 Int. Conf. Data Engineering (ICDE’98), pages 412–421, Orlando, FL, Feb. 1998. [13] J.S. Park, M.S. Chen, Philip S. Yu. “An Effective Hash-Based Algorithm for Mining Association Rules” In Proc. of the 1995 ACM SIGMOD Conference, pages 175--186, San Jose, California, USA, May 1995. [14] J. R. Quinlan. “Induction of decision trees.” Machine Learning, 1:81-106, 1986 [15] J. R.Quinlan. “C4.5:Programs for Machine Learning.” SanMateo, CA: Morgan Kaufmann, 1993. [16] R. Srikant and R. Agrawal. “Mining Generalized Association Rules”. In 32 Proceedingsof the 21st International Conference on Very Large Data Bases, pages 407–419, Zurich, Swizerland, September 1995. [17] R. Srikant and R. Agrawal. “Mining Quantitative Association Rules”. In Proceedingsof the 1996ACM SIGMOD International Conference on Management of Data, pages 1–12, Montreal, Canada, June 1996. [18] A. Savasere, E. Omiecinski, and S. Navathe. “An Efficient Algorithm for Mining Association Rules in Large Databases”. In Proceedings of the 21st International Conference on Very Large Data Bases, pages 432–444, Zurich, Swizerland, September 1995. [19] I. H. Toroslu, M. Kantarcioglu, “Mining Cyclic Patterns”, DAWAK 2001(LNCS 2114), pp.83-92, Munich, Germany, September 2001. 33