Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
義 守 大 學 資 訊 管 理 研 究 所 碩 士 論 文 資訊性關聯規則之維護 Maintenance of Discovered Informative Rule Sets 研究生:黃冠偉 指導教授:王天津博士 共同指導教授:王學亮博士 中華民國 九十二 年 六 月 I 資訊性關聯規則之維護 Maintenance of Discovered Informative Rule Sets 研究生:黃冠偉 指導教授:王天津 博士 共同指導教授:王學亮 博士 Student:Kuan-Wei Huang Advisor:Dr. Tien-Chin Wang Coadvisor:Dr. Shyue-Liang Wang 義守大學 資訊管理研究所 碩士論文 A Thesis Submitted to Department of Information Management I-Shou University in Partial Fulfillment of the Requirements for the Master Degree in Information Management June , 2003 Kaohsiung, Taiwan 中華民國九十二年六月 II 資訊性關聯規則之維護 研究生:黃冠偉 指導教授:王天津 博士 共同指導教授:王學亮 博士 義守大學資訊管理研究所 摘 要 本研究之目的在於探討有效率的維護資訊性關聯規則(Informative Rule Sets, IRS)的方法。以信心度而言,資訊性關聯規則可以和一般的關聯規則做相同的 預測,且規則的數量會遠小於關聯規則的數量。預測是指給定一群顧客在某段時 間內購物行為之規則以及某一特定顧客之部分購物行為,希望能預測出此一特定 顧客之其他購物行為。而資訊性關聯規則之維護是指已知交易資料庫及其資訊性 關聯規則,當資料庫發生新增、刪除、或修改時,如何有效率的維護資訊性關聯 規則。 根據關聯規則之快速更新(Fast Update, FUP)的方法,本研究提出兩個有 效率的維護資訊性關聯規則的演算法。當資料庫發生新增或刪除資料時,漸近新 增演算法可有效率的維護資訊性關聯規則。當資料庫發生刪除資料時,漸近刪除 演算法可有效率的維護資訊性關聯規則。同時我們並與非漸近演算法作數值比 III 較,結果顯示我們所提出之方法需要較少之資料庫掃瞄次數、候選規則、及執行 時間。 關鍵字:資料探勘、預測、資訊性關聯規則、漸近發現、維護 IV Maintenance of Discovered Informative Rule Sets Student:Kuan-Wei Huang Advisor:Dr. Tien-Chin Wang CoAdvisor:Dr. Shyue-Liang Wang Institute of Information Management I-Shou University ABSTRACT The goal of this research is to study the efficient maintenance of discovered Informative Rule Set (IRS) when new transaction data is added to and/or deleted from original transaction database. An informative rule set is the smallest subset of association rule set such that it can make the same prediction sequence according to confidence priority. Prediction is a process, for example, given a set of rules that describe the shopping behavior of the customers in a store over time, and some purchases made by a particular customer, we wish to predict what other purchases will be made by that customer. The problem of maintenance of discovered informative rule set is that, given a transaction database and its informative rule set, V when the database receives insertion, deletion, or modification, we wish to maintain the discovered informative rule set as efficiently as possible. Based on the Fast Updating technique (FUP) for the updating of discovered association rules, we present here two algorithms to maintain the discovered IRS. The proposed incremental insertion algorithm maintains the discovered IRS efficiently under database insertion. The proposed incremental deletion algorithm maintains the discovered IRS efficiently under database deletion. Numerical comparison with the non-incremental informative rule set approach is shown to demonstrate that our proposed techniques require less computation time, in terms of number of database scanning and number of candidate rules generated, to maintain the discovered informative rule set. Keywords: data mining, prediction, informative rule set, incremental, maintenance VI ACKNOWLEDGEMETS 論文能夠順利完成,首先要感謝我的兩位指導老師,王天津教授 和王學亮教授;在這兩年研究所生涯中給予很大的鼓勵和指導,使我 能對此領與有更深入的了解。並且在我犯錯時,給予我適時的指正與 教誨,我才能順利的完成這份研究。 其次,要感謝我的論文口試委員,林建宏教授及樹德科技大學吳 志宏教授在論文口試期間的指導,因為他們的建議,使得本論文的內 容能更趨於完善,僅致以由衷的感謝。 接下來,謝謝這兩年來許多陪我一起度過的學長、同學及學弟妹 們,一路上有你們的陪伴與鼓勵,使的這兩年的學習生涯更加的多采 多姿,並且更加快樂。 最後,我要感謝我的家人,在求學的這段期間中,因為他們默默 的支持,讓我無憂無慮的完成我的學業。 黃冠偉 謹誌於 義守大學 VII Contents ABSTRACT (CHINESE) ............................................................................................ III ABSTRACT (ENGLISH) ............................................................................................. V ACKNOWLEDGEMENTS ..................................................................................... VII LIST OF FIGURES .....................................................................................................IX LIST OF TABLES ........................................................................................................ X CHPATER 1 INTRODUCTION .................................................................................... 1 1.1 Background ...................................................................................................... 1 1.2 Motivation ........................................................................................................ 2 1.3 Thesis Organization ......................................................................................... 3 CHAPTER 2 LITERATURE SURVEY ........................................................................ 4 2.1 Association Rules for Prediction...................................................................... 4 2.2 Informative Rule set for Prediction ................................................................. 8 2.3 Maintenance of Association Rules ................................................................. 10 CHAPTER 3 DISCOVERY OF INFORMATIVE RULE SET: INCREMENTALINSERTION ............................................................................ 14 3.1 Problem Description ...................................................................................... 14 3.2 Notations ........................................................................................................ 15 3.3 Algorithm ....................................................................................................... 16 3.4 Example ......................................................................................................... 21 CHAPTER 4 DISCOVERY OF INFORMATIVE RULE SET: INCREMENTALINSERTION ............................................................................ 23 4.1 Problem Description ...................................................................................... 24 4.2 Algorithm ....................................................................................................... 25 4.3 Example ......................................................................................................... 29 CHAPTER 5 EXPERIMENT RESULTS .................................................................... 33 5.1 Incremental Insertion Results.………………………………………………33 5.2 Incremental Deletion Results.………………………………………………37 CHAPTER 6 CONCLUSION ..................................................................................... 41 REFERENCE............................................................................................................... 43 VIII List of Figures Figure 2.1 A simple database D .................................................................................. 7 Figure 2.2 A transaction database after deletion and insertion ................................. 11 Figure 3.2 New data set △+..................................................................................... 15 Figure 3.1 A simple database D ................................................................................ 15 Figure 3.3 A candidate tree over the set of items {a, b, c, d} ................................... 18 Figure 4.2 Deleted data set △-................................................................................. 25 Figure 4.1 Figure 4.3 Figure 5.1 Figure 5.2 Figure.5.3 A simple database D ................................................................................ 25 A candidate tree over the set of items {a, b, c, d} ................................... 26 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size 2,000 records ........................................................................................ 34 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size 10,000 records ...................................................................................... 35 Running time comparison of incremental and non-incremental approaches under various incremental data sizes, minimum supports 10% ...................................................................................................... 36 Figure 5.4 Running time comparison of incremental and non-incremental approaches under various incremental data sizes, minimum supports 2% ........................................................................................................ 36 Figure 5.5 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size 2,000 records .............................................................................................................. 37 Figure 5.6 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size Figure.5.7 10,000 records ...................................................................................... 38 Running time comparison of incremental and non-incremental approaches under various incremental data sizes, minimum supports 10% ...................................................................................................... 39 IX List of Tables Table 2.1 A summary of Recent Association Rule Mining Approaches………….....5 Table 2.2 Association Rule Set Obtained from Figure 2.1…………………………..7 Table 2.3 Association Rule Set Obtained from Figure 2.1…………………………..9 Table 2.4 A Summary of Recent Association Rule Maintenance Approaches……..13 Table 3.1 Comparisons of Non-Incremental and Incremental Insertion Approach……………………………………………………………...….24 Table 4.1 Comparisons of Non-Incremental and Incremental Deletion Approach…………………………………………………………………42 X CHAPTER 1 INTRODUCTION 1.1 Background The discovery of association rules in transaction databases is an important data-mining problem because of its wide application in many areas, such as market basket analysis, decision support, financial forecast, collaborative recommendation, and prediction. Prediction is a process, for example, given a set of rules that describe the shopping behavior of the customers in a store over time, and some purchases made by a particular customer, we wish to predict what other purchases will be made by that customer. Many techniques have been proposed for prediction in the past. In addition to the classical decision-tree induction approach, there are Bayesian classifications, neural network, nearest neighbor classifiers, case-based reasoning, genetic algorithm, rough set, fuzzy set, and data mining approaches. 1 For data mining approach, the association rule set is usually used for prediction. However, traditional association rule algorithms typically generate a large number of rules, most of which are unnecessary when used for prediction. Enhancements on simplifying the association rule set[2] directly and indirectly have been therefore studied extensively. Most indirect algorithms simplify the set by post-pruning and re-organization of association rules. The direct algorithms attempts to reduce the number of association rules directly, for example, the constraint association rule sets [17][20], non-redundant rule sets [18][19], and informative rule sets [12]. In this work, we are particularly interested in improving the efficiency of mining informative rule sets when the transaction database is updated, i.e., when a small transaction data set is added to and/or deleted from the original database. This problem is referred to as the maintenance of discovered informative rule set. 1.2 Motivation One possible approach to the maintenance problem is to re-run the data-mining algorithm on the whole updated database. However, this approach has some obvious 2 disadvantage. All the computation done initially at finding out old large itemsets are wasted and have to be computed again from scratch, for the case of association rules. In the case of IRS, support counts of candidate itemsets also have to be re-computed from scratch for the updated database. Therefore, more efficient algorithms for computing the large itemsets in the updated database, utilizing the information from old large itemsets, are quite desirable. 1.3 Thesis Organization The rest of our paper is organized as follows. Chapter 2 reviews the association rules and informative rule sets for prediction as well as maintenance of association rules. Chapter 3 presents the proposed incremental insertion algorithm. presents the incremental deletion algorithm. results of the proposed algorithms. Chapter 4 Chapter 5 shows the experimental Conclusion and future work are finally given in Chapter 6. 3 CHAPTER 2 LITERATURE SURVEY In this chapter, we review the association rules and informative rule sets for prediction as well as maintenance of association rules. In section 2.1, we review the basic concept of association rules mining and summarize the related researches in recent years. [1][2][17]. In section 2.2, we review the concept of informative rule sets from transaction databases [12]. In section 2.3, we review the recent activities of maintenance of discovered association rules [3][7][8][10][16]. 2.1 Association Rules for Prediction Association rule has been the most widely studied pattern in the field of data mining. The following table summarizes some of the recent activities. Method Author Subject Association David W. Maintenance of Discovered Association Rules in Rules Cheung Large Databases: An Incremental Updating Technique (FUP) 4 Year 96’ Maintenance of Discovered Knowledge: A Case in 96’ Multi-level Association Rules (FUP*) A General Incremental Technique for Maintaining 97’ Discovered Association Rules (FUP2) L.P. Chen Efficient Graph-Based Algorithms for Discovering and Maintaining Association Rules in Large 01’ Databases (DUP) Necip An Efficient Algorithm to Update Large Itemsets Fazil Ayan with early Pruning (UWEP) Shiby An Efficient algorithm for Incremental Updating Thomas of Association Rules in Large Databases T.P. Hong Incremental Data Mining Using Pre-large Itemsets Zaki An Efficient Algorithm for Closed Association 99’ 97’ Non-redundant 00’ 99’ Association Rules Rule Mining Generating Non-Redundant Association Rules Yves Mining Minimal Non-Redundant Association Bastide Rules Using Frequent Closed Itemsets Constraint-based David W. A Fast Distributed Algorithm for Mining Association Rules Cheung Association Rules 00’ 00’ 96’ Jiawei Han Agrawal Mining Association Rules with Item Constraint Chunhua Distributed Mining for Association Rules With Wang Item Constraints Jiuyong Li Mining the Smallest Association Rule Set for 97’ 00’ Association Rules for Predict Predictions 01’ Table 2.1 A summary of recent association rule mining approaches Here we briefly review the basic concept of association rule. be a set of item. Let I = {i1, i2, …, im} Let D be a set of transactions, where each transaction T is be a set of items such that T I. Each transaction associated with a unique identifier, called its TID. An association rule is an implication of the form X Y, where X I, 5 Y I, and X Y = Ø. The rule X Y has support s if s% of transactions in D contains X Y, and it has confidence c if c% of transactions in D that contains X also contains Y. This is the original definition of an association rule. Informally, the prediction using association rule set can be described as follows. For a given association rule set R and an itemset P, we say that the predictions for P from R is a sequence of items Q. The sequence of Q is generated by using the rules in R which is descending order of confidence. For each rule r that matches P (i.e. for each rule whose antecedent is a subset of P), each consequent of r is added to Q. After adding a consequence to Q, all rules whose consequences are in Q are removed form R. The following example shows the association rules of a simple data set and its application to prediction. Example 1 Consider a small database shown in Figure 2.1. 0.5 and minimum confidence 0.5. For minimum support For the rule: a b, the 67% is called the support of the rule is the percentage of transactions that contain both a and b. The 80% here called the confidence of the rule, which means that 80% of transaction that contains X also contains Y. Therefore, set of 12 association rules can be found, as shown in Table 2.2. 6 TID Items 1 abc 2 abc 3 abc 4 abd 5 acd 6 bcd Figure 2.1 A simple database D Table 2.2 AR Support Confidence 1 a=>b 0.67 0.8 2 a=>c 0.67 0.8 3 b=>a 0.67 0.8 4 b=>c 0.67 0.8 5 c=>a 0.67 0.8 6 c=>b 0.67 0.8 7 ab=>c 0.5 0.75 8 ac=>b 0.5 0.75 9 bc=>a 0.5 0.75 10 a=>bc 0.5 0.6 11 b=>ac 0.5 0.6 12 c=>ab 0.5 0.6 Association rule set obtained from figure 2.1 For prediction, given an itemset P = {a, b}, the predicted sequence of items will be Q = {b, c, a}. It can be observed that not all association rules are used to produce the predicted sequence Q. 7 2.2 Informative Rule Set for Prediction Basically an informative rule set is the smallest subset of association rule set such that it can make the same prediction sequence according to confidence priority. The definition of informative rule set introduced in [12] is given as follows. Definition 2.1 Let RA be an association rule set and RA1 the set of single-target rules in RA. A set RI is informative over RA if (1) RI RA1 ; (2) r RI but does not exist r’ RI such that r’ r and conf(r’) conf(r); and (3) r” RA1 - RI, exist r RI such that r” r and conf(r”) conf(r). A top-down level-wise searching algorithm using candidate tree is proposed by [12] for the efficient discovery of informative rule set. Consider again the database shown in Figure 2.1. The informative rule set for the same minimum support and confidence will be the first 6 association rules in Table 2.2, i.e. RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b}, which can make the same prediction sequence as the whole association rule set. The result is shown in Table 2.3. 8 AR Table 2.3 Support Confidence 1 a=>b 0.67 0.8 2 a=>c 0.67 0.8 3 b=>a 0.67 0.8 4 b=>c 0.67 0.8 5 c=>a 0.67 0.8 6 c=>b 0.67 0.8 Informative rule set obtained from figure 2.1 The confidence priority of informative rule set can be further illustrated by the following. Assuming the following two rules exist. (1) Purchasing PRINTER Purchasing PRINTING PAPER (2) Purchasing (PRINTER and PRINTING INK) Purchasing PRINTING PAPER 80% 60% We can predict that the confidence is 80% that purchasing printer also purchasing printing paper. But another rule that purchasing both printer and printing ink will also purchase printing paper has lower confidence. Then the rule that purchasing both printer and printing ink will also purchase printing paper is redundant. Example 2 For the table 1. Consider the rule set: {a c(0.67, 0.8), b c(0.67, 0.8), ab c(0.5, 0.75)}, where the numbers in parentheses are the support and confidence respectively. Every transaction identified by the rule ab c is also identified by rule a c or b c with higher confidence. So ab c can be 9 omitted from the informative rule set without losing predictive capability. Example 3 For the table 1. Consider the rule set {a b(0.67, 0.8), a c(0.67, 0.8), a bc(0.5, 0.6)}. Rules a b and a c provide predictions b and c with higher confidence than rule a bc, so rule a bc can be omitted from the informative rule set. 2.3 Maintenance of Association Rules The problem of maintenance of discovered association rule set in large database has been studied extensively in the past [3][7][8][10][16]. Maintenance of discovered association rule set is a process by which, given a transaction database and its association rule set, when the database receives insertion, deletion, or modification, we wish to maintain the discovered association rules as efficiently as possible. The following figure shows the relative relationship between a given transaction database D with respect to the insertion △+ and deletion △-. 10 △D DD+ △+ Figure 2.2 A transaction database after deletion and insertion The following table summarizes some incremental approaches of the recent years, including incremental insertion, incremental deletion, incremental modification. Serial 1 2 Subject Author Maintenance of Discovered Association Rules in Large Cheung Databases: An Incremental Updating Technique (FUP) Maintenance of Discovered Knowledge: A Case in etc. Cheung etc Year 96’ 96’ 11 From In Proceedings of the International Conference on Data Engineering, New Orleans, Louisiana, pp 106-114 In Proceedings of 2nd international Conference on Knowledge Discovery Increment insert insert 3 4 5 Multi-level and Association Rules (FUP*) pp307-310 Maintenance of Discovered Association Rules: When to Update? DMKD 1997 Efficient Mining of Association Rules in Distributed Databases (DMA) Cheung etc Cheung etc A General Incremental Technique for Cheung Maintaining etc Discovered Association Rules Data Mining, 96’ 96’ 97’ insert IEEE Trans. On Knowledge and Data Mining In Proceedings of International Conference on Database Systems for Advanced Applications, 1-4 April 1997 insert delete (FUP2) 6 An Efficient algorithm for Incremental Shiby Updating of Thomas Association Rules etc. in Large Databases In Proc. KDD 1997 97’ In 3rd Pracific-Asia Conference, Efficient Graph-Based 7 8 update Algorithms for Discovering and Maintaining Knowledge in Large Databases L.P. Chen etc. An Efficient Approach for Incremental L.P. Chen etc. 99’ PAKDD-99 Proceedings, Beijing, China, pp 409-419 update 99’ Methodologies for Knowledge Discovery and Data Mining. 3rd update 12 9 Association Rules Pacific-Asia Mining Conference. PAKDD-99 An Efficient Algorithm to Update Large Itemsets with early Pruning (UWEP) In Proc. 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. San Diego, CAUSA, August 1999 Incremental Data 10 11 Mining Using Pre-large Itemsets Efficient Graph-Based Algorithms for Discovering and Maintaining Association Rules Necip Fazil Ayan etc. T.P. Hong etc. L.P. Chen etc. 99’ insert Proceedings of the 2002 02’ IEEE International Conference on Data Mining (ICDM 2002) update Knowledge and Information Systems (2001) 3: 338-355 01’ update in Large Databases (DUP) Table 2.4 A summary of recent association rule maintenance approaches 13 CHAPTER 3 DISCOVERY OF INFORMATIVE RULE SET: INCREMENTAL INSERTION This chapter presents an incremental insertion algorithm for the maintenance of discovered informative rule set when a small transaction data set is added to the original database. The proposed approach is based on the Fast UPdating technique (FUP) [7] for updating the discovered association rules. In the following, we describe the problem statement, notations used in this work, the top level of the proposed incremental insertion algorithm of informative rule set, the main functions of the algorithm and an example to demonstrate the proposed approach. 3.1 Problem Description 14 The problem of maintenance of discovered informative rule set is that, given a transaction database and its informative rule set, when the database receives insertion, deletion, or modification, we wish to maintain the discovered informative rule set as efficiently as possible. For example, consider the database shown in Figure 3.1 and its informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b}. When a new data set in Figure 3.2 is inserted, the updated informative rule set will be RD+ = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}. TID Items 1 abc 2 abc 3 abc 4 abd 5 acd 6 bcd Figure 3.1 TID Items 7 abcd 8 abd Figure 3.2 New data set △+ A simple database D In this chapter, we will consider the maintenance of discovered informative rule set under insertion only. 3.2 Notations 15 The following notation will be used in this chapter. D: original database, D- △- △+: new database D+: updated database, D △+ I: items in the database, (e.g. ID, ID+) TN: transaction in database, (e.g. TND, TND+) X: itemset, X I s: support c: confidence X.sup: support of X (e.g. X.suppD, X.supp△+, X.supp D+) Ck: set of k-th candidate itemsets (e.g. C kD , C k ) Lk: set of k-th large itemsets (e.g. LDk , Lk ) T: candidate tree (e.g. TD, T△+) Tk: k-th level of T R: informative rule set (e.g. RD, R△+, RD+) 3.3 Algorithm 16 This section presents the proposed incremental insertion algorithm for the maintenance of discovered informative rule set. Based on the concept of informative rule miner [12] and FUP technique [7], the proposed algorithm generates directly the updated informative rule set level-by-level and stores them in a candidate tree. A candidate tree is an extended set enumeration tree such as Figure 3.3, where the set of items is {a, b, c, d}. Each node in the candidate tree stores two sets {A, Z}. A is an itemset, the identity set of the node, and Z is a subset of the identity itemset, called potential target set where every item can be the consequence of an association rule. For example, a node {{abc}, {ab}} is a set of candidates of two rules, namely, bc=>a and ac=>b. Utilizing FUP technique to update the supports of 1-itemsets, the proposed algorithm updates the level one nodes of candidate tree TD. It then generates the candidate nodes of the next level. While the candidate nodes are not empty, it proceeds to update the support counts of the candidate nodes using FUP technique. It then prunes the unnecessary candidate rules or nodes. This process repeats itself until no candidate node is generated. The following is the top level of the proposed incremental insertion algorithm of informative rule set. 17 Figure 3.3 Input: A candidate tree over the set of items {a, b, c, d} Database D, the informative rule set RD, candidate tree TD, new data set △+ Output: the informative rule set RD+ 1. RD+=RD 2. Update supports of 1-itemsets 3. Update level one of candidate tree TD 4. Generate candidate nodes of next level 18 5. While (candidate nodes are not empty) 6. Update supports of candidate node 7. Prune the candidate rules or nodes 8. Include qualified rule sets to RD+ 9. Generate candidate nodes of next level 10. Return rule set in RD+ The functions for generating candidate nodes in steps 4 and 9 and pruning the candidate nodes in step 7 are introduced in [12]. The update support functions in steps 2 and 6 are given as follows. Function: Update supports of 1-itemsets (1) Initialization: W= L1D , C 1D = , L1D = (2) For all X I +, △ If X W then update support of X on D+ //X.suppD known //scan △+ Else If X C 1D then C 1D = C 1D {X}, and calculate support of X on △+ (3) For all X W 19 If X.suppD+ s*(D+△+) then L1D = L1D {X} (4) For all X C 1D If X.supp△+<s*△+ then C 1D = C 1D -{X} (5) For all X C 1D //scan D, X is non-frequent itemset in D If X.suppD+>s*(D+△+) then L1D = L1D {X} Function: Update supports of Candidate nodes (1) Initialization: W= TkD , TkD = (2) C kD =Candidate-Rule-generator ( TkD1 )- TkD // Apriori generator function (3) For all node ni W If ni contains non-frequent subset in TkD1 then W=W-{ni} (4) For all ni W Calculate support of ni on D+ // ni.suppD+ known // scan △+ If ni.supp D+ s*(D+△+) 20 then TkD = TkD {ni} (5) For all ni C kD If ni.supp△+<s*△+ then C kD = C kD -{ni} (6) For all ni C kD //scan D If ni.suppD+ s*(D+△+) then TkD = TkD {ni} 3.4 Example This section gives an example to show the proposed algorithm can be used to find the informative rule set incrementally. Input: Database D in Figure 3.1, the informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b}, the candidate tree TD, a new data set d in Figure 3.2 Output: the informative rule set RD+ 21 1. RD+=RD 2. Update supports of 1-itemsets: a.supD+ = 7, b.supD+ = 7, c.supD+ = 6, d.supD+ = 5. 3. The candidate nodes of level one nodes a, b, c, d with candidates {{a}, {a}}, {{b}, {b}}, {{c}, {c}}, {{d}, {d}}. 4. The candidate nodes in level two are nodes b, c, d, which are descendant of node a of level one. Other nodes are shown in Figure 3.3. 5. While (candidate nodes are not empty) 6. Update supports of candidate nodes of level two, as shown in Figure 3.3. 7. Node d of descendant of c is pruned. 8. RD+ = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}. 9. Generate candidate nodes of level three, as shown in Figure 3.3. 10. The process repeats the while loop until there is no candidate node, as shown in Figure 3.3. The resulting IRS is RD+ = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}. In the following, we summarize the candidate rule sets and the number of scanning of database for the non-incremental and incremental insertion algorithms for the discovery of informative rule set in Table 3.1. 22 The candidate rule sets for non-incremental approach are 4, 6, and 1 for level 1, 2, and 3, respectively. the incremental approach has only 0, 3, 0, respectively. However, In addition, there is some saving of database scanning for the incremental approach as shown in Table 3.1. Ck Scanning of Database IRS IIRS Level 1 4 0 IRS D&△+ Level 2 6 3 D&△+ Level 3 1 0 D&△+ IIRS △+ D&△+ △+ Table 3.1 Comparisons of non-incremental and incremental insertion approaches 23 CHAPTER 4 DISCOVERY OF INFORMATIVE RULE SET: INCREMENTAL DELETION This chapter presents an incremental deletion algorithm for the maintenance of discovered informative rule set when a small transaction data set is removed from the original database. The proposed approach is based on the Fast Updating2 technique (FUP2) [8] for updating the discovered association rules. In the following, we describe the problem statement, notations used in this work, the top level of the proposed incremental deletion algorithm of informative rule set, the main functions of the algorithm and an example to demonstrate the proposed approach. 4.1 Problem Description 24 The problem of maintenance of discovered informative rule set is that, given a transaction database and its informative rule set, when the database receives insertion, deletion, or modification, we wish to maintain the discovered informative rule set as efficiently as possible. For example, consider the database shown in Figure 4.1 and its informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b}. When a data set in Figure 4.2 is deleted, the updated informative rule set will be RD- = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}. TID Items 1 abc 2 abc 3 abc 4 abd 5 acd 6 bcd Figure 4.1 TID Items 1 abc 2 abc Figure 4.2 Deleted data set △- A simple database D In this chapter, we will consider the maintenance of discovered informative rule set under deletion only. 4.2 Algorithm 25 This section presents the proposed incremental deletion algorithm for the maintenance of discovered informative rule set. Based on the concept of informative rule miner [12] and FUP2 technique [8], the proposed algorithm generates directly the updated informative rule set level-by-level and stores them in a candidate tree. A candidate tree is an extended set enumeration tree such as Figure 4.3, where the set of items is {a, b, c, d}. Figure 4.3 A candidate tree over the set of items {a, b, c, d} Utilizing FUP2 technique to update the supports of 1-itemsets, the proposed algorithm updates the level one nodes of candidate tree TD. 26 It then generates the candidate nodes of the next level. While the candidate nodes are not empty, it proceeds to update the support counts of the candidate nodes using FUP technique. It then prunes the unnecessary candidate rules or nodes. This process repeats itself until no candidate node is generated. The following is the top level of the proposed incremental deletion algorithm of informative rule set. Input: Database D, the informative rule set RD, candidate tree TD, removed data set △- Output: the informative rule set RD- 1. RD-=RD 2. Update supports of 1-itemsets 3. Update level one of candidate tree TD 4. Generate candidate nodes of next level 5. While (candidate nodes are not empty) 6. Update supports of candidate node 7. Prune the candidate rules or nodes 8. Include qualified rule sets to RD- 9. Generate candidate nodes of next level 10. Return rule set in RD27 The functions for generating candidate nodes in steps 4 and 9 and pruning the candidate nodes in step 7 are introduced in [12]. The update support functions in steps 2 and 6 are given as follows. Function: Update supports of 1-itemsets (1) Initialization: W= L1D , C1D = , L1D = (2) For all X I△-, If X W then update support of X on D- //X.suppD known //scan △- Else If X C1D then C1D = C1D {X}, and calculate support of X on △-} (3) For all X W If X.suppD- s*(D-△-), then L1D = L1D {X} (4) For all X C1D If X.supp△->s*△-, then C1D = C1D -{X} (5) For all X C1D //scan DB, X is non-frequent itemset in DB If X.suppD->s*(D-△-), then L1D = L1D {X} 28 Function: Update supports of Candidate nodes (1) Initialization: W= T kD , TkD = (2) C kD =Candidate-Rule-generator ( TkD1 )- T kD // Apriori generator function (3) For all node ni W If ni contains non-frequent subset in TkD1 then W=W-{ni} (4) For all ni W Calculate support of ni on D- // ni suppD- known // scan △- If ni.supp△-<s*△then TkD = TkD {ni} (5) For all ni C kD If ni.supp D-<s*(D-△-) then C kD = C kD -{ni} (6) For all ni C kD //scan DB If ni.suppD- s*(D-△-) then TkD = TkD {ni} 29 4.3 Example This section gives an example to show the proposed algorithm can be used to find the informative rule set incrementally. Input: Database D in Figure 4.1, the informative rule set RD = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b}, the candidate tree TD, a new data set △- in Figure 4.2 Output: the informative rule set RD- 1. RD-=RD 2. Update supports of 1-itemsets: a.supD- = 3, b.supD- = 3, c.supD- = 3, d.supD- = 3. 3. The candidate nodes of level one nodes a, b, c, d with candidates {{a}, {a}}, {{b}, {b}}, {{c}, {c}}, {{d}, {d}}. 4. The candidate nodes in level two are nodes b, c, d, which are descendant of node a of level one. Other nodes are shown in Figure 4.3. 5. While (candidate nodes are not empty) 30 6. Update supports of candidate nodes of level two, as shown in Figure 4.3. 7. Node d of descendant of c is pruned. 8. RD-= {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}. 9. Generate candidate nodes of level three, as shown in Figure 4.3. 10. The process repeats the while loop until there is no candidate node, as shown in Figure 4.3. The resulting IRS is RD- = {a=>b, a=>c, b=>a, b=>c, c=>a, c=>b, d=>a, d=>b, a=> d, b=>d}. In the following, we summarize the candidate rule sets and the number of scanning of database for the non-incremental and incremental insertion algorithm for the discovery of informative rule set in Table 4.1. The candidate rule sets for non-incremental approach are 4, 6, and 4 for level 1, 2, and 3, respectively. the incremental approach has only 0, 2, 4, respectively. However, In addition, there is some saving of database scanning for the incremental approach as shown in Table 4.1. 31 Ck Scanning of Database IRS IIRS IRS IIRS Level 1 4 0 D-&△- △- Level 2 6 2 D-&△- D-&△- Level 3 4 4 D-&△- D-&△- Table 4.1 Comparisons of non-incremental and incremental deletion approaches 32 CHAPTER 5 EXPERIMENT RESULT In this chapter, we present the experimental results of comparing the incremental insertion and deletion algorithms proposed in chapters 3 and 4 with the non-incremental approach proposed by [12]. All programs are written in C++ and run on the same 600Hz Pentium III PC with 256 MB of memory running Windows XP operating system. We ran the algorithms on a transaction data set of 50,000 records, which is generated by the synthetic data generator of QUEST from IBM Almaden research center. In this dataset, there are 100 different items and the average length of itemsets is 7.8 with maximum itemset length 16 and minimum itemset length 2. 5.1 Incremental Insertion Results 33 In the following, we present the processing times of the incremental insertion algorithm and the non-incremental algorithm under various minimum supports and incremental data sets. 800 Running time (seconds) 700 Non-Incremental Incremental 600 500 400 300 200 100 0 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% minsup Figure 5.1 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size 2,000 records Figures 5.1 and 5.2 show the running times for both approaches under various minimum supports. The size of original database D is 40,000 records. minimum confidence is set at 20%. 10,000 records respectively. The The size of inserted data set is 2,000 records and We can observe that for small minimum supports, the incremental approach performs better than the non-incremental approach. 34 For minimum supports greater than 20%, the two approaches perform similarly. This is due to the fact that the number of candidate itemsets and rules is small and about the same, when the minimum support increases Running time (seconds) 900 800 Non-Incremental 700 Incremental 600 500 400 300 200 100 0 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% minsup Figure 5.2 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size 10,000 records Figures 5.3 and 5.4 show the running times for both approaches under various incremental data sizes. The size of original database D is 40,000 records. The minimum confidence is set at 20%. The minimum supports are 10% and 2% respectively. We can observe that for various incremental data sizes, the incremental 35 Running time (seconds) 140 120 100 80 60 40 20 0 Non-Incremental Incremental 2k 4k 6k 8k 10k Incremental size Figure.5.3 Running time comparison of incremental and non-incremental Running time (seconds) approaches under various incremental data sizes, minimum supports 10% 1000 800 600 Non-Incremental 400 Incremental 200 0 2k 4k 6k 8k 10k Incremental size Figure 5.4 Running time comparison of incremental and non-incremental approaches under various incremental data sizes, minimum supports 2% approach performs better than the non-incremental approach, for both minimum supports. In fact, the ratio of processing time is around 1.67 for all incremental data 36 sizes. 5.2 Incremental Deletion Results In the following, we present the processing times of the incremental deletion algorithm and the non-incremental algorithm under various minimum supports and incremental data sets. 1000 900 Non-Incremental Running time(second) 800 Incremental 700 600 500 400 300 200 100 0 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% minsup Figure 5.5 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size 2,000 records Figures 5.5 and 5.6 show the running times for both approaches under various 37 minimum supports. The size of original database D is 50,000 records. minimum confidence is set at 20%. 10,000 records respectively. The The size of deleted data set is 2,000 records and We can observe that for small minimum supports, the incremental approach performs better than the non-incremental approach. For minimum supports greater than 20%, the two approaches perform similarly. This is due to the fact that the number of candidate itemsets and rules is small and about the same, when the minimum support increases 900 800 Non-Incremental Running time(second) 700 Incremental 600 500 400 300 200 100 0 2% 4% 6% 8% 10% 12% 14% 16% 18% 20% minsup Figure 5.6 Running time comparison of incremental and non-incremental approaches under various minimum supports, incremental data size 10,000 records 38 Figures 5.7 shows the running times for both approaches under various incremental data sizes. The size of original database D is 50,000 records. The minimum confidence is set at 20%. The minimum supports are 10% and 6% respectively. We can observe that for various incremental data sizes, the incremental approach performs better than the non-incremental approach, for both minimum supports. 140 Running time(second) 120 100 80 60 Non-Increment 40 Incremental 20 0 2k 4k 6k 8k 10k Incremental size Figure.5.7 Running time comparison of incremental and non-incremental approaches under various incremental data sizes, minimum supports 10% 39 Running time(second) 350 300 250 200 Non-Incremental 150 Incremental 100 50 0 2k 4k 6k 8k 10k minsup Figure 5.4 Running time comparison of incremental and non-incremental approaches under various incremental data sizes, minimum supports 6% 40 CHAPTER 6 CONCLUSION AND FUTURE WORK Informative rule set is the smallest subset of association rule set such that same prediction sequence by confidence priority can be achieved. The problem of maintenance of discovered informative rule sets includes the incremental insertion, incremental deletion, and incremental modification. In this thesis, we have studied the problem of maintenance of discovered informative rule set under insertion and deletion. For each of these two types of maintenance, we have proposed efficient searching algorithms to maintain the discovered informative rule sets, based on the FUP techniques. In addition, numerical comparisons of the proposed incremental insertion approach and incremental deletion approach with the non-incremental approach has bee shown. The experiment results show that the proposed approach is more efficient than the non-incremental approach. Although the proposed incremental approach works well, it is just a beginning. There is still much work to be done on this topic. 41 Maintenance of discovered informative rule sets under incremental modification has not been carried out. In addition, other maintenance approaches besides FUP techniques should also be considered and compared. 42 REFERENCE [1] R. Agrawal, T. Imielinksi and A. Swami, “Mining Association Rules between Sets of Items in Large Database”, Proc. of the ACM SIGMOD Conference on Management of Data, Washington DC, May 1993, 207-216. [2] R. Agrawal, R. Srikant, “Fast Algorithms for Mining Association Rules”, Proc. of the 20th Int’l Conference on Very Large Databases, Santiago, Chile, September 1994, 487-499. [3] Necip Fazil Ayan, “An Efficient Algorithm to Update Large Itemsets with early Pruning”, In Proc. 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, San Diego, CAUSA, August 1999, 287-291. [4] Yves Bastide, Nicolas Pasquier, Rafik Taouil, Gerd Stumme, Lot Lakhal, “Mining Minimal Non-Redundant Association Rules Using Frequent Closed Itemsets”, Proc. DOOD'2000 conference, LNCS, Springer-Verlag, July 2000, 972-986. [5] L.P. Cheng, “Efficient Graph-Based Algorithms for Discovering and Maintaining Association Rules in Large Databases”, In Proceedings of Knowledge and Information Systems, Lonton, 2001, 338-355. [6] David W. Cheung, Vincent T. Ng, Benjamin W. Tam, “Maintenance of Discovered Knowledge: A Case in Multi-level Association Rules”, In Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta Canada, 1996, 307-310. 43 [7] David W. Cheung, Jiawei Han, Vincent T. Ng, C.Y. Wong, “Maintenance of Discovered Association Rules in Large Databases: An Incremental Updating Technique”, In Proceedings of the International Conference on Data Engineering, New Orleans, Louisiana, 1996, 106-114. [8] David W. Cheung, S.D. Lee, Benjamin Kao, “A General Incremental Technique for Maintaining Discovered Association Rules”, In Proceedings of International Conference on Database Systems for Advanced Applications, Melbourne, 1-4 April 1997, pp 185-194. [9] Jun-Hui Her, Sung-Hae Jun, Jun-Heyog Choi, Jung-Hyun Lee, “A Bayesian Neural Network Model for Dynamic Web Document Clustering”, Proceedings of the IEEE Region 10 Conference , Volume: 2 , Dec 1999 Page(s): 1415-1418. [10] T.P. Hong, “Incremental Data Mining Using Pre-large Itemsets”, Proceedings of the 2002 IEEE International Conference on Data Mining, Maebashi City December 2002, Japan, 9-12. [11] Guanling Lee, K.L. Lee, Arbee L.P. Chen, “Efficient Graph-Based Algorithms for Discovering and Maintaining Association Rules in Large Databases”, Knowledge and Information Systems 3, 2001, 338-355. [12] Jiuyong Li, Hong Shen, Rodney Topor, “Mining the Smallest Association Rule Set for Predictions”, Proceedings of the 2001 IEEE International Conference on Data Mining, California, USA, December, 2001. [13] G. Piategsky-Shapiro, “ Discovery, Analysis and Presentation of Strong Rules”, Knowledge Discovery in Databases, AAAI/MIT press, 1991, 229-248. [14] Mei-Ling Shyu, Shu-Ching Chen, Chi-Min Shu, “Affinity-Based Probabilistic Reasoning and Document Clustering on the WWW”, The Twenty-Fourth 44 Annual International Computer Software and Applications Conference October 25 - 28, 2000, Taipei, Taiwan, 149. [15] Chunhua Wahg, Houkuan Huang, Honglian Li, “A Fast Distributed Mining Algorithm for Association Rules With Item Constraints”, 2000 IEEE International Conference on System, Man & Cybernetics, 2000, Vol 1-5, 1900-1905. [16] G. I. Webb, “Efficient Search for Association Rules”, Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-00), N.Y., Aug 20-23 2000. , 99-107 [17] Show-Jane Yen, Arbee L.P. Chen, “An Efficient Approach to Discovering Knowledge from Large Databases”, In Proceedings of the IEEE/ACM International Conference on Parallel and Distributed Information Systems, Berlin, 1996, 8-18. [18] Mohammed J. Zaki, Ching-Jui Hsiao, “An Efficient Algorithm for Closed Association Rule Mining”, 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Boston, MA, August 2000, 34-43. [19] M.J. Zaki, “Generating Non-Redundant Association Rules”, Proc. of the 6th ACM International Conference on Knowledge Discovery and Data Mining, Edmonton, Alberta Canada, 2000. [20] Tian Zhang, Raghu Ramakrishnan, Miron Livny, “BIRCH: An Efficient Data Clustering Method for Very Large Databases”, In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data, Montreal, Ouebec, Canada, 1996, pp 103-114. 45