Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
行政院國家科學委員會補助專題研究計畫成果報告 ※※※※※※※※※※※※※※※※※※※※※※※※ ※ ※ ※ 利用準大項目集之漸進式資料挖掘 ※ ※ ※ ※※※※※※※※※※※※※※※※※※※※※※※※ 計畫類別: □個別型計畫 計畫編號:NSC □整合型計畫 89-2213-E-214-056 執行期間:89 年 8 月 1 日至 90 年 7 月 31 日 計畫主持人:洪宗貝 教授 共同主持人:陶幼慧 助理教授 本成果報告包括以下應繳交之附件: □赴國外出差或研習心得報告一份 □赴大陸地區出差或研習心得報告一份 □ 出席國際學術會議心得報告及發表之論文各一份 □ 國際合作研究計畫國外研究報告書一份 執行單位:義守大學 中 華 民 國 90 年 10 月 20 日 行政院國家科學委員會專題研究計劃成果報告 利用準大項目集之漸進式資料挖掘 Incremental Data Mining Using Pre-large Itemsets 計畫編號:NSC 89-2213-E-214-056 執行期限:89 年 8 月 1 日至 90 年 7 月 31 日 主持人:洪宗貝 義守大學資管系 共同主持人:陶幼慧 義守大學資管系 計畫參與人員:王慶堯、林桂英、王乾隆、李詠騏 義守大學資工所 1、中文摘要 在本計劃當中,我們提出『準大項目 集』的概念,並設計一全新且有效率的漸 進式演算法。所謂準大項目集,主要是由 〝低支持度門檻值〞和〝高支持度門檻值〞 兩個支持度門檻值所定義而成,以用來減 少對原來資料庫的處理並節省維護所挖掘 出知識庫的成本。我們所提出之演算法除 非新增的資料太大,否則將不需要對原來 資料庫做處理或掃描的動作。另外,此演 算法更具備一個特質,即當資料庫不斷的 增加時,所獲得的結果會更好,這對於現 實資料庫的應用尤其有用。 關鍵詞:資料挖掘、關聯式法則、大 項目集、準大項目集、漸進式資料挖掘。 English Abstract In this project, we propose the concept of pre-large itemsets and design a novel efficient incremental mining algorithm based on it. Pre-large itemsets are defined using two support thresholds, a lower support threshold and an upper support threshold, to reduce rescanning the original databases and to save maintenance costs. The proposed algorithm doesn't need to rescan the original database until a number of transactions have come. If the size of the database is growing larger, then the allowed number of new transactions will be larger too. Therefore, along with the growth of the database, our 1 proposed approach is increasingly efficient. This characteristic is especially useful for real applications. Keywords: data mining, association rule, large itemset, pre-large itemset, incremental mining. 2、Background and Purpose In the past, many algorithms for mining association rules from transactions were proposed, most of which were executed in level-wise processes. That is, itemsets containing single items were processed first, then itemsets with two items were processed, then the process was repeated, continuously adding one more item each time, until some criteria were met. These algorithms usually considered the database size static and focused on batch mining. In real-world applications, however, new records are usually inserted into databases, and designing a mining algorithm that can maintain association rules as a database grows is thus critically important. When new records are added to databases, the original association rules may become invalid, or new implicitly valid rules may appear in the resulting updated databases [7][8][9][11][12]. In these situations, conventional batch-mining algorithms must re-process the entire updated databases to find final association rules. Two drawbacks may exist for conventional batch-mining algorithms in maintaining database knowledge: (a) Nearly the same computation time as that spent in mining from the original database is needed to cope with each new transaction. If the original database is large, much computation time is wasted in maintaining association rules whenever new transactions are generated (b) Information previously mined from the original database, such as large itemsets and association rules, provides no help in the maintenance process. Cheung and his co-workers proposed an incremental mining algorithm, called FUP (Fast UPdate algorithm) [7], for incrementally maintaining mined association rules and avoiding the shortcomings mentioned above. The FUP algorithm modifies the Apriori mining algorithm [3] and adopts the pruning techniques used in the DHP (Direct Hashing and Pruning) algorithm [10]. It first calculates large itemsets mainly from newly inserted transactions, and compares them with the previous large itemsets from the original database. According to the comparison results, FUP determines whether re-scanning the original database is needed, thus saving some time in maintaining the association rules. Although the FUP algorithm can indeed improve mining performance for incrementally growing databases, original databases still need to be scanned when necessary. In this paper, we thus propose a new mining algorithm based on two support thresholds to further reduce the need for rescanning original databases. Since rescanning the database spends much computation time, the maintenance cost can thus be reduced in the proposed algorithm. itemsets already mined from the original database in FUP. A pre-large itemset is not truly large, but promises to be large in the future. A lower support threshold and an upper support threshold are used to realize this concept. The upper support threshold is the same as that used in the conventional mining algorithms. The support ratio of an itemset must be larger than the upper support threshold in order to be considered large. On the other hand, the lower support threshold defines the lowest support ratio for an itemset to be treated as pre-large. An itemset with its support ratio below the lower threshold is thought of as a small itemset. Pre-large itemsets act like buffers in the incremental mining process and are used to reduce the movements of itemsets directly from large to small and vice-versa. Considering an original database and transactions newly inserted using the two support thresholds, itemsets may thus fall into one of the following nine cases illustrated in Figure 1. New transactions Large itemsets Original database Pre-large Small itemsets itemsets Large itemsets Pre-large itemsets Case 1 Case 2 Case 3 Case 4 Case 5 Case 6 Small itemsets Case 7 Case 8 Case 9 Figure 1: Nine cases arising from adding new transactions to existing databases Cases 1, 5, 6, 8 and 9 above will not affect the final association rules according to the weighted average of the counts. Cases 2 and 3 may remove existing association rules, and cases 4 and 7 may add new association rules. If we retain all large and pre-large itemsets with their counts after each pass, then cases 2, 3 and case 4 can be handled easily. Also, in the maintenance phase, the ratio of new transactions to old transactions is usually very small. This is more apparent when the database is growing larger. An itemset in 3、Results and Discussions In this plan, we propose the concept of pre-large itemsets to solve the problem represented by case 3 the situation in which a candidate itemsets is large for new transactions but is not recorded in large 2 Substep 5-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) = SU(I) and keep I with SD(I),otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set SD(I) = SU(I) and keep I with SD(I), otherwise, neglect I. STEP 6: For each itemset I in the originally pre-large itemset PkD , do the following substeps: Substep 6-1: Set the new count SU(I) = ST(I)+ SD(I). Substep 6-2: If SU(I)/(d+t+c) Su, then assign I as a large itemset, set SD(I) = SU(I) and keep I with SD(I),otherwise, if SU(I)/(d+t+c) Sl, then assign I as a pre-large itemset, set SD(I) = SU(I) and keep I with SD(I), otherwise, neglect I. STEP 7: For each itemset I in the candidate itemsets that is not in the originally large itemsets LDk or pre-large itemsets PkD , do the following substeps: Substep 7-1: If I is in the large itemsets T Lk or pre-large itemsets PkT from the new transactions, then put it in the rescan-set R, which is used when rescanning in Step 8 is necessary. Substep 7-2: If I is small for the new transactions, then do nothing. STEP 8: If t +c f or R is null, then do nothing; otherwise, rescan the original database to determine whether the itemsets in the rescan-set R are large or pre-large. STEP 9: Form candidate (k+1)-itemsets Ck+1 from finally large and pre-large k-itemsets ( LUk PkU ) that appear in the new transactions. STEP 10: Set k = k+1. STEP 11: Repeat STEPs 4 to 10 until no new large or pre-large itemsets are found. STEP 12: Modify the association rules according to the modified large itemsets. STEP 13: If t +c > f, then set d=d+t+c and set c=0; otherwise, set c=t+c. After Step 13, the final association rules for the updated database have been determined. case 7 cannot possibly be large for the entire updated database as long as the number of transactions is small compared to the number of transactions in the original database. This is shown in the following theorem. Theorem 1: let Sl and Su be respectively the lower and the upper support thresholds, and let d and t be respectively the numbers of the original and new transactions. If t ( Su Sl ) d , then an itemset that is small 1 Su (neither large nor pre-large) in the original database but is large in newly inserted transactions is not large for the entire updated database. The details of the proposed maintenance algorithm are described below. A variable, c, is used to record the number of new transactions since the last re-scan of the original database. The proposed maintenance algorithm: INPUT: A lower support threshold Sl, an upper support threshold Su, a set of large itemsets and pre-large itemsets in the original database consisting of (d+c) transactions, and a set of t new transactions. OUTPUT: A set of final association rules for the updated database. STEP 1: Calculate the safety number f of new transactions according to theorem 1 as follows: ( Su S l ) d f = . 1 Su STEP 2: Set k =1, where k records the number of items in itemsets currently being processed. STEP 3: Find all candidate k-itemsets Ck and their counts from the new transactions. STEP 4: Divide the candidate k-itemsets into three parts according to whether they are large, pre-large or small in the original database. STEP 5: For each itemset I in the originally large k-itemsets LDk , do the following substeps: Substep 5-1: Set the new count SU(I) = T S (I)+ SD(I). 3 The International Conference on Very Large Data Bases, pp. 487-499, 1994. [4] R. Agrawal and R. Srikant, ”Mining In this paper, we have proposed the sequential patterns,” The Eleventh IEEE concept of pre-large itemsets, and designed a International Conference on Data novel, efficient, incremental mining Engineering, pp. 3-14, 1995. algorithm based on it. Using two [5] R. Agrawal, R. Srikant and Q. Vu, user-specified upper and lower support “Mining association rules with item thresholds, the pre-large itemsets act as a gap constraints,” The Third International to avoid small itemsets becoming large in the Conference on Knowledge Discovery in updated database when transactions are Databases and Data Mining, pp. 67-73, inserted. Our proposed algorithm also retains Newport Beach, California, 1997. the features of the FUP algorithm [7][11]. Moreover, the proposed algorithm can [6] M.S. Chen, J. Han and P.S. Yu, “Data mining: An overview from a database effectively handle cases, in which itemsets perspective,” IEEE Transactions on are small in an original database but large in Knowledge and Data Engineering, Vol. 8, newly inserted transactions, although it does No. 6, pp. 866-883, 1996. need additional storage space to record the [7] D.W. Cheung, J. Han, V.T. Ng, and C.Y. pre-large itemsets. Note that the FUP Wong, “Maintenance of discovered algorithm needs to rescan databases to handle association rules in large databases: An such cases. The proposed algorithm does not incremental updating approach,” The require rescanning of the original databases Twelfth IEEE International Conference until a number of new transactions on Data Engineering, pp. 106-114, 1996. determined from the two support thresholds [8] D.W. Cheung, S.D. Lee, and B. Kao, “A and the size of the database have been general incremental technique for processed. If the size of the database grows maintaining discovered association larger, then the number of new transactions rules,” In Proceedings of Database allowed before rescanning will be larger too. Systems for Advanced Applications, pp. Therefore, as the database grows, our 185-194, Melbourne, Australia, 1997. proposed approach becomes increasingly efficient. This characteristic is especially [9]M.Y. Lin and S.Y. Lee, “Incremental update on sequential patterns in large useful for real-world applications. The databases,” The Tenth IEEE International contents of this paper have been published in Conference on Tools with Artificial the Intelligent Data Analysis, Vol. 5, No. 2, Intelligence, pp. 24-31, 1998. 2001, pp. 111-129. [10] J.S. Park, M.S. Chen, P.S. Yu, “Using a 5、References hash-based method with transaction trimming for mining association rules,” [1] R. Agrawal, T. Imielinksi and A. Swami, IEEE Transactions on Knowledge and “Mining association rules between sets of Data Engineering, Vol. 9, No. 5, pp. items in large database,“ The ACM 812-825, 1997. SIGMOD Conference, pp. 207-216, [11] N.L. Sarda and N.V. Srinivas, “An Washington DC, USA, 1993. adaptive algorithm for incremental [2] R. Agrawal, T. Imielinksi and A. Swami, mining of association rules,” The Ninth “Database mining: a performance International Workshop on Database perspective,” IEEE Transactions on and Expert Systems, pp. 240-245, 1998. Knowledge and Data Engineering, Vol. 5, [12] S. Zhang, “Aggregation and No. 6, pp. 914-925, 1993. maintenance for database mining,” [3] R. Agrawal and R. Srikant, “Fast Intelligent Data Analysis, Vol. 3, No. 6, algorithm for mining association rules,” pp. 475-490, 1999. 4、Self-Evaluation 4