Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT CANDIDATE KEY GENERATION IN DATABASES Prof. Ambarish S. Durani1 and Mrs. Rashmi B. Sune2 1 Assistant Professor, Datta Meghe Institute of Engineering, Technology & Research, Sawangi (Meghe), Wardha. 2 Lecturer, Sant Chavara School, Wardha. Abstract- Mining frequent patterns in a large database is still an important and relevant topic in data mining. Nowadays, FP Growth is one of the famous and benchmarked algorithms to mine the frequent patterns from FP-Tree data structure. However, the major drawback in FP-Growth is, the FP-Tree must be rebuilt all over again once the original database is changed. Keywords- Data Mining, Dynamic Approach, Knowledge Discovery, Association Mining, Frequent Itemsets. I. INTRODUCTION Apriori: Uses a generate-and-test approach _ generates candidate itemsets and tests if they are frequen. Generation of candidate itemsets is expensive (in both space and time). Support counting is expensive. Subset checking (computationally expensive). Multiple Database scans (I/O) FP-Growth: Allows frequent itemset discovery without candidate itemset generation. An efficient mining method of frequent patterns in large Database: using a highly compact FP-tree, divide-and-conquer method in nature. Two step approach: Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set. Step 2: Extracts frequent itemsets directly from the FP-tree Traversal through FP-Tree. FP-tree: A novel data structure storing compressed, crucial information about frequent patterns, compact yet complete for frequent pattern mining. II. PROCESS OF EXECUTION Two Steps: 1. Scan the transaction DB for the first time, find frequent items (single item patterns) and order them into a list L in frequency descending order. e.g., L={f:4, c:4, a:3, b:3, m:3, p:3} In the format of (item-name, support) 2. For each transaction, order its frequent items according to the order in L; Scan DB the second time, construct FP-tree by putting each frequency ordered transaction onto it. Step 1: FP-Tree Construction (Example) DOI:10.21884/IJMTER.2017.4078.HZ2MB 52 International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161 FP-Tree size The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share items (and hence prefixes). – Best case scenario: all transactions contain the same set of items. • 1 path in the FP-tree – Worst case scenario: every transaction has a unique set of items (no items in common) • Size of the FP-tree is at least as large as the original data. • Storage requirements for the FP-tree are higher - need to store the pointers between the nodes and the counters. • The size of the FP-tree depends on how the items are ordered Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share items (and hence prefixes). Best case scenario: all transactions contain the same set of items 1 path in the FP-tree Worst case scenario: every transaction has a unique set of items (no items in common) Size of the FP-tree is at least as large as the original data. Storage requirements for the FP-tree are higher - need to store the pointers between the nodes and the counters. The size of the FP-tree depends on how the items are ordered ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a heuristic). Step 2: Frequent Itemset Generation FP-Growth extracts frequent itemsets from the FP-tree. Bottom-up algorithm - from the leaves towards the root Divide and conquer: first look for frequent itemsets ending in e, then de, etc. . . then d, then cd, etc. . . First, extract prefix path sub-trees ending in an item(set). (hint: use the linked lists) @IJMTER-2017, All rights Reserved 53 International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161 Prefix path sub-trees (Example): Step 2: Frequent Itemset Generation Each prefix path sub-tree is processed recursively to extract the frequent itemsets. Solutions are then merged. – E.g. the prefix path sub-tree for e will be used to extract frequent itemsets ending in e, then in de, ce, be and ae, then in cde, bde, cde, etc. – Divide and conquer approach @IJMTER-2017, All rights Reserved 54 International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161 III. ADVANTAGES a. b. c. d. only 2 passes over data-set “compresses” data-set no candidate generation much faster than Apriori IV. DISADVANTAGES – FP-Tree may not fit in memory!! – FP-Tree is expensive to build. Trade-off: takes time to build, but once it is built, frequent itemsets are read o_ easily. Time is wasted (especially if support threshold is high), as the only pruning that can be done is on single items. Support can only be calculated once the entire data-set is added to the FP-Tree. V. RESULT ANALYSIS Frequent itemsets found (ordered by sufix and order in which they are found): @IJMTER-2017, All rights Reserved 55 International Journal of Modern Trends in Engineering and Research (IJMTER) Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161 VI. CONCLUSION In this paper, we have introduced a Association Rule Mining approach. We also proposed FP-growth algorithm. The proposed approach performs periodically the data mining process on data updates during a current episode and uses that knowledge captured in the previous episode to produce data mining rules. In our approach, we dynamically update knowledge obtained from the previous data mining process. Transactions domain is treated as a set of consecutive episodes. In our approach, information gained during a current episode depends on the current set of transactions and that discovered information during the previous episode. VII. FUTUREWORK As a future work, the Association Rule Mining approach will be tested with different datasets that cover a large spectrum of different data mining applications, such as, web site access analysis for improvements in e-commerce advertising, fraud detection, screening and investigation, retail site or product analysis, and customer segmentation. REFERENCES [1] Tamir Tassa.“Secure Mining of Association Rules in Horizontally Distributed Databases,” In IEEE Transactions On Knowledge And Data Engineering, [2] R. Fagin, M. Naor, and P. Winkler.”Comparing Information Without Leaking It,” Communications of the ACM, 39:77–85, 1996 [3] Murat Kantarcıoˇglu and Chris Clifton,” Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data,” in Ieee Transactions On Knowledge And Data Engineering, To Appear, 29 Jan. 2003; revised 11 Jun. 2003, accepted 1 Jul. 2003. [4] D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu, “Efficient mining of association rules in distributed databases,” IEEE Trans. Knowledge Data Eng., vol. 8, no. 6, pp. 911–922, Dec. 1996. [5] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Santa Barbara, California, USA: ACM, May 21-23 2001, pp. 247–255. [Online]. Available: http: //doi.acm.org/10.1145/375551.375602 [6] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Advances in Cryptology – CRYPTO 2000. Springer-Verlag, Aug. 20-24 2000, pp. 36–54. [Online]. Available: http://link.springer.de/link/ service/series/0558/bibs/1880/18800036.htm [7] C. Yao, “How to generate and exchange secrets,” in Proceedings of the 27th IEEE Symposium on Foundations of Computer Science. IEEE, 1986, pp. 162–167. [8] L. Torgo. Data Mining with R: Learning with Case Studies. Chapman & Hall/CRC, 2010. [9] L. Van Wel and L. Royakkers. Ethical issues in web datamining. Ethics and Inf. Technol., 6:129–140, 2004. [10] M. Kantardzic. Data Mining: Concepts, Models, Methods and Algorithms. John Wiley & Sons, Inc., 2002. [11] Pallavi Dubey, “Association Rule Mining on Distributed Data”, International Journal of Scientific & Engineering Research, Volume 3, Issue 1, January-2012 1 ISSN 2229-5518 [12] M. H. Dunham.“Data Mining. Introductory and Advanced Topics,” Prentice Hall, 2003, ISBN 0-13-088892-3. [13] Schuster and R.op Wolff , "Communication-Efficient Distributed Mining of Association Rules," Proc. ACM SIGMOD Int'l Conf. Management of Data, ACM Press, 2001,pp. 473-484. [14] D.W. Cheung, et al., "A Fast Dis Distributed Information Systems,” IEEE CS Press, 1996,pp. 31 [15] Ao, F., Yan, Y., Huang, J., Huang, K. “Mining maximal frequent item sets in data streams based on FP trees” Springer Verlag Berlin Heidelberg pp. 479489, (2007). [16] C., Han, J., Pei, J., Yan, X., Yu, P. ”Mining frequent patterns in data stream at multiple time granularities” In Next Generation Data Mining pp. 105-124, (2003). @IJMTER-2017, All rights Reserved 56