Download IEEE Conference Paper Template

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
STUDY ON FREQUENT PATTEREN GROWTH ALGORITHM WITHOUT
CANDIDATE KEY GENERATION IN DATABASES
Prof. Ambarish S. Durani1 and Mrs. Rashmi B. Sune2
1
Assistant Professor, Datta Meghe Institute of Engineering, Technology & Research, Sawangi (Meghe), Wardha.
2
Lecturer, Sant Chavara School, Wardha.
Abstract- Mining frequent patterns in a large database is still an important and relevant topic in data
mining. Nowadays, FP Growth is one of the famous and benchmarked algorithms to mine the frequent
patterns from FP-Tree data structure. However, the major drawback in FP-Growth is, the FP-Tree must be
rebuilt all over again once the original database is changed.
Keywords- Data Mining, Dynamic Approach, Knowledge Discovery, Association Mining, Frequent
Itemsets.
I. INTRODUCTION
Apriori: Uses a generate-and-test approach _ generates candidate itemsets and tests if they are frequen.
Generation of candidate itemsets is expensive (in both space and time). Support counting is expensive.
Subset checking (computationally expensive). Multiple Database scans (I/O)
FP-Growth: Allows frequent itemset discovery without candidate itemset generation. An efficient mining
method of frequent patterns in large Database: using a highly compact FP-tree, divide-and-conquer method
in nature.
Two step approach:
Step 1: Build a compact data structure called the FP-tree Built using 2 passes over the data-set.
Step 2: Extracts frequent itemsets directly from the FP-tree Traversal through FP-Tree.
FP-tree: A novel data structure storing compressed, crucial information about frequent patterns, compact
yet complete for frequent pattern mining.
II. PROCESS OF EXECUTION
Two Steps:
1. Scan the transaction DB for the first time, find frequent items (single item patterns) and order them into
a list L in frequency descending order.
e.g., L={f:4, c:4, a:3, b:3, m:3, p:3}
In the format of (item-name, support)
2. For each transaction, order its frequent items according to the order in L; Scan DB the second time,
construct FP-tree by putting each frequency ordered transaction onto it.
Step 1: FP-Tree Construction (Example)
DOI:10.21884/IJMTER.2017.4078.HZ2MB
52
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161
FP-Tree size
 The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share
items (and hence prefixes).
– Best case scenario: all transactions contain the same set of items.
• 1 path in the FP-tree
– Worst case scenario: every transaction has a unique set of items (no items in common)
• Size of the FP-tree is at least as large as the original data.
• Storage requirements for the FP-tree are higher - need to store the pointers between
the nodes and the counters.
• The size of the FP-tree depends on how the items are ordered
Ordering by decreasing support is typically used but it does not always lead to the smallest tree (it's a
heuristic).
The FP-Tree usually has a smaller size than the uncompressed data - typically many transactions share
items (and hence prefixes).
Best case scenario: all transactions contain the same set of items
1 path in the FP-tree
Worst case scenario: every transaction has a unique set of items (no items in common)
Size of the FP-tree is at least as large as the original data. Storage requirements for the FP-tree are
higher - need to store the pointers between the nodes and the counters. The size of the FP-tree depends on
how the items are ordered ordering by decreasing support is typically used but it does not always lead to the
smallest tree (it's a heuristic).
Step 2: Frequent Itemset Generation
 FP-Growth extracts frequent itemsets from the FP-tree.
 Bottom-up algorithm - from the leaves towards the root
 Divide and conquer: first look for frequent itemsets ending in e, then de, etc. . . then d, then cd,
etc. . .
 First, extract prefix path sub-trees ending in an item(set). (hint: use the linked lists)
@IJMTER-2017, All rights Reserved
53
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161
Prefix path sub-trees (Example):
Step 2: Frequent Itemset Generation
 Each prefix path sub-tree is processed recursively to extract the frequent itemsets. Solutions are then
merged.
– E.g. the prefix path sub-tree for e will be used to extract frequent itemsets ending in e, then in
de, ce, be and ae, then in cde, bde, cde, etc.
– Divide and conquer approach
@IJMTER-2017, All rights Reserved
54
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161
III. ADVANTAGES
a.
b.
c.
d.
only 2 passes over data-set
“compresses” data-set
no candidate generation
much faster than Apriori
IV. DISADVANTAGES
–
FP-Tree may not fit in memory!!
–
FP-Tree is expensive to build.
Trade-off: takes time to build, but once it is built, frequent itemsets are read o_ easily. Time is wasted
(especially if support threshold is high), as the only pruning that can be done is on single items. Support can
only be calculated once the entire data-set is added to the FP-Tree.
V. RESULT ANALYSIS
Frequent itemsets found (ordered by sufix and order in which they are found):
@IJMTER-2017, All rights Reserved
55
International Journal of Modern Trends in Engineering and Research (IJMTER)
Volume 04, Issue 3, [March– 2017] ISSN (Online):2349–9745; ISSN (Print):2393-8161
VI. CONCLUSION
In this paper, we have introduced a Association Rule Mining approach. We also proposed FP-growth
algorithm. The proposed approach performs periodically the data mining process on data updates during a
current episode and uses that knowledge captured in the previous episode to produce data mining rules. In
our approach, we dynamically update knowledge obtained from the previous data mining process.
Transactions domain is treated as a set of consecutive episodes. In our approach, information gained during
a current episode depends on the current set of transactions and that discovered information during the
previous episode.
VII. FUTUREWORK
As a future work, the Association Rule Mining approach will be tested with different datasets that
cover a large spectrum of different data mining applications, such as, web site access analysis for
improvements in e-commerce advertising, fraud detection, screening and investigation, retail site or product
analysis, and customer segmentation.
REFERENCES
[1] Tamir Tassa.“Secure Mining of Association Rules in Horizontally Distributed Databases,” In IEEE Transactions On
Knowledge And Data Engineering,
[2] R. Fagin, M. Naor, and P. Winkler.”Comparing Information Without Leaking It,” Communications of the ACM, 39:77–85,
1996
[3] Murat Kantarcıoˇglu and Chris Clifton,” Privacy-preserving Distributed Mining of Association Rules on Horizontally
Partitioned Data,” in Ieee Transactions On Knowledge And Data Engineering, To Appear, 29 Jan. 2003; revised 11 Jun.
2003, accepted 1 Jul. 2003.
[4] D. W.-L. Cheung, V. Ng, A. W.-C. Fu, and Y. Fu, “Efficient mining of association rules in distributed databases,” IEEE Trans.
Knowledge Data Eng., vol. 8, no. 6, pp. 911–922, Dec. 1996.
[5] D. Agrawal and C. C. Aggarwal, “On the design and quantification of privacy preserving data mining algorithms,” in
Proceedings of the Twentieth ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems. Santa
Barbara,
California,
USA:
ACM,
May
21-23
2001,
pp.
247–255.
[Online].
Available:
http:
//doi.acm.org/10.1145/375551.375602
[6] Y. Lindell and B. Pinkas, “Privacy preserving data mining,” in Advances in Cryptology – CRYPTO 2000. Springer-Verlag,
Aug. 20-24 2000, pp. 36–54. [Online]. Available: http://link.springer.de/link/ service/series/0558/bibs/1880/18800036.htm
[7] C. Yao, “How to generate and exchange secrets,” in Proceedings of the 27th IEEE Symposium on Foundations of Computer
Science. IEEE, 1986, pp. 162–167.
[8] L. Torgo. Data Mining with R: Learning with Case Studies. Chapman & Hall/CRC, 2010.
[9] L. Van Wel and L. Royakkers. Ethical issues in web datamining. Ethics and Inf. Technol., 6:129–140, 2004.
[10] M. Kantardzic. Data Mining: Concepts, Models, Methods and Algorithms. John Wiley & Sons, Inc., 2002.
[11] Pallavi Dubey, “Association Rule Mining on
Distributed Data”, International Journal of Scientific & Engineering
Research, Volume 3, Issue 1, January-2012 1 ISSN 2229-5518
[12] M. H. Dunham.“Data Mining. Introductory and Advanced Topics,” Prentice Hall, 2003, ISBN 0-13-088892-3.
[13] Schuster and R.op Wolff , "Communication-Efficient Distributed Mining of Association Rules," Proc. ACM SIGMOD Int'l
Conf. Management of Data, ACM Press, 2001,pp. 473-484.
[14] D.W. Cheung, et al., "A Fast Dis Distributed Information Systems,” IEEE CS Press, 1996,pp. 31
[15] Ao, F., Yan, Y., Huang, J., Huang, K. “Mining maximal frequent item sets in data streams based on FP trees” Springer Verlag
Berlin Heidelberg pp. 479489, (2007).
[16] C., Han, J., Pei, J., Yan, X., Yu, P. ”Mining frequent patterns in data stream at multiple time granularities” In Next
Generation Data Mining pp. 105-124, (2003).
@IJMTER-2017, All rights Reserved
56