Download Hyper-Structure Mining of Frequent Patterns in Large Databases

H-Mine: Hyper-Structure Mining of Frequent Patterns in Large Databases Jiawei Han, Jian Pei, Hongjun Lu Shojiro Nishio, Shiwei Tang, Dongqing Yang Proc. of 2001 Int. Conf. on Data Mining (ICDM'01), San Jose, CA, Nov. 2001 Advisor: Professor Hsin-Hsi Chen Reporter: Clarence Min-Chi Hsieh Natural Language Processing Laboratory, Dept. of Computer Science and Info. Engineering, NTU 2005/08/23 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Outline Introduction  H-Mine  Performance Study  Conclusions  Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 2 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Introduction  Huge space is required to serve the mining – An Apriori-like algorithm generates a huge number of candidates for long or dense patterns.  Huge Database – FP-tree will be large and the space requirement for recursion is a challenge Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 3 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases H-Mine Step1. By scanning TDB once, the complete set of frequent items can be found and output Step2. Build H-struct Step3. To mine the projected database recursively Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 4 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases H-Mine (Example) ID 100 200 300 400 TDB min_sup_count = 2 Items Scan TDB Complete set of frequent items c, d, e, f, g, i can be found and output： a, c, d, e, m { a:3, c:3, d:4, e:3, g:2 } a, b, d, e, g, k a, c, d, h ID Frequent-item projection 100 c, d, e, g 200 a, c, d, e 300 a, d, e, g 400 a, c, d Following the alphabetical order of frequent items (called F-list): a-c-d-e-g Scan TDB Build H-struct in Copyright © Natural Language Processing Lab., NTU, 2005 main memory Slide - 5 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) H-Struct Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 6 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table Ha c d e g 2 3 2 1 ac: 2 ad: 3 ae: 2 Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 7 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table Hac d e 2 1 Header Header table Ha table H a c d e g c d e g 3 3 4 3 2 2 3 2 1 acd: 2 Frequent projections 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 8 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table Ha c d e g 2 3 2 1 Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 9 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases H-Mine (Example) (Cont.) Header Header table Had table Ha e g c d e g 2 1 2 3 2 1 ade: 2 Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 10 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table H Frequent projections a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 11 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table Hc d e g 3 2 1 cd: 3 ce: 2 Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 12 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table Hcd Header table Hc e g 2 1 d e g 3 2 1 cde: 2 Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 13 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table H Frequent projections a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 14 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases H-Mine (Example) (Cont.) Header table Hd e g 3 2 de: 3 dg: 2 Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 15 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases H-Mine (Example) (Cont.) Header table Hd e g 3 2 Header table Hde g 2 100 deg: 2 Frequent projections Header table H a c d e g 3 3 4 3 2 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 16 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table H Frequent projections a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 17 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh H-Mine (Example) (Cont.) Header table He g 2 eg: 2 Frequent projections Header table H a c d e g 3 3 4 3 2 100 c d e g 200 a c d e 300 a d e g 400 a c d Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 18 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases H-Mine (Example) (Cont.) ID 100 200 300 400 TDB min_sup_count = 2 Items c, d, e, f, g, i Output a:3, c:3, d:4, e:3, g:2, ac:2, ad:3, ae:2, a, c, d, e, m a, b, d, e, g, k acd:2, a, c, d, h ade:2, cd:3, ce:2, cde:2, de:3, dg:2, deg:2, eg: 2 Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 19 Reporter: Clarence Min-Chi Hsieh Form H-Mine      H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases (Mem) to H-Mine TDB can be partition into k parts TDB1, …, TDBk, ( TDBi (1  i  k) ) TDBi can be held in main memory, where TDBi has ni transactions, and n1+…+ni = n Find frequent patterns in TDBi with the min_supi = min_sup × (ni/n) Let Fi (1i  k) be the set of (locally) frequent pattern in TDBi Gather the patterns in Fi and collect their (global) support in TDB by scanning the transaction database TDB one more time Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 20 H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Reporter: Clarence Min-Chi Hsieh Form H-Mine (Mem) to H-Mine (Cont.) TDB TDB1 TDB2 F1 F2 … TDBk Fk Scan TDB Once Frequent patterns Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 21 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Performance Study  Data Set – Real Data  Gazelle.com – 59,602 transactions – Up to 267 items per transaction – Synthetic Data   1,000 items T25I15D10k – 10,000 transactions – Up to 25 items per transaction – Average longest potentially frequent itemset is with 15 items Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 22 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Performance Study (Cont.)  Runtime on data set Gazelle Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 23 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Performance Study (Cont.)  Space usage on data set Gazelle Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 24 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Performance Study (Cont.)  Runtime on data set T25I15D10k Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 25 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Performance Study (Cont.)  Space usage on data set T25I15D10k Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 26 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Performance Study (Cont.)  Runtime on data set T25I15D200-1280k Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 27 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Performance Study (Cont.)  Runtime on data set T25I15D1280k Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 28 Reporter: Clarence Min-Chi Hsieh H-Mine：Hyper-Structure Mining of Frequent Patterns in Large Databases Conclusions    H-mine outperforms Apriori and FPgrowth and is efficient and highly scalable for mining very large database H-mine and FP-growth do not need to store any frequent patterns or candidates H-mine has high performance and is scalable in all kinds of data Copyright © Natural Language Processing Lab., NTU, 2005 Slide - 29

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Hyper-Structure Mining of Frequent Patterns in Large Databases