Download Hyper-Structure Mining of Frequent Patterns in Large Databases

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
H-Mine: Hyper-Structure Mining of
Frequent Patterns in Large Databases
Jiawei Han, Jian Pei, Hongjun Lu
Shojiro Nishio, Shiwei Tang, Dongqing Yang
Proc. of 2001 Int. Conf. on Data Mining (ICDM'01),
San Jose, CA, Nov. 2001
Advisor: Professor Hsin-Hsi Chen
Reporter: Clarence Min-Chi Hsieh
Natural Language Processing Laboratory,
Dept. of Computer Science and Info. Engineering, NTU
2005/08/23
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Outline
Introduction
 H-Mine
 Performance Study
 Conclusions

Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 2
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Introduction

Huge space is required to serve the
mining
– An Apriori-like algorithm generates a huge number of
candidates for long or dense patterns.

Huge Database
– FP-tree will be large and the space requirement for
recursion is a challenge
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 3
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
H-Mine
Step1. By scanning TDB once, the
complete set of frequent items can be
found and output
Step2. Build H-struct
Step3. To mine the projected database
recursively
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 4
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
H-Mine (Example)
ID
100
200
300
400
TDB
min_sup_count = 2
Items
Scan TDB Complete set of frequent items
c, d, e, f, g, i
can be found and output:
a, c, d, e, m
{ a:3, c:3, d:4, e:3, g:2 }
a, b, d, e, g, k
a, c, d, h
ID
Frequent-item
projection
100
c, d, e, g
200
a, c, d, e
300
a, d, e, g
400
a, c, d
Following the alphabetical
order of frequent items
(called F-list): a-c-d-e-g
Scan TDB Build H-struct in
Copyright © Natural Language Processing Lab., NTU, 2005
main memory
Slide - 5
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
H-Struct
Frequent
projections
Header
table H
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 6
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table Ha
c d e g
2 3 2 1
ac: 2
ad: 3
ae: 2
Frequent
projections
Header
table H
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 7
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table Hac
d e
2 1
Header
Header
table Ha table H
a c d e g
c d e g
3 3 4 3 2
2 3 2 1
acd: 2
Frequent
projections
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 8
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table Ha
c d e g
2 3 2 1
Frequent
projections
Header
table H
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 9
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
H-Mine (Example) (Cont.)
Header
Header
table Had table Ha
e g
c d e g
2 1
2 3 2 1
ade: 2
Frequent
projections
Header
table H a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 10
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table H
Frequent
projections
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 11
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table Hc
d e g
3 2 1
cd: 3
ce: 2
Frequent
projections
Header
table H
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 12
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table Hcd
Header
table Hc
e g
2 1
d e g
3 2 1
cde: 2
Frequent
projections
Header
table H
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 13
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table H
Frequent
projections
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 14
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
H-Mine (Example) (Cont.)
Header
table Hd e g
3 2
de: 3
dg: 2
Frequent
projections
Header
table H a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 15
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
H-Mine (Example) (Cont.)
Header
table Hd e g
3 2
Header
table Hde
g
2
100
deg: 2
Frequent
projections
Header
table H a c d e g
3 3 4 3 2
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 16
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table H
Frequent
projections
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 17
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
H-Mine (Example) (Cont.)
Header
table He
g
2
eg: 2
Frequent
projections
Header
table H
a c d e g
3 3 4 3 2
100
c
d
e
g
200
a
c
d
e
300
a
d
e
g
400
a
c
d
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 18
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
H-Mine (Example) (Cont.)
ID
100
200
300
400
TDB
min_sup_count = 2
Items
c, d, e, f, g, i Output a:3, c:3, d:4, e:3, g:2,
ac:2, ad:3, ae:2,
a, c, d, e, m
a, b, d, e, g, k
acd:2,
a, c, d, h
ade:2,
cd:3, ce:2,
cde:2,
de:3, dg:2,
deg:2,
eg: 2
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 19
Reporter: Clarence Min-Chi Hsieh
Form H-Mine





H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
(Mem)
to H-Mine
TDB can be partition into k parts
TDB1, …, TDBk, ( TDBi (1  i  k) )
TDBi can be held in main memory, where TDBi
has ni transactions, and n1+…+ni = n
Find frequent patterns in TDBi with the
min_supi = min_sup × (ni/n)
Let Fi (1i  k) be the set of (locally) frequent
pattern in TDBi
Gather the patterns in Fi and collect their
(global) support in TDB by scanning the
transaction database TDB one more time
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 20
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Reporter: Clarence Min-Chi Hsieh
Form H-Mine (Mem) to H-Mine (Cont.)
TDB
TDB1 TDB2
F1
F2
…
TDBk
Fk
Scan TDB Once
Frequent
patterns
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 21
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Performance Study

Data Set
– Real Data

Gazelle.com
– 59,602 transactions
– Up to 267 items per transaction
– Synthetic Data


1,000 items
T25I15D10k
– 10,000 transactions
– Up to 25 items per transaction
– Average longest potentially frequent itemset is with
15 items
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 22
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Performance Study (Cont.)

Runtime on data set Gazelle
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 23
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Performance Study (Cont.)

Space usage on data set Gazelle
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 24
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Performance Study (Cont.)

Runtime on data set T25I15D10k
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 25
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Performance Study (Cont.)

Space usage on data set T25I15D10k
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 26
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Performance Study (Cont.)

Runtime on data set T25I15D200-1280k
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 27
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Performance Study (Cont.)

Runtime on data set T25I15D1280k
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 28
Reporter: Clarence Min-Chi Hsieh
H-Mine:Hyper-Structure Mining of
Frequent Patterns in Large Databases
Conclusions



H-mine outperforms Apriori and FPgrowth and is efficient and highly
scalable for mining very large database
H-mine and FP-growth do not need to
store any frequent patterns or
candidates
H-mine has high performance and is
scalable in all kinds of data
Copyright © Natural Language Processing Lab., NTU, 2005
Slide - 29
Related documents