• Study Resource
• Explore

Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Association Rule Mining
CS 685: Special Topics in Data Mining
The UNIVERSITY of KENTUCKY
Frequent Pattern Analysis
Finding inherent regularities in data
What products were often purchased together?— Beer
and diapers?!
What are the subsequent purchases after buying a PC?
What are the commonly occurring subsequences in a
group of genes?
What are the shared substructures in a group of
effective drugs?
2
CS685: Special Topics in Data Mining
What Is Frequent Pattern
Analysis?
Frequent pattern: a pattern (a set of items, subsequences,
substructures, etc.) that occurs frequently in a data set
Applications
Identify motifs in bio-molecules
DNA sequence analysis, protein structure analysis
Identify patterns in micro-arrays
Market basket analysis, cross-marketing, catalog design, sale campaign
analysis, etc.
3
CS685: Special Topics in Data Mining
Data
An item is an element (a literal, a variable, a
symbol, a descriptor, an attribute, a measurement,
etc)
A transaction is a set of items
A data set is a set of transactions
A database is a data set
Transaction-id
4
Items bought
100
f, a, c, d, g, I, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n
CS685: Special Topics in Data Mining
Association Rules
Transactionid
Items bought
100
f, a, c, d, g, I, m, p
200
a, b, c, f, l,m, o
300
b, f, h, j, o
400
b, c, k, s, p
500
a, f, c, e, l, p, m, n
Customer
Customer
Itemset X = {x1, …, xk}
Find all the rules X  Y with minimum
support and confidence
support, s, is the probability that a
transaction contains X  Y
confidence, c, is the conditional
probability that a transaction having X
also contains Y
Let supmin = 50%, confmin = 50%
Customer
5
Association rules:
A  C (60%, 100%)
C  A (60%, 75%)
CS685: Special Topics in Data Mining
Apriori-based Mining
Generate length (k+1) candidate itemsets
from length k frequent itemsets, and
Test the candidates against DB
6
CS685: Special Topics in Data Mining
Apriori Algorithm
A level-wise, candidate-generation-and-test
approach (Agrawal & Srikant 1994)
Data base D
TID
10
20
30
40
Items
a, c, d
b, c, e
a, b, c, e
b, e
1-candidates
Scan D
Min_sup=2
3-candidates
Scan D
Itemset
bce
Freq 3-itemsets
Itemset
bce
7
Sup
2
Itemset
a
b
c
d
e
Freq 1-itemsets
Sup
2
3
3
1
3
Itemset
a
b
c
e
Freq 2-itemsets
Itemset
ac
bc
be
ce
Sup
2
2
3
2
2-candidates
Sup
2
3
3
3
Counting
Itemset
ab
ac
ae
bc
be
ce
Sup
1
2
1
2
3
2
Itemset
ab
ac
ae
bc
be
ce
Scan D
CS685: Special Topics in Data Mining
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
8
CS685: Special Topics in Data Mining
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-join Lk-1
INSERT INTO Ck
SELECT p.item1, p.item2, …, p.itemk-1, q.itemk-1
FROM Lk-1 p, Lk-1 q
WHERE p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1
< q.itemk-1
Step 2: pruning
For each itemset c in Ck do
For each (k-1)-subsets s of c do if (s is not in Lk-1) then delete c
from Ck
9
CS685: Special Topics in Data Mining
Example of Candidategeneration
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
10
CS685: Special Topics in Data Mining
How to Count Supports of
Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Method:
Candidate itemsets are stored in a hash-tree
Leaf node of hash-tree contains a list of itemsets and
counts
Interior node contains a hash table
Subset function: finds all the candidates contained in a
transaction
11
CS685: Special Topics in Data Mining
Apriori: Candidate Generationand-test
Any subset of a frequent itemset must be also
frequent — an anti-monotone property
A transaction containing {beer, diaper, nuts} also
contains {beer, diaper}
{beer, diaper, nuts} is frequent  {beer, diaper} must
also be frequent
No superset of any infrequent itemset should be
generated or tested
Many item combinations can be pruned
12
CS685: Special Topics in Data Mining
The Apriori Algorithm
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=; k++) do
Ck+1 = candidates generated from Lk;
for each transaction t in database do increment the
count of all candidates in Ck+1 that are contained in t
Lk+1 = candidates in Ck+1 with min_support
return k Lk;
13
CS685: Special Topics in Data Mining
Challenges of Frequent
Pattern Mining
Challenges
Multiple scans of transaction database
Huge number of candidates
Tedious workload of support counting for candidates
Improving Apriori: general ideas
Reduce number of transaction database scans
Shrink number of candidates
Facilitate support counting of candidates
14
CS685: Special Topics in Data Mining
DIC: Reduce Number of Scans
ABCD
ABC ABD ACD BCD
AB
AC
BC
BD
CD
Once both A and D are determined
frequent, the counting of AD can
begin
Once all length-2 subsets of BCD
are determined frequent, the
counting of BCD can begin
Transactions
A
B
C
D
Apriori
{}
Itemset lattice
S. Brin R. Motwani, J.
Ullman, and S. Tsur,
1997.
15
1-itemsets
2-itemsets
…
1-itemsets
2-items
DIC
3-items
CS685: Special Topics in Data Mining
DHP: Reduce the Number of
Candidates
A hashing bucket count <min_sup  every
candidate in the buck is infrequent
Candidates: a, b, c, d, e
Hash entries: {ab, ad, ae} {bd, be, de} …
Large 1-itemset: a, b, d, e
The sum of counts of {ab, ad, ae} < min_sup 
ab should not be a candidate 2-itemset
J. Park, M. Chen, and P. Yu, 1995
16
CS685: Special Topics in Data Mining
Partition: Scan Database Only
Twice
Partition the database into n partitions
Itemset X is frequent  X is frequent in at
least one partition
Scan 1: partition database and find local
frequent patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S.
Navathe, 1995
17
CS685: Special Topics in Data Mining
Sampling for Frequent
Patterns
Select a sample of original database, mine
frequent patterns within sample using Apriori
Scan database once to verify frequent itemsets
found in sample, only borders of closure of
frequent patterns are checked
Example: check abcd instead of ab, ac, …, etc.
Scan database again to find missed frequent
patterns
H. Toivonen, 1996
18
CS685: Special Topics in Data Mining
Bottleneck of Frequentpattern Mining
Multiple database scans are costly
Mining long patterns needs many passes of
scanning and generates lots of candidates
To find frequent itemset i1i2…i100
# of scans: 100
# of Candidates:
100  100 
100  100

  
    
  2  1  1.27  1030
 1   2 
100 
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
19
CS685: Special Topics in Data Mining
Set Enumeration Tree
Subsets of I can be enumerated
systematically

I={a, b, c, d}
a
ab
ac
abc
b
abd
c
bc
acd
d
bd
cd
bcd
abcd
20
CS685: Special Topics in Data Mining
Borders of Frequent Itemsets
Connected
X and Y are frequent and X is an ancestor of Y
 all patterns between X and Y are frequent

a
ab
ac
abc
b
abd
c
bc
acd
d
bd
cd
bcd
abcd
21
CS685: Special Topics in Data Mining
Projected Databases
To find a child Xy of X, only X-projected
database is needed
The sub-database of transactions containing X
Item y is frequent in X-projected database

a
ab
ac
abc
b
abd
c
bc
acd
d
bd
cd
bcd
abcd
22
CS685: Special Topics in Data Mining
Tree-Projection Method
Find frequent 2-itemsets
For each frequent 2-itemset xy, form a
projected database
The sub-database containing xy
Recursive mining
If x’y’ is frequent in xy-proj db, then xyx’y’ is
a frequent pattern
23
CS685: Special Topics in Data Mining
Borders and Max-patterns
Max-patterns: borders of frequent patterns
A subset of max-pattern is frequent
A superset of max-pattern is infrequent

a
ab
ac
abc
b
abd
c
bc
acd
d
bd
cd
bcd
abcd
24
CS685: Special Topics in Data Mining
MaxMiner: Mining Maxpatterns
1st scan: find frequent items
A, B, C, D, E
2nd scan: find support for
BC, BD, BE, BCDE
CD, CE, CDE, DE,
Tid
Items
10
A,B,C,D,E
20
B,C,D,E,
30
A,C,D,F
Min_sup=2
Potential
max-patterns
Since BCDE is a max-pattern, no need to check
BCD, BDE, CDE in later scan
Baya’98
25
CS685: Special Topics in Data Mining
Frequent Closed Patterns
For frequent itemset X, if there exists no
item y s.t. every transaction containing X
also contains y, then X is a frequent closed
pattern
“acdf” is a frequent closed pattern
Concise rep. of freq pats
Reduce # of patterns and rules
N. Pasquier et al. In ICDT’99
26
Min_sup=2
TID
Items
10
a, c, d, e, f
20
a, b, e
30
c, e, f
40
a, c, d, f
50
c, e, f
CS685: Special Topics in Data Mining
CLOSET: Mining Frequent
Closed Patterns
Flist: list of all freq items in support asc. order
Min_sup=2
Flist: d-a-f-e-c
Divide search space
Patterns having d
Patterns having d but no a, etc.
Find frequent closed pattern recursively
TID
10
20
30
40
50
Items
a, c, d, e, f
a, b, e
c, e, f
a, c, d, f
c, e, f
Every transaction having d also has cfa  cfad is a
frequent closed pattern
PHM’00
27
CS685: Special Topics in Data Mining
Closed and Max-patterns
Closed pattern mining algorithms can be
A max-pattern must be closed
28
CS685: Special Topics in Data Mining
Multiple-level Association
Rules
Items often form hierarchy
Flexible support settings: Items at the lower level
are expected to have lower support.
Transaction database can be encoded based on
dimensions and levels
explore shared multi-level mining
reduced support
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
29
Milk
[support = 10%]
2% Milk
[support = 6%]
Level 1
min_sup = 5%
Skim Milk
[support = 4%]
Level 2
min_sup = 3%
CS685: Special Topics in Data Mining
Multi-dimensional Association
Rules
Single-dimensional rules:
MD rules:  2 dimensions or predicates
Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”)  occupation(X,“student”) 
hybrid-dimension assoc. rules (repeated predicates)
Categorical Attributes: finite number of possible
values, no order among values
Quantitative Attributes: numeric, implicit order
30
CS685: Special Topics in Data Mining
Quantitative/Weighted Association
Rules
Numeric attributes are dynamically discretized
maximize the confidence or compactness of the rules
2-D quantitative association rules: Aquan1  Aquan2  Acat
Cluster “adjacent” association rules to form general rules
using a 2-D grid.
70-80k
60-70k
Income
age(X,”33-34”) 
income(X,”30K - 50K”) 
50-60k
40-50k
30-40k
20-30k
<20k
32
33
34
35
36
37
38
Age
31
CS685: Special Topics in Data Mining
Constraint-based Data Mining
Find all the patterns in a database autonomously?
The patterns could be too many but not focused!
Data mining should be interactive
User directs what to be mined
Constraint-based mining
User flexibility: provides constraints on what to be
mined
System optimization: push constraints for efficient
mining
32
CS685: Special Topics in Data Mining
Constraints in Data Mining
Knowledge type constraint
classification, association, etc.
Data constraint — using SQL-like queries
find product pairs sold together in stores in New York
Dimension/level constraint
in relevance to region, price, brand, customer category
Rule (or pattern) constraint
small sales (price < \$10) triggers big sales (sum >\$200)
Interestingness constraint
strong rules: support and confidence
33
CS685: Special Topics in Data Mining
Related documents