Download Association Rule Mining Definition: Frequent Itemset Definition

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Association Rule Mining
Definition: Frequent Itemset
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Itemset
– A collection of one or more items
Example: {Milk, Bread, Diaper}
– k-itemset
Market-Basket transactions
TID
Items
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
4
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
Example of Association Rules
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Implication means co-occurrence,
not causality!
An itemset that contains k items
Support count (σ
σ)
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
Support
TID
Items
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
4
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5
Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Definition: Association Rule
Association Rule
– An implication expression of the form
X → Y, where X and Y are itemsets
– Example:
{Milk, Diaper} → {Beer}
Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
3
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
4
5
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Measures how often items in Y
appear in transactions that
contain X
s=
c=
© Tan,Steinbach, Kumar
σ ( Milk , Diaper, Beer )
|T|
2
= = 0.4
5
σ (Milk, Diaper, Beer) 2
= = 0.67
σ (Milk, Diaper)
3
Introduction to Data Mining
4/18/2004
3
Mining Association Rules
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Given a set of transactions T, the goal of
association rule mining is to find all rules having
Brute-force approach:
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
4
Mining Association Rules
Example of Rules:
TID
2
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
{Milk , Diaper } ⇒ Beer
– Confidence (c)
4/18/2004
– support ≥ minsup threshold
– confidence ≥ minconf threshold
Example:
Fraction of transactions that contain
both X and Y
Introduction to Data Mining
Association Rule Mining Task
– Support (s)
© Tan,Steinbach, Kumar
Two-step approach:
1. Frequent Itemset Generation
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
–
Generate all itemsets whose support ≥ minsup
2. Rule Generation
–
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
Frequent itemset generation is still
computationally expensive
• Thus, we may decouple the support and confidence requirements
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
5
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
Frequent Itemset Generation
Frequent Itemset Generation
null
A
AB
ABC
AC
AD
ABD
ABE
ABCD
B
C
AE
ACD
D
BC
BD
ACE
ABCE
ABDE
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
E
BE
ADE
CD
BCD
CE
BCE
ACDE
Transactions
DE
BDE
TID
1
2
3
4
5
CDE
Given d items, there
are 2d possible
candidate itemsets
Introduction to Data Mining
4/18/2004
7
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
© Tan,Steinbach, Kumar
Frequent Itemset Generation Strategies
Reduce the number of candidates (M)
Reduce the number of transactions (N)
8
Apriori principle:
Apriori principle holds due to the following property
of the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s(Y )
Reduce the number of comparisons (NM)
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
© Tan,Steinbach, Kumar
4/18/2004
– If an itemset is frequent, then all of its subsets must also
be frequent
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
Introduction to Data Mining
Reducing Number of Candidates
– Complete search: M=2d
– Use pruning techniques to reduce M
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
BCDE
ABCDE
© Tan,Steinbach, Kumar
Brute-force approach:
Introduction to Data Mining
4/18/2004
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
9
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
10
Illustrating Apriori Principle
Illustrating Apriori Principle
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
ABCE
Pruned
supersets
© Tan,Steinbach, Kumar
Introduction to Data Mining
ABDE
ACDE
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Minimum Support = 3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
BCDE
Count
3
2
3
2
3
3
Ite m s e t
{ B r e a d ,M ilk ,D ia p e r }
C ount
3
ABCDE
4/18/2004
11
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
12
Frequent Itemset Mining
Apriori
1
2
3
4
5
2 strategies:
– Breadth-first: Apriori
Exploit
B, C
B, C
A, C, D
A, B, C, D
B, D
minsup=2
monotonicity to the maximum
– Depth-first strategy: Eclat
Prune
Do
the database
not fully exploit monotonicity
Candidates
A
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
13
B
0
© Tan,Steinbach, Kumar
Introduction to Data Mining
Apriori
1
2
3
4
5
{}
D0
4/18/2004
14
Apriori
B, C
B, C
A, C, D
A, B, C, D
B, D
1
2
3
4
5
minsup=2
B, C
B, C
A, C, D
A, B, C, D
B, D
minsup=2
Candidates
A
Candidates
B
0
© Tan,Steinbach, Kumar
C1
1
Introduction to Data Mining
{}
D0
4/18/2004
A
15
B
0
© Tan,Steinbach, Kumar
C2
2
Introduction to Data Mining
Apriori
1
2
3
4
5
C0
0
{}
D0
4/18/2004
16
Apriori
B, C
B, C
A, C, D
A, B, C, D
B, D
1
2
3
4
5
minsup=2
B, C
B, C
A, C, D
A, B, C, D
B, D
minsup=2
Candidates
A
© Tan,Steinbach, Kumar
1
Candidates
B
C3
2
Introduction to Data Mining
{}
D1
4/18/2004
A
17
© Tan,Steinbach, Kumar
2
B
C4
3
Introduction to Data Mining
{}
D2
4/18/2004
18
Apriori
1
2
3
4
5
Apriori
B, C
B, C
A, C, D
A, B, C, D
B, D
1
2
3
4
5
minsup=2
B, C
B, C
A, C, D
A, B, C, D
B, D
minsup=2
Candidates
AB
AC
AD
BC
BD
CD
Candidates
A
B
2
© Tan,Steinbach, Kumar
C4
4
Introduction to Data Mining
{}
D3
A
4/18/2004
19
B
2
© Tan,Steinbach, Kumar
Introduction to Data Mining
Apriori
1
2
3
4
5
C4
4
{}
D3
4/18/2004
Apriori
B, C
B, C
A, C, D
A, B, C, D
B, D
1
2
3
4
5
minsup=2
B, C
B, C
A, C, D
A, B, C, D
B, D
Candidates
minsup=2
ACD
AB
AC
1
A
2
B
2
© Tan,Steinbach, Kumar
AD
BC
2
3
BD
{}
CD 2
2
C4
4
Introduction to Data Mining
AB
AC
1
D3
A
4/18/2004
21
Apriori
1
2
3
4
5
20
© Tan,Steinbach, Kumar
2
AD
B
2
BC
2
C4
4
Introduction to Data Mining
3
{}
BCD
BD
CD 2
2
D3
4/18/2004
22
Apriori Algorithm
B, C
B, C
A, C, D
A, B, C, D
B, D
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
minsup=2
ACD 2
BCD
Method:
1
Generate
AB
AC
1
A
© Tan,Steinbach, Kumar
2
2
AD
B
BC
2
C4
4
Introduction to Data Mining
3
{}
BD
CD 2
2
D3
4/18/2004
23
length (k+1) candidate itemsets from length k
frequent itemsets
Prune candidate itemsets containing subsets of length k that
are infrequent
Count the support of each candidate by scanning the DB
Eliminate candidates that are infrequent, leaving only those
that are frequent
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
24
Frequent Itemset Mining
Depth-First Algorithms
2 strategies:
Find all frequent itemsets
1
2
3
4
5
– Breadth-first: Apriori
Exploit
monotonicity to the maximum
B,
B,
A,
A, B,
B,
C
C
C, D
C, D
D
minsup=2
– Depth-first strategy: Eclat
Prune
Do
the database
not fully exploit monotonicity
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
25
Depth-First Algorithms
B, C
B, C
A,
C, D
A, B, C, D
B,
D
Introduction to Data Mining
4/18/2004
26
Depth-First Algorithms
Find all frequent itemsets
Find all frequent itemsets
1
2
3
4
5
© Tan,Steinbach, Kumar
1
2
3
4
5
without D
1
2
3
4
5
minsup=2
minsup=2
Find all frequent itemsets
Find all frequent itemsets
B, C
B, C
A,
C, D
A, B, C, D
B,
D
B, C
B, C
A,
C, D
A, B, C, D
B,
D
without D
1
2
3
4
5
B,
B,
A,
A, B,
B,
C
C
C, D
C, D
D
minsup=2
minsup=2
A, B, C, AC, BC
with D
with D
Find all frequent itemsets
Find all frequent itemsets
1
2
3
4
5
B, C
B, C
A,
C, D
A, B, C, D
B,
D
1
2
3
4
5
minsup=2
B,
B,
A,
A, B,
B,
C
C
C, D
C, D
D
minsup=2
A, B, C, AC
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
27
Depth-First Algorithms
B, C
B, C
A,
C, D
A, B, C, D
B,
D
Introduction to Data Mining
without D
1
2
3
4
5
1
2
3
4
5
B, C
B, C
A,
C, D
A, B, C, D
B,
D
with D
Find all frequent itemsets
Find all frequent itemsets
B, C
B, C
A,
C, D
A, B, C, D
B,
D
1
2
3
4
5
AD, BD, CD, ACD
A, B, C, AC, BC
B,
B,
A,
A, B,
B,
C
C
C, D
C, D
D
A, B, C, AC, BC
C
C
C, D
C, D
D
AD, BD, CD, ACD + A, B, C, AC, BC
minsup=2
A, B, C, AC
© Tan,Steinbach, Kumar
B,
B,
A,
A, B,
B,
1
2
3
4
5
minsup=2
A, B, C, AC, BC, AD, BD, CD, ACD
with D
add D again
without D
minsup=2
A, B, C, AC, BC
minsup=2
28
Find all frequent itemsets
Find all frequent itemsets
B, C
B, C
A,
C, D
A, B, C, D
B,
D
minsup=2
minsup=2
1
2
3
4
5
4/18/2004
Depth-First Algorithms
Find all frequent itemsets
Find all frequent itemsets
1
2
3
4
5
© Tan,Steinbach, Kumar
A, B, C, AC
Introduction to Data Mining
4/18/2004
29
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
Depth-First Algorithm
Depth-First Algorithm
DB
DB
1
2
3
4
5
B, C
B, C
A,
C, D
A, B, C, D
B,
D
1
2
3
4
5
A: 2
B: 4
C: 4
D: 3
Introduction to Data Mining
4/18/2004
31
Depth-First Algorithm
A,
C
A, B, C
B,
A: 2
B: 4
C: 2
Introduction to Data Mining
1
2
3
4
5
DB[CD]
3
4
A,
A, B
A: 2
4/18/2004
33
B, C
B, C
A,
C, D
A, B, C, D
B,
D
DB[D]
3
4
5
A,
C
A, B, C
B,
4/18/2004
32
4/18/2004
34
4/18/2004
36
DB[CD]
3
4
A,
A, B
A: 2
A: 2
B: 4
C: 2
AC: 2
© Tan,Steinbach, Kumar
Introduction to Data Mining
Depth-First Algorithm
DB
© Tan,Steinbach, Kumar
Introduction to Data Mining
A: 2
B: 4
C: 4
D: 3
Depth-First Algorithm
A: 2
B: 4
C: 4
D: 3
A: 2
B: 4
C: 2
AC: 2
DB
© Tan,Steinbach, Kumar
B, C
B, C
A,
C, D
A, B, C, D
B,
D
A,
C
A, B, C
B,
Depth-First Algorithm
DB[D]
3
4
5
A: 2
B: 4
C: 4
D: 3
1
2
3
4
5
DB[D]
3
4
5
© Tan,Steinbach, Kumar
DB
B, C
B, C
A,
C, D
A, B, C, D
B,
D
C
C
C, D
C, D
D
A: 2
B: 4
C: 4
D: 3
© Tan,Steinbach, Kumar
1
2
3
4
5
B,
B,
A,
A, B,
B,
DB
DB[D]
3
4
5
1
2
3
4
5
A,
C
A, B, C
B,
A: 2
B: 4
C: 2
AC: 2
Introduction to Data Mining
B,
B,
A,
A, B,
B,
C
C
C, D
C, D
D
A: 2
B: 4
C: 4
D: 3
4/18/2004
35
© Tan,Steinbach, Kumar
DB[D]
3
4
5
A,
C
A, B, C
B,
A: 2
B: 4
C: 2
AC: 2
Introduction to Data Mining
DB[BD]
4
A
A:1
Depth-First Algorithm
Depth-First Algorithm
DB
1
2
3
4
5
B, C
B, C
A,
C, D
A, B, C, D
B,
D
A: 2
B: 4
C: 4
D: 3
© Tan,Steinbach, Kumar
DB
DB[D]
3
4
5
1
2
3
4
5
A,
C
A, B, C
B,
A: 2
B: 4
C: 2
AC: 2
Introduction to Data Mining
4/18/2004
37
© Tan,Steinbach, Kumar
1
2
3
4
5
Introduction to Data Mining
4/18/2004
© Tan,Steinbach, Kumar
Introduction to Data Mining
1
2
3
4
B
B
A
A, B
39
B, C
B, C
A,
C, D
A, B, C, D
B,
D
© Tan,Steinbach, Kumar
DB[BC]
1
2
4
38
4/18/2004
40
4/18/2004
42
DB[C]
1
2
3
4
B
B
A
A, B
A: 2
B: 3
Introduction to Data Mining
A
A: 1
B, C
B, C
A,
C, D
A, B, C, D
B,
D
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
4/18/2004
DB[C]
DB
1
2
3
4
5
A: 2
B: 3
Introduction to Data Mining
4/18/2004
Depth-First Algorithm
DB[C]
DB
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
A: 2
B: 4
C: 2
AC: 2
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
Depth-First Algorithm
B, C
B, C
A,
C, D
A, B, C, D
B,
D
A,
C
A, B, C
B,
DB
B, C
B, C
A,
C, D
A, B, C, D
B,
D
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
1
2
3
4
5
DB[D]
3
4
5
Depth-First Algorithm
DB
© Tan,Steinbach, Kumar
C
C
C, D
C, D
D
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
Depth-First Algorithm
1
2
3
4
5
B,
B,
A,
A, B,
B,
41
© Tan,Steinbach, Kumar
1
2
3
4
B
B
A
A, B
A: 2
B: 3
Introduction to Data Mining
Depth-First Algorithm
DB[C]
DB
1
2
3
4
5
Depth-First Algorithm
1
2
3
4
B, C
B, C
A,
C, D
A, B, C, D
B,
D
1
2
3
4
5
A
A, B
A: 2
B: 3
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
AC: 2
BC: 3
© Tan,Steinbach, Kumar
DB
B
B
Introduction to Data Mining
4/18/2004
43
© Tan,Steinbach, Kumar
B, C
B, C
A,
C, D
A, B, C, D
B,
D
1
2
4
5
A
A:1
Introduction to Data Mining
4/18/2004
45
© Tan,Steinbach, Kumar
4/18/2004
46
C
C
C, D
C, D
D
© Tan,Steinbach, Kumar
C
C
C, D
C, D
D
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
AC: 2
BC: 3
B,
B,
A,
A, B,
B,
Introduction to Data Mining
ECLAT
DB
B,
B,
A,
A, B,
B,
44
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
AC: 2
BC: 3
Depth-First Algorithm
1
2
3
4
5
4/18/2004
DB
1
2
3
4
5
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
AC: 2
BC: 3
© Tan,Steinbach, Kumar
Introduction to Data Mining
Depth-First Algorithm
DB[B]
DB
C
C
C, D
C, D
D
A: 2
B: 4
C: 4
D: 3
AD: 2
BD: 4
CD: 2
ACD: 2
AC: 2
BC: 3
Depth-First Algorithm
1
2
3
4
5
B,
B,
A,
A, B,
B,
For each item, store a list of transaction ids (tids)
Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10
Final set of frequent itemsets
Introduction to Data Mining
4/18/2004
47
© Tan,Steinbach, Kumar
Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B
Vertical Data Layout
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
C
2
3
4
8
9
D
2
4
5
9
E
1
3
6
TID-list
Introduction to Data Mining
4/18/2004
48
ECLAT
Rule Generation
Determine support of any k-itemset by intersecting tid-lists
of two of its (k-1) subsets.
A
1
4
5
6
7
8
9
∧
B
1
2
5
7
8
10
→
Introduction to Data Mining
ABC →D,
A →BCD,
AB →CD,
BD →AC,
4/18/2004
49
Rule Generation
Given a frequent itemset L, find all non-empty
subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
Depth-first traversal of the search lattice
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too
large for memory
© Tan,Steinbach, Kumar
AB
1
5
7
8
ABD →C,
B →ACD,
AC → BD,
CD →AB,
ACD →B,
C →ABD,
AD → BC,
BCD →A,
D →ABC
BC →AD,
If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L → ∅ and ∅ → L)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
50
Rule Generation for Apriori Algorithm
Lattice of rules
How to efficiently generate rules from frequent
itemsets?
Low
Confidence
Rule
– In general, confidence does not have an antimonotone property
c(ABC →D) can be larger or smaller than c(AB →D)
– But confidence of rules generated from the same
itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
Pruned
Rules
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
51
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
52
Related documents