Download Data Mining Association Analysis: Basic Concepts and Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Data Mining
Association Analysis: Basic Concepts
and Algorithms
Association Rule Mining
l
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Lecture Notes for Chapter 6
Market-Basket transactions
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Definition: Frequent Itemset
l
TID
Items
1
2
3
4
5
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
© Tan,Steinbach, Kumar
l
u
l
An itemset that contains k items
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
l
Support
TID
Items
1
Bread, Milk
2
3
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
4
5
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
l
Rule Evaluation Metrics
u
u
Measures how often items in Y
appear in transactions that
contain X
4/18/2004
3
Association Rule Mining Task
Given a set of transactions T, the goal of
association rule mining is to find all rules having
© Tan,Steinbach, Kumar
{Milk , Diaper } ⇒ Beer
σ (Milk, Diaper,Beer) 2
= = 0.67
σ (Milk, Diaper)
3
4/18/2004
4
4/18/2004
Example of Rules:
Items
1
2
3
4
5
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Introduction to Data Mining
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
Introduction to Data Mining
TID
Brute-force approach:
© Tan,Steinbach, Kumar
Bread, Milk
2
3
4
5
Mining Association Rules
– support = minsup threshold
– confidence = minconf threshold
l
Items
1
σ ( Milk , Diaper, Beer) 2
s=
= = 0.4
|T|
5
c=
– An itemset whose support is greater
than or equal to a minsup threshold
Introduction to Data Mining
TID
Example:
Fraction of transactions that contain
both X and Y
– Confidence (c)
– E.g. s({Milk, Bread, Diaper}) = 2/5
l
2
– Support (s)
Frequent Itemset
© Tan,Steinbach, Kumar
4/18/2004
– Example:
{Milk, Diaper} → {Beer}
– Fraction of transactions that contain an
itemset
l
Introduction to Data Mining
– An implication expression of the form
X → Y, where X and Y are itemsets
Example: {Milk, Bread, Diaper}
Support count (σ)
Implication means co-occurrence,
not causality!
Association Rule
– A collection of one or more items
– k-itemset
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Definition: Association Rule
Itemset
u
Example of Association Rules
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
5
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
Mining Association Rules
Frequent Itemset Generation
null
l
Two-step approach:
1. Frequent Itemset Generation
–
A
2. Rule Generation
–
l
B
C
D
E
Generate all itemsets whose support ≥ minsup
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
ABCD
ABCE
ABDE
ACDE
BCDE
Given d items, there
are 2d possible
candidate itemsets
ABCDE
© Tan,Steinbach, Kumar
l
Introduction to Data Mining
4/18/2004
7
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
Frequent Itemset Generation
Computational Complexity
Brute-force approach:
l
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
 d   d − k  
R = ∑   × ∑ 

 k   j  
= 3 − 2 +1
Transactions
TID
1
2
3
4
5
8
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
d −1
d −k
k =1
j =1
d
d +1
If d=6, R = 602 rules
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
9
Frequent Itemset Generation Strategies
l
Reduce the number of candidates (M)
© Tan,Steinbach, Kumar
l
Reduce the number of transactions (N)
l
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
Introduction to Data Mining
Apriori principle:
4/18/2004
Apriori principle holds due to the following property
of the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s ( X ) ≥ s (Y )
Reduce the number of comparisons (NM)
© Tan,Steinbach, Kumar
10
– If an itemset is frequent, then all of its subsets must also
be frequent
– Reduce size of N as the size of itemset increases
– Used by DHP and vertical-based mining algorithms
l
4/18/2004
Reducing Number of Candidates
– Complete search: M=2d
– Use pruning techniques to reduce M
l
Introduction to Data Mining
11
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
12
Illustrating Apriori Principle
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Found to be
Infrequent
Items (1-itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Minimum Support = 3
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
13
Apriori Algorithm
© Tan,Steinbach, Kumar
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Item s e t
{Bread,Milk,Diaper}
Introduction to Data Mining
Count
3
4/18/2004
14
Reducing Number of Comparisons
l
Method:
Candidate counting:
– Scan the database of transactions to determine the
support of each candidate itemset
– To reduce the number of comparisons, store the
candidates in a hash structure
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
u
u Generate
length (k+1) candidate itemsets from length k
frequent itemsets
u Prune candidate itemsets containing subsets of length k that
are infrequent
u Count the support of each candidate by scanning the DB
u Eliminate candidates that are infrequent, leaving only those
that are frequent
© Tan,Steinbach, Kumar
Count
3
2
3
2
3
3
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Pruned
supersets
l
Count
4
2
4
3
4
1
Introduction to Data Mining
4/18/2004
15
Generate Hash Tree
Transactions
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
16
Association Rule Discovery: Hash tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
Hash Function
1,4,7
• Hash function
Candidate Hash Tree
3,6,9
2,5,8
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)
Hash function
3,6,9
1,4,7
234
567
345
136
145
2,5,8
124
457
© Tan,Steinbach, Kumar
125
458
Introduction to Data Mining
234
567
145
136
345
356
357
689
367
368
Hash on
1, 4 or 7
124
457
159
4/18/2004
17
© Tan,Steinbach, Kumar
125
458
159
Introduction to Data Mining
356
357
689
367
368
4/18/2004
18
Association Rule Discovery: Hash tree
Hash Function
1,4,7
Association Rule Discovery: Hash tree
Hash Function
Candidate Hash Tree
3,6,9
1,4,7
2,5,8
Candidate Hash Tree
3,6,9
2,5,8
234
567
145
234
567
136
145
345
Hash on
2, 5 or 8
124
457
159
125
458
© Tan,Steinbach, Kumar
356
357
689
Introduction to Data Mining
367
368
136
345
124
457
4/18/2004
356
357
689
Hash on
3, 6 or 9
19
Subset Operation
159
125
458
© Tan,Steinbach, Kumar
Introduction to Data Mining
367
368
4/18/2004
20
Subset Operation Using Hash Tree
Given a transaction t, what
are the possible subsets of
size 3?
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
1,4,7
3+ 56
3,6,9
2,5,8
234
567
145
136
345
124
457
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
21
Subset Operation Using Hash Tree
1+ 2356
159
© Tan,Steinbach, Kumar
2+ 356
4/18/2004
22
Subset Operation Using Hash Tree
1,4,7
3+ 56
367
368
Introduction to Data Mining
Hash Function
1 2 3 5 6 transaction
12+ 356
125
458
356
357
689
1+ 2356
3,6,9
2+ 356
12+ 356
2,5,8
Hash Function
1 2 3 5 6 transaction
1,4,7
3+ 56
13+ 56
3,6,9
2,5,8
13+ 56
234
567
15+ 6
145
136
145
345
124
457
© Tan,Steinbach, Kumar
125
458
234
567
15+ 6
159
Introduction to Data Mining
356
357
689
136
367
368
345
124
457
4/18/2004
23
© Tan,Steinbach, Kumar
125
458
159
356
357
689
367
368
Match transaction against 11 out of 15 candidates
Introduction to Data Mining
4/18/2004
24
Factors Affecting Complexity
l
Choice of minimum support threshold
–
–
l
Some itemsets are redundant because they have
identical support as their supersets
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
8
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
more space is needed to store support count of each item
if number of frequent items also increases, both computation and
I/O costs may also increase
Size of database
–
l
l
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set
–
–
l
Compact Representation of Frequent Itemsets
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
Average transaction width
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
25
l
Need a compact representation
© Tan,Steinbach, Kumar
An itemset is maximal frequent if none of its immediate supersets
is frequent
l
TID
1
2
3
4
5
Introduction to Data Mining
4/18/2004
27
2
ABCD
3
BCE
4
ACDE
5
DE
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
© Tan,Steinbach, Kumar
Maximal vs Closed Itemsets
ABC
26
Support
4
5
3
4
4
2
3
3
4
3
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2
Border
© Tan,Steinbach, Kumar
Items
4/18/2004
An itemset is closed if none of its immediate supersets
has the same support as the itemset
Maximal
Itemsets
1
k =1
Introduction to Data Mining
Closed Itemset
TID
10
Number of frequent itemsets
Maximal Frequent Itemset
Infrequent
Itemsets
10 
= 3× ∑  
k
l
123
Transaction Ids
A
12
124
AB
12
24
AC
AD
ABD
ABE
B
245
C
123
4
AE
24
2
ABC
1234
2
3
BD
4
ACD
CD
34
CE
45
3
BCD
BDE
ABCE
ABDE
124
AB
12
CDE
ABC
24
AC
AD
ABD
ABE
B
AE
ACDE
2
3
BD
BE
2
4
ACE
ADE
E
24
CD
Closed and
maximal
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
ABCD
BCDE
345
D
BC
4
ACD
245
C
123
4
24
2
1234
2
4
ABCD
DE
4
BCE
123
A
12
24
2
ADE
2
28
Closed but
not maximal
null
124
E
BE
4
ACE
Minimum support = 2
345
D
BC
4/18/2004
Maximal vs Closed Frequent Itemsets
null
124
Introduction to Data Mining
ABCE
ABDE
ACDE
BCDE
# Closed = 9
# Maximal = 4
Not supported by
any transactions
© Tan,Steinbach, Kumar
ABCDE
ABCDE
Introduction to Data Mining
4/18/2004
29
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
Maximal vs Closed Itemsets
Alternative Methods for Frequent Itemset Generation
l
Traversal of Itemset Lattice
– General-to-specific vs Specific-to-general
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
31
Alternative Methods for Frequent Itemset Generation
l
Traversal of Itemset Lattice
© Tan,Steinbach, Kumar
l
Introduction to Data Mining
4/18/2004
33
Representation of Database
Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B
© Tan,Steinbach, Kumar
Traversal of Itemset Lattice
© Tan,Steinbach, Kumar
A
1
4
5
6
7
8
9
Introduction to Data Mining
B
1
2
5
7
8
10
C
2
3
4
8
9
D
2
4
5
9
E
1
3
6
4/18/2004
35
Introduction to Data Mining
4/18/2004
34
FP-growth Algorithm
l
Use a compressed representation of the
database using an FP-tree
l
Once an FP-tree has been constructed, it uses a
recursive divide-and-conquer approach to mine
the frequent itemsets
– horizontal vs vertical data layout
TID
1
2
3
4
5
6
7
8
9
10
32
– Breadth-first vs Depth-first
Alternative Methods for Frequent Itemset Generation
l
4/18/2004
Alternative Methods for Frequent Itemset Generation
– Equivalent Classes
© Tan,Steinbach, Kumar
Introduction to Data Mining
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
FP-tree construction
FP-Tree Construction
null
After reading TID=1:
TID
1
2
3
4
5
6
7
8
9
10
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
TID
1
2
3
4
5
6
7
8
9
10
A:1
B:1
After reading TID=2:
null
A:1
C:1
D:1
© Tan,Steinbach, Kumar
Introduction to Data Mining
Transaction
Database
4/18/2004
37
FP-growth
Item
A
B
C
D
E
null
B:3
A:7
B:5
Header table
B:1
B:1
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
C:1
D:1
C:3
Pointer
D:1
D:1
Pointers are used to assist
frequent itemset generation
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
A:7
C:3
38
Tree Projection
null
B:1
C:1
E:1
E:1
E:1
D:1
null
Set enumeration tree:
B:5
C:3
D:1
D:1
C:1
D:1
D:1
Conditional Pattern base
for D:
P = {(A:1,B:1,C:1),
(A:1,B:1),
(A:1,C:1),
(A:1),
(B:1,C:1)}
Recursively apply FPgrowth on P
Frequent Itemsets found
(with sup > 1):
AD, BD, CD, ACD, BCD
D:1
D:1
A
Possible Extension:
E(A) = {B,C,D,E}
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Possible Extension:
E(ABC) = {D,E}
ABCD
ABCE
ABDE
ACDE
BCDE
ABCDE
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
39
Tree Projection
Introduction to Data Mining
4/18/2004
40
Projected Database
Items are listed in lexicographic order
l Each node P stores the following information:
l
–
–
–
–
© Tan,Steinbach, Kumar
Original Database:
TID
1
2
3
4
5
6
7
8
9
10
Itemset for node P
List of possible lexicographic extensions of P: E(P)
Pointer to projected database of its ancestor node
Bitvector containing information about which
transactions in the projected database contain the
itemset
Items
{A,B}
{B,C,D}
{A,C,D,E}
{A,D,E}
{A,B,C}
{A,B,C,D}
{B,C}
{A,B,C}
{A,B,D}
{B,C,E}
Projected Database
for node A:
TID
1
2
3
4
5
6
7
8
9
10
Items
{B}
{}
{C,D,E}
{D,E}
{B,C}
{B,C,D}
{}
{B,C}
{B,D}
{}
For each transaction T, projected transaction at node A is T ∩ E(A)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
41
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
42
ECLAT
l
ECLAT
For each item, store a list of transaction ids (tids)
Horizontal
Data Layout
TID
1
2
3
4
5
6
7
8
9
10
Items
A,B,E
B,C,D
C,E
A,C,D
A,B,C,D
A,E
A,B
A,B,C
A,C,D
B
© Tan,Steinbach, Kumar
Vertical Data Layout
A
1
4
5
6
7
8
9
B
1
2
5
7
8
10
C
2
3
4
8
9
D
2
4
5
9
E
1
3
6
3 traversal approaches:
l
Advantage: very fast support counting
Disadvantage: intermediate tid-lists may become too
large for memory
l
4/18/2004
43
© Tan,Steinbach, Kumar
l
Introduction to Data Mining
4/18/2004
44
Rule Generation
l
ACD →B,
C →ABD,
AD → BC,
How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an antimonotone property
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABD →C,
B →ACD,
AC → BD,
CD →AB,
→
AB
1
5
7
8
l
TID-list
Introduction to Data Mining
∧
B
1
2
5
7
8
10
– top-down, bottom-up and hybrid
Given a frequent itemset L, find all non-empty
subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
ABC →D,
A →BCD,
AB →CD,
BD →AC,
Determine support of any k-itemset by intersecting tid-lists
of two of its (k-1) subsets.
A
1
4
5
6
7
8
9
Rule Generation
l
l
c(ABC →D) can be larger or smaller than c(AB →D)
BCD →A,
D →ABC
BC →AD,
– But confidence of rules generated from the same
itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L → ∅ and ∅ → L)
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule
u
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
45
Rule Generation for Apriori Algorithm
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
46
Rule Generation for Apriori Algorithm
Lattice of rules
l
Low
Confidence
Rule
Candidate rule is generated by merging two rules
that share the same prefix
in the rule consequent
CD=>AB
l
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
l
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
BD=>AC
D=>ABC
Pruned
Rules
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
47
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
48
Effect of Support Distribution
l
Effect of Support Distribution
Many real data sets have skewed support
distribution
l
– If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
– If minsup is set too low, it is computationally
expensive and the number of itemsets is very large
Support
distribution of
a retail data set
l
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
49
Multiple Minimum Support
l
How to set the appropriate minsup threshold?
Using a single minimum support threshold may
not be effective
© Tan,Steinbach, Kumar
Introduction to Data Mining
Item
– MS(i): minimum support for item i
– e.g.: MS(Milk)=5%,
MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– MS({Milk, Broccoli}) = min (MS(Milk), MS(Broccoli))
= 0.1%
MS(I)
Sup(I)
A
0.10% 0.25%
B
0.20% 0.26%
C
0.30% 0.29%
– Challenge: Support is no longer anti-monotone
Suppose:
D
Support(Milk, Coke) = 1.5% and
Support(Milk, Coke, Broccoli) = 0.5%
0.50% 0.05%
E
u
50
Multiple Minimum Support
How to apply multiple minimum supports?
u
4/18/2004
3%
4.20%
AB
ABC
AC
ABD
AD
ABE
A
B
AE
ACD
BC
ACE
C
D
BD
ADE
BE
BCD
CD
BCE
CE
BDE
E
{Milk,Coke} is infrequent but {Milk,Coke,Broccoli} is frequent
DE
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
51
Multiple Minimum Support
Item
MS(I)
© Tan,Steinbach, Kumar
AB
ABC
AC
ABD
l
Sup(I)
0.10% 0.25%
B
0.20% 0.26%
C
0.30% 0.29%
B
CDE
4/18/2004
52
Multiple Minimum Support (Liu 1999)
A
A
Introduction to Data Mining
AD
ABE
AE
ACD
BC
ACE
BD
ADE
BE
BCD
CD
BCE
CE
BDE
Order the items according to their minimum
support (in ascending order)
– e.g.:
MS(Milk)=5%,
MS(Coke) = 3%,
MS(Broccoli)=0.1%, MS(Salmon)=0.5%
– Ordering: Broccoli, Salmon, Coke, Milk
C
D
D
l
– L1 : set of frequent items
– F1 : set of items whose support is ≥ MS(1)
where MS(1) is mini( MS(i) )
– C2 : candidate itemsets of size 2 is generated from F1
instead of L1
0.50% 0.05%
E
E
3%
4.20%
DE
© Tan,Steinbach, Kumar
Introduction to Data Mining
CDE
4/18/2004
Need to modify Apriori such that:
53
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
54
Multiple Minimum Support (Liu 1999)
l
Pattern Evaluation
Modifications to Apriori:
l
– In traditional Apriori,
A candidate (k+1)-itemset is generated by merging two
frequent itemsets of size k
u The candidate is pruned if it contains any infrequent subsets
of size k
u
Association rule algorithms tend to produce too
many rules
– many of them are uninteresting or redundant
– Redundant if {A,B,C} → {D} and {A,B} → {D}
have same support & confidence
– Pruning step has to be modified:
Prune only if subset contains the first item
e.g.: Candidate={Broccoli, Coke, Milk} (ordered according to
minimum support)
u {Broccoli, Coke} and {Broccoli, Milk} are frequent but
{Coke, Milk} is infrequent
– Candidate is not pruned because {Coke,Milk} does not contain
the first item, i.e., Broccoli.
u
u
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
55
Knowledge
Postprocessing
l
In the original formulation of association rules,
support & confidence are the only measures used
4/18/2004
56
Given a rule X → Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X
Preprocessed
Data
Prod
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
Prod
uct
uct
Featur
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
Featur
e
e
Introduction to Data Mining
Computing Interestingness Measure
l
Patterns
Interestingness measures can be used to
prune/rank the derived patterns
© Tan,Steinbach, Kumar
Application of Interestingness Measure
Interestingness
Measures
l
M ining
Selected
Data
→Y
Y
Y
X
f11
f10
X
f01
f00
fo+
f+1
f+0
|T|
f11: support of X and Y
f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
f1+
Preprocessing
Data
Used to define various measures
u
Selection
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
57
Drawback of Confidence
© Tan,Steinbach, Kumar
Coffee
Tea
15
5
Tea
75
5
80
90
10
100
Introduction to Data Mining
4/18/2004
58
Statistical Independence
l
Coffee
support, confidence, lift, Gini,
J-measure, etc.
Population of 1000 students
– 600 students know how to swim (S)
– 700 students know how to bike (B)
– 420 students know how to swim and bike (S,B)
20
Association Rule: Tea → Coffee
– P(S∧B) = 420/1000 = 0.42
– P(S) × P(B) = 0.6 × 0.7 = 0.42
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
– P(S∧B) = P(S) × P(B) => Statistical independence
– P(S∧B) > P(S) × P(B) => Positively correlated
– P(S∧B) < P(S) × P(B) => Negatively correlated
⇒ Although confidence is high, rule is misleading
⇒ P(Coffee|Tea) = 0.9375
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
59
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
60
Statistical-based Measures
l
Example: Lift/Interest
Measures that take into account statistical
dependence
P(Y | X )
Lift =
P (Y )
P( X ,Y )
Interest =
P ( X ) P (Y )
PS = P ( X , Y ) − P ( X ) P (Y )
φ − coefficien t =
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
80
90
10
100
20
Y
Y
Introduction to Data Mining
4/18/2004
62
Some measures are
good for certain
applications, but not
for others
Y
10
0
10
X
90
0
0
90
90
X
0
10
10
10
90
100
90
10
100
Lift =
© Tan,Steinbach, Kumar
There are lots of
measures proposed
in the literature
90
What criteria should
we use to determine
whether a measure
is good or bad?
0.9
= 1.11
(0.9)(0.9)
What about Aprioristyle support based
pruning? How does
it affect these
measures?
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Introduction to Data Mining
4/18/2004
63
Properties of A Good Measure
l
5
⇒ Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
61
X
© Tan,Steinbach, Kumar
75
but P(Coffee) = 0.9
P ( X , Y ) − P ( X ) P (Y )
P( X )[1 − P ( X )]P (Y )[1 − P (Y )]
0.1
= 10
(0.1)(0.1)
5
Tea
Confidence= P(Coffee|Tea) = 0.75
X
Lift =
Coffee
15
Association Rule: Tea → Coffee
Drawback of Lift & Interest
Y
Coffee
Tea
Comparing Different Measures
10 examples of
contingency tables:
Piatetsky-Shapiro:
3 properties a good measure M must satisfy:
Example
f11
f10
f01
f00
E1
E2
E3
E4
E5
E6
E7
E8
E9
E10
8123
8330
9481
3954
2886
1500
4000
4000
1720
61
83
2
94
3080
1363
2000
2000
2000
7121
2483
424
622
127
5
1320
500
1000
2000
5
4
1370
1046
298
2961
4431
6000
3000
2000
1154
7452
– M(A,B) = 0 if A and B are statistically independent
– M(A,B) increase monotonically with P(A,B) when P(A)
and P(B) remain unchanged
Rankings of contingency tables
using various measures:
– M(A,B) decreases monotonically with P(A) [or P(B)]
when P(A,B) and P(B) [or P(A)] remain unchanged
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
65
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
66
Property under Variable Permutation
Property under Row/Column Scaling
Grade-Gender Example (Mosteller, 1968):
B
p
r
A
A
B
q
s
B
B
A
p
q
A
r
s
Male
Female
Male
Female
High
2
3
5
High
4
30
34
Low
1
4
5
Low
2
40
42
3
7
10
6
70
76
2x
10x
Does M(A,B) = M(B,A)?
Symmetric measures:
u
support, lift, collective strength, cosine, Jaccard, etc
Asymmetric measures:
u
confidence, conviction, Laplace, J-measure, etc
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
67
.
.
.
.
.
Transaction N
A
B
C
D
E
F
1
0
0
0
0
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
1
1
1
1
1
1
1
1
0
1
1
1
1
0
1
1
1
1
1
0
1
1
1
1
1
1
1
1
0
0
0
0
0
1
0
0
0
0
0
(a)
© Tan,Steinbach, Kumar
(b)
l
4/18/2004
A
A
A
A
B
p
r
B
q
s+k
support, cosine, Jaccard, etc
correlation, Gini, mutual information, odds ratio, etc
© Tan,Steinbach, Kumar
Introduction to Data Mining
Y
Y
Y
Y
X
60
10
70
X
20
10
X
10
20
30
X
10
60
70
70
30
100
30
70
100
30
0.2 − 0.3 × 0.3
0.7 × 0.3 × 0.7 × 0.3
= 0.5238
φ=
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
70
Different Measures have Different Properties
Non-invariant measures:
u
68
φ Coefficient is the same for both tables
69
Invariant measures:
u
4/18/2004
φ-coefficient is analogous to correlation coefficient
for continuous variables
0.6 − 0.7 × 0.7
0.7 × 0.3 × 0.7 × 0.3
= 0.5238
Property under Null Addition
B
q
s
Introduction to Data Mining
φ=
(c)
Introduction to Data Mining
B
p
r
© Tan,Steinbach, Kumar
Example: φ-Coefficient
Property under Inversion Operation
Transaction 1
Mosteller:
Underlying association should be independent of
the relative number of male and female students
in the samples
4/18/2004
71
Symbol
Measure
Range
P1
P2
P3
O1
O2
O3
O3'
Φ
λ
α
Q
Y
κ
M
J
G
s
c
L
V
I
IS
PS
F
AV
S
ζ
Correlation
-1 … 0 … 1
Yes
Yes
Yes
Yes
No
Yes
Yes
No
Lambda
0…1
Yes
No
No
Yes
No
No*
Yes
No
O4
Odds ratio
0…1…∞
Yes*
Yes
Yes
Yes
Yes
Yes*
Yes
No
Yule's Q
-1 … 0 … 1
Yes
Yes
Yes
Yes
Yes
Yes
Yes
No
Yule's Y
-1 … 0 … 1
Yes
Yes
Yes
Yes
Yes
Yes
Yes
Cohen's
-1 … 0 … 1
Yes
Yes
Yes
Yes
No
No
Yes
No
Mutual Information
0…1
Yes
Yes
Yes
Yes
No
No*
Yes
No
J-Measure
0…1
Yes
No
No
No
No
No
No
No
Gini Index
0…1
Yes
No
No
No
No
No*
Yes
Support
0…1
No
Yes
No
Yes
No
No
No
No
Confidence
0…1
No
Yes
No
Yes
No
No
No
Yes
No
No
Laplace
0…1
No
Yes
No
Yes
No
No
No
No
Conviction
0.5 … 1 … ∞
No
Yes
No
Yes**
No
No
Yes
No
Interest
0…1…∞
Yes*
Yes
Yes
Yes
No
No
No
No
IS (cosine)
0 .. 1
No
Yes
Yes
Yes
No
No
No
Yes
Piatetsky-Shapiro's
-0.25 … 0 … 0.25
Yes
Yes
Yes
Yes
No
Yes
Yes
No
No
Yes
No
Certainty factor
-1 … 0 … 1
Yes
Yes
Yes
No
No
Added value
0.5 … 1 … 1
Yes
Yes
Yes
No
No
No
No
Collective strength
0…1…∞
No
Yes
Yes
Yes
No
Yes*
Yes
No
0 .. 1
No
Yes
Yes
Yes
No
No
No
Yes
Yes
Yes
No
No
No
4/18/2004
No
Jaccard
Klosgen's
K
© Tan,Steinbach, Kumar





1 
2
− 1   2 − 3 −
 Κ 0 Κ
Yes
3
3  to Data
3 3 Mining
  Introduction
2
No
72
No
Support-based Pruning
Effect of Support-based Pruning
Most of the association rule mining algorithms
use support measure to prune rules and itemsets
l
All Itempairs
1000
900
800
Study effect of support pruning on correlation of
itemsets
700
600
500
– Generate 10000 random contingency tables
– Compute support and pairwise correlation for each
table
– Apply support-based pruning and examine the tables
that are removed
400
300
200
100
-1
-0
.9
-0
.8
-0
.7
-0
.6
-0
.5
-0
.4
-0
.3
-0
.2
-0
.1
1
0
0
0.
1
0.
2
0.
3
0.
4
0.
5
0.
6
0.
7
0.
8
0.
9
l
Correlation
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
73
© Tan,Steinbach, Kumar
Effect of Support-based Pruning
Support < 0.01
74
Effect of Support-based Pruning
0.7
0.8
1
0.6
0.9
0.5
0.8
1
0.4
0.9
0.3
0.6
0.7
0.2
0.5
0
0.1
0.4
Correlation
0.2
0.3
-1
-0
.9
-0
.8
-0
.7
-0
.6
-0
.5
-0
.4
-0
.3
-0
.2
-0
.1
1
0.9
0.8
0
0.7
50
0
0.5
0.6
100
50
0.3
0.4
150
100
0
0.1
0.2
150
-0.1
200
-0
.3
-0.2
250
200
-0.5
-0.4
300
250
-0.7
-0.6
4/18/2004
Support < 0.03
300
-1
-0
.9
-0.8
Introduction to Data Mining
l
Investigate how support-based pruning affects
other measures
l
Steps:
– Generate 10000 contingency tables
– Rank each table according to the different measures
– Compute the pair-wise correlation between the
measures
Correlation
Support < 0.05
300
250
Support-based pruning
eliminates mostly
negatively correlated
itemsets
200
150
100
50
0
0.1
-1
-0
.9
-0
.8
-0
.7
-0
.6
-0
.5
-0
.4
-0
.3
-0
.2
-0
.1
0
Correlation
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
75
© Tan,Steinbach, Kumar
Effect of Support-based Pruning
4/18/2004
76
Effect of Support-based Pruning
Without Support Pruning (All Pairs)
u
Introduction to Data Mining
0.5% ≤ support ≤ 50%
u
0.005 <= support <= 0.500 (61.45%)
All Pairs (40.14%)
1
Conviction
Odds ratio
Conviction
0.9
0.9
Odds ratio
Col Strength
Correlation
1
Interest
0.8
Col Strength
0.7
Confidence
0.8
Laplace
Interest
Jaccard
Kappa
Klosgen
0.7
Correlation
0.6
0.6
Klosgen
0.5
Reliability
0.4
Yule Q
Jaccard
PS
CF
Yule Y
Reliability
PS
0.5
0.4
CF
Yule Q
0.3
Laplace
0.3
Yule Y
Confidence
0.2
Kappa
0.1
Jaccard
0.2
IS
IS
Support
0.1
Support
Jaccard
0
-1
Lambda
Gini
-0.8
-0.6
-0.4
-0.2
0
0.2
Correlation
0.4
0.6
0.8
1
0
-1
Lambda
Gini
-0.8
-0.6
-0.4
-0.2
0
0.2
Correlation
0.4
0.6
0.8
J-measure
J-measure
Mutual Info
Mutual Info
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
u
Red cells indicate correlation between
the pair of measures > 0.85
u
40.14% pairs have correlation > 0.85
© Tan,Steinbach, Kumar
Introduction to Data Mining
Scatter Plot between Correlation
& Jaccard Measure
1
u
4/18/2004
77
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Scatter Plot between Correlation
& Jaccard Measure:
61.45% pairs have correlation > 0.85
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
78
1
Subjective Interestingness Measure
Effect of Support-based Pruning
0.5% ≤ support ≤ 30%
u
l
0.005 <= support <= 0.300 (76.42%)
Interest
0.9
Reliability
Conviction
0.8
Yule Q
0.7
Odds ratio
Confidence
0.6
Jaccard
CF
Yule Y
Kappa
Objective measure:
– Rank patterns based on statistics computed from data
– e.g., 21 measures of association (support, confidence,
Laplace, Gini, mutual information, Jaccard, etc).
1
Support
0.5
0.4
Correlation
Col Strength
l
0.3
IS
Jaccard
0.2
Subjective measure:
Laplace
0.1
PS
– Rank patterns according to user’s interpretation
Klosgen
0
-0.4
Lambda
Mutual Info
-0.2
0
0.2
0.4
Correlation
0.6
0.8
1
Gini
u
A pattern is subjectively interesting if it contradicts the
expectation of a user (Silberschatz & Tuzhilin)
u
A pattern is subjectively interesting if it is actionable
(Silberschatz & Tuzhilin)
J-measure
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Scatter Plot between Correlation
& Jaccard Measure
u
76.42% pairs have correlation > 0.85
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
79
Interestingness via Unexpectedness
l
Introduction to Data Mining
l
Web Data (Cooley et al 2001)
Pattern expected to be infrequent
u
L: number of links connecting the pages
Pattern found to be frequent
u
lfactor = L / (k × k-1)
Pattern found to be infrequent
u
cfactor = 1 (if graph is connected), 0 (disconnected graph)
– Structure evidence = cfactor × lfactor
Expected Patterns
– Usage evidence =
Unexpected Patterns
P ( X Ι X Ι ... Ι X )
P( X ∪ X ∪ ... ∪ X )
1
1
Need to combine expectation of users with evidence from
data (i.e., extracted patterns)
© Tan,Steinbach, Kumar
Introduction to Data Mining
80
– Domain knowledge in the form of site structure
– Given an itemset F = {X1, X2, …, Xk} (Xi : Web pages)
Pattern expected to be frequent
+ - +
l
4/18/2004
Interestingness via Unexpectedness
Need to model expectation of users (domain knowledge)
+
-
© Tan,Steinbach, Kumar
4/18/2004
81
2
2
k
k
– Use Dempster-Shafer theory to combine domain
knowledge and evidence from data
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
82