Download Slides

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining
Association Analysis: Basic Concepts
and Algorithms
Association Rule Mining
l
Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction
Lecture Notes for Chapter 6
Market-Basket transactions
Introduction to Data Mining
by
Tan, Steinbach, Kumar
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
1
Definition: Frequent Itemset
l
TID
Items
1
Bread, Milk
2
Bread, Diaper, Beer, Eggs
3
4
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
© Tan,Steinbach, Kumar
l
u
l
Support count (σ
σ)
– Frequency of occurrence of an itemset
– E.g. σ({Milk, Bread,Diaper}) = 2
l
Support
TID
Items
1
2
3
4
5
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
l
Rule Evaluation Metrics
u
u
Measures how often items in Y
appear in transactions that
contain X
4/18/2004
3
Association Rule Mining Task
Given a set of transactions T, the goal of
association rule mining is to find all rules having
© Tan,Steinbach, Kumar
4
Bread, Milk, Diaper, Beer
5
Bread, Milk, Diaper, Coke
s=
σ ( Milk, Diaper, Beer )
|T|
=
2
= 0 .4
5
σ (Milk, Diaper, Beer) 2
= = 0.67
σ (Milk, Diaper)
3
4/18/2004
4
4/18/2004
Example of Rules:
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
{Milk,Diaper} → {Beer} (s=0.4, c=0.67)
{Milk,Beer} → {Diaper} (s=0.4, c=1.0)
{Diaper,Beer} → {Milk} (s=0.4, c=0.67)
{Beer} → {Milk,Diaper} (s=0.4, c=0.67)
{Diaper} → {Milk,Beer} (s=0.4, c=0.5)
{Milk} → {Diaper,Beer} (s=0.4, c=0.5)
Observations:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
⇒ Computationally prohibitive!
Introduction to Data Mining
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Introduction to Data Mining
TID
Brute-force approach:
© Tan,Steinbach, Kumar
Bread, Milk
2
3
Mining Association Rules
– support ≥ minsup threshold
– confidence ≥ minconf threshold
l
Items
1
{Milk , Diaper} → {Beer}
c=
– An itemset whose support is greater
than or equal to a minsup threshold
Introduction to Data Mining
TID
Example:
Fraction of transactions that contain
both X and Y
– Confidence (c)
– E.g. s({Milk, Bread, Diaper}) = 2/5
l
2
– Support (s)
Frequent Itemset
© Tan,Steinbach, Kumar
4/18/2004
– Example:
{Milk, Diaper} → {Beer}
– Fraction of transactions that contain an
itemset
l
Introduction to Data Mining
– An implication expression of the form
X → Y, where X and Y are itemsets
Example: {Milk, Bread, Diaper}
An itemset that contains k items
Implication means co-occurrence,
not causality!
Association Rule
– A collection of one or more items
– k-itemset
{Diaper} → {Beer},
{Milk, Bread} → {Eggs,Coke},
{Beer, Bread} → {Milk},
Definition: Association Rule
Itemset
u
Example of Association Rules
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
5
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
6
Mining Association Rules
Frequent Itemset Generation
null
l
Two-step approach:
1. Frequent Itemset Generation
–
A
2. Rule Generation
–
l
B
C
D
E
Generate all itemsets whose support ≥ minsup
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
ABCD
ABCE
ABDE
ACDE
BCDE
Given d items, there
are 2d possible
candidate itemsets
ABCDE
© Tan,Steinbach, Kumar
l
Introduction to Data Mining
4/18/2004
7
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
Frequent Itemset Generation
Computational Complexity
Brute-force approach:
l
Given d unique items:
– Total number of itemsets = 2d
– Total number of possible association rules:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
 d   d − k 
R = ∑   × ∑ 

 k   j 
= 3 − 2 +1
Transactions
TID
1
2
3
4
5
8
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
d −1
d −k
k =1
j =1
d
d +1
If d=6, R = 602 rules
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
9
Frequent Itemset Generation Strategies
l
Reduce the number of candidates (M)
© Tan,Steinbach, Kumar
4/18/2004
10
Reducing Number of Candidates
l
– Complete search: M=2d
– Use pruning techniques to reduce M
l
Introduction to Data Mining
Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent
Reduce the number of comparisons (NM)
l
– Use efficient data structures to store the candidates or
transactions
– No need to match every candidate against every
transaction
Apriori principle holds due to the following property
of the support measure:
∀X , Y : ( X ⊆ Y ) ⇒ s( X ) ≥ s (Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
11
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
12
Illustrating Apriori Principle
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Found to be
Infrequent
Items (1-itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Count
3
2
3
2
3
3
Minimum Support = 3
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
13
Apriori Algorithm
– Let k=1
– Generate frequent itemsets of length 1
– Repeat until no new frequent itemsets are identified
u Generate
length (k+1) candidate itemsets from length k
frequent itemsets
u Prune candidate itemsets containing subsets of length k that
are infrequent
u Count the support of each candidate by scanning the DB
u Eliminate candidates that are infrequent, leaving only those
that are frequent
Introduction to Data Mining
Itemset
{Bread,Milk,Diaper}
Introduction to Data Mining
4/18/2004
15
TID
1
2
3
4
5
6
7
8
9
Items
ABE
BD
BC
ABD
AC
BC
AC
ABCE
ABC
© Tan,Steinbach, Kumar
Introduction to Data Mining
Apriori Algorithm: Level 2
candidate
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
candidate support
Frequent?
6
Y
B
7
Y
C
6
Y
D
2
Y
E
2
Y
© Tan,Steinbach, Kumar
Introduction to Data Mining
Count
2
4/18/2004
14
Minsup = 2
Apriori Algorithm: Level 1
A
(No need to generate
candidates involving Coke
or Eggs)
Apriori Algorithm: Example
Method:
© Tan,Steinbach, Kumar
© Tan,Steinbach, Kumar
Pairs (2-itemsets)
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
Pruned
supersets
l
Count
4
2
4
3
4
1
4/18/2004
17
© Tan,Steinbach, Kumar
support
4
4
1
2
4
2
2
0
1
0
4/18/2004
16
4/18/2004
18
Frequent?
Y
Y
N
Y
Y
Y
Y
N
N
N
Introduction to Data Mining
Apriori Algorithm: Level 3
Reducing Number of Comparisons
l
candidate
support
Frequent?
ABC
2
Y
ABE
2
Y
Candidate counting:
– Scan the database of transactions to determine the
support of each candidate itemset
– To reduce the number of comparisons, store the
candidates in a hash structure
Instead of matching each transaction against every candidate,
match it against candidates contained in the hashed buckets
u
Transactions
No level 4 candidates.
© Tan,Steinbach, Kumar
TID
1
2
3
4
5
Introduction to Data Mining
4/18/2004
19
Generate Hash Tree
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
20
Association Rule Discovery: Hash tree
Suppose you have 15 candidate itemsets of length 3:
{1 4 5}, {1 2 4}, {4 5 7}, {1 2 5}, {4 5 8}, {1 5 9}, {1 3 6}, {2 3 4}, {5 6 7}, {3 4 5},
{3 5 6}, {3 5 7}, {6 8 9}, {3 6 7}, {3 6 8}
You need:
Hash Function
1,4,7
• Hash function
Candidate Hash Tree
3,6,9
2,5,8
• Max leaf size: max number of itemsets stored in a leaf node (if number of
candidate itemsets exceeds max leaf size, split the node)
Hash function
3,6,9
1,4,7
234
567
345
136
145
2,5,8
124
457
© Tan,Steinbach, Kumar
125
458
145
345
356
357
689
367
368
124
457
4/18/2004
21
Association Rule Discovery: Hash tree
1,4,7
136
Hash on
1, 4 or 7
159
Introduction to Data Mining
Hash Function
234
567
© Tan,Steinbach, Kumar
125
458
159
Introduction to Data Mining
367
368
4/18/2004
22
Association Rule Discovery: Hash tree
Hash Function
Candidate Hash Tree
3,6,9
1,4,7
2,5,8
Candidate Hash Tree
3,6,9
2,5,8
234
567
145
145
345
124
457
234
567
136
Hash on
2, 5 or 8
© Tan,Steinbach, Kumar
356
357
689
125
458
159
Introduction to Data Mining
356
357
689
367
368
345
124
457
4/18/2004
23
136
Hash on
3, 6 or 9
© Tan,Steinbach, Kumar
125
458
159
Introduction to Data Mining
356
357
689
367
368
4/18/2004
24
Subset Operation
Subset Operation Using Hash Tree
Given a transaction t, what
are the possible subsets of
size 3?
Hash Function
1 2 3 5 6 transaction
1+ 2356
2+ 356
1,4,7
3+ 56
3,6,9
2,5,8
234
567
145
136
345
124
457
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
25
Subset Operation Using Hash Tree
2+ 356
3+ 56
1+ 2356
2+ 356
1,4,7
3+ 56
3,6,9
2,5,8
124
457
125
458
© Tan,Steinbach, Kumar
145
159
356
357
689
Introduction to Data Mining
345
124
457
4/18/2004
27
© Tan,Steinbach, Kumar
l
lowering support threshold results in more frequent itemsets
this may increase number of candidates and max length of
frequent itemsets
Dimensionality (number of items) of the data set
Average transaction width
l
– transaction width increases with denser data sets
– This may increase max length of frequent itemsets and traversals
of hash tree (number of subsets in a transaction increases with its
width)
Introduction to Data Mining
Match transaction against 11 out of 15 candidates
Introduction to Data Mining
4/18/2004
28
4/18/2004
29
Given a frequent itemset L, find all non-empty
subsets f ⊂ L such that f → L – f satisfies the
minimum confidence requirement
ABC →D,
A →BCD,
AB →CD,
BD →AC,
since Apriori makes multiple passes, run time of algorithm may
increase with number of transactions
© Tan,Steinbach, Kumar
159
367
368
– If {A,B,C,D} is a frequent itemset, candidate rules:
more space is needed to store support count of each item
if number of frequent items also increases, both computation and
I/O costs may also increase
l
125
458
356
357
689
Rule Generation
Choice of minimum support threshold
Size of database
136
367
368
Factors Affecting Complexity
l
234
567
15+ 6
136
345
–
26
Hash Function
12+ 356
234
567
145
–
–
4/18/2004
13+ 56
15+ 6
l
Introduction to Data Mining
1 2 3 5 6 transaction
3,6,9
2,5,8
13+ 56
–
–
© Tan,Steinbach, Kumar
367
368
Subset Operation Using Hash Tree
1,4,7
12+ 356
l
159
Hash Function
1 2 3 5 6 transaction
1+ 2356
125
458
356
357
689
ABD →C,
B →ACD,
AC → BD,
CD →AB,
ACD →B,
C →ABD,
AD → BC,
BCD →A,
D →ABC
BC →AD,
If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L → ∅ and ∅ → L)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
30
Rule Generation
l
Rule Generation for Apriori Algorithm
Lattice of rules
How to efficiently generate rules from frequent
itemsets?
Low
Confidence
Rule
– In general, confidence does not have an antimonotone property
c(ABC →D) can be larger or smaller than c(AB →D)
– But confidence of rules generated from the same
itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC → D) ≥ c(AB → CD) ≥ c(A → BCD)
Pruned
Rules
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule
u
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
31
Rule Generation for Apriori Algorithm
l
l
join(CD=>AB,BD=>AC)
would produce the candidate
rule D => ABC
l
Prune rule D=>ABC if its
subset AD=>BC does not have
high confidence
Introduction to Data Mining
BD=>AC
l
Introduction to Data Mining
TID A1 A2 A3 A4 A5 A6 A7 A8 A9 A10 B1 B2 B3 B4 B5 B6 B7 B8 B9 B10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
1
1
1
1
1
1
1
1
1
1
1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
2
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
3
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
4
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
5
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
6
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
7
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
8
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
9
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
10
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
12
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
13
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
14
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
1
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
l
l
4/18/2004
33
Number of frequent itemsets
10
k =1
Introduction to Data Mining
Closed Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
l
4/18/2004
34
An itemset is closed if none of its (immediate) supersets
has the same support as the itemset
Maximal
Itemsets
TID
1
2
3
4
5
© Tan,Steinbach, Kumar
10 
= 3× ∑  
k
Need a compact representation
© Tan,Steinbach, Kumar
Maximal Frequent Itemset
Infrequent
Itemsets
32
Some itemsets are redundant because they have
identical support as their supersets
D=>ABC
© Tan,Steinbach, Kumar
4/18/2004
Compact Representation of Frequent Itemsets
Candidate rule is generated by merging two rules
that share the same prefix
in the rule consequent
CD=>AB
© Tan,Steinbach, Kumar
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Support
4
5
3
4
4
2
3
3
4
3
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2
Border
Introduction to Data Mining
4/18/2004
35
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
36
Maximal vs Closed Itemsets
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
Maximal vs Closed Frequent Itemsets
Transaction Ids
null
124
123
A
12
124
AB
12
24
AC
AD
ABD
ABE
B
245
C
123
4
AE
24
2
ABC
1234
D
2
3
BC
BD
4
ACD
BE
2
4
ACE
ADE
24
12
CD
34
CE
45
3
BCD
ABCD
ABCE
ABDE
DE
BDE
CDE
124
AB
12
4
BCE
123
A
E
ABC
24
AC
AD
ABD
ABE
1234
B
AE
2
3
BD
24
2
ADE
CD
Closed and
maximal
34
CE
3
BCD
45
DE
4
BCE
BDE
CDE
4
ABCD
BCDE
E
BE
4
ACE
2
ACDE
345
D
BC
4
ACD
245
C
123
4
24
2
Closed but
not maximal
null
124
4
2
Minimum support = 2
345
ABCE
ABDE
ACDE
BCDE
# Closed = 9
# Maximal = 4
Not supported by
any transactions
© Tan,Steinbach, Kumar
ABCDE
ABCDE
Introduction to Data Mining
4/18/2004
37
4/18/2004
39
Maximal vs Closed Itemsets
© Tan,Steinbach, Kumar
Introduction to Data Mining
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
38
Related documents