Download Association Analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Data Mining:
Association Analysis
This lecture node is modified based on Lecture Notes for Chapter 6/7
of Introduction to Data Mining by Tan, Steinbach, Kumar, and slides
from Jiawei Han for the book of Data Mining – Concepts and
Techniqies by Jiawei Han and Micheline Kamber.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
1
Association Rule Mining

Given a set of transactions, find rules that will predict the
occurrence of an item based on the occurrences of other
items in the transaction.
Market-Basket transactions
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
© Tan,Steinbach, Kumar
Introduction to Data Mining
Example of Association Rules
{Diaper}  {Beer},
{Milk, Bread}  {Eggs,Coke},
{Beer, Bread}  {Milk},
4/18/2004
‹#›
Definition: Frequent Itemset

Itemset
– A collection of one or more items

Example: {Milk, Bread, Diaper}
– k-itemset


An itemset that contains k items
Support count ()
– Frequency of occurrence of an itemset
– E.g. ({Milk, Bread,Diaper}) = 2

Support
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
– Fraction of transactions that contain an
itemset
– E.g. s({Milk, Bread, Diaper}) = 2/5

Frequent Itemset
– An itemset whose support is greater
than or equal to a minsup threshold
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Definition: Association Rule

Association Rule
– An implication expression of the form
X  Y, where X and Y are itemsets
– Example:
{Milk, Diaper}  {Beer}

Rule Evaluation Metrics
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
– Support (s)

Example:
Fraction of transactions that contain
both X and Y
{Milk , Diaper }  Beer
– Confidence (c)

Measures how often items in Y
appear in transactions that
contain X
© Tan,Steinbach, Kumar
s
 (Milk, Diaper, Beer )
|T|

2
 0.4
5
 (Milk, Diaper, Beer ) 2
c
  0.67
 (Milk, Diaper )
3
Introduction to Data Mining
4/18/2004
‹#›
Association Rule Mining Task

Given a set of transactions T, the goal of
association rule mining is to find all rules having
– support ≥ minsup threshold
– confidence ≥ minconf threshold

Brute-force approach:
– List all possible association rules
– Compute the support and confidence for each rule
– Prune rules that fail the minsup and minconf
thresholds
 Computationally prohibitive!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Mining Association Rules
Example of Rules:
TID
Items
1
Bread, Milk
2
3
4
5
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
{Milk,Diaper}  {Beer} (s=0.4, c=0.67)
{Milk,Beer}  {Diaper} (s=0.4, c=1.0)
{Diaper,Beer}  {Milk} (s=0.4, c=0.67)
{Beer}  {Milk,Diaper} (s=0.4, c=0.67)
{Diaper}  {Milk,Beer} (s=0.4, c=0.5)
{Milk}  {Diaper,Beer} (s=0.4, c=0.5)
Observations:
• All the above rules are binary partitions of the same itemset:
{Milk, Diaper, Beer}
• Rules originating from the same itemset have identical support but
can have different confidence
• Thus, we may decouple the support and confidence requirements
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Mining Association Rules

Two-step approach:
1. Frequent Itemset Generation
–
Generate all itemsets whose support  minsup
2. Rule Generation
–

Generate high confidence rules from each frequent itemset,
where each rule is a binary partitioning of a frequent itemset
Frequent itemset generation is still
computationally expensive
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Frequent Itemset Generation
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
ACDE
ABCDE
© Tan,Steinbach, Kumar
Introduction to Data Mining
BCDE
Given d items, there
are 2d possible
candidate itemsets
4/18/2004
‹#›
Frequent Itemset Generation

Brute-force approach:
– Each itemset in the lattice is a candidate frequent itemset
– Count the support of each candidate by scanning the
database
Transactions
N
TID
1
2
3
4
5
Items
Bread, Milk
Bread, Diaper, Beer, Eggs
Milk, Diaper, Beer, Coke
Bread, Milk, Diaper, Beer
Bread, Milk, Diaper, Coke
List of
Candidates
M
w
– Match each transaction against every candidate
– Complexity ~ O(NMw) => Expensive since M = 2d !!!
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Reducing Number of Candidates

Apriori principle:
– If an itemset is frequent, then all of its subsets must also
be frequent

Apriori principle holds due to the following property
of the support measure:
X , Y : ( X  Y )  s( X )  s(Y )
– Support of an itemset never exceeds the support of its
subsets
– This is known as the anti-monotone property of support
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Illustrating Apriori Principle
null
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
Found to be
Infrequent
ABCD
ABCE
Pruned
supersets
© Tan,Steinbach, Kumar
Introduction to Data Mining
ABDE
ACDE
BCDE
ABCDE
4/18/2004
‹#›
Illustrating Apriori Principle
Item
Bread
Coke
Milk
Beer
Diaper
Eggs
Count
4
2
4
3
4
1
Items (1-itemsets)
Itemset
{Bread,Milk}
{Bread,Beer}
{Bread,Diaper}
{Milk,Beer}
{Milk,Diaper}
{Beer,Diaper}
Minimum Support = 3
Pairs (2-itemsets)
(No need to generate
candidates involving Coke
or Eggs)
Triplets (3-itemsets)
If every subset is considered,
6C + 6C + 6C = 41
1
2
3
With support-based pruning,
6 + 6 + 1 = 13
© Tan,Steinbach, Kumar
Count
3
2
3
2
3
3
Introduction to Data Mining
Itemset
{Bread,Milk,Diaper}
Count
3
4/18/2004
‹#›
Apriori Algorithm



Let k=1
Generate frequent itemsets of length 1
Repeat until no new frequent itemsets are identified
– Generate length (k+1) candidate itemsets from length k
frequent itemsets
 Let two k-itemsets be (X,Y) and (X, Z),
where X is (k-1)-items, and Y and Z are 1-item.
 The new (k+1)-itemset will be (X, Y, Z).
– Prune candidate itemsets containing subsets of length k that
are infrequent
– Count the support of each candidate by scanning the DB
– Eliminate candidates that are infrequent, leaving only those
that are frequent
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
The Apriori Algorithm — Example
Database D
TID
100
200
300
400
itemset sup.
C1
{1}
2
{2}
3
Scan D
{3}
3
{4}
1
{5}
3
Items
134
235
1235
25
C2 itemset sup
L2 itemset sup
2
2
3
2
{1
{1
{1
{2
{2
{3
C3 itemset
{2 3 5}
Scan D
{1 3}
{2 3}
{2 5}
{3 5}
© Tan,Steinbach, Kumar
2}
3}
5}
3}
5}
5}
1
2
1
2
3
2
L1 itemset sup.
{1}
{2}
{3}
{5}
2
3
3
3
C2 itemset
{1 2}
Scan D
{1
{1
{2
{2
{3
3}
5}
3}
5}
5}
L3 itemset sup
{2 3 5} 2
Introduction to Data Mining
4/18/2004
‹#›
Maximal Frequent Itemset
An itemset is maximal frequent if none of its immediate supersets
is frequent
null
Maximal
Itemsets
A
B
C
D
E
AB
AC
AD
AE
BC
BD
BE
CD
CE
DE
ABC
ABD
ABE
ACD
ACE
ADE
BCD
BCE
BDE
CDE
ABCD
ABCE
ABDE
Infrequent
Itemsets
ABCD
E
© Tan,Steinbach, Kumar
Introduction to Data Mining
ACDE
BCDE
Border
4/18/2004
‹#›
Closed Itemset

An itemset is closed if none of its immediate supersets
has the same support as the itemset.
TID
1
2
3
4
5
Items
{A,B}
{B,C,D}
{A,B,C,D}
{A,B,D}
{A,B,C,D}
© Tan,Steinbach, Kumar
Itemset
{A}
{B}
{C}
{D}
{A,B}
{A,C}
{A,D}
{B,C}
{B,D}
{C,D}
Introduction to Data Mining
Support
4
5
3
4
4
2
3
3
4
3
Itemset Support
{A,B,C}
2
{A,B,D}
3
{A,C,D}
2
{B,C,D}
3
{A,B,C,D}
2
4/18/2004
‹#›
Maximal vs Closed Itemsets
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
124
123
A
12
124
AB
12
24
AC
ABE
2
245
C
123
4
AE
24
ABD
1234
B
AD
2
ABC
2
3
BD
4
ACD
345
D
BC
BE
2
4
ACE
ADE
E
24
CD
34
CE
3
BCD
45
ABCE
ABDE
ACDE
BDE
CDE
BCDE
ABCDE
Introduction to Data Mining
DE
4
BCE
4
ABCD
Not supported by
any transactions
© Tan,Steinbach, Kumar
Transaction Ids!
null
4/18/2004
‹#›
Maximal vs Closed Frequent Itemsets
Minimum support = 2
124
123
A
12
124
AB
12
ABC
24
AC
AD
ABD
ABE
1234
B
AE
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
Closed but
not maximal
null
24
BE
2
4
ACE
E
ADE
CD
ABCD
ABCE
ABDE
ACDE
# Closed = 9
# Maximal = 4
© Tan,Steinbach, Kumar
34
CE
3
BCD
ABCDE
Introduction to Data Mining
45
DE
4
BCE
4
2
Closed and
maximal
BCDE
BDE
CDE
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
4/18/2004
‹#›
Maximal vs Closed Itemsets
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Rule Generation

Given a frequent itemset L, find all non-empty
subsets f  L such that f  L – f satisfies the
minimum confidence requirement
– If {A,B,C,D} is a frequent itemset, candidate rules:
ABC D,
A BCD,
AB CD,
BD AC,

ABD C,
B ACD,
AC  BD,
CD AB,
ACD B,
C ABD,
AD  BC,
BCD A,
D ABC
BC AD,
If |L| = k, then there are 2k – 2 candidate
association rules (ignoring L   and   L)
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Rule Generation

How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same
itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule

© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Rule Generation for Apriori Algorithm
Lattice of rules
Low
Confidence
Rule
CD=>AB
ABCD=>{ }
BCD=>A
ACD=>B
BD=>AC
D=>ABC
BC=>AD
C=>ABD
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
AB=>CD
A=>BCD
Pruned
Rules
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Effect of Support Distribution

Many real data sets have skewed support
distribution
Support
distribution of
a retail data set
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Effect of Support Distribution

How to set the appropriate minsup threshold?
– If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
– If minsup is set too low, it is computationally
expensive and the number of itemsets is very large

Using a single minimum support threshold may
not be effective
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Criticism to Support and Confidence

Example:
2000/5000 2000/3000
– Among 5000 students
 3000 play basketball
 3750 eat cereal
 2000 both play basket ball and eat cereal
– play basketball  eat cereal [40%, 66.7%] is misleading because the
overall percentage of students eating cereal is 75% which is higher than
66.7%.
– play basketball  not eat cereal [20%, 33.3%] is far more accurate,
although with lower support and confidence
basketball not basketball sum(row)
cereal
2000
1750
3750
not cereal
1000
250
1250
sum(col.)
3000
2000
5000
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Criticism to Support and Confidence (Cont.)

Example 2:
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
– X and Y: positively correlated,
– X and Z, negatively related
– support and confidence of
X=>Z dominates

We need a measure of
dependent or correlated events
corrA, B

P( A  B)

P( A) P( B)
Rule Support Confidence
X=>Y 25%
50%
X=>Z 37.50%
75%
P(B|A)/P(B) is also called the lift
of rule A => B
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Other Interestingness Measures: Interest

Interest (correlation, lift)
P( A  B)
P( A) P( B)
– taking both P(A) and P(B) in consideration
– P(A^B)=P(B)*P(A), if A and B are independent events
– A and B negatively correlated, if the value is less than 1; otherwise A
and B positively correlated
X 1 1 1 1 0 0 0 0
Y 1 1 0 0 0 0 0 0
Z 0 1 1 1 1 1 1 1
© Tan,Steinbach, Kumar
Itemset
Support
Interest
X,Y
X,Z
Y,Z
25%
37.50%
12.50%
2
0.9
0.57
Introduction to Data Mining
4/18/2004
‹#›
Infrequent Patterns

Example-1:
– The sale of DVDs and VCRs together is low, because
people don’t buy both at the same time.
– They are negative-correlated, and they are competing
items.

Example-2:
– If {Fire = Yes} is frequent but {Fire = Yes, Alarm = No}
is infrequent, then the latter is an important infrequent
pattern because it indicates faulty alarm systems.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Negative Patterns

Let I = {i1, i2, …} be a set of items. A negative item ik
denote the absence of item ik from a given
transaction.
– For example, coffee is a negative item whose value is 1 if a
transaction does not contain coffee.

A negative itemset X is an itemset that has the
following properties:
X  A B
where A is a set of positive items, B is a set of
negative items, | B | 1 s(X) ≥ minsup.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Negative Patterns

A negative association rule is an association
rule that has the following properties
– The rule is extracted from a negative itemset.
– The support of the rule is greater than or equal to
minsup.
– The confidence of the rule is greater than or equal to
minconf.

An example
© Tan,Steinbach, Kumar
tea  coffee
Introduction to Data Mining
4/18/2004
‹#›
Negatively Correlated Patterns



Let X = {x1, x2, ...., xk} be a k-itemsets and P(X) be the
probability that a transaction contains X.
The probability is itemset support s(X).
Negatively correlated itemset: An itemset X is negatively
correlated if
k
s( X )   s( x j )  s( x1 )  s( x2 )  ...  s( xk )
j 1

An association rule X -> Y is negatively correlated if
s( X  Y )  s( X ) s(Y )
where X and Y are disjoint itemsets.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Negatively Correlated Patterns
Because
s ( X  Y )  s ( X ) s (Y )
 s ( X  Y )  [ s ( X  Y )  s ( X  Y )][ s ( X  Y )  s ( X  Y )]
 s ( X  Y )[1  s ( X  Y )  s ( X  Y )  s ( X  Y )]  s ( X  Y )s ( X  Y )
 s( X  Y )s( X  Y )  s( X  Y )s( X  Y )
The condition for negative correlation can be stated below.
s( X  Y ) s( X  Y )  s( X  Y ) s( X  Y )
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Comparisons





Comparisons among infrequent
patterns, negative patterns, and
negatively correlated patterns.
Many infrequent patterns have
corresponding negative patterns.
Many negatively correlated patterns
also have corresponding negative
patterns.
The lower the support s(X U Y), the
more negatively support the pattern is.
Negatively correlated patterns that are
infrequent tend to be more interesting
than negatively correlated patterns
that are frequent.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Comparisons
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›