Association Rules The Goal Example Data Other Examples Download

Transcript
The Goal
•
Find “patterns”: local regularities that occur more
often than you would expect. Examples:
If a person buys wine at a supermarket, they also
buy cheese. (confidence: 20%)
If a person likes Lord of the Rings and Star
Wars, they like Star Trek (confidence: 90%)
Look like they could be used for classification, but
There is not a single class label in mind. They can
predict any attribute or a set of attributes. They
are unsupervised
Not intended to be used together as a set
Often mined from very large data sets
•
•
Association Rules
Charles Sutton
Data Mining and Exploration
Spring 2012
•
•
•
•
Based on slides by Chris Williams and Amos Storkey
Thursday, 8 March 12
Thursday, 8 March 12
Other Examples
Example Data
Market basket analysis, e.g., supermarket
Item
Chicken
Onion
1
Transactions
trip to market
1
1
Rocket
Caviar
Haggis
1
1
1
1
1
1
1
1
1
1
Collaborative-filtering type data: e.g., Films a
person has watched
•
•
Rows: patients, columns: medical tests (Cabena et al, 1998)
Survey data (Impact Resources, Inc., Columbus OH, 1987)
494
14. Unsupervised Learning
TABLE 14.1. Inputs for the demographic data.
Feature
1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
....
These are databases that companies have already.
Thursday, 8 March 12
•
Thursday, 8 March 12
Demographic
# Values
Sex
Marital status
Age
Education
Occupation
Income
Years in Bay Area
Dual incomes
Number in household
Number of children
Householder status
Type of home
Ethnic classification
Language in home
!
number in household
number of children
⇓
2
5
7
6
9
9
5
3
9
9
3
5
8
3
Type
Categorical
Categorical
Ordinal
Ordinal
Categorical
Ordinal
Ordinal
Categorical
Ordinal
Ordinal
Categorical
Categorical
Categorical
Categorical
= 1
= 0
language in home = English
"
Itemsets, Coverage, etc
Toy Example
Itemsets, Frequency, Accuracy
Itemsets,
Frequency,
Itemsets,
Frequency,
Accuracy
PlayAccuracy
Tennis
Example
A1 , A2 , . . . Am
andefined
attribute
itemsetcolumn
is a pattern
by
• CallAneach
Play Tennis Example
Itemsets, Frequency, Accuracy
Play Tenn
Play Play
TennisTen
Ex
Day
•
An itemset
is
a pattern
defined
D1
Day
Outlook Temperature Humidity Wind PlayTennis
(A
aby
) of(Aattribute
. . value
. (Aik = a
AnaItemsets,
item
set
is
pairs
i1a=set
j1by
i2 = aj2 )
jk )
An itemset
is
pattern
defined
Frequency,
Accuracy
PlayOutloo
Tenn
Day
itemset is aHot
pattern defined
D2
D1 An
Sunny
High by False
No
Day
Day (A
Outlook
Temperature
Wind PlayTennis
Itemsets,
Frequency,
Accuracy
P
D1
Sunny
(Afrequency
= aj2 ) of. .an
. (A
) simplyHumidity
i1 = aj1 ) (or
isupport)
ik = ajkX
D3
2
itemset
P(X )
D2
Sunny
Hot
High
True
No
D1
(AThe
aj2 ) . . . (A
ajk is
)
i1 = aj1 ) D1(Ai2 =Sunny
ik =
D2 No D4
Sunny
Hot
High
False
(A
=
a
)
(A
=
a
)
.
.
.
(A
=
a
)
i
j
i
j
i
j
An itemset is a pattern defined by
1
1
2
2 False
k Yes k
D3 Overcast
Hot
High
D2
Example:
the “Play
data
D3 NoOverca
Day
D5
The frequency
(or support)
anTennis”
itemset
Xdata
is simply
Example:
IninD2
the of
Play
Tennis
Sunny
Hot P(X )
High
True
D4
Rain
Mild
High
False
Yes
An of
itemset
is
defined
D1
D4
Rain
(Ai1itemset
=a
aj1pattern
) (A
=
asimply
. by
. . (A
D6
i2 is
j2 )
ik = a
jk )
The
frequency
(or
support)
an
X
P(X
)
Overcast
Hot = False) =High
False
Yes D3
P(Humidity
=
Normal
4/14
frequencyCool
(or support)
of an itemset
simply P(X )
D5 The
Rain
Normal
False X is Yes
Example: in
the “PlayD3
Tennis”
dataPlay = Yes Windy
D2
D5
Rain
D7
D4
Itemsets,
Frequency,
Accuracy
PlayExample:
Tennis Example
D4
Rain
High
Yes D8
(Ai1 = aof
(A
. . . (A
= )ajk ) False
D6
Rain
Cool
Normal
True
No
D3
j1 )an Mild
i2 = aX
j2 )is simply
iP(X
The
frequency
(or support)
itemset
k
in
the
“Play
Tennis”
data
D6
Rain
D5
Example: in the
“Play Tennis”
data True
P(Humidity = NormalD5Play = Rain
Yes Windy = False)
= 4/14 Normal
D4
Cool
Falseset D7 YesOverca
D7 Overcast
Cool
Normal
Yes
D9
The
support
of
an
item
set
is
its
frequency
in
the
data
Example:
in
the
“Play
Tennis”
data
The accuracy
(orfrequency
confidence)
an association
rule if
Y=y
D6
D5
The
(orofsupport)
of an itemset
X is
simply P(X
)
D8 An itemset
Sunnyis a patternMild
High
False
No
D6
Rain Windy
Normal
True
Play
= Yes
= Cool
False)
= 4/14
D8 NoD10
Sunny
defined by Play = Yes Windy = False) = 4/14 P(Humidity = Normal
then Z=z
is
D6
P(Humidity
=
Normal
D7
P(Humidity
=
Normal
Play
=
Yes
Windy
=
False)
=
4/14
Day
OutlookExample:
Temperature
Humidity
Wind
PlayTennis
D9
Sunny
Cool
Normal
False
Yes
D9 YesD11
Sunny
D7
Overcast
True
Example:
the
“Play
D7
P(Z
= z|YTennis”
=Cool
yrule
) data
The accuracy (or confidence)
of in
an
association
if Y=y Normal
D8
D1
Sunny
Hot
High
False
No
) (Ai2 = aj2 ) Normal
. . . (Aik = aFalse
D10
Rain (Ai1 = aj1Mild
Yes
D10 NoD12
Rain
jk )
D8
D8
Sunny
Mild
High
False
then
Z=z
is
D2
Sunny Example
Hot P(Humidity
High = Normal
True
No= Yes Windy = False) =
Play
4/14
support
(
)
=
4
D11 D13
Sunny
D9
D11
Sunny
Mild
Normal
True
Yes
D9
P(Z
=
z|Y
=
y)
TheD9
accuracy
(or
an
association
rule
if Y=y False
Normal
YesD14
ofSunny
anconfidence)
association
rule
if Y=y
D3accuracy
Overcast(or confidence)
Hot
High
False of Cool
Yes
frequency
(or support)
of an itemset
Xofisan
simply
P(X )
D10
The
accuracy
(or
association
rule if Y=y The
D12 Overca
D12TheOvercast
Mildconfidence)
High
True
Yes
D10
then
Z=z is
P(Windy
= False
Play
=association
Yes|Humidity
=rule
Normal)
=
4/7then Z=z
D4 Z=z
Rain
Mild
High
False
Yes
D10
Rain
Mild
Normal
Falseis D13YesOverca
then
is
The
confidence
of
an
if
Y=y
D11
thenin Z=z
is Tennis”
Example
D13Example:
Overcast
Hot data Normal
False
Yes
P(Z
=
z|Y
=
y
)
the “Play
D5
Rain
CoolD11
Normal
False
Yes of an association
D12
The accuracy
(or
confidence)
rule if
Y=y D14YesD11
Sunny
Mild
Normal
True
P(Z
=
z|Y
=
y
)
Rain
D14
Rain
Mild
High
No
P(Z
= z|Y True
= y)
D6
Rain Example:
Cool Play
No = 4/7
D12
D13
P(Windy
= False
=Normal
Yes|Humidity
then
Z=z is True= Normal)
•
•
•
•
•
P(Humidity = Normal
Play = Yes
Windy = False) = 4/14
Example
5/1
The accuracy (or confidence) of an association rule if Y=y
= False Play = Yes|Humidity = Normal)
then P(Windy
Z=z is
P(Z = z|Y = y )
Thursday, 8 March 12
Finding Frequent Itemsets
Item sets to rules
Generating rules from itemsets
Example
P(Windy = False
Play = Yes|Humidity = Normal) = 4/7
•
Example Overcast
5/1
Mild
True
Yes
CoolD12 Normal
True
Yes= z|Y = y ) High
D14
P(Z
False Play
= Yes|Humidity
= Normal)
= 4/7
D13 =Overcast
Hot
Normal
False
YesD13
MildP(Windy
High
False
No
D9 Generating
Sunny
Normal
False
Yes
Finding
rules
itemsets
D14
Rain
Mild
True
No D14Fr
Example
= False Cool
Play
= from
Yes|Humidity
= Normal)
= 4/7 High5 / 1
Thursday, 8 March
12
D10
Rain
Mild
Normal
False
Yes
4/76 / 1 P(Windy
5/1
P(Windy
Yes|Humidity = Normal) = 4/7
D11
Sunny
Mild
Normal= False
True Play =Yes
Finding
Freque
Generating
rules
from
itemsets
D12 Overcast
Mild
High
True
Finding F
Generating
rules
fromFalse
itemsetsYes
5/1
D13 5Overcast
Hot
Normal
Yes
/1
5/1
D14
Rain An itemset
Mild of size kHigh
can giveTrue
rise to 2k No1 rules
D7
Overcast
Example
D8
Sunny
=
Finding Frequent
Itemsets
Itemset
Generating rulesExample.
from
itemsets
Generating
rules from
itemsets
Finding
Frequent
Itemsets
k
An itemset of sizeAn
k can
giveofrise
2 give
1 rules
itemset
sizeto
k can
rise to 2k 1 rules
6/1
Windy=False, Play=Yes, Humidity=Normal
Example. Itemset Example. Itemset
5/1
Task
Finding
FF
Key
ifTas
alla
Task: find
prop
Key observ
Key
Finding Frequentgives
Itemsets
Generating rules from itemsets
rise to 7 rules including
So
if alfi
Key observation: a set X of variables can be frequent only
if
all
subse
Windy=False,
Play=Yes, Humidity=Normal
k
Windy=False,An
Play=Yes,
Humidity=Normal
itemset of size k can give rise to 2
1 rules
Then:
We
convert
them
to
rules
on
pro.
k
IF
Windy=False
and
Humidity=Normal
THEN
Play=Yes
(4/4)
if all subsets of variables are frequent (monotonicity
property),
Task:
Find
all rise
item
sets
with
support
An itemsetIFofPlay=Yes
size
kTHENcan
give
toWindy=False
2
1 rules
Humidity=Normal
and
(4/9)
An itemset of size k can give rise to 2k k 1 rules
gives
rise
to 7 rules
including
Example.
Itemset
IF
True
THEN
Windy=False
and Play=Yes
and Humidity=Normal (4/14)
So
gives
rise
to
7
rules
including
e
Ta
property),
i.e. of
P(A,
B) k can
P(A)give
andrise
P(A,to
B)2 -1
P(B)
An itemset
size
rules
So findAn
freq
Task:
find allTHEN
itemsets
with(4/4)
frequency s
Example.
Itemset
on .
IF
Windy=False
and
Humidity=Normal
Play=Yes
k
item
Insight:
A
large
set
can
be
no
more
frequent
than
Example.
Itemset
An
itemset
of
size
k
can
give
rise
to
2
1
rules
on
...
IF Play=Yes THEN
Humidity=Normal
and Windy=False
(4/9)
IF Windy=False and Humidity=Normal
THEN
Play=Yes
(4/4)
So find frequent singleton sets, then sets of size 2, and so
Windy=False,
Humidity=Normal
Ke
True THEN
Windy=False
and Play=Yes
Humidity=Normal
(4/14)
itemsets
with
frequency
s and(4/9)
Key
observation:
aPlay=Yes,
set X of
variables can be frequent only
IF Task:
Play=Yesfind
THENall
Humidity=Normal
and
Windy=False
An
Example:
itemset
(199
itsIF subsets,
e.g.,
Select
association
rules
that have (4/14)
accuracy greater than some
Example.
Itemset
IF True THEN Windy=False
and
Play=Yes and
Humidity=Normal
on
...
An
efficien
if a
item
Humidity=Normal
KeyWindy=False,
observation:
a setaXPlay=Yes,
variables
be
frequent
onlyare frequent (monotonicity
ifof all
subsets
ofincluding
variables
threshold
gives
rise
to 7can
rules
Windy=False, Play=Yes, Humidity=Normal
itemsets(19
is
An efficient
algorithmPlay=Yes,
using thisHumidity=Normal
idea for finding frequent
support(Wind
= False)
support(Wind
= False,
Outlook
= Sunny)
pro
if all subsets of variables
frequent
(monotonicity
Selectare
association
rules
that
have
accuracy
greater
than
some
Windy=False,
property), i.e. P(A, B) P(A) and
P(A, B) P(B)
IF Windy=False and Humidity=Normal THEN Play=Yes
(4/4)
(1994),
Ma
threshold
aand
property),
i.e.
P(A, B)
P(A)
P(A,
B) P(B)
Select
rules
that
have
accuracy
greater than some
itemsets is the APRIORI algorithm (Agrawal and Srikant
gives
rise toassociation
7 rules
including
•
•
•
•
Task:
itemsets
with frequency
s
First:find
Weallwill
find frequent
item sets
•
•
Results
7 to
rules
including:
gives in
rise
7 rules
including
• (1994),
Mannila et al (1994))
gives rise to 7 rules including
IF IF
Windy=False
and Humidity=Normal
THEN Play=YesTHEN Play=Yes
(4/4)
Windy=False
and Humidity=Normal
IF IF
Play=Yes
THEN Humidity=Normal
and Windy=False
(4/9)
Play=Yes
THEN Humidity=Normal
and Windy=False
IF True THEN Windy=False and Play=Yes and Humidity=Normal (4/14)
IF True THEN Windy=False and Play=Yes and Humidity=Normal
•
8/1
IF
IF
IF
(4/4)
(4/9)
(4/14)
•
IF Play=Yes THEN Humidity=Normal and Windy=False
•
Selectkeep
association
that whose
have accuracy
greater thanissome
We
rulesrules
only
confidence
greater
andrules
Srikant,
1994;
Mannilaetgreater
et
1994)
(1994),
Mannila
al al,
(1994))
Select association
that
have
accuracy
than some
threshold
a association rules that have accuracy greater than some
Select
9/1
than a threshold
threshold a
threshold a
Thursday, 8 March 12
Thursday, 8 March 12
8/1
So
(4/9)
IF True THEN find
Windy=False
and Play=Yes
and Humidity=Normal
(4/14) sets of size 2, and so
singleton
then
aSosingleton
Sothreshold
find frequent
sets,
thenfrequent
sets
of size
2,in
and
so sets,
searchSo
through
itemsets
order
of number
of
on
Windy=False and Humidity=Normal THEN Play=Yes
(4/4)
8/1
on ...
on ...
Play=Yes THEN Humidity=Normal
and Windy=False
(4/9)
items
True THEN
and Play=Yes
and Humidity=Normal
(4/14)
Select
association
rules
that have accuracy greater than some
AnWindy=False
efficient algorithm
using
idea for finding
frequent
8/1
Anthis
efficient
algorithm
using this idea for finding
frequentAn
threshold
a (Agrawal and Srikant
itemsets is theAn
APRIORI
algorithm
ite
efficient
algorithm
this is APRIORI
itemsets
is thefor
APRIORI
algorithm(Agarwal
and Srikant
8 / (Agrawal
1
(1994), Mannila et al (1994))
9/1
(19
8/1
lgorithm
APRIORI Algorithm
APRIORI algorithm
ariables)
Single database pass is linear in |Ci |n, make a pass for each i
Single database pass is linear in |Ci |n, make a pass for each i
until Ci is empty
until Ci is empty
Candidate formation
Candidate formation
(for binary variables)
Find all pairs
of all
sets
{U,
V } {U,
from
thatthat
U U VVhas
Find
pairs
of sets
V }Lfrom
Li such
has
i such
size
i
+
1
and
test
if
this
union
is
really
a
potential
size i + 1 and test if this union is really a potential
candidate.
O(|Li |3 )
3
candidate. O(|L
i| )
i =1
A is a variable} Ci = {{A}|A is a variable}
while Ci is not empty
not empty
database pass:
se pass:
for each set in Ci test if it is frequent
or each set in Ci test if it isletfrequent
Li be collection of frequent sets from Ci
candidate
et Li be collection of frequent formation:
sets from Ci
let Ci+1 be those sets of size i + 1
ate formation:
all of whose subsets are frequent
et Ci+1 be those sets
of size i + 1
end while
Example: 5 three-item sets
Example: 5 three-item
(ABC), (ABD),sets
(ACD), (ACE), (BCD)
Candidate
(ABC), (ABD),
(ACD),four-item
(ACE),sets
(BCD)
(ABCD)
ok
Candidate four-item sets
(ACDE) not ok because (CDE) is not present above
(ABCD) ok
Data
structure techniques
can be
used forabove
speedups
(ACDE) not ok
because
(CDE) is not
present
Other algorithms possible for finding frequent itemsets, e.g.
Data structure
techniques
Han’s
FP-growth can be used for speedups
ll of whose subsets are frequent
Other algorithms possible for finding frequent itemsets, e.g.
Han’s FP-growth
Thursday, 8 March 12
Thursday, 8 March 12
10 / 1
APRIORI and Algorithm Components
Comments
Comments on Association Rules
10 / 1
association rules will be trivial, some
nd Algorithm
Components
• Some
interesting. Need to sort through them
•
•
11 / 1
Finding Association Rules is just the beginning in a datamining
effort. Some will be trivial, others interesting. Challenge is to
select potentially interesting rules
Comments on Association Rules
Finding Association rules as Exploratory Data Analysis
Trivial ruleRules
example:
Finding Association
is just the beginning in a datamining
effort. Some will be trivial, others
interesting.
Challenge is to
pregnant
female
select potentially
interesting rules
with accuracy 1!
Task: Rule
Pattern Discovery
Example:
pregnant
=> female (confidence: 1)
Structure: Association Rules
AlsoScore
can miss
“interesting but rare” rules
Function: Support
•
Search: Breadth
Firstcaviar
with Pruning
Example:
vodka -->
(low support)
Finding Association
Exploratory
Data Analysis
For rule Arules
B, itas
can
be useful to compare
P(B|A) to P(B)
Data Management Technique: Linear Scans
•
re: Association• Rules
For rule A -->B, can be useful to compare P(B|A)
APRIORI algorithm can be generalized to frequent structure
Trivial rule example:
Rule Pattern Discovery
Really this is a type of exploratory data analysis
mining, e.g. finding episodes from sequences or
frequently-occurring trees
pregnant
•
: Breadth First with
Pruning
APRIORI
can be generalised to structures like
subsequences
subtrees
anagement Technique:
Linearand
Scans
female
Example application: Health Insurance Commission (HIC) in
with accuracy
1!
Australia
detected patterns of ordering of medical tests that
suggested that some of the tests ordered were unnecessary
For rule A (Cabe
B, itña
can
be1998)
useful to compare P(B|A) to P(B)
et al,
Function: Supportto P(B)
Thursday, 8 March 12
11 / 1
12 / 1
APRIORI algorithm can be generalized to frequent structure
mining, e.g. finding episodes from sequences or
frequently-occurring trees
13 / 1