Download Association rules

Document related concepts
no text concepts found
Transcript
Association rules
Lecturer: JERZY STEFANOWSKI
Institute of Computing Sciences
Poznan University of Technology
Poznan, Poland
Lecture 10
SE Master Course
2008/2009
This lecture is based on the following
resources - slides:
G.Piatetsky-Shapiro: Association Rules and
Frequent Item Analysis.
and partly on
J.Han: Mining Association Rules in Large
Databases
and my other notes.
Outline
ƒ Transactions
ƒ Frequent itemsets
ƒ Subset Property
ƒ Association rules
ƒ Applications
3
Association rules
ƒ Transaction data
ƒ Market basket analysis
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
ƒ {Cheese, Milk} → Bread [sup=5%, conf=80%]
ƒ Association rule:
„80% of customers who buy cheese and milk also buy
bread and 5% of customers buy all these products
together”
4
What Is Frequent Pattern Analysis?
ƒ Frequent pattern: a pattern (a set of items, subsequences, substructures,
etc.) that occurs frequently in a data set
ƒ
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
ƒ Motivation: Finding inherent regularities in data
ƒ What products were often purchased together? — Beer and diapers?!
ƒ What are the subsequent purchases after buying a PC?
ƒ What kinds of DNA are sensitive to this new drug?
ƒ Can we automatically classify web documents?
ƒ
Applications
ƒ
Basket data analysis, cross-marketing, catalog design, sale campaign analysis,
Web log (click stream) analysis, and DNA sequence analysis.
5
Why is Frequent Pattern or Association Mining
an Essential Task in Data Mining?
ƒ Foundation for many essential data mining tasks
ƒ Association, correlation, causality
ƒ Sequential patterns, temporal or cyclic association, partial
periodicity, spatial and multimedia association
ƒ Associative classification, cluster analysis, fascicles (semantic data
compression)
ƒ DB approach to efficient mining massive data
ƒ
Broad applications
ƒ
Basket data analysis, cross-marketing, catalog design, sale
campaign analysis
ƒ
Web log (click stream) analysis, DNA sequence analysis, etc
6
Transactions Example
TID Produce
1
MILK, BREAD, EGGS
2
BREAD, SUGAR
3
BREAD, CEREAL
4
MILK, BREAD, SUGAR
5
MILK, CEREAL
6
BREAD, CEREAL
7
MILK, CEREAL
8
MILK, BREAD, CEREAL, EGGS
9
MILK, BREAD, CEREAL
7
Transaction database: Example, 1
TID Products
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
Instances = Transactions
ITEMS:
A = milk
B= bread
C= cereal
D= sugar
E= eggs
8
Transaction database: Example, 2
Attributes converted to binary flags
TID Products
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
9
TID
A
B
C
D
E
1
1
1
0
0
1
2
0
1
0
1
0
3
4
0
1
1
1
1
0
0
1
0
0
5
1
0
1
0
0
6
0
1
1
0
0
7
8
1
1
0
1
1
1
0
0
0
1
9
1
1
1
0
0
Definitions
ƒ Item: attribute=value pair or simply value
ƒ usually attributes are converted to binary flags for
each value, e.g. product=“A” is written as “A”
ƒ Itemset I : a subset of possible items
ƒ Example: I = {A,B,E} (order unimportant)
ƒ Transaction: (TID, itemset)
ƒ TID is transaction ID
10
Support and Frequent Itemsets
ƒ Support of an itemset
ƒ sup(I ) = no. of transactions t that support
(i.e. contain) I
ƒ In example database:
ƒ sup ({A,B,E}) = 2, sup ({B,C}) = 4
ƒ Frequent itemset I is one with at least the
minimum support count
ƒ sup(I ) >= minsup
11
SUBSET PROPERTY (Agrawal et al..)
ƒ Every subset of a frequent set is frequent!
ƒ Q: Why is it so?
ƒ A: Example: Suppose {A,B} is frequent. Since
each occurrence of A,B includes both A and B,
then both A and B must also be frequent
ƒ Similar argument for larger itemsets
ƒ Almost all association rule algorithms are based
on this subset property
12
Association Rules
ƒ Association rule R : Itemset1 => Itemset2
ƒ Itemset1, 2 are disjoint and Itemset2 is non-empty
ƒ meaning: if transaction includes Itemset1 then it also
has Itemset2
ƒ Examples
ƒ A,B => E,C
ƒ A => B,C
13
From Frequent Itemsets to Association
Rules
ƒ Q: Given frequent set {A,B,E}, what are
possible association rules?
ƒ A => B, E
ƒ A, B => E
ƒ A, E => B
ƒ B => A, E
ƒ B, E => A
ƒ E => A, B
ƒ __ => A,B,E (empty rule), or true => A,B,E
14
Rule Support and Confidence
ƒ Suppose R : I => J is an association rule
ƒ sup (R) = sup (I ∪ J) is the support count
ƒ support of itemset I ∪ J (I or J)
ƒ conf (R) = sup(J) / sup(R) is the confidence of R
ƒ fraction of transactions with I ∪ J that have J
ƒ Association rules with minimum support and count are
sometimes called “strong” rules
15
Classification vs Association Rules
Classification Rules
Association Rules
ƒ Focus on one target field
ƒ Many target fields
ƒ Specify class in all cases
ƒ Applicable in some cases
ƒ Measures: Accuracy
ƒ Measures: Support,
Confidence, Lift
16
Association Rules Example, 1
ƒ Q: Given frequent set {A,B,E},
what association rules have
minsup = 2 and minconf= 50% ?
A, B => E : conf=2/4 = 50%
17
TID List of items
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
Association Rules Example, 2
ƒ Q: Given frequent set {A,B,E}, what
association rules have minsup = 2 and
minconf= 50% ?
A, B => E : conf=2/4 = 50%
A, E => B : conf=2/2 = 100%
B, E => A : conf=2/2 = 100%
E => A, B : conf=2/2 = 100%
Don’t qualify
A =>B, E : conf=2/6 =33%< 50%
B => A, E : conf=2/7 = 28% < 50%
__ => A,B,E : conf: 2/9 = 22% < 50%
18
TID List of items
1
A, B, E
2
B, D
3
B, C
4
A, B, D
5
A, C
6
B, C
7
A, C
8
A, B, C, E
9
A, B, C
Find Strong Association Rules
ƒ A rule has the parameters minsup and minconf:
ƒ sup(R) >= minsup and conf (R) >= minconf
ƒ Problem:
ƒ Find all association rules with given minsup and
minconf
ƒ First, find all frequent itemsets
19
Finding Frequent Itemsets
ƒ Start by finding one-item sets (easy)
ƒ Q: How?
ƒ A: Simply count the frequencies of all items
20
Finding itemsets: next level
ƒ Apriori algorithm (Agrawal & Srikant 94)
ƒ Idea: use one-item sets to generate two-item
sets, two-item sets to generate three-item sets, …
ƒ If (A B) is a frequent item set, then (A) and (B) have to
be frequent item sets as well!
ƒ In general: if X is frequent k-item set, then all (k-1)item subsets of X are also frequent
⇒ Compute k-item set by merging (k-1)-item sets
21
Another example
ƒ Given: five three-item sets
(A B C), (A B D), (A C D), (A C E), (B C D)
ƒ Lexicographic order improves efficiency!
ƒ Candidate four-item sets:
(A B C D)
Q: OK?
A: yes, because all 3-item subsets are frequent
(A C D E)
Q: OK?
A: No, because (C D E) is not frequent
22
Pruning search space
23
Apriori: A Candidate Generation-and-test
Approach - Summary
ƒ Any subset of a frequent itemset must be frequent
ƒ if {beer, diaper, nuts} is frequent, so is {beer, diaper}
ƒ Every transaction having {beer, diaper, nuts} also contains {beer,
diaper}
ƒ Apriori pruning principle: If there is any itemset which is
infrequent, its superset should not be generated/tested!
ƒ Method:
ƒ generate length (k+1) candidate itemsets from length k frequent
itemsets, and
ƒ test the candidates against DB
ƒ The performance studies show its efficiency and scalability
ƒ Agrawal & Srikant 1994, Mannila, et al. 1994
24
Mining Association Rules—an Example
Transaction-id
Items bought
10
A, B, C
20
A, C
30
40
Min. support 50%
Min. confidence 50%
Frequent pattern
Support
A, D
{A}
75%
B, E, F
{B}
50%
{C}
50%
{A, C}
50%
For rule A ⇒ C:
support = support({A}∪{C}) = 50%
confidence = support({A}∪{C})/support({A}) = 66.6%
25
The Apriori Algorithm—An Example
Database TDB
Tid
A, C, D
20
A, B, C, E
40
B, E
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
{A}
2
{B}
3
{C}
3
{D}
1
{E}
3
1st scan
B, C, E
30
sup
C1
Items
10
Itemset
C2
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
L1
Itemset
sup
{A}
2
{B}
3
{C}
3
{E}
3
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
L3
3rd scan
26
Itemset
{B, C, E}
sup
2
The Apriori Algorithm
ƒ Pseudo-code:
Ck: Candidate itemset of size k
Lk : frequent itemset of size k
L1 = {frequent items};
for (k = 1; Lk !=∅; k++) do begin
Ck+1 = candidates generated from Lk;
for each transaction t in database do
increment the count of all candidates in Ck+1
that are contained in t
Lk+1 = candidates in Ck+1 with min_support
end
return ∪k Lk;
27
How to Generate Candidates?
ƒ Suppose the items in Lk-1 are listed in an order
ƒ Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk1
ƒ Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
28
How to Count Supports of Candidates?
ƒ Why counting supports of candidates a problem?
ƒ The total number of candidates can be very huge
ƒ One transaction may contain many candidates
ƒ Method:
ƒ Candidate itemsets are stored in a hash-tree
ƒ Leaf node of hash-tree contains a list of itemsets and counts
ƒ Interior node contains a hash table
ƒ Subset function: finds all the candidates contained in a
transaction
29
Generating Association Rules
ƒ Two stage process:
ƒ Determine frequent itemsets e.g. with the Apriori
algorithm.
ƒ For each frequent item set I
ƒ for each subset J of I
ƒ determine all association rules of the form: I-J => J
ƒ Main idea used in both stages : subset property
30
Example: Generating Rules
from an Itemset
ƒ Frequent itemset from golf data:
Humidity = Normal, Windy = False, Play = Yes (4)
ƒ Seven potential rules:
If Humidity = Normal and Windy = False then Play = Yes
4/4
If Humidity = Normal and Play = Yes then Windy = False
4/6
If Windy = False and Play = Yes then Humidity = Normal
4/6
If Humidity = Normal then Windy = False and Play = Yes
4/7
If Windy = False then Humidity = Normal and Play = Yes
4/8
If Play = Yes then Humidity = Normal and Windy = False
4/9
If True then Humidity = Normal and Windy = False and Play = Yes
4/12
31
Rules for the weather data
ƒ Rules with support > 1 and confidence = 100%:
Association rule
Sup.
Conf.
1
Humidity=Normal Windy=False
⇒Play=Yes
4
100%
2
Temperature=Cool
⇒Humidity=Normal
4
100%
3
Outlook=Overcast
⇒Play=Yes
4
100%
4
Temperature=Cold Play=Yes
⇒Humidity=Normal
3
100%
...
...
...
...
...
58
Outlook=Sunny Temperature=Hot
⇒Humidity=High
2
100%
ƒ In total: 3 rules with support four, 5 with support
three, and 50 with support two
32
Association Rule Mining
mining association rules
(Agrawal et. al SIGMOD93)
Better algorithms
Fast algorithm
Problem extension
Generalized A.R.
(Agrawal et. al VLDB94)
(Srikant et. al; Han et. al. VLDB95)
Hash-based
(ParkPartitioning
et. al SIGMOD95)
Quantitative A.R.
(Srikant et. al SIGMOD96)
(Navathe et. al VLDB95)
Direct Itemset Counting
N-dimensional A.R.
(Brin et. al SIGMOD97)
(Lu et. al DMKD’98)
Meta-ruleguided
mining
Parallel mining
(Agrawal et. al TKDE96)
Distributed mining Incremental mining
(Cheung et. al PDIS96) (Cheung et. al ICDE96)
33
Bottleneck of Frequent-pattern Mining
with Apriori
ƒ Multiple database scans are costly
ƒ Mining long patterns needs many passes of scanning and
generates lots of candidates
ƒ To find frequent itemset i1i2…i100
ƒ # of scans: 100
ƒ # of Candidates: (1001) + (1002) + … + (110000) = 2100-1 = 1.27*1030 !
ƒ Bottleneck: candidate-generation-and-test
ƒ Can we avoid candidate generation?
ƒ Another algorithms → FP Tree
34
Mining Frequent Patterns Without
Complete Candidate Generation
ƒ Grow long patterns from short ones using local
frequent items
ƒ “abc” is a frequent pattern
ƒ Get all transactions having “abc”: DB|abc
ƒ “d” is a local frequent item in DB|abc Æ abcd is a
frequent pattern
35
FP-Growth vs. Apriori: Scalability With the
Support Threshold
Data set T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
0.5
1
1.5
2
Support threshold(%)
36
2.5
3
Weka associations
File: weather.nominal.arff
MinSupport: 0.2
37
Weka associations: output
38
Presentation of Association Rules (Table
Form )
Visualization of Association Rules: Plane Graph
40
Filtering Association Rules
ƒ Problem: any large dataset can lead to very large
number of association rules, even with reasonable
Min Confidence and Support
ƒ Confidence by itself is not sufficient
ƒ e.g. if all transactions include Z, then
ƒ any rule I => Z will have confidence 100%.
ƒ Other measures to filter rules
41
Association Rule LIFT
ƒ The lift of an association rule I => J is defined as:
ƒ lift = P(J|I) / P(J)
ƒ Note, P(J) = (support of J) / (no. of transactions)
ƒ ratio of confidence to expected confidence
ƒ Interpretation:
ƒ if lift > 1, then I and J are positively correlated
lift < 1, then I are J are negatively correlated.
lift = 1, then I and J are independent.
42
Interestingness Measure: Correlations
and Lift
ƒ play basketball ⇒ eat cereal [40%, 66.7%] is misleading
ƒ The overall percentage of students eating cereal is 75% which is higher
than 66.7%.
ƒ play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate,
although with lower support and confidence
ƒ Measure of dependent/correlated events: lift or corr
corrA, B
P( A∪ B)
=
P( A) P( B)
43
Basketball
Not basketball
Sum (row)
Cereal
2000
1750
3750
Not cereal
1000
250
1250
Sum(col.)
3000
2000
5000
Beyond Binary Data
ƒ Hierarchies
ƒ drink Æ milk Æ low-fat milk Æ Stop&Shop low-fat milk
…
ƒ find associations on any level
ƒ Sequences over time
ƒ…
44
Multiple-level Association Rules
ƒ Items often form hierarchy
ƒ Flexible support settings:
Items at the lower level are expected
to have lower support.
ƒ Transaction database can be encoded based on
dimensions and levels
ƒ explore shared multi-level mining
reduced support
uniform support
Level 1
min_sup = 5%
Level 2
min_sup = 5%
Milk
[support = 10%]
2% Milk
[support = 6%]
45
Skim Milk
[support = 4%]
Level 1
min_sup = 5%
Level 2
min_sup = 3%
Quantitative Association Rules
ID
100
200
300
400
500
600
Age
44
55
45
34
45
33
Salary
30 000
45 000
50 000
44 000
38 000
44 000
Maritial Status NumCars
married
2
married
3
divorced
1
single
0
married
2
single
2
Sample Rules
Support Confidence
<age:44..55> and < status: married> ==> <numCars:2> 50%
100%
<NumCars: 0..1> ==> <Married: No>
33%
66,70%
46
Multi-dimensional Association
ƒ
Single-dimensional rules:
buys(X, “milk”) ⇒ buys(X, “bread”)
ƒ
Multi-dimensional rules: ≥ 2 dimensions or predicates
ƒ Inter-dimension assoc. rules (no repeated predicates)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X,“coke”)
ƒ hybrid-dimension assoc. rules (repeated predicates)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”)
ƒ
Categorical Attributes
ƒ finite number of possible values, no ordering among values
ƒ
Quantitative Attributes
ƒ numeric, implicit ordering among values
47
Sequence Databases and Sequential
Pattern Analysis
ƒ Transaction databases, time-series databases vs. sequence
databases
ƒ Frequent patterns vs. (frequent) sequential patterns
ƒ Applications of sequential pattern mining
ƒ Customer shopping sequences:
ƒ First buy computer, then CD-ROM, and then digital camera,
within 3 months.
ƒ Medical treatment, natural disasters (e.g., earthquakes), science &
engineering processes, stocks and markets, etc.
ƒ Telephone calling patterns, Weblog click streams
ƒ DNA sequences and gene structures
48
What Is Sequential Pattern Mining?
ƒ Given a set of sequences, find the complete set
of frequent subsequences
A
sequence : < (ef) (ab)
(df) c b >
A sequence database
SID
sequence
10
<a(abc)(ac)d(cf)>
20
<(ad)c(bc)(ae)>
30
<(ef)(ab)(df)cb>
40
<eg(af)cbc>
An element may contain a set of items.
Items within an element are unordered
and we list them alphabetically.
<a(bc)dc> is a subsequence
of <a(abc)(ac)d(cf)>
Given support threshold min_sup =2, <(ab)c> is a
sequential pattern
49
Challenges on Sequential Pattern
Mining
ƒ A huge number of possible sequential patterns are
hidden in databases
ƒ A mining algorithm should
ƒ find the complete set of patterns, when possible, satisfying the
minimum support (frequency) threshold
ƒ be highly efficient, scalable, involving only a small number of
database scans
ƒ be able to incorporate various kinds of user-specific constraints
50
Applications
ƒ Market basket analysis
ƒ Store layout, client offers
ƒ This analysis is applicable whenever a customer purchases multiple
things in proximity
ƒ telecommunication (each customer is a transaction containing the set of
phone calls)
ƒ weather analysis (each time interval is a transaction containing the set
of observed events)
ƒ credit cards
ƒ banking services
ƒ medical treatments
ƒ Finding unusual events
ƒ WSARE – What is Strange About Recent Events
ƒ …
51
Conclusions
ƒ Association rule mining
ƒ probably the most significant contribution from the
database community in KDD
ƒ A large number of papers have been published
ƒ Many interesting issues have been explored
ƒ An interesting research direction
ƒ Association analysis in other types of data: sequence
data, spatial data, multimedia data, time series data, etc.
52
Summary
ƒFrequent itemsets
ƒAssociation rules
ƒSubset property
ƒApriori algorithm
ƒExtensions of this algorithms
ƒSequence patterns
ƒApplications
53
Any questions, remarks?
54
Related documents