Download slides - UCLA Computer Science

Document related concepts
no text concepts found
Transcript
Frequent Itemsets
Association rules
and market basket analysis
Carlo Zaniolo
UCLA CSD
1
Association Rules & Correlations
Basic concepts
Efficient and scalable frequent itemset mining
methods:
Apriori, and improvements
FP-growth
Rule derivation, visualization and validation
Multi-level Associations
Summary
2
Market Basket Analysis: the context
Customer buying habits by finding associations and
correlations between the different items that
customers place in their “shopping basket”
Milk, eggs, sugar,
bread
Milk, eggs, cereal, bread
Eggs, sugar
Customer1
Customer2
Customer3
3
Market Basket Analysis: the context
Given: a database of customer transactions, where
each transaction is a set of items
Find groups of items which are frequently
purchased together
4
Goal of MBA
Extract information on purchasing behavior
Actionable information: can suggest
new store layouts
new product assortments
which products to put on promotion
MBA applicable whenever a customer purchases
multiple things in proximity
credit cards
services of telecommunication companies
banking services
medical treatments
5
MBA: applicable to many other contexts
Telecommunication:
Each customer is a transaction containing the set
of customer’s phone calls
Atmospheric phenomena:
Each time interval (e.g. a day) is a transaction
containing the set of observed event (rains, wind,
etc.)
Etc.
6
Association Rules
Express how product/services relate to
each other, and tend to group together
“if a customer purchases three-way calling,
then will also purchase call-waiting”
simple to understand
actionable information: bundle three-way
calling and call-waiting in a single package
7
Frequent Itemsets
Transaction:
Relational format
Compact format
<Tid,item>
<Tid,itemset>
<1, item1>
<1, {item1,item2}>
<1, item2>
<2, {item3}>
<2, item3>
Item: single element, Itemset: set of items
Support of an itemset I: # of transaction containing I
Minimum Support  : threshold for support
Frequent Itemset : with support  .
Frequent Itemsets represents set of items which are
positively correlated
8
Frequent Itemsets Example
Transaction ID Items Bought
1
dairy,fruit
2
dairy,fruit, vegetable
3
dairy
4
fruit, cereals
Support({dairy}) = 3 (75%)
Support({fruit}) = 3 (75%)
Support({dairy, fruit}) = 2 (50%)
If  = 60%, then
{dairy} and {fruit} are frequent while {dairy, fruit}
is not.
9
Itemset support & Rules confidence
 Let A and B be disjoint itemsets and let:
s = support(AB) and
c= support(AB)/support(A)
Then the rule A  B holds with support s and
confidence c: write A  B [s, c]
Objective of the mining task. Find all rules with
 minimum support 
 minimum confidence 
 Thus A  B [s, c] holds if : s   and c  
10
Association Rules: Meaning
A  B [ s, c ]
Support: denotes the frequency of the rule within
transactions. A high value means that the rule involve a
great part of database.
support(A  B [ s, c ]) = p(A  B)
Confidence: denotes the percentage of transactions
containing A which contain also B. It is an estimation of
conditioned probability .
confidence(A  B [ s, c ]) = p(B|A) = p(A & B)/p(A).
11
Association Rules - Example
Transaction ID Items Bought
2000
A,B,C
1000
A,C
4000
A,D
5000
B,E,F
For rule A  C:
Min. support 50%
Min. confidence 50%
Frequent Itemset Support
{A}
75%
{B}
50%
{C}
50%
{A,C}
50%
support = support({A, C}) = 50%
confidence = support({A, C})/support({A}) = 66.6%
The Apriori principle:
Any subset of a frequent itemset must be frequent
12
Scalable Methods for Mining Frequent Patterns
The downward closure property of frequent patterns
Every subset of a frequent itemset must be frequent
[antimonotonic property]
If {beer, diaper, nuts} is frequent, so is {beer, diaper}
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper}
Scalable mining methods: Three major approaches
Apriori (Agrawal & Srikant@VLDB’94)
Freq. pattern growth (FPgrowth—Han, Pei & Yin
@SIGMOD’00)
Vertical data format approach (Charm—Zaki & Hsiao
@SDM’02)
13
Frequent Patterns=Frequent Itemsets
Closed Patterns and Max-Patterns
A long pattern contains very many sub-patterns--combinatorial explosion
An itemset is closed if none of its supersets has the
same support (i.e., frequency drops for each item added)
Closed pattern is a lossless compression of freq. patterns-Reducing the # of patterns and rules
An itemset is maximal frequent if none of its
supersets is frequent (i.e., frequency drops below threshold
for each item added)---thus maxima are also closed.
14
Frequent Itemsets
Minimum support = 2
null
124
123
A
12
124
AB
12
ABC
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
24
AC
AD
ABD
ABE
B
AE
2
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
1234
BE
2
4
ACE
ADE
E
24
CD
ABCE
ABDE
CE
3
BCD
45 DE
4
BCE
BDE
CDE
# Frequent = 13
4
ABCD
34
ACDE
BCDE
ABCDE
15
Maximal Frequent Itemset:
if none of its supersets is frequent
Minimum support = 2
null
124
123
A
12
124
AB
12
ABC
TID
Items
1
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
24
AC
AD
ABD
ABE
B
AE
2
345
D
2
3
BC
BD
4
ACD
245
C
123
4
24
2
1234
BE
2
4
ACE
ADE
E
24
CD
ABCE
ABDE
CE
3
BCD
45 DE
4
BCE
BDE
CDE
# Frequent = 13
4
ABCD
34
ACDE
BCDE
# Maximal = 4
ABCDE
16
Closed Frequent Itemset:
None of its superset has the same support
Minimum support = 2
124
123
A
12
124
AB
12
ABC
TID
1
24
AC
ABE
AE
3
BC
BD
BE
2
4
ACE
ADE
Closed and
maximal
345
D
2
4
ACD
245
C
123
4
24
ABD
1234
B
AD
2
Closed but
not maximal
null
E
24
CD
34
CE
3
BCD
45 DE
4
BCE
BDE
CDE
Items
ABC
2
ABCD
3
BCE
4
ACDE
5
DE
2
4
ABCD
ABCE
ABDE
ACDE
BCDE
# Frequent = 13
# Closed = 9
# Maximal = 4
Closed and
maximal
ABCDE
17
Maximal vs Closed Itemsets
As we move from a frequent itemset A to its
supersets, support can:
1.
Remain the same,
2.
Drop but still remain above threshold.
Then A is closed.
3.
Drop below the threshold: A is maximal
(and closed). Maximal is sufficient do
determine frequent itemsets, but all
closed itemsets cannot be reconstructed
from it.
4.
Crucial patterns: an informationpreserving subset of closed patterns.
Frequent
Itemsets
Closed
Frequent
Itemsets
Maximal
Frequent
Itemsets
18
Apriori: A Candidate Generation-and-Test Approach
Apriori pruning principle: If there is any itemset
which is infrequent, its superset should not be
generated/tested! (Agrawal & Srikant @VLDB’94,
Mannila, et al. @ KDD’ 94)
Method:
Initially, scan DB once to get frequent 1-itemset
Generate length (k+1) candidate itemsets from length k
frequent itemsets
Test the candidates against DB
Terminate when no frequent or candidate set can be
generated
19
Association Rules & Correlations
Basic concepts
Efficient and scalable frequent itemset mining
methods:
Apriori, and improvements
20
The Apriori Algorithm—An Example
Database TDB
Tid
Items
10
A, C, D
20
B, C, E
30
A, B, C,
E
40
B, E
L2
Itemset
{A, C}
{B, C}
{B, E}
{C, E}
Supmin = 2
Itemset
sup
{A}
2
C1
{B}
3
1st scan
{C}
3
{D}
1
{E}
3
C2
sup
2
2
3
2
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
sup
1
2
1
2
3
2
Itemset`
sup
{A}
2
{B}
3
{C}
3
{E}
3
L1
C2
2nd scan
Itemset
{A, B}
{A, C}
{A, E}
{B, C}
{B, E}
{C, E}
C3
Itemset
{B, C, E}
3rd scan
L3
Itemset
sup
{B, C, E}
2
21
Important Details of Apriori
How to generate candidates?
Step 1: self-joining Lk
Step 2: pruning
How to count supports of candidates?
Example of Candidate-generation
L3={abc, abd, acd, ace, bcd}
Self-joining: L3*L3
abcd from abc and abd
acde from acd and ace
Pruning:
acde is removed because ade is not in L3
C4={abcd}
22
How to Generate Candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk-1
insert into Ck
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
from Lk-1 p, Lk-1 q
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1
Step 2: pruning
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck
23
How to Count Supports of Candidates?
Why counting supports of candidates a problem?
The total number of candidates can be very huge
One transaction may contain many candidates
Data Structures used:
Candidate itemsets can be stored in a hash-tree
or in a prefix-tree (trie)--example
24
Effect of Support Distribution
Many real data sets have skewed support
distribution
Support
distribution of
a retail data set
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Effect of Support Distribution
How to set the appropriate minsup threshold?
– If minsup is set too high, we could miss itemsets
involving interesting rare items (e.g., expensive
products)
– If minsup is set too low, it is computationally
expensive and the number of itemsets is very large
Using a single minimum support threshold may
not be effective
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Rule Generation
How to efficiently generate rules from frequent
itemsets?
– In general, confidence does not have an antimonotone property
c(ABC D) can be larger or smaller than c(AB D)
– But confidence of rules generated from the same
itemset has an anti-monotone property
– e.g., L = {A,B,C,D}:
c(ABC  D)  c(AB  CD)  c(A  BCD)
Confidence is anti-monotone w.r.t. number of items on the
RHS of the rule

© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Rule Generation
Given a frequent itemset L, find all non-empty subsets f 
L such that f  L–f satisfies the minimum confidence
requirement.
Theorem: If AB C fails confidence, so does A BC
If |L| = k, then there are 2k candidate association rules
(including L   and   L)
– Example: L= {A,B,C,D} is the frequent itemset, then
– The candidate rules are:
ABC D,
A BCD,
AB CD,
BD AC,
ABD C,
B ACD,
AC  BD,
CD AB,
ACD B,
C ABD,
AD  BC,
BCD A,
D ABC
BC AD,
But antimonotonicity will make things converge fast by building
on (i) successes an (ii) failures.
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Rule Generation (i) and (ii)
1.
Candidate rule is generated by merging two rules that
share the same prefixin the rule consequent
2.
If ADBC fails then Prune rule DABC
3.
If AD BC and BD AC succeed
then test candidate rule DABC
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Lattice of rules:
confidence(f  L–
f)=support(L)/support(f)
L={A,B,C,D}
Low
Confidence
Rule
CD=>AB
ABCD=>{ }
BCD=>A
ACD=>B
BD=>AC
D=>ABC
BC=>AD
C=>ABD
L= f
ABD=>C
AD=>BC
B=>ACD
ABC=>D
AC=>BD
AB=>CD
A=>BCD
Pruned
Rules
© Tan,Steinbach, Kumar
Introduction to Data Mining
4/18/2004
‹#›
Rules: some useful, some trivial,
others unexplicable
Useful: “On Thursdays, grocery store consumers
often purchase diapers and beer together”.
Trivial: “Customers who purchase maintenance
agreements are very likely to purchase large
appliances”.
Unexplicable: “When a new hardaware store
opens, one of the most sold items is toilet rings.”
Conclusion: Inferred rules must be validate by
domain expert, before they can be used in the
marketplace: Post Mining of association rules.
31
Mining for Association Rules
The main steps in the process
1.
2.
3.
4.
Select a
Find the
Find the
Validate
minimum support/confidence level
frequent itemsets
association rules
(postmine) the rules so found.
32
Mining for Association Rules: Checkpoint
Apriori opened up a big commercial market for DM
association rules came from the db fields, classifier from AI,
clustering precedes both … and DM
Many open problem areas, including
1.
Performance: Faster Algorithms needed for frequent
itemsets
2. Improving statistical/semantic significance of rules
3. Data Stream Mining for association rules. Even Faster
algorithms needed, incremental computation, adaptability,
etc. Also the post-mining process becomes more challenging.
33
Performance: Efficient
Implementation Apriori in SQL
Hard to get good performance out of pure SQL (SQL-92) based
approaches alone
Make use of object-relational extensions like UDFs, BLOBs,
Table functions etc.
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association
rule mining with relational database systems: Alternatives and
implications. In SIGMOD’98
A much better solution: use UDAs—native or imported.
Haixun Wang and Carlo Zaniolo: ATLaS: A Native Extension of SQL for
Data Mining. SIAM International Conference on Data Mining 2003,
San Francisco, CA, May 1-3, 2003
34
Performance for Apriori
Challenges
Multiple scans of transaction database [not for data
streams]
Huge number of candidates
Tedious workload of support counting for candidates
Many Improvements suggested: general ideas
Reduce passes of transaction database scans
Shrink number of candidates
Facilitate counting of candidates
35
Partition: Scan Database Only Twice
Any itemset that is potentially frequent in DB must be
frequent in at least one of the partitions of DB
Scan 1: partition database and find local frequent patterns
Scan 2: consolidate global frequent patterns
A. Savasere, E. Omiecinski, and S. Navathe. An efficient
algorithm for mining association in large databases. In
VLDB’95
Does this scaleup to larger partitions?
36
Sampling for Frequent Patterns
Select a sample S of original database, mine
frequent patterns within sample using Apriori
To avoid losses mine for a support less than that
required
Scan rest of database to find exact counts.
H. Toivonen. Sampling large databases for association rules.
In VLDB’96
37
DIC: Reduce Number of Scans
ABCD
ABC ABD ACD BCD
AB
AC
BC
AD
BD
CD
Once both A and D are determined
frequent, the counting of AD begins
Once all length-2 subsets of BCD are
determined frequent, the counting of
BCD begins
Transactions
B
A
C
D
Apriori
{}
Itemset lattice
S. Brin R. Motwani, J. Ullman,
and S. Tsur. Dynamic itemset
DIC
counting and implication rules for
market basket data. In
SIGMOD’97
1-itemsets
2-itemsets
…
1-itemsets
2-items
3-items
38
Improving Performance (cont.)
APriori Multiple database scans are costly
Mining long patterns needs many passes of scanning and
generates lots of candidates
To find frequent itemset i1i2…i100
# of scans: 100
# of Candidates: (1001) + (1002) + … + (110000) = 2100-1 =
1.27*1030 !
Bottleneck: candidate-generation-and-test
Can we avoid candidate generation?
39
Mining Frequent Patterns
Without Candidate Generation
FP-Growth Algorithm
1. Build FP-tree: items are listed by decreasing frequency
2. For each suffix (recursively)
Build its conditionalized subtree
and compute its frequent items
An order of magnitude faster than Apriori
40
Frequent Patterns (FP) Algorithm
The algorithm consists of two steps:
Step 1:
builds the FP-Tree (Frequent Patterns Tree).
Step 2:
use FP_Growth Algorithm for finding
frequent itemsets from the FP- Tree.
_________________________________________
These slides are based on those by:
Yousry Taha,Taghrid Al-Shallali, Ghada AL Modaifer ,Nesreen AL Boiez
41
Frequent Pattern Tree Algorithm:
Example
T-ID
List of Items
101 Milk, bread, cookies, juice
792 Milk, juice
1130 Milk, eggs
1735 Bread, cookies, coffee
• The first scan of database is same as Apriori, which derives the set
of 1-itemsets & their support counts.
• The set of frequent items is sorted in the order of descending
support count.
• An Fp-tree is constructed
• The Fp-tree is conditionalized and mined for frequent itemsets
42
FP-Tree for
T-ID
List of Items
101
Milk, bread, cookies, juice
792
Milk, juice
1130
Milk, eggs
1735
Bread, cookies, coffee
Table: Item header table
Item Id
Support
milk
3
bread
2
cookies
2
Node-link
NULL
FP-tree
Milk:3
Milk:2
Milk:1
Bread:1
juice
Bread:1
Juice:1
Cookies:1
2
Cookies:1
Juice:1
43
FP-Growth: for each suffix find (1) its supporting paths,
(2) its conditional FP-tree, and (3) the frequent patterns
with such an ending (suffix)
Tree paths supporting suffix
(conditional pattern base)
Suffix
juice
{(milk, bread,cookies:1),
Conditional
Frequent pattern
FP-Tree
generated
{milk:2}
{juice, milk:2}
(milk: 1)}
cookies
{(milk, bread:1),(bread: 1)}
{bread: 2}
{cookies, bread:2}
bread
{(milk: 1)}
-
-
milk
-
-
-
… then expand the suffix and repeat these operations
45
Conditional pattern base: prefix paths having a given suffix: we start
form the least frequent one: Juice
NULL
NULL
Milk:1
Milk:2
Milk:3 2
Milk:3
Milk:2
Milk:1
Bread:1
Bread:1
Bread:1
Bread:1
Juice:1
Juice:1
Cookies:1
Cookies:1
Cookies:1
Cookies:1
Juice:1
Juice:1
46
Conditionalized tree for Suffix “Juice”
NULL
Milk:2
Thus: (Juice, Milk:2) is a frequent pattern
47
Now Patterns with Suffix “Cookies”
NULL
Item Id
milk
Sup Count
Node-link
3 ..
bread
2 Next
cookies
2 NOW
juice
Done
Milk:3
Milk:2
Milk:1
Done
Bread:1
Bread:1
Cookies:1
Cookies:1
NULL
Milk:2
Milk:1
Bread:1
NULL
Bread:1
Bread:2
Thus:
(Cookies, Bread:2)
is frequent
48
Why Frequent Pattern Growth Fast ?
• Performance study shows
FP-growth is an order of magnitude faster than Apriori
• Reasoning
− No candidate generation, no candidate test
− Use compact data structure
− Eliminate repeated database scan
− Basic operation is counting and FP-tree building
49
Other types of Association RULES
• Association Rules among Hierarchies.
• Multidimensional Association
• Negative Association
50
FP-growth vs. Apriori: Scalability
With the Support Threshold
Data set T25I20D10K
100
D1 FP-grow th runtime
90
D1 Apriori runtime
80
Run time(sec.)
70
60
50
40
30
20
10
0
0
6/22/2000
0.5
1
1.5
2
Support threshold(%)
2.5
3
51
51
FP-growth vs. Apriori: Scalability
With Number of Transactions
Data set T25I20D100K (1.5%)
FP-growth
Run time (sec.)
60
Apriori
50
40
30
20
10
0
0
20
40
60
80
100
Number of transactions (K)
6/22/2000
52
52
FP-Growth: pros and cons
FP- tree is Complete
Preserve complete information for frequent pattern mining
Never break a long pattern of any transaction
FP- tree Compact
Reduce irrelevant info—infrequent items are gone
Items in frequency descending order: the more frequently
occurring, the more likely to be shared
Never be larger than the original database (not count node-links
and the count field)
FP-tree is generate in one scan of database (data
streams mining?)
However, deriving the frequent patterns from the FP-tree is
still computationally expensive—improved algorithms needed for
data streams.
53
Rule Validations
Only a small subset of derived rules might
be meaningful/useful
Domain expert must validate the rules
Useful tools:
Visualization
Correlation analysis
54
Association Rules & Correlations
Basic concepts
Efficient and scalable frequent itemset mining
methods:
Apriori, and improvements
FP-growth
Rule derivation, visualization and validation
Multi-level Associations
Summary
55
Finding Frequent Sequential Patterns
The problem:
Given a sequence of discrete events that may
repeat:
A B A C D A C E B A B C…
Find patterns that repeat frequently.
For example:
A followed by B (A->B), or AC followed by D
(AC->D)
The patterns should occur within a window W.
Applications in telecommunication data, networks,
biology
Summary
Association rule mining
probably the most significant contribution from
the database community in KDD
New interesting research directions
Association analysis in other types of data:
spatial data, multimedia data, time series data,
Association Rule Mining for Data Streams:
a very difficult challenge.
57
PostProcessing: Rule Validation
Interestingness
Measures
58
Visualization of Association Rules: Plane Graph
59
Computing Interestingness Measure
Given a rule X  Y, information needed to compute rule
interestingness can be obtained from a contingency table
Contingency table for X  Y
Y
Y
X
f11
f10
f1+
X
f01
f00
fo+
f+1
f+0
|T|
f11: support of X and Y
f10: support of X and Y
f01: support of X and Y
f00: support of X and Y
Used to define various measures
support, confidence, lift, Gini,
J-measure, etc.
60
Drawback of Confidence
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
Day drinks of 100 People
75: {Coffee}
5: {Tea}
15:{Coffee, Tea}
[support: 15/100]
Confidence= P(Coffee|Tea) = [confidence: 15/20= .75]
but P(Coffee) = 0.9 … >0.75
 Confidence is high, rule is misleading since drinking tea
lowers probability of coffee. Or conversely:
 P(Coffee|Tea) = 0.9375 …>>0.75
61
A better Measure: Lift/Interest
Coffee
Coffee
Tea
15
5
20
Tea
75
5
80
90
10
100
Association Rule: Tea  Coffee
lift(Tea  Coffee)= c(Tea  Coffee)/s(Coffee)=
s(Tea&Coffee)/s(Tea) s(Coffee)
Confidence= P(Coffee|Tea) = 0.75
but P(Coffee) = 0.9
 Lift = 0.75/0.9= 0.8333 (< 1, therefore is negatively associated)
62
Drawback of Lift & Interest
Statistical independence:
If P(X,Y)=P(X)P(Y) => Lift = 1
Y
Y
X
10
0
10
X
0
90
90
10
90
100
0.1
Lift =
= 10
(0.1)(0.1)
Y
Y
X
90
0
90
X
0
10
10
90
10
100
0.9
Lift =
= 1.11
(0.9)(0.9)
Lift favors infrequent items
Other criteria proposed Gini,
J-measure, etc.
63
Statistical Independence
Population of 1000 students
600 students know how to swim (S)
700 students know how to bike (B)
420 students know how to swim and bike (S,B)
P(SB) = 420/1000 = 0.42
P(S)  P(B) = 0.6  0.7 = 0.42
P(SB) = P(S)  P(B) => Statistical
independence
P(SB) > P(S)  P(B) => Positively correlated
P(SB) < P(S)  P(B) => Negatively correlated
64
Statistical Measures
Measures that take into account statistical
dependence
Lift: Does X lift the probability of Y? i.e.
probability of Y given X over probability
of Y.
P(Y | X )
Lift =
P(Y )
Interest:
I =1 independence,
P( X , Y )
Interest =
I> 1 positive association (<1 negative)
P( X ) P(Y )
PS = P( X , Y ) - P( X ) P(Y )
65
There are lots of
measures proposed
in the literature
Some measures are
good for certain
applications, but not
for others
What criteria should
we use to determine
whether a measure
is good or bad?
What about Aprioristyle support based
pruning? How does
it affect these
measures?
66
Pattern Evaluation
Association rule algorithms tend to produce
too many rules
many of them are uninteresting or redundant
confidence(A B) = p(B|A) = p(A & B)/p(A)
Confidence is not discriminative enough criterion
Beyond original support & confidence
Interestingness measures can be used to
prune/rank the derived patterns
67
Other Association Mining Methods
CHARM: Mining frequent itemsets by a Vertical Data Format
Mining Frequent Closed Patterns
Mining Max-patterns
Mining Quantitative Associations [e.g., what is the implication
between age and income?]
Constraint-base association mining
Frequent Patterns in Data Streams: very difficult problem.
Performance is a real issue
Constraint-based (Query-Directed) Mining
Mining sequential and structured patterns
68