Download Market Basket Analysis: an Introduction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Market Basket Analysis: an Introduction
José Miguel Hernández Lobato
Computational and Biological Learning Laboratory
Cambridge University
20/09/2011
1
Market Basket Analysis
• Allows us to identify patterns in customer purchases.
• Ideally, we would like to answer questions like
– What products tend to be bought together?
– What products may benefit from promotion?
– What are the best cross‐selling opportunities?
• The often‐quoted example of beer and nappies (an urban legend, in fact).
2
Transaction Data
• A store sells a large set of products .
is a set of products • A transaction (basket) bought by a customer at a particular time.
• The set of transactions is often encoded as a ).
sparse binary matrix (can be very large, 3
Association Rules
• Most popular method for MBA.
• Generates rules of the form .
• and are arbitrary disjoint sets of products (often ).
implies that if occurs in a particular basket •
then should occur in that basket too.
4
Filtering the Rules
• Only a reduced subset of interesting rules are generated.
•
, , is the set of rules obtained from with – Minimum support ∈ , .
– Minimum confidence ∈ , .
• Let be the set of transactions containing every product in A
(the same for ).
• Then , , is the set of rules → such that:
| ∩ |
| ∩ |
and | |
| |
Minimum support Minimum confidence
5
APRIORI Algorithm
• Given , we can generate , , very efficiently.
• APRIORI algorithm:
1. Identify the Frequent Item Sets (FIS) such that 1.
2.
3.
.
For reasonably high the total number of FIS should be small.
is frequent ⇔ any subset of is also frequent.
FIS of size contain subsets of size 1 which are also FIS. 2. For any frequent item set and any ∈ , generate the rule →
if
|
|
6
|
Efficient Identification of FIS
1.
2.
3.
4.
The FIS with items are stored in a tree structure.
The tree is extended with the candidate sets that contain 1 items.
They are the union of two FIS with items and a common parent.
A single pass through eliminates the candidates such that .
•
The cost is determined by the number of candidate item sets.
•
If is very low the number of candidates grows exponentially.
7
Other Interest Measures
•
•
•
•
•
•
Large support is used to keep low the number of rules found.
But this removes potentially interesting rules.
In practice, generating a large number of rules is unavoidable.
Additional interest measures can be used to filter the rules.
Interestingness is often “deviation from independence”.
The lift of a rule → is defined as
∪
→
• If and are perfectly independent then ( → )
8
1.
Available Software
• R includes an implementation of Apriori in the arules package.
• Based on the efficient C code developed by Christian Borgelt.
• The package arulesViz allows to visualize association rules.
9
Calling Apriori
10
Top 7 Rules According to Lift
11
Visualizing the Rules: Scatter Plot
o {liquor, red/blush wine} →
{bottled beer} Support: 0.001931876
Confidence: 0.9047619
Lift: 11.23527
o {root vegetables, yogurt}→
{whole milk} Support: 0.01453991
Confidence: 0.5629921
Lift: 2.203354
12
Grouped Matrix‐based Visualization
13
Graph‐based Visualization
14
Large Scale Example: Netflix Data
15
Top 15 Rules According to Lift
16
Scatter Plot
o {Pirates of the Caribbean:
The Curse of the Black Pearl, Lord of the Rings: The Fellowship of the Ring, Lord of the Rings: The Return of the King}
→
{Lord of the Rings: The Two Towers} Support: 0.1439382
Confidence: 0.9549493
Lift: 3.447636
o {Monsters} → {Finding Nemo}
Support: 0.1678217
Confidence: 0.7320968
Lift: 2.828278
17
Advantages and Disadvantages of Association Rules
• Advantages:
1.
2.
3.
Computationally efficient (as long as is large!).
Individual rules are easily interpreted.
Well‐researched method.
• Disadvantages:
1.
2.
3.
4.
5.
Not clear how to choose and .
The number of rules obtained is often very large.
Difficult to isolate interesting patterns. No widely‐accepted procedure.
Sometimes most of the rules obtained are trivial.
Lack of rigorous probabilistic model for the data.
18
Useful References
•
•
•
•
•
R. Agrawal, T. Imielinski, and A. Swami (1993) Mining association rules between sets of items in large databases. ACM SIGMOD International Conference on Management of Data, 207‐216.
Christian Borgelt and Rudolf Kruse (2002) Induction of Association Rules: Apriori Implementation. 15th Conference on Computational Statistics (COMPSTAT 2002).
Hahsler M, Buchta C, Gun B, Hornik K (2010). arules: Mining Association Rules and Frequent Itemsets. R package version 1.0‐3. http://CRAN.R‐project.org/package=arules
Michael Hahsler and Sudheer Chelluboina (2011) arulesViz: arulesViz ‐ Visualizing Association Rules R package version 0.1‐1. http://CRAN.R‐project.org/package=arulesViz
Maimon, O. and Rokach, L. (ed.) Data Mining and Knowledge Discovery Handbook, 2nd ed Springer, 2010
19
Thank you for your attention!
20