Download Market Basket Analysis: an Introduction

Market Basket Analysis: an Introduction José Miguel Hernández Lobato Computational and Biological Learning Laboratory Cambridge University 20/09/2011 1 Market Basket Analysis • Allows us to identify patterns in customer purchases. • Ideally, we would like to answer questions like – What products tend to be bought together? – What products may benefit from promotion? – What are the best cross‐selling opportunities? • The often‐quoted example of beer and nappies (an urban legend, in fact). 2 Transaction Data • A store sells a large set of products . is a set of products • A transaction (basket) bought by a customer at a particular time. • The set of transactions is often encoded as a ). sparse binary matrix (can be very large, 3 Association Rules • Most popular method for MBA. • Generates rules of the form . • and are arbitrary disjoint sets of products (often ). implies that if occurs in a particular basket • then should occur in that basket too. 4 Filtering the Rules • Only a reduced subset of interesting rules are generated. • , , is the set of rules obtained from with – Minimum support ∈ , . – Minimum confidence ∈ , . • Let be the set of transactions containing every product in A (the same for ). • Then , , is the set of rules → such that: | ∩ | | ∩ | and | | | | Minimum support Minimum confidence 5 APRIORI Algorithm • Given , we can generate , , very efficiently. • APRIORI algorithm: 1. Identify the Frequent Item Sets (FIS) such that 1. 2. 3. . For reasonably high the total number of FIS should be small. is frequent ⇔ any subset of is also frequent. FIS of size contain subsets of size 1 which are also FIS. 2. For any frequent item set and any ∈ , generate the rule → if | | 6 | Efficient Identification of FIS 1. 2. 3. 4. The FIS with items are stored in a tree structure. The tree is extended with the candidate sets that contain 1 items. They are the union of two FIS with items and a common parent. A single pass through eliminates the candidates such that . • The cost is determined by the number of candidate item sets. • If is very low the number of candidates grows exponentially. 7 Other Interest Measures • • • • • • Large support is used to keep low the number of rules found. But this removes potentially interesting rules. In practice, generating a large number of rules is unavoidable. Additional interest measures can be used to filter the rules. Interestingness is often “deviation from independence”. The lift of a rule → is defined as ∪ → • If and are perfectly independent then ( → ) 8 1. Available Software • R includes an implementation of Apriori in the arules package. • Based on the efficient C code developed by Christian Borgelt. • The package arulesViz allows to visualize association rules. 9 Calling Apriori 10 Top 7 Rules According to Lift 11 Visualizing the Rules: Scatter Plot o {liquor, red/blush wine} → {bottled beer} Support: 0.001931876 Confidence: 0.9047619 Lift: 11.23527 o {root vegetables, yogurt}→ {whole milk} Support: 0.01453991 Confidence: 0.5629921 Lift: 2.203354 12 Grouped Matrix‐based Visualization 13 Graph‐based Visualization 14 Large Scale Example: Netflix Data 15 Top 15 Rules According to Lift 16 Scatter Plot o {Pirates of the Caribbean: The Curse of the Black Pearl, Lord of the Rings: The Fellowship of the Ring, Lord of the Rings: The Return of the King} → {Lord of the Rings: The Two Towers} Support: 0.1439382 Confidence: 0.9549493 Lift: 3.447636 o {Monsters} → {Finding Nemo} Support: 0.1678217 Confidence: 0.7320968 Lift: 2.828278 17 Advantages and Disadvantages of Association Rules • Advantages: 1. 2. 3. Computationally efficient (as long as is large!). Individual rules are easily interpreted. Well‐researched method. • Disadvantages: 1. 2. 3. 4. 5. Not clear how to choose and . The number of rules obtained is often very large. Difficult to isolate interesting patterns. No widely‐accepted procedure. Sometimes most of the rules obtained are trivial. Lack of rigorous probabilistic model for the data. 18 Useful References • • • • • R. Agrawal, T. Imielinski, and A. Swami (1993) Mining association rules between sets of items in large databases. ACM SIGMOD International Conference on Management of Data, 207‐216. Christian Borgelt and Rudolf Kruse (2002) Induction of Association Rules: Apriori Implementation. 15th Conference on Computational Statistics (COMPSTAT 2002). Hahsler M, Buchta C, Gun B, Hornik K (2010). arules: Mining Association Rules and Frequent Itemsets. R package version 1.0‐3. http://CRAN.R‐project.org/package=arules Michael Hahsler and Sudheer Chelluboina (2011) arulesViz: arulesViz ‐ Visualizing Association Rules R package version 0.1‐1. http://CRAN.R‐project.org/package=arulesViz Maimon, O. and Rokach, L. (ed.) Data Mining and Knowledge Discovery Handbook, 2nd ed Springer, 2010 19 Thank you for your attention! 20

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Market Basket Analysis: an Introduction