Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Transcript

MIS2502: Data Analytics Association Rule Mining Jeremy Shafer [email protected] http://community.mis.temple.edu/jshafer Association Rule Mining Find out which items predict the occurrence of other items Also known as “affinity analysis” or “market basket” analysis Uses • What products are bought together? • Amazon’s recommendation engine • Telephone calling patterns Examples of Association Rule Mining • Market basket analysis/affinity analysis – What products are bought together? – Where to place items on grocery store shelves? • Amazon’s recommendation engine – “People who bought this product also bought…” • Telephone calling patterns – Who do a set of people tend to call most often? • Social network analysis – Determine who you “may know” Market-Basket Transactions Basket 1 2 3 Items Bread, Milk Bread, Diapers, Beer, Eggs Milk, Diapers, Beer, Coke 4 5 Bread, Milk, Diapers, Beer Bread, Milk, Diapers, Coke Some Association Rules from these transactions XY (antecedent consequent) {Diapers} {Beer}, {Milk, Bread} {Diapers} {Beer, Bread} {Milk}, {Bread} {Milk, Diapers} Core idea: The itemset Itemset A group of items of interest {Milk, Beer, Diapers} Basket Association rules express relationships between itemsets XY {Milk, Diapers} {Beer} “when you have milk and diapers, you also have beer” Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke Support • Support count () – In how many baskets does the itemset appear? – {Milk, Beer, Diapers} = 2 (i.e., in baskets 3 and 4) • Support (s) – Fraction of transactions that contain all items in X Y – s({Milk, Diapers, Beer}) = 2/5 = 0.4 X Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke 2 baskets have milk, beer, and diapers Y • You can calculate support for both X and Y separately – Support for X = 3/5 = 0.6 – Support for Y = 3/5 = 0.6 5 baskets total Confidence • Confidence is the strength of the association – Measures how often items in Y appear in transactions that contain X c Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke ( X Y ) (Milk, Diapers, Beer) 2 0.67 (X ) (Milk, Diapers) 3 c must be between 0 and 1 1 is a complete association 0 is no association This says 67% of the times when you have milk and diapers in the itemset you also have beer! Some sample rules Association Rule Support (s) Confidence (c) {Milk,Diapers} {Beer} 2/5 = 0.4 2/3 = 0.67 {Milk,Beer} {Diapers} 2/5 = 0.4 2/2 = 1.0 {Diapers,Beer} {Milk} 2/5 = 0.4 2/3 = 0.67 {Beer} {Milk,Diapers} 2/5 = 0.4 2/3 = 0.67 {Diapers} {Milk,Beer} 2/5 = 0.4 2/4 = 0.5 {Milk} {Diapers,Beer} 2/5 = 0.4 2/4 = 0.5 All the above rules are binary partitions of the same itemset: {Milk, Diapers, Beer} (That is, we are splitting the set into two parts.) Basket 1 Items Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke Rule Pairs Association Rule Support (s) Confidence (c) 1 {Milk,Diapers} {Beer} 2/5 = 0.4 2/3 = 0.67 2 {Milk,Beer} {Diapers} 2/5 = 0.4 2/2 = 1.0 3 {Diapers,Beer} {Milk} 2/5 = 0.4 2/3 = 0.67 4 {Beer} {Milk,Diapers} 2/5 = 0.4 2/3 = 0.67 5 {Diapers} {Milk,Beer} 2/5 = 0.4 2/4 = 0.5 6 {Milk} {Diapers,Beer} 2/5 = 0.4 2/4 = 0.5 A rule pair is a pair of two association rules that represent the two possible combinations of a binary partition. If I have a rule of X Y then its corresponding rule pair is Y X. One rule pair is starred above. Can you find the others? Don’t blindly follow the numbers i.e., high confidence suggests a strong association… • But this can be deceptive • Consider {Bread} {Diapers} • Support for the total itemset is 0.6 (3/5) • And confidence is 0.75 (3/4) – pretty high • But is this just because both are frequently occurring items (s=0.8)? • You’d almost expect them to show up in the same baskets by chance Lift Takes into account how co-occurrence differs from what is expected by chance – i.e., if items were selected independently from one another s( X Y ) Lift s ( X ) * s (Y ) Support for total itemset X and Y Support for X times support for Y Lift Example • What’s the lift for the rule: {Milk, Diapers} {Beer} • So X = {Milk, Diapers} Y = {Beer} s({Milk, Diapers, Beer}) = 2/5 = 0.4 s({Milk, Diapers}) = 3/5 = 0.6 s({Beer}) = 3/5 = 0.6 0.4 0.4 So Lift 1.11 0.6 * 0.6 0.36 Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke When Lift > 1, the occurrence of X Y together is more likely than what you would expect by chance Another example Checking Account Savings Account No Yes No 500 3500 4000 Yes 1000 5000 6000 1500 8500 10000 Are people more inclined to have a checking account if they have a savings account? Support ({Savings} {Checking}) = 5000/10000 = 0.5 Support ({Savings}) = 6000/10000 = 0.6 Support ({Checking}) = 8500/10000 = 0.85 0.5 0.5 Lift 0.98 0.6 * 0.85 0.51 Answer: No In fact, it’s slightly less than what you’d expect by chance! But this can be overwhelming Thousands of products Many customer types Millions of combinations So where do you start? Selecting the rules • We know how to calculate the measures for each rule – Support – Confidence – Lift • Then we set up thresholds for the minimum rule strength we want to accept The steps • List all possible association rules • Compute the support and confidence for each rule • Drop rules that don’t make the thresholds • Use lift to further check the association Once you are confident in a rule, take action {Milk, Diapers} {Beer} Possible Marketing Actions • Create “New Parent Coping Kits” of beer, milk, and diapers • What are some others? Recap In this presentation, we identified four statistics used in Association Rule mining. 1. Support Count 2. Support 3. Confidence 4. Lift Support Count () Support Count answers the question – “How Many?” Specifically, how many occurrences in a data set have a particular combination of items. Support count can be calculated for either a rule or a specific item set. So, if: X = {Milk, Diapers} Y = {Beer} Then: (X) = 3 (Y) = 3 (X Y) = 2 Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke Support (s) Support answers the question – “What percent?” Specifically, what percentage of occurrences in a data set have a particular combination of items. Support (s) may be expressed as a percentage or a probability. Support can be calculated for either a rule or a specific item set. So, if: X = {Milk, Diapers} Y = {Beer} Then: s(X) = 3/5 = .6 = 60% s(Y) = 3/5 = .6 = 60% s(X Y) = 2/5 = .4 = 40% Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke Confidence (c) Confidence answers about the quality (or strength) of a particular association rule. Specifically, if I see itemsset X in my data, how likely am I to see itemset Y as well? Confidence can be expressed as a percentage or a probability. Confidence can only be calculated for an association rule. So, if: X = {Milk, Diapers} Y = {Beer} Then: s (X ® Y ) 2 c(X ® Y ) = = = 0.67 s (X) 3 Lift Lift answers the question – does the support for a association rule in our data appear because of some influence that is more than just dumb luck? Lift can only be calculated for an association rule. • A lift > 1 indicates that the association is real. The greater the lift, the stronger the association. • A lift = 1 is what we expect to see if there is no connection between the itemsets. • A lift < 1 indicates something called a negative association one. That is if we have X we are actually less likely to see Y. Lift If: X = {Milk, Diapers} Y = {Beer} Then: s(XY) = 2/5 = 0.4 s(X) = 3/5 = 0.6 s(Y) = 3/5 = 0.6 Basket Items 1 Bread, Milk 2 Bread, Diapers, Beer, Eggs 3 Milk, Diapers, Beer, Coke 4 Bread, Milk, Diapers, Beer 5 Bread, Milk, Diapers, Coke When Lift > 1, the occurrence of X Y together is more likely than what you would expect by chance s( X Y ) 0.4 0.4 Lift 1.11 s ( X ) * s (Y ) 0.6 * 0.6 0.36 What are these numbers good for? Measurement Remark Support Count A basic measurement, Support Count can be used to get a rough estimate, and is also useful in the computation of support and confidence. Support Gives me a probability (or percentage) of an itemset's occurrence within a body of data. Confidence A good, preliminary indication of the strength of an association rule. Lift The best indicator of the quality or strength of an association rule.