Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Frequent itemset mining and temporal extensions Sunita Sarawagi [email protected] http://www.it.iitb.ac.in/~sunita Association rules Given several sets of items, example: Set of items purchased Set of pages visited on a website Set of doctors visited Find all rules that correlate the presence of one set of items with another are of the form X Y where X and Y are sets of items Eg: Purchase of books A&B purchase of C Rules Parameters: Support and Confidence All rules X Z have two parameters Support probability that a transaction has X and Z confidence conditional probability that a transaction having X also contains Z Two parameters to association rule mining: Minimum support s Minimum confidence c Transaction ID Items Bought 2000 A,B,C 1000 A,C 4000 A,D 5000 B,E,F S: 50%, and c: 50% A C (50%, 66.6%) C A (50%, 100%) Applications of fast itemset counting Cross selling in retail, banking Catalog design and store layout Applications in medicine: find redundant tests Improve predictive capability of classifiers that assume attribute independence Improved clustering of categorical attributes Finding association rules in large databases Number of transactions: in millions Number of distinct items: tens of thousands Lots of work on scalable algorithms Typically two parts to the algorithm: Finding all frequent itemsets with support > S 2. Finding rules with confidence greater than C 1. Frequent itemset search more expensive Apriori algorithm, FP-tree algorithm The Apriori Algorithm L1 = {frequent items of size one}; for (k = 1; Lk !=; k++) Ck+1 = candidates generated from Lk by • Join Lk with itself • Prune any k+1 itemset whose subset not in Lk for each transaction t in database do • increment the count of all candidates in Ck+1 that are contained in t Lk+1 = candidates in Ck+1 with min_support return k Lk; How to Generate Candidates? Suppose the items in Lk-1 are listed in an order Step 1: self-joining Lk-1 insert into Ck select p.item1, p.item2, …, p.itemk-1, q.itemk-1 from Lk-1 p, Lk-1 q where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 < q.itemk-1 Step 2: pruning forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck The Apriori Algorithm — Example Database D TID 100 200 300 400 Items 124 235 1235 25 itemset sup. C1 {1} 2 {2} 4 Scan D {3} 2 {4} 1 {5} 3 itemset sup {1 2} 2 L2 {2 3} 2 {2 5} 3 {3 5} 2 C2 C3 itemset {2 3 5} itemset sup {1 2} 2 {1 3} 1 {1 5} 1 {2 3} 2 {2 5} 3 {3 5} 2 Scan D L1 itemset sup. {1} 2 {2} 4 {3} 2 {5} 3 C2 itemset {1 2} Scan D L3 itemset sup {2 3 5} 2 {1 {1 {2 {2 {3 3} 5} 3} 5} 5} Improvements to Apriori Apriori with well-designed data structures works well in practice when frequent itemsets not too long (common case) Lots of enhancements proposed Sampling: count in two passes Invert database to be column major instead of row major and count by intersection Count multiple length itemsets in one-pass Reducing passes not useful since I/O not bottleneck: Main bottleneck: candidate generation and counting not optimized for long itemsets Mining Frequent Patterns Without Candidate Generation Compress a large database into a compact, Frequent-Pattern tree (FP-tree) structure highly condensed, but complete for frequent pattern mining Develop an efficient, FP-tree-based frequent pattern mining method A divide-and-conquer methodology: mining tasks into smaller ones Avoid candidate generation decompose Construct FP-tree from Database min_support = 0.5 TID 100 200 300 400 500 Items bought (ordered) frequent items {f, a, c, d, g, i, m, p} {f, c, a, m, p} {a, b, c, f, l, m, o} {f, c, a, b, m} {b, f, h, j, o} {f, b} {b, c, k, s, p} {c, b, p} {a, f, c, e, l, p, m, n} {f, c, a, m, p} {} Scan DB once, find frequent 1-itemset Order frequent items by decreasing frequency Scan DB again, construct FP-tree Item frequency f c a b m p 4 4 3 3 3 3 f:4 c:1 c:3 b:1 b:1 a:3 p:1 m:2 b:1 p:2 m:1 Step 1: FP-tree to Conditional Pattern Base Starting at the frequent header table in the FP-tree Traverse the FP-tree by following the link of each frequent item Accumulate all of transformed prefix paths of that item to form a conditional pattern base Header Table Item frequency f 4 c 4 a 3 b 3 m 3 p 3 Conditional pattern bases {} item cond. pattern base c:1 c f:3 c:3 b:1 b:1 a fc:3 a:3 b fca:1, f:1, c:1 m fca:2, fcab:1 p fcam:2, cb:1 f:4 m:2 b:1 p:2 m:1 p:1 Step 2: Construct Conditional FP-tree For each pattern-base Accumulate the count for each item in the base Construct the FP-tree for the frequent items of the pattern base m-conditional pattern base: fca:2, fcab:1 All frequent patterns concerning m {} m, f:3 fm, cm, am, c:3 fcm, fam, cam, fcam a:3 m-conditional FP-tree Mining Frequent Patterns by Creating Conditional Pattern-Bases Item p Conditional pattern-base {(fcam:2), (cb:1)} Conditional FP-tree {(c:3)}|p m {(fca:2), (fcab:1)} {(f:3, c:3, a:3)}|m b {(fca:1), (f:1), (c:1)} Empty a {(fc:3)} {(f:3, c:3)}|a c {(f:3)} {(f:3)}|c f Empty Empty Repeat this recursively for higher items… FP-growth vs. Apriori: Scalability With the Support Threshold Data set T25I20D10K 100 D1 FP-grow th runtime 90 D1 Apriori runtime 80 Run time(sec.) 70 60 50 40 30 20 10 0 0 0.5 1 1.5 2 Support threshold(%) 2.5 3 Criticism to Support and Confidence X and Y: positively correlated, X and Z, negatively related support and confidence of X=>Z dominates X 1 1 1 1 0 0 0 0 Y 1 1 0 0 0 0 0 0 Z 0 1 1 1 1 1 1 1 Need to measure departure from Rule Support Confidence expected. X=>Y 25% 50% For two items: P( A B ) X=>Z 37.50% 75% P( A) P( B) For k items, expected support derived from support of k-1 itemsets using iterative scaling methods Prevalent correlations are not interesting Analysts already know about prevalent rules Interesting rules are those that deviate from prior expectation Mining’s payoff is in finding surprising phenomena 1995 Zzzz... 1998 bedsheets and pillow covers sell together! bedsheets and pillow covers sell together! What makes a rule surprising? Does not match prior expectation Correlation between milk and cereal remains roughly constant over time Cannot be trivially derived from simpler rules Milk 10%, cereal 10% Milk and cereal 10% … surprising Eggs 10% Milk, cereal and eggs 0.1% … surprising! Expected 1% Finding suprising temporal patterns Algorithms to mine for surprising patterns Encode itemsets into bit streams using two models • Mopt: The optimal model that allows change along time • Mcons: The constrained model that does not allow change along time Surprise = difference in number of bits in Mopt and Mcons One item: optimal model Milk-buying habits modeled by biased coin Customer tosses this coin to decide whether to buy milk Head or “1” denotes “basket contains milk” Coin bias is Pr[milk] Analyst wants to study Pr[milk] along time Single coin with fixed bias is not interesting Changes in bias are interesting The coin segmentation problem Players A and B A has a set of coins with different biases A repeatedly Picks arbitrary coin Tosses it arbitrary number of times B observes H/T Guesses transition points and biases Return Pick A Toss 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 B 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 How to explain the data Given n head/tail observations Can assume n different coins with bias 0 or 1 • Data fits perfectly (with probability one) • Many coins needed Or assume one coin • May fit data poorly “Best explanation” is a compromise 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 1/4 5/7 1/3 Coding examples Sequence of k zeroes Naïve encoding takes k bits Run length takes about log k bits 1000 bits, 10 randomly placed 1’s, rest 0’s Posit a coin with bias 0.01 Data encoding cost is (Shannon’s theorem): 10 log 0.01 990 log 0.99 81 bits « 1000 bits Note that 10 log100 66 bits How to find optimal segments Sequence of 17 tosses: 0 0 1 0 1 1 0 1 0 1 1 0 1 0 0 0 1 Derived graph with 18 nodes: Shortest path Edge cost = model cost + data cost Data cost for Pr[head] = 5/7, 5 heads, 2 tails Model cost = one node ID + one Pr[head] Two or more items “Unconstrained” segmentation items induce a 2k sided coin “milk and cereal” = 11, “milk, not cereal” = 10, “neither” = 00, etc. k Shortest path finds significant shift in any of the coin face probabilities Problem: some of these shifts may be completely explained by marginals 00 01 10 11 Example Theta=2 0.4 0.3 Support Drop in joint sale of milk and cereal is completely explained by drop in sale of milk Pr[milk & cereal] / (Pr[milk] Pr[cereal]) remains constant over time Call this ratio 0.2 0.1 0 0 Milk 2 4 6 Time Cereal 8 10 Both Constant- segmentation Observed support p11 p11 p1 p1 ( p10 p11 )( p01 p11 ) Independence Compute global over all time All coins must have this common value of Segment as before Compare with un-constrained coding cost Is all this really needed? Simpler alternative Aggregate data into suitable time windows Compute support, correlation, , etc. in each window Use variance threshold to choose itemsets Pitfalls Choices: windows, thresholds May miss fine detail Over-sensitive to outliers Experiments Millions of baskets over several years Two algorithms Complete MDL approach MDL segmentation + statistical tests (MStat) Data set 2.8 million transactions 7 years, 1987 to 1993 15800 items Average 2.62 items per basket 1600 1600 1200 1200 Rank(MDL) Rank(MDL) Little agreement in itemset ranks 800 800 400 400 0 0 0 400 800 1200 1600 Rank(Stat, 4 week) 0 400 800 1200 1600 Rank(MStat) Simpler methods do not approximate MDL 2000 MDL 1500 1000 500 0 -2000 0 2000 4000 6000 Score Freq Freq MDL has high selectivity 1800 1600 1400 1200 1000 800 600 400 200 0 MStat 0 5 10 15 Score Score of best itemsets stand out from the rest using MDL Three anecdotes against time High MStat score 20 15 10 5 Small marginals Polo shirt & shorts 0 600 500 High correlation 400 300 Small % variation Bedsheets & pillow cases High MDL score 200 100 0 40 30 Significant gradual drift Men’s & women’s shorts 20 10 0 Conclusion New notion of surprising patterns based on Joint support expected from marginals Variation of joint support along time Robust MDL formulation Efficient algorithms Near-optimal segmentation using shortest path Pruning criteria Successful application to real data References R. Agrawal and R. Srikant. Fast algorithms for mining association rules. VLDB'94 487-499, Santiago, Chile. S. Chakrabarti, S. Sarawagi and B.Dom, Mining surprising patterns using temporal description length Proc. of the 24th Int'l Conference on Very Large Databases (VLDB), 1998 J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. SIGMOD'00, 1-12, Dallas, TX, May 2000. Jiawei Han, Micheline Kamber , Data Mining: Concepts and Techniques by, Morgan Kaufmann Publishers (Some of the slides in the talk are taken from this book) H. Toivonen. Sampling large databases for association rules. VLDB'96, 134145, Bombay, India, Sept. 1996