Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Presented By: Kevin Seng 4/3/01 CS632 - Data Mining 1 Papers Rakesh Agrawal and Ramakrishnan Srikant: Fast algorithms for mining association rules. Mining sequential patterns. 4/3/01 CS632 - Data Mining 2 Outline… For each paper… Present the problem. Describe the algorithms. 4/3/01 Intuition Design Performance. CS632 - Data Mining 3 Market Basket Introduction Retailers are able to collect massive amounts of sales data (basket data) 4/3/01 Bar-code technology E-commerce Sales data generally includes customer id, transaction date and items bought. CS632 - Data Mining 4 Market Basket Problem It would be useful to find association rules between transactions. 4/3/01 ie. 75% of the people who buy spaghetti also by tomato sauce. Given a set of basket data, how can we efficiently find the set of association rules? CS632 - Data Mining 5 Formal Definition (1) L = {i1,i2,… im} set of items. Database D is a set of transactions. Transaction T is a set of items such that T L. 4/3/01 An unique identifier, TID, is associated with each transaction. CS632 - Data Mining 6 Formal Definition (2) T contains X, a set of some items in L, if X T. Association rule, X Y 4/3/01 X T, Y T, X Y = Confidence – % of transactions which contain X which also contain Y. Support - % of transactions in D which contain X Y. CS632 - Data Mining 7 Formal Definition (3) 4/3/01 Given a set of transactions D, we want to generate all association rules that have support and confidence greater than the user-specified minimum support (minsup) and minimum confidence (minconf). CS632 - Data Mining 8 Problem Decomposition Two sub-problems: Find all itemsets that have transaction support above minsup. 4/3/01 These itemsets are called large itemsets. From all the large itemsets, generate the set of association rules that have confidence about minconf. CS632 - Data Mining 9 Second Sub-problem Straightforward approach: For every large itemset l, find all nonempty subsets of l. For every such subset a, output a rule of the form a (l – a) if ratio of support(l) to support(a) is at least minconf. 4/3/01 CS632 - Data Mining 10 Discovering Large Itemsets Done with multiple passes over the data. First pass, find all individual items that are large (have minimum support). Subsequent pass, using large itemsets found in previous pass: 4/3/01 Generate candidate itemsets. Count support for each candidate itemset. Eliminate itemsets that do not have min support. CS632 - Data Mining 11 Algorithm L1 = {large 1-itemsets}; for( k=2; Lk-1; k++) do begin Ck = apriori-gen(Lk-1); // New candidates forall transactions t D do // Counting support Ct = subset(Ck, t); // Candidates in t forall candidates c Ct do c.count++; end Lk = {c Ck | c.count minsup} End Answer = k Lk; 4/3/01 CS632 - Data Mining 12 Candidate Generation AIS and SETM algorithms: Uses the transactions in the database to generate new candidates. But… this generates a lot of candidates which we know beforehand are not large! 4/3/01 CS632 - Data Mining 13 Apriori Algorithms Generate candidates using only large itemsets found in previous pass without considering the database. Intuition: Any subset of a large itemset must be large. 4/3/01 CS632 - Data Mining 14 Apriori Candidate Generation Takes in Lk-1 and returns Ck. Two steps: Join large itemsets Lk-1 with Lk-1. Prune out all itemsets in joined result which contain a (k-1)subset not found in Lk-1. 4/3/01 CS632 - Data Mining 15 Candidate Generation (Join) insert into Ck select p.item1, p.item2,…,p.itemk-1,q.itemk-1 from Lk-1 p, Lk-1 q where p.item1= q.item1,…, p.itemk-2= q.itemk-2, p.itemk-1< q.itemk-1 4/3/01 CS632 - Data Mining 16 Candidate Gen. (Example) L3 {1 {1 {1 {1 {2 2 2 3 3 3 4/3/01 3} 4} 4} 5} 4} Join C4 C4 {1 2 3 4} Prune {1 2 3 4} {1 3 4 5} CS632 - Data Mining 17 Counting Support Need to count the number of transactions which support a given itemset. For efficiency, use a hash-tree. 4/3/01 Subset Function CS632 - Data Mining 18 Subset Function (Hash-tree) Candidate itemsets are stored in hash-tree. Leaf node – contains a list of itemsets. Interior node – contains a hash table. 4/3/01 Each bucket of the hash table points to another node. Root is at depth 1. Interior nodes at depth d points to nodes at depth d+1. CS632 - Data Mining 19 Hash-tree Example (1) depth C3 {1 {1 {1 {1 {2 1 2 2 3 3 3 3} 4} 4} 5} 4} 4/3/01 2 {1 2 3} {1 2 4} 3 {1 3 4} {1 3 5} CS632 - Data Mining 2 1 {2 3 4} 2 3 t=2 20 Using the hash-tree 4/3/01 If we are at a leaf – find all itemsets contained in transaction. If we are at an interior node – hash on each remaining element in transaction. Root node – hash on all elements in transaction. CS632 - Data Mining 21 Hash-tree Example (2) 1 2 D {1 2 3 4} {2 3 5} 2 {1 2 3} {1 2 4} 4/3/01 3 {2 3 4} {1 3 4} {1 3 5} CS632 - Data Mining 22 AprioriTid (1) Does not use the transactions in the database for counting itemset support. Instead stores transactions as sets of possible large itemsets, Ck. Each member of Ck is of the form: < TID, {Xk}> , Xk is a possible large itemset 4/3/01 CS632 - Data Mining 23 AprioriTid (2) Advantage of Ck If a transaction does not contain any candidate k-itemset then it will have no entry in Ck. Number of entries in Ck may be less than the number of transactions in D. 4/3/01 Especially for large k. Speeds up counting! CS632 - Data Mining 24 AprioriTid (3) However… For small k each entry in Ck may be larger than it’s corresponding transaction. The usual space vs. time. 4/3/01 CS632 - Data Mining 25 AprioriTid (4) Example 4/3/01 CS632 - Data Mining 26 Observation When Ck does not fit in main memory we can see large jump in execution time. AprioriTid beats Apriori only when Ck can fit in main memory. 4/3/01 CS632 - Data Mining 27 AprioriHybrid It is not necessary to use the same algorithm for all the passes. Combine the two algorithms! 4/3/01 Start with Apriori When Ck can fit in main memory switch to AprioriTID CS632 - Data Mining 28 Performance (1) 4/3/01 Measured performance by running algorithms on generated synthetic data. Used the following parameters: CS632 - Data Mining 29 Performance (2) 4/3/01 CS632 - Data Mining 30 Performance (3) 4/3/01 CS632 - Data Mining 31 Mining Sequential Patterns (1) Sequential patterns are ordered list of itemsets. Market basket example: 4/3/01 Customers typically rent “star wars” then “empire strikes back” then “return of the Jedi” “Fitted sheets and pillow cases” then “comforter” then “drapes and ruffles” CS632 - Data Mining 32 Mining Sequential Patterns (2) Looks at sequences of transactions as opposed to a single transaction. Groups transactions based on customer ID. 4/3/01 Customer sequence. CS632 - Data Mining 33 Formal Definition (1) Given a database D of customer transactions. Each transaction consists of: customer id, transaction-time, items purchased. 4/3/01 No customer has more than one transaction with the same transaction-time. CS632 - Data Mining 34 Formal Definition (2) 4/3/01 Itemset i, (i1 i2...im) where ij is an item. Sequence s, s1s2…sn where sj is an itemset. Sequence a1a2…an contained in b1b2…bn if there exist integers i1< i2 ... < in such that a1 bi1 , a2 bi2 ,…, an bin . A sequence s is maximal if it is not contained in any other sequence. CS632 - Data Mining 35 Formal Definition (3) A customer supports a sequence s if s is contained in the customer sequence for this customer. Support of a sequence - % of customers who support the sequence. 4/3/01 For mining association rules, support was % of transactions. CS632 - Data Mining 36 Formal Definition (4) 4/3/01 Given a database D of customer transactions find the maximal sequences among all sequences that have a certain user-specified minimum support. Sequences that have support above minsup are large sequences. CS632 - Data Mining 37 Algorithm: Sort Phase Customer ID – Major key Transaction-time – Minor key Converts the original transaction database into a database of customer sequences. 4/3/01 CS632 - Data Mining 38 Algorithm: Litemset Phase (1) Litemset Phase: Find all large itemsets. Why? Because each itemset in a large sequence has to be a large itemset. 4/3/01 CS632 - Data Mining 39 Algorithm: Litemset Phase (2) To get all large itemsets we can use the Apriori algorithms discussed earlier. Need to modify support counting. 4/3/01 For sequential patterns, support is measured by fraction of customers. CS632 - Data Mining 40 Algorithm: Litemset Phase (3) 4/3/01 Each large itemset is then mapped to a set of contiguous integers. Used to compare two large itemsets in constant time. CS632 - Data Mining 41 Algorithm: Transformation (1) 4/3/01 Need to repeatedly determine which of a given set of large sequences are contained in a customer sequence. Represent transactions as sets of large itemsets. Customer sequence now becomes a list of sets of itemsets. CS632 - Data Mining 42 Algorithm: Transformation (2) 4/3/01 CS632 - Data Mining 43 Algorithm: Sequence Phase (1) Use the set of large itemsets to find the desired sequences. Similar structure to Apriori algorithms used to find large itemsets. 4/3/01 Use seed set to generate candidate sequences. Count support for each candidate. Eliminate candidate sequences which are not large. CS632 - Data Mining 44 Algorithm: Sequence Phase (2) Two types of algorithms: Count-all: counts all large sequences, including non-maximal sequences. Count-some: try to avoid counting nonmaximal sequences by counting longer sequences first. 4/3/01 AprioriAll AprioriSome DynamicSome CS632 - Data Mining 45 Algorithm: Maximal Phase (1) 4/3/01 Find the maximal sequences among the set of large sequences. Set of all large subsequences S CS632 - Data Mining 46 Algorithm: Maximal Phase (2) 4/3/01 Use hash-tree to find all subsequences of sk in S. Similar to subset function used in finding large itemsets. S is stored in hash-tree. CS632 - Data Mining 47 AprioriAll (1) 4/3/01 CS632 - Data Mining 48 AprioriAll (2) Hash-tree is used for counting. Candidate generation similar to candidate generation in finding large itemsets. Except that order matters and therefore we don’t have the condition: p.itemk-1< q.itemk-1 4/3/01 CS632 - Data Mining 49 AprioriAll (3) Example of candidate generation: 4/3/01 CS632 - Data Mining 50 AprioriAll (4) 4/3/01 Example: CS632 - Data Mining 51 Count-some Algorithms Try to avoid counting non-maximal sequences by counting longer sequences first. 2 phases: 4/3/01 Forward Phase – find all large sequences or certain lengths. Backward Phase – find all remaining large sequences. CS632 - Data Mining 52 AprioriSome (1) Determines which lengths to count using next() function. next() takes in as a parameter the length of the sequence counted in the last pass. next(k) = k + 1 - Same as AprioriAll Balances tradeoff between: 4/3/01 Counting non-maximal sequences Counting extensions of small candidate sequences CS632 - Data Mining 53 AprioriSome (2) 4/3/01 hitk = Lk/ Ck Intuition: As hitk increases the time wasted by counting extensions of small candidates decreases. CS632 - Data Mining 54 AprioriSome (3) 4/3/01 CS632 - Data Mining 55 AprioriSome (4) Backward Phase: For all lengths which we skipped: 4/3/01 Delete sequences in candidate set which are contained in some large sequence. Count remaining candidates and find all sequences with min. support. Also delete large sequences found in forward phase which are non-maximal. CS632 - Data Mining 56 AprioriSome (5) 4/3/01 CS632 - Data Mining 57 AprioriSome (6) Example: Forward Phase: next(k) = 2k minsup = 2 C3 3-Sequences 4/3/01 CS632 - Data Mining 58 AprioriSome (7) Example Backward Phase: C3 3-Sequences 4/3/01 CS632 - Data Mining 59 Performance (1) 4/3/01 Used generated datasets again… Parameters for data: CS632 - Data Mining 60 Performance (2) DynamicSome generates too many candidates. AprioriSome does a little better than AprioriAll. 4/3/01 It avoids counting many non-maximal sequences. CS632 - Data Mining 61 Performance (3) Advantage of AprioriSome is reduced for 2 reasons: AprioriSome generates more candidates. Candidates remain memory resident even if skipped over. 4/3/01 Cannot always follow heuristic. CS632 - Data Mining 62 Wrap up Just presented two classic papers on data-mining. 4/3/01 Association Rules Sequential Patterns CS632 - Data Mining 63