Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sequential PAttern Mining using A Bitmap Representation Jay Ayres, Johannes Gehrke Tomi Yiu, Jason Flannick ACM SIGKDD 2002 Outline Introduction The SPAM Algorithm Data Representation Experimental Evaluation Conclusion Introduction Sequential Patterns Mining :依據顧客購買商 品的交易資料庫來進行分析,用來找出顧客購買 商品行為是否存在商品間先後順序關係。 Input I={a, b, c, d} CID TID Itemset 1 1 {a, b, d} 1 3 {b, c, d} 1 6 {b, c, d} 2 2 {b} 2 4 {a, b, c} 3 5 {a, b} 3 7 {b, c, d} Dataset sorted by CID and TID Output ({b}, {d, c}, {c}) Introduction (Frequent) Sa:A Sequence. Example:({a}, {c}) supD(Sa):The support of Sa in D. Example:Sa=({a},{b, c}), we know supD(Sa)=2 Given a support threshold minSup, a Sequence Sa is called a frequent sequential pattern on D if supD(Sa) minSup. Example:minSup=2 Sa is deemed frequent. CID Sequence 1 ({a, b, d}, {b, c, d}, {b, c, d}) 2 ({b}, {a, b, c}) 3 ({a, b}, {b, c, d}) Sequence for each customer The SPAM Algorithm (Lexicographic Tree for Sequences) Consider all sequences arranged in a sequence tree with the following structure: The root of the tree is labeled with . {} Sequence I={a, b} a Sequence a, a a, a, a a, a, b a, b a, (a, b) (a, b) Figure 1: The Lexicographic Sequence Tree The SPAM Algorithm (Lexicographic Tree for Sequences) Each sequence in the sequence tree can be considered as either a sequence-extended sequence or an itemset-extended sequence. {} I={a, b} a a, a a, a, a a, a, b a, b a, (a, b) (a, b) Figure 1: The Lexicographic Sequence Tree The SPAM Algorithm (Lexicographic Tree for Sequences) For example: Sa= ({a, b, c}, {a, b}) sequence-extended sequence of Sa. itemset-extended sequence of Sa. ({a, b, c}, {a, b}) S-Step ({a, b, c}, {a, b}, {d}) =S-Step =I-Step ({a, b, c}, {a, b}) I-Step ……… ({a, b, c}, {a, b, d}) The SPAM Algorithm (Lexicographic Tree for Sequences) Sn, the set of candidate items that are considered for a possible S-step extensions of node n (abbreviated s-extensions). Example : S({a})={a, b, c, d} a CID Sequence 1 ({a, b, d}, {b, c, d}, {b, c, d}) 2 ({b}, {a, b, c}) 3 ({a, b}, {b, c, d}) Sequence for each customer S-Step a, a a, b a, c a, d The SPAM Algorithm (Lexicographic Tree for Sequences) In, which identifies the set of candidate items that are considered for a possible Istep extensions (abbreviated, i-extensions). Example : I({a})={b, c, d} a I-Step CID Sequence 1 ({a, b, d}, {b, c, d}, {b, c, d}) 2 ({b}, {a, b, c}) 3 ({a, b}, {b, c, d}) Sequence for each customer (a, b) (a, c) (a, d) The SPAM Algorithm (Depth First Tree Traversal) A standard depth-first manner. I={a, b} supD(S) minSup, store S and repeat DFS. {} a supD(S) < minSup, not need to repeat DFS. a, a a, b (a, b) A huge search space. a, a, a a, a, b a, (a, b) Figure 1: The Lexicographic Sequence Tree The SPAM Algorithm (Pruning) S-step Pruning Suppose we are at node ({a}) in the tree and suppose that S({a}) = {a, b, c, d}, I({a}) = {b, c, d}. Suppose that ({a}, {c}) and ({a}, {d}) are not frequent. By Apriori principle; ({a},{a},{c}) ({a},{b},{c}) ({a},{a,c}) ({a},{b,c})({a},{a},{d}) ({a},{b},{d}) ({a},{a,d}) ({a},{b,d})are not frequent Hence, when we are at node ({a}, {a}) or ({a}, {b}), we do not have to perform I-step or S-step using items c and d, i.e. S({a},{a}) = S({a},{b}) = {a, b} I({a},{a}) = {b}, and I({a}, {b}) = . a a,a a,b a,c a,d a,a,a {a,b} {a,c} {a,d} a,{b,d} a,{b,c} a,a,b a,{a,d} a,a,c a,a,d a,{a,c} a,{a,b} a,b,d a,b,a a,b,b a,b,c The SPAM Algorithm (Pruning) I-step Pruning (Example) Let us consider the same node ({a}) described in the previous section. The possible itemset-extended sequences are ({a, b}), ({a, c}), and ({a, d}). If ({a, c}) is not frequent, then ({a, b, c}) must also not be frequent by the Apriori principle. Hence, I({a, b}) = {d}, S({a,b}) = {a, b}, and S({a, d}) = {a, b}. a {a,b} {a,d} {a,c} {a,b,d} {a,d},a {a,b},a {a,b},b {a,b,c} {a,b},d {a,b},c ………… {a,d},b Data Representation (Data Structure) Our algorithm uses a vertical bitmap representation of the data. A vertical bitmap is created for each item in the dataset, and each bitmap has a bit corresponding to each transaction in the dataset. If item i appears in transaction j, then the bit corresponding to transaction j of the bitmap for item i is set to one; otherwise, the bit is set to zero. Data Representation (Data Structure) If the size of a sequence is between 2k + 1 and 2k+1, we consider it as a 2K+1-bit sequence. (the size of a sequence =3), k=1, 21+1-bit=4-bit CID TID Itemset 1 1 {a, b, d} 1 3 {b, c, d} 1 6 {b, c, d} 2 2 {b} 2 4 {a, b, c} 3 5 {a, b} 3 7 {b, c, d} Dataset sorted by CID and TID CID TID {a} {b} {c} {d} 1 1 1 - 1 3 6 - 1 0 0 0 1 1 1 0 0 1 1 0 1 1 1 0 2 2 - 2 4 - 0 1 0 0 1 1 0 0 0 1 0 0 0 0 0 0 3 3 - 5 7 - 1 0 0 0 1 1 0 0 0 1 0 0 0 1 0 0 Data Representation (Candidate Generation) S-step Process We first generate a bitmap from B(sa) such that all bits less than or equal to k are set to zero, and all bits after k are set to one. We call this bitmap a transformed bitmap. We then AND the transformed bitmap with the item bitmap. The resultant bitmap has the properties described above and it is exactly the bitmap for the generated sequence. Data Representation (Candidate Generation) S-step Process ({a}) ({a})s {b} ({a},{b}) 1 0 0 0 0 1 1 1 1 1 1 0 0 1 1 0 0 0 1 1 1 1 0 0 0 1 0 0 1 0 0 0 S-step process 0 1 1 1 & result 0 0 0 0 1 1 0 0 0 1 0 0 CID Sequence 1 ({a, b, d}, {b, c, d}, {b, c, d}) 2 ({b}, {a, b, c}) 3 ({a, b}, {b, c, d}) Data Representation (Candidate Generation) I-step Process ({a},{b}) {d} ({a},{b, d}) 0 1 1 0 1 1 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 & 0 1 0 0 result 0 0 0 0 CID Sequence 1 ({a, b, d}, {b, c, d}, {b, c, d}) 2 ({b}, {a, b, c}) 0 1 0 0 3 ({a, b}, {b, c, d}) Experimental Evaluation (ComparisonWith SPADE and PrefixSpan) We compared the three algorithms on several small, medium, and large datasets for various minimum support values. This set of tests shows that SPAM outperforms SPADE by about a factor of 2.5 on small datasets and better than an order of magnitude for reasonably large datasets. PrefixSpan outperforms SPAM slightly on very small datasets, but on large datasets SPAM outperforms PrefixSpan by over an order of magnitude. Conclusion We presented an algorithm to quickly find all frequent sequences in a list of transactions. The algorithm utilizes a depth-first traversal of the search space combined with a vertical bitmap representation to store each sequence. Experimental results demonstrated that our algorithm outperforms SPADE and PrefixSpan on large datasets