Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Rules: Algorithms: FP-growth Alípio Jorge, DCC-FC, Universidade do Porto [email protected] 1 Data Mining 2 Alípio Jorge FP-Growth [Han, Pei, Yin] An algorithm more efficient than APRIORI Compresses the database into a more efficient structure Reuses pattern fragments Reduces the number of accesses to the DB Reduces memory and computation needs A divide-and-conquer method reduces the search space One order of magnitude faster than APRIORI 2 Data Mining 2 Alípio Jorge Shortcomings of APRIORI generation of long patterns {a1, a2, …, an} How many sub-patterns are needed? 3 Data Mining 2 Alípio Jorge Compressing a DB Consider the database abc bc cd cdf bd d what's its size? can we compress it? we only want frequent sets, not individual transactions 4 Recall APRIORI-TID Data Mining 2 Alípio Jorge FP-Tree Principles 1. 2. 3. 4. 5 We only need the frequent items. We go through the DB once to count the frequency of all items. We keep the item frequencies in a compact data structure to avoid new visits to the DB. If different transactions share a frequent item set, we can partially merge them. Frequent items must be disposed by a pre-defined order. If two transactions share a prefix (according to the order) we can use a structure based on prefixes. We must, however, keep the countings. That is easier if the items are sorted by inverse frequence. Data Mining 2 Alípio Jorge FP-Growth Example: data and frequent items (minsup=3 (0.6)) 6 Data Mining 2 Alípio Jorge FP-Tree From example 7 Data Mining 2 Alípio Jorge FP-Tree Definition consists of each node in the item prefix subtree has 3 fields item-name count (of the prefix) node-link (pointer to next node with same name) each entry in the frequent-item-header table 8 root node set of item prefix subtrees frequent item header table item name head of node-link (pointer to first node with that name) Data Mining 2 Alípio Jorge FP-Tree construction algorithm Input: DB, minsup ms Output: FP-Tree Method Scan DB once collect frequent items F Sort F in support descending order Create root of Tree and, for each transaction T of DB insert freq items of T in Tree sorted by descending freq Inserting items <p1, p2, …> in Tree 9 if root of Tree has a child N with p1, increase counter of node else create a new child N with p1 and count 1 insert <p2,…> in subtree with root N Data Mining 2 Alípio Jorge Analysis 2 scans of DB FP-tree construction takes 2 scans of DB: inserting a transaction T is O(|freq(T)|) freq(T) is the frequent item projection of transaction T Completeness frequent item identification tree construction the FP-Tree contains all information we need about frequent sets Compactness 10 size of Tree <= SUM over T of |freq(T)| height of Tree = MAX over T of |freq(T)| Data Mining 2 Alípio Jorge Analysis how much space does APRIORI need? may grow exponentially How does frequency ordering influence compactness? 11 Data Mining 2 Alípio Jorge Exercise Compare Apriori and FP-Tree on the following data How much space is needed? APRIORI: number of candidate sets FP-Growth: size of Tree + size of Table Generate Tree now considering increasing frequency ordering DB={{a,b,c,d},{a,b,c,d},{a,e},{a,b}}, minsup=50% compare size of both trees Consider DB={adef , bdef , cdef , a, a, a, b, b, b, c, c, c} 12 build tree with decreasing freq ord can you find a different ordering that achieves a more compact Tree? Data Mining 2 Alípio Jorge Where are we now? We have a compact representation of the DB How can we obtain the frequency of a pattern? How can we obtain ALL frequent patterns? 13 Data Mining 2 Alípio Jorge Obtaining frequent patterns Algorithm FP-growth Properties 14 all freq sets with item A can be obtained from the Tree by following node-links starting in the table (e.g. freq({b,c})?) Data Mining 2 Alípio Jorge FP-Growth Consider each item A at a time (start by less frequent) obtain long patterns involving A obtain an FP-Tree from those patterns conditional pattern base conditional FP-Tree obtain patterns of size 2 from cond FP-Tree and A for each size 2 pattern "repeat" from step 2 15 Data Mining 2 Alípio Jorge FP-Growth Patterns with "p" (minsup=3) Conditional pattern base Conditional FP-Tree cp:3 Patterns with "p" only c is frequent Tree is [root c:3] Size 2 patterns fcam:2, cb:1 p:3, cp:3 Now we don't have to worry about "p" 16 all patterns with have been derived Data Mining 2 Alípio Jorge FP-Growth Patterns with "m" (minsup=3) Conditional pattern base Conditional FP-Tree am:3, cm:3, fm:3 extend am 17 Table: [f:3, c:3, a:3] Tree is [root f:3c:3a:3] Size 2 patterns fca:2, fcab:1 next slide Data Mining 2 Alípio Jorge FP-Growth Patterns with "am" (minsup=3) Now the Tree is Conditional pattern base cam:3, fam:3 extend cam 18 Table: [f:3, c:3] Tree is [root f:3c:3] Size 3 patterns fc:3 Conditional FP-Tree Tree is [root f:3c:3] fam is not extendable Data Mining 2 Alípio Jorge FP-Growth Patterns with "cam" (minsup=3) Now the Tree is Conditional pattern base 19 [root f:3] Size 3 patterns f:3 Conditional FP-Tree Tree is [root f:3] fcam:3, no longer patterns available Data Mining 2 Alípio Jorge FP-Growth (Exercise) Continuing… Extend cm fm And then find patterns with 20 b a c f Data Mining 2 Alípio Jorge Properties To have the frequent patterns with suffix A, we only need to consider the prefix paths of nodes with A. 21 the count is that of A E.g. with b Data Mining 2 Alípio Jorge Properties an item set with B∪{A} is frequent in DB iff B is frequent in the conditonal pattern base of A. 22 That’s why we just need the tree Data Mining 2 Alípio Jorge Single-path tree (special case) An FP-Tree with a single path can be mined by enumerating all the combinations of the subpaths. 23 The count of each subpath is the minimum of the nodes in it Data Mining 2 Alípio Jorge Single prefix-path Tree (special case) Can be mined more efficiently. Separate single prefix-path from multi-path part Mine separately Combine 24 Data Mining 2 Alípio Jorge Finally: FP-Growth algorithm Input: DB (as a Tree), minsup Output: the set of frequent patterns Method: call FP-growth(Tree,null) Procedure FP-growth(Tree,A) If Tree has single prefix-path P and a multipath part Q Else let Q be Tree For each item Ai in Q 25 Generate patterns from combinations in P+A (freq-pattern(P)) Generate pattern B=Ai+A Construct B cond-pattern-base and cond FP-Tree If Tree_B is not empty call freq-pattern(B)=FP-growth(Tree_B, B) Return freq-pattern(P)+freq-pattern(Q)+freq-pattern(P) × freq-pattern(Q) Data Mining 2 Alípio Jorge Analysis It can be shown that The algorithm finds the complete set of frequent patterns in the given transaction database DB. An FP-Tree is usually much smaller than DB It scans the FP-Tree of DB once for each frequent item A Mining operation consist of 26 Generates a small pattern base B_A Pattern mining is done recursively on the small B_A Prefix count adjustment Counting local frequent items Pattern fragment concatenation Data Mining 2 Alípio Jorge More… Scaling FP-growth by not assuming the Tree fits main memory. Divide the database adequately and join the patterns obtained Project the original DB into different DBs Experimental evaluation 27 Compare FP-growth with Apriori Compare FP-growth with TreeProjection [Agarwal et al 2001] Reimplement competing algorithms Synthetic and real data sets Data Mining 2 Alípio Jorge Experimental evaluation Assess compactness of FP-Trees 28 By measuring their size on connect-4 (and other data) Data Mining 2 Alípio Jorge Experimental evaluation Scalability study 29 By measuring time of competing algorithms on different data sets (here sparse artificial data set and (dense) connect-4) Data Mining 2 Alípio Jorge Exercises (on paper) Consider DB= {{a,b,c,d},{a,b,c,d},{f,c,d},{e,c,d},{c}}, 30 Obtain the frequent patterns using FP-Growth, minsup=1 Data Mining 2 Alípio Jorge Advanced exercises Implement your own version of building na FP-Tree from a database of transactions If you did it in R look at the FP-Tree of the greengrocers dataset Try Borgelt’s implementation of FP-Growth 31 http://www.borgelt.net/fpgrowth.html Compare it with Borgelt’s implementation of Apriori Data Mining 2 Alípio Jorge References Jiawei Han, Jian Pei,Yiwen Yin, and Runying Mao. 2004. Mining Frequent Patterns without Candidate Generation: A Frequent-Pattern Tree Approach. Data Min. Knowl. Discov. 8, 1 (January 2004) Ramesh C. Agarwal, Charu C. Aggarwal, and V.V.V. Prasad. 2001. A tree projection algorithm for generation of frequent item sets. J. Parallel Distrib. Comput. 61, 3 (March 2001) Rakesh Agrawal and Ramakrishnan Srikant. 1994. Fast Algorithms for Mining Association Rules in Large Databases. In Proceedings of the 20th International Conference on Very Large Data Bases (VLDB '94) Rakesh Agrawal, Heikki Mannila, Ramakrishnan Srikant, Hannu Toivonen, and A. Inkeri Verkamo. 1996. Fast discovery of association rules. In Advances in knowledge discovery and data mining, Usama M. Fayyad, Gregory Piatetsky-Shapiro, Padhraic Smyth, and Ramasamy Uthurusamy (Eds.) 32 Data Mining 2 Alípio Jorge