Download Parallel Association Rule Mining

Parallel Association Rule Mining Presented by: Ramoza Ahsan and Xiao Qin November 5th, 2013 Outline  Background of Association Rule Mining  Apriori Algorithm  Parallel Association Rule Mining  Count Distribution  Data Distribution  Candidate Distribution  FP tree Mining and growth  Fast Parallel Association Rule mining without candidate generation  More Readings 2 Association Rule Mining  Association rule mining  Finding interesting patterns in data. (Analysis of past transaction data can provide valuable information on customer buying behavior.)  Record usually contains transaction date and items bought.  Literature work more focused on serial mining.  Support and Confidence: Parameters for Association Rule mining. 3 Association rule Mining Parameters  The support ,supp(X),of an itemset X is proportion of transactions in the data set which contain the itemset.  Confidence of a Rule X->Y is the fraction of transactions containing X which also contain Y . i.e. supp(X U Y)/supp(X) 4 Transaction ID Milk Bread Egg Juice 1 1 1 0 0 2 0 0 1 0 3 0 0 0 1 4 1 1 1 0 5 0 1 0 0 Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.5 Outline  Background of Association Rule Mining  Apriori Algorithm  Parallel Association Rule Mining  Count Distribution  Data Distribution  Candidate Distribution  FP tree Mining and growth  Fast Parallel Association Rule mining without candidate generation  FP tree over Hadoop 5 Apriori Algorithm Apriori runs in two steps.  Generation of candidate itemsets  Pruning of itemsets which are infrequent Level-wise generation of frequent itemsets. Apriori principle: If an itemset is frequent, then all of its subsets must also be frequent. 6 Apriori Algorithm for generating frequent itemsets  7 Minimum support=2 Parallel Association Rule Mining  Paper presents parallel algorithm for generating frequent      8 itemsets Each of N procesor has private memory and disk. Data is distributed evenly on the disks of every processor. Count Distribution algorithm focusses on minimizing communication. Data Distribution utilizes memory aggregation efficiently Candidate Distribution reduces synchronization between processors. Algorithm 1: Count Distribution  Each processor generates complete Ck ,using complete     9 frequent itemset Lk-1. Processor traverses over its local data partition and develops local support counts. Exchange the counts with other processors to develop global count. Synchronization is needed. Each processor computes Lk from Ck. Each processor makes a decision to continue or stop. Algorithm 2: Data Distribution  Partition the dataset into N small chunks  Partition the set of candidates k-itemsets into N exclusive subsets.  Each node (N total) takes one subset. Each node count the frequency of the itemsets in one chunk until it counts through all the chunks.  Aggregate the count. Algorithm 2: Data Distribution 1/N Data 1/N Ck 1/N Data 1/N Ck Ck Data 1/N Data 1/N Ck 1/N Ck 1/N Ck 1/N Data 1/N Data Algorithm 2: Data Distribution 1/N Data 1/N Data 1/N Ck 1/N Ck 1/N Data 1/N Ck 1/N Ck 1/N Data 1/N Data 1/N Ck synchronize Algorithm 3: Candidates Distribution  If the workload is not balanced, this can cause all the processor to wait for whichever processor finishes last in every pass.  The Candidates Distribution Algorithm try to do away this dependencies by partition both the data and candidates. Algorithm 3: Candidates Distribution Data_1 Lk-1 Lk-1_1 Ck_1 Lk-1_2 Ck_2 Lk-1_3 Ck_3 Lk-1_4 Ck_4 Lk-1_5 Ck_5 Data_2 Data Data_3 Data_4 Data_5 Algorithm 3: Candidates Distribution Data_1 Data_2 Data_3 Data_4 Data_5 Lk-1_1 Ck_1 Lk-1_2 Ck_2 Lk-1_3 Ck_3 Lk-1_4 Ck_4 Lk-1_5 Ck_5 Data Partition and L Partition  Data  Each pass, every node grabs the necessary tuples from the dataset.  L  Let L3={ABC, ABD, ABE, ACD, ACE}  The items in the itemsets are lexicographically ordered.  Partition the itemsets based on common k-1 long prefixes. Rule Generation  Ex.  Frequent Itemset {ABCDE,AB}  The Rule that can be generated from this set is AB => CDE Support : Sup(ABCDE) Confidence : Sup(ABCDE)/Sup(AB) Outline  Background of Association Rule Mining  Apriori Algorithm  Parallel Association Rule Mining  Count Distribution  Data Distribution  Candidate Distribution  FP tree Mining and growth  Fast Parallel Association Rule mining without candidate generation  FP tree over Hadoop 18 FP Tree Algorithm Allows frequent itemset discovery without candidate itemset generation: • Step 1: Build a compact data structure called FP-tree, built using 2 passes over the data set. • Step 2: Extracts frequent itemsets directly from the FP-tree 19 FP-Tree & FP-Growth example Min supp=3 20 Fast Parallel Association Rule Mining Without Candidacy Generation  Phase 1:  Each processor is given equal number of transactions.  Each processor locally counts the items.  Local count is summed up to get global count.  Infrequent items are pruned and frequent items are stored in header table in descending order of frequency.  construction of parallel frequent pattern trees for each processor.  Phase 2: mining of FP tree similar to FP growth algorithm using the global counts in header table. 21 Example with min supp=4 TID Items Processor Number 1 2 3 A,B,C,D,E F,B,D,E,G B,D,A,E,G P0 4 5 6 A,B,F,G,D B,F,D,G,K A,B,F,D,G,K P1 7 8 9 A,R,M,K,O B,F,G,A,D A,B,F,M,O P2 Item P0 P1 P2 Item Global Counter Item 3 Proc. # Global Counter A 2 2 B 3 3 2 P0 8 0 7 8 1 7 B 0 A B C D C 1 A 7 D 7 P1 E… 6 F 6 P2 ….. G 6 . . 22 Step 1 Step 4. After pruning infrequent ones FP tree for P0 Item Global Counter TI D Items Reordered Transaction B 8 A 7 D 7 1 2 3 A,B,C,D,E F,B,D,E,G B,D,A,E,G B,A,D B,D,F,G B,A,D,G F 6 G 6 B:1 B:2 B:3 A:1 A:2 D:1 D:2 D:1 F:1 G:1 G:1 23 Construction of local FP trees 24 Conditional Pattern Bases Items Conditional Pattern Base G D:1,A:1,B:1 F:1,D:1,B:1 F D:1,B:1 D A:2,B:2 B:1 A A B B 25 B:2 {} Frequent pattern strings  All frequent pattern trees are shared by all processors  Each generate conditional pattern base from respective items in header table  Merging all conditional pattern bases of same item yields frequent string.  If support of item is less than threshold it is not added in final frequent string. 26 More Readings [1] [2] [3] [4] FP-Growth on Hadoop 3 Map-Reduce(s) FP-Growth on Hadoop Core Thank You!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Parallel Association Rule Mining