Download Parallel Association Rule Mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Parallel Association Rule Mining
Presented by: Ramoza Ahsan and Xiao Qin
November 5th, 2013
Outline
 Background of Association Rule Mining
 Apriori Algorithm
 Parallel Association Rule Mining
 Count Distribution
 Data Distribution
 Candidate Distribution
 FP tree Mining and growth
 Fast Parallel Association Rule mining without candidate
generation
 More Readings
2
Association Rule Mining
 Association rule mining
 Finding interesting patterns in data. (Analysis of past transaction
data can provide valuable information on customer buying
behavior.)
 Record usually contains transaction date and items bought.
 Literature work more focused on serial mining.
 Support and Confidence: Parameters for Association Rule
mining.
3
Association rule Mining Parameters
 The support ,supp(X),of an itemset X is proportion of transactions in
the data set which contain the itemset.
 Confidence of a Rule X->Y is the fraction of transactions containing X
which also contain Y . i.e. supp(X U Y)/supp(X)
4
Transaction ID Milk
Bread
Egg
Juice
1
1
1
0
0
2
0
0
1
0
3
0
0
0
1
4
1
1
1
0
5
0
1
0
0
Supp(milk,bread,egg)=1/5 and rule {milk,bread}->{egg} has confidence=0.5
Outline
 Background of Association Rule Mining
 Apriori Algorithm
 Parallel Association Rule Mining
 Count Distribution
 Data Distribution
 Candidate Distribution
 FP tree Mining and growth
 Fast Parallel Association Rule mining without candidate
generation
 FP tree over Hadoop
5
Apriori Algorithm
Apriori runs in two steps.
 Generation of candidate itemsets
 Pruning of itemsets which are infrequent
Level-wise generation of frequent itemsets.
Apriori principle:
If an itemset is frequent, then all of its subsets must also be frequent.
6
Apriori Algorithm for generating
frequent itemsets

7
Minimum support=2
Parallel Association Rule Mining
 Paper presents parallel algorithm for generating frequent





8
itemsets
Each of N procesor has private memory and disk.
Data is distributed evenly on the disks of every processor.
Count Distribution algorithm focusses on minimizing
communication.
Data Distribution utilizes memory aggregation efficiently
Candidate Distribution reduces synchronization between
processors.
Algorithm 1: Count Distribution
 Each processor generates complete Ck ,using complete




9
frequent itemset Lk-1.
Processor traverses over its local data partition and develops
local support counts.
Exchange the counts with other processors to develop global
count. Synchronization is needed.
Each processor computes Lk from Ck.
Each processor makes a decision to continue or stop.
Algorithm 2: Data Distribution
 Partition the dataset into N small chunks
 Partition the set of candidates k-itemsets into N exclusive
subsets.
 Each node (N total) takes one subset. Each node count the
frequency of the itemsets in one chunk until it counts
through all the chunks.
 Aggregate the count.
Algorithm 2: Data Distribution
1/N Data
1/N Ck
1/N Data
1/N Ck
Ck
Data
1/N Data
1/N Ck
1/N Ck
1/N Ck
1/N Data
1/N Data
Algorithm 2: Data Distribution
1/N Data
1/N Data
1/N Ck
1/N Ck
1/N Data
1/N Ck
1/N Ck
1/N Data
1/N Data
1/N Ck
synchronize
Algorithm 3: Candidates Distribution
 If the workload is not balanced, this can cause all the
processor to wait for whichever processor finishes last in
every pass.
 The Candidates Distribution Algorithm try to do away this
dependencies by partition both the data and candidates.
Algorithm 3: Candidates Distribution
Data_1
Lk-1
Lk-1_1
Ck_1
Lk-1_2
Ck_2
Lk-1_3
Ck_3
Lk-1_4
Ck_4
Lk-1_5
Ck_5
Data_2
Data
Data_3
Data_4
Data_5
Algorithm 3: Candidates Distribution
Data_1
Data_2
Data_3
Data_4
Data_5
Lk-1_1
Ck_1
Lk-1_2
Ck_2
Lk-1_3
Ck_3
Lk-1_4
Ck_4
Lk-1_5
Ck_5
Data Partition and L Partition
 Data
 Each pass, every node grabs the necessary tuples from the
dataset.
 L
 Let L3={ABC, ABD, ABE, ACD, ACE}
 The items in the itemsets are lexicographically ordered.
 Partition the itemsets based on common k-1 long prefixes.
Rule Generation
 Ex.
 Frequent Itemset {ABCDE,AB}
 The Rule that can be generated from this set is
AB => CDE
Support : Sup(ABCDE)
Confidence : Sup(ABCDE)/Sup(AB)
Outline
 Background of Association Rule Mining
 Apriori Algorithm
 Parallel Association Rule Mining
 Count Distribution
 Data Distribution
 Candidate Distribution
 FP tree Mining and growth
 Fast Parallel Association Rule mining without candidate
generation
 FP tree over Hadoop
18
FP Tree Algorithm
Allows frequent itemset discovery without candidate itemset generation:
• Step 1: Build a compact data structure called FP-tree, built using 2
passes over the data set.
• Step 2: Extracts frequent itemsets directly from the FP-tree
19
FP-Tree & FP-Growth example
Min supp=3
20
Fast Parallel Association Rule Mining
Without Candidacy Generation
 Phase 1:
 Each processor is given equal number of transactions.
 Each processor locally counts the items.
 Local count is summed up to get global count.
 Infrequent items are pruned and frequent items are stored in
header table in descending order of frequency.
 construction of parallel frequent pattern trees for each
processor.
 Phase 2: mining of FP tree similar to FP growth algorithm
using the global counts in header table.
21
Example with min supp=4
TID
Items
Processor Number
1
2
3
A,B,C,D,E
F,B,D,E,G
B,D,A,E,G
P0
4
5
6
A,B,F,G,D
B,F,D,G,K
A,B,F,D,G,K
P1
7
8
9
A,R,M,K,O
B,F,G,A,D
A,B,F,M,O
P2
Item
P0
P1 P2
Item
Global
Counter
Item
3
Proc.
#
Global
Counter
A
2
2
B
3
3
2
P0
8
0
7
8
1
7
B
0
A
B
C
D
C
1
A
7
D
7
P1
E…
6
F
6
P2
…..
G
6
.
.
22
Step 1
Step 4.
After
pruning
infrequent
ones
FP tree for P0
Item
Global
Counter
TI
D
Items
Reordered
Transaction
B
8
A
7
D
7
1
2
3
A,B,C,D,E
F,B,D,E,G
B,D,A,E,G
B,A,D
B,D,F,G
B,A,D,G
F
6
G
6
B:1
B:2
B:3
A:1
A:2
D:1
D:2
D:1
F:1
G:1
G:1
23
Construction of local FP trees
24
Conditional Pattern Bases
Items
Conditional Pattern Base
G
D:1,A:1,B:1
F:1,D:1,B:1
F
D:1,B:1
D
A:2,B:2
B:1
A
A
B
B
25
B:2
{}
Frequent pattern strings
 All frequent pattern trees are shared by all processors
 Each generate conditional pattern base from respective items
in header table
 Merging all conditional pattern bases of same item yields
frequent string.
 If support of item is less than threshold it is not added in final
frequent string.
26
More Readings
[1]
[2]
[3]
[4]
FP-Growth on Hadoop
3 Map-Reduce(s)
FP-Growth on Hadoop
Core
Thank You!
Related documents