Download Apriori Algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
A New PC Cluster Based Distributed Association Rule Mining
Ferenc Kovács, Sándor Juhász
Department of Automation and Applied Informatics, Budapest University of Technology and Economics, Faculty of Electrical
Engineering and Informatics, 1111 Budapest, Goldmann György tér 3. IV. em., Hungary
E-Mail: {kovacs.ferenc, juhasz.sandor }@aut.bme.hu
Abstract – One of the most important problems in data
mining is association rule mining. It requires very large
computation and I/O traffic capacity. For that reason
there are several parallel mining algorithms which can
take advantage of the performance of the cluster systems.
These algorithms are optimised and developed on
supercomputer platforms, but nowadays the capacity of
PC preserves the possibility to build cluster systems
cheaper. Usage of PC cluster systems raises some issues
about the optimisation of the distributed mining
algorithms, especially the cost of the node to node
communication and data distribution. The current
association rule mining algorithms do not consider the
cost of data distribution and they do not make any data
processing during the data distribution phase. The main
part of the distributed association rule mining algorithms
is based on Apriori algorithm therefore these algorithms
suffer the drawbacks of the Apriori algorithm. In this
paper a new distributed association rule mining algorithm
is introduced, which is based on modified version of the
Apriori algorithm, which gives a solution of the Apriori
algorithm bottlenecks. Other advantage of this new
distributed algorithm is that it can make significant and
fundamental data processing during the data distribution
phase.
Keywords: data mining, association rule mining,
distributed data mining
I. INTRODUCTION
The association rule mining (ARM) is very important task
within the area of data mining [1]. Many algorithms were
developed to find association rules, but the Apriori is the
best-known [2]. The one of the main disadvantage of the
Apriori algorithm is its I/O cost. A paper [3] introduces an
algorithm for I/O cost cutting, but it has higher memory
requirements.
Because of the complexity of the ARM task several parallel
algorithms have been developed. The main part of the
distributed algorithms is based on the Apriori algorithm.
The count distribution (CD) -like algorithms [4] generate
the smallest network traffic, because they send only their
own counters to the other nodes. The data distribution
(DD)- based algorithms [4] generate higher network traffic,
because the nodes send not only their local counters, but
their own database as well. There are some distributed
algorithms, which are not based on Apriori, for instance [5]
contributes such an algorithm.
The paper [6] shows a detailed analysis of the Apriori
algorithm and it concluded that the finding the 2 frequent
itemsets is the significant part of the execution time. The
paper [7] introduces two modifications to avoid the
bottlenecks in the Apriori algorithm.
The distributed algorithms were developed and evaluated
in supercomputer environment, but the PC cluster systems
have several differences compared to traditional cluster
systems. Therefore the contributed distributed association
rule mining algorithms do not consider the data distribution
cost and they begin the itemset counting after the data
distribution phase. But the initial data distribution appears
in case of the PC cluster systems as the database usually is
stored separately from the mining cluster.
Just few distributed association rule mining algorithm were
investigated in PC cluster environment. The paper [8] is
concerned with the behaviour of HPA algorithm [9] in PC
cluster environment. The node synchronization and
possibility of different types of nodes can cause serious
performance decrease in PC clusters. The PC cluster-based
modification of CD and DD algorithms were discussed in
[10]. The algorithm synchronization problem was
examined in [11]. An asynchronous distributed algorithm
was introduced in [12]
This paper is organized as follows: first of all the widely
spread sequential ARM algorithms are described.
Afterwards the basic distributed ARM algorithms are
summarized. After the bottlenecks of the Apriori based
algorithms are investigated. Afterwards the novel
distributed association rule mining algorithm is introduced.
Finally we outline some test result of the new algorithm.
II. PROBLEM STATEMENT
First we elaborate on some basic concepts of association
rules using the formalism presented in [1]. Let
I={i1,i2,…im}
be set of literals, called items. Let
D={t1,t2,…tn} be set of transactions, where each transaction
t is a set of items such that t  I. The itemset X has support
s in the transaction set D if s% of transactions contains X,
here we denote s= support(X). An association rule is an
implication
in
the
form
of
XY,
where X ,Y  I , and X Y= . Each rule has two
measures of value, support and confidence. The support of
the rule XY is support(X Y). The confidence c of the
rule XY in the transaction set D means c% of
transactions in D that contain X also contain Y, which can
be written in c  S  X Y  / S  X  form. The problem of
mining association rules is to find all the rules that satisfy a
user specified minimum support and minimum confidence.
If support(X) is larger than a user defined minimum
support (denoted here min_sup) then the itemset X is called
large itemset.
The association rule mining can be decomposed into two
subproblems:
 Finding all of the large itemsets
 Generating rules from these large itemsets
The second subproblem is much easier than the first one
that is the reason why the ARM algorithms are different
from each other only in the method handling the first
subproblem.
III. BASIC ALGORITHMS
In this section some basic association rule mining
algorithms are described. First the Apriori algorithm is
introduced because it is the base algorithm of the
investigated distributed algorithms. Then three fundamental
and referential algorithms are described: count distribution
[4], data distribution [4] and hash-based partition algorithm
[14]. Table 1 contains the notations that are used in detailed
descriptions of the sequential algorithms.
k itemset
L
Li
Ci
|A|
An itemset having k items
Set of the frequent itemset
Set of frequent i itemset
Set of candidate i itemset (potentially
frequent itemset)
Number of elements in set A
Table 1. Notations in the sequential algorithms
Apriori Algorithm
The Apriori algorithm [2] use the following theorem to
reduce the search space: if an itemset is large then all of its
subsets are large as well. This means it is possible to
generate the potentially large i+1 itemset using large i
itemset. Each subsets of candidate i+1 itemset must be
large itemset. Hereby it is possible to find all large itemset
using database scan repeatedly. During the ith database scan
it counts the occurrence of the i itemset and by the end of
the pass i, it generates the candidates, which contain i+1
item. Figure 1 shows the pseudo code of the Apriori
algorithm. The main disadvantage of this algorithm is the
multiple database scan. There are many solutions to reduce
the number of database scans [2][3][5][13].
L1  1 frequent itemsets
C2  GenerateCandidate  L1 
i2
while  Li 1   
{
foreach  t  D 
{
IncrementCounter  Ci , t 
}


c.counter


Li   c  Ci
 min_supp 
D




i  i 1
Ci  GenerateCandidate  Li 
}
L
Li
i
Figure 1. Pseudo Code of the Apriori Algorithm
Count Distribution Algorithm
The count distribution algorithm [4] is a fundamental
distributed association rule algorithm. The basic idea of
this algorithm is that each of the nodes keeps large itemsets
and counters of candidates locally, which are related to the
whole database. These counters are maintained in
accordance with the local dataset and incoming counter
values. The nodes locally execute the Apriori algorithm
and after reading through the local dataset they broadcast
own counters to the other nodes. Hence each of the nodes
can generate new candidates on the basis of the global
counter values. Figure 2 shows the pseudo code of the CD
algorithm and figure 3 gives an overview of the CD
algorithm.
C j  items
1
i 1
do
{
foreach t  D j
{
IncrementCounter Ci j , t
}
Broadcast Ci j
for  k  1 to N , k  j 
{
Receive Cik




 
 


ComputeCounters Ci j , Cik
}
c.counter


Li  c  Ci j
 min_supp 
D


i  i 1
Ci j  GenerateCandidate  Li 1 
} while  Li 1   
L
Li
i
Figure 2 Pseudo Code of the Count Distribution algorithm
Node 1
Node N
Exchange
counter values
Lk
Lk
Candidate
Generation
Candidate
Generation

Ck
Ck
Database
Partition
Apriori Item
Counting
Database
Partition
C j  items
1
i 1
do
{
foreach t  D j
{
IncrementCounter Ci j , t
}
Broadcast D j
for  k  1 to N , k  j 
{
Receive D k
Apriori Item
Counting


Figure 3 The overview of the count distribution algorithm
This algorithm keeps the possibility of low data
transmission, but each of the nodes is synchronized after
each of the database scans. The other disadvantage of
previously mentioned algorithm is that if there are too
many candidates then it will not take the opportunity that
there could be enough memory in whole cluster to keep all
the candidates in the memory.

 
 
foreach  t  D 
k
{

IncrementCounter Ci j , t

}
}
Broadcast Ci j
Data Distribution Algorithm
 
Data distribution algorithm [4] gives a solution for a
situation, when one of the nodes does not have enough
memory for all of the candidates. In this case each of the
nodes is responsible for only a part of the candidates. Each
of the nodes counts the occurrence of its own candidates in
the whole dataset. But all of the nodes have to broadcast
their own local database to the other nodes. Figure 4 gives
an overview of the DD algorithm and Figure 5 shows the
pseudo code of the algorithm.
The disadvantage of the algorithm is that it generates very
large network traffic, because it does not use any kind of
optimisation to reduce the network traffic. It can be noticed
that the huge number of the candidates are decreasing
during the later iteration step. Hence the local database
broadcast is not necessary when the number of counter is
decreased.
HPA algorithm
The HPA algorithm[14] is an improved version of the data
distribution algorithm; it uses a hash function to determine
the owner node of the candidate. With the help of this hash
function it is possible to decide where to send each of the
read itemsets. In this way the network traffic can be
reduced.
Node 1
Node N
Exchange
counter values
Lk
Database
Partition
Lk
Database
Partition
Candidate
Generation
Candidate
Generation
Ck
Ck
Remote
Data
Remote
Data
Apriori Item
Counting
Apriori Item
Counting
Exchange
Database
Figure 4 Overview of the Data Distribution algorithm
Ci  Ci j
for  k  1 to N , k  j 
{
Receive Cik
}
Ci  Ci
 
Cik
c.counter


Li   c  Ci
 min_supp 
D


i  i 1
Ci j  GenerateCandidate  Li 1 
} while  Li 1   
L
Li
i
Figure 5 Pseudo Code of the Data Distribution algorithm
IV. BOTTLENECK OF THE APRIORI BASED
ALGORITHMS
Several Apriori-based algorithms were developed but the
modifications are focused on the reduction of the number
of database scans. However the Apriori-based algorithms
have some other drawback. The role of the 2 frequent
itemsets is fundamental in the Apriori-based algorithms [6].
In accordance with the evaluation of the Apriori algorithm
it is useful to take care of the 2 candidates in special way
[7]. The modified version of the Apriori algorithm does not
generate 2 candidates but it takes into consideration all
possible 2 itemsets as candidates. The itemset counters can
be stored in a triangular matrix. This triangular matrix can
be indexed by the itemset identifiers. Hence this matrix
contains counter values of each 2 itemsets and it can keep
the counter values of the 1 itemsets as well. The diagonal
of the matrix can be the placeholder of the 1 itemset
counters. The advantages of this solution as follows:
1. It is not necessary to generate the 2 candidates
2. It is possible to count the 1 and 2 itemset in one step
3. Each the counters can be reached in O(1) time
1
4
2
2
12
3
1
4
6
4
1
3
1
6
5
0
3
0
1
4
2
3
4
5
1
Item index
Figure 6 The data structure of the counters
The main benefit of this structure is the O(1) time counter
searching. This can extremely reduce the cost of the first
database scans. This can be very important because they
take substantial part of the level wise algorithms.
After the first database scan the level wise algorithms can
continue their work as they did it previously. But of course
it is necessary to convert the frequent itemset from the
diagonal matrix into their data structure (usually hash tree
or “trie” structure). The conversion can be speed up if the
counters of the 1 itemsets are checked first. Because if an
itemset is not frequent then its row and column cannot
contain frequent counter. This is the consequence of the
Apriori condition: if an itemset is frequent then each of its
sub itemsets is frequent as well.
V. THE NOVEL ALGORITHM: COUNTING WHILE
DISTRIBUTING
The Counting While Distributing (CWD) algorithm is
based on count distribution algorithm but it uses the
triangular matrix to find the 2 frequent itemset [7]. The
other modification is that the traditional CD algorithm uses
all to all broadcast mechanism to share own local itemset
counter when the nodes finished a database scan but in PC
cluster environment the all to all broadcast can be very
expensive therefore in this algorithm there is a coordinator
node. This node collects the local counter value of the
candidates and generates new candidates [10][12].
The traditional distributed association rule mining
algorithms do not consider the initial database distribution
cost hence they start the itemset counting after the data
distribution. Due to the increased searching capability of
triangular matrix representation of the 2 itemset it is
possible to count the 2 itemsets during the data distribution
phase. Therefore after the initial data distribution the
coordinator can immediately collect the counters of the 2
itemsets. This solution allows overlapping the initial data
distribution and a significant part of the itemsets counting.
The asynchronous communication model can improve the
overall efficiency of the cluster. The asynchronous
communication model facilitates to overlap the
communication and the data processing. During the initial
data distribution the coordinator node creates small data
fragments from the database and these are sent to the
worker nodes. This fragmentation keeps the possibility to
increase the efficiency of the data processing. Due to the
asynchronous communication the worker nodes can
process a data fragment while a new one is arriving through
the network communication channel. In the current
implementation the size of the fragment is 128 kilobytes.
The sequence diagram of the communication is shown on
Figure 7. The coordinator node can send to following
messages to the worker nodes:
 Start: The coordinator informs the worker nodes about
the beginning of the data distribution.
 Data Fragment: The coordinator node sends a data
fragment to the worker node.
 Delete Candidate: The candidate did not become
frequent. Hence the worker nodes have to delete the
candidate from their own memory.
 Insert Candidate: The new candidates are generated
and the coordinator node broadcasts the new
candidates to the worker nodes.
 Continue Counting: The candidate generation process
is finished. The worker nodes can continue the itemset
counting.
 Finish Counting: There is no new candidate hence the
itemset counting process is finished.
The worker nodes can send the following messages to the
coordinator node:
 Triangular Matrix: The worker nodes send back their
own triangular matrix to the coordinator.
 Counter Update: The worker nodes send back their
own candidate counter values to the coordinator.
Coordinator
Worker Node
Start
Data Fragment
Finished Distribution
Triangular Matrix
Insert Counter
Continue Working
Counter Update
Insert Counter
Delete Counter
Continue Working
Counter Update
Counter Update
Finished
Figure 7 Sequence diagram of the communication
VI. SIMULATION RESULT
Minimum support=0.9%
Working Node=10
500
400
Response Timr (sec)
300
250
200
Novel
CD
150
100
50
0
1
2
3
4
6
7
8
9
10
Figure 11 Response time of the CWD and CD algorithm in case of 0.9 %
minimum support
Triangular Matrix
Collection
Data Distribution
Working nodes =10
16
14
12
10
8
6
4
2
0%
0%
1.
5
0%
1.
4
0%
1.
3
0%
1.
2
0%
1.
1
0%
CD
200
1.
0
0%
0.
9
0%
0.
8
250
0.
7
CWD
0.
6
0%
300
0%
0
350
Minimum support
150
100
Figure 12 Finding 2 frequent itemsets in case of 10 working nodes and
using CWD algorithm
50
0
0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40% 1.50%
Minimum support
Frequent 2
Frequent 1
Distribution
Working nodes =10
Working Node=5
600
500
400
CWD
CD
300
300
Response Time (sec)
Figure 8 Response time of the CWD and CD algorithm in case of 10
working nodes
250
200
150
100
50
0
200
0.
50
%
0.
60
%
0.
70
%
0.
80
%
0.
90
%
1.
00
%
1.
10
%
1.
20
%
1.
30
%
1.
40
%
1.
50
%
Response Time (sec)
5
Working nodes
0.
5
Response Time (sec)
450
350
Response Time (sec)
The introduced CWD algorithm and the count distribution
algorithm are implemented in standard C++ using the
pyramid [15] asynchronous message passing library on
Windows XP platform. The simulation took place in the PC
laboratory of the department using 11 uniform PCs having
an Intel Pentium IV processor of 2.26 GHz, 256 MB RAM,
and an Intel 82801DB PRO/100 VE network adapter of
100 Mbits. The nodes were connected through a switching
hub of type 3Com SuperStack 4226T. The datasets used by
the algorithms are generated by IBM synthetic data
generator [2]. We used T20I7D500K dataset for the
experimentation. The dataset means that there are 500K
transactions in it, the average size of a transaction is 20,
and the average size of the maximal frequent itemset is 7.
The simulation result is shown from Figure 8 to Figure 14.
100
Minimum support
0
0.50% 0.60% 0.70% 0.80% 0.90% 1.00% 1.10% 1.20% 1.30% 1.40% 1.50%
Minimum support
Figure 9 Response time of the CWD and CD algorithm in case of 5
working nodes
Figure 13 Finding 2 frequent itemsets in case of 10 working nodes and
using CD algorithm
Finding 2 Frequent itemsets
minimum support=0.5%
1000
800
CWD
CD
600
400
250
200
Novel
CD
150
100
50
1
2
3
4
5
6
7
8
9
10
0%
1.
5
0%
0%
1.
4
1.
3
0%
1.
2
0%
1.
1
0%
1.
0
0%
0.
9
0%
0.
8
0%
0.
7
0%
0.
5
0
0%
0
200
0.
6
Response Time (sec)
1200
Response Time(sec)
300
1400
Minimum Support
Working nodes
Figure 10 Response time of the CWD and CD algorithm in case of 0.5 %
minimum support
Figure 14 Comparison of the two algorithm: finding 2 frequent itemsets in
case of 10 working nodes
Figure 8 and Figure 9 show that the response time of the
CWD algorithm is significantly better than the CD
algorithm. Figure 10 and Figure 11 show the scale-up
capability of both algorithms. They show that the novel
CWD algorithm always has better response time than he
CD algorithm. Both algorithms have saturation in the scaleup capabilities but asymptotical limit of the CWD
algorithm is significantly smaller than in case of CD
algorithm. Figure 12, Figure 13 and Figure 14 show that
the speed-up of the CWD algorithm comes from the speedup the Apriori algorithm. These measurement results also
reinforce that the main bottleneck of the Apriori algorithm
is finding 2 frequent itemsets. Therefore the Apriori-based
association rule mining algorithms have worse performance
than the introduced CWD algorithm.
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
VII. CONCLUSION
[8]
In this paper a novel distributed association rule mining
algorithm has been introduced. This algorithm is based on a
modified version of the Apriori algorithm. It handles the 2
itemsets is special way therefore it can significantly reduce
the response time of the Apriori algorithm. The basic
distributed association rule mining algorithms are based on
the Apriori algorithm hence they imply all the problems of
the Apriori algorithm. The introduced speed up solution
can be used in any kind of Apriori-based algorithm.
The CWD algorithm has substantially better response time
than the referential CD algorithm. The one of the main
advantage of this algorithm is that it counts the occurrences
of the 2 itemsets while the data distribution is being taken.
Therefore the cost of the data distribution disappears in the
introduced novel algorithm, which can be high in case of
very large data source.
[9]
[10]
[11]
[12]
[13]
[14]
[15]
ACKNOWLEDGMENTS
This work has been supported by the fund of the Hungarian
Academy of Sciences for control research and the
Hungarian National Research Fund (grant number:
T042741).
R. Agrawal, T. Imielinski, and A. Swami, Mining association
rules between sets of items in large databases, Proc. ACMSIGMOD Conference, 1993
R. Agrawal and R. Srikant, Fast algorithms for mining association
rules, Proc. 20th Very Large Databases Conference, 1994
J. Han J. Pei and Y. Yin, Mining Frequent Patterns without
Candidate Generation, Proc. ACM SIGMOD International.
Conference on Management of Data, 2000
R. Agrwal and J.C. Schafer, Parallel mining of association rules,
IEEE Trans. Knowledge and Data Engineering, vol. 8, no 6, 1996
O. R. Zaïne, M. El-Hajj and P. Lu, Fast Parallel Association Rule
Mining Without Candidacy Generation, Proc. IEEE International
Conference on Data Mining , 2001
Renáta Iváncsy, Ferenc Kovács and István Vajk, An Analysis of
Association Rule Mining Algorithms, Proc. of International
Conference of ICSC EIS, 2004
Ferenc Kovács, Renáta Iváncsy and István Vajk, Evaluation of the
Serial Association Rule Mining Algorithms, Proc. of International
Conference of IASTED DBA, 2004
M. Tamura and M. Kitsuregawa, Dynamic load balancing for
parallel association rule mining on heterogeneous PC cluster
system, Proc. 25th Very Large Databases Conference, 1999
T. Shintani and M. Kitsuregawa, Hash based parallel algorithms
for mining association rules, Proc. Parallel and Distributed
Information Systems Conference, 1996
T. Shimomura and S. Shibusawa, Performance Evaluation of
Distributed Algorithms for Mining Association Rules on
Workstation Cluster, Proc. IEEE International Workshops on
Parallel Processing (ICPP’00- Workshops), 2000
J. Zhang, H. Shi and L. Zheng, A Method and Algorithm of
Distributed Mining Association Rules in Synchronism, Proc. IEEE
International Conference on Machine Learning and Cybernetics,
2002
Ferenc Kovács, Renáta Iváncsy and István Vajk, Dynamic Itemset
Counting in PC Cluster Based Association Rule Mining, Proc. of
International Conference of ICSC EIS, 2004
S.Brin, R. Motawani, J.D. Ullman and S. Tsur, Dynamic Item set
counting and implication rules for market basket data, Proc.
ACM-SIGMOD Conference, 1997
T. Shintani and M. Kitsuregawa: Hash based parallel algorithms
for mining association rules, Proc. Parallel and Distributed
Information Systems Conference, 1996
S. Juhász et al., “The Pyramid Project”, Budapest University of
Technology and Economics, Department of Automation and
Applied Informatics, 2002.
http://avalon.aut.bme.hu/~sanyo/piramis