Download DOC Version - University of South Australia

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
University of South Australia
M.S. Thesis
In Computer Science
Minor Thesis
ASSOCIATION RULES MINING IN
DISTRIBUTED ENVIRONMENT
By: Shamila Mafazi
Supervisor: Abrar Haider
June 2010
i
Table of Contents
1. Introduction
1
1.1.
Motivation
2
1.2.
Research Question
2
1.3.
Purpose
2
1.4.
Methodology
3
1.5.
Thesis Plan
3
1.6.
Contribution of the Thesis
3
2. Literature Review
2.1.
2.2.
5
Data mining and association rules mining in centralised environments
5
2.1.1.
Data mining
5
2.1.2.
Data pre-processing
7
2.1.3.
Data cleaning as a problem of distributed DBs
10
2.1.4.
Association Rules Mining
11
2.1.5.
Definition of association rules mining difficulties
13
2.1.6.
Apriori Algorithm
13
2.1.7.
Subset function
16
2.1.8.
Applied optimisation on Apriori algorithm
17
2.1.9.
AprioriTid and AproriHybrid algorithms
17
2.1.10. Sampling
18
2.1.11. Partitioning
19
2.1.12. Direct Hashing and Pruning (DHP) algorithm
20
2.1.13. Dynamic Itemset Counting (DIC algorithm)
20
2.1.14. Frequent Pattern (FP) Growth method
21
2.1.15. Association rules mining in XML documents
22
2.1.16. Trie data structure
23
2.1.17. Non derivable itemsets
25
2.1.17.1.
Deduction rules
25
2.1.17.2.
Non-Derivable Itemsets (NDI) algorithm
28
2.1.17.3.
Producing the frequent itemsets
31
Distributed association rules mining
34
2.2.1. Distributed data mining
34
2.2.2. Necessity of studying distributed data mining
36
2.2.3. Important instances and issues in distributed data mining
37
ii
2.2.4. Distributed Algorithms for association rules mining
40
2.2.4.1
Count Distribution (CD) algorithm
40
2.2.4.2
A Fast Distributed algorithm
42
2.2.4.2.1 Candidate set generation
43
2.2.4.2.2 Local pruning of candidate itemsets
45
2.2.4.2.3 FDM algorithm
48
2.2.4.3
ODAM algorithm
49
2.2.4.4
DDM, PDDM and DDDM algorithm
51
2.2.5. Comparing the distributed algorithms
3. Proposed algorithm by this thesis
53
55
3.1 Mining the non derivable itemsets in distributed environments
55
3.2 Proposed algorithm
61
3.3 Step by step explanation of the new algorithm
62
4. Conclusion
68
5. Future works
70
6. References
71
iii
List of Figures
Figure 1. ETL processes
9
Figure 2. Producing candidate item sets by Apriori algorithm
15
Figure 3. Hash tree
17
Figure 4. An example of a Trie
24
Figure 5. An example of transactions of a DB
26
Figure 6. Tight bounds on support of (abcd)
27
Figure 7. Size of concise representation
31
Figure 8. Distributed memory architecture for distributed data mining
35
Figure 9. Shared memory architecture for distributed data mining
35
Figure 10. Horizontal DB Layout
38
Figure 11. Vertical DB Layout
38
Figure 12. Second replication from count distribution algorithm
41
Figure 13. ODAM algorithm on 3 sites
50
Figure 14. Implement of new algorithm on the sample distributed DBs
56
Figure 15. Supports counting at distributed sites
57
Figure 16. Global support counts
57
Figure 17. Candidate 2-itemsets support counting
58
Figure 18. Final Trie
60
iv
List of Tables
Table 1. User DB1
10
Table 2. Client DB2
10
Table 3. Users (integrated DB with cleaned data)
11
Table 4. An example of a DB
12
Table 5. Notations used in Apriori algorithm (Agrawal &Srikant 1994)
14
Table 6. Locally large itemsets
46
Table 7. Globally Large Itemsets
46
Table 8. Notations used in the new algorithm
62
v
Abstract
The tremendous growth of information technology within the companies, businesses
and governments, has created immense Databases (DBs). This trend creates a prompt
requirement for novel tools and techniques for intelligent DB analysis. As John
Naisbett mentioned ‘We are drowning in information but starving for knowledge!’
These tools and techniques are the topics of the field called “data mining” or
“Knowledge Discovery in Databases” (KDD).
Data mining or KDD is the process of finding hidden and probably useful patterns and
knowledge in databases. Numerous research has been performed thus far in the field of
data mining for traditional centralised databases. Data mining is not only applicable in
the centralised setting but also in distributed environments where distributed databases
are used.
The problem of data mining in distributed data as a distributed problem needs
distributed algorithms. A distributed data mining algorithm provides data mining results
including knowledge and pattern without exchanging raw data within the participating
sites in a distributed system.
In distributed data mining all the data mining tasks, such as, classifications, clustering
and so on, are considered. Association rules mining is one of the most well known
methods of data mining which has wide applications. In this thesis, mining association
rules in the distributed environment, particularly market basket data, is considered.
Computation and communication are two important factors in DARM and generally in
distributed data mining. In this study, a new technique to improve the performance of
finding association rules in distributed environments is presented. This technique may
be utilised in every so far DARM algorithm.
One of the well known association rules mining algorithms is called the “Frequent
Itemset Mining” or FIM. The proposed algorithm by this research is the result of using
a new technique inside of the DTFIM (distributed version of the FIM) algorithm.
vi
Declaration
I declare that:
this thesis presents work carried out by myself and does not incorporate without
acknowledgment any material previously submitted for a degree or diploma in any
university; to the best of my knowledge it does not contain any materials previously
published or written by another person except where due reference is made in the text;
and all substantive contributions by others to the work presented, including jointly
authored publications, is clearly acknowledged.
Shamila Mafazi
14/06/2010
vii
Acknowledgment
Firstly, I would like to thank my supervisor, Dr. Abrar Haider for his valuable support
in this thesis. I specially like to extend my thank to Dr Jiuyong Li who introduced me to
the Data Mining world.
viii
1.
Introduction
The ubiquity of information technology and its rapid improvement within organisations
have profoundly impacted the management system. A significant amount of
information is stored in databases of companies, businesses or government centres.
Locally, these data are only used for making reports by users and managers. Another
usage of data, which is more common and important, is data management and data
mining operations.
This thesis addresses the problem of discovering frequent itemsets in the distributed
DBs. In data mining discovering hidden and useful patterns is the target. Most of these
patterns could help both managers and users or customers, whereas managers can make
wiser decisions and customers can shop more easily. The patterns which are discovered
during the different operations of data mining have different types. One of the most
useful and famous patterns is association rules. The most famous usage of these
association rules is, analysing market basket data for supermarkets. For instance, after
mining the DB of a super market, it may reveal that 60% of customers who buy milk
may buy butter as well. Finding these rules helps managers to organise the shelves,
guide customers and is also useful in management.
DBs applied for data mining are typically large. These DBs are growing from gigabytes
towards a terabyte or more. Because of the time and space limitations, it is hard and
even impossible to manage and process these DBs on a single site. Additionally, some
of the DBs are distributed naturally; hence, the importance of a parallel and distributed
data mining environment becomes evident.
Association rules mining is one of the most effective methods of data mining in
distributed DBs, however, other methods such as, clustering or classification are
discussed in distributed environments. Association rules mining in a distributed
environment is the main concentration of this research proposal.
1
1.1
Motivation
By the emergence of distributed DBs in organisations and different commercial centres,
beside large DBs, the concept of data mining in these environments is discussed. It is
vital and inevitable for almost every organisation to analyse their DBs for discovering
useful and interesting patterns. For instance, airlines analyse theirs DBs to target the
appropriate customers for special marketing promotions. Banks require customer
behaviour patterns due to bankruptcy prediction methods, loan and credit card
approvals. Insurance companies demand the patterns for making better decisions
regarding the premium of the customers and finally packaged goods manufacturers and
supermarkets need shopping patterns due to supplying goods. However, many
achievements are made regarding association rules mining in distributed environments;
still there is room for improvement. Increasing efficiency, communication reduction,
security and site privacy in distributed environments are important considerations in
DARM and generally in distributed data mining.
1.2
Research Question
Discovering association rules in a distributed environment is a relatively new area.
Most of the distributed algorithms are not efficient enough and have high
communication and computation complexity. The main question this research asks is:
How can a more efficient method for discovering association rules in a distributed
environment be created?
To address this question, a new algorithm is created which is more efficient in
comparison to the previous methods. Efficiency is defined as less execution time for
achieving results and less communication volume.
1.3
Purpose
A new algorithm is represented for association rules mining in a distributed
environment which has better outcomes in comparison with the previous algorithms.
The new algorithm is based on the FIM algorithm and utilises some of the existing
techniques such as, pruning local candidate item sets, producing candidate sets,
gathering support counts from sites and candidate item set reduction. This algorithm is
tested on distributed data.
2
1.4
Methodology
The research methodology used to answer the research questions, discussed in this
section. The main research question ‘How an efficient method for discovering
association rules in distributed environments can be created?’ has been answered by
designing a new algorithm which is more efficient than existing algorithms. The
examination on different distributed data shows the efficiency of this new algorithm
from the execution time and transfer volume aspects.
In order to resolve the research question, this research has reviewed literature relating
to the following areas:

Data mining in the centralised environments.

Data mining in the distributed environments.

Association rules mining in the distributed/centralised environments.

Comparing the existing algorithms in the distributed/centralised environments and
studying the advantages and disadvantages of them.

Developing a concise representation particularly, distributed deduction rules.

Designing the new algorithm based on DTFIM.
1.5
Thesis Plan
This thesis is constructed as follows: the first section contains an introduction to the
research and a definition of the research questions. The second section includes the
literature review. The third section explains the method which is used to answer the
research questions. Subsequently, the new method is tested on a distributed DB sample.
Additionally, this section presents the new algorithm. The last section concludes the
thesis.
1.6
Contribution of the thesis
The proposed algorithm by this thesis intends to resolve marketing problem in
distributed environments. Additionally, it aims to profile needs of customers and their
preferences in the transaction oriented systems such as, credit cards. One of the most
significant problems of market basket data is dealing with large number of candidate
itemsets. Retrieving interesting and meaningful patterns from these candidates is
extremely difficult. This thesis presents an algorithm which reduces the number of
3
candidate sets and simplifies the process to produce interesting customer preferences
and patterns.
4
2.
Literature Review
This section deals with the major areas related to the research and covers existing work
in these respective areas. The first section considers data mining in centralised
environments and discusses some algorithms in this regard. The second section looks at
the association rules mining in distributed environments and related algorithms.
2.1
Data mining and association rules mining in centralised environments
2.1.1
Data mining
According to Han & Kamber (2006) ‘data mining refers to extraction or mining
knowledge from large amounts of data’. Data mining has attracted significant attention
in the information industry and in society in recent years due to the enormous
availability of data and the urgent need for turning this data into useful information and
knowledge. This information may apply to applications ranging from market analysis,
fraud detection, production control and scientific data mining.
According to Frawley, Piatetsky-Shapiro & Matheus (1991), knowledge discovery is
defined as ‘the nontrivial extraction of implicit, previously unknown and potentially
useful information from data.’ Although immense patterns can be extracted from a DB,
only patterns which are novel, useful and nontrivial to compute are considered to be
useful and interesting. Knowledge is useful when it can satisfy the expectation of the
system or the user.
Large databases can be considered as rich and trustworthy resources for producing
knowledge and information. The discovered knowledge can be utilised in information
management, report processing, decision making etc. There are two feasible elementary
targets of data mining which are prediction and description. According to Fayyad et al.
(1996 p.12) ‘prediction involves using some variables or fields in the database to
predict unknown or future values of other variables of interest. Description focuses on
finding human-interpretable patterns describing the data.’ Moreover, they believe that
in KDD context, description is more important than prediction.
There are various types of data mining techniques such as, association rules mining,
classifications, clustering, prediction and time series analysis. These are the most
important methods and techniques. A brief definition for each method is as follows:
5
i. Association rules mining retrieves relation and correlation between the data in a DB.
This data mining operation produces a set of rules called association rules.
ii. Classification is another important method of data mining. In this method, an object
in a DB divides into separate groups based on its attributes. Subsequently a model
based on the data attribute is built for each class of test data. Classification predicts
categorical (discrete, unordered) labels (Han & Kamber 2006, p.285). The result of
classification could be a decision tree or a set of rules. According to Berry & Linoff
(2003, p.166) a decision tree is defined as ‘a structure that can be used to divide up a
large collection of records into successively smaller sets of records by applying a
sequence of simple decision rules.’ Additionally, classification can be performed by
association rules. For example if a car dealer can divide their customers, based on their
interest in different kinds of cars, then the company can send the right catalogues to the
right customers, and consequently, increase its income.
iii. Data clustering is another significant method of data mining. Clustering is defined as:
The process of grouping a set of physical or abstract objects into classes of similar
objects. A cluster is a collection of data objects that are similar to one another within the
same cluster and are dissimilar to the objects in other clusters (Han & Kamber 2006,
p.383).
Clustering intends to maximise the similarity between the dataset of a class and
minimise the similarity between the dataset of two different classes. The
discrepancy between clustering and classification is that, clustering is an
unsupervised learning which denotes that the number of classes and the class label
of each training tuple are unknown from the beginning of the process, whereas
classification is a supervised learning which signifies that the number of classes has
been recognised in advance (Han & Kamber 2006, p. 287).
iv. Prediction is another technique of data mining, which predicts the possible amount
for unidentified variables. Berry & Linoff (2003, p.10) consider prediction as
classification or estimation with this difference that in prediction records are classified
based on predicted or estimated future value.
6
In prediction, first the unknown variables are determined by statistical analysis,
subsequently, intelligent methods such as, genetic algorithms and neural networks,
perform the prediction. For instance, the salary of an employee is predictable from the
other employees. There are other methods which are quite effective in prediction, such
as, regression analysis, correlation analysis and decision trees (Han & Kamber 2006, p.
285).
v. Time series analysis is one of the other methods of data mining. According to Berry
& Linoff (2003, p.128) the key point in data mining is the constancy of the values
frequency over time. Time series requires selecting a proper time frame for the data. In
this method an immense amount of time series data is analysed for discovering notable
features and specific order. Occurrence of continuous events, sets of events which
occur after a specific event, processes and corruptions are parts of these notable
features. For example, the process of changing the price for a specific good in a factory
is predictable by using the old data, commercial conditions and competitors (Fayyad et
al. 1996, p.229).
2.1.2
Data pre-processing
In distributed data mining, data comes from several sources with different forms and
types. This data is always dirty, noisy, incomplete and inconsistent. Fundamentally,
data is considered as raw material for data mining. Just as oil, which mines with
impurities and it becomes useable only by passing through different stages of
refinement. The most powerful engine is unable to use crude oil as a fuel which is the
same as the most powerful algorithm which can not find interesting patterns in the raw
or un-pre-processed data (Berry & Linoff 2003, p.540).
According to Han & Kamber (2006, p.47) knowledge discovery includes the execution
of iterative sequences which are recognised as data pre-processing and KDD as
follows:
i.
Data cleaning: removing noise and inconsistency from data and filling the missed
values.
ii.
Data integration: combining or integrating multiple data sources into a coherent
data store.
7
iii.
Data selection/reduction: retrieving relevant data to the analysis task from the DB.
Moreover, the size of a database can be reduced by aggregating, eliminating
redundant tuples or clustering.
iv.
Data transformation: transformation and consolidation of data into appropriate
forms for mining tasks, for instance, summary or aggregation operations.
v.
Data mining: a necessary process where intelligent methods are applied to extract
data patterns such as clustering, classification, association rules mining, and so on.
vi.
Pattern evaluation: to evaluate the interesting patterns showing knowledge based
on some measures.
vii.
Knowledge presentation: where knowledge presentation techniques are used to
present the mined knowledge to the user.
The first four steps may be executed during the process of data storage, reporting or
data integration. The last three steps can be performed in one step called data mining.
According to Le, Rahayu & Taniar (2006), three types of databases on database design
approaches are recognised. ‘Well-defined and structured data such as, relational, object
oriented and object relational data, semi-structured data such as, XML, and
Unstructured data, such as HTML documents…’.
There are also some approaches regarding integration techniques that unify data with
different types. The first approach can integrate only relational data into the data
warehouse. The study of Calvanese et al. (1998) is an instance in this regard. The
second approach can handle more complex data types, for instance the transition from
relational data to object oriented data. Filho et al. (2000) have developed an object
oriented model which transforms the data warehouse system to a dimensional object
model. The third approach integrates XML documents to a data warehouse system.
Jensen, Moller & Pedersen (2001) believe that XML is becoming a new standard for
data presentation on the internet and they developed an integrated architecture which
transfers XML and relational data sources, by using OLAP (On-Line Analytical
Processing) tools to the data warehouse system. The final approach which is proposed
by Le, Rahayu & Taniar (2006) handle three types of data including HTML documents.
8
Furthermore, ETL (Extraction, Transformation and Loading) is a novel approach for
data preparation. Duties of ETL include data extraction from various sources, cleaning
and loading into a target data warehouse (Li et al. 2005).
According to Rahem & Do (n.d.) the major part in ETL is data cleaning. When multi
DBs need to be integrated, most of the time redundancy occurs because different
distributed sources usually have the same data in different representations. Since the
integrated DB or the data warehouses are used for decision making, the correctness of
their data is paramount and the need for data cleaning is vital. The following figure
indicates the process of ETL.
Figure 1. ETL processes (Rahm & Hai Do n.d.)
In the ETL process illustrated in the above figure, all data cleaning processes are
performed in a separate data staging area before loading data into the data warehouse.
However a significant fraction of the cleaning and transformation has to be executed
manually or by low level programs (Rahm & Do n.d).
9
As the figure indicates, in the extraction stage, an instance and a schema for each DBs
are produced. In the integration level, the extracted schemas and instances are matched
and integrated to a single implementation schema and to a data staging area
respectively. Execution of filtering and aggregation rules on the final schema joins it to
the last stage which is storage in the data warehouse.
2.1.3
Data cleaning as a problem of distributed DBs
The major concern for data cleaning from different sources is recognition of
correspondent data which points to the same real world entity, object identity problem
and duplicate elimination. Following is an example of two distributed DBs which
intend to be integrated (Rahm & Do n.d).
Table 1. User DB 1
uI
firstName lastName sex address
phone
mobile
1
carry
bradshaw
f
3
main
st, 0061882331234
richmond, sa
9
sara
smith
f
3 main st,
Richmond, sa
vehicleIdNo
dd
4
24
423-410
0061423312415
Table 2. Client DB 2
cId
name
streetNo suburb
state gender vNo
14
pitter smith
4 lane
rochmond
sa
1
3 main
richmond
sa
0
147 carry
bradshaw
These tables contain the following conflicts and problems:

Name conflicts: such as user/client, sex/gender, uId/cId, vehicleIdNo/vNo.

Structural conflicts: different presentation for name and address.

Heterogeneous data: different presentation for gender value (0/1 & f/m)

Duplicated records: client named Carry Bradshaw is repeated in both DBs.

Incomparable DBs: same records have different IDs in DBs, for instance uId=14 and
cId=147 indicate same records with different ID number.
10

Null attribute: some of the attributes contain null value, such as phone and mobile.
The following table is the integration of both tables with possible corrections:
Table 3. Users (integrated DB with cleaned data)
id lName
fName gender streetNo suburb
state phone
1
bradshaw carry
f
3main st
richmond
sa
2
smith
sara
f
3main st
richmond
sa
3
smith
pitter
m
4 lane st
rochmond sa
mobile
uId cId
0061882331234
14
0061423312415 924
14
The integrated DB is not ready for data mining algorithms. The following actions due
to preparation are recommended by Berry & Linoff (2003, p.555):

Extracting information from a field: in some cases, numbers or IDs encode
meaningful information. For example phone numbers contain country code, area
code which all of them contain geographical information.

Ignoring names: in general, names do not carry useful information for data mining
(there might be some exceptions).

Using special software to standardise the address field: ‘address describes the
geography of customers which is important for understanding customer behaviour.’

Correcting misspelled values: often sorting values brings the misspelled value next
to the right one (in the example, incorrectness of suburb field of Users table,
rochmond, is obvious after sorting).

Ignoring columns with one value.

Elimination of duplicated records: tuples should be sorted by their occurrence; more
than one occurrence represents duplication.
As the premise of the suggested algorithm by this thesis is, dealing with integrated and
clean DBs, more surveys in this regard might be out of the scope of the thesis.
2.1.4 Association Rules Mining
Association rules mining, is one of the most important and useful methods of data
mining in unsupervised learning systems. The aim of this method is to find rules in the
11
immense DBs. Originally, association rules mining arose from the point of sale that
reveals what items are likely to be purchased together.
Agrawal, Imiliniski & Swami (1993) define the operation of finding the interesting
rules as association rules mining. They also describe the necessity and usefulness of
discovering such rules as follows:

Observation of all rules which contain a specific item may help the store to decide
how the sale of the item can be enhanced. Additionally, these rules may reveal the
effect of continuing or discontinuing the sale of an item on the sale of other items.

Discovering all rules that include two or more specific item, may disclose the items
which are mostly purchased together and result in a better management of
depository.

Finding all rules that consist of all items placed on two special shelves help
managers to arrange shelves efficiently.
Suppose that the below table belongs to a database of a supermarket. Each record consists
of items purchased by a customer in a single transaction.
Table 4. An example of a DB
Tid
Items
1
{bread, coke, yogurt}
2
{bread, butter, yogurt, cream}
3
{cream, coke}
4
{detergent, bread, butter, yogurt}
5
{bread, butter, yogurt}
6
{ detergent, cream, bread, butter}
This DB has only six transactions, but in the real DBs there is the enormous number of
transactions. One of the rules which are evident in this DB is that, 80% of customers
who have bought butter would also buy bread. This rule exists in 66% of customers
shopping or transactions of this supermarket. These kinds of rules are called association
rules and sets of items which are frequently repeated in a DB and produce the rules, are
called frequent itemsets or frequent patterns. For instance, the set of {bread, butter} in
12
this sample database represents a frequent itemset. The discussed probability is called
confidence and the percentage of the transactions which contains an itemset is called
the itemset’s support (Kantardzic 2003, p.166). According to Han & Kamber (2006,
p.40) the support and confidence of a rule are two measures of rule interestingness.
They demonstrate the usefulness and certainty of discovered rules respectively.
Generally, an itemset is interesting and is called frequent (large) itemset, if its support
and confidence is higher than the user specified minimum amount. Additional analysis
can be performed to uncover interesting statistical correlations between associated
items.
2.1.5
Definition of association rules mining difficulties:
The first problem is the immense number of transactions which may not fit in the
memory of a computer. Secondly, the number of frequent itemsets increases
exponentially by increasing the number of items; hence, a scalable algorithm is needed.
Kantardzic (2003, p.166) mentions a formal and mathematical definition of the
problems which is presented by Agrawal, Imielinski and Swami (1993) as follows:
Let I= {i1 , i2 ,..., im } be a set of items and D be a set of database transactions where each
transaction t is a set of items such that t ⊆ I. Each transaction has an identifier which is
called TID. A transaction t contains x if and only if x ⊆ t. An association rule is an
implication of the form x=>y where x ⊂ I, y ⊂ I, and x ∩ y=Ø. The rule x=>y holds in
the transaction set D with s% of transactions in D contains x ∪ y.
With having the transaction set of D, the association rules mining include producing the
entire association rules that satisfy the minimum support and confidence threshold. D
can be a data file, a table of relational DBs or the product of a report from a DB. The
major aim of association rules mining is to discover strong and meaningful rules in
immense DBs. Therefore, the major problems may summarise in to two phases, finding
the large itemsets and producing the association rules from them. The first step is more
important and the time taking of association rules mining process belongs to this part.
Apriori algorithm represents an early solution to the problem.
13
2.1.6
Apriori Algorithm
Apriori algorithm is a basic algorithm which was proposed by Agrawal & Srikant in
1994 for mining frequent itemsets for Boolean association rules (Han & Kamber 2006,
p.234). This algorithm is introduced as a famous algorithm and basics for almost all of
the existing methods in both the intensive association rules mining and in parallel and
distributed rules mining. This algorithm is considered as a dynamic programming
algorithm.
Algorithm: Apriori Algorithm
Input: database D and minimum support minsup
Output: all large item sets
1) C1 = All distinct items in D
2) large itemsets in C1
3) while Lk+1 is not empty
4) Ck+1 = candidateGen(Lk)
5) Lk+1 = large itemsets in Ck+1
6) k++
7) return ∪Lk
As the pseudo code of this algorithm indicates, the results in each stage are used for the
next stage. The following table illustrates the used notations in this algorithm.
Table 5. Notations used in Apriori algorithm (Agrawal &Srikant 1994)
k-itemset
Lk
An itemset with k items.
Ck
The candidate item set (they could be frequent) also each member of this
set has two fields:
1. The item set
2. The related supportive number (support count).
A set of large k-itemsets.
Each member of this set has two fields:
1. The item set
2. The related supportive number (support count).
Ĉk
Set of candidate k-itemsets.
For counting the number of occurrences of each Ck candidate item sets, it should be
determined what candidate itemset from Ck is in each t transaction. After determining
the number of occurrences for each candidate set, for obtaining the frequent itemsets,
the set whose number of occurrences is more than the minimum support threshold is
14
introduced. The figure below indicates the operation of Apriori algorithm on a simple
example of a DB with four transactions. The support threshold is set to 2.
Figure 2. Producing candidate item sets by Apriori algorithm
(Kantardzic 2003, p.168)
As illustrated in the figure, in each stage, those itemsets whose support is less than 2 is
eliminated. Candidate-gen function gets the frequent itemsets with Lk+1 length ((k+1)itemsets) as the input. The output of this function is the sets of frequent k-itemsets.
Firstly, this function joins Lk-1 with itself:
insert into Ck
select p.item1 , p. item2 , …, p. itemk-1, q. itemk-1
from Lk-1 p, Lk-1 q
where p. item1 =q. item1 , …, p. itemk-2=q. itemk-2, p. itemk-1< q. itemk-1 (Agrawal & Shafer
1996 p.4)
Then in the pruning stage the itemsets which are in Ck (c∈Ck ), if there are one or more
subsets with length of k-1 which are not in Lk-1 then C from Ck will be deleted :
forall itemsets c in Ck do
forall (k-1)-subsets s of c do
if (s is not in Lk-1) then delete c from Ck (Agrawal & Shafer 1996 p.4)
15
Example: suppose that L3= {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}} after the
insertion, C4 would be {{1,2,3,4 },{1,3,4,5}}. In the pruning stage, the {1,3,4,5}
itemset will be deleted because the {1,4,5} itemset is not in L3 and {1,2,3,4} remains in
C4 (Agrawal & Shafer 1996 p.4).
As the figure indicates, Apriori algorithm for m item generates 2m subsets that might be
frequent. As discussed, the number of items and customer transactions are immense;
consequently, Apriori is not considered as a scalable algorithm.
2.1.7 Subset function:
In this function, the Ck candidate itemsets are kept in a data structure called hash tree.
A node of the tree includes a list of itemsets (a leaf node) or a hash table (a middle
node). In a middle node, each package of the hash table can point to other nodes. The
root of the hash tree is specified at the first depth. A node in the d depth can point to
other nodes in depth of d+1. The itemsets are saved in the leaves. In case of adding an
itemset such as C to the hash tree, the tree should be traced from the root to the leaves.
In a middle node, the decision for choosing a branch is made. For finding the
appropriate branch, the hash function will be executed on the dth item from the
itemsets. In the beginning all the nodes are produced as a leaf node. When the number
of the itemsets in a leaf node exceed from a predetermined threshold, then this leaf
node becomes a middle node (Brin et al. 1997).
In Apriori algorithm, a subset function can be used to determine a transaction such as t,
includes what itemset, the subset function begins from the root and identifies all the
candidate itemsets in a transaction. In case of being in a leaf node, it should be defined
which one of the itemsets could be placed in the list of the leaf node and then the
related support count is increased. In a middle node, if the node is reached by hashing
the ith item from the transaction, then each item that comes after the ith item in t,
should be hashed. This function is executed on all the other nodes recursively. The
following figure illustrates how these procedures are applied. The presented hash tree in
the figure is related to a candidate 3-itemset. Also the figure demonstrates the
procedures of the subset function for the transactions of 1,2,3,5,6.
16
Subset function
1, 4
2, 5
3, 6
Transaction: 1 2 3 5 6
1+ 2 3 5 6
1 3+ 5 6
14
1 2+ 3 5 6
12
13
12
5
23
4
15
34
35
6
36
7
35
Figure 3. Hash tree
2.1.8 The applied optimisation on apriori algorithm
Significant amount of research has been produced thus far to optimise Apriori
algorithm. Most of them tend to maintain the basic structure of the Apriori algorithm
and enhance the efficiency as well as decreasing time costs. Agrawal and Srikant
(1994); Toivonen (1996); Savasere, Omiecinski & Navathe (1995); Park, Chen & Yu
(1995); Brin et al. (1997); Han et al. (2001); Bodon (2003); Borgelt (2003); Bodon
(2004) are some samples.
Since the efficiency of the Apriori algorithm depends on the method of counting the
supports for the candidate itemsets, and the number of available candidate itemsets in
each transaction, most of the research focuses on these parts of Apriori algorithm. Some
of the optimisations applied on Apriori algorithm resulted in new algorithms. The
following algorithms are two of them:
2.1.9 AprioriTid and AprioriHybrid Algorithms
17
Similarly, with offering the Apriori algorithm by Agrawal and Srikant (1994), they
offer another two algorithms, called AprioriTid and AprioriHybrid. The AprioriTid
algorithm reduces the execution time for computing the support count, by replacing the
transactions with candidate itemsets of the transactions. These procedures are
implemented frequently in all the iterations. The replaced transaction database at the kth
iteration is presented by Ĉk. However, AprioriTid acts faster than Apriori algorithm at
the final iterations, but it acts slower in the primary repeats. This is because, producing
the Ĉk causes some extra burden, and it does not fit in the main memory and it should
be stored on the disk. If the transaction does not include any candidate itemset in length
of k, then Ĉk is empty for this transaction, therefore the number of elements in Ĉk might
be less than the number of transactions in the database, especially in the final repeats. In
addition, in the final repeats, each element can be smaller than its transactions, but in
the primary repeats, each element could be bigger than its transaction.
AprioriHybrid, the combination of Apriori and AprioriTid, is another algorithm which
is proposed by Agrawal & Srikant (1994).
Practically, in the earlier traverses through a DB, Apriori algorithm performs better
than AprioriTid. On the other hand, AprioriTid acts better in later passes. Therefore,
AprioriHybrid uses the apriori algorithm in the primary passes and whenever the Ĉk is
expected to be smaller as fits in the main memory, the algorithm switches to AprioriTid
algorithm. Since, the size of Ĉk is proportionate with the number of candidate itemsets,
an heuristic procedure (which can estimate the size of Ĉk) may be used in the current
iterations. If the size of estimating Ĉk is small enough and there are few itemsets in the
current repeat rather than the previous repeat, then the algorithm decides to use
AprioriTid. According to Agrawal & Srikant (1994) ‘experiments show that the
Apriori-Hybrid has excellent scale-up properties, opening up the feasibility of mining
association rules over very large databases’.
2.1.10 Sampling:
According to Toivonen (1996), the sampling algorithm reads the entire database twice
to find all the frequent itemsets. Firstly, this algorithm selects a sample of the database
randomly and discovers all the frequent patterns in the sample. Secondly, it checks the
result on the entire database. In the cases that the sampling method can not find all the
18
frequent patterns, the unknown patterns can be discovered by producing all the frequent
patterns and considering their support amount by rereading from the database. The
possibility of rescanning can be reduced by minimising the support threshold in the data
sample. The experiment indicates that in most cases, the sampling algorithm can find
all the frequent patterns only by one traverse through the DB and there is no need for
rescanning. The Apriori algorithm may be applied to mine the sample data. This
method, compared to Apriori algorithm, needs less reading from the disk, therefore, it is
considered to be efficient.
2.1.11 Partitioning
The most apparent argument for exploring interesting association rules revolves around
the size of the DB because the DBs’ size applied for data mining are typically
enormous. The basic approach to discover association rules is via database partitioning.
The partitioning method divides a database into several partitions, whereas each
partition fits in the main memory. Therefore, because of the existence of the data in the
main memory, the data mining procedures are executed more efficient. The partitioning
method is based on the fact that, if an itemset is frequent, it should be known in at least
one frequent partition. Since loading the transactions in the main memory reduces the
procedures of reading from the disk, the partitioning method is considered to act very
quickly (Savasere, Omiecinski & Navathein 1995).
Some of the distributed association rules mining algorithm utilises the partitioning
method to increase the efficiency of their algorithm. Hash Partitioned Apriori (HPA) as
an example, is a parallel algorithm which partitions the candidate itemsets between
processors by using a hash function. Since HPA uses the entire memory space of all
processors, it is considered as an efficient Algorithm in large data mining processes in
distributed and parallel environments (Sujni & Saravanan 2008).
Moreover, Coenen, Leng, Ahmed (2003) proposed vertical partitioning (DATA-VP)
algorithm. This algorithm utilises the Apriori-T algorithm. Apriori-T algorithm
‘combines the classic Apriori ARM algorithm with the T-tree data structure.’ Initially,
DATA-VP algorithm splits the set of single attributes between sites. As this algorithm
affixes the candidate itemsets to the T-tree, their support count is computed and those
19
with insufficient support counts are pruned. At the termination of the algorithm, each
site contains a T-tree including large itemsets. The experiment on the DBs indicates
that, the execution time in DATA-VP algorithm is much less than Apriori algorithm
(five times smaller).
2.1.12 Direct Hashing and Pruning (DHP) algorithm
DHP algorithm was presented by Park, Chen & Yu in 1995. This algorithm is an
extension of the Apriori algorithm with applying a hashing technique. They intended to
solve the problem of the immense number of candidate itemsets in every stage,
especially, in the second passes whose number is too many. A feasible solution is that,
by an appropriate hash function, the candidate itemsets are kept in a hash table. In this
method, instead of counting the support for all the itemsets one by one, the support of
each package is calculated. After each iteration, if the support of a package is less than
the minimum support threshold, then the entire candidate itemsets associated with this
package are eliminated. Consequently, a quicker method for counting the support is
gained.
2.1.13 Dynamic Itemset Counting (DIC algorithm)
Berin et al. (1997) introduced the DIC algorithm which intends to minimise the number
of traversals through the database. According to them, the DIC algorithm benefits from
hash tree as its data structure. In this tree each itemset is stored by its items. Every
itemset and all its prefixes contain a node. The root node is an empty itemset and all the
1-itemsets are attached to the root node. The rest of itemsets are inserted into their
prefix including all but their last item. At the same time with inserting an itemset a
counter is maintained. At approach to detect frequent itemsets the time of reading a
transaction, the counter of active itemset is incremented. Furthermore, the state of each
itemset is kept by managing transitions from active to counted and from small to large.
Additionally, the occurrence of such transitions should be detected.
The DIC has numerous advantages in contrast to the Apriori algorithm. According to
Berin et al. (1997) ‘the main one is performance. If the data is fairly homogeneous
throughout the file and the interval is reasonably small, this algorithm generally makes
on the order of two passes.’
20
2.1.14 Fequent Pattern (FP) Growth Method
FP-growth is considered as an alternative algorithm that assumes a radically dissimilar
approach to discovering frequent itemsets. It does not define the generating and testing
model of Apriori (Tan, Steinbach & Kumar 2006, pp. 363).
The FP growth method was introduced by Han, Pei & Yin (2000) for the first time.
Later it was optimised by Han et al. (2001).
The algorithm, converts the data set to a compressed data structure called FP-tree,
consequently, the frequent itemsets can be extracted directly from the tree. Particularly,
in the first iteration, the whole DB is read and frequent items are found. In the second
iteration, the frequent items in each transaction are considered and arranged, based on
their frequency. The regular items are added to a tree called FP-tree. At the end of the
second iteration, the whole DB, except for the non frequent items, is stored in the main
memory. Arranging item sets based on their frequency makes a smaller tree in the main
memory. After production of the FP-tree, the algorithm ignores the DB and the
algorithm which is recursive, mines the frequent itemsets by exploring this tree.
Tan, Steinbach & Kumar (2006, pp. 364) describe the procedure of the construction of
the FP-tree as follows:
i. The DB is scanned to specify the support count for each item. Infrequent items are
removed, whereas the frequent items are sorted in decreasing support counts order.
ii. Initially, the FP-tree includes the root node presented by null symbol. To construct
the tree, the algorithm does a second pass over the data. After scanning the first
transaction, the nodes are created and labelled based on the number of items in the
transaction. In this stage all the nodes have a frequency count of 1.
iii. After scanning the second transaction, if the second transaction does not share a
common prefix with the first transaction, then their path is disjointed and a new set
of nodes is produced from the root (null) node, otherwise, a new set of nodes is
created from the shared prefix and the frequency count of it increases to 2.
Subsequently, a list of pointers is made which connects nodes with the same item.
iv. This procedure proceeds until the entire transaction items map onto one of the
paths of the tree.
21
The FP-growth algorithm produces the frequent itemsets by exploring the FP-tree in a
bottom-up method. Since all the transactions are mapped onto a path in the tree, the
frequent itemset ending with a specific item can be derived easily. These paths can be
recognised rapidly by using the pointers related to that item.
Usually, the size of the FP-tree is smaller than the size of the original data as many
transactions in market basket data have items in common. On the other hand, because
of the storage space for the pointers, it requires higher physical memory space (Tan,
Steinbach & Kumar 2006, pp. 366).
The most important benefit of this method is that, it reduces the cost of reading from
the disk and eliminates the production and counting candidate itemsets. Therefore, it is
considered as an efficient algorithm. However, according to Wang (2009), ‘it has
problems about space flexibility and costs much time in dense data mining.’
Each of the association rules mining has its own advantages and disadvantages. For
eliminating the disadvantages of different methods and more use from their advantages,
one or more algorithms may combine together and consequently a more efficient
algorithm can be produced.
2.1.15 Association rules mining in XML documents
The dramatic development of the eXtensible Markup Language (XML) documents as a
standard for information transportation and storage on the web has been significant.
Consequently, there is an urgent need for tools such as association rules mining, to
extract interesting knowledge from them.
There are some limitations for use of XML inside the data mining communities. There
are some approaches for the knowledge discovery tasks but most of them are based on
the traditional relational framework with an XML interface (Braga et al. 2002).
Much research has introduced association rules from native XML documents so far,
Braga et al. (2002), Feng et al. (2003), Zhang (2005) and Wan & Dobbie (2004) are
some samples in this regard. Each of these papers uses a different method for
association rules mining in XML documents. For instance, Feng et al. (2003) suggests
22
tree structure. They believe that building up association rules among trees rather than
using a simple structure, is more powerful both from structural and semantic aspects.
Zhang et al (2005) proposes a framework called XAR-Miner which includes three main
steps as follows:

Pre-processing (i.e., construction of the Indexed XML Tree (IX-tree) or Multiple
Relational Databases (Multi-DB));

Generation of generalised meta-patterns;

Generation of large association rules of generalised meta-patterns.
Resultant generalised meta-patterns are used to generate large association rules that
meet the appropriate support and confidence levels.
The paper published by Wan & Dobbie (2004) is another example for association rules
mining for XML documents. The authors implement an algorithm by using query
language X-query. In this method the need for pre-processing or post-processing is
eliminated. Achieving a better efficiency is not the only target in the methods and
algorithms of association rules mining; the attention to the tree structure of XML
documents is more important.
2.1.16 Trie data structure
According to (Bodon 2004), Trie is a famous data structure in Frequent Itemset Mining
(FIM) which was introduced by Fredkin (1960) and Briandais (1959) for the first time
for saving and retrieving words from dictionaries.
Trie is a weighted and labelled tree which has direction. The root of the tree is defined
at depth zero. A node of this tree in d depth can point to the other nodes which are at
depth d+1. If node u points to node v then it is considered as father for v and v as a
child for u. The concatenation of labels of the edges, on the path from the root to a node
represents an itemset. The value of each node contains the support count for the itemset
which it symbolises and the links are labelled with a frequent itemsets. In figure 1.3 a
Trie is illustrated. For instance, the support count for itemset {a,b} is 6 and for {a}=9.
23
Depth 0
Depth d
Depth d+1
Figure 4. An example of a Trie (Ansari et al. 2008)
Tries are suited for any finite sets or sequences as well as storing and retrieving words.
Tries can be executed in many ways such as impact and non impact representation. In
compact representation, the edges of a node are saved in a vector. Each element of a
vector is a pair which represents an edge of this tree. The first element stores the label
of the related edge and the second element stores the address of the node which the
edge points to. Indeed, a link list can be used in this regard. In the non compact
representation, only the pointers are stored in a vector. The length of the vector is
associated with the total number of items. An element at index i belongs to the edge
whose label is the ith item. If there is no edge with such a label then the element is nil.
The advantage of this method is that it searches a specific label in o(1) time whereas in
compact representation, this approach is implemented by a binary search with the time
complexity of o(logn). The non compact representation needs more memory for the
nodes with few edges and less memory for nodes with many edges than the compact
representation. Based on the need for memory these two methods can be combined
(Bodon 2004).
Distributed Trie-based Frequent Itemset Mining (DTFIM) is one of the most recent
distributed Trie-based algorithms in multi computer environment which is proposed by
Ansari et al. (2008). They state that ‘using the Trie data structure outperforms the other
implementations using hash tree.’ In DTFIM algorithm, in each local site, a Trie-based
structure is constructed. At the beginning each site determines the local support counts
of all its items (1-itemsets) and stores them in a vector. At the end of the stage, local
sites synchronise their data to specify the globally large 1-itemsets. Subsequently each
site initialises its local Trie therefore all of them possess same Trie at the end of this
stage. At second pass, a two dimensional array is created for storing the support count
24
for the itemset with 2 elements. At the end, the globally support counts for 2-itemsets
are determined and each site appends large 2-itemsets into its local Trie. Similarly, for
each pass k (k≥3) the mentioned processes are repeated. At the end, the local support
counts are synchronised and the infrequent k-itemsets are pruned.
One of the properties of this algorithm is that, with the more skewed distributed DB, the
algorithm behaves more efficient.
2.1.17 Non derivable itemsets
According to Calder & Geothals (2007), in most of the association rules mining
algorithm, if the minimum support threshold is set low or data is highly correlated, the
number of frequent itemset becomes immense. In these circumstances, producing all
the frequent itemsets is not feasible. To overcome this issue, various proposals have
been presented to build a concise presentation of frequent itemsets. Calder & Geothals
(2002) introduce the non-derivable itemsets as a concise representation of the frequent
itemset to eliminate the need for mining all frequent itemsets. The non-derivable
itemsets are gained from the deduction rules. Firstly, this section discusses the
deduction rules which derive tight bounds on the support of candidate itemsets and then
the method that these deduction rules divide the itemsets of a DB into two groups of
derivable and non-derivable itemsets is discussed. In addition, it is discussed that how
non-derivable itemsets can be used as a condensed representation of the entire data.
2.1.17.1 Deduction rules
The deduction rules are defined as the rules for deducing tight bounds on the support of
an itemset without the need to access to the DB.
A transaction DB over a finite set I is a finite set of pair (tid,J) which tid is a positive integer
called the identifier and J is a subset of I (J⊆I). For each itemset I and all the X⊆I, the
deduction rules are as follows (Calder & Goethals 2007):

If |I\X| is odd then
|I\J| +1
supp(J)
X⊆ J⊂I
Supp (I) ≤ ∑ (-1)
25

If |I\X| is even then
|I\J| +1
supp(J)
X ⊆ J ⊂I
Supp (I) ≥ ∑ (-1)
These rules are denoted Rx(I). Depending on the |I\X| to be odd or even, the support
count can be the high or low bound for the itemset (I). If |I\X| is odd then Rx(I) is
considered an upper bound, otherwise it is considered to be a lower bound. Therefore
by having the support for all the subsets of itemset I and also with Rx(I) rules and all the
X⊆I , the lower and upper bound with support I will be obtained.
Calders (2004) proves that, these bounds are tight which denotes that with considering
the information from a subset of an itemset and without scanning the DB, the gained
lower or upper bounds are not better than the bounds detected by deduction rules.
Before, the discussion of derivation rules, first the principle of these rules is illustrated
by an example (Calder & Geothals 2007).
Figure 5. An example of a DB transactions (Calders & Geothals 2007)
Let D be a transaction DB. Suppose a,b,c are the sets which their elements are the
transactions.
According to the inclusion-exclusion principle of Galambos and Simonelli (1996), the
following equality is true (Calders & Geothals 2007):
__
supp (abc) = supp(a) –supp(ab) – supp(ac) + supp(abc)
__
Since supp (abc) is always greater than or equal to zero then:
0 ≥ supp(a) –supp(ab) – supp(ac) + supp(abc)
supp(abc) ≥ supp(ab) + supp(ac) – supp(a)
26
These inequalities can be used for lower bound of a,b,c, if the supports of all its subsets
are known. This procedure can be duplicated for every generalised itemset a,b,c.
The following inequalities demonstrate all the possible rules for deducing the tight
bounds on support of abcd itemset (Calders 2002).
R0 : supp(abcd) ≥ supp(abc) + supp(abd) + supp(acd) + supp(bcd) − supp(ab) − supp(ac) − supp(ad)
−supp(bc) −supp(bd)− supp(cd) + supp(a)+supp(b) +supp(c)+supp(d)− supp({})
Ra : supp(abcd) ≤ supp(a) − supp(ab) − supp(ac) − supp(ad) + supp(abc) + supp(abd) + supp(acd)
Rb : supp(abcd) ≤ supp(b) − supp(ab) − supp(bc) − supp(bd) + supp(abc) + supp(abd) + supp(bcd)
Rc : supp(abcd) ≤ supp(c) − supp(ac) − supp(bc) − supp(cd) + supp(abc) + supp(acd) + supp(bcd)
Rd : supp(abcd) ≤ supp(d) − supp(ad) − supp(bd) − supp(cd) + supp(abd) + supp(acd) + supp(bcd)
Rab : supp(abcd) ≥ supp(abc) + supp(abd) − supp(ab)
Rac : supp(abcd) ≥ supp(abc) + supp(acd) − supp(ac)
Rad : supp(abcd) ≥ supp(abd) + supp(acd) − supp(ad)
Rbc : supp(abcd) ≥ supp(abc) + supp(bcd) − supp(bc)
Rbd : supp(abcd) ≥ supp(abd) + supp(bcd) − supp(bd)
Rcd : supp(abcd) ≥ supp(acd) + supp(bcd) − supp(cd)
Rabc : supp(abcd) ≤ supp(abc)
Rabd : supp(abcd) ≤ supp(abd)
Racd : supp(abcd) ≤ supp(acd)
Rbcd : supp(abcd) ≤ supp(bcd)
Rabcd : supp(abcd) ≥ 0
Figure 6. Tight bounds on support of (abcd) (Calders 2002)
In the above formulas, if the support counts of subsets are placed with numbers then
obviously, a number will be on the right side of all the formulas. Therefore, those
groups of formulas whose support counts of abcd itemset are greater than (≥) a specific
amount indicate the lower bounds for support of this itemset. On the contrary those
formulas whose support counts are less than (≤) the specific amount represent the upper
bounds for this itemset. The tight bounds for this itemset are the least upper and the
greatest lower bounds which are gained from these formulas. If for an itemset such as, I
27
the least upper bound (I.u) is equal to the greatest lower bound (I.l) then it denotes that
the exact support count of this itemset can be calculated by using the support of its
subsets. Basically, if in the D database I.l = I.u then :
Supp(I,D)= I.u= I.l
Such itemsets are called derivable itemsets. In contrast, an itemset is considered as non
derivable itemset if the exact amount of its support can not be calculated by using the
deduction rules. Basically, for the non-derivable itemsets the deduction rules can not
calculate the equal lower and upper bounds. The non-derivable itemsets are considered
as a concise presentation of the entire itemsets (Calders & Geothals 2002). This
presentation includes a subset of all the itemsets which has all the necessary
information for deriving the entire itemset including frequent itemsets.
The NDI algorithm was proposed by Calders & Geothals (2002) to produce all the nonderivable itemset for a DB. This algorithm, similar to the Apriori algorithm, acts stage
by stage, but the difference is that, the NDI algorithm in each stage, prunes more
itemsets from the candidate list. The next section describes NDI algorithm. In fact the
NDI algorithm provides a concise presentation for a DB by generating the nonderivable itemsets.
2.1.17.2 Non Derivable Itemset (NDI) algorithm
Likewise the Apriori algorithm, in each iteration some candidate itemsets are produced.
Subsequently, the tight bounds are calculated by using the deduction rules on the
candidate itemset. Also, the derivable itemsets are removed from the list of candidate
itemsets. Subsequently, the DB is searched for calculating the support counts of the
non-derivable candidate itemsets. At the end, the frequent non-derivable itemsets are
determined by considering the amount of support threshold. The following codes
illustrate the NDI algorithm (Calders & Goethals 2002):
NDI(D,s)
i := 1; NDI := {}; C1 := {{i} | i ∈T };
for all I in C1 do I .l := 0; I .u := |D|;
while Ci not empty do
Count the supports of all candidates in Ci in one pass over D;
Fi := { I ∈Ci | support (I , D) ≥ s};.
NDI := NDI ∪Fi;
28
Gen := {};
for all I ∈Fi do
if support(I) ≠ I.l and support(I) ≠ I.u then
Gen := Gen ∪ {I };
Pre Ci+1:= AprioriGenerate(Gen);
Ci+1:= {};
for all J ∈ Pre Ci+1 do
Compute bounds [l, u] on support of J ;
if l ≠ u then J.l := l; J.u := u; Ci+1:= Ci+1∪ {J}
i := i + 1
end while
return NDI
In this algorithm, in each stage the candidate itemsets are produced by the frequent
itemsets whose support counts are not equal to the lower or upper bounds. Moreover,
the candidate itemsets are produced in two stages. The first stage is similar to
production of the itemsets in Apriori algorithm. In the second stage, the upper and
lower bounds are determined by using the deduction rules on the candidate itemsets
which are already produced in stage one. Subsequently, the derivable itemsets from
candidate lists are pruned. The support of the non-derivable candidate itemsets is
derived by reading from the DB. At the end, the output of the algorithm is the nonderivable frequent itemsets. Since the evaluating of all the deduction rules are so time
consuming, Cadlers & Goethals (2002) use a part of these rules for deriving the tight
bounds of the itemsets. They apply the deduction rules up to k depth for an itemset such
as, I where only the rules Rj (I) for |I − J| ≤ k are evaluated. As the condition indicates,
these deduction rules produce only a part of the entire deduction rules for an itemset.
Empirically, increasing the amount of k reduces the effectiveness of the deduction
rules. Thus, Cadlers & Goethals (2002) conclude that ‘in practice most of the pruning is
done by the rules of limited depth’. Furthermore, they prove that all the itemsets of a
DB can be derivable. Consequently, if I set is derivable then all its supersets are
derivable. On the other hand, if I set is non-derivable consequently all its subsets are
non-derivable.
Moreover, Cadlers (2004) discussed that for a non-derivable itemset, the interval
between the lower and upper bound (w(I)= LB(I)-UB(I)) is decreased exponentially,
depending on the length of I, therefore, w(I∪{i}) ≤ w(I ) / 2 is true for I itemset where
i ∉ I. This attribute guarantees that number of non-derivable itemsets can not be too
29
many because the interval can only be divided to logarithmic numbers (log(n)+1 where
n is the number of transactions).
The number of Rx(I) rules increases exponentially with cardinality of I\X. the |I\X|
number is called the depth of Rx(I) rule. Since for the calculation of the rules, lots of
sources are needed; therefore empirically, the rules with limited depth are used. The
greatest lower bound and the least upper bounds on support of I (which are gained from
evaluating the rules with maximum k depth) are indicated as UBk(I) and LBk(I)
respectively. The interval between [LBk(I) , UBk(I)] is gained by calculating the rules
such as, { Rx(I) | X ⊆ I,| I \ X |≤ k}. This point should be considered, if for the
calculation of the frequent non-derivable itemset, the deduction rules up to k depth is
used then a group of the derivable itemsets might not be pruned and at the end of the
mining operations. Subsequently, there would be some derivable itemsets in addition to
non-derivable itemsets. Undoubtedly, if the k amount is large enough, then the number
of the derivable itemset will be less or zero.
Experiment on real DBs indicates the size of the concise representation of nonderivable itemsets is much smaller than the total number of frequent itemsets as
produced by Apriori. The following figure compare the number of candidate itemsets
generated by NDI and Apriori algorithm based on the same threshold.
30
Figure 7. Size of concise representation (Calders & Goethals 2002)
As the figure 2.7 illustrated, for instance for the support threshold 0.1%, Apriori
algorithm generates 990097 frequent itemsets whereas NDI algorithm generates only
162821 frequent itemsets (Calders & Goethals 2002).
2.1.17.3 Producing all the frequent itemsets by the non-derivable itemsets:
The concept of derivability can be applied to produce all the frequent itemsets of a DB
efficiently. In fact the itemsets of a DB can be divided into two groups of derivable and
non-derivable itemsets. For exploring the frequent non-derivable itemsets, the DB
needs to be scanned and the support counts should be calculated. On the other hand, the
support counts for all the derivable itemsets are calculated by applying the deduction
rules. This property can be applied to create an efficient method for mining all the
frequent itemsets. In this method, the derivable itemsets can be retrieved without
reading from the DB. Since in some of the DBs, there is immense number of derivable
frequent itemsets, therefore all the frequent itemsets can be gained by using the concept
of derivability in a more efficient way. As previously mentioned, in exploring all the
frequent itemsets by using the concept of derivability, two groups of derivable and nonderivable itemsets are discovered. Evaluating and gaining all the deduction rules are
31
time consuming and it can eliminate the advantage of economising in reading from the
DB. Moreover, there are two methods for discovering the derivable itemsets, which are
surveyed by Cadlers & Goethals (2002). These two methods present that, for a specific
group of derivable itemsets, the evaluation of all the deduction rules is unnecessary.
They are as follows:

First method: suppose I be a non-derivable itemset, but after scanning from the DB
the Ri(j) deduction rule calculates an amount equal to the exact support count of I,
then all the supersets of I, which are in the form of I∪{i} are derivable and the
amount of their exact support can be calculated by using the Rx∪{i} (I ∪{i}) or
Rx∪{i}(I )deduction rules. This observation can be used to avoid the checking of all
the possible deduction rules for calculation of the support bound of I∪{i}. It
follows that, when the bounds on the support of the I itemset are determined, the
lower and upper bounds (I.u and I.l) for this itemset should be saved. If the I
itemset is non-derivable (I.l ≠ I.u) then the support counts should be gained by
scanning the DB. After the calculation of the support count, the following
conditions need to be tasted:
support(I) = I.l
support(I) = I.u
If one of these two conditions is true, then all the supersets of I are derivable and
there is no need for the calculation of the bounds and the deduction rules (which
have gained the exact support count) can be used to find the support count for the
supersets of I.

Second method: suppose that I is a derivable itemset and also the Rx(I) derives the
exact support count of I, for calculating the support count of I∪{i} only the rules
Rx∪{i}(I∪{i}) need to be observed. The concept of derivability is monotonic; it
denotes that the superset of a derivable itemset is derivable. Therefore for obtaining
the support of a superset of a derivable itemset, only the evaluation of one
deduction rule is enough.
32
Using the concept of derivability, these two methods can be used to retrieve all the
frequent itemsets. In these methods, in each repeat, some derivable and non-derivable
itemsets are determined from the candidate itemsets. In the primary iterations, the
number of non-derivable itemsets is more than the number of derivable itemsets, but
continuing the process, the number of non-derivable itemsets is decreased and the
number of derivable itemsets increase. In the last iterations all the itemsets might be
derivable. As mentioned before, the frequent non-derivable itemsets are a minimal
presentation or a compact presentation of the entire frequent itemsets and because of
this factor, all the frequent itemsets can be produced by the non-derivable itemsets (as a
concise representation).
In order to gain a concise representation of a DB, some other techniques and
algorithms, such as, closed itemset (Pasquier et al. 1998), GrGrowth algorithm (Liu, Li
& Wong 2007), etc. were proposed. The experiments indicate that, the non-derivable
itemsets are more efficient in comparison with other compact representations (Cadlers
2004). Indeed, the non-derivable itemsets have more compact representation than
others, which denote whose numbers compare to the total numbers of frequent itemsets,
in a specific DB, are fewer. Therefore, producing all the frequent itemsets by using the
compact presentation from the non-derivable itemsets is more economical than other
compact presentations. Hence, the proposed algorithm by this thesis applies this
technique.
33
2.2 Distributed association rules mining
This chapter introduces distributed data mining and some of the most important
Distributed Association Rules Mining (DARM) algorithms. Moreover, some of the
discussed problems and advanced progress in this regard are covered.
2.2.1
Distributed data mining
‘Mining association rules in the distributed environment is a distributed problem and
must be preformed using a distributed algorithm that does not need raw data exchange
between participating sites.’(Ansari et al. 2008)
The subject of distributed data mining has attracted a great deal of attention from
research and commercial communities for finding useful and interesting hidden patterns
in large transaction logs. Guo & Grossman (1999); Zaki (2000); Kargupta & Chan
(2000); Zaki (1999); Agrawal & Shafer (1996); Cheung et al. (1996) and Sujni &
Saravanan (2008) have surveyed the issues of distributed data mining.
Distributed data mining is the operation of data mining in distributed data sets.
According to Zaki (1999), two dominant architectures exist in the distributed
environments which are listed as distributed and shared memory architectures. In
distributed memory each processor has a private DB or memory and has access to it. In
this architecture, access to other local DB is possible only via message exchange. This
architecture offers a simple programming method, but limited bandwidth may reduce
the scalability. The assumption of this thesis is to deal with distributed memory
architecture. The figure below indicates a simple architecture for distributed memory
systems.
34
Figure 8. Distributed memory architecture for distributed data mining
In the shared DB architecture, each processor has direct and equal access to the DB in
the system (Global DB). Parallel programs can be implemented on such systems easily.
The figure below indicates an architecture for shared memory systems.
Global Data Mining
Local
DM
Local
DM
....
Local
DM
DB
Figure 9. Shared memory architecture for distributed data mining
In most cases distributed data mining is mentioned with parallel data mining. However,
both are designed to optimise the efficiency of data mining in distributed environments;
but each of them uses different architecture system and methods. Computers in
distributed data mining have been distributed and they communicate with each other by
interchanging messages. In parallel data mining, a parallel computer is mentioned with
processors which use memory jointly, whereas computers in distributed data mining
systems do not use anything jointly. This discrepancy in architecture has a significant
35
effect on designing the algorithm, the cost model and measuring efficiency in parallel
and distributed data mining (Kargupta & Chan 2000, p.24). However, according to
Fang et al. (2005), they are the same ‘from the point of concept, the architecture of
parallel and distributed calculating is a kind of layer calculating structure’.
2.2.2.

The necessity of studying distributed data mining
Distributed DBs
By global changes of organisation structure from centralisation to decentralisation,
emerging computer networks and distributed DBs in this computerised era, the need for
aggregation of distributed data has become vital. Many organisations such as the
Department of Health have to deal with distributed, heterogeneous and independent
DBs. Chiu, Koh & Chang (2007) survey on the Department of Health of Taiwan
indicates the need for integrating distributed DBs into a central one for management
purposes. Furthermore, the investigation of Amoroso, Atkinson & Secor (1993) reveals
that the problem of data management in organisations is inevitable. In this regard, they
explore a data management construct due for clarifying the weak point of data management
by managers.
Depending on the type of organisations, the distributed DBs need to be integrated to a
central DB such as, health history of the patients, or there is no need to have a copy of all
the data of the distributed DBs in a central DB, such as, market basket data.
A primary solution for the data management problem in distributed DBs is to transfer all
data from different sites to a centralised site and use the data mining operations. Even in the
case of having a central site with enough storage space (memory) and ability to do all the
heavy tasks for data mining, the transformation of all the data of the local sites to a central
site is extremely time consuming and costly. In addition, in some cases, transferring the
local data is not permitted, due to the site ownership and security issues. In distributed data
mining:
no site should be able to learn the contents of a transaction at any other site, what rules
are supported by any other site, or the specific value of support/confidence for any rule at
any other site, unless that information is revealed by knowledge of one’s own data and
the final result (Kantarcioglu & Clifton 2004).
36
Most of the distributed algorithms, which are based on the Apriori, suffer from lack of
privacy for the participating sites. Some algorithms have been proposed to protect the
sites’ privacy, such as, SDDM (Secure Distributed Data Mining) presented by
Fukasawa et al. (2004). This algorithm satisfies privacy requirements and has the
ability to resist collusion. Besides this, since only random numbers are used for
preserving privacy, it is considered an efficient algorithm.

Efficiency and Scalability
Even if data is not distributed, distributed data mining can be useful for data which is
stored in a site. In particular, a site with massive data can send a part of its data to other
sites and shares the burden of data mining operations. Subsequently, the results from
those sites are combined to achieve the desired result. Even though the site may have
the ability to implement the data mining operation by itself, sending parts of data to
other sites can enhance efficiency. There are two approaches regarding distributed data
mining. In the first approach, each site mines a part of the data (data distribution). In the
second one, each site performs a part of data mining for the entire distributed DBs
(distributing operations). In both cases, the results from all sites are combined. Since
the burden of operations is distributed between sites, using this method is much quicker
than centralised data mining. The above discussion shows that distributed data mining
methods are scalable. A data mining method is considered scalable if its efficiency does
not change by increasing the volume of data. In a distributed data mining system, the
number of existing sites depends on increasing or decreasing the volume of data.
Distributed data mining is a desirable method in most cases, especially with recent
progress in computer networks, specifically, the Intranet and Internet. However, in
comparison with centralised data mining, it is more costly to implement; also, the used
methods are more complicated and accuracy in designing data mining systems is more
important (Agrawal & Shafer 1996).
2.2.3
Important instances and issues in distributed data mining
Considering the local different sites as part of the general DB system, executing
centralised data mining instead of distributed data mining is not recommended because
data in a distributed DB is distributed naturally. For designing a suitable data mining
37
system, attention to the features of data, the computer network and the kind of data
mining operation is necessary. Some important instances are listed as follows:

Homogeneous data in contrast to Heterogeneous data :
In most studies of distributed data mining, it is assumed that the local DB is
homogeneous; which denotes, the local data are in the same bed, with the same DB
management system and with the same scheme. In a Heterogeneous local DB, the data
mining system should be adjusted in order to work on different local DBs. In addition,
the different schemes should be standardised and unified to a general scheme;
otherwise it would be very difficult or even impossible to achieve the desired result.

Data layout:
There are many methods to represent a DB. According to Zaki (1999), DB layout can
be horizontal and vertical. In the horizontal layout each customer transaction is stored
along with the related items. This representation is more common and is adapted by
Apriori algorithm. In the vertical DB layout, all the transactions which contain a
specific item are listed under the item. In this representation,
the length of the TID-lists shrinks as we progress to larger sized itemsets. However, one
problem with this approach is that the initial set of TID-lists may be too large to fit into main
memory thus requiring more sophisticated techniques to compress the TID-lists (Tan,
Steinbach & Kumar 2006, pp. 362).
The figures below indicate these layouts.
1
2
3
A
B
B
B
D
L
4
A
B
D
L
R
Q
R
Q
Figure 10. Horizontal DB Layout
A B D L Q R
1 1 1 1 1 2
4 2 2 3 3 3
3
4
Figure 11. Vertical DB Layout
38

Data replication:
All or part of local data can be repeated on other sites. Data replication increases the
availability of data. Basically, Data replication is not only for data mining purposes, but
it is a decision based on calculation or other needs. Although data can be replicated for
data mining processes, in this case, the data miner should decide what data or which
part of data should be repeated.

Information exchange cost:
The major concern in distributed data mining is the amount of time of reading from a
disk and also the execution time cost. The time of information exchange should be
considered as well. In a slow network, data exchange is the major cost. Data exchange
cost is determined by band width and the number of messages which are sent by the
network. Therefore, the cost model should be different for centralised and distributed
data mining.

The conclusion from results:
Obtaining the result is not only gathering results from all sites and putting them
together. An interesting discovered rule in a local DB could be quite useless in the
global DB. For example, a frequent itemset in a local DB can be infrequent at the
global DB. Since the aim of distributed data mining is finding rules which are useful for
the global DB, the discovered local rules with their features should be studied for the
global DB.

Data skew:
The distribution of data statistics, such as the quantity of attributes and membership in
different classes, is usually different among local DBs. The local model which is
obtained by mining in a local DB is unavoidably affected by this distribution. Skew in
data can make the local models useless and without value. For example, a learnt
classifier from a local DB cannot classify the next samples of that class.
The mentioned cases are not separate from each other and are quite dependent. For
instance, data fragmentation of a global DB causes local DBs to be heterogeneous; or
the horizontal data fragmentation can cause data skew, if it has been implemented
carelessly. However, replication in data tables can reduce the cost of transfer for
39
accessing data, but it increases the cost of transfer for keeping unity in data. In addition
to the mentioned cases, there are other factors in distributed data mining which are
important, such as security, the privacy of local sites, autonomy of local DBs, network
topology, how to transfer data and the amount of burden which each local site can
handle.
2.2.4
The distributed Algorithms for association rules mining
This section introduces how to mine association rules in a distributed environment and
some of the most significant algorithms in this regard.
Suppose DB is a DB which has D transactions. Also, assume that in a distributed
system there is n site with names of S1, S2,…, Sn which the DB database is partitioned
over the n sites into { DB1, DB2,…, DBn } respectively, where partition DBi belongs to
Si site. Additionally, assume Di denotes the size of partitions DBi; where i=1,.., n.
Suppose that the support count of an itemset X in DB and DBi are represented by X.sup
and X. supi respectively. These supports are named global support count and the local
support count of X at site Si respectively. With a specified support threshold s, X is
globally large if X.sup ≥ s * D. Likewise, X is locally large at site Si, if X. supi ≥ s * Di.
L represents globally large itemsets in DB and L(k), the globally large k itemsets with K
length in L. Discovering the large itemsets L is the major concern for a distributed
association rule mining algorithm. The following section contrasts the discussed model
proposed by Cheung et al. (1996), with Count Distributed (CD) algorithm.
2.2.4.1
Count Distribution (CD) algorithm
CD algorithm is a parallel algorithm for association rules mining in distributed
environments presented by Agrawal & Shafer in 1996. This algorithm tends to
minimise communication.
In each iteration, this algorithm produces the candidate itemsets, by executing the
Apriori_gen function on the large itemsets at the prior repeat. Subsequently, the local
support counts of all these candidate itemsets are calculated by each site and the
obtained result is send to other sites. Therefore, each site is able to calculate the
frequent item set related to each iteration and proceeds to the next iteration.
40
In this method, the sets of all the candidate items are produced repeatedly on all of the
sites and each node contains a part of DB. Each processor is responsible for counting
the local support of global candidate itemsets. Then, each site calculates the global
support for its candidate itemset. This global support is the summation of local support
for each candidate itemset in the global distributed DB. Calculating the global support
is executed by exchanging local supports between sites (Global reduction). The global
frequent item set can be calculated by each site by using information related to the
global support of the candidate itemset. The figure below indicates the execution of the
count distribution algorithm. This figure is a distributed system with three sites which
are executing the second iteration of the count distribution algorithm:
Figure 12. The second replication from count distribution algorithm
In this figure, each site obtains the local support of the candidate itemset by having
candidate itemsets. Subsequently, sites exchange their local counts; as a result each site
obtains the support of global candidate itemsets.
41
Here is the count distribution algorithm. In this algorithm, { D1, D2,…, Dp } are the
different parts of distributed data, where Di is a part of this DB which is on the ith site
and p is the number of sites.
The following pseudo codes indicate Count Distribution (CD) Algorithm presented by
Agrawal & Shafer (1996):
Input: I, s, { D1, D2,…., Dp}
Output: L
1) C1 =I;
2) for k=1; Ck ≠ 0 ; k++ do begin
//step one: counting to get the local counts
3) count(Ck, Di); //local processor is i
//step two: exchanging the local counts with other processors to obtain the global counts in the whole
DB.
4) forall itemset X∈Ck do begin
p
5) X.count=Σ{ Xj.count};
j=1
6) end
//step three: identifying the large itemsets and generating the candidates of size k+1
7) Lk ={c ∈Ck | c.count ≥ s × | D1 ∪D2 ∪…∪Dp |};
8) Ck+1 =apriori_gen(Lk);
9) end
10) return L= L1 ∪L2 ∪…∪Lk;
There are three major steps in a count distribution algorithm. In the first step, each site
finds local supports of Ck candidate itemsets in its own local DB. In the second step,
each site exchanges its local support with other sites to obtain the entire support for all
the candidate itemsets. In the third step, each site obtains Lk, the entire frequent
itemset, and the candidate item set with length of K+1 is obtained from each site by
execution of Apriori_gen() function on Lk. This algorithm repeats step 1 to 3 until it
produces the candidate item set.
2.2.4.2
A Fast Distributed algorithm
The Fast Distributed Mining (FDM) algorithm is one of the most significant algorithms
in distributed data mining which is presented by Cheung et al. (1996). Other algorithms
such as ODAM and also the proposed algorithm by this thesis are designed based on
this algorithm. Firstly, some of the presented techniques in this algorithm are
introduced and defined, and then the algorithm is explained.
42
2.2.4.2.1 Candidate sets generation:
Observing some of the interesting properties of large itemsets in distributed
environments can considerably lessen the number of exchanged messages within the
network. For instance, according to Cheung et al. (1996), in a distributed DB, there is a
significant relevence between large itemsets and the sites which is ‘every globally large
itemsets must be locally large at some site(s).’
To prove this theory, suppose ‘itemset X is both globally and locally large at a site Si,
then X is called gl-large at site Si’ (Cheung et al. 1996). Just as in centralised DBs
where there are monotonous relationships between frequent itemsets, these properties
exist in both local frequent itemsets and gl-large in a distributed DB. These properties
are summerised as follows:

If an itemset X is locally large at a site Si, then all of its subsets are also locally large at site
Si .

If an itemset X is gl-large at a site Si, then all of its subsets are also gl-large at site Si.
(Cheung et al. 1996).
By use of the following lemma, an effective technique for the generating candidate
itemset can be developed in a distributed environment:
Lemma 1: if an itemset X is globally large, there is at least a site Si, (1≤i ≤ n), which X and
all its subsets are gl-large at site Si.
Proof: if X is not locally large at any site, then X. supi < S* Di for all i=1, …, n. Therefore
X.sup < S*D and X can not be globally large. By contradiction, X must be locally large at
some site Si, and hence X is gl-large at Si. Consequently, all the subsets of X must be gllarge at Si. (Cheung et al. 1996).
GLi indicates the gl-large itemsets at Si and GLi(k) represents the gl-large k-itemsets at
site Si. Lemma 1 shows that if X∈ L(k) then there is at least a Si site which all its subsets
with (k-1)-itemset in Si site, are gl-large and consequently, all of them belong to GLi(k-1).
43
Same as Apriori algorithm, the candidate itemsets are indicated at the kth iteration by
CA(k). These candidate itemsets are gained by implementation of Apriori_gen function
on L(k-1). Therefore: CA(k)= Apriori_gen(L(k-1)) .
For each Si site, CGi(k) indicates the candidate itemsets gained by applying Apriori_gen
function on GLi(k-1). Therefore: CGi(k)= Apriori_gen(GLi(k-1)) . CG denotes the candidate
itemsets produced from gl-large itemsets. Therefore, CGi(k) is generated from GLi(k-1).
Since, GLi(k-1)⊆ L(k−1), CGi(k) considers as a subset of CA(k). In following CG(k) is used
in place of ∪ⁿ i=1 CGi(k) (Cheung et al. 1996).
Theorem 1: for every k>1, the large itemset L(k) is a subset of CG(k) = ∪ⁿ i=1 CGi(k) where
CGi(k) = Apriori_gen (GLi(k-1)).
Proof: suppose X ∈L(k). It follows from Lemma1 that there exists a site Si, (1≤i ≤ n), such that
all the size (k-1) subsets of X are gl-large at site Si. Hence, X∈ CGi(k)
Therefore, L(k)
⊆CG(k) = ∪ⁿ i=1 CGi(k) = ∪ⁿ i=1 Apriori_gen (GLi(k-1)) (Cheung et al. 1996).
The theorem reveals that CG(k) is a subset of CA(k) and therefore it is smaller than CA(k).
This set is considered as the candidate itemset for the large k-itemsets. This theorem is
the basis of candidate itemsets generation in FDM algorithm. The CGi(k) candidate
itemsets can be produced locally in each Si site at the kth iteration. At the end of each
iteration, the list of globally large itemsets is available at each site. The candidate
itemsets at Si for the (k+1)st iteration are produced based on GLi(k).
According to the examinations, this method reduces 10-25% of the number of candidate
itemsets. The following example discloses the effectiveness of theorem 1 in reduction
of the number of candidate itemsets:
Example: suppose there are three sites in a distributed system and DB is the data base
of the system which is split into three partitions, DB1, DB2, DB3, whereas each site
possesses one of these partitions. Assume that the large itemset, gained from the first
iteration, is L(1)= {I,L,M,N,O,P,Q} which in this set, I, L and M are local large itemsets
and at the first site (S1) and L,M and N are locally large at S2 site and O, P and Q at the
S3 site is a large itemset. Thus, GL1(1)= {I,L,M}, GL2(1)= {L,M,N} and GL3(1)= {O,P,Q}.
44
Based on the theorem 1 the candidate 2-itemsets at site S1 is CG1(2) where CG1(2) =
Apriori_gen(GL1(1))=
{IL,IM,LM}.
Similarly,
CG2(2)
=
Apriori_gen(GL2(1))=
{LM,LN,MN} and CG3(2) = Apriori_gen(GL3(1))= {OP,OQ,PQ}. Hence, the candidate
itemsets
with
length
of
2
is
CG(2)=
CG1(2)
∪CG2(2)
∪CG3(2)
=
{IL,IM,LM,LN,MN,OP,OQ,PQ} which is 11 candidates. In contrast, applying the
Apriori_gen function on L(1) generates 28 members for candidate itemset. This example
indicates the effectiveness of applying theorem 1 in producing the candidate itemsets.
2.2.4.2.2 Local pruning of candidate itemsets:
The prior section reveals that by using theorem1, number of candidate itemsets can be
lessen, compare to applying the Apriori algorithm directly. This trend has a significant
effect in increasing the efficiency of a distributed algorithm. After generating the
candidate itemset, for gaining the globally large itemset, the support counts of them
must be exchanged between all the sites. There is a possibility for local pruning of
some of the candidate itemsets before exchanging the support count. The concept is
that, in each Si site, if a candidate itemset is not a large itemset at its Si site, then there is
no need to compute its globally support count by that site, because whether the itemset
is not globally large or it is locally large at other sites. Therefore, only the site(s) that
itemset is locally large require to compute the global support count for the itemset.
Hence, due to calculate the large itemsets with length of k in every Si site, the candidate
itemset is limited to the set which is locally large at that site.
In following, LLi(k) indicates the locally large candidates from CGi(k) at Si. In each
iteration, the gl-large itemsets with k-length at each site Si are computed by the
following procedures:

Candidate sets generation: Generate the candidate sets CGi(k) based on the gl-large
itemsets found at site Si at the (k-1)-st iteration using the formula, CGi(k) = Apriori gen
(GLi(k-1) ).

Local pruning: For each X ∈ CGi(k), the partition DBi is scanned to compute the local
support count X. supi. If X is not locally large at site Si, it is excluded from the
candidate sets LLi(k)
45

Support count exchange: broadcast the candidate sets in LLi(k) to other sites to collect
support counts. Compute their global support counts and find all the gl-large k-itemsets
in site Si.

Broadcast mining results: broadcast the computed gl-large k-itemsets to all the other
sites (Cheung et al. 1996).
For more clarity, example 1 is expanded as example2:
Example 2: suppose the DB in the above example includes 150 transactions where each
DB contains 50 transactions. Additionally, presume the minimum support threshold is
s=10%. Likewise, based on the example 1, the produced candidate itemsets at site S1 is
CG1(2)= {IL,IM,LM}, at S2 is CG2(2) = {LM,LN,MN} and at S3 is CG3(2) = {OP,OQ,PQ}.
For gaining the large 2-itemset, firstly, the local support count in each site should be
computed.
Table 6. Locally large itemsets
X. sup1
X. sup2
X. sup3
IL
5
LM
8
OP
4
IM
3
LN
10
OQ 8
LM
10
MN
4
PQ
4
As illustrated in the table, IM. sup1 = 3 < s* D1 = 0.1* 50 = 5. Hence it is not locally
large and consequently, IM prunes at site S1 but IL and LM satisfy the minimum support
counts and they do not prune. Thus, LL1(2)= {IL, LM}. Similarly, LL2(2)= {LM, LN} and
LL3(2)= {OQ}. Subsequently, the number of candidate itemsets reduces to 5 which is
much smaller than the primary size. After finalising the local pruning, the support
counts of the remained candidate itemsets are calculated by distribution of these
candidates between other sites. The result is illustrated in the following table:
Table 7. Globally Large Itemsets
Locally large
Broadcast request
candidates
from
IL
S1
X.sup1 X.sup2 X.sup3
5
4
4
46
LM
S1, S2
10
8
2
LN
S2
4
10
4
OQ
S3
4
4
8
At the end of iteration, only LM is gl-large because LM.sup= 10+8+2= 20 >
s*D=0.1*150=15 and IL.sup= 5+4+4= 9 < s*D=15. Therefore, the gl-large at site S1,
S2 and S3 are GL1(2) = {LM}, GL2(2) = {LM,LN} and GL3(2) = {OQ} respectively.
Consequently, the large 2-itemsets is L(2) ={LM,LN,OQ}.
Some items such as LM are locally large at least in one site. Therefore, there is no need
for all the sites to announce them as large itemset, only one site is sufficient to
announce. Cheung et al. (1996) introduce an optimisation technique to remove such
redundancies. To support steps number 2 and 3, each site needs to discover the support
counts of its candidate itemset and in support count stage, each site has to compute the
support count of candidate sets from other sites. An elementary solution is that, each
local site scans its DB twice. This method reduces the efficiency of the algorithm
because substantially each local DB is scanned twice. The more feasible solution is that
since every site has the globally large itemsets of the prior iteration, the support counts
of the entire candidate itemsets are computed by a single scan from each local DB and
they are stored in a DB called Hash tree.
To decrease the number of massages for each candidate itemset, this algorithm uses a
technique called count polling. In this technique, each candidate itemset utilises an
assignment function. This function assigns each itemset to a polling site which is quite
independent to the site which the itemset is large at. The polling site for an itemset is
responsible for determining whether that itemset is globally large. Consequently, this
method reduces the number of exchanged messages. Cheung et al. (1996) illustrate the
method by the following example.
Suppose in the previous example, S1 is a polling site for IL and LM, in the same way, S2
is for LN and S3 for OQ. Considering the polling sites, S1 is amenable for pulling the
support count for IL and LM. In the simple case of IL, S1 broadcasts the polling requests
to other sites. But for LM which is locally large at S1 and S2, S2 transfers the pair (LM,
47
LM. sup2) = (LM, 10) in response. Later S1 repeats this procedure for S3. Once S3 sends
back the support count LM. sup3 =2 for S1, then S1 computes the support counts for LM
as LM.sup= 10+10+2=22>15. Consequently, S1 realises that LM is a global large
itemset at its site. As this example indicates the double polling message for LM has
been removed.
2.2.4.2.3 FDM algorithm
The basic version of FDM algorithm (FDM-LP, FDM with Local Pruning) presented by
Cheung et al. (1996) is as follow:
Input DBi (i=1…n): A part of distributed DB on Si site.
Out put (L): the frequent item set in global distributed DB.
Method: iterative execution of following codes (for kth frequent) by each Si in distributively.
This algorithm finishes when CG(k ) = ∅ or L(k ) = ∅.
(1) if k = 1 then
(2) Ti(1) = get_local_count(DB,0،1 )
(3) else{
(4) CG(k) = ∪ⁿ i=1 CGi(k) =∪ⁿ i=1 Apriori gen (Gen(i))
(5) Ti(k) = get_local_count(DB, CG(k),i)
}
(6) for all X ∈Ti(k) do
(7) if X. supi ≥ s×Di then
(8) for j=1 to n do
(9) if polling_site(X) = Sj then
insert(X,X. supi) into LLi,j(k) ;
(10) for j=1 to n do send LLi,j(k) to site Sj;
(11) for j=1 to n do{
(12) recive LLi,j(k)
(13) for all X∈LLi,j(k) do{
(14) if X ∉ LP i(k) then
Insert X into LP i(k) ;
(15) update X.large_sites;}}
(16) for all X∈LP i(k) do
(17) send_polling_request(X);
(18) reply_polling_request(Ti(k));
(19) for all X∈LP i(k) do {
(20) receive X.supj from the site Sj ,
where Sj ∉X.large_sites;
(21) X.sup = Σi =1 X. supi ;
(22) if X.sup ≥ s × D then
Insert X into Gi(k);}
48
(23) broadcast Gi(k);
(24) receive Gj(k) from all other sites Sj ,(j ≠ i) ;
(25) L(k) = ∪ⁿ i=1 Gi(k), (i = 1 ,…, n);
(26) divide L(k) into Gi(k), (i=1 ,…,n);
(27) return L(k)
During the execution of FDM algorithm, each Si sites play different roles. In the
beginning, a site consider as “home site” for the produced set of candidate sets and
subsequently it changes to a polling site to get response from other sites. Later, it
becomes a remote site. The different stages for FDM algorithm with consideration of
different roles for each site are as follow:
1. Home site: generate candidate itemset and sending them for related polling site. (lines 1-10)
2. Polling site: receive candidate sets and send polling requests. (lines 11-17)
3. Remote site: return support counts to polling site. (line 18)
4. Polling site: receive support counts and find large item sets. (lines 19-23)
5. Home site: receive large item sets. (lines 24-27) (Cheung et al. 1996).
2.2.4.3
ODAM algorithm
In contrast to other DARM algorithm, the Optimised Distributed Association Rules
Mining (ODAM) ‘offers better performance by minimising candidate itemset
generation costs’ (Ashrafi, Taniar & Smith 2004). This algorithm intends to reduce the
communication and synchronisation costs.
A DARM algorithm performs better if the communication cost (the number of
exchanged messages) becomes minimal. Likewise, synchronisation is another essential
factor. Certain period of time of each site is waisted for computing globally frequent
itemset generation. Ashrafi, Taniar & Smith (2004) divide the optimisation technology
for communication cost into two methods.
The first method is called direct support counts exchange. In this method ‘all sites share
a common globally frequent itemset with identical support count… this approach
focuses on a rule’s exactness and correctness.’ CD and FDM are instances of this
method.
49
The second method, indirect support counts exchange, intends to reduce the
communication costs by eliminating the global support count. On the other hand, the
correctness of DARM algorithms relies on each itemset’s global support. Using a
partial support count of itemsets to generate rules may result in discrepancies in the
consequent rule set. DDM algorithm applies this method.
To preserve the correctness and compactness of association rules, ODAM algorithm
employs the first approach. In this algorithm, the total number of exchanged messages
is reduced by applying some techniques. To increase the efficiency of the generated
candidate support counts, in each pass through the DB, ODAM replaces the infrequent
items with the new transactions. Eliminating the infrequent itemsets (items with less
than 50% of support count) from every transaction increases the chance of observing
similar transactions. This technique in addition to reduce average transaction size, it
discovers more identical transactions. The following example illustrates this technique.
No Items
1
abcde
2
abe
3
cd
4
abcd
5
abd
6
abef
7
ab
8
abcdef
9
cdf
No
1
2
3,10
4
5
6
7
8
9
Items
abcde
abe
cd
abcd
abd
abef
ab
abcdef
cdf
No
1,4,8
2,6,7
3,9
5
Items
abcd
ab
cd
abd
10 cd
Figure 13. ODAM algorithm on 3 sites
The first dataset is the original one. As the middle data set indicates, if the data set is
loaded into the main memory directly, only one identical transaction (cd) is found.
However if the dataset is loaded to the main memory after elimination of the infrequent
items from every transaction, more identical transaction is found. The support count for
50
each item is calculated as follow: s(a)= 0.7, s(b)=0.8, s(c)=0.6, s(d)=0.7, s(e)=0.4,
s(f)=0.3. Therefore e and f is recognised as infrequent items. As the last dataset shows,
the size of transactions is much smaller after removing the infrequent 1-itemsets.
Firstly, Likewise Apriori algorithm, ODAM computes the support count of 1-itemsets
from each site. Secondly, it broadcasts the itemsets to other site to discover whether
they are globally large or not. Subsequently, each site produces 2-itemsets and
calculates their support counts. In the second pass through the DB, meanwhile counting
the support count of candidate 2-itemsets, the global infrequent 1-itemset are eliminated
from all the transactions and new transactions (those without infrequent itemsets) are
replaced in the main memory. While inserting the novel transactions into the memory,
the identical transactions are recognised. For identical transactions their counter is
increased by one and for non identical transactions their counter is set to one. In
continue each site scans the main memory for support count of candidate itemsets and
computes their support. Subsequently, the global frequent itemsets related to that pass
are calculated by broadcasting the gained support counts in each site to the other sites.
In this method, the total number of transactions may excess the capacity of the memory.
To overcome this problem, a horizontal fragmentation technique is proposed. In this
method, a dataset is fragmented into several horizontal partitions., The infrequent items
are deleted from each partition and transactions are inserted into the main memory. As
mentioned, the existence of the transaction is checked and an appropriate amount is
assigned to the transactions counters.
Finally, all the memory entries for the specific partition are transferred to a temp file.
These trends repeat for every partition. By combining the temp files and considering
the identical transactions, a new dataset is produced in all sites which are used by
algorithm.
In contrast to CD and FDM algorithm, ODAM is more efficient from execution time
and message exchange cost aspects. Additionally, ODAM is so scalable in increasing
the number of participating sites in a distributed system. Fundamentally, ODAM
algorithm is designed for data which are geographically distributed. It computes the
51
support count of candidate itemsets quicker and reduces the average size of
transactions, datasets and exchanged messages.
2.2.4.4
DDM, PDDM and DDDM Algorithms
Distributed Decision Miner (DDM), Preemptive Distributed Decision Miner (PDDM)
and Distributed Dual Decision Miner (DDDM) algorithms are proposed by Schuster
and Wolf (2004) for discovering association rules in distributed environment. A brief
description for each of them is stated in following.
DDM has good performance even with skewed data or where the size of data partitions
is different in several sites further more, it overcomes the scalability problem.
According to Schuster and Wolf (2001), ‘the basic idea of this algorithm is to verify
that an item set is frequent before collecting its support counts from all parties’.
Although this algorithm differs from FDM in many aspects but the discussed fact
considers as the most important discrepancy between these two algorithms. In this
algorithm, after discovering a local large itemset by a site, the support count for that
itemset is not requested from all sites rather the sites negotiate (through message
exchange) to decide which itemsets are globally large. Subsequently, the support count
request is broadcasted to all sites containing that item. Obviously this method reduces
the number of exchanged messages as no message is wasted on an itemset which is
only locally large not globally.
PDDM algorithm designed to improve the communication complexity of DDM.
Experience indicates that all the participatingd sites are not equally important. A site
may contain more large itemsets than others. For instance, the DB of a superstore
includes more significant data than the DB of a grocery store. This algorithm, at the
earlier stage of negotiation, allows sites with extreme support count to broadcast their
support. Consequently, refraining DBs with less significant data to broadcast their
messages, may result in better usage of bandwidth.
DDDM intends to reduce the communication cost in a different way. The fundamental
idea of this algorithm is that a DDM-type algorithm is employed to detect the large
itemsets considering the confidence of the rules. If one site recommends a rule is
52
globally confident and there are no objections from other sites, then the rule considers
being globally significant.
The number of partitions and candidate item sets are two important factors in
complexity for all the DARM algorithms. These algorithms can be used when the
number of sites is too many for example 10,000 or when there is band width limit.
An operation of parallel data mining with a lot of separated computers can be executed
by using these algorithms. As they divide data sets in to small partitions that can be
fitted in the memory of every computer, therefore the association rules can be produced
quicker.
2.2.5
Comparing distributed algorithms
A brief comparison between some of the most important DARM algorithm is provided
as follow.
Apart from distributed sampling algorithm, all the DARM algorithms are the extended
version of the Apriori algorithm. For example the count distribution algorithm is
obtained from paralleling the Apriori algorithm and it has high computation and
communication complexity. The FDM algorithm increases the efficiency of count
distribution algorithm, considering the fact that each global frequent itemset should be a
frequent itemset at least in one local site. Computation and communication complexity
of the FDM algorithm is much lower than the count distribution algorithm when the
number of sites is few or data skewness in different sites is high. Except from these two
points, these two algorithms are same.
Count distribution (CD) and FDM algorithms are two famous algorithms in DARM and
also they are a standard measurement tool for new algorithms in terms of execution
time and communication complexity.
The main discrepancy between FDM and DDM is that, FDM communicates with all
sites for finding a frequent item set and also for finding their global support, but this
frequent item set could not be a global frequent item set, whereas if DDM recognises a
frequent item set could not be a global frequent item set, then it ignores the item set and
does not send any extra messages for calculating the exact amount of its support.
53
FDM algorithm is the base for designing ODAM algorithm. In FDM algorithm the cost
of reading from the disk is reduced by using some techniques and also the
communication cost between sites has been optimized. DDM and distributed sampling
algorithms act quiet well, especially from the communication aspect and in modern
distributed systems such as, peer to peer systems. These two are scalable and have the
ability to be developed.
DTFIM algorithm is the distributed version of FIM algorithm which is a Trie-based
algorithm. This algorithm uses some of FDM algorithm techniques, but it is more
efficient than FDM as it uses Trie data structure.
54
3. Proposed algorithm by this research
3.1 Mining the non derivable itemsets in distributed environments
The experiments indicate that discovering ‘concise representation and then creating the
frequent itemsets from this representation, outperforms existing frequent set mining
algorithms’ (Calder & Goethals 2002). Since the concept of derivability exists in the
distributed environments, this section introduces the distributed deduction rules which
are an extension of the deduction rules presented by Calder & Goethals (2002) in
centralised environments.
For applying NDI algorithm in a distributed environment there are two possibilities.
The first approach is to transfer all the data of different sites into a DB and then apply
the NDI algorithm. However, due to high transmission costs and security issues, this
solution seems not to be as efficient as the second approach which is applying the
distributed NDI algorithm. In this method there is no need for transferring data. The
derivable itemsets are mined locally whereas the non-derivable itemsets are mined in
the distributed way. The distributed deduction rules are defined as follows:
Assume that I= {i1, i2,… ,im } be a set of items. Additionally, Let DB be a distributed
database which contains D transactions. Also, suppose that there are n distributed sites,
as S1, S2,… ,Sn whereas { DB1, DB2,… ,DBn} are the DB of each site. Furthermore,
I.supp (i) demonstrates the support count of the i itemset. As mentioned, finding nonderivable itemsets in these distributed DBs is the target. The extended derivation rules
for gaining the thigh bounds on the support of I in a distributed DB is as follows:

If |I/X| is odd then
|I\J| +1
Supp (I) =Supp1 (I) + Supp2 (I)+ … +Suppn (I) ≤ ∑
(-1) (Supp1 (J) + Supp2 (J)+ … + Suppn (J))
X⊆ J⊂I
|I\J| +1
((-1)
=∑
X⊆ J⊂I

∑ Suppi (J)
1 ≤i ≤ n
If |I\X| is even then
|I\J| +1
Supp (I) =Supp1 (I) + Supp2 (I)+ … +Suppn (I) ≥ ∑ (-1) (Supp1 (J) + Supp2 (J)+ … + Suppn (J))
X⊆ J⊂I
|I\J| +1
=∑
((-1)
X⊆ J⊂I
∑ Suppi (J)
1 ≤i ≤ n
55
Therefore the deduction rules in the distributed environment are summarised as
follows:

If |I/X| is odd then
|I\J| +1
Supp (I) ≤ ∑
((-1)
∑ Suppi (J))
X⊆ J⊂I
1 ≤i ≤ n

If |I/X| is even then
|I\J| +1
Supp (I) ≥ ∑
((-1)
∑ Suppi (J))
X⊆ J⊂I
1 ≤i ≤ n
As the formulas illustrated, the non-derivable itemsets can be extracted without the
need for transforming the data from distributed DBs into a centralised DB. The
deduction rules for an itemset are calculated by the support counts of its subsets which
are obtained from other sites.
Since in the proposed algorithm the support counts of the subsets of a frequent itemset
is already calculated by each sites, therefore, there is no need for calculation of the
internal sigma in the above formulas.
As mentioned before, the deduction rules can generate a concise representation of the
frequent itemsets. Furthermore, producing all the frequent itemsets based on nonderivable itemsets is explained. Similar procedure can be used to generate all the
frequent itemsets in a distributed environment.
It is apparent that in each iteration of every DARM algorithms, there are numbers of
candidate itemsets. To determine whether the candidate itemsets are globally large
(frequent) or not, the distributed algorithm requires to scan the local DB and exchanges
the support counts of the discovered candidate itemset. However, for calculation of the
support count of derivable candidate itemsets there is no need to scan the local DBs and
56
transfer the support counts, as each local site can independently recognise a derivable
itemset is frequent or not; whereas calculating the support count of non-derivable
candidate itemsets, requires scanning the local DBs and message exchange. However
after some iteration there would not be any non-derivable itemsets and each site can
implement its operations independently, without the need to communicate with other
sites and even scanning its local DB. The following example indicates the effectiveness
of using deduction rules in the efficiency improvement of a DARM.
Example 1: suppose a distributed DB with three sites S1, S2 and S3. Let D1, D2 and D3 be
the local DBs. Mining all the frequent itemsets is the target whereas the minimum
support threshold (minsup) and the iteration depth are set both to 3.
D1=
D2=
D3=
Tid
Items
Tid
Items
Tid
Items
1
a,b,c
1
a,b,d
1
a,d
2
a,c,d,f
2
c,d,e,f
2
b,d
3
b,c,d
3
b,c,d
4
b,c,d
5
a,b,c,d
Figure 14. Implement of new algorithm on the sample distributed DBs
In this example, for mining the frequent itemsets, the Bodon’s Trie data structure is
used as well as the deduction rules. Additionally, the candidate itemset generation is
based on FDM algorithm. In this method a Trie is built in each local site which tries are
updated by local sites at the end of each iteration. Nevertheless, all the 1-itemsets are
kept in a simple vector, and 2-itemsets are kept in a two-dimensional array. The
frequent k-itemsets (k≥3) are stored in a Trie data structure.
At the beginning each site scans its local DB independently and the support counts of 1itemsets are determined, as follows:
S1=
S2=
S3=
57
local
Support
local
Support
local
Support
candidate
count
candidate
count
candidate
count
itemset
itemset
itemset
{a}
2
{a}
1
{a}
2
{b}
1
{b}
2
{b}
4
{c}
2
{c}
2
{c}
3
{d}
1
{d}
3
{d}
5
{f}
1
{e}
1
{f}
1
Figure 15. Supports counting at distributed sites
The support count of every item is stored in a vector by local DBs. Subsequently, the
local support counts are exchanged and the globally large 1-itemsets are determined as:
Global
Support
candidate
counts
itemsets
{a}
5
{b}
7
{c}
7
{d}
9
Figure 16. The global support counts
Following this, the support counts of local candidates are updated and those which their
support count is less than support threshold are deleted which in this example e and f
items are eliminated.
From this stage, each site before sending its local candidate k-itemsets to the central
site, first checks the derivability of its candidate itemsets by the distributed deduction
rules. The deduction rules are calculated by the support counts from previous iteration
58
which is kept by the central site. Consequently, the tight bounds for each candidate
itemset are determined.
In this example the candidate 2-itemsets are calculated as { ab,ac,ad,bc,bd,cd }. For
instance, the deduction rules for ab itemset are computed as follows:
R0 : supp(ab) ≥
supp(a)+supp(b)-supp({}) = 5+7-10 =2
Ra :
supp(ab) ≤ supp(a) = 5
Rb :
supp(ab) ≤ supp(b) = 7
Rab : supp(ab)
≥0
As the rules illustrated, the support of ab itemset is in the interval of [2,5]. The upper
bound 5 arises from the principle which denotes that the support of ab must be lesser
than the support of b. The lower bound is 2 because there are 5 transactions which hold
a, from these 5 transactions, 3 of them contain b, therefore, there is an overlap of at
least 2 transactions which contain ab. Mostly as a general rule, the minimum of the
upper bounds and maximum of the lower bounds are considered as the upper and lower
bounds for an itemset. Consequently, ab is a non-derivable itemset. Similarly, in this
iteration, the tight bounds for the rest of the itemsets are calculated and the derivable
and non-derivable itemsets are determined. Following this, the derivable itemsets are
deleted from the list of candidate 2-itemsets. In this example, at this iteration, there is
no derivable 2-itemsets, therefore there is no elimination and the candidate 2-itemsets
are sent to the central site. The following figure indicates the local and global support
count for the candidate 2-itemsets:
Items Support
Support
Support
Global
at S1 Site
at S2 Site
at S3 Site
Support
ab
1
1
1
3
ac
2
0
1
3
ad
1
1
1
3
bc
1
1
3
5
bd
0
2
4
6
cd
1
2
3
6
Figure 17. Candidate 2-itemsets support counting
59
At the third iteration, the candidate 3-itemsets are determined as {abc,abd,acd,bcd}.
Since all the three sites keep the support counts of candidate itemsets of first and
second iterations, each site calculates the deduction rules independently. For instance,
the deduction rules for the abc candidate itemset are gained as follows:
supp(abc) ≤ supp(ab)+supp(ac)+supp(bc)-supp(a)-supp(b)-supp(c)+supp({}) =
R0:
3+3+5-5-7-7+10 =2
Ra :
supp(abc) ≥ supp(ab)+supp(ac)-supp(a) = 3+3-5=1
Rb :
supp(abc) ≥ supp(ab)+supp(bc)-supp(b) = 3+5-7=1
Rc :
supp(abc) ≥ supp(ac)+supp(bc)-supp(c) = 3+5-7=1
Rab :
supp(abc) ≤ supp(ab) = 3
Rac :
supp(abc) ≤ supp(ac) = 3
Rbc : supp(abc)
≤ supp(bc) = 5
Rabc : supp(abc)
≥0
Thus the support of abc itemset must be included in the interval [1,3]. Similarly, the
support of abd itemset is in the interval [2,5].
Additionally, the support count of acd itemset is calucated as follows:
supp(acd) ≤ supp(ac)+supp(ad)+supp(cd)-supp(a)-supp(c)-supp(d)+supp({}) =
R0:
3+4+6-5-7-9+10 =2
Ra :
supp(acd) ≥ supp(ad)+supp(ac)-supp(a) = 4+3-5=2
Rc :
supp(acd) ≥ supp(ac)+supp(cd)-supp(c) = 3+6-7=2
Rd :
supp(acd) ≥ supp(ad)+supp(cd)-supp(d) = 4+6-9=1
Rac :
supp(acd) ≤ supp(ac) = 3
Rad :
supp(acd) ≤ supp(ad) = 4
Rcd : supp(acd)
≤ supp(cd) = 6
Racd : supp(acd)
≥0
As the deduction rules illustrated, the minimum of the upper bounds and maximum of
the lower bounds are both 2 and the support count of this itemset is in the interval of
60
[2,2]. Therefore acd is a derivable candidate itemset and will be removed from the list
of candidate 3-itemsets. Likewise the support of bcd itemset is in the interval [4,4] and
it will be deleted. consequently, for those derivable itemsets such as {bcd} and {acd},
there is no need for scanning the local DBs and exchange the support count of them to
calculate the global support count. Consequently, the candidate 3-itemset after applying
the deduction rules is minimised as {abc,abd}. Since, the global support count for both
of these candidate 3-itemsets is 2, thus there is no global 3-itemsets and the global large
itemsets are summarised as global large 2-itemsets. The following figure indicates the
final Trie:
a
c
b
5
b
d
c
3
7
7
3
d
c
3
5
6
d
6
Figure 18. Final Trie
Provided that, the frequent itemsets in the above example are found only by FDM
algorithm (without the deduction rules), the extra cost of information exchange and the
cost of scanning the derivable itemsets from the local DBs in the third and fourth
iterations are added. Moreover, since this method uses Trie data structure, it is memory
efficient.
In the most of the distributed DBs, there are numbers of derivable itemsets which their
number depends on the nature of the distributed DBs. Since, there is no need to
calculate the global support count of derivable itemsets, the costs of scanning the local
DBs and exchanging the information, are eliminated. Moreover, this method reduces
the traffic of the network and the result can be achieved in a shorter time. However the
efficiency of this method deeply depends on the nature of the distributed DBs and the
number of derivable itemsets.
61
3.2 Proposed algorithm
As mentioned earlier, the proposed algorithm by this thesis employs the distributed
NDI algorithm and DTFIM algorithm. Moreover, the lemma 1 proposed by Cheung et
al. (1996) is used. This algorithm finds all the frequent derivable and non-derivable
itemsets. Each iteration of this algorithm contains immense numbers of candidate
itemset which each site can compute their support counts and determines their
frequency. After some iteration, all the candidate itemsets are derivable and
consequently each site can perform independently. This process continues until no
candidate itemsets are produced. In the above example, all the deduction rules are
evaluated but empirically, in most cases, evaluating deduction rules up to a limited
depth are sufficient. Calders & Goethals (2002) state that ‘in practice most pruning is
done by the rules of limited depth.’ The reason is that evaluating all the deduction rules
for itemsets with the large number of element (items) is so time consuming.
Admittedly, the allocated depth is associated with the size of distributed DBs. In DBs
with the significant number of transactions, the deduction rules should be evaluated
until the low depths, for instance 3.
3.3 Step by step explanation of the new algorithm

Production of the global frequent 1-itemsets (GL1):
i. Developing the local 1-itemsets vectors: The local DBs are scanned by their local
sites independently. Consequently the name of 1-itemsets (name of items) and their
local support counts are stored in a vector locally.
ii. Global 1-itemsets: The support counts are exchanged within the sites to determine
the globally large 1-itemsets (GL1).
iii. Initialising the local Tries: Each local site initialise their local Trie based on GL1,
therefore all the sites have the same Trie at the end of the pass.

Production of the global frequent 2-itemsets (GL2):
iv. Local candidate 2-itemsets generation: Each site based on its allocated Trie, specifies
the local candidate 2-itemsets and stores them in a two dimensional array.
v. Applying the deduction rules: Each site applies the deduction rules on its local
candidate 2-itemsets. (The deduction rules are calculated by using the support count
from the prior iteration which is already stored in local Tries of each site). Subsequently
62
the derivable itemsets are removed from the list of the non-derivable local candidate 2itemsets.
vi. Global large 2-itmsests: The support counts of non-derivable locally frequent 2itemsets are exchanged within the sites and the global support count for 2-itemsets is
computed.
vii. Updating the local Tries: Local sites update their Trie by inserting the global large
2-itmsets.

Production of the global large k-itemsets (k≥3):
viii.
Building local Tries: the candidate k-itemsets (CGk) are built by each site.
For this purpose, at each local site while the (NDLk-1) is not empty, a candidate tree
(Trie[i]) for the local candidate large k-itemset collection (CGk) is built based on NDLk1.
ix. Developing local vectors and deduction rules: each site traverses its Trie (using
depth-first traversal algorithm) to reach leaves and stores their support count into a
vector. Since all the vectors from different sites have same order, only the new support
counts from the local Tries are transferred. Likewise prior iterations, before exchanging
the support counts the deduction rules are applied and derivable k-itemsets are
eliminated from the non derivable candidate k-itemsets.
x. Final pruning: After the exchange, each site traverses its local Trie once more and
updates the local support count of each leaf node based on the global support counts
vector. Those k-itemsets which their updated support counts are smaller than the
support threshold are deleted from the list of local large k-itemsets.
The new algorithm consists of several sub algorithms which two of them have been
presented. For clarity, the used notations in this algorithm are listed in the table below.
Table 8. Notations used in the new algorithm
TR i(k)
The Trie data structure at site ith which contains the non derivable large k-itemsets
NDL(k)
An array which consists of the globally non derivable large k-itemsets
DL(k)
An array which consists of the globally derivable large k-itemsets
CG(k)
An array which holds the non derivable candidate k-itemsets
GL i(k)
An array which keeps the globally large k-itemsets.
63
L(k)
An array of the large k-itemsets
Inputs:
DBi (i=1,…,n): the databases at each site Si.
iterationDepth: number of iterations
minSup: the support threshold
Output: The set of all globally large itemsets L.
Method: Execution of the following program fragment (for the k-th iteration) at the
participating sites.
k:=1;
(2) while k ≤ iterationDepth do
(3) {
(4)
if k=1 then
(5)
TR i(1) := findLocalCandidate (DB i,0,1);
(6)
else
(7)
{
(8)
candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1));
(9)
if DL(k-1) ≠ 0 then
(10)
dFrequent (DL(k-1), NDL(k-1), DL(k));
(11)
TR i(k) := findLocalCandidate (DBi, CG(k), k);
(12)
}
(13)
if CG(k) ≠ 0 then
// if the CG(k) is not empty
(14)
TRi(k-1) :=findNDFrequent (DBi, CG(k), k);
(15)
passLocalCandidate (TRi(k));
(16)
GLi(k) := getGlobalFrequent (); // globally large k-itemsets
(17)
updateLocalCandidates (TRi(k), GLi(k)); // prunes the local candidates which are not
(1)
//
(18)
(19)
globally large
NDL(k) := ∪ⁿ i=1 GLi(k) ;
k:=k+1;
}
(21) L(k) := NDL(k) ∪ DL (k) ;
(22) return L(k);
(20)
The candidateGen procedure generates the non derivable candidate sets and derivable
frequent itemsets.
procedure candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1))
(1) for all Z ∈TR i(k-1) do
(2) {
(3)
compute the [l,u] bounds of Z
(4)
if Z.sup=Z.l or Z.sup=Z.u then
64
(5)
{
Prune Z from NDL(k-1) and TRi(k-1) and insert it into DL(k-1) ;
(7)
if Z.sup=Z.l then
(8)
Z.sup=Z.l;
(9)
else
(10)
Z.sup=Z.u;
(11) }
(12) pCG(k) =∪ⁿ i=1 CG i(k) =∪ⁿ i=1 aprioriGen(NDLi(k-1)); //FDM candidate itemsets generator
(13) for all Y ∈ pCG(k) do
(14) {
(15)
compute [l,u] bounds on support of Y
(16)
if l≠u then
(17)
{
(18)
Y.l=l;
(19)
Y.u=u;
(20)
Insert Y into CG(k) ;
(21)
}
(22)
else
(23)
{
(24)
If u≥ minSup then
(25)
{
(26)
Insert Y into DL(k), delete it from NDLi(k-1) and TRi(k-1) ;
(27)
Y.sup=u
(28)
}
(29)
}
(30) }
(31) }
end procedure
(6)
Procedure dFrequent (DL(k-1), NDL(k-1), DL(k))
(1) DCG(k) := aprioriGen2(DL(k-1), NDL(k-1)); // FDM apriori candidate generator.
(2) for all Z ∈ DCG(k) do
(3) {
(4)
compute Z.sup; //compute the s support of Z
(5)
if Z.sup ≥ minSup then
(6)
Insert Z into DL(k), delete it from NDLi(k-1) and TRi(k-1) ;
(7) }
end procedure
At the first iteration of the main algorithm, the local support counts of candidate 1itemsets are calculated by the findLocalCandidate procedure at line 5. Subsequently,
the local Tries (which are vectors at this stage) from different local sites are passed to
65
the global site by the passLocalCandidate procedure. The central site is the root of local
Tries which is responsible for receiving the local candidates as well as determining and
sending back the global frequent k-itemsets to the local sites (line 16). Then based on
the global frequent k-itemsets each site updates its Trie (line 17).
From the second iteration, the candidateGen function sets the values of the candidate
non-derivable and the derivable large k-itemsets. As mentioned, since the derivable kitemsets are same in all the sites, there is no need to scan the DB and exchange their
support count. The dFrequents procedure retrieves the derivable frequent itemsets
which are remained from the sets of derivable and non-derivable itemsets of the prior
iteration and updates the set of derivable frequent k-itemsets. For this purpose the
dFrequent procedure uses the theorem 1 used in FDM algorithm, to produce the
derivable itemsets whereas the candidateGen procedure, benefits from the deductions
rules up to a predefined depth, to generate the derivable k-itemsets.
In the main algorithm, the non-derivable and derivable frequent itemsets build the
frequent itemsets (line 21).
It should mention that, in each iteration of this algorithm the set of derivable frequent
itemsets are same in all the sites. In primary iterations of this algorithm, there are few
numbers of derivable itemsets, but to continue the execution of the algorithm, their
number increase. This process may continue until no non-derivable itemsets are found.
As discussed before large number of derivable itemsets increases the efficiency of this
algorithm.
In the candidateGen procedure (lines 1 to 9) produces a subset of non-derivable
itemsets from the prior iteration whose support counts are equal to the upper or lower
bounds of them. As mentioned before, the super sets of those itemsets are derivable and
they should be eliminated from the set of non-derivable itemsets and the local Tries.
Line 11 produces the candidate itemsets based on the FDM candidate generator. Lines
12-26 the bounds on the support of candidate itemsets are calculated. Subsequently, the
non-derivable itemsets are inserted to the list of non-derivable candidate itmset (CG(k))
and the frequent derivable itemsets are added to the list of derivable itemsets (DL(k)).
66
Figure 3.3 indicates the dFrequent procedure. This procedure uses the R
X∪{i}
(I
∪{i})to simplify the computation of support counts. Production of derivable candidate
itemsets (DCG(k)) is the first step in this procedure. The apprioriGen2 function joins the
frequent derivable itemsets with the non-derivable itemsets remained from the prior
iteration. In this way all the possible extension of the frequent derivable itemsets are
obtained. Therefore, this procedure generates all the possible extensions and then their
support counts are calculated.
67
4. Conclusion
With the immense development of use of distributed DBs in organisations and different
commercial centres, beside large DBs, the concept of data mining in these
environments has attracted a great deal of attention.
All kinds of data mining techniques such as clustering, classification etc. are applicable
in distributed data mining. The concentrate of this thesis, is on the association rules
mining. Association rule mining is one of the most important techniques of data mining
which has many uses in commercial and non commercial fields.
Although lots of research has been proceeded in association rules mining in centralised
environments, the discussion of association rules mining in distributed environments is
new and there are not too many methods in this regard. In distributed data mining it is
not recommended to transfer the raw data into a centralised DB due to security reasons,
traffic of network and the ownership of participant sites.
Improving the efficiency, reducing communication volume, security, local site
ownership in distributed environments, are the most important issues in DARM and
generally in distributed data mining.
The aim of this research is to achieve an efficient method for association rules mining
in distributed environment which has a better outcome in comparison with the previous
algorithms. In this research, firstly, data mining operations and its different methods
have been studied. Following this, to present a distributed algorithm for association
rules mining, association rules and its method and algorithms in centralised
environments have been investigated.
In association rules mining there are two major steps. The first step is, finding frequent
itemsets in DB and the second step is to generate the association rules from them.
Since, the main and time consuming step is finding the frequent itemsets and generating
rules is simple and straight forward, the discussion of association rules mining is
summarised to find the frequent itemsets. The important issue in association rules
mining is efficiency. The two important parameters in efficiency of DARM are
68
considered as decreasing the scanning of memory and reducing computation procedures
in data mining operations.
In this research after studying methods of association rules mining in centralised
environments, the distributed data mining discussion and its important issues have been
discussed. For discovering a new method, all the existing methods and algorithm in this
regard, were studied carefully.
One of the most important DARM algorithms is DTFIM. The proposed algorithm is
based on DTFIM algorithm and uses some of the existing techniques such as, pruning
local candidate itemset, producing candidate sets, gathering support counts from sites
and candidate itemset reduction.
The proposed algorithm resolves marketing problem in distributed environments by
reducing the number of candidate itemsets. Moreover it simplifies the whole process of
producing interesting customer preferences and patterns.
69
5. Feature works
Implementation and testing the algorithm on a real DB to prove the efficiency of the
algorithm is one of the feature works. Furthermore, several opportunities for future
work will be identified from the outcomes of the research. Grid discussion is one of the
most progressive research fields. Recently, some research has been performed in the
field of distributed data mining by using the grid services, which are at the beginning
way. By using association rules mining, all other data mining operations such as
classification or clustering can be employed. Some research has been done for finding
classification rules by association rules mining, or clustering by association rules. In
some cases, using association rules for clustering and classification is so effective and
are more efficient in comparison with previous and usual methods of applying these
operations.
70
6. References
Agrawal, R, Imiliniski,T & Swami, A 1993, ‘ Mining association rules between sets of
items in large databases’ ,In Proc. Of the ACM SIGMOD conference on Management of
Data,Washington, D.C., pp.207-216.
Agrawal, R & Srikant, R 1994, ‘Fast Algorithms for mining Association Rules’,
Proceedings of the 20th VLDB Conference, Santiago de Chile, pp. 487-499.
Agrawal, R & Shafer, J 1996, ‘Parallel Mining of Association Rules: Design,
Implementation and Experience’, IEEE Transaction on Knowledge and Data
Engineering, vol. 8, no. 6, pp.962-969.
Ansari, E, Dastghaibifard, GH, Keshtkaran, M, Kaabi, H 2008,‘Distributed Frequent
Itemset Mining using Trie Data Structure’, IAENG International Journal of Computer
Science, vol. 35, no. 3, pp. 377-381.
Ashrafi, MZ, Taniar, D & Smith, K 2004, ‘ODAM: an Optimized Distributed
Association Rule Mining Algorithm’, IEEE Distributed Systems Online, vol.05, no. 3.
Berry, MJ. A., Linoff G S. 2003, Data Mining Techniques for Marketing, Sales, and
Customer Relationship Management,Wiley Publishing Inc., Canada.
Bodon, F 2003, ‘A Fast Apriori Implementation’, In Proceedings of the IEEE ICDM
Workshop on Frequent Itemset Mining Implementations.
Bodon, F 2004, ‘Surprising Results of Trie-Based FIM Algorithms’, In Proceedings of
the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’04),
vol. 126, Brighton, UK, 2004.
Borgelt, C 2003, ‘Efficient Implementations of Apriori and Eclat’, In proceedings of
the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03),
vol. 90, Melbourne, Florida, USA.
Braga, D, Campi, A,Ceri, S ,Klemettinen, M & Lanzi, PL 2002, ‘A Tool for Extracting
XML Association Rules from XML Documents’, In Proceeding of IEEE-ICTAI 2002,
Washington DC, USA, pp. 57-64.
Briandais, RDL 1959, ‘File searching using variable-length keys’, In Western Joint
Computer Conference, pp. 295-298.
Brin, S, Motwani, R, Ullman, JD & Tsur, S 1997, ‘Dynamic Itemset Counting and
Implication Rules for Market Basket Data’, In Proceedings of the 1997 ACM SIGMOD
International Conference on Management of Data, vol. 26(2), pp. 255–264.
71
Calders, T 2004, ‘Deducing Bounds on the Support of Itemsets’, In Database
Technologies for Data Mining- Discovering Knowledge with Inductive Queries, vol.
2682 , pp. 214-233.
Calders, T& Goethals, B 2002, ‘Mining all Non Derivable Frequent Itemsets’, In Proc.
Principles and Practice of Knowledge Discovery in Databases PKDD’02, vol. 243, pp.
74-85.
Calders, T& Goethals, B 2007, ‘Non Derivable Itemsets mining’, data mining
Knowledge Discovery in Databases, vol. 14, pp. 171-206.
Calvanese, D, Giacomo, GD, Lenzerini, M, Nardi, D & Rosati, R 1998, ‘Source
Integration in DataWarehousing’, DEXA springer, pp. 192-197.
Cheung, DW, Han, J, Ng, VT, Fu, AW & Fu, Y 1996, ‘A Fast Distributed Algorithm
for Mining Association Rules’, In Proc. Parallel and Distributed Information Systems,
IEEE CS Press, pp. 31-42.
Cheung, DW, Lee, SD, Xiao, Y 2002, ‘Effect of Data Skewness and Workload Balance
in Parallel Data Mining’, IEEE Transactions on knowledge and data engineering, vol.
14, no. 3, pp.489-514.
Coenen, F, Leng, P & Ahmed, Sh 2003, ‘T-Trees, Vertical Partitioning and Distributed
Association Rule Mining’, Proceedings of the Third IEEE International Conference on
data mining, pp. 513-516.
Fang, YW, Zhao, XB, Zhang, GP, Wang Y, Sun, Y & Zhang YF 2005, ‘ Study on
Algorithms of Parallel and Distributed Data mining Calculation Process’ , proceedings
of the fourth international conference on Machine Learning and Cybernetics,
Guangzhou, vol. 4, pp. 2084-2089.
Fayyad, UM, Piatetsky-Shapiro, G, Smyth, P & Uthurusamy, R 1996, Advances in
Knowledge Discovery and Data Mining, Menlo Park : AAAI Press, United States of
America.
Feng, L, Dillon,TS, Weigand, H & Chang, E 2003, ‘An XML-Enabled Association
Rule Framework’. In Proceedings of DEXA’03, pp 88-97, Prague, Czech Republic.
Filho, AH, Prado, HA, Toscani, SS 2000, ‘Evolving a Legacy Data Warehouse System
to an Object-oriented Architecture’, Computer Science Society, 2000. SCCC '00.
Proceedings. XX International Conference of the Chilean, pp. 32-40.
Frawley, WJ, Piatetsky-Shapiro, G & Matheus, CJ 1991, ‘Knowledge Discovery in
Databases’ , AI magazine, vol. 13, no. 3, pp. 57-70.
Fukasawa, T, Wang, J, Takata, T & Miyazaki M 2004, ‘An Effective Distributed
Privacy-Preserving Data Mining Algorithm’, Faculty of Software and Information
72
Science, Iwate Prefectural University 152-52 Sugo, Takizawa, Iwate 020-0193, Japan,
Digitally Advanced Integrated Solutions Labs, Ltd., Japan, pp. 320-325.
Guo, Y & Grossman, R 1999, ‘Scalable Parallel and Distributed Data mining’, Data
Mining and Knowledge Discovery.
Han, J, Kamber, M 2006, Data mining: concepts and techniques, Diane Cerra, United
States of America.
Han, J, Pei, J, & Yin, Y 2000, Mining frequent patterns without candidate generation,
In Proc. 2000 ACMSIGMOD Int. Conf. Management of Data (SIGMOD’00), pp. 1-12.
Han, J, Pie,J ,Yin, Y & Mao, R 2001, ‘Mining frequent pattern without candidate
generation: A frequent-pattern tree approach’, Data Mining and knowledge discovery,
pp. 53-87.
Kantardzic, M 2003, Data Mining Concepts, Models, Methods, and Algorithms, A John
Wiley & Sons, INC, United State of America.
Kantarcioglu, M & Clifton, C 2004, ‘Privacy-preserving Distributed Mining of
Association Rules on Horizontally Partitioned Data’, IEEE Transactions on Knowledge
and Data Engineering, pp. 2-13.
Kargupta H & Chan, P 2000, Advances in Distributed and Parallel Knowledge
Discovery, AAAI Press.
Le, DX, Rahayu, JW & Taniar, D 2006, ‘Web data Warehousing Convergence: from
Schematic to Systematic’,international journal of information technology and web
engineering, vol. 1, no. 4, pp. 68-92.
Li, Z, Sun, J, Yu, H & Zhang, J 2005,’CommonCube-based Conceptual Modeling of
ETL Processes’, 2005 International Conference on Control and Automation
(ICCA2005), vol. 1, pp. 131-136.
Liu, G, Li, J & Wong, L 2007, ‘A new concise representation of frequent itemsets using
generators and a positive border’, School of computing, National University of
Singapore, No.17, pp.35-56.
Park, JS, Chen, M-S & Yu, PS 1995, ‘An effective Hash-Based Algorithm for Mining
Association Rules’. In proceedings of the 1995 ACM SIGMOD International
Conference on Management of Data, vol. 24(2), pp.175-186.
Pasquier, N, Bastide, Y, Taouil, R & Lakhal, L 1999, ‘Discovering Frequent Closed
Itemsets for Association Rules’ In Proc. ICDT Int. Conf. Database Theory, pp. 398416.
Rahm, D & Do, H. H (n.d), Data Cleaning: Problems and Current Approaches
Savasere, A, Omiecinski, E & Navathe S 1995, ‘An Efficient Algorithm for Mining
Association Rules in Large Databases’, technical Report No. GIT-CC-95-04.
73
Schuster, A & Wolf, R 2004, ‘Communication-Efficient Distributed Mining of
Association Rules’, Computer Science Department, Technion, Israel Institute of
Technology, Technion City, Haifa 3200, Israel , pp. 171-196.
Schuster, A, Wolf, R & Trock, D 2005, ‘A High-Performance Distributed Algorithm
for Mining Association Rules’, Knowledge And Information Systems (KAIS) Journal,
vol. 7, no. 4, pp. 458-475.
Shintani, T & Kitsuregawa, M 1996, ‘Hash Based Parallel Algorithms for Mining
Association Rules’, Proceedings of the International Conference on Parallel and
Distributed Information Systems, pp. 19-30.
Sujni, P & Saravanan, V 2008, ‘Hash Partitioned apriori in Parallel and Distributed
Data Mining Environment with Dynamic Data Allocation Approach’ , Computer
Science and Information Technology, 2008. ICCSIT '08. International Conference, pp.
481-485.
Tan, PN, Steinbach, M, Kumar, V 2006, Introduction to Data Mining, Pearson
Education, Inc., United State of America.
Toivonen, H 1996, ‘Sampling Large Databases for Association Rules’, Proceedings of
the 22nd VLDB Conference Mumbai (Bombay), India, pp. 134–145.
Wan, JWW, Dobbie, J 2004, ‘Mining Association Rules from XML Data using
XQuery’, In Proceedings of the second workshop on Australasian information security,
Data Mining and Web Intelligence, and Software Internationalization – vol. 32, pp.
169-174.
Wang, B 2009, ‘A Research on Extraction Method of Distributed Heterogeneous
Dataset in Multi-Support Association Rule Mining’, 2009 ISECS International
Colloquium on Computing, Communication, Control and Management, pp. 17-20.
Zaki, M 1999, ‘Parallel and Distributed Association Mining: A survey’, IEEE
Concurrency, vol. 7, no. 4, pp. 14-25.
Zhang, S, Zhang, J, Liu, H &Wang, W2005, ‘XAR-Miner: Efficient Association Rules
Mining for XML Data’, In Proceedings of 14th international conference on World
Wide Web, Chiba, Japan, pp. 894-895.
74