Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
University of South Australia M.S. Thesis In Computer Science Minor Thesis ASSOCIATION RULES MINING IN DISTRIBUTED ENVIRONMENT By: Shamila Mafazi Supervisor: Abrar Haider June 2010 i Table of Contents 1. Introduction 1 1.1. Motivation 2 1.2. Research Question 2 1.3. Purpose 2 1.4. Methodology 3 1.5. Thesis Plan 3 1.6. Contribution of the Thesis 3 2. Literature Review 2.1. 2.2. 5 Data mining and association rules mining in centralised environments 5 2.1.1. Data mining 5 2.1.2. Data pre-processing 7 2.1.3. Data cleaning as a problem of distributed DBs 10 2.1.4. Association Rules Mining 11 2.1.5. Definition of association rules mining difficulties 13 2.1.6. Apriori Algorithm 13 2.1.7. Subset function 16 2.1.8. Applied optimisation on Apriori algorithm 17 2.1.9. AprioriTid and AproriHybrid algorithms 17 2.1.10. Sampling 18 2.1.11. Partitioning 19 2.1.12. Direct Hashing and Pruning (DHP) algorithm 20 2.1.13. Dynamic Itemset Counting (DIC algorithm) 20 2.1.14. Frequent Pattern (FP) Growth method 21 2.1.15. Association rules mining in XML documents 22 2.1.16. Trie data structure 23 2.1.17. Non derivable itemsets 25 2.1.17.1. Deduction rules 25 2.1.17.2. Non-Derivable Itemsets (NDI) algorithm 28 2.1.17.3. Producing the frequent itemsets 31 Distributed association rules mining 34 2.2.1. Distributed data mining 34 2.2.2. Necessity of studying distributed data mining 36 2.2.3. Important instances and issues in distributed data mining 37 ii 2.2.4. Distributed Algorithms for association rules mining 40 2.2.4.1 Count Distribution (CD) algorithm 40 2.2.4.2 A Fast Distributed algorithm 42 2.2.4.2.1 Candidate set generation 43 2.2.4.2.2 Local pruning of candidate itemsets 45 2.2.4.2.3 FDM algorithm 48 2.2.4.3 ODAM algorithm 49 2.2.4.4 DDM, PDDM and DDDM algorithm 51 2.2.5. Comparing the distributed algorithms 3. Proposed algorithm by this thesis 53 55 3.1 Mining the non derivable itemsets in distributed environments 55 3.2 Proposed algorithm 61 3.3 Step by step explanation of the new algorithm 62 4. Conclusion 68 5. Future works 70 6. References 71 iii List of Figures Figure 1. ETL processes 9 Figure 2. Producing candidate item sets by Apriori algorithm 15 Figure 3. Hash tree 17 Figure 4. An example of a Trie 24 Figure 5. An example of transactions of a DB 26 Figure 6. Tight bounds on support of (abcd) 27 Figure 7. Size of concise representation 31 Figure 8. Distributed memory architecture for distributed data mining 35 Figure 9. Shared memory architecture for distributed data mining 35 Figure 10. Horizontal DB Layout 38 Figure 11. Vertical DB Layout 38 Figure 12. Second replication from count distribution algorithm 41 Figure 13. ODAM algorithm on 3 sites 50 Figure 14. Implement of new algorithm on the sample distributed DBs 56 Figure 15. Supports counting at distributed sites 57 Figure 16. Global support counts 57 Figure 17. Candidate 2-itemsets support counting 58 Figure 18. Final Trie 60 iv List of Tables Table 1. User DB1 10 Table 2. Client DB2 10 Table 3. Users (integrated DB with cleaned data) 11 Table 4. An example of a DB 12 Table 5. Notations used in Apriori algorithm (Agrawal &Srikant 1994) 14 Table 6. Locally large itemsets 46 Table 7. Globally Large Itemsets 46 Table 8. Notations used in the new algorithm 62 v Abstract The tremendous growth of information technology within the companies, businesses and governments, has created immense Databases (DBs). This trend creates a prompt requirement for novel tools and techniques for intelligent DB analysis. As John Naisbett mentioned ‘We are drowning in information but starving for knowledge!’ These tools and techniques are the topics of the field called “data mining” or “Knowledge Discovery in Databases” (KDD). Data mining or KDD is the process of finding hidden and probably useful patterns and knowledge in databases. Numerous research has been performed thus far in the field of data mining for traditional centralised databases. Data mining is not only applicable in the centralised setting but also in distributed environments where distributed databases are used. The problem of data mining in distributed data as a distributed problem needs distributed algorithms. A distributed data mining algorithm provides data mining results including knowledge and pattern without exchanging raw data within the participating sites in a distributed system. In distributed data mining all the data mining tasks, such as, classifications, clustering and so on, are considered. Association rules mining is one of the most well known methods of data mining which has wide applications. In this thesis, mining association rules in the distributed environment, particularly market basket data, is considered. Computation and communication are two important factors in DARM and generally in distributed data mining. In this study, a new technique to improve the performance of finding association rules in distributed environments is presented. This technique may be utilised in every so far DARM algorithm. One of the well known association rules mining algorithms is called the “Frequent Itemset Mining” or FIM. The proposed algorithm by this research is the result of using a new technique inside of the DTFIM (distributed version of the FIM) algorithm. vi Declaration I declare that: this thesis presents work carried out by myself and does not incorporate without acknowledgment any material previously submitted for a degree or diploma in any university; to the best of my knowledge it does not contain any materials previously published or written by another person except where due reference is made in the text; and all substantive contributions by others to the work presented, including jointly authored publications, is clearly acknowledged. Shamila Mafazi 14/06/2010 vii Acknowledgment Firstly, I would like to thank my supervisor, Dr. Abrar Haider for his valuable support in this thesis. I specially like to extend my thank to Dr Jiuyong Li who introduced me to the Data Mining world. viii 1. Introduction The ubiquity of information technology and its rapid improvement within organisations have profoundly impacted the management system. A significant amount of information is stored in databases of companies, businesses or government centres. Locally, these data are only used for making reports by users and managers. Another usage of data, which is more common and important, is data management and data mining operations. This thesis addresses the problem of discovering frequent itemsets in the distributed DBs. In data mining discovering hidden and useful patterns is the target. Most of these patterns could help both managers and users or customers, whereas managers can make wiser decisions and customers can shop more easily. The patterns which are discovered during the different operations of data mining have different types. One of the most useful and famous patterns is association rules. The most famous usage of these association rules is, analysing market basket data for supermarkets. For instance, after mining the DB of a super market, it may reveal that 60% of customers who buy milk may buy butter as well. Finding these rules helps managers to organise the shelves, guide customers and is also useful in management. DBs applied for data mining are typically large. These DBs are growing from gigabytes towards a terabyte or more. Because of the time and space limitations, it is hard and even impossible to manage and process these DBs on a single site. Additionally, some of the DBs are distributed naturally; hence, the importance of a parallel and distributed data mining environment becomes evident. Association rules mining is one of the most effective methods of data mining in distributed DBs, however, other methods such as, clustering or classification are discussed in distributed environments. Association rules mining in a distributed environment is the main concentration of this research proposal. 1 1.1 Motivation By the emergence of distributed DBs in organisations and different commercial centres, beside large DBs, the concept of data mining in these environments is discussed. It is vital and inevitable for almost every organisation to analyse their DBs for discovering useful and interesting patterns. For instance, airlines analyse theirs DBs to target the appropriate customers for special marketing promotions. Banks require customer behaviour patterns due to bankruptcy prediction methods, loan and credit card approvals. Insurance companies demand the patterns for making better decisions regarding the premium of the customers and finally packaged goods manufacturers and supermarkets need shopping patterns due to supplying goods. However, many achievements are made regarding association rules mining in distributed environments; still there is room for improvement. Increasing efficiency, communication reduction, security and site privacy in distributed environments are important considerations in DARM and generally in distributed data mining. 1.2 Research Question Discovering association rules in a distributed environment is a relatively new area. Most of the distributed algorithms are not efficient enough and have high communication and computation complexity. The main question this research asks is: How can a more efficient method for discovering association rules in a distributed environment be created? To address this question, a new algorithm is created which is more efficient in comparison to the previous methods. Efficiency is defined as less execution time for achieving results and less communication volume. 1.3 Purpose A new algorithm is represented for association rules mining in a distributed environment which has better outcomes in comparison with the previous algorithms. The new algorithm is based on the FIM algorithm and utilises some of the existing techniques such as, pruning local candidate item sets, producing candidate sets, gathering support counts from sites and candidate item set reduction. This algorithm is tested on distributed data. 2 1.4 Methodology The research methodology used to answer the research questions, discussed in this section. The main research question ‘How an efficient method for discovering association rules in distributed environments can be created?’ has been answered by designing a new algorithm which is more efficient than existing algorithms. The examination on different distributed data shows the efficiency of this new algorithm from the execution time and transfer volume aspects. In order to resolve the research question, this research has reviewed literature relating to the following areas: Data mining in the centralised environments. Data mining in the distributed environments. Association rules mining in the distributed/centralised environments. Comparing the existing algorithms in the distributed/centralised environments and studying the advantages and disadvantages of them. Developing a concise representation particularly, distributed deduction rules. Designing the new algorithm based on DTFIM. 1.5 Thesis Plan This thesis is constructed as follows: the first section contains an introduction to the research and a definition of the research questions. The second section includes the literature review. The third section explains the method which is used to answer the research questions. Subsequently, the new method is tested on a distributed DB sample. Additionally, this section presents the new algorithm. The last section concludes the thesis. 1.6 Contribution of the thesis The proposed algorithm by this thesis intends to resolve marketing problem in distributed environments. Additionally, it aims to profile needs of customers and their preferences in the transaction oriented systems such as, credit cards. One of the most significant problems of market basket data is dealing with large number of candidate itemsets. Retrieving interesting and meaningful patterns from these candidates is extremely difficult. This thesis presents an algorithm which reduces the number of 3 candidate sets and simplifies the process to produce interesting customer preferences and patterns. 4 2. Literature Review This section deals with the major areas related to the research and covers existing work in these respective areas. The first section considers data mining in centralised environments and discusses some algorithms in this regard. The second section looks at the association rules mining in distributed environments and related algorithms. 2.1 Data mining and association rules mining in centralised environments 2.1.1 Data mining According to Han & Kamber (2006) ‘data mining refers to extraction or mining knowledge from large amounts of data’. Data mining has attracted significant attention in the information industry and in society in recent years due to the enormous availability of data and the urgent need for turning this data into useful information and knowledge. This information may apply to applications ranging from market analysis, fraud detection, production control and scientific data mining. According to Frawley, Piatetsky-Shapiro & Matheus (1991), knowledge discovery is defined as ‘the nontrivial extraction of implicit, previously unknown and potentially useful information from data.’ Although immense patterns can be extracted from a DB, only patterns which are novel, useful and nontrivial to compute are considered to be useful and interesting. Knowledge is useful when it can satisfy the expectation of the system or the user. Large databases can be considered as rich and trustworthy resources for producing knowledge and information. The discovered knowledge can be utilised in information management, report processing, decision making etc. There are two feasible elementary targets of data mining which are prediction and description. According to Fayyad et al. (1996 p.12) ‘prediction involves using some variables or fields in the database to predict unknown or future values of other variables of interest. Description focuses on finding human-interpretable patterns describing the data.’ Moreover, they believe that in KDD context, description is more important than prediction. There are various types of data mining techniques such as, association rules mining, classifications, clustering, prediction and time series analysis. These are the most important methods and techniques. A brief definition for each method is as follows: 5 i. Association rules mining retrieves relation and correlation between the data in a DB. This data mining operation produces a set of rules called association rules. ii. Classification is another important method of data mining. In this method, an object in a DB divides into separate groups based on its attributes. Subsequently a model based on the data attribute is built for each class of test data. Classification predicts categorical (discrete, unordered) labels (Han & Kamber 2006, p.285). The result of classification could be a decision tree or a set of rules. According to Berry & Linoff (2003, p.166) a decision tree is defined as ‘a structure that can be used to divide up a large collection of records into successively smaller sets of records by applying a sequence of simple decision rules.’ Additionally, classification can be performed by association rules. For example if a car dealer can divide their customers, based on their interest in different kinds of cars, then the company can send the right catalogues to the right customers, and consequently, increase its income. iii. Data clustering is another significant method of data mining. Clustering is defined as: The process of grouping a set of physical or abstract objects into classes of similar objects. A cluster is a collection of data objects that are similar to one another within the same cluster and are dissimilar to the objects in other clusters (Han & Kamber 2006, p.383). Clustering intends to maximise the similarity between the dataset of a class and minimise the similarity between the dataset of two different classes. The discrepancy between clustering and classification is that, clustering is an unsupervised learning which denotes that the number of classes and the class label of each training tuple are unknown from the beginning of the process, whereas classification is a supervised learning which signifies that the number of classes has been recognised in advance (Han & Kamber 2006, p. 287). iv. Prediction is another technique of data mining, which predicts the possible amount for unidentified variables. Berry & Linoff (2003, p.10) consider prediction as classification or estimation with this difference that in prediction records are classified based on predicted or estimated future value. 6 In prediction, first the unknown variables are determined by statistical analysis, subsequently, intelligent methods such as, genetic algorithms and neural networks, perform the prediction. For instance, the salary of an employee is predictable from the other employees. There are other methods which are quite effective in prediction, such as, regression analysis, correlation analysis and decision trees (Han & Kamber 2006, p. 285). v. Time series analysis is one of the other methods of data mining. According to Berry & Linoff (2003, p.128) the key point in data mining is the constancy of the values frequency over time. Time series requires selecting a proper time frame for the data. In this method an immense amount of time series data is analysed for discovering notable features and specific order. Occurrence of continuous events, sets of events which occur after a specific event, processes and corruptions are parts of these notable features. For example, the process of changing the price for a specific good in a factory is predictable by using the old data, commercial conditions and competitors (Fayyad et al. 1996, p.229). 2.1.2 Data pre-processing In distributed data mining, data comes from several sources with different forms and types. This data is always dirty, noisy, incomplete and inconsistent. Fundamentally, data is considered as raw material for data mining. Just as oil, which mines with impurities and it becomes useable only by passing through different stages of refinement. The most powerful engine is unable to use crude oil as a fuel which is the same as the most powerful algorithm which can not find interesting patterns in the raw or un-pre-processed data (Berry & Linoff 2003, p.540). According to Han & Kamber (2006, p.47) knowledge discovery includes the execution of iterative sequences which are recognised as data pre-processing and KDD as follows: i. Data cleaning: removing noise and inconsistency from data and filling the missed values. ii. Data integration: combining or integrating multiple data sources into a coherent data store. 7 iii. Data selection/reduction: retrieving relevant data to the analysis task from the DB. Moreover, the size of a database can be reduced by aggregating, eliminating redundant tuples or clustering. iv. Data transformation: transformation and consolidation of data into appropriate forms for mining tasks, for instance, summary or aggregation operations. v. Data mining: a necessary process where intelligent methods are applied to extract data patterns such as clustering, classification, association rules mining, and so on. vi. Pattern evaluation: to evaluate the interesting patterns showing knowledge based on some measures. vii. Knowledge presentation: where knowledge presentation techniques are used to present the mined knowledge to the user. The first four steps may be executed during the process of data storage, reporting or data integration. The last three steps can be performed in one step called data mining. According to Le, Rahayu & Taniar (2006), three types of databases on database design approaches are recognised. ‘Well-defined and structured data such as, relational, object oriented and object relational data, semi-structured data such as, XML, and Unstructured data, such as HTML documents…’. There are also some approaches regarding integration techniques that unify data with different types. The first approach can integrate only relational data into the data warehouse. The study of Calvanese et al. (1998) is an instance in this regard. The second approach can handle more complex data types, for instance the transition from relational data to object oriented data. Filho et al. (2000) have developed an object oriented model which transforms the data warehouse system to a dimensional object model. The third approach integrates XML documents to a data warehouse system. Jensen, Moller & Pedersen (2001) believe that XML is becoming a new standard for data presentation on the internet and they developed an integrated architecture which transfers XML and relational data sources, by using OLAP (On-Line Analytical Processing) tools to the data warehouse system. The final approach which is proposed by Le, Rahayu & Taniar (2006) handle three types of data including HTML documents. 8 Furthermore, ETL (Extraction, Transformation and Loading) is a novel approach for data preparation. Duties of ETL include data extraction from various sources, cleaning and loading into a target data warehouse (Li et al. 2005). According to Rahem & Do (n.d.) the major part in ETL is data cleaning. When multi DBs need to be integrated, most of the time redundancy occurs because different distributed sources usually have the same data in different representations. Since the integrated DB or the data warehouses are used for decision making, the correctness of their data is paramount and the need for data cleaning is vital. The following figure indicates the process of ETL. Figure 1. ETL processes (Rahm & Hai Do n.d.) In the ETL process illustrated in the above figure, all data cleaning processes are performed in a separate data staging area before loading data into the data warehouse. However a significant fraction of the cleaning and transformation has to be executed manually or by low level programs (Rahm & Do n.d). 9 As the figure indicates, in the extraction stage, an instance and a schema for each DBs are produced. In the integration level, the extracted schemas and instances are matched and integrated to a single implementation schema and to a data staging area respectively. Execution of filtering and aggregation rules on the final schema joins it to the last stage which is storage in the data warehouse. 2.1.3 Data cleaning as a problem of distributed DBs The major concern for data cleaning from different sources is recognition of correspondent data which points to the same real world entity, object identity problem and duplicate elimination. Following is an example of two distributed DBs which intend to be integrated (Rahm & Do n.d). Table 1. User DB 1 uI firstName lastName sex address phone mobile 1 carry bradshaw f 3 main st, 0061882331234 richmond, sa 9 sara smith f 3 main st, Richmond, sa vehicleIdNo dd 4 24 423-410 0061423312415 Table 2. Client DB 2 cId name streetNo suburb state gender vNo 14 pitter smith 4 lane rochmond sa 1 3 main richmond sa 0 147 carry bradshaw These tables contain the following conflicts and problems: Name conflicts: such as user/client, sex/gender, uId/cId, vehicleIdNo/vNo. Structural conflicts: different presentation for name and address. Heterogeneous data: different presentation for gender value (0/1 & f/m) Duplicated records: client named Carry Bradshaw is repeated in both DBs. Incomparable DBs: same records have different IDs in DBs, for instance uId=14 and cId=147 indicate same records with different ID number. 10 Null attribute: some of the attributes contain null value, such as phone and mobile. The following table is the integration of both tables with possible corrections: Table 3. Users (integrated DB with cleaned data) id lName fName gender streetNo suburb state phone 1 bradshaw carry f 3main st richmond sa 2 smith sara f 3main st richmond sa 3 smith pitter m 4 lane st rochmond sa mobile uId cId 0061882331234 14 0061423312415 924 14 The integrated DB is not ready for data mining algorithms. The following actions due to preparation are recommended by Berry & Linoff (2003, p.555): Extracting information from a field: in some cases, numbers or IDs encode meaningful information. For example phone numbers contain country code, area code which all of them contain geographical information. Ignoring names: in general, names do not carry useful information for data mining (there might be some exceptions). Using special software to standardise the address field: ‘address describes the geography of customers which is important for understanding customer behaviour.’ Correcting misspelled values: often sorting values brings the misspelled value next to the right one (in the example, incorrectness of suburb field of Users table, rochmond, is obvious after sorting). Ignoring columns with one value. Elimination of duplicated records: tuples should be sorted by their occurrence; more than one occurrence represents duplication. As the premise of the suggested algorithm by this thesis is, dealing with integrated and clean DBs, more surveys in this regard might be out of the scope of the thesis. 2.1.4 Association Rules Mining Association rules mining, is one of the most important and useful methods of data mining in unsupervised learning systems. The aim of this method is to find rules in the 11 immense DBs. Originally, association rules mining arose from the point of sale that reveals what items are likely to be purchased together. Agrawal, Imiliniski & Swami (1993) define the operation of finding the interesting rules as association rules mining. They also describe the necessity and usefulness of discovering such rules as follows: Observation of all rules which contain a specific item may help the store to decide how the sale of the item can be enhanced. Additionally, these rules may reveal the effect of continuing or discontinuing the sale of an item on the sale of other items. Discovering all rules that include two or more specific item, may disclose the items which are mostly purchased together and result in a better management of depository. Finding all rules that consist of all items placed on two special shelves help managers to arrange shelves efficiently. Suppose that the below table belongs to a database of a supermarket. Each record consists of items purchased by a customer in a single transaction. Table 4. An example of a DB Tid Items 1 {bread, coke, yogurt} 2 {bread, butter, yogurt, cream} 3 {cream, coke} 4 {detergent, bread, butter, yogurt} 5 {bread, butter, yogurt} 6 { detergent, cream, bread, butter} This DB has only six transactions, but in the real DBs there is the enormous number of transactions. One of the rules which are evident in this DB is that, 80% of customers who have bought butter would also buy bread. This rule exists in 66% of customers shopping or transactions of this supermarket. These kinds of rules are called association rules and sets of items which are frequently repeated in a DB and produce the rules, are called frequent itemsets or frequent patterns. For instance, the set of {bread, butter} in 12 this sample database represents a frequent itemset. The discussed probability is called confidence and the percentage of the transactions which contains an itemset is called the itemset’s support (Kantardzic 2003, p.166). According to Han & Kamber (2006, p.40) the support and confidence of a rule are two measures of rule interestingness. They demonstrate the usefulness and certainty of discovered rules respectively. Generally, an itemset is interesting and is called frequent (large) itemset, if its support and confidence is higher than the user specified minimum amount. Additional analysis can be performed to uncover interesting statistical correlations between associated items. 2.1.5 Definition of association rules mining difficulties: The first problem is the immense number of transactions which may not fit in the memory of a computer. Secondly, the number of frequent itemsets increases exponentially by increasing the number of items; hence, a scalable algorithm is needed. Kantardzic (2003, p.166) mentions a formal and mathematical definition of the problems which is presented by Agrawal, Imielinski and Swami (1993) as follows: Let I= {i1 , i2 ,..., im } be a set of items and D be a set of database transactions where each transaction t is a set of items such that t ⊆ I. Each transaction has an identifier which is called TID. A transaction t contains x if and only if x ⊆ t. An association rule is an implication of the form x=>y where x ⊂ I, y ⊂ I, and x ∩ y=Ø. The rule x=>y holds in the transaction set D with s% of transactions in D contains x ∪ y. With having the transaction set of D, the association rules mining include producing the entire association rules that satisfy the minimum support and confidence threshold. D can be a data file, a table of relational DBs or the product of a report from a DB. The major aim of association rules mining is to discover strong and meaningful rules in immense DBs. Therefore, the major problems may summarise in to two phases, finding the large itemsets and producing the association rules from them. The first step is more important and the time taking of association rules mining process belongs to this part. Apriori algorithm represents an early solution to the problem. 13 2.1.6 Apriori Algorithm Apriori algorithm is a basic algorithm which was proposed by Agrawal & Srikant in 1994 for mining frequent itemsets for Boolean association rules (Han & Kamber 2006, p.234). This algorithm is introduced as a famous algorithm and basics for almost all of the existing methods in both the intensive association rules mining and in parallel and distributed rules mining. This algorithm is considered as a dynamic programming algorithm. Algorithm: Apriori Algorithm Input: database D and minimum support minsup Output: all large item sets 1) C1 = All distinct items in D 2) large itemsets in C1 3) while Lk+1 is not empty 4) Ck+1 = candidateGen(Lk) 5) Lk+1 = large itemsets in Ck+1 6) k++ 7) return ∪Lk As the pseudo code of this algorithm indicates, the results in each stage are used for the next stage. The following table illustrates the used notations in this algorithm. Table 5. Notations used in Apriori algorithm (Agrawal &Srikant 1994) k-itemset Lk An itemset with k items. Ck The candidate item set (they could be frequent) also each member of this set has two fields: 1. The item set 2. The related supportive number (support count). A set of large k-itemsets. Each member of this set has two fields: 1. The item set 2. The related supportive number (support count). Ĉk Set of candidate k-itemsets. For counting the number of occurrences of each Ck candidate item sets, it should be determined what candidate itemset from Ck is in each t transaction. After determining the number of occurrences for each candidate set, for obtaining the frequent itemsets, the set whose number of occurrences is more than the minimum support threshold is 14 introduced. The figure below indicates the operation of Apriori algorithm on a simple example of a DB with four transactions. The support threshold is set to 2. Figure 2. Producing candidate item sets by Apriori algorithm (Kantardzic 2003, p.168) As illustrated in the figure, in each stage, those itemsets whose support is less than 2 is eliminated. Candidate-gen function gets the frequent itemsets with Lk+1 length ((k+1)itemsets) as the input. The output of this function is the sets of frequent k-itemsets. Firstly, this function joins Lk-1 with itself: insert into Ck select p.item1 , p. item2 , …, p. itemk-1, q. itemk-1 from Lk-1 p, Lk-1 q where p. item1 =q. item1 , …, p. itemk-2=q. itemk-2, p. itemk-1< q. itemk-1 (Agrawal & Shafer 1996 p.4) Then in the pruning stage the itemsets which are in Ck (c∈Ck ), if there are one or more subsets with length of k-1 which are not in Lk-1 then C from Ck will be deleted : forall itemsets c in Ck do forall (k-1)-subsets s of c do if (s is not in Lk-1) then delete c from Ck (Agrawal & Shafer 1996 p.4) 15 Example: suppose that L3= {{1,2,3},{1,2,4},{1,3,4},{1,3,5},{2,3,4}} after the insertion, C4 would be {{1,2,3,4 },{1,3,4,5}}. In the pruning stage, the {1,3,4,5} itemset will be deleted because the {1,4,5} itemset is not in L3 and {1,2,3,4} remains in C4 (Agrawal & Shafer 1996 p.4). As the figure indicates, Apriori algorithm for m item generates 2m subsets that might be frequent. As discussed, the number of items and customer transactions are immense; consequently, Apriori is not considered as a scalable algorithm. 2.1.7 Subset function: In this function, the Ck candidate itemsets are kept in a data structure called hash tree. A node of the tree includes a list of itemsets (a leaf node) or a hash table (a middle node). In a middle node, each package of the hash table can point to other nodes. The root of the hash tree is specified at the first depth. A node in the d depth can point to other nodes in depth of d+1. The itemsets are saved in the leaves. In case of adding an itemset such as C to the hash tree, the tree should be traced from the root to the leaves. In a middle node, the decision for choosing a branch is made. For finding the appropriate branch, the hash function will be executed on the dth item from the itemsets. In the beginning all the nodes are produced as a leaf node. When the number of the itemsets in a leaf node exceed from a predetermined threshold, then this leaf node becomes a middle node (Brin et al. 1997). In Apriori algorithm, a subset function can be used to determine a transaction such as t, includes what itemset, the subset function begins from the root and identifies all the candidate itemsets in a transaction. In case of being in a leaf node, it should be defined which one of the itemsets could be placed in the list of the leaf node and then the related support count is increased. In a middle node, if the node is reached by hashing the ith item from the transaction, then each item that comes after the ith item in t, should be hashed. This function is executed on all the other nodes recursively. The following figure illustrates how these procedures are applied. The presented hash tree in the figure is related to a candidate 3-itemset. Also the figure demonstrates the procedures of the subset function for the transactions of 1,2,3,5,6. 16 Subset function 1, 4 2, 5 3, 6 Transaction: 1 2 3 5 6 1+ 2 3 5 6 1 3+ 5 6 14 1 2+ 3 5 6 12 13 12 5 23 4 15 34 35 6 36 7 35 Figure 3. Hash tree 2.1.8 The applied optimisation on apriori algorithm Significant amount of research has been produced thus far to optimise Apriori algorithm. Most of them tend to maintain the basic structure of the Apriori algorithm and enhance the efficiency as well as decreasing time costs. Agrawal and Srikant (1994); Toivonen (1996); Savasere, Omiecinski & Navathe (1995); Park, Chen & Yu (1995); Brin et al. (1997); Han et al. (2001); Bodon (2003); Borgelt (2003); Bodon (2004) are some samples. Since the efficiency of the Apriori algorithm depends on the method of counting the supports for the candidate itemsets, and the number of available candidate itemsets in each transaction, most of the research focuses on these parts of Apriori algorithm. Some of the optimisations applied on Apriori algorithm resulted in new algorithms. The following algorithms are two of them: 2.1.9 AprioriTid and AprioriHybrid Algorithms 17 Similarly, with offering the Apriori algorithm by Agrawal and Srikant (1994), they offer another two algorithms, called AprioriTid and AprioriHybrid. The AprioriTid algorithm reduces the execution time for computing the support count, by replacing the transactions with candidate itemsets of the transactions. These procedures are implemented frequently in all the iterations. The replaced transaction database at the kth iteration is presented by Ĉk. However, AprioriTid acts faster than Apriori algorithm at the final iterations, but it acts slower in the primary repeats. This is because, producing the Ĉk causes some extra burden, and it does not fit in the main memory and it should be stored on the disk. If the transaction does not include any candidate itemset in length of k, then Ĉk is empty for this transaction, therefore the number of elements in Ĉk might be less than the number of transactions in the database, especially in the final repeats. In addition, in the final repeats, each element can be smaller than its transactions, but in the primary repeats, each element could be bigger than its transaction. AprioriHybrid, the combination of Apriori and AprioriTid, is another algorithm which is proposed by Agrawal & Srikant (1994). Practically, in the earlier traverses through a DB, Apriori algorithm performs better than AprioriTid. On the other hand, AprioriTid acts better in later passes. Therefore, AprioriHybrid uses the apriori algorithm in the primary passes and whenever the Ĉk is expected to be smaller as fits in the main memory, the algorithm switches to AprioriTid algorithm. Since, the size of Ĉk is proportionate with the number of candidate itemsets, an heuristic procedure (which can estimate the size of Ĉk) may be used in the current iterations. If the size of estimating Ĉk is small enough and there are few itemsets in the current repeat rather than the previous repeat, then the algorithm decides to use AprioriTid. According to Agrawal & Srikant (1994) ‘experiments show that the Apriori-Hybrid has excellent scale-up properties, opening up the feasibility of mining association rules over very large databases’. 2.1.10 Sampling: According to Toivonen (1996), the sampling algorithm reads the entire database twice to find all the frequent itemsets. Firstly, this algorithm selects a sample of the database randomly and discovers all the frequent patterns in the sample. Secondly, it checks the result on the entire database. In the cases that the sampling method can not find all the 18 frequent patterns, the unknown patterns can be discovered by producing all the frequent patterns and considering their support amount by rereading from the database. The possibility of rescanning can be reduced by minimising the support threshold in the data sample. The experiment indicates that in most cases, the sampling algorithm can find all the frequent patterns only by one traverse through the DB and there is no need for rescanning. The Apriori algorithm may be applied to mine the sample data. This method, compared to Apriori algorithm, needs less reading from the disk, therefore, it is considered to be efficient. 2.1.11 Partitioning The most apparent argument for exploring interesting association rules revolves around the size of the DB because the DBs’ size applied for data mining are typically enormous. The basic approach to discover association rules is via database partitioning. The partitioning method divides a database into several partitions, whereas each partition fits in the main memory. Therefore, because of the existence of the data in the main memory, the data mining procedures are executed more efficient. The partitioning method is based on the fact that, if an itemset is frequent, it should be known in at least one frequent partition. Since loading the transactions in the main memory reduces the procedures of reading from the disk, the partitioning method is considered to act very quickly (Savasere, Omiecinski & Navathein 1995). Some of the distributed association rules mining algorithm utilises the partitioning method to increase the efficiency of their algorithm. Hash Partitioned Apriori (HPA) as an example, is a parallel algorithm which partitions the candidate itemsets between processors by using a hash function. Since HPA uses the entire memory space of all processors, it is considered as an efficient Algorithm in large data mining processes in distributed and parallel environments (Sujni & Saravanan 2008). Moreover, Coenen, Leng, Ahmed (2003) proposed vertical partitioning (DATA-VP) algorithm. This algorithm utilises the Apriori-T algorithm. Apriori-T algorithm ‘combines the classic Apriori ARM algorithm with the T-tree data structure.’ Initially, DATA-VP algorithm splits the set of single attributes between sites. As this algorithm affixes the candidate itemsets to the T-tree, their support count is computed and those 19 with insufficient support counts are pruned. At the termination of the algorithm, each site contains a T-tree including large itemsets. The experiment on the DBs indicates that, the execution time in DATA-VP algorithm is much less than Apriori algorithm (five times smaller). 2.1.12 Direct Hashing and Pruning (DHP) algorithm DHP algorithm was presented by Park, Chen & Yu in 1995. This algorithm is an extension of the Apriori algorithm with applying a hashing technique. They intended to solve the problem of the immense number of candidate itemsets in every stage, especially, in the second passes whose number is too many. A feasible solution is that, by an appropriate hash function, the candidate itemsets are kept in a hash table. In this method, instead of counting the support for all the itemsets one by one, the support of each package is calculated. After each iteration, if the support of a package is less than the minimum support threshold, then the entire candidate itemsets associated with this package are eliminated. Consequently, a quicker method for counting the support is gained. 2.1.13 Dynamic Itemset Counting (DIC algorithm) Berin et al. (1997) introduced the DIC algorithm which intends to minimise the number of traversals through the database. According to them, the DIC algorithm benefits from hash tree as its data structure. In this tree each itemset is stored by its items. Every itemset and all its prefixes contain a node. The root node is an empty itemset and all the 1-itemsets are attached to the root node. The rest of itemsets are inserted into their prefix including all but their last item. At the same time with inserting an itemset a counter is maintained. At approach to detect frequent itemsets the time of reading a transaction, the counter of active itemset is incremented. Furthermore, the state of each itemset is kept by managing transitions from active to counted and from small to large. Additionally, the occurrence of such transitions should be detected. The DIC has numerous advantages in contrast to the Apriori algorithm. According to Berin et al. (1997) ‘the main one is performance. If the data is fairly homogeneous throughout the file and the interval is reasonably small, this algorithm generally makes on the order of two passes.’ 20 2.1.14 Fequent Pattern (FP) Growth Method FP-growth is considered as an alternative algorithm that assumes a radically dissimilar approach to discovering frequent itemsets. It does not define the generating and testing model of Apriori (Tan, Steinbach & Kumar 2006, pp. 363). The FP growth method was introduced by Han, Pei & Yin (2000) for the first time. Later it was optimised by Han et al. (2001). The algorithm, converts the data set to a compressed data structure called FP-tree, consequently, the frequent itemsets can be extracted directly from the tree. Particularly, in the first iteration, the whole DB is read and frequent items are found. In the second iteration, the frequent items in each transaction are considered and arranged, based on their frequency. The regular items are added to a tree called FP-tree. At the end of the second iteration, the whole DB, except for the non frequent items, is stored in the main memory. Arranging item sets based on their frequency makes a smaller tree in the main memory. After production of the FP-tree, the algorithm ignores the DB and the algorithm which is recursive, mines the frequent itemsets by exploring this tree. Tan, Steinbach & Kumar (2006, pp. 364) describe the procedure of the construction of the FP-tree as follows: i. The DB is scanned to specify the support count for each item. Infrequent items are removed, whereas the frequent items are sorted in decreasing support counts order. ii. Initially, the FP-tree includes the root node presented by null symbol. To construct the tree, the algorithm does a second pass over the data. After scanning the first transaction, the nodes are created and labelled based on the number of items in the transaction. In this stage all the nodes have a frequency count of 1. iii. After scanning the second transaction, if the second transaction does not share a common prefix with the first transaction, then their path is disjointed and a new set of nodes is produced from the root (null) node, otherwise, a new set of nodes is created from the shared prefix and the frequency count of it increases to 2. Subsequently, a list of pointers is made which connects nodes with the same item. iv. This procedure proceeds until the entire transaction items map onto one of the paths of the tree. 21 The FP-growth algorithm produces the frequent itemsets by exploring the FP-tree in a bottom-up method. Since all the transactions are mapped onto a path in the tree, the frequent itemset ending with a specific item can be derived easily. These paths can be recognised rapidly by using the pointers related to that item. Usually, the size of the FP-tree is smaller than the size of the original data as many transactions in market basket data have items in common. On the other hand, because of the storage space for the pointers, it requires higher physical memory space (Tan, Steinbach & Kumar 2006, pp. 366). The most important benefit of this method is that, it reduces the cost of reading from the disk and eliminates the production and counting candidate itemsets. Therefore, it is considered as an efficient algorithm. However, according to Wang (2009), ‘it has problems about space flexibility and costs much time in dense data mining.’ Each of the association rules mining has its own advantages and disadvantages. For eliminating the disadvantages of different methods and more use from their advantages, one or more algorithms may combine together and consequently a more efficient algorithm can be produced. 2.1.15 Association rules mining in XML documents The dramatic development of the eXtensible Markup Language (XML) documents as a standard for information transportation and storage on the web has been significant. Consequently, there is an urgent need for tools such as association rules mining, to extract interesting knowledge from them. There are some limitations for use of XML inside the data mining communities. There are some approaches for the knowledge discovery tasks but most of them are based on the traditional relational framework with an XML interface (Braga et al. 2002). Much research has introduced association rules from native XML documents so far, Braga et al. (2002), Feng et al. (2003), Zhang (2005) and Wan & Dobbie (2004) are some samples in this regard. Each of these papers uses a different method for association rules mining in XML documents. For instance, Feng et al. (2003) suggests 22 tree structure. They believe that building up association rules among trees rather than using a simple structure, is more powerful both from structural and semantic aspects. Zhang et al (2005) proposes a framework called XAR-Miner which includes three main steps as follows: Pre-processing (i.e., construction of the Indexed XML Tree (IX-tree) or Multiple Relational Databases (Multi-DB)); Generation of generalised meta-patterns; Generation of large association rules of generalised meta-patterns. Resultant generalised meta-patterns are used to generate large association rules that meet the appropriate support and confidence levels. The paper published by Wan & Dobbie (2004) is another example for association rules mining for XML documents. The authors implement an algorithm by using query language X-query. In this method the need for pre-processing or post-processing is eliminated. Achieving a better efficiency is not the only target in the methods and algorithms of association rules mining; the attention to the tree structure of XML documents is more important. 2.1.16 Trie data structure According to (Bodon 2004), Trie is a famous data structure in Frequent Itemset Mining (FIM) which was introduced by Fredkin (1960) and Briandais (1959) for the first time for saving and retrieving words from dictionaries. Trie is a weighted and labelled tree which has direction. The root of the tree is defined at depth zero. A node of this tree in d depth can point to the other nodes which are at depth d+1. If node u points to node v then it is considered as father for v and v as a child for u. The concatenation of labels of the edges, on the path from the root to a node represents an itemset. The value of each node contains the support count for the itemset which it symbolises and the links are labelled with a frequent itemsets. In figure 1.3 a Trie is illustrated. For instance, the support count for itemset {a,b} is 6 and for {a}=9. 23 Depth 0 Depth d Depth d+1 Figure 4. An example of a Trie (Ansari et al. 2008) Tries are suited for any finite sets or sequences as well as storing and retrieving words. Tries can be executed in many ways such as impact and non impact representation. In compact representation, the edges of a node are saved in a vector. Each element of a vector is a pair which represents an edge of this tree. The first element stores the label of the related edge and the second element stores the address of the node which the edge points to. Indeed, a link list can be used in this regard. In the non compact representation, only the pointers are stored in a vector. The length of the vector is associated with the total number of items. An element at index i belongs to the edge whose label is the ith item. If there is no edge with such a label then the element is nil. The advantage of this method is that it searches a specific label in o(1) time whereas in compact representation, this approach is implemented by a binary search with the time complexity of o(logn). The non compact representation needs more memory for the nodes with few edges and less memory for nodes with many edges than the compact representation. Based on the need for memory these two methods can be combined (Bodon 2004). Distributed Trie-based Frequent Itemset Mining (DTFIM) is one of the most recent distributed Trie-based algorithms in multi computer environment which is proposed by Ansari et al. (2008). They state that ‘using the Trie data structure outperforms the other implementations using hash tree.’ In DTFIM algorithm, in each local site, a Trie-based structure is constructed. At the beginning each site determines the local support counts of all its items (1-itemsets) and stores them in a vector. At the end of the stage, local sites synchronise their data to specify the globally large 1-itemsets. Subsequently each site initialises its local Trie therefore all of them possess same Trie at the end of this stage. At second pass, a two dimensional array is created for storing the support count 24 for the itemset with 2 elements. At the end, the globally support counts for 2-itemsets are determined and each site appends large 2-itemsets into its local Trie. Similarly, for each pass k (k≥3) the mentioned processes are repeated. At the end, the local support counts are synchronised and the infrequent k-itemsets are pruned. One of the properties of this algorithm is that, with the more skewed distributed DB, the algorithm behaves more efficient. 2.1.17 Non derivable itemsets According to Calder & Geothals (2007), in most of the association rules mining algorithm, if the minimum support threshold is set low or data is highly correlated, the number of frequent itemset becomes immense. In these circumstances, producing all the frequent itemsets is not feasible. To overcome this issue, various proposals have been presented to build a concise presentation of frequent itemsets. Calder & Geothals (2002) introduce the non-derivable itemsets as a concise representation of the frequent itemset to eliminate the need for mining all frequent itemsets. The non-derivable itemsets are gained from the deduction rules. Firstly, this section discusses the deduction rules which derive tight bounds on the support of candidate itemsets and then the method that these deduction rules divide the itemsets of a DB into two groups of derivable and non-derivable itemsets is discussed. In addition, it is discussed that how non-derivable itemsets can be used as a condensed representation of the entire data. 2.1.17.1 Deduction rules The deduction rules are defined as the rules for deducing tight bounds on the support of an itemset without the need to access to the DB. A transaction DB over a finite set I is a finite set of pair (tid,J) which tid is a positive integer called the identifier and J is a subset of I (J⊆I). For each itemset I and all the X⊆I, the deduction rules are as follows (Calder & Goethals 2007): If |I\X| is odd then |I\J| +1 supp(J) X⊆ J⊂I Supp (I) ≤ ∑ (-1) 25 If |I\X| is even then |I\J| +1 supp(J) X ⊆ J ⊂I Supp (I) ≥ ∑ (-1) These rules are denoted Rx(I). Depending on the |I\X| to be odd or even, the support count can be the high or low bound for the itemset (I). If |I\X| is odd then Rx(I) is considered an upper bound, otherwise it is considered to be a lower bound. Therefore by having the support for all the subsets of itemset I and also with Rx(I) rules and all the X⊆I , the lower and upper bound with support I will be obtained. Calders (2004) proves that, these bounds are tight which denotes that with considering the information from a subset of an itemset and without scanning the DB, the gained lower or upper bounds are not better than the bounds detected by deduction rules. Before, the discussion of derivation rules, first the principle of these rules is illustrated by an example (Calder & Geothals 2007). Figure 5. An example of a DB transactions (Calders & Geothals 2007) Let D be a transaction DB. Suppose a,b,c are the sets which their elements are the transactions. According to the inclusion-exclusion principle of Galambos and Simonelli (1996), the following equality is true (Calders & Geothals 2007): __ supp (abc) = supp(a) –supp(ab) – supp(ac) + supp(abc) __ Since supp (abc) is always greater than or equal to zero then: 0 ≥ supp(a) –supp(ab) – supp(ac) + supp(abc) supp(abc) ≥ supp(ab) + supp(ac) – supp(a) 26 These inequalities can be used for lower bound of a,b,c, if the supports of all its subsets are known. This procedure can be duplicated for every generalised itemset a,b,c. The following inequalities demonstrate all the possible rules for deducing the tight bounds on support of abcd itemset (Calders 2002). R0 : supp(abcd) ≥ supp(abc) + supp(abd) + supp(acd) + supp(bcd) − supp(ab) − supp(ac) − supp(ad) −supp(bc) −supp(bd)− supp(cd) + supp(a)+supp(b) +supp(c)+supp(d)− supp({}) Ra : supp(abcd) ≤ supp(a) − supp(ab) − supp(ac) − supp(ad) + supp(abc) + supp(abd) + supp(acd) Rb : supp(abcd) ≤ supp(b) − supp(ab) − supp(bc) − supp(bd) + supp(abc) + supp(abd) + supp(bcd) Rc : supp(abcd) ≤ supp(c) − supp(ac) − supp(bc) − supp(cd) + supp(abc) + supp(acd) + supp(bcd) Rd : supp(abcd) ≤ supp(d) − supp(ad) − supp(bd) − supp(cd) + supp(abd) + supp(acd) + supp(bcd) Rab : supp(abcd) ≥ supp(abc) + supp(abd) − supp(ab) Rac : supp(abcd) ≥ supp(abc) + supp(acd) − supp(ac) Rad : supp(abcd) ≥ supp(abd) + supp(acd) − supp(ad) Rbc : supp(abcd) ≥ supp(abc) + supp(bcd) − supp(bc) Rbd : supp(abcd) ≥ supp(abd) + supp(bcd) − supp(bd) Rcd : supp(abcd) ≥ supp(acd) + supp(bcd) − supp(cd) Rabc : supp(abcd) ≤ supp(abc) Rabd : supp(abcd) ≤ supp(abd) Racd : supp(abcd) ≤ supp(acd) Rbcd : supp(abcd) ≤ supp(bcd) Rabcd : supp(abcd) ≥ 0 Figure 6. Tight bounds on support of (abcd) (Calders 2002) In the above formulas, if the support counts of subsets are placed with numbers then obviously, a number will be on the right side of all the formulas. Therefore, those groups of formulas whose support counts of abcd itemset are greater than (≥) a specific amount indicate the lower bounds for support of this itemset. On the contrary those formulas whose support counts are less than (≤) the specific amount represent the upper bounds for this itemset. The tight bounds for this itemset are the least upper and the greatest lower bounds which are gained from these formulas. If for an itemset such as, I 27 the least upper bound (I.u) is equal to the greatest lower bound (I.l) then it denotes that the exact support count of this itemset can be calculated by using the support of its subsets. Basically, if in the D database I.l = I.u then : Supp(I,D)= I.u= I.l Such itemsets are called derivable itemsets. In contrast, an itemset is considered as non derivable itemset if the exact amount of its support can not be calculated by using the deduction rules. Basically, for the non-derivable itemsets the deduction rules can not calculate the equal lower and upper bounds. The non-derivable itemsets are considered as a concise presentation of the entire itemsets (Calders & Geothals 2002). This presentation includes a subset of all the itemsets which has all the necessary information for deriving the entire itemset including frequent itemsets. The NDI algorithm was proposed by Calders & Geothals (2002) to produce all the nonderivable itemset for a DB. This algorithm, similar to the Apriori algorithm, acts stage by stage, but the difference is that, the NDI algorithm in each stage, prunes more itemsets from the candidate list. The next section describes NDI algorithm. In fact the NDI algorithm provides a concise presentation for a DB by generating the nonderivable itemsets. 2.1.17.2 Non Derivable Itemset (NDI) algorithm Likewise the Apriori algorithm, in each iteration some candidate itemsets are produced. Subsequently, the tight bounds are calculated by using the deduction rules on the candidate itemset. Also, the derivable itemsets are removed from the list of candidate itemsets. Subsequently, the DB is searched for calculating the support counts of the non-derivable candidate itemsets. At the end, the frequent non-derivable itemsets are determined by considering the amount of support threshold. The following codes illustrate the NDI algorithm (Calders & Goethals 2002): NDI(D,s) i := 1; NDI := {}; C1 := {{i} | i ∈T }; for all I in C1 do I .l := 0; I .u := |D|; while Ci not empty do Count the supports of all candidates in Ci in one pass over D; Fi := { I ∈Ci | support (I , D) ≥ s};. NDI := NDI ∪Fi; 28 Gen := {}; for all I ∈Fi do if support(I) ≠ I.l and support(I) ≠ I.u then Gen := Gen ∪ {I }; Pre Ci+1:= AprioriGenerate(Gen); Ci+1:= {}; for all J ∈ Pre Ci+1 do Compute bounds [l, u] on support of J ; if l ≠ u then J.l := l; J.u := u; Ci+1:= Ci+1∪ {J} i := i + 1 end while return NDI In this algorithm, in each stage the candidate itemsets are produced by the frequent itemsets whose support counts are not equal to the lower or upper bounds. Moreover, the candidate itemsets are produced in two stages. The first stage is similar to production of the itemsets in Apriori algorithm. In the second stage, the upper and lower bounds are determined by using the deduction rules on the candidate itemsets which are already produced in stage one. Subsequently, the derivable itemsets from candidate lists are pruned. The support of the non-derivable candidate itemsets is derived by reading from the DB. At the end, the output of the algorithm is the nonderivable frequent itemsets. Since the evaluating of all the deduction rules are so time consuming, Cadlers & Goethals (2002) use a part of these rules for deriving the tight bounds of the itemsets. They apply the deduction rules up to k depth for an itemset such as, I where only the rules Rj (I) for |I − J| ≤ k are evaluated. As the condition indicates, these deduction rules produce only a part of the entire deduction rules for an itemset. Empirically, increasing the amount of k reduces the effectiveness of the deduction rules. Thus, Cadlers & Goethals (2002) conclude that ‘in practice most of the pruning is done by the rules of limited depth’. Furthermore, they prove that all the itemsets of a DB can be derivable. Consequently, if I set is derivable then all its supersets are derivable. On the other hand, if I set is non-derivable consequently all its subsets are non-derivable. Moreover, Cadlers (2004) discussed that for a non-derivable itemset, the interval between the lower and upper bound (w(I)= LB(I)-UB(I)) is decreased exponentially, depending on the length of I, therefore, w(I∪{i}) ≤ w(I ) / 2 is true for I itemset where i ∉ I. This attribute guarantees that number of non-derivable itemsets can not be too 29 many because the interval can only be divided to logarithmic numbers (log(n)+1 where n is the number of transactions). The number of Rx(I) rules increases exponentially with cardinality of I\X. the |I\X| number is called the depth of Rx(I) rule. Since for the calculation of the rules, lots of sources are needed; therefore empirically, the rules with limited depth are used. The greatest lower bound and the least upper bounds on support of I (which are gained from evaluating the rules with maximum k depth) are indicated as UBk(I) and LBk(I) respectively. The interval between [LBk(I) , UBk(I)] is gained by calculating the rules such as, { Rx(I) | X ⊆ I,| I \ X |≤ k}. This point should be considered, if for the calculation of the frequent non-derivable itemset, the deduction rules up to k depth is used then a group of the derivable itemsets might not be pruned and at the end of the mining operations. Subsequently, there would be some derivable itemsets in addition to non-derivable itemsets. Undoubtedly, if the k amount is large enough, then the number of the derivable itemset will be less or zero. Experiment on real DBs indicates the size of the concise representation of nonderivable itemsets is much smaller than the total number of frequent itemsets as produced by Apriori. The following figure compare the number of candidate itemsets generated by NDI and Apriori algorithm based on the same threshold. 30 Figure 7. Size of concise representation (Calders & Goethals 2002) As the figure 2.7 illustrated, for instance for the support threshold 0.1%, Apriori algorithm generates 990097 frequent itemsets whereas NDI algorithm generates only 162821 frequent itemsets (Calders & Goethals 2002). 2.1.17.3 Producing all the frequent itemsets by the non-derivable itemsets: The concept of derivability can be applied to produce all the frequent itemsets of a DB efficiently. In fact the itemsets of a DB can be divided into two groups of derivable and non-derivable itemsets. For exploring the frequent non-derivable itemsets, the DB needs to be scanned and the support counts should be calculated. On the other hand, the support counts for all the derivable itemsets are calculated by applying the deduction rules. This property can be applied to create an efficient method for mining all the frequent itemsets. In this method, the derivable itemsets can be retrieved without reading from the DB. Since in some of the DBs, there is immense number of derivable frequent itemsets, therefore all the frequent itemsets can be gained by using the concept of derivability in a more efficient way. As previously mentioned, in exploring all the frequent itemsets by using the concept of derivability, two groups of derivable and nonderivable itemsets are discovered. Evaluating and gaining all the deduction rules are 31 time consuming and it can eliminate the advantage of economising in reading from the DB. Moreover, there are two methods for discovering the derivable itemsets, which are surveyed by Cadlers & Goethals (2002). These two methods present that, for a specific group of derivable itemsets, the evaluation of all the deduction rules is unnecessary. They are as follows: First method: suppose I be a non-derivable itemset, but after scanning from the DB the Ri(j) deduction rule calculates an amount equal to the exact support count of I, then all the supersets of I, which are in the form of I∪{i} are derivable and the amount of their exact support can be calculated by using the Rx∪{i} (I ∪{i}) or Rx∪{i}(I )deduction rules. This observation can be used to avoid the checking of all the possible deduction rules for calculation of the support bound of I∪{i}. It follows that, when the bounds on the support of the I itemset are determined, the lower and upper bounds (I.u and I.l) for this itemset should be saved. If the I itemset is non-derivable (I.l ≠ I.u) then the support counts should be gained by scanning the DB. After the calculation of the support count, the following conditions need to be tasted: support(I) = I.l support(I) = I.u If one of these two conditions is true, then all the supersets of I are derivable and there is no need for the calculation of the bounds and the deduction rules (which have gained the exact support count) can be used to find the support count for the supersets of I. Second method: suppose that I is a derivable itemset and also the Rx(I) derives the exact support count of I, for calculating the support count of I∪{i} only the rules Rx∪{i}(I∪{i}) need to be observed. The concept of derivability is monotonic; it denotes that the superset of a derivable itemset is derivable. Therefore for obtaining the support of a superset of a derivable itemset, only the evaluation of one deduction rule is enough. 32 Using the concept of derivability, these two methods can be used to retrieve all the frequent itemsets. In these methods, in each repeat, some derivable and non-derivable itemsets are determined from the candidate itemsets. In the primary iterations, the number of non-derivable itemsets is more than the number of derivable itemsets, but continuing the process, the number of non-derivable itemsets is decreased and the number of derivable itemsets increase. In the last iterations all the itemsets might be derivable. As mentioned before, the frequent non-derivable itemsets are a minimal presentation or a compact presentation of the entire frequent itemsets and because of this factor, all the frequent itemsets can be produced by the non-derivable itemsets (as a concise representation). In order to gain a concise representation of a DB, some other techniques and algorithms, such as, closed itemset (Pasquier et al. 1998), GrGrowth algorithm (Liu, Li & Wong 2007), etc. were proposed. The experiments indicate that, the non-derivable itemsets are more efficient in comparison with other compact representations (Cadlers 2004). Indeed, the non-derivable itemsets have more compact representation than others, which denote whose numbers compare to the total numbers of frequent itemsets, in a specific DB, are fewer. Therefore, producing all the frequent itemsets by using the compact presentation from the non-derivable itemsets is more economical than other compact presentations. Hence, the proposed algorithm by this thesis applies this technique. 33 2.2 Distributed association rules mining This chapter introduces distributed data mining and some of the most important Distributed Association Rules Mining (DARM) algorithms. Moreover, some of the discussed problems and advanced progress in this regard are covered. 2.2.1 Distributed data mining ‘Mining association rules in the distributed environment is a distributed problem and must be preformed using a distributed algorithm that does not need raw data exchange between participating sites.’(Ansari et al. 2008) The subject of distributed data mining has attracted a great deal of attention from research and commercial communities for finding useful and interesting hidden patterns in large transaction logs. Guo & Grossman (1999); Zaki (2000); Kargupta & Chan (2000); Zaki (1999); Agrawal & Shafer (1996); Cheung et al. (1996) and Sujni & Saravanan (2008) have surveyed the issues of distributed data mining. Distributed data mining is the operation of data mining in distributed data sets. According to Zaki (1999), two dominant architectures exist in the distributed environments which are listed as distributed and shared memory architectures. In distributed memory each processor has a private DB or memory and has access to it. In this architecture, access to other local DB is possible only via message exchange. This architecture offers a simple programming method, but limited bandwidth may reduce the scalability. The assumption of this thesis is to deal with distributed memory architecture. The figure below indicates a simple architecture for distributed memory systems. 34 Figure 8. Distributed memory architecture for distributed data mining In the shared DB architecture, each processor has direct and equal access to the DB in the system (Global DB). Parallel programs can be implemented on such systems easily. The figure below indicates an architecture for shared memory systems. Global Data Mining Local DM Local DM .... Local DM DB Figure 9. Shared memory architecture for distributed data mining In most cases distributed data mining is mentioned with parallel data mining. However, both are designed to optimise the efficiency of data mining in distributed environments; but each of them uses different architecture system and methods. Computers in distributed data mining have been distributed and they communicate with each other by interchanging messages. In parallel data mining, a parallel computer is mentioned with processors which use memory jointly, whereas computers in distributed data mining systems do not use anything jointly. This discrepancy in architecture has a significant 35 effect on designing the algorithm, the cost model and measuring efficiency in parallel and distributed data mining (Kargupta & Chan 2000, p.24). However, according to Fang et al. (2005), they are the same ‘from the point of concept, the architecture of parallel and distributed calculating is a kind of layer calculating structure’. 2.2.2. The necessity of studying distributed data mining Distributed DBs By global changes of organisation structure from centralisation to decentralisation, emerging computer networks and distributed DBs in this computerised era, the need for aggregation of distributed data has become vital. Many organisations such as the Department of Health have to deal with distributed, heterogeneous and independent DBs. Chiu, Koh & Chang (2007) survey on the Department of Health of Taiwan indicates the need for integrating distributed DBs into a central one for management purposes. Furthermore, the investigation of Amoroso, Atkinson & Secor (1993) reveals that the problem of data management in organisations is inevitable. In this regard, they explore a data management construct due for clarifying the weak point of data management by managers. Depending on the type of organisations, the distributed DBs need to be integrated to a central DB such as, health history of the patients, or there is no need to have a copy of all the data of the distributed DBs in a central DB, such as, market basket data. A primary solution for the data management problem in distributed DBs is to transfer all data from different sites to a centralised site and use the data mining operations. Even in the case of having a central site with enough storage space (memory) and ability to do all the heavy tasks for data mining, the transformation of all the data of the local sites to a central site is extremely time consuming and costly. In addition, in some cases, transferring the local data is not permitted, due to the site ownership and security issues. In distributed data mining: no site should be able to learn the contents of a transaction at any other site, what rules are supported by any other site, or the specific value of support/confidence for any rule at any other site, unless that information is revealed by knowledge of one’s own data and the final result (Kantarcioglu & Clifton 2004). 36 Most of the distributed algorithms, which are based on the Apriori, suffer from lack of privacy for the participating sites. Some algorithms have been proposed to protect the sites’ privacy, such as, SDDM (Secure Distributed Data Mining) presented by Fukasawa et al. (2004). This algorithm satisfies privacy requirements and has the ability to resist collusion. Besides this, since only random numbers are used for preserving privacy, it is considered an efficient algorithm. Efficiency and Scalability Even if data is not distributed, distributed data mining can be useful for data which is stored in a site. In particular, a site with massive data can send a part of its data to other sites and shares the burden of data mining operations. Subsequently, the results from those sites are combined to achieve the desired result. Even though the site may have the ability to implement the data mining operation by itself, sending parts of data to other sites can enhance efficiency. There are two approaches regarding distributed data mining. In the first approach, each site mines a part of the data (data distribution). In the second one, each site performs a part of data mining for the entire distributed DBs (distributing operations). In both cases, the results from all sites are combined. Since the burden of operations is distributed between sites, using this method is much quicker than centralised data mining. The above discussion shows that distributed data mining methods are scalable. A data mining method is considered scalable if its efficiency does not change by increasing the volume of data. In a distributed data mining system, the number of existing sites depends on increasing or decreasing the volume of data. Distributed data mining is a desirable method in most cases, especially with recent progress in computer networks, specifically, the Intranet and Internet. However, in comparison with centralised data mining, it is more costly to implement; also, the used methods are more complicated and accuracy in designing data mining systems is more important (Agrawal & Shafer 1996). 2.2.3 Important instances and issues in distributed data mining Considering the local different sites as part of the general DB system, executing centralised data mining instead of distributed data mining is not recommended because data in a distributed DB is distributed naturally. For designing a suitable data mining 37 system, attention to the features of data, the computer network and the kind of data mining operation is necessary. Some important instances are listed as follows: Homogeneous data in contrast to Heterogeneous data : In most studies of distributed data mining, it is assumed that the local DB is homogeneous; which denotes, the local data are in the same bed, with the same DB management system and with the same scheme. In a Heterogeneous local DB, the data mining system should be adjusted in order to work on different local DBs. In addition, the different schemes should be standardised and unified to a general scheme; otherwise it would be very difficult or even impossible to achieve the desired result. Data layout: There are many methods to represent a DB. According to Zaki (1999), DB layout can be horizontal and vertical. In the horizontal layout each customer transaction is stored along with the related items. This representation is more common and is adapted by Apriori algorithm. In the vertical DB layout, all the transactions which contain a specific item are listed under the item. In this representation, the length of the TID-lists shrinks as we progress to larger sized itemsets. However, one problem with this approach is that the initial set of TID-lists may be too large to fit into main memory thus requiring more sophisticated techniques to compress the TID-lists (Tan, Steinbach & Kumar 2006, pp. 362). The figures below indicate these layouts. 1 2 3 A B B B D L 4 A B D L R Q R Q Figure 10. Horizontal DB Layout A B D L Q R 1 1 1 1 1 2 4 2 2 3 3 3 3 4 Figure 11. Vertical DB Layout 38 Data replication: All or part of local data can be repeated on other sites. Data replication increases the availability of data. Basically, Data replication is not only for data mining purposes, but it is a decision based on calculation or other needs. Although data can be replicated for data mining processes, in this case, the data miner should decide what data or which part of data should be repeated. Information exchange cost: The major concern in distributed data mining is the amount of time of reading from a disk and also the execution time cost. The time of information exchange should be considered as well. In a slow network, data exchange is the major cost. Data exchange cost is determined by band width and the number of messages which are sent by the network. Therefore, the cost model should be different for centralised and distributed data mining. The conclusion from results: Obtaining the result is not only gathering results from all sites and putting them together. An interesting discovered rule in a local DB could be quite useless in the global DB. For example, a frequent itemset in a local DB can be infrequent at the global DB. Since the aim of distributed data mining is finding rules which are useful for the global DB, the discovered local rules with their features should be studied for the global DB. Data skew: The distribution of data statistics, such as the quantity of attributes and membership in different classes, is usually different among local DBs. The local model which is obtained by mining in a local DB is unavoidably affected by this distribution. Skew in data can make the local models useless and without value. For example, a learnt classifier from a local DB cannot classify the next samples of that class. The mentioned cases are not separate from each other and are quite dependent. For instance, data fragmentation of a global DB causes local DBs to be heterogeneous; or the horizontal data fragmentation can cause data skew, if it has been implemented carelessly. However, replication in data tables can reduce the cost of transfer for 39 accessing data, but it increases the cost of transfer for keeping unity in data. In addition to the mentioned cases, there are other factors in distributed data mining which are important, such as security, the privacy of local sites, autonomy of local DBs, network topology, how to transfer data and the amount of burden which each local site can handle. 2.2.4 The distributed Algorithms for association rules mining This section introduces how to mine association rules in a distributed environment and some of the most significant algorithms in this regard. Suppose DB is a DB which has D transactions. Also, assume that in a distributed system there is n site with names of S1, S2,…, Sn which the DB database is partitioned over the n sites into { DB1, DB2,…, DBn } respectively, where partition DBi belongs to Si site. Additionally, assume Di denotes the size of partitions DBi; where i=1,.., n. Suppose that the support count of an itemset X in DB and DBi are represented by X.sup and X. supi respectively. These supports are named global support count and the local support count of X at site Si respectively. With a specified support threshold s, X is globally large if X.sup ≥ s * D. Likewise, X is locally large at site Si, if X. supi ≥ s * Di. L represents globally large itemsets in DB and L(k), the globally large k itemsets with K length in L. Discovering the large itemsets L is the major concern for a distributed association rule mining algorithm. The following section contrasts the discussed model proposed by Cheung et al. (1996), with Count Distributed (CD) algorithm. 2.2.4.1 Count Distribution (CD) algorithm CD algorithm is a parallel algorithm for association rules mining in distributed environments presented by Agrawal & Shafer in 1996. This algorithm tends to minimise communication. In each iteration, this algorithm produces the candidate itemsets, by executing the Apriori_gen function on the large itemsets at the prior repeat. Subsequently, the local support counts of all these candidate itemsets are calculated by each site and the obtained result is send to other sites. Therefore, each site is able to calculate the frequent item set related to each iteration and proceeds to the next iteration. 40 In this method, the sets of all the candidate items are produced repeatedly on all of the sites and each node contains a part of DB. Each processor is responsible for counting the local support of global candidate itemsets. Then, each site calculates the global support for its candidate itemset. This global support is the summation of local support for each candidate itemset in the global distributed DB. Calculating the global support is executed by exchanging local supports between sites (Global reduction). The global frequent item set can be calculated by each site by using information related to the global support of the candidate itemset. The figure below indicates the execution of the count distribution algorithm. This figure is a distributed system with three sites which are executing the second iteration of the count distribution algorithm: Figure 12. The second replication from count distribution algorithm In this figure, each site obtains the local support of the candidate itemset by having candidate itemsets. Subsequently, sites exchange their local counts; as a result each site obtains the support of global candidate itemsets. 41 Here is the count distribution algorithm. In this algorithm, { D1, D2,…, Dp } are the different parts of distributed data, where Di is a part of this DB which is on the ith site and p is the number of sites. The following pseudo codes indicate Count Distribution (CD) Algorithm presented by Agrawal & Shafer (1996): Input: I, s, { D1, D2,…., Dp} Output: L 1) C1 =I; 2) for k=1; Ck ≠ 0 ; k++ do begin //step one: counting to get the local counts 3) count(Ck, Di); //local processor is i //step two: exchanging the local counts with other processors to obtain the global counts in the whole DB. 4) forall itemset X∈Ck do begin p 5) X.count=Σ{ Xj.count}; j=1 6) end //step three: identifying the large itemsets and generating the candidates of size k+1 7) Lk ={c ∈Ck | c.count ≥ s × | D1 ∪D2 ∪…∪Dp |}; 8) Ck+1 =apriori_gen(Lk); 9) end 10) return L= L1 ∪L2 ∪…∪Lk; There are three major steps in a count distribution algorithm. In the first step, each site finds local supports of Ck candidate itemsets in its own local DB. In the second step, each site exchanges its local support with other sites to obtain the entire support for all the candidate itemsets. In the third step, each site obtains Lk, the entire frequent itemset, and the candidate item set with length of K+1 is obtained from each site by execution of Apriori_gen() function on Lk. This algorithm repeats step 1 to 3 until it produces the candidate item set. 2.2.4.2 A Fast Distributed algorithm The Fast Distributed Mining (FDM) algorithm is one of the most significant algorithms in distributed data mining which is presented by Cheung et al. (1996). Other algorithms such as ODAM and also the proposed algorithm by this thesis are designed based on this algorithm. Firstly, some of the presented techniques in this algorithm are introduced and defined, and then the algorithm is explained. 42 2.2.4.2.1 Candidate sets generation: Observing some of the interesting properties of large itemsets in distributed environments can considerably lessen the number of exchanged messages within the network. For instance, according to Cheung et al. (1996), in a distributed DB, there is a significant relevence between large itemsets and the sites which is ‘every globally large itemsets must be locally large at some site(s).’ To prove this theory, suppose ‘itemset X is both globally and locally large at a site Si, then X is called gl-large at site Si’ (Cheung et al. 1996). Just as in centralised DBs where there are monotonous relationships between frequent itemsets, these properties exist in both local frequent itemsets and gl-large in a distributed DB. These properties are summerised as follows: If an itemset X is locally large at a site Si, then all of its subsets are also locally large at site Si . If an itemset X is gl-large at a site Si, then all of its subsets are also gl-large at site Si. (Cheung et al. 1996). By use of the following lemma, an effective technique for the generating candidate itemset can be developed in a distributed environment: Lemma 1: if an itemset X is globally large, there is at least a site Si, (1≤i ≤ n), which X and all its subsets are gl-large at site Si. Proof: if X is not locally large at any site, then X. supi < S* Di for all i=1, …, n. Therefore X.sup < S*D and X can not be globally large. By contradiction, X must be locally large at some site Si, and hence X is gl-large at Si. Consequently, all the subsets of X must be gllarge at Si. (Cheung et al. 1996). GLi indicates the gl-large itemsets at Si and GLi(k) represents the gl-large k-itemsets at site Si. Lemma 1 shows that if X∈ L(k) then there is at least a Si site which all its subsets with (k-1)-itemset in Si site, are gl-large and consequently, all of them belong to GLi(k-1). 43 Same as Apriori algorithm, the candidate itemsets are indicated at the kth iteration by CA(k). These candidate itemsets are gained by implementation of Apriori_gen function on L(k-1). Therefore: CA(k)= Apriori_gen(L(k-1)) . For each Si site, CGi(k) indicates the candidate itemsets gained by applying Apriori_gen function on GLi(k-1). Therefore: CGi(k)= Apriori_gen(GLi(k-1)) . CG denotes the candidate itemsets produced from gl-large itemsets. Therefore, CGi(k) is generated from GLi(k-1). Since, GLi(k-1)⊆ L(k−1), CGi(k) considers as a subset of CA(k). In following CG(k) is used in place of ∪ⁿ i=1 CGi(k) (Cheung et al. 1996). Theorem 1: for every k>1, the large itemset L(k) is a subset of CG(k) = ∪ⁿ i=1 CGi(k) where CGi(k) = Apriori_gen (GLi(k-1)). Proof: suppose X ∈L(k). It follows from Lemma1 that there exists a site Si, (1≤i ≤ n), such that all the size (k-1) subsets of X are gl-large at site Si. Hence, X∈ CGi(k) Therefore, L(k) ⊆CG(k) = ∪ⁿ i=1 CGi(k) = ∪ⁿ i=1 Apriori_gen (GLi(k-1)) (Cheung et al. 1996). The theorem reveals that CG(k) is a subset of CA(k) and therefore it is smaller than CA(k). This set is considered as the candidate itemset for the large k-itemsets. This theorem is the basis of candidate itemsets generation in FDM algorithm. The CGi(k) candidate itemsets can be produced locally in each Si site at the kth iteration. At the end of each iteration, the list of globally large itemsets is available at each site. The candidate itemsets at Si for the (k+1)st iteration are produced based on GLi(k). According to the examinations, this method reduces 10-25% of the number of candidate itemsets. The following example discloses the effectiveness of theorem 1 in reduction of the number of candidate itemsets: Example: suppose there are three sites in a distributed system and DB is the data base of the system which is split into three partitions, DB1, DB2, DB3, whereas each site possesses one of these partitions. Assume that the large itemset, gained from the first iteration, is L(1)= {I,L,M,N,O,P,Q} which in this set, I, L and M are local large itemsets and at the first site (S1) and L,M and N are locally large at S2 site and O, P and Q at the S3 site is a large itemset. Thus, GL1(1)= {I,L,M}, GL2(1)= {L,M,N} and GL3(1)= {O,P,Q}. 44 Based on the theorem 1 the candidate 2-itemsets at site S1 is CG1(2) where CG1(2) = Apriori_gen(GL1(1))= {IL,IM,LM}. Similarly, CG2(2) = Apriori_gen(GL2(1))= {LM,LN,MN} and CG3(2) = Apriori_gen(GL3(1))= {OP,OQ,PQ}. Hence, the candidate itemsets with length of 2 is CG(2)= CG1(2) ∪CG2(2) ∪CG3(2) = {IL,IM,LM,LN,MN,OP,OQ,PQ} which is 11 candidates. In contrast, applying the Apriori_gen function on L(1) generates 28 members for candidate itemset. This example indicates the effectiveness of applying theorem 1 in producing the candidate itemsets. 2.2.4.2.2 Local pruning of candidate itemsets: The prior section reveals that by using theorem1, number of candidate itemsets can be lessen, compare to applying the Apriori algorithm directly. This trend has a significant effect in increasing the efficiency of a distributed algorithm. After generating the candidate itemset, for gaining the globally large itemset, the support counts of them must be exchanged between all the sites. There is a possibility for local pruning of some of the candidate itemsets before exchanging the support count. The concept is that, in each Si site, if a candidate itemset is not a large itemset at its Si site, then there is no need to compute its globally support count by that site, because whether the itemset is not globally large or it is locally large at other sites. Therefore, only the site(s) that itemset is locally large require to compute the global support count for the itemset. Hence, due to calculate the large itemsets with length of k in every Si site, the candidate itemset is limited to the set which is locally large at that site. In following, LLi(k) indicates the locally large candidates from CGi(k) at Si. In each iteration, the gl-large itemsets with k-length at each site Si are computed by the following procedures: Candidate sets generation: Generate the candidate sets CGi(k) based on the gl-large itemsets found at site Si at the (k-1)-st iteration using the formula, CGi(k) = Apriori gen (GLi(k-1) ). Local pruning: For each X ∈ CGi(k), the partition DBi is scanned to compute the local support count X. supi. If X is not locally large at site Si, it is excluded from the candidate sets LLi(k) 45 Support count exchange: broadcast the candidate sets in LLi(k) to other sites to collect support counts. Compute their global support counts and find all the gl-large k-itemsets in site Si. Broadcast mining results: broadcast the computed gl-large k-itemsets to all the other sites (Cheung et al. 1996). For more clarity, example 1 is expanded as example2: Example 2: suppose the DB in the above example includes 150 transactions where each DB contains 50 transactions. Additionally, presume the minimum support threshold is s=10%. Likewise, based on the example 1, the produced candidate itemsets at site S1 is CG1(2)= {IL,IM,LM}, at S2 is CG2(2) = {LM,LN,MN} and at S3 is CG3(2) = {OP,OQ,PQ}. For gaining the large 2-itemset, firstly, the local support count in each site should be computed. Table 6. Locally large itemsets X. sup1 X. sup2 X. sup3 IL 5 LM 8 OP 4 IM 3 LN 10 OQ 8 LM 10 MN 4 PQ 4 As illustrated in the table, IM. sup1 = 3 < s* D1 = 0.1* 50 = 5. Hence it is not locally large and consequently, IM prunes at site S1 but IL and LM satisfy the minimum support counts and they do not prune. Thus, LL1(2)= {IL, LM}. Similarly, LL2(2)= {LM, LN} and LL3(2)= {OQ}. Subsequently, the number of candidate itemsets reduces to 5 which is much smaller than the primary size. After finalising the local pruning, the support counts of the remained candidate itemsets are calculated by distribution of these candidates between other sites. The result is illustrated in the following table: Table 7. Globally Large Itemsets Locally large Broadcast request candidates from IL S1 X.sup1 X.sup2 X.sup3 5 4 4 46 LM S1, S2 10 8 2 LN S2 4 10 4 OQ S3 4 4 8 At the end of iteration, only LM is gl-large because LM.sup= 10+8+2= 20 > s*D=0.1*150=15 and IL.sup= 5+4+4= 9 < s*D=15. Therefore, the gl-large at site S1, S2 and S3 are GL1(2) = {LM}, GL2(2) = {LM,LN} and GL3(2) = {OQ} respectively. Consequently, the large 2-itemsets is L(2) ={LM,LN,OQ}. Some items such as LM are locally large at least in one site. Therefore, there is no need for all the sites to announce them as large itemset, only one site is sufficient to announce. Cheung et al. (1996) introduce an optimisation technique to remove such redundancies. To support steps number 2 and 3, each site needs to discover the support counts of its candidate itemset and in support count stage, each site has to compute the support count of candidate sets from other sites. An elementary solution is that, each local site scans its DB twice. This method reduces the efficiency of the algorithm because substantially each local DB is scanned twice. The more feasible solution is that since every site has the globally large itemsets of the prior iteration, the support counts of the entire candidate itemsets are computed by a single scan from each local DB and they are stored in a DB called Hash tree. To decrease the number of massages for each candidate itemset, this algorithm uses a technique called count polling. In this technique, each candidate itemset utilises an assignment function. This function assigns each itemset to a polling site which is quite independent to the site which the itemset is large at. The polling site for an itemset is responsible for determining whether that itemset is globally large. Consequently, this method reduces the number of exchanged messages. Cheung et al. (1996) illustrate the method by the following example. Suppose in the previous example, S1 is a polling site for IL and LM, in the same way, S2 is for LN and S3 for OQ. Considering the polling sites, S1 is amenable for pulling the support count for IL and LM. In the simple case of IL, S1 broadcasts the polling requests to other sites. But for LM which is locally large at S1 and S2, S2 transfers the pair (LM, 47 LM. sup2) = (LM, 10) in response. Later S1 repeats this procedure for S3. Once S3 sends back the support count LM. sup3 =2 for S1, then S1 computes the support counts for LM as LM.sup= 10+10+2=22>15. Consequently, S1 realises that LM is a global large itemset at its site. As this example indicates the double polling message for LM has been removed. 2.2.4.2.3 FDM algorithm The basic version of FDM algorithm (FDM-LP, FDM with Local Pruning) presented by Cheung et al. (1996) is as follow: Input DBi (i=1…n): A part of distributed DB on Si site. Out put (L): the frequent item set in global distributed DB. Method: iterative execution of following codes (for kth frequent) by each Si in distributively. This algorithm finishes when CG(k ) = ∅ or L(k ) = ∅. (1) if k = 1 then (2) Ti(1) = get_local_count(DB,0،1 ) (3) else{ (4) CG(k) = ∪ⁿ i=1 CGi(k) =∪ⁿ i=1 Apriori gen (Gen(i)) (5) Ti(k) = get_local_count(DB, CG(k),i) } (6) for all X ∈Ti(k) do (7) if X. supi ≥ s×Di then (8) for j=1 to n do (9) if polling_site(X) = Sj then insert(X,X. supi) into LLi,j(k) ; (10) for j=1 to n do send LLi,j(k) to site Sj; (11) for j=1 to n do{ (12) recive LLi,j(k) (13) for all X∈LLi,j(k) do{ (14) if X ∉ LP i(k) then Insert X into LP i(k) ; (15) update X.large_sites;}} (16) for all X∈LP i(k) do (17) send_polling_request(X); (18) reply_polling_request(Ti(k)); (19) for all X∈LP i(k) do { (20) receive X.supj from the site Sj , where Sj ∉X.large_sites; (21) X.sup = Σi =1 X. supi ; (22) if X.sup ≥ s × D then Insert X into Gi(k);} 48 (23) broadcast Gi(k); (24) receive Gj(k) from all other sites Sj ,(j ≠ i) ; (25) L(k) = ∪ⁿ i=1 Gi(k), (i = 1 ,…, n); (26) divide L(k) into Gi(k), (i=1 ,…,n); (27) return L(k) During the execution of FDM algorithm, each Si sites play different roles. In the beginning, a site consider as “home site” for the produced set of candidate sets and subsequently it changes to a polling site to get response from other sites. Later, it becomes a remote site. The different stages for FDM algorithm with consideration of different roles for each site are as follow: 1. Home site: generate candidate itemset and sending them for related polling site. (lines 1-10) 2. Polling site: receive candidate sets and send polling requests. (lines 11-17) 3. Remote site: return support counts to polling site. (line 18) 4. Polling site: receive support counts and find large item sets. (lines 19-23) 5. Home site: receive large item sets. (lines 24-27) (Cheung et al. 1996). 2.2.4.3 ODAM algorithm In contrast to other DARM algorithm, the Optimised Distributed Association Rules Mining (ODAM) ‘offers better performance by minimising candidate itemset generation costs’ (Ashrafi, Taniar & Smith 2004). This algorithm intends to reduce the communication and synchronisation costs. A DARM algorithm performs better if the communication cost (the number of exchanged messages) becomes minimal. Likewise, synchronisation is another essential factor. Certain period of time of each site is waisted for computing globally frequent itemset generation. Ashrafi, Taniar & Smith (2004) divide the optimisation technology for communication cost into two methods. The first method is called direct support counts exchange. In this method ‘all sites share a common globally frequent itemset with identical support count… this approach focuses on a rule’s exactness and correctness.’ CD and FDM are instances of this method. 49 The second method, indirect support counts exchange, intends to reduce the communication costs by eliminating the global support count. On the other hand, the correctness of DARM algorithms relies on each itemset’s global support. Using a partial support count of itemsets to generate rules may result in discrepancies in the consequent rule set. DDM algorithm applies this method. To preserve the correctness and compactness of association rules, ODAM algorithm employs the first approach. In this algorithm, the total number of exchanged messages is reduced by applying some techniques. To increase the efficiency of the generated candidate support counts, in each pass through the DB, ODAM replaces the infrequent items with the new transactions. Eliminating the infrequent itemsets (items with less than 50% of support count) from every transaction increases the chance of observing similar transactions. This technique in addition to reduce average transaction size, it discovers more identical transactions. The following example illustrates this technique. No Items 1 abcde 2 abe 3 cd 4 abcd 5 abd 6 abef 7 ab 8 abcdef 9 cdf No 1 2 3,10 4 5 6 7 8 9 Items abcde abe cd abcd abd abef ab abcdef cdf No 1,4,8 2,6,7 3,9 5 Items abcd ab cd abd 10 cd Figure 13. ODAM algorithm on 3 sites The first dataset is the original one. As the middle data set indicates, if the data set is loaded into the main memory directly, only one identical transaction (cd) is found. However if the dataset is loaded to the main memory after elimination of the infrequent items from every transaction, more identical transaction is found. The support count for 50 each item is calculated as follow: s(a)= 0.7, s(b)=0.8, s(c)=0.6, s(d)=0.7, s(e)=0.4, s(f)=0.3. Therefore e and f is recognised as infrequent items. As the last dataset shows, the size of transactions is much smaller after removing the infrequent 1-itemsets. Firstly, Likewise Apriori algorithm, ODAM computes the support count of 1-itemsets from each site. Secondly, it broadcasts the itemsets to other site to discover whether they are globally large or not. Subsequently, each site produces 2-itemsets and calculates their support counts. In the second pass through the DB, meanwhile counting the support count of candidate 2-itemsets, the global infrequent 1-itemset are eliminated from all the transactions and new transactions (those without infrequent itemsets) are replaced in the main memory. While inserting the novel transactions into the memory, the identical transactions are recognised. For identical transactions their counter is increased by one and for non identical transactions their counter is set to one. In continue each site scans the main memory for support count of candidate itemsets and computes their support. Subsequently, the global frequent itemsets related to that pass are calculated by broadcasting the gained support counts in each site to the other sites. In this method, the total number of transactions may excess the capacity of the memory. To overcome this problem, a horizontal fragmentation technique is proposed. In this method, a dataset is fragmented into several horizontal partitions., The infrequent items are deleted from each partition and transactions are inserted into the main memory. As mentioned, the existence of the transaction is checked and an appropriate amount is assigned to the transactions counters. Finally, all the memory entries for the specific partition are transferred to a temp file. These trends repeat for every partition. By combining the temp files and considering the identical transactions, a new dataset is produced in all sites which are used by algorithm. In contrast to CD and FDM algorithm, ODAM is more efficient from execution time and message exchange cost aspects. Additionally, ODAM is so scalable in increasing the number of participating sites in a distributed system. Fundamentally, ODAM algorithm is designed for data which are geographically distributed. It computes the 51 support count of candidate itemsets quicker and reduces the average size of transactions, datasets and exchanged messages. 2.2.4.4 DDM, PDDM and DDDM Algorithms Distributed Decision Miner (DDM), Preemptive Distributed Decision Miner (PDDM) and Distributed Dual Decision Miner (DDDM) algorithms are proposed by Schuster and Wolf (2004) for discovering association rules in distributed environment. A brief description for each of them is stated in following. DDM has good performance even with skewed data or where the size of data partitions is different in several sites further more, it overcomes the scalability problem. According to Schuster and Wolf (2001), ‘the basic idea of this algorithm is to verify that an item set is frequent before collecting its support counts from all parties’. Although this algorithm differs from FDM in many aspects but the discussed fact considers as the most important discrepancy between these two algorithms. In this algorithm, after discovering a local large itemset by a site, the support count for that itemset is not requested from all sites rather the sites negotiate (through message exchange) to decide which itemsets are globally large. Subsequently, the support count request is broadcasted to all sites containing that item. Obviously this method reduces the number of exchanged messages as no message is wasted on an itemset which is only locally large not globally. PDDM algorithm designed to improve the communication complexity of DDM. Experience indicates that all the participatingd sites are not equally important. A site may contain more large itemsets than others. For instance, the DB of a superstore includes more significant data than the DB of a grocery store. This algorithm, at the earlier stage of negotiation, allows sites with extreme support count to broadcast their support. Consequently, refraining DBs with less significant data to broadcast their messages, may result in better usage of bandwidth. DDDM intends to reduce the communication cost in a different way. The fundamental idea of this algorithm is that a DDM-type algorithm is employed to detect the large itemsets considering the confidence of the rules. If one site recommends a rule is 52 globally confident and there are no objections from other sites, then the rule considers being globally significant. The number of partitions and candidate item sets are two important factors in complexity for all the DARM algorithms. These algorithms can be used when the number of sites is too many for example 10,000 or when there is band width limit. An operation of parallel data mining with a lot of separated computers can be executed by using these algorithms. As they divide data sets in to small partitions that can be fitted in the memory of every computer, therefore the association rules can be produced quicker. 2.2.5 Comparing distributed algorithms A brief comparison between some of the most important DARM algorithm is provided as follow. Apart from distributed sampling algorithm, all the DARM algorithms are the extended version of the Apriori algorithm. For example the count distribution algorithm is obtained from paralleling the Apriori algorithm and it has high computation and communication complexity. The FDM algorithm increases the efficiency of count distribution algorithm, considering the fact that each global frequent itemset should be a frequent itemset at least in one local site. Computation and communication complexity of the FDM algorithm is much lower than the count distribution algorithm when the number of sites is few or data skewness in different sites is high. Except from these two points, these two algorithms are same. Count distribution (CD) and FDM algorithms are two famous algorithms in DARM and also they are a standard measurement tool for new algorithms in terms of execution time and communication complexity. The main discrepancy between FDM and DDM is that, FDM communicates with all sites for finding a frequent item set and also for finding their global support, but this frequent item set could not be a global frequent item set, whereas if DDM recognises a frequent item set could not be a global frequent item set, then it ignores the item set and does not send any extra messages for calculating the exact amount of its support. 53 FDM algorithm is the base for designing ODAM algorithm. In FDM algorithm the cost of reading from the disk is reduced by using some techniques and also the communication cost between sites has been optimized. DDM and distributed sampling algorithms act quiet well, especially from the communication aspect and in modern distributed systems such as, peer to peer systems. These two are scalable and have the ability to be developed. DTFIM algorithm is the distributed version of FIM algorithm which is a Trie-based algorithm. This algorithm uses some of FDM algorithm techniques, but it is more efficient than FDM as it uses Trie data structure. 54 3. Proposed algorithm by this research 3.1 Mining the non derivable itemsets in distributed environments The experiments indicate that discovering ‘concise representation and then creating the frequent itemsets from this representation, outperforms existing frequent set mining algorithms’ (Calder & Goethals 2002). Since the concept of derivability exists in the distributed environments, this section introduces the distributed deduction rules which are an extension of the deduction rules presented by Calder & Goethals (2002) in centralised environments. For applying NDI algorithm in a distributed environment there are two possibilities. The first approach is to transfer all the data of different sites into a DB and then apply the NDI algorithm. However, due to high transmission costs and security issues, this solution seems not to be as efficient as the second approach which is applying the distributed NDI algorithm. In this method there is no need for transferring data. The derivable itemsets are mined locally whereas the non-derivable itemsets are mined in the distributed way. The distributed deduction rules are defined as follows: Assume that I= {i1, i2,… ,im } be a set of items. Additionally, Let DB be a distributed database which contains D transactions. Also, suppose that there are n distributed sites, as S1, S2,… ,Sn whereas { DB1, DB2,… ,DBn} are the DB of each site. Furthermore, I.supp (i) demonstrates the support count of the i itemset. As mentioned, finding nonderivable itemsets in these distributed DBs is the target. The extended derivation rules for gaining the thigh bounds on the support of I in a distributed DB is as follows: If |I/X| is odd then |I\J| +1 Supp (I) =Supp1 (I) + Supp2 (I)+ … +Suppn (I) ≤ ∑ (-1) (Supp1 (J) + Supp2 (J)+ … + Suppn (J)) X⊆ J⊂I |I\J| +1 ((-1) =∑ X⊆ J⊂I ∑ Suppi (J) 1 ≤i ≤ n If |I\X| is even then |I\J| +1 Supp (I) =Supp1 (I) + Supp2 (I)+ … +Suppn (I) ≥ ∑ (-1) (Supp1 (J) + Supp2 (J)+ … + Suppn (J)) X⊆ J⊂I |I\J| +1 =∑ ((-1) X⊆ J⊂I ∑ Suppi (J) 1 ≤i ≤ n 55 Therefore the deduction rules in the distributed environment are summarised as follows: If |I/X| is odd then |I\J| +1 Supp (I) ≤ ∑ ((-1) ∑ Suppi (J)) X⊆ J⊂I 1 ≤i ≤ n If |I/X| is even then |I\J| +1 Supp (I) ≥ ∑ ((-1) ∑ Suppi (J)) X⊆ J⊂I 1 ≤i ≤ n As the formulas illustrated, the non-derivable itemsets can be extracted without the need for transforming the data from distributed DBs into a centralised DB. The deduction rules for an itemset are calculated by the support counts of its subsets which are obtained from other sites. Since in the proposed algorithm the support counts of the subsets of a frequent itemset is already calculated by each sites, therefore, there is no need for calculation of the internal sigma in the above formulas. As mentioned before, the deduction rules can generate a concise representation of the frequent itemsets. Furthermore, producing all the frequent itemsets based on nonderivable itemsets is explained. Similar procedure can be used to generate all the frequent itemsets in a distributed environment. It is apparent that in each iteration of every DARM algorithms, there are numbers of candidate itemsets. To determine whether the candidate itemsets are globally large (frequent) or not, the distributed algorithm requires to scan the local DB and exchanges the support counts of the discovered candidate itemset. However, for calculation of the support count of derivable candidate itemsets there is no need to scan the local DBs and 56 transfer the support counts, as each local site can independently recognise a derivable itemset is frequent or not; whereas calculating the support count of non-derivable candidate itemsets, requires scanning the local DBs and message exchange. However after some iteration there would not be any non-derivable itemsets and each site can implement its operations independently, without the need to communicate with other sites and even scanning its local DB. The following example indicates the effectiveness of using deduction rules in the efficiency improvement of a DARM. Example 1: suppose a distributed DB with three sites S1, S2 and S3. Let D1, D2 and D3 be the local DBs. Mining all the frequent itemsets is the target whereas the minimum support threshold (minsup) and the iteration depth are set both to 3. D1= D2= D3= Tid Items Tid Items Tid Items 1 a,b,c 1 a,b,d 1 a,d 2 a,c,d,f 2 c,d,e,f 2 b,d 3 b,c,d 3 b,c,d 4 b,c,d 5 a,b,c,d Figure 14. Implement of new algorithm on the sample distributed DBs In this example, for mining the frequent itemsets, the Bodon’s Trie data structure is used as well as the deduction rules. Additionally, the candidate itemset generation is based on FDM algorithm. In this method a Trie is built in each local site which tries are updated by local sites at the end of each iteration. Nevertheless, all the 1-itemsets are kept in a simple vector, and 2-itemsets are kept in a two-dimensional array. The frequent k-itemsets (k≥3) are stored in a Trie data structure. At the beginning each site scans its local DB independently and the support counts of 1itemsets are determined, as follows: S1= S2= S3= 57 local Support local Support local Support candidate count candidate count candidate count itemset itemset itemset {a} 2 {a} 1 {a} 2 {b} 1 {b} 2 {b} 4 {c} 2 {c} 2 {c} 3 {d} 1 {d} 3 {d} 5 {f} 1 {e} 1 {f} 1 Figure 15. Supports counting at distributed sites The support count of every item is stored in a vector by local DBs. Subsequently, the local support counts are exchanged and the globally large 1-itemsets are determined as: Global Support candidate counts itemsets {a} 5 {b} 7 {c} 7 {d} 9 Figure 16. The global support counts Following this, the support counts of local candidates are updated and those which their support count is less than support threshold are deleted which in this example e and f items are eliminated. From this stage, each site before sending its local candidate k-itemsets to the central site, first checks the derivability of its candidate itemsets by the distributed deduction rules. The deduction rules are calculated by the support counts from previous iteration 58 which is kept by the central site. Consequently, the tight bounds for each candidate itemset are determined. In this example the candidate 2-itemsets are calculated as { ab,ac,ad,bc,bd,cd }. For instance, the deduction rules for ab itemset are computed as follows: R0 : supp(ab) ≥ supp(a)+supp(b)-supp({}) = 5+7-10 =2 Ra : supp(ab) ≤ supp(a) = 5 Rb : supp(ab) ≤ supp(b) = 7 Rab : supp(ab) ≥0 As the rules illustrated, the support of ab itemset is in the interval of [2,5]. The upper bound 5 arises from the principle which denotes that the support of ab must be lesser than the support of b. The lower bound is 2 because there are 5 transactions which hold a, from these 5 transactions, 3 of them contain b, therefore, there is an overlap of at least 2 transactions which contain ab. Mostly as a general rule, the minimum of the upper bounds and maximum of the lower bounds are considered as the upper and lower bounds for an itemset. Consequently, ab is a non-derivable itemset. Similarly, in this iteration, the tight bounds for the rest of the itemsets are calculated and the derivable and non-derivable itemsets are determined. Following this, the derivable itemsets are deleted from the list of candidate 2-itemsets. In this example, at this iteration, there is no derivable 2-itemsets, therefore there is no elimination and the candidate 2-itemsets are sent to the central site. The following figure indicates the local and global support count for the candidate 2-itemsets: Items Support Support Support Global at S1 Site at S2 Site at S3 Site Support ab 1 1 1 3 ac 2 0 1 3 ad 1 1 1 3 bc 1 1 3 5 bd 0 2 4 6 cd 1 2 3 6 Figure 17. Candidate 2-itemsets support counting 59 At the third iteration, the candidate 3-itemsets are determined as {abc,abd,acd,bcd}. Since all the three sites keep the support counts of candidate itemsets of first and second iterations, each site calculates the deduction rules independently. For instance, the deduction rules for the abc candidate itemset are gained as follows: supp(abc) ≤ supp(ab)+supp(ac)+supp(bc)-supp(a)-supp(b)-supp(c)+supp({}) = R0: 3+3+5-5-7-7+10 =2 Ra : supp(abc) ≥ supp(ab)+supp(ac)-supp(a) = 3+3-5=1 Rb : supp(abc) ≥ supp(ab)+supp(bc)-supp(b) = 3+5-7=1 Rc : supp(abc) ≥ supp(ac)+supp(bc)-supp(c) = 3+5-7=1 Rab : supp(abc) ≤ supp(ab) = 3 Rac : supp(abc) ≤ supp(ac) = 3 Rbc : supp(abc) ≤ supp(bc) = 5 Rabc : supp(abc) ≥0 Thus the support of abc itemset must be included in the interval [1,3]. Similarly, the support of abd itemset is in the interval [2,5]. Additionally, the support count of acd itemset is calucated as follows: supp(acd) ≤ supp(ac)+supp(ad)+supp(cd)-supp(a)-supp(c)-supp(d)+supp({}) = R0: 3+4+6-5-7-9+10 =2 Ra : supp(acd) ≥ supp(ad)+supp(ac)-supp(a) = 4+3-5=2 Rc : supp(acd) ≥ supp(ac)+supp(cd)-supp(c) = 3+6-7=2 Rd : supp(acd) ≥ supp(ad)+supp(cd)-supp(d) = 4+6-9=1 Rac : supp(acd) ≤ supp(ac) = 3 Rad : supp(acd) ≤ supp(ad) = 4 Rcd : supp(acd) ≤ supp(cd) = 6 Racd : supp(acd) ≥0 As the deduction rules illustrated, the minimum of the upper bounds and maximum of the lower bounds are both 2 and the support count of this itemset is in the interval of 60 [2,2]. Therefore acd is a derivable candidate itemset and will be removed from the list of candidate 3-itemsets. Likewise the support of bcd itemset is in the interval [4,4] and it will be deleted. consequently, for those derivable itemsets such as {bcd} and {acd}, there is no need for scanning the local DBs and exchange the support count of them to calculate the global support count. Consequently, the candidate 3-itemset after applying the deduction rules is minimised as {abc,abd}. Since, the global support count for both of these candidate 3-itemsets is 2, thus there is no global 3-itemsets and the global large itemsets are summarised as global large 2-itemsets. The following figure indicates the final Trie: a c b 5 b d c 3 7 7 3 d c 3 5 6 d 6 Figure 18. Final Trie Provided that, the frequent itemsets in the above example are found only by FDM algorithm (without the deduction rules), the extra cost of information exchange and the cost of scanning the derivable itemsets from the local DBs in the third and fourth iterations are added. Moreover, since this method uses Trie data structure, it is memory efficient. In the most of the distributed DBs, there are numbers of derivable itemsets which their number depends on the nature of the distributed DBs. Since, there is no need to calculate the global support count of derivable itemsets, the costs of scanning the local DBs and exchanging the information, are eliminated. Moreover, this method reduces the traffic of the network and the result can be achieved in a shorter time. However the efficiency of this method deeply depends on the nature of the distributed DBs and the number of derivable itemsets. 61 3.2 Proposed algorithm As mentioned earlier, the proposed algorithm by this thesis employs the distributed NDI algorithm and DTFIM algorithm. Moreover, the lemma 1 proposed by Cheung et al. (1996) is used. This algorithm finds all the frequent derivable and non-derivable itemsets. Each iteration of this algorithm contains immense numbers of candidate itemset which each site can compute their support counts and determines their frequency. After some iteration, all the candidate itemsets are derivable and consequently each site can perform independently. This process continues until no candidate itemsets are produced. In the above example, all the deduction rules are evaluated but empirically, in most cases, evaluating deduction rules up to a limited depth are sufficient. Calders & Goethals (2002) state that ‘in practice most pruning is done by the rules of limited depth.’ The reason is that evaluating all the deduction rules for itemsets with the large number of element (items) is so time consuming. Admittedly, the allocated depth is associated with the size of distributed DBs. In DBs with the significant number of transactions, the deduction rules should be evaluated until the low depths, for instance 3. 3.3 Step by step explanation of the new algorithm Production of the global frequent 1-itemsets (GL1): i. Developing the local 1-itemsets vectors: The local DBs are scanned by their local sites independently. Consequently the name of 1-itemsets (name of items) and their local support counts are stored in a vector locally. ii. Global 1-itemsets: The support counts are exchanged within the sites to determine the globally large 1-itemsets (GL1). iii. Initialising the local Tries: Each local site initialise their local Trie based on GL1, therefore all the sites have the same Trie at the end of the pass. Production of the global frequent 2-itemsets (GL2): iv. Local candidate 2-itemsets generation: Each site based on its allocated Trie, specifies the local candidate 2-itemsets and stores them in a two dimensional array. v. Applying the deduction rules: Each site applies the deduction rules on its local candidate 2-itemsets. (The deduction rules are calculated by using the support count from the prior iteration which is already stored in local Tries of each site). Subsequently 62 the derivable itemsets are removed from the list of the non-derivable local candidate 2itemsets. vi. Global large 2-itmsests: The support counts of non-derivable locally frequent 2itemsets are exchanged within the sites and the global support count for 2-itemsets is computed. vii. Updating the local Tries: Local sites update their Trie by inserting the global large 2-itmsets. Production of the global large k-itemsets (k≥3): viii. Building local Tries: the candidate k-itemsets (CGk) are built by each site. For this purpose, at each local site while the (NDLk-1) is not empty, a candidate tree (Trie[i]) for the local candidate large k-itemset collection (CGk) is built based on NDLk1. ix. Developing local vectors and deduction rules: each site traverses its Trie (using depth-first traversal algorithm) to reach leaves and stores their support count into a vector. Since all the vectors from different sites have same order, only the new support counts from the local Tries are transferred. Likewise prior iterations, before exchanging the support counts the deduction rules are applied and derivable k-itemsets are eliminated from the non derivable candidate k-itemsets. x. Final pruning: After the exchange, each site traverses its local Trie once more and updates the local support count of each leaf node based on the global support counts vector. Those k-itemsets which their updated support counts are smaller than the support threshold are deleted from the list of local large k-itemsets. The new algorithm consists of several sub algorithms which two of them have been presented. For clarity, the used notations in this algorithm are listed in the table below. Table 8. Notations used in the new algorithm TR i(k) The Trie data structure at site ith which contains the non derivable large k-itemsets NDL(k) An array which consists of the globally non derivable large k-itemsets DL(k) An array which consists of the globally derivable large k-itemsets CG(k) An array which holds the non derivable candidate k-itemsets GL i(k) An array which keeps the globally large k-itemsets. 63 L(k) An array of the large k-itemsets Inputs: DBi (i=1,…,n): the databases at each site Si. iterationDepth: number of iterations minSup: the support threshold Output: The set of all globally large itemsets L. Method: Execution of the following program fragment (for the k-th iteration) at the participating sites. k:=1; (2) while k ≤ iterationDepth do (3) { (4) if k=1 then (5) TR i(1) := findLocalCandidate (DB i,0,1); (6) else (7) { (8) candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1)); (9) if DL(k-1) ≠ 0 then (10) dFrequent (DL(k-1), NDL(k-1), DL(k)); (11) TR i(k) := findLocalCandidate (DBi, CG(k), k); (12) } (13) if CG(k) ≠ 0 then // if the CG(k) is not empty (14) TRi(k-1) :=findNDFrequent (DBi, CG(k), k); (15) passLocalCandidate (TRi(k)); (16) GLi(k) := getGlobalFrequent (); // globally large k-itemsets (17) updateLocalCandidates (TRi(k), GLi(k)); // prunes the local candidates which are not (1) // (18) (19) globally large NDL(k) := ∪ⁿ i=1 GLi(k) ; k:=k+1; } (21) L(k) := NDL(k) ∪ DL (k) ; (22) return L(k); (20) The candidateGen procedure generates the non derivable candidate sets and derivable frequent itemsets. procedure candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1)) (1) for all Z ∈TR i(k-1) do (2) { (3) compute the [l,u] bounds of Z (4) if Z.sup=Z.l or Z.sup=Z.u then 64 (5) { Prune Z from NDL(k-1) and TRi(k-1) and insert it into DL(k-1) ; (7) if Z.sup=Z.l then (8) Z.sup=Z.l; (9) else (10) Z.sup=Z.u; (11) } (12) pCG(k) =∪ⁿ i=1 CG i(k) =∪ⁿ i=1 aprioriGen(NDLi(k-1)); //FDM candidate itemsets generator (13) for all Y ∈ pCG(k) do (14) { (15) compute [l,u] bounds on support of Y (16) if l≠u then (17) { (18) Y.l=l; (19) Y.u=u; (20) Insert Y into CG(k) ; (21) } (22) else (23) { (24) If u≥ minSup then (25) { (26) Insert Y into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; (27) Y.sup=u (28) } (29) } (30) } (31) } end procedure (6) Procedure dFrequent (DL(k-1), NDL(k-1), DL(k)) (1) DCG(k) := aprioriGen2(DL(k-1), NDL(k-1)); // FDM apriori candidate generator. (2) for all Z ∈ DCG(k) do (3) { (4) compute Z.sup; //compute the s support of Z (5) if Z.sup ≥ minSup then (6) Insert Z into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; (7) } end procedure At the first iteration of the main algorithm, the local support counts of candidate 1itemsets are calculated by the findLocalCandidate procedure at line 5. Subsequently, the local Tries (which are vectors at this stage) from different local sites are passed to 65 the global site by the passLocalCandidate procedure. The central site is the root of local Tries which is responsible for receiving the local candidates as well as determining and sending back the global frequent k-itemsets to the local sites (line 16). Then based on the global frequent k-itemsets each site updates its Trie (line 17). From the second iteration, the candidateGen function sets the values of the candidate non-derivable and the derivable large k-itemsets. As mentioned, since the derivable kitemsets are same in all the sites, there is no need to scan the DB and exchange their support count. The dFrequents procedure retrieves the derivable frequent itemsets which are remained from the sets of derivable and non-derivable itemsets of the prior iteration and updates the set of derivable frequent k-itemsets. For this purpose the dFrequent procedure uses the theorem 1 used in FDM algorithm, to produce the derivable itemsets whereas the candidateGen procedure, benefits from the deductions rules up to a predefined depth, to generate the derivable k-itemsets. In the main algorithm, the non-derivable and derivable frequent itemsets build the frequent itemsets (line 21). It should mention that, in each iteration of this algorithm the set of derivable frequent itemsets are same in all the sites. In primary iterations of this algorithm, there are few numbers of derivable itemsets, but to continue the execution of the algorithm, their number increase. This process may continue until no non-derivable itemsets are found. As discussed before large number of derivable itemsets increases the efficiency of this algorithm. In the candidateGen procedure (lines 1 to 9) produces a subset of non-derivable itemsets from the prior iteration whose support counts are equal to the upper or lower bounds of them. As mentioned before, the super sets of those itemsets are derivable and they should be eliminated from the set of non-derivable itemsets and the local Tries. Line 11 produces the candidate itemsets based on the FDM candidate generator. Lines 12-26 the bounds on the support of candidate itemsets are calculated. Subsequently, the non-derivable itemsets are inserted to the list of non-derivable candidate itmset (CG(k)) and the frequent derivable itemsets are added to the list of derivable itemsets (DL(k)). 66 Figure 3.3 indicates the dFrequent procedure. This procedure uses the R X∪{i} (I ∪{i})to simplify the computation of support counts. Production of derivable candidate itemsets (DCG(k)) is the first step in this procedure. The apprioriGen2 function joins the frequent derivable itemsets with the non-derivable itemsets remained from the prior iteration. In this way all the possible extension of the frequent derivable itemsets are obtained. Therefore, this procedure generates all the possible extensions and then their support counts are calculated. 67 4. Conclusion With the immense development of use of distributed DBs in organisations and different commercial centres, beside large DBs, the concept of data mining in these environments has attracted a great deal of attention. All kinds of data mining techniques such as clustering, classification etc. are applicable in distributed data mining. The concentrate of this thesis, is on the association rules mining. Association rule mining is one of the most important techniques of data mining which has many uses in commercial and non commercial fields. Although lots of research has been proceeded in association rules mining in centralised environments, the discussion of association rules mining in distributed environments is new and there are not too many methods in this regard. In distributed data mining it is not recommended to transfer the raw data into a centralised DB due to security reasons, traffic of network and the ownership of participant sites. Improving the efficiency, reducing communication volume, security, local site ownership in distributed environments, are the most important issues in DARM and generally in distributed data mining. The aim of this research is to achieve an efficient method for association rules mining in distributed environment which has a better outcome in comparison with the previous algorithms. In this research, firstly, data mining operations and its different methods have been studied. Following this, to present a distributed algorithm for association rules mining, association rules and its method and algorithms in centralised environments have been investigated. In association rules mining there are two major steps. The first step is, finding frequent itemsets in DB and the second step is to generate the association rules from them. Since, the main and time consuming step is finding the frequent itemsets and generating rules is simple and straight forward, the discussion of association rules mining is summarised to find the frequent itemsets. The important issue in association rules mining is efficiency. The two important parameters in efficiency of DARM are 68 considered as decreasing the scanning of memory and reducing computation procedures in data mining operations. In this research after studying methods of association rules mining in centralised environments, the distributed data mining discussion and its important issues have been discussed. For discovering a new method, all the existing methods and algorithm in this regard, were studied carefully. One of the most important DARM algorithms is DTFIM. The proposed algorithm is based on DTFIM algorithm and uses some of the existing techniques such as, pruning local candidate itemset, producing candidate sets, gathering support counts from sites and candidate itemset reduction. The proposed algorithm resolves marketing problem in distributed environments by reducing the number of candidate itemsets. Moreover it simplifies the whole process of producing interesting customer preferences and patterns. 69 5. Feature works Implementation and testing the algorithm on a real DB to prove the efficiency of the algorithm is one of the feature works. Furthermore, several opportunities for future work will be identified from the outcomes of the research. Grid discussion is one of the most progressive research fields. Recently, some research has been performed in the field of distributed data mining by using the grid services, which are at the beginning way. By using association rules mining, all other data mining operations such as classification or clustering can be employed. Some research has been done for finding classification rules by association rules mining, or clustering by association rules. In some cases, using association rules for clustering and classification is so effective and are more efficient in comparison with previous and usual methods of applying these operations. 70 6. References Agrawal, R, Imiliniski,T & Swami, A 1993, ‘ Mining association rules between sets of items in large databases’ ,In Proc. Of the ACM SIGMOD conference on Management of Data,Washington, D.C., pp.207-216. Agrawal, R & Srikant, R 1994, ‘Fast Algorithms for mining Association Rules’, Proceedings of the 20th VLDB Conference, Santiago de Chile, pp. 487-499. Agrawal, R & Shafer, J 1996, ‘Parallel Mining of Association Rules: Design, Implementation and Experience’, IEEE Transaction on Knowledge and Data Engineering, vol. 8, no. 6, pp.962-969. Ansari, E, Dastghaibifard, GH, Keshtkaran, M, Kaabi, H 2008,‘Distributed Frequent Itemset Mining using Trie Data Structure’, IAENG International Journal of Computer Science, vol. 35, no. 3, pp. 377-381. Ashrafi, MZ, Taniar, D & Smith, K 2004, ‘ODAM: an Optimized Distributed Association Rule Mining Algorithm’, IEEE Distributed Systems Online, vol.05, no. 3. Berry, MJ. A., Linoff G S. 2003, Data Mining Techniques for Marketing, Sales, and Customer Relationship Management,Wiley Publishing Inc., Canada. Bodon, F 2003, ‘A Fast Apriori Implementation’, In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations. Bodon, F 2004, ‘Surprising Results of Trie-Based FIM Algorithms’, In Proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’04), vol. 126, Brighton, UK, 2004. Borgelt, C 2003, ‘Efficient Implementations of Apriori and Eclat’, In proceedings of the IEEE ICDM Workshop on Frequent Itemset Mining Implementations (FIMI’03), vol. 90, Melbourne, Florida, USA. Braga, D, Campi, A,Ceri, S ,Klemettinen, M & Lanzi, PL 2002, ‘A Tool for Extracting XML Association Rules from XML Documents’, In Proceeding of IEEE-ICTAI 2002, Washington DC, USA, pp. 57-64. Briandais, RDL 1959, ‘File searching using variable-length keys’, In Western Joint Computer Conference, pp. 295-298. Brin, S, Motwani, R, Ullman, JD & Tsur, S 1997, ‘Dynamic Itemset Counting and Implication Rules for Market Basket Data’, In Proceedings of the 1997 ACM SIGMOD International Conference on Management of Data, vol. 26(2), pp. 255–264. 71 Calders, T 2004, ‘Deducing Bounds on the Support of Itemsets’, In Database Technologies for Data Mining- Discovering Knowledge with Inductive Queries, vol. 2682 , pp. 214-233. Calders, T& Goethals, B 2002, ‘Mining all Non Derivable Frequent Itemsets’, In Proc. Principles and Practice of Knowledge Discovery in Databases PKDD’02, vol. 243, pp. 74-85. Calders, T& Goethals, B 2007, ‘Non Derivable Itemsets mining’, data mining Knowledge Discovery in Databases, vol. 14, pp. 171-206. Calvanese, D, Giacomo, GD, Lenzerini, M, Nardi, D & Rosati, R 1998, ‘Source Integration in DataWarehousing’, DEXA springer, pp. 192-197. Cheung, DW, Han, J, Ng, VT, Fu, AW & Fu, Y 1996, ‘A Fast Distributed Algorithm for Mining Association Rules’, In Proc. Parallel and Distributed Information Systems, IEEE CS Press, pp. 31-42. Cheung, DW, Lee, SD, Xiao, Y 2002, ‘Effect of Data Skewness and Workload Balance in Parallel Data Mining’, IEEE Transactions on knowledge and data engineering, vol. 14, no. 3, pp.489-514. Coenen, F, Leng, P & Ahmed, Sh 2003, ‘T-Trees, Vertical Partitioning and Distributed Association Rule Mining’, Proceedings of the Third IEEE International Conference on data mining, pp. 513-516. Fang, YW, Zhao, XB, Zhang, GP, Wang Y, Sun, Y & Zhang YF 2005, ‘ Study on Algorithms of Parallel and Distributed Data mining Calculation Process’ , proceedings of the fourth international conference on Machine Learning and Cybernetics, Guangzhou, vol. 4, pp. 2084-2089. Fayyad, UM, Piatetsky-Shapiro, G, Smyth, P & Uthurusamy, R 1996, Advances in Knowledge Discovery and Data Mining, Menlo Park : AAAI Press, United States of America. Feng, L, Dillon,TS, Weigand, H & Chang, E 2003, ‘An XML-Enabled Association Rule Framework’. In Proceedings of DEXA’03, pp 88-97, Prague, Czech Republic. Filho, AH, Prado, HA, Toscani, SS 2000, ‘Evolving a Legacy Data Warehouse System to an Object-oriented Architecture’, Computer Science Society, 2000. SCCC '00. Proceedings. XX International Conference of the Chilean, pp. 32-40. Frawley, WJ, Piatetsky-Shapiro, G & Matheus, CJ 1991, ‘Knowledge Discovery in Databases’ , AI magazine, vol. 13, no. 3, pp. 57-70. Fukasawa, T, Wang, J, Takata, T & Miyazaki M 2004, ‘An Effective Distributed Privacy-Preserving Data Mining Algorithm’, Faculty of Software and Information 72 Science, Iwate Prefectural University 152-52 Sugo, Takizawa, Iwate 020-0193, Japan, Digitally Advanced Integrated Solutions Labs, Ltd., Japan, pp. 320-325. Guo, Y & Grossman, R 1999, ‘Scalable Parallel and Distributed Data mining’, Data Mining and Knowledge Discovery. Han, J, Kamber, M 2006, Data mining: concepts and techniques, Diane Cerra, United States of America. Han, J, Pei, J, & Yin, Y 2000, Mining frequent patterns without candidate generation, In Proc. 2000 ACMSIGMOD Int. Conf. Management of Data (SIGMOD’00), pp. 1-12. Han, J, Pie,J ,Yin, Y & Mao, R 2001, ‘Mining frequent pattern without candidate generation: A frequent-pattern tree approach’, Data Mining and knowledge discovery, pp. 53-87. Kantardzic, M 2003, Data Mining Concepts, Models, Methods, and Algorithms, A John Wiley & Sons, INC, United State of America. Kantarcioglu, M & Clifton, C 2004, ‘Privacy-preserving Distributed Mining of Association Rules on Horizontally Partitioned Data’, IEEE Transactions on Knowledge and Data Engineering, pp. 2-13. Kargupta H & Chan, P 2000, Advances in Distributed and Parallel Knowledge Discovery, AAAI Press. Le, DX, Rahayu, JW & Taniar, D 2006, ‘Web data Warehousing Convergence: from Schematic to Systematic’,international journal of information technology and web engineering, vol. 1, no. 4, pp. 68-92. Li, Z, Sun, J, Yu, H & Zhang, J 2005,’CommonCube-based Conceptual Modeling of ETL Processes’, 2005 International Conference on Control and Automation (ICCA2005), vol. 1, pp. 131-136. Liu, G, Li, J & Wong, L 2007, ‘A new concise representation of frequent itemsets using generators and a positive border’, School of computing, National University of Singapore, No.17, pp.35-56. Park, JS, Chen, M-S & Yu, PS 1995, ‘An effective Hash-Based Algorithm for Mining Association Rules’. In proceedings of the 1995 ACM SIGMOD International Conference on Management of Data, vol. 24(2), pp.175-186. Pasquier, N, Bastide, Y, Taouil, R & Lakhal, L 1999, ‘Discovering Frequent Closed Itemsets for Association Rules’ In Proc. ICDT Int. Conf. Database Theory, pp. 398416. Rahm, D & Do, H. H (n.d), Data Cleaning: Problems and Current Approaches Savasere, A, Omiecinski, E & Navathe S 1995, ‘An Efficient Algorithm for Mining Association Rules in Large Databases’, technical Report No. GIT-CC-95-04. 73 Schuster, A & Wolf, R 2004, ‘Communication-Efficient Distributed Mining of Association Rules’, Computer Science Department, Technion, Israel Institute of Technology, Technion City, Haifa 3200, Israel , pp. 171-196. Schuster, A, Wolf, R & Trock, D 2005, ‘A High-Performance Distributed Algorithm for Mining Association Rules’, Knowledge And Information Systems (KAIS) Journal, vol. 7, no. 4, pp. 458-475. Shintani, T & Kitsuregawa, M 1996, ‘Hash Based Parallel Algorithms for Mining Association Rules’, Proceedings of the International Conference on Parallel and Distributed Information Systems, pp. 19-30. Sujni, P & Saravanan, V 2008, ‘Hash Partitioned apriori in Parallel and Distributed Data Mining Environment with Dynamic Data Allocation Approach’ , Computer Science and Information Technology, 2008. ICCSIT '08. International Conference, pp. 481-485. Tan, PN, Steinbach, M, Kumar, V 2006, Introduction to Data Mining, Pearson Education, Inc., United State of America. Toivonen, H 1996, ‘Sampling Large Databases for Association Rules’, Proceedings of the 22nd VLDB Conference Mumbai (Bombay), India, pp. 134–145. Wan, JWW, Dobbie, J 2004, ‘Mining Association Rules from XML Data using XQuery’, In Proceedings of the second workshop on Australasian information security, Data Mining and Web Intelligence, and Software Internationalization – vol. 32, pp. 169-174. Wang, B 2009, ‘A Research on Extraction Method of Distributed Heterogeneous Dataset in Multi-Support Association Rule Mining’, 2009 ISECS International Colloquium on Computing, Communication, Control and Management, pp. 17-20. Zaki, M 1999, ‘Parallel and Distributed Association Mining: A survey’, IEEE Concurrency, vol. 7, no. 4, pp. 14-25. Zhang, S, Zhang, J, Liu, H &Wang, W2005, ‘XAR-Miner: Efficient Association Rules Mining for XML Data’, In Proceedings of 14th international conference on World Wide Web, Chiba, Japan, pp. 894-895. 74