Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Rules Mining in Distributed Environments By: Shamila Mafazi Supervised by: Dr. Abrar Haider Acknowledgment First I would like to thank my supervisor, Dr. Abrar Haider for his valuable support in this thesis. I specially like to extend my thank to Dr Jiuyong Li who introduced me to the Data Mining world. Introduction Data Mining refers to extraction or mining knowledge from large amount of data (Han & Kamber 2006). Data Mining Techniques include: Association rules mining, clustering, classification, prediction and so on. Association Rules Mining retrieves relation and correlation between the data in a database. Research Question How can a more efficient method for discovering association rules in a distributed environment be created? Methodology To address the research question, this research has reviewed literature relating to the following areas: Data mining in the centralised environments. Data mining in the distributed environments. Association rules mining in the distributed/centralised environments. Comparing the existing algorithms in the distributed/centralised environments and studying the advantages and disadvantages of them. Developing a concise representation particularly, distributed deduction rules. Designing the new algorithm based on DTFIM . Thesis Contribution The proposed algorithm by this research intends to resolve marketing problem in distributed environments. Additionally, it aims to profile needs of customers and their preferences in the transaction oriented systems such as, credit cards. One of the most significant problems of market basket data is dealing with large number of candidate itemsets. Retrieving interesting and meaningful patterns from these candidates is extremely difficult. Thus, this research, presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns. The Importance of Association Rules Better management of inventory. Better arrangement of the shelves. Enhancement in sale of items. These rules can reveal the effect of continuing or discontinuing the sale of an item on the sale of other items. Two Common Measures of Rules Interestingness The discovered rules are interesting if they satisfy the minimum support and confidence thresholds. Support of an itemset is the percentage of the transactions which contain that itemset. Confidence is the probability of occuring (such as buying) items together. Association Rules Mining (ARM) ARM in centralised environments refers to mining association rules from an integrated database Apriori AprioriTid AprioriHybrid FIM Sampling Partitioning Algorithm DIC algorithm FP Growth algorithm NDI algorithm ARM in distributed environments refers to mining association rules from distributed databases. CD FDM ODAM DTFIM Apriori Algorithm (Agrawal & Srikant 1994) Minimum support thershold=2 1-itemsets 2-itemsets 3-itemsets (Kantardzic 2003, p.168) Centralised ARM Algorithm Frequent Itemsets Mining (FIM) algorithm (Bodon 2004): Uses Trie data structure Fast construction and information retrieval Supp(a)=9 Supp(a,b)=6 Memory efficient Trie data structure (Ansari et al. 2008) Centralised ARM Algorithm Non derivable itemsets (NDI) (Calder & Geothals 2002) Deduction rules divide the itemsets of a data base into derivable and non derivable itemsets. The deduction rules set tight bounds on the support of an itemset. The least of the upper bounds and the greatest of the lower bounds are the tight bounds for an itemset. Derivable itemsets have equal upper and lower bounds. Derivable itemsets do not represent any new information, not being introduced by their subsets. The non derivable itemsets are considered as a concise representation of the frequent itemsets. Deduction Rules: An example of a DB transactions (Calders & Geothals 2007) __ supp (abc) = supp(a) –supp(ab) – supp(ac) + supp(abc) 0 ≥ supp(a) –supp(ab) – supp(ac) + supp(abc) supp(abc) ≥ supp(ab) + supp(ac) – supp(a) Summary of Deduction Rules If |I\X| is odd then (odd-itemsets) |I\J| +1 Supp (I) ≤ ∑ (-1) supp(J) X⊆ J⊂I If |I\X| is even then (even-itemsets) |I\J| +1 Supp (I) ≥ ∑ (-1) X ⊆ J ⊂I supp(J) The Importance of Distributed Data Mining Distributed databases. Transferring data within sites is extremely costly and time consuming. Due to security issues and data ownership, transferring the local data is not permitted Distributed ARM Algorithms Distributed Trie-based Frequent Itemset Mining (DTFIM) (Ansari et al. 2008) Trie based data structure. Memory efficient Fast searching The more skewed data bases, the algorithm acts more efficient. My Contribution Creating a concise representation by applying distributed deduction rules. Applying the distributed deduction rules inside the DTFIM algorithm Distributed Deduction Rules If |I/X| is odd then |I\J| +1 Supp (I) ≤ ∑ ∑ Suppi (J)) ((-1) 1 ≤i ≤ n X⊆ J⊂I If |I/X| is even then |I\J| +1 Supp (I) ≥ ∑ X⊆ J⊂I ((-1) ∑ Suppi (J)) 1 ≤i ≤ n The Proposed Algorithm (cont.) Inputs: DBi (i=1,…,n): the databases at each site Si. iterationDepth: number of iterations minSup: the support threshold Output: The set of all globally large itemsets L. Method: Execution of the following program fragment (for the k-th iteration) at the participating sites. The Proposed Algorithm (cont.) k:=1; while k ≤ iterationDepth do { if k=1 then TR i(1) := findLocalCandidate (DB i,0,1); else { candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1)); if DL(k-1) ≠ 0 then dFrequent (DL(k-1), NDL(k-1), DL(k)); TR i(k) := findLocalCandidate (DBi, CG(k), k); } if CG(k) ≠ 0 then // if the CG(k) is not empty TRi(k-1) :=findNDFrequent (DBi, CG(k), k); passLocalCandidate (TRi(k)); GLi(k) := getGlobalFrequent (); // globally large k-itemsets updateLocalCandidates (TRi(k), GLi(k)); // prunes the local candidates which are not globally large NDL(k) := ∪ⁿ i=1 GLi(k) ; k:=k+1; } L(k) := NDL(k) ∪ DL (k) ; return L(k); Candidate Set Generating Procedure procedure candidateGen (TR i(k-1), NDL(k-1), CG(k), DL(k), DL(k-1)) for all Z ∈TR i(k-1) do { compute the [l,u] bounds of Z if Z.sup=Z.l or Z.sup=Z.u then { Prune Z from NDL(k-1) and TRi(k-1) and insert it into DL(k-1) ; if Z.sup=Z.l then Z.sup=Z.l; else Z.sup=Z.u; } pCG(k) =∪ⁿ i=1 CG i(k) =∪ⁿ i=1 aprioriGen(NDLi(k-1)); //FDM candidate itemsets generator for all Y ∈ pCG(k) do { compute [l,u] bounds on support of Y if l≠u then { Y.l=l; Y.u=u; Insert Y into CG(k) ; } else { If u≥ minSup then { Insert Y into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; Y.sup=u } } } } end procedure Derivable Frequent Itemsets Procedure Procedure dFrequent (DL(k-1), NDL(k-1), DL(k)) DCG(k) := aprioriGen2(DL(k-1), NDL(k-1)); // FDM apriori candidate generator. for all Z ∈ DCG(k) do { compute Z.sup; //compute the s support of Z if Z.sup ≥ minSup then Insert Z into DL(k), delete it from NDLi(k-1) and TRi(k-1) ; } end procedure Explanation of the Proposed Algorithm Developing the local 1-itemsets vectors: The local DBs are scanned by their local sites independently. the 1-itemsets are determined and their local support counts are stored in the local vectors. Global 1-itemsets: The support counts are exchanged within the sites The globally large 1-itemsets are determined. Initialising the local Tries: Each local site initialise their local Trie based on Global 1-itemsets Explanation of the Proposed Algorithm Production of the global frequent k-itemsets (k≥2): The local candidate 2-itemsets are generated based on the local Tries the local candidate 2-itemsets are stored in a two dimensional array. Applying the deduction rules: The deduction rules are applied on the local candidate 2-itemsets by each site. The derivable itemsets are deleted from the list of the non-derivable local candidate 2-itemsets. Global large 2-itmsests: The support counts of non-derivable locally frequent 2-itemsets are exchanged within the sites. The global support count for 2-itemsets are determined. Updating the local Tries: Local sites update their Trie by inserting the global large 2-itmsets. Conclusion This research resolves the problem of market basket data in distributed environments and presents an algorithm which reduces the number of candidate sets and simplifies the process to produce interesting customer preferences and patterns. References Ansari, E, Dastghaibifard, GH, Keshtkaran, M, Kaabi, H 2008,‘Distributed Frequent Itemset Mining using Trie Data Structure’, IAENG International Journal of Computer Science, vol. 35, no. 3, pp. 377-381. Calders, T& Goethals, B 2007, ‘Non Derivable Itemsets mining’, data mining Knowledge Discovery in Databases, vol. 14, pp. 171-206. Han, J, Kamber, M 2006, Data mining: concepts and techniques, Diane Cerra, United States of America. Kantardzic, M 2003, Data Mining Concepts, Models, Methods, and Algorithms, A John Wiley & Sons, INC, United State of America. Thank you