Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN (Print) : 0974-6846 Indian Journal of Science and Technology, Vol 8(24), DOI: 10.17485/ijst/2015/v8i24/80157, September 2015 ISSN (Online) : 0974-5645 Survey on Mining Association Rule with Data Structures Y. Jeya Sheela and S. H. Krishnaveni* Department of Information Technology, Noorul Islam University, Kumaracoil, Kanyakumari - 629180, Tamil Nadu, India; [email protected], [email protected] Abstract In the current trend, development of various applications of Data mining gains much importance. Data mining is a process of extracting interesting, advantageous and understandable patterns from huge databases. Mining Association rules from large databases is one of the important tasks in data mining. Association rule mining is based on two steps: finding frequent item set and generating rules from it. There are many algorithms for finding frequent item set. Processing large databases for generating efficient association rules necessitates repeated scans which increases the computing time. Data Structures plays a main role in reducing the complexity of computational operations. In this paper we have focused on the theories of standard data structures used in mining proficient association rules and exemplified with examples. Keywords: Data Mining, Data Structures, FP-Tree, Pre-Large Tree, Tries 1. Introduction Data mining is a process of extracting interesting, hidden and useful patterns from large databases such as: relational and transactional database, data warehouses, XML repository etc. It is also known as Knowledge Discovery in Database (KDD). In general there are three processes: pre-processing, data mining process and post- processing. Real world data consists of inconsistent, incorrect data with certain behaviors and data may contain many errors. Data pre-processing is the essential and proven technique that deals with such issues. Data goes through a sequence of steps during pre-processing: data cleaning, integration, transformation, reduction and data discretization. The main step in KDD is the data mining process, where various mining algorithms are applied to produce useful unseen information. Third step is called post-processing, which assess the mining outcome based on users’ necessities and area information. The result obtained can be presented only if it is reasonable, otherwise few or all of those processes are repeated until satisfying result is obtained. Finally the result can be presented in any of the form: raw data, tables, decision trees, rules, charts, data *Author for correspondence cubes or 3D graphics. There are many efficient mining algorithms .Association rule mining plays an important role in the mining process. It is used to find relationships among data items, frequent patterns in a large transactional database. For example, in a Super market there are provisions that after buying few items, for instance, once milk is bought, a list of related items like: butter 40%, bread 25% will be presented for additional purchasing. In this example, Association rule is, when milk is bought, 40% of the time butter is bought together, and 25% of the time bread is bought along with milk. These rules help to make strong decisions in marketing management. There are various applications of association rules in the area of telecommunication networks, market and risk management, inventory control etc. Association rule mining can be used along with data structures. Data Structure is a way of organizing data in a database. There are various efficient data structures. In section 2, various existing data structures are studied, in section 3, various data structures that are used in the generation of association rules and their advantages and disadvantages are studied, in section 4, results and conclusion are discussed. Survey on Mining Association Rule with Data Structures 2. Various Data Structures used in Data Mining Process 2.1 Trie Data Structure A trie, also called digital tree, radix tree or prefix tree, is a prearranged tree data structure that is used to store keys in strings. The data structure trie was initially introduced to store and capably get words from a Dictionary. A trie1 is a rooted directed tree like structure. Initially the root is at depth 0 and a node at depth i point nodes at depth i + 1. A pointer called as a link, is tagged by a letter. A special letter * represents null character. If node i points to node j, then node i is the parent of node j, and node j is the child of node i. Each leaf node l denotes a word which is the concatenation of the letters in the path from root to leaf node l. If two words have identical k letters, then the k steps on the paths are identical. Let S = a set of words S = {fear, fell, fit, full, fun, tell, talk, tap} In Figure 1, Trie T that stores the above words is shown. Initially a root node is drawn. Every character of the input word is read one by one and added as a separate node in the trie. If the character read is new, then a new node is constructed, added in the trie and marked as leaf node. If the character read is a prefix of existing node, then a new node is constructed and last node is marked as leaf. The length of the word decides the trie depth. To search2 a word in the tree, starting from the root node, move ahead by relating its letters in a sequence. If the word is not found, then there is no link with the corresponding tag. To insert a node, start from the root node, move ahead as if to search a word. If there is a node, which has no link tagged with the subsequent letter L of the word, then a new node and a link to point it (tagged as L) are created. This process is iterated until the end of the word is reached. 2.2 FP-tree Mining Association rules are based on finding frequent patterns and then generating rules from it. A pattern that occurs very often in a data set is frequent pattern3. There are many algorithms to find frequent patterns or item sets namely: Apriori, FP-growth. Apriori algorithm first generates candidate sets from a set of items and checks whether they occur frequently. It requires many database scans and it is expensive method. FP-Growth algorithm first constructs a small data structure called FP-tree and then mines frequent patterns from the FP-tree. This can be done in two passes. Example 4: Find all frequent item sets in the following database given in Table 1, taking minimum support as 30%. Step 1: Minimum support calculation Number of transactions = 8 Minimum support = 30% Minimum support count = 30/100*8 = 2.4 =3 Step 2: Finding frequency of occurrence and Priority of items Table 2 shows the frequency of occurrence of items and priority of items in the given transactional database. Table 1. Given Transactional dataset Figure 1. Trie T stores the words fear, fell, fit, full, fun, tell, talk, tap. 2 Vol 8 (24) | September 2015 | www.indjst.org Tid Items 1 X, Y 2 Y, Z, W 3 X, W, U, V 4 X, U, V 5 X, Y, Z 6 X, Y, Z, U 7 X 8 Y, Z, V Indian Journal of Science and Technology Y. Jeya Sheela and S. H. Krishnaveni Table 2. Frequency of occurrence and priority of items Item Frequency Priority X 6 1 Y 5 2 Z 4 3 U 3 4 V 3 5 W 2 6 For e.g. Item X has occurred in row 1, row 3, row 4, row 5, row 6 and row 7. Totally it has occurred six times. Hence frequency of item X is 6. In the same way frequency of remaining items are found. Based on this frequency, items are prioritized. Here frequency of Item X is 6 and item W is 2, so item X has the highest priority and item W has the lowest priority. Items that do not satisfy the minimum support requirement will be dropped. Step 3: Ordering of items on the basis of Priority Table 3 shows the ordering of items. Items are ordered based on the priority of items shown in Table 2. Order of items are X, Y, Z, U, V, W. Step 4: FP–tree Construction Figure 2 shows the FP-Tree construction for the items in row 1 of Table 3. Root node is taken as NULL. Initially Root node is drawn and then the items in row 1 are included one after the other. Items in row 1 are: X, Y Figure 3 shows the FP-Tree construction for the items in row 2 of Table 3. Items in row 2 are: Y, Z, W. Figure 4 shows the FP-Tree construction for the items in row 3 of Table 3.Items in row 3 are: X, U, V, and W. Figure 5 shows the FP-Tree construction for the items in row 4 of Table 3. Items in row 4 are: X, U, V. Figure 2. FP-Tree construction for the items in Row 1. Figure 3. FP-Tree construction for the items in Row 2. Figure 4. FP-Tree construction for the items in Row 3. Table 3. Order of items Item Frequency Priority X 6 1 Y 5 2 Z 4 3 U 3 4 V 3 5 W 2 6 Vol 8 (24) | September 2015 | www.indjst.org Figure 5. FP-Tree construction for the items in Row 4. Indian Journal of Science and Technology 3 Survey on Mining Association Rule with Data Structures Figure 6 shows the FP-Tree construction for the items in row 5 of Table 3. Items in row 5 are: X, Y, Z. Figure 7 shows the FP-Tree construction for the items in row 6 of Table 3. Items in row 6 are: X, Y, Z, U. Figure 8 shows the FP-Tree construction for the items in row 7 of Table 3. Items in row 7 are: X Figure 9 shows the FP-Tree construction for the items in row 8 of Table 3. Items in row 8 are: Y, Z, and V. Step 5: Validation process FP- tree constructed is validated by checking the frequency of items in the FP-tree constructed in Figure 9 with the values in Table 2. If it matches, the FP-tree constructed is right. Here the values are matched and hence the FP-tree constructed is valid. Figure 8. FP-Tree construction for the items in Row 7. 2.3 Pre-Large Tree The concepts of pre-large tree5 can be used to generate association rules from large databases, which minimize the number of scans. A pre-large concept is defined with a Figure 9. Final FP-Tree constructed after inserting the items of Row 8. Figure 6. FP-Tree construction for the items in Row 5. Figure 7. FP-Tree construction for the items in Row 6. 4 Vol 8 (24) | September 2015 | www.indjst.org lower support threshold and an upper support threshold. The upper support threshold is same as the minimum support threshold which is set by the user. The ratio of support of an item set should be bigger than the upper support threshold, so that it will be thought as a large item set. If the ratio of support of an item set is below the lower support threshold, then it is considered as small item set. Pre-large item sets stores the items one by one in the growing mining process and minimizes the movements of item sets from large to small items and vice-versa. Lower support threshold is based on the number of updated records permitted in the database. If it go beyond the permitted number, rescanning of database is required. Rescanning increases the computing time. To insert data effectively, pre-large concepts can be combined with FP – tree to design pre-large-tree structure. Initially a pre-large tree is constructed from the actual database. The database is scanned to find large and prelarge items and then it is sorted in descending order. Then pre-large tree is constructed based on the sorted sequence Indian Journal of Science and Technology Y. Jeya Sheela and S. H. Krishnaveni of items. Construction process progresses in a step by step basis, from first to last transaction .After executing all the transactions, pre-large tree is constructed. Two tables Header_Table and the Pre_Header_Table are maintained, which stores the frequency values of large and pre-large items. Example for pre large tree construction is given below: Given database contains 10 transactions and 9 items, denoted as {m} to {u}. Lower support threshold, Sl is set at 40% and the upper support threshold, Su is set at 70%. The large items are :{ m}, {n}, {p} and {q}, and the pre-large items are: {r}, {s} and {t} With these items, Header_Table and the Pre_Header_ Table are constructed. Header_Table contains the frequency values of large items and Pre_Header_Table contains the frequency values of pre-large items, then pre-large tree is constructed. Step 1: Finding frequency and priority of large Items Table 5 shows the frequency of items of given transaction item set. In the given database, ‘n’ has occurred 8 times, so ‘n’ has the highest priority and ‘m’ has occurred only two times, so it has the lowest priority. Step 2: Ordering of items based on priority In Table 6, items are arranged based on the priority of items shown in Table 5 ie. descending order. Step 3: Header_Table construction Table 7 shows the Header_Table construction Header_Table contains the frequency of occurrences of large items in the given database. Table 4. Given Transaction item set Tid Items 1 m, n, o, p, q, r 2 n, o, p, r 3 n, o, p, q, t 4 n, op, q, r 5 n, o, p, r, s, u 6 n, o, p, q, t 7 n, r, s 8 o, p, q, t 9 o, r, s, t, u 10 m, n, o, q, s Vol 8 (24) | September 2015 | www.indjst.org Table 5. Frequency and priority of large items Item Frequency Priority m 2 8 n 8 1 o 8 2 p 6 3 q 6 4 r 6 5 s 4 6 t 4 7 u 2 9 Table 6. Ordering and prioritizing of items Tid Items 1 Ordering m, n, o, p, q, r n, o, p, q, r, m 2 n, o, p, r n, o, p, r 3 n, o, p, q, t n, o, p, q, t 4 n, o, p, q, r n, o, p, q, r 5 n, o, p, r, s, u n, o, p, r, s, u 6 n, o, p, q, t n, o, p, q, t 7 n, r, s n, r, s 8 o, p, q, t o, p, q, t 9 o, r, s, t, u o, r, s, t, u 10 m, n, o, q, s n, o, q, s, m Table 7. Header_Table Item Frequency n 8 o 8 p 6 q 6 Head Step 4: Pre_Header_Table construction Table 8 shows Pre_Header_table construction. Pre_header_Table contains the frequency of occurrences of pre-large items in the given database. Step 5: Constructing Pre-large tree In Figure 10, Header_Table and the Pre_Header_Table stores the frequency values of large and pre-large items. Based on these values Pre-large tree is constructed. Indian Journal of Science and Technology 5 Survey on Mining Association Rule with Data Structures Table 8. Pre_Header_Table Item Frequency r 6 s 4 t 4 Head This parameter may not suit all datasets. But Trie does not depend on these parameters. So it is very easy to work with Trie. Trie works very faster with low support threshold than Hash trees. Experimental results prove that the performance of trie is very close to hash-trees with constraints at high support threshold but outperforms hash tree at low threshold. Tries are best suited for proficient execution of candidate generation because, candidates generated from the pairs of item sets will have the same parent. So candidates can be easily obtained. 3.1 Mining Association Rules using TCOM Figure 10. Pre-Large tree construction. 3. Role of Data Structures in Mining Association Rules 3.1 The Trie Data Structure for Finding Frequent Item Set Association rule mining finds frequent item set and then generates well-built rules from it. Most algorithms like Apriori based on Hash trees, frequent pattern growth (FP-growth) and Vertical data format approach are used to find frequent itemset and widely increases the speed of searching of items. Finding frequent item set is one of the important applications in data mining. In general, hashtrees are used for finding frequent item set. In this paper, a trie data structure1 replaces hash-trees and resolves this main data mining task. A trie, is a pre-organized data structure that is used to store keys in strings. Let I = {i1,i2,i3….in} be the set of items. Each of these items are paired to generate candidate sets. From candidate sets, frequent item sets are found. Trie searches the k-item set pairs faster than Hash-trees. Hash-tree depends on two parameters: table size and leaf_mm-size (number of candidates the leaf stores) for better performance. 6 Vol 8 (24) | September 2015 | www.indjst.org Association rule mining is one of the most significant parts in data mining process. The purpose of association rule mining is to find association relationships or correlations among a set of items. In this paper, a proficient way to discover the legal association rules among the occasionally occurring items is presented. A novel data structure called Transactional Co-Occurrence Matrix (TCOM)6 is designed for the actual transactional database by two passes. Then the count of occurrence of item sets is calculated and legal association rules are mined based on TCOM. The main advantage is that item sets can be randomly accessed and counted without examining the actual database or TCOM. This increases the effectiveness of the mining process. Discovering rule patterns is the eventual goal of association rule mining. A very small amount of memory is required for rule mining. This is very small when compared to recurrent pattern mining process. In most of the cases discovering rules among occasionally occurring items are very advantageous. Experimental results prove that this method is proficient and able method for mining massive transaction databases. This method can be improved further by using closed item set and efficient pruning techniques. 3.3 Pre-large Tree for Mining Association Rule Mining The Frequent Pattern tree (FP-tree) is a capable data structure that mines association rules without generating candidate item sets. It condenses a database into a tree like structure and stores large items only. When the data are customized, it necessitates all transactions to be processed in batch. In this paper, an algorithm based on pre-large concepts5 is used to handle customized records of actual database. This algorithm sets a lower and an Indian Journal of Science and Technology Y. Jeya Sheela and S. H. Krishnaveni upper support threshold for defining pre-large concepts ie to prevent small item becoming large item. The algorithm initially divides the items into three parts: large, pre-large or small in the actual database and checks their count differences as either positive, zero or negative or then each part is processed separately. In general in the field of data mining, minimum support threshold is set by the user. Here minimum support threshold is considered same as upper support threshold. Lower threshold is based on the number of customized records permitted. If the numbers of customized records go beyond the permitted number, the algorithm will scan the database again to get the ending results. But if the number of customized records does not go beyond the permitted number, the execution time is better saved. The algorithm uses some pruning techniques to minimize the number of scans of actual database. So it obtains an excellent execution time for maintaining pre-large tree, especially when handling small number of customized records. Experimental results proves that the speed of execution of pre-large-tree maintenance algorithm is very fast than the batch FP-tree and FUFP-tree maintenance algorithm for handling customized records. 3.4 Mining Frequent Ordered Sub Trees in a Tree-Structured Database Mining frequent subtree is a key research area in knowledge discovery from tree – structured Database. To find relationships among tree data items, under user defined threshold, frequently occurring sub structures should be found in prior. This is well-known as the frequent sub tree mining (FSM)7. In this paper, a new method is used to find all the repeated ordered sub trees from a tree-structured database. The main idea is that the structural features of the input tree instances are taken out to create a transactional form that facilitates the use of normal item set mining methods. The eventual aim is to mine frequent sub trees from input tree instances that are represented in a transactional database using normal item set mining method. In this way, the sub tree listing process is prevented and the sub trees can be regenerated in a post-processing phase. This enables, additionally structured and more complicated tree data to deal with much lower support thresholds. This method can find position-constrained Sub trees. Every node in the position-constrained sub tree will have explanatory notes regarding the occurrence and embedding level of nodes Vol 8 (24) | September 2015 | www.indjst.org in the actual database tree. In addition to this, separated sub tree relationships can also be specified through implicit linking nodes. Experiments carried out on artificial and actual-world datasets verify the estimated benefits of this method over striving methods in terms of effectiveness, mining abilities, and enlighten of the hauled patterns. This method can integrate any normal item set mining algorithm. 3.5 Frequent Closed Enumeration Table (FCET) to Find Generalized Association Rule Association rule mining is one of the important tasks in the field of Data mining. It is used to find strong relationships among data items. The main aim of this paper is to use a well-organized data structure to find generalized association rules between the items at different hierarchical levels in a tree with the postulation that the actual frequent item sets and association rules were created in prior. Consider a large transactional database, each and every transactions contains a set of items, and a nomenclature (tree-like structure) on the items. Relationship among every item at any level of the tree is found. All associates of each and every item in a transaction are appended to the transaction, and by using any of the mining algorithms, association rule can be found from these enlarged transactions. The prime dispute of creating an efficient mining algorithm is how to utilize the original frequent item sets and association rules to directly create novel generalized association rules without scanning the database again and again. In this paper, a proficient data structure called the Frequent Closed Enumeration Table (FCET)8 is used to store up the related information. It stores maximal item sets (maximal item set is a frequent item set whose nearest or next supersets are not frequent) and derives the subset item sets information. Two algorithms namely GMAR and GMFI are used along with various pruning techniques to create new generalized association rules. Experimental results prove that GMAR and GMFI algorithms are far better than BASIC and Cumulate algorithms because they generate only fewer number of candidate sets. GMAR removes huge amount of extraneous rules based on the minimum confidence. GMAR is always good than GMFI in a bare database and GMFI is very good than GMAR in a intense database and the amount of frequent Indian Journal of Science and Technology 7 Survey on Mining Association Rule with Data Structures item sets is large. Time complexity for finding maximal item sets is O(log2n) and n is the overall total of maximal itemsets. The memory requirement for the FCET is little large, but by limiting the maximal item sets size , large amount of memory spaces are saved, in the case of intense databases. 4. Conclusion In this paper we have studied the concepts of various data structures and advantages of using data structures in various applications of data mining. Generating Association rules from large transactional databases is an important task in the field of data mining. Various features found in data structures like Trie, FP- tree, Pre-Large tree, TCOM, FCET helps in minimizing the complex computational tasks in generating rules. Data mining issues like memory requirement for processing voluminous data and time complexity can be effectively solved by selecting proper data structures. 8 Vol 8 (24) | September 2015 | www.indjst.org 5. References 1. Bodon F, Ronyai B. Trie: An alternative data structure for data mining algorithms. Mathematical and Computer Modelling; 2003. 2. Available from: http://www.geeksforgeeks.org/trie-insertand-search 3. Available from: http://www.1.se.cuhk.edu.hk/~seem4630/ tuto/Tutorial03.ppt 4. Available from: http://www.hareenlaks.blogspot. com/2011/06/fp-tree-example-how-to-identify.html 5. Lin C, Hong T. Maintenance of pre-large trees for data mining with updated records. Information Sciences; 2014. 6. Ding J, Yau SST. TCOM, an innovative data structure for mining association rules among infrequent items. Computers and Mathematics with Applications. 2009; 57(2):290–301. 7. Hadzic F, Hecker M, Tagarelli A. Ordered subtree mining via transactional mapping using a structure-preserving tree database schema. Information Sciences; 2015. 8. Wu C, Huang Y. Generalized association rule mining using an efficient data structure. Expert Systems with Applications; 2011. Indian Journal of Science and Technology