Download A Survey on Techniques For Speeding Up Association

A Survey on Techniques For Speeding Up Association Rule Mining Vinayak Suresh Shukla1, Prof.Dr. S.A. Itkar2 Computer Department, P. E. S. Modern College of Engineering, Shivajinagar, Pune-05 Abstract- The interest in digital data storage is increased in recent years due to ease of use and low cost storage Medias. This, in parallel, has affected the knowledge mining process from large volumes of data in distributed as well as heterogeneous data. This leaded to evolution of association mining algorithms like Apriori and FP-Growth, to meet up time and memory requirements. Anyhow, these algorithms underperform in case of extremely large number of stored data in order to mine. Hence, instead of proposing a better algorithms or modifying the existing ones, the mining performance can be effectively improved by defining a New Data Structure on which the existing algorithms are applied. So, here we studied the technique to speed up the association rule mining with traditional algorithms and new data structures, in order to meet up time and memory requirements in efficient manner. General Terms- Data Compression, Data Structure, Knowledge Discovery. Keywords- Association Rule Mining (ARM), Index Compression. I. INTRODUCTION In recent years, production of digital data in every field is increased [1], so the storage of data. This made the knowledge discovery much interesting but time consuming. Association rule mining, the wellknown and researched mining technique lags behind in situations where the volume of data is too extreme. An association rule is defined as an implication, X → Y, where X and Y be set of items from a data set. I ={ i1,i2,...,in} being the set of all the items comprised in a data set, and T ={ t1,t2,...,tn} being the set of all transactions within that data set. Then, X and Y are subsets of I, i.e., X ⊂ I and Y ⊂ I. An association rule X →Y determines that if X is satisfied, then it is highly likely that Y is also satisfied, X and Y being the antecedent and consequent, respectively. Many efficient algorithms, like e,g ARM [2], have been proposed for mining data sets with the reduction of number of candidate sets, number of transactions and comparison of both. But it can be found that these algorithms are time consuming and hard processes when the data set sizes are extremely high. So as a solution, instead of modifying original functioning of mining algorithms, a new compressed data structure [3] is to be defined which will be used by algorithms for mining. That will help in better handling, speeding up the access to data and fast computing when further used by any existing mining algorithm. The paper is organized into four sections: Section II gives brief review of the speeding up the rule mining. Section III describes the performance parameters considered to compare these approaches and finally, Section IV summarizes and presents the conclusions. II. RELATED WORK Association rule mining [2] is the most important technique used for mining the rules. It aims to extract strong and relative items among the given data set. Basically it was designed for market basket analysis which would help shopkeepers arranging the sale items in order to grow the business. Apriori [4] is the first algorithm for association rule mining. It is based on comprehensive search. It works in two steps as I. finding all frequent item sets from data set and II. Deriving association rules @IJRTER-2016, All Rights Reserved 233 International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 12; December - 2016 [ISSN: 2455-1457] from those frequents item sets. If data set has k single items, and N number of transactions, then 2^k-1 item sets can be generated. And of the data set size is huge, the processing is hard, because the complexity of above operation will be O(N X M X k) just to obtain candidate sets. The frequent pattern FP-growth algorithm [5] was introduced to reduce the number of transactions and related comparisons. Data needs to be scanned only once since FP-growth stored frequent items in a tree structure. But it suffers with large number if I/O and large number of memory required storing all sets. To overcome the existing drawback, like large memory requirements and long computational times, many evolutionary association rule mining algorithms [6],[7],[8],[9], were proposed. But, even after considering mining rules from different views, working with high-dimensional data was hard. So as a solution, it was found that creating a new data structure[3] in order to reduce the original data set size would be advantageous, and that can be used for speeding up the mining, with current algorithms without changing their original functioning. The data structure proposed in the paper was meant to reduce the data set size and produce faster data access. Despite this guarantees the data validity and doesn’t change the original data values. The conversion of data into new data structure comprises of three steps viz. 1. Shuffling, 2. Inverted Index Mapping and finally and 3. Data Compression [3]. 1. Shuffling: The shuffling strategy makes the use of data set property, i.e. similarity among records and transactions within same or specific data sets. It aims to cluster the records or transactions with the short Hamming Distance. Instead of storing all the consecutive records, it stores the distance of consecutive records after the initial record in each cluster/group. This strategy has complexity of O(N) for N number of records. 2. Inverted Index Mapping: On a resulting data, this uses mapping strategy to index the attribute values according to the transactions satisfied. The Inverted Index Mapping is the most important step of building the proposed data structure, and it is responsible to construct an index-based structure. This step takes sorted data as an input. Following the inverted index strategy, it assigns key-value pair to each list of attributes. Complexity of this step is O(P X Q) where P is number of unique values and Q is number of instances. 3. Data Compression: It uses procedure similar to Run Length Encoding (RLE). The consecutive indices for each attribute can be grouped together in two-tuple i.e. starting index and total number of consecutive records/transaction. Complexity of this step is O(P X Q) where P is number of unique values and Q is number of instances. III. PERFORMANCE PARAMETERS 1. Data Size, 2. Processing Time, and 3. Memory Requirements: The traditional data set are structured in 2-D tables [10] which stores the continues data, and sometimes duplicate data. The New Data Structure compresses the structure using above mentioned techniques and the resulting structure is much less than the traditional one. This leads to reduce loading time and memory requirements. IV. CONCLUSION With the reduced main memory requirements and less data access time, the proposed data structure allows the arm algorithms to work efficiently with larger volumes of data. The described data structure overcomes the drawbacks of larger memory requirements and long data access time of arm algorithms, without changing their original schema. Use of this data structure resulted in improving performance capabilities of existing algorithms. The experimental result showed that the proposed data structure ensures data reduction (loss-less), reducing main memory requirements and shorter data access time at high levels. Hence, it can be used where efficient data access time is required with the large data sets. @IJRTER-2016, All Rights Reserved 234 International Journal of Recent Trends in Engineering & Research (IJRTER) Volume 02, Issue 12; December - 2016 [ISSN: 2455-1457] REFERENCES V. Marx, “Biology: The big challenges of big data,” Nature, vol. 498, no. 7453, pp. 255–260, 2013. R. Agrawal and R. Srikant, “Fast algorithms for mining association rules in large databases,” in Proc. 20th Int. Conf. Very Large Data Bases (VLDB), Santiago, Chile, 1994, pp. 487–499. 3. José María Luna, Alberto Cano, Mykola Pechenizkiy, “ Speeding-Up Association Rule Mining With Inverted Index Compression”, IEEE TRANSACTIONS ON CYBERNETICS, 2016. 4. R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. I. Verkamo, “Fast discovery of association rules,” in Advances in Knowledge Discovery and Data Mining. Menlo Park, CA, USA: Amer. Assoc. Artif. Intell., 1996, pp. 307–328. 5. J. Han, J. Pei, Y. Yin, and R. Mao, “Mining frequent patterns without candidate generation: A frequent-pattern tree approach,” Data Min. Knowl. Disc., vol. 8, no. 1, pp. 53–87, 2004. 6. B. Goethals and M. J. Zaki, “Advances in frequent itemset mining implementations: Report on FIMI’03,” ACM SIGKDD Explor. Newslett., vol. 6, no. 1, pp. 109–117, 2004. 7. C. Lucchese, S. Orlando, and R. Perego, “DCI closed: A fast and memory efficient algorithm to mine frequent closed itemsets,” in Proc. IEEE ICDM Workshop Frequent Itemset Min. Implement. (FIMI), Brighton, U.K., 2004, pp. 20–28. 8. J. M. Luna, J. R. Romero, C. Romero, and S. Ventura, “On the use of genetic programming for mining comprehensible rules in subgroup discovery,” IEEE Trans. Cybern., vol. 44, no. 12, pp. 2329–2341, Dec. 2014. 9. C.Borgelt,“Efficient Impleme -ntations of Apriori and Eclat,” in Proc. 1st IEEE ICDM Workshop Frequent Item Set Min. Implement. (FIMI), Melbourne, FL, USA, 2003, pp. 90–99. 10. M. Hall et al., “The WEKA data mining software: An update,” SIGKDD Explor., vol. 11, no. 1, pp. 10–18, 2009. 1. 2. @IJRTER-2016, All Rights Reserved 235

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download A Survey on Techniques For Speeding Up Association