Download Document

Probabilistic Static Load-Balancing of Parallel Mining of Frequent Sequences ABSTRACT Frequent sequence mining is well known and well studied problem in data mining. The output of the algorithm is used in many other areas like bioinformatics, chemistry, and market basket analysis. Unfortunately, the frequent sequence mining is computationally quite expensive. In this paper, we present a novel parallel algorithm for mining of frequent sequences based on a static load-balancing. The static load-balancing is done by measuring the computational time using a probabilistic algorithm. For reasonable size of instance, the algorithms achieve speedups up to =3/4 P where P is the number of processors. In the experimental evaluation, we show that our method performs significantly better than the current state-of-the-art methods. The presented approach is very universal: it can be used for static load-balancing of other pattern mining algorithms such as itemset/tree/graph mining algorithms. EXISTING SYSTEM Frequent pattern mining is an important data mining technique with a wide variety of mined patterns. The mined frequent patterns can be sets of items (itemsets), sequences, graphs, trees, etc. The GSP algorithm is the first to solve the problem of frequent sequence mining. As the frequent sequence mining is an extension of itemset mining, the GSP algorithm is an extension of the Apriori algorithm. The Apriori and the GSP algorithms are breadth-first search algorithms. The GSP algorithm suffers with similar problems as the Apriori algorithm: it is slow and memory consuming. Disadvantages of Existing System: 1. The frequent sequence mining is computationally quite expensive 2. Existing algorithms are slow and memory consuming methods PROPOSED ALGORITHM In this paper we propose a novel parallel method that statically load-balance the computation. That is: the set of all frequent sequences is first split into Prefix-Based Equivalence Classes (PBECs), the relative execution time of each PBEC is estimated and finally the PBECs are assigned to processors. The method estimates the processing time of one PBEC by the sequential Prefixspan algorithm using sampling. In propose system, it is important to be aware that the running time of the sequential algorithm scales with: 1) the database size; 2) the number of frequent sequences; 3) the number of embeddings of a frequent sequence in database transactions. Advantages of Proposed System: 1. It is significantly better than the existing methods 2. We improve the estimate of the processing time of a single PBEC SYSTEM REQUIREMENTS Hardware Requirements:  Processor - Pentium –IV  Speed - 1.1 Ghz  Ram - 256 Mb  Hard Disk - 20 Gb  Key Board - Standard Windows Keyboard  Mouse - Two or Three Button Mouse  Monitor - SVGA Software Requirements:  Operating System : Windows XP  Coding Language : C# References:  R. Agrawal and J. C. Shafer, “Parallel mining of association rules,” IEEE Trans. Knowl. Data Eng., vol. 8, no. 6, pp. 962–969, Dec. 1996.  R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in Proc. 20th Int. Conf.Very Large Data Bases, 1994, pp. 487–499.  R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc.11th Int. Conf. Data Eng., 1995, pp. 3–14.  V. Chvatal, “The tail of the hypergeometric distribution,” Discr.Math., vol. 25, no. 3, pp. 285–287, 1979.  S. Cong, J. Han, J. Hoeflinger, and D. Padua, “A sampling-based framework for parallel data mining,” in Proc.10th ACM SIGPLAN Symp. Principles Practice Parallel Program., 2005, pp. 255–265.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document