Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Transcript
Probabilistic Static Load-Balancing of Parallel Mining of Frequent Sequences
ABSTRACT
Frequent sequence mining is well known and well studied problem in data mining.
The output of the algorithm is used in many other areas like bioinformatics,
chemistry, and market basket analysis. Unfortunately, the frequent sequence mining
is computationally quite expensive. In this paper, we present a novel parallel
algorithm for mining of frequent sequences based on a static load-balancing. The
static load-balancing is done by measuring the computational time using a
probabilistic algorithm. For reasonable size of instance, the algorithms achieve
speedups up to =3/4 P where P is the number of processors. In the experimental
evaluation, we show that our method performs significantly better than the current
state-of-the-art methods. The presented approach is very universal: it can be used for
static load-balancing of other pattern mining algorithms such as itemset/tree/graph
mining algorithms.
EXISTING SYSTEM
Frequent pattern mining is an important data mining technique with a wide variety
of mined patterns. The mined frequent patterns can be sets of items (itemsets),
sequences, graphs, trees, etc. The GSP algorithm is the first to solve the problem of
frequent sequence mining. As the frequent sequence mining is an extension of
itemset mining, the GSP algorithm is an extension of the Apriori algorithm. The
Apriori and the GSP algorithms are breadth-first search algorithms. The GSP
algorithm suffers with similar problems as the Apriori algorithm: it is slow and
memory consuming.
Disadvantages of Existing System:
1. The frequent sequence mining is computationally quite expensive
2. Existing algorithms are slow and memory consuming methods
PROPOSED ALGORITHM
In this paper we propose a novel parallel method that statically load-balance the
computation. That is: the set of all frequent sequences is first split into Prefix-Based
Equivalence Classes (PBECs), the relative execution time of each PBEC is estimated
and finally the PBECs are assigned to processors. The method estimates the
processing time of one PBEC by the sequential Prefixspan algorithm using sampling.
In propose system, it is important to be aware that the running time of the sequential
algorithm scales with: 1) the database size; 2) the number of frequent sequences; 3)
the number of embeddings of a frequent sequence in database transactions.
Advantages of Proposed System:
1. It is significantly better than the existing methods
2. We improve the estimate of the processing time of a single PBEC
SYSTEM REQUIREMENTS
Hardware Requirements:
 Processor
-
Pentium –IV
 Speed
-
1.1 Ghz
 Ram
-
256 Mb
 Hard Disk
-
20 Gb
 Key Board
-
Standard Windows Keyboard
 Mouse
-
Two or Three Button Mouse
 Monitor
-
SVGA
Software Requirements:
 Operating System
:
Windows XP
 Coding Language
:
C#
References:
 R. Agrawal and J. C. Shafer, “Parallel mining of association rules,” IEEE
Trans. Knowl. Data Eng., vol. 8, no. 6, pp. 962–969, Dec. 1996.
 R. Agrawal and R. Srikant, “Fast algorithms for mining association rules,” in
Proc. 20th Int. Conf.Very Large Data Bases, 1994, pp. 487–499.

R. Agrawal and R. Srikant, “Mining sequential patterns,” in Proc.11th Int.
Conf. Data Eng., 1995, pp. 3–14.
 V. Chvatal, “The tail of the hypergeometric distribution,” Discr.Math., vol.
25, no. 3, pp. 285–287, 1979.
 S. Cong, J. Han, J. Hoeflinger, and D. Padua, “A sampling-based framework
for parallel data mining,” in Proc.10th ACM SIGPLAN Symp. Principles
Practice Parallel Program., 2005, pp. 255–265.