Download Parallel Approach for Implementing Data Mining Algorithms

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
TITLE OF THE THESIS
Parallel Approach for Implementing Data Mining
Algorithms
A
RESEARCH PROPOSAL
SUBMITTED TO THE
SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT,
FOR THE DEGREE
OF
DOCTOR OF PHILOSOPHY
IN
(COMPUTER SCIENCE and ENGINEERING)
By
MANISH BHARDWAJ
Registration No <
>
UNDER THE GUIDANCE OF
DR. D.S.ADANE
DEPARTMENT OF COMPUTER SCIENCE AND ENGINEERING
SHRI RAMDEOBABA COLLEGE OF ENGINEERING AND MANAGEMENT,
NAGPUR, MAHARASHTRA - 440015
Year 2016
MANISH BHARDWAJ
Doctorate Research Proposal
RCOEM, Nagpur
About this proposal
This doctorate research proposal document describes the working title of the
research proposal and general overview of the area. The research plan
mentioned in this document may be modified based on the approval on this
documented research proposal.
Research Proposal, RCOEM Nagpur
Confidential
Page ii
MANISH BHARDWAJ
Doctorate Research Proposal
RCOEM, Nagpur
CONTENTS
1. RESEARCH TITLE…………………………………………………………………….. 1
2. ABSTRACT………………………………………………………………………………… 1
3. LITERATURE SURVEY………………………………………………………………. 1
4. PROBLEM DEFINITION…………………………………………………………….
8
5. PROPOSED METHODOLOGY…………………………………………………….
9
6. REFERENCES……………………………………………………………………………. 9
Research Proposal, RCOEM Nagpur
Confidential
Page iii
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
1 . Research Proposal Title
Parallel approach for Implementing Data Mining Algorithms.
2 . Abstract
Parallel data mining approach concerns, parallel algorithms, techniques and
tools for extraction of useful, implicit and novel pattern from datasets using
high performance architecture. The huge data that is generated by online
transaction, by social networking sites and government organization working
in the area of space and bioinformatics fields create new problems for data
mining and knowledge discovery methods. Due to large size most of the
currently available data mining algorithms are not useful to many problems.
Data mining algorithms not giving better result when the size of datasets
becomes very large. The time required to execute the algorithm is also high
for large datasets. By help of parallel technique the problem of mining is
done in more efficient manner, its help to perform the task by taking the
advantages of available high performance architecture. By the parallel
approach like Data partition, task partition, divide-and-conquer, single
dimension reduction, scalable thread scheduling and local sort help to
implement the data mining algorithm which performance is high and time
requirement is low as compare to simple implementation. Graphics
processing unit with CUDA enable model allow to doing the task in parallel
by help of thread block which are running in parallel. OpenMP API with fork
join model with multiple constructs and Directives helping the parallel
approach implementation with multiple core support.
3 . Literature Review and related work
3.1 Research Issues and Challenges
Some important research issues and a set of open problems for designing
and implementing the large-scale data mining algorithms.
3.1.1 High Dimensionality
Available methods are able to handle hundreds of attributes. New parallel
algorithms are needed that are able to handle more number of attributes.
Research Proposal, RCOEM Nagpur
Confidential
Page 1
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
3.1.2 Large Size
Data warehouse continue to increase in size. Available techniques are able to
handle data in the gigabyte range, but are not yet better suitable for
terabyte-sized data.
3.1.3 Data Type
More data mining research has focused on structured data, due to its
simplicity. But support for other data types are also required. Examples
include semi-structured, unstructured, spatial, temporal and multimedia
databases.
3.1.4 Dynamic Load Balancing
For homogeneous environment static partitioning are used. Dynamic load
balancing is also crucial to handle a heterogeneous environment.
3.1.5 Multi-table Mining
Applying mining over multiple tables or over distributed databases contain
with different database schemas is very difficult with available mining
methods. Better methods are required to handle the multi table mining
problem [1].
3.2 Scaling up Methods for Data Mining
Scaling up is only the way to handle the large datasets. By parallel approach
like one dimension reduction, scalable thread scheduling and local sorting for
implementing data mining algorithm which able to handle the large data
sets.
3.2.1 Modifying Algorithm
Modifying algorithms mainly having the aim to making algorithm faster. For
this purpose different optimizing search techniques are used. It also reduce
the complexity and showing the optimize representation or try to find
approximate solution instead of accurate solution.
3.2.1.1 Model restriction and reducing the search space
Restricting the model space has an immediate advantage in that the search
space is also reduced. Furthermore, simple solutions are usually faster to
obtain and evaluate and, in many cases, are competitive with more complex
solutions. The major problem is when the intrinsic complexity of the problem
Research Proposal, RCOEM Nagpur
Confidential
Page 2
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
cannot be met by a simple solution. Examples of this strategy are many,
including linear models, perceptrons, and decision stumps.
3.2.1.2 Using powerful search heuristics
Using a more efficient search heuristic avoids artificially constraining the
possible models and tries to make the search process faster. The method
consists of three steps: first, it must derive an upper bound on the relative
loss between using a subset of the available data and the whole dataset in
each step of the learning algorithm. Then, it must derive an upper bound of
the time complexity of the learning algorithm as a function of the number of
samples used in each step. Finally, it must minimize the time bound, via the
number of samples used in each step, subject to the target limits on the loss
of performance of using a subset of the dataset.
3.2.2 Change the way to deal Problem
It consisting on modifying the way to solve problem, is based on general
principal of divide-and- Conquer. The idea is to perform some kind of data
partitioning or problem decomposition.
3.3 Parallelization
Parallelization is help in the sense that the most costly parts are performed
concurrently, with parallelization there is possibility of addressing the scaling
up of the mining methods without either simplifying the algorithm or the
task.
3.4 Graphics
Architecture
Processing
Unit
with
Compute
Unified
device
Graphics processing units (GPUs) has enabled inexpensive high
performance computing for general – purpose applications. Compute
Unified Device Architecture (CUDA) programming model provides the
programmers adequate C language like APIs to better exploit the
parallel power of the GPU.
GPUs have evolved into a highly parallel, multithreaded, many-core
processor with tremendous computational horsepower and very high
memory bandwidth. NVIDIA’s GPU with the CUDA programming model
provides an adequate API for non-graphics applications.
Research Proposal, RCOEM Nagpur
Confidential
Page 3
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
Fig. 3.1 A set of SIMD stream multiprocessors with memory hierarchy
3.4.1 CUDA Programming Model
In software level CUDA is the collection of threads block which are running in
parallel. The unit of work is assign to the GPU is called a kernel. CUDA
program is running in a thread-parallel way. Computation is organized as a
grid of thread blocks which consists of a set of threads as shown in below
figure.
At Instruction level, 32 consecutive threads in a thread block make up of a
minimum unit of execution, which is called a thread warp. Each stream
multiprocessor executes one or more thread block concurrently [2].
Research Proposal, RCOEM Nagpur
Confidential
Page 4
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
Fig.3.2 Serial execution on the host and parallel execution on the device
3.4.2 Parallelization techniques on CUDA –enabled platform
Three schemes for data mining parallelization on CUDA- based platform are
as follow:
3.4.2.1 Scalable threads scheduling scheme for irregular pattern
A task is assigned to the CPU or the GPU or the number of thread blocks is
usually determined by the size of the problem before the GPU kernel starts.
However, the size of a problems as irregular pattern problem. CUDA
computing is not suitable for this problem.
Solution: Scalable threads scheduling, Upper bound of number of
threads/threads blocks and allocate the GPU resources are calculated first
and if some threads block are ideal let the corresponding threads blocks
quit immediately.
3.4.2.2 Parallel distributed top –k scheme
Top –k problem is to select the k minimum or maximum elements from a
data collection. Insertion sort is has been proved to be efficient when k is
small but CUDA –based insertion sort is not efficient.
Solution: To reduce the computation and tackle the weakness of the CUDAbased insertion sort by using local sorts rather than a global sort.
3.4.2.3 Parallel high dimension reduction scheme
Text mining may consist of hundreds of attributes, exceeding the size of the
shared memory allocated to each thread block on the GPU. In such case, the
record has to be broken into multiple sub-records to fit in the shared
Research Proposal, RCOEM Nagpur
Confidential
Page 5
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
memory, but breaking down in too many sub-records is not the solution
because the cost for manipulating the records and temporal results will high.
Solution: By observing that different attributes in a record are independent,
if each thread block only takes care of one distinct attribute of all the
records. Rather than perform reduction on the high –dimensional data,
perform one –dimensional reduction on each attribute.
3.5 CUDA based implementations of data mining algorithms
3.5.1 CU-Apriori
In CUDA based Apriori especially Candidate generation and Support counting
take most of the computation of Apriori.
3.5.1.1 Candidate generation
Candidate generation procedure joins two frequent (k-1) itemsets and
prunes the unpromising k-candidates. Since the task of joining two itemsets
is independent between different threads, it is suitable for parallelization,
here scalable threads scheduling scheme for irregular pattern is used.
3.5.1.2 Support counting
Support counting procedure records the number of occurrence of a candidate
itemset by scanning the transaction database. Since the counting for each
candidate is independent with others, it is suitable for parallelization.
Transactions are loaded into the shared memory and shared by all the
threads within a threads block [3].
3.5.2 CU-KNN
CUDA based K- Nearest- Neighbour classifier, Distance calculation and
Selection of k nearest neighbours done most of computation.
3.5.2.1 Distance calculation
It can be fully parallelized since pair-wise distance calculation is
independent. This property makes KNN perfectly suitable for a GPU parallel
implementation. The goal of this is to maximize the concurrency of the
distance calculation invoked by different threads and minimize the global
memory access.
3.5.2.2 Selection of k nearest neighbours
The selection of k nearest neighbours of a query object is essentially to find
the k shortest distances, which is a typical top-k problem. So, its
implementation is done by distributed top-k scheme.
Research Proposal, RCOEM Nagpur
Confidential
Page 6
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
3.5.3 CU-K-means
In CUDA based K-means especially Cluster label update, Centroid update,
Centroid movement detection, take most of computation of K-means.
3.5.3.1 Cluster label update
All thread performs the distance calculation of an object to all the centroids,
and selects the nearest centroid. Each object is assigned to the cluster
whose centroid is closest to it. Attribute partitions of objects are loaded into
the shared memory, so the bandwidth between the global memory and the
shared memory is utilized efficiently.
3.5.3.2 Centroid update
Each new centroid is calculated by averaging the attribute values of all the
records belonging to the common cluster. Parallel high dimension reduction
scheme is used to do the this task.
3.5.3.3 Centroid movement detection
If the new centroids move away from the centroids in the last iteration.
Firstly we required to calculate the square of the difference between every
attribute of the new and old centroids, called centroid difference matrix.
Secondly perform the parallel high dimension reduction scheme on the
centroid difference matrix. Thirdly, since the attributes of the record is small
this record is transferred to the main memory, and summed up to get
global_squared_error. The cost of data transfer between the main and global
memory is negligible [7].
3.5.4 FP-Growth
Although the FP-Growth association-rule mining algorithm is more efficient
than the Apriori algorithm, it has two disadvantages. The first is that the FPtree can become too large to be created in memory; the second is serial
processing approach used.
A distributed application data framework parallel approach of FP-Growth not
required generating overall FP-tree. Overall FP-tree may be too large to
create in shared memory. Algorithm uses parallel processing approach in all
important steps. Which improve the processing capability and efficiency of
association-rule mining algorithm [4].
3.5.5 Parallel Bees Swarm Optimization
Association mining problem with huge datasets solved by using and applying
the bees’ behaviour. It take the advantage of GPU architecture and deal with
large datasets to solve real time problem. Master and slave paradigm is used
with this method. The master is executing on CPU and the slave is offloaded
to the GPU. First, The master initializes randomly the solution reference.
After that, it determines regions of the whole bees by generating the
Research Proposal, RCOEM Nagpur
Confidential
Page 7
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
neighbours of each bee. Single solution is evaluated on GPU in parallel.
After, the master receives back the fitness of all rules; each bee calculates
sequentially the best rule and puts it in the table dance. The best rule of the
dance table becomes the solution reference for the next iteration [5].
3.5.6 Accelerating Parallel Frequent Itemset Mining on Graphic
Processors with Sorting
It constructing the Transaction Identifier table and performing the sorting for
all frequent itemsets this is helping to reduce the candidate itemsets by
using GPU architecture. GPU thread block were allocated after sorting the
itemsets in descending order. Therefore time required to check and support
counting take less time [6].
3.5.7 Parallel Highly Informative K-ItemSet
PHIKS, a highly scalable, parallel miki mining algorithm. PHIKS able to
handle the mining process of huge databases (terabytes of datasets in size).
MIKI, the problem of maximally informative k-Itemsets (miki for short)
discovery in massive data sets, where in formativeness is expressed is
expressed by means of joint entropy and k is the size of the itemset.
Miki mining is a key problem in data analytics with high potential impact on
various tasks such as unsupervised learning, supervised learning, or
information retrieval , to cite a few. A typical application is the discovery of
discriminative sets of features, based on joint entropy [9].
4. Problem Definition
Large data generated by online transaction, social networking sites and
government organization of space and bioinformatics, available data mining
algorithms are not performing well with this datasets.
Other problem is about performance; some of algorithms are able to solve
the mining problem facing problem of search space which prevent efficient
execution and generated solution are not satisfactory level.
Research Proposal, RCOEM Nagpur
Confidential
Page 8
Manish Bhardwaj
Doctorate Research Proposal
RCOEM, Nagpur
5. Proposed Methodology
To deal with the very large datasets, the only way to deal with this problem
by apply the Parallel approach for Scaling up the data mining algorithm and
that can be done by modifying the algorithm, by data partitioning, by
problem decomposition and parallelization.
For parallelization Graphics processing units enabling in expensive high
performance computing power with this Compute unified device architecture
programming model provide the programmers adequate c language like API
to better exploits the parallel power of GPU.GPU has evolved into a highly
parallel ,multithreaded ,many core processor so work is distributed among
different thread block and threads are performing operation in thread
parallel fashion.
Other approach is based on OpenMP, it is shared memory API work in fork
and join model. It having large set of constructs and directives which allow
to do the work in parallel , that way task utilize the computing power of
multiple core and parallel approach are apply for scaling up data mining
algorithms.
6. References
1. M. J. Zaki, Large-Scale parallel Data Mining , LNAI 1759, pp. 1-23
,Springer 2000.
2. N. Garcia-Pedrajas, A. de Hero-Garcia , Scaling up data mining
algorithms: review and taxonomy, springer-verlag 2011.
3. L. Jian , C. Wang , Y. Liu ,Y. Shi , Parallel data mining techniques on
Graphics processing Unit with Compute Unified device Architecture
(CUDA),pp. 943-967 Springer science + Business Media, LLC 2011.
4. Zhi- gang Wang, Chi-she Wang ,A Parallel Association-Rule Mining
Algorithm ,pp. 125-129 springer-Verlag Berling Heidelberg 2012.
5. Y. Tan , Parallel Bees Swarm Optimization for Association rules mining
using GPU Architecture ,pp. 50-57 ,Springer International Publishing
Switzerland 2014.
6. H.Hsu , Accelerating parallel Frequent Itemset Mining on Graphics
processors with Sorting ,pp. 245-256 IFIP 2013.
7. H. Decker , Parallel and Distributed Mining of Probalilistic Frequent
Itemsets Using Multiple GPUs, Springer-Verlag Berlin Heidelberg 2013.
8. S. Tsutsui and P.Collet , Data Mining Using parallel Multi-objective
Evolutionary Algorithms on Graphics Processing Units, Springer-Verlag
Berlin 2013.
9. Saber Salah , A high scalable parallel algorithm for maximally
informative k- itemset mining ,Springer –Verlag London 2016.
Research Proposal, RCOEM Nagpur
Confidential
Page 9