Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) Data Mining Techniques in Parallel and Distributed Environment- A Comprehensive Survey Shraddha Masih1, Sanjay Tanwani2 1,2 School of Computer Science & IT, DAVV, Indore, India Parallel computing techniques took a boost with the advent of multi core CPUs and cheaper GPUs. A combination of CPU and GPU resulted in multi fold performance benefit. Last trend is distributed data mining where data mining techniques were applied in different distributed computing paradigms like peer to peer, clusters, grids and cloud environment [1]. Abstract— Distributed sources of voluminous data have raised the need of distributed data mining. Conventional data mining techniques works well on structured data which is clean, pre-processed and properly arranged either in the form of structured files, databases or data warehouse. These techniques are based upon centralised data store however they have several limitations in distributed scenario where the data is scattered in different geographical locations on data servers all across the network. It becomes a costly affair to accumulate huge data on a centralised node in real time. To overcome these limitations, application of distributed data mining techniques has become essential. This paper describes various data mining tools and techniques that can be used in distributed environment. Different algorithmic and architectural approaches are followed in various distributed mining techniques. Latest approaches in distributed data mining are explored. Various research issues and challenges in the field of distributed data mining are also discussed. II. ABOUT DATA MINING In this competitive world, top level management needs to take right decisions at right time for giving better service to customers, and to provide better organizational image. Decisions based on better analysis results in increasing profit and decreasing loss. For doing so, management is dependent on better analytical and data mining services. Abbreviations: KDD-Knowledge discovery in databases, ARM- Association rule mining, DDM- Distributed Data Mining, GPU-Graphical processing Unit I. INTRODUCTION Organizations need to accumulate vast and growing amounts of data in different databases. This data may be either transactional data like sales, inventory, payroll, accounting etc. or analytical data that is helpful in decision support systems. For utilizing this data, it must be analyzed thoroughly. Many analytical tools are available in market. Data mining techniques also come in the category of analytical systems that help to give insight into hidden information. It can be helpful to find patterns, relationships and categories of data [2]. Data mining is considered as a part of KDD process. Main steps of KDD include data accumulation, cleaning, pre-processing, storing, mining and finally representing the patterns in a presentable format. In last twenty years lot of research has been done on improvising performance of data mining techniques. From past to present, three different trends have been observed. The first trend is based on centralized approach where all data needs to be stored on a central node. Mostly sequential algorithms were a part of this approach. The second trend was observed in terms of parallelizing centralized algorithms. Two main approaches were used for parallelization: Task parallelism and Data parallelism. Fig.1Data Mining Process Data mining offers a wide range of algorithms used for analysis, pattern discovery and prediction. It includes techniques such as association rule mining, decision trees, regression, support vector machines and many more. Data mining techniques evolved as a requirement when enormous data started accumulating in digital format. A wide variety of profitable solutions are hidden inside this wide pool of data. The existing data mining algorithms can work in three different computing environments: Centralised Parallel Distributed A. Centralised Approach for Data Mining Organizations may have multiple repositories of transactional data depending on the location of their office. 453 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) In centralized approach, data is extracted and accumulated on a centralized store after cleaning and pre -processing. From this central store, task relevant data is selected and mining techniques are applied. Initially, data mining techniques were restricted to centralized processing [3],[4][5][6][7]. Data mining algorithms are helpful in digging out hidden previously unknown information from existing data. Xindong Wu et al.[2] did a survey in 2007 and presented top 10 algorithms mostly used by the analysts of the world. The algorithms were rated on the basis of their popularity, performance and utility. The centralised algorithms that were considered to be most influential are C4.5, k-Means, SVM, Apriori, EM, PageRank, AdaBoost, kNN, Naive Bayes, and CART. [8,9,10,11]. A brief about these algorithms is presented below: Table.1 Top Algorithms of Data Mining Data mining Technique Association Rule Centralized Algorithms Feature A Priori Botttom-Up Approach: Requires n scans of database for finding association rules upto n itemsets. Botttom-Up with Top Down Approach: Requires early termination if big itemsets are found to be frequent while doing top down comparison. Requires database scan only twice since the rules are derived from the Frequent Pattern Tree. Hill climbing method of clustering for creating K clusters. Mathematical modelling method based on random phenomenon. Decision tree based method for classification from which rules can also be derived. Based upon Gini Diversity Index. Multiway tree can be generated. Generates binary decision tree. Information based method for splitting nodes. Derives classification function to distinguish different classes of training dataset It is not based on exact match for classification. It finds a group of k objects in the training set that are closest to the test object. Supervised classification method based on comparing score with threshold. Ensemble learning method that combines many weak rules for creating accurate prediction rules. Search Ranking algorithm based on web hyperlinks of web pages. Pincer Search Clustering Classification FP Tree Growth K Means EM Algorithm C4.5 CART SVM kNN Prediction Niave Bayes AdaBoost Others PageRank General purpose programming can also be done on Graphical Processing Units where multi cores can be exploited for highly parallel processing. Many data mining algorithms have been specifically designed in CUDA and shows drastic improvement in performance. Parallel programming is incomplete without discussing on the recent approach called Map Reduce [17]. It can process large sized data sets in a highly parallel manner. Map Reduce was introduced by Google in 2004. Map Reduce has become the most popular framework for mining large-scale datasets in parallel as well as distributed environment. Different computing environments’ require different programming paradigms depending upon the problem type. As data mining techniques are data and compute intensive both, it can be exploited better by using any one or a combination of parallel programming approaches given in next table: B. Parallel Approach for Data Mining Many scientific and compute intensive and large problems can be better solved using parallel programming approach. Data mining can be executed in a highly parallel environment over multiple processors. Parallel implementations of data mining algorithms can be distinguished on the basis of task parallel and data-parallel approaches [16]. Modern Programming languages are also structured so as to efficiently utilize novel architectures. There exist dedicated parallel programming paradigms for parallelizing the algorithms over multiprocessor and networked systems. OpenMP and MPI are exclusively used to achieve shared and distributed memory parallelization. [24, 25]. CUDA is a programming language that is designed for programming on NVIDIA GPUs [23]. CUDA offers a data parallel programming model. In CUDA, threads access different memories of GPU. 454 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) Table 2. Parallel approaches MPI OpenMP CUDA MapReduce A framework for distributedmemory Parallelism. It is a concept not a software. A framework for threaded parallelism. Shared memory model. A parallel programming model for multiprocessing environments in GPUS. Multithreaded framework. threads are assigned either a map or a reduce task Multiple tasks run concurrently across separate nodes. Multiple threads run concurrently whereUsual mapping is 1 thread : 1 core Multiple lightweight threads run concurrently on each block of GPU. MapReduce library expresses the computation as two functions: map and reduce Each task has its own private memory Shared memory is accessible to all threads Threads access shared memory as well as registers. Individual registers for individual threads. Each map task runs in slave nodes. Reduce task runs on master node. On Distributed network On Multi-core processors Specially designed for GPUs On multicore CPU, GPU, GRIDs and on cloud. Message based Message Passing Send and Receive Directive based (C/C++)# pragma omp directives Kernel function runs on GPU. Based on key-value pair Flexible and expressive: Can be used on a wider range of problems than OpenMP Each process has its own local variables Easier to program and debug than MPI C- Extension so much easier for Cprogrammers. Generally used when the data size is very large. Directives can be added incrementally Kernel function has its own local variables. Map task is highly scalable and works on distributed data. Future Scope in Parallel Data mining With the availability of cheaper, highly parallel GPUs in market, lot of research is done in parallelizing data mining algorithms for these devices. GPUMiner, is a novel parallel data mining system that utilizes newgeneration graphics processing units (GPUs). This system relies on the massively multi-threaded SIMD architecture. [26]. Various data mining algorithms including association rules, clustering and classification have been modified for parallel processing architectures [27, 28, 29, 30, 31, 32]. Parallel mining on multidimensional data storage have also been explored by S.Goil and A. Choudahary. [33]. Jin, Ruoming, Ge Yang, and Gagan Agrawal focused on shared memory parallelization of data mining algorithms. They parallelized data mining algorithms, and their technique applied to large number of data mining problems. They proposed a reduction-objectbased interface for specifying a data mining algorithm [63]. We present identified future scopes in the field of parallel data mining. i. CPU+GPU combination can be used for performance enhancement in compute intensive tasks [79]. CUDA can make computations on a single computer run faster by using its CPU+GPU combination. ii. Using GPUs in clusters of computers can achieve large scale, cost-effective, and power efficient solution of data mining [80]. iii. Lot of scope is there in developing map-reduce-like models for programming in heterogeneous CPU-GPU clusters. C. Distributed Approach for Data Mining The larger amount of data you store on a single machine, the longer it takes to access. With time the amount of data grows so large that firing analytical queries on these data becomes very time consuming. By dividing the data and distributing it on several machines, you need strong indexing techniques to point at the appropriate servers. Distributed approach for data mining is useful when the data sources are at multiple sites. Data extraction, cleaning, pre processing and integrating consumes majority of time thereby affecting the analysis process. When it comes to time critical applications, this delay cannot be tolerated. Thus, there exists a requirement to mine such data in a distributed manner. D. Distributed Data mining Challenges Distributed data: As the dimension of an organization grows, managing data is convenient when distributed as per the location or functionality. Storing and managing distributed data is a challenge especially when it has to be reused for global processing. Wilford-Rivera Ingrid [11] have explored the methods to apply data mining on distributed databases. BIGDATA: Big data is a collection of structured and unstructured data sets that so large and complex that it becomes difficult to process using conventional database management tools. 455 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) BIGDATA come from variant sources like web logs, click histories, e-commerce applications, Retail purchase histories, bank and credit card transactions, social networking and media, mobile devices call & text data, networked devices and sensors. Traditional data mining techniques are not up to the mark for processing and analyzing Big Data in a time & cost-efficient manner. For such applications, the Map Reduce frame work has recently attracted a lot of attention. Google’s MapReduce and open-source equivalent Hadoop is a powerful tool for building such applications. The main benefit of Hadoop is that it takes advantage of distributed processing and is scalable and fault tolerant. Kyuseok Shim [17] applied parallel programming method MapReduce that can be used for many machine learning applications. In this paper, MapReduce framework based on Hadoop is discussed, and the stateof-the-art in MapReduce algorithms for data mining is presented. Unstructured and Complex Data: Unstructured data is the one that cannot be retrieved through SQL. It is generally non tabular and does not have any pattern [14]. A related new style of database called NoSQL (Not Only SQL) has emerged now a days. NoSQL encompasses wide variety of data management techniques but exploring NoSQL data for analysis is still a challenging job. Main NoSQL databases [15] currently available include: HBase, Cassandra, MarkLogic, Aerospike and MongoDB. Distributed operations: Distributed queries are fired when an application distributes its tasks among different computers in a network. The challenge is to apply data mining techniques in a distributed fashion with underlying consideration of reducing overall data transfer over the network. Service-oriented architecture can be exploited for the implementation of data mining in distributed environments [18,19]. Data privacy and security: Automated data mining in distributed environments raises serious issues in terms of data privacy, security, and governance. Various algorithms have been modified so as to retain privacy in distributed environment.[37,38,39,40]. This method requires lot of synchronization overheads. ii. Centralized ensemble methods: This method generates local models and transmits them to a central site (asynchronously). The central site forms a combined global model. These methods require only a single round of message passing, resulting in modest synchronization requirements [41]. First, we present different distributed data mining techniques proposed by researchers that have helped in enhancing performance of basic data mining techniques. We included association rules, classification and clustering in our study. Distributed Association Rules: Association rule mining has been studied intensively in last 20 years. Hundreds of algorithms are proposed till date but the recent focus is on mining association rules in a distributed fashion. A-priori, pincer-search, FP- Tree growth algorithms have been implemented in different ways using different data structures. Moving toward distributed approach, researchers have tried to parallelize existing algorithms and proposed CD-Count distribution, FDM-Fast distributed algorithm, FPM-Frequent pattern mining and DDM-Distributed data mining. Later, the researchers started optimizing the ARM algorithms by using hybrid methods. Assaf Schuster, Ran Wolff, Dan Trock used a combination of sampling & storing in vertical trie data structure and further mined this data structure using DDM method[42]. A Tree based Algorithm for Generating of Frequent Item Sets was also proposed which uses Pattern Count Tree for representing the database. [44]. A parallel algorithm for data mining of association rules on shared-memory multiprocessors was tested for optimizations of fast frequency computation. Degree of parallelism, synchronization, and data locality issues have also been discussed for shared memory systems [46]. D-ARM algorithm [47] proposed by Assaf Schuster outperformed on a number of computing nodes with less communication cost. Distributed Classification Algorithms: Standard classification algorithms include C4.5, ID3, SLIQ and SPRINT. Many researchers have put efforts on parallel implementation of these algorithms [48]. An algorithm for classification on multi relational data with handling of missing values and less communication cost have been proposed by Anna Atramentov [49]. Current database systems are mostly of distributed nature. Performing classification on this distributed data is a highly challenging job. Problem of inducing decision trees in a large distributed network of databases requires an algorithm that can reduce the communication overhead by sending just a fraction of the statistical data [50]. E. Distributed data mining algorithms With fast growing business intelligence market, exponential increase in the amount of data and distributed locations of data, there has raised a requirement of distributed data mining. The distribution may be either of computation or of data. For distributing the mining task, any one of two strategies can be used: i. Message passing among nodes or processors: Nodes in a distributed system communicate via messages. 456 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) In distributed environment, data is distributed and every data server has a partial set of the data. A classification algorithm for vertically partitioned data assumes that local classifiers can be constructed locally. These local classifiers can be used to support decision making at each location. Global classifier can then be constructed having access to the entire feature set [51]. A K-Means based P2P mining technique [60] for clustering homogeneously distributed data works by communicating with the neighbouring nodes in asynchronous manner. The work also offers theoretical analysis of the algorithm that bounds the error in the distributed clustering process compared to the centralized approach. New cast model of computation is used by Wojtek Kowalczyk, Mark Jelasity, and A.E. Eiben efficiently mines data over P2P overlay networks [61]. Distributed Clustering Algorithms: S. Datta, C. Giannella, and H. Kargupta [52] presented K Means algorithm for clustering on large data distributed over dynamic network. This algorithm is robust to network change and does not require global synchronization. It is based upon local synchronization. S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta , K. Liu, and S. Datta[53] described a technique for clustering homogeneously distributed data in a peerto-peer environment. The proposed technique is based on the principles of the K-Means algorithm. In this technique, the neighbouring nodes communicate in a localized asynchronous manner. Clustering process can be optimised by sending best representatives to a server site. The process can be very efficient, because determining local representatives can be carried out quickly and independently from each other. Based on the most suitable local representatives, global clustering can be done efficiently [54, 55]. A novel distributed clustering algorithm KDEC uses Sampling based methods for non-parametric kernel density estimation on local sites. It also takes into account the issues of privacy and communication costs that arise in a distributed environment [56]. GRIDS: GRID consists of many tightly coupled perhaps geographically distributed heterogeneous computers which are made to work together on either single or related problems. Grids are required by professional communities who need to access remote resources, distributed datasets, and for large scale data analyses. Grid can play a significant role in providing an effective computational support for distributed data mining applications. Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P. [62]designed a system called Knowledge Grid. Their work describes the Knowledge Grid framework and presents the toolset provided by the Knowledge Grid for implementing distributed data mining on GRID. A three layered architecture called Data Mining Grid system was proposed to enable creation of grids dedicated to data mining tasks [67].Globus toolkit 4 is used as a middleware between upper and lower layers. Grid can offer an infrastructure for supporting decentralized and parallel data analysis. Service oriented grid computing can allow the end-users to focus on the knowledge discovery process without worrying about the details of grid infrastructure [68]. Data mining services on grids can now be accesses through web services also [69]. A system called KNOWLEDGE GRID framework presents the toolset provided implementing distributed knowledge discovery. Tool provides the facility of starting from searching grid resources, and then finally executing the resulting data mining process on a grid [70]. Intra Grid based data mining tool DMGCE is developed with the use of competitive directed acyclic graphs in a heterogeneous computing environment. It works on a dynamic scheduling framework. In this framework, reuse of existing DM algorithms is achieved by encapsulating them into agents [71]. F. Data Mining in Distributed Computing Environments PEER to PEER systems: Idea behind peer to peer computing is to create a group of computers connected together to combine their computing and processing abilities to solve complex problems. Each computer has equal capability. This architecture is widely used for enormous data storage, scientific computations and data analytics. DDM applications and algorithms for Peer to per environments, are described by Datta and Souptik where both exact and approximate local P2P data mining algorithms work in a decentralized and communicationefficient manner[57]. Wolff, Ran, and Assaf Schuster proposed an Association rule mining in peer-to-peer systems[58]. They presented an algorithm by which every node in the system can find about equal confidence level though they work on data partitions. Schuster, Assaf, and Ran Wolff [59] also presented a set of new algorithms that solve the Distributed Association Rule Mining problem. These algorithms are very efficient and also extremely robust on skewed and imbalanced data partitions. CLUSTERS: All machines in a cluster are homogeneous and work as a single unit. The computers in the cluster are normally contained in a single location. 457 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) Computers in a cluster works as a single computing resource and are connected by high speed networks. A cluster is a cheaper alternative to a single high performance processor system like super computers. Clusters are ideal for users who need to run similar jobs as in case of data mining. Tree data structure is generally used to compress database. Such tree structures can be efficiently mined using frequent pattern growth methodology. PC cluster based framework can be used for tree mining resulting in improved support counting procedure [72]. However there is a problem when FP-tree cannot fit into the memory. A parallel execution of FP-growth using PC Cluster is implemented for execution efficiency on shared-nothing environment [73]. Performance of FPTree decreases with the increase in size of database. For handling this problem a combination of parallel and distributed techniques can be applied. Use of Tid setbased Parallel FP-tree (TPFP-tree) and Balanced Tid setbased Parallel FP-tree (BTP-tree) for frequent pattern mining on PC Clusters and multi-cluster grids shortens the execution time significantly [74]. Recently, a framework called MATE-CG is proposed that uses a Map Reduce-Like Framework Data-Intensive Computations on a heterogeneous cluster of multi-core CPUs and many-core GPUs [75]. Data mining on very large datasets can be optimised by using Open source framework called Hadoop. Hadoop – MapReduce is a highly parallel programming paradigm that is used by big shots like Yahoo, Facebook, Ebay, Twitter and many more. Future Scope in Distributed Data mining: We have discussed several issues related to distributed data mining. After carefully examining the current trends, we propose that data mining techniques in near future will be oriented towards following areas: i. Use of Hadoop Mapreduce for large sized data[81,82]. Hadoop Distributed file system automatically handles scalability and fault tolerant issues. ii. Combining ETL tools like Mahout, Sqoop, Flume and Mongo-Hadoop Connector, we can mine NoSQL Big databases[83]. iii. CUDA a highly parallel programming language that is designed for GPU can run within Mapreduce for further improving efficiency of mining compute-intensive tasks over petabytes of data[84]. iv. Use of G-Hadoop with G-Farm file system, a MapReduce framework that can be used for largescale distributed computing on distributed data [85]. v. Cloud based techniques for data mining are almost unexplored so there is lot of scope in this direction. CLOUD: Cloud is an infrastructure that provides services and resources through internet. Main services are Infrastructure as a service -IAAS-, Platform as a service PAAS and Software as a service -SAAS. Cloud can be used to utilize virtual resources to perform data and compute intensive analyses. Data mining computations can be optimised using parallel programming paradigms like Hadoop-Mapreduce, CGL-MapReduce, and Dryad. However, many scientific applications still require low latency communication mechanisms by runtimes such as MPI. Different MapReduce implementations of data mining algorithms have been performed on virtualised resources on cloud [76]. To efficiently support many important data mining algorithms in cloud environment, a distributed framework called GraphLab is recently proposed. It is graph based extension which is fault tolerant and reduces network congestion [77]. High performance cloud can be used to mine large distributed data sets[78]. Sector is a distributed file system that can be processed by Sphere which is a high performance parallel data processing engine. Sector and Sphere are designed for analyzing large data sets using computer clusters connected with wide area high performance networks A distributed data mining application have been developed using Sector and Sphere. III. CONCLUSION Data mining has become more relevant today with the increase in the amount of data generated every minute. With issues like increase in size, data distribution, unstructured data, cleaning and pre-processing and is an open challenge. Data mining techniques can be speeded up by proper combination of parallel and distributed approaches. As data floats on network in distributed systems, privacy preservation techniques are mandatory to be applied on every DDM technique. In distributed scenario, we can get better performance in terms of memory utilization and speedup if there is utilization of proper blend of resources. Lot of advancements in the field of data mining is observed in last decade. Several network and computing related bottlenecks still exist. We have addressed many challenges and recent research areas in the field of distributed data mining. Distributed data mining has to go long way for benefitting scientists, academicians and industries. REFERENCES AND BIBLIOGRAPHY [1] [2] 458 Zeng, Li, et al. "Distributed data mining: a survey." Information Technology and Management 13.4 (2012): 403-409. X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. motoda, G.J. MClachlan, A. Ng, B. Liu, P.S. Yu, Z. Zhou, M. Steinbach, D. J. Hand, D. Steinberg, ―Top 10 Algorithms in Data Mining,‖ Knowl Inf Syst (2008) 141-37. International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] [24] [25] [26] [27] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In J. B. Bocca, M. Jarke, and C. Zaniolo, editors, VLDB, pages 487–499. Morgan Kaufmann, 1994. P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proceedings of Knowledge discovery in Data Conference, pages 9–15, 1998. Chiang D, Lin C, Chen M (2011) The adaptive approach for storage assignment by mining data of warehouse management system for distribution centres. Enterp Inf Syst 5(2):219–234 Duan L, Xu L, Guo F, Lee J, Yan B (2007) A local-density based spatial clustering algorithm with noise. Inf Syst 32:978–986 Duan L, Xu L, Liu Y, Lee J (2009) Cluster-based outlier detection. Ann Oper Res 168:151–168 Agrawal R, Srikant R Fast algorithms for mining association rules. In: Proceedings of the 20th VLDB conference 1994, pp487– 499 Ahmed S, Coenen F,Leng PH Tree-based partitioning of date for association rule mining. 2006KnowlInfSyst 10(3):315–331 Banerjee A, Merugu S, Dhillon I, Ghosh J Clustering with Bregman divergences. J Mach Learn Res6 2005 :1705–1749 Wilford-Rivera, Ingrid, et al. "Integrating Data Mining Models from Distributed Data Sources." Distributed Computing and Artificial Intelligence. Springer Berlin Heidelberg, 2010. 389-396. http:// www.sas.com/ en_us/ insights/ big-data/ Lämmel, Ralf. "Google’s MapReduce programming model— Revisited." Science of computer programming 70.1 (2008): 1-30. Blumberg, Robert, and Shaku Atre. "The problem with unstructured data." DM REVIEW 13 (2003): 42-49. Han, Jing, et al. "Survey on NoSQL database." Pervasive computing and applications (ICPCA), 2011 6th international conference on. IEEE, 2011. Andrade, Diego, et al. "Task-parallel versus data-parallel librarybased programming in multicore systems." Parallel, Distributed and Network-based Processing, 2009 17th Euromicro International Conference on. IEEE, 2009. Shim, Kyuseok. "MapReduce algorithms for Big Data analysis." Proceedings of the VLDB Endowment 5.12 (2012): 2016-2017. Talia, Domenico, Paolo Trunfio, and Oreste Verta. "Weka4ws: a wsrf-enabled weka toolkit for distributed data mining on grids." Knowledge Discovery in Databases: PKDD 2005. Springer Berlin Heidelberg, 2005. 309-320. Talia, Domenico, Paolo Trunfio, and Oreste Verta. "The Weka4WS framework for distributed data mining in service‐oriented Grids." Concurrency and Computation: Practice and Experience 20.16 (2008): 1933-1951. Pujari, Arun K. Data mining techniques. Universities press, 2001. Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006. Piatetsky-Shapiro, Gregory, et al. "What are the grand challenges for data mining?: KDD-2006 panel report." ACM SIGKDD Explorations Newsletter 8.2 (2006): 70-77. Nickolls, John, et al. "Scalable parallel programming with CUDA." Queue 6.2 (2008): 40-53. Anuradha, T., R. Satya Pasad, and S. N. Tirumalarao. "Parallelizing Apriori on Dual Core using OpenMP." International Journal of Computer Applications 43 (2012). http://www.cs.ucla.edu/~palsberg/course/cs239/papers/EECS2006-183.pdf Zaki, Mohammed Javeed, et al. "New Algorithms for Fast Discovery of Association Rules." KDD. Vol. 97. 1997. Parthasarathy, Srinivasan, et al. "Parallel data mining for association rules on shared-memory systems." Knowledge and Information Systems 3.1 (2001): 1-29. [28] Zaïane, Osmar R., Mohammad El-Hajj, and Paul Lu. "Fast parallel association rule mining without candidacy generation." Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on. IEEE, 2001. [29] Zaki, Mohammed Javeed, Ching-Tien Ho, and Rakesh Agrawal. "Parallel classification for data mining on shared-memory multiprocessors." Data Engineering, 1999. Proceedings., 15th International Conference on. IEEE, 1999. [30] Huang, Zhexue. "A Fast Clustering Algorithm to Cluster Very Large Categorical Data Sets in Data Mining." DMKD. 1997. [31] Kwok, Terence, et al. "Parallel fuzzy c-means clustering for large data sets." Euro-Par 2002 Parallel Processing. Springer Berlin Heidelberg, 2002. 365-374. [32] Foti, D., et al. "Scalable parallel clustering for data mining on multicomputers." Parallel and Distributed Processing. Springer Berlin Heidelberg, 2000. 390-398. [33] Goil, Sanjay, and Alok Choudhary. "PARSIMONY: An infrastructure for parallel multidimensional analysis and data mining." Journal of parallel and distributed computing 61.3 (2001): 285-321. [34] Du, Wenliang, and Zhijun Zhan. "Building decision tree classifier on private data." Proceedings of the IEEE international conference on Privacy, security and data mining-Volume 14. Australian Computer Society, Inc., 2002. [35] Du, Wenliang, Yunghsiang S. Han, and Shigang Chen. "PrivacyPreserving Multivariate Statistical Analysis: Linear Regression and Classification." SDM. Vol. 4. 2004. [36] Zhan, Zhijun, and Wenliang Du. "Privacy-Preserving Data Mining Using Multi-Group Randomized Response Techniques." Group 1.2 (2010): 3. [37] Kiran, P., and N. P. Kavya. "A Survey on Methods, Attacks and Metric for Privacy Preserving Data Publishing." International Journal of Computer Applications 53 (2012). [38] Han, Jiawei, Micheline Kamber, and Jian Pei. Data mining: concepts and techniques. Morgan kaufmann, 2006. [39] Schuster, Assaf, Ran Wolff, and Dan Trock. "A high-performance distributed algorithm for mining association rules." Knowledge and Information Systems 7.4 (2005): 458-475. [40] Agarwal, Ramesh C., Charu C. Aggarwal, and V. V. V. Prasad. "A tree projection algorithm for generation of frequent item sets." Journal of parallel and Distributed Computing 61.3 (2001): 350371. [41] Ananthanarayana, V. S., D. K. Subramanian, and M. Narasimha Murty. "Scalable, distributed and dynamic mining of association rules." High Performance Computing—HiPC 2000. Springer Berlin Heidelberg, 2000. 559-566. [42] Nestorov, S. "Mining Qualified Association Rules in Distributed Databases." Work-shop on Data Mining and Exploration Middleware for Distributed and Grid Computing, Minneapolis, MINI (2003). [43] Parthasarathy, Srinivasan, et al. "Parallel data mining for association rules on shared-memory systems." Knowledge and Information Systems 3.1 (2001): 1-29. [44] Schuster, Assaf, Ran Wolff, and Dan Trock. "A high-performance distributed algorithm for mining association rules." Knowledge and Information Systems 7.4 (2005): 458-475. [45] Amado, Nuno, Joao Gama, and Fernando Silva. "Exploiting Parallelism in Decision Tree Induction." Proceedings from the ECML/PKDD Workshop on Parallel and Distributed computing for Machine Learning. 2003. [46] Atramentov, Anna, Hector Leiva, and Vasant Honavar. "A multirelational decision tree learning algorithm–implementation and experiments." Inductive Logic Programming. Springer Berlin Heidelberg, 2003. 38-56. 459 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) [66] Richard Olejnik, Teodor-Florin Fortiş, Bernard Toursel ―Web services oriented data mining in knowledge architecture‖ Future Generation Computer Systems, Volume 25, Issue 4, April 2009, Pages 436–443 [67] Cannataro, Mario, et al. "Distributed data mining on grids: services, tools, and applications." Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 34.6 (2004): 24512465. [68] Luo, Ping, et al. "Distributed data mining in grid computing environments." Future Generation Computer Systems 23.1 (2007): 84-91. [69] Pramudiono, Iko, and Masaru Kitsuregawa. "Tree structure based parallel frequent pattern mining on pc cluster." Database and Expert Systems Applications. Springer Berlin Heidelberg, 2003. [70] Pramudiono, Iko, and Masaru Kitsuregawa. "Parallel FP-growth on PC cluster." Advances in Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2003. 467-473. [71] Yu, Kun-Ming, and Jiayi Zhou. "Parallel TID-based frequent pattern mining algorithm on a PC Cluster and grid computing system." Expert Systems with Applications 37.3 (2010): 24862494. [72] Jiang, Wei, and Gagan Agrawal. "Mate-cg: A map reduce-like framework for accelerating data-intensive computations on heterogeneous clusters." Parallel & Distributed Processing Symposium (IPDPS), 2012 IEEE 26th International. IEEE, 2012. [73] Ekanayake, Jaliya, and Geoffrey Fox. "High performance parallel computing with clouds and cloud technologies." Cloud Computing. Springer Berlin Heidelberg, 2010. 20-38. [74] Low, Yucheng, et al. "Distributed GraphLab: a framework for machine learning and data mining in the cloud." Proceedings of the VLDB Endowment 5.8 (2012): 716-727. [75] Grossman, Robert, and Yunhong Gu. "Data mining using high performance data clouds: experimental studies using sector and sphere." Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining. ACM, 2008. [76] Linchuan Chen, Xin Huo , and Gagan Agrawal . ―Accelerating MapReduce on a coupled CPU - GPU architecture‖. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, pages 25:11, Los Alamitos, CA, USA, 2012. IEEE Computer Society Press. [77] J.A. Stuart and J.D. Owens. ―Multi GPU MapReduce on GPU Clusters‖. In Parallel Distributed Processing Symposium (IPDPS), 2011 IEEE International, pages 1068 1079, may 2011. [78] C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In HPCA ’07: proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 13–24, Washington, DC, USA, 2007. IEEE Computer Society. [79] Jeffrey Dean and Sanjay Ghemawat. ― MapReduce: simplified data processing on large clusters‖. Commun. ACM, 51(1):107– 113, January 2008. [80] Hadoop. http://hadoop.apache.org/ [81] Bingsheng He, Wenbin Fang, Qiong Luo, Naga K. Govindaraju, and Tuyong Wang. ―Mars: a MapReduce framework on graphics processors‖. In Proceedings of the 17th international conference on Parallel architectures and compilation technique s, PACT ’08, pages 260 -269, New York, NY, USA, 2008. ACM. [47] Bar-Or, Amir, et al. "Hierarchical decision tree induction in distributed genomic databases." Knowledge and Data Engineering, IEEE Transactions on 17.8 (2005): 1138-1151. [48] Basak, Jayanta, and Ravi Kothari. "A classification paradigm for distributed vertically partitioned data." Neural computation 16.7 (2004): 1525-1544. [49] S. Datta, C. Giannella, and H. Kargupta. K-Means Clustering over a Large, Dynamic Network. In Proceedings of 2006 SIAM Conference on Data Mining, Bethesda, MD, April 2006. [50] K.HammoudaandM .Kamel. HP2PC: Scalable Hierarchically Distributed Peer-to-Peer Clustering. In Proceedings of the 2007 SIAM International Conference on Data Mining (SDM ’07), Philadelphia, PA, 2007. [51] Klusch, Matthias, Stefano Lodi, and Gianluca Moro. "Distributed clustering based on sampling local density estimates." IJCAI. 2003. [52] Januzaj, Eshref, Hans-Peter Kriegel, and Martin Pfeifle. "Scalable density-based distributed clustering." Knowledge Discovery in Databases: PKDD 2004. Springer Berlin Heidelberg, 2004. 231244. [53] Klusch, Matthias, Stefano Lodi, and Gianluca Moro. "Distributed clustering based on sampling local density estimates." IJCAI. 2003. [54] Datta, Souptik, et al. "Distributed data mining in peer-to-peer networks." Internet Computing, IEEE 10.4 (2006): 18-26. [55] Wolff, Ran, and Assaf Schuster. "Association rule mining in peerto-peer systems." Systems, Man, and Cybernetics, Part B: Cybernetics, IEEE Transactions on 34.6 (2004): 2426-2438. [56] Schuster, Assaf, and Ran Wolff. "Communication-efficient distributed mining of association rules." Data Mining and Knowledge Discovery 8.2 (2004): 171-196. [57] Bandyopadhyay, Sanghamitra, et al. "Clustering distributed data streams in peer-to-peer environments." Information Sciences 176.14 (2006): 1952-1985. [58] Kowalczyk, Wojtek, Márk Jelasity, and A. Eiben. "Towards data mining in large and fully distributed peer-to-peer overlay networks." Proceedings of BNAIC’03. 2003. [59] Cannataro M, Congiusta A, Pugliese A, Talia D, Trunfio P.‖Distributed data mining on grids: services, tools, and applications‖. IEEE Trans Syst Man Cybern B Cybern. 2004 Dec;34(6):2451-65. [60] Jin, Ruoming, Ge Yang, and Gagan Agrawal. "Shared memory parallelization of data mining algorithms: Techniques, programming interface, and performance." Knowledge and Data Engineering, IEEE Transactions on 17.1 (2005): 71-89. [61] K. Bhaduri, R. Wolf, C. Giannella, and H. Kargupta. Distributed decision-tree induction in peer-to-peer systems. Stat. Anal. Data Min., 1(2):85–103, 2008. [62] P.Luo,H.Xiong,K.Lu,andZ.Shi. distributed Classification inPeerto-PeerNetworks. In Proceedings of the 13th International Conference on Knowledge Discovery and Data Mining (KDD ’07), pages 968–976, New York NY, 2007. [63] Stankovski, Vlado, et al. "Grid-enabling data mining applications with DataMiningGrid: An architectural perspective." Future Generation Computer Systems 24.4 (2008): 259-279. [64] María S. Pérez, Alberto Sánchez, Víctor Robles, Pilar Herrero, José M. Peña ―Design and implementation of a data mining gridaware architecture‖ Future Generation Computer Systems, Volume 23, Issue 1, 1 January 2007, Pages 42–47 [65] Antonio Congiusta, Domenico Talia, Paolo Trunfio ‖ Serviceoriented middleware for distributed data mining on the grid‖ Journal of Parallel and Distributed Computing, Volume 68, Issue 1, January 2008, Pages 3–15 460 International Journal of Emerging Technology and Advanced Engineering Website: www.ijetae.com (ISSN 2250-2459, ISO 9001:2008 Certified Journal, Volume 4, Issue 3, March 2014) [82] F. Ahmad, S. T. Chakradhar, A. Raghunathan, and T. N. Vijaykumar. Tarazu: optimizing mapreduce on heterogeneous clusters. In Proceedings of the seventeenth international conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS ’12, pages 6174, New York, NY, USA, 2012. ACM [83] Wang, Lizhe, et al. "G-Hadoop: MapReduce across distributed data centers for data-intensive computing." Future Generation Computer Systems 29.3 (2013): 739-750. 461