Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
IDEA GROUP PUBLISHING Parallel Data Mining 261 701 E Chocolate Avenue, Hershey PA 17033, USA Tel: 717/533-8845; Fax: 717/533-8661; URL: www.idea-group.com ITB7039 Chapter XIII c n I p Parallel Data Mining u o r G a e d I David Taniar Monash University, Australia c n I p u o r G a e d I J. Wenny Rahayu La Trobe University, Australia Data mining refers to a process on nontrivial extraction of implicit, previously unknown and potential useful information (such as knowledge rules, constraints, regularities) from data in databases. With the availability of inexpensive storage and the progress in data capture technology, many organizations have created ultra-large databases of business and scientific data, and this trend is expected to grow. Since the databases to be mined are likely to be very large (measured in terabytes and even petabytes), there is a critical need to investigate methods for parallel data mining techniques. Without parallelism, it is generally difficult for a single processor system to provide reasonable response time. In this chapter, we present a comprehensive survey of parallelism techniques for data mining. Parallel data mining offers new complexity as it incorporates techniques from parallel databases and parallel programming. Challenges that remain open for future research will also be presented. c n I p u o r G a e d I c n I p u o r G a e d I INTRODUCTION Data mining refers to a process on nontrivial extraction of implicit, previously unknown and potential useful information (such as knowledge rules, constraints, regularities) from data in databases. Techniques for data mining include mining association rules, data classification, generalization, clustering, and searching for patterns (Chen, Han, & Yu, 1996). The focus of data mining is to reveal information that is hidden and unexpected, as there is little value in finding patterns and This chapter appears in the book, Data Mining: A Heuristic Approach, edited by Hussein Abbass, Ruhul Sarker and Charles Newton. Copyright 2002, Idea Group Inc. 262 Taniar and Rahayu relationships that are already intuitive. By discovering hidden patterns and relationships in the data, data mining enables users to extract greater value from their data than simple query and analysis approaches. To discover the hidden patterns in data, we need to build a model consisting of independent variables (e.g., income, marital status) that can be used to determine a dependent variable (e.g., credit risk). Building a data mining model consists of identifying the relevant independent variables and minimizing the generalization error. To identify the model that has the least error and is the best predictor may require building hundreds of models in order to select the best one. We have now reached a point in terms of computational power, storage capacity and cost that enables us to gather, analyze and mine unprecedented amounts of data. Due to their size or complexity, a high performance data mining product is critically required. High performance in data mining literally means to take advantage of parallel database management systems and additional CPUs in order to gain performance benefits. By adding additional processing elements, more data can be processed, more models can be built and accuracy of the models can be improved. In this chapter, we are going to present a study of how parallelism can be achieved in data mining. To explain this, we need to study parallelism in more details. We also need to highlight data mining techniques. The merging between these two technologies, namely parallelism and data mining, are then presented, which includes various existing parallel data mining algorithms. Finally, we highlight the challenges including research topics that still have to be investigated. c n I p u o r G a e d I c n I p u o r G a e d I PARALLELISM In Parallel Data Mining, one of the most important keywords is “Parallel.” In the following sections, we describe what the architectures of parallel technology are, what forms of parallelism are available in data mining, what the objectives of parallelism are, and what the obstacles of employing parallelism in data mining are. Parallel Technology The motivation for the use of parallel technology in data mining is not only influenced by the need for performance improvement, but also the fact that parallel computers are no longer a monopoly of supercomputers but are now in fact available in many forms, such as systems consisting of a small number but powerful processors (e.g., SMP machines), clusters of workstations (e.g., loosely coupled shared-nothing architectures), massively parallel processors (MPP), and clusters of SMP machines (i.e., hybrid architectures) (Almasi & Gottlieb, 1994). It is common that parallel architectures especially used for data-intensive applications, including data mining, are classified into several categories, namely shared-memory, shareddisk, shared-nothing, and shared-something architectures (Bergsten, Couprie & c n I p u o r c G n I p u Idea o r G Idea 27 more pages are available in the full version of this document, which may be purchased using the "Add to Cart" button on the publisher's webpage: www.igi-global.com/chapter/parallel-data-mining/7593 Related Content Approaches to Large-Scale User Opinion Summarization for the Web William Darling (2014). Innovative Document Summarization Techniques: Revolutionizing Knowledge Understanding (pp. 163-184). www.irma-international.org/chapter/approaches-to-large-scale-user-opinionsummarization-for-the-web/96744/ A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's Global and Local Importance Information Xiaodan Zhang, Xiaohua Hu, Jiali Xia, Xiaohua Zhou and Palakorn Achananuparp (2008). International Journal of Data Warehousing and Mining (pp. 84-101). www.irma-international.org/article/graph-based-biomedical-literatureclustering/1819/ Developing a Competitive City through Healthy Decision-Making Ori Gudes, Elizabeth Kendall, Tan Yigitcanlar, Jung Hoon Han and Virendra Pathak (2013). Data Mining: Concepts, Methodologies, Tools, and Applications (pp. 15451558). www.irma-international.org/chapter/developing-competitive-city-throughhealthy/73511/ Multidimensional Model Design using Data Mining: A Rapid Prototyping Methodology Sandro Bimonte, Lucile Sautot, Ludovic Journaux and Bruno Faivre (2017). International Journal of Data Warehousing and Mining (pp. 1-35). www.irma-international.org/article/multidimensional-model-design-using-data -mining/173704/ A Novel Neural Fuzzy Network Using a Hybrid Evolutionary Learning Algorithm Cheng-Jian Lin and Cheng-Hung Chen (2010). Intelligent Soft Computation and Evolving Data Mining: Integrating Advanced Technologies (pp. 250-273). www.irma-international.org/chapter/novel-neural-fuzzy-network-using/42364/