Download View Sample PDF - IRMA International

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
IDEA GROUP PUBLISHING
Parallel Data Mining 261
701 E Chocolate Avenue, Hershey PA 17033, USA
Tel: 717/533-8845; Fax: 717/533-8661; URL: www.idea-group.com
ITB7039
Chapter XIII
c
n
I
p
Parallel
Data
Mining
u
o
r
G
a
e
d
I
David Taniar
Monash University, Australia
c
n
I
p
u
o
r
G
a
e
d
I
J. Wenny Rahayu
La Trobe University, Australia
Data mining refers to a process on nontrivial extraction of implicit,
previously unknown and potential useful information (such as knowledge
rules, constraints, regularities) from data in databases. With the availability of inexpensive storage and the progress in data capture technology, many organizations have created ultra-large databases of business
and scientific data, and this trend is expected to grow. Since the databases
to be mined are likely to be very large (measured in terabytes and even
petabytes), there is a critical need to investigate methods for parallel data
mining techniques. Without parallelism, it is generally difficult for a
single processor system to provide reasonable response time. In this
chapter, we present a comprehensive survey of parallelism techniques for
data mining. Parallel data mining offers new complexity as it incorporates techniques from parallel databases and parallel programming.
Challenges that remain open for future research will also be presented.
c
n
I
p
u
o
r
G
a
e
d
I
c
n
I
p
u
o
r
G
a
e
d
I
INTRODUCTION
Data mining refers to a process on nontrivial extraction of implicit, previously
unknown and potential useful information (such as knowledge rules, constraints,
regularities) from data in databases. Techniques for data mining include mining
association rules, data classification, generalization, clustering, and searching for
patterns (Chen, Han, & Yu, 1996). The focus of data mining is to reveal information
that is hidden and unexpected, as there is little value in finding patterns and
This chapter appears in the book, Data Mining: A Heuristic Approach, edited by Hussein Abbass,
Ruhul Sarker and Charles Newton. Copyright 2002, Idea Group Inc.
262 Taniar and Rahayu
relationships that are already intuitive. By discovering hidden patterns and relationships in the data, data mining enables users to extract greater value from their data
than simple query and analysis approaches. To discover the hidden patterns in data,
we need to build a model consisting of independent variables (e.g., income, marital
status) that can be used to determine a dependent variable (e.g., credit risk). Building
a data mining model consists of identifying the relevant independent variables and
minimizing the generalization error. To identify the model that has the least error
and is the best predictor may require building hundreds of models in order to select
the best one.
We have now reached a point in terms of computational power, storage
capacity and cost that enables us to gather, analyze and mine unprecedented
amounts of data. Due to their size or complexity, a high performance data mining
product is critically required. High performance in data mining literally means to
take advantage of parallel database management systems and additional CPUs in
order to gain performance benefits. By adding additional processing elements, more
data can be processed, more models can be built and accuracy of the models can be
improved.
In this chapter, we are going to present a study of how parallelism can be
achieved in data mining. To explain this, we need to study parallelism in more
details. We also need to highlight data mining techniques. The merging between
these two technologies, namely parallelism and data mining, are then presented,
which includes various existing parallel data mining algorithms. Finally, we
highlight the challenges including research topics that still have to be investigated.
c
n
I
p
u
o
r
G
a
e
d
I
c
n
I
p
u
o
r
G
a
e
d
I
PARALLELISM
In Parallel Data Mining, one of the most important keywords is “Parallel.” In
the following sections, we describe what the architectures of parallel technology are,
what forms of parallelism are available in data mining, what the objectives of
parallelism are, and what the obstacles of employing parallelism in data mining are.
Parallel Technology
The motivation for the use of parallel technology in data mining is not only
influenced by the need for performance improvement, but also the fact that parallel
computers are no longer a monopoly of supercomputers but are now in fact available
in many forms, such as systems consisting of a small number but powerful
processors (e.g., SMP machines), clusters of workstations (e.g., loosely coupled
shared-nothing architectures), massively parallel processors (MPP), and clusters of
SMP machines (i.e., hybrid architectures) (Almasi & Gottlieb, 1994). It is common
that parallel architectures especially used for data-intensive applications, including
data mining, are classified into several categories, namely shared-memory, shareddisk, shared-nothing, and shared-something architectures (Bergsten, Couprie &
c
n
I
p
u
o
r
c
G
n
I
p
u
Idea
o
r
G
Idea
27 more pages are available in the full version of this
document, which may be purchased using the "Add to Cart"
button on the publisher's webpage:
www.igi-global.com/chapter/parallel-data-mining/7593
Related Content
Approaches to Large-Scale User Opinion Summarization for the Web
William Darling (2014). Innovative Document Summarization Techniques:
Revolutionizing Knowledge Understanding (pp. 163-184).
www.irma-international.org/chapter/approaches-to-large-scale-user-opinionsummarization-for-the-web/96744/
A Graph-Based Biomedical Literature Clustering Approach Utilizing Term's
Global and Local Importance Information
Xiaodan Zhang, Xiaohua Hu, Jiali Xia, Xiaohua Zhou and Palakorn Achananuparp
(2008). International Journal of Data Warehousing and Mining (pp. 84-101).
www.irma-international.org/article/graph-based-biomedical-literatureclustering/1819/
Developing a Competitive City through Healthy Decision-Making
Ori Gudes, Elizabeth Kendall, Tan Yigitcanlar, Jung Hoon Han and Virendra Pathak
(2013). Data Mining: Concepts, Methodologies, Tools, and Applications (pp. 15451558).
www.irma-international.org/chapter/developing-competitive-city-throughhealthy/73511/
Multidimensional Model Design using Data Mining: A Rapid Prototyping
Methodology
Sandro Bimonte, Lucile Sautot, Ludovic Journaux and Bruno Faivre (2017).
International Journal of Data Warehousing and Mining (pp. 1-35).
www.irma-international.org/article/multidimensional-model-design-using-data
-mining/173704/
A Novel Neural Fuzzy Network Using a Hybrid Evolutionary Learning
Algorithm
Cheng-Jian Lin and Cheng-Hung Chen (2010). Intelligent Soft Computation and
Evolving Data Mining: Integrating Advanced Technologies (pp. 250-273).
www.irma-international.org/chapter/novel-neural-fuzzy-network-using/42364/