Download Distributed Computing Environment: Approaches and Applications

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Distributed Computing Environment: Approaches and Applications
Li Zeng, Lida Xu, Zhongzhi Shi, Maoguang Wang, Wenjuan Wu
Abstract—In information systems in particular Business
Intelligence systems, for data produced at physically distributed
locations most traditional data mining approaches require data
to be transmitted to a single location for centralized processing
and mining. However, the continual transmission of a large
number of data to a central location must be impractical and
expensive. Thus, distributed and parallel data mining
algorithms and applications were rapidly developed. The paper
surveys the-state-of-the art in approaches and applications of
distributed computing environment. The goal is to summarize a
brief introduction to this field with pointers for further
exploration.
Keywords—Distributed Computing; Distributed and Parallel;
Data Mining
I. INTRODUCTION
istributed computing environment is not far from
reality since its inception. With the rapid development of
Internet, intranets, and wireless sensor networks, et al.
analyzing and mining geographically distributed data sources
require data mining algorithms to design for distributed or
parallel applications, which have attracted tremendous
interest among researcher and practitioners. This paper is
intended as a short introduction to the study of distributed and
parallel data mining, emphasis on the approaches, categories,
and applications.
D
II. THE STATE OF THE ART
Traditional data mining algorithms often make the
assumption that the data is centralized, memory-resident, and
static. However, this assumption is no longer feasible
anywhere. More and more huge amount of data are producing
and storing in local databases every day. In these distributed
circumstances, centralized data mining algorithms must
This work was supported by the Outstanding Overseas Chinese Scholars
Fund of Chinese Academy of Sciences, the National Science Foundation of
China (No. 60435010, 90604017, 60675010), 863 National High-Tech
Program (No.2006AA01Z128), National Basic Research Priorities
Programme (No. 2003CB317004) and the Nature Science Foundation of
Beijing (No. 4052025).
L. Zeng, L. D. Xu, Z.Z. Shi, and M. G. Wang are with the Key Laboratory
of Intelligent Information Processing, Institute of Computing Technology,
Chinese Academy of Sciences, and Graduate University of Chinese
Academy of Sciences (e-mail: [email protected]).
W. J. Wu is with the School of Information, Renmin University of China.
1-4244-0991-8/07/$25.00/©2007 IEEE
waste lots of I/O resources, and excessive communication
overhead to transmit huge amount of data. Another difficulty
is to update and mine incrementally immense amounts of data
for centralized data mining algorithms when the data is
dynamic or immediate-change. Several inherent fields where
large amount of data are stored in decentralized structure
include WWW, multinational Corp, sensor network,
scientific field, manufacturing industries troubleshooting [1],
and health care where daily diagnostic data and medical
information stored by hospital management information
system.
All these have prompted the need for distributed and
parallel data mining algorithms. Either distributed or parallel
data mining is still a particular process involving the
application of specific algorithms, that is, distributed or
parallel algorithms, for extracting pattern from distributed
and large-scale data. In the meantime, the additional steps in
the data mining process can not be ignored, such as data
preparation, data reduction, data cleaning, and data
transformation.
III. APPROACHES AND APPLICATIONS
A. Distributed mining systems
A. Most distributed data mining projects are interested in
discovering hidden patterns in a complete data set that is
essentially partitioned and physically distributed at different
data sources. That is, the distributed algorithms perform
further analysis based on local results to achieve global
results. Different strategies are possible depending upon the
data, its distribution, and the resources availability. The most
representative example of DDM is distributed association
rule mining [16] and distributed decision tree [17]. Kargupta
et al. [21, 22] first proposed collective data mining which
works on vertically partitioned data and combines immediate
results from local data sources. Several systems have been
developed for distributed data mining. Perhaps the most
outstanding systems include JAM system developed by
Stolfo et al. [26], Kensington system developed by Guo et al.
[30], and Papyrus developed by Bailey et al. [31] JAM
employs meta-learning, while Kensington employs
3240
knowledge probing. Papyrus is designed to support different
data, task and model strategies. In contrast to JAM, Papyrus
can not only move models from node to node, but can also
move data from node to node, when that strategy is desired.
Also, Papyrus is a specialized system which is designed for
clusters, meta-clusters, and super-clusters, while most
systems are designed only for mining data distributed over the
Internet.
Distributed data mining can choose to move data, to move
intermediate results, to move predictive models or to move
the final results of a data mining algorithm. Distributed data
mining systems which employ local learning build models at
each site and move the models to a central location. Systems
which employ centralized learning move the data to a central
location for model building. Systems can also employ hybrid
learning, that is, strategies which combine local and
centralized learning. Most distributed data mining systems all
employ local learning. [31]
In respect of task strategy, distributed data mining systems
can choose to coordinate a data mining algorithm over several
sites or to apply data mining algorithms independently at each
site. With independent learning, data mining algorithms are
applied to each site independently. With coordinated learning,
one or more sites coordinate the tasks within a data mining
algorithm across several sites.
Several different methods have been employed for
combining predictive models built at different sites. The
simplest method is to use voting and combine the outputs of
the various methods with a majority vote. Meta-learning
combines several models by building a separate meta-model
whose inputs are the outputs of the various models and whose
output is the desired outcome [26, 30]. Multiple models, or
what are often called ensembles or committees of models,
have been used for quite a while in centralized data mining.
Methods for combining models in an ensemble include
Bayesian model averaging and model selection [32], partition
learning [33].
B. Peer-to-peer based
Communication bandwidth is often a scarce resource for
data mining in decentralized circumstances. Approaches
collecting all local data at one given node (or centre node)
will absolutely result in communication bottlenecks or
messages implosion at somewhere. Remarkably, peer-to-peer
(P2P) network provide a natural and well-suited platform for
such distributed data mining, as well as information
dissemination, file sharing, Voice over IP, and e-commerce.
As mentioned, DDM has introduced distribution version of
many traditional data mining algorithms, such as association
rule mining, clustering, and classification. However, most
DDM efforts assume a stable network and data, and they can
not be applied directly to dynamic topology such as node
failure, link failure, and immediate join-leave behavior. In
distributed data mining over P2P network, most recent
researches focused on developing local algorithms for
primitive operations such as average, sum, max, random
sampling, quantize, and other aggregate functions.
Wojtek et al. [8] used the newscast model to estimate the
mean value of distributed data, and provided the theoretical
and experimental evidence for its feasibility to distributed
data mining over P2P networks. They used empirical
probabilities instead of guaranteed correctness. Another
approach conducted by David [9] relied on uniform
gossip-based randomized algorithms for computing
aggregate information.
Mortada et al. [7] proposed a class of asynchronous
distributed algorithms, which is a Laplacian-based approach
to averaging the local inputs of nodes on a P2P network.
Remarkably, their algorithm did not rely on synchronization,
coordinated parameter values, and the apriori knowledge of
global topology. Souptik et al. [2, 5] presents an overview of
efforts to develop DDM application and algorithms in P2P
networks, and describe both exact and approximate local P2P
data mining algorithms that work in a decentralized manner.
Ran et al. [3, 4] introduced a local algorithm for computing
the majority vote, and proposed an algorithm for monitoring
K-means clustering in P2P networks. Similar work by Brian
et al. [6] assumed a centralized coordinator site and a
hierarchical topology for global conflict resolution, and
proved that distributed communication is only necessary on
occasion.
C. Internet and gird computing based
Internet and Grid computing seek to change the way we
tackle complex problems. Internet provides not only huge
volume of distributed data, but also plenty distributed
computing resources to construct a supercomputer than ever
before. Complex and data-intensive computing such as
bioinformatics data mining, climate prediction models or
airplane computer-aided design systems used to run only on
supercomputer. Recently, however, rapid improvements in
Internet, distributed algorithms and the increase in speed and
memory of the PC machine, lead to consider more
decentralized approaches to the problem of computing power.
There are over 400 million PCs around the world, and most
are idle much of the time. Internet computing exploits these
3241
idle workstations and PCs to create powerful distributed
computing systems with global reach. It has been possible to
undertake many significant projects required supercomputer
capabilities before on normal and cheap equipments. Some
examples of popular internet computing projects include:
z Models@Home[14]
Distributed computing in bioinformatics;
z SETI@home [12,15] setiathome.ssl.berkeley.edu
z FightAIDSatHome [13] www.fightaidsathome.org
z ClimatePrediction [11] www.climateprediction.net
z Casino 21: www.climate-dynamics.rl.ac.uk
Projects Models@Home [14] which is distributed
computing in bioinformatics using a screensaver based
approach, as done by SETI@Home, [12, 15] the world's
largest distributed computing project to detect intelligent life
outside Earth, have demonstrated the principles and
techniques of distributed computing which provides a
mechanism for engineer or researchers to tap into the unused
resource of millions of distributed PCs. Other similar projects
are FightAIDSatHome [13] which is the first biomedical
distributed computing project ever launched. It is run by the
Olson Laboratory at The Scripps Research Institute in
California. FightAIDSatHome uses scattered computer's idle
resources to assist fundamental research in discovering new
drugs, building on growing knowledge of the structural
biology of AIDS. These projects above require the
participant’s machines to download client, data and models
that executed on local machine, and then upload a result of
specified form to server. The ClimatePrediction project
developed by Stainforth et al. [11] takes the distributed
computing paradigm a step further by inviting participants to
download a full-scale climate model and run it locally to
simulate 100 years of the Earth’s climate. Each participant
independently carries out a subset of result which is then
combined with other participant’s results to ensemble a
global climate prediction result. Obviously, low
communication overhead was generated during the running
as each participant has the complete model.
A grid is a geographically distributed computation
infrastructure composed of a set of heterogeneous machines
that users can access via a single interface. Grids provide
common resource-access technology and operational services
across widely distributed boundary. Grid computing therefore
is described as seamless, pervasive access to resources and
services. It plays a significant role in providing an effective
computational support for application of distributed data
mining. Grid computing has been proposed as a novel
computational model, distinguished from conventional
distributed computing by its focus on large-scale resource
sharing, innovative applications, and, in some cases,
high-performance orientation. Today grids can be used as
effective infrastructures for distributed high-performance
computing and data processing [24, 25]. Cannataro et al. [24]
discussed how to design and implement data mining
applications by using the knowledge grid tools starting from
searching grid resources, composing software and data
components, and executing the data mining process on a grid.
In addition, Luo [27] proposed a two-phase scheduling
framework including external scheduling and internal
scheduling in two-level grid environment, upon which a
system named DMGCE (Data Mining Grid Computing
Environment) has been developed and implemented.
D. Meta-learning, agent based and Service-oriented
Meta-learning, loosely defined as learning from learned
knowledge, is another technique recently developed that
deals with the problem of computing a “global” model from
large and inherently distributed databases. The goal of
meta-learning is to compute a number of independent model
(classifiers) by applying learning programs in parallel, that is,
without transferring or directly accessing the data sites. Stolfo
et al. [26] developed such a system so-called JAM (Java
Agent for Meat-learning) mentioned above to achieved the
goal. JAM is a distributed, scalable and portable agent-based
data mining system that provides a set of meta-learning
agents for combining multiple models that were learned at
different sites. As long as a machine learning program is
defined and encapsulated as an object conforming to the
interface requirements, it can be imported and used directly in
JAM. [26] This plug and play characteristic makes JAM truly
powerful and extensible data mining facility, although using a
centralized approach to maintain the global configuration
may be an obvious sequential bottleneck. JAM is so far the
most representative system to mine distributed databases by
means of meta-learning and intelligent agents.
Service-oriented architecture (SOA) provides seamless
integration of self-contained computational services that can
communicate and coordinate with each other to perform
goal-directed computation. The concept has blossomed in the
past few years due to the development of Web service related
standards and technologies, including WSDL, Universal
Description, Discovery, and Integration (UDDI), and SOAP.
WEKA4WS [27], an open-source framework derived from
the Weka toolkit for supporting distributed data mining on
Grid environments, exposes all the data mining algorithms
3242
provided by as WSRF-compliant services (WSRF: Web
Services Resource Framework), which enable important
benefits, such as dynamic service discovery and composition,
standard support for authorization and cryptography, and so
on. Grid-enable Weka [28] is another proposed system found
on Weka, which is a widely used toolkit for machine learning
and data mining written in Java. In Grid-enabled Weka,
execution of these tasks can be distributed across computers
in an ad hoc Grid environment. Tasks that can be executed
using Grid-enable Weka include building a classifier on a
remote machine, testing a classifier on a dataset, and
cross-validation. Grid-enable Weka modifies the Weka
toolkit to enable the use of multiple computational resources
when performing data analysis. WekaG [29] is another
system derived from the Weka. It’s based on client/server
architecture. The server side provides a set of services that
implement the functionalities and the process of the different
data mining algorithms. The client side is responsible for
communicating with server side and offering user interface.
FAEHIM
(Federated
Analysis
Environment
for
Heterogeneous Intelligent Mining) is a Web Services-based
toolkit for supporting distributed data mining. This toolkit
consists of a set of data mining services, a set of tools to
interact with these services, and a workflow system used to
assemble these services and tools.
IV. CONCLUSIONS
Both distributed computing environment and parallel
computing environment can speed up the data mining
progress, however, they are still distinct in several unaware
ways. Distributed data mining often applies the same or
different mining algorithms to tackles local data, to exchange
or communicate among multiple process units, and then
combining the local pattern discovery by local data mining
algorithms from local databases into a global knowledge. In
this case the discovered knowledge is often different from the
knowledge discovered by sequentially applying the data
mining algorithms to the entire datasets. The accuracy or
efficiency of distributed data mining is somewhat difficult to
predict. Because it depends on data partitioning, task
scheduling, and global synthesizing. In contrast to distributed
data mining, in principle a parallel data mining algorithm
discovers the knowledge as same as the knowledge found by
its sequential algorithm due to applying the global parallel
algorithm on the entire data sets. Its accuracy may be more
guaranteed than the distributed data mining in some case.
Distributed
computing
environment
enables
geographically distributed science and engineering teams to
collaborate in new ways. Despite recent advances in
Peer-to-Peer and service-oriented, grid computing,
high-performance computing, issues on distributed and
parallel computing such as limitation on data transmission
bandwidth, skew distribution, very-large data size, privacy
preserve still need serious and immediate attention. For
example, the execution environments of geographically
distributed applications are far less deterministic than those of
locally centralized. In addition, distributed applications
become highly complex when large-scale and skewed data
distributes in unpredictable locations in which identifying and
correcting performance bottlenecks exposed may not be
useful repeatedly, because the network bandwidths and
computing resources often vary from one place to another.
The work here is only a part of related research piece.
Although we are convinced that distributed data mining or
parallel data mining is the way to go, it will still be quite a
journey.
REFERENCES
[1]
R. Heider, Troubleshooting CFM 56-3 Engines for the Boeing
737—Using CBR and Data-Mining, Spinger-Verlag, New York, vol.
1168, pp. 512–523, 1996. Lecture Notes in Computer Science.
[2] S. Datta, K. Bhaduri, Chris Giannella and Hillol Kargupta, IEEE
Internet Computing, IEEE Computer Society, July 2006, pp.18-27
[3] R. Wolff and A. Schuster, “Association Rule Mining in Peer-to-Peer
Systems,” IEEE Trans. Systems, Man, and Cybernetics, Part B, vol. 34,
no. 6, 2004, pp.2426–2438.
[4] R. Wolff, K. Bhaduri, and H. Kargupta,“Local l2-Thresholding Based
Data Mining in Peer-to-Peer Systems,” Proc. 2006 SIAM Conf. Data
Mining (SDM06), SIAM Press, 2006, pp. 430-441.
[5] S. Datta, C. Giannella, and H. Kargupta,“K-Means Clustering over
Large, Dynamic Networks,” Proc. 2006 SIAM Conf. Data Mining
(SDM 06), SIAM Press, 2006, pp. 153–164.
[6] B. Babcock and C. Olston, “Distributed Top-K Monitoring,” Proc.
ACM SIGMOD 2003 Int’l Conf. Management of Data, ACM Press,
2003, pp. 28–39.
[7] M. Mehyar, D. Spanos, J. Pongsajapan, et al., “Distributed Averaging
on a Peer-to-Peer Network,” Proc. IEEE Conf. Decision and Control,
IEEE CS Press, 2005.
[8] W. Kowalczyk, M. Jelasity, and A. Eiben,“Towards Data Mining in
Large and Fully Distributed Peerto-Peer Overlay Networks,” Proc. 15th
Belgian-Dutch Conf. Artificial Intelligence (BNAIC 03), Univ.of
Nijmegen Press, 2003, pp. 203–210.
[9] D. Kempe, A. Dobra, J. Gehrke, “Gossip-based computation of
aggregate information”, Proc. 44th Annual IEEE Symposium on.
Foundation of Computer Science, IEEE CS Press, 2003, pp.1-10
[10] I. Flockhart, N. Radcliffe, “GA-MINER: Parallel Data Mining with
Hierarchical Genetic Algorithms”, Final Report. University of
Edinburgh, 1995
[11] D. Stainforth, J. Kettleborough, M. Allen, et al. “Distributed
Computing for Public-interest Climate Modeling Research”, IEEE
Computing in Science & Engineering, 2002, pp.82-90
3243
[12] E. Korpela et al., “SETI@home: Massively Distributed Computing for
SETI,” Computing in Science & Eng., vol. 3, no. 1, Jan./Feb. 2001, pp.
78–83.
[13] Fight AIDS at Home Project, http://fightaidsathome.scripps.edu/
[14] E. Krieger, G. Vriend. “Distributed Computing in Bioinformatics using
a screensaver based approach”, Oxford Journals, Oxford University
Press, 2007
[15] D. Anderson, J. Cobb, E. Korpela, et al. “SETI@home: an experiment
in public-resource computing”, Communications of the ACM,
2002,45(11):56-61.
[16] M.J. Zaki, “Parallel and Distributed Association Mining: A Survey,”
IEEE Concurrency, vol. 7, no. 4, 1999, pp. 14–25.
[17] K. Kubota, A. Nakase, H. Sakai, et al. “Parallelization of Decision Tree
Algorithm and Its Performance Evaluation”, High Performance
Computing in the Asia-Pacific Region'00, The 4th International
Conference, Volume 2, 2000
[18] The Predictive Model Markup Language Version 0.9, The Data Mining
Group, 1998, http://www.dmg.org.
[19] R. Grossman, S. Bailey, A. Ramu, et al. The Management and Mining
of Multiple Predictive Models Using the Predictive Modeling Markup
Language (PMML), Information and System Technology, Volume 41,
pages 589-595, 1999
[20] A. L. Prodromidis, P. K. Chan, “Meta-learning in distributed data
mining systems: Issues and Approaches”. Book on Advances of
Distributed Data Mining, editors Hillol Kargupta and Philip Chan,
AAAI press, 2000
[21] H. Karagupta, E. Johnson, E. R. Sanseverino, et al. Scalable Data
Mining from Vertically Partitioned Feature Space Using Collective
Mining and Gene Expression Based Genetic Algorithms, KDD-98
Workshop on Distributed Data Mining, 1999.
[22] H. Kargupta et al., “Collective Data Mining: A New Perspective
towards Distributed Data Mining,” Advances in Distributed and
Parallel Knowledge Discovery, H. Kargupta and P. Chan, eds.,
MIT/AAAI Press, 2000, pp. 133–184.
[23] M. Cannataro, C. Comito, “A Data Mining Ontology for Grid
Programming”, In Proc 1st Workshop Semantics in Peer-to-Peer and
Grid Computing, 2003, pp 113–134.
[24] M. Cannataro, A. Congiusta, A. Pugliese, et al. “Distributed Data
Mining on Grids: Services, Tools, and Applications”, IEEE Trans.
Systems, Man, and Cybernetics, Part B, vol. 34, no. 6, 2004,
pp.2451–2466.
[25] I. Foster and C. Kesselman, “The anatomy of the grid: Enabling
scalable virtual organizations. S. Tuecke,” Int. J. Supercomput.
Applicat., vol. 15, no. 3, 2001.
[26] S. Stolfo, A. L. Prodromidis, and P. K. Chan, JAM: Java Agents for
Meta-Learning over Distributed Databases, Proceedings of the Third
International Conference on Knowledge Discovery and Data Mining,
AAAI Press, Menlo Park, California, 1997.
[27] D. Talia, P. Trunfio, and O. Verta, “Weka4WS: A WSRF-Enabled
Weka Toolkit for Distributed Data Mining on Grids”, PKDD 2005,
pp.309-320, 2005
[28] R. Khoussainov, X. Zuo and N. Kushmerick “Grid-enabled Weka: A
Toolkit for Machine Learning on the Grid”, ERCIM News No.59, Oct
2004.
[29] M. Perez, A. Sanchez, P. Herrero, et al. “Adapting the Weka Data
Mining Toolkit to a Grid based environment”, 3rd Atlantic Web
Intelligence Conf. 2005.
[30] Y. Guo, S. M. Rueger, J. Sutiwaraphun, et al., “Meta-Learning for
Parallel Data Mining”, in Proceedings of the Seventh Parallel
Computing Workshop, pp1-2, 1997
[31] R. Grossman, S. Bailey, A. Ramu, et al., “The Preliminary Design of
Papyrus: A System for High Performance, Distributed Data Mining
over Clusters”, in Advances in Distributed and Parallel Knowledge
Discovery, H. Kargupta and P. Chan, editors, AAAI Press/The MIT
Press, MenloPark, California, 2000, pages 259-275.
[32] A. Raftery, D. Madigan, and J. Hoeting, “Bayesian Model Averaging
for Linear Regression Models”, Journal of the American Statistical
Association, Volume92, pages 179–191, 1996.
[33] R. Grossman, H. Bodek, D. Northcutt, et al., Data Mining and
Tree-based Optimization, in the Proceedings of the Second
International Conference on Knowledge Discovery and Data Mining,
AAAI Press, MenloPark, California, page 323-326, 1996.
3244