Download Analysis of Grid Based Distributed Data Mining System for

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
International Journal of Electronics and Computer Science Engineering
Available Online at www.ijecse.org
290
ISSN- 2277-1956
Analysis of Grid Based Distributed Data Mining
System for Service Oriented Frameworks
Praseeda Manoj
Department of Computer Science
Muscat College, Sultanate of Oman
Abstract- Distribution of data and computation allows for solving larger problems and execute applications that are
distributed in nature. A Grid is a distributed computing infrastructure that enables to manage large amount of data and
run business applications supporting consumers and end users. The Grid can play a significant role in providing an
effective computational infrastructure that enables coordinated resource sharing within dynamic organizations. There
have been several systems proposed to build distributed data mining. This paper analyses different grid based distributed
data mining applications which help to give an overview of how Grid computing can be used to support distributed data
mining. In addition, the synergy between data mining and grid technology is also discussed. This concept is implemented
in Weka4WS, a framework that extends the widely used open source Weka toolkit to support distributed data mining on
WSRF-enabled Grids . Weka4WS adopts the WSRF technology for running remote data mining algorithms and
managing distributed computations.
Keywords – Data mining, Grid, DDM, WSRF
I.
INTRODUCTION
Every organization that has embraced the concept of a data warehouse believes that data mining is a distinct part of
its future. The major reason that data mining has attracted a great deal of attention in the information industry in
recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into
useful information and knowledge. Data mining is the process of analyzing data from different perspectives and
summarizing it into useful information. Technically, the data mining process finds the correlations and patterns
existing among several fields in a large relational database. Traditional on-line transaction processing systems,
OLTPs, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool
for efficient storage, retrieval and management of large amounts of data, but are not good at delivering meaningful
analysis in return. This is where Data Mining or Knowledge Discovery in database (KDD) has obvious benefits for
any enterprise. Using a combination of techniques, including statistical analysis, multidimensional analysis,
intelligent agents, and data visualization, KDD can discover highly useful informative patterns within the data that
can be used to develop predictive models of behavior. Data Mining or KDD is the nontrivial extraction of implicit,
previously unknown, potentially useful and understandable patterns information from data.
II.
AIM AND OBJECTIVES
The aim of this paper is to analyse the advantages of WSRF enabled grid computing compared to OGSI grid
computing and discussion of applications like WekaG, DataMiningGrid and Weka4WS. This paper investigates the
synergy between data mining and grid technology using Globus toolkit4.0, a grid computing framework. It describes
how the two paradigms – data mining and grid technology- can benefit from each other. This paper also describes
Weka4WS, a framework that extends the widely used open source Weka toolkit to support distributed data mining
on WSRF-enabled Grids. Weka4WS adopts the WSRF technology for running remote data mining algorithms and
managing distributed computations. The Weka4WS user interface supports the execution of both local and remote
data mining tasks. A performance analysis of Weka4WS for executing distributed data mining tasks in different
network scenarios is presented.
III.
DISTRIBUTED DATA MINING AND GRIDS
Data mining and knowledge discovery can benefit from the use of DDM techniques to improve mining performance
of huge data or distributed data. Although there are many efficient algorithms and techniques for mining centralized
ISSN 2277-1956/V2N1-290-295
291
Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks
data sets, it's inefficient or incapable to deal with huge data sets or distributed data sets. There are two main reasons
to choose DDM.
The first one is that data is very large. If data is too large, it's hard to store it at a single site, or it's inefficient or
incapable to mine such large data at a single site. In such cases, data may be decomposed into some parts that are
distributed at different sites. Then we perform the data mining operations for each site. At the end, the mining results
of each site are combined to gain global results. This will optimize centralized data mining since the work load is
distributed among the sites.The second reason is that we need to deal with inherent distributed datasets. In fact,
various wired and wireless networks such as internet, intranets, local area networks, wireless networks etc. produce
many distributed resources of data. These distributed data need to be mined to gain global patterns, models or
knowledge. The straightforward solution is to transfer all data to a central site, where data mining is done. However,
even if we have enough capacity to handle the data storage and data mining at a central site, it may be too expensive
to transfer the local data sets to the central site. On the other hand, the privacy issue is playing an important role in
the emerging distributed data. The distributed data sets may not be transferred because of privacy, security or
autonomy of the data sets. Therefore, DDM is an effective and scalable solution for mining huge and distributed
data sets in distributed computing environments.
In recent years, DDM has attracted a lot of attention among the fields of research and applications. Many techniques
and systems of DDM have been proposed. However, the DDM problems such as heterogeneous data, complex data,
security, privacy and autonomy of local databases, network topology and transmission scheme, still bother us. As
the grid is becoming a well-accepted computing infrastructure in science and industry, it is necessary to provide
general data mining services, algorithms and applications that help analysts, scientists, organizations and
professionals to leverage grid capacity in supporting high performance distributed computing for solving their data
mining problem in a distributed way.
IV.
FEATURES OF WSRF ON WEB SERVICES
WSRF is a family of technical specifications concerned with the creation, addressing, inspection and lifetime
management of resources using Web services. A Web service is a software component that can be accessed by
remote entities using standard internet protocols such as HTTP. The capabilities offered by a service are defined
using the Web Services Description Language (WSDL), an XML-based formalism that allows to define the
operations exposed by a Web Service, as well as specifying the input and output messages that must be exchanged
to invoke such operations. The set of operations and associated messages constitute the interface of a service. An
important feature of Web services is the independence of the service interface from the implementation of the
operations. To invoke a Web service, a remote entity needs to know only its WSDL interface, without worrying
about the actual programming language used to implement its operations. This allows to couple in an easy way
distributed software components implemented using different languages and running on heterogeneous platforms.
Web Services in Grid computing are used as uniform interfaces for accessing remote resources and composing
distributed applications, independently from their location and specific implementation. The so-called Open Grid
Services Architecture (OGSA) defines an architectural model for Grid systems in which distributed resources and
applications are modeled as web services that interact with each other using Internet-based standards. WSRF
implements the OGSA philosophy by defining a set of Web service standards for the implementation of Grid
systems. WSRF mainly focuses on managing stateful resources using Web services. The combination of a stateful
resource with a Web service is termed WS-Resource. The possibility to define a “state” associated to a Web service
is the most important difference between WSRF-compliant Web services and pre-WSRF ones. This is a key feature
in implementing Grid systems, because Grid applications can be composed by multiple long-running processes,
whose state needs to be accessed and monitored to control the overall execution. In this context, WS-Resources
provide a standard way to represent, advertise, and access properties associated to processes as required by complex
Grid applications.
ISSN 2277-1956/V2N1-290-295
IJECSE, Volume2, Number 1
Praseeda Manoj et sl.
V.
OVERVIEW OF EXISTING DATA MINING APPLICATIONS ON THE GRID
The following applications are aimed to adapt the toolkit Weka to a Grid environment using WSRF technology.
i.
WekaG
This application uses open architectures such as OGSI and the Globus Toolkit3.0. As it is an extension of the open
source Weka tool it can be further extended with data mining techniques and algorithms when needed. WekaG also
implements authorization access to resources, combined with the security measurements of the Globus toolkit.
WekaG implements a vertical architecture called Data mining Grid Architecture (DMGA), which is based on the
data mining phases: preprocessing, data mining and post-processing. The application implements client/server
architecture. The server side is responsible for a set of grid services that implement the different data mining
algorithms and data mining phases. The client side interacts with the server and provides a user interface which is
integrated in the Weka interface. WekaG is implemented to include the following features: coupling data sources,
authorization access to resources, discovery based on metadata, planning and scheduling tasks and identifying the
available and appropriate resources. It does not support OLAP and there is little known about the scalability of the
application.
ii.
DataMiningGrid
The DataMiningGrid system is developed for generic and sector independent data mining interfaces and tools to be
exploited on the grid. This system composed the following requirements: massive and distributed data, distributed
operations, data privacy and security, user friendliness and resource identification and metadata. The main
objectives of the project are the development of grid interfaces that could be used by data mining tools, a user
friendly workflow editor for configuration, text mining and ontology learning services, a test bed with some
demonstrator applications and the last objective is to develop all of this with emerging grid standards. To join the
DataMiningGridtestbed the user need linux machine or windows machine on which core GT4 services will have to
be installed.
iii.
Weka4WS
Weka4WS allowing the execution of all its data mining algorithms on remote Grid nodes. To enable remote
invocation, the data mining algorithms provided by the Weka library are exposed as a Web Service, which can be
easily deployed on the available Grid nodes. The architecture of Weka4WS includes three kinds of nodes: storage
nodes, which contain the datasets to be mined; compute nodes, on which remote data mining algorithms are run;
user nodes, which are the local machines of users. Remote execution is managed using basic WSRF mechanisms
like state management, notifications, etc. while the Globus Toolkit 4 services are used for standard Grid
functionalities, such as security and file transfer. Weka4WS can only handle a dataset contained by a single storage
node. This dataset is then transferred to computing nodes to be mined. If data are considerably large this transfer will
cause high communication overhead.
Figure 1 - Local task and Remote task execution
ISSN 2277-1956/V2N1-290-295
292
293
Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks
VI.
ARCHITECTURE OF WSRF ENABLED GRID
This paper describes the more efficient DDM technique, WEKA4WS, a WSRF enabled grid. Weka4WS, a
framework that extends the widely used Weka toolkit for supporting distributed data mining on Grid environments.
Weka provides a large collection of machine learning algorithms written in Java for data pre-processing,
classification, clustering, association rules, and visualization, which can be invoked through a common graphical
user interface. In Weka, the overall data mining process takes place on a single machine, since the algorithms can be
executed only locally. The goal of Weka4WS is to extend Weka to support remote execution of the data mining
algorithms. In such a way, distributed data mining tasks can be concurrently executed on decentralized Grid nodes
by exploiting data distribution and improving application performance.
Weka4WS is an application that extends Weka to perform data mining tasks on WSRF enabled grids. The first
prototype of Weka4WS has been developed using the Java WSRF library provided by GT4. The goal of Weka4WS
is to support remote execution of data mining algorithms in such a way that distributed data mining tasks can be
concurrently executed on decentralized nodes on the grid, exploiting data distribution and improving performance.
Each task is managed by a single thread and therefore a user can start multiple tasks in parallel, taking full advantage
of the grid environment. Weka4WS leveraging the OGSA and WSRF standards, will provide a distributed data
mining open service middleware by which users can design higher level distributed data mining services that cover
the main steps of the KDD process and offer typical distributed data mining patterns.
Figure 2 : Weka4WS architecture.
VII.
CONCLUSION
With the advancement of information technology, increasingly complex and resource-demanding applications have
become possible. As a result, even larger-scale problems are projected and in many areas so-called grand challenge
problems are being tackled. These problems put an even greater demand on the underlying computing resources. A
large number of applications that need many resources is modern data mining applications in science, engineering and
other areas. Grid technology is an answer to the increasing demand for affordable large-scale computing resources.
The grid technology and the complex nature of data mining applications have led to a new relation of data mining and
grid. A data mining grid enables data mining applications and provides a comprehensive solution for affordable highperformance resources satisfying the needs of large-scale data mining problems. Mining grid data could be
understood as a methodology that could help to address the complex issues involved in running and maintaining large
grid computing environments.
ISSN 2277-1956/V2N1-290-295
IJECSE, Volume2, Number 1
Praseeda Manoj et sl.
To support complex data-mining applications, grid environments must provide adaptive data management and data
analysis tools and techniques through the offer of resources, services and decentralized data access mechanisms.
This paper discussed the importance of grid computing in distributed data mining. Grid can offer an effective
infrastructure for managing data mining and knowledge discovery applications. It can represent in a near future an
effective infrastructure for managing very large data sources and providing high-level mechanisms for extracting
valuable knowledge from them. To solve this class of tasks, advanced tools and services for knowledge discovery
are vital. In this paper, advanced tool like Weka4WS systems is described. In the next years the Grid will be used as
a platform for implementing and deploying geographically distributed knowledge discovery and knowledge
management services and applications. Weka4WS adopts the emerging Web Services Resource Framework (WSRF)
for remotely running data mining algorithms and composing distributed knowledge discovery applications that
integrate data, tools, and resources available from dispersed sites through the SOA paradigm.
This paper described the architecture of Weka4WS by exploiting the WSRF library provided by Globus Toolkit 4.
Weka4WS provides an effective way to perform compute-intensive distributed data analysis on large-scale Grid
environments. The Weka4WS Web services can be directly invoked within adhoc programs to implement
applications that coordinate the invocation of multiple data mining services in a distributed scenario. Thus, a
distributed data mining application can be composed by several tasks that execute on multiple Grid nodes in parallel
and/or in sequence.
VIII.
FUTURE DEVELOPMENT
The importance of high-performance data mining is going to be considered a real added value. Grid can offer an
effective infrastructure for deploying data mining and knowledge discovery applications. The future use of the Grid
is mainly related to its ability embody many of those properties and to manage world-wide complex distributed
applications. Among those, knowledge-based applications are a major goal. To reach this goal, the Grid needs to
evolve towards an open decentralized infrastructure based on interoperable high-level services that make use of
knowledge both in providing resources and in giving results to end users. Software technologies for the
implementation and deployment of knowledge Grids as discussed in this paper will provide important elements to
build up knowledge-based applications on a local Grid or on a World Wide Grid. These models, techniques, and
tools can provide the basic components for developing Grid based complex systems such as distributed knowledge
management systems providing pervasive access, adaptively and high performance for virtual organizations in
science, engineering and industry that need to produce knowledge-based applications. In future, Weka Knowledge
Flow environment can support the visual design of distributed data mining applications which will be able to handle
different storage nodes using Globus Toolkit 5.This will allow users to design and execute complex data mining
applications on the Grid in a simple and effective way.
IX.
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
REFERENCES
Data Mining Techniques in Grid Computing Environments Editor Werner Dubitzky. University of Ulster, UK , by John Wiley & Sons,
Ltd.
Data Mining Practical Machine Learning Tools and Techniques Third Edition By Ian H. Witten, Eibe Frank & Mark A. Hall.
Meta-learning in Grid-based Data Mining Systems, Moez Ben Haj Hmida and Yahya Slimani
IJCNC Vol.2, No.5, September 2010
Schuster, A., Wolff. R. Trock, D.: A High-Performance Distributed Algorithm for Mining Association Rules . In: Third IEEE International
Conference on Data Mining, Florida , USA (2003)
M. Cannataro, A. Congiusta, C. Mastroianni, A. Pugliese, D. Talia, P. Trunfio, Grid-Based Data Mining and Knowledge Discovery,
Intelligent Technologies for Information Analysis, N. Zhong and J. Liu (eds.), Springer-Verlag, chapt. 2 (2004), pp. 19–45.
K. Czajkowski et al., The WS-Resource Framework Version 1.0. http://www- 106.ibm.com/developerworks/library/ws-resource/wswsrf.
pdf.
Maarten Altorf - Data mining on grids- Universiteit Leiden August 007.
H. Kargupta and C. Kamath and P. Chan, Distributed and Parallel Data Mining: Emergence,Growth, and Future Directions, In: Advances in
Distributed and Parallel Knowledge Discovery, AAAI/MIT Press, pp. 409–416, (2000).
ISSN 2277-1956/V2N1-290-295
294
295
Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
Cannataro, M., Congiusta, A., Pugliese, A., Talia, D. and Trunfio, P. (2004b), ‘Distributed data mining on grids: services, tools, and
applications’, IEEE Transactions on Systems, Man, and Cybernetics: Part B 34 (6), 2451–2465.
Foster. What is the Grid? A Three Point Checklist, July 2002.
Mastroianni, C., Talia, D. and Trunfio, P. (2004), ‘Metadata for managing grid resources in data mining applications’, Journal of Grid
Computing 2 (1), 85–102.
H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, 2000.
Data Mining Techniques, First Edition by Arun K. Pujari.
Erwin, D. W. and Snelling, D. F. (2001), UNICORE: a grid computing environment, in ‘International Conference on Parallel and
Distributed Computing (Euro-Par’01)’, Vol. 2150 of LNCS, Springer, Manchester, UK, pp. 825–834.
Congiusta, A., Talia, D. and Trunfio, P. (2007), ‘Distributed data mining services leveraging WSRF’,Future Generation Computer Systems
23 (1), 34–41.
www.cs.waikato.ac.nz
K. Czajkowski et al., The WS-Resource Framework Version 1.0. http://www- 106.ibm.com/developerworks/library/ws-resource/wswsrf.
pdf.
http://www.globus.org/toolkit/ accessed on 14-Dec-2012.
ISSN 2277-1956/V2N1-290-295