Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Electronics and Computer Science Engineering Available Online at www.ijecse.org 290 ISSN- 2277-1956 Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks Praseeda Manoj Department of Computer Science Muscat College, Sultanate of Oman Abstract- Distribution of data and computation allows for solving larger problems and execute applications that are distributed in nature. A Grid is a distributed computing infrastructure that enables to manage large amount of data and run business applications supporting consumers and end users. The Grid can play a significant role in providing an effective computational infrastructure that enables coordinated resource sharing within dynamic organizations. There have been several systems proposed to build distributed data mining. This paper analyses different grid based distributed data mining applications which help to give an overview of how Grid computing can be used to support distributed data mining. In addition, the synergy between data mining and grid technology is also discussed. This concept is implemented in Weka4WS, a framework that extends the widely used open source Weka toolkit to support distributed data mining on WSRF-enabled Grids . Weka4WS adopts the WSRF technology for running remote data mining algorithms and managing distributed computations. Keywords – Data mining, Grid, DDM, WSRF I. INTRODUCTION Every organization that has embraced the concept of a data warehouse believes that data mining is a distinct part of its future. The major reason that data mining has attracted a great deal of attention in the information industry in recent years is due to the wide availability of huge amounts of data and the imminent need for turning such data into useful information and knowledge. Data mining is the process of analyzing data from different perspectives and summarizing it into useful information. Technically, the data mining process finds the correlations and patterns existing among several fields in a large relational database. Traditional on-line transaction processing systems, OLTPs, have contributed substantially to the evolution and wide acceptance of relational technology as a major tool for efficient storage, retrieval and management of large amounts of data, but are not good at delivering meaningful analysis in return. This is where Data Mining or Knowledge Discovery in database (KDD) has obvious benefits for any enterprise. Using a combination of techniques, including statistical analysis, multidimensional analysis, intelligent agents, and data visualization, KDD can discover highly useful informative patterns within the data that can be used to develop predictive models of behavior. Data Mining or KDD is the nontrivial extraction of implicit, previously unknown, potentially useful and understandable patterns information from data. II. AIM AND OBJECTIVES The aim of this paper is to analyse the advantages of WSRF enabled grid computing compared to OGSI grid computing and discussion of applications like WekaG, DataMiningGrid and Weka4WS. This paper investigates the synergy between data mining and grid technology using Globus toolkit4.0, a grid computing framework. It describes how the two paradigms – data mining and grid technology- can benefit from each other. This paper also describes Weka4WS, a framework that extends the widely used open source Weka toolkit to support distributed data mining on WSRF-enabled Grids. Weka4WS adopts the WSRF technology for running remote data mining algorithms and managing distributed computations. The Weka4WS user interface supports the execution of both local and remote data mining tasks. A performance analysis of Weka4WS for executing distributed data mining tasks in different network scenarios is presented. III. DISTRIBUTED DATA MINING AND GRIDS Data mining and knowledge discovery can benefit from the use of DDM techniques to improve mining performance of huge data or distributed data. Although there are many efficient algorithms and techniques for mining centralized ISSN 2277-1956/V2N1-290-295 291 Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks data sets, it's inefficient or incapable to deal with huge data sets or distributed data sets. There are two main reasons to choose DDM. The first one is that data is very large. If data is too large, it's hard to store it at a single site, or it's inefficient or incapable to mine such large data at a single site. In such cases, data may be decomposed into some parts that are distributed at different sites. Then we perform the data mining operations for each site. At the end, the mining results of each site are combined to gain global results. This will optimize centralized data mining since the work load is distributed among the sites.The second reason is that we need to deal with inherent distributed datasets. In fact, various wired and wireless networks such as internet, intranets, local area networks, wireless networks etc. produce many distributed resources of data. These distributed data need to be mined to gain global patterns, models or knowledge. The straightforward solution is to transfer all data to a central site, where data mining is done. However, even if we have enough capacity to handle the data storage and data mining at a central site, it may be too expensive to transfer the local data sets to the central site. On the other hand, the privacy issue is playing an important role in the emerging distributed data. The distributed data sets may not be transferred because of privacy, security or autonomy of the data sets. Therefore, DDM is an effective and scalable solution for mining huge and distributed data sets in distributed computing environments. In recent years, DDM has attracted a lot of attention among the fields of research and applications. Many techniques and systems of DDM have been proposed. However, the DDM problems such as heterogeneous data, complex data, security, privacy and autonomy of local databases, network topology and transmission scheme, still bother us. As the grid is becoming a well-accepted computing infrastructure in science and industry, it is necessary to provide general data mining services, algorithms and applications that help analysts, scientists, organizations and professionals to leverage grid capacity in supporting high performance distributed computing for solving their data mining problem in a distributed way. IV. FEATURES OF WSRF ON WEB SERVICES WSRF is a family of technical specifications concerned with the creation, addressing, inspection and lifetime management of resources using Web services. A Web service is a software component that can be accessed by remote entities using standard internet protocols such as HTTP. The capabilities offered by a service are defined using the Web Services Description Language (WSDL), an XML-based formalism that allows to define the operations exposed by a Web Service, as well as specifying the input and output messages that must be exchanged to invoke such operations. The set of operations and associated messages constitute the interface of a service. An important feature of Web services is the independence of the service interface from the implementation of the operations. To invoke a Web service, a remote entity needs to know only its WSDL interface, without worrying about the actual programming language used to implement its operations. This allows to couple in an easy way distributed software components implemented using different languages and running on heterogeneous platforms. Web Services in Grid computing are used as uniform interfaces for accessing remote resources and composing distributed applications, independently from their location and specific implementation. The so-called Open Grid Services Architecture (OGSA) defines an architectural model for Grid systems in which distributed resources and applications are modeled as web services that interact with each other using Internet-based standards. WSRF implements the OGSA philosophy by defining a set of Web service standards for the implementation of Grid systems. WSRF mainly focuses on managing stateful resources using Web services. The combination of a stateful resource with a Web service is termed WS-Resource. The possibility to define a “state” associated to a Web service is the most important difference between WSRF-compliant Web services and pre-WSRF ones. This is a key feature in implementing Grid systems, because Grid applications can be composed by multiple long-running processes, whose state needs to be accessed and monitored to control the overall execution. In this context, WS-Resources provide a standard way to represent, advertise, and access properties associated to processes as required by complex Grid applications. ISSN 2277-1956/V2N1-290-295 IJECSE, Volume2, Number 1 Praseeda Manoj et sl. V. OVERVIEW OF EXISTING DATA MINING APPLICATIONS ON THE GRID The following applications are aimed to adapt the toolkit Weka to a Grid environment using WSRF technology. i. WekaG This application uses open architectures such as OGSI and the Globus Toolkit3.0. As it is an extension of the open source Weka tool it can be further extended with data mining techniques and algorithms when needed. WekaG also implements authorization access to resources, combined with the security measurements of the Globus toolkit. WekaG implements a vertical architecture called Data mining Grid Architecture (DMGA), which is based on the data mining phases: preprocessing, data mining and post-processing. The application implements client/server architecture. The server side is responsible for a set of grid services that implement the different data mining algorithms and data mining phases. The client side interacts with the server and provides a user interface which is integrated in the Weka interface. WekaG is implemented to include the following features: coupling data sources, authorization access to resources, discovery based on metadata, planning and scheduling tasks and identifying the available and appropriate resources. It does not support OLAP and there is little known about the scalability of the application. ii. DataMiningGrid The DataMiningGrid system is developed for generic and sector independent data mining interfaces and tools to be exploited on the grid. This system composed the following requirements: massive and distributed data, distributed operations, data privacy and security, user friendliness and resource identification and metadata. The main objectives of the project are the development of grid interfaces that could be used by data mining tools, a user friendly workflow editor for configuration, text mining and ontology learning services, a test bed with some demonstrator applications and the last objective is to develop all of this with emerging grid standards. To join the DataMiningGridtestbed the user need linux machine or windows machine on which core GT4 services will have to be installed. iii. Weka4WS Weka4WS allowing the execution of all its data mining algorithms on remote Grid nodes. To enable remote invocation, the data mining algorithms provided by the Weka library are exposed as a Web Service, which can be easily deployed on the available Grid nodes. The architecture of Weka4WS includes three kinds of nodes: storage nodes, which contain the datasets to be mined; compute nodes, on which remote data mining algorithms are run; user nodes, which are the local machines of users. Remote execution is managed using basic WSRF mechanisms like state management, notifications, etc. while the Globus Toolkit 4 services are used for standard Grid functionalities, such as security and file transfer. Weka4WS can only handle a dataset contained by a single storage node. This dataset is then transferred to computing nodes to be mined. If data are considerably large this transfer will cause high communication overhead. Figure 1 - Local task and Remote task execution ISSN 2277-1956/V2N1-290-295 292 293 Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks VI. ARCHITECTURE OF WSRF ENABLED GRID This paper describes the more efficient DDM technique, WEKA4WS, a WSRF enabled grid. Weka4WS, a framework that extends the widely used Weka toolkit for supporting distributed data mining on Grid environments. Weka provides a large collection of machine learning algorithms written in Java for data pre-processing, classification, clustering, association rules, and visualization, which can be invoked through a common graphical user interface. In Weka, the overall data mining process takes place on a single machine, since the algorithms can be executed only locally. The goal of Weka4WS is to extend Weka to support remote execution of the data mining algorithms. In such a way, distributed data mining tasks can be concurrently executed on decentralized Grid nodes by exploiting data distribution and improving application performance. Weka4WS is an application that extends Weka to perform data mining tasks on WSRF enabled grids. The first prototype of Weka4WS has been developed using the Java WSRF library provided by GT4. The goal of Weka4WS is to support remote execution of data mining algorithms in such a way that distributed data mining tasks can be concurrently executed on decentralized nodes on the grid, exploiting data distribution and improving performance. Each task is managed by a single thread and therefore a user can start multiple tasks in parallel, taking full advantage of the grid environment. Weka4WS leveraging the OGSA and WSRF standards, will provide a distributed data mining open service middleware by which users can design higher level distributed data mining services that cover the main steps of the KDD process and offer typical distributed data mining patterns. Figure 2 : Weka4WS architecture. VII. CONCLUSION With the advancement of information technology, increasingly complex and resource-demanding applications have become possible. As a result, even larger-scale problems are projected and in many areas so-called grand challenge problems are being tackled. These problems put an even greater demand on the underlying computing resources. A large number of applications that need many resources is modern data mining applications in science, engineering and other areas. Grid technology is an answer to the increasing demand for affordable large-scale computing resources. The grid technology and the complex nature of data mining applications have led to a new relation of data mining and grid. A data mining grid enables data mining applications and provides a comprehensive solution for affordable highperformance resources satisfying the needs of large-scale data mining problems. Mining grid data could be understood as a methodology that could help to address the complex issues involved in running and maintaining large grid computing environments. ISSN 2277-1956/V2N1-290-295 IJECSE, Volume2, Number 1 Praseeda Manoj et sl. To support complex data-mining applications, grid environments must provide adaptive data management and data analysis tools and techniques through the offer of resources, services and decentralized data access mechanisms. This paper discussed the importance of grid computing in distributed data mining. Grid can offer an effective infrastructure for managing data mining and knowledge discovery applications. It can represent in a near future an effective infrastructure for managing very large data sources and providing high-level mechanisms for extracting valuable knowledge from them. To solve this class of tasks, advanced tools and services for knowledge discovery are vital. In this paper, advanced tool like Weka4WS systems is described. In the next years the Grid will be used as a platform for implementing and deploying geographically distributed knowledge discovery and knowledge management services and applications. Weka4WS adopts the emerging Web Services Resource Framework (WSRF) for remotely running data mining algorithms and composing distributed knowledge discovery applications that integrate data, tools, and resources available from dispersed sites through the SOA paradigm. This paper described the architecture of Weka4WS by exploiting the WSRF library provided by Globus Toolkit 4. Weka4WS provides an effective way to perform compute-intensive distributed data analysis on large-scale Grid environments. The Weka4WS Web services can be directly invoked within adhoc programs to implement applications that coordinate the invocation of multiple data mining services in a distributed scenario. Thus, a distributed data mining application can be composed by several tasks that execute on multiple Grid nodes in parallel and/or in sequence. VIII. FUTURE DEVELOPMENT The importance of high-performance data mining is going to be considered a real added value. Grid can offer an effective infrastructure for deploying data mining and knowledge discovery applications. The future use of the Grid is mainly related to its ability embody many of those properties and to manage world-wide complex distributed applications. Among those, knowledge-based applications are a major goal. To reach this goal, the Grid needs to evolve towards an open decentralized infrastructure based on interoperable high-level services that make use of knowledge both in providing resources and in giving results to end users. Software technologies for the implementation and deployment of knowledge Grids as discussed in this paper will provide important elements to build up knowledge-based applications on a local Grid or on a World Wide Grid. These models, techniques, and tools can provide the basic components for developing Grid based complex systems such as distributed knowledge management systems providing pervasive access, adaptively and high performance for virtual organizations in science, engineering and industry that need to produce knowledge-based applications. In future, Weka Knowledge Flow environment can support the visual design of distributed data mining applications which will be able to handle different storage nodes using Globus Toolkit 5.This will allow users to design and execute complex data mining applications on the Grid in a simple and effective way. IX. [1] [2] [3] [4] [5] [6] [7] [8] REFERENCES Data Mining Techniques in Grid Computing Environments Editor Werner Dubitzky. University of Ulster, UK , by John Wiley & Sons, Ltd. Data Mining Practical Machine Learning Tools and Techniques Third Edition By Ian H. Witten, Eibe Frank & Mark A. Hall. Meta-learning in Grid-based Data Mining Systems, Moez Ben Haj Hmida and Yahya Slimani IJCNC Vol.2, No.5, September 2010 Schuster, A., Wolff. R. Trock, D.: A High-Performance Distributed Algorithm for Mining Association Rules . In: Third IEEE International Conference on Data Mining, Florida , USA (2003) M. Cannataro, A. Congiusta, C. Mastroianni, A. Pugliese, D. Talia, P. Trunfio, Grid-Based Data Mining and Knowledge Discovery, Intelligent Technologies for Information Analysis, N. Zhong and J. Liu (eds.), Springer-Verlag, chapt. 2 (2004), pp. 19–45. K. Czajkowski et al., The WS-Resource Framework Version 1.0. http://www- 106.ibm.com/developerworks/library/ws-resource/wswsrf. pdf. Maarten Altorf - Data mining on grids- Universiteit Leiden August 007. H. Kargupta and C. Kamath and P. Chan, Distributed and Parallel Data Mining: Emergence,Growth, and Future Directions, In: Advances in Distributed and Parallel Knowledge Discovery, AAAI/MIT Press, pp. 409–416, (2000). ISSN 2277-1956/V2N1-290-295 294 295 Analysis of Grid Based Distributed Data Mining System for Service Oriented Frameworks [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] Cannataro, M., Congiusta, A., Pugliese, A., Talia, D. and Trunfio, P. (2004b), ‘Distributed data mining on grids: services, tools, and applications’, IEEE Transactions on Systems, Man, and Cybernetics: Part B 34 (6), 2451–2465. Foster. What is the Grid? A Three Point Checklist, July 2002. Mastroianni, C., Talia, D. and Trunfio, P. (2004), ‘Metadata for managing grid resources in data mining applications’, Journal of Grid Computing 2 (1), 85–102. H. Witten and E. Frank. Data Mining: Practical machine learning tools with Java implementations. Morgan Kaufmann, 2000. Data Mining Techniques, First Edition by Arun K. Pujari. Erwin, D. W. and Snelling, D. F. (2001), UNICORE: a grid computing environment, in ‘International Conference on Parallel and Distributed Computing (Euro-Par’01)’, Vol. 2150 of LNCS, Springer, Manchester, UK, pp. 825–834. Congiusta, A., Talia, D. and Trunfio, P. (2007), ‘Distributed data mining services leveraging WSRF’,Future Generation Computer Systems 23 (1), 34–41. www.cs.waikato.ac.nz K. Czajkowski et al., The WS-Resource Framework Version 1.0. http://www- 106.ibm.com/developerworks/library/ws-resource/wswsrf. pdf. http://www.globus.org/toolkit/ accessed on 14-Dec-2012. ISSN 2277-1956/V2N1-290-295