Download Weka4WS: Enabling Distributed Data Mining on Grids

Weka4WS: Enabling Distributed Data Mining on Grids Domenico Talia, Paolo Trunfio, Oreste Verta [email protected] DEIS University of Calabria Rende, Italy Venezia, October 17-18, 2005 1 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations Execution mechanisms Performance analysis Conclusions 2 Goal of this work Computational Grids emerged in the last years as effective infrastructures for distributed high-performance computing and data processing By exploiting a service-oriented approach, knowledge discovery applications can be developed on Grids to deliver high performance and manage data and knowledge distribution The goal of this work is to extend the Weka toolkit for supporting distributed data mining through the use of standard Grid technologies, such as the emerging Web Services Resource Framework (WSRF) and the Globus Toolkit 4 3 The Weka4WS framework Weka provides a large collection of machine learning algorithms for data pre-processing, classification, clustering, association rules, and visualization, which can be used through a common GUI In Weka, the overall data mining process takes place on a single machine, since the algorithms can be executed only locally The goal of Weka4WS is to extend Weka to support remote execution of Weka data mining algorithms 4 The Weka4WS framework In Weka4WS the data mining algorithms for classification, clustering and association rules can be executed on remote Grid resources to implement distributed data mining and speedup applications. To enable remote invocation, each data mining algorithm provided by the Weka library are exposed as a Web Service, which can be easily deployed on the available Grid nodes To achieve integration and interoperability with standard Grid environments, Weka4WS has been designed and developed by using the emerging Web Services Resource Framework (WSRF) as enabling technology 5 WSRF and Globus Toolkit 4 WSRF is a family of technical specifications concerned with the creation, addressing, inspection, and lifetime management of stateful resources The framework codifies the relationship between Web Services and stateful resources, which are termed as WS-Resource in the WSRF language The Globus Alliance recently released the Globus Toolkit 4 (GT4), which provides an open source implementation of the WSRF library and incorporates Web Services implemented according to the WSRF specifications Weka4WS has been developed by using the Java WSRF library provided by GT4 6 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations Execution mechanisms Performance analysis Conclusions 7 Weka4WS architecture In Weka4WS all nodes use GT4 services for standard Grid functionalities, such as security and data management We distinguish those nodes in two categories: user nodes, which are the local machines of the users providing the Weka4WS client software computing nodes, which provide the Weka4WS Web Services allowing the execution of remote data mining tasks Data can be located on computing nodes, user nodes, or third-party nodes If the dataset to be mined is not locally available on a computing node, it can be downloaded or replicated by means of the GT4 data management services 8 Software components Web Service Graphical User Interface Weka Library Client Module Weka Library GT4 Services GT4 Services User node Computing node User nodes include three software components: Graphical User Interface (GUI) Client Module (CM) Weka Library (WL) 9 Software components Graphical User Interface Weka Library Client Module Web Service Weka Library GT4 Services GT4 Services User node Computing node Computing nodes include two software components: Web Service (WS) Weka Library (WL) 10 Software components local task Web Service Graphical User Interface Weka Library Client Module remote task Weka Library GT4 Services GT4 Services User node Computing node The GUI extends the Weka Explorer environment to allow the execution of both local and remote data mining tasks: local tasks are executed by directly invoking the local WL remote tasks are executed through the CM, which operates as an intermediary between the GUI and Web Services on remote computing nodes 11 Software components Graphical User Interface Weka Library Client Module Web Service Weka Library GT4 Services GT4 Services User node Computing node Algorithm invocation The WS is a WSRF-compliant Web Service that exposes the data mining algorithms provided by the underlying WL Therefore, requests to the WS are executed by invoking the corresponding WL algorithms 12 Graphical user interface A “Remote pane” has been added to the original Weka Explorer environment 13 Graphical user interface This pane provides a list of the remote Web Services that can be invoked, and two buttons to start and stop the data mining task on the selected Web Service 14 Graphical user interface Through the GUI a user can both: start the execution locally by using the standard Local pane start the execution remotely by using the Remote pane Each task in the GUI is managed by an independent thread Therefore, a user can start multiple data mining tasks in parallel on different Web Services, this way taking full advantage of the distributed Grid environment Whenever the output of a data mining task has been received from a remote computing node, it is visualized in the standard Output pane 15 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations Execution mechanisms Performance analysis Conclusions 16 Web Services operations WSRF specific Data mining createResource Creates a new WS-Resource. subscribe Subscribes to notifications about resource properties changes. destroy Explicitly requests the destruction of a WSResource. classification Submits the execution of a classification task. clustering Submits the execution of a clustering task. Submits the execution of an association rules associationRules task. The first three operations are related to WSRF-specific invocation mechanisms The last three operations are used to require the execution of a specific data mining task 17 Web Services operations The classification operation provides access to the complete set of classifiers in the Weka Library (currently, 71 algorithms) The clustering and associationRules operations expose all the clustering and association rules algorithms provided by the Weka Library (5 and 2 algorithms, respectively) To improve concurrency the data mining operations are invoked in an asynchronous way: the client submits the execution in a non-blocking mode, and results are notified to the client whenever they have been computed 18 Input parameters of the DM operations classification clustering algorithm Name of the classification algorithm arguments Arguments to be passed to the algorithm testOptions Options to be used during the testing phase classIndex Index of the attribute to use as the class dataSet URL of the dataset to be mined algorithm Name of the clustering algorithm arguments Algorithm arguments testOptions Testing phase options selectedAttrs Indexes of the selected attributes classIndex Index of the class w.r.t. evaluate clusters dataSet URL of the dataset to be mined algorithm Name of the association rules algorithm associationRules arguments dataSet Algorithm arguments URL of the dataset to be mined Three parameters are required in the invocation of all the data mining operations: algorithm, arguments, and dataSet 19 Input parameters of the DM operations classification clustering algorithm Name of the classification algorithm arguments Arguments to be passed to the algorithm testOptions Options to be used during the testing phase classIndex Index of the attribute to use as the class dataSet URL of the dataset to be mined algorithm Name of the clustering algorithm arguments Algorithm arguments testOptions Testing phase options selectedAttrs Indexes of the selected attributes classIndex Index of the class w.r.t. evaluate clusters dataSet URL of the dataset to be mined algorithm Name of the association rules algorithm associationRules arguments dataSet Algorithm arguments URL of the dataset to be mined The algorithm parameter specifies the name of the Java class in the Weka Library to be invoked example: “weka.classifiers.trees.J48” 20 Input parameters of the DM operations classification clustering algorithm Name of the classification algorithm arguments Arguments to be passed to the algorithm testOptions Options to be used during the testing phase classIndex Index of the attribute to use as the class dataSet URL of the dataset to be mined algorithm Name of the clustering algorithm arguments Algorithm arguments testOptions Testing phase options selectedAttrs Indexes of the selected attributes classIndex Index of the class w.r.t. evaluate clusters dataSet URL of the dataset to be mined algorithm Name of the association rules algorithm associationRules arguments dataSet Algorithm arguments URL of the dataset to be mined The arguments parameter specifies a sequence of arguments to be passed to the algorithm example: “-C 0.25 -M 2” 21 Input parameters of the DM operations classification clustering algorithm Name of the classification algorithm arguments Arguments to be passed to the algorithm testOptions Options to be used during the testing phase classIndex Index of the attribute to use as the class dataSet URL of the dataset to be mined algorithm Name of the clustering algorithm arguments Algorithm arguments testOptions Testing phase options selectedAttrs Indexes of the selected attributes classIndex Index of the class w.r.t. evaluate clusters dataSet URL of the dataset to be mined algorithm Name of the association rules algorithm associationRules arguments dataSet Algorithm arguments URL of the dataset to be mined The dataSet parameter specifies the URL of the dataset to be mined example: “gsiftp://hostname/path/file.arff” 22 Task execution steps Web Service createResource subscribe Client Module clustering deliver destroy Dataset Weka Library GT4 RFT Service GridFTP Server User node GridFTP Server Computing node Scenario: the Client Module (CM) is requesting the execution of a clustering analysis on a dataset local to the user node To perform this task a sequence of steps need to be executed 23 Task execution: Resource creation Web Service 1 createResource 1 subscribe Client Module WS-Resource clustering Clustering model deliver destroy Dataset Weka Library GT4 RFT Service GridFTP Server User node GridFTP Server Computing node 1 Resource creation. The CM invokes the createResource operation to create a new WS-Resource. A clustering model property is used to store the result of the clustering task. The WS returns the EPR of the created resource 24 Task execution: Notification subscription Web Service 1 Client Module 2 createResource 1 subscribe 2 WS-Resource clustering Clustering model deliver destroy Dataset Weka Library GT4 RFT Service GridFTP Server User node GridFTP Server Computing node 2 Notification subscription. The CM invokes the subscribe operation, which subscribes to notifications about changes that will occur to the clustering model resource property 25 Task execution: Task submission Web Service 1 Client Module 2 3 createResource 1 subscribe 2 clustering WS-Resource 3 Clustering model deliver destroy Dataset Weka Library GT4 RFT Service GridFTP Server User node GridFTP Server Computing node 3 Task submission. The CM invokes the clustering operation to require the execution of the clustering task. The operation is invoked in an asynchronous way 26 Task execution: Dataset download Web Service 1 Client Module 2 3 createResource 1 subscribe 2 clustering WS-Resource 3 Clustering model deliver destroy Dataset 4 GridFTP Server 4 Dataset Weka Library 4 GT4 RFT Service User node GridFTP Server Computing node 4 Dataset download. The WS requests to download the dataset to be mined from the URL specified in the clustering invocation. The download request is managed by the GT4 Reliable File Transfer (RFT) service 27 Task execution: Data mining Web Service 1 Client Module 2 3 createResource 1 subscribe 2 clustering WS-Resource 3 Clustering model deliver destroy 5 Dataset 4 GridFTP Server 4 Dataset 5 Weka Library 4 GT4 RFT Service User node GridFTP Server Computing node 5 Data mining. The clustering analysis is started by invoking the appropriate Java class in the WL. The execution is handled within the WS-Resource created on Step 1, and the result of the computation is stored in the clustering model property 28 Task execution: Results notification Web Service 1 Client Module 2 3 deliver createResource 1 subscribe 2 clustering WS-Resource 3 Clustering model 6 destroy 5 Dataset 4 GridFTP Server 4 Dataset 5 Weka Library 4 GT4 RFT Service User node GridFTP Server Computing node 6 Results notification. Whenever the clustering model property has been changed, its new value is notified to the CM, by invoking its implicit deliver operation 29 Task execution: Resource destruction Web Service 1 Client Module 2 3 deliver createResource 1 subscribe 2 clustering 3 Clustering model 6 7 WS-Resource 7 destroy 5 Dataset 4 GridFTP Server 4 Dataset 5 Weka Library 4 GT4 RFT Service User node GridFTP Server Computing node 7 Resource destruction. The CM invokes the destroy operation, which explicitly destroys the WS-Resource created on Step 1 30 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations Execution mechanisms Performance analysis Conclusions 31 Performance analysis Goals: evaluating the execution times of the different steps needed to perform a typical data mining task in different network scenarios evaluating the efficiency of the WSRF mechanisms and Weka4WS as methods to execute distributed data mining services We used 10 datasets extracted from the census dataset available at the UCI repository: number of instances: from 1700 to 17000 dataset size: from 0.5 to 5 MB Weka4WS has been used to perform a clustering analysis on each of these datasets: algorithm used: Expectation Maximization (EM) number of clusters to be identified: 10 32 Performance analysis The clustering analysis on each dataset was executed in two network scenarios: LAG: the computing node and the user node are connected by a local area grid (Bw = 94.4 Mbps, RTT = 1.4 ms) WAG: the computing node and the user node are connected by a wide area grid (Bw = 213 kbps, RTT = 19 ms) For each dataset size and network scenario we run 20 independent executions: the values reported in the following graphs are computed as an average of the values measured in the 20 executions 33 1.03E+06 9.24E+05 8.30E+05 7.26E+05 6.25E+05 5.19E+05 4.16E+05 2.12E+05 1.0E+06 1.12E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.11E+05 Execution times - LAG scenario Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) This graph represents the execution times of the different steps of the clustering task in the LAG scenario for a dataset size ranging from 0.5 to 5 MB 34 1.03E+06 9.24E+05 8.30E+05 7.26E+05 6.25E+05 5.19E+05 4.16E+05 2.12E+05 1.0E+06 1.12E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.11E+05 Execution times - LAG scenario Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) The execution times of the WSRF-specific steps are independent from the dataset size: resource creation (1698 ms, on the average), notification subscription (275 ms), task submission (342 ms), results notification (1354 ms), and resource destruction (214 ms) 35 1.03E+06 9.24E+05 8.30E+05 7.26E+05 6.25E+05 5.19E+05 4.16E+05 2.12E+05 1.0E+06 1.12E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.11E+05 Execution times - LAG scenario data mining Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total dataset download 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) On the contrary, the execution times of the dataset download and data mining steps are proportional to the dataset size: Dataset download: 218 ms (0.5 MB) ... 665 ms (5 MB) Data mining: 107474 ms (0.5 MB) ... 1026584 ms (5 MB) 36 1.03E+06 9.24E+05 8.30E+05 7.26E+05 6.25E+05 5.19E+05 4.16E+05 2.12E+05 1.0E+06 1.12E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.11E+05 Execution times - LAG scenario total time Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) The total execution time ranges from 111798 ms for the dataset of 0.5 MB, to 1031209 ms for the dataset of 5 MB The lines representing the total execution time and the data mining execution time appear coincident, because the data mining step takes from 96% to 99% of the total execution time 37 1.18E+06 1.06E+06 9.64E+05 8.49E+05 7.25E+05 6.16E+05 4.86E+05 2.46E+05 1.0E+06 1.30E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.61E+05 Execution times - WAG scenario Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) The execution times of the WSRF-specific steps are similar to those measured in the LAG scenario The only significant diﬀerence is the execution time of the results notification step (2790 ms), due to additional time needed to transfer the clustering model through a low-speed network 38 1.18E+06 1.06E+06 9.64E+05 8.49E+05 7.25E+05 6.16E+05 4.86E+05 2.46E+05 1.0E+06 1.30E+05 Execution (ms) Execution time time (ms) 1.0E+07 3.61E+05 Execution times - WAG scenario Resource creation Notification subscription Task submission 1.0E+05 Dataset download Data mining 1.0E+04 Results notification Resource destruction 1.0E+03 Total 1.0E+02 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset size size (MB) Dataset (MB) For the same reason, the transfer of the dataset to be mined requires an execution time significantly greater than the one measured in the LAG scenario: Dataset download: 14638 ms (0.5 MB) ... 132463 ms (5 MB) 39 100 90 80 99.21% 99.32% 99.40% 99.48% 99.52% 99.55% 40 99.02% 50 98.71% 60 98.02% 70 96.13% % of % the total of the total execution execution timetime Execution times percentage - LAG scenario 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Other steps Dataset download Data mining 30 20 10 0 Dataset size (MB) Dataset size (MB) This graph shows the percentage of the execution times of the data mining, dataset download, and the other steps (i.e., resource creation, notification subscription, task submission, results notification, resource destruction), w.r.t. the total execution time in the LAG scenario 40 100 90 80 99.21% 99.32% 99.40% 99.48% 99.52% 99.55% 40 99.02% 50 98.71% 60 98.02% 70 96.13% % of % the total of the total execution execution timetime Execution times percentage - LAG scenario 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Other steps Dataset download Data mining 30 20 10 0 Dataset size (MB) Dataset size (MB) In the LAG scenario the data mining step represents from 96.13% to 99.55% of the total execution time, the dataset download ranges from 0.19% to 0.06%, and the other steps range from 3.67% to 0.38% 41 100 90 80 70 60 87.29% 87.56% 88.10% 88.19% 88.31% 88.32% 88.25% 88.32% 40 86.40% 50 Other steps 84.62% % of % the total of the total execution execution timetime Execution times percentage - WAG scenario 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 Dataset download Data mining 30 20 10 0 Dataset size (MB) Dataset size (MB) In the WAG scenario the data mining step represents from 84.62% to 88.32% of the total execution time, the dataset download ranges from 11.22% to 11.20%, while the other steps range from 4.16% to 0.48% 42 Performance considerations The performance analysis demonstrates the efficiency of the WSRF mechanisms and Weka4WS as a means to execute data mining tasks on remote machines In the LAN scenario neither the dataset download nor the other steps represent a significant overhead with respect to the total execution time In the WAN scenario, on the contrary, the dataset download is a critical step that can affect the overall execution time: for this reason, the use of data replication mechanisms and high-performance file transfer protocols such as GridFTP can be of great importance 43 Outline Goal of this work Weka4WS, WSRF and GT4 Weka4WS architecture Software components Graphical user interface Web Services operations Execution mechanisms Performance analysis Conclusions 44 Conclusions (1) Gathering data in a central site raises several privacy considerations that generally prevent centralized mining of different multi-owner data sources. Privacy-preserving mining when data is distributed among many sites can be implemented through the cooperation of classifiers that learn global data mining results without revealing the data sources available at the single sites. Flexible middleware and services running in distributed environments can help users to follow this approach. 45 Conclusions (2) Weka4WS can be used to implement distributed data mining applications By exploiting such mechanisms, Weka4WS can provide an effective way to perform privacy-preserving distributed data analysis on large-scale Grids The experimental results demonstrate the efficiency of the WSRF mechanisms as a means to execute data mining tasks on remote resources A Weka4WS software prototype running under Globus Toolkit 4.0.1 will be soon made available to the research community 46 THANKS Thanks!

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Weka4WS: Enabling Distributed Data Mining on Grids