Download Weka4WS: Enabling Distributed Data Mining on Grids

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Principal component analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Weka4WS: Enabling Distributed Data
Mining on Grids
Domenico Talia, Paolo Trunfio, Oreste Verta
[email protected]
DEIS
University of Calabria
Rende, Italy
Venezia, October 17-18, 2005
1
Outline
Goal of this work
Weka4WS, WSRF and GT4
Weka4WS architecture
Software components
Graphical user interface
Web Services operations
Execution mechanisms
Performance analysis
Conclusions
2
Goal of this work
Computational Grids emerged in the last years as
effective infrastructures for distributed high-performance
computing and data processing
By exploiting a service-oriented approach, knowledge
discovery applications can be developed on Grids to
deliver high performance and manage data and
knowledge distribution
The goal of this work is to extend the Weka toolkit for
supporting distributed data mining through the use of
standard Grid technologies, such as the emerging Web
Services Resource Framework (WSRF) and the Globus
Toolkit 4
3
The Weka4WS framework
Weka provides a large collection of machine learning
algorithms for data pre-processing, classification,
clustering, association rules, and visualization, which can
be used through a common GUI
In Weka, the overall data mining process takes place on
a single machine, since the algorithms can be executed
only locally
The goal of Weka4WS is to extend Weka to support
remote execution of Weka data mining algorithms
4
The Weka4WS framework
In Weka4WS the data mining algorithms for
classification, clustering and association rules can be
executed on remote Grid resources to implement
distributed data mining and speedup applications.
To enable remote invocation, each data mining
algorithm provided by the Weka library are
exposed as a Web Service, which can be easily
deployed on the available Grid nodes
To achieve integration and interoperability with standard
Grid environments, Weka4WS has been designed and
developed by using the emerging Web Services Resource
Framework (WSRF) as enabling technology
5
WSRF and Globus Toolkit 4
WSRF is a family of technical specifications concerned
with the creation, addressing, inspection, and lifetime
management of stateful resources
The framework codifies the relationship between Web
Services and stateful resources, which are termed as
WS-Resource in the WSRF language
The Globus Alliance recently released the Globus Toolkit
4 (GT4), which provides an open source implementation
of the WSRF library and incorporates Web Services
implemented according to the WSRF specifications
Weka4WS has been developed by using the Java WSRF
library provided by GT4
6
Outline
Goal of this work
Weka4WS, WSRF and GT4
Weka4WS architecture
Software components
Graphical user interface
Web Services operations
Execution mechanisms
Performance analysis
Conclusions
7
Weka4WS architecture
In Weka4WS all nodes use GT4 services for standard Grid
functionalities, such as security and data management
We distinguish those nodes in two categories:
user nodes, which are the local machines of the users
providing the Weka4WS client software
computing nodes, which provide the Weka4WS Web
Services allowing the execution of remote data mining tasks
Data can be located on computing nodes, user nodes, or
third-party nodes
If the dataset to be mined is not locally available on a
computing node, it can be downloaded or replicated by
means of the GT4 data management services
8
Software components
Web
Service
Graphical
User Interface
Weka
Library
Client
Module
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
User nodes include three software components:
Graphical User Interface (GUI)
Client Module (CM)
Weka Library (WL)
9
Software components
Graphical
User Interface
Weka
Library
Client
Module
Web
Service
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
Computing nodes include two software components:
Web Service (WS)
Weka Library (WL)
10
Software components
local
task
Web
Service
Graphical
User Interface
Weka
Library
Client
Module
remote
task
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
The GUI extends the Weka Explorer environment to allow
the execution of both local and remote data mining tasks:
local tasks are executed by directly invoking the local WL
remote tasks are executed through the CM, which operates
as an intermediary between the GUI and Web Services on
remote computing nodes
11
Software components
Graphical
User Interface
Weka
Library
Client
Module
Web
Service
Weka
Library
GT4 Services
GT4 Services
User node
Computing node
Algorithm
invocation
The WS is a WSRF-compliant Web Service that exposes
the data mining algorithms provided by the underlying
WL
Therefore, requests to the WS are executed by invoking
the corresponding WL algorithms
12
Graphical user interface
A “Remote pane” has been added to the original Weka
Explorer environment
13
Graphical user interface
This pane provides a list of the remote Web Services that
can be invoked, and two buttons to start and stop the
data mining task on the selected Web Service
14
Graphical user interface
Through the GUI a user can both:
start the execution locally by using the standard Local pane
start the execution remotely by using the Remote pane
Each task in the GUI is managed by an independent
thread
Therefore, a user can start multiple data mining tasks in
parallel on different Web Services, this way taking full
advantage of the distributed Grid environment
Whenever the output of a data mining task has been
received from a remote computing node, it is visualized
in the standard Output pane
15
Outline
Goal of this work
Weka4WS, WSRF and GT4
Weka4WS architecture
Software components
Graphical user interface
Web Services operations
Execution mechanisms
Performance analysis
Conclusions
16
Web Services operations
WSRF
specific
Data
mining
createResource
Creates a new WS-Resource.
subscribe
Subscribes to notifications about resource
properties changes.
destroy
Explicitly requests the destruction of a WSResource.
classification
Submits the execution of a classification task.
clustering
Submits the execution of a clustering task.
Submits the execution of an association rules
associationRules
task.
The first three operations are related to WSRF-specific
invocation mechanisms
The last three operations are used to require the
execution of a specific data mining task
17
Web Services operations
The classification operation provides access to the
complete set of classifiers in the Weka Library (currently,
71 algorithms)
The clustering and associationRules operations expose all
the clustering and association rules algorithms provided
by the Weka Library (5 and 2 algorithms, respectively)
To improve concurrency the data mining operations are
invoked in an asynchronous way:
the client submits the execution in a non-blocking mode,
and results are notified to the client whenever they have
been computed
18
Input parameters of the DM operations
classification
clustering
algorithm
Name of the classification algorithm
arguments
Arguments to be passed to the algorithm
testOptions
Options to be used during the testing phase
classIndex
Index of the attribute to use as the class
dataSet
URL of the dataset to be mined
algorithm
Name of the clustering algorithm
arguments
Algorithm arguments
testOptions
Testing phase options
selectedAttrs
Indexes of the selected attributes
classIndex
Index of the class w.r.t. evaluate clusters
dataSet
URL of the dataset to be mined
algorithm
Name of the association rules algorithm
associationRules arguments
dataSet
Algorithm arguments
URL of the dataset to be mined
Three parameters are required in the invocation of all
the data mining operations: algorithm, arguments, and
dataSet
19
Input parameters of the DM operations
classification
clustering
algorithm
Name of the classification algorithm
arguments
Arguments to be passed to the algorithm
testOptions
Options to be used during the testing phase
classIndex
Index of the attribute to use as the class
dataSet
URL of the dataset to be mined
algorithm
Name of the clustering algorithm
arguments
Algorithm arguments
testOptions
Testing phase options
selectedAttrs
Indexes of the selected attributes
classIndex
Index of the class w.r.t. evaluate clusters
dataSet
URL of the dataset to be mined
algorithm
Name of the association rules algorithm
associationRules arguments
dataSet
Algorithm arguments
URL of the dataset to be mined
The algorithm parameter specifies the name of the Java
class in the Weka Library to be invoked
example: “weka.classifiers.trees.J48”
20
Input parameters of the DM operations
classification
clustering
algorithm
Name of the classification algorithm
arguments
Arguments to be passed to the algorithm
testOptions
Options to be used during the testing phase
classIndex
Index of the attribute to use as the class
dataSet
URL of the dataset to be mined
algorithm
Name of the clustering algorithm
arguments
Algorithm arguments
testOptions
Testing phase options
selectedAttrs
Indexes of the selected attributes
classIndex
Index of the class w.r.t. evaluate clusters
dataSet
URL of the dataset to be mined
algorithm
Name of the association rules algorithm
associationRules arguments
dataSet
Algorithm arguments
URL of the dataset to be mined
The arguments parameter specifies a sequence of
arguments to be passed to the algorithm
example: “-C 0.25 -M 2”
21
Input parameters of the DM operations
classification
clustering
algorithm
Name of the classification algorithm
arguments
Arguments to be passed to the algorithm
testOptions
Options to be used during the testing phase
classIndex
Index of the attribute to use as the class
dataSet
URL of the dataset to be mined
algorithm
Name of the clustering algorithm
arguments
Algorithm arguments
testOptions
Testing phase options
selectedAttrs
Indexes of the selected attributes
classIndex
Index of the class w.r.t. evaluate clusters
dataSet
URL of the dataset to be mined
algorithm
Name of the association rules algorithm
associationRules arguments
dataSet
Algorithm arguments
URL of the dataset to be mined
The dataSet parameter specifies the URL of the dataset
to be mined
example: “gsiftp://hostname/path/file.arff”
22
Task execution steps
Web
Service
createResource
subscribe
Client
Module
clustering
deliver
destroy
Dataset
Weka
Library
GT4 RFT Service
GridFTP Server
User node
GridFTP Server
Computing node
Scenario: the Client Module (CM) is requesting the execution
of a clustering analysis on a dataset local to the user node
To perform this task a sequence of steps need to be executed
23
Task execution: Resource creation
Web
Service
1
createResource
1
subscribe
Client
Module
WS-Resource
clustering
Clustering
model
deliver
destroy
Dataset
Weka
Library
GT4 RFT Service
GridFTP Server
User node
GridFTP Server
Computing node
1 Resource creation. The CM invokes the createResource
operation to create a new WS-Resource. A clustering model
property is used to store the result of the clustering task. The
WS returns the EPR of the created resource
24
Task execution: Notification subscription
Web
Service
1
Client
Module
2
createResource
1
subscribe
2
WS-Resource
clustering
Clustering
model
deliver
destroy
Dataset
Weka
Library
GT4 RFT Service
GridFTP Server
User node
GridFTP Server
Computing node
2 Notification subscription. The CM invokes the subscribe
operation, which subscribes to notifications about changes
that will occur to the clustering model resource property
25
Task execution: Task submission
Web
Service
1
Client
Module
2
3
createResource
1
subscribe
2
clustering
WS-Resource
3
Clustering
model
deliver
destroy
Dataset
Weka
Library
GT4 RFT Service
GridFTP Server
User node
GridFTP Server
Computing node
3 Task submission. The CM invokes the clustering operation
to require the execution of the clustering task. The operation
is invoked in an asynchronous way
26
Task execution: Dataset download
Web
Service
1
Client
Module
2
3
createResource
1
subscribe
2
clustering
WS-Resource
3
Clustering
model
deliver
destroy
Dataset
4
GridFTP Server
4
Dataset
Weka
Library
4
GT4 RFT Service
User node
GridFTP Server
Computing node
4 Dataset download. The WS requests to download the
dataset to be mined from the URL specified in the clustering
invocation. The download request is managed by the GT4
Reliable File Transfer (RFT) service
27
Task execution: Data mining
Web
Service
1
Client
Module
2
3
createResource
1
subscribe
2
clustering
WS-Resource
3
Clustering
model
deliver
destroy
5
Dataset
4
GridFTP Server
4
Dataset
5
Weka
Library
4
GT4 RFT Service
User node
GridFTP Server
Computing node
5 Data mining. The clustering analysis is started by invoking
the appropriate Java class in the WL. The execution is handled
within the WS-Resource created on Step 1, and the result of
the computation is stored in the clustering model property
28
Task execution: Results notification
Web
Service
1
Client
Module
2
3
deliver
createResource
1
subscribe
2
clustering
WS-Resource
3
Clustering
model
6
destroy
5
Dataset
4
GridFTP Server
4
Dataset
5
Weka
Library
4
GT4 RFT Service
User node
GridFTP Server
Computing node
6 Results notification. Whenever the clustering model
property has been changed, its new value is notified to the
CM, by invoking its implicit deliver operation
29
Task execution: Resource destruction
Web
Service
1
Client
Module
2
3
deliver
createResource
1
subscribe
2
clustering
3
Clustering
model
6
7
WS-Resource
7
destroy
5
Dataset
4
GridFTP Server
4
Dataset
5
Weka
Library
4
GT4 RFT Service
User node
GridFTP Server
Computing node
7 Resource destruction. The CM invokes the destroy
operation, which explicitly destroys the WS-Resource created
on Step 1
30
Outline
Goal of this work
Weka4WS, WSRF and GT4
Weka4WS architecture
Software components
Graphical user interface
Web Services operations
Execution mechanisms
Performance analysis
Conclusions
31
Performance analysis
Goals:
evaluating the execution times of the different steps needed to
perform a typical data mining task in different network scenarios
evaluating the efficiency of the WSRF mechanisms and Weka4WS
as methods to execute distributed data mining services
We used 10 datasets extracted from the census dataset
available at the UCI repository:
number of instances: from 1700 to 17000
dataset size: from 0.5 to 5 MB
Weka4WS has been used to perform a clustering analysis
on each of these datasets:
algorithm used: Expectation Maximization (EM)
number of clusters to be identified: 10
32
Performance analysis
The clustering analysis on each dataset was executed in
two network scenarios:
LAG: the computing node and the user node are connected
by a local area grid (Bw = 94.4 Mbps, RTT = 1.4 ms)
WAG: the computing node and the user node are
connected by a wide area grid (Bw = 213 kbps, RTT = 19
ms)
For each dataset size and network scenario we run 20
independent executions:
the values reported in the following graphs are computed
as an average of the values measured in the 20 executions
33
1.03E+06
9.24E+05
8.30E+05
7.26E+05
6.25E+05
5.19E+05
4.16E+05
2.12E+05
1.0E+06
1.12E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.11E+05
Execution times - LAG scenario
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
This graph represents the execution times of the different
steps of the clustering task in the LAG scenario for a dataset
size ranging from 0.5 to 5 MB
34
1.03E+06
9.24E+05
8.30E+05
7.26E+05
6.25E+05
5.19E+05
4.16E+05
2.12E+05
1.0E+06
1.12E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.11E+05
Execution times - LAG scenario
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
The execution times of the WSRF-specific steps are
independent from the dataset size:
resource creation (1698 ms, on the average), notification
subscription (275 ms), task submission (342 ms), results
notification (1354 ms), and resource destruction (214 ms)
35
1.03E+06
9.24E+05
8.30E+05
7.26E+05
6.25E+05
5.19E+05
4.16E+05
2.12E+05
1.0E+06
1.12E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.11E+05
Execution times - LAG scenario
data mining
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
dataset download
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
On the contrary, the execution times of the dataset download
and data mining steps are proportional to the dataset size:
Dataset download: 218 ms (0.5 MB) ... 665 ms (5 MB)
Data mining: 107474 ms (0.5 MB) ... 1026584 ms (5 MB)
36
1.03E+06
9.24E+05
8.30E+05
7.26E+05
6.25E+05
5.19E+05
4.16E+05
2.12E+05
1.0E+06
1.12E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.11E+05
Execution times - LAG scenario
total time
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
The total execution time ranges from 111798 ms for the
dataset of 0.5 MB, to 1031209 ms for the dataset of 5 MB
The lines representing the total execution time and the data
mining execution time appear coincident, because the data mining
step takes from 96% to 99% of the total execution time
37
1.18E+06
1.06E+06
9.64E+05
8.49E+05
7.25E+05
6.16E+05
4.86E+05
2.46E+05
1.0E+06
1.30E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.61E+05
Execution times - WAG scenario
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
The execution times of the WSRF-specific steps are similar to
those measured in the LAG scenario
The only significant difference is the execution time of the results
notification step (2790 ms), due to additional time needed to
transfer the clustering model through a low-speed network
38
1.18E+06
1.06E+06
9.64E+05
8.49E+05
7.25E+05
6.16E+05
4.86E+05
2.46E+05
1.0E+06
1.30E+05
Execution
(ms)
Execution time
time (ms)
1.0E+07
3.61E+05
Execution times - WAG scenario
Resource creation
Notification subscription
Task submission
1.0E+05
Dataset download
Data mining
1.0E+04
Results notification
Resource destruction
1.0E+03
Total
1.0E+02
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset size
size (MB)
Dataset
(MB)
For the same reason, the transfer of the dataset to be mined
requires an execution time significantly greater than the one
measured in the LAG scenario:
Dataset download: 14638 ms (0.5 MB) ... 132463 ms (5 MB)
39
100
90
80
99.21%
99.32%
99.40%
99.48%
99.52%
99.55%
40
99.02%
50
98.71%
60
98.02%
70
96.13%
% of %
the
total
of the
total execution
execution timetime
Execution times percentage - LAG scenario
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Other steps
Dataset download
Data mining
30
20
10
0
Dataset size (MB)
Dataset size (MB)
This graph shows the percentage of the execution times of the
data mining, dataset download, and the other steps (i.e.,
resource creation, notification subscription, task submission,
results notification, resource destruction), w.r.t. the total
execution time in the LAG scenario
40
100
90
80
99.21%
99.32%
99.40%
99.48%
99.52%
99.55%
40
99.02%
50
98.71%
60
98.02%
70
96.13%
% of %
the
total
of the
total execution
execution timetime
Execution times percentage - LAG scenario
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Other steps
Dataset download
Data mining
30
20
10
0
Dataset size (MB)
Dataset size (MB)
In the LAG scenario the data mining step represents from
96.13% to 99.55% of the total execution time, the dataset
download ranges from 0.19% to 0.06%, and the other steps
range from 3.67% to 0.38%
41
100
90
80
70
60
87.29%
87.56%
88.10%
88.19%
88.31%
88.32%
88.25%
88.32%
40
86.40%
50
Other steps
84.62%
% of %
the
total
of the
total execution
execution timetime
Execution times percentage - WAG scenario
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
4.5
5.0
Dataset download
Data mining
30
20
10
0
Dataset size (MB)
Dataset size (MB)
In the WAG scenario the data mining step represents from
84.62% to 88.32% of the total execution time, the dataset
download ranges from 11.22% to 11.20%, while the other
steps range from 4.16% to 0.48%
42
Performance considerations
The performance analysis demonstrates the efficiency of
the WSRF mechanisms and Weka4WS as a means to
execute data mining tasks on remote machines
In the LAN scenario neither the dataset download nor
the other steps represent a significant overhead with
respect to the total execution time
In the WAN scenario, on the contrary, the dataset
download is a critical step that can affect the overall
execution time:
for this reason, the use of data replication mechanisms and
high-performance file transfer protocols such as GridFTP
can be of great importance
43
Outline
Goal of this work
Weka4WS, WSRF and GT4
Weka4WS architecture
Software components
Graphical user interface
Web Services operations
Execution mechanisms
Performance analysis
Conclusions
44
Conclusions (1)
Gathering data in a central site raises several privacy
considerations that generally prevent centralized mining
of different multi-owner data sources.
Privacy-preserving mining when data is distributed
among many sites can be implemented through the
cooperation of classifiers that learn global data mining
results without revealing the data sources available at
the single sites.
Flexible middleware and services running in distributed
environments can help users to follow this approach.
45
Conclusions (2)
Weka4WS can be used to implement distributed data
mining applications
By exploiting such mechanisms, Weka4WS can provide
an effective way to perform privacy-preserving
distributed data analysis on large-scale Grids
The experimental results demonstrate the efficiency of
the WSRF mechanisms as a means to execute data
mining tasks on remote resources
A Weka4WS software prototype running under Globus
Toolkit 4.0.1 will be soon made available to the research
community
46
THANKS
Thanks!