Download Distributed Data Mining Patterns as Services in Grids and

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Distributed Data Mining Tasks and Distributed
Data Mining Tasks and
Patterns as Services
Large Grain Programming in Grids and Distributed Infrastructures
Domenico Talia
UNIVERSITY OF CALABRIA, Italy
talia@deis unical it
[email protected]
DPA Workshop 2008 – Las Palmas
Goal
• Di
Discuss a strategy based on the use of services for the b d
h
f
i
f h
design of open distributed knowledge discovery tasks and applications on Grids and distributed systems
and applications on Grids and distributed systems.
• Outline
Outline how Grid‐based and service‐oriented how Grid‐based and service‐oriented
programming mechanisms can be developed as a collection of Grid/Web/Cloud services.
• Investigate how they can be used to develop distributed g
y
p
data analysis tasks and knowledge discovery applications exploiting the SOA model.
2
C
Complex
l
Big Problems
• Bi
Bigger and more complex
d
l
problems must be solved
by distributed computing.
•
DATA SOURCES are l
larger
and larger
dl
and distributed.
d di t ib t d
• The main problem is not storing DATA, it is analyse, mine, and process DATA.
3
Data Availability or Data Deluge ?
Data Availability or Data Deluge ?
• TToday
d the
th information
i f
ti stored
t d in
i digital
di it l data
d t archives
hi
is enormous and its size is still growing very rapidly.
The world has created or 161 exabytes (161 billion
gigabytes) of digital information in 2006.
(source: IDC)
• Whereas until some decades ago the main problem was the
shortage
h t
off information,
i f
ti
th challenge
the
h ll
now seems to
t be
b
• the very large volume of information to deal with and
• the associated complexity to process it and to extract significant and
useful p
parts or summaries.
4
Distributed Data Analysis
Data Analysis Patterns
•
•
•
•
•
•
•
Data parallelism? Task parallelism?
Managing data dependencies
Data management: input, intermediate, output
Dynamic task graphs/workflows (data dependencies)
Dynamic data access involving large amounts of data
Parallel data mining and/or Distributed data mining
Programming distributed mining operations/taks/patterns
5
Programming Levels
Grain size
Web Services, Grid Services, Workflows, Mushup, …
p
Components, Patterns, C
t P tt
Distributed Objects, …
MPI, OpenMP, threads, MapReduce RMI HPF
MapReduce, RMI, HPF,…
Process #
6
Distributed Data Mining on Grids
Distributed Data Mining on Grids
• The Grid extends the distributed and parallel computing paradigms allowing resource negotiation, dynamical allocation, heterogeneity, open protocols and services.
• As
As Grids and Clouds became well accepted computing Grids and Clouds became well accepted computing
infrastructures it is necessary to provide data mining services, algorithms, and applications. • Those may help users to leverage Grid/Cloud/… capability in s pporting high performance distrib ted comp ting for
supporting high‐performance distributed computing for solving their data mining problems in a distributed way.
Grid services for distributed data mining
• EExploiting the SOA model and the Web Services Resource l ii
h SOA
d l d h W bS i
R
Framework (WSRF) it is possible to define basic services for supporting distributed data mining tasks
pp
g
g
in Grids • Those services can address all the aspects that must be considered in data mining and in knowledge discovery processes • data selection and transport services,
d
l
d
• data analysis services,
• knowledge models representation services, and
k
l d
d l
d
• visualization services.
Grid services for distributed data mining
• It is possible to define services corresponding to
Single Steps
that compose a KDD process such as preprocessing,
filtering, and visualization.
Single Data Mining Tasks
such as classification, clustering, and association rules
discovery.
Distributed Data Mining Patterns
such as collective learning, parallel classification and
meta-learning models.
Data Mining Applications or KDD processes
including all or some of the previous tasks expressed
through a multi-step
multi step workflow.
workflow
Data mining Grid services
Data mining Grid services
• This collection of data mining services can constitute an Open Service Framework for Grid‐based Data Mining
Open Service Framework for Grid-based Data Mining
• Allowing developers to program distributed KDD processes
as a composition
p
of single
g and/or
/ aggregated
gg g
services
available over a Grid.
• Those services should exploit other basic Grid services for
data transfer and management for data tranfer, replica
management, data
d integration
i
i and
d querying.
i
Data mining Grid services
Data mining Grid services
• By exploiting the Grid services features it is possible
to develop data mining services accessible every
time and everywhere.
• This approach may result in
•
•
•
•
Service‐based
Service
based distributed data mining applications
Data mining services for virtual organizations.
Distributed data analysis services on demand.
demand
A sort of knowledge discovery eco‐system formed of a
large numbers of decentralized data analysis services.
Data mining Grid services: Are they programming abstractions?
• Apparently not, in a traditional approach.
• Yes, if we consider the user and application
requirements
i
t in
i handling
h dli data
d t and
d in
i understanding
d t di
what is useful in it.
•
•
•
•
Basic services as simple operations;
Service programming languages for composing them;
Complex services and their complex composition;
Towards distributed programming patterns for services.
Grid services for distributed data Grid
services for distributed data
mining
• Service‐based systems we developed
• Weka4WS
• Knowledge Grid
• Mobile Data Mining Grid Services
Mobile Data Mining Grid Services
• Mining@home
Weka4WS KnowledgeFlow
Weka4WS KnowledgeFlow
Programming a data mining workflows and run them on several Grid nodes
Programming a data mining workflows and run them on several Grid nodes.
Knowledge Grid: application design
Knowledge Grid: application design
start
schema: “/../DMService.wsdl”
operation: “classify”
argument: “J48”
argument: “-C 0.25 -M 2”
...
data mining
data mining
data mining
data mining
data mining
data mining
data mining
data mining
file transfer
file transfer
file transfer
file transfer
voting
end
schema: “/../DMService.wsdl”
operation: “clusterize”
clusterize”
argument: “EM”
argument: “-I 100 -N -1 -S 90”
...
Grid Services for Mobile Data Mining
Grid Services for Mobile Data Mining
• A user can choose the mining algorithm and select which part of a result (data mining model) he wants to visualize.
Mining@home
•
The Public Resource Computing paradigm (PRC) is currently used to h
bl
d
(
)
l
d
execute large scientific applications with the help of private computers (Seti@home, Climate@home, Einstein@home).
•
PRC model can be exploited to program to P2P data mining tasks involving hundreds or thousands of nodes
hundreds or thousands of nodes.
•
Highly decentralized data analysis
g y
y
tasks can be programmed as large
collections of threads or services.
17
1..18E+06
1..06E+06
8.4
49E+05
7.2
25E+05
6.16E+05
4.86
6E+05
3.61E
E+05
1.0E+06
1.30E+0
05
Execution time (ms)
1.0E+07
2.46E
E+05
Execution times
9..64E+05
Impact of the WSRF overhead
Impact of the WSRF overhead
1.0E+05
dataset download
1.0E+04
1.0E+03
data mining
Resource creation
Notification subscription
Task submission
Dataset download
Data mining
Results notification
Resource destruction
Total
1.0E+02
0.5 1.0 1.5
2.0 2.5 3.0 3.5 4.0 4.5 5.0
Dataset size (MB)
ƒ In a Grid scenario the data mining step represents from 85% to 88% of the total execution time, the dataset download takes about 11%, while the other steps range from 0.5% to 4%.
Weka4WS: application speedup on a Grid
Summary
• New
New HPC infrastructures allow us to attack new problems, BUT HPC infrastructures allow us to attack new problems, BUT
require to solve more challenging problems.
• New programming models and environments
are required
• Data is becoming a BIG player, programming data analysis
applications and services is a must.
li ti
d
i
i
t
• New ways to efficiently compose different models and paradigms are needed.
• Relationships between different programming levels must be addressed.
• In a long‐term vision, pervasive collections of data analysis services and applications must be accessed and used as public utilities. • We must be ready for managing with this scenario.
20
Thanks
21