Download Knowledge Discovery in Grid Datasets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Knowledge Discovery in Grid Datasets – Goals,
Design Concepts and the Architecture
Peter Brezany
University of Vienna
P. Brezany
University of Vienna
1
Collecting Data
Laboratories
Satellites
Business
Experiments
(high energy physics,...)
P. Brezany
(microscopes,
MRI/CT scanners, ...)
Data Repositories
Analysis
Computer simulations
University of Vienna
2
Motivation
• Computational Grid – a new-generation infrastructure
• Challenge: Advanced analysis of data managed by Grid
• Typical data in modern Grid applications:
– files, file collections, relational and XML DBs, virtual data, data objects
• The data is often is large, geographically distributed and
its complexity is increasing; some applications require
special security precautions.
• Our research aims:
– Phase 1 : Knowledge discovery Grid system (GridMiner)
– Phase 2 : Intelligent Grid system (WisdomGrid)
P. Brezany
University of Vienna
3
• Motivation
Outline
• Background and Related Work
• Basic Concepts and GridMiner Architecture
• Grid Data Integration System
• Data Mining Layer
• Implementation Issues and Experiments
• Future Research
P. Brezany
• Conclusions
University of Vienna
4
Background and Related Work
• Basic Grid development (Globus 1) – metacomputing
• Data Grid (Globus 2, DataGrid of CERN, etc.)
• Semantic Grid (myGrid)
• Open Grid Service Architecture (Globus 3, OGSA-DAIS)
• Parallel and Distributed Data Mining and Data Warehousing
• Knowledge Grid (GridMiner and work of others)
• Web Intelligence
P. Brezany
University of Vienna
5
GridMiner Requirements
• Open architecture
• Data distribution, complexity, heterogeneity, and large data size
• Applying different kinds of analysis strategies
• Compatibility with existing Grid infrastructure
• Openness to tools and algorithms
• Scalability
• Grid, network, and location transparency
• Security and data privacy
• OLAP support
P. Brezany
University of Vienna
6
GridMiner (Layered) Abstract
Architecture
User Interface
Knowledge Grid
Data to
Knowledge
Information Grid
Control
Computational & Data Grid
Built on the K.G. Jeffery‘s proposal
P. Brezany
University of Vienna
7
GridMiner Conceptual Architecture
J
o
b
C
o
n
t
r
o
l
P. Brezany
University of Vienna
8
Service Architecture
Based on OGSA-DAIS
P. Brezany
University of Vienna
9
Data Distribution Scenarios
1. Single data source
2. Federated data sources with different types of partitioning
P. Brezany
University of Vienna
10
Example
Vertical and horizontal distribution of the virtual data source
P. Brezany
University of Vienna
11
Mapping Schema
P. Brezany
University of Vienna
12
Grid Data Mediation Services
P. Brezany
University of Vienna
13
Architecture of a Data Mining System
P. Brezany
University of Vienna
14
Components of the Data Mining Layer
•
GridMiner Service Factory
•
GridMiner Service Registry
•
GridMiner Data Mining Service
•
GridMiner Preprocessing Service
•
GridMiner Presentation Service
•
GridMiner Orchestration Service
P. Brezany
University of Vienna
15
Centralized Data Mining
GMS R
Client
GS
1. browse
R
GMS F
factory GSHs
F
GS
NSrc
GDS
NSrc
6. create GMDM S
10. evaluate Model
GMDM
F
GS
GDT
GDS 1
NSrc
GDS
9. use it
GS
9. use it
G DSF
7. create GDS
3. create GDS
GMDMS
5. use it
5. use it
GMPPS
GS
notificatio ns
query SDEs
t
ei
us
s
Es
io n
SD
t
a
y
c
er
t ifi
qu
no
GS
NSrc
GMPP
4.
F
rf or m
8. pe
2. create GMPPS
GS
GMS F
GS
GDT
GDS 2
<read>
<write>
<read>
DataSource
P. Brezany
University of Vienna
16
Parallel and Distributed Data Mining
Client
GS
GMS R
R
1. browse
GMS F
GS
F
notifications
2. create
GMDMS
query SDEs
factory GSHs
GS
7. perform DataMining
9. evaluate Model
NSrc
GMDM
GMDMS 0
5. create
4. create
3. create
8. perform
6. create
8. control
8. control
8. control
8. control
GMSF
GMSF
GMDMS 1
<read>
dat1
P. Brezany
GMSF
GMDMS 2
GMSF
GMDMS 3
<read>
SOAP / RMI /
JXTA / MPI /
etc.
dat2
University of Vienna
G MDMS 4
<read>
dat3
<read>
dat4
17
GridMiner Orchestration Service
GMS R
Client
GS
1. browse
R
GS
F
notifications
2. create
GMDMS
GMS F
query SDEs
GSHs >
GS
3. execute Workflow
GridMin er Job Desc ription
Workflow Engine
NSrc
He ader
GMDM
GMOrchS
Re source De clarations
Workflow
4. create
5. perform
GMSF
Activity
GMSF
GMPPS 1
7. perform
GMSF
Activity
G MP PS 2
<read>
10. create
9. perform
Activity
GMDMS
<read>
<write>
P. Brezany
8. create
6. create
11. perform
Activity
Activity
use GMPPS for filling missing
values, remove noi se
Activity
use GMPPS for selection
an d preliminary aggregatio ns
Activity
use GMDMS for
generati ng a decis ion tree
Activity
use GMPRS for a graphic al,
interactive representation
GMSF
GMP RS
<read>
<write>
Workflow
Outline
<read>
<write>
University of Vienna
18
GridMiner
Job
Specification
Language
P. Brezany
University of Vienna
19
Implementation Prototype
• Implementation of the Mediation Service for
horizontal data partitioning
• Implementation of Data Mining Services for decision
tree construction as OGSA conformous Grid service,
based on the Globus Toolkit 3 Release
• We use
– a freely available Java-based data mining system Weka (data
preprocessing and data mining tasks) – (main memory oriented)
– a home-grown Java implementation of the algorithm SPRINT
(disk-oriented)
P. Brezany
University of Vienna
20
Experimental Environment
• Test data suites
– synthetical data (generated by an extended version of the IBM
Quest Synthetic Data Generation Code)
– TBI (Traumatic Brain Injury) databases
• Grid testbed
–
–
–
–
–
Vienna
CERN
Dublin
Zagreb
Cracow
• Goals in the first phases
– Verifying model accuracy
– Overhead of the service layers
P. Brezany
University of Vienna
21
Extending the
Functionality
P. Brezany
University of Vienna
22
OLAM
P. Brezany
University of Vienna
23
Example: Mining Patterns for Data
Classification and Associations
use database dat1, dat2
mine classifications
analyze patient_outcome
using g_parsimony
display as tree
P. Brezany
use database DBs attributes
mine associations
using method_attributes
display as rules
University of Vienna
24
Workflow 1: Interactive Mode
P. Brezany
University of Vienna
25
Workflow 2: Batch Mode
P. Brezany
University of Vienna
26
Workflow 3: Hybrid Mode
P. Brezany
University of Vienna
27
Execution Model Based on Static Workflow
P. Brezany
University of Vienna
28
Execution Model Based on Dynamic Workflow
P. Brezany
University of Vienna
29
Towards the Wisdom Grid
(WG)
P. Brezany
University of Vienna
30
WG Architecture
Domain Knowledge Agents
Knowledge Explorer Agent
Wisdom Grid
Agent Platform
External Knowledge Base
External Services
Agent Grid Service
Knowledge Base Service
Knowledge Discovery Service
Grid
End User (personal) Agent
P. Brezany
KB
University of Vienna
31
Work-Flow
External Agents
End User Agent
Knowledge Base
service
Knowledge Agent
Agent Service
Knowledge discovery
service
Services ...
Knowledge Base
P. Brezany
Knowledge Explorer Agent
University of Vienna
32
Knowledge Discovery Service
Client for other services
Knowledge Discovery in Databases
GridMiner
data mining
on-line analytical processing (OLAP)
Web Mining
semantic web
Online libraries
Web/Grid Services
Knowledge Explorer Agent
P. Brezany
University of Vienna
33
Knowledge Base Service / KB
KBS - Search, Query, Expand Knowledge Base
KB- Database that stores particular data about real
objects and relations between these objects and their
properties
Consists of ontologies and instances
Information about resources (location, query lang.)
on the Web
web/grid services ,agents
references to the online database
Languages
XML/RDF/DAML-OIL/DAML-S/OWL
P. Brezany
University of Vienna
34
Ontology - example
DAML-OIL Language:
Patient
is
Human
has
Age
P. Brezany
<daml:Class rdf:ID=“Human”>
<rdfs:subClassOf>
<daml:Restriction cardinality=“1”>
<daml:onProperty rdf:resource= “#Age”/>
</daml:Restriction>
</rdfs:subClassOf>
</daml>
<daml:DatatypeProperty about:ID=“Age”>
<rdf:domain rdf:resource = “#Human”/>
</daml:DatatypeProperty>
<daml:Class rdf:ID=“Patient”>
<daml:subClassOf rdf:resource=“#Human”/>
</daml:Class>
University of Vienna
35
Knowledge Base - example
Human
has
has
Temperature
Value
is
Patient
has
Attribute
attribute:PAT_ID
P. Brezany
Tables
table:PATIENTS
University of Vienna
has
Database
jdbc://foo/hospital
36
Semantic mediator
•
Distributed heterogeneous databases
– Different database schemas
– Different query languages
– Different names of attributes/tables…
but the same semantics !
• WG enables semantics mediation at a higher level
P. Brezany
University of Vienna
37
Semantic mediator (cont.)
AGE
Patient
samePropertyAs
is
Human
PAT_AGE
has
Database in Hospital X
PAT_TAB
Age
has
ID
AGE
BT
...
…
…
Database in Hospital Z
Blood Type
PATIENTS
samePropertyAs
PAT_BLOOD_TYPE
P. Brezany
BT
PAT_ID
PAT_AGE
PAT_BLOOD_TYPE
...
…
…
University of Vienna
38
Distributed Knowledge base
uri:fooY#Human
is subclass
Class
has property
Class
property
Is same class as
uri:fooZ#Temperature
uri:fooX#Patient
class
P. Brezany
uri:fooX#Ill_Person
University of Vienna
39
Agent Grid Service
Supports system with ability to communicate with the
outside world in standard languages
FIPA Standards
ACL – Agent Communication Language
KQML- Knowledge Query and Manipulation Language
Agent Platform (JADE,FIPA-OS)
Agents
Domain Knowledge Agent
Knowledge Explorer Agent
End-user Agent (personal)
P. Brezany
University of Vienna
40
Querying
End-user agent
with own ontology – subset of ontology
Merging of ontologies
without own ontology
Negotiating about domain of interest
Queries created from ontology
Templates
<Patient rdf:ID=“ID001”>
<Temperature/>
</Patient>
P. Brezany
University of Vienna
41
Answers
•
Mined Knowledge (GridMiner)
– Decision trees/ rules
» (clinical pathways)
– Association rules
•
Instances of domain ontology
–
–
–
–
P. Brezany
Particular data
References
Links to Web sites
Information about another knowledge providers
University of Vienna
42
Case Study - Medical Application
Semantic
Web/Grid
Knowledge Explorer Agent
Knowledge Agent
Q: Outcome?
+ data about
patient’s condition
A: probability
of survival
+ references to
the diagnoses
Knowledge
Discovery
Service
GridMiner
resources
Training
set
Knowledge Base
End User (personal) Agent
P. Brezany
Testset
University of Vienna
Hospital Databases
43
Conclusions and Future Work
• Application and extension of the Grid technology to
knowledge discovery – an important, but nontraditional Grid application domain
• Introduction of a new Grid Data Mediation Service
• Future work
– Performance evaluation on large synthetic data volumes
– Coupling of the Data Minining services architecture with the OLAP
services architecture
– Development of a knowledge discovery oriented Grid Workflow
Language and the appropriate Workflow Engine
– Application of GridMiner to a real medical application (management
of patients with severe traumatic brain injuries)
– Development of the Wisdom Grid
P. Brezany
University of Vienna
44
Related documents