Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Knowledge Discovery in Grid Datasets – Goals, Design Concepts and the Architecture Peter Brezany University of Vienna P. Brezany University of Vienna 1 Collecting Data Laboratories Satellites Business Experiments (high energy physics,...) P. Brezany (microscopes, MRI/CT scanners, ...) Data Repositories Analysis Computer simulations University of Vienna 2 Motivation • Computational Grid – a new-generation infrastructure • Challenge: Advanced analysis of data managed by Grid • Typical data in modern Grid applications: – files, file collections, relational and XML DBs, virtual data, data objects • The data is often is large, geographically distributed and its complexity is increasing; some applications require special security precautions. • Our research aims: – Phase 1 : Knowledge discovery Grid system (GridMiner) – Phase 2 : Intelligent Grid system (WisdomGrid) P. Brezany University of Vienna 3 • Motivation Outline • Background and Related Work • Basic Concepts and GridMiner Architecture • Grid Data Integration System • Data Mining Layer • Implementation Issues and Experiments • Future Research P. Brezany • Conclusions University of Vienna 4 Background and Related Work • Basic Grid development (Globus 1) – metacomputing • Data Grid (Globus 2, DataGrid of CERN, etc.) • Semantic Grid (myGrid) • Open Grid Service Architecture (Globus 3, OGSA-DAIS) • Parallel and Distributed Data Mining and Data Warehousing • Knowledge Grid (GridMiner and work of others) • Web Intelligence P. Brezany University of Vienna 5 GridMiner Requirements • Open architecture • Data distribution, complexity, heterogeneity, and large data size • Applying different kinds of analysis strategies • Compatibility with existing Grid infrastructure • Openness to tools and algorithms • Scalability • Grid, network, and location transparency • Security and data privacy • OLAP support P. Brezany University of Vienna 6 GridMiner (Layered) Abstract Architecture User Interface Knowledge Grid Data to Knowledge Information Grid Control Computational & Data Grid Built on the K.G. Jeffery‘s proposal P. Brezany University of Vienna 7 GridMiner Conceptual Architecture J o b C o n t r o l P. Brezany University of Vienna 8 Service Architecture Based on OGSA-DAIS P. Brezany University of Vienna 9 Data Distribution Scenarios 1. Single data source 2. Federated data sources with different types of partitioning P. Brezany University of Vienna 10 Example Vertical and horizontal distribution of the virtual data source P. Brezany University of Vienna 11 Mapping Schema P. Brezany University of Vienna 12 Grid Data Mediation Services P. Brezany University of Vienna 13 Architecture of a Data Mining System P. Brezany University of Vienna 14 Components of the Data Mining Layer • GridMiner Service Factory • GridMiner Service Registry • GridMiner Data Mining Service • GridMiner Preprocessing Service • GridMiner Presentation Service • GridMiner Orchestration Service P. Brezany University of Vienna 15 Centralized Data Mining GMS R Client GS 1. browse R GMS F factory GSHs F GS NSrc GDS NSrc 6. create GMDM S 10. evaluate Model GMDM F GS GDT GDS 1 NSrc GDS 9. use it GS 9. use it G DSF 7. create GDS 3. create GDS GMDMS 5. use it 5. use it GMPPS GS notificatio ns query SDEs t ei us s Es io n SD t a y c er t ifi qu no GS NSrc GMPP 4. F rf or m 8. pe 2. create GMPPS GS GMS F GS GDT GDS 2 <read> <write> <read> DataSource P. Brezany University of Vienna 16 Parallel and Distributed Data Mining Client GS GMS R R 1. browse GMS F GS F notifications 2. create GMDMS query SDEs factory GSHs GS 7. perform DataMining 9. evaluate Model NSrc GMDM GMDMS 0 5. create 4. create 3. create 8. perform 6. create 8. control 8. control 8. control 8. control GMSF GMSF GMDMS 1 <read> dat1 P. Brezany GMSF GMDMS 2 GMSF GMDMS 3 <read> SOAP / RMI / JXTA / MPI / etc. dat2 University of Vienna G MDMS 4 <read> dat3 <read> dat4 17 GridMiner Orchestration Service GMS R Client GS 1. browse R GS F notifications 2. create GMDMS GMS F query SDEs GSHs > GS 3. execute Workflow GridMin er Job Desc ription Workflow Engine NSrc He ader GMDM GMOrchS Re source De clarations Workflow 4. create 5. perform GMSF Activity GMSF GMPPS 1 7. perform GMSF Activity G MP PS 2 <read> 10. create 9. perform Activity GMDMS <read> <write> P. Brezany 8. create 6. create 11. perform Activity Activity use GMPPS for filling missing values, remove noi se Activity use GMPPS for selection an d preliminary aggregatio ns Activity use GMDMS for generati ng a decis ion tree Activity use GMPRS for a graphic al, interactive representation GMSF GMP RS <read> <write> Workflow Outline <read> <write> University of Vienna 18 GridMiner Job Specification Language P. Brezany University of Vienna 19 Implementation Prototype • Implementation of the Mediation Service for horizontal data partitioning • Implementation of Data Mining Services for decision tree construction as OGSA conformous Grid service, based on the Globus Toolkit 3 Release • We use – a freely available Java-based data mining system Weka (data preprocessing and data mining tasks) – (main memory oriented) – a home-grown Java implementation of the algorithm SPRINT (disk-oriented) P. Brezany University of Vienna 20 Experimental Environment • Test data suites – synthetical data (generated by an extended version of the IBM Quest Synthetic Data Generation Code) – TBI (Traumatic Brain Injury) databases • Grid testbed – – – – – Vienna CERN Dublin Zagreb Cracow • Goals in the first phases – Verifying model accuracy – Overhead of the service layers P. Brezany University of Vienna 21 Extending the Functionality P. Brezany University of Vienna 22 OLAM P. Brezany University of Vienna 23 Example: Mining Patterns for Data Classification and Associations use database dat1, dat2 mine classifications analyze patient_outcome using g_parsimony display as tree P. Brezany use database DBs attributes mine associations using method_attributes display as rules University of Vienna 24 Workflow 1: Interactive Mode P. Brezany University of Vienna 25 Workflow 2: Batch Mode P. Brezany University of Vienna 26 Workflow 3: Hybrid Mode P. Brezany University of Vienna 27 Execution Model Based on Static Workflow P. Brezany University of Vienna 28 Execution Model Based on Dynamic Workflow P. Brezany University of Vienna 29 Towards the Wisdom Grid (WG) P. Brezany University of Vienna 30 WG Architecture Domain Knowledge Agents Knowledge Explorer Agent Wisdom Grid Agent Platform External Knowledge Base External Services Agent Grid Service Knowledge Base Service Knowledge Discovery Service Grid End User (personal) Agent P. Brezany KB University of Vienna 31 Work-Flow External Agents End User Agent Knowledge Base service Knowledge Agent Agent Service Knowledge discovery service Services ... Knowledge Base P. Brezany Knowledge Explorer Agent University of Vienna 32 Knowledge Discovery Service Client for other services Knowledge Discovery in Databases GridMiner data mining on-line analytical processing (OLAP) Web Mining semantic web Online libraries Web/Grid Services Knowledge Explorer Agent P. Brezany University of Vienna 33 Knowledge Base Service / KB KBS - Search, Query, Expand Knowledge Base KB- Database that stores particular data about real objects and relations between these objects and their properties Consists of ontologies and instances Information about resources (location, query lang.) on the Web web/grid services ,agents references to the online database Languages XML/RDF/DAML-OIL/DAML-S/OWL P. Brezany University of Vienna 34 Ontology - example DAML-OIL Language: Patient is Human has Age P. Brezany <daml:Class rdf:ID=“Human”> <rdfs:subClassOf> <daml:Restriction cardinality=“1”> <daml:onProperty rdf:resource= “#Age”/> </daml:Restriction> </rdfs:subClassOf> </daml> <daml:DatatypeProperty about:ID=“Age”> <rdf:domain rdf:resource = “#Human”/> </daml:DatatypeProperty> <daml:Class rdf:ID=“Patient”> <daml:subClassOf rdf:resource=“#Human”/> </daml:Class> University of Vienna 35 Knowledge Base - example Human has has Temperature Value is Patient has Attribute attribute:PAT_ID P. Brezany Tables table:PATIENTS University of Vienna has Database jdbc://foo/hospital 36 Semantic mediator • Distributed heterogeneous databases – Different database schemas – Different query languages – Different names of attributes/tables… but the same semantics ! • WG enables semantics mediation at a higher level P. Brezany University of Vienna 37 Semantic mediator (cont.) AGE Patient samePropertyAs is Human PAT_AGE has Database in Hospital X PAT_TAB Age has ID AGE BT ... … … Database in Hospital Z Blood Type PATIENTS samePropertyAs PAT_BLOOD_TYPE P. Brezany BT PAT_ID PAT_AGE PAT_BLOOD_TYPE ... … … University of Vienna 38 Distributed Knowledge base uri:fooY#Human is subclass Class has property Class property Is same class as uri:fooZ#Temperature uri:fooX#Patient class P. Brezany uri:fooX#Ill_Person University of Vienna 39 Agent Grid Service Supports system with ability to communicate with the outside world in standard languages FIPA Standards ACL – Agent Communication Language KQML- Knowledge Query and Manipulation Language Agent Platform (JADE,FIPA-OS) Agents Domain Knowledge Agent Knowledge Explorer Agent End-user Agent (personal) P. Brezany University of Vienna 40 Querying End-user agent with own ontology – subset of ontology Merging of ontologies without own ontology Negotiating about domain of interest Queries created from ontology Templates <Patient rdf:ID=“ID001”> <Temperature/> </Patient> P. Brezany University of Vienna 41 Answers • Mined Knowledge (GridMiner) – Decision trees/ rules » (clinical pathways) – Association rules • Instances of domain ontology – – – – P. Brezany Particular data References Links to Web sites Information about another knowledge providers University of Vienna 42 Case Study - Medical Application Semantic Web/Grid Knowledge Explorer Agent Knowledge Agent Q: Outcome? + data about patient’s condition A: probability of survival + references to the diagnoses Knowledge Discovery Service GridMiner resources Training set Knowledge Base End User (personal) Agent P. Brezany Testset University of Vienna Hospital Databases 43 Conclusions and Future Work • Application and extension of the Grid technology to knowledge discovery – an important, but nontraditional Grid application domain • Introduction of a new Grid Data Mediation Service • Future work – Performance evaluation on large synthetic data volumes – Coupling of the Data Minining services architecture with the OLAP services architecture – Development of a knowledge discovery oriented Grid Workflow Language and the appropriate Workflow Engine – Application of GridMiner to a real medical application (management of patients with severe traumatic brain injuries) – Development of the Wisdom Grid P. Brezany University of Vienna 44