Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovery Net : A UK e-Science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing, September 18-19, 2003 Why Discovery Net? Data Challenge: Distributed, heterogeneous & large scale data sets Novel and real-time data sources Resource Challenge Novel specialised data analysis components/services continually being published/made available Computational resources provided Information Challenge: Data cleaning, normalisation & calibration New data needs to be related to existing data Knowledge Challenge: Collaborative, interactive & people-intensive Result interpretation & validation in relation to existing knowledge Knowledge sharing is key What is Discovery Net Goal : Construct an Infrastructure for Global wide Knowledge Discovery Services Key Technologies: • • • • • • Grid and Distributed Computing Workflow and service composition Data Mining & Visualisation. Data Access & Information Structuring. High Throughput Screening Devices: real-time. Discovery Net: Unifying the World’s Knowledge Data Integration: Dynamic Real Time Construction of “Data Grids” Application Integration: Component and Service-based Integration People Integration: Global-wide Discovery Groupware Knowledge Integration: Multi-subjects and Multi-modality Integrative Analysis to Cross Validate and Annotate Related Discovery Work What is Discovery Net Scientific Information Literature Scientific Discovery Real Time Integration Workflow Construction Databases Operational Data Dynamic Application Interactive Visual Integration Analysis Using Distributed Resources Images Instrument Data Discovery Net Layer Model (Life Science Application) D-Net Clients: End-user applications and user interface allowing scientists to construct and drive knowledge discovery activities Deployment Web/Grid Services OGSA D-Net Middleware: Provides execution logic for distributed knowledge discovery and access to distributed resources High Performance and Grid-enabled Transfer Protocol Computation & Data Resources: Distributed databases, compute servers and scientific devices. (GSI-FTP, DSTP..) Grid-enabled Infrastructure (GSI) A Knowledge Grid based on D-Net Servers Goal: Plug & Play Data Sources, Analysis Components and Knowledge Discovery Processes DNet server DNet Server DNet API Deployment Computation Components Data access & Storage InfoGrid Knowledge discovery services DNet server DNet Server DNet participating client DNet Client DNet Server XML DPML Internet DNet client DNet Client Web client WWW RDBMS Data sources Computational services Several types of clients for different usage (from thin web client to participating client) Current implmentation based on Java distributed objects (EJB), moving towards Web/Grid service But deployment and API access through standard Web/Grid service Discovery Process Management Workflow based service composition Data-flow approach fits Knowledge Discovery process Allows scientists to develop processes. Towards a Standard Workflow Representation for Discovery Informatics: Discovery Process Markup Language (DPML): Contains component data-flow graphs, but also Records collaboration information (user, changes) Records execution constraints (location, parameterisation) Becomes a key intellectual property: Discovery Processes can be stored, reused, audited, refined and deployed in various forms D-Net Workflow for Genome Annotation : 16 services executing across Internet InfoGrid: Dynamic Data Integration Dynamic Data Integration = On-demand access to heterogeneous data sources + information structuring Towards a Dynamic Information Integration Methodology: Specialised Information Source Access: Trails Journals Project Patients… Clinical Biological Activity Screening Protocols Journals Reports Toxicology Integrative Analysis Patents… Metabolic InfoGrid allows users to register, locate and connect to various specialised information sources. On the-fly Integration: InfoGrid allows users to build their own integration structure on the fly (Worst case: proprietary protocol/format, best case JDBC/HTTP-XML-XPath/Web Service). Easy Maintenance: Wrappers/Drivers to new data sources can be added through a clean API Structures Protein / Pathways… Sequence Targets Structure Chemistry Libraries Gene Sequence Location Synthetic Expression Function… pathways… Function… Catalogues Dynamic Application Integration Services Dynamic Application Integration = Ondemand access and composition of remote analysis components Towards a Dynamic Component Integration: Clustering Classification Regression Gene function perdition Component service: allow users to register, locate and remotely execute D-NET API components (Java component interface or Web Service port type). Execution service: allow users to control the execution of components distributed environments Easy Maintenance: New components can be added through a clean API Promoter Prediction Homology Search Discovery Deployment Discovery Deployment = On-demand rapid application construction and publishing Towards a Dynamic Deployment of Knowledge Discovery Procedures: Deployment Engine : allows users to build and publish applications based on DPML code coordinating remotely execute components, as Web Page, Web/Grid Service, command line tool. Easy Maintenance: New discovery procedures described in DPML, a Standardised representation of “composed” discovery procedures Storage & Reporting Servers: allow users to share DPML procedures and to generate workflow audit reports. Discovery Component Report Discovery Process in DPML Discovery Service Batch processing Knowledge Integration & Interpretation Dynamic Knowledge Interpretation = cross-reference and verify analysis results against background knowledge Towards a Knowledge Integration Framework: Multi-subject data analysis Text Mining Genetic Analysis Specialised Client Interfaces: Interactive Analysis and dynamic component interaction Result Annotation, Structuring and Storage: Information source query, result browsing, sharing and markup Sequence Pathway Analysis Analysis Life science example application Workflow execution Component execution location resolution User list of known resources A component can require explicitly to be executed on a particular resource A component can choose from a set of resources proposed (and could use Grid resource information systems and network weather information to determine where to go) For unconstrained components, simple “near the data” execution policy: If single input data location then execute there Otherwise fallback to original execution location Allows usual DPKD workflows to be designed Handles data management and transfer (serialisation, Java based, FTP based) Discovery Net and Grid technologies Cluster/Campus Grid level: Partial or complete workflow execution on Condor / SGE Task farming on subset of the workflow Global Grid: GSI integration (Java Cog Kit) GSI-FTP transfer functionality (Java Cog Kit) OGSA Grid Service access to functionalities (GT3) Potential use of GRIS or NWS in component implementation Globus scheduler ? Unicore ? SRB ? Discovery Net Application Testbeds GUSTO UNITS with wireless connectivity Life Science Testbed: Gene sequencing, Protein Chips High Throughput real-time genome annotation testbed: analyse and interpret new sequences using existing distributed bioinformatics tools and databases Environmental Modelling Pollution Sensors (GUSTO): SO2, Benzene, .. High Throughput real-time pollution monitoring testbed: analyse, interpret time-resolved correlations among remote stations, and with other environmental data sets Geo-hazard Prediction Multi-spectral, multi-temporal, Satellite imagery Real-time geo-hazard prediction testbed: analyse, interpret satellite images with other data sets to generate thematic knowledge Case Study: SC2002 HPC Challenge Organism Identify High Throughput Sequencers Chromosomes Gene markers Regulatory Annotation D-Net based Global Collaborative DNA Real- Time Genome Annotation Identify Genes Nucleotide-level Organism’s tRNAs, rRNAs Non-translated EMBL NCBI SNP Elements Literature Duplication Variations ….. blast Repeat Repetitive RNAs Segmental Regions genscan TIGR SNP grail Masker E-PCR genscan References Identify Protein-level Annotation Proteins Functional Characteisation Domain Fold Prediction Literature Genome Classify into Protein Families Homologues 3-D Structure Secondary Inter Inter Pro Pro SWISS SMART blast 3D-PSSM Motif PFAM Search PROT predator DSC structure ….. References Process-level Annotation Pathway Relate Cell Cycle Drugs Cell death Literature Ontologies Metabolism Biological GO CSNDB Process….. Embryogenesis KEGG GK GeneMaps virtual GenNav chip ….. References Maps AmiGO 15 DBs 21 Applications Annotation How It Works Interactive Editor & Visualisation Nucleotide Annotation Workflows Download sequence from Reference Server Save to Distributed Annotation Server Inter SMART Pro EMBL KEGG SWISS NCBI PROT TIGR SNP GO 1800 clicks 500 Web access 200 copy/paste 3 weeks work in 1 workflow and few second execution Execute distributed annotation workflow Conclusion and Future works Towards an open integration platform that enables scientists to conduct their KD activities Several levels of integration required Enable use of available resources Evolution towards cost model integration (performance, value, QoS) Semantic based service retrieval and composition Other useful standards ? (OGSA-DAI ?)