Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bryan Lawrence, BADC David Boyd Deputy Director CLRC e-Science centre Kerstin Kleese DL: Climate Database Expert Roy Lowry BODC: Marine Database Expert Dean Williams PCMDI: ESG Principle Investigator Bob Drach PCMDI: ESG Metadata Architecture Mike Fiorino PCMDI: Meteorologist The NERC DataGrid Acronym Summary: PCMDI: Program for Climate Model Data Intercomparison (US Department of Energy, Lawrence-Livermore National Lab) ESG: Earth System Grid The NERC DataGrid (US Grid Project: NCAR, Argonne, PCMDI, USC …) Outline • Motivation • The Earth System Grid – definitions of “portals” and applications – ontologies • Relations with other NERC e-science programmes. • Architecture – querying – software Stack • • • • The NERC DataGrid Initial steps and Project Management Connectivity with other grid projects Success and Failure Summary of what we are doing and the road to the future The BADC – part of NCAS! The NERC DataGrid The Role: Key words: Curation and Facilitation! http://www.badc.rl.ac.uk Just under half of BADC users are NOT atmospheric scientists: Earth Observation Earth Science 160 126 42 132 The NERC DataGrid 56 104 132 Engineering Geography Marine Sciences Mathematics 152 Biological/Medical Terrestrial/Fresh Water Motivation – Town meeting 2001 E-science should be involved with: • delivering an enhanced meta-data record of archived data. • 'dictionary' building. • building systems to translate data and link databases. • integrating computer and natural science communities. • the ability to generate a single query across multiple datasets (in different catalogues) returning both metadata and data. • the ability to acquire large datasets in near real time (NRT). • the automatic production of metadata, both by models, and where possible, by observing systems. Summary from two of the four working groups! The NERC DataGrid Relevant to many stakeholders Energy Water Management Food Chain The NERC DataGrid Health Weather Risk (Slide from Julia Slingo’s introduction to CGAM as part of NCAS) Motivation Page 22: NERC will …... ensure that Earth system science is underpinned by e-science investments to enable access, manipulation … of data from diverse sources. The NERC DataGrid The Data Use Chain Discovery Authentication Authorisation Extraction SubSampling Regridding Formatting Processing Delivery Time-line The NERC DataGrid Display NERC Metadata Gateway - SST • Geospatial coordinates forgotten. Time reference forgotten. Need to get entire field(s), and find correct time! •And if I want to compare data from different locations? - multiple logins - multiple formats - discovery? The NERC DataGrid Searching: need comprehensive metadata! A priori would any user know to look in the COAPEC data set? Earth system-science means we have to remove these boundaries! • detailed file level metadata isn’t visible, and so data mining applications impossible. - need ontologies to help queries match actual data descriptions. The NERC DataGrid NB: Dynamic catalogues! What is an Ontology? An ontology defines the terms used to describe and represent an area of knowledge by specifying the following kinds of concepts: •Classes (general things) in the many domains of interest •The relationships that can exist among things •The properties (or attributes) those things may have Ontologies are usually expressed in a logic-based language, so that detailed, accurate, consistent, sound, and meaningful distinctions can be made among the classes, properties, and relations.. The NERC DataGrid Ontology Example: An example of part of ontology defined using OIL (e.g. see Oil in a Nutshell, D. Fensel et.al.) ontology-definitions slot-def eats inverse is-eaten-by slot-def has-part inverse is-part-of properties transitive Relationships Classes Properties The NERC DataGrid class-def animal class-def plant subclass-of NOT animal class-def tree subclass-of plant class-def branch slot-constraint is-part-of has-value tree class-def leaf slot-constraint is-part-of has-value branch classdef class-def defined carnivore class-def giraffe subclass-of animal subclass-of animal slot-constraint eats slot-constraint eats value-type animal value-type leaf class-def defined herbivore class-def lion subclass-of animal subclass-of animal slot-constraint eats slot-constraint eats value-type value-type herbivore plant OR (slot-constraint is-part-of has-value plant) With current funding, the NDG does not aim to build a formal ontology, but we do aim to being to build a thesaurus that can form the basis of one, and we do hope to spin off a project to build one and integrate it in the NDG (OIL: Ontology Inference Layer) ESG: Example of a Web-based Data Portal ESG will provide support for: • large but simple data sets, • limited metadata, but not searchable. NDG will provide support for •Small-but-complex datasets. •Data-mining (searchable metadata). NDG is complementary to ESG! The NERC DataGrid Live Access Server (1) … we will keep the basic structure, but gradually replace components. The NERC DataGrid Live Access Server (2) Data Request Structure: The NERC DataGrid ESG: Example of a Client Application We will: • Provide python based classes for our observational data to complement the access to 3D gridded data. • Provide a web services wrapper so that other grid applications can access NDG data. The NERC DataGrid Applications and Portals Internet Link tape robot Online Data XML database BADC NDG Wrapper Online Data Online Data XML database XML database BODC NDG Wrapper Group NDG Wrapper Wider Internet NERC Grid Software Agent Grid User ESG (&other) Applications Supercomputer Research Group Data Sources Wider Internet NDG Web Portal Internet User Internet Link XML database The NERC DataGrid Satellite Relationship to GODIVA (Haines et.al.) (Grid for Ocean Diagnostics, Interactive Visualisation and Analysis) Architecture of the GODIVA Grid: NDG will: • improve data discovery tools for GODIVA (even for their own datasets). • provide metadata creation tools for GODIVA participants. • provide access to data held outside GODIVA participants. The NERC DataGrid GODIVA team have already discovered issues with the XML database interface they are going to use. ClimatePrediction.com •Scientific •investigators •HTTP •Summary •statistics •HTTP (DODS URL) •Obs •Datamining •HTTP •Participants & •policy-makers •Live Access Server •ESG-II/NERC CP.COM will need the •DataGrid •Peer-to-peer NDG to make•visualisation best use of observational data in •100Tb of key output at 10-20 sites evaluating their •Conventional FTP/HTTP parameter space. •GridFTP •1Pb total output on 1M participants’ PCs The NERC DataGrid Mining on the Grid Satellite Data Grid Mining Agent Archive X Grid Processor Grid Mining Agent Grid Processor Satellite Data Archive Y Grid Mining Agent Grid Processor From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002 The NERC DataGrid Data mining: Grid Miner Architecture IPG Mining Agent Data Archive X IPG Processor IPG Processor Mining Operations Repository The devil is in the detail: how does the data mining agent get at the data? IPG Processor Mining Confiig Info Mining Daemon Control Database Need data mining clients – objects which can read specific datatypes and present themselves to agents! IPG Processor Satellite Data Archive Y IPG Mining Agent IPG Processor From Hinke’s NASA IPG presentation at CEOS, Rome, May 2002 The NERC DataGrid Finding data: Querying! • Requires databases of metadata & querying those databases. • Each part of the NDG will have an internal metadata catalogue (&/or database), and data (either in flat files or the database). – so the querying strategy must support centralised querying on partially indexed data, followed (if necessary) by distributed querying, which may or may not need mapping into a local database schema. – In the grid environment the indexes themselves will be replicated, and some data may also be replicated. • Major NDG design issue: developing appropriate data models, database schema and indexing strategies! – This is not a generic problem, it will be specific to our datatypes. – Technology needs to be public domain (i.e. free) for uptake! – NDG approach to database technology will be developed in conjunction with DBTF. The NERC DataGrid Query Pathway; software components Application Level NERC DataGrid Interfaces: NERC international generic Data Extraction Path for Known Datasets inadequate Generate Expansion Query (e..g: time and space) User Query Potentially Interesting Discovery and Extraction Path No User Assessment Define Requirements for SubSampling and Reformatting Exit or return to previous step at this level Data Exists? Continue to Extraction? Existing Data and Services New Data Interfaces Existing and Required and Services Grid Middleware Not OK The NERC DataGrid Yes Data Path into Archive Query Distributor (Check Authentication) Collate Multiple Returns Query Distributor (Check Authorisation against "Locating") Parallel Queries New Model and Data Ingestion and Metadata Creation Interfaces Query Handler Data and Metadata Archives "Dataset" Catalogue Search (Check Authorisation against "Looking") OK for extraction Check Authorisation for "Extraction" Network Path and Cache Identification Deliver Data to Processor (s) (and cache) Parallel Queries Reformat Metadata Query Handler Granule Catalogue Search; Return Satisfactory Granule Metadata User Processing, Display and/or Visualisation Sub-Sample and Reformat Extract Data File Response: DataSet Metadata BNL V1.01 - 12/01 Information Structure Joint Interfaces PCMDI Components NDG Components Existing Components The NERC DataGrid Simplified Software Stack Key point: make use of existing technology, allow component replacement with time! Achievable by: interface definition and integration. Note: Any application will be able to access our data services via the OGSA wrapper in the middleware. The NERC DataGrid Software stack The NERC DataGrid NDG: Ingestion Tasks The NERC DataGrid Draft Project Schedule Phase One Delivery The NERC DataGrid Metadata Gateway The NERC DataGrid Replace with Globus Giggle? Next steps include: •Replacing the transport layers in the metadata gateway with SOAP •Replacing the SGML in the metadata gateway with XML …etc The NERC DataGrid Connectivity? Innovation? Evolution! ClimatePrediction .com GODIVA BADC ESG II UK DataBase Task Force QinetiQ CEOS BNSC BODC NERC DataGrid NEODC CLRC e-science Data Portal PARADISE ? Future ? Other Programmes Other DDC-CEH U.S. Thredds/ NOMADS EU DataGrid WP9 Ontologies - Nesc -MyGrid Digital Libraries (Zoom) Plagiarism: Copying from one person Research : Copying from many people … we can’t afford to be too innovative! The NERC DataGrid Indicators of Success Finding and making use of data: • Possible to find, reformat, and visualise disparate datasets from disparate organisations within one application. • No longer necessary to rely on personal contacts to locate and acquire data of interest if it’s held in the BADC/BODC. • • Key requirement for interdisciplinarity; the ability to test data comparison ideas without learning foreign formats and establishing personal relationships every time. Other NERC data designated data centres implementing NDG. Take up by community: • NDG software (but not necessarily graphics tools) in use in GODIVA project and in wider UK university community (including data repositories in research groups). • Earth System Grid uses NDG components. The NERC DataGrid Risks Of Failure • Someone else does it first – unlikely! • Performance too slow for users! – – – – More cache and replication Improve database performance (UK DBTF!) Data-compression layer for XML Reduce scope and search depth (don’t want to do this!) • Globus 3 (OGSA) delivery heavily delayed – Web services implementation + Globus2 + datagrid service registry • Availability of people with appropriate skills – re-deploy existing staff where possible – Schedule begins with three months training. • ESG-II architecture delayed or incompatible with UK architecture – Close relationship with PCMDI means we will be able to proceed effectively anyway. The NERC DataGrid NDG expected evolution Data Repositories At USER Institution Satellite 1 NERC DDC Computation Data File 010 010 010 010 010 010 4 Catalogue Ingestor Local Catalogue Other: e.g. PML/ESSC Computation Catalogue Client 3 Python API XML Catalogue Server 6 Catalogue Client Computation Based on LAS 2 Docs The NERC DataGrid Data File Graphics Evolving to OGSA 5 Beyond the next three years: The NDG and earth systems science Extension to the other NERC data centres, requires: – online (or near-line) data. – appropriate ingestion tools, appropriate mappings between specific discipline specific metadata and generic metadata. – GRID enabling data centres. – Decisions about policy and access. The NERC DataGrid Bryan Lawrence, BADC David Boyd, CLRC E-science Kerstin Kleese, CLRC E-science Roy Lowry, BODC Dean Williams, PCMDI Bob Drach, PCMDI Mike Fiorino, PCMDI The NERC DataGrid The NERC DataGrid Project Management • Weekly workgroup meetings (teleconference and physical). • Milestoning code and documentation reviews at quarterly intervals. • Quarterly liaison with both US colleagues and other NERC projects (GODIVA, ClimatePrediction.com etc). • Bi-Annual target-reprofiling. • Professional project management at the code level: – Both RAL SSTD and RAL e-Science have considerable experience managing and delivering large software projects. • Two key tenets of management philosophy: – Build early, build often. – Evolve from a working system. The NERC DataGrid The NDG: What will we do? Key components: BADC/BODC • Project Management. • Ingestion tools for station data, oracle database data, and other (eg PP - includes tools based on ESML and Marine XML). • Format conversion tools within CDAT. • Ingestion! Migrate NERC Metadata gateway to WDSL/SOAP (Zoom?). Key components: CLRC e-science • • • • • Globus Installation at all sites. Functional decomposition and interface definitions. Search database schema; search software python API, wrappers. Database Population. Logical to Physical File Manager. Amalgamating search API into – LAS (or successor) , VCDAT, metadata gateway. • Add data retrieval interfaces into metadata gateway. The NERC DataGrid