Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
South African Research Data Infrastructure Open Data Platform Architecture Wim Hugo CDIO, SAEON Systems Architect, DIRISA Vice-Chair, ICSU-WDS Anwar Vahed Manager, DIRISA CSIR Meraka Institute Context “Free and Open Access” Tax-Funded Data Reproducibility of Science Governance: Stakeholder Groupings DST Research Institutions SARIR Programme(s) SARVA BEA DRDLR SASDI Custodians ICSU World Data System African Networks Shared Platform Stakeholders DST SA-GEO SAEOSS GEO-BON/ GEOSS Communities of Practice DoE (REDIS) DEA (NSIF, SANBI, CC M&E), O&C NRF, HEIs and National Facilities NEDICC DIRISA ASSAf Consolidated Roadmaps Emerging E&EO Research Data Infrastructure Services ORCID, DataCite, … Linked Open Data Services/ Composites APIs Global Registries Data Providers Brokers and Harvesters SARVA BioEnergy Atlas SAEON Data Portal Community or Thematic Portals External Systems Earth and Environmental Sciences GEOSS, ICSU WDS, … Open Data Platform Global Infrastructures Shared Metadata Data Hosting Services Components Portals Gateways Guidance Physical Infrastructure SAEOSS Gateways DIRISA DEA SASDI SARIR Other Disciplines Six DIRISA Architectures Requirements and Specifications – detailing the scope that needs to be addressed by DIRISA, as well as the typical solutions and specifications or standards that will apply; The RDI Landscape – summarising international best practice and precedents, and reviewing local infrastructure, initiatives, and status quo; Structuring a Data Alliance – dealing in more detail with governance, community participation, and capacity building requirements. Hardware Specifications and Standards Hardware and Networks Software Specifications and Standards Software and Systems Business and Governance Guidance, Capacity, and Soft Skills Accreditation Generalised Scientific Data Infrastructure Use Case Access/ Download Data/ Services “Bind” “Publish” Metadata Analise/ Visualise Process “Find” Discover “Predictable Assembly from Reliable Components” Generalised Scientific Data Infrastructure Use Case Curate Cite Access/ Download Data/ Services “Bind” “Publish” Metadata Analise/ Visualise Process “Find” Discover “Predictable Assembly from Reliable Components” Assess/ Rate Generalised Scientific Data Infrastructure Use Case Curate Mediate Cite Access/ Download Data/ Services “Bind” “Publish” Metadata Analise/ Visualise Process “Find” Discover “Predictable Assembly from Reliable Components” Assess/ Rate Technical Standards Support Harvesters/ Discovery Data Services • CS/W • OAI-PMH • REST Service API • OGC WxS, KML, GeoRSS, GeoJSON, GML • NetCDF/ HDF5 (Multidimensional Data) • Time Series/ Signal Data • Media (Images, Video, Audio) • Document Objects • Tabular Data (CSV, Excel, …) Metadata Standards • ISO 19115/ p2 • SANS 1878 • EML • FGDC • Dublin Core • DDI • Darwin Core • DataCite Linked Open Data • DataCite DOIs • ORCID • Digital Samples • Vocabulary Registries Options for Participation Meta-Data Management Portal Discovery Portal Data Hosting, Visualisation and Download Portal Embedding* Embedding Embedding Adapters and Harvesters CS/W Adapter Standard Data Services REST Services Plug-Ins/ Own Development REST Services Reporting Portal No Infrastructure Embedding Shared Infrastructure REST Services Mostly Own Infrastructure Open Data Architecture Hardware Architecture: Research Cloud(s) RIMS/ NRF DMP Tools (Multiple) DOI Registration/ DataCITE Service and Component Interfaces (REST/ JSON, XML, Javascript) DIRISA (SAEON) Service and Portal Infrastructure DataFirst SANBI DEPOSIT | DISCOVERY | APPLICATION | REPORTING Componentbased Integration Distributed Data Cloud Management (iRODS, Resonant, or equivalent) ORCID/ Re3data/ … Web Dav HTTP DEA HTTP / FTP Research Cloud Basic Portals SAEON DIRISA Other T2 T1 T2/3 SASDI/ NSIF DataCite SA SAEOSS Full Function Portals Middleware – Archiving, Backup Desktop Deposit Tool (OwnCloud) ServiceBased Integration Other Accr. SARVA BEA SAEON DIRISA Main Use Cases: #0 - Registration Request data from ORCID I have an ORCID already Request Data from BI Staging Grant(s) Registered in RIMS DSA WDS re3data Select a Type of Participation Individual Researcher Supplement known information Institutional Participant Register ORCID NRF Registry of Repositories (including DIRISA) Select 1 or more Repositories Main Use Cases: #1 - Deposit Optional Data Upload Repositories of Last Resort Optional DOI Registration Online Capture Manual Meta-Data Provision File Upload REST Services Push Automated MetaData Processes Standard Harvesting Protocols Web Folders and FTP Semi-Automated Meta-Data Processes DMPs, RIMS Online Resource URLs and Pointers Earth and Environmental Sciences Applied Sciences, Built Env., Engineering Social Sciences and Humanities Health and BioInformatics National Aggregate Business Science, Law, Economics Institutional or Domain Repositories Physics, Chemistry, Astronomy Main Use Cases: #2 - Discovery Repositories of Last Resort DOI Resolvers Citations Earth and Environmental Sciences Portal Interfaces Indexed Meta-Data Search and Discovery Options Applied Sciences, Built Env., Engineering Standardised Search Interfaces (Machines) Social Sciences and Humanities National Aggregate Business Science, Law, Economics REST Services GEOSS Broker, ICSU WDS, … Standard Harvesting End Points Health and BioInformatics Institutional or Domain Repositories Physics, Chemistry, Astronomy Main Use Cases: #3 - Application Repositories of Last Resort Citation Application Google Analytics Request Previews Indexed Meta-Data Application Options Chain into Web Processes Event Logs and User Feedback Download Brokers and Mediators Institutional or Domain Repositories Main Use Cases: #4 - Reporting RIMS (Grant Administrat ion) Indexed Meta-Data Google Analytics CrossRef/ DataCite Reporting Scope Portal-Based Depositor Summaries Reporting Options REST-Based Statistics Depositor Summaries Application Statistics Meta-Data Status/ Search History Page Views and User Behaviour Citations and Mentions User Rating and Comments, Data Quality Context and Knowledge Network Grant Policy Compliance Event Logs and User Feedback Accreditation: Minimum Scope of Evaluation Security and ICT Management Access and Licensing Policies External Expertise Conference and Publication Record Quality Assurance Networking and Sharing Products and Services ICSU-WDS Communicaton and Outreach Data Seal of Approval Hardware Infrastructure Ingest and Publication Depositor Authenticity TRAC Preservation Practice Infrastructure Legal Compliance Software Infrastructure Sustainability Host Organisation Interoperability Funding Mechanisms Business Continuity Planning Accreditation Options Option Nature of Accreditation Process Local (NRF) Context Notes and Comments ISO 16363:2012 On-Site Audit Exceeds Requirement Expensive but a mediumterm goal Exceeds Requirement Not applicable locally, mainly used in Europe/ Germany Remote confirmation with peer review Meets Requirement Traditionally Earth and Environmental Science. Allows ‘Network Members’ Remote confirmation with peer review Meets Requirement Self-Evaluation Meets Requirement, but needs subsequent formalisation (NRF?) NESTOR/ DIN ICS-WDS (World Data System) DSA (Data Seal of Approval) TRAC Remote Audit Traditionally Social Science and Humanities Not Recommended Some Questions Remain … • True scalability: – research infrastructure maintenance is human-resource intensive and cannot remain so; • Universally accepted, machine-readable licenses for non-open data: – the equivalent of Creative Commons licenses for data that is legitimately restricted in one of several generic ways (privacy and ethics, commercial interest, and classified information) does not exist, but are required for large-scale, automated processing. • External dependencies: – in an increasingly interconnected systems environment, how do we sustainably fund critical components of globally shared infrastructure (for example vocabulary services or persistent identifier resolvers). ? Funded by NRF/ SAEON, Department of Science and Technology, and CSIR/ Meraka Institute Biodiversity Data Management Elements of Interoperability Syntax Describes service protocols, parameters Schema Describes structure of content Temporal – easy Semantic Describes the meaning of content Spatial – easy Topic - difficult Essential Biodiversity Variables • Genetic composition – Co-ancestry, Allelic diversity, Population genetic differentiation, Breed and variety diversity • Species populations – Species distribution, Population abundance, Population structure by age/size class • Species traits – Phenology, Body mass, Natal dispersion distance, Migratory behavior, Demographic traits, Physiological traits • Community composition – Taxonomic diversity, Species interactions • Ecosystem function – Net primary productivity, Secondary productivity, Nutrient retention, Disturbance regime • Ecosystem structure – Habitat structure, Ecosystem extent and fragmentation, Ecosystem composition by functional type Not all traditional spatial data! Not all remotely sensed! Simple or Core Information Model Genes and Alleles Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Example: Taxon Abundance, Presence and Absence Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Example: Phylogenetic Data Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Example: Morphology Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Example: Biome Definition, Ecosystem Services Genes and Alleles Relationship Species and Taxons Sampling Event Spatial and Temporal Coverage Life Stages, Traits and Characters Physical Phenomena Generic Dimensions of Data • Spatial Coverage – XYZ Continuous or Near-Continuous: Uppercase Discrete or dispersed: Lowercase • Temporal Coverage: T • Topic or Semantic/ Ontological Coverage – P: Phenomenon • mostly physical, chemical, or other contextual data – B: Biological • Tx: Species and Taxonomy (with some extensions) • Al: Allele/ Genome/ Phylogenetic • Ch: Characteristics, Traits, and, and Life Stages • Each unique combination of these, supported by a vocabularies/ ontology is a generic data family Some Generic Data Families and Crosswalk Requirements Typical Dimensions/ Content Typical Infrastructure Typical Syntax/ Schema Multi-dimensional XYZ, t, P Cube Data OPeNDAP Traditional Spatial XY, t, P S-DB WxS Signals XYZ, t, P/ B O&M SOS General Ecosystem XYZ, t, P/ B MetaCat CSV Occurrence XYZ, T, Tx GBIF Index DwC Genetic XYZ, T, Al GenBank FTP/ ASN.1 Still Thinking About: ✪ HDF-5 for Everything ✪ Directed Graphs/ RDF for Everything Typical Guidance For Each EBV … http://bit.ly/1W8YPxx Work started within GEO BON WG 8 Vocabulary and Name Services • Important to Limit Diversity of Interfaces • Mappings of Vocabularies to Schema • RDA has started thinking about this …