Download Presentation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
South African Research Data Infrastructure
Open Data Platform Architecture
Wim Hugo
CDIO, SAEON
Systems Architect, DIRISA
Vice-Chair, ICSU-WDS
Anwar Vahed
Manager, DIRISA
CSIR Meraka Institute
Context
“Free and Open Access”
Tax-Funded Data
Reproducibility of Science
Governance: Stakeholder Groupings
DST
Research Institutions
SARIR Programme(s)
SARVA
BEA
DRDLR
SASDI
Custodians
ICSU World Data
System
African Networks
Shared Platform
Stakeholders
DST
SA-GEO
SAEOSS
GEO-BON/ GEOSS
Communities of
Practice
DoE
(REDIS)
DEA
(NSIF, SANBI, CC
M&E), O&C
NRF, HEIs and
National Facilities
NEDICC
DIRISA
ASSAf
Consolidated Roadmaps
Emerging E&EO Research Data Infrastructure
Services
ORCID,
DataCite, …
Linked Open
Data
Services/
Composites
APIs
Global
Registries
Data Providers
Brokers and
Harvesters
SARVA
BioEnergy
Atlas
SAEON Data
Portal
Community
or Thematic
Portals
External Systems
Earth and
Environmental Sciences
GEOSS,
ICSU WDS,
…
Open Data Platform
Global Infrastructures
Shared Metadata
Data Hosting
Services
Components
Portals
Gateways
Guidance
Physical Infrastructure
SAEOSS
Gateways
DIRISA
DEA
SASDI
SARIR
Other
Disciplines
Six DIRISA Architectures
Requirements and
Specifications – detailing the
scope that needs to be
addressed by DIRISA, as well as
the typical solutions and
specifications or standards that
will apply;
The RDI Landscape –
summarising international best
practice and precedents, and
reviewing local infrastructure,
initiatives, and status quo;
Structuring a Data Alliance –
dealing in more detail with
governance, community
participation, and capacity
building requirements.
Hardware
Specifications and
Standards
Hardware and
Networks
Software
Specifications and
Standards
Software and
Systems
Business and Governance
Guidance, Capacity, and Soft Skills
Accreditation
Generalised Scientific Data Infrastructure Use Case
Access/
Download
Data/
Services
“Bind”
“Publish”
Metadata
Analise/
Visualise
Process
“Find”
Discover
“Predictable Assembly from Reliable Components”
Generalised Scientific Data Infrastructure Use Case
Curate
Cite
Access/
Download
Data/
Services
“Bind”
“Publish”
Metadata
Analise/
Visualise
Process
“Find”
Discover
“Predictable Assembly from Reliable Components”
Assess/
Rate
Generalised Scientific Data Infrastructure Use Case
Curate
Mediate
Cite
Access/
Download
Data/
Services
“Bind”
“Publish”
Metadata
Analise/
Visualise
Process
“Find”
Discover
“Predictable Assembly from Reliable Components”
Assess/
Rate
Technical Standards Support
Harvesters/ Discovery
Data Services
• CS/W
• OAI-PMH
• REST Service API
• OGC WxS, KML, GeoRSS,
GeoJSON, GML
• NetCDF/ HDF5
(Multidimensional Data)
• Time Series/ Signal Data
• Media (Images, Video, Audio)
• Document Objects
• Tabular Data (CSV, Excel, …)
Metadata Standards
• ISO 19115/ p2
• SANS 1878
• EML
• FGDC
• Dublin Core
• DDI
• Darwin Core
• DataCite
Linked Open Data
• DataCite DOIs
• ORCID
• Digital Samples
• Vocabulary Registries
Options for Participation
Meta-Data
Management
Portal
Discovery
Portal
Data Hosting,
Visualisation and
Download
Portal
Embedding*
Embedding
Embedding
Adapters and
Harvesters
CS/W Adapter
Standard Data
Services
REST Services
Plug-Ins/ Own
Development
REST Services
Reporting
Portal
No
Infrastructure
Embedding
Shared
Infrastructure
REST Services
Mostly
Own
Infrastructure
Open Data Architecture
Hardware Architecture: Research Cloud(s)
RIMS/
NRF
DMP Tools
(Multiple)
DOI
Registration/
DataCITE
Service and Component Interfaces
(REST/ JSON, XML, Javascript)
DIRISA (SAEON) Service and Portal
Infrastructure
DataFirst
SANBI
DEPOSIT | DISCOVERY | APPLICATION | REPORTING
Componentbased
Integration
Distributed Data Cloud Management
(iRODS, Resonant, or equivalent)
ORCID/
Re3data/
…
Web
Dav
HTTP
DEA
HTTP
/ FTP
Research Cloud
Basic Portals
SAEON
DIRISA
Other
T2
T1
T2/3
SASDI/ NSIF
DataCite SA
SAEOSS
Full Function
Portals
Middleware – Archiving, Backup
Desktop
Deposit Tool
(OwnCloud)
ServiceBased
Integration
Other
Accr.
SARVA
BEA
SAEON
DIRISA
Main Use Cases: #0 - Registration
Request
data from
ORCID
I have an ORCID
already
Request
Data from
BI Staging
Grant(s)
Registered in RIMS
DSA
WDS
re3data
Select a Type of
Participation
Individual
Researcher
Supplement
known information
Institutional
Participant
Register
ORCID
NRF
Registry of
Repositories
(including
DIRISA)
Select 1 or more
Repositories
Main Use Cases: #1 - Deposit
Optional
Data Upload
Repositories of
Last Resort
Optional
DOI
Registration
Online Capture
Manual Meta-Data
Provision
File Upload
REST Services
Push
Automated MetaData Processes
Standard
Harvesting
Protocols
Web Folders and
FTP
Semi-Automated
Meta-Data
Processes
DMPs, RIMS
Online Resource URLs
and Pointers
Earth and
Environmental
Sciences
Applied
Sciences,
Built Env.,
Engineering
Social
Sciences
and
Humanities
Health
and BioInformatics
National
Aggregate
Business
Science,
Law,
Economics
Institutional or
Domain
Repositories
Physics,
Chemistry,
Astronomy
Main Use Cases: #2 - Discovery
Repositories of
Last Resort
DOI
Resolvers
Citations
Earth and
Environmental
Sciences
Portal Interfaces
Indexed
Meta-Data
Search and
Discovery Options
Applied
Sciences,
Built Env.,
Engineering
Standardised
Search Interfaces
(Machines)
Social
Sciences
and
Humanities
National
Aggregate
Business
Science,
Law,
Economics
REST Services
GEOSS Broker,
ICSU WDS, …
Standard
Harvesting
End Points
Health
and BioInformatics
Institutional or
Domain
Repositories
Physics,
Chemistry,
Astronomy
Main Use Cases: #3 - Application
Repositories of
Last Resort
Citation
Application
Google
Analytics
Request Previews
Indexed
Meta-Data
Application
Options
Chain into Web
Processes
Event Logs
and User
Feedback
Download
Brokers and
Mediators
Institutional or
Domain
Repositories
Main Use Cases: #4 - Reporting
RIMS
(Grant
Administrat
ion)
Indexed
Meta-Data
Google
Analytics
CrossRef/ DataCite
Reporting Scope
Portal-Based
Depositor
Summaries
Reporting Options
REST-Based
Statistics
Depositor
Summaries
Application
Statistics
Meta-Data
Status/ Search
History
Page Views and
User Behaviour
Citations and
Mentions
User Rating and
Comments, Data
Quality
Context and
Knowledge
Network
Grant Policy
Compliance
Event Logs
and User
Feedback
Accreditation: Minimum Scope of Evaluation
Security and
ICT
Management
Access and
Licensing
Policies
External
Expertise
Conference
and
Publication
Record
Quality
Assurance
Networking
and Sharing
Products and
Services
ICSU-WDS
Communicaton
and Outreach
Data Seal of
Approval
Hardware
Infrastructure
Ingest and
Publication
Depositor
Authenticity
TRAC
Preservation
Practice
Infrastructure
Legal
Compliance
Software
Infrastructure
Sustainability
Host
Organisation
Interoperability
Funding
Mechanisms
Business
Continuity
Planning
Accreditation Options
Option
Nature of Accreditation
Process
Local (NRF) Context
Notes and Comments
ISO
16363:2012
On-Site Audit
Exceeds Requirement
Expensive but a mediumterm goal
Exceeds Requirement
Not applicable locally,
mainly used in Europe/
Germany
Remote confirmation
with peer review
Meets Requirement
Traditionally Earth and
Environmental Science.
Allows ‘Network
Members’
Remote confirmation
with peer review
Meets Requirement
Self-Evaluation
Meets Requirement, but
needs subsequent
formalisation (NRF?)
NESTOR/ DIN
ICS-WDS
(World Data System)
DSA
(Data Seal of Approval)
TRAC
Remote Audit
Traditionally Social
Science and Humanities
Not Recommended
Some Questions Remain …
• True scalability:
– research infrastructure maintenance is human-resource intensive and
cannot remain so;
• Universally accepted, machine-readable licenses for non-open data:
– the equivalent of Creative Commons licenses for data that is
legitimately restricted in one of several generic ways (privacy and
ethics, commercial interest, and classified information) does not exist,
but are required for large-scale, automated processing.
• External dependencies:
– in an increasingly interconnected systems environment, how do we
sustainably fund critical components of globally shared infrastructure
(for example vocabulary services or persistent identifier resolvers).
?
Funded by
NRF/ SAEON,
Department of Science and Technology,
and CSIR/ Meraka Institute
Biodiversity Data Management
Elements of Interoperability
Syntax
Describes service
protocols, parameters
Schema
Describes structure of
content
Temporal – easy
Semantic
Describes the meaning
of content
Spatial – easy
Topic - difficult
Essential Biodiversity Variables
•
Genetic composition
– Co-ancestry, Allelic diversity, Population genetic differentiation, Breed and variety diversity
•
Species populations
– Species distribution, Population abundance, Population structure by age/size class
•
Species traits
– Phenology, Body mass, Natal dispersion distance, Migratory behavior, Demographic traits,
Physiological traits
•
Community composition
– Taxonomic diversity, Species interactions
•
Ecosystem function
– Net primary productivity, Secondary productivity, Nutrient retention, Disturbance regime
•
Ecosystem structure
–
Habitat structure, Ecosystem extent and fragmentation, Ecosystem composition by
functional type
Not all traditional
spatial data!
Not all remotely
sensed!
Simple or Core Information Model
Genes and
Alleles
Species and
Taxons
Sampling Event
Spatial and
Temporal
Coverage
Life Stages,
Traits and
Characters
Physical
Phenomena
Example: Taxon Abundance, Presence and Absence
Genes and
Alleles
Relationship
Species and
Taxons
Sampling Event
Spatial and
Temporal
Coverage
Life Stages,
Traits and
Characters
Physical
Phenomena
Example: Phylogenetic Data
Genes and
Alleles
Relationship
Species and
Taxons
Sampling Event
Spatial and
Temporal
Coverage
Life Stages,
Traits and
Characters
Physical
Phenomena
Example: Morphology
Genes and
Alleles
Relationship
Species and
Taxons
Sampling Event
Spatial and
Temporal
Coverage
Life Stages,
Traits and
Characters
Physical
Phenomena
Example: Biome Definition, Ecosystem Services
Genes and
Alleles
Relationship
Species and
Taxons
Sampling Event
Spatial and
Temporal
Coverage
Life Stages,
Traits and
Characters
Physical
Phenomena
Generic Dimensions of Data
• Spatial Coverage
– XYZ
Continuous or Near-Continuous: Uppercase
Discrete or dispersed: Lowercase
• Temporal Coverage: T
• Topic or Semantic/ Ontological Coverage
– P: Phenomenon
• mostly physical, chemical, or other contextual data
– B: Biological
• Tx: Species and Taxonomy (with some extensions)
• Al: Allele/ Genome/ Phylogenetic
• Ch: Characteristics, Traits, and, and Life Stages
• Each unique combination of these, supported by a vocabularies/
ontology is a generic data family
Some Generic Data Families and Crosswalk Requirements
Typical Dimensions/ Content
Typical Infrastructure
Typical Syntax/ Schema
Multi-dimensional
XYZ, t, P
Cube Data
OPeNDAP
Traditional Spatial
XY, t, P
S-DB
WxS
Signals
XYZ, t, P/ B
O&M
SOS
General Ecosystem
XYZ, t, P/ B
MetaCat
CSV
Occurrence
XYZ, T, Tx
GBIF Index
DwC
Genetic
XYZ, T, Al
GenBank
FTP/ ASN.1
Still Thinking About: ✪ HDF-5 for Everything ✪ Directed Graphs/ RDF for Everything
Typical Guidance
For Each EBV …
http://bit.ly/1W8YPxx
Work started within GEO BON WG 8
Vocabulary and Name Services
• Important to Limit Diversity of Interfaces
• Mappings of Vocabularies to Schema
• RDA has started thinking about this …
Related documents