Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
EMBL/ELIXIR use-cases for EGI/EUDAT Tony Wildish www.ebi.ac.uk Data resources available from EMBL-EBI Genes, genomes & variation European Nucleotide Archive Ensembl European Variation Archive Ensembl Genomes European Genome-phenome Archive GWAS Catalog Metagenomics portal Gene, protein & metabolite expression RNA Central Literature & ontologies Europe PubMed Central Gene Ontology Experimental Factor Ontology Express Array Expression Atlas Metabolights PRIDE Protein sequences, families & motifsInterPro Pfam UniProt Molecular structures Protein Data Bank in Europe Electron Microscopy Data Bank Chemical biology Reactions, interactions & pathways IntAct Reactome MetaboLights ChEMBL ChEBI Systems BioModels Enzyme Portal BioSamples ELIXIR: Driven by 4 scientific use-cases • Marine Metagenomics • Genomic & Phenotypic data for Crop and Forest plants • Rare Diseases • Human Genetic Data All scientific use cases require either private or public data sets to be replicated from the source or between analysis sites 3 Three types of metadata • Content metadata – Scientific/biological content, the value of the data – Structure specific to the archive hosting the data • File metadata – File size, checksum, creation date – Logical File Name (LFN) – filename relative to archive root • Access metadata – Physical File Name (PFN) - host, protocol/port, path to root of archive, LFN – Defined per site, per protocol (HTTP, FTP…) – Site-specific, part of the fabric 4 Use-case characteristics • Data volumes from 10’s to several 100’s of GB monthly – Human data likely to be largest volume/traffic • Replication between a handful of sites – Periodic updates to reference datasets => metadata handling to describe datasets consistently • Download smaller subsets for individual analyses • End-users widely distributed, communities of all sizes/scales 5 Use-case characteristics • Content metadata replication not a target – Complex, domain-specific, well established – No clear gain in replicating it at this time • Decouple dataset-description metadata from file-location and transfer/access metadata – Allow file-distribution to be explored and understood without digging into details of what the data is about 6 User interface Abstract model: Users browse metadata to discover or define datasets which are located at multiple sites Complexity comes from the dataset structure… Content Metadata Access Path Catalogue PFN Dataset definition Dataset Site A Storage Dataset Version File File File LFN Source archive Access Path Catalogue PFN Site B Storage 7 Example dataset structure: three datasets each have their own files #1 and #2 don’t overlap with each other, but both overlap with #3 Dataset 1 LFN LFN LFN LFN LFN LFN LFN LFN LFN Dataset 2 Dataset 3 LFN LFN LFN LFN LFN LFN 8 3 overlapping Datasets -> 5 non-overlapping Filesets Dataset 1 Fileset 1 Fileset 2 Fileset 3 Fileset 4 Dataset 2 Dataset 3 Fileset 5 9 Filesets in releases are ‘closed’ (immutable) As-yet unreleased filesets may be ‘open’ (mutable) Dataset A Dataset A Release 2 Dataset A Release 1 Dataset B Fileset 3 (open) Fileset 2 (closed) Fileset 1 (closed) Fileset n File File File LFN File File File LFN File File File LFN 10 Site A Dataset Dataset Version File File File LFN Metadata/ Dataset catalogue Dataset Version Topology, Infrastructure, Fabric… PFN Fileset Storage File File File LFN File/Fileset replica catalogue Access Protocols Site B Access protocol catalogue PFN Storage … 11 Data stewardship • Data stewardship requires data policies – This site must maintain a copy of my data – I want a copy of my data somewhere in EGI, but I don’t care where – This data can be deleted after 10 years – … • Infrastructure providers need access to those policies – Can I delete my copy of the data? Does it matter if I lose it accidentally? – Is my copy ‘custodial’ (do I need to keep it backed up?) – Does my copy have to be permanently online? Near-line? • Belongs as part of the replica catalogue 12 Summary • Three types of metadata for Elixir data mgmt – Content (value), file, access: manage separately – Content metadata is/will-be managed by Elixir – File, access metadata needs catalogues, tools • Dataset/data-organization metadata is complex – Not cleanly separable, overlapping, multi-scale… – Need to explore real use-cases, understand details • Proposed data-model can address needs – Needs validation: prototype, deploy... 13