Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
From research data to new knowledge: a lifecycle approach. Dr Liz Lyon, Director UKOLN, University of Bath, UK JISC/SURF/CNI Conference May 2005, Amsterdam. UKOLN is supported by: www.ukoln.ac.uk a centre of expertise in digital information management www.bath.ac.uk Overview 1. Scholarly communications in flux 2. e-Research and the diversity of data 3. Repositories & meta-functionality • • • Realising the link to learning: eBank UK Providing value-added services Enabling knowledge extraction & postprocessing 4. Look at (some of) the issues en route JISC/SURF/CNI Conference May 2005 2 1. Scholarly communications in flux A medieval scriptorium….. JISC/SURF/CNI Conference May 2005 4 Presentation services: subject, media-specific, data, commercial portals Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Resource discovery, linking, embedding Data analysis, transformation, mining, modelling Searching , harvesting, embedding The scholarly knowledge cycle. Aggregator services: national, commercial Liz Lyon, Ariadne, July 2003. Harvesting metadata Research & e-Science workflows Deposit / selfarchiving Repositories : institutional, e-prints, subject, data, learning objects Validation Publication Peer-reviewed publications: journals, conference proceedings JISC/SURF/CNI Conference May 2005 5 Presentation services: subject, media-specific, data, commercial portals Searching , harvesting, embedding Aggregator services: national, commercial Resource discovery, linking, embedding Learning object creation, re-use Harvesting metadata Learning & Teaching workflows Repositories : institutional, e-prints, subject, data, learning objects Validation Peer-reviewed publications: journals, conference proceedings Deposit / selfarchiving Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules Resource discovery, linking, embedding JISC/SURF/CNI Conference May 2005 Validation Quality assurance bodies 6 Presentation services: subject, media-specific, data, commercial portals Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Resource discovery, linking, embedding Data analysis, transformation, mining, modelling Searching , harvesting, embedding Aggregator services: national, commercial Resource discovery, linking, embedding Learning object creation, re-use Harvesting metadata Research & e-Science workflows Deposit / selfarchiving Learning & Teaching workflows Repositories : institutional, e-prints, subject, data, learning objects Validation Publication Deposit / selfarchiving Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules Resource discovery, linking, embedding Peer-reviewed publications: journals, conference proceedings JISC/SURF/CNI Conference May 2005 Validation Quality assurance bodies 7 2. e-Research and the diversity of data Assuring permanent open access to the records of science & the humanities? Long term access to primary data • Increasing data volumes from eScience and Grid-enabled / cyberinfrastructure applications • Changing research paradigm: data-driven science, “big science” • Observational data, simulations, large-scale experimentation, computations • Multi-media resources, statistical data, surveys, geo-spatial data…… JISC/SURF/CNI Conference May 2005 9 Diversity of data collections • • Very large, relatively homogeneous: Large-scale Hadron Collider (LHC) outputs from CERN Smaller, heterogeneous and richer collections: World Data Centre for Solar-terrestrial Physics CCLRC Small-scale laboratory results: “jumping robots” project at the University of Bath Population survey data: UK Biobank • Highly sensitive, personal data: patient care records • • JISC/SURF/CNI Conference May 2005 10 Taxonomy of data collections • • • Research collections: jumping robots Community collections: Flybase at Indiana (with UC Berkeley ) Reference collections: Protein Data Bank Source: NSF Long-Lived Digital Data Collections Draft report March 2005 JISC/SURF/CNI Conference May 2005 11 Taxonomy of data collections • • • Research collections: jumping robots Community collections: Flybase at Indiana (with UC Berkeley ) Reference collections: Protein Data Bank Evolution…… Source: NSF Long-Lived Digital Data Collections Draft report March 2005 JISC/SURF/CNI Conference May 2005 12 Repository evolution: 1971 Research collection <12 files 2005 Reference collection >2700 structures deposited in 6 months JISC/SURF/CNI Conference May 2005 13 1. Issues: research data as content • Sharing it! • Data diversity – – – – Homo- or heterogeneous Raw and derived / processed Sensitivity Fast or slow growth in volume • Repository evolution: – Likelihood to scale up (from bytes to petabytes) – Quality assurance (from the start) – Community-based standards development (“folksonomies”) – Build robust services JISC/SURF/CNI Conference May 2005 14 3. Repositories & meta-functionality eBank UK: linking research data to learning • JISC-funded September 2003, Phase 2 February 2005 • UKOLN at the University of Bath (lead), University of Southampton, University of Manchester • Exemplar: e-Science testbed ‘Combechem’ – – – – Grid-enabled combinatorial chemistry Crystallography, laser and surface chemistry examples Development of an e-Lab using pervasive computing technology National Crystallography Service • Resource Discovery Network / PSIgate physical sciences portal • http://www.ukoln.ac.uk/projects/ebank-uk/ JISC/SURF/CNI Conference May 2005 16 Presentation services: subject, media-specific, data, commercial portals Data creation / capture / gathering: laboratory experiments, Grids, fieldwork, surveys, media Resource discovery, linking, embedding Data analysis, transformation, mining, modelling Searching , harvesting, embedding Aggregator services: eBank UK Resource discovery, linking, embedding Learning object creation, re-use Harvesting metadata Research & e-Science workflows Deposit / selfarchiving Learning & Teaching workflows Repositories : institutional, e-prints, subject, data, learning objects Validation Publication Deposit / selfarchiving Institutional presentation services: portals, Learning Management Systems, u/g, p/g courses, modules Resource discovery, linking, embedding Peer-reviewed publications: journals, conference proceedings JISC/SURF/CNI Conference May 2005 Validation Quality assurance bodies 17 Data Flow in eBank UK Create HTML Submit OAI-PMH present Store/link Index and Search Harvest (XML) Institutional repository eBank aggregator HTML present JISC/SURF/CNI Conference May 2005 Data files Metadata 18 Comb-e-Chem Project Video Simulation Diffractometer Properties Analysis Structures Database X-Ray e-Lab Properties e-Lab Grid Middleware JISC/SURF/CNI Conference May 2005 20 The digital repository ecrystals.chem.soton.ac.uk Acknowledgement: Simon Coles JISC/SURF/CNI Conference May 2005 21 Access to the underlying data JISC/SURF/CNI Conference May 2005 22 Harvesting: OAIster JISC/SURF/CNI Conference May 2005 23 Aggregating: search & discover JISC/SURF/CNI Conference May 2005 24 Linking to publications JISC/SURF/CNI Conference May 2005 25 eBank embedded in a science portal JISC/SURF/CNI Conference May 2005 26 eBank Phase 2: linking to learning • Embedding in e-Learning processes • Evaluating the pedagogical benefits – MChem course – Chemical informatics course JISC/SURF/CNI Conference May 2005 27 2. Issues: generic data models, metadata schema & terminology • Validation against other schema – CCLRC Scientific Data Model Vs 2 • Complex digital objects and packaging options – METS – MPEG 21 DIDL • Terminologies – Domain: crystallography – Inter-disciplinary e.g. biomaterials – Metadata enhancement: subject keyword additions to datasets based on knowledge of keywords in related publications – Meaningful resource discovery? JISC/SURF/CNI Conference May 2005 28 3. Issues: linking and identifiers • • • • Links to individual datasets within an experiment Links to all datasets associated with an experiment or a data collection Links to derived eprints and published literature Context sensitive linking: find me – – – – • Datasets by this author / creator Datasets related to this subject Learning objects by this author / creator Learning objects related to this subject Identifiers and persistence – “generic” – domain: International Chemical Identifier (InChI code) • • Resource discovery : Google Scholar? Provenance: authenticity, authority, integrity? JISC/SURF/CNI Conference May 2005 29 4. Issues: embedding and workflow • Into the crystallographic publishing community International Union of Crystallography • Into the chemistry research workflow – SMART TEA Digital Lab Book e-synthesis Lab – Other analytical techniques and instrumentation • Into the curriculum and e-Learning workflows – MChem course – Undergraduate Chemical Informatics courses JISC/SURF/CNI Conference May 2005 30 Repositories and digital curation For later use? In use now (and the future)? Static Dynamic Data preservation Data curation “maintaining and adding value to a trusted body of digital information for current and future use” JISC/SURF/CNI Conference May 2005 31 Provide value-added services Annotation • e-Lab books (Smart Tea Project in chemistry) • Gene and protein sequences JISC/SURF/CNI Conference May 2005 32 Enable “post-processing” and knowledge extraction The acquisition of newly-derived information and knowledge from repository content • Run complex algorithms over primary datasets • Mining (data, text, structures) • Modelling (economic, climate, mathematical, biological) • Analysis (statistical, lexical, pattern matching, gene) • Presentation (visualisation, rendering) JISC/SURF/CNI Conference May 2005 33 JISC/SURF/CNI Conference May 2005 34 5. Issues: “knowledge services” • Layered over repositories – Annotation – Mining, modelling, analysis – Visualisation • Across multiple repositories – Grid enabled applications – Highly distributed, dynamic and collaborative • Associated with curatorial responsibility – UK Digital Curation Centre http://www.dcc.ac.uk JISC/SURF/CNI Conference May 2005 35 Issues summary 1. Research data is diverse, increasing rapidly in volume and complexity 2. Repository collections are dynamic and evolve 3. Technical challenges associated with interoperability, persistence, provenance, resource discovery and infrastructure provision 4. Embedding in workflow is critical: scholarly communications, research practice, learning 5. Knowledge extraction tools will generate new discoveries based on repository content 6. Repository solutions must scale: M2M processing will become the norm…… JISC/SURF/CNI Conference May 2005 36