Survey							
                            
		                
		                * Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Astronomical Data Archiving and Curation Clive Page AstroGrid Project University of Leicester 2004 March 22 Importance of Data Archiving in Astronomy • No observation can be repeated exactly, as the sky is always changing – After a violent event (e.g. supernova explosion) earlier observations are crucial • Observations over a long period can identify – Variability – Proper motions • In recent years all data come in digital form • Important earlier datasets on photographic plates have now mostly been digitised. Principal Data Types in Archives • • • • Raw data from telescopes Observing logs Calibration datasets Calibrated/reduced data: – Images – Spectra – Time-series • Derived data products: – Source catalogues – Sky survey image collections Data Formats • A variety, but FITS format predominates: – FITS can store arrays and tables, and encapsulates data and metadata, but… • Standards have evolved, older FITS files less compatible • Individual observatory conventions also exist • Metadata vital - sometimes to be found only: – In associated software packages or documentation – In the heads of those developing the software Important UK data archive sites • Cambridge - Astronomical Survey Unit (CASU): – INT wide-field survey, APM catalogue, VIZIER mirror, UKIRT archive. In future: WFCAM, VISTA. • Edinburgh – Wide-field Astronomy Unit (WFAU) – SuperCOSMOS images and catalogue, 6df galaxy survey, SLOAN DSS copy. In future: WFCAM, VISTA. • Leicester - Data Archive Service (LEDAS): – EXOSAT, GINGA, ASCA, ROSAT, XMM; Chandra mirror, many optical datasets. In future: SWIFT, SuperWASP source archive. Important UK data archive sites (continued) • Manchester - Jodrell Bank: – Merlin, HI surveys, European VLBI datasets, pulsar catalogues. Future: e-Merlin archive. • Rutherford Laboratory: – World Data Centre for STP, CLUSTER and ISO UK data centres, Starlink software collection and data archive. In future: SuperWASP image archive. • UCL - Mullard Space Science Laboratory: – YOHKOH, SOHO, TRACE, ReSIK and other solar/STP archives. Database management systems • DBMS currently used by UK archives include: – – – – – – – – – BROWSE – written at ESOC/ESTEC in 1980s. DB2 (IBM) Ingres miniSQL – free simple DBMS MySQL – open source, supports many web sites PostgreSQL – open source, good spatial indexing SQL Server (Microsoft) Sybase ASE WFCtools – written at Harvard/SAO for accessing large optical catalogues User access methods • Residual telnet/ssh services – Allows registered users to perform DBMS operations store their own subsets etc. – Mostly obsolescent • FTP access for large downloads • Web interfaces use CGI with Perl, PHP, or Python – Results mostly returned as HTML tables/GIFs, with some FITS and VOtable. • No use (pre-AstroGrid) of XML-based Web Services (Xforms, SOAP, WSDL etc.) Problems – (1) technical • Data storage: thanks to Moore’s Law, new datasets are much bigger than old ones. May get adequate storage for existing data from: – new big projects like WFCAM, SWIFT, e-MERLIN, VISTA? – SRIF funding? • International Virtual Observatory Alliance (IVOA) is developing new standards e.g. for tabular data, registry, query language. – These have to be implemented before fully stable. • DBMS: freeware like MySQL, PostgreSQL improving rapidly, probably adequate. – If not, licence costs may be substantial. • Database middleware (OGSA-DAI, ELDAS) – still developing, not quite ready for large-scale use Problems – (2) structural • Data preservation requires migration to new platforms, new DBMS every few years • Many DBMS in use are incapable of supporting functionality required e.g. no spatial indexing – Also implies migration to new DBMS • AstroGrid (and other VO projects) will supply the middleware, but have no remit (and no funding) to update the archives themselves. • Serious data mining research will require serious processing power near the data stores (e.g. an Astronomical Data Warehouse). Problems – (3) managerial • VO software from AstroGrid includes MySpace: a temporary user space on remote systems. – Optional, but highly desirable because of need to “shift the results not the data” – will sites give space to users unknown to them? – how to administer many ad-hoc groups of users? • Creation of the VO Registry will require considerable input from managers of existing data archives – exact mechanism TBD. Manpower Additional manpower needed for: • Migration of existing data collections to new platforms, and often to new DBMS • Installation of AstroGrid and other VO software • Provision of metadata to the Registry • Implementation and operation of MySpace • Setting up astronomical data warehouse facilities at a few sites Funding problems • SRIF funding is for hardware only, not manpower • AstroGrid2 bid failed to get support for elements of data centre support • PPARC grant applications to support data archiving and curation have an unhappy history: they tend to fall between research and projects funding lines. Summary • Archives have a vital role in astronomy – They are basically in good shape in that no important bits have been lost (as far as I know) – But we have been muddling through • • • • Technical problems look soluble Data storage – we may be able to find enough Much work needed on current archives for them to survive into the VO era. Additional skilled manpower will be essential – sources of support for this are lacking • Continuity is vital for archives – this is a longterm problem with no obvious solution.