Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Astronomical e-Science in Edinburgh Introduction: astronomy and e-science Sky Survey Science Archives Astronomy and e-science are perfect partners. Astronomy needs the new techniques being developed in e-science to help it cope with the data avalanche it is experiencing: the volume of astronomical data available online doubles every eighteen months or so, with the largest sky survey databases growing at several Terabytes per year. Conversely, astronomy data provide an ideal testbed for many e-science developments: in the words of Jim Gray of Microsoft Research “It has no commercial value [and] no privacy concerns. It is real and well documented…high-dimensional, spatial and temporal data (with confidence intervals), [generated by] many different instruments from many different places and many different times. There is a lot of it.” The principal e-science initiative within astronomy is the creation of an international Virtual Observatory (VO), which will federate the world’s significant astronomical data sources and integrate them with the hardware and software required to exploit that federation scientifically. As astronomical databases grow to multi-TB scales they must change from being passive data repositories to incorporate data exploration, as well as data access, facilities. AstroGrid aims to provide data exploration services for a set of major UK databases, which it will federate, the largest amongst these is the new 1.5TB SuperCOSMOS Science Archive (SSA), based upon a legacy photographic sky survey, scanned by the SuperCOSMOS measuring machine (shown right). The SSA is also being used as a testbed for the larger WFCAM Science Archive, which will hold sky survey data from the new wide field camera (WFCAM) on the UK Infrared Telescope. WFCAM (shown below), will generate over 20TB of data per year for seven years from 2004. A key requirement for these large Astronomy and e-science in Edinburgh The University of Edinburgh is home to the Wide Field Astronomy Unit (WFAU[1]), which curates the largest databases in UK astronomy; the science archives from optical and near-infrared sky surveys. WFAU is also one of the UK data centres within AstroGrid[2], the UK’s VO project, itself part of the Astrophysical Virtual Observatory (AVO[3]) consortium, which is undertaking an EU-funded R&D study into the design of a data Grid for European astronomy. Collaborations exist between WFAU astronomers and researchers in the University’s School of Informatics interested in data management and in data mining, since, in both cases, the scale and complexity of sky survey databases present interesting problems. The astronomical e-science community in Edinburgh is completed by NeSC, whose workshops and training courses have greatly benefited AstroGrid, and which collaborates with both WFAU and AstroGrid through the edikt[4] and OGSA-DAI[5] projects. In what follows we describe in more detail some of the projects currently underway within this community. sky survey databases is effective spatial indexing, so WFAU and NeSC are collaborating with IBM, Microsoft and Oracle to study the SSA as a prototype of a large, spatially-indexed database, comparing the spatial options supported by the three companies’ products. The SSA will also be a scalability testbed for the data access and data exploration services being developed by AstroGrid, and over the next few years the SSA and WSA will become key components of the Virtual Observatory, with deployed on them a growing array of web and Grid services allowing astronomers to do research on the TB scale. Data Mining and Machine Learning Foremost amongst the services deployed on these new science archives will be data mining algorithms, to enable scientists to extract scientific knowledge from the terabytes of data. The first application of data mining with the SSA has been to help clean up the dataset itself[6]. Optical sky surveys include a number of “junk” objects - due to artificial satellites, aeroplanes, optical effects around bright stars, etc – which often cannot be distinguished from stars or galaxies on the basis of their measured attributes alone. They can, however, be identified from their unlikely spatial arrangements. The plots below show the detection of a satellite trail (left) and a diffraction halo around a bright star (right) through the application of a machine learning algorithm looking for statistically unlikely linear and elliptical arrangements of objects, respectively: in each case the junk objects number only a few hundred, from a catalogue of several hundred thousand in the particular survey image. This algorithm has applicability beyond the SSA, and it is intended that it will be implemented as a prototype Grid service by WFAU and NeSC. Association techniques A second area where machine learning is being used is in the association of entries in different databases which represent observations of the same celestial source in different passbands. The angular resolution inherent in the two sets of observations can differ markedly, as shown schematically below, where many objects from a high resolution observation (in red) lie within the blue ellipse, which denotes the region of sky within which a source from a much lower resolution image is constrained to lie. In this situation proximity alone cannot decide which of the red sources is the most likely counterpart of the blue source. Machine learning techniques are being used to discover the attributes of the population of red sources which correlate with being close to blue sources as part of a PhD project, funded by a PPARC e-science studentship to deliver a suite of association services for use by AstroGrid. Virtual Observatory data formats The interoperability of data sets is the key to the Virtual Observatory, so XML has been advocated as a VO exchange format, with the development of VOTable[7]. VOTable is a new XML standard for tabular astronomical datasets, developed under the auspices of the International Virtual Observatory Alliance (IVOA[8]), which is the standards agency for the VO. Fully tagged XML is, of course, verbose, and, thus, inefficient for storing large datasets, as are often found in astronomy. Researchers in Edinburgh are working on two possible solutions to this problem. The first is BinX[9], an XML schema for binary data, being developed by the edikt project, and described in more detail in their flyer. BinX performs the conversion between VOTable and FITS[10], a compact binary data format commonly used in astronomy: this means that astronomers can choose when to store their data in readily interoperable XML and when in compact binary. The second approach, being developed by Peter Buneman’s database research group in the School of Informatics, is to restructure the VOTable file into a vectorized format, which is more compact and can be queried much more quickly. Both these approaches will be developed into working prototypes for AstroGrid. AstroGrid WFAU is one of the data centres comprising the AstroGrid consortium and AstroGrid developers and researchers based in Edinburgh are engaged in the fully range of the project’s activities. Some of the first fruits of this labour will be demonstrated on the NeSC stand at All Hands 2003, as well as on the PPARC stand. References 1. Wide Field Astronomy Unit (WFAU): www-wfau.roe.ac.uk 2. AstroGrid: www.astrogrid.org 3. Astrophysical Virtual Observatory (AVO) www.euro-vo.org 4. edikt: www.edikt.org 5. OGSA-DAI: www.ogsadai.org.uk 6. Amos Storkey’s “junk” detection page: www.anc.ed.ac.uk/~amos/sattrackres.html 7. VOTable: http://cdsweb.u-strasbg.fr/doc/VOTable 8. International Virtual Observatory Alliance (IVOA) www.ivoa.net 9.BinX: www.edikt.org/binx 10. Flexible Image Transport System (FITS): fits.gsfc.nasa.gov