The Australian Virtual Observatory Clusters and Grids David Barnes Astrophysics Group Overview • • • • • • • • What is a Virtual Observatory? Scientific motivation International scene Australian scene DataGrids for VOs ComputeGrids for VOs Sketch of AVO DataGrid and ComputeGrid Clustering experience at Swinburne What is a Virtual Observatory? • A Virtual Observatory (VO) is a distributed, uniform interface to the data archives of the world’s major astronomical observatories. • A VO is explored with advanced data mining and visualisation tools which exploit the unified interface to enable cross-correlation and combined processing of distributed and diverse datasets. • VOs will rely on, and provide motivation for, the development of national and international computational and data grids. Scientific motivation • Understanding of astrophysical processes depends on multi-wavelength observations and input from theoretical models. • As telescopes and instruments grow in complexity, surveys generate massive databases which require increasing expertise to comprehend. • Theoretical modeling codes are growing in sophistication to consume available compute time. • Major advances in astrophysics will be enabled by transparently cross-matching, cross-correlating and inter-processing otherwise disparate data. Sample multi-wavelength data for the galaxy IC5332 (Ryan-Weber) blue HI spectral line column density H-alpha spectral line HI spectral line velocity dispersion infrared HI spectral line velocity field HI profile from public release International scene • AstroGrid (www.uk-vo.org) – phase A (1yr R&D) complete; phase B (3yr implementation) funded £3.7M. • Astrophysical Virtual Observatory (www.euro-vo.org) – phase A (3yr R&D) funded €4.0M. • National Virtual Observatory (www.usvo.org) – (5yr framework development) funded USD 10M. Australian scene • Australian Virtual Observatory (www.aus-vo.org) – phase A (1yr common-format archive implementation) funded AUD 260K (2003 LIEF grant [Melb, Syd, ATNF, AAO]). • Data archives are: – – – – HIPASS: 1.4 GHz continuum and HI spectral line survey SUMSS: 843 MHz continuum survey S4: digital images of the southern sky in five optical filters ATCA archive: continuum and spectral line images of the southern sky – 2dFGRS: optical spectra of >200K southern galaxies – and more... DataGrids for VOs • archives listed on previous slide range from ~10 GB to ~10 TB in processed (reduced) size. • providing just the processed images and spectra on-line requires a distributed, highbandwidth network of data servers – that is, a DataGrid. • users may want some simple operations such as smoothing or filtering, applied at the data server. This is a Virtual DataGrid. ComputeGrids for VOs • More complex operations may be applied requiring significant processing: – source detection and parameterisation – reprocessing of raw or intermediate data products with new calibration algorithms – combined processing of raw, intermediate or "final product" data from different archives • These operations require a distributed, highbandwidth network of computational nodes – that is, a ComputeGrid. Possible initial players in the Australian Virtual Observatory Data and Compute Grids… CPU? Parkes? Data CPU? ATNF/AAO 2dFGRS RAVE Data Canberra CPU? ATCA Adelaide Theory? CPU Data CPU? VPAC Melbourne HIPASS Gemini? Theory Data Sydney SUMSS Grangenet CPU APAC CPU Swinburne Theory Clustering @ Swinburne • • • • • 1998 – 2000: 40 Compaq Alpha workstations 2001: +16 Dell dual PIII rackmount servers 2002: +30 Dell dual P4 workstations mid 2002: +60 Dell dual P4 rackmount servers November 2002: placed 180th in Top500 with 343 sustained Gflop/s. (APAC 63rd with 825 Gflop/s) • +30 Dell dual P4 rackmount servers installed mid 2002 at the Parkes telescope in NSW. • psuedo-Grid with data pre-processed in realtime at the telescope, shipped back in “slowtime”. Swinburne activities • N-body simulation codes: – galaxy formation – stellar disk astrophysics – cosmology • Pulsar searching and timing – (1 GB/min data recording) • Survey processing as a coarse-grained problem • Rendering of virtual reality content Clustering costs nodes price/node price/cpu 1 cpu, 256MB std mem, 20GB disk, ethernet 1.3K 1.3K 2 cpu, 1 GB fast mem, 20 GB disk, ethernet 4.4K 2.2K 2 cpu, 2GB fast mem, 60 GB SCSI disk, ethernet 8.0K 4.0K Giganet, Myrinet, ... 1.5K 1.5K (1 cpu) 0.8K (2 cpu) (estimates incl. on-site warranty; 2nd fastest cpu; excl. infrastructure) Some ideas... • “desktop cluster” – astro group has 6 dual-cpu workstations. – Add MPI, PVM, Nimrod libs and Ganglia monitoring tool to get 12-cpu loose cluster with 8GB mem. – Use MOSIX to provide transparent job migration with workstations joining the cluster at night-time. • “pre-purchase cluster” – univ. buys ~500 desktops/yr – use them for ~6 months! – build up a cluster of desktops purchased ahead of demand, and replace as deployed to desktops. – Gain compute power of new CPUs without any real effect on end-users.