Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
PHAT, Pasadena, Dec 4th 2008 1 of 21 Robust Machine Learning Applied to Terascale Astronomical Datasets Nick Ball Department of Astronomy University of Illinois, Urbana-Champaign Outline PHAT, Pasadena, Dec 4th 2008 • • • • • Intro Data Tools Quasar photo-zs and PDFs What next PHAT, Pasadena, Dec 4th 2008 • • • • • Intro Data Tools Quasar photo-zs and PDFs What next Motivation PHAT, Pasadena, Dec 4th 2008 • Current data is already > 100 million objects, > TB file size • Upcoming data will be > 10 billion objects, PB file size • • We need to cope with this! For photo-z, we want PDFs 2 of 21 A Unique Collaboration PHAT, Pasadena, Dec 4th 2008 • Laboratory for Cosmological Data Mining (LCDM) at NCSA and UIUC Astronomy: Robert Brunner, Nick Ball, Adam Myers • Automated Learning Group, NCSA: David Tcheng, Xavier Llorà • This is novel because we are performing data mining not simulation • LCDM is a top-20 user of NCSA supercomputing resources 3 of 21 Highlights PHAT, Pasadena, Dec 4th 2008 • • • • • • • P(G,N,S) for 1.43x108 SDSS DR3 objects N = neither star nor galaxy, e.g., quasar Quasar photo-zs with SDSS and GALEX Improved dispersion Photo-z PDFs for SDSS QSO, MSG, and LRG Vastly reduce catastrophic failures in QSOs Ball et al. 2006, 2007, 2008 4 of 21 PHAT, Pasadena, Dec 4th 2008 • • • • • Intro Data Tools Quasar photo-zs and PDFs What next Photometric Data PHAT, Pasadena, Dec 4th 2008 • • • SDSS DR6: 2.5×108 u g r i z to r ~ 22 • UKIDSS DR2 GALEX AIS GR3 NUV, FUV COSMOS: ACS, ground-based, GALEX DIS, Spitzer S-COSMOS 5 of 21 Spectroscopic Data PHAT, Pasadena, Dec 4th 2008 • SDSS: 106 galaxies to r < 17.77, 5x104 quasars to i < 19, 21, LRGs • • • zCOSMOS: 4089 to IAB < 22.5 • COSMOS deeper training data IMACS/MMT quasars: 1334 to IAB < ~24 SDSS deeper training data: 2SLAQ, 2QZ, CNOC2, CFRS, DEEP2, MGCz, SDSSSouthern, TKRS,VVDS 6 of 21 PHAT, Pasadena, Dec 4th 2008 • • • • • Intro Data Tools Quasar photo-zs and PDFs What next NCSA Supercomputing PHAT, Pasadena, Dec 4th 2008 • LCDM has > 106 processor hours on NCSA supercomputers • • • • • Xeon Linux Cluster Tungsten (now retired) Intel 64 Cluster Abe + GPU cluster Tesla Peak performances 16.4, 89.5 TF ~100 TB Lustre filesystems Access to 5 PB Unitree mass storage system 7 of 21 Machine Learning PHAT, Pasadena, Dec 4th 2008 • • • Supervised learning: training set of examples • • • Train on spectra and classify photometry Trained learner classifies new examples Examples include artificial neural networks, decision trees, support vector machine, instance-based learning The training set should be representative Perform blind test 8 of 21 Instance-Based Learning (kNN) PHAT, Pasadena, Dec 4th 2008 • Memorize the positions in parameter space of each training object • For new objects, calculate the weighted average redshift of the k nearest neighbors • Most of the work is done in the latter stage • Computationally intensive 9 of 21 PHAT, Pasadena, Dec 4th 2008 • • • • • Intro Data Tools Quasar photo-zs and PDFs What next Quasar Photometric Redshifts PHAT, Pasadena, Dec 4th 2008 • We assign photo-zs to 55,746 SDSS DR5 quasars and 7,642 SDSS DR5+GALEX GR2 quasars (i < 19.1) • We use a CZR and compare it to instancebased learning • • We train on 80% and blind test on 20% This gives blind testing samples of 11,149 for SDSS and 1,528 for SDSS+GALEX 10 of 21 Probability Density Functions PHAT, Pasadena, Dec 4th 2008 • P(z) generated with single neighbor NN by perturbing the inputs within the errors • • • • • magperturbed = mag + N(mag,err) Typically produce ~103 photo-zs per object The distribution of these is the PDF for the object Gives fraction of objects within photo-z bins For a given dataset, no adjustable parameters - just the more perturbations the better 11 of 21 Quasar PDF Results PHAT, Pasadena, Dec 4th 2008 • • • Bad photo-zs tend to have multiple peaks • • PDF spread correlates to true photo-z accuracy • Single peak alters the selection function Often the second peak is correct if the first is not But cannot select the correct peak without a spectrum Can eliminate catastrophic failures by selecting single peak 12 of 21 PHAT, Pasadena, Dec 4th 2008 13 of 21 6 120 5 100 4 zphot 80 3 60 2 40 1 20 ! = 0.45657 0 0 1 2 3 zspec 4 5 6 SDSS DR5 quasars: kNN single nearest neighbor PHAT, Pasadena, Dec 4th 2008 14 of 21 6 120 5 100 4 zmean 80 3 60 2 40 1 20 ! = 0.34397 0 0 1 2 3 zspec 4 5 6 Photometric redshifts for SDSS DR5 quasars (mean of PDF) Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 15 of 21 6 120 5 100 4 zone peak 80 3 60 2 40 1 20 ! = 0.11096 0 0 1 2 3 zspec 4 5 6 SDSS DR5 quasars with one PDF peak Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 16 of 21 1 All One peak Frac. one peak (z) Frac. one peak 0.9 0.8 n1 peak / nall 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 zspec 4 5 Alteration of the quasar selection function 6 Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 • • • • • Intro Data Tools Quasar photo-zs and PDFs What next Ongoing Work PHAT, Pasadena, Dec 4th 2008 • • Results for PHAT data • Incorporate missing values, e.g., bad mag.s in one band or from cross-matched data • Working with NCSA ISL, throughput of 106+ objects per second in ANN mapped to FPGA, limit will be disk I/O • A multi-output ANN produces a PDF directly in binned redshift kNN as kd-tree in C to generate PDFs for ~108 objects 17 of 21 Future Work PHAT, Pasadena, Dec 4th 2008 • • • Multiwavelength datasets at low and high z • • e.g. FPGA, GPU, Cell, NCSA Blue Waters • Variability and time domain Compare low and high z in unified framework Move towards petascale via continued terascale improvement and innovative systems Combining classification and photo-z, e.g. P(star formation, AGN) 18 of 21 Problems/Questions PHAT, Pasadena, Dec 4th 2008 • A ‘PDF’ generated from perturbations is not Bayesian • • Errors on the magnitudes • How to combine P(z), P(star, galaxy), P(SF, AGN), P(quasar), etc. • Data distribution: 10 billion PDFs?? Missing values: too faint vs. not in the survey footprint 19 of 21 Another Hybrid: Semi-Supervised Learning PHAT, Pasadena, Dec 4th 2008 • A theme in the workshop is combining the best of templates and empirical • • Empirical training is supervised or unsupervised • So far done for classification in SDSS spectra (Bazell et al. 2005, ApJ 618 723), i.e., very little used • But photo-z in bins is classification (cf. the FPGA ANN), thus we have a method for working beyond the spectral regime But can combine the two: supervised where there is training data, unsupervised where there is not 20 of 21 Summary PHAT, Pasadena, Dec 4th 2008 • Illinois Astronomy, NCSA: LCDM group • Classifications and redshifts via data mining, extensible to petascale • Photometric redshift PDFs for quasars with kNN using perturbations • References: ^Ball, N in ADS • http://nball.astro.uiuc.edu • http://lcdm.astro.uiuc.edu 21 of 21 Spare slides Causes of Bad Quasar Photometric Redshifts PHAT, Pasadena, Dec 4th 2008 • • • • Reddening Degeneracy in color-redshift relation Emission lines crossing filter edges Emission lines simulating other lines PHAT, Pasadena, Dec 4th 2008 6 MgII (2798.75A, Flux = 14.725) 5 zmean 4 3 2 1 0 0 u g r 1 i z 2 z 3 4 5 6 spec Quasar emission lines crossing filters: MgII Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 6 All lines 5 zmean 4 3 2 1 0 0 1 2 z 3 4 5 6 spec Quasar emission lines crossing filters: all Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 6 Ly! (1215.67A, Flux = 100) 5 zmean 4 3 2 1 0 0 u 1 g 2 r z 3 4 i 5 6 spec Quasar emission lines crossing filters: Lyα Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 6 CIV (1549.06A, Flux = 25.291) 5 zmean 4 3 2 1 0 0 u 1 g 2 r z 3 i z 4 5 6 spec Quasar emission lines crossing filters: CIV Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 6 CIII (1908.73A, Flux = 15.943) 5 zmean 4 3 2 1 0 0 u g 1 r 2 i z 3 z 4 5 6 spec Quasar emission lines crossing filters: CIII Ball et al. 2008, ApJ 683 12 PHAT, Pasadena, Dec 4th 2008 6 H! (6564.61A, Flux = 30.832) 5 zmean 4 3 2 1 0 0 i z 1 2 z 3 4 5 6 spec Quasar emission lines crossing filters: Hα Ball et al. 2008, ApJ 683 12