Download Robust Machine Learning Applied to Terascale Astronomical Datasets

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
PHAT, Pasadena, Dec 4th 2008
1 of 21
Robust Machine Learning
Applied to Terascale
Astronomical Datasets
Nick Ball
Department of Astronomy
University of Illinois, Urbana-Champaign
Outline
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
•
Intro
Data
Tools
Quasar photo-zs and PDFs
What next
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
•
Intro
Data
Tools
Quasar photo-zs and PDFs
What next
Motivation
PHAT, Pasadena, Dec 4th 2008
•
Current data is already > 100 million objects,
> TB file size
•
Upcoming data will be > 10 billion objects, PB
file size
•
•
We need to cope with this!
For photo-z, we want PDFs
2 of 21
A Unique Collaboration
PHAT, Pasadena, Dec 4th 2008
•
Laboratory for Cosmological Data Mining
(LCDM) at NCSA and UIUC Astronomy:
Robert Brunner, Nick Ball, Adam Myers
•
Automated Learning Group, NCSA: David
Tcheng, Xavier Llorà
•
This is novel because we are performing data
mining not simulation
•
LCDM is a top-20 user of NCSA
supercomputing resources
3 of 21
Highlights
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
•
•
•
P(G,N,S) for 1.43x108 SDSS DR3 objects
N = neither star nor galaxy, e.g., quasar
Quasar photo-zs with SDSS and GALEX
Improved dispersion
Photo-z PDFs for SDSS QSO, MSG, and LRG
Vastly reduce catastrophic failures in QSOs
Ball et al. 2006, 2007, 2008
4 of 21
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
•
Intro
Data
Tools
Quasar photo-zs and PDFs
What next
Photometric Data
PHAT, Pasadena, Dec 4th 2008
•
•
•
SDSS DR6: 2.5×108 u g r i z to r ~ 22
•
UKIDSS DR2
GALEX AIS GR3 NUV, FUV
COSMOS: ACS, ground-based, GALEX DIS,
Spitzer S-COSMOS
5 of 21
Spectroscopic Data
PHAT, Pasadena, Dec 4th 2008
•
SDSS: 106 galaxies to r < 17.77, 5x104 quasars
to i < 19, 21, LRGs
•
•
•
zCOSMOS: 4089 to IAB < 22.5
•
COSMOS deeper training data
IMACS/MMT quasars: 1334 to IAB < ~24
SDSS deeper training data: 2SLAQ, 2QZ,
CNOC2, CFRS, DEEP2, MGCz, SDSSSouthern, TKRS,VVDS
6 of 21
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
•
Intro
Data
Tools
Quasar photo-zs and PDFs
What next
NCSA Supercomputing
PHAT, Pasadena, Dec 4th 2008
•
LCDM has > 106 processor hours on
NCSA supercomputers
•
•
•
•
•
Xeon Linux Cluster Tungsten (now retired)
Intel 64 Cluster Abe + GPU cluster Tesla
Peak performances 16.4, 89.5 TF
~100 TB Lustre filesystems
Access to 5 PB Unitree mass storage system
7 of 21
Machine Learning
PHAT, Pasadena, Dec 4th 2008
•
•
•
Supervised learning: training set of examples
•
•
•
Train on spectra and classify photometry
Trained learner classifies new examples
Examples include artificial neural networks,
decision trees, support vector machine,
instance-based learning
The training set should be representative
Perform blind test
8 of 21
Instance-Based Learning (kNN)
PHAT, Pasadena, Dec 4th 2008
• Memorize the positions in parameter space of
each training object
• For new objects, calculate the weighted
average redshift of the k nearest neighbors
• Most of the work is done in the latter stage
• Computationally intensive
9 of 21
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
•
Intro
Data
Tools
Quasar photo-zs and PDFs
What next
Quasar Photometric Redshifts
PHAT, Pasadena, Dec 4th 2008
• We assign photo-zs to 55,746 SDSS DR5
quasars and 7,642 SDSS DR5+GALEX GR2
quasars (i < 19.1)
•
We use a CZR and compare it to instancebased learning
•
•
We train on 80% and blind test on 20%
This gives blind testing samples of 11,149 for
SDSS and 1,528 for SDSS+GALEX
10 of 21
Probability Density Functions
PHAT, Pasadena, Dec 4th 2008
•
P(z) generated with single neighbor NN by
perturbing the inputs within the errors
•
•
•
•
•
magperturbed = mag + N(mag,err)
Typically produce ~103 photo-zs per object
The distribution of these is the PDF for the object
Gives fraction of objects within photo-z bins
For a given dataset, no adjustable parameters - just
the more perturbations the better
11 of 21
Quasar PDF Results
PHAT, Pasadena, Dec 4th 2008
•
•
•
Bad photo-zs tend to have multiple peaks
•
•
PDF spread correlates to true photo-z accuracy
•
Single peak alters the selection function
Often the second peak is correct if the first is not
But cannot select the correct peak without a
spectrum
Can eliminate catastrophic failures by selecting
single peak
12 of 21
PHAT, Pasadena, Dec 4th 2008
13 of 21
6
120
5
100
4
zphot
80
3
60
2
40
1
20
! = 0.45657
0
0
1
2
3
zspec
4
5
6
SDSS DR5 quasars: kNN single nearest neighbor
PHAT, Pasadena, Dec 4th 2008
14 of 21
6
120
5
100
4
zmean
80
3
60
2
40
1
20
! = 0.34397
0
0
1
2
3
zspec
4
5
6
Photometric redshifts for SDSS DR5 quasars (mean of PDF)
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
15 of 21
6
120
5
100
4
zone peak
80
3
60
2
40
1
20
! = 0.11096
0
0
1
2
3
zspec
4
5
6
SDSS DR5 quasars with one PDF peak
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
16 of 21
1
All
One peak
Frac. one peak (z)
Frac. one peak
0.9
0.8
n1 peak / nall
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
1
2
3
zspec
4
5
Alteration of the quasar selection function
6
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
•
Intro
Data
Tools
Quasar photo-zs and PDFs
What next
Ongoing Work
PHAT, Pasadena, Dec 4th 2008
•
•
Results for PHAT data
•
Incorporate missing values, e.g., bad mag.s in
one band or from cross-matched data
•
Working with NCSA ISL, throughput of 106+
objects per second in ANN mapped to FPGA,
limit will be disk I/O
•
A multi-output ANN produces a PDF directly
in binned redshift
kNN as kd-tree in C to generate PDFs for
~108 objects
17 of 21
Future Work
PHAT, Pasadena, Dec 4th 2008
•
•
•
Multiwavelength datasets at low and high z
•
•
e.g. FPGA, GPU, Cell, NCSA Blue Waters
•
Variability and time domain
Compare low and high z in unified framework
Move towards petascale via continued
terascale improvement and innovative systems
Combining classification and photo-z, e.g.
P(star formation, AGN)
18 of 21
Problems/Questions
PHAT, Pasadena, Dec 4th 2008
•
A ‘PDF’ generated from perturbations is not
Bayesian
•
•
Errors on the magnitudes
•
How to combine P(z), P(star, galaxy),
P(SF, AGN), P(quasar), etc.
•
Data distribution: 10 billion PDFs??
Missing values: too faint vs. not in the survey
footprint
19 of 21
Another Hybrid: Semi-Supervised Learning
PHAT, Pasadena, Dec 4th 2008
•
A theme in the workshop is combining the best of
templates and empirical
•
•
Empirical training is supervised or unsupervised
•
So far done for classification in SDSS spectra (Bazell
et al. 2005, ApJ 618 723), i.e., very little used
•
But photo-z in bins is classification (cf. the FPGA
ANN), thus we have a method for working beyond
the spectral regime
But can combine the two: supervised where there is
training data, unsupervised where there is not
20 of 21
Summary
PHAT, Pasadena, Dec 4th 2008
• Illinois Astronomy, NCSA: LCDM group
• Classifications and redshifts via data mining,
extensible to petascale
• Photometric redshift PDFs for quasars with
kNN using perturbations
• References: ^Ball, N in ADS
• http://nball.astro.uiuc.edu
• http://lcdm.astro.uiuc.edu
21 of 21
Spare slides
Causes of Bad Quasar Photometric Redshifts
PHAT, Pasadena, Dec 4th 2008
•
•
•
•
Reddening
Degeneracy in color-redshift relation
Emission lines crossing filter edges
Emission lines simulating other lines
PHAT, Pasadena, Dec 4th 2008
6
MgII (2798.75A, Flux = 14.725)
5
zmean
4
3
2
1
0
0
u
g
r
1
i
z
2
z
3
4
5
6
spec
Quasar emission lines crossing filters: MgII
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
6
All lines
5
zmean
4
3
2
1
0
0
1
2
z
3
4
5
6
spec
Quasar emission lines crossing filters: all
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
6
Ly! (1215.67A, Flux = 100)
5
zmean
4
3
2
1
0
0
u
1
g
2
r
z
3
4
i
5
6
spec
Quasar emission lines crossing filters: Lyα
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
6
CIV (1549.06A, Flux = 25.291)
5
zmean
4
3
2
1
0
0
u
1
g
2
r
z
3
i
z
4
5
6
spec
Quasar emission lines crossing filters: CIV
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
6
CIII (1908.73A, Flux = 15.943)
5
zmean
4
3
2
1
0
0
u
g
1
r
2
i
z
3
z
4
5
6
spec
Quasar emission lines crossing filters: CIII
Ball et al. 2008,
ApJ 683 12
PHAT, Pasadena, Dec 4th 2008
6
H! (6564.61A, Flux = 30.832)
5
zmean
4
3
2
1
0
0
i
z
1
2
z
3
4
5
6
spec
Quasar emission lines crossing filters: Hα
Ball et al. 2008,
ApJ 683 12