Download Galaxy Redshift Surveys (obj)

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cosmic distance ladder wikipedia , lookup

Weak gravitational lensing wikipedia , lookup

Circular dichroism wikipedia , lookup

Redshift wikipedia , lookup

Gravitational lens wikipedia , lookup

Astronomical spectroscopy wikipedia , lookup

Transcript
Data-Intensive Statistical
Challenges in Astrophysics
Alex Szalay
The Johns Hopkins University
Collaborators:
T. Budavari, C-W Yip (JHU),
M. Mahoney (Stanford),
I. Csabai, L. Dobos (Hungary)
The Age of Surveys
CMB Surveys (pixels)
•
•
•
•
•
1990
2000
2002
2003
2008
COBE
Boomerang
CBI
WMAP
Planck
1000
10,000
50,000
1 Million
10 Million
Time Domain
•
•
•
•
•
QUEST
SDSS Extension survey
Dark Energy Camera
Pan-STARRS
LSST…
Angular Galaxy Surveys (obj)
•
1970 Lick
1M
•
1990 APM
2M
•
2005 SDSS
200M
•
2011 PS1
1000M
•
2020 LSST
30000M
Galaxy Redshift Surveys (obj)
•
1986 CfA
3500
•
1996 LCRS
23000
•
2003 2dF
250000
•
2008 SDSS
1000000
•
2012 BOSS
2000000
•
2012 LAMOST
2500000
Petabytes/year …
Sloan Digital Sky Survey
• “The Cosmic Genome Project”
• Two surveys in one
– Photometric survey in 5 bands
– Spectroscopic redshift survey
• Data is public
– 2.5 Terapixels of images => 5 Tpx
– 10 TB of raw data => 120TB processed
– 0.5 TB catalogs => 35TB in the end
• Started in 1992, finished in 2008
• Extra data volume enabled by
– Moore’s Law
– Kryder’s Law
Analysis of Galaxy Spectra
• Sparse signal in large dimensions
• Much noise, and very rare events
• 4Kx1M SVD problem, perfect for randomized
algorithms
• Motivated our work on robust incremental PCA
Galaxy Properties from Galaxy Spectra
Spectral Lines
Continuum Emissions
Galaxy Diversity from PCA
PC
1st
[Average Spectrum]
2nd
[Stellar Continuum]
3rd
[Finer Continuum Features + Age]
4th
[Age]
Balmer series hydrogen lines
5th
[Metallicity]
Mg b, Na D, Ca II Triplet
Streaming PCA
• Initialization
– Eigensystem of a small, random subset
– Truncate at p largest eigenvalues
• Incremental updates
– Mean and the low-rank A matrix
– SVD of A yields new eigensystem
• Randomized algorithm!
T. Budavari, D. Mishin 2011
Robust PCA
• PCA minimizes σRMS of the residuals r = y – Py
– Quadratic formula: r2 extremely sensitive to outliers
• We optimize a robust M-scale σ2 (Maronna 2005)
– Implicitly given by
• Fits in with the iterative method!
• Outliers can be processed separately
Eigenvalues in Streaming PCA
9
Classic
Robust
Examples with SDSS Spectra
Built on top of the Incremental Robust PCA
• Principal Component Pursuit (I. Csabai et al)
• Importance sampling (C-W Yip et al)
Principal component pursuit
•
•
Low rank approximation of data matrix: X
Standard PCA: min X  E
subject to rank ( E )  k
2
– works well if the noise distribution is Gaussian
– outliers can cause bias
•
Principal component pursuit
min A 0
subject to X  N  A, rank ( N )  k
– “sparse” spiky noise/outliers: try to minimize the number
of outliers while keeping the rank low
– NP-hard problem
•
The L1 trick:


min N *   A 1 subject to X  N  A
N,A
– numerically feasible convex problem (Augmented Lagrange Multiplier)

min N *   A 1
N,A

subject to X  ( N  A) 2  
* E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009.
Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection)
Testing on Galaxy Spectra
• Slowly varying continuum +
absorption lines
• Highly variable “sparse”
emission lines
• This is the simple version of
PCP: the position of the lines
are known
• but there are many of
them, automatic
detection can be useful DATA:
• spiky noise can bias
Streaming robust PCA implementation for
galaxy spectrum catalog (L. Dobos et al.)
standard PCA
SDSS 1M galaxy spectra
Morphological subclasses
Robust averages + first few PCA directions
PCA
PCA
reconstruction
Residual
Principal component pursuit
Low rank
Sparse
Residual
λ=0.6/sqrt(n), ε=0.03
Not Every Data Direction is Equal
Selected Wavelengths
A
Galaxy ID
Galaxy ID
Wavelength
= C
X
Procedure:
1. Perform SVD of A = U  VT
2. Pick number of eigenvectors = K
3. Calculate Leverage Score =
i ||VTij||2 / K
Mahoney and Drineas 2009
Selected Wavelengths
Wavelength
Wavelength Sampling Probability
k=2
c=7
k=4
c = 16
k=6
c = 25
k=8
c = 29
Ranking Astronomical Line Indices
Subspace Analysis of
Spectra Cutouts:
-Othogonality
-Divergence
-Commonality
(Worthey et al. 94; Trager et al. 98)
(Yip et al. 2012 in prep.)
Identify Informative Regions
“NewMethod”
1. Pick the λ with largest Pλ
2. Define its region of influence using  λ Pλ convergence.
Mask λ’s from future selection.
3. Go back to Step 1, or quit.
“MahoneySecond”
1. Over-select λ’s from the targeted number.
2. Merge selected λ if two pixels lie within a certain distance
3. Quit.
Identifying New Line Indices, Objectively
(Yip et al. 2012 in prep.)
New Spectral Regions (MahoneySecond; k = 5;
Overselecting 10 X; Combining if < 30 Å)
NewMethod vs MahoneySecond
NM
M2
Gunawan & Neswan 2000)
Angle between Subspaces
JHU
Lick
 λ Pλ
JHU
Lick
Line Indices for Galaxy
Parameter Estimations
Importance Sampling and Galaxies
• Lick indices are ad hoc
• The new indices are objective
–
–
–
–
Recover atomic lines
Recover molecular bands
Recover Lick indices
Informative regions are orthogonal to each other,
in contrast to Lick
• Future
– Emission line indices
– More accurate parameter estimation of galaxies
Summary
Non-Incremental changes on the way
• Science is moving increasingly from hypothesisdriven to data-driven discoveries
• Need randomized, incremental algorithms
– Best result in 1 min, 1 hour, 1 day, 1 week
•
New computational tools and strategies
… not just statistics, not just computer science,
not just astronomy, not just genomics…
Astronomy has always been data-driven….
now becoming more generally accepted