* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Galaxy Redshift Surveys (obj)
Survey
Document related concepts
Transcript
Data-Intensive Statistical Challenges in Astrophysics Alex Szalay The Johns Hopkins University Collaborators: T. Budavari, C-W Yip (JHU), M. Mahoney (Stanford), I. Csabai, L. Dobos (Hungary) The Age of Surveys CMB Surveys (pixels) • • • • • 1990 2000 2002 2003 2008 COBE Boomerang CBI WMAP Planck 1000 10,000 50,000 1 Million 10 Million Time Domain • • • • • QUEST SDSS Extension survey Dark Energy Camera Pan-STARRS LSST… Angular Galaxy Surveys (obj) • 1970 Lick 1M • 1990 APM 2M • 2005 SDSS 200M • 2011 PS1 1000M • 2020 LSST 30000M Galaxy Redshift Surveys (obj) • 1986 CfA 3500 • 1996 LCRS 23000 • 2003 2dF 250000 • 2008 SDSS 1000000 • 2012 BOSS 2000000 • 2012 LAMOST 2500000 Petabytes/year … Sloan Digital Sky Survey • “The Cosmic Genome Project” • Two surveys in one – Photometric survey in 5 bands – Spectroscopic redshift survey • Data is public – 2.5 Terapixels of images => 5 Tpx – 10 TB of raw data => 120TB processed – 0.5 TB catalogs => 35TB in the end • Started in 1992, finished in 2008 • Extra data volume enabled by – Moore’s Law – Kryder’s Law Analysis of Galaxy Spectra • Sparse signal in large dimensions • Much noise, and very rare events • 4Kx1M SVD problem, perfect for randomized algorithms • Motivated our work on robust incremental PCA Galaxy Properties from Galaxy Spectra Spectral Lines Continuum Emissions Galaxy Diversity from PCA PC 1st [Average Spectrum] 2nd [Stellar Continuum] 3rd [Finer Continuum Features + Age] 4th [Age] Balmer series hydrogen lines 5th [Metallicity] Mg b, Na D, Ca II Triplet Streaming PCA • Initialization – Eigensystem of a small, random subset – Truncate at p largest eigenvalues • Incremental updates – Mean and the low-rank A matrix – SVD of A yields new eigensystem • Randomized algorithm! T. Budavari, D. Mishin 2011 Robust PCA • PCA minimizes σRMS of the residuals r = y – Py – Quadratic formula: r2 extremely sensitive to outliers • We optimize a robust M-scale σ2 (Maronna 2005) – Implicitly given by • Fits in with the iterative method! • Outliers can be processed separately Eigenvalues in Streaming PCA 9 Classic Robust Examples with SDSS Spectra Built on top of the Incremental Robust PCA • Principal Component Pursuit (I. Csabai et al) • Importance sampling (C-W Yip et al) Principal component pursuit • • Low rank approximation of data matrix: X Standard PCA: min X E subject to rank ( E ) k 2 – works well if the noise distribution is Gaussian – outliers can cause bias • Principal component pursuit min A 0 subject to X N A, rank ( N ) k – “sparse” spiky noise/outliers: try to minimize the number of outliers while keeping the rank low – NP-hard problem • The L1 trick: min N * A 1 subject to X N A N,A – numerically feasible convex problem (Augmented Lagrange Multiplier) min N * A 1 N,A subject to X ( N A) 2 * E. Candes, et al. “Robust Principal Component Analysis”. preprint, 2009. Abdelkefi et al. ACM CoNEXT Workshop (traffic anomaly detection) Testing on Galaxy Spectra • Slowly varying continuum + absorption lines • Highly variable “sparse” emission lines • This is the simple version of PCP: the position of the lines are known • but there are many of them, automatic detection can be useful DATA: • spiky noise can bias Streaming robust PCA implementation for galaxy spectrum catalog (L. Dobos et al.) standard PCA SDSS 1M galaxy spectra Morphological subclasses Robust averages + first few PCA directions PCA PCA reconstruction Residual Principal component pursuit Low rank Sparse Residual λ=0.6/sqrt(n), ε=0.03 Not Every Data Direction is Equal Selected Wavelengths A Galaxy ID Galaxy ID Wavelength = C X Procedure: 1. Perform SVD of A = U VT 2. Pick number of eigenvectors = K 3. Calculate Leverage Score = i ||VTij||2 / K Mahoney and Drineas 2009 Selected Wavelengths Wavelength Wavelength Sampling Probability k=2 c=7 k=4 c = 16 k=6 c = 25 k=8 c = 29 Ranking Astronomical Line Indices Subspace Analysis of Spectra Cutouts: -Othogonality -Divergence -Commonality (Worthey et al. 94; Trager et al. 98) (Yip et al. 2012 in prep.) Identify Informative Regions “NewMethod” 1. Pick the λ with largest Pλ 2. Define its region of influence using λ Pλ convergence. Mask λ’s from future selection. 3. Go back to Step 1, or quit. “MahoneySecond” 1. Over-select λ’s from the targeted number. 2. Merge selected λ if two pixels lie within a certain distance 3. Quit. Identifying New Line Indices, Objectively (Yip et al. 2012 in prep.) New Spectral Regions (MahoneySecond; k = 5; Overselecting 10 X; Combining if < 30 Å) NewMethod vs MahoneySecond NM M2 Gunawan & Neswan 2000) Angle between Subspaces JHU Lick λ Pλ JHU Lick Line Indices for Galaxy Parameter Estimations Importance Sampling and Galaxies • Lick indices are ad hoc • The new indices are objective – – – – Recover atomic lines Recover molecular bands Recover Lick indices Informative regions are orthogonal to each other, in contrast to Lick • Future – Emission line indices – More accurate parameter estimation of galaxies Summary Non-Incremental changes on the way • Science is moving increasingly from hypothesisdriven to data-driven discoveries • Need randomized, incremental algorithms – Best result in 1 min, 1 hour, 1 day, 1 week • New computational tools and strategies … not just statistics, not just computer science, not just astronomy, not just genomics… Astronomy has always been data-driven…. now becoming more generally accepted