Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Computational Mathematics for Large-scale Data Analysis Alexander Gray Georgia Institute of Technology College of Computing FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory The FASTlab Fundamental Algorithmic and Statistical Tools Laboratory 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. Arkadas Ozakin: Research scientist, PhD Theoretical Physics Dong Ryeol Lee: PhD student, CS + Math Ryan Riegel: PhD student, CS + Math Parikshit Ram: PhD student, CS + Math William March: PhD student, Math + CS James Waters: PhD student, Physics + CS Hua Ouyang: PhD student, CS Sooraj Bhat: PhD student, CS Ravi Sastry: PhD student, CS Long Tran: PhD student, CS Michael Holmes: PhD student, CS + Physics (co-supervised) Nikolaos Vasiloglou: PhD student, EE (co-supervised) Wei Guan: PhD student, CS (co-supervised) Nishant Mehta: PhD student, CS (co-supervised) Wee Chin Wong: PhD student, ChemE (co-supervised) Abhimanyu Aditya: MS student, CS Yatin Kanetkar: MS student, CS Praveen Krishnaiah: MS student, CS Devika Karnik: MS student, CS Prasad Jakka: MS student, CS Exponential growth in dataset sizes Instruments Data: CMB Maps 1.0E+07 1.0E+07 1.0E+06 1.0E+06 1.0E+05 1.0E+05 1.0E+04 1.0E+04 1.0E+03 1.0E+03 1.0E+02 1.0E+02 1985 1985 1990 1990 [Science, Szalay & J. Gray, 2001] Data: Local Redshift Surveys 1986 CfA 3,500 1996 LCRS 23,000 2003 2dF 250,000 2005 SDSS 800,000 Data: Angular Surveys 1970 Lick 1M 1990 APM 2M 2005 SDSS 200M 2010 LSST 2B 1995 1995 2000 2000 2005 2005 2010 2010 1990 COBE 1,000 2000 Boomerang 10,000 2002 CBI 50,000 2003 WMAP 1 Million 2008 Planck 10 Million Astrophysicist 1. How did galaxies evolve? 2. What was the early universe like? 3. Does dark energy exist? 4. Is our model (GR+inflation) right? R. Nichol, Inst. Cosmol. Gravitation A. Connolly, U. Pitt Physics C. Miller, NOAO R. Brunner, NCSA G. Djorgovsky, Caltech G. Kulkarni, Inst. Cosmol. Gravitation D. Wake, Inst. Cosmol. Gravitation R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astronomy G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Machine learning/ statistics guy Astrophysicist 1. How did galaxies evolve? 2. What was the early universe like? 3. Does dark energy exist? 4. Is our model (GR+inflation) right? • Kernel density estimator O(N2) n • n-point R. Nichol, Inst. Cosmol. Grav. spatial statistics O(N ) A. Connolly, U. Pitt Physics • Nonparametric Bayes classifier O(N2) C. Miller, NOAO • Support vector machine O(N2) R. Brunner, NCSA • Nearest-neighbor statistics O(N2) G. Djorgovsky, Caltech • Gaussian process regression O(N3) G. Kulkarni, Inst. Cosmol. Grav. DT(N)) O(c • Hierarchical clustering D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Machine learning/ statistics guy Astrophysicist 1. How did galaxies evolve? 2. What was the early universe like? 3. Does dark energy exist? 4. Is our model (GR+inflation) right? • Kernel density estimator O(N2) n • n-point R. Nichol, Inst. Cosmol. Grav. spatial statistics O(N ) A. Connolly, U. Pitt Physics • Nonparametric Bayes classifier O(N2) C. Miller, NOAO • Support vector machine O(N3) R. Brunner, NCSA • Nearest-neighbor statistics O(N2) G. Djorgovsky, Caltech 3 Gaussian G. Kulkarni, Inst. •Cosmol. Grav. process regression O(N ) • Hierarchical clustering O(N3) D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, U. Hawaii Inst. Astro. G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics Machine learning/ statistics guy Astrophysicist 1. How did galaxies evolve? 2. What was the early universe like? 3. Does dark energy exist? 4. Is our model (GR+inflation) right? • Kernel density estimator O(N2) n • n-point R. Nichol, Inst. Cosmol. Grav. spatial statistics O(N ) A. Connolly, U. Pitt Physics • Nonparametric Bayes classifier O(N2) C. Miller, NOAO • Support vector machine O(N3) R. Brunner, NCSA • Nearest-neighbor statistics O(N2) G. Djorgovsky, Caltech 3 Gaussian G. Kulkarni, Inst. •Cosmol. Grav. process regression O(N ) • Hierarchical clustering O(N3) D. Wake, Inst. Cosmol. Grav. R. Scranton, U. Pitt Physics M. Balogh, U. Waterloo Physics I. Szapudi, Hawaii Inst. Astro. But I U.have 1 million G. Richards, Princeton Physics A. Szalay, Johns Hopkins Physics points Machine learning/ statistics guy Uncertainty quantification Many types – all generally lead to more computational expense: • Uncertainty of statistical model: cross-validation, bootstrap • Uncertainty of simulation: run many times, estimate a model each time • Uncertainty of simulation vs. real data: expensive spatial statistics How to do Stats/Machine Learning on Very Large Survey Datasets? 1. Choose the appropriate statistical task and method for the scientific question 2. Use the fastest algorithm and data structure for the statistical method 3. Apply method to the data! How to do Stats/Machine Learning on Very Large Survey Datasets? 1. Choose the appropriate statistical task and method for the scientific question 2. Use the fastest algorithm and data structure for the statistical method 3. Apply method to the data! 10 data analysis problems, and scalable tools we’d like for them 1. Querying (e.g. characterizing a region of space): – – – – 2. spherical range-search O(N) orthogonal range-search O(N) k-nearest-neighbors O(N) all-k-nearest-neighbors O(N2) Density estimation (e.g. comparing galaxy types): – – – – – mixture of Gaussians kernel density estimation O(N2) L2 density tree [Ram and Gray in prep] manifold kernel density estimation O(N3) [Ozakin and Gray 2008, to be submitted] hyper-kernel density estimation O(N4) [Sastry and Gray 2008, submitted] 10 data analysis problems, and scalable tools we’d like for them 3. Regression (e.g. photometric redshifts): – – – linear regression O(D2) kernel regression O(N2) Gaussian process regression/kriging O(N3) 4. Classification (e.g. quasar detection, star-galaxy separation): – – – – – – k-nearest-neighbor classifier O(N2) nonparametric Bayes classifier O(N2) support vector machine (SVM) O(N3) non-negative SVM O(N3) [Guan and Gray, in prep] false-positive-limiting SVM O(N3) [Sastry and Gray, in prep] separation map O(N3) [Vasiloglou, Gray, and Anderson 2008, submitted] 10 data analysis problems, and scalable tools we’d like for them 5. Dimension reduction (e.g. galaxy or spectra characterization): – – – – – – – 6. principal component analysis O(D2) non-negative matrix factorization kernel PCA O(N3) maximum variance unfolding O(N3) co-occurrence embedding O(N3) [Ozakin and Gray, in prep] rank-based manifolds O(N3) [Ouyang and Gray 2008, ICML] isometric non-negative matrix factorization O(N3) [Vasiloglou, Gray, and Anderson 2008, submitted] Outlier detection (e.g. new object types, data cleaning): – – by density estimation, by dimension reduction by robust Lp estimation [Ram, Riegel and Gray, in prep] 10 data analysis problems, and scalable tools we’d like for them 7. Clustering (e.g. automatic Hubble sequence) – – – – by dimension reduction, by density estimation k-means mean-shift segmentation O(N2) hierarchical clustering (“friends-of-friends”) O(N3) 8. Time series analysis (e.g. asteroid tracking, variable objects): – – – – – Kalman filter O(D2) hidden Markov model O(D2) trajectory tracking O(Nn) Markov matrix factorization [Tran, Wong, and Gray 2008, submitted] functional independent component analysis [Mehta and Gray 2008, submitted] 10 data analysis problems, and scalable tools we’d like for them 9. Feature selection and causality (e.g. which features predict star/galaxy) – – – – – – LASSO regression L1 SVM Gaussian graphical model inference and structure search discrete graphical model inference and structure search 0-1 feature-selecting SVM [Guan and Gray, in prep] L1 Gaussian graphical model inference and structure search [Tran, Lee, Holmes, and Gray, in prep] 10. 2-sample testing and matching (e.g. cosmological validation, multiple surveys): – minimum spanning tree O(N3) – n-point correlation O(Nn) – bipartite matching/Gaussian graphical model inference O(N3) [Waters and Gray, in prep] How to do Stats/Machine Learning on Very Large Survey Datasets? 1. Choose the appropriate statistical task and method for the scientific question 2. Use the fastest algorithm and data structure for the statistical method 3. Apply method to the data! Core computational problems What are the basic mathematical operations, or bottleneck subroutines, can we focus on developing fast algorithms for? Core computational problems Aggregations, GNPs, graphs, linear algebra, optimization, integration • • • • • • • • • • Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation: kernel density estimation, mixture of Gaussians Regression: linear regression, kernel regression, Gaussian process regression Classification: nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA, maximum variance unfolding Outlier detection: by robust L2 estimation, by density estimation, by dimension reduction Clustering: k-means, mean-shift, hierarchical clustering (“friends-offriends”), by dimension reduction, by density estimation Time series analysis: Kalman filter, hidden Markov model, trajectory tracking Feature selection and causality: LASSO regression, L1 support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation, bipartite matching Simple example: spherical range search. How can we compute this efficiently? kd-trees: most widely-used spacepartitioning tree [Bentley 1975], [Friedman, Bentley & Finkel 1977],[Moore & Lee 1995] A kd-tree: level 1 A kd-tree: level 2 A kd-tree: level 3 A kd-tree: level 4 A kd-tree: level 5 A kd-tree: level 6 Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Pruned! (inclusion) Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm Pruned! (exclusion) Range-count recursive algorithm Range-count recursive algorithm Range-count recursive algorithm fastest practical algorithm [Bentley 1975] our algorithms can use any tree How to do Stats/Machine Learning on Very Large Survey Datasets? 1. Choose the appropriate statistical task and method for the scientific question 2. Use the fastest algorithm and data structure for the statistical method 3. Apply method to the data! All-k-nearest neighbors • Applications: querying, adaptive density estimation first step [Budavari et al., in prep] • Naïve cost: O(N2) • Fastest algorithm: Generalized Fast Multipole Method, aka multi-tree methods – Technique: Use two trees simultaneously – [Gray and Moore 2001, NIPS; Riegel, Boyer and Gray, to be submitted] – O(N), exact Kernel density estimation • Applications: accurate density estimates, e.g. galaxy types [Balogh et al. 2004] • Naïve cost: O(N2) • Fastest algorithm: dual-tree fast Gauss transform, Monte Carlo multipole method – Technique: Hermite operators, Monte Carlo – [Lee, Gray and Moore 2005, NIPS; Lee and Gray 2006, UAI; Holmes, Gray and Isbell 2007, NIPS; Lee and Gray 2008, NIPS] – O(N), relative error guarantee Nonparametric Bayes classification (or KDA) • Applications: accurate classification, e.g. quasar detection [Richards et al. 2004,2009], [Scranton et al. 2005] • Naïve cost: O(N2) • Fastest algorithm: dual-tree bounding with hybrid tree expansion – Technique: bound each posterior – [Liu, Moore, and Gray 2004; Gray and Riegel 2004, CompStat; Riegel and Gray 2007, SDM] – O(N), exact Friends-of-friends • Applications: cluster-finding, 2-sample • Naïve cost: O(N3) • Fastest algorithm: dual-tree Boruvka algorithm – Technique: similar to all-k-nn – [March and Gray, to be submitted] – O(NlogN), exact n-point correlations • Applications: 2-sample, e.g. AGN [Wake et al. 2004], ISW [Giannantonio et al. 2006], 3pt validation [Kulkarni et al. 2007] • Naïve cost: O(Nn) • Fastest algorithm: n-tree algorithms – Technique: use n trees simultaneously – [Gray and Moore 2001, NIPS; Moore et al. 2001, Mining the Sky; Waters, Riegel and Gray, to be submitted] – O(Nlogn), exact 3-point runtime (biggest previous: 20K) n=2: O(N) n=3: O(Nlog3) VIRGO simulation data, N = 75,000,000 naïve: 5x109 sec. (~150 years) multi-tree: 55 sec. (exact) n=4: O(N2) Now fast! very fast • • • • • • • • • • as fast as possible (conjecture) Querying: nearest-neighbor, sph range-search, ortho range-search, all-nn Density estimation: kernel density estimation, mixture of Gaussians Regression: linear regression, kernel regression, Gaussian process regression Classification: nearest-neighbor classifier, nonparametric Bayes classifier, support vector machine Dimension reduction: principal component analysis, non-negative matrix factorization, kernel PCA, maximum variance unfolding Outlier detection: by robust L2 estimation Clustering: k-means, mean-shift, hierarchical clustering (“friends-offriends”) Time series analysis: Kalman filter, hidden Markov model, trajectory tracking Feature selection and causality: LASSO regression, L1 support vector machine, Gaussian graphical models, discrete graphical models 2-sample testing and matching: n-point correlation, bipartite matching The end • You can now use many of the state-of-theart data analysis methods on large datasets – MLPACK software – CAS Analytics in SQL Server • Computational astronomy workshop and largescale data analysis workshop coming soon Alexander Gray [email protected] (email is best; webpage sorely out of date) Characterization of an entire distribution? 2-point correlation “How many pairs have distance < r ?” N N I ( x x i j i 2-point correlation function i j r) r The n-point correlation functions • Spatial inferences: filaments, clusters, voids, homogeneity, isotropy, 2-sample testing, … • Foundation for theory of point processes [Daley,Vere-Jones 1972], unifies spatial statistics [Ripley 1976] • Used heavily in biostatistics, cosmology, particle physics, statistical physics 2pcf definition: dP dV1dV2 [1 (r )] 2 3pcf definition: dP 3dV1dV2 dV3 [1 (r12 ) (r23 ) (r13 ) (r12 , r23 , r13 )] Standard model: n>0 terms should be zero! r2 r1 r3 3-point correlation “How many triples have pairwise distances < r ?” N N N I ( i j i k j i ij r1 ) I ( jk r2 ) I ( ki r3 ) How can we count n-tuples efficiently? “How many triples have pairwise distances < r ?” Use n trees! [Gray & Moore, NIPS 2000] “How many valid triangles a-b-c (where a A, b B, c C ) could there be? r count{A,B,C} = ? A B C “How many valid triangles a-b-c (where a A, b B, c C ) could there be? r count{A,B,C} = count{A,B,C.left} + count{A,B,C.right} A B C “How many valid triangles a-b-c (where a A, b B, c C ) could there be? r count{A,B,C} = count{A,B,C.left} + count{A,B,C.right} A B C “How many valid triangles a-b-c (where a A, b B, c C ) could there be? r A B count{A,B,C} = C ? “How many valid triangles a-b-c (where a A, b B, c C ) could there be? r A B count{A,B,C} = C Exclusion 0! “How many valid triangles a-b-c (where a A, b B, c C ) could there be? r A B count{A,B,C} = C ? “How many valid triangles a-b-c (where a A, b B, c C ) could there be? Inclusion A r B count{A,B,C} = Inclusion C |A| x |B| x |C| Inclusion