Download ppt

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Fast Statistical Algorithms
in Databases
Alexander Gray
Georgia Institute of Technology
College of Computing
FASTlab: Fundamental Algorithmic and Statistical Tools Laboratory
The FASTlab
Fundamental Algorithmic and Statistical Tools Laboratory
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
Arkadas Ozakin: Research scientist, PhD Theoretical Physics
Nikolaos Vasiloglou: Research scientist, PhD Elec Comp Engineering
Abhimanyu Aditya: MS, CS
Leif Poorman: MS, Physics
Dong Ryeol Lee: PhD student, CS + Math
Ryan Riegel: PhD student, CS + Math
Parikshit Ram: PhD student, CS + Math
William March: PhD student, Math + CS
James Waters: PhD student, Physics + CS
Hua Ouyang: PhD student, CS
Sooraj Bhat: PhD student, CS
Ravi Sastry: PhD student, CS
Long Tran: PhD student, CS
Wei Guan: PhD student, CS (co-supervised)
Nishant Mehta: PhD student, CS (co-supervised)
Wee Chin Wong: PhD student, ChemE (co-supervised)
Jaegul Choo: PhD student, CS (co-supervised)
Ailar Javadi: PhD Student, ECE (co-supervised)
Shakir Sharfraz: MS student, CS
Subbanarasimhiah Harish: MS student, CS
Our challenge
• Allow statistical analyses (both simple
and state-of-the-art) on terascale or
petascale datasets
• Return answers with error guarantees
• Rigorous theoretical runtimes
• Ensure small constants (small actual
runtimes, not just good asymptotic
runtimes)
10 data analysis problems, and
scalable tools we’d like for them
1.
Querying (e.g. characterizing a region of space):
–
–
–
–
2.
spherical range-search O(N)
orthogonal range-search O(N)
k-nearest-neighbors O(N)
all-k-nearest-neighbors O(N2)
Density estimation (e.g. comparing galaxy types):
–
–
–
–
–
mixture of Gaussians
kernel density estimation O(N2)
manifold kernel density estimation O(N3) [Ozakin and Gray
2009, in submission]
hyper-kernel density estimation O(N4) [Sastry and Gray 2009,
in submission]
L2 density tree [Ram and Gray, in prep]
10 data analysis problems, and
scalable tools we’d like for them
3. Regression (e.g. photometric redshifts):
–
–
–
linear regression O(ND2)
kernel regression O(N2)
Gaussian process regression/kriging O(N3)
4. Classification (e.g. quasar detection, star-galaxy
separation):
–
–
–
–
–
–
k-nearest-neighbor classifier O(N2)
nonparametric Bayes classifier O(N2)
support vector machine (SVM) O(N3)
isometric separation map O(N3) [Vasiloglou, Gray, and
Anderson 2008]
non-negative SVM O(N3) [Guan and Gray, in prep]
false-positive-limiting SVM O(N3) [Sastry and Gray, in prep]
10 data analysis problems, and
scalable tools we’d like for them
5.
Dimension reduction (e.g. galaxy or spectra
characterization):
–
–
–
–
–
–
–
6.
principal component analysis O(D3)
non-negative matrix factorization
kernel PCA (including LLE, IsoMap, et al.) O(N3)
maximum variance unfolding O(N3)
rank-based manifolds O(N3) [Ouyang and Gray 2008]
isometric non-negative matrix factorization O(N3) [Vasiloglou,
Gray, and Anderson 2008]
density-preserving maps O(N3) [Ozakin and Gray 2009, in
submission]
Outlier detection (e.g. new object types, data
cleaning):
–
–
by density estimation, by dimension reduction
by robust Lp estimation [Ram, Riegel and Gray, in prep]
10 data analysis problems, and
scalable tools we’d like for them
7. Clustering (e.g. automatic Hubble sequence)
–
–
–
–
by dimension reduction, by density estimation
k-means
mean-shift segmentation O(N2)
hierarchical clustering (“friends-of-friends”) O(N3)
8. Time series analysis (e.g. asteroid tracking, variable
objects):
–
–
–
–
–
Kalman filter O(D2)
hidden Markov model O(D2)
trajectory tracking O(Nn)
functional independent component analysis [Mehta and Gray
2008]
Markov matrix factorization [Tran, Wong, and Gray 2009, in
submission]
10 data analysis problems, and
scalable tools we’d like for them
9. Feature selection and causality (e.g. which features
predict star/galaxy)
–
–
–
–
–
–
LASSO regression
L1 SVM
Gaussian graphical model inference and structure search
discrete graphical model inference and structure search
mixed-integer SVM [Guan and Gray 2009, in submission]
L1 Gaussian graphical model inference and structure search
[Tran, Lee, Holmes, and Gray, in prep]
10. 2-sample testing and matching (e.g. cosmological
validation, multiple surveys):
– minimum spanning tree O(N3)
– n-point correlation O(Nn)
– bipartite matching/Gaussian graphical model inference O(N3)
[Waters and Gray, in prep]
Core methods of
statistics / machine learning / mining
•
•
•
•
•
•
•
•
•
•
Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal
range-search O(N), contingency table
Density estimation: kernel density estimation O(N2), mixture of Gaussians
O(N)
Regression: linear regression O(ND2), kernel regression O(N2), Gaussian
process regression O(N3)
Classification: nearest-neighbor classifier O(N2), nonparametric Bayes
classifier O(N2), support vector machine
Dimension reduction: principal component analysis O(D3), non-negative
matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
Clustering: k-means O(N), hierarchical clustering O(N3), by dimension
reduction
Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory
tracking
Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models
2-sample testing and matching: n-point correlation O(Nn), bipartite
matching O(N3)
4 main computational bottlenecks:
generalized N-body, graphical models, linear algebra, optimization
•
•
•
•
•
•
•
•
•
•
Querying: nearest-neighbor O(N), spherical range-search O(N), orthogonal
range-search O(N), contingency table
Density estimation: kernel density estimation O(N2), mixture of Gaussians
O(N)
Regression: linear regression O(ND2), kernel regression O(N2), Gaussian
process regression O(N3)
Classification: nearest-neighbor classifier O(N2), nonparametric Bayes
classifier O(N2), support vector machine
Dimension reduction: principal component analysis O(D3), non-negative
matrix factorization, kernel PCA O(N3), maximum variance unfolding O(N3)
Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
Clustering: k-means O(N), hierarchical clustering O(N3), by dimension
reduction
Time series analysis: Kalman filter O(D3), hidden Markov model, trajectory
tracking
Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models
2-sample testing: n-point correlation O(Nn), bipartite matching O(N3)
GNPs: Data structure needed
kd-trees:
most widely-used spacepartitioning tree for
general dimension
[Bentley 1975], [Friedman, Bentley &
Finkel 1977],[Moore & Lee 1995]
(others: ball-trees, cover-trees, etc)
What algorithmic methods do we use?
• Generalized N-body algorithms
(multiple trees) for distance/similaritybased computations [2000, 2003, 2009]
• Hierarchical series expansions for
kernel summations [2004, 2006, 2008]
• Multi-scale Monte Carlo for linear
algebra and summations [2007, 2008]
• Parallel computing [1998, 2006, 2009]
Computational complexity
using fast algorithms
•
•
•
•
•
•
•
•
•
•
Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
Dimension reduction: principal component analysis O(D) or O(1), nonnegative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models
2-sample testing and matching: n-point correlation O(Nlogn), bipartite
matching O(N) or O(1)
Computational complexity
using fast algorithms
•
•
•
•
•
•
•
•
•
•
Querying: nearest-neighbor O(logN), spherical range-search O(logN),
orthogonal range-search O(logN), contingency table
Density estimation: kernel density estimation O(N) or O(1), mixture of
Gaussians O(logN)
Regression: linear regression O(D) or O(1), kernel regression O(N) or
O(1), Gaussian process regression O(N) or O(1)
Classification: nearest-neighbor classifier O(N), nonparametric Bayes
classifier O(N), support vector machine O(N)
Dimension reduction: principal component analysis O(D) or O(1), nonnegative matrix factorization, kernel PCA O(N) or O(1), maximum variance
unfolding O(N)
Outlier detection: by robust L2 estimation, by density estimation, by
dimension reduction
Clustering: k-means O(logN), hierarchical clustering O(NlogN), by
dimension reduction
Time series analysis: Kalman filter O(D) or O(1), hidden Markov model,
trajectory tracking
Feature selection and causality: LASSO, L1 SVM, Gaussian graphical
models, discrete graphical models
2-sample testing and matching: n-point correlation O(Nlogn), bipartite
matching O(N) or O(1)
All-k-nearest neighbors
• Naïve cost: O(N2)
• Fastest algorithm: Generalized Fast
Multipole Method, aka multi-tree methods
– Technique: Use two trees simultaneously
– [Gray and Moore 2001, NIPS; Ram, Lee, March,
and Gray 2009, submitted; Riegel, Boyer and Gray,
to be submitted]
– O(N), exact
• Applications: querying, adaptive density
estimation first step [Budavari et al., in prep]
Kernel density estimation
• Naïve cost: O(N2)
• Fastest algorithm: dual-tree fast Gauss
transform, Monte Carlo multipole method
– Technique: Hermite operators, Monte Carlo
– [Lee, Gray and Moore 2005, NIPS; Lee and Gray
2006, UAI; Holmes, Gray and Isbell 2007, NIPS;
Lee and Gray 2008, NIPS]
– O(N), relative error guarantee
• Applications: accurate density estimates,
e.g. galaxy types [Balogh et al. 2004]
Nonparametric Bayes
classification
• Naïve cost: O(N2)
• Fastest algorithm: dual-tree bounding
with hybrid tree expansion
– Technique: bound each posterior
– [Liu, Moore, and Gray 2004; Gray and Riegel 2004,
CompStat; Riegel and Gray 2007, SDM]
– O(N), exact
• Applications: accurate classification, e.g.
quasar detection [Richards et al. 2004,2009],
[Scranton et al. 2005]
Principal component analysis
• Naïve cost: O(D3)
• Fastest algorithm: QUIC-SVD
– Technique: tree-based Monte Carlo using
cosine tree
– [Holmes, Gray, and Isbell 2009]
– O(1), relative error guarantee
• Applications: spectral decomposition
Friends-of-friends
• Naïve cost: O(N3)
• Fastest algorithm: dual-tree Boruvka
algorithm
– Technique: maintain implicit friend sets
– [March and Gray, to be submitted]
– O(NlogN), exact
• Applications: cluster-finding, 2-sample
n-point correlations
• Naïve cost: O(Nn)
• Fastest algorithm: n-tree algorithms
– Technique: use n trees simultaneously
– [Gray and Moore 2001, NIPS; Moore et al. 2001,
Mining the Sky; Waters, Riegel and Gray 2009, to
be submitted]
– O(Nlogn), exact
• Applications: 2-sample, e.g. AGN [Wake
et al. 2004], ISW [Giannantonio et al. 2006],
3pt validation [Kulkarni et al. 2007]
3-point runtime
(biggest previous:
20K)
n=2: O(N)
n=3: O(Nlog3)
VIRGO
simulation data,
N = 75,000,000
naïve: 5x109 sec.
(~150 years)
multi-tree: 55 sec.
(exact)
n=4: O(N2)
Software
• MLPACK (C++)
– First scalable comprehensive ML library
• MLPACK Pro
- Very-large-scale data (parallel)
Software
• MLPACK (C++)
– First scalable comprehensive ML library
• MLPACK Pro
- Very-large-scale data (parallel)
• MLPACK-db
– fast data analytics in relational
databases (SQL Server; on-disk)
– Thanks to Tamas Budavari
Fast algorithms in SQL Server:
Challenges
• Don’t copy the data: In-RAM kdtrees  disk-based data structures
• Commercial database: no access
to guts (e.g. caching, memory)
• Already optimized for different
operations (SQL queries) and data
structures (B+ trees)
• Big learning curve
Building a kd-tree in SQL Server
• Piggyback onto B+ trees and disk
layout layers: Use SQL queries
• Hybrid approach: upper tree using
Morton z-curve (SQL), lower using
in-RAM building; 3 passes total
• Clustered indices for fast
sequential access of nodes, data
• Abstracted algs in managed C#
Current state
• So far: All-NN, KDE*, NBC*, 2pt*
• Performance for all-nearest-neighbors
on 40M, 4-D SDSS color data:
– Naïve SQL Server: 50 yrs
– Batched SQL Server: 3 yrs
– Single-tree MLPACK-db: 14.5 hrs
– Dual-tree MLPACK-db: 227 min
– TPIE: 23 min, MLPACK-Pro: 10 min
• We believe there is 10x factor of fat
– e.g. this uses only 100Mb of 2Gb RAM
The end
• You can now use many of the state-of-theart data analysis methods on large survey
datasets
• We need more funding 
• Talk to me… I have a lot of problems
(ahem) but always want more!
Alexander Gray [email protected]
www.fast-lab.org
Ex: support vector machine
Data: IJCNN1 [DP01a]
2 classes
49,990 training points
91,701 testing points
22 features
SMO: O(N2-N3)
SFW: O(N/e + 1/e2)
SMO: 12,831 SV’s, 84,360 iterations, 98.3%
accuracy, 765 sec
SFW: 4,145 SV’s, 4,500 iterations, 98.1%
accuracy, 21 sec
[Ouyang and Gray 2009, in submission]
A few new ones…
• Non-vector data (like time series,
images, graphs, trees):
– Functional ICA (Mehta and Gray 2008)
– Generalized mean map kernel (Mehta and Gray, in
submission)
• Visualizing high-dimensional data
– Rank-based manifolds (Ouyang and Gray 2008)
– Density-preserving maps (Ozakin and Gray, in
submission)
– Intrinsic dimension estimation (Ozakin and Gray, in
prep)
• Accounting for measurement errors
• Probabilistic cross-match
Characterization of an entire distribution?
2-point correlation
“How many pairs have distance < r ?”
N
N
 I ( x  x
i
j i
2-point correlation
function
i
j
 r)
r
The n-point correlation functions
• Spatial inferences: filaments, clusters, voids,
homogeneity, isotropy, 2-sample testing, …
• Foundation for theory of point processes
[Daley,Vere-Jones 1972], unifies spatial statistics [Ripley
1976]
• Used heavily in biostatistics, cosmology, particle
physics, statistical physics
2pcf definition:
dP   dV1dV2 [1   (r )]
2
3pcf definition:
dP  3dV1dV2 dV3  [1   (r12 )   (r23 )   (r13 )   (r12 , r23 , r13 )]
Standard model: n>0 terms
should be zero!
r2
r1
r3
3-point correlation
“How many triples have
pairwise distances < r ?”
N
N
N
  I (
i
j i k  j i
ij
 r1 ) I ( jk  r2 ) I ( ki  r3 )
How can we count n-tuples efficiently?
“How many triples have
pairwise distances < r ?”
Use n trees!
[Gray & Moore, NIPS 2000]
“How many valid triangles a-b-c
(where a  A, b  B, c  C )
could there be?
r
count{A,B,C} =
?
A
B
C
“How many valid triangles a-b-c
(where a  A, b  B, c  C )
could there be?
r
count{A,B,C} =
count{A,B,C.left}
+
count{A,B,C.right}
A
B
C
“How many valid triangles a-b-c
(where a  A, b  B, c  C )
could there be?
r
count{A,B,C} =
count{A,B,C.left}
+
count{A,B,C.right}
A
B
C
“How many valid triangles a-b-c
(where a  A, b  B, c  C )
could there be?
r
A
B
count{A,B,C} =
C
?
“How many valid triangles a-b-c
(where a  A, b  B, c  C )
could there be?
r
A
B
count{A,B,C} =
C
Exclusion
0!
“How many valid triangles a-b-c
(where a  A, b  B, c  C )
could there be?
r
A
B
count{A,B,C} =
C
?
“How many valid triangles a-b-c
(where a  A, b  B, c  C )
could there be?
Inclusion
A
r
B
count{A,B,C} =
Inclusion
C
|A| x |B| x |C|
Inclusion
Related documents