* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Randomized algorithms for matrices and massive datasets
Survey
Document related concepts
Jordan normal form wikipedia , lookup
Determinant wikipedia , lookup
Matrix (mathematics) wikipedia , lookup
Four-vector wikipedia , lookup
Perron–Frobenius theorem wikipedia , lookup
Cayley–Hamilton theorem wikipedia , lookup
Orthogonal matrix wikipedia , lookup
Matrix calculus wikipedia , lookup
Principal component analysis wikipedia , lookup
Gaussian elimination wikipedia , lookup
Ordinary least squares wikipedia , lookup
Non-negative matrix factorization wikipedia , lookup
Transcript
SDM06 TUTORIAL: Randomized Algorithms for Matrices and Massive Data Sets Petros Drineas Michael W. Mahoney CS - RPI Yahoo! Research Tutorial given at SIAM Data Mining Meeting April 22, 2006 (Most recent copy) available at: http://www.cs.yale.edu/homes/mmahoney http://www.cs.rpi.edu/~drinep Randomized Linear Algebra Algorithms Goal: To develop and analyze (fast) Monte Carlo algorithms for performing useful computations on large (and later not so large!) matrices and tensors. • Matrix Multiplication • Computation of the Singular Value Decomposition • Computation of the CUR Decomposition • Testing Feasibility of Linear Programs • Least Squares Approximation • Tensor computations: SVD generalizations • Tensor computations: CUR generalization Such computations generally require time which is superlinear in the number of nonzero elements of the matrix/tensor, e.g., O(n3) for n x n matrices. 2 Example: the CUR decomposition Algorithmic Motivation: To speed up computations in applications where extremely large data sets are modeled by matrices and, e.g., O(n2) space and O(n3) time is not an option. Structural Motivation: To reveal novel structural properties of the datasets, given sufficient computational time, that are useful in applications. Carefully chosen U O(1) columns O(1) rows Why? Given a sample consisting of a few columns (C) and a few rows (R) of A, we can compute U and “reconstruct” A as CUR. If the sampling probabilities are not “too bad”, we get provably good accuracy. Goal: make ||A-CUR|| small. Why? (Algorithmic) After making two passes over A, we can compute provably good C, U, and R and store them (“sketch”) instead of A: O(m+n) vs. O(n2) space. Why? (Structural) Given sufficient time, we can find C, U and R such that A – CUR is “very” small. This might lead to better understanding of the data. 3 Applications of such algorithms Matrices arise, e.g., since m objects (documents, genomes, images, web pages), each with n features, may be represented by an m x n matrix A. • Covariance Matrices • Latent Semantic Indexing • DNA Microarray Data • Eigenfaces and Image Recognition • Similarity Queries • Matrix Reconstruction • LOTS of other data applications!! More generally, • Linear and Nonlinear Programming Applications • Design of Approximation Algorithms • Statistical Learning Theory Applications 4 Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • • • • Data mining DNA microarray (and DNA SNP) data Recommendation Systems Kernel-CUR and the Nystrom Method 5 Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems 6 The Pass Efficient Model Motivation: Amount of disk/tape space has increased enormously; RAM and computing speeds have increased less rapidly. • Can store large amounts of data, but • Cannot process these data with traditional algorithms. In the Pass-Efficient Model: • Data are assumed to be stored on disk/tape. • Algorithm has access to the data via a pass over the data. • Algorithm is allowed additional RAM space and additional computation time. An algorithm is pass-efficient if it uses a small constant number of passes and sublinear additional time and space to compute a description of the solution. Note: If data are an m x n matrix A, then algorithms which require additional time and space that is O(m+n) or O(1) are pass-efficient. 7 Random Sampling Random Sampling and Randomized Algorithms: • Better complexity properties (randomization as a resource). • Simpler algorithms and/or analysis (maybe de-randomize later). Uniform Sampling: • Typically things work in expectation, but poor variance properties. Non-uniform Sampling: • With “good” probabilities, can make the variance small. Randomized Algorithms for Linear Algebra problems: A “sketch” consisting of a small number of judiciously chosen and randomly sampled rows and columns (or elements) is sufficient for provably rapid and efficient approximation of many matrix operations. 8 Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • • • • Data mining DNA microarray (and DNA SNP) data Recommendation Systems Kernel-CUR and the Nystrom Method 9 Approximating Matrix Multiplication … (D. & Kannan FOCS ’01, and D., Kannan, & M. TR ’04, SICOMP ’06) Problem Statement Given an m-by-n matrix A and an n-by-p matrix B, approximate the product A·B, OR, equivalently, Approximate the sum of n rank-one matrices. Each term in the summation is a rank-one matrix i-th column of A i-th row of B 10 …by random sampling i-th column of A i-th row of B Algorithm 1. Fix a set of probabilities pi, i=1,…,n, summing up to 1. 2. For t=1 up to s, set jt = i, where Pr(jt = i) = pi; (Pick s terms of the sum, with replacement, with respect to the pi.) 3. Approximate AB by the sum of the s terms, after scaling. 11 Random sampling (cont’d) i-th column of A i-th row of B Keeping the terms j1, j2, … js. 12 The algorithm (matrix notation) • Create C and R by performing s i.i.d. trials, with replacement. • For t=1 up to s, pick a column A(jt) and a row B(jt) with probability • Include A(jt)/(spjt)1/2 as a column of C, and B(jt)/(spjt)1/2 as a row of R. 13 Simple Lemmas The input matrices are given in “sparse unordered representation”; e.g., their non-zero entries are presented as triples (i, j, Aij) in any order. • The expectation of CR (element-wise) is AB. • Our nonuniform sampling minimizes the variance of the estimator. • It is easy to implement the sampling in two passes. • If the matrices are dense the algorithm runs in O(smp) time, instead of O(nmp) time, • It requires O(sm+sp) RAM space. • Does not tamper with the sparsity of the matrices. 14 Error Bounds For the above algorithm, For the above algorithm, with probability at least 1-, • This is a relative error bound if ||AB||F = (||A||F ||B||F), i.e. if there is “not much cancellation” in the multiplication. • We removed the expectation (by applying a martingale argument) and so have an extra log(1/) factor. • Markov’s inequality would also remove the expectation, introducing an extra 1/ factor. 15 Special case: B = AT If B = AT, then the sampling probabilities are Also, R = CT, and the error bounds are 16 Special case: B = AT (cont’d) (Rudelson & Vershynin ’04, Vershynin ’04) Improvement for the spectral norm bound for the special case B = AT. • Uses a result of M. Rudelson for random vectors in isotropic position. • Tight concentration results can be proven using Talagrand’s theory. • The sampling procedure is slightly different; s columns/rows are kept in expectation, i.e., column i is picked with probability: 17 Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • • • • Data mining DNA microarray (and DNA SNP) data Recommendation Systems Kernel-CUR and the Nystrom Method 18 Singular Value Decomposition (SVD) U (V): orthogonal matrix containing the left (right) singular vectors of A. S: diagonal matrix containing the singular values of A. 1. Exact computation of the SVD takes O(min{mn2 , m2n}) time. 2. The top few singular vectors/values can be approximated faster (Lanczos/ Arnoldi methods). 19 Rank k approximations (Ak) Uk (Vk): orthogonal matrix containing the top k left (right) singular vectors of A. S k: diagonal matrix containing the top k singular values of A. Also, Ak=UkUkTA. Ak is a matrix of rank k such that ||A-Ak||2,F is minimized over all rank k matrices! This property of very useful in the context of Principal Component Analysis. 20 Approximating SVD in O(n) time (D., Frieze, Kannan, Vempala & Vinay SODA ’99, JML ’04, D. Kannan, & M. TR ’04, SICOMP ’06) Given: m x n matrix A • Sample c columns from A and rescale to form the m x c matrix C. • Compute the m x k matrix Hk of the top k left singular vectors of C. Structural Theorem: For any probabilities and number of columns: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + 2√k||AAT-CCT||F Algorithmic Theorem: If pi = |A(i)|2/||A||F2 and c ≥ 42k/2, then: ||A-HkHkTA||2,F2 ≤ ||A-Ak||2,F2 + ||A||F2. Proof: via matrix multiplication theorem and matrix perturbation theory. 21 Example of randomized SVD A C Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. Original matrix After sampling columns Compute the top k left singular vectors of the matrix C and store them in the 512-by-k matrix Hk. 22 Example of randomized SVD (cont’d) Title: C:\Petros\Image Processing\baboondet.eps Creator: MATLAB, The Mathworks, Inc. Preview: This EPS picture was not saved with a preview included in it. Comment: This EPS picture will print to a PostScript printer, but not to other types of printers. A HkHkTA A and HkHkTA are close. 23 Element-wise sampling (Achlioptas & McSherry, STOC ’01, JACM ’05) The Algorithm in 2 lines: • To approximate a matrix A, keep a few elements of the matrix (instead of rows or columns) and zero out the remaining elements. • Compute a rank k approximation to this sparse matrix (using Lanczos methods). More details: Let pij 2 [0,1] for all i,j. Create the matrix S from A such that: ||A-S||2 is bounded ! (i) the singular values of A and S are close, and (ii, under additional assumptions) the top k left (right) singular vectors of S span a subspace that is close the to subspace spanned by the top k left (right) singular vectors of A. 24 Element-wise sampling (cont’d) Approximating singular values fast: • Zero out (a large number of) elements of A, scale the remaining ones appropriately. • Compute the singular values of the resulting sparse matrix using iterative techniques. • (Good choice for pij: pij = sAij2/i,j Aij2, where s denotes the expected number of elements that we seek to keep in S.) • Note: Each element is kept or discarded independently of the others. Similar ideas have been used to: • explain the success of Latent Semantic Indexing (LSI): Vempala, PODS ’98 & Azar, Fiat, Karlin, McSherry, Saia STOC ’01) (Papadimitriou, Raghavan, Tamaki, • design recommendation systems: (Azar, Fiat, Karlin, McSherry, Saia STOC ’01 ) • speedup kernel computations: (Achlioptas, McSherry, and Schölkopf, NIPS ’02) 25 Element-wise vs. row/column sampling Quite different techniques! • Row/column sampling preserves subspace/structural properties of the matrices. • Element-wise sampling explains how adding noise and/or quantizing the elements of a matrix perturbs its singular values/vectors. • The two techniques should be complementary! Some similarities and differences: • Similar error bounds. • Element-wise sampling is doable in one pass! • Running time of element-wise sampling depends on the speed of, e.g., Arnoldi methods. • Element-wise methods do not seem amenable to many of the extensions we will present. 26 Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • • • • Data mining DNA microarray (and DNA SNP) data Recommendation Systems Kernel-CUR and the Nystrom Method 27 A novel CUR matrix decomposition (D. & Kannan, SODA ’03, D., Kannan, & M. TR ’04, SICOMP ’06) Create an approximation to the original matrix of the following form: Carefully chosen U O(1) columns O(1) rows An application: Given a query vector x, instead of computing A · x, compute CUR · x to identify its nearest neighbors, 28 The CUR decomposition Given a large m-by-n matrix A (stored on disk), compute a decomposition CUR of A such that: 1. C consists of c = O(k/2) columns of A. 2. R consists of r = O(k/2) rows of A. 3. C (R) is created using importance sampling, e.g. columns (rows) are picked in i.i.d. trials with respect to probabilities: 4. C, U, R can be stored in O(m+n) space, after making two passes through the entire matrix A, using O(m+n) additional space and time. 5. The product CUR satisfies (with high probability): 29 Computing U Intuition: (which can be formalized - see later) The CUR algorithm expresses every row of the matrix A as a linear combination of a small subset of the rows of A. • This small subset consists of the rows in R. • Given a row of A – say A(i) – the algorithm computes a good fit for the row A(i) using the rows in R as the basis, by approximately solving: But, only c = O(1) elements of the i-th row are given as input. So, we only approximate the optimal vector u instead of computing it exactly. Actually, the pass-efficient CUR decomposition approximates the approximation. 30 Error bounds for CUR Assume Ak is the “best” rank k approximation to A (through SVD). Then, if we pick O(k/2) rows and O(k/2) columns, If we pick O(1/2) rows and O(1/2) columns, 31 Previous CUR-type decompositions Goreinov, Tyrtyshnikov, & Zamarashkin (LAA ’97, …) Berry, Stewart, & Pulatova (Num. Math. ’99, TR ’04, … ) Williams & Seeger (NIPS ’01, …) D., Kannan & M. (SODA ’03, TR ’04, SICOMP ‘06) D., M., & Muthukrishnan (TR ’06) C: columns that span max volume U: W+ R: rows that span max volume Existential result Error bounds depend on ||W+||2 Spectral norm bounds! C: variant of the QR algorithm R: variant of the QR algorithm U: minimizes ||A-CUR||F No a priori bounds A must be known to construct U. Solid experimental performance C: uniformly at random U: W+ R: uniformly at random Experimental evaluation A is assumed PSD Connections to Nystrom method C: w.r.t. column lengths U: in linear/constant time R: w.r.t. row lengths “Sketching” massive matrices Provable, a priori, bounds Explicit dependency on A – Ak C: depends on singular vectors of A. U: (almost) W+ R: depends on singular vectors of C (1+) approximation to A – Ak Computable in low polynomial time (Suffices to compute SVD(A)) 32 (For details see Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD”, ‘05.) Lower Bounds Question: How many queries does a sampling algorithm need to approximate a given function accurately with high probability? Lower bounds for the low rank matrix approximation problem and the matrix reconstruction problem. (Ziv Bar-Yossef ’03, ’04) • Any sampling algorithm that w.h.p. finds a good low rank approximation requires (m+n) queries. • Even if the algorithm is given the exact weight distribution over the columns of a matrix it will still require (k/4) queries. • Finding a matrix D such that ||A-D||F ≤ ||A||F requires (mn) queries and that finding a D such that ||A-D||2 ≤ ||A||2 requires (m+n) queries. Applied to our results: • The LinearTimeSVD algorithm is optimal w.r.t. ||||F bounds. • The ConstantTimeSVD algorithm is optimal w.r.t. ||||2 bounds up to poly factors. • The CUR algorithm is optimal for constant . 33 Overview (1/2) • Data Streaming Models and Random Sampling • Matrix Multiplication • Singular Value Decomposition • CUR Matrix Decomposition • Applications of Matrix CUR • • • • Data mining DNA microarray (and DNA SNP) data Recommendation Systems Kernel-CUR and the Nystrom Method 34 CUR application: Data Mining Database: An m-by-n matrix A, e.g., m (>106) objects and n(>105) features. Queries: Given a new object x, find similar objects (nearest neighbors) in A. Closeness: Two normalized objects x and y are “close” xT·d = cos(x,d) is high. • Given a query vector x, the matrix product A·x computes all the angles/distances. Key observation: The exact value xT· d might not be necessary. • The feature values in the vectors are set by coarse heuristics. • It is in general enough to see if xT· d > Threshold. Algorithm: Given a query vector x, compute CUR · x to identify nearest neighbors. Theorem: We have a bound on the worst case of x using CUR instead of A: 35 CUR application: Genetic Microarray Data Exploit structural properties of CUR in biological applications: Experimental conditions genes Find a “good” set of genes and arrays to include in C and R? Provable and/or heuristic strategies are acceptable. Common in Biological/Chemical/Medical applications of PCA: • Explain the singular vectors, by mapping them to meaningful biological processes. • This is a “challenging” task (think: reification) ! CUR is a low-rank decomposition in terms of the data that practitioners understand. • Use it to explain the data and do dimensionality reduction, classification, clustering. Gene microarray data: M., D., & Alter (UT Austin) (sporulation and cell cycle data). 36 CUR application: Recommendation Systems (D., Raghavan, & Kerenidis, STOC ’02) The problem: m customers and n products; Aij is the (unknown) utility of product j for customer i. The goal: recreate A from a few samples to recommend high utility products. • (KRRT98): Assuming strong clustering of the products, competitive algorithms even with only 2 samples/customer. • (AFKMS01): Assuming sampling of (mn) entries of A and a gap requirement, accurately recreate A. Question: Can we get competitive performance by sampling o(mn) elements? Answer: Apply the CUR decomposition: Customer sample products (purchases, small surveys) Customer sample customers (guinea pigs) 37 Kernel-CUR Motivation Kernel-based learning methods to extract non-linear structure: • Choose features to define a (dot product) space F. • Map the data, X, to F by : X -> F. • Do classification, regression, and clustering in F with linear methods (SVMs,GPs,SVD). If the Gram matrix G, where Gij=kij=((X(i)), (X(j))), is dense but has low numerical rank, then calculations of interest need O(n2) space and O(n3) time: • matrix inversion in GP prediction, • quadratic programming problems in SVMs, • computation of eigendecomposition of G. Relevant recent work using low-rank methods: • (Williams and Seeger, NIPS ’01, etc.): Nystrom method for out-of-sample extensions. • (Achlioptas, McSherry, and Schölkopf, NIPS ’02): randomized kernels. 38 Kernel-CUR Decomposition (D. & M., COLT ’05, TR ’05, JMLR ‘05) Input: n x n SPSD matrix G, probabilities {pi, 1=1,…,n}, c <= n, and k <= c. Algorithm: • Let C be the n x c matrix containing c randomly sampled columns of G. • Let W be the c x c matrix with containing intersection of C and CT. Theorem: Let pi = Gii2/ Si Gii2. If c = O(k log(1/)/4), then w.p. at least 1-, If c = O(log(1/)/4), then with probability at least 1-, 39 Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems 40 Datasets modeled as tensors Tensors: (naively, a dataset subscripted by multiple indices) appear both in Math and CS. • Represent high dimensional functions. • Connections to complexity theory (i.e., matrix multiplication complexity). • Statistical applications (i.e., ICA, HOS, etc.). • Large data-set applications (e.g., Medical Imaging & Hyperspectral Imaging) Problem: There does not exist a definition of tensor rank (and associated tensor SVD) with the – nice – properties found in the matrix case. • (Lek-Heng Lim ’05: strong impossibility results!) Common heuristic: “unfold” the tensor along a mode and apply Linear Algebra. • We will do this, but note that this kills the essential tensor structure. 41 Datasets modeled as tensors (cont’d) Goal: Extract structure from a tensor dataset A (naively, a dataset subscripted by multiple indices) using a small number of samples. Mode 3 Mode 1 Mode 2 • Tensor rank (minimum number of rank-one tensors) is NP-hard to compute. • Tensor -rank (“unfold” along the th mode and the the matrix SVD) is a commonly-used heuristic. m x n x p tensor A Randomized-Tensor-CUR: unfold along a “distinguished” mode and reconstruct. Randomized-Tensor-SVD: unfold along every mode and choose columns. (Drineas & Mahoney, “A Randomized Algorithm for a Tensor-Based Generalization of the SVD,” TR05.) 42 The TensorCUR algorithm (3-modes) Choose the preferred mode (e.g., time). Pick a few representative “slabs” (let R denote the tensor of the sampled slabs). Use only information in a small number of representative “fibers” (let C denote the tensor of sampled fibers and U a low-dimensional encoding tensor). Express the remaining slabs as linear combinations of the basis of sampled slabs. 43 Tensor-CUR application: Hyperspectral Image Analysis (with M. Maggioni and R. Coifman at Yale) Goal: Extract structure from temporally-resolved images or spectrally-resolved images of medical interest using a small number of samples (images and/or pixels). 128 frequencies Note: A temporally or spectrally resolved image may be viewed as a tensor (naively, a dataset subscripted by multiple indices) or as a matrix (whose columns have internal structure that is not modeled). ca. 500 pixels ca. 500 pixels m x n x p tensor A or mn x p matrix A Note: The chosen images are a dictionary from the data to express every image. Note: The chosen pixels are a dictionary from the data to express every pixel. 44 45 Sampling hyperspectral data Sample slabs depending on total absorption. • For example, absorption at two pixel types: Sample fibers uniformly (since intensity depends on stain). 46 Eigen-analysis of slabs and fibers 47 Look at the exact (65-th) slab. 48 The (65-th) slab approximately reconstructed This slab was reconstructed by approximate least-squares fit to the basis from slabs 41 and 50, using 1000 (of 250K) pixels/fibers. 49 Tissue Classification - Exact Data 50 Tissue Classification - Ns=12 & Nf=1000 51 Tensor-CUR application: Recommendation Systems Important Comment: • Utility is ordinal and not cardinal concept. • Compare products; don’t assign utility values. Recommendation Model Revisited: • Every customer has an n-by-n matrix (whose entries are +1,-1) and represent pairwise product comparisons. m customers • There are m such matrices, forming an nby-n-by-m 3-mode tensor A. • Extract the “structure” of this tensor. n products n products 52 Application to Jester Joke Recommendations Use just the 14,140 “full” users who rated all 100 Jester jokes. For each user, convert the utility vector to 100 x 100 pair-wise preference matrix. Choose, e.g., 300 slabs/users, and a small number of fibers/comparisons. 300 slabs 8000 10,000 fibers 100 fibers 7000 75 fibers 6000 50 fibers Number of users 5000 4000 3000 2000 1000 0 0 1 2 3 4 5 6 7 Successful recommendations (out of 10) 8 9 10 53 Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems 54 Modeling data as matrices People studying data : • • • • put the data onto a graph or into a vector space even if the data don’t naturally or obviously live there and perform graph operations or vector space operations to extract information from the data. Such data often have structure unrelated to the graphical or linear algebraic structure implicit in the modeling. • This non-modeled structure is difficult to formalize. Practitioners often have extensive field-specific intuition about the data. • This intuition is often used to choose “where the data live.” • The choice of where the data live may capture non-modeled structure. 55 Modeling data as matrices (cont’d) Matrices often arise since n objects (“documents,” genomes, images, web pages), each with m features, may be represented by an m x n matrix A. Such data matrices often have structure: • for linear structure, SVD or PCA is often used, • for non-linear structure, kernel, e.g., diffusion-based, methods used, • other structures include sparsity, nonnegativity, etc. Note: We know what the rows/columns “mean” from the application area. Goal: Develop principled provably-accurate algorithmic methods such that: • they are agnostic with respect to any particular field, • one can fruitfully couple them to the field-specific intuition, • they perform well on complex non-toy data sets. 56 SVD and low-rank approximations Theorem: Let A be an m x n matrix rank . Truncate the SVD of A by keeping k ≤ terms: Ak = Uk Sk VkT. This gives the “best” rank-k approximation to A. Interesting properties of truncated SVD: • Used in data analysis via Principal Components Analysis (PCA) . •The rows of Uk (= UA,k) are NOT orthogonal and are NOT unit length. • The lengths/Euclidean norms of the rows of Uk capture a notion of information dispersal. • Gives a low-rank approximation with a very particular structure (rotate-rescale-rotate). • Best at capturing Frobenius (and other) norm. • Problematic w.r.t. sparsity, interpretability, etc. 57 Problems with SVD/Eigen-Analysis Problems arise since structure in the data is not respected by mathematical operations on the data: • • • • Reification - maximum variance directions are just that. Interpretability - what does a linear combination of 6000 genes mean. Sparsity - is destroyed by orthogonalization. Non-negativity - is a convex and not linear algebraic notion. The SVD gives two bases to diagonalize the matrix. • Truncating gives a low-rank matrix approximation with a very particular structure. • Think: rotation-with-truncation; rescaling; rotation-back-up. Question: Do there exist “better” low-rank matrix approximations. • “better” structural properties for certain applications. • “better” at respecting relevant structure. • “better” for interpretability and informing intuition. 58 Exactly and approximately rank-k matrices Theorem: Let the m x n matrix A be exactly rank k. Then: • There exists k linear combinations of columns, rows, etc. such that: A = Ak = UkSkVkT. • There exists k actual columns and rows of A, permuted, such that: Take-home message: Low-rank structure IS redundancy in columns/rows. Theorem: Let the m x n matrix A be approximately rank k. Then: A ≈ Ak = UkSkVkT is the “best” approximation. Question: Can we express approximately rank k matrices in terms of their actual columns and/or rows. 59 Dictionaries for data analysis Discrete Fourier Transform (DCT): • fj = Si=0,…,N-1 xn cos[j(n+1/2)/N] • the basis is fixed. • O(N2) or O(Nlog(N)) computation to determine coefficients. Singular Value Decomposition (SVD): • A = Si=1,…, iU(i)V(i)T = Si=1,…, i A[i] • O(N3) computation to determine basis and coefficients. Many other more complex/expensive procedures depending on the application. Question: Can actual data points and/or feature vectors be the dictionary? • “Core-sets” on graphs. • “CUR-decompositions” on matrices. 60 CX and CUR matrix decompositions Recall: Matrices are about their rows and columns. Recall: Low-rank matrices have redundancy in their rows and columns. Def: A CX matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of columns of the original matrix A (e.g., PCA = CC+A). Def: A CUR matrix decomposition is a low-rank approximation explicitly expressed in terms of a small number of columns and rows of the original matrix A. Carefully chosen U O(1) columns O(1) rows 61 Dictionaries & the SVD A = U SVT = S i=1,..., iU(i)V(i)T, • where U(i),V(i) = eigen-cols and eigen-rows. Approximate: A(j) ≈ S i=1,...,k zijU(i) • by minzij|| A(j) - S i=1,...,k zijU(i) ||2 Z = UkTA --> A ≈ Ak = (UkUkT)A • project onto space of top k eigen-cols. Z = SkVkT --> A ≈ Ak = Uk(SkVkT) • approximate every column of A i.t.o. a small number of eigen-rows and a lowdimensional encoding matrix Sk. 62 Dictionaries & columns and rows A = CUR = S • ij uijC(i)R(i), where U=W+ and W = intersection of C and R, where C(i),R(i) = actual-cols and actual-rows. Approximate: A(j) ≈ S i=1,...,c yijC(i) • by minyij|| A(j) - S i=1,...,c yijC(i) ||2 Y = C+A --> A ≈ PCA = (CC+)A • project onto space of those c actual-cols. Y ≈ W+R --> A ≈ PCA ≈ C(W+R) • approximate every column of A i.t.o. a small number of actual-rows and a lowdimensional encoding matrix U=W+. 63 Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems 64 Problem formulation (1 of 3) Never mind columns and rows - just deal with columns (for now) of the matrix A. • Could ask to find the “best” k of n columns of A. • Combinatorial problem - trivial algorithm takes nk time. • Probably NP-hard if k is not fixed. Let’s ask a different question. • Fix a rank parameter k. • Let’s over-sample columns by a little (e.g., k+3, 10k, k2, etc.). • Try to get close (additive error or relative error) to the “best” rank-k approximation.. 65 Problem formulation (2 of 3) Ques: Do there exist O(k), or O(k2), or …, columns s.t.: ||A-CC+A||2,F < ||A-Ak||2,F + ||A||F Ans: Yes - and can find them in O(m+n) space and time after two passes over the data! (DFKVV99,DKM04) Ques: Do there exist O(k), or O(k2), or …, columns such that ||A-CC+A||2,F < (1+)-1||A-Ak||2,F + t||A||F Ans: Yes - and can find them in O(m+n) space and time after t passes over the data! (RVW05,DM05) Ques: Do there exist, and can we find, O(k), or O(k2), or …, columns such that ||A-CC+A||F < (1+)||A-Ak||F Ans: Yes, they exist - existential proof - no non-exhaustive algorithm given! (RVW05,DRVW06) Ans: ... 66 Problem formulation (3 of 3) Ques: Do there exist O(k), or O(k2), or …, columns and rows such that ||A-CUR||2,F < ||A-Ak||2,F + ||A||F Ans: Yes - lots of them, and can find them in O(m+n) space and time after two passes over the data! (DK03,DKM04) Note: “lots of them” since these are randomized Monte Carlo algorithms! Ques: Do there exist O(k), or O(k2), or …, columns and rows such that ||A-CUR||F < (1+)||A-Ak||F Ans: … 67 Theorem: Relative-Error CUR Fix any k, , . Then, there exists a Monte Carlo algorithm that uses O(SVD(Ak)) time to find C and R and construct U s.t.: holds with probability at least 1-, by picking c = O( k2 log(1/) / 2 ) columns, and r = O( k4 log2(1/) / 6 ) rows. (Current theory work: we can improve the sampling complexity to c,r=O(k poly(1/, 1/)).) (Current empirical work: we can usually choose c,r ≤ k+4.) (Don’t worry about : choose =1 if you want!) 68 L2 Regression problems First consider overconstrained problems, n >> d. • Typically, there is no x such that Ax = b. • Can generalize to non-overconstrained problems if rank(A)=k. We seek sampling-based algorithms for approximating l2 regression. • Nontrivial structural insights in overconstrained problems • Nontrivial algorithmic insights for non-overconstrained problems. 69 Creating an induced subproblem Algorithm 1. Fix a set of probabilities pi, i=1…n, summing up to 1. 2. Pick r indices from {1…n} in r i.i.d. trials, with respect to the pi’s. 3. For each sampled index j, keep the j-th row of A and the j-th element of b; rescale both by (1/rpj)1/2. 70 The induced subproblem sampled rows of A, rescaled sampled elements of b, rescaled 71 Our main L2 Regression result If the pi satisfy certain conditions, then with probability at least 1-, (A): condition number of A The sampling complexity is (New improvement: we can reduce the sampling complexity to r = O(d).) 72 Conditions for the probabilities The conditions that the pi must satisfy, for some 1, 2, 3 (0,1]: lengths of rows of matrix of left singular vectors of A Component of b not in the span of the columns of A Small i ) more sampling The sampling complexity is: 73 Rows of left singular vectors What do the lengths of the rows of the n x d matrix U = UA “mean”? Consider possible n x d matrices U of d left singular vectors: In|k = k columns from the identity row lengths = 0 or 1 In|k x -> x Hn|k = k columns from the n x n Hadamard (real Fourier) matrix row lengths all equal Hn|k x -> maximally dispersed Uk = k columns from any orthogonal matrix row lengths between 0 and 1 Lengths of the rows of U = UA correspond to a notion of information dispersal. Where in Rm the S information in A is sent, not what the S information is. 74 Comments on L2 Regression Main point: The relevant information for l2 regression if n >> d is contained in an induced subproblem of size O(d2)-by-d. In O(nd2) = O(SVD(A)) = O(SVD(Ad)) time we can easily compute pi’s that satisfy all three conditions, with 1 = 2 = 3 = 1/3. • Too expensive in practice for this over-constrained problem! • NOT too expensive when applied to CX and CUR matrix problems!! Key observation: (FKV98, DK01, DKM04, RV04) : Us is almost orthogonal, and we can bound the spectral and the Frobenius norm of: UsT Us – I. NOTE: K. Clarkson in SODA 2005 analyzed sampling-based algorithms for overconstrained l1 regression ( = 1 ) problems. 75 L2 Regression and CUR Approximation Extended L2 Regression Algorithm: • Input: m x n matrix A, m x p matrix B, and a rank parameter k. • Output: n x p matrix X approximately solving minX ||AkX - B ||F. • Algrthm: Randomly sample r=O(d2) or r=O(d) rows from Ak and B Solve the induced sub-problem. Xopt = Ak+B ≈ (SAk)+SB • Cmplxty: O(SVD(Ak)) time and space Corollary 1: Approximately solve: minX ||ATkX - AT||F to get columns C such that: ||A-CC+A||F ≤ (1+)||A-Ak||F. Corollary 2: Approximately solve: minX ||CX - A||F to get rows R such that: ||A-CUR||F ≤ (1+) ||A-CC+A||F. 76 Theorem: Relative-Error CUR Fix any k, , . Then, there exists a Monte Carlo algorithm that uses O(SVD(Ak)) time to find C and R and construct U s.t.: holds with probability at least 1-, by picking c = O( k2 log(1/) / 2 ) columns, and r = O( k4 log2(1/) / 6 ) rows. (Current theory work: we can improve the sampling complexity to c,r=O(k poly(1/, 1/)).) (Current empirical work: we can usually choose c,r ≤ k+4.) (Don’t worry about : choose =1 if you want!) 77 Subsequent relative-error algorithms November 2005: Drineas, Mahoney, and Muthukrishnan • The first relative-error low-rank matrix approximation algorithm. • O(SVD(Ak)) time and O(k2) columns for both CX and CUR decompositions. January 2006: Har-Peled • Used -nets and VC-dimension arguments on optimal k-flats. • O(mn k2 log(k)) - “linear in mn” time to get 1+ approximation. March 2006: Despande and Vempala • Used a volume sampling - adaptive sampling procedure of RVW05, DRVW06. • O(Mk/) ≈ O(SVD(Ak)) time and O(k log(k)) columns for CX-like decomposition. April 2006: Drineas, Mahoney, and Muthukrishnan • Improved the DMM November 2005 result to O(k log(k)) columns. 78 Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems 79 CUR data application: DNA tagging-SNPs (data from K. Kidd’s lab at Yale University, joint work with Dr. Paschou at Yale University) Single Nucleotide Polymorphisms: the most common type of genetic variation in the genome across different individuals. They are known locations at the human genome where two alternate nucleotide bases (alleles) are observed (out of A, C, G, T). individuals SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … There are ∼10 million SNPs in the human genome, so this table could have ~10 million columns. 80 Recall “the” human genome • Human genome ≈ 3 billion base pairs • 30,000 – 40,000 genes • The functionality of 97% of the genome is unknown. • BUT: individual differences (polymorphic variation) at ≈ 1 b.p. per thousand. individuals SNPs AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA SNPs occur quite frequently within the genome allowing the tracking of disease genes and population histories. Thus, SNPs are effective markers for genomic research. 81 Focus at a specific locus and assay the observed alleles. C T SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: individuals - Heterozygotic (in our study, CT = TC) SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … 82 Focus at a specific locus and assay the observed alleles. C C SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) individuals - Homozygotic at the first allele, e.g., C SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … 83 Focus at a specific locus and assay the observed alleles. T T SNP: exactly two alternate alleles appear. Two copies of a chromosome (father, mother) An individual could be: - Heterozygotic (in our study, CT = TC) - Homozygotic at the first allele, e.g., C - Homozygotic at the second allele, e.g., T individuals SNPs … AG CT GT GG CT CC CC CC CC AG AG AG AG AG AA CT AA GG GG CC GG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CT AA GG GG CC GG AA GG AA CC AA CC AA GG TT AA TT GG GG GG TT TT CC GG TT GG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AA AG CT AA GG GG CC AG AG CG AC CC AA CC AA GG TT AG CT CG CG CG AT CT CT AG CT AG GG GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA CC GG AA CC CC AG GG CC AC CC AA CG AA GG TT AG CT CG CG CG AT CT CT AG CT AG GT GT GA AG … … GG TT TT GG TT CC CC CC CC GG AA GG GG GG AA CT AA GG GG CT GG AA CC AC CG AA CC AA GG TT GG CC CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CG CC AG AG AG AG AG AA CT AA GG GG CT GG AG CC CC CG AA CC AA GT TT AG CT CG CG CG AT CT CT AG CT AG GG TT GG AA … … GG TT TT GG TT CC CC CC CC GG AA AG AG AG AA TT AA GG GG CC AG AG CG AA CC AA CG AA GG TT AA TT GG GG GG TT TT CC GG TT GG GT TT GG AA … 84 Why are SNPs important? Genetic Association Studies: • Locate causative genes for common complex disorders (e.g., diabetes, heart disease, etc.) by identifying association between affection status and known SNPs. • No prior knowledge about the function of the gene(s) or the etiology of the disorder is necessary. Biology and Association Studies: The subsequent investigation of candidate genes that are in physical proximity with the associated SNPs is the first step towards understanding the etiological “pathway” of a disorder and designing a drug. Data Analysis and Association Studies: Susceptibility alleles (and genotypes carrying them) should be more common in the patient population. 85 SNPs carry redundant information Key observation: non-random relationship between SNPs. Human genome is organized into block-like structure. Strong intra-block correlations. We can focus only on “tagSNPs”. • Among different populations (eg., European, Asian, African, etc.), different patterns of SNP allele frequencies or SNP correlations are often observed. • Understanding such differences is crucial in order to develop the “next generation” of drugs that will be “population specific” (eventually “genome specific”) and not just “disease specific”. 86 Funding … • Mapping the whole genome sequence of a single individual is very expensive. • Mapping all the SNPs is also quite expensive, but the costs are dropping fast. HapMap project (~$100,000,000 funding from NIH and other sources): • Map all 10,000,000 SNPs for 270 individuals from 4 different populations (YRI, CEU, CHB, JPT), in order to create a “genetic map” to be used by researchers. • Also, funding from pharmaceutical companies, NSF, the Department of Justice*, etc. *Is it possible to identify the ethnicity of a suspect from his DNA? 87 Research directions Research questions (working within a population): Why? (i) Are different SNPs correlated, within or across populations? - Understand structural properties of the human genome. (ii) Find a “good” set of tagging-SNPs capturing the diversity of a chromosomal region of the human genome. (iii) Find a set of individuals that capture the diversity of a chromosomal region. (iii) Is extrapolation feasible? - Save time/money by assaying only the tSNPs and predicting the rest. - Save time/money by running (drug) tests only on the cell lines of the selected individuals. Existing literature Pairwise metrics of SNP correlation, called LD (linkage disequilibrium) distance, based on nucleotide frequencies and co-occurrences. Almost no metrics exist for measuring correlation between more than 2 SNPs and LD is very difficult to generalize. Exhaustive and semi-exhaustive algorithms in order to pick “good” ht-SNPs that have small LD distance with all other SNPs. Using Linear Algebra: an SVD based algorithm was proposed by Lin & Altman, Am. J. Hum. Gen. 2004. 88 The DNA SNP data • Samples from 38 different populations. • Average size 50 subjects/population. • For each subject 63 SNPs were assayed, from a region in chromosome 17 called SORCS3, ≈ 900,000 bases long. • We are in the process of analyzing HapMap data as well as more 3 regions assayed by Kidd’s lab (with Asif Javed). 89 N > 50 N: 25 ~ 50 Finns Kom Zyrian Yakut Khanty Irish European, Mixed Danes Chuvash Russians African Americans Jews, Ashkenazi Adygei Druze Samaritans Pima, Arizona Cheyenne Chinese, Hakka Japanese Han Chinese, Taiwan Cambodians Jews, Yemenite Ibo Maya Atayal Hausa Yoruba Pima, Mexico Ami Biaka Jews, Ethiopian Mbuti Ticuna Micronesians Chagga Nasioi Surui Karitiana Africa Europe NW Siberia NE Siberia SW Asia E Asia N America S America Oceania 90 Encoding the data SNPs individuals 0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1 How? • Exactly two nucleotides (out of A,G,C,T) appear in each column. • Thus, the two alleles might be both equal to the first one (encode by +1), both equal to the second one (encode by -1), or different (encode by 0). Notes • Order of the alleles is irrelevant, so TG is the same as GT. • Encoding, e.g., GG to +1 and TT to -1 is not any different (for our purposes) from encoding GG to -1 and TT to +1. (Flipping the signs of the columns of a matrix does not affect our techniques.) 91 Evaluating (linear) structure For each population We ran SVD to determine the “optimal” number k of eigenSNPs covering 90% of the variance. If we pick the top k left singular vectors we can express every column (i.e, SNP) of A as a linear combination of the left singular vectors loosing 10% of the data. We ran CUR to pick a small number (e.g., k+2) of columns of A and express every column (i.e., SNP) of A as a linear combination of the picked columns, loosing 10% of the data. SNPs individuals 0 0 0 1 0 -1 1 1 1 0 0 0 0 0 1 0 1 -1 -1 1 -1 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 0 1 -1 -1 1 -1 1 -1 1 1 1 1 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 1 0 0 1 -1 -1 1 0 0 0 0 1 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 1 -1 1 1 1 0 -1 1 0 1 1 0 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 -1 -1 -1 1 -1 -1 1 1 1 -1 1 -1 -1 -1 1 0 1 -1 -1 0 -1 1 1 0 0 1 1 1 -1 -1 -1 1 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 0 1 0 0 0 0 0 1 0 1 -1 -1 0 -1 0 1 -1 0 1 1 1 -1 -1 0 0 0 0 0 0 0 0 0 0 0 1 -1 -1 1 -1 -1 -1 1 -1 -1 1 1 1 -1 1 0 0 0 1 -1 1 -1 -1 1 0 0 0 1 1 1 0 1 -1 -1 1 -1 -1 -1 -1 -1 -1 1 -1 -1 -1 0 -1 -1 1 92 93 Predicting SNPs within a population Split the individuals in two sets: training and test. Given a small number of SNPs for all individuals (tagging-SNPs), and all SNPs for individuals in the training set, predict the unassayed SNPs. Tagging-SNPs are selected using only the training set. SNPs “Training” individuals, chosen uniformly at random individuals (for a few subjects, we are given all SNPs) SNP sample (for all subjects, we are given a small number of SNPs) 94 95 96 Predicting SNPs across populations Given all SNPs for all individuals in population X, and a small number of taggingSNPs for population Y, predict all unassayed SNPs for all individuals of Y. Tagging-SNPs are selected using only the training set. Training set: individuals in X. Test set: individuals in Y. A: contains all individuals in both X and Y. SNPs All individuals in population X. individuals SNP sample (for all subjects, we are given a small number of SNPs) 97 Select tSNPs OUT: set of tSNPs Assay tSNPs in population B SNPs individuals IN: population A individuals … AA CT GT GG TT TT CC GG GG AA GG CT AG CC … Population A … AG ?? GG ?? ?? ?? CC ?? ?? ?? AG ?? ?? ?? … … AA ?? GG ?? ?? ?? CG ?? ?? ?? GG ?? ?? ?? … Reconstruct SNPs … AG CT GT GG CT CC CG AG AG AC AG CT AG CT … … AG CC GG GT CT CT CC GG AG CC GG CC AG CT … … AG ?? GG ?? ?? ?? CC ?? ?? ?? AA ?? ?? ?? … Population B SNPs … GG TT TT GG TT CC GG AG AA AC AG CT GG CT … … AA ?? GT ?? ?? ?? CG ?? ?? ?? AA ?? ?? ?? … IN: population A & assayed tSNPs in B OUT: unassayed SNPs in B : tSNP Transferability of tagging SNPs individuals SNPs … AA TT GT TT CC CT CG AG GG CC AA CC AA TT … … AG CT GG TT TT CT CC GG AA AA AA CC AA TT … … AG CC GG GT CT CC CC AG AA AC AG CT AA CT … … AA CC GG GT CT TT CG AA AG CC GG CT AG CC … Population B FIG. 6 99 100 Keeping both SNPs and individuals Given a small number of SNPs for all individuals, and all SNPs for some judiciously chosen individuals, predict the values of the remaining SNPs. SNPs “Basis” individuals JUDICIOUCLY CHOSEN individuals (for a few subjects, we are given all SNPs) SNP sample (for all subjects, we are given a small number of SNPs) 101 102 103 Overview (2/2) • Tensor-based data sets • Tensor-CUR • Hyperspectral data • Recommendation systems • From Very-Large to Medium-Sized Data • Relative-error CX and CUR Matrix Decompositions • L2 Regression Problems • Application to DNA SNP Data • Conclusions and Open Problems 104 Conclusions & Open Problems • Impose other structural properties in CUR-type decompositions • Non-negativity • Element quantization, e.g., to 0,1,-1 • Block-SVD type structure • Robust heuristics and robust extensions • Especially for noisy data • L1 norm bounds • Extension to different statistical learning problems • • • • Matrix reconstruction Regression Classification, e.g. SVM Clustering 105 Conclusions & Open Problems • Relate to traditional numerical linear algebra • Gu and Eisenstat - deterministically find well-conditioned columns • Goreinov and Tyrtyshnikov - volume-maximization and conditioning criteria • Stewart - backward error analysis • Empirical evaluation of different sampling probabilities • Uniform • Non-uniform; norms of rows/columns or of left/right singular vectors • Others, depending on the problem to be solved? • Use CUR and CW+CT for improved interpretability for data matrices Structural vs. algorithmic issues 106 Workshop on “Algorithms for Modern Massive Data Sets” (http://www.stanford.edu/group/mmds/) @ Stanford University and Yahoo! Research, June 21-24, 2006 Objective: - Explore novel techniques for modeling and analyzing massive, high-dimensional, and nonlinearstructured data. - Bring together computer scientists, computational and applied mathematicians, statisticians, and practitioners to promote cross-fertilization of ideas. Organizers: G. Golub, M. W. Mahoney, L-H. Lim, and P. Drineas. Sponsors: NSF, Yahoo! Research, Ask!. 107