Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
A sparse factor analysis model for high dimensional latent spaces Chuan Gao Institute of Genome Sciences & Policy Duke University [email protected] Barbara E Engelhardt Department of Biostatistics & Bioinformatics Institute for Genome Sciences & Policy Duke University [email protected] Abstract Inducing sparsity in factor analysis has become increasingly important as applications have arisen that are best modeled by a high dimensional, sparse latent space, and the interpretability of this latent space is critical. Applying latent factor models with a high dimensional latent space but without sparsity yields nonsense factors that may be artifactual and are prone to overfittiing the data. In the Bayesian context, a number of sparsity-inducing priors have been proposed, but none that specifically address the context of a high dimensional latent space. Here we describe a Bayesian sparse factor analysis model that uses a general three parameter beta prior, which, given specific settings of hyperparameters, can recapitulate sparsity inducing priors with appropriate modeling assumptions and computational properties. We apply the model to simulated and real gene expression data sets to illustrate the model properties and to identify large numbers of sparse, possibly correlated factors in this space. 1 Introduction Factor analysis has been used in a variety of settings to extract useful features from high dimensional data sets [1, 2]. Factor analysis, in a general context, has a number of drawbacks, such as unidentifiability with respect to the rotation of the latent matrices, and the difficulty of selecting the appropriate number of factors. One solution that addresses these drawbacks is to induce sparsity in the loading matrix. By imposing substantial regularization on the loading matrix, the identifiability issue can be alleviated when the latent space is sufficiently sparse, and model selection criteria appear to be more effective at choosing the number of factors because the model does not overfit to the same extent as a non-sparse model. There are currently a number of options for how to induce sparsity constraints on the latent parameter space. We choose to work in the Bayesian context, where a sparsity-inducing prior should have substantial mass around zero to provide strong shrinkage near zero, and also have heavy tails to allow signals to escape strong shrinkage [3]. In the context of sparse regression, there have been a number of proposed solutions [4, 5, 6, 7, 8], some of which have been applied to latent factor models[9, 10]. However, all of these approaches in the factor analysis context impose an equal amount of shrinkage on all parameters, which may sacrifice small signals to achieve high levels of sparsity. To address this issue in the Bayesian latent factor model context, one can use a mixture with a point mass at zero and a normal distribution, a so-called ‘spike and slab’ prior, on the loading matrix [1]. Unfortunately there is no closed form solution for the parameters estimates, so MCMC is used to estimate the parameters, which is computational intractable for large data. In this work, we use a three parameter beta (T PB) prior [11] as a general shrinkage prior for the factor loading matrix of a latent factor model. T PB(a, b, φ) is a generalized form of the Beta distribution, with the third parameter φ further controlling the shape of the density. It has been shown that a linear transformation of the Beta distribution, producing the inverse Beta distribution, has 1 desirable shrinkage properties in sparse modeling (the ‘horseshoe’ prior) [7]. The T PB distribution can be used to mimic this distribution, with the inverse beta variable scaled by φ. The T PB is thus appealing because a) it can be used to recapitulate the sparsity-inducing properties of the horseshoe prior, with substantial mass around zero to provide strong shrinkage for noise and heavy tails to avoid shrinking signal, and b) by carefully controlling its parameters, it recapitulates the two-groups model [12, 3] for priors with different shrinkage characteristics. This allows us to recreate a twogroups sparsity-inducing prior, with one mode centered at zero and the other at the posterior mean, and also has a straightforward posterior distribution for which the parameters can be estimated via expectation maximization, making it computationally tractable. In the setting of identifying a large number of factors that may individually contribute minimally to the data variance, these two features, namely computationally tractability and two-groups sparsity modeling, are critical to effectively model the data. 2 Bayesian sparse factor model via T PB We will define the factor analysis model as follows: Y = ΛX + W, (1) where Y has dimension n × m, Λ is the loading matrix with dimension n × p, X is the factor matrix with dimension p × m, and W is the n × m residual error matrix, where we assume W ∼ N(0, Ψ). For computational tractability, we assume Ψ is diagonal (but the diagonal elements are not necessarily the same). For the latent variable X, we follow convention by giving it a standard normal prior, X ∼ N(0, I). To induce sparsity in the factor loading matrix Λ, we put the following priors on each element λik of the parameter matrix Λ: 1 − 1) ρik 1 − 1) ∼ T PB(a, b, γk γk ∼ T PB(c, d, ν). λik ∼ N (0, ρik (2) (3) (4) In the prior on λik , ρij provides local shrinkage for each element, while γk controls the global shrinkage and is specific to each factor k. As in the horseshoe, ρ1ik − 1 ∈ [0, ∞] has the desirable properties of strong shrinkage to zero while not overly shrinking signals. This general model is able to capture a number of different types of shrinkage scenarios, depending on the values of a, b, ρ (Table 1). Table 1: Table showing the shrinkage effects for different values of a, b and ρ. 1 horseshoe weak strong a = b = 12 a ↑ and b ↓ a ↓ and b ↑ 3 ρ >1 strong variable strong <1 weak weak variable Posterior distribution We will generalize this prior further for the latent factor model. For a given parameter θ and scale φ, the following relationship holds [11]: θ ∼ β 0 (a, b) ⇔ θ ∼ G(a, λ) and λ ∼ G(b, φ), φ (5) where β 0 (a, b) and G indicate a inverse beta and gamma distribution respectively. From Equations 2, 3 and 4, if we make the substitutions θik = ρ1ik − 1 and φk = γ1k − 1, it can be shown that θik 0 φk ∼ β (a, b). This relationship implies the following simple hierarchical structure for the latent factor model: λik ∼ N (0, θik ), θik ∼ G(a, δik ), δik ∼ G(b, φk ), φk ∼ G(c, η) and η ∼ G(d, ν), where the parameter φk controls the global shrinkage and θik controls the local shrinkage. We give Ψ an uninformative prior. 2 Based on the posterior distributions, a Gibbs sampler can be constructed to iteratively sample the parameter values from their posterior distributions. For faster computation, we use Expectation Maximization (EM), where the expectation step involves taking the expected value of the latent variable X, and the maximization step identifies MAP parameter estimates (see paper for details). 4 4.1 Results Simulations We simulated five data sets with different levels of sparsity to test the performance of our model, with sample size n = 200, m = 500, and p = 20 factors. The 200 × 20 loading matrices Λ were generated from the above model, setting a = b = 1. To adjust the sparsity of the matrix, we let ν take values in the range [10−4 , 1], where smaller values of ν produce more sparsity in the matrix. Both X and W were chosen from N (0, I) with appropriate dimensions. We set both a = b = 21 to recapitulate the horseshoe prior and differentiate them from the simulated distribution. For ν, we used values between 1 and 0.01 with minimal changes in the estimates. We ran EM from a random starting point ten times, and used the parameters from the run with the best fit. We compared our result with the Bayesian Factor Regression Model (BFRM) [1] and the K-SVD model [9], BFRM was run with default settings, with a burn in period of 2000 and a sampling period of 20, 000, and K-SVD was run with the same setting as the demonstration file included in the package. We compared the three models by looking at the sparsity level of the three methods versus the amount of information represented in the latent subspace. The sparsity level was measured by the PN c(k) N −k+ 12 Gini index [13]. For an ordered list of values ~c, the Gini index is: 1 − 2 k=1 ||~c||1 , N where N is the total number of elements in the list; bigger values indicate a sparser representation. The accuracy of of the prediction is reflected in the mean squared error (MSE), computed from the residuals. We find that, compared to BFRM, T PB achieves equivalent or better MSE with far more sparsity (range of [0.5, 0.9] for T PB versus [0.4, 0.6] for BFRM). Compared to K-SVD, our method adaptively learned the sparsity of the data, keeping MSE low, while K-SVD maintains the same sparsity level for all simulations, sacrificing accuracy for sparsity. Interestingly, for sparser simulations, both K-SVD and T PB achieve sparser estimates than the real data, while maintaing the same prediction accuracies. We find that the Bayes Information Criterion (BIC) score, depending on the value of ν, is fairly accurate in terms of the number of selected features in this context (Figure 2). Figure 1: Comparison of the sparsity level (left panel, Y-axis) and the MSE (right panel, Y-axis) of T PB, BFRM and K-SVD. The underlying sparsity is on the X-axis. The true sparsity plot on the left panel also corresponds to the line with slope of 1. 4.2 Gene Expression Analysis Microarrays are able to generate gene expression levels for tens of thousands of genes in a sample quickly and at low cost. Biologists know that genes do not function as independent units, but instead as parts of complicated networks with different biochemical purposes [14, 15]. As a result, genes that share similar functions tend to have gene expression levels that are correlated across samples because, for example, they may be regulated by common transcription factors. Identifying these coregulated sets of genes from high dimensional gene expression measurements is critical for analysis of gene networks and for identifying genetic variants that impact transcription from long genomic distances. The number of co-regulated sets of genes may be very large, relative to the number of genes in the gene expression matrix. 3 250000 0 Data nu=0.1 200000 nu=0.01 nu=0.001 -40000 nu=0.0001 BIC LogL 150000 100000 -80000 50000 -120000 0 10 20 30 40 10 N-factor 20 30 40 N-factor Figure 2: Plot showing the fitness of the model, with factor number on the x-axis and log likelihood (left panel) or the BIC score (right panel) on the y-axis. In this simulated data with twenty factors, the BIC score is fairly accurate in terms of the number of selected features across different values of ν. To this end, we applied our method on a subset of 18262 genes with a sample size of 354 human cerebellum samples [unpublished]. We set K = 1000 and ran EM from ten starting points with a = 0.5, b = 1000 and ν = 10−4 to induce strong shrinkage; we used the result with the best fit. We note that the sparse prior alleviates the problem of overfitting by shrinking unnecessary factors to 0. By looking at the correlation of the genes that load on each factor (those that have values 6= 0), we found that the genes on the same factors clustered well (Figure 3, left). The size of the gene clusters range from 0 to 500, with most around 50 (Figure 3, right). Figure 3: Correlation of genes loaded on the first few factors (left) and distribution of the gene clusters for a total of 1000 factors (right). Factors on the left are denoted by black lines. 5 Conclusions We built a model for sparse factor analysis using a three parameter beta prior to induce shrinkage. We found that this model has favorable characteristics to estimating possibly high-dimensional latent spaces. We are further testing the robustness of estimates from our model and will use the factors to identify genetic variants that are associated with long-distance genetic regulation of each factor. Acknowledgments The authors would like to thank Sayan Mukherjee for helpful conversations. The gene expression data were generated by Merck Research Laboratories in collaboration with the Harvard Brain Tissue Resource Center and was obtained through the Synapse data repository (data set id: syn4505 at https://synapse.sagebase.org/#Synapse:syn4505). References [1] C. M. Carvalho, J. E. Lucas, Q. Wang, J. Chang, J. R. Nevins, and M. West. High-dimensional sparse factor modelling - applications in gene expression genomics. Journal of the American Statistical Association, 103:1438–1456, 2008. PMCID 3017385. [2] Mingyuan Zhou, Lauren Hannah, David Dunson, and Lawrence Carin. Beta-Negative Binomial Process and Poisson Factor Analysis. December 2011. 4 [3] Nicholas G. Polson James G. Scott. Shrink globally, act locally: Sparse bayesian regularization and prediction. [4] Michael E. Tipping. Sparse bayesian learning and the relevance vector machine. J. Mach. Learn. Res., 1:211–244, September 2001. [5] Jim and Philip. Inference with normal-gamma prior distributions in regression problems. Bayesian Analysis, 5(1):171–188, 2010. [6] Trevor Park and George Casella. The Bayesian Lasso. Journal of the American Statistical Association, 103(482):681–686, 2008. [7] Carlos M. Carvalho, Nicholas G. Polson, and James G. Scott. Handling sparsity via the horseshoe. Journal of Machine Learning Research - Proceedings Track, pages 73–80, 2009. [8] Barbara Engelhardt and Matthew Stephens. Analysis of population structure: A unifying framework and novel methods based on sparse factor analysis. PLoS Genetics, 6(9):e1001117, 2010. [9] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. Signal Processing, IEEE Transactions on, 54(11):4311–4322, November 2006. [10] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res., 11:19–60, March 2010. [11] Artin Armagan, David Dunson, and Merlise Clyde. Generalized beta mixtures of gaussians. In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems 24, pages 523–531. 2011. [12] B. Efron. Microarrays, Empirical Bayes and the Two-Groups Model. Statistical Science, 23:1–47, 2008. [13] Niall Hurley and Scott Rickard. Comparing measures of sparsity. Machine Learning for Signal Processing, 2008. MLSP 2008. IEEE Workshop on, pages 55–60, Oct. 2008. [14] Yoo-Ah Kim, Stefan Wuchty, and Teresa Przytycka. Identifying causal genes and dysregulated pathways in complex diseases. PLoS computational biology, 7(3):e1001095, 2011. [15] Yanqing Chen, Jun Zhu, Pek Lum, Xia Yang, Shirly Pinto, Douglas MacNeil, Chunsheng Zhang, John Lamb, Stephen Edwards, Solveig Sieberts, Amy Leonardson, Lawrence Castellini, Susanna Wang, Marie-France Champy, Bin Zhang, Valur Emilsson, Sudheer Doss, Anatole Ghazalpour, Steve Horvath, Thomas Drake, Aldons Lusis, and Eric Schadt. Variations in dna elucidate molecular networks that cause disease. Nature, 452(7186):429–435, 2008. 5