Download A sparse factor analysis model for high dimensional latent

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Public health genomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
A sparse factor analysis model for high dimensional
latent spaces
Chuan Gao
Institute of Genome Sciences & Policy
Duke University
[email protected]
Barbara E Engelhardt
Department of Biostatistics & Bioinformatics
Institute for Genome Sciences & Policy
Duke University
[email protected]
Abstract
Inducing sparsity in factor analysis has become increasingly important as applications have arisen that are best modeled by a high dimensional, sparse latent
space, and the interpretability of this latent space is critical. Applying latent factor models with a high dimensional latent space but without sparsity yields nonsense factors that may be artifactual and are prone to overfittiing the data. In
the Bayesian context, a number of sparsity-inducing priors have been proposed,
but none that specifically address the context of a high dimensional latent space.
Here we describe a Bayesian sparse factor analysis model that uses a general three
parameter beta prior, which, given specific settings of hyperparameters, can recapitulate sparsity inducing priors with appropriate modeling assumptions and computational properties. We apply the model to simulated and real gene expression
data sets to illustrate the model properties and to identify large numbers of sparse,
possibly correlated factors in this space.
1
Introduction
Factor analysis has been used in a variety of settings to extract useful features from high dimensional data sets [1, 2]. Factor analysis, in a general context, has a number of drawbacks, such as
unidentifiability with respect to the rotation of the latent matrices, and the difficulty of selecting the
appropriate number of factors. One solution that addresses these drawbacks is to induce sparsity in
the loading matrix. By imposing substantial regularization on the loading matrix, the identifiability
issue can be alleviated when the latent space is sufficiently sparse, and model selection criteria appear to be more effective at choosing the number of factors because the model does not overfit to the
same extent as a non-sparse model.
There are currently a number of options for how to induce sparsity constraints on the latent parameter
space. We choose to work in the Bayesian context, where a sparsity-inducing prior should have
substantial mass around zero to provide strong shrinkage near zero, and also have heavy tails to
allow signals to escape strong shrinkage [3]. In the context of sparse regression, there have been
a number of proposed solutions [4, 5, 6, 7, 8], some of which have been applied to latent factor
models[9, 10]. However, all of these approaches in the factor analysis context impose an equal
amount of shrinkage on all parameters, which may sacrifice small signals to achieve high levels of
sparsity. To address this issue in the Bayesian latent factor model context, one can use a mixture
with a point mass at zero and a normal distribution, a so-called ‘spike and slab’ prior, on the loading
matrix [1]. Unfortunately there is no closed form solution for the parameters estimates, so MCMC
is used to estimate the parameters, which is computational intractable for large data.
In this work, we use a three parameter beta (T PB) prior [11] as a general shrinkage prior for
the factor loading matrix of a latent factor model. T PB(a, b, φ) is a generalized form of the Beta
distribution, with the third parameter φ further controlling the shape of the density. It has been shown
that a linear transformation of the Beta distribution, producing the inverse Beta distribution, has
1
desirable shrinkage properties in sparse modeling (the ‘horseshoe’ prior) [7]. The T PB distribution
can be used to mimic this distribution, with the inverse beta variable scaled by φ. The T PB is thus
appealing because a) it can be used to recapitulate the sparsity-inducing properties of the horseshoe
prior, with substantial mass around zero to provide strong shrinkage for noise and heavy tails to
avoid shrinking signal, and b) by carefully controlling its parameters, it recapitulates the two-groups
model [12, 3] for priors with different shrinkage characteristics. This allows us to recreate a twogroups sparsity-inducing prior, with one mode centered at zero and the other at the posterior mean,
and also has a straightforward posterior distribution for which the parameters can be estimated via
expectation maximization, making it computationally tractable. In the setting of identifying a large
number of factors that may individually contribute minimally to the data variance, these two features,
namely computationally tractability and two-groups sparsity modeling, are critical to effectively
model the data.
2
Bayesian sparse factor model via T PB
We will define the factor analysis model as follows:
Y = ΛX + W,
(1)
where Y has dimension n × m, Λ is the loading matrix with dimension n × p, X is the factor matrix
with dimension p × m, and W is the n × m residual error matrix, where we assume W ∼ N(0, Ψ).
For computational tractability, we assume Ψ is diagonal (but the diagonal elements are not necessarily the same). For the latent variable X, we follow convention by giving it a standard normal
prior, X ∼ N(0, I). To induce sparsity in the factor loading matrix Λ, we put the following priors
on each element λik of the parameter matrix Λ:
1
− 1)
ρik
1
− 1)
∼ T PB(a, b,
γk
γk ∼ T PB(c, d, ν).
λik ∼ N (0,
ρik
(2)
(3)
(4)
In the prior on λik , ρij provides local shrinkage for each element, while γk controls the global
shrinkage and is specific to each factor k. As in the horseshoe, ρ1ik − 1 ∈ [0, ∞] has the desirable
properties of strong shrinkage to zero while not overly shrinking signals. This general model is able
to capture a number of different types of shrinkage scenarios, depending on the values of a, b, ρ
(Table 1).
Table 1: Table showing the shrinkage effects for different values of a, b and ρ.
1
horseshoe
weak
strong
a = b = 12
a ↑ and b ↓
a ↓ and b ↑
3
ρ
>1
strong
variable
strong
<1
weak
weak
variable
Posterior distribution
We will generalize this prior further for the latent factor model. For a given parameter θ and scale
φ, the following relationship holds [11]:
θ
∼ β 0 (a, b) ⇔ θ ∼ G(a, λ) and λ ∼ G(b, φ),
φ
(5)
where β 0 (a, b) and G indicate a inverse beta and gamma distribution respectively. From Equations
2, 3 and 4, if we make the substitutions θik = ρ1ik − 1 and φk = γ1k − 1, it can be shown that
θik
0
φk ∼ β (a, b). This relationship implies the following simple hierarchical structure for the latent
factor model: λik ∼ N (0, θik ), θik ∼ G(a, δik ), δik ∼ G(b, φk ), φk ∼ G(c, η) and η ∼ G(d, ν),
where the parameter φk controls the global shrinkage and θik controls the local shrinkage. We give
Ψ an uninformative prior.
2
Based on the posterior distributions, a Gibbs sampler can be constructed to iteratively sample the
parameter values from their posterior distributions. For faster computation, we use Expectation
Maximization (EM), where the expectation step involves taking the expected value of the latent
variable X, and the maximization step identifies MAP parameter estimates (see paper for details).
4
4.1
Results
Simulations
We simulated five data sets with different levels of sparsity to test the performance of our model,
with sample size n = 200, m = 500, and p = 20 factors. The 200 × 20 loading matrices Λ were
generated from the above model, setting a = b = 1. To adjust the sparsity of the matrix, we let ν
take values in the range [10−4 , 1], where smaller values of ν produce more sparsity in the matrix.
Both X and W were chosen from N (0, I) with appropriate dimensions. We set both a = b = 21
to recapitulate the horseshoe prior and differentiate them from the simulated distribution. For ν, we
used values between 1 and 0.01 with minimal changes in the estimates. We ran EM from a random
starting point ten times, and used the parameters from the run with the best fit. We compared our
result with the Bayesian Factor Regression Model (BFRM) [1] and the K-SVD model [9], BFRM
was run with default settings, with a burn in period of 2000 and a sampling period of 20, 000, and
K-SVD was run with the same setting as the demonstration file included in the package.
We compared the three models by looking at the sparsity level of the three methods versus the
amount of information represented in the latent subspace. The sparsity level was measured
by the
PN c(k) N −k+ 12 Gini index [13]. For an ordered list of values ~c, the Gini index is: 1 − 2 k=1 ||~c||1
,
N
where N is the total number of elements in the list; bigger values indicate a sparser representation.
The accuracy of of the prediction is reflected in the mean squared error (MSE), computed from
the residuals. We find that, compared to BFRM, T PB achieves equivalent or better MSE with
far more sparsity (range of [0.5, 0.9] for T PB versus [0.4, 0.6] for BFRM). Compared to K-SVD,
our method adaptively learned the sparsity of the data, keeping MSE low, while K-SVD maintains
the same sparsity level for all simulations, sacrificing accuracy for sparsity. Interestingly, for sparser
simulations, both K-SVD and T PB achieve sparser estimates than the real data, while maintaing the
same prediction accuracies. We find that the Bayes Information Criterion (BIC) score, depending on
the value of ν, is fairly accurate in terms of the number of selected features in this context (Figure 2).
Figure 1: Comparison of the sparsity level (left panel, Y-axis) and the MSE (right panel, Y-axis) of
T PB, BFRM and K-SVD. The underlying sparsity is on the X-axis. The true sparsity plot on the left panel
also corresponds to the line with slope of 1.
4.2
Gene Expression Analysis
Microarrays are able to generate gene expression levels for tens of thousands of genes in a sample
quickly and at low cost. Biologists know that genes do not function as independent units, but instead
as parts of complicated networks with different biochemical purposes [14, 15]. As a result, genes
that share similar functions tend to have gene expression levels that are correlated across samples
because, for example, they may be regulated by common transcription factors. Identifying these coregulated sets of genes from high dimensional gene expression measurements is critical for analysis
of gene networks and for identifying genetic variants that impact transcription from long genomic
distances. The number of co-regulated sets of genes may be very large, relative to the number of
genes in the gene expression matrix.
3
250000
0
Data
nu=0.1
200000
nu=0.01
nu=0.001
-40000
nu=0.0001
BIC
LogL
150000
100000
-80000
50000
-120000
0
10
20
30
40
10
N-factor
20
30
40
N-factor
Figure 2: Plot showing the fitness of the model, with factor number on the x-axis and log likelihood (left
panel) or the BIC score (right panel) on the y-axis. In this simulated data with twenty factors, the BIC score
is fairly accurate in terms of the number of selected features across different values of ν.
To this end, we applied our method on a subset of 18262 genes with a sample size of 354 human
cerebellum samples [unpublished]. We set K = 1000 and ran EM from ten starting points with
a = 0.5, b = 1000 and ν = 10−4 to induce strong shrinkage; we used the result with the best fit.
We note that the sparse prior alleviates the problem of overfitting by shrinking unnecessary factors
to 0. By looking at the correlation of the genes that load on each factor (those that have values 6=
0), we found that the genes on the same factors clustered well (Figure 3, left). The size of the gene
clusters range from 0 to 500, with most around 50 (Figure 3, right).
Figure 3: Correlation of genes loaded on the first few factors (left) and distribution of the gene clusters
for a total of 1000 factors (right). Factors on the left are denoted by black lines.
5
Conclusions
We built a model for sparse factor analysis using a three parameter beta prior to induce shrinkage.
We found that this model has favorable characteristics to estimating possibly high-dimensional latent
spaces. We are further testing the robustness of estimates from our model and will use the factors to
identify genetic variants that are associated with long-distance genetic regulation of each factor.
Acknowledgments
The authors would like to thank Sayan Mukherjee for helpful conversations. The gene expression
data were generated by Merck Research Laboratories in collaboration with the Harvard Brain Tissue
Resource Center and was obtained through the Synapse data repository (data set id: syn4505 at
https://synapse.sagebase.org/#Synapse:syn4505).
References
[1] C. M. Carvalho, J. E. Lucas, Q. Wang, J. Chang, J. R. Nevins, and M. West. High-dimensional
sparse factor modelling - applications in gene expression genomics. Journal of the American
Statistical Association, 103:1438–1456, 2008. PMCID 3017385.
[2] Mingyuan Zhou, Lauren Hannah, David Dunson, and Lawrence Carin. Beta-Negative Binomial Process and Poisson Factor Analysis. December 2011.
4
[3] Nicholas G. Polson James G. Scott. Shrink globally, act locally: Sparse bayesian regularization
and prediction.
[4] Michael E. Tipping. Sparse bayesian learning and the relevance vector machine. J. Mach.
Learn. Res., 1:211–244, September 2001.
[5] Jim and Philip. Inference with normal-gamma prior distributions in regression problems.
Bayesian Analysis, 5(1):171–188, 2010.
[6] Trevor Park and George Casella. The Bayesian Lasso. Journal of the American Statistical
Association, 103(482):681–686, 2008.
[7] Carlos M. Carvalho, Nicholas G. Polson, and James G. Scott. Handling sparsity via the horseshoe. Journal of Machine Learning Research - Proceedings Track, pages 73–80, 2009.
[8] Barbara Engelhardt and Matthew Stephens. Analysis of population structure: A unifying
framework and novel methods based on sparse factor analysis. PLoS Genetics, 6(9):e1001117,
2010.
[9] M. Aharon, M. Elad, and A. Bruckstein. K-SVD: An Algorithm for Designing Overcomplete Dictionaries for Sparse Representation. Signal Processing, IEEE Transactions on,
54(11):4311–4322, November 2006.
[10] Julien Mairal, Francis Bach, Jean Ponce, and Guillermo Sapiro. Online learning for matrix
factorization and sparse coding. J. Mach. Learn. Res., 11:19–60, March 2010.
[11] Artin Armagan, David Dunson, and Merlise Clyde. Generalized beta mixtures of gaussians.
In J. Shawe-Taylor, R.S. Zemel, P. Bartlett, F.C.N. Pereira, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 24, pages 523–531. 2011.
[12] B. Efron. Microarrays, Empirical Bayes and the Two-Groups Model. Statistical Science,
23:1–47, 2008.
[13] Niall Hurley and Scott Rickard. Comparing measures of sparsity. Machine Learning for Signal
Processing, 2008. MLSP 2008. IEEE Workshop on, pages 55–60, Oct. 2008.
[14] Yoo-Ah Kim, Stefan Wuchty, and Teresa Przytycka. Identifying causal genes and dysregulated
pathways in complex diseases. PLoS computational biology, 7(3):e1001095, 2011.
[15] Yanqing Chen, Jun Zhu, Pek Lum, Xia Yang, Shirly Pinto, Douglas MacNeil, Chunsheng Zhang, John Lamb, Stephen Edwards, Solveig Sieberts, Amy Leonardson, Lawrence
Castellini, Susanna Wang, Marie-France Champy, Bin Zhang, Valur Emilsson, Sudheer Doss,
Anatole Ghazalpour, Steve Horvath, Thomas Drake, Aldons Lusis, and Eric Schadt. Variations
in dna elucidate molecular networks that cause disease. Nature, 452(7186):429–435, 2008.
5