Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
(Semi-)Supervised Probabilistic Principal Component Analysis Shipeng Yu University of Munich, Germany Siemens Corporate Technology http://www.dbs.ifi.lmu.de/~spyu Joint work with Kai Yu, Volker Tresp, Hans-Peter Kriegel, Mingrui Wu Dimensionality Reduction We are dealing with high-dimensional data Motivations Texts: e.g. bag-of-words features Images: color histogram, correlagram, etc. Web pages: texts, linkages, structures, etc. Noisy features: how to remove or down-weight them? Learnability: “curse of dimensionality” Inefficiency: high computational cost Visualization A pre-processing for many data mining tasks 2 Unsupervised versus Supervised Unsupervised Dimensionality Reduction Only the input data are given PCA (principal component analysis) Supervised Dimensionality Reduction Should be biased by the outputs Classification: FDA (Fisher discriminant analysis) Regression: PLS (partial least squares) RVs: CCA (canonical correlation analysis) More general solutions? Semi-Supervised? 3 Our Settings and Notations N data points, M input features, L output labels X 7! V We aim to derive a mapping ª : such that V ½ RK , K < M x ¢¢¢ x y ¢¢¢ y 1 x1 x2 ¢¢¢ xM x1 x2 .. . xN x1 ¢¢¢ xM y1 ¢¢¢ xN1 xN1+1 .. . yN1 Y X yL y1 y2 .. . xN yN Y Supervised L y1 .. . x1 x2 .. . X 1 x1 .. . xN X Unsupervised M Unlabeled Data! Semi-supervised 4 Outline Principal Component Analysis Probabilistic PCA Supervised Probabilistic PCA Related Work Conclusion 5 PCA: Motivation Find the K orthogonal projection directions which have the most data variances 6 Applications Visualization De-noising Latent semantic indexing Eigenfaces … 1st PC 4 2 0 -2 2nd PC -4 -6 -6 -4 -2 0 2 4 6 6 PCA: Algorithm Basic Algorithm Centralize dataà xi ¹x = 1 N N i=1 xi P matrix Compute the sample covariance S= xi ¡ ¹x ; P 1 X> X N = > N x x i=1 i i 1 N Do eigen-decomposition (sort eigenvalues decreasingly) > S = U¤U PCA directions are given in UK, the first K columns of U ¤ The PCA projection of a test data x is > ¤ ¤ v = UK x 7 Supervised PCA? PCA is unsupervised When output information is available: Classification labels: 0/1 Regression responses: real values Ranking orders: rank labels / preferences Multi-outputs: output dimension > 1 Structured outputs, … Can PCA be biased by outputs? And how? 8 Outline Principal Component Analysis Probabilistic PCA Supervised Probabilistic PCA Related Work Conclusion 9 Latent Variable Model for PCA Another interpretation of PCA [Pearson 1901] PCA is minimizing the reconstruction error of X : min kX ¡ VAk2F A;V > s.t. V V = I V 2 RN K are latent variables: PCA projections of X £ A 2 RK M are factor loadings: PCA mappings Equivalent to PCA up to a scaling factor £ Lead to idea of PPCA 10 Probabilistic PCA [TipBis99] N x Wx v Latent variable model x = Wx v + ¹x + ²x Latent variables v » N (0; I) Mean vector j »N σx Noise process ²x » N (0; ¾x2 I) 2 (Wx v + ¹x ; ¾x I) Conditional independence: x v 2 !0 ¾ If x , PPCA leads to PCA solution (up to a rotation and R scaling factor) x is Gaussian distributed: P (x) = P (xjv)P (v) dv 11 From Unsupervised to Supervised Key insights of PPCA 2 RN Y When we have outputs All the M input dimensions are conditionally independent given the K latent variables In PCA we are seeking the K latent variables that best explain the data covariance £L , we believe: There are inter-covariances between X and Y There are intra-covariances within Y if L > 1 Idea: Let the latent variables explain all of them! 12 Outline Principal Component Analysis Probabilistic PCA Supervised Probabilistic PCA Related Work Conclusion 13 Supervised Probabilistic PCA N Wx Supervised latent variable model x = Wx v + ¹x + ²x y = Wy v + ¹y + ²y x y σx σy Wy Noise process ²y » N (0; ¾y2 I) All input and output dimensions are conditionally independent xjv » N (Wx v + ¹x ; ¾x2 I); v yjv » N (Wy v + ¹y ; ¾y2 I) (x; y) are jointly Gaussian R distributed: P (x; y) = P (xjv)P (yjv)P (v) dv 14 Semi-Supervised Probabilistic PCA ¢¢¢ v1 x1 ¢¢¢ xM vK y1 N1 ¢¢¢ yL x1 .. . y1 .. . xN1 xN1+1 .. . yN1 Wx v x y σx σy Wy Y x xN X Idea: A SPPCA model with missing data! Y Y Likelihood: N1 N P (X; Y) = P (xn ; yn ) n=1 N-N1 v P (xn0 ) n0 =N1 +1 15 S2PPCA: EM Learning 2 ; ¾2 W ; W ; ¹ ; ¹ ; ¾ x y x y x y Model parameters: EM Learning E-step: estimate v for each data point (a projection problem) P (vn jxn ; yn ) / P (xn jvn )P (yn jvn )P (vn ) Inference P (v 0 jx 0 ) / P (x 0 jv 0 )P (v 0 ) n n n n n Problem M-step: maximize data log-likelihood w.r.t. parameters An extension of EM learning for PPCA model Can be kernelized! By product: an EM learning algorithm for kernel PCA 16 S2PPCA: Toy Problem - Linear Y X Y X 17 S2PPCA: Toy Problem - Nonlinear 18 S2PPCA: Toy Problem - Nonlinear 19 S2PPCA: Properties Semi-supervised projection Applicable to large data sets Take PCA and kernel PCA as special cases Primal: O(t(M+L)NK) time, O((M+L)N) space Dual: O(tN2K) time, O(N2) space cheaper than A latent variable solution [Yu et al, SIGIR’05] min A;B;W Primal if M>N (1 ¡ ¯)kX ¡ VAk2F + ¯ kY ¡ VBk2F > s.t. V V = I; V = XW Cannot deal with semi-supervised setting! Closed form solutions for SPPCA No closed form solutions for S2PPCA 20 SPPCA: Primal Solution Theorem 1 (Primal) Let S denote the normalized sample covariance matrix for centered observations f(xn ; yn )gN , n=1 à ! µ ¶µ ¶ X > 1 S 1 S 1 N ¡ 1 xn xn x ¡1 2 ¾x ¾x ¾y xy S= © 2 © 2 = ; 1 S 1 S N yn yn yx y ¾x ¾ y ¾2 n=1 y (M+L)×(M+L) and ¸1 ¸ : : : ¸ ¸(M +L) be its eigenvalues with eigenvectors u1 ; : : : ; u(M +L) , then if the latent space is K-dimensional, the following results hold: (i) In SPPCA model Wx and Wy are calculated as 1 Wx = ¾x Ux (¤K ¡ I) 2 R; 1 Wy = ¾y Uy (¤K ¡ I) 2 R; where ¤K = (¸1 ; : : : ; ¸K ), Ux (Uy ) contains the ¯rst M (last L) rows of [u1 ; : : : ; uK ], and R is an arbitrary K £ K orthogonal rotation matrix. (ii) Projection v¤ for centered new input x¤ is given as h 1 > ¡1 > ¡ v = R (¤K ¡ I) 2 Ux Ux + (¤K ¡ I) 1 ¾x ¤ i ¡1 > ¤ Ux x : 21 SPPCA: Dual Solution New kernel matrix! b > Theorem 1 (Dual) Let K = ¾12 K + ¾12 YY , and ¸1 ¸ : : : ¸ ¸N be its x y eigenvalues with eigenvectors v1 ; : : : ; vN , then if the latent space in SPPCA is K-dimensional, (i) The projections of training data, which are encoded in rows of matrix Z, are calculated as p ¡ 1 Z = NVK (I ¡ N¤K1 ) 2 R where ¤K = (¸1 ; : : : ; ¸K ), VK = [v1 ; : : : ; vK ], and R is an arbitrary K £ K orthogonal matrix. ¤ ¤ (ii) Projections v for new input x is given as p > ¡1 > > ¤ ¡ ¤ v = NR D 2 (VK KVK + D) 1 VK k(X; x ); where D = I ¡ N¤K1 . ¡ 22 Experiments Train Nearest Neighbor classifier after projection Evaluation metrics: Multi-class classification: error rate Multi-label classification: F1-measure, AUC 23 Multi-class Classification S2PPCA is almost always better than SPPCA LDA is very good for FACE data S2PPCA is very good on TEXT data S2PPCA has good scalability 24 Multi-label Classification S2PPCA is always better than SPPCA S2PPCA is better in most cases S2PPCA has good scalability 25 Extensions Put priors on factor loading matrices Relax Gaussian noise model for outputs y = f(v) Better way to incorporate supervised information We need to do more approximations (using, e.g. EP) Directly predict missing outputs (i.e. single step) Mixture modeling in latent variable space Learn MAP estimates for them Achieve local PCA instead of global PCA Robust supervised PCA mapping Replace Gaussian with Student-t Outlier detection in PCA 26 Related Work Fisher discriminant analysis (FDA) Canonical correlation analysis (CCA) Goal: Find directions to maximize between-class distance while minimizing within-class distance Only deal with outputs of multi-class classification Limited number of projection dimensions Goal: Maximize the correlation between inputs and outputs Ignore intra-covariance of both inputs and outputs Partial least squares (PLS) Goal: Sequentially find orthogonal directions to maximize covariance with respect to outputs A penalized CCA; poor generalization on new output dimensions 27 Conclusion A general framework for (semi-)supervised dimensionality reduction We can solve the problem analytically (EIG) or via iterative algorithms (EM) Trade-off EIG: optimization-based, easily extended to other loss EM: semi-supervised projection, good scalability Both algorithms can be kernelized PCA and kernel PCA are special cases More applications expected 28 Thank you! Questions? http://www.dbs.ifi.lmu.de/~spyu