Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lecture VIII– Dim. Reduction (I) Contents: • Subset Selection & Shrinkage • Ridge regression, Lasso • PCA, PCR, PLS Lecture VIII: MLSC - Dr. Sethu Vijayakumar 1 Data From Human Movement Measure arm movement and full-body movement of humans and anthropomorphic robots Perform local dimensionality analysis with a growing variational mixture of factor analyzers Lecture VIII: MLSC - Dr. Sethu Vijayakumar 2 Dimensionality of Full Body Motion 1 Global PCA 32 0.25 // Local FA // 0.9 0.2 0.7 0.6 0.15 Probabilit y Cumulative Variance 0.8 0.5 0.4 0.1 0.3 0.2 0.05 0.1 0 // 0 10 20 30 40 No. of retained dimensions 105 50 10 0 // 1 11 21 31 Dimensionality 105 10 41 About 8 dimensions in the space formed by joint positions, velocities, and accelerations are needed to model an inverse dynamics model Lecture VIII: MLSC - Dr. Sethu Vijayakumar 3 Dimensionality Reduction Goals: z z z z z fewer dimensions for subsequent processing better numerical stability due to removal of correlations simplify post processing due to “advanced statistical properties” of preprocessed data don’t lose important information, only redundant information or irrelevant information perform dimensionality reduction spatially localized for nonlinear problems (e.g., each RBF has its own local dimensionality reduction) Lecture VIII: MLSC - Dr. Sethu Vijayakumar 4 Subset Selection & Shrinkage Methods Subset Selection Refers to methods which selects a set of variables to be included and discards other dimensions based on some optimality criterion. Does regression on this reduced dimensional input set. Discrete method –variables are either selected or discarded • leaps and bounds procedure (Furnival & Wilson, 1974) • Forward/Backward stepwise selection Shrinkage Methods Refers to methods which reduces/shrinks the redundant or irrelevant variables in a more continuous fashion. • Ridge Regression • Lasso • Derived Input Direction Methods – PCR, PLS Lecture VIII: MLSC - Dr. Sethu Vijayakumar 5 Ridge Regression Ridge regression shrinks the coefficients by imposing a penalty on their size. They minimize a penalized residual sum of squares : wˆ ridge M M ⎧N ⎫ 2 = arg min ⎨∑ (t i − w0 − ∑ xij w j ) + λ ∑ w 2j ⎬ w j =1 j =1 ⎩ i =1 ⎭ Complexity parameter controlling amount of shrinkage Equivalent representation: wˆ ridge M ⎧N ⎫ = arg min ⎨∑ (t i − w0 − ∑ x ij w j ) 2 ⎬ w j =1 ⎩ i =1 ⎭ subject to M ∑w j =1 2 j ≤s Lecture VIII: MLSC - Dr. Sethu Vijayakumar 6 Ridge regression (cont’d) Some notes: • when there are many correlated variables in a regression problem, their coefficients can be poorly determined and exhibit high variance e.g. a wildly large positive variable can be canceled by a similarly largely negative coefficient on its correlated cousin. • The bias term is not included in the penalty. • When the inputs are orthogonal, the ridge estimates are just a scaled version of the least squares estimate Matrix representation of criterion and it’s solution: 1 (t − Xw )T (t − Xw ) + λ w T w 2 = ( X T X + λ I ) −1 X T t J (w , λ ) = wˆ ridge Lecture VIII: MLSC - Dr. Sethu Vijayakumar 7 Ridge regression (cont’d) Ridge regression shrinks the dimension with least variance the most. Shrinkage Factor : Each direction is shrunk by d 2j (d + λ ) 2 j , [Page 62: Elem. Stat. Learning] where d 2j refers to the correspond ing eigen value. Effective degree of freedom : M d 2j j =1 (d 2j + λ ) df (λ ) = tr[ X(X X + λI) X ] = ∑ T −1 T [Page 63: Elem. Stat. Learning] Note : df (λ ) = M if λ = 0 (no regularization) Lecture VIII: MLSC - Dr. Sethu Vijayakumar 8 Lasso Lasso is also a shrinkage method like ridge regression. It minimizes a penalized residual sum of squares : wˆ lasso M M ⎧N ⎫ 2 = arg min ⎨∑ (t i − w0 − ∑ xij w j ) + λ ∑ w j ⎬ w j =1 j =1 ⎩ i =1 ⎭ Equivalent representation: wˆ lasso M ⎧N ⎫ = arg min ⎨∑ (t i − w0 − ∑ x ij w j ) 2 ⎬ w j =1 ⎩ i =1 ⎭ subject to M ∑w j =1 j ≤s The L2 ridge penalty is replaced by the L1 lasso penalty. This makes the solution non-linear in t. Lecture VIII: MLSC - Dr. Sethu Vijayakumar 9 Derived Input Direction Methods These methods essentially involves transforming the input directions to some low dimensional representation and using these directions to perform the regression. Example Methods • Principal Components Regression (PCR) • Based on input variance only • Partial Least Squares (PLS) • Based on input-output correlation and output variance Lecture VIII: MLSC - Dr. Sethu Vijayakumar 10 Principal Component Analysis From earlier discussions: PCA finds the eigenvectors of the correlation (covariance) matrix of the the data Here, we show how PCA can be learned by bottle-neck neural networks (auto-associator) and Hebbian learning frameworks Lecture VIII: MLSC - Dr. Sethu Vijayakumar 11 PCA Implementation Hebbian Learning z obtain a measure of familiarity of a new data point the more familiar a data point, the larger the output y w x1 x2 x3 xd w n +1 = w n + α yx Lecture VIII: MLSC - Dr. Sethu Vijayakumar 12 Hebbian Learning Properties of Hebbian Learning unstable learning rule (at most neutrally stable) z finds direction of maximal variance of the data z 1 2 1 T J = y = w xx T w 2 2 ⎧1 ⎫ E {J } = E ⎨ w T x x T w ⎬ ⎩2 ⎭ 1 T 1 T = w E xx T w = w Cw 2 2 { } x2 x y ∆w = α xy Here, C is the correlation matrix w0 x1 Lecture VIII: MLSC - Dr. Sethu Vijayakumar 13 Fixing the Instability (Oja’s rule) Renormalization w n +1 w n + α xy = w n + α xy Oja’s Rule ∆ w = w n+1 − w n = α y( x − yw ) = α (yx − y 2 w ) Verify Oja’s Rule (does it do the right thing ??) {( E {∆w} = E α xy − y 2 w )} After convergence: { } = α E xxT w − wT xxT ww ( = α Cw − wT Cww ) E {∆w} = 0 ⇒ ( ) Cw = wT Cw w = λw , thus w is eigenvector ( ) ( ) wT Cw = wT wT Cw w = wT Cw wT w thus, wT w = 1 Lecture VIII: MLSC - Dr. Sethu Vijayakumar 14 Application Examples of Oja’s Rule Finds the direction (line) which minimizes the sum of the total squared distance from each point to its orthogonal projection onto the line. X = UDVT Singular Value Decomposition (SVD) Lecture VIII: MLSC - Dr. Sethu Vijayakumar 15 PCA : Batch vs Stochastic PCA in Batch Update: z z just subtract the mean from the data calculate eigenvectors to obtain principle components (Matlab function “eig”) PCA in Stochastic Update: z z stochastically estimate the mean and subtract it from input data use Oja’s rule on mean subtracted data x n+ 1 x n+ 1 1 x nn + x x − xn) = xn + = ( n +1 n +1 = x n + α (x − x n ) Lecture VIII: MLSC - Dr. Sethu Vijayakumar 16 Autoencoder as motivation for Oja’s Rule Note that Oja’s rule looks like a supervised learning rule z the update looks like a reverse delta-rule: it depends on the difference between the actual input and the back-propagated output ∆w = w n +1 − w n = α y( x − y w ) x w y w x Convert unsupervised learning into a supervised learning problem by trying to reconstruct the inputs from few features! Lecture VIII: MLSC - Dr. Sethu Vijayakumar 17 Learning in the Autoencoder Minimize the cost 1 1 T T J = (x − xˆ ) (x − xˆ ) = (x − yw ) (x − yw) 2 2 ∂J ∂w ⎞ ⎛ ∂y =− w+y (x − yw) ⎝ ∂w ∂w ∂w ⎠ T ⎛ ∂ (w T x ) ∂w⎞ = −⎜ w + y ⎟ (x − yw ) ∂w⎠ ⎝ ∂w T = −(x T w + y)(x − yw) = −(y + y)(x − yw) = −2y(x − yw) ∆w = −α ∂J = α ′y(x − yw) ∂w This is Oja’s rule! Lecture VIII: MLSC - Dr. Sethu Vijayakumar 18 PCA with More Than One Feature Oja’s Rule in Multiple Dimensions y = Wx xˆ = W T y ( ∆W = α y x − W T y ) T Sanger’s Rule: A clever trick added to Oja’s rule! y = Wx xˆ = W T y [ ∆ W ]r − th − row ⎛ = α yr x − ⎝ r ∑ ([ W ]i − th − row ) T i= 1 ( = α y r x − ([ W ]1: r − th − row ⎞ yi ⎠ T ) [y ]1: r − th − row T ) T Matlab Notation: W ( r , : ) = α * y ( r ) * ( x − W (1: r , : )' *y (1: r )) This rule makes the rows of W become the eigenvectors of the data, ordered in descending sequence according to the magnitude of the eigenvalues. Lecture VIII: MLSC - Dr. Sethu Vijayakumar 19 Discussion about PCA If data is noisy, we may represent the noise instead of the data z The way out: Factor Analysis (handles noise in input dimension also) PCA has no data generating model PCA has no probabilistic interpretation (not quite true !!) PCA ignores possible influence of subsequent (e.g., supervised) learning steps PCA is a linear method z Way out: Nonlinear PCA PCA can converge very slowly z Way out: EM versions of PCA But PCA is a very reliable method for dimensionality reduction if it is appropriate! Lecture VIII: MLSC - Dr. Sethu Vijayakumar 20 PCA preprocessing for Supervised Learning In Batch Learning Notation for a Linear Network: ⎡ ⎛ N n n ⎢ ⎜∑ x −x x −x U = ⎢ eigenvectors ⎜ n =1 N −1 ⎢ ⎜ ⎜ ⎢ ⎝ ⎣ ( )( ) T ⎞⎤ ⎟⎥ ⎟⎥ ⎟⎥ ⎟⎥ ⎠ ⎦ max(1:k ) % TX % ⎞⎤ ⎡ ⎛X % contains mean-zero data where X = ⎢ eigenvectors ⎜ ⎟⎥ N 1 − ⎝ ⎠ ⎦ max(1:k ) ⎣ Subsequent Linear Regression for Network Weights ( % T XU % W = UT X ) −1 % TY UT X NOTE: Inversion of the above matrix is very cheap since it is diagonal! No numerical problems! Problems of this pre-processing: Important regression data might have been clipped! Lecture VIII: MLSC - Dr. Sethu Vijayakumar 21 Principal Component Regression (PCR) Build the matrix X and vector y X = (~ x1 , ~ x 2 ,..., ~ xn )T t = (t1 , t 2 ,..., t n ) T Eigen Decomposition X T X = VD 2 V T Compute the linear model U = max [ v 1 v 2 L v k ] ˆ w eigen values pcr T = (U X T XU) −1 U T X T t Data projection Lecture VIII: MLSC - Dr. Sethu Vijayakumar 22 PCA in Joint Data Space A straightforward extension to take supervised learning step into account: z z perform PCA in joint data space extract linear weight parameters from PCA results y x Lecture VIII: MLSC - Dr. Sethu Vijayakumar 23 PCA in Joint Data Space: Formalization ⎡x − x ⎤ Z = ⎢ ⎥ ⎣y − y ⎦ ⎡ ⎛ ⎢ ⎜ U = ⎢ eig en vecto rs ⎜ ⎢ ⎜ ⎜ ⎢ ⎝ ⎣ N ∑ (z n −z n =1 )(z n N −1 −z ) T ⎞⎤ ⎟⎥ ⎟⎥ ⎟⎥ ⎟⎥ ⎠ ⎦ m ax (1:k ) ⎡ ⎛ ZT Z ⎞⎤ = ⎢ eig en vecto rs ⎜ ⎟⎥ ⎝ N − 1 ⎠ ⎦ m ax (1:k ) ⎣ The Network Weights become: ( ( W = U x UTy − UTy U y UTy − I ) −1 ) ⎡ U x (= d × k ) ⎤ U y UTy , where U = ⎢ ⎥ ⎣ U y (= c × k ) ⎦ Note: this new kind of linear network can actually tolerate noise in the input data! But only the same noise level in all (joint) dimensions! Lecture VIII: MLSC - Dr. Sethu Vijayakumar 24