Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Covariance and Correlation Matrix N d Given sample {xn }1 , where x ∈ R , xn = • • • 1 N PN x1n x2n .. . xdn sample mean x = n=1 xn , and entries of sample 1 PN mean are xi = N n=1 xin sample covariance matrix is a d × d matrix Z with 1 PN entries Zij = N −1 n=1 (xin − xi )(xjn − xj ) sample correlation matrix is a d × d matrix C with entries Cij = 1 N −1 PN n=1 (xin −xi )(xjn −xj ) σxi σxj , where σxi and σxj are the sample standard deviations – p. 168 Covariance and Correlation Matrix Example Given sample: " # " # " # " # 1.2 2.5 0.7 4.2 , , , , 0.9 3.9 0.4 5.8 Z= C= " " 2.443333 1.563117·1.563117 3.940000 2.554082·1.563117 x= " 2.15 2.75 # # 2.443333 3.940000 3.940000 6.523333 # " 3.940000 1.563117·2.554082 6.523333 2.554082·2.554082 = Observe, if sample is z -normalized (xnew ij = 1.000000 0.986893 0.986893 1.000000 xij −xi σxi , mean 0, standard deviation 1) then C equals Z. See cov(), cor(), scale() in R. – p. 169 # Principal Component Analysis with NN Principal Component Analysis (PCA) is a technique for • dimensionality reduction • lossy data compression • feature extraction • data visualization Idea: orthogonal projection of the data onto a lower dimensional linear space, such that the variance of the projected data is maximized. u1 – p. 170 Maximize Variance of Projected Data Given data {xn }N 1 where xn has dimensionality d. • Goal: project data onto a space having dimensionality m < d while maximizing the variance of the projected data. Let us consider the projection onto one-dimensional space (m = 1). • Define direction of this space using a d-dimensional vector u1 . Mean of the projected data is uT1 x, where x is sample mean N X 1 x= xn N n=1 – p. 171 Maximize Variance of Projected Data (cont.) Variance of projected data is given by N 2 1 X T u1 xn − uT1 x = uT1 Su1 N n=1 where S is the data covariance matrix defined by N 1 X S= (xn − x)(xn − x)T N n=1 Goal: maximize the projected variance uT1 Su1 with respect to u1 . Prevent ku1 k growing to infinity, use constrain uT1 u1 = 1, gives optimization problem: maximize uT1 Su1 subject to uT1 u1 = 1 – p. 172 Maximize Variance of Projected Data (cont.) Lagrangian form (one Lagrange multiplier λ1 ): L(u1 , λ1 ) = uT1 Su1 − λ1 (uT1 u1 − 1) Set derivative with respect to u1 to zero, ∂L(u1 , λ1 ) =0 ∂u1 gives Su1 = λ1 u1 last term says that u1 must be an eigenvector of S. Finally by left-multiplying by uT1 and making use of uT1 u1 = 1 one can see that the variance is given by uT1 Su1 = λ1 . Observe, that variance is maximized when u1 equals to the eigenvector having largest eigenvalue λ1 . – p. 173 Second Principal Component Second eigenvector u2 should also be of unit length and orthogonal to u1 (after projection uncorrelated to uT1 x). maximize uT2 Su2 subject to uT2 u2 = 1, uT2 u1 = 0 Lagrangian form (two Lagrange multipliers λ1 , λ2 ): L(u2 , λ1 , λ2 ) = uT2 Su2 − λ2 (uT2 u2 − 1) − λ1 (uT2 u1 − 0) This gives solution uT2 Su2 = λ2 which implies that u2 should be eigenvector of S with second largest eigenvalue λ2 . Other dimensions are given by the eigenvectors with decreasing eigenvalues. – p. 174 PCA Example 1 0 −1 −2 cbind(data.x.eig.1, rep(0, N))[,2] 2 1 0 −3 −2 −1 data.xy[,2] 2 Projection on first eigenvector 3 First and second eigenvector −4 −2 0 2 4 6 data.xy[,1] −2 0 2 4 cbind(data.x.eig.1, rep(0, N))[,1] 0 −1 −2 data.x.eig[,2] 1 2 Projection on both orthogonal eigenvectors −2 0 data.x.eig[,1] 2 4 – p. 175 Proportion of Variance • In image and speech processing problems the inputs are usually highly correlated. If dimensions are highly correlated, then there will be small number of eigenvectors with large eigenvalues (m ≪ d). As a result, a large reduction in dimensionality can be attained. 0.8 0.7 0.6 0.5 λ 1 + λ 2 + . . . + λm λ1 + λ2 + . . . + λm + . . . + λd Proportion of variance 0.9 1.0 Proportion of variance explained, digit class 1 (USPS database) 0 50 100 150 200 250 Eigenvectors – p. 176 PCA Second Example 8 • 8 256 256 • Segment image in 32 · 32 = 1024 image pieces of size 8 × 8 ≡ 1 × 64: x1 , x2 , . . . , x1024 ∈ R64 • Determine mean: x = 1 P1024 n=1 xi 1024 Determine covariance matrix S and the m eigenvectors u1 , u2 , . . . , um having the largest corresponding eigenvalues λ1 , λ2 , . . . , λm • Create eigenvector matrix U, where u1 , u2 , . . . , um are column vectors • Project image pieces xi into subspace as follows: zTi = UT (xTi − xT ) – p. 177 PCA Second Example (cont.) Reconstruct image pieces by back-projecting it to the eTi = UzTi + x. Note, mean is added original space as x (substracted step before) because data is not normalized 1.0 Original image 0.8 Proportion of variance Proportion of variance explained in image 0.6 • 0 10 20 30 40 50 60 Eigenvectors Reconstructed with 16 eigenvectors Reconstructed with 32 eigenvectors Reconstructed with 48 eigenvectors Reconstructed with 64 eigenvectors – p. 178 PCA with a Neural Network V w1 x1 x2 wd w2 b V = wT x = d X w j xj b j=1 b xd Apply Hebbian learning rule ∆wi = ηV xi , such that after some update steps weight vector w should point in direction of maximum variance. – p. 179 PCA with a Neural Network (cont.) Suppose that there is a stable equilibrium point for w such that the average weight change is zero X X 0 = h∆wi i = hV xi i = hV w j xj xi i = Cij wj = Cw. j j Angle brackets h·i indicates an average over the input distribution P (x) and C denotes the correlation matrix with Cij ≡ hxi xj i, or C ≡ hx xT i Note, C is symmetric (Cij = Cji ) and positive semi-definite which implies that its eigenvalues are positive or zero and eigenvectors can be taken as orthogonal. – p. 180 PCA with a Neural Network (cont.) • At our hypothetical equilibrium point, w is an eigenvector of C with eigenvalue 0 • Never stable, because C has some pos. eigenvalues and some corresponding eigenvector would grow exponentially • constrain the growth of w, e.g. renormalization (kwk = 1) after each update step • more elegant idea: adding a weight decay proportional to V 2 to Hebbian learning rule (Oja’s Rule) ∆wi = ηV (xi − V wi ) – p. 181 PCA with a Neural Network Example 0 −1 −2 −3 data.xy[,2] 1 2 3 Oja’s Rule (blue vector), largest eigenvector (red vector) −4 −2 0 data.xy[,1] 2 4 – p. 182 Some insights into Oja’s Rule Oja’s rule converges to a weight vector w with following properties: • kwk = 1 (unit length), • eigenvector direction: w lies in a maximal eigenvector direction of C, • variance maximization: w lies in a direction that maximizes hV 2 i Oja’s learning rule is still limited, because we can construct only the first principal component of the z -normalized data. – p. 183 Construct the first m principal components • Single-layer network with the i-th output Vi given by P Vi = j wij xj = wiT x, wi is the weight vector for the i-th output • Oja’s m-unit learning rule ∆wij = ηVi (xj − d X Vk wkj ) i X Vk wkj ) k=1 • Sanger’s learning rule ∆wij = ηVi (xj − k=1 • Both rules reduce to Oja’s 1-unit rule for the m = 1 and i = 1 case – p. 184 Oja’s and Sanger’s Rule • In both cases the wi vectors converge to orthogonal unit vectors • Weight vectors become in Sanger’s rule exactly the first m principal components, in order wi = ±ci , where ci is normalized eigenvector of the correlation matrix C belonging to the i-th largest eigenvalue λi • Oja’s m-unit rule converges to the m weight vectors that span the same subspace as the first m eigenvectors, but do not find the eigenvector directions themselves – p. 185 Linear Auto-Associative Network reconstructed features z1 x ed zm reconstruction x e1 m extracted features extraction x1 original features xd • Network is training to perform identity mapping • Idea: bottleneck units represents significant features in the input data • Train network by minimizing the sum-of-square error 2 P P (n) d N 1 (n) ) − x ) y (x k k=1 n=1 k 2 – p. 186 Linear Auto-Associative Network (cont.) • Equivalent to the Oja’s/Sanger’s update rule, this type of learning can be considered as unsupervised learning, since no independent target data is provided • Error function has a unique global minimum when hidden units have linear activations functions • At this minimum the network performs a projection onto the m-dimensional sub-space which is spanned by the first m principal components of the data • Note, however, that these vectors need not to be orthogonal or normalized – p. 187 non-linear linear non-linear linear x1 z1 original features xd zm x e1 reconstructed features x ed m extracted features Non-Linear Auto-Associative Network – p. 188