Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Lagrange multipliers Given the following optimization problem: f (x, y) = 2 − x2 − 2y 2 maximize g(x, y) = x2 + y 2 − 1 = 0. subject to With Lagrange multipliers we can find the extrema of a function of several variables subject to one or more constraints. 2 0 -2 -4 -6 -8 -2 -1 -10 2 0 1 x 0 y 1 -1 -2 2 – p. 76 Lagrange multipliers (cont.) The gradient of f , 2 y 0 -2 -1 0 1 2 x -1 -2 grad f (x) ∂f ∂f ∂f , ,..., = ∂x1 ∂x2 ∂xn ∇f = 1 is a vector field, where the vectors point in the directions of the greatest increase of f . The direction of greatest increase is always perpendicular to the level curves. The circle (blue curve) is the feasible region satisfying the constraint x2 + y 2 − 1 = 0 – p. 77 Lagrange multipliers (cont.) At extreme points (x, y) the gradients of f and g are parallel vectors, that is 2 y 1 0 -2 0 -1 2 1 x -1 -2 ∇f (x, y) = λ∇g(x, y) To find the pi we have to solve ∇f (x, y) − λ∇g(x, y) = 0 – p. 78 Lagrange multipliers Ex. 1 Back to our optimization problem: maximize subject to f (x, y) = 2 − x2 − 2y 2 g(x, y) = x2 + y 2 − 1 = 0. L(x, y, λ) = f (x, y) − λg(x, y) = 2 − x2 − 2y 2 − λ(x2 + y 2 − 1) ∂L(x, y, λ) = −2x − 2λx = 0 ∂x ∂L(x, y, λ) = −4y − 2λy = 0 ∂y ∂L(x, y, λ) = −x2 − y 2 + 1 = 0 ∂λ Solving the equation system gives: x = ±1 and y = 0 (λ = −1) and x = 0 and y = ±1 (λ = −2). – p. 79 Lagrange multipliers Ex. 2 Find the point pt on the circle formed by the intersection of the unit sphere with the plane x + y + z = 21 that is closest to b min f (x, y, z) the point pg = (1, 2, 3), i.e. min kpg − pt k2 ≡ f (x, y, z) = (x − 1)2 + (y − 2)2 + (z − 3)2 g1 (x, y, z) = x2 + y 2 + z 2 − 1 1 g2 (x, y, z) = x + y + z − 2 – p. 80 Lagrange multipliers Ex. 2 (cont.) L(x, y, z, λ) = (x − 1)2 + (y − 2)2 + (z − 3)2 1 2 2 2 +λ1 x + y + z − 1 + λ2 x + y + z − 2 ∂L(x, y, z, λ) = 2(x − 1) + 2λ1 x + λ2 = 0 ∂x ∂L(x, y, z, λ) = 2(y − 2) + 2λ1 y + λ2 = 0 ∂y ∂L(x, y, z, λ) = 2(z − 3) + 2λ1 z + λ2 = 0 ∂z ∂L(x, y, z, λ) = x2 + y 2 + z 2 − 1 = 0 ∂λ1 ∂L(x, y, z, λ) 1 = x+y+z− =0 ∂λ2 2 – p. 81 Lagrange multipliers Ex. 2 (cont.) Solving this equation system gives: √ √ 1 1 1 1 1 x1 = 6 − 12 66, y1 = 6 , z1 = 6 + 12 66 x1 = −0.51, y1 = 0.16, z1 = 0.84 √ √ 1 1 1 66, y2 = 6 , x2 = + z2 = 6 − 12 66 x2 = 0.84, y2 = 0.16, z2 = −0.51 1 6 1 12 – p. 82 Lagrange multipliers Ex. 2 (cont.) – p. 83 Covariance and Correlation Matrix N d Given sample {xn }1 , where x ∈ R , xn = • • • 1 N PN x1n x2n .. . xdn sample mean x = n=1 xn , and entries of sample 1 PN mean are xi = N n=1 xin sample covariance matrix is a d × d matrix Z with 1 PN entries Zij = N −1 n=1 (xin − xi )(xjn − xj ) sample correlation matrix is a d × d matrix C with entries Cij = 1 N −1 PN n=1 (xin −xi )(xjn −xj ) σxi σxj , where σxi and σxj are the sample standard deviations – p. 84 Covariance and Correlation Matrix Example Given sample: " # " # " # " # 1.2 2.5 0.7 4.2 , , , , 0.9 3.9 0.4 5.8 Z= C= " " 2.443333 1.563117·1.563117 3.940000 2.554082·1.563117 x= " 2.15 2.75 # # 2.443333 3.940000 3.940000 6.523333 # " 3.940000 1.563117·2.554082 6.523333 2.554082·2.554082 = Observe, if sample is z -normalized (xnew ij = 1.000000 0.986893 0.986893 1.000000 xij −xi σxi , mean 0, standard deviation 1) then C equals Z. See cov(), cor(), scale() in R. – p. 85 # Principal Component Analysis Principal Component Analysis (PCA) is a technique for • dimensionality reduction • lossy data compression • feature extraction • data visualization Idea: orthogonal projection of the data onto a lower dimensional linear space, such that the variance of the projected data is maximized. u1 – p. 86 Maximize Variance of Projected Data Given data {xn }N 1 where xn has dimensionality d. • Goal: project data onto a space having dimensionality m < d while maximizing the variance of the projected data. Let us consider the projection onto one-dimensional space (m = 1). • Define direction of this space using a d-dimensional vector u1 . Mean of the projected data is uT1 x, where x is sample mean N X 1 x= xn N n=1 – p. 87 Maximize Variance of Projected Data (cont.) Variance of projected data is given by N 2 1 X T u1 xn − uT1 x = uT1 Su1 N n=1 where S is the data covariance matrix defined by N 1 X S= (xn − x)(xn − x)T N n=1 Goal: maximize the projected variance uT1 Su1 with respect to u1 . Prevent ku1 k growing to infinity, use constrain uT1 u1 = 1, gives optimization problem: maximize uT1 Su1 subject to uT1 u1 = 1 – p. 88 Maximize Variance of Projected Data (cont.) Lagrangian form (one Lagrange multiplier λ1 ): L(u1 , λ1 ) = uT1 Su1 − λ1 (uT1 u1 − 1) Set derivative with respect to u1 to zero, ∂L(u1 , λ1 ) =0 ∂u1 gives Su1 = λ1 u1 last term says that u1 must be an eigenvector of S. Finally by left-multiplying by uT1 and making use of uT1 u1 = 1 one can see that the variance is given by uT1 Su1 = λ1 . Observe, that variance is maximized when u1 equals to the eigenvector having largest eigenvalue λ1 . – p. 89 Second Principal Component Second eigenvector u2 should also be of unit length and orthogonal to u1 (after projection uncorrelated to uT1 x). maximize uT2 Su2 subject to uT2 u2 = 1, uT2 u1 = 0 Lagrangian form (two Lagrange multipliers λ1 , λ2 ): L(u2 , λ1 , λ2 ) = uT2 Su2 − λ2 (uT2 u2 − 1) − λ1 (uT2 u1 − 0) This gives solution uT2 Su2 = λ2 which implies that u2 should be eigenvector of S with second largest eigenvalue λ2 . Other dimensions are given by the eigenvectors with decreasing eigenvalues. – p. 90 PCA Example 1 0 −1 −2 cbind(data.x.eig.1, rep(0, N))[,2] 2 1 0 −3 −2 −1 data.xy[,2] 2 Projection on first eigenvector 3 First and second eigenvector −4 −2 0 2 4 6 data.xy[,1] −2 0 2 4 cbind(data.x.eig.1, rep(0, N))[,1] 0 −1 −2 data.x.eig[,2] 1 2 Projection on both orthogonal eigenvectors −2 0 data.x.eig[,1] 2 4 – p. 91 Proportion of Variance • In image and speech processing problems the inputs are usually highly correlated. If dimensions are highly correlated, then there will be small number of eigenvectors with large eigenvalues (m ≪ d). As a result, a large reduction in dimensionality can be attained. 0.8 0.7 0.6 0.5 λ 1 + λ 2 + . . . + λm λ1 + λ2 + . . . + λm + . . . + λd Proportion of variance 0.9 1.0 Proportion of variance explained, digit class 1 (USPS database) 0 50 100 150 200 250 Eigenvectors – p. 92 PCA Second Example 8 • 8 256 256 • Segment image in 32 · 32 = 1024 image pieces of size 8 × 8 ≡ 1 × 64: x1 , x2 , . . . , x1024 ∈ R64 • Determine mean: x = 1 P1024 n=1 xi 1024 Determine covariance matrix S and the m eigenvectors u1 , u2 , . . . , um having the largest corresponding eigenvalues λ1 , λ2 , . . . , λm • Create eigenvector matrix U, where u1 , u2 , . . . , um are column vectors • Project image pieces xi into subspace as follows: zTi = UT (xTi − xT ) – p. 93 PCA Second Example (cont.) Reconstruct image pieces by back-projecting it to the eTi = UzTi + x. Note, mean is added original space as x (substracted step before) because data is not normalized 1.0 Original image 0.8 Proportion of variance Proportion of variance explained in image 0.6 • 0 10 20 30 40 50 60 Eigenvectors Reconstructed with 16 eigenvectors Reconstructed with 32 eigenvectors Reconstructed with 48 eigenvectors Reconstructed with 64 eigenvectors – p. 94 Nonlinear Dimensionality Reduction Nonlinear techniques for dimensionality reduction can be subdivided into three main types: 1. Preserve global properties of the original data in the low-dimensional representation. 2. Preserve local properties of the original data in the low-dimensional representation. 3. Global alignment of a mixture of linear models. • Given a dataset representation in a n × D matrix X consisting of n datavectors xi , i = 1, 2, . . . , n with dimensionality D. • Assume that dataset has intrinsic dimensionality d (where d < D, and often d ≪ D). – p. 95 Nonlinear Dimensionality Reduction (cont.) (Nonlinear) Dimensionality reduction techniques transform dataset X with dimensionality D into a new dataset Y with dim. d, while retaining the geometry or structure of the data as much as possible. Picture taken from Sam Roweis’ website. – p. 96 Multidimensional Scaling (MDS) MDS maps high-dimensional data representation to a low-dimensional representation while retaining the pairwise distance between the data points as much as possible. • Quality of mapping is expressed in the stress function, a measure between the pairwise distances in low-dimensional and high-dimensional representation of the data. Raw stress function (square error cost) to be minimized X ˆ i , yj )]2 EM = [d(xi , xj ) − d(y i<j where d(xi , xj ) is the distance (e.g. Euclidean) between xi ˆ i , yj ) the and xj in the high-dimensional space and d(y distance in the low-dimensional space. – p. 97 Sammon’s projection Sammon’s mapping is closely related to the MDS technique. Sammon’s stress function to be minimized X [d(xi , xj ) − d(y ˆ i , yj )]2 1 . ES = P d(xi , xj ) i<j [d(xi , xj )] i<j Stress function puts more emphasis on retaining distances that were originally small. – p. 98