Download Lagrange multipliers

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Bootstrapping (statistics) wikipedia , lookup

Time series wikipedia , lookup

Transcript
Lagrange multipliers
Given the following optimization problem:
f (x, y) = 2 − x2 − 2y 2
maximize
g(x, y) = x2 + y 2 − 1 = 0.
subject to
With Lagrange multipliers we can find the extrema of a
function of several variables subject to one or more
constraints.
2
0
-2
-4
-6
-8
-2
-1
-10
2
0
1
x 0
y
1
-1
-2
2
– p. 76
Lagrange multipliers (cont.)
The gradient of f ,
2
y
0
-2
-1
0
1
2
x
-1
-2
grad f (x)
∂f
∂f ∂f
,
,...,
=
∂x1 ∂x2
∂xn
∇f =
1
is a vector field, where the
vectors point in the directions of the greatest increase
of f .
The direction of greatest increase is always perpendicular to
the level curves. The circle (blue curve) is the feasible region
satisfying the constraint x2 + y 2 − 1 = 0
– p. 77
Lagrange multipliers (cont.)
At extreme points (x, y) the
gradients of f and g are parallel vectors, that is
2
y
1
0
-2
0
-1
2
1
x
-1
-2
∇f (x, y) = λ∇g(x, y)
To find the pi we have to
solve
∇f (x, y) − λ∇g(x, y) = 0
– p. 78
Lagrange multipliers Ex. 1
Back to our optimization problem:
maximize
subject to
f (x, y) = 2 − x2 − 2y 2
g(x, y) = x2 + y 2 − 1 = 0.
L(x, y, λ) = f (x, y) − λg(x, y) = 2 − x2 − 2y 2 − λ(x2 + y 2 − 1)
∂L(x, y, λ)
= −2x − 2λx = 0
∂x
∂L(x, y, λ)
= −4y − 2λy = 0
∂y
∂L(x, y, λ)
= −x2 − y 2 + 1 = 0
∂λ
Solving the equation system gives: x = ±1 and y = 0
(λ = −1) and x = 0 and y = ±1 (λ = −2).
– p. 79
Lagrange multipliers Ex. 2
Find the point pt on the circle formed by the intersection of
the unit sphere with the plane x + y + z = 21 that is closest to
b min f (x, y, z)
the point pg = (1, 2, 3), i.e. min kpg − pt k2 ≡
f (x, y, z) = (x − 1)2 + (y − 2)2 + (z − 3)2
g1 (x, y, z) = x2 + y 2 + z 2 − 1
1
g2 (x, y, z) = x + y + z −
2
– p. 80
Lagrange multipliers Ex. 2 (cont.)
L(x, y, z, λ) = (x − 1)2 + (y − 2)2 + (z − 3)2
1
2
2
2
+λ1 x + y + z − 1 + λ2 x + y + z −
2
∂L(x, y, z, λ)
= 2(x − 1) + 2λ1 x + λ2 = 0
∂x
∂L(x, y, z, λ)
= 2(y − 2) + 2λ1 y + λ2 = 0
∂y
∂L(x, y, z, λ)
= 2(z − 3) + 2λ1 z + λ2 = 0
∂z
∂L(x, y, z, λ)
= x2 + y 2 + z 2 − 1 = 0
∂λ1
∂L(x, y, z, λ)
1
= x+y+z− =0
∂λ2
2
– p. 81
Lagrange multipliers Ex. 2 (cont.)
Solving this equation system gives:
√
√
1
1
1
1
1
x1 = 6 − 12 66, y1 = 6 ,
z1 = 6 + 12 66
x1 = −0.51,
y1 = 0.16, z1 = 0.84
√
√
1
1
1
66, y2 = 6 ,
x2 = +
z2 = 6 − 12 66
x2 = 0.84,
y2 = 0.16, z2 = −0.51
1
6
1
12
– p. 82
Lagrange multipliers Ex. 2 (cont.)
– p. 83
Covariance and Correlation Matrix



N
d
Given sample {xn }1 , where x ∈ R , xn = 


•
•
•
1
N
PN
x1n
x2n
..
.
xdn






sample mean x =
n=1 xn , and entries of sample
1 PN
mean are xi = N n=1 xin
sample covariance matrix is a d × d matrix Z with
1 PN
entries Zij = N −1 n=1 (xin − xi )(xjn − xj )
sample correlation matrix is a d × d matrix C with
entries Cij =
1
N −1
PN
n=1 (xin −xi )(xjn −xj )
σxi σxj
, where σxi and
σxj are the sample standard deviations
– p. 84
Covariance and Correlation Matrix Example
Given sample:
"
# "
# "
# "
#
1.2
2.5
0.7
4.2
,
,
,
,
0.9
3.9
0.4
5.8
Z=
C=
"
"
2.443333
1.563117·1.563117
3.940000
2.554082·1.563117
x=
"
2.15
2.75
#
#
2.443333 3.940000
3.940000 6.523333
# "
3.940000
1.563117·2.554082
6.523333
2.554082·2.554082
=
Observe, if sample is z -normalized (xnew
ij =
1.000000 0.986893
0.986893 1.000000
xij −xi
σxi ,
mean 0,
standard deviation 1) then C equals Z.
See cov(), cor(), scale() in R.
– p. 85
#
Principal Component Analysis
Principal Component Analysis (PCA) is a technique for
• dimensionality reduction
• lossy data compression
• feature extraction
• data visualization
Idea: orthogonal projection of the data onto a lower dimensional
linear space, such that the variance of the projected data is
maximized.
u1
– p. 86
Maximize Variance of Projected Data
Given data {xn }N
1 where xn has dimensionality d.
•
Goal: project data onto a space having dimensionality
m < d while maximizing the variance of the projected
data.
Let us consider the projection onto one-dimensional space
(m = 1).
• Define direction of this space using a d-dimensional
vector u1 .
Mean of the projected data is uT1 x, where x is sample mean
N
X
1
x=
xn
N
n=1
– p. 87
Maximize Variance of Projected Data (cont.)
Variance of projected data is given by
N
2
1 X T
u1 xn − uT1 x = uT1 Su1
N
n=1
where S is the data covariance matrix defined by
N
1 X
S=
(xn − x)(xn − x)T
N
n=1
Goal: maximize the projected variance uT1 Su1 with respect to
u1 . Prevent ku1 k growing to infinity, use constrain uT1 u1 = 1,
gives optimization problem:
maximize
uT1 Su1
subject to
uT1 u1 = 1
– p. 88
Maximize Variance of Projected Data (cont.)
Lagrangian form (one Lagrange multiplier λ1 ):
L(u1 , λ1 ) = uT1 Su1 − λ1 (uT1 u1 − 1)
Set derivative with respect to u1 to zero,
∂L(u1 , λ1 )
=0
∂u1
gives
Su1 = λ1 u1
last term says that u1 must be an eigenvector of S. Finally by
left-multiplying by uT1 and making use of uT1 u1 = 1 one can
see that the variance is given by
uT1 Su1 = λ1 .
Observe, that variance is maximized when u1 equals to the
eigenvector having largest eigenvalue λ1 .
– p. 89
Second Principal Component
Second eigenvector u2 should also be of unit length and
orthogonal to u1 (after projection uncorrelated to uT1 x).
maximize
uT2 Su2
subject to
uT2 u2 = 1,
uT2 u1 = 0
Lagrangian form (two Lagrange multipliers λ1 , λ2 ):
L(u2 , λ1 , λ2 ) = uT2 Su2 − λ2 (uT2 u2 − 1) − λ1 (uT2 u1 − 0)
This gives solution
uT2 Su2 = λ2
which implies that u2 should be eigenvector of S with second
largest eigenvalue λ2 . Other dimensions are given by the
eigenvectors with decreasing eigenvalues.
– p. 90
PCA Example
1
0
−1
−2
cbind(data.x.eig.1, rep(0, N))[,2]
2
1
0
−3
−2
−1
data.xy[,2]
2
Projection on first eigenvector
3
First and second eigenvector
−4
−2
0
2
4
6
data.xy[,1]
−2
0
2
4
cbind(data.x.eig.1, rep(0, N))[,1]
0
−1
−2
data.x.eig[,2]
1
2
Projection on both orthogonal eigenvectors
−2
0
data.x.eig[,1]
2
4
– p. 91
Proportion of Variance
•
In image and speech processing problems the inputs are
usually highly correlated.
If dimensions are highly correlated, then there will be small
number of eigenvectors with large eigenvalues (m ≪ d). As a
result, a large reduction in dimensionality can be attained.
0.8
0.7
0.6
0.5
λ 1 + λ 2 + . . . + λm
λ1 + λ2 + . . . + λm + . . . + λd
Proportion of variance
0.9
1.0
Proportion of variance explained, digit class 1 (USPS database)
0
50
100
150
200
250
Eigenvectors
– p. 92
PCA Second Example
8
•
8
256
256
•
Segment image in
32 · 32 = 1024 image
pieces of size
8 × 8 ≡ 1 × 64:
x1 , x2 , . . . , x1024 ∈ R64
• Determine mean: x =
1 P1024
n=1 xi
1024
Determine covariance matrix S and the m eigenvectors
u1 , u2 , . . . , um having the largest corresponding
eigenvalues λ1 , λ2 , . . . , λm
• Create eigenvector matrix U, where u1 , u2 , . . . , um are
column vectors
• Project image pieces xi into subspace as follows:
zTi = UT (xTi − xT )
– p. 93
PCA Second Example (cont.)
Reconstruct image pieces by back-projecting it to the
eTi = UzTi + x. Note, mean is added
original space as x
(substracted step before) because data is not normalized
1.0
Original image
0.8
Proportion of variance
Proportion of variance explained in image
0.6
•
0
10
20
30
40
50
60
Eigenvectors
Reconstructed with 16 eigenvectors
Reconstructed with 32 eigenvectors
Reconstructed with 48 eigenvectors
Reconstructed with 64 eigenvectors
– p. 94
Nonlinear Dimensionality Reduction
Nonlinear techniques for dimensionality reduction can be
subdivided into three main types:
1. Preserve global properties of the original data in the
low-dimensional representation.
2. Preserve local properties of the original data in the
low-dimensional representation.
3. Global alignment of a mixture of linear models.
•
Given a dataset representation in a n × D matrix X
consisting of n datavectors xi , i = 1, 2, . . . , n with
dimensionality D.
• Assume that dataset has intrinsic dimensionality d
(where d < D, and often d ≪ D).
– p. 95
Nonlinear Dimensionality Reduction (cont.)
(Nonlinear) Dimensionality reduction techniques transform dataset X with dimensionality D into a new dataset Y with dim.
d, while retaining the geometry or structure of the data as
much as possible.
Picture taken from Sam Roweis’ website.
– p. 96
Multidimensional Scaling (MDS)
MDS maps high-dimensional data representation to a
low-dimensional representation while retaining the pairwise
distance between the data points as much as possible.
• Quality of mapping is expressed in the stress function, a
measure between the pairwise distances in
low-dimensional and high-dimensional representation of
the data.
Raw stress function (square error cost) to be minimized
X
ˆ i , yj )]2
EM =
[d(xi , xj ) − d(y
i<j
where d(xi , xj ) is the distance (e.g. Euclidean) between xi
ˆ i , yj ) the
and xj in the high-dimensional space and d(y
distance in the low-dimensional space.
– p. 97
Sammon’s projection
Sammon’s mapping is closely related to the MDS technique.
Sammon’s stress function to be minimized
X [d(xi , xj ) − d(y
ˆ i , yj )]2
1
.
ES = P
d(xi , xj )
i<j [d(xi , xj )]
i<j
Stress function puts more emphasis on retaining distances
that were originally small.
– p. 98