Download PCA and Oja`s Rule

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Karhunen–Loève theorem wikipedia , lookup

Time series wikipedia , lookup

Transcript
Covariance and Correlation Matrix



N
d
Given sample {xn }1 , where x ∈ R , xn = 


•
•
•
1
N
PN
x1n
x2n
..
.
xdn






sample mean x =
n=1 xn , and entries of sample
1 PN
mean are xi = N n=1 xin
sample covariance matrix is a d × d matrix Z with
1 PN
entries Zij = N −1 n=1 (xin − xi )(xjn − xj )
sample correlation matrix is a d × d matrix C with
entries Cij =
1
N −1
PN
n=1 (xin −xi )(xjn −xj )
σxi σxj
, where σxi and
σxj are the sample standard deviations
– p. 168
Covariance and Correlation Matrix Example
Given sample:
"
# "
# "
# "
#
1.2
2.5
0.7
4.2
,
,
,
,
0.9
3.9
0.4
5.8
Z=
C=
"
"
2.443333
1.563117·1.563117
3.940000
2.554082·1.563117
x=
"
2.15
2.75
#
#
2.443333 3.940000
3.940000 6.523333
# "
3.940000
1.563117·2.554082
6.523333
2.554082·2.554082
=
Observe, if sample is z -normalized (xnew
ij =
1.000000 0.986893
0.986893 1.000000
xij −xi
σxi ,
mean 0,
standard deviation 1) then C equals Z.
See cov(), cor(), scale() in R.
– p. 169
#
Principal Component Analysis with NN
Principal Component Analysis (PCA) is a technique for
• dimensionality reduction
• lossy data compression
• feature extraction
• data visualization
Idea: orthogonal projection of the data onto a lower dimensional linear space, such that the variance of the projected data
is maximized.
u1
– p. 170
Maximize Variance of Projected Data
Given data {xn }N
1 where xn has dimensionality d.
•
Goal: project data onto a space having dimensionality
m < d while maximizing the variance of the projected
data.
Let us consider the projection onto one-dimensional space
(m = 1).
• Define direction of this space using a d-dimensional
vector u1 .
Mean of the projected data is uT1 x, where x is sample mean
N
X
1
x=
xn
N
n=1
– p. 171
Maximize Variance of Projected Data (cont.)
Variance of projected data is given by
N
2
1 X T
u1 xn − uT1 x = uT1 Su1
N
n=1
where S is the data covariance matrix defined by
N
1 X
S=
(xn − x)(xn − x)T
N
n=1
Goal: maximize the projected variance uT1 Su1 with respect to
u1 . Prevent ku1 k growing to infinity, use constrain uT1 u1 = 1,
gives optimization problem:
maximize
uT1 Su1
subject to
uT1 u1 = 1
– p. 172
Maximize Variance of Projected Data (cont.)
Lagrangian form (one Lagrange multiplier λ1 ):
L(u1 , λ1 ) = uT1 Su1 − λ1 (uT1 u1 − 1)
Set derivative with respect to u1 to zero,
∂L(u1 , λ1 )
=0
∂u1
gives
Su1 = λ1 u1
last term says that u1 must be an eigenvector of S. Finally by
left-multiplying by uT1 and making use of uT1 u1 = 1 one can
see that the variance is given by
uT1 Su1 = λ1 .
Observe, that variance is maximized when u1 equals to the
eigenvector having largest eigenvalue λ1 .
– p. 173
Second Principal Component
Second eigenvector u2 should also be of unit length and
orthogonal to u1 (after projection uncorrelated to uT1 x).
maximize
uT2 Su2
subject to
uT2 u2 = 1,
uT2 u1 = 0
Lagrangian form (two Lagrange multipliers λ1 , λ2 ):
L(u2 , λ1 , λ2 ) = uT2 Su2 − λ2 (uT2 u2 − 1) − λ1 (uT2 u1 − 0)
This gives solution
uT2 Su2 = λ2
which implies that u2 should be eigenvector of S with second
largest eigenvalue λ2 . Other dimensions are given by the
eigenvectors with decreasing eigenvalues.
– p. 174
PCA Example
1
0
−1
−2
cbind(data.x.eig.1, rep(0, N))[,2]
2
1
0
−3
−2
−1
data.xy[,2]
2
Projection on first eigenvector
3
First and second eigenvector
−4
−2
0
2
4
6
data.xy[,1]
−2
0
2
4
cbind(data.x.eig.1, rep(0, N))[,1]
0
−1
−2
data.x.eig[,2]
1
2
Projection on both orthogonal eigenvectors
−2
0
data.x.eig[,1]
2
4
– p. 175
Proportion of Variance
•
In image and speech processing problems the inputs are
usually highly correlated.
If dimensions are highly correlated, then there will be small
number of eigenvectors with large eigenvalues (m ≪ d). As a
result, a large reduction in dimensionality can be attained.
0.8
0.7
0.6
0.5
λ 1 + λ 2 + . . . + λm
λ1 + λ2 + . . . + λm + . . . + λd
Proportion of variance
0.9
1.0
Proportion of variance explained, digit class 1 (USPS database)
0
50
100
150
200
250
Eigenvectors
– p. 176
PCA Second Example
8
•
8
256
256
•
Segment image in
32 · 32 = 1024 image
pieces of size
8 × 8 ≡ 1 × 64:
x1 , x2 , . . . , x1024 ∈ R64
• Determine mean: x =
1 P1024
n=1 xi
1024
Determine covariance matrix S and the m eigenvectors
u1 , u2 , . . . , um having the largest corresponding
eigenvalues λ1 , λ2 , . . . , λm
• Create eigenvector matrix U, where u1 , u2 , . . . , um are
column vectors
• Project image pieces xi into subspace as follows:
zTi = UT (xTi − xT )
– p. 177
PCA Second Example (cont.)
Reconstruct image pieces by back-projecting it to the
eTi = UzTi + x. Note, mean is added
original space as x
(substracted step before) because data is not normalized
1.0
Original image
0.8
Proportion of variance
Proportion of variance explained in image
0.6
•
0
10
20
30
40
50
60
Eigenvectors
Reconstructed with 16 eigenvectors
Reconstructed with 32 eigenvectors
Reconstructed with 48 eigenvectors
Reconstructed with 64 eigenvectors
– p. 178
PCA with a Neural Network
V
w1
x1 x2
wd
w2
b
V = wT x =
d
X
w j xj
b
j=1
b
xd
Apply Hebbian learning rule
∆wi = ηV xi ,
such that after some update steps weight vector w should
point in direction of maximum variance.
– p. 179
PCA with a Neural Network (cont.)
Suppose that there is a stable equilibrium point for w such
that the average weight change is zero
X
X
0 = h∆wi i = hV xi i = hV
w j xj xi i =
Cij wj = Cw.
j
j
Angle brackets h·i indicates an average over the input
distribution P (x) and C denotes the correlation matrix with
Cij ≡ hxi xj i,
or
C ≡ hx xT i
Note, C is symmetric (Cij = Cji ) and positive semi-definite
which implies that its eigenvalues are positive or zero and
eigenvectors can be taken as orthogonal.
– p. 180
PCA with a Neural Network (cont.)
•
At our hypothetical equilibrium point, w is an
eigenvector of C with eigenvalue 0
• Never stable, because C has some pos. eigenvalues and
some corresponding eigenvector would grow
exponentially
• constrain the growth of w, e.g. renormalization
(kwk = 1) after each update step
• more elegant idea: adding a weight decay proportional
to V 2 to Hebbian learning rule (Oja’s Rule)
∆wi = ηV (xi − V wi )
– p. 181
PCA with a Neural Network Example
0
−1
−2
−3
data.xy[,2]
1
2
3
Oja’s Rule (blue vector), largest eigenvector (red vector)
−4
−2
0
data.xy[,1]
2
4
– p. 182
Some insights into Oja’s Rule
Oja’s rule converges to a weight vector w with following
properties:
• kwk = 1 (unit length),
• eigenvector direction: w lies in a maximal eigenvector
direction of C,
• variance maximization: w lies in a direction that
maximizes hV 2 i
Oja’s learning rule is still limited, because we can construct
only the first principal component of the z -normalized data.
– p. 183
Construct the first m principal components
•
Single-layer network with the i-th output Vi given by
P
Vi = j wij xj = wiT x, wi is the weight vector for the
i-th output
• Oja’s m-unit learning rule
∆wij = ηVi (xj −
d
X
Vk wkj )
i
X
Vk wkj )
k=1
•
Sanger’s learning rule
∆wij = ηVi (xj −
k=1
•
Both rules reduce to Oja’s 1-unit rule for the m = 1 and
i = 1 case
– p. 184
Oja’s and Sanger’s Rule
•
In both cases the wi vectors converge to orthogonal unit
vectors
• Weight vectors become in Sanger’s rule exactly the first
m principal components, in order wi = ±ci , where ci is
normalized eigenvector of the correlation matrix C
belonging to the i-th largest eigenvalue λi
• Oja’s m-unit rule converges to the m weight vectors that
span the same subspace as the first m eigenvectors, but
do not find the eigenvector directions themselves
– p. 185
Linear Auto-Associative Network
reconstructed features
z1
x
ed
zm
reconstruction
x
e1
m extracted
features
extraction
x1
original features
xd
•
Network is training to perform identity mapping
• Idea: bottleneck units represents significant features in
the input data
• Train network by minimizing the sum-of-square error
2
P
P
(n)
d
N
1
(n) ) − x )
y
(x
k
k=1
n=1
k
2
– p. 186
Linear Auto-Associative Network (cont.)
•
Equivalent to the Oja’s/Sanger’s update rule, this type
of learning can be considered as unsupervised learning,
since no independent target data is provided
• Error function has a unique global minimum when
hidden units have linear activations functions
• At this minimum the network performs a projection onto
the m-dimensional sub-space which is spanned by the
first m principal components of the data
• Note, however, that these vectors need not to be
orthogonal or normalized
– p. 187
non-linear
linear
non-linear
linear
x1
z1
original features
xd
zm
x
e1 reconstructed features x
ed
m extracted
features
Non-Linear Auto-Associative Network
– p. 188