Download Lecture 8

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Forecasting wikipedia , lookup

Regression toward the mean wikipedia , lookup

Choice modelling wikipedia , lookup

Lasso (statistics) wikipedia , lookup

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Coefficient of determination wikipedia , lookup

Transcript
Lecture VIII– Dim. Reduction (I)
Contents:
•
Subset Selection & Shrinkage
• Ridge regression, Lasso
•
PCA, PCR, PLS
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
1
Data From Human Movement
„
„
Measure arm movement and full-body movement of
humans and anthropomorphic robots
Perform local dimensionality analysis with a growing
variational mixture of factor analyzers
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
2
Dimensionality of Full Body Motion
1
Global PCA
32
0.25
//
Local FA
//
0.9
0.2
0.7
0.6
0.15
Probabilit
y
Cumulative Variance
0.8
0.5
0.4
0.1
0.3
0.2
0.05
0.1
0
//
0
10
20
30
40
No. of retained dimensions
105
50
10
0
//
1
11
21
31
Dimensionality
105
10
41
About 8 dimensions in the space formed by joint positions,
velocities, and accelerations are needed to model an inverse
dynamics model
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
3
Dimensionality Reduction
Š Goals:
z
z
z
z
z
fewer dimensions for subsequent processing
better numerical stability due to removal of correlations
simplify post processing due to “advanced statistical properties” of preprocessed data
don’t lose important information, only redundant information or
irrelevant information
perform dimensionality reduction spatially localized for nonlinear
problems (e.g., each RBF has its own local dimensionality reduction)
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
4
Subset Selection & Shrinkage Methods
Subset Selection
Refers to methods which selects a set of variables to be included and discards other
dimensions based on some optimality criterion. Does regression on this reduced
dimensional input set. Discrete method –variables are either selected or discarded
• leaps and bounds procedure (Furnival & Wilson, 1974)
• Forward/Backward stepwise selection
Shrinkage Methods
Refers to methods which reduces/shrinks the redundant or irrelevant variables in a
more continuous fashion.
• Ridge Regression
• Lasso
• Derived Input Direction Methods – PCR, PLS
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
5
Ridge Regression
Ridge regression shrinks the coefficients by imposing a penalty on their size. They
minimize a penalized residual sum of squares :
wˆ
ridge
M
M
⎧N
⎫
2
= arg min ⎨∑ (t i − w0 − ∑ xij w j ) + λ ∑ w 2j ⎬
w
j =1
j =1
⎩ i =1
⎭
Complexity parameter
controlling amount of shrinkage
Equivalent representation:
wˆ
ridge
M
⎧N
⎫
= arg min ⎨∑ (t i − w0 − ∑ x ij w j ) 2 ⎬
w
j =1
⎩ i =1
⎭
subject to
M
∑w
j =1
2
j
≤s
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
6
Ridge regression (cont’d)
Some notes:
• when there are many correlated variables in a regression problem, their coefficients can
be poorly determined and exhibit high variance e.g. a wildly large positive variable can be
canceled by a similarly largely negative coefficient on its correlated cousin.
• The bias term is not included in the penalty.
• When the inputs are orthogonal, the ridge estimates are just a scaled version of the least
squares estimate
Matrix representation of criterion and it’s solution:
1
(t − Xw )T (t − Xw ) + λ w T w
2
= ( X T X + λ I ) −1 X T t
J (w , λ ) =
wˆ
ridge
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
7
Ridge regression (cont’d)
Ridge regression shrinks the dimension with least variance the most.
Shrinkage Factor :
Each direction is shrunk by
d 2j
(d + λ )
2
j
,
[Page 62: Elem. Stat. Learning]
where d 2j refers to the correspond ing eigen value.
Effective degree of freedom :
M
d 2j
j =1
(d 2j + λ )
df (λ ) = tr[ X(X X + λI) X ] = ∑
T
−1
T
[Page 63: Elem. Stat. Learning]
Note : df (λ ) = M if λ = 0 (no regularization)
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
8
Lasso
Lasso is also a shrinkage method like ridge regression. It minimizes a penalized
residual sum of squares :
wˆ
lasso
M
M
⎧N
⎫
2
= arg min ⎨∑ (t i − w0 − ∑ xij w j ) + λ ∑ w j ⎬
w
j =1
j =1
⎩ i =1
⎭
Equivalent representation:
wˆ
lasso
M
⎧N
⎫
= arg min ⎨∑ (t i − w0 − ∑ x ij w j ) 2 ⎬
w
j =1
⎩ i =1
⎭
subject to
M
∑w
j =1
j
≤s
The L2 ridge penalty is replaced by the L1 lasso penalty.
This makes the solution non-linear in t.
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
9
Derived Input Direction Methods
These methods essentially involves transforming the input directions to some low
dimensional representation and using these directions to perform the regression.
Example Methods
• Principal Components Regression (PCR)
• Based on input variance only
• Partial Least Squares (PLS)
• Based on input-output correlation and output variance
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
10
Principal Component Analysis
From earlier discussions:
„
„
PCA finds the eigenvectors of the correlation (covariance) matrix of
the the data
Here, we show how PCA can be learned by bottle-neck neural
networks (auto-associator) and Hebbian learning frameworks
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
11
PCA Implementation
Š Hebbian Learning
z
obtain a measure of familiarity of a new data point
Š
the more familiar a data point, the larger the output
y
w
x1
x2
x3
xd
w n +1 = w n + α yx
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
12
Hebbian Learning
ŠProperties of Hebbian Learning
unstable learning rule (at most neutrally stable)
z finds direction of maximal variance of the data
z
1 2
1 T
J =
y =
w xx T w
2
2
⎧1
⎫
E {J } = E ⎨ w T x x T w ⎬
⎩2
⎭
1 T
1 T
=
w E xx T w =
w Cw
2
2
{
}
x2
x
y
∆w = α xy
Here, C is the correlation matrix
w0
x1
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
13
Fixing the Instability (Oja’s rule)
Š Renormalization
w
n +1
w n + α xy
=
w n + α xy
Š Oja’s Rule ∆ w = w n+1 − w n = α y( x − yw ) = α (yx − y 2 w )
Š Verify Oja’s Rule (does it do the right thing ??)
{(
E {∆w} = E α xy − y 2 w
)}
After convergence:
{
}
= α E xxT w − wT xxT ww
(
= α Cw − wT Cww
)
E {∆w} = 0 ⇒
(
)
Cw = wT Cw w = λw , thus w is eigenvector
(
)
(
)
wT Cw = wT wT Cw w = wT Cw wT w
thus, wT w = 1
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
14
Application Examples of Oja’s Rule
Finds the direction (line) which minimizes the
sum of the total squared distance from each
point to its orthogonal projection onto the line.
X = UDVT
Singular Value Decomposition (SVD)
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
15
PCA : Batch vs Stochastic
Š PCA in Batch Update:
z
z
just subtract the mean from the data
calculate eigenvectors to obtain principle components (Matlab function “eig”)
Š PCA in Stochastic Update:
z
z
stochastically estimate the mean and subtract it from input data
use Oja’s rule on mean subtracted data
x
n+ 1
x n+ 1
1
x nn + x
x − xn)
= xn +
=
(
n +1
n +1
= x n + α (x − x n )
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
16
Autoencoder as motivation for Oja’s Rule
ŠNote that Oja’s rule looks like a supervised learning rule
z
the update looks like a reverse delta-rule: it depends on the difference
between the actual input and the back-propagated output
∆w = w
n +1
− w n = α y( x − y w )
x
w
y
w
x
Convert unsupervised learning into a supervised learning problem by trying
to reconstruct the inputs from few features!
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
17
Learning in the Autoencoder
Š Minimize the cost
1
1
T
T
J = (x − xˆ ) (x − xˆ ) = (x − yw ) (x − yw)
2
2
∂J
∂w ⎞
⎛ ∂y
=−
w+y
(x − yw)
⎝ ∂w
∂w
∂w ⎠
T
⎛ ∂ (w T x )
∂w⎞
= −⎜
w + y ⎟ (x − yw )
∂w⎠
⎝ ∂w
T
= −(x T w + y)(x − yw)
= −(y + y)(x − yw)
= −2y(x − yw)
∆w = −α
∂J
= α ′y(x − yw)
∂w
This is Oja’s rule!
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
18
PCA with More Than One Feature
Š Oja’s Rule in Multiple Dimensions
y = Wx
xˆ = W T y
(
∆W = α y x − W T y
)
T
Š Sanger’s Rule: A clever trick added to Oja’s rule!
y = Wx
xˆ = W T y
[ ∆ W ]r − th − row
⎛
= α yr x −
⎝
r
∑ ([ W ]i − th − row )
T
i= 1
(
= α y r x − ([ W ]1: r − th − row
⎞
yi
⎠
T
) [y ]1: r − th − row
T
)
T
Matlab Notation:
W ( r , : ) = α * y ( r ) * ( x − W (1: r , : )' *y (1: r ))
This rule makes the rows of W become the eigenvectors of the data, ordered in
descending sequence according to the magnitude of the eigenvalues.
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
19
Discussion about PCA
Š If data is noisy, we may represent the noise instead of the data
z
The way out: Factor Analysis (handles noise in input dimension also)
Š PCA has no data generating model
Š PCA has no probabilistic interpretation (not quite true !!)
Š PCA ignores possible influence of subsequent (e.g., supervised) learning
steps
Š PCA is a linear method
z
Way out: Nonlinear PCA
Š PCA can converge very slowly
z
Way out: EM versions of PCA
Š But PCA is a very reliable method for dimensionality reduction if it is
appropriate!
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
20
PCA preprocessing for Supervised Learning
Š In Batch Learning Notation for a Linear Network:
⎡
⎛ N n
n
⎢
⎜∑ x −x x −x
U = ⎢ eigenvectors ⎜ n =1
N −1
⎢
⎜
⎜
⎢
⎝
⎣
(
)(
)
T
⎞⎤
⎟⎥
⎟⎥
⎟⎥
⎟⎥
⎠ ⎦ max(1:k )
% TX
% ⎞⎤
⎡
⎛X
% contains mean-zero data
where X
= ⎢ eigenvectors ⎜
⎟⎥
N
1
−
⎝
⎠ ⎦ max(1:k )
⎣
Subsequent Linear Regression for Network Weights
(
% T XU
%
W = UT X
)
−1
% TY
UT X
NOTE: Inversion of the above matrix is very cheap since it is diagonal!
No numerical problems!
Problems of this pre-processing:
Important regression data might have been clipped!
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
21
Principal Component Regression (PCR)
Build the matrix X and vector y
X = (~
x1 , ~
x 2 ,..., ~
xn )T
t = (t1 , t 2 ,..., t n ) T
Eigen Decomposition
X T X = VD 2 V T
Compute the linear model
U = max [ v 1 v 2 L v k ]
ˆ
w
eigen values
pcr
T
= (U X T XU) −1 U T X T t
Data projection
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
22
PCA in Joint Data Space
Š A straightforward extension to take supervised learning step
into account:
z
z
perform PCA in joint data space
extract linear weight parameters from PCA results
y
x
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
23
PCA in Joint Data Space: Formalization
⎡x − x ⎤
Z = ⎢
⎥
⎣y − y ⎦
⎡
⎛
⎢
⎜
U = ⎢ eig en vecto rs ⎜
⎢
⎜
⎜
⎢
⎝
⎣
N
∑ (z
n
−z
n =1
)(z
n
N −1
−z
)
T
⎞⎤
⎟⎥
⎟⎥
⎟⎥
⎟⎥
⎠ ⎦ m ax (1:k )
⎡
⎛ ZT Z ⎞⎤
= ⎢ eig en vecto rs ⎜
⎟⎥
⎝ N − 1 ⎠ ⎦ m ax (1:k )
⎣
The Network Weights become:
(
(
W = U x UTy − UTy U y UTy − I
)
−1
)
⎡ U x (= d × k ) ⎤
U y UTy , where U = ⎢
⎥
⎣ U y (= c × k ) ⎦
Note: this new kind of linear network can actually tolerate noise in the input data!
But only the same noise level in all (joint) dimensions!
Lecture VIII: MLSC - Dr. Sethu Vijayakumar
24