Download Radial Basis Function networks for regression and classification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Data assimilation wikipedia , lookup

Regression analysis wikipedia , lookup

Linear regression wikipedia , lookup

Least squares wikipedia , lookup

Choice modelling wikipedia , lookup

Transcript
Radial Basis Function networks
for regression and classification
April 5, 2011
()
Radial Basis Function networks for regression and classification April 5, 2011
1 / 30
Radial Basis Function Networks
Outline
1
Radial Basis Function Networks
()
Radial Basis Function networks for regression and classification April 5, 2011
2 / 30
Radial Basis Function Networks
Introduction
Classification and Regression
Given a dataset (TRAINING SET) of input-target pairs:
D = xi ; ti i=1,2,...,N
the learning task is to predict the corresponding t to each new input
vector x ∈
/D
1
2
classification: ti are discrete (class label)
regression: ti continuous value
Underlying relation between x and t given by:
ti = f xi + i
GOAL: to approximate f with a parametric function y(x; w)
polinomial, ffw, etc.
()
Radial Basis Function networks for regression and classification April 5, 2011
3 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function exact interpolation
Consider a mapping from an input space X ⊆ RD to a target space
Y ⊆R
f (x) : x ∈ X ⊆ RD −→ y ∈ Y ⊆ R
In general we do not know the function f (x)
We only have a set of input-target pairs called Training Set (TS)
()
x1
..
.
→
y1
..
.
xN
→ yN
Radial Basis Function networks for regression and classification April 5, 2011
4 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function exact interpolation
The goal of exact interpolation is to find a function h(x) such that:
h(xi ) = t i , i = 1, . . . , N
The radial basis function approach introduces a set of N basis
functions (one for each data point) of the form:
φn (x) = φ(k x − xn k)
where φ(·) is some non linear function and k x − xn k denote the
Euclidean distance between the input x and the point xn
()
Radial Basis Function networks for regression and classification April 5, 2011
5 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Form of the basis functions
x2
Gaussian: φ(x) = exp − 2σ
2
−α
, with α > 0
φ(x) = x 2 − σ 2
Thin-plate: φ(x) = x 2 ln(x)
...
We will consider
of Gaussian basis function:
the case
kx−µj k
φj (x) = exp − 2σ2
j
()
Radial Basis Function networks for regression and classification April 5, 2011
6 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function exact interpolation
The output of the mapping is a linear combination of the basis
functions:
h(x) =
N
X
wj φj (x)
j=1
Thus the interpolation condition can be expressed:
N
X
wj Φij = t i i = 1, 2, . . . , N
j=1
()
Radial Basis Function networks for regression and classification April 5, 2011
7 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function exact interpolation
Thus the interpolation condition can be expressed:
N
X
wj Φij = t i i = 1, 2, . . . , N
j=1
this condition can be

Φ11 Φ12
 Φ21 Φ22

 ..
..
 .
.
rewritten in a matrix

. . . Φ1N
w1


. . . Φ2N   w2
..   ..
..
.
.  .
ΦN1 ΦN2 . . . ΦNN
wN
form: ΦwT
  1
t
  t2
 
 =  ..
  .
tN
=t





Φ is a matrix of dimension N × N of components Φ(i, j) = φj (xi )
w, t are vector of dimension N
()
Radial Basis Function networks for regression and classification April 5, 2011
8 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function exact interpolation
ΦwT = t
if Φ is a non singular matrix the solution for the parameters can be
found simply by:
w = Φ−1 t
()
Radial Basis Function networks for regression and classification April 5, 2011
9 / 30
Radial Basis Function Networks
Radial Basis Function Networks
MATLAB EXPERIMENTS
MATLAB EXPERIMENTS
()
Radial Basis Function networks for regression and classificationApril 5, 2011
10 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function exact interpolation
This method can generalized for mapping to multidimensional output
space Y ⊆ RC
In this case each input vector xn must be mapped exactly onto an
output vector tn of components tkn with k = 1, . . . , C
Thus the interpolation condition can be written:
ΦWT = T
Where W is a matrix of dimension C × N while T is a matrix of
dimension N × C
To find the solution we must invert the matrix Φ and perform a
matrix product Φ−1 T
()
Radial Basis Function networks for regression and classificationApril 5, 2011
11 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function networks
We made the following changes to the previous model:
1
2
3
The number M of basis functions φ1 (x), . . . φM (x) is much less than N
(number of data points);
The centers µj and the width σj of basis functions are determined
during the training process;
The bias parameters are included in the linear sum.
yk (x) =
M
X
wkj φj (x) + wk0
j=1
()
Radial Basis Function networks for regression and classificationApril 5, 2011
12 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function networks
()
Radial Basis Function networks for regression and classificationApril 5, 2011
13 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Radial basis function networks training
Two stage training procedure:
1
2
Unsupervised training to determine the parameters of the basis
functions;
By fixing the parameters of the basis function we determine the weights
(wkj ) by Supervised training
()
Radial Basis Function networks for regression and classificationApril 5, 2011
14 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the weights of the network
yk (x) =
M
X
wkj φj (x) + wk0
j=1
we can absorb the bias parameters into the weights to give:
yk (x) =
M
X
wkj φj (x)
j=0
where the function φ0 (x) = 1
()
Radial Basis Function networks for regression and classificationApril 5, 2011
15 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the weights of the network
Consider a Training Set consisting of N data point and rbf network
with M basis functions (internal nodes)
We can construct the matrix Φ of dimension N × (M + 1) as follows:





Φ10 Φ11
Φ20 Φ21
....
..
Φ12
Φ22
..
.
ΦN0 ΦN1 ΦN2
()
. . . Φ1M
. . . Φ2M
..
..
.
.
. . . ΦNM





Radial Basis Function networks for regression and classificationApril 5, 2011
16 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the weights of the network
The expression for the output of the rbf network
yk (x) =
M
X
wkj φj (x)
j=0
can be expressed as:
Y = ΦWT
where Y, Φ and W are matrix of dimension N × C , N × M and
C × M respectively;
()
Radial Basis Function networks for regression and classificationApril 5, 2011
17 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the weights of the network
Y = ΦWT
We can find the parameters W by minimizing a suitable error function
(e.g. sum-of-squares)
N
C
1 XX
E=
{yk (xn ) − tkn }2
2
n=1 k=1
The weights values are given by the solution of the linear equation:
ΦT ΦWT = ΦT T
The solution is WT = (ΦT Φ)−1 ΦT T
()
Radial Basis Function networks for regression and classificationApril 5, 2011
18 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the weights of the network
N
E
C
=
1 XX
{yk (xn ) − tkn }2
2
=
1
k ΦWT − T k2
2
n=1 k=1
Deriving with respect to W
∂E
W
= ΦT ΦWT − T
= ΦT ΦWT − ΦT T
Setting to zero the derivative and finding W we obtain:
WT = (ΦT Φ)−1 ΦT T
()
Radial Basis Function networks for regression and classificationApril 5, 2011
19 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the weights of the network
It can be showed that the solution for W can be found using the
Singular Value Decomposition (SVD)
1
2
3
Decompose the matrix Φ = UΣVT
Compute VΣ−1 UT
W = VΣ−1 UT T
Where U and V are orthonormal matrix of dimension N × N and
M × M respectively and Σ is a matrix of dimension N × M with the
singular value on the diagonal
Note that Σ−1 denote a M × N matrix constructed as follows:
(
1
if Σ(i, i) > 0
Σ−1 (i, i) = Σ(i,i)
0
otherwise
()
Radial Basis Function networks for regression and classificationApril 5, 2011
20 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the parameters of the basis functions
We consider the case of Gaussian basis function:
!
k x − µj k
φj (x) = exp −
2σj2
Thus we have to determine the parameters µj and σj for each basis
function
()
Radial Basis Function networks for regression and classificationApril 5, 2011
21 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining centers and width of the basis functions
Different approach exists:
Subset of data points
Orthogonal least squares
Clustering algorithms
Gaussian mixture models
()
Radial Basis Function networks for regression and classificationApril 5, 2011
22 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Clustering algorithms for selecting the parameters of the
basis functions
If we consider a clustering algorithm for which the number of clusters
is predefined (e.g. K-Means):
1
2
3
Set the number of cluster to M and run the clustering algorithm
Set the centers of the basis functions equals to the centers of clusters
Set the widths (variances) of the basis functions equals to the variances
of clusters
()
Radial Basis Function networks for regression and classificationApril 5, 2011
23 / 30
Radial Basis Function Networks
Radial Basis Function Networks
How do we select the model complexity ?
Choosing a very simple model may give rise to poor results ! (e.g.
M = 1)
Choosing a very complex model may give rise to over-fitting and thus
poor generalization performace !
One technique that is often used to control over-fitting is to still use a
complex model but to add a penalty term to the error function in
order to discourage the coefficients from reaching large values
(regularization):
N
C
λ
1 XX
{yk (xn ) − tkn }2 + k w k2
E=
2
2
n=1 k=1
where λ is called regularization coefficient
The solution is WT = (ΦT Φ + λI)−1 ΦT T
()
Radial Basis Function networks for regression and classificationApril 5, 2011
24 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Determining the weights of the network
It can be showed that the solution for W can be found using the
Singular Value Decomposition (SVD)
1
2
3
Decompose the matrix Φ = UΣVT
Compute VΣ−1 UT
W = VΣ−1 UT T
Where U and V are orthonormal matrix of dimension N × N and
M × M respectively and Σ is a matrix of dimension N × M with the
singular value on the diagonal
Note that Σ−1 denote a M × N matrix constructed as follows:
(
1
if Σ(i, i) > 0
Σ−1 (i, i) = Σ(i,i)+λ
0
otherwise
()
Radial Basis Function networks for regression and classificationApril 5, 2011
25 / 30
Radial Basis Function Networks
Radial Basis Function Networks
MATLAB EXPERIMENTS
MATLAB EXPERIMENTS
()
Radial Basis Function networks for regression and classificationApril 5, 2011
26 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Supervised Learning
The use of unsupervised techniques to determine the basis function
parameters is not in general an optimal procedure in so far as the
subsequent supervised training is concerned
Indeed with unsupervided techniques the setting up of the basis
functions takes no account of the target labels
In order to obtain best result we should include the target data in the
training procedure, that is we should perform supervised training
()
Radial Basis Function networks for regression and classificationApril 5, 2011
27 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Supervised Learning
yk (x) =
M
X
j=1
k x − µj k
wkj exp −
2σj2
!
+ wk0
To find the parameters µj and σj we should minimize the error
function (e.g. sum-of-squares) with respect to these parameters
This can be done by deriving the error function with respect to µj
and σj and make use of these derivatives in the Gradient descent
optimization algorithm.
()
Radial Basis Function networks for regression and classificationApril 5, 2011
28 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Supervised Learning
XX
k xn − µj k2
∂E
=
{yk (xn ) − tkn } wkj exp −
∂σj
2σj2
n
!
k
XX
k xn − µj k2
∂E
=
{yk (xn ) − tkn } wkj exp −
∂µji
2σj2
n
k
()
k xn − µj k2
σj3
!
(xin − µji )
σj2
Radial Basis Function networks for regression and classificationApril 5, 2011
29 / 30
Radial Basis Function Networks
Radial Basis Function Networks
Gradient descent
Differentiable error function E depending on parameters
Θ = (θ1 , . . . , θS )
1
We began with some initial guess for Θ (e.g. random)
2
We update the parameters by moving a small distance in the Θ-space
in the direction in which E decrease most rapidly (−∇Θ E )
∂E (τ +1)
(τ )
θj
= θj − η
∂θj (τ )
Θ
where η is called learning rate and is usually taken in the range
[0 . . . 1]
()
Radial Basis Function networks for regression and classificationApril 5, 2011
30 / 30
Model complexity and PCA
April 5, 2011
()
Model complexity and PCA
April 5, 2011
1 / 23
A toy example of regression
Outline
1
A toy example of regression
2
Model complexity
()
Model complexity and PCA
April 5, 2011
2 / 23
A toy example of regression
A toy example
A simple regression problem
We introduce some key concept by means of a toy example of regression
problem
In this toy example we know the regression function u(x) = sin(2πx)
The Training Set comprises 10 input data points {xi }10
i=1 spaced
uniformly in the range [0, 1] with the corresponding target value
{ti = u(xi ) + i }10
i=1 where i is small random noise (drawn from a
gaussian distribution)
1
0
−1
0
()
1
Model complexity and PCA
April 5, 2011
3 / 23
A toy example of regression
A toy example
A simple regression problem
To approximate the “unknown” function u(x) by means of the function
y (x, w):
choose the function model (linear, polynomial, neural network, . . . )
determine the parameters w of the model by the learning algorithm
(usually this step involve):
choosing of an error function;
for a given value of w̃ the error function measures the misfit between
the function y (x, w̃) and the training data;
Learning Algorithm: is the procedure able to select the parameters w
that minimizing the error function.
()
Model complexity and PCA
April 5, 2011
4 / 23
A toy example of regression
A toy example
A first model: Polynomial
Polynomial Model:
y (x, w) = w0 + w1 x + w2 x 2 + · · · + wM x M =
M
X
wj x j with M ∈ N
j=0
Sum of squares Error Function: E (w) =
t
1 PN
2
n=1 {y
(xn , w) − tn }2
tn
y(xn , w)
xn
x
Thus we have to select the value of M and then we have to determine
the values of the parameters w.
()
Model complexity and PCA
April 5, 2011
5 / 23
A toy example of regression
A toy example
A second model: Neural Network
in the case of single input and output, identity output function we can
write:
y (x, w) =
M
X
(2)
wj φj (x) with M ∈ N
j=0
(1)
φj (x) = g wj x
where g (·) is some non linear function (e.g. sigmoid, radial basis function
RBF networks)
Thus we have to select the value of M and then we have to determine
the values of the parameters w.
()
Model complexity and PCA
April 5, 2011
6 / 23
Model complexity
Outline
1
A toy example of regression
2
Model complexity
()
Model complexity and PCA
April 5, 2011
7 / 23
Model complexity
Determining the parameters w
Choosen the function model (polinomial, neural network, and so on),
chosen the related value of M techniques exist to determine the
parameters values w:
Maximum Likelihood;
Bayesian approach;
...
We indicate with w∗ the values of the parameters that minimize the
error function for a given model (in our case identified by the value of
M)
()
Model complexity and PCA
April 5, 2011
8 / 23
Model complexity
What does it happens when we change the model
complexity ?
1
0
−1
−1
0
0
−1
−1
1
1
1
0
0
()
1
1
1
0
0
0
Model complexity and PCA
1
April 5, 2011
9 / 23
Model complexity
The Over-fitting Problem
1
Training
Test
0.5
0
0
3
6
9
Measuring the generalization performance on the Test Set
p
Root Mean Square error function ERMS = 2E (wm )/N
We choose M that give the best generalization performance (that is
the minumum error on the test set)
()
Model complexity and PCA
April 5, 2011
10 / 23
Model complexity
What does it happens to w when we change the model
complexity ?
w0∗
w1∗
w2∗
w3∗
w4∗
w5∗
w6∗
w7∗
w8∗
w9∗
()
M=0
0.19
M=1
0.28
−1.27
M=6
0.31
7.99
−25.43
17.37
Model complexity and PCA
M=9
0.35
232.37
−5321.83
45868.31
−231639.30
640042.26
−1061800.52
1042400.18
−557682.99
125201.43
April 5, 2011
11 / 23
Model complexity
How do we select the model complexity ?
Choosing a very simple model may give rise to poor results ! (e.g.
M = 1 we are using a linear model)
Choosing a very complex model may give rise to over-fitting and thus
poor generalization performace !
One technique that is often used to control over-fitting is to still use a
complex model but to add a penalty term to the error function in
order to discourage the coefficients from reaching large values
(regularization):
N
1X
λ
Ẽ (w) =
{y (x n ; w) − t n }2 + k w k2
2
2
n=1
where λ is called regularization coefficient
()
Model complexity and PCA
April 5, 2011
12 / 23
Model complexity
Using Regularization
1
0
0
−1
−1
0
1
()
1
0
Model complexity and PCA
1
April 5, 2011
13 / 23
Model complexity
MATLAB EXPERIMENTS
MATLAB EXPERIMENTS
()
Model complexity and PCA
April 5, 2011
14 / 23
Model complexity
Dimensionality Reduction: 2D example
Data
1
x2
0.5
0
−0.5
−1
−1
−0.5
0
0.5
1
x1
()
Model complexity and PCA
April 5, 2011
15 / 23
Model complexity
Dimensionality Reduction: 2D example
Data
data
V1
1
V2
x2
0.5
0
−0.5
−1
−1
−0.5
0
0.5
1
x1
()
Model complexity and PCA
April 5, 2011
16 / 23
Model complexity
Dimensionality Reduction: 2D example
Projected data onto V1
−2
−1.5
()
−1
−0.5
u1
0
Model complexity and PCA
0.5
1
1.5
April 5, 2011
17 / 23
Model complexity
Dimensionality Reduction: 2D example
Reconstructed Data
1
x2
0.5
0
−0.5
−1
−1
−0.5
0
0.5
1
x1
()
Model complexity and PCA
April 5, 2011
18 / 23
Model complexity
Dimensionality Reduction: 2D example
Reconstructed Data
1
1
0.5
0.5
x2
x2
Data
0
0
−0.5
−0.5
−1
−1
−1
−0.5
0
0.5
1
x1
−1
−0.5
0
0.5
1
x1
Reconstruction error: 0.18 (RMS error)
()
Model complexity and PCA
April 5, 2011
19 / 23
Model complexity
Principal Component Analysis
Consider a data set of N observations {xn }n=1,...,N where xn ∈ Rd
Objective: to project the data onto a k < d dimensional space while
maximizing the variance of the projected data
k = 1, vector u ∈ Rd
uxn projection of the n-th point
P
n
x̄ = N1 N
n=1 x sample mean of the data
ux̄ projected sample mean
()
Model complexity and PCA
April 5, 2011
20 / 23
Model complexity
Principal Component Analysis
Variance of projected data:
1
N
PN
n=1
2
uT xn − uT x̄ = uT Su
where S is:
S=
1
N
PN
n
n=1 (x
− x̄)(xn − x̄)
Optimization problem:
arg maxu uT Su s.t. k u k2 = 1
()
Model complexity and PCA
April 5, 2011
21 / 23
Model complexity
Principal Component Analysis
Using Lagrange multiplier:
uT Su + λ(1 − uT u)
setting the deriving with respect to u equal to zero (Eigenvalue
problem):
Su = λu
multiplying both sides by uT (note that uT u = 1)
uT Su = λ
So the variance is maximized when we choose u equal to the
eigenvector having the largest eigenvalue λ of S
()
Model complexity and PCA
April 5, 2011
22 / 23
Model complexity
PCA Algorithm
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
function PCA(X , k)
Xmean ←computeMean(X )
Xc ← X − Xmean
. Centering data
CovX ← XcT Xc
. Computing covariance matrix
[U Lambda] ←diagonalize(CovX )
. Eigenvectors, eigenvalues
U ←sortDescending(U, Lambda)
. Sorting eigenvectors
Uk ←getFirstKcomponents(U, k)
. First k eigenvectors
Y ← Xc Uk
. Projecting data
return Y
end function
()
Model complexity and PCA
April 5, 2011
23 / 23