Download Radial-basis Function Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Radial-Basis Function
Networks
CS/CMPE 333 – Neural Networks
Introduction



Typical tasks performed by neural networks are
association, classification, and filtering. This
categorization has historical significance as well.
These tasks involve input-output mappings, and the
network is designed to learn the a mapping from
knowledge of the problem environment
Thus, the design of a neural network can be viewed as
a curve-fitting or function approximation problem
 This
viewpoint is the motivation for radial-basis function
networks
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
2
Radial-Basis Function Networks





RBF are 2 layer networks; input source nodes, hidden
neurons with basis functions (nonlinear), and output
neurons with linear/nonlinear activation functions
The theory of radial-basis function networks is built
upon function approximation theory in mathematics
RBF networks were first used in 1988. Major work was
done by Moody and Darken (1989) and Poggio and
Girosi (1990)
In RBF networks, the mapping from input to highdimension hidden space is nonlinear, while that from
hidden to output space is linear
What is the basis for this ?
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
3
Radial-Basis Function Network
Φ1(.)
w
x1
y1
y2
xp
ΦM(.)
Source
nodes
Hidden neurons with
RBF activation functions
Output
neurons
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
4
Cover’s Theorem (1)


Cover’s theorem (1965) gives the motivation for RBF
networks
Cover’s theorem on the separability of patterns
 Complex
pattern-classification problems cast in highdimensional space nonlinearly is more likely to be linearly
separable than low-dimensional space
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
5
Cover’s Theorem (2)


Consider set X of N p-dimensional vectors (input
patterns) x1 to xN. Let X+ and X- be a binary partition
of X, and φ(x) = [φ1(x), φ2(x),…, φM(x)]T.
Cover’s theorem
partition (dichotomy) [X+, X-] of X is said to be φseparable if there exist an m-dimensional vector w such that
 A binary
wTφ(x) ≥ 0 when x belong to X+
wTφ(x) < 0 when x belong to X Decision
boundary or surface
wTφ(x) = 0
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
6
Cover’s Theorem (3)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
7
Example (1)

Consider the XOR problem to illustrate the significance of φseparability and Cover’s theorem.
Define a pair of Gaussian hidden functions
φ1(x) = exp(-||x – t1||2)
t1 = [1, 1]T
φ2(x) = exp(-||x – t2|2|)
t1 = [0, 0]T

Output of these function for each pattern

CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
8
Example (2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
9
Function Approximation (1)

Function approximation seeks to describe the behavior
of complex functions by ensembles of simpler
functions
 Describe
f(x) by F(x)
F(x) can be described in a compact region of input
space by
F(x) = Σi=1 N wiφi(x)
Such that
|f(x) – F(x)| < ε

ε
can be made arbitrarily small
 Choice of function φ(.) ?
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
10
Function Approximation (2)


Find F(x) that “best” approximates the map/function f.
The best approximation is problem dependent, and it
can be strict interpolation or good generalization
(regularized interpolation).
Design decisions
of elementary functions φ(.)
 How to compute the weights w ?
 How many elementary functions to use (i.e. what should be
N)?
 How to obtain a good generalization ?
 Choice
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
11
Choice of Elementary Functions φ





Let f(x) belongs to the function space L2(Rp) (true for
almost all physical systems)
We want φ to be a basis of L2
What is meant by a basis?
A set of functions φi (i = 1, M) are a basis of L2 if
linear superposition of φi can generate any function in
L2 . Moreover, they must be linearly independent:
w1 φ1 + w2 φ2 +…+ wM φM = 0 iff wi = 0 for all i
Demos from Neural and Adaptive Systems book
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
12
Interpolation Problem (1)

In general, the map from an input space to an output
space is given by
f: Rp -> Rq
p
and q = input and out space dimensions; f = map or
hypersurface

Strict interpolation problem
 Given
a set of N different points xi (i = 1, N) and a
corresponding set of N real numbers di (i = 1, N) find a
function F: Rp -> R1 that satisfies the interpolation condition
F(xi) = di
i = 1, N
 The function F passes through all the points
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
13
Interpolation Problem (2)

A common type of φ(.) is radial-symmetric basis
functions
F(x) = Σi=1 N wiφ||x – xi||
Substituting and writing in matrix format
Фw=d
Ф
= φji (i, j = 1, N) = interpolation matrix; φji = φ||xj – xi||
 w = linear weight vector; d = desired response vector
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
14
Interpolation Problem (3)


Ф is known to be positive definite for a certain class of
radial-basis functions. Thus,
w = Ф-1 d
In theory, w can be computed. In practice, however, Ф
is close to singular
 Then


what ?
Regularization theory to perturb Ф to make it nonsingular
But, there is another problem… poor generalization or
overfitting
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
15
Ill-Posed Problems

Supervised learning is a an ill-posed problem
 There
is not enough information in the training data to
reconstruct the input-output mapping uniquely
 The presence of noise or imprecision in the input data adds
uncertainty to the reconstruction of the input-outut mapping

To achieve good generalization additional information
of the domain is needed
 In
other words, the input-output patterns should exhibit
redundancy
 Redundancy is achieved when the physical generator of data
is smooth, and thus can be used to generate redundant inputoutput examples
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
16
Regularization Theory (1)



How to make an ill-posed problem well-posed ?
By constraining the mapping with additional
information (e.g. smoothness) in the form of a
nonnegative functional
Proposed by Tikhonov in 1963 in the context of
function approximation in mathematics
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
17
Regularization Theory (2)





Input-output examples: xi, di (i = 1, N)
Find the mapping F(x): Rp -> R1 for the input-output
examples
In regularization theory, F is found by minimizing the
cost functional ξ(F)
ξ(F) = ξs(F) + λξc(F)
Standard error term
ξs(F) = 0.5Σi=1 N (di – yi)2 = 0.5Σi=1 N (di – F(xi))2
Regularization term
ξc(F) = 0.5||PF(x)||2

P = linear differential operator
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
18
Regularization Theory (3)

Regularization term depends on the geometric
properties of the approximating function
 The
selection of the operator P is therefore problem
dependent based on prior knowledge of the geometric
properties of the actual function f(x) (e.g. the smoothness of
f(x))

Regularization parameter λ: a positive real number
 This
parameter indicates the sufficiency of the given inputoutput examples in capturing the underlying function f(x)

The solution of the regularization problem is a function
type F(x)
 We
won’t go into the details of how to find F as it requires
good understanding of functional analysis
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
19
Regularization Theory (4)

Solution of the regularization problem yields
F(x) = 1/λΣi=1 N [di - F(xi)]G(x, xi) = Σi=1 N wiG(x, xi)
 G(x,xi)

= Green’s function centered on xi
In matrix form
F = Gw
or
(G – λI)w = d
and
w = (G – λI)-1d
G
depends only on the operator P
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
20
Type of Function G(x; xi)


If P is translationally invariant then G(x; xi) depends
only on the difference of x and xi, i.e.
G(x; xi) = G(x - xi)
If P is both translationally and rotationally invariant
then G(x; xi) depends only on Euclidean norm of the
difference vector x - xi, i.e.
G(x; xi) = G(||x – xi||)
 This

is a radial-basis function
If P is further constrained, and G(x; xi) is positive
definite, then we have the Gaussian radial-basis
function, i.e.
G(x; xi) = exp(- (1/2σ2) ||x – xi||2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
21
Regularization Network (1)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
22
Regularization Network (2)


The regularization network is based on the regularized
interpolation problem
F(x) = Σi=1 N wiG(x, xi)
It has 3 layers
 Input
layer of p source nodes, where p is the dimension of the
input vector x (or number of independent variables)
 Hidden layer with N neurons, where N is the number of
input-output examples. Each neuron uses the activation
function G(x; xi)
 Output layer with q neurons, where q is the output dimension

The unknowns are the weights w (only) from the
hidden layer to the output layer
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
23
RBF Networks (in Practice) (1)

The regularization network requires N hidden neurons,
which becomes computationally expensive for large N
 The
complexity of the network is reduced to obtain an
approximate solution to the regularization problem

The approximate solution F*(x) is then given by
F*(x) = Σi=1 M wiφi(x)
 φi(x)
(i = 1,M) = new set of basis functions; M is typically
less than N

Using radial basis functions
F*(x) = Σi=1 M wiφi(||x – ti||)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
24
RBF Networks (2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
25
RBF Networks (3)

Unknowns in the RBF network
 M,
the number of hidden neurons (M < N)
 The centers ti of the radial-basis functions
 And the weights w
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
26
How to Train RBF Networks - Learning

Normally the training of the hidden layer parameters
(number of hidden neurons, centers and variance of
Gaussian) is done prior to the training of the weights
(i.e. on a different ‘time scale’)
 This
is justified based on the fact that the hidden layer
performs a different task (nonlinear) than the output layer
weights (linear)


The weights are learned by supervised learning using
an appropriate algorithm (LMS or BP)
The hidden layer parameters are learned by (in general,
but not always) unsupervised learning
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
27
Fixed Centers Selected at Random


Randomly select M inputs as centers for the activation
functions
Fix the variance of the Gaussian based on the distance
between the selected centers. A radial-basis function
centered at ti is then given by
φ(||x – ti||) = exp(- M/d2 ||x – ti||2)
d
= max. distance between the chosen centers
 The width’ or standard deviation of the functions is fixed,
given by σ = d/√2M

The linear weights are then computed by solving the
regularization problem or by using supervised learning
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
28
Self-Organized Selection of Centers

Use a self-organizing or clustering technique to
determine the number and centers of the Gaussian
functions
 A common
algorithm is the k-means algorithm. This
algorithms assigns a label to a vector x by the majority label
on the k-nearest neighbors

Then compute the weights using a supervised errorcorrection learning such as LMS
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
29
Supervised Selection of Centers


All unknown parameters are trained using errorcorrecting supervised learning
A gradient descent approach is used to find the
minimum of the cost function wrt the weights wi and
activation function centers ti and spread of centers σ
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
30
Example (1)


Classify between two ‘overlapping’ two-dimensional,
Gaussian-distributed patterns
Conditional probability density function for the two
classes
f(x | C1) = 1/2πσ12 exp[-1/2σ12 ||x – μ1||2]
μ1 = mean = [0 0]T and σ12 = variance = 1
f(x | C2) = 1/2πσ22 exp[-1/2σ22 ||x – μ2||2]
μ2 = mean = [2 0]T and σ22 = variance = 4
= [x1 x2]T = two dimensional input
 C1 and C2 = class labels
x
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
31
Example (2)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
32
Example (3)
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
33
Example (4)

Consider a two-input, M hidden neurons, and twooutput RBF
 Decision
rule: an input x is classified to C1 if y1 >= 0
 The training set is generated from the probability distribution
functions
 Using the perceptron algorithm, the network is trained for
minimum mean-square-error

The testing set is generated from the probability
distribution functions
 The

trained network is tested for correct classification
For other implementation details, see the Matlab code
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
34
Example: Function Approximation (1)


Approximate relationship between a car’s fuel economy (in
miles per gallon) and its characteristics
Input data description: 9 independent discrete valued, boolean,
and continuous variables










X1: number of cylinders
X2: displacement
X3: horsepower
X4: weight
X5: acceleration
X6: model year
X7: Made in US? (0,1)
X8: Made in Europe? (0,1)
X9: Made in Japan? (0,1)
Output f(X) is fuel economy in miles per gallon
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
35
Example: Function Approximation (2)



Using the NNET toolbox, create and train a RBF
network with function newrb
The function parameters allows you to set the meansquared-error goal of the training, the spread of the
radial-basis functions, and the maximum number of
hidden layer neurons.
Newrb uses the following approach to find the
unknowns (it is a self-organizing approach)
 Start
with one hidden neuron; compute network error
 Add another neuron with center equal to input vector that
produced the maximum error; compute network error
 If network error does not improve significantly, stop; other
go to previous step and add another neuron
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
36
Comparison of RBF Network and MLP (1)




Both are universal approximators. Thus, a RBF
network exists for every MLP, and vice versa
An RBF has a single hidden layer, while an MLP can
have multiple hidden layers
The model of the computational neurons of an MLP are
all identical, while the neurons in the hidden and output
layers of an RBF network have different models
The activation functions of the hidden nodes of an RBF
network is based on the Euclidean norm of the input
wrt to a center, while that of an MLP is based on the
inner product of input and weights
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
37
Comparison of RBF Network and MLP (2)

MLPs construct global approximations to nonlinear
input-output mapping. This is a consequence of the
global activation function (sigmoidal) used in MLPs
 As
a result, MLP can perform generalization in regions
where input data is not available (i.e. extrapolation)

RBF networks construct local approximations to inputoutput data. This is a consequence of the local
Gaussian functions
 As
a result, RBF networks are capable of fast learning from
the training data
CS/CMPE 333 - Neural Networks (Sp 2002/2003) - Asim Karim @ LUMS
38