Download The multi-layer perceptron

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
The Multi-Layer Perceptron
non-linear modelling by adaptive pre-processing
Rob Harrison
AC&SE
target, z
error, e
pre-processing layer
adaptive layer
output, y
input data, x
Adaptive Basis Functions
• “linear” models
– fixed pre-processing
– parameters → cost “benign”
– easy to optimize
but
– combinatorial
– arbitrary choices
what is best pre-processor to choose?
The Multi-Layer Perceptron
• formulated from loose biological principles
• popularized mid 1980s
– Rumelhart, Hinton & Williams 1986
• Werbos 1974, Ho 1964
• “learn” pre-processing stage from data
• layered, feed-forward structure
– sigmoidal pre-processing
– task-specific output
non-linear model
Generic MLP
xu L
j
A
Y
E
xd R
more layers
x1 I
N
x2 P
U
T
O
U
T
P
U
T
k
i
HIDDEN LAYERS
L0
L1
LM-1
LM
L
A
Y
E
R
y1
y2
yi
yn
Two-Layer MLP
x1
x2
y
xu
y    wT v  b 
j
v j    wTj x  b j
xd
L0
L1
L2

A Sigmoidal Unit
x1
y j  f a j 
x2
wj
wj2
wju
xu
xd

wjd

 f b j   i w ji xi
aj
f 
yj

Combinations of Sigmoids
1
0.8
0.6
0.4
0.2
0
-0.2
-0.4
-0.6
-0.8
-1
-3
-2
-1
0
1
2
3
In 2-D
from Pattern Classification, 2e, by Duda, Hart & Stork, © 2001 John Wiley
Universal Approximation
• linear combination of “enough” sigmoids
– Cybenko, 1990
• single hidden layer adequate
– more may be better
• choose hidden parameters (w, b) optimally
problem solved?
Pros
• compactness
– potential to obtain same veracity with much
smaller model
• c.f. sparsity/complexity control in linear
models
Compactness of Model
1
d=5
d=10
d=50
mlp
0.9
0.8
reduction in SSE
0.7
MLP O 1/ H 
0.6

0.5
0.4
SER O 1/ H
0.3
0.2
0.1
0 0
10
1
10
10
'size' of pre-processor
2
3
10
2
d

& Cons
• parameter → cost “malign”
– optimization difficult
– many solutions possible
– effect of hidden weights in
output non-linear
x1
x2
y
xu
j
xd
L0
L1
L2
Multi-Modal Cost Surface
5
gradient?
0
-5
-10
global min
2
local
min
0
-2
-3
-2
-1
0
1
2
3
Heading Downhill
• assume
– minimization (e.g. SSE)
– analytically intractable
• step parameters downhill
• wnew = wold + step in right direction
• backpropagation (of error)
– slow but efficient
• conjugate gradients, Levenburg/Marquardt
– for preference
Neural Network
Steepest
DESIGNDescent Backprop #1
W1(1,1)
W2(1,1)
a2
b1(1)
p
Use the radio b
to select the ne
parameters to
with backpropa
The correspon
error surface a
contour are sh
below.
Click in the con
graph (on the r
to start the
steepest desce
learning algori
1
W1(2,1)
b2
W2(1,2)
1
b1(2)
10
2
b1(1)
Sum Sq. Error
1
1
0
-10
0
10
20
W1(1,1) 30
0
-10
-20 b1(1)
10
0
-10
-20
-10
0
10 20
W1(1,1)
30
Chapter 1
DESIGNDescent Backprop #1
Steepest
Neural Network
W2(1,1)
W1(1,1)
a2
b1(1)
p
Use the radio b
to select the ne
parameters to
with backpropa
The correspon
error surface a
contour are sho
below.
Click in the con
graph (on the r
to start the
steepest desce
learning algorit
1
b2
W2(1,2)
W1(2,1)
1
b1(2)
10
2
b1(1)
Sum Sq. Error
1
1
0
-10
0
10
20
W1(1,1) 30
0
-10
-20 b1(1)
10
0
-10
-20
-10
0
10 20
W1(1,1)
30
Chapter 1
below.
Click
in the contour
Neural Network
Conjugate
DESIGN Gradient Back
graph (on the right)
conjugatethe
gradients
to start
15
steepest descent
learning algorithm.
10
backprop
0
5
0
b1(1)
0
0
-10
-15
0
-10
-5
-20
0
10 20
W1(1,1)
-25
-10
30
0
Chapter 12
10
W1(1,1)
20
30
Implications
• correct structure can get “wrong” answer
– dependency on initial conditions
– might be good enough
• train / test (cross-validation) required
– is poor behaviour due to
• network structure?
• ICs?
additional dimension in development
Is It a Problem?
• pros seem to outweigh cons
• good solutions often arrived at quickly
• all previous issues apply
– sample density
– sample distribution
– lack of prior knowledge
How to Use
• to “generalize” a GLM
– linear regression – curve-fitting
• linear output + SSE
– logistic regression – classification
• logistic output + cross-entropy (deviance)
• extend to multinomial, ordinal
– e.g. softmax outptut + cross entropy
– Poisson regression – count data
What is Learned?
• the right thing
– in a maximum likelihood sense
theoretical
• conditional mean of target data, E(z|x)
– implies probability of class membership for
classification P(Ci|x)
estimated
• if good estimate then y → E(z|x)
Simple 1-D Function
Neural Network DESIGN
Generalization
Neural Network DESIGN
Function Approximation
Function Approximation
2
2
Target
Target
1.5
1
0.5
0
-2
-1.5
-1
-0.5
0
Input
0.5
1
Generalization
1.5
Use
to ch
num
in th
1
Use
the slide bar
to choose the
number of neurons
in the hidden layer.
0.5
0
-2
2
Click
butto
logs
netw
data
Click the [Train]
button to train the
1.5
logsig-linear
network on the
data points at left.
-1.5
-1
-0.5
0
Input
0.5
1
1.5
2
Number of Hidden Neurons S1:
4
Number of Hidden Neurons S1:
9
1
9
1
9
Difficulty Index:
1
Difficulty Index:
Chapter 11
1
1
9
1
9
More Complex 1-D Example
Neural Network DESIGN
Neural Network DESIGN
Generalization
Function Approximation
Function Approximation
2
2
Click the [Train]
button to train the
logsig-linear
1.5
network on the
data points at left.
C
bu
lo
ne
da
Target
Target
1.5
Use the
1 slide bar
to choose the
number of neurons
in the hidden layer.
0.5
1
0.5
0
-2
Generalizati
-1.5
-1
-0.5
0
Input
0.5
1
1.5
0
-2
2
-1.5
U
to
nu
in
-1
-0.5
0
Input
0.5
1
1.5
2
Number of Hidden Neurons S1:
4
Number of Hidden Neurons S1:
7
1
9
1
9
Difficulty Index:
9
Difficulty Index:
9
Chapter 11
1
9
1
9
Local Solutions
Neural Network DESIGN
Generalization
Neural Network DESIGN
Function Approximation
Function Approximation
2
2
Target
Target
Click
butto
logsig
netwo
data
Click the [Train]
button to train the
1.5
logsig-linear
network on the
data points at left.
1.5
Use t
to ch
numb
in the
Use 1
the slide bar
to choose the
number of neurons
in the hidden layer.
0.5
1
0.5
0
-2
Generalization
-1.5
-1
-0.5
0
Input
0.5
1
1.5
0
-2
2
-1.5
-1
-0.5
0
Input
0.5
1
1.5
2
Number of Hidden Neurons S1:
9
Number of Hidden Neurons S1:
9
1
9
1
9
Difficulty Index:
9
Difficulty Index:
Chapter 11
9
1
9
1
9
C
A Dichotomy
1
0.8
X2
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
X1
0.5
1
1.5
data courtesy B Ripley
1
0.8
0.8
0.6
0.6
P(C2|x)
1
1
P(C |x)
Class-Conditional Probabilities
0.4
0.4
0.2
0.2
0
2
0
2
1
1
1
1
0
0
x2
0
0
-1
-1
-2
x1
x2
-1
-1
-2
x1
Linear Decision Boundary
1.2
1
0.8
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
0.5
induced by P(C1|x) = P(C2|x)
1
1
0.8
0.8
0.6
0.6
P(C2|x)
1
1
P(C |x)
Class-Conditional Probabilities
0.4
0.4
0.2
0.2
0
2
0
2
1
1
1
1
0
0
x2
0
0
-1
-1
-2
x1
x2
-1
-1
-2
x1
Non-Linear Decision Boundary
1.2
1
0.8
x2
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
x1
0.5
1
1
1
0.8
0.8
0.6
0.6
P(C2|x)
P(C1|x)
Class-Conditional Probabilities
0.4
0.4
0.2
0.2
0
2
0
2
1
1
1
1
0
0
x2
0
0
-1
-1
-2
x1
x2
-1
-1
-2
x1
Over-fitted Decision Boundary
1.2
1
0.8
x2
0.6
0.4
0.2
0
-0.2
-1.5
-1
-0.5
0
x1
0.5
1
Invariances
• e.g. scale, rotation, translation
• augment training data
• design regularizer
– tangent propagation (Simmard et al, 1992)
• build into MLP structure
– convolutional networks (LeCun et al, 1989)
Multi-Modal Outputs
• assumed distribution of outputs unimodal
• sometimes not
– e.g. inverse robot kinematics
• E(z|x) not unique
• use mixture model for output density
• Mixture Density Networks (Bishop, 1996)
Bayesian MLPs
• deterministic treatment
• point estimates, E(z|x)
• can use MLP as estimator in a Bayesian
treatment
– non-linearity → approximate solutions
• approximate estimate of spread
E((z-y)2|x)
NETTalk
•
•
•
•
Introduced by T Sejnowski & C Rosenberg
First realistic ANNs demo
English text-to-speech non-phonetic
Context sensitive
– brave, gave, save, have (within word)
– read, read
(within sentence)
• Mimicked DECTalk (rule-based)
Representing Text
• 29 “letters” (a–z + space, comma, stop)
– 29 binary inputs indicating letter
• Context
– local (within word) – global (sentence)
• 7 letters at a time: middle → AFs
– 3 each side give context, e.g. boundaries
Coding
/k/
MLP
H
E
_
C
A
T
_
NETTalk Performance
• 1024 words continuous,
informal speech
• 1000 most common
words
• Best performing MLP
• 26 o/p, 7x29 i/p, 80
hidden
• 309 units, 18,320 weights
• 94% training accuracy
• 85% testing accuracy
Related documents