Download weights

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Last lecture summary
• biologically motivated
• synapses
• Neuron accumulates (Σ) positive/negative stimuli
from other neurons.
• Then Σ is processed further – f(Σ) – to produce an
output, i.e. neuron sends an output signal to
neurons connected to it.
Neural networks for applied science and engineering, Samarasinghe
x – inputs
w – weights
f(Σ) – activation (tansfer)
function
y - output
• threshold neuron (McCulloch-Pitts)
– only binary inputs and output
– the weights are pre-set, no learning
– set the threshold so that the classification is
correct
Heavyside (threshold) activation function
• Threshold w0 is incorporated as a weight of
one additional input with input value x0 = 1.0.
• Such input is called bias.
2
w x
j 0
j
j
 w0 1.0  w1 x1  w2 x2
Perceptron
• binary classifier, maps its input x (real-valued
vector) to f(x) – a binary value (0 or 1)
• f(x) =
• 1 … w∙x > 0 (including bias)
• 0 … otherwise
• perceptron can adjust its weights (i.e. can
learn) – perceptron learning algorithm
Multiple output perceptron
• for multicategory (i.e. more than 2 classes) classification
• one output neuron for each class
output layer
input layer
Learning
• Learning means there exist an algorithm for
setting neuron’s weights (threshold w0 is also
set).
– delta rule – gradient descent
w1i 1  w1i  w1  w1i    x
– β – learning rate
• iterative algorithm, one pass through the
whole training set (epoch) is not enough
• online learning
– adjust weights after each input pattern
presentation
– weight oscillation may occur
• batch learning
– obtain the error gradient for each input pattern,
average them at the end of the epoch
• Supervised learning using delta rule
1. Transmit an input pattern 𝒙 = (𝑥1, … , 𝑥𝑛)
through connections whose weights are initially
set to random values.
2. The weighted inputs are summed, the output 𝑦
is produced, and 𝑦 is compared with the given
target output (𝑡) to determine error for this
pattern.
3. Inputs and target outputs are presented
repeatedly, and the weights are adjusted using
the delta rule at each iteration or after an epoch
until the minimum possible square error is
achieved.
4. This usually involves the iterative presentation of
the entire training dataset many times.
New stuff
Finishing perceptron
Perceptron failure
• Please, help me and draw on the blackboard
following functions:
– AND, OR, XOR (eXclusive OR, true when exactly one of the
operands is true, otherwise false)
AND
OR
1
XOR
1
1
???
0
0
0
1
0
0
1
0
1
Play with
http://lcn.epfl.ch/tutorial/english/perceptron/html/index.html
• Perceptron uses linear activation function, so
only linearly separable problems can be solved.
• 1969 – famous book “Perceptrons” by Marvin
Minsky and Seymour Papert showed that it was
impossible for these classes of network to learn
an XOR function.
• They conjectured (incorrectly !) that a similar
result would hold for a perceptron with three or
more layers.
• The often-cited Minsky/Papert text caused a
significant decline in interest and funding of
neural network research. It took ten more years
until neural network research experienced a
resurgence in the 1980s.
Play with
http://www.eee.metu.edu.tr/~halici/courses/543java/NNOC/Perceptron.html
Multilayer perceptron
New stuff
Nonlinear activation functions
• So far we met threshold and linear activation
functions.
• They are linear, and conversely the solved
problems must also be linear.
• The nonlinearity is introduced by using
nonlinear activation functions.
1
  
1  e
logistic (sigmoid, unipolar)
e  e
tanh      
e e
tanh (bipolar)
Multilayer perceptron
• MLP, the most famous type of neural network
input layer
hidden layer
output layer
three-layer vs. two-layer
input layer
hidden layer
output layer
Backpropagation training algorithm
• How to train MLP?
• Gradient descent type of algorithm called
backpropagation.
• MLP works in two passes:
• forward pass
– present a training sample to the neural network
– compare the network's output to the desired
output from that sample
– calculate the error in each output neuron
• backward pass
– compute the amount ∆w by which the weights should
be updated
– first calculate gradient for hidden-to-output weights
– then calculate gradient for input-to-hidden weights
• the knowledge of gradhidden-output is necessary to calculate
gradinput-hidden
– update the weights in the network
wm1  wm  wm
wm   dm
• It is a gradient descent method
– learning rate β is used
– can get trapped in local minima
input signal propagates forward
error propagates backward
• online learning vs. batch learning
– In online learning the weights are changed after
each presentation of a training pattern.
• Weights may oscillate.
• Suitable for online learning.
– In batch learning, the total gradient for the whole
epoch is represented as the sum of the gradient
for each of the n patterns.
• Batch learning improves the stability by
averaging.
• Another averaging approach providing
stability is using the momentum.
• This method basically tags the average of the past
weight changes onto the new weight increment
at every weight change, thereby smoothing out
the net weight change.
wm  wm1  1     dm
• Momentum μ is between 0 and 1.
• It indicates the relative importance of the past
weight change ∆wm-1 on the new weight
increment ∆wm
• Thus, the current gradient and the past weight
change together decide how much the new
weight increment will be.
wm  wm1  1     dm
• For example, if μ is equal to 0, momentum
does not apply at all, and the past history has
no place.
• If μ is equal to 1, the current change is totally
based on the past change.
• Values of μ between 0 and 1 result in a
combined response to weight change.
wm  wm1  1     dm
• The equation is recursive , so the influence of
the past weight change incorporates that of all
previous weight changes as well.
• Momentum can be used with both batch and
online learning.
• In batch learning, it can provide further
stability to the gradient descent.
• Momentum can be especially useful in online
learning to minimize oscillations in error after
the presentation of each pattern.
Delta-Bar-Delta
• In backpropagation the same learning rate β
applies to all of the weights.
• More flexibility could be achieved if each weight is
adjusted independently.
• This method is called delta-bar-delta (TurboProp).
• Each weight has its own learning rate, they’re
adjusted as follows:
– if the direction in which the error decreases at the
current point is the same as the direction in which the
error has been decreasing recently, then the learning
rate is increased.
– if the opposite is true, the learning rate is decreased
Second order methods
• Surface curvature can be used to guide the
error down the error surface more efficiently.
grad is a vector pointing in
the direction of the greatest
rate of increase of the
function.
How fast changes the rate of
increase of the function in
the small neighbourhood?
 2 E

2

w
1

 2 E

H   w2 w1


 2 E
 w w
 n 1
2 E
w1w2
 E
w22
2
2 E
wn wn
2 E 

w1wn 
2 E 

w2 wn 


2
 E 
wn2 
This is given as the derivative
of gradient, derivative of
derivative, i.e. second
derivative.
The second derivatives with
respect to all pairs of
weights are given as the
Hessian matrix.
• Common methods using the Hessian
– QuickProp
– Gauss-Newton
– Levenberg-Marquardt (LM)
• These methods are order of magnitude faster
(i.e. they reach minima in much less epochs)
than first order methods (i.e. gradient based).
• However the efficiency is gained at a
considerable computational cost.
– Computing and inverting Hessian for large
networks with large number of training patterns is
expensive (large storage requirements) and slow.
Bias-variance
• Just a small reminder
• bias (lack of fit, undefitting) – model does not
fit data enough, not enough flexible (too small
number of parameters)
• variance (overfitting) – model is too flexible
(too much parameters), fits noise
• bias-variance tradeoff – improving the
generalization ability of the model (i.e. find
the correct amount of flexibility)
• Parameters in MLP: weights
• If you use one more hidden neuron, the
number of weights increases by how much?
– # input neurons + # output neurons
• If MLP is used for regression task, be careful!
• To use MLP statistically correct, the number of
degrees of freedoms (i.e. weights) can’t
exceed the number of data points.
– Compare to polynomial regression example from
the 2nd lecture
Improving generalization of MLP
• Flexibility comes from hidden neurons.
• Choose such a # of hidden neurons so neither
undefitting, nor overfitting occurs.
• Three most common approaches:
– exhaustive search
– early stopping
– regularization
Exhaustive search
• Increase a number of hidden units, and
monitor the performance on the validation
data set.
number of neurons
Early stopping
• fixed and large number of neurons is used
• network is trained while testing its performance
on a validation set at regular intervals
• minimum at validation error – correct weights
epochs
Regularization
• Who remembers from the polynomial
example what is regularization?
• In NN called weight decay.
• Idea: keep the growth of weights to a
minimum in such a way that non-important
weights are pulled toward zero
• Only the important weights are allowed to
grow, others are forced to decay
• This is achieved not by minimizing MSE, but by
minimizing
m
W  MSE    w2j
j 1
• second term – regularization term
• m – number of weights in the network
• δ – regularization parameter
– the larger the δ, the more important the
regularization
Network pruning
• Both early stopping and weight decay use all
weights in the NN. They do not reduce the
complexity of the model.
• Network pruning – reduce complexity by
keeping only essential weights/neurons.
• Several pruning approaches, e.g.
– optimal brain damage (OBD)
– optimal brain surgeon (OBS)
– optimal cell damage (OCD)
OBD
• Based on sensitivity analysis
– systematically change parameters in a model to
determine the effects of such changes
• Weights that are not important for inputoutput mapping are removed.
• The importance (saliency) of the weight is
measured based on the cost of setting a
weight to zero.
• The saliency can be computed from the
Hessian.
• Hessian is nonlocal – i.e. it uses the derivative
with respect to all pairs of weights.
Computationally costly for large networks.
• Local approximation of Hessian – use only
diagonal weights (i.e. ignore all second
derivatives with respect to weights other than
itself)
– It implies that the weights of the network are
independent
• Saliency si of weight wi is defined as
H ii wi2
si 
2
• Hii (diagonal entry of the Hessian) indicates
the acceleration of the error with respect to a
small perturbation to a weight wi.
• By multiplying Hii by wi2 an indication of the
total effect of wi on the error is obtained.
• The larger the si, the larger the influence of wi
on error.
• How to perform OBD?
1. Train flexible network in a normal way (i.e. use
early stopping, weight decay, …)
2. Compute saliency for each weight. Remove
weight with small saliencies.
3. Train again the reduced network with kept
weights. Initialize the training with their values
obtained in the previous step.
4. Repeat from step 1.