Download The rise of neural networks Deep networks Why many layers? Why

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Neural modeling fields wikipedia , lookup

Machine learning wikipedia , lookup

Gene expression programming wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Pattern recognition wikipedia , lookup

Backpropagation wikipedia , lookup

Catastrophic interference wikipedia , lookup

Convolutional neural network wikipedia , lookup

Transcript
24/03/2017
The rise of neural networks
In the mid-2000s, there has been a resurgence of
neural networks, mainly due two reasons:
 high computational power became available at low cost
via general-purpose graphics processing units (GPGPUs).
 major players like Google, Microsoft, and Facebook,
needed to analyze their huge amount of data.
This resurgence led to a new neural model, known
as “Deep Learning”, for training networks of many
layers called “Deep Networks”.
Deep networks
Why many layers?
Deep networks refer to networks with more than 3 layers. The
others are called shallow networks. Both are feed-forward
networks trained with a supervised learning paradigm.
input
layer
hidden
layer 1
hidden
layer 2
hidden
layer 3
Why should we use many layers if the universal approximation
theorem proved that a neural network with three layers can
approximate any function?
output
layer
Why many layers?
More precisely, the universal approximation theorem says that,
given any function f (x) of the input vector x and an error  > 0,
there exists a number N of hidden units and a set of weights
that can approximate the function within the given error.
Why many layers?
Consider a 3-layer network for handwritten digit recognition
from the MNIST image set.
Why do we use 10 output neurons and not 4 (in binary code)?
10 output neurons
However, it does not say how to find such weights!
In other words, even though a 3-layer network can
theoretically approximate any function, finding the right
weights could take very long time for complex functions.
Ok, but why a network with many layers
could solve the same problem more easily?
28 x 28 input neurons
28
28
1
24/03/2017
Why many layers?
Typically, what the hidden neurons learn is to recognize the
presence of elementary features in the input image.
What this neuron has to learn
is to integrate these features
by increasing those weights.
Why many layers?
If we have 4 outputs, each neuron has to learn the bit of the
binary code, and there is no easy way to relate this information
to simple features of the input image.
y0 = 1
1  8, 9
1  4, 5, 6, 7
1  2, 3, 6, 7
1  1, 3, 5, 7, 9
Why many layers?
However, it is much easier to learn a bitwise representation by
adding an extra layer that makes the conversion.
Each layer learns a more sophisticated data representation:
simple features
Why many layers?
Let’s consider a more complex problem as face recognition.
Assuming the 3-layer network is able to learn classifying
faces, it is quite hard to understand from the weights what the
network really learned.
digits
binary
code
“It’s a face”
Mysterious behavior
In other words, the final behavior is quite mysterious: weights
are discovered automatically and we do not understand how
the network is doing its job. We cannot make predictions!
Would you trust an autonomous car that takes driving
decisions based on a neural network that nobody understand?
A better approach
A better way to approach face recognition is to decompose the
problem into sub-problems:
 is there an “eye” in the top left?
 is there an “eye” in the top right?
 is there a “nose” in the middle?
 is there a “mouth” in the bottom middle?
 …
If the answer to several of these questions is “yes”, then we
conclude that the image is likely to be a face.
2
24/03/2017
Feature detectors
Sub-feature detectors
Hence the architecture becomes:
For example, the Eye detector can be decomposed into
modules for detecting eyebrow, eyelashes, iris, and so on:
Eye detector in
the top left
Eyebrow
detector
Eye detector in
the top right
Eyelashes
detector
Nose detector
in the middle
eye
detector
Iris detector
Mouth detector
in the bottom
Eye shape
Clearly, sub-problems can also be decomposed to detect
simpler features, and so on.
Feature composition
Hence the network
architecture becomes:
Feature composition
The result is a network of many layers (deep neural network),
with early layers detecting very simple features on the image,
and later layers building more and more complex abstractions:
elementary
shapes
simple
features
macro
features
face
N-class
classifier
This idea can be iterated, until features are decomposed into
many very simple features located in small regions of pixels.
Feature composition
Two main problems
The result is a network of many layers (deep neural network),
with early layers detecting very simple features on the image,
and later layers building more and more complex abstractions:
N-class
classifier
Unfortunately, increasing the number of hidden layers leads to
two known problems:
1. Vanishing gradient: as we add more and more hidden
layers, Backpropagation becomes less and less effective to
the lower layers, since the gradient becomes smaller and
smaller (hence vanishing).
1. Overfitting: In networks with a large number of neurons
(hence, many degrees of freedom) the network tends to fit
the training data too closely, performing really well on the
training set, but very poorly on other examples.
These problems remained unsolved until 2006, when several
methods have been developed to learn deep neural networks.
3
24/03/2017
Vanishing gradient
Vanishing gradient
This problem has been formally identified for the first time in
1991 by Sepp Hochreiter in his master thesis.
If we train a network for digits classification with an increasing
number of hidden layers, we see that the accuracy does not
improve as expected.
To understand what is going on, let’s monitor the gradient of
the neurons in each hidden layer, since it gives an indication
of how quickly each neuron is learning.
δ1
δ2
.. ..
. .
δn
Accuracy (%)
100
80
60
40
20
δl
2
3
4
j 1
Input neurons:
Output neurons:
Hidden neurons:
Learning rate:
Regularization:
Mini-batch size:
Max_epochs:
0.285
0.070
0.017
10-2
10-3
0.003
1
2
3
4
Vanishing gradient
28x28=784
10
30 per layer
η = 0.1
λ = 5.0
m = 10
30
Monitoring ∥δl∥ during learning we get:
10‐2
l=3
10‐4
10‐6
x
w3
w4
w5
y5
To understand why the gradient vanishes, let’s derive ∂E/∂b2,
that is, the gradient of the bias of the first hidden neuron:
E
E
  ' (a2 )  w3   ' (a3 )  w4   ' (a4 )  w5   ' (a5 ) 
b2
y5
x
w2
w3
w4
w5
y5
Except for ∂E/∂y5, it is a product of terms of the form wj σ′(aj).
l=4
10‐3
Note that the speed of learning decreases exponentially
proceedings from the output to the input.
To understand the issue, consider the simplest deep network:
This test confirms that the gradient
decreases exponentially as we move
backward through the hidden layers.
10‐1
10‐5
Why does gradient vanish?
Speed of learning ∥δl ∥
100
hidden layer
w2
j
# hidden layers
Monitoring ∥δl∥ for a network of 4 hidden layers, at the start of
learning we get:
This is the situation at the
l
beginning of learning
δ
10-0
n

δl 
Vanishing gradient
10-1
If δ l is the gradient vector of layer l, its
norm ∥ δl ∥ gives a rough measure of the
speed learning of layer l:
layer l
1
w ji    j xi
Each neuron updates its weights according to:
l=2
l=1
0
100
200
300
400
500
Number of epochs
Why does gradient vanish?
 L
 E
E
  ' ( al )   w j  ' ( a j ) 
bl


1
j
l

 y L
-4
σ′(a)
0.25
-3
-2
-1
0
1
2
3
4
a
If we initialize weights using a Gaussian(0,1), then
| wj | < 1  | wj σ′(zj) | < 0.25
Hence, the product of many such terms decreases
exponentially with the number of terms!
4
24/03/2017
Why does gradient vanish?
 L
 E
E
  ' ( al )   w j  ' ( a j ) 
bl


1
j
l

 y L
You could argue that if weights wj grow during training, then it
could no longer be true that |wjσ′(aj)| < 1/4.
Why does gradient vanish?
Note that the gradient is more likely to vanish than exploding
using sigmoid neurons.
In fact, to explode we need to have |wjσ′(aj)| ≥ 1, but this is not
so easy to happen, because σ′(a) also depends on w:
σ′(a) = σ′(w x + b)
large w  large a  small σ′(a)
Hence:
Indeed, if the terms get large enough, greater than 1, then we
don’t have a vanishing gradient, but an exploding gradient.
The only way to make |wjσ′(aj)| ≥ 1 is if the input x falls within
a small range of values. Sometimes that happens, but more
often it does not. Hence, the gradient is more likely to vanish.
Improving Backpropagation
Several techniques have been proposed to improve
Backpropagation to make it suitable for deep networks.
They include:
 A better choice of the Loss function;
 A better choice of the activation function;
 Regularization methods to address overfitting and
improve generalization.
Loss function
We have seen that initial weight values strongly affect the
learning speed:
E
w = 0.6, b = 0.9
0
100
200
300 epoch
E
w = 2.0, b = 2.0
0
100
The problem is due to the fact that
w is proportional to ∂E/∂w, and, since
then ∂E/∂w is proportional to ’(a).
200
E
300 epoch
1
(t  y ) 2
2
5
24/03/2017
A better loss function
We said that this problem can be solved by initializing the
weights with small random values.
Another approach is to replace the quadratic cost function for
computing the error E = ½(ty)2 with a different loss function C,
known as the cross-entropy function:
C   t ln y  (1  t ) ln (1  y )
Note that C satisfies two properties of good cost functions:
1.
C > 0, since both log arguments are in the range (0,1);
2.
C  0 when y  t. However, this is true only if t can be
either 0 or 1, as in most classification problems.
A better loss function
If we run the experiment with the cross-entropy loss function,
we get (with  = 0.005):
E
w = 0.6, b = 0.9
E
A better loss function
Unlike the quadratic error function, the derivative of the crossentropy loss function does not depend on ’(a):
C C y a

wi y a wi
C
 ( y  t ) xi
wi
Since w =  ∂C/∂w, the larger the
error, the faster the neuron will learn!
100
200
300 epoch
Cross-entropy loss function
The cross-entropy function used before is relative to a single
output neuron, but it can easily be defined for the output layer
and for the entire training set (global loss):
w = 2.0, b = 2.0
0
100
200
300 epoch
Note that for different loss functions the values of 
cannot be compared.
Where does C derive from?
We would like to get rid of
’(a) in the error function:
E
 (y  t) ’(a) xi
wi
So we would like to have a
loss function C such that:
C C y a

 ( y  t ) xi
wi y a wi
nL
Ck   Ckj
For the output layer L on
the entire training set:
C
j 1
1
M
M
C
k 1
k
The cross-entropy function is nearly always the best choice,
provided the output neurons are sigmoid neurons.
Where does C derive from?
t
C 1  t


y 1  y y
So we got:
Integrating w.r.t. y we have:
C
’(a) xi = (y  t) xi
y
By adding yt  yt at
the nominator we get:
Ckj   tk ln y j  (1  t k ) ln (1  y j )
For the output layer L
on example k
C
Which means:
C
y t
y t


y  ' (a ) y (1  y )
E
 (y  t) ’(a) xi
wi
Note that w.r.t. E / wi
only the term ’(a) is missing:
For a single output
neuron j on example k
0
C
t 1 t
y t
 

y
y 1  y y (1  y )
y
  ' ( a )  y (1  y )
a
a
 xi
wi
C   t ln y  (1  t ) ln (1  y )
That is:
C
dy
dy
dy  (1  t ) 
 t
y
1 y
y
C   (1  t ) ln (1  y )  t ln y
y  t  yt  yt y (1  t )  t (1  y ) 1  t t
C




y (1  y )
y (1  y )
y
1 y y
6
24/03/2017
Is sigmoid the best?
One problem of the sigmoid function is that neuron outputs are
always positive and according to Backpropagation, weight
variations are computed as wj = j x.
The tanh function
A function often used to replace the sigmoid is the tanh:
tanh( x) 
Since the elements of x are all positive, all weights of a neuron
either increase or decrease together depending on j.
That is a problem, since some of the weights may need to
increase while others need to decrease. That can only happen
if the elements of x can have different signs.
1
-4
-3
-2
-1
This reasoning suggests replacing the sigmoid by a function
that allows both positive and negative activations.
The tanh function
e x  e x
e x  ex
1
2
3
x
4
-1
The tanh function
It is worth observing that:
Note that:
tanh( x)  2 (2 x)  1
 ( x) 
1  tanh( x / 2)
2
That is, tanh is just a rescaled version of the sigmoid function.
 Since tanh ranges from -1 to 1, inputs and output need to
be normalized differently than with sigmoid networks.
 Backpropagation and stochastic gradient descent can also
be applied to a network of tanh neurons.
 A network of tanh neurons can learn any function.
Proof:
tanh( x) 

e x  e x
1  e 2 x
2  (1  e 2 x )



x
x
2 x
e e
1 e
1  e 2 x
2
 1  2 (2 x )  1
1  e 2 x
Are there better functions?
In both sigmoid and tanh, neurons stop learning when
they saturate, since f′′(a) reduces the gradient.
Instead of changing the loss function from quadratic to
cross-entropy, another approach could be to choose an
activation function f′(a) such that its derivative does not
decrease with the activation.
A commonly used function with this feature is the
rectified linear function.
 Extensive experiments showed that with respect to sigmoid
neurons tanh provides only small improvements.
 There is no formal proof that says which of the two
functions allows learning faster or generalizing better for
any application.
The rectified linear function
f ( x)  max(0, x)
f (x)
x
7
24/03/2017
The rectified linear function
Note that:
 Increasing weights and the input activation, a rectified linear
unit will never saturate, so there is no learning slowdown.
 On the other hand, when the activation is negative, the
gradient vanishes, so the neuron stops learning.
 Backpropagation and stochastic gradient descent can also
be applied to a network of rectified linear neurons.
 A network of rectified linear neurons can learn any function.
 Experiments showed that rectified linear units can achieve
considerable benefits, over sigmoid and tanh neurons.
Softmax neurons
A softmax neuron is normally used for the output layer. The
activation is the weighted sum of its inputs, but the output is
computed by the so-called softmax function, defined as:
y j  f (a j ) 
n
actual
output
target
output
i 1
y1 = 0.70
tk1 = 0
y2 = 0.10
tk2 = 1
y3 = 0.08
tk3 = 0
y4 = 0.07
tk4 = 0
y5 = 0.05
tk5 = 0
This definition is sound, since for
a pattern k of class j, (tkj = 1):
yj  1
Ck is low
yj  0
Ck is high
n
y
i 1
i
i 1
Hence, the output from a softmax layer can be seen as a
probability distribution, where each output yj is the estimated
probability that the correct output is j.
n
t
1
j 1
kj
1
Like for the previous cross-entropy loss function, the derivative
Ck /aj does not depend on ’(a):
e
yi

a j
aj
 e  (e
 e 
ai
aj
i
)2
ai 2
i 1
To derive this results, we compute:
Ck Ck yi

yi a j
a j
aj

1  e

ai 
i e  i e ai
aj
Case i  j:
yi
0  e ai e

a j
 e ai

i
aj

2
 
e
aj
Since yj depends on all other outputs yi, to compute yj /aj
we have to distinguish two cases: i = j and i  j.
a
e
ai

(e j ) 2
 e 
ai 2
i

  y j (1  y j )


e ai
i e ai
e
n
t
Ck
  ki
yi
i 1 yi
Softmax neurons
i
i
e

Ck
 y j  t kj
a j
n
Ck    tki ln yi
Softmax neurons
Case i = j:
where n is the number
neurons in the output layer.
Softmax neurons
Softmax neurons are used in combination with the so called
log-likelihood loss function, defined as:
Ck   ln y2  2.3
e
ai
Note that the outputs are positives and always sum up to 1.
Softmax neurons
In this example, we have:
aj
This property is quite convenient for classification!
 There is no proof stating when such units are preferable.
Ck    tki ln yi
e
n
aj
i e ai
Ck Ck yi

a j
yi a j
n
t
Ck
   ki
yi
i 1 yi
yi

a j
t kj
Ck
t
y j (1  y j ) 
   ki (  yi y j ) 
yj
a j
i  j yi
  yi y j
  t ki y j  t kj  t kj y j  y j
i j
n
t
i 1
ki
y j (1  y j )
if i = j
 yi y j
if i  j
t
i j
ki
y j  t kj (1  y j )
 t kj  y j  t kj
=1
8
24/03/2017
Softmax neurons
In summary, a softmax output layer with log-likelihood cost is
quite similar to a sigmoid output layer with cross-entropy cost.
yj 
e
n
aj
e
n
ai
Ck    t ki ln yi
i 1
Ck
 y j  t kj
a j
How to address overfitting
Overfitting is the situation in which a neural network starts
learning too much details of the TS, loosing the ability to
generalize on new examples. This occurs when the network
has a many hidden neurons compared with the TS size.
accuracy
A TS
100
i 1
A VS
In many situations both approaches work well. The only real
difference between the two cases is that a softmax layer
provides estimates on classification probabilities.
This is a sign that the
network is overfitting the TS,
learning the noise on data!
epochs
How to address overfitting
Overfitting is a serious problem when training large
networks with many hidden neurons (like deep networks).
Several techniques have been proposed to address
overfitting and improve generalization. They include:
 Early stopping;
How to address overfitting
A method we considered to avoid overfitting is early stopping.
Note that some non trivial judgment is required to determine
when to stop, looking at the trend of both TS and VS:
accuracy
100
STOP
A TS
 L2 regularization;
 L1 regularization;
A VS
 Dropout;
 Averaging or voting techniques;
 Artificial expansion of the training set.
epochs
How to address overfitting
One of the best approaches for reducing overfitting is to
increase the size of the TS. With enough training data it is
difficult to overfit, even for a very large network.
Unfortunately, training data can be expensive or difficult to
acquire, so this is not always a practical option.
Another approach is to reduce the number of hidden neurons
(hence the number of degrees of freedoms).
However, large networks have the potential to be more
powerful than small networks, so this is an option we don’t like
to adopt.
Regularization methods
Fortunately, there are other methods to reduce overfitting,
even when we have a fixed network and fixed training data.
These are known as regularization methods.
Regularization methods tend to prevent weights to reach
high values during learning.
But, why keeping weights at low values
helps avoiding overfitting?
9
24/03/2017
Why regularization helps
L2 regularization
The most common one is known as L2 regularization or
weight decay.
A network with small weights is
less sensitive to small input
changes, so it learns based on
patterns seen more often in TS:
The idea is to add an extra regularization term R to the cost
function:
C = C0 + R
By contrast, a network with
large weights is more sensitive
to small variations and tends to
learn noise peculiarity in the TS:
where C0 is the original cost function and R is given by:
R
For this reason, weight regularization helps networks to
generalize better from what they learn.
C  C0 
Basically, regularization balances between finding small
weights and minimizing the original cost function.
w  
Small
weights
L2 regularization
Biases are not subject to regularization since often it does not
change the results very much.
w
2
C C0 


w
w w M
   weight
1 

 M  decay
To understand why R reduces overfitting, let’s compute the
weights variation w by deriving the regularized cost function.
Note that, being 0 < weight_decay < 1, at a first glance it
seems that w should exponentially decrease towards zero. But
that’s not true, since w may increase for the other term.
w
This is exactly the same as the gradient descent
rule, except that each weight is rescaled by a factor
1
C
  
w(t )  1 
 w(t  1)   0
w
 M

2M
C
C

  0  
w
M
w
w

C
C
  
w  w  0 
w  1 
w  0
w
M
w
 M
The relative importance of these elements depends on λ:
0
2
w
L2 regularization
Intuitively, the effect of R is to keep weights small. Large
weights will only be allowed if they considerably improve C0.
λ
w
where λ > 0 is the regularization parameter and M is the size
of the training set. Note that R does not include biases.
L2 regularization
Minimize
C0

2M
L2 regularization
Using L2 regularization, the TS error decreases as without R,
but the accuracy on the VS continues to increase.
Increasing the TS size and the number of hidden units may
further improve the accuracy.
accuracy
100
A TS
A VS with R
A VS
without
R
epochs
10
24/03/2017
L1 regularization
L1 regularization
Another form of regularization is the L1 regularization, in which
the regularization term is defined as:
R

M
w
w
C C0 

 sgn( w)
w w M
w(t )  w(t  1) 

M
sgn w(t  1)   
When |w| is large,
L1 regularization shrinks the weight
much less than L2 regularization.
When |w| is small,
L1 regularization shrinks the weight
much more than L2 regularization.
The result is that L1 regularization tends to concentrate the
weights in a relatively small number of high-importance
connections, while the other weights are driven toward zero.
C0
w
Dropout
Dropout
Dropout is a radically different technique for regularization,
proposed by Hinton et al. in 2012.
Dropout is a radically different technique for regularization,
proposed by Hinton et al. in 2012.
Unlike L1 and L2 regularization, dropout does not modifies
the cost function, but the network itself by randomly deleting
half of the hidden neurons at every learning iteration.
Unlike L1 and L2 regularization, dropout does not modifies
the cost function, but the network itself by randomly deleting
half of the hidden neurons at every learning iteration.
mini-batch
m times
xk
mini-batch
m times
xk
y
m


tk
Dropout
tk
Dropout
In the operation phase, all hidden neurons are active,
hence twice the neurons used in the learning phase. To
compensate for that, weights outgoing from the hidden
neurons are halved:
wji =
y
m
wji
2
Why does dropout help reducing overfitting?
Using dropout is like training several networks with the
same TS and then averaging the results.
The averaging scheme is quite effective (although
expensive) to reduce overfitting, because different
networks overfit in different ways, thus averaging is like
filtering the noisy behavior.
Dropout has been very successful in improving the
performance of neural networks, reaching very high
accuracy when used together with regularization methods.
11