* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The rise of neural networks Deep networks Why many layers? Why
Survey
Document related concepts
Transcript
24/03/2017 The rise of neural networks In the mid-2000s, there has been a resurgence of neural networks, mainly due two reasons: high computational power became available at low cost via general-purpose graphics processing units (GPGPUs). major players like Google, Microsoft, and Facebook, needed to analyze their huge amount of data. This resurgence led to a new neural model, known as “Deep Learning”, for training networks of many layers called “Deep Networks”. Deep networks Why many layers? Deep networks refer to networks with more than 3 layers. The others are called shallow networks. Both are feed-forward networks trained with a supervised learning paradigm. input layer hidden layer 1 hidden layer 2 hidden layer 3 Why should we use many layers if the universal approximation theorem proved that a neural network with three layers can approximate any function? output layer Why many layers? More precisely, the universal approximation theorem says that, given any function f (x) of the input vector x and an error > 0, there exists a number N of hidden units and a set of weights that can approximate the function within the given error. Why many layers? Consider a 3-layer network for handwritten digit recognition from the MNIST image set. Why do we use 10 output neurons and not 4 (in binary code)? 10 output neurons However, it does not say how to find such weights! In other words, even though a 3-layer network can theoretically approximate any function, finding the right weights could take very long time for complex functions. Ok, but why a network with many layers could solve the same problem more easily? 28 x 28 input neurons 28 28 1 24/03/2017 Why many layers? Typically, what the hidden neurons learn is to recognize the presence of elementary features in the input image. What this neuron has to learn is to integrate these features by increasing those weights. Why many layers? If we have 4 outputs, each neuron has to learn the bit of the binary code, and there is no easy way to relate this information to simple features of the input image. y0 = 1 1 8, 9 1 4, 5, 6, 7 1 2, 3, 6, 7 1 1, 3, 5, 7, 9 Why many layers? However, it is much easier to learn a bitwise representation by adding an extra layer that makes the conversion. Each layer learns a more sophisticated data representation: simple features Why many layers? Let’s consider a more complex problem as face recognition. Assuming the 3-layer network is able to learn classifying faces, it is quite hard to understand from the weights what the network really learned. digits binary code “It’s a face” Mysterious behavior In other words, the final behavior is quite mysterious: weights are discovered automatically and we do not understand how the network is doing its job. We cannot make predictions! Would you trust an autonomous car that takes driving decisions based on a neural network that nobody understand? A better approach A better way to approach face recognition is to decompose the problem into sub-problems: is there an “eye” in the top left? is there an “eye” in the top right? is there a “nose” in the middle? is there a “mouth” in the bottom middle? … If the answer to several of these questions is “yes”, then we conclude that the image is likely to be a face. 2 24/03/2017 Feature detectors Sub-feature detectors Hence the architecture becomes: For example, the Eye detector can be decomposed into modules for detecting eyebrow, eyelashes, iris, and so on: Eye detector in the top left Eyebrow detector Eye detector in the top right Eyelashes detector Nose detector in the middle eye detector Iris detector Mouth detector in the bottom Eye shape Clearly, sub-problems can also be decomposed to detect simpler features, and so on. Feature composition Hence the network architecture becomes: Feature composition The result is a network of many layers (deep neural network), with early layers detecting very simple features on the image, and later layers building more and more complex abstractions: elementary shapes simple features macro features face N-class classifier This idea can be iterated, until features are decomposed into many very simple features located in small regions of pixels. Feature composition Two main problems The result is a network of many layers (deep neural network), with early layers detecting very simple features on the image, and later layers building more and more complex abstractions: N-class classifier Unfortunately, increasing the number of hidden layers leads to two known problems: 1. Vanishing gradient: as we add more and more hidden layers, Backpropagation becomes less and less effective to the lower layers, since the gradient becomes smaller and smaller (hence vanishing). 1. Overfitting: In networks with a large number of neurons (hence, many degrees of freedom) the network tends to fit the training data too closely, performing really well on the training set, but very poorly on other examples. These problems remained unsolved until 2006, when several methods have been developed to learn deep neural networks. 3 24/03/2017 Vanishing gradient Vanishing gradient This problem has been formally identified for the first time in 1991 by Sepp Hochreiter in his master thesis. If we train a network for digits classification with an increasing number of hidden layers, we see that the accuracy does not improve as expected. To understand what is going on, let’s monitor the gradient of the neurons in each hidden layer, since it gives an indication of how quickly each neuron is learning. δ1 δ2 .. .. . . δn Accuracy (%) 100 80 60 40 20 δl 2 3 4 j 1 Input neurons: Output neurons: Hidden neurons: Learning rate: Regularization: Mini-batch size: Max_epochs: 0.285 0.070 0.017 10-2 10-3 0.003 1 2 3 4 Vanishing gradient 28x28=784 10 30 per layer η = 0.1 λ = 5.0 m = 10 30 Monitoring ∥δl∥ during learning we get: 10‐2 l=3 10‐4 10‐6 x w3 w4 w5 y5 To understand why the gradient vanishes, let’s derive ∂E/∂b2, that is, the gradient of the bias of the first hidden neuron: E E ' (a2 ) w3 ' (a3 ) w4 ' (a4 ) w5 ' (a5 ) b2 y5 x w2 w3 w4 w5 y5 Except for ∂E/∂y5, it is a product of terms of the form wj σ′(aj). l=4 10‐3 Note that the speed of learning decreases exponentially proceedings from the output to the input. To understand the issue, consider the simplest deep network: This test confirms that the gradient decreases exponentially as we move backward through the hidden layers. 10‐1 10‐5 Why does gradient vanish? Speed of learning ∥δl ∥ 100 hidden layer w2 j # hidden layers Monitoring ∥δl∥ for a network of 4 hidden layers, at the start of learning we get: This is the situation at the l beginning of learning δ 10-0 n δl Vanishing gradient 10-1 If δ l is the gradient vector of layer l, its norm ∥ δl ∥ gives a rough measure of the speed learning of layer l: layer l 1 w ji j xi Each neuron updates its weights according to: l=2 l=1 0 100 200 300 400 500 Number of epochs Why does gradient vanish? L E E ' ( al ) w j ' ( a j ) bl 1 j l y L -4 σ′(a) 0.25 -3 -2 -1 0 1 2 3 4 a If we initialize weights using a Gaussian(0,1), then | wj | < 1 | wj σ′(zj) | < 0.25 Hence, the product of many such terms decreases exponentially with the number of terms! 4 24/03/2017 Why does gradient vanish? L E E ' ( al ) w j ' ( a j ) bl 1 j l y L You could argue that if weights wj grow during training, then it could no longer be true that |wjσ′(aj)| < 1/4. Why does gradient vanish? Note that the gradient is more likely to vanish than exploding using sigmoid neurons. In fact, to explode we need to have |wjσ′(aj)| ≥ 1, but this is not so easy to happen, because σ′(a) also depends on w: σ′(a) = σ′(w x + b) large w large a small σ′(a) Hence: Indeed, if the terms get large enough, greater than 1, then we don’t have a vanishing gradient, but an exploding gradient. The only way to make |wjσ′(aj)| ≥ 1 is if the input x falls within a small range of values. Sometimes that happens, but more often it does not. Hence, the gradient is more likely to vanish. Improving Backpropagation Several techniques have been proposed to improve Backpropagation to make it suitable for deep networks. They include: A better choice of the Loss function; A better choice of the activation function; Regularization methods to address overfitting and improve generalization. Loss function We have seen that initial weight values strongly affect the learning speed: E w = 0.6, b = 0.9 0 100 200 300 epoch E w = 2.0, b = 2.0 0 100 The problem is due to the fact that w is proportional to ∂E/∂w, and, since then ∂E/∂w is proportional to ’(a). 200 E 300 epoch 1 (t y ) 2 2 5 24/03/2017 A better loss function We said that this problem can be solved by initializing the weights with small random values. Another approach is to replace the quadratic cost function for computing the error E = ½(ty)2 with a different loss function C, known as the cross-entropy function: C t ln y (1 t ) ln (1 y ) Note that C satisfies two properties of good cost functions: 1. C > 0, since both log arguments are in the range (0,1); 2. C 0 when y t. However, this is true only if t can be either 0 or 1, as in most classification problems. A better loss function If we run the experiment with the cross-entropy loss function, we get (with = 0.005): E w = 0.6, b = 0.9 E A better loss function Unlike the quadratic error function, the derivative of the crossentropy loss function does not depend on ’(a): C C y a wi y a wi C ( y t ) xi wi Since w = ∂C/∂w, the larger the error, the faster the neuron will learn! 100 200 300 epoch Cross-entropy loss function The cross-entropy function used before is relative to a single output neuron, but it can easily be defined for the output layer and for the entire training set (global loss): w = 2.0, b = 2.0 0 100 200 300 epoch Note that for different loss functions the values of cannot be compared. Where does C derive from? We would like to get rid of ’(a) in the error function: E (y t) ’(a) xi wi So we would like to have a loss function C such that: C C y a ( y t ) xi wi y a wi nL Ck Ckj For the output layer L on the entire training set: C j 1 1 M M C k 1 k The cross-entropy function is nearly always the best choice, provided the output neurons are sigmoid neurons. Where does C derive from? t C 1 t y 1 y y So we got: Integrating w.r.t. y we have: C ’(a) xi = (y t) xi y By adding yt yt at the nominator we get: Ckj tk ln y j (1 t k ) ln (1 y j ) For the output layer L on example k C Which means: C y t y t y ' (a ) y (1 y ) E (y t) ’(a) xi wi Note that w.r.t. E / wi only the term ’(a) is missing: For a single output neuron j on example k 0 C t 1 t y t y y 1 y y (1 y ) y ' ( a ) y (1 y ) a a xi wi C t ln y (1 t ) ln (1 y ) That is: C dy dy dy (1 t ) t y 1 y y C (1 t ) ln (1 y ) t ln y y t yt yt y (1 t ) t (1 y ) 1 t t C y (1 y ) y (1 y ) y 1 y y 6 24/03/2017 Is sigmoid the best? One problem of the sigmoid function is that neuron outputs are always positive and according to Backpropagation, weight variations are computed as wj = j x. The tanh function A function often used to replace the sigmoid is the tanh: tanh( x) Since the elements of x are all positive, all weights of a neuron either increase or decrease together depending on j. That is a problem, since some of the weights may need to increase while others need to decrease. That can only happen if the elements of x can have different signs. 1 -4 -3 -2 -1 This reasoning suggests replacing the sigmoid by a function that allows both positive and negative activations. The tanh function e x e x e x ex 1 2 3 x 4 -1 The tanh function It is worth observing that: Note that: tanh( x) 2 (2 x) 1 ( x) 1 tanh( x / 2) 2 That is, tanh is just a rescaled version of the sigmoid function. Since tanh ranges from -1 to 1, inputs and output need to be normalized differently than with sigmoid networks. Backpropagation and stochastic gradient descent can also be applied to a network of tanh neurons. A network of tanh neurons can learn any function. Proof: tanh( x) e x e x 1 e 2 x 2 (1 e 2 x ) x x 2 x e e 1 e 1 e 2 x 2 1 2 (2 x ) 1 1 e 2 x Are there better functions? In both sigmoid and tanh, neurons stop learning when they saturate, since f′′(a) reduces the gradient. Instead of changing the loss function from quadratic to cross-entropy, another approach could be to choose an activation function f′(a) such that its derivative does not decrease with the activation. A commonly used function with this feature is the rectified linear function. Extensive experiments showed that with respect to sigmoid neurons tanh provides only small improvements. There is no formal proof that says which of the two functions allows learning faster or generalizing better for any application. The rectified linear function f ( x) max(0, x) f (x) x 7 24/03/2017 The rectified linear function Note that: Increasing weights and the input activation, a rectified linear unit will never saturate, so there is no learning slowdown. On the other hand, when the activation is negative, the gradient vanishes, so the neuron stops learning. Backpropagation and stochastic gradient descent can also be applied to a network of rectified linear neurons. A network of rectified linear neurons can learn any function. Experiments showed that rectified linear units can achieve considerable benefits, over sigmoid and tanh neurons. Softmax neurons A softmax neuron is normally used for the output layer. The activation is the weighted sum of its inputs, but the output is computed by the so-called softmax function, defined as: y j f (a j ) n actual output target output i 1 y1 = 0.70 tk1 = 0 y2 = 0.10 tk2 = 1 y3 = 0.08 tk3 = 0 y4 = 0.07 tk4 = 0 y5 = 0.05 tk5 = 0 This definition is sound, since for a pattern k of class j, (tkj = 1): yj 1 Ck is low yj 0 Ck is high n y i 1 i i 1 Hence, the output from a softmax layer can be seen as a probability distribution, where each output yj is the estimated probability that the correct output is j. n t 1 j 1 kj 1 Like for the previous cross-entropy loss function, the derivative Ck /aj does not depend on ’(a): e yi a j aj e (e e ai aj i )2 ai 2 i 1 To derive this results, we compute: Ck Ck yi yi a j a j aj 1 e ai i e i e ai aj Case i j: yi 0 e ai e a j e ai i aj 2 e aj Since yj depends on all other outputs yi, to compute yj /aj we have to distinguish two cases: i = j and i j. a e ai (e j ) 2 e ai 2 i y j (1 y j ) e ai i e ai e n t Ck ki yi i 1 yi Softmax neurons i i e Ck y j t kj a j n Ck tki ln yi Softmax neurons Case i = j: where n is the number neurons in the output layer. Softmax neurons Softmax neurons are used in combination with the so called log-likelihood loss function, defined as: Ck ln y2 2.3 e ai Note that the outputs are positives and always sum up to 1. Softmax neurons In this example, we have: aj This property is quite convenient for classification! There is no proof stating when such units are preferable. Ck tki ln yi e n aj i e ai Ck Ck yi a j yi a j n t Ck ki yi i 1 yi yi a j t kj Ck t y j (1 y j ) ki ( yi y j ) yj a j i j yi yi y j t ki y j t kj t kj y j y j i j n t i 1 ki y j (1 y j ) if i = j yi y j if i j t i j ki y j t kj (1 y j ) t kj y j t kj =1 8 24/03/2017 Softmax neurons In summary, a softmax output layer with log-likelihood cost is quite similar to a sigmoid output layer with cross-entropy cost. yj e n aj e n ai Ck t ki ln yi i 1 Ck y j t kj a j How to address overfitting Overfitting is the situation in which a neural network starts learning too much details of the TS, loosing the ability to generalize on new examples. This occurs when the network has a many hidden neurons compared with the TS size. accuracy A TS 100 i 1 A VS In many situations both approaches work well. The only real difference between the two cases is that a softmax layer provides estimates on classification probabilities. This is a sign that the network is overfitting the TS, learning the noise on data! epochs How to address overfitting Overfitting is a serious problem when training large networks with many hidden neurons (like deep networks). Several techniques have been proposed to address overfitting and improve generalization. They include: Early stopping; How to address overfitting A method we considered to avoid overfitting is early stopping. Note that some non trivial judgment is required to determine when to stop, looking at the trend of both TS and VS: accuracy 100 STOP A TS L2 regularization; L1 regularization; A VS Dropout; Averaging or voting techniques; Artificial expansion of the training set. epochs How to address overfitting One of the best approaches for reducing overfitting is to increase the size of the TS. With enough training data it is difficult to overfit, even for a very large network. Unfortunately, training data can be expensive or difficult to acquire, so this is not always a practical option. Another approach is to reduce the number of hidden neurons (hence the number of degrees of freedoms). However, large networks have the potential to be more powerful than small networks, so this is an option we don’t like to adopt. Regularization methods Fortunately, there are other methods to reduce overfitting, even when we have a fixed network and fixed training data. These are known as regularization methods. Regularization methods tend to prevent weights to reach high values during learning. But, why keeping weights at low values helps avoiding overfitting? 9 24/03/2017 Why regularization helps L2 regularization The most common one is known as L2 regularization or weight decay. A network with small weights is less sensitive to small input changes, so it learns based on patterns seen more often in TS: The idea is to add an extra regularization term R to the cost function: C = C0 + R By contrast, a network with large weights is more sensitive to small variations and tends to learn noise peculiarity in the TS: where C0 is the original cost function and R is given by: R For this reason, weight regularization helps networks to generalize better from what they learn. C C0 Basically, regularization balances between finding small weights and minimizing the original cost function. w Small weights L2 regularization Biases are not subject to regularization since often it does not change the results very much. w 2 C C0 w w w M weight 1 M decay To understand why R reduces overfitting, let’s compute the weights variation w by deriving the regularized cost function. Note that, being 0 < weight_decay < 1, at a first glance it seems that w should exponentially decrease towards zero. But that’s not true, since w may increase for the other term. w This is exactly the same as the gradient descent rule, except that each weight is rescaled by a factor 1 C w(t ) 1 w(t 1) 0 w M 2M C C 0 w M w w C C w w 0 w 1 w 0 w M w M The relative importance of these elements depends on λ: 0 2 w L2 regularization Intuitively, the effect of R is to keep weights small. Large weights will only be allowed if they considerably improve C0. λ w where λ > 0 is the regularization parameter and M is the size of the training set. Note that R does not include biases. L2 regularization Minimize C0 2M L2 regularization Using L2 regularization, the TS error decreases as without R, but the accuracy on the VS continues to increase. Increasing the TS size and the number of hidden units may further improve the accuracy. accuracy 100 A TS A VS with R A VS without R epochs 10 24/03/2017 L1 regularization L1 regularization Another form of regularization is the L1 regularization, in which the regularization term is defined as: R M w w C C0 sgn( w) w w M w(t ) w(t 1) M sgn w(t 1) When |w| is large, L1 regularization shrinks the weight much less than L2 regularization. When |w| is small, L1 regularization shrinks the weight much more than L2 regularization. The result is that L1 regularization tends to concentrate the weights in a relatively small number of high-importance connections, while the other weights are driven toward zero. C0 w Dropout Dropout Dropout is a radically different technique for regularization, proposed by Hinton et al. in 2012. Dropout is a radically different technique for regularization, proposed by Hinton et al. in 2012. Unlike L1 and L2 regularization, dropout does not modifies the cost function, but the network itself by randomly deleting half of the hidden neurons at every learning iteration. Unlike L1 and L2 regularization, dropout does not modifies the cost function, but the network itself by randomly deleting half of the hidden neurons at every learning iteration. mini-batch m times xk mini-batch m times xk y m tk Dropout tk Dropout In the operation phase, all hidden neurons are active, hence twice the neurons used in the learning phase. To compensate for that, weights outgoing from the hidden neurons are halved: wji = y m wji 2 Why does dropout help reducing overfitting? Using dropout is like training several networks with the same TS and then averaging the results. The averaging scheme is quite effective (although expensive) to reduce overfitting, because different networks overfit in different ways, thus averaging is like filtering the noisy behavior. Dropout has been very successful in improving the performance of neural networks, reaching very high accuracy when used together with regularization methods. 11