* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download DEEP LEARNING REVIEW
Psychophysics wikipedia , lookup
Premovement neuronal activity wikipedia , lookup
Biological neuron model wikipedia , lookup
Feature detection (nervous system) wikipedia , lookup
Neuropsychopharmacology wikipedia , lookup
Neural modeling fields wikipedia , lookup
Neural coding wikipedia , lookup
Neural engineering wikipedia , lookup
Nervous system network models wikipedia , lookup
Gene expression programming wikipedia , lookup
Machine learning wikipedia , lookup
Synaptic gating wikipedia , lookup
Optogenetics wikipedia , lookup
Metastability in the brain wikipedia , lookup
Development of the nervous system wikipedia , lookup
Channelrhodopsin wikipedia , lookup
Central pattern generator wikipedia , lookup
Artificial neural network wikipedia , lookup
Hierarchical temporal memory wikipedia , lookup
Backpropagation wikipedia , lookup
Convolutional neural network wikipedia , lookup
Catastrophic interference wikipedia , lookup
DEEP LEARNING REVIEW by Girish Dharamveer Sukhwani What are Neural Networks? • Consider N-dimensional input(x), • Neural Networks contain: Input Layer Zero or more Hidden Layers Output Layer Weights associated with all connections between units. • Neural Networks with more than one hidden layers are called Deep Neural Networks. x1 x2 w2 w1 y w3 x3 wn xn Simple Models of Neurons • Linear Neurons • Binary Threshold Neurons • Rectified Linear Neurons (Linear Threshold Neurons) • Sigmoid Neurons • Stochastic Binary Neurons Linear Neurons 𝑦 • Simple to implement. • Computationally Limited • The output function: 𝑦 = 𝑛𝑖 𝑤𝑖 𝑥𝑖 + 𝑏 where, n Number of dimensions or features w Weight Vector x Input Vector b Bias (Distance from the origin) 𝑛 𝑤𝑖 𝑥𝑖 + 𝑏 𝑖 Binary Threshold Neurons • First compute the weighted sum(wTx) of the inputs. • If the weighted sum exceeds a threshold, the value of the output will be 1, else 0. • Two ways to write equation for binary threshold neurons: 𝑧= 𝑖 𝑥𝑖 𝑤𝑖 𝑦= 𝑧=𝑏+ 1 𝑖𝑓 𝑧 ≥ 𝜃 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑖 𝑥𝑖 𝑤𝑖 𝑦= 1 𝑖𝑓 𝑧 ≥ 0 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • Therefore, we can derive that: 𝜃 = −𝑏 Linear Threshold Neurons • Also known as Rectified Linear Neurons. • Combines the properties of binary threshold and linear neurons. • They compute the weighted sum (wTx) of their inputs. • Then they give an output which is a non-linear function of the total input. 𝑧=𝑏+ 𝑦= 𝑖 𝑥𝑖 𝑤𝑖 𝑧 𝑖𝑓 𝑧 ≥ 0 0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 Sigmoid Neurons • Very commonly used neurons. • They give a real-valued output that is a smooth and bounded function of their total input. • First, the weighted sum (wTx) of their inputs is computed as follows: 𝑧=𝑏+ 𝑥𝑖 𝑤𝑖 𝑖 • Then we apply the logistic function on the weighted sum as follows: 1 𝑦= 1 + 𝑒 −𝑧 • Has smooth derivatives. Stochastic Binary Neurons • Uses the same equations as the logistic neurons: 𝑧=𝑏+ 𝑥𝑖 𝑤𝑖 𝑖 𝑝(𝑠 = 1) = 1 1 + 𝑒 −𝑧 • The output of the logistic function (also known as the “logit”) is treated as a probability of producing a spike (output label 1) • Instead of giving a real number output for the probability, they make a probabilistic decision function to output either 1 or 0. • A similar trick for rectified linear units. • The real-valued outputs are treated as the Poisson rate of producing spikes. Types of Learning • Supervised Learning • Unsupervised Learning • Reinforcement Learning • Learn to predict an output given an input vector. Supervised Learning • Each training case consists of an input vector x and a target output t. • Two types of tasks: i. Regression: The target output is a real number or a whole vector of real numbers. For example, the temperature at a time tomorrow or the price of a stock in 6 months time. ii. Classification: The target output is a class label. For example, Binary (0 or 1) or Multiclass (Classifying images of hand-written digits). • First, select a model, such that 𝑦 = 𝑓 𝑥; 𝑤 , and map the input vector, x, to the output, y, using some parameters, w (weights). • Then, compute an objective function that measures the error (or distance) between the predicted output, y, and the target output, t. • Now, use the error to modify the adjustable parameters, w, to reduce this error. Unsupervised Learning • Discover a good meaningful internal representation of the input to perform supervised or reinforcement learning. • Provides a compact, low-dimensional representation of higher dimensional inputs. • PCA is one such linear method for finding lowdimensional representation. • Provides an economical high-dimensional representation of the input in terms of learned features. o Binary features require only one bit per value. o Real-valued features that are nearly all zero. • Similarity between data points using clustering. Reinforcement Learning • Learn to select actions or sequences of actions to maximize the payoffs. • The rewards may only occur occasionally. • The only supervisory signal is an occasional scalar reward. • A discount factor is usually used for delayed rewards, so we don’t have to look too far into the future. • Reinforcement learning is difficult: o The blame attribution problem. o The explore-exploit dilemma. o Dynamic environments. • Learns fewer parameters than Supervised and Unsupervised Learning. Types of Neural Network Architectures • Feed-forward Neural Networks • Recurrent Neural Networks • Symmetrically Connected Neural Networks Feed-forward Neural Networks • Commonest type of neural network architecture. • The information flows in one direction, hence the name “Feedforward”. • Complete a series of transformations that change the similarities between cases. o The can be achieved by nonlinearizing the output of the layer below. Recurrent Neural Networks • These have directed cycles in their connection graph. • The information can flow around in cycles and can sometimes get back to where it started. • More complicated to train because of the complicated architecture. • More biologically realistic. • Can efficiently model sequential data. • They have the ability to remember information in their hidden state for a long time. o Very hard to train them to use this potential. Symmetrically Connected Neural Networks • Similar to recurrent networks, but the connections between units have same weights in both directions (symmetry). • John Hopfield realized that symmetric networks are easier to analyze than recurrent networks. • More restricted in what they can do, because they obey an energy function. • Symmetrically connected neural networks without hidden units are called “Hopfield Nets”. • Hopfield nets provide a model to understand human memory. Statistical Pattern Recognition • Convert raw input vector to a vector of feature activations. • Use hand-written programs to do so. • Learn the weights for each input value to get a single scalar quantity. • If the quantity is greater than a threshold, decide that the input vector is a positive example, else decide that it is a negative example. • The standard perceptron architecture is the first generation of neural networks and is derived from the above paradigm of statistical pattern recognition. Perceptron • The decision units used are Binary Threshold Neurons. • To learn bias, we add a value 1 to the end of the weight vector and learn them as we learn the weights. • The perceptron algorithm: • Pick a data point and compute the weighted sum (y = wTx) of the input vector. • If y == t, then leave the weights alone. • If y != t, such that t = 1 and y = 0, then add the input vector to the weight vector. • If y != t, such that t = 0 and y = 1, then subtract the input vector to the weight vector. • This is guaranteed to find the set of weights that gets the right answer for all the training cases, if any such set of weights exists. A Geometrical View of Perceptrons • Consider weight space, which has one dimension per weight. • A point represents weights, and training cases are hyperplanes passing through origin (no threshold). • The weights must lie on one side of the hyper-plane to get the correct answer. • For class label 1, the weight vector must lie on the same side as the input vector. • For class label 0, the weight vector must lie on the other side of the input vector. Limitations of Perceptron • Requires hand-picked or hand-coded features. • The XOR problem. • Discriminating simple patterns under translation with wrap-around. • The “Group Invariance Theorem” which states that perceptron cannot learn to do pattern recognition if the transformations form a group. • Requires hand-coded features to deal with these transformations. • Full-batch: Weight Update Methods a) Weights updated after going through all training examples. b) The change in weights is sum of gradients of all training cases. c) Fast but could be unstable (overshooting). • Mini-batch: a) Weight updates are made after going through small batches of training examples. b) The change in weights is sum of gradients of training examples in that batch. c) Comparatively slow but more efficient. • Online: a) Weights are updated after each training case. b) The change in weights is the gradient of a particular training example. c) Slowest and most effective. Motivation: Stochastic Gradient Descent • For highly redundant datasets, the gradient on the first half is almost identical to the gradient on the second half. • Instead of computing a full gradient, update the weights using the gradient on the first half and then get a gradient for the new weights on the second half. • The extreme version of this is online learning. • Mini-batches are better than online because less computation is used updating the weights. • The gradients for mini-batches can be computed in parallel on GPUs. • Mini-batches need to be balanced. a) Initializing the weights: • Two hidden units having the same bias, and same incoming and outgoing weights, will always get exactly the same gradients. • They can never learn different features. • Break the symmetry by initializing the weights to have small random values. • Cannot use big weights because hidden units with big fanin can cause learning to overshoot. • Therefore, initialize the weights to be proportional to 𝑓𝑎𝑛 − 𝑖𝑛. • Learning rate can be scaled the same way. b) Shifting the inputs: • Transform each component of the input vector so that it has zero mean over the whole training set. • This converts the elliptical error surface to a circular error surface in which the gradient is pointing directly to the minimum. Tricks for minibatch gradient descent Tricks for minibatch gradient descent (Continued) c) Scaling the inputs: • Transform each component of the input vector so that it has unit variance over the whole training set. • This also gives a circular error surface in which the gradients point directly to the minimum. d) Decorrelate the input components: • PCA can be used to decorrelate the features. • Drop the principal components with the smallest eigenvalues. • Also achieves some dimensionality reduction. • Divide the remaining principal components by the square root of their eigenvalues. • This will give a circular error surface. Four ways to speed up minibatch learning a) Momentum: • Use the gradient to change “velocity” and not “position” of the weight vector. b) Separate adaptive learning rates: • Slowly adjust the learning rate using the consistency of the gradient for that parameter. c) rmsprop: • Divide the learning rate by a running average of the magnitudes of recent gradients for that weight. d) Fancy method from optimization literature: • Adapt a fancy method from the optimization literature to work for neural nets and mini-batches. The Momentum Method • Damps oscillations in directions of high curvature by combining gradients with opposite signs. • Usually has a value greater than 0 and less than 1. • Builds up speed in directions with a gentle but consistent gradient. • The equation is as follows: 𝜀 ∆𝑤 = 𝛼 ∆𝑤 𝑡 − 1 − 𝜕𝐸 (𝑡). 𝜕𝑤 • Use small momentum values at the beginning of learning when the gradients are big (e.g. 0.5). • When the large gradients disappear and the weights are stuck in a ravine, the momentum can smoothly be raised (e.g. 0.9 or greater) Nesterov Momentum Standard Momentum Nesterov Momentum First computes the gradient at the current location. Then takes a big jump in the direction of the updated accumulated gradient. First make a big jump in the direction of the previous accumulated gradient. Then measure the gradient where you end up and make a correction • Use a global learning rate multiplied by an appropriate local gain that is determined empirically for each weight. Adaptive Learning Rates • Start with a local gain of 1 for every weight. • Increase the local gain if the gradient for that weight does not change sign. • Use small additive increases and multiplicative decreases. • Formulation: 𝜕𝐸 ∆𝑤𝑖𝑗 = −𝜀𝑔𝑖𝑗 𝜕𝑤𝑖𝑗 𝑔𝑖𝑗 𝑡 = 𝑔𝑖𝑗 𝑡 − 1 + 0.05 𝑖𝑓 𝑔𝑖𝑗 𝑡 − 1 ∗ 0.95 𝜕𝐸 𝜕𝐸 𝑡 𝑡−1 𝜕𝑤𝑖𝑗 𝜕𝑤𝑖𝑗 >0 𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 • Addresses the problem of varying magnitudes of the gradient during learning. rprop • Combines the idea of only using the sign of the gradient with the idea of separate adaptive learning rates. • Increases the step size multiplicatively (e.g. times 1.2) if the signs of the last two gradients agree. • Otherwise decrease the step size multiplicatively (e.g. times 0.5) • Range: one-millionth < step size < 50 • Doesn’t work for mini-batches. • Consider a weight that gets a gradient of +0.1 on nine minibatches and a gradient of -0.9 on the tenth mini-batch. • We would want the weights to roughly stay where they are. • Since the magnitude of the weight change in rprop is the same in either direction (based on the sign), the weights would grow a lot in this case. • Mini-batch version of rprop. rmsprop • rprop is equivalent to using the gradient but also dividing by the size of the gradient. • In mini-batch rprop, we divide by a different number for each mini-batch. • In rmsprop, we force the number we divide by to be very similar for adjacent mini-batches. • We keep a moving average of the squared gradient for each weight: 𝜕𝐸 2 ) 𝜕𝑤 𝑀𝑒𝑎𝑛𝑆𝑞𝑢𝑎𝑟𝑒 𝑤, 𝑡 = 0.9 𝑀𝑒𝑎𝑛𝑆𝑞𝑎𝑢𝑟𝑒 𝑤, 𝑡 − 1 + 0.1 ( • Then divide the gradient by 𝑀𝑒𝑎𝑛𝑆𝑞𝑎𝑢𝑟𝑒(𝑤, 𝑡) • Can be combined with momentum, nesterov momentum and/or adaptive learning rates. Drawbacks of squared error: The SoftMax output function • Easily affected by outliers. • Cannot be used to assign probabilities to mutually exclusive class labels. SoftMax output: • Forces the outputs to represent a probability distribution. • Outputs will sum to 1. • Inputs to the SoftMax function are “logits”. 1 𝑧𝑖 = 𝑇 1 + 𝑒 −𝑤 𝑥𝑖 • The SoftMax can be computed as: 𝑦𝑖 = 𝑒 𝑧𝑖 𝑧𝑗 𝑒 𝑗 ∈ 𝑔𝑟𝑜𝑢𝑝 Cross-entropy cost function • Since the predicted outputs are now the probability of the class labels, the appropriate cost function would be to minimize the log probability. 𝐶=− 𝑡𝑗 log 𝑦𝑗 𝑗 • C has a very big gradient when the target value is 1 and the predicted output is almost 0. 𝜕𝐶 𝜕𝐶 𝜕𝑦𝑗 = = 𝑦𝑖 − 𝑡𝑖 𝜕𝑧𝑖 𝜕𝑦𝑗 𝜕𝑧𝑖 𝑗 • The steepness of 𝜕𝐶 𝜕𝑦 exactly balances the flatness of 𝜕𝑦 . 𝜕𝑧 Backpropagation Algorithm • Two steps: • Forward Pass: Computes the output and the error associated with it. • Backward Pass: Computes the gradients for each weight and updates them. • https://mattmazur.com/2015/03/17/a-step-by-stepbackpropagation-example/ Thank You