Download DEEP LEARNING REVIEW

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Psychophysics wikipedia , lookup

Premovement neuronal activity wikipedia , lookup

Biological neuron model wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Learning wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Neural modeling fields wikipedia , lookup

Neural coding wikipedia , lookup

Neural engineering wikipedia , lookup

Nervous system network models wikipedia , lookup

Gene expression programming wikipedia , lookup

Machine learning wikipedia , lookup

Synaptic gating wikipedia , lookup

Optogenetics wikipedia , lookup

Metastability in the brain wikipedia , lookup

Development of the nervous system wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Central pattern generator wikipedia , lookup

Artificial neural network wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Backpropagation wikipedia , lookup

Convolutional neural network wikipedia , lookup

Catastrophic interference wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Recurrent neural network wikipedia , lookup

Transcript
DEEP LEARNING
REVIEW
by
Girish Dharamveer
Sukhwani
What are Neural
Networks?
• Consider N-dimensional input(x),
• Neural Networks contain:
 Input Layer
 Zero or more Hidden Layers
 Output Layer
 Weights associated with all
connections between units.
• Neural Networks with more than one
hidden layers are called Deep Neural
Networks.
x1
x2
w2
w1
y
w3
x3
wn
xn
Simple Models
of Neurons
• Linear Neurons
• Binary Threshold Neurons
• Rectified Linear Neurons (Linear Threshold Neurons)
• Sigmoid Neurons
• Stochastic Binary Neurons
Linear Neurons
𝑦
• Simple to implement.
• Computationally Limited
• The output function:
𝑦 = 𝑛𝑖 𝑤𝑖 𝑥𝑖 + 𝑏
where,
n  Number of dimensions or features
w  Weight Vector
x  Input Vector
b  Bias (Distance from the origin)
𝑛
𝑤𝑖 𝑥𝑖 + 𝑏
𝑖
Binary Threshold Neurons
• First compute the weighted sum(wTx) of the inputs.
• If the weighted sum exceeds a threshold, the
value of the output will be 1, else 0.
• Two ways to write equation for binary threshold
neurons:

𝑧=
𝑖 𝑥𝑖 𝑤𝑖
𝑦=

𝑧=𝑏+
1 𝑖𝑓 𝑧 ≥ 𝜃
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
𝑖 𝑥𝑖 𝑤𝑖
𝑦=
1 𝑖𝑓 𝑧 ≥ 0
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Therefore, we can derive that:
𝜃 = −𝑏
Linear Threshold Neurons
• Also known as Rectified Linear
Neurons.
• Combines the properties of binary
threshold and linear neurons.
• They compute the weighted sum
(wTx) of their inputs.
• Then they give an output which is a
non-linear function of the total
input.
𝑧=𝑏+
𝑦=
𝑖 𝑥𝑖 𝑤𝑖
𝑧 𝑖𝑓 𝑧 ≥ 0
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Sigmoid Neurons
• Very commonly used neurons.
• They give a real-valued output that is a
smooth and bounded function of their
total input.
• First, the weighted sum (wTx) of their
inputs is computed as follows:
𝑧=𝑏+
𝑥𝑖 𝑤𝑖
𝑖
• Then we apply the logistic function on
the weighted sum as follows:
1
𝑦=
1 + 𝑒 −𝑧
• Has smooth derivatives.
Stochastic Binary Neurons
• Uses the same equations as the logistic
neurons:
𝑧=𝑏+
𝑥𝑖 𝑤𝑖
𝑖
𝑝(𝑠 = 1) =
1
1 + 𝑒 −𝑧
• The output of the logistic function (also
known as the “logit”) is treated as a
probability of producing a spike (output
label  1)
• Instead of giving a real number output
for the probability, they make a
probabilistic decision function to output
either 1 or 0.
• A similar trick for rectified linear units.
• The real-valued outputs are treated as
the Poisson rate of producing spikes.
Types of Learning
• Supervised Learning
• Unsupervised Learning
• Reinforcement Learning
• Learn to predict an output given an input vector.
Supervised
Learning
• Each training case consists of an input vector x and a target
output t.
• Two types of tasks:
i. Regression: The target output is a real number or a whole
vector of real numbers. For example, the temperature at a
time tomorrow or the price of a stock in 6 months time.
ii. Classification: The target output is a class label. For
example, Binary (0 or 1) or Multiclass (Classifying images of
hand-written digits).
• First, select a model, such that 𝑦 = 𝑓 𝑥; 𝑤 , and map the input
vector, x, to the output, y, using some parameters, w (weights).
• Then, compute an objective function that measures the error (or
distance) between the predicted output, y, and the target
output, t.
• Now, use the error to modify the adjustable parameters, w, to
reduce this error.
Unsupervised
Learning
• Discover a good meaningful internal representation of
the input to perform supervised or reinforcement
learning.
• Provides a compact, low-dimensional representation of
higher dimensional inputs.
• PCA is one such linear method for finding lowdimensional representation.
• Provides an economical high-dimensional representation
of the input in terms of learned features.
o Binary features require only one bit per value.
o Real-valued features that are nearly all zero.
• Similarity between data points using clustering.
Reinforcement
Learning
• Learn to select actions or sequences of actions to
maximize the payoffs.
• The rewards may only occur occasionally.
• The only supervisory signal is an occasional scalar
reward.
• A discount factor is usually used for delayed rewards, so
we don’t have to look too far into the future.
• Reinforcement learning is difficult:
o The blame attribution problem.
o The explore-exploit dilemma.
o Dynamic environments.
• Learns fewer parameters than Supervised and
Unsupervised Learning.
Types of Neural
Network
Architectures
• Feed-forward Neural Networks
• Recurrent Neural Networks
• Symmetrically Connected Neural Networks
Feed-forward Neural Networks
• Commonest type of neural network
architecture.
• The information flows in one
direction, hence the name “Feedforward”.
• Complete a series of transformations
that change the similarities between
cases.
o The can be achieved by nonlinearizing the output of the
layer below.
Recurrent Neural Networks
• These have directed cycles in their
connection graph.
• The information can flow around in
cycles and can sometimes get back to
where it started.
• More complicated to train because of
the complicated architecture.
• More biologically realistic.
• Can efficiently model sequential data.
• They have the ability to remember
information in their hidden state for a
long time.
o Very hard to train them to use this
potential.
Symmetrically Connected Neural Networks
• Similar to recurrent networks, but the
connections between units have same
weights in both directions (symmetry).
• John Hopfield realized that symmetric
networks are easier to analyze than
recurrent networks.
• More restricted in what they can do,
because they obey an energy function.
• Symmetrically connected neural
networks without hidden units are
called “Hopfield Nets”.
• Hopfield nets provide a model to
understand human memory.
Statistical Pattern Recognition
• Convert raw input vector to a vector
of feature activations.
• Use hand-written programs to do so.
• Learn the weights for each input
value to get a single scalar quantity.
• If the quantity is greater than a
threshold, decide that the input
vector is a positive example, else
decide that it is a negative example.
• The standard perceptron architecture
is the first generation of neural
networks and is derived from the
above paradigm of statistical pattern
recognition.
Perceptron
• The decision units used are Binary Threshold Neurons.
• To learn bias, we add a value 1 to the end of the weight vector and learn them as we learn the
weights.
• The perceptron algorithm:
• Pick a data point and compute the weighted sum (y = wTx) of the input vector.
• If y == t, then leave the weights alone.
• If y != t, such that t = 1 and y = 0, then add the input vector to the weight vector.
• If y != t, such that t = 0 and y = 1, then subtract the input vector to the weight vector.
• This is guaranteed to find the set of weights that gets the right answer for all the training cases, if
any such set of weights exists.
A Geometrical View of Perceptrons
• Consider weight space, which has one
dimension per weight.
• A point represents weights, and training
cases are hyperplanes passing through
origin (no threshold).
• The weights must lie on one side of the
hyper-plane to get the correct answer.
• For class label 1, the weight vector must
lie on the same side as the input vector.
• For class label 0, the weight vector must
lie on the other side of the input vector.
Limitations of Perceptron
• Requires hand-picked or hand-coded
features.
• The XOR problem.
• Discriminating simple patterns under
translation with wrap-around.
• The “Group Invariance Theorem” which
states that perceptron cannot learn to do
pattern recognition if the transformations
form a group.
• Requires hand-coded features to deal with
these transformations.
• Full-batch:
Weight Update
Methods
a) Weights updated after going through all training examples.
b) The change in weights is sum of gradients of all training
cases.
c) Fast but could be unstable (overshooting).
• Mini-batch:
a) Weight updates are made after going through small batches
of training examples.
b) The change in weights is sum of gradients of training
examples in that batch.
c) Comparatively slow but more efficient.
• Online:
a) Weights are updated after each training case.
b) The change in weights is the gradient of a particular training
example.
c) Slowest and most effective.
Motivation:
Stochastic
Gradient Descent
• For highly redundant datasets, the gradient on the first half
is almost identical to the gradient on the second half.
• Instead of computing a full gradient, update the weights
using the gradient on the first half and then get a gradient
for the new weights on the second half.
• The extreme version of this is online learning.
• Mini-batches are better than online because less
computation is used updating the weights.
• The gradients for mini-batches can be computed in parallel
on GPUs.
• Mini-batches need to be balanced.
a)
Initializing the weights:
• Two hidden units having the same bias, and same incoming
and outgoing weights, will always get exactly the same
gradients.
• They can never learn different features.
• Break the symmetry by initializing the weights to have
small random values.
• Cannot use big weights because hidden units with big fanin can cause learning to overshoot.
• Therefore, initialize the weights to be proportional to
𝑓𝑎𝑛 − 𝑖𝑛.
• Learning rate can be scaled the same way.
b)
Shifting the inputs:
• Transform each component of the input vector so that it
has zero mean over the whole training set.
• This converts the elliptical error surface to a circular error
surface in which the gradient is pointing directly to the
minimum.
Tricks for minibatch gradient
descent
Tricks for minibatch gradient
descent
(Continued)
c)
Scaling the inputs:
• Transform each component of the input vector so that it
has unit variance over the whole training set.
• This also gives a circular error surface in which the
gradients point directly to the minimum.
d)
Decorrelate the input components:
• PCA can be used to decorrelate the features.
• Drop the principal components with the smallest
eigenvalues.
• Also achieves some dimensionality reduction.
• Divide the remaining principal components by the
square root of their eigenvalues.
• This will give a circular error surface.
Four ways to
speed up minibatch learning
a) Momentum:
• Use the gradient to change “velocity” and not “position”
of the weight vector.
b) Separate adaptive learning rates:
• Slowly adjust the learning rate using the consistency of
the gradient for that parameter.
c)
rmsprop:
• Divide the learning rate by a running average of the
magnitudes of recent gradients for that weight.
d) Fancy method from optimization literature:
• Adapt a fancy method from the optimization literature
to work for neural nets and mini-batches.
The
Momentum
Method
• Damps oscillations in directions of high curvature by
combining gradients with opposite signs.
• Usually has a value greater than 0 and less than 1.
• Builds up speed in directions with a gentle but consistent
gradient.
• The equation is as follows:
𝜀
∆𝑤 = 𝛼 ∆𝑤 𝑡 − 1 −
𝜕𝐸
(𝑡).
𝜕𝑤
• Use small momentum values at the beginning of
learning when the gradients are big (e.g. 0.5).
• When the large gradients disappear and the weights are
stuck in a ravine, the momentum can smoothly be raised
(e.g. 0.9 or greater)
Nesterov Momentum
Standard Momentum
Nesterov Momentum
 First computes the gradient at the current
location.
 Then takes a big jump in the direction of the
updated accumulated gradient.
 First make a big jump in the direction of the
previous accumulated gradient.
 Then measure the gradient where you end
up and make a correction
• Use a global learning rate multiplied by an appropriate
local gain that is determined empirically for each weight.
Adaptive
Learning Rates
• Start with a local gain of 1 for every weight.
• Increase the local gain if the gradient for that weight does
not change sign.
• Use small additive increases and multiplicative decreases.
• Formulation:
𝜕𝐸
∆𝑤𝑖𝑗 = −𝜀𝑔𝑖𝑗
𝜕𝑤𝑖𝑗
𝑔𝑖𝑗 𝑡
=
𝑔𝑖𝑗 𝑡 − 1 + 0.05 𝑖𝑓
𝑔𝑖𝑗 𝑡 − 1 ∗ 0.95
𝜕𝐸
𝜕𝐸
𝑡
𝑡−1
𝜕𝑤𝑖𝑗
𝜕𝑤𝑖𝑗
>0
𝑂𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
• Addresses the problem of varying magnitudes of the gradient
during learning.
rprop
• Combines the idea of only using the sign of the gradient with
the idea of separate adaptive learning rates.
• Increases the step size multiplicatively (e.g. times 1.2) if
the signs of the last two gradients agree.
• Otherwise decrease the step size multiplicatively (e.g.
times 0.5)
• Range: one-millionth < step size < 50
• Doesn’t work for mini-batches.
• Consider a weight that gets a gradient of +0.1 on nine minibatches and a gradient of -0.9 on the tenth mini-batch.
• We would want the weights to roughly stay where they are.
• Since the magnitude of the weight change in rprop is the
same in either direction (based on the sign), the weights
would grow a lot in this case.
• Mini-batch version of rprop.
rmsprop
• rprop is equivalent to using the gradient but also dividing by
the size of the gradient.
• In mini-batch rprop, we divide by a different number for each
mini-batch.
• In rmsprop, we force the number we divide by to be very
similar for adjacent mini-batches.
• We keep a moving average of the squared gradient for each
weight:
𝜕𝐸 2
)
𝜕𝑤
𝑀𝑒𝑎𝑛𝑆𝑞𝑢𝑎𝑟𝑒 𝑤, 𝑡 = 0.9 𝑀𝑒𝑎𝑛𝑆𝑞𝑎𝑢𝑟𝑒 𝑤, 𝑡 − 1 + 0.1 (
• Then divide the gradient by 𝑀𝑒𝑎𝑛𝑆𝑞𝑎𝑢𝑟𝑒(𝑤, 𝑡)
• Can be combined with momentum, nesterov momentum
and/or adaptive learning rates.
Drawbacks of squared error:
The SoftMax
output
function
• Easily affected by outliers.
• Cannot be used to assign probabilities to mutually exclusive
class labels.
SoftMax output:
• Forces the outputs to represent a probability distribution.
• Outputs will sum to 1.
• Inputs to the SoftMax function are “logits”.
1
𝑧𝑖 =
𝑇
1 + 𝑒 −𝑤 𝑥𝑖
• The SoftMax can be computed as:
𝑦𝑖 =
𝑒 𝑧𝑖
𝑧𝑗
𝑒
𝑗 ∈ 𝑔𝑟𝑜𝑢𝑝
Cross-entropy
cost function
• Since the predicted outputs are now the probability of the
class labels, the appropriate cost function would be to
minimize the log probability.
𝐶=−
𝑡𝑗 log 𝑦𝑗
𝑗
• C has a very big gradient when the target value is 1 and the
predicted output is almost 0.
𝜕𝐶
𝜕𝐶 𝜕𝑦𝑗
=
= 𝑦𝑖 − 𝑡𝑖
𝜕𝑧𝑖
𝜕𝑦𝑗 𝜕𝑧𝑖
𝑗
• The steepness of
𝜕𝐶
𝜕𝑦
exactly balances the flatness of
𝜕𝑦
.
𝜕𝑧
Backpropagation
Algorithm
• Two steps:
• Forward Pass: Computes the output and the error
associated with it.
• Backward Pass: Computes the gradients for each weight
and updates them.
• https://mattmazur.com/2015/03/17/a-step-by-stepbackpropagation-example/
Thank You