Download Deep Learning - UCF Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonsynaptic plasticity wikipedia , lookup

Holonomic brain theory wikipedia , lookup

Apical dendrite wikipedia , lookup

Artificial intelligence wikipedia , lookup

Subventricular zone wikipedia , lookup

Neural oscillation wikipedia , lookup

Mirror neuron wikipedia , lookup

Neuroanatomy wikipedia , lookup

Premovement neuronal activity wikipedia , lookup

Single-unit recording wikipedia , lookup

Neural engineering wikipedia , lookup

Optogenetics wikipedia , lookup

Neural modeling fields wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Pattern recognition wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Pre-BΓΆtzinger complex wikipedia , lookup

Neural coding wikipedia , lookup

Anatomy of the cerebellum wikipedia , lookup

Central pattern generator wikipedia , lookup

Development of the nervous system wikipedia , lookup

Metastability in the brain wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Artificial neural network wikipedia , lookup

Biological neuron model wikipedia , lookup

Synaptic gating wikipedia , lookup

Catastrophic interference wikipedia , lookup

Deep learning wikipedia , lookup

Nervous system network models wikipedia , lookup

Recurrent neural network wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Convolutional neural network wikipedia , lookup

Transcript
Deep Learning
CAP5610: Machine Learning
Instructor: Guo-Jun Qi
Deep Networks
β€’ Deep networks are composed of multiple layers of non-linear
computations
𝑦𝑦
3
𝑦𝑦
2
𝑦𝑦
π‘₯π‘₯
1
=
y
tanh(W
(3)
(3)
y
(2)
+b )
(3)
=
y (2) sigmoid(W (2) y (1) + b (2) )
=
y (1) sigmoid(W (1) x + b (1) )
Multiple Layers of Feature Representation
β€’ The neurons at each layer provides distinct levels of abstract
β€’ The higher layer neurons model more abstract features than those at lower
layers
Recognize the whole object
Recognize different parts of object
Recognize the edge points
Shallow VS. Deep
β€’ Many learning algorithms are based on some shallow architecture
β€’ SVM: A single layer classifier
y
=
y sign(Wx + b)
x1
x2
x3
x4
β€’ Our brain has a deep architecture.
History
β€’ Before 2006, no successful results have been reported on training
deep architecture
β€’ Typically with three/four layers of neural network (one or two hidden layers),
the performance becomes saturate.
β€’ More layers yield poorer performance.
β€’ Exception: Convolutional Neural Networks (CNNs), LeCun, 1998.
β€’ Breakthrough: Deep Belief Networks
Two of most typical deep networks
β€’ Deep Belief Network: Multiple Layers of Restricted Boltzman
Machines
β€’ Effective training algorithms
β€’ Autoencoders
β€’ Recap: one hidden layer Autoencoder
β€’ May only capture simple feature structures
β€’ Corner points, interest points, edge points
β€’ Multiple layers: Stacked Autoencoders
β€’ Capture more complex feature structures
Advantages of Deep Networks
β€’ Main argument:
β€’ Shallow structures may oversimplify the complexity underlying a function
mapping input feature vectors to target output variables
β€’ Deep structures are more adequate to simulate these complex functions
β€’ E.g., our vision/acoustic systems
β€’ Reason: a shallow structure may require exponentially more neurons
to model a function that can be represented more compactly by a
deeper structure.
Challenges
β€’ More attentions must be paid to tune a deep structure
β€’ Shallow structure often has a complex objective function
β€’ linear model/kernalized nonlinear model
β€’ It has global optimal solution in training phase, which can be found by many welldeveloped optimization algorithms
β€’ The objective function associated with deep structure is often nonconvex
β€’ Many local optimum, and special attentions must be paid to avoid being trapped in bad
local optimum
β€’ Finding global optimum is impossible
Several Techniques used in training deep
𝛿𝛿
networks: Momentum
(2)
1
β€’ For each iteration t
(2)
β€’ For each weight π‘€π‘€π‘˜π‘˜π‘–π‘– :
β€’ Update
2
2
πœ•πœ•πœ•πœ•
2
πœ•πœ•π‘€π‘€π‘˜π‘˜π‘˜π‘˜
2
(𝑑𝑑)
(1)
𝛿𝛿1
(1)
𝛿𝛿2
π‘€π‘€π‘˜π‘˜π‘˜π‘˜ (𝑑𝑑) ← π‘€π‘€π‘˜π‘˜π‘˜π‘˜ (𝑑𝑑) βˆ’ π›Όπ›Όβˆ†π‘€π‘€π‘˜π‘˜π‘˜π‘˜ (𝑑𝑑)
β€’ For each weight
β€’ Update
2
βˆ†π‘€π‘€π‘˜π‘˜π‘˜π‘˜ (𝑑𝑑) ≑
(1) βˆ†π‘€π‘€ (1) ≑ πœ•πœ•πœ•πœ• (t)
𝑀𝑀𝑗𝑗𝑗𝑗 : 𝑗𝑗𝑗𝑗
1
πœ•πœ•π‘€π‘€
1
1
𝑗𝑗𝑗𝑗
1
𝑀𝑀𝑗𝑗𝑗𝑗 (𝑑𝑑) ← 𝑀𝑀𝑗𝑗𝑗𝑗 (𝑑𝑑) βˆ’ π›Όπ›Όβˆ†π‘€π‘€π‘—π‘—π‘—π‘— (𝑑𝑑)
Note: 𝛼𝛼 > 0 is a learning rate which controls the step size for each weight update
Several Techniques used in training deep
𝛿𝛿
networks: Momentum
β€’ Idea: exponentially weighed sum of recent derivatives
β€’ For each iteration t
β€’ For each weight
(2)
π‘€π‘€π‘˜π‘˜π‘–π‘– :
2
βˆ†π‘€π‘€π‘˜π‘˜π‘˜π‘˜
(1)
𝑀𝑀𝑗𝑗𝑗𝑗 :
βˆ†π‘€π‘€π‘—π‘—π‘—π‘— ≑
(𝑑𝑑) ≑
πœ•πœ•πœ•πœ•
2
πœ•πœ•π‘€π‘€π‘˜π‘˜π‘˜π‘˜
(𝑑𝑑)
(1)
β€’ Update π‘€π‘€π‘˜π‘˜π‘˜π‘˜2 𝑑𝑑 ← π‘€π‘€π‘˜π‘˜π‘˜π‘˜2 𝑑𝑑 βˆ’ 𝛼𝛼{βˆ†π‘€π‘€π‘˜π‘˜π‘˜π‘˜2 𝑑𝑑 + πœ–πœ–βˆ†π‘€π‘€π‘˜π‘˜π‘˜π‘˜2 𝑑𝑑 βˆ’ 1 }
β€’ For each weight
β€’ Update
1
(1)
1
πœ•πœ•πœ•πœ•
1
πœ•πœ•π‘€π‘€π‘—π‘—π‘—π‘—
1
(t)
1
𝑀𝑀𝑗𝑗𝑗𝑗 (𝑑𝑑) ← 𝑀𝑀𝑗𝑗𝑗𝑗 (𝑑𝑑) βˆ’ 𝛼𝛼{βˆ†π‘€π‘€π‘—π‘—π‘—π‘— 𝑑𝑑 + πœ–πœ–βˆ†π‘€π‘€π‘—π‘—π‘—π‘— 𝑑𝑑 βˆ’ 1 }
πœ–πœ– is typically set to [0.7 0.95].
𝛿𝛿1
(2)
1
(1)
𝛿𝛿2
Momentum
β€’ The search for the minimum point can be trapped in a steep local
basin.
𝑓𝑓(π‘₯π‘₯)
πœ•πœ•πœ•πœ•
(𝑑𝑑)
πœ•πœ•π‘₯π‘₯
π‘₯π‘₯(𝑑𝑑)
is very small or even vanishes
Momentum
Can give a extra force to push the current search for the optimal point
from the trap of local minimum
𝑓𝑓(π‘₯π‘₯)
π‘₯π‘₯(𝑑𝑑)
πœ•πœ•πœ•πœ•
πœ•πœ•πœ•πœ•
𝑑𝑑 + πœ–πœ–
𝑑𝑑
πœ•πœ•π‘₯π‘₯
πœ•πœ•π‘₯π‘₯
Adaptive learning rate
β€’ Too large learning rate
β€’ Cause oscillation in searching for the minimal point
β€’ Too slow learning rate
β€’ Too slow convergence to the minimal point
β€’ Adaptive learning rate
β€’ At the beginning, the learning rate can be large when the current point is far
from the optimal point
β€’ Gradually, the learning rate will decay as time goes by.
Adaptive learning rate
β€’ Should not be too large or too small
Annealing rate: 𝛼𝛼(𝑑𝑑) =
𝛼𝛼 0
𝑑𝑑
1+𝑇𝑇
β€’ 𝛼𝛼 𝑑𝑑 will eventually go to zero, but at the beginning it is almost a constant.
Deep Convolutional Neural Network
β€’ In conventional neural network,
β€’ Each neuron in a upper layer takes inputs from all the neurons in the lower
layer
β€’ All the weights on the connections between two layers are distinct
β€’ Problem: too many parameters
β€’ Two layers with N, M neurons will have NM parameters for the connection weights
between them
Deep Convolutional Neural Network
β€’ This parameterization is not necessary.
β€’ There are many similar local structures in an input image
β€’ Each neuron in the upper layer can have more localized input from the
neuron in the lower layer
β€’ The weights on the connections are recurrent
β€’ Much smaller number of parameters
An example of convolutional layer
β€’ Given input X=(2,1,1,3)
β€’ Convolutional kernel W=(2,1)
β€’ Output Y = X*W=(5,3,5)
3
5
2
2
1
2
1
5
1
2
1
1
3
A real CNN
β€’
β€’
β€’
β€’
β€’
C1: Convolutional layer
S2: Subsampling layer, reducing the number of neurons
C3: Convolutional layer
S4: subsampling layer
C5,F6, output: full connection
A real CNN: Convolutional Layer
β€’ C1: Convolutional layer
β€’
β€’
β€’
β€’
6 feature maps of size 28X28
Each C1 neuron has a kernel of size 5X5
Totally we have (5X5+1)X6=156
If fully connected between Input and C1, there will be (32X32+1)X28X28X6= 4,821,600
A real CNN: Subsampling
β€’ S2: Subsampling layer
β€’ Each neuron in S2 takes the maximum output from 2X2 neurons underneath it in
C1
β€’ S2 reduces the number of neurons by a half.
A real CNN: Convolutional layer
β€’ C3: Convolutional layer
β€’ Each neuron in C3 is connected to some of S2 layers with a kernel of size 5X5
β€’ Its activation is the sum of convolutions on these layers
A real CNN: Subsampling
β€’ S4: Subsampling layer
β€’ Each neuron in S4 takes the maximum output from 2X2 neurons underneath it in
C3
A real CNN: Convolutional layer
β€’ C5: Convolutional layer
β€’ Each neuron in C5 takes a kernel convolution with all 16 feature maps in C5, and
sums the convolution results together.
A real CNN: Fully connected layer
β€’ F5: Fully connected layer
β€’ Each neuron in F6 is fully connected with all neurons in C5.
A real CNN: Output
β€’ Output: Gaussian connections
β€’ Each neuron output from F6 is transformed by a Gaussian kernel before it is
injected into output layer
β€’ This layer is fully connected with all neurons in F6
CNN Results
Challenges
β€’ Neural network is still slightly fully connected.
β€’ The model may be too complex for many applications
β€’ More elegant regularization method is demanded to avoid the overfitting risk.
β€’ Reducing the magnitudes of connection weights
β€’ Making the connection more sparse, while not losing too much modeling power.
β€’ Dropout:
β€’ Some (a half of) neurons in a fully connected layer become inactive whose outputs will not
participate in the forward pass and backpropagation.
β€’ Every time a neural network with reduced complexity is generated to process the input
signals forwards, or updated by backpropagation.
Summary
β€’ Deep network – multiple layers of neurons
β€’ Practical issues
β€’ Momentum
β€’ Adaptive learning rate
β€’ Convolutional neural networks
β€’ Making neural networks sparser
β€’ dropout