* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Deep Learning - UCF Computer Science
Nonsynaptic plasticity wikipedia , lookup
Holonomic brain theory wikipedia , lookup
Apical dendrite wikipedia , lookup
Artificial intelligence wikipedia , lookup
Subventricular zone wikipedia , lookup
Neural oscillation wikipedia , lookup
Mirror neuron wikipedia , lookup
Neuroanatomy wikipedia , lookup
Premovement neuronal activity wikipedia , lookup
Single-unit recording wikipedia , lookup
Neural engineering wikipedia , lookup
Optogenetics wikipedia , lookup
Neural modeling fields wikipedia , lookup
Neuropsychopharmacology wikipedia , lookup
Pattern recognition wikipedia , lookup
Feature detection (nervous system) wikipedia , lookup
Pre-BΓΆtzinger complex wikipedia , lookup
Neural coding wikipedia , lookup
Anatomy of the cerebellum wikipedia , lookup
Central pattern generator wikipedia , lookup
Development of the nervous system wikipedia , lookup
Metastability in the brain wikipedia , lookup
Channelrhodopsin wikipedia , lookup
Artificial neural network wikipedia , lookup
Biological neuron model wikipedia , lookup
Synaptic gating wikipedia , lookup
Catastrophic interference wikipedia , lookup
Deep learning wikipedia , lookup
Nervous system network models wikipedia , lookup
Recurrent neural network wikipedia , lookup
Deep Learning CAP5610: Machine Learning Instructor: Guo-Jun Qi Deep Networks β’ Deep networks are composed of multiple layers of non-linear computations π¦π¦ 3 π¦π¦ 2 π¦π¦ π₯π₯ 1 = y tanh(W (3) (3) y (2) +b ) (3) = y (2) sigmoid(W (2) y (1) + b (2) ) = y (1) sigmoid(W (1) x + b (1) ) Multiple Layers of Feature Representation β’ The neurons at each layer provides distinct levels of abstract β’ The higher layer neurons model more abstract features than those at lower layers Recognize the whole object Recognize different parts of object Recognize the edge points Shallow VS. Deep β’ Many learning algorithms are based on some shallow architecture β’ SVM: A single layer classifier y = y sign(Wx + b) x1 x2 x3 x4 β’ Our brain has a deep architecture. History β’ Before 2006, no successful results have been reported on training deep architecture β’ Typically with three/four layers of neural network (one or two hidden layers), the performance becomes saturate. β’ More layers yield poorer performance. β’ Exception: Convolutional Neural Networks (CNNs), LeCun, 1998. β’ Breakthrough: Deep Belief Networks Two of most typical deep networks β’ Deep Belief Network: Multiple Layers of Restricted Boltzman Machines β’ Effective training algorithms β’ Autoencoders β’ Recap: one hidden layer Autoencoder β’ May only capture simple feature structures β’ Corner points, interest points, edge points β’ Multiple layers: Stacked Autoencoders β’ Capture more complex feature structures Advantages of Deep Networks β’ Main argument: β’ Shallow structures may oversimplify the complexity underlying a function mapping input feature vectors to target output variables β’ Deep structures are more adequate to simulate these complex functions β’ E.g., our vision/acoustic systems β’ Reason: a shallow structure may require exponentially more neurons to model a function that can be represented more compactly by a deeper structure. Challenges β’ More attentions must be paid to tune a deep structure β’ Shallow structure often has a complex objective function β’ linear model/kernalized nonlinear model β’ It has global optimal solution in training phase, which can be found by many welldeveloped optimization algorithms β’ The objective function associated with deep structure is often nonconvex β’ Many local optimum, and special attentions must be paid to avoid being trapped in bad local optimum β’ Finding global optimum is impossible Several Techniques used in training deep πΏπΏ networks: Momentum (2) 1 β’ For each iteration t (2) β’ For each weight π€π€ππππ : β’ Update 2 2 ππππ 2 πππ€π€ππππ 2 (π‘π‘) (1) πΏπΏ1 (1) πΏπΏ2 π€π€ππππ (π‘π‘) β π€π€ππππ (π‘π‘) β πΌπΌβπ€π€ππππ (π‘π‘) β’ For each weight β’ Update 2 βπ€π€ππππ (π‘π‘) β‘ (1) βπ€π€ (1) β‘ ππππ (t) π€π€ππππ : ππππ 1 πππ€π€ 1 1 ππππ 1 π€π€ππππ (π‘π‘) β π€π€ππππ (π‘π‘) β πΌπΌβπ€π€ππππ (π‘π‘) Note: πΌπΌ > 0 is a learning rate which controls the step size for each weight update Several Techniques used in training deep πΏπΏ networks: Momentum β’ Idea: exponentially weighed sum of recent derivatives β’ For each iteration t β’ For each weight (2) π€π€ππππ : 2 βπ€π€ππππ (1) π€π€ππππ : βπ€π€ππππ β‘ (π‘π‘) β‘ ππππ 2 πππ€π€ππππ (π‘π‘) (1) β’ Update π€π€ππππ2 π‘π‘ β π€π€ππππ2 π‘π‘ β πΌπΌ{βπ€π€ππππ2 π‘π‘ + ππβπ€π€ππππ2 π‘π‘ β 1 } β’ For each weight β’ Update 1 (1) 1 ππππ 1 πππ€π€ππππ 1 (t) 1 π€π€ππππ (π‘π‘) β π€π€ππππ (π‘π‘) β πΌπΌ{βπ€π€ππππ π‘π‘ + ππβπ€π€ππππ π‘π‘ β 1 } ππ is typically set to [0.7 0.95]. πΏπΏ1 (2) 1 (1) πΏπΏ2 Momentum β’ The search for the minimum point can be trapped in a steep local basin. ππ(π₯π₯) ππππ (π‘π‘) πππ₯π₯ π₯π₯(π‘π‘) is very small or even vanishes Momentum Can give a extra force to push the current search for the optimal point from the trap of local minimum ππ(π₯π₯) π₯π₯(π‘π‘) ππππ ππππ π‘π‘ + ππ π‘π‘ πππ₯π₯ πππ₯π₯ Adaptive learning rate β’ Too large learning rate β’ Cause oscillation in searching for the minimal point β’ Too slow learning rate β’ Too slow convergence to the minimal point β’ Adaptive learning rate β’ At the beginning, the learning rate can be large when the current point is far from the optimal point β’ Gradually, the learning rate will decay as time goes by. Adaptive learning rate β’ Should not be too large or too small Annealing rate: πΌπΌ(π‘π‘) = πΌπΌ 0 π‘π‘ 1+ππ β’ πΌπΌ π‘π‘ will eventually go to zero, but at the beginning it is almost a constant. Deep Convolutional Neural Network β’ In conventional neural network, β’ Each neuron in a upper layer takes inputs from all the neurons in the lower layer β’ All the weights on the connections between two layers are distinct β’ Problem: too many parameters β’ Two layers with N, M neurons will have NM parameters for the connection weights between them Deep Convolutional Neural Network β’ This parameterization is not necessary. β’ There are many similar local structures in an input image β’ Each neuron in the upper layer can have more localized input from the neuron in the lower layer β’ The weights on the connections are recurrent β’ Much smaller number of parameters An example of convolutional layer β’ Given input X=(2,1,1,3) β’ Convolutional kernel W=(2,1) β’ Output Y = X*W=(5,3,5) 3 5 2 2 1 2 1 5 1 2 1 1 3 A real CNN β’ β’ β’ β’ β’ C1: Convolutional layer S2: Subsampling layer, reducing the number of neurons C3: Convolutional layer S4: subsampling layer C5,F6, output: full connection A real CNN: Convolutional Layer β’ C1: Convolutional layer β’ β’ β’ β’ 6 feature maps of size 28X28 Each C1 neuron has a kernel of size 5X5 Totally we have (5X5+1)X6=156 If fully connected between Input and C1, there will be (32X32+1)X28X28X6= 4,821,600 A real CNN: Subsampling β’ S2: Subsampling layer β’ Each neuron in S2 takes the maximum output from 2X2 neurons underneath it in C1 β’ S2 reduces the number of neurons by a half. A real CNN: Convolutional layer β’ C3: Convolutional layer β’ Each neuron in C3 is connected to some of S2 layers with a kernel of size 5X5 β’ Its activation is the sum of convolutions on these layers A real CNN: Subsampling β’ S4: Subsampling layer β’ Each neuron in S4 takes the maximum output from 2X2 neurons underneath it in C3 A real CNN: Convolutional layer β’ C5: Convolutional layer β’ Each neuron in C5 takes a kernel convolution with all 16 feature maps in C5, and sums the convolution results together. A real CNN: Fully connected layer β’ F5: Fully connected layer β’ Each neuron in F6 is fully connected with all neurons in C5. A real CNN: Output β’ Output: Gaussian connections β’ Each neuron output from F6 is transformed by a Gaussian kernel before it is injected into output layer β’ This layer is fully connected with all neurons in F6 CNN Results Challenges β’ Neural network is still slightly fully connected. β’ The model may be too complex for many applications β’ More elegant regularization method is demanded to avoid the overfitting risk. β’ Reducing the magnitudes of connection weights β’ Making the connection more sparse, while not losing too much modeling power. β’ Dropout: β’ Some (a half of) neurons in a fully connected layer become inactive whose outputs will not participate in the forward pass and backpropagation. β’ Every time a neural network with reduced complexity is generated to process the input signals forwards, or updated by backpropagation. Summary β’ Deep network β multiple layers of neurons β’ Practical issues β’ Momentum β’ Adaptive learning rate β’ Convolutional neural networks β’ Making neural networks sparser β’ dropout