Download lec3 - Department of Computer Science

How to learn a generative model of images Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto How the brain works • Each neuron receives inputs from thousands of other neurons – A few neurons also get inputs from the sensory receptors - A few neurons send outputs to muscles. - Neurons use binary spikes of activity to communicate • The effect that one neuron has on another is controlled by a synaptic weight – The weights can be positive or negative • The synaptic weights adapt so that the whole network learns to perform useful computations – Recognizing objects, understanding language, making plans, controlling the body How to make an intelligent system • The cortex has about a hundred billion neurons. • Each neuron has thousands of connections. • So all you need to do is find the right values for the weights on thousands of billions of connections. • This task is much too difficult for evolution to solve directly. – A blind search would be much too slow. – DNA doesn’t have enough capacity to store the answer. • So there must be an intelligent designer. – What does she look like? – Where did she come from? The intelligent designer • The intelligent designer is a learning algorithm. – The algorithm adjusts the weights to give the neural network a better model of the data it encounters. • A learning algorithm is the differential equation of knowledge. • Evolution produced the learning algorithm – Trial and error in the space of learning algorithms is a much better strategy than trial and error in the space of synapse strengths. • To understand the learning algorithm, we first need to understand the type of network it produces. – Shape recognition is a good task to consider. – We are much better than computers and it uses a lot of neurons. Hopfield nets • Model each pixel in an image using a binary neuron that has states of 1or 0. 1 3.7 1 • Connect the neurons together with symmetric connections. -4.2 0 0 1 • Update the neurons one at a time based on the total input they receive. 0 • Stored patterns correspond to the energy minima of the network. 1 To store a pattern we change the weights to lower the energy of that pattern. binary state of unit i in configuration s E (s)    si bi   si s j wij iunits Energy of binary configuration s i j bias of unit i weight between units i and j indexes every non-identical pair of i and j once Why a Hopfield net doesn’t work • The ways in which shapes vary are much too complicated to be captured by pair-wise interactions between pixels. – To capture all the allowable variations of a shape we need extra “hidden” variables that learn to represent the features that the shape is composed of. Some examples of real handwritten digits From Hopfield Nets to Boltzmann Machines • Boltzmann machines are stochastic Hopfield nets with hidden variables. • They have a simple learning algorithm that adapts all of the interactions so that the equilibrium distribution over the visible variables matches the distribution of the observed data. – The pair-wise interactions with the hidden variables can model higher-order correlations between visible variables. Stochastic binary neurons • These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons. p ( si  1)  1 1  exp( bi   s j w ji ) j 1 p( si  1) 0.5 0 0 bi   s j w ji j How a Boltzmann Machine models data • The aim of learning is to discover weights that cause the equilibrium distribution of the whole network to match the data distribution on the visible variables. • Everything is defined in terms of energies of joint configurations of the visible and hidden units. hidden units visible units The Energy of a joint configuration binary state of unit i in joint configuration v,h E ( v, h )   vh  si bi iunits Energy with configuration v on the visible units and h on the hidden units bias of unit i  i j vh vh si s j wij weight between units i and j indexes every non-identical pair of i and j once Using energies to define probabilities • The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations. • The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it. p ( v, h )  partition function p (v )  e  E ( v ,h ) e  E (u , g ) u,g e h e u,g  E ( v ,h )  E (u , g ) A very surprising fact • Everything that one weight needs to know about the other weights and the data in order to do maximum likelihood learning is contained in the difference of two correlations.  log p( v)  si s j  v wij Derivative of log probability of one training vector Expected value of product of states at thermal equilibrium when the training vector is clamped on the visible units si s j free Expected value of product of states at thermal equilibrium when nothing is clamped The batch learning algorithm • Positive phase – Clamp a datavector on the visible units. – Let the hidden units reach thermal equilibrium at a temperature of 1 (may use annealing to speed this up) – Sample si s j for all pairs of units – Repeat for all datavectors in the training set. • Negative phase – Do not clamp any of the units – Let the whole network reach thermal equilibrium at a temperature of 1 (where do we start?) – Sample si s j for all pairs of units – Repeat many times to get good estimates • Weight updates – Update each weight by an amount proportional to the difference in  si s j  in the two phases. Three reasons why learning is impractical in Boltzmann Machines • If there are many hidden layers, it can take a long time to reach thermal equilibrium when a data-vector is clamped on the visible units. • It takes even longer to reach thermal equilibrium in the “negative” phase when the visible units are unclamped. – The unconstrained energy surface needs to be highly multimodal to model the data. • The learning signal is the difference of two sampled correlations which is very noisy. Restricted Boltzmann Machines • We restrict the connectivity to make inference and learning easier. – Only one layer of hidden units. – No connections between hidden units. • In an RBM, the hidden units are conditionally independent given the visible states. It only takes one step to reach thermal equilibrium when the visible units are clamped. – So we can quickly get the exact value of :  si s j  v hidden j i visible A picture of the Boltzmann machine learning algorithm for an RBM j j j j  si s j  fantasy  si s j  data a fantasy i i i t=0 t=1 t=2 i t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. wij   (  si s j  data   si s j  fantasy ) Contrastive divergence learning: A quick way to learn an RBM j j  si s j  data i  si s j  recon i t=0 data t=1 reconstruction Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. wij   (  si s j  data   si s j  recon ) This is not following the gradient of the log likelihood. But it works well. It is trying to make the free energy gradient be zero at the data distribution. How to learn a set of features that are good for reconstructing images of the digit 2 50 binary feature neurons 50 binary feature neurons Decrement weights between an active pixel and an active feature Increment weights between an active pixel and an active feature 16 x 16 pixel 16 x 16 pixel image image data (reality) reconstruction (lower energy than reality) Bush joke The weights of the 50 feature detectors We start with small random weights to break symmetry The final 50 x 256 weights Each neuron grabs a different feature. How well can we reconstruct the digit images from the binary feature activations? Data Reconstruction from activated binary features Data New test images from the digit class that the model was trained on Reconstruction from activated binary features Images from an unfamiliar digit class (the network tries to see every image as a 2) Bush joke 2 Training a deep network • First train a layer of features that receive input directly from the pixels. • Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer. • It can be proved that each time we add another layer of features we get a better model of the set of training images. – i.e. we assign lower free energy to the real data and higher free energy to all other possible images. – The proof uses the fact that the variational free energy of a non-equilibrium distribution is always higher that the variational free energy of the equilibrium distribution. – The proof depends on a neat equivalence. A causal network that is equivalent to an RBM etc. WT h3 • Learning the weights in an RBM is exactly equivalent to learning in an infinite causal network with tied weights. W h2 WT h W h1 W v v Learning a deep causal network etc. W1 h3 • First learn with all the weights tied W1 h2 W1 h1 W1 h1 v W1 v etc. W2 h3 • Then freeze the bottom layer and relearn all the other layers. W2 h2 W2 h2 W2 h1 h1 W1 v etc. W3 h3 • Then freeze the bottom two layers and relearn all the other layers. W3 h2 W2 h3 W3 h1 h2 W1 v The generative model after learning 3 layers • To generate data: 1. Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling. 2. Perform a top-down pass to get states for all the other layers. So the lower level bottom-up connections are not part of the generative model h3 W3 h2 W2 h1 W1 data Why the hidden configurations should be treated as data when learning the next layer of weights • After learning the first layer of weights: log p(v)    energy (v)   entropy(h |v)   p(h   | v) log p(h   )  log p(v | h   )  entropy  • If we freeze the generative weights that define the likelihood term and the recognition weights that define the distribution over hidden configurations, we get: log p(v)   p(h   | v) log p(h   )  const ant  • Maximizing the RHS is equivalent to maximizing the log prob of “data”  that occurs with probability p (h   | v) A neural model of digit recognition The top two layers form an associative memory whose energy landscape models the low dimensional manifolds of the digits. The energy valleys have names 2000 top-level neurons 10 label neurons The model learns to generate combinations of labels and images. To perform recognition we do an uppass from the image followed by a few iterations of the top-level associative memory. 500 neurons 500 neurons 28 x 28 pixel image Fine-tuning with the up-down algorithm: A contrastive divergence version of wake-sleep • Replace the top layer of the causal network by an RBM – This eliminates explaining away at the top-level. – It is nice to have an associative memory at the top. • Replace the sleep phase by a top-down pass starting with the state of the RBM produced by the wake phase. – This makes sure the recognition weights are trained in the vicinity of the data. – It also reduces mode averaging. If the recognition weights prefer one mode, they will stick with that mode even if the generative weights like some other mode just as much. SHOW THE MOVIE Examples of correctly recognized handwritten digits that the neural network had never seen before Its very good How well does it discriminate on MNIST test set with no extra information about geometric distortions? • • • • • Generative model based on RBM’s Support Vector Machine (Decoste et. al.) Backprop with 1000 hiddens (Platt) Backprop with 500 -->300 hiddens K-Nearest Neighbor 1.25% 1.4% 1.6% 1.6% ~ 3.3% • Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input. Learning perceptual physics • Suppose we have a video sequence of some balls bouncing in a box. • A physicist would model the data using Newton’s laws. To do this, you need to decide: – How many objects are there? – What are the coordinates of their centers at each time step? – How elastic are they? • Does a baby do the same as a physicist? – Maybe we can just learn a model of how the world behaves from the raw video. – It doesn’t learn the abstractions that the physicist has, but it does know what it likes. • And what it likes is videos that obey Newtonian physics The conditional RBM model t-1 • Given the data and the previous hidden state and the previous visible frames, the hidden units at time t are conditionally independent. – So it is easy to sample from their conditional equilibrium distribution. • Learning can be done by using contrastive divergence. – Reconstruct the data at time t from the inferred states of the hidden units. – The temporal connections between hiddens can be learned as if they were additional biases wij  si (  s j  data   s j  recon ) i t-2 t-1 t j t Show Ilya’s movies THE END For more on this type of learning see: www.cs.toronto.edu/~hinton/science.pdf For the proof that adding extra layers makes the model better see the paper on my web page: “A fast learning algorithm for deep belief nets” Learning with realistic labels 2000 top-level units 10 label units 500 units This network treats the labels in a special way, but they could easily be replaced by an auditory pathway. 500 units 28 x 28 pixel image Learning with auditory labels • Alex Kaganov replaced the class labels by binarized cepstral spectrograms of many different male speakers saying digits. • The auditory pathway then had multiple layers, just like the visual pathway. The auditory and visual inputs shared the top level layer. • After learning, he showed it a visually ambiguous digit and then reconstructed the visual input from the representation that the toplevel associative memory had settled on after 10 iterations. “six” reconstruction “five” original visual input reconstruction The features learned in the first hidden layer Seeing what it is thinking • The top level associative memory has activities over thousands of neurons. – It is hard to tell what the network is thinking by looking at the patterns of activation. • To see what it is thinking, convert the top-level representation into an image by using top-down connections. – A mental state is the state of a hypothetical world in which the internal representation is correct. brain state The extra activation of cortex caused by a speech task. What were they thinking? What goes on in its mind if we show it an image composed of random pixels and ask it to fantasize from there? mind brain 2000 top-level neurons 10 label mind brain neurons 500 neurons 500 neurons mind brain 28 x 28 pixel image feature data reconstruction

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download lec3 - Department of Computer Science