Download gentle - University of Toronto

Can computer simulations of the brain allow us to see into the mind? Geoffrey Hinton Canadian Institute for Advanced Research & University of Toronto Overview • Some old theories of how cortex learns and why they fail. • Causal generative models and how to learn them. • Energy-based generative models and how to learn them.. – An example: Modeling a class of highly variable shapes by using a set of learned features. • A fast learning algorithm for deep networks that have many layers of neurons. – A really good generative model of handwritten digits. – How to see into the network’s mind. How to make an intelligent system • The cortex has about a hundred billion neurons. • Each neuron has thousands of connections. • So all you need to do is find the right values for the weights on hundreds of thousands of billions of connections. • This task is much too difficult for evolution to solve directly. – A blind search would be much too slow. – DNA doesn’t have enough capacity to store the answer. • So evolution has found a learning algorithm and provided the right hardware environment for it to work in. – Searching the space of learning algorithms is a much better bet than searching for weights directly. A very simple learning task • Consider a neural network with two layers of neurons. – neurons in the top layer represent known shapes. – neurons in the bottom layer represent pixel intensities. • A pixel gets to vote if it has ink on it. – Each inked pixel can vote for several different shapes. • The shape that gets the most votes wins. 0 1 2 3 4 5 6 7 8 9 How to learn the weights (1960’s) 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. The learned weights 1 2 3 4 5 6 7 8 9 0 The image Show the network an image and increment the weights from active pixels to the correct class. Then decrement the weights from active pixels to whatever class the network guesses. Why the simple system does not work • A two layer network with a single winner in the top layer is equivalent to having a rigid template for each shape. – The winner is the template that has the biggest overlap with the ink. • The ways in which shapes vary are much too complicated to be captured by simple template matches of whole shapes. – To capture all the allowable variations of a shape we need to learn the features that it is composed of. Examples of handwritten digits from a test set Good Old-Fashioned Neural Networks (1980’s) • The network is given an input vector and it must produce an output that represents: – a classification (e.g. the identity of a face) – or a prediction (e.g. the price of oil tomorrow) • The network is made of multiple layers of non-linear neurons. – Each neuron sums its weighted inputs from the layer below and non-linearly transforms this sum into an output that is sent to the layer above. • The weights are learned by looking at a big set of labeled training examples. Good old-fashioned neural networks Back-propagate error signal to get derivatives for learning Compare outputs with correct answer to get error signal outputs hidden layers input vector What is wrong with back-propagation? • It requires labeled training data. – Almost all data is unlabeled. • We need to fit about 10^14 connection weights in only about 10^9 seconds. – Unless the weights are highly redundant, labels cannot possibly provide enough information. • The learning time does not scale well – It is very slow in networks with more than two or three hidden layers. • The neurons need to send two different types of signal – Forward pass: signal = activity = y – Backward pass: signal = dE/dy Overcoming the limitations of back-propagation • We need to keep the efficiency of using a gradient method for adjusting the weights, but use it for modeling the structure of the sensory input. – Adjust the weights to maximize the probability that a generative model would have produced the sensory input. This is the only place to get 10^5 bits per second. – Learn p(image) not p(label | image) • What kind of generative model could the brain be using? The building blocks: Binary stochastic neurons • y is the probability of producing a spike. 1 yj 0.5 synaptic weight from i to j 0 0 total input to neuron j  external input   yi wij i output of neuron i 5 Sigmoid Belief Nets • It is easy to generate an unbiased example at the leaf nodes. Hidden cause • It is typically hard to compute the posterior distribution over all possible configurations of hidden causes. • Given samples from the posterior, it is easy to learn the local interactions Visible effect Explaining away • Even if two hidden causes are independent, they can become dependent when we observe an effect that they can both influence. – If we learn that there was an earthquake it reduces the probability that the house jumped because of a truck. -10 truck hits house -10 20 earthquake 20 -20 house jumps The wake-sleep algorithm • • Wake phase: Use the recognition weights to perform a bottom-up pass. – Train the generative weights to reconstruct activities in each layer from the layer above. Sleep phase: Use the generative weights to generate samples from the model. – Train the recognition weights to reconstruct activities in each layer from the layer below. h3 W3 R3 h2 W2 R2 h1 W1 R1 data How good is the wake-sleep algorithm? • It solves the problem of where to get target values for learning – The wake phase provides targets for learning the generative connections – The sleep phase provides targets for learning the recognition connections (because the network knows how the fantasy data was generated) • It only requires neurons to send one kind of signal. • It approximates the true posterior by assuming independence. – This ignores explaining away which causes problems. Two types of generative neural network • If we connect binary stochastic neurons in a directed acyclic graph we get Sigmoid Belief Nets (Neal 1992). • If we connect binary stochastic neurons using symmetric connections we get a Boltzmann Machine (Hinton & Sejnowski, 1983) 8 How a Boltzmann Machine models data • It is not a causal generative model (like a sigmoid belief net) in which we first generate the hidden states and then generate the visible states given the hidden ones. • To generate a sample from the model, we just keep stochastically updating the binary states of all the units – After a while, the probability of observing any particular vector on the visible units will have reached its equilibrium value. hidden units visible units Restricted Boltzmann Machines • We restrict the connectivity to make learning easier. – Only one layer of hidden units. – No connections between hidden units. • In an RBM, the hidden units really are conditionally independent given the visible states. It only takes one step to reach conditional equilibrium distribution when the visible units are clamped. – So we can quickly get an unbiased sample from the posterior distribution when given a data-vector : hidden j i visible Weights  Energies  Probabilities • Each possible joint configuration of the visible and hidden units has an energy – The energy is determined by the weights and biases. • The energy of a joint configuration of the visible and hidden units determines its probability. • The probability of a configuration over the visible units is found by summing the probabilities of all the joint configurations that contain it. The Energy of a joint configuration binary state of visible unit i E (v,h)    vi bi i Energy with configuration v on the visible units and h on the hidden units binary state of hidden unit j   h jb j j   vi h j wij i, j biases of units i and j weight between units i and j indexes every connected visible-hidden pair Using energies to define probabilities • The probability of a joint configuration over both visible and hidden units depends on the energy of that joint configuration compared with the energy of all other joint configurations. • The probability of a configuration of the visible units is the sum of the probabilities of all the joint configurations that contain it. p ( v, h )  partition function p (v )  e  E ( v ,h ) e  E (u , g ) u,g e h e u,g  E ( v ,h )  E (u , g ) A picture of the maximum likelihood learning algorithm for an RBM j vi h j 0 j j j vi h j   vi h j 1 a fantasy i i i t=0 t=1 t=2 i t = infinity Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.  log p(v)  vi h j 0  vi h j  wij 16 Contrastive divergence learning: A quick way to learn an RBM j vi h j 0 i t=0 data j vi h j 1 i t=1 reconstruction Start with a training vector on the visible units. Update all the hidden units in parallel Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again. wij   ( vi h j   vi h j  ) 0 1 This is not following the gradient of the log likelihood. But it works well. It is approximately following the gradient of another objective function. How to learn a set of features that are good for reconstructing images of the digit 2 50 binary feature neurons 50 binary feature neurons Decrement weights between an active pixel and an active feature Increment weights between an active pixel and an active feature 16 x 16 pixel 16 x 16 pixel image image data (reality) reconstruction (lower energy than reality) Bartlett The weights of the 50 feature detectors We start with small random weights to break symmetry The final 50 x 256 weights Each neuron grabs a different feature. feature data reconstruction How well can we reconstruct the digit images from the binary feature activations? Data Reconstruction from activated binary features New test images from the digit class that the model was trained on Data Reconstruction from activated binary features Images from an unfamiliar digit class (the network tries to see every image as a 2) Training a deep network • First train a layer of features that receive input directly from the pixels. • Then treat the activations of the trained features as if they were pixels and learn features of features in a second hidden layer. • It can be proved that each time we add another layer of features we get a better model of the set of training images. – i.e. we assign lower energy to the real data and higher energy to all other possible images. – The proof is complicated. It uses variational free energy, a method that physicists use for analyzing complicated non-equilibrium systems. – But it is based on a neat equivalence. A causal network that is equivalent to an RBM • The variables in h1 are exactly conditionally independent given v. – Inference is trivial. We just multiply v by W T – This is because the explaining away is eliminated by the layers above h1. • Inference in the causal network is exactly equivalent to letting a Restricted Boltzmann Machine settle to equilibrium starting at the data. etc. W h3 W h2 W h1 W v A causal network that is equivalent to an RBM etc. W h3 • Learning the weights in an RBM is exactly equivalent to learning in an infinite causal network with tied weights. W h2 W h W h1 W v v Learning a deep causal network etc. W1 h3 • First learn with all the weights tied W1 h2 W1 h1 W1 h1 v W1 v etc. W2 h3 • Then freeze the bottom layer and relearn all the other layers. W2 h2 W2 h2 W2 h1 h1 W1 v etc. W3 h3 • Then freeze the bottom two layers and relearn all the other layers. W3 h2 W2 h3 W3 h1 h2 W1 v The generative model after learning 3 layers • To generate data: 1. Get an equilibrium sample from the top-level RBM by performing alternating Gibbs sampling. 2. Perform a top-down pass to get states for all the other layers. So the lower level bottom-up connections are not part of the generative model h3 W3 h2 W2 h1 W1 data A neural model of digit recognition The top two layers form an associative memory whose energy landscape models the low dimensional manifolds of the digits. The energy valleys have names 2000 top-level neurons 10 label neurons The model learns to generate combinations of labels and images. To perform recognition we start with a neutral state of the label units and do an up-pass from the image followed by one or two iterations of the toplevel associative memory. 500 neurons 500 neurons 28 x 28 pixel image Fine-tuning with the up-down algorithm: A contrastive divergence version of wake-sleep • Replace the top layer of the causal network by an RBM – This eliminates explaining away at the top-level. – It is nice to have an associative memory at the top. • Replace the sleep phase by a top-down pass starting with the state of the RBM produced by the wake phase. – This makes sure the recognition weights are trained in the vicinity of the data. – It also reduces mode averaging. If the recognition weights prefer one mode, they will stick with that mode even if the generative weights like some other mode just as much. Examples of correctly recognized handwritten digits that the neural network had never seen before Its very good How well does it discriminate on MNIST test set with no extra information about geometric distortions? • • • • • Up-down net with RBM pre-training + CD10 1.25% Support Vector Machine (Decoste et. al.) 1.4% Backprop with 1000 hiddens (Platt) 1.5% Backprop with 500 -->300 hiddens 1.5% K-Nearest Neighbor ~ 3.3% • Its better than backprop and much more neurally plausible because the neurons only need to send one kind of signal, and the teacher can be another sensory input. The features learned in the first hidden layer Seeing what it is thinking • The top level associative memory has activities over thousands of neurons. – It is hard to tell what the network is thinking by looking at the patterns of activation. • To see what it is thinking, convert the top-level representation into an image by using top-down connections. – A mental state is the state of a hypothetical world in which the internal representation is correct. brain state The extra activation of cortex caused by a speech task. What were they thinking? What goes on in its mind if we show it an image composed of random pixels and ask it to fantasize from there? mind brain 2000 top-level neurons 10 label mind brain neurons 500 neurons 500 neurons mind brain 28 x 28 pixel image Samples generated by running the associative memory with one label clamped. Initialized by an up-pass from a random binary image. 20 iterations between samples. Samples generated by letting the associative memory run with one label clamped. There are 1000 iterations of alternating Gibbs sampling between samples. Learning with realistic labels 2000 top-level units 10 label units 500 units This network treats the labels in a special way, but they could easily be replaced by an auditory pathway. 500 units 28 x 28 pixel image Learning with auditory labels • Alex Kaganov replaced the class labels by binarized cepstral spectrograms of many different male speakers saying digits. • The auditory pathway then had multiple layers, just like the visual pathway. The auditory and visual inputs shared the top level layer. • After learning, he showed it a visually ambiguous digit and then reconstructed the visual input from the representation that the toplevel associative memory had settled on after 10 iterations. “six” reconstruction “five” original visual input reconstruction And now for something a bit more realistic • Handwritten digits are convenient for research into shape recognition, but natural images of outdoor scenes are much more complicated. – If we train a network on patches from natural images, does it produces sets of features that look like the ones found in real brains? A network with local connectivity Local connectivity Global connectivity image The local connectivity between the two hidden layers induces a topography on the hidden units. Features learned by a net that sees 100,000 patches of natural images. The feature neurons are locally connected to each other. THE END All 125 errors Best other machine learning technique gets 140 errors on this task. (these results are without added distortions or geometric prior knowledge) The receptive fields of the first hidden layer The generative fields of the first hidden layer A simple learning algorithm j si s j 0 i data j  si s j 1 i “hidden” neurons represent features “visible neurons” represent pixels reconstruction Start with a training vector on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel. data recon wij   (  si s j    si s j  ) learning rate Types of connectivity • Feedforward networks – These compute a series of transformations – Typically, the first layer is the input and the last layer is the output. • Recurrent networks – These have directed cycles in their connection graph. They can have complicated dynamics. – More biologically realistic. output units hidden units input units Binary stochastic neurons • y is the probability of producing a spike. 1 yj 0.5 synaptic weight from i to j 0 0 total input to j  external input   yi wij i output of neuron i A simple learning algorithm j i data j i “hidden” neurons represent features “visible neurons” represent pixels reconstruction 1. Start with a training image on the visible neurons. 2. Pick binary states for the hidden neurons. Increment weights between active pairs. 3. Then pick binary states for the visible neurons 4. Then pick binary states for the hidden neurons again. Decrement weights between active pairs. This changes the weights to increase the goodness of the data and decrease the goodness of the reconstruction. A more interesting type of network • Several politicians can be elected at the same time. – A politician corresponds to a feature and each familiar object corresponds to a coalition of features. • The politicians decide who can vote (like in Florida). • The whole system of voters and politicians can have several different stable states. – Conservative politicians discourage liberal voters from voting. – Liberal politicians discourage conservative voters from voting. • If we add some noise to the voting process, the system will occasionally jump from one regime to another. Stochastic binary neurons • These have a state of 1 or 0 which is a stochastic function of the neuron’s bias, b, and the input it receives from other neurons. p ( si  1)  1 1  exp( bi   s j w ji ) j 1 p( si  1) 0.5 0 0 bi   s j w ji j 3 How the brain works • Each neuron receives inputs from thousands of other neurons – A few neurons also get inputs from the sensory receptors - A few neurons send outputs to muscles. - Neurons use binary spikes of activity to communicate • The effect that one neuron has on another is controlled by a synaptic weight – The weights can be positive or negative • The synaptic weights adapt so that the whole network learns to perform useful computations – Recognizing objects, understanding language, making plans, controlling the body

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download gentle - University of Toronto