Download gentle - University of Toronto

Document related concepts

Neural oscillation wikipedia , lookup

Mirror neuron wikipedia , lookup

Connectome wikipedia , lookup

Development of the nervous system wikipedia , lookup

Neural coding wikipedia , lookup

Neural modeling fields wikipedia , lookup

Artificial neural network wikipedia , lookup

Biological neuron model wikipedia , lookup

Neuropsychopharmacology wikipedia , lookup

Holonomic brain theory wikipedia , lookup

Premovement neuronal activity wikipedia , lookup

Metastability in the brain wikipedia , lookup

Neuroanatomy wikipedia , lookup

Optogenetics wikipedia , lookup

Circumventricular organs wikipedia , lookup

Pre-Bötzinger complex wikipedia , lookup

Central pattern generator wikipedia , lookup

Pattern recognition wikipedia , lookup

Channelrhodopsin wikipedia , lookup

Hierarchical temporal memory wikipedia , lookup

Feature detection (nervous system) wikipedia , lookup

Synaptic gating wikipedia , lookup

Efficient coding hypothesis wikipedia , lookup

Nervous system network models wikipedia , lookup

Catastrophic interference wikipedia , lookup

Convolutional neural network wikipedia , lookup

Recurrent neural network wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Transcript
Can computer simulations of the
brain allow us to see into the mind?
Geoffrey Hinton
Canadian Institute for Advanced Research
&
University of Toronto
Overview
• Some old theories of how cortex learns and why they fail.
• Causal generative models and how to learn them.
• Energy-based generative models and how to learn them..
– An example: Modeling a class of highly variable shapes
by using a set of learned features.
• A fast learning algorithm for deep networks that have many
layers of neurons.
– A really good generative model of handwritten digits.
– How to see into the network’s mind.
How to make an intelligent system
• The cortex has about a hundred billion neurons.
• Each neuron has thousands of connections.
• So all you need to do is find the right values for the weights
on hundreds of thousands of billions of connections.
• This task is much too difficult for evolution to solve directly.
– A blind search would be much too slow.
– DNA doesn’t have enough capacity to store the answer.
• So evolution has found a learning algorithm and provided
the right hardware environment for it to work in.
– Searching the space of learning algorithms is a much
better bet than searching for weights directly.
A very simple learning task
• Consider a neural network with
two layers of neurons.
– neurons in the top layer
represent known shapes.
– neurons in the bottom layer
represent pixel intensities.
• A pixel gets to vote if it has ink
on it.
– Each inked pixel can vote
for several different shapes.
• The shape that gets the most
votes wins.
0 1 2 3 4 5 6 7 8 9
How to learn the weights (1960’s)
1
2
3
4
5
6
7
8
9
0
The image
Show the network an image and increment the weights
from active pixels to the correct class.
Then decrement the weights from active pixels to
whatever class the network guesses.
1
2
3
4
5
6
7
8
9
0
The image
Show the network an image and increment the weights
from active pixels to the correct class.
Then decrement the weights from active pixels to
whatever class the network guesses.
1
2
3
4
5
6
7
8
9
0
The image
Show the network an image and increment the weights
from active pixels to the correct class.
Then decrement the weights from active pixels to
whatever class the network guesses.
1
2
3
4
5
6
7
8
9
0
The image
Show the network an image and increment the weights
from active pixels to the correct class.
Then decrement the weights from active pixels to
whatever class the network guesses.
1
2
3
4
5
6
7
8
9
0
The image
Show the network an image and increment the weights
from active pixels to the correct class.
Then decrement the weights from active pixels to
whatever class the network guesses.
1
2
3
4
5
6
7
8
9
0
The image
Show the network an image and increment the weights
from active pixels to the correct class.
Then decrement the weights from active pixels to
whatever class the network guesses.
The learned weights
1
2
3
4
5
6
7
8
9
0
The image
Show the network an image and increment the weights
from active pixels to the correct class.
Then decrement the weights from active pixels to
whatever class the network guesses.
Why the simple system does not work
• A two layer network with a single winner in the top layer
is equivalent to having a rigid template for each shape.
– The winner is the template that has the biggest
overlap with the ink.
• The ways in which shapes vary are much too
complicated to be captured by simple template matches
of whole shapes.
– To capture all the allowable variations of a shape we
need to learn the features that it is composed of.
Examples of handwritten digits from a test set
Good Old-Fashioned Neural Networks
(1980’s)
• The network is given an input vector and it must produce
an output that represents:
– a classification (e.g. the identity of a face)
– or a prediction (e.g. the price of oil tomorrow)
• The network is made of multiple layers of non-linear
neurons.
– Each neuron sums its weighted inputs from the layer
below and non-linearly transforms this sum into an
output that is sent to the layer above.
• The weights are learned by looking at a big set of
labeled training examples.
Good old-fashioned neural networks
Back-propagate
error signal to
get derivatives
for learning
Compare outputs with
correct answer to get
error signal
outputs
hidden
layers
input vector
What is wrong with back-propagation?
• It requires labeled training data.
– Almost all data is unlabeled.
• We need to fit about 10^14 connection weights in only
about 10^9 seconds.
– Unless the weights are highly redundant, labels cannot
possibly provide enough information.
• The learning time does not scale well
– It is very slow in networks with more than two or three
hidden layers.
• The neurons need to send two different types of signal
– Forward pass: signal = activity = y
– Backward pass: signal = dE/dy
Overcoming the limitations of back-propagation
• We need to keep the efficiency of using a gradient method
for adjusting the weights, but use it for modeling the
structure of the sensory input.
– Adjust the weights to maximize the probability that a
generative model would have produced the sensory
input. This is the only place to get 10^5 bits per second.
– Learn p(image) not p(label | image)
• What kind of generative model could the brain be using?
The building blocks: Binary stochastic neurons
• y is the probability of producing a spike.
1
yj
0.5
synaptic weight
from i to j
0
0
total input to neuron j  external input   yi wij
i
output of
neuron i
5
Sigmoid Belief Nets
• It is easy to generate an
unbiased example at the
leaf nodes.
Hidden cause
• It is typically hard to
compute the posterior
distribution over all
possible configurations of
hidden causes.
• Given samples from the
posterior, it is easy to
learn the local
interactions
Visible
effect
Explaining away
• Even if two hidden causes are independent, they can
become dependent when we observe an effect that they can
both influence.
– If we learn that there was an earthquake it reduces the
probability that the house jumped because of a truck.
-10
truck hits house
-10
20
earthquake
20
-20
house jumps
The wake-sleep algorithm
•
•
Wake phase: Use the
recognition weights to perform a
bottom-up pass.
– Train the generative weights
to reconstruct activities in
each layer from the layer
above.
Sleep phase: Use the generative
weights to generate samples
from the model.
– Train the recognition weights
to reconstruct activities in
each layer from the layer
below.
h3
W3
R3
h2
W2
R2
h1
W1
R1
data
How good is the wake-sleep algorithm?
• It solves the problem of where to get target values for
learning
– The wake phase provides targets for learning the
generative connections
– The sleep phase provides targets for learning the
recognition connections (because the network knows how
the fantasy data was generated)
• It only requires neurons to send one kind of signal.
• It approximates the true posterior by assuming
independence.
– This ignores explaining away which causes problems.
Two types of generative neural network
• If we connect binary stochastic neurons in a
directed acyclic graph we get Sigmoid Belief
Nets (Neal 1992).
• If we connect binary stochastic neurons using
symmetric connections we get a Boltzmann
Machine (Hinton & Sejnowski, 1983)
8
How a Boltzmann Machine models data
• It is not a causal generative
model (like a sigmoid belief net)
in which we first generate the
hidden states and then
generate the visible states
given the hidden ones.
• To generate a sample from the
model, we just keep
stochastically updating the
binary states of all the units
– After a while, the probability
of observing any particular
vector on the visible units will
have reached its equilibrium
value.
hidden
units
visible
units
Restricted Boltzmann Machines
• We restrict the connectivity to make
learning easier.
– Only one layer of hidden units.
– No connections between hidden
units.
• In an RBM, the hidden units really
are conditionally independent given
the visible states. It only takes one
step to reach conditional equilibrium
distribution when the visible units
are clamped.
– So we can quickly get an
unbiased sample from the
posterior distribution when given
a data-vector :
hidden
j
i
visible
Weights  Energies  Probabilities
• Each possible joint configuration of the visible
and hidden units has an energy
– The energy is determined by the weights and
biases.
• The energy of a joint configuration of the visible
and hidden units determines its probability.
• The probability of a configuration over the visible
units is found by summing the probabilities of all
the joint configurations that contain it.
The Energy of a joint configuration
binary state of
visible unit i
E (v,h)    vi bi
i
Energy with configuration
v on the visible units and
h on the hidden units
binary state of
hidden unit j
  h jb j
j

 vi h j wij
i, j
biases of
units i and j
weight between
units i and j
indexes every connected
visible-hidden pair
Using energies to define probabilities
• The probability of a joint
configuration over both visible
and hidden units depends on
the energy of that joint
configuration compared with
the energy of all other joint
configurations.
• The probability of a
configuration of the visible
units is the sum of the
probabilities of all the joint
configurations that contain it.
p ( v, h ) 
partition
function
p (v ) 
e
 E ( v ,h )
e
 E (u , g )
u,g
e
h
e
u,g
 E ( v ,h )
 E (u , g )
A picture of the maximum likelihood learning
algorithm for an RBM
j
vi h j 0
j
j
j
vi h j  
vi h j 1
a fantasy
i
i
i
t=0
t=1
t=2
i
t = infinity
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in
parallel and updating all the visible units in parallel.
 log p(v)
 vi h j 0  vi h j 
wij
16
Contrastive divergence learning:
A quick way to learn an RBM
j
vi h j 0
i
t=0
data
j
vi h j 1
i
t=1
reconstruction
Start with a training vector on the
visible units.
Update all the hidden units in
parallel
Update the all the visible units in
parallel to get a “reconstruction”.
Update the hidden units again.
wij   ( vi h j   vi h j  )
0
1
This is not following the gradient of the log likelihood. But it works well.
It is approximately following the gradient of another objective function.
How to learn a set of features that are good for
reconstructing images of the digit 2
50 binary
feature
neurons
50 binary
feature
neurons
Decrement weights
between an active
pixel and an active
feature
Increment weights
between an active
pixel and an active
feature
16 x 16
pixel
16 x 16
pixel
image
image
data
(reality)
reconstruction
(lower energy than reality)
Bartlett
The weights of the 50 feature detectors
We start with small random weights to break symmetry
The final 50 x 256 weights
Each neuron grabs a different feature.
feature
data
reconstruction
How well can we reconstruct the digit images
from the binary feature activations?
Data
Reconstruction
from activated
binary features
New test images from
the digit class that the
model was trained on
Data
Reconstruction
from activated
binary features
Images from an
unfamiliar digit class
(the network tries to see
every image as a 2)
Training a deep network
• First train a layer of features that receive input directly
from the pixels.
• Then treat the activations of the trained features as if
they were pixels and learn features of features in a
second hidden layer.
• It can be proved that each time we add another layer of
features we get a better model of the set of training
images.
– i.e. we assign lower energy to the real data and
higher energy to all other possible images.
– The proof is complicated. It uses variational free
energy, a method that physicists use for analyzing
complicated non-equilibrium systems.
– But it is based on a neat equivalence.
A causal network that is
equivalent to an RBM
• The variables in h1 are exactly
conditionally independent given v.
– Inference is trivial. We just
multiply v by W T
– This is because the explaining
away is eliminated by the layers
above h1.
• Inference in the causal network is
exactly equivalent to letting a
Restricted Boltzmann Machine
settle to equilibrium starting at the
data.
etc.
W
h3
W
h2
W
h1
W
v
A causal network that is
equivalent to an RBM
etc.
W
h3
• Learning the weights in an RBM is
exactly equivalent to learning in an
infinite causal network with tied
weights.
W
h2
W
h
W
h1
W
v
v
Learning a deep causal
network
etc.
W1
h3
• First learn with all the weights tied
W1
h2
W1
h1
W1
h1
v
W1
v
etc.
W2
h3
• Then freeze the bottom layer and
relearn all the other layers.
W2
h2
W2
h2
W2
h1
h1
W1
v
etc.
W3
h3
• Then freeze the bottom two layers
and relearn all the other layers.
W3
h2
W2
h3
W3
h1
h2
W1
v
The generative model after learning 3 layers
•
To generate data:
1. Get an equilibrium sample
from the top-level RBM by
performing alternating
Gibbs sampling.
2. Perform a top-down pass
to get states for all the
other layers.
So the lower level bottom-up
connections are not part of
the generative model
h3
W3
h2
W2
h1
W1
data
A neural model of digit recognition
The top two layers form an
associative memory whose
energy landscape models the low
dimensional manifolds of the
digits.
The energy valleys have names
2000 top-level neurons
10 label
neurons
The model learns to generate
combinations of labels and images.
To perform recognition we start with a
neutral state of the label units and do
an up-pass from the image followed
by one or two iterations of the toplevel associative memory.
500 neurons
500 neurons
28 x 28
pixel
image
Fine-tuning with the up-down algorithm:
A contrastive divergence version of wake-sleep
• Replace the top layer of the causal network by an RBM
– This eliminates explaining away at the top-level.
– It is nice to have an associative memory at the top.
• Replace the sleep phase by a top-down pass starting with
the state of the RBM produced by the wake phase.
– This makes sure the recognition weights are trained in
the vicinity of the data.
– It also reduces mode averaging. If the recognition
weights prefer one mode, they will stick with that mode
even if the generative weights like some other mode
just as much.
Examples of correctly recognized handwritten digits
that the neural network had never seen before
Its very
good
How well does it discriminate on MNIST test set with
no extra information about geometric distortions?
•
•
•
•
•
Up-down net with RBM pre-training + CD10 1.25%
Support Vector Machine (Decoste et. al.)
1.4%
Backprop with 1000 hiddens (Platt)
1.5%
Backprop with 500 -->300 hiddens
1.5%
K-Nearest Neighbor
~ 3.3%
• Its better than backprop and much more neurally plausible
because the neurons only need to send one kind of signal,
and the teacher can be another sensory input.
The features learned in the first hidden layer
Seeing what it is thinking
• The top level associative memory
has activities over thousands of
neurons.
– It is hard to tell what the network
is thinking by looking at the
patterns of activation.
• To see what it is thinking, convert
the top-level representation into an
image by using top-down
connections.
– A mental state is the state of a
hypothetical world in which the
internal representation is
correct.
brain state
The extra activation of
cortex caused by a
speech task. What
were they thinking?
What goes on in its mind if we show it an
image composed of random pixels and ask it
to fantasize from there?
mind
brain
2000 top-level neurons
10 label
mind
brain
neurons
500 neurons
500 neurons
mind
brain
28 x 28
pixel
image
Samples generated by running the associative memory
with one label clamped. Initialized by an up-pass from a
random binary image. 20 iterations between samples.
Samples generated by letting the associative
memory run with one label clamped. There are
1000 iterations of alternating Gibbs sampling
between samples.
Learning with realistic labels
2000 top-level units
10 label units
500 units
This network treats
the labels in a
special way, but
they could easily be
replaced by an
auditory pathway.
500 units
28 x 28
pixel
image
Learning with auditory labels
• Alex Kaganov replaced the class labels by binarized cepstral
spectrograms of many different male speakers saying digits.
• The auditory pathway then had multiple layers, just like the visual
pathway. The auditory and visual inputs shared the top level layer.
• After learning, he showed it a visually ambiguous digit and then
reconstructed the visual input from the representation that the toplevel associative memory had settled on after 10 iterations.
“six”
reconstruction
“five”
original visual input
reconstruction
And now for something a bit more realistic
• Handwritten digits are convenient for research
into shape recognition, but natural images of
outdoor scenes are much more complicated.
– If we train a network on patches from natural
images, does it produces sets of features that
look like the ones found in real brains?
A network with local
connectivity
Local connectivity
Global connectivity
image
The local connectivity
between the two hidden
layers induces a
topography on the
hidden units.
Features
learned by a
net that sees
100,000
patches of
natural
images.
The feature
neurons are
locally
connected to
each other.
THE END
All 125 errors
Best other machine
learning technique
gets 140 errors on
this task.
(these results are
without added
distortions or geometric
prior knowledge)
The receptive fields of the first hidden layer
The generative fields of the first hidden layer
A simple learning algorithm
j
si s j 0
i
data
j
 si s j 1
i
“hidden” neurons
represent features
“visible neurons”
represent pixels
reconstruction
Start with a training vector on the visible units.
Then alternate between updating all the hidden units in
parallel and updating all the visible units in parallel.
data
recon
wij   (  si s j 
  si s j 
)
learning rate
Types of connectivity
• Feedforward networks
– These compute a series of
transformations
– Typically, the first layer is the
input and the last layer is the
output.
• Recurrent networks
– These have directed cycles
in their connection graph.
They can have complicated
dynamics.
– More biologically realistic.
output units
hidden units
input units
Binary stochastic neurons
• y is the probability of producing a spike.
1
yj
0.5
synaptic weight
from i to j
0
0
total input to j  external input   yi wij
i
output of
neuron i
A simple learning algorithm
j
i
data
j
i
“hidden” neurons
represent features
“visible neurons”
represent pixels
reconstruction
1. Start with a training image on the visible neurons.
2. Pick binary states for the hidden neurons. Increment weights
between active pairs.
3. Then pick binary states for the visible neurons
4. Then pick binary states for the hidden neurons again. Decrement
weights between active pairs.
This changes the weights to increase the goodness of the data and
decrease the goodness of the reconstruction.
A more interesting type of network
• Several politicians can be elected at the same time.
– A politician corresponds to a feature and each familiar
object corresponds to a coalition of features.
• The politicians decide who can vote (like in Florida).
• The whole system of voters and politicians can have several
different stable states.
– Conservative politicians discourage liberal voters from
voting.
– Liberal politicians discourage conservative voters from
voting.
• If we add some noise to the voting process, the system will
occasionally jump from one regime to another.
Stochastic binary neurons
• These have a state of 1 or 0 which is a stochastic
function of the neuron’s bias, b, and the input it receives
from other neurons.
p ( si  1) 
1
1  exp( bi   s j w ji )
j
1
p( si  1)
0.5
0
0
bi   s j w ji
j
3
How the brain works
• Each neuron receives inputs from thousands of other neurons
– A few neurons also get inputs from the sensory receptors
- A few neurons send outputs to muscles.
- Neurons use binary spikes of activity to communicate
• The effect that one neuron has on another is controlled by a
synaptic weight
– The weights can be
positive or negative
• The synaptic weights adapt so that the whole network learns
to perform useful computations
– Recognizing objects, understanding language, making
plans, controlling the body