Download Document

Document related concepts

Impulsivity wikipedia , lookup

Psychological behaviorism wikipedia , lookup

Psychophysics wikipedia , lookup

Operant conditioning wikipedia , lookup

Classical conditioning wikipedia , lookup

Control mastery theory wikipedia , lookup

Delayed gratification wikipedia , lookup

Neuroeconomics wikipedia , lookup

Transcript
Introduc)ontoComputa)onalNeuroscience2017
Lecturers
BorisGutkin([email protected])
Ma:yChalk(ma:[email protected])
GregoryDumont([email protected])
Tuesdays17-19,SallePres)ge
HW:HelpsessionsTBA
hEp://iec-lnc.ens.fr/group-for-neural-theory/teaching-260/ar)cle/
co6-course
Textbook:Dayan&AbboE.Theore)calNeuroscience.MITPress
2001.
Introduc)ontoComputa)onalNeuroscience2017
Lecturers
BorisGutkin([email protected])
Ma:yChalk(ma:[email protected])
GregoryDumont([email protected])
Tuesdays17-19,SallePres)ge
HW:HelpsessionsTBA
hEp://iec-lnc.ens.fr/group-for-neural-theory/teaching-260/ar)cle/
co6-course
Textbook:Dayan&AbboE.Theore)calNeuroscience.MITPress
2001.
Behavior
1.1TheRescorla-WagnerRule(Gutkin)
1.2ExploraOon-ExpoitaOon/Reinforcement
Learning:Model-freeapproaches(Gutkin)
1.3ReinforcementLearning:Model-based
NeuralBiophysics
approaches(Gutkin)
3.1BinaryNeuronsandnetworks
1.4SeminarinRL
(Dumont)
3.2Firingratemodels(Dumont)
3.3BiophysicsofNeurons(Dumont)
3.4Hodgkin-Huxley(Dumont)
NeuralComputa)on
2.1NeuralDecoding(Chalk)
2.2PopulaOonCoding(Chalk)
2.3Decision-Making(Chalk)
2.4NeuralEncoding(Chalk)
2.5Seminar(Deneve)
Introduc)ontoComputa)onalNeuroscience2017
Lecturers
BorisGutkin([email protected])
Ma:yChalk(ma:[email protected])
GregoryDumont([email protected])
Tuesdays17-19,SallePres)ge
HW:HelpsessionsTBA
hEp://iec-lnc.ens.fr/group-for-neural-theory/teaching-260/ar)cle/
co6-course
Textbook:Dayan&AbboE.Theore)calNeuroscience.MITPress
2001.
Behavior
1.1TheRescorla-WagnerRule(Gutkin)
1.2ExploraOon-ExpoitaOon/Reinforcement
Learning:Model-freeapproaches(Gutkin)
1.3ReinforcementLearning:Model-based
NeuralBiophysics
approaches(Gutkin)
3.1BinaryNeuronsandnetworks
1.4SeminarinRL
(Dumont)
3.2Firingratemodels(Dumont)
3.3BiophysicsofNeurons(Dumont)
3.4Hodgkin-Huxley(Dumont)
NeuralComputa)on
2.1NeuralDecoding(Chalk)
2.2PopulaOonCoding(Chalk)
2.3Decision-Making(Chalk)
2.4NeuralEncoding(Chalk)
2.5Seminar(Deneve)
Evalua)on:
Homeworks80%:takenupnextweek,returned2weeks
later.NolateHW!!!!!!!
WriEenExams:20%June6
Part 1.1
Classical Conditioning
The Rescorla-Wagner Learning Rule
Cogmaster CO6 / Boris Gutkin
Classical conditioning
Ivan Pavlov (1849-1936)
• Work on digestive tract of dogs
• Nobel prize in 1903 for work on digestive glands
• Most famous for “classical conditioning”
Cogmaster CO6 / Christian Machens
Classical conditioning
food
training
2.
3.
testing
testing
1.
4.
sound
sound
sound
salivation
no salivation
food
salivation
salivation
Cogmaster CO6 / Christian Machens
training
1.
testing
Extinction
2.
training
3.
testing
sound
4.
food
sound
sound
sound
salivation
salivation
no
food
salivation disappears
no salivation
Cogmaster CO6 / Christian Machens
Interpretation
Why is this happening?
involuntary behavior
learning of associations
temporal proximity of stimuli
Ivan Pavlov
(1849-1936)
work: 1890s
John B Watson
(1878-1958)
work: 1920s
Cogmaster CO6 / Christian Machens
Interpretation
Why is this happening?
involuntary behavior
learning of associations
temporal proximity of stimuli
Ivan Pavlov
(1849-1936)
work: 1890s
John B Watson
(1878-1958)
work: 1920s
Robert Rescorla
Allan R Wagner
Why is this happening?
learning to predict
Cogmaster CO6 / Christian Machens
sound
food
salivation
testing
training
The Rescorla-Wagner model
sound
salivation
Assume: The dog wants to be able to predict the reward!
Cogmaster CO6 / Christian Machens
sound
food
salivation
testing
training
The Rescorla-Wagner model
sound
salivation
Assume: The dog wants to be able to predict the reward!
ui stimulus ( ) in trial i: ui = 0 or ui = 1
ri reward (
) in trial i: ri = 0 or ri = 1
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Trial 1
Trial 2
Trial 3
Trial 4
u2 = 1
u3 = 0
u4 = 1
stimulus
u1 = 1
reward
r1 = 1
time
r2 = 1
r3 = 1
r4 = 0
time
Assume: The dog wants to be able to predict the reward!
ui stimulus ( ) in trial i: ui = 0 or ui = 1
ri reward (
) in trial i: ri = 0 or ri = 1
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Trial 1
Trial 2
Trial 3
Trial 4
u2 = 1
u3 = 0
u4 = 1
stimulus
u1 = 1
reward
r1 = 1
prediction
v1 = 0
time
r2 = 1
v2 = 0.5
r3 = 1
v3 = 1
r4 = 0
v4 = 1
time
Assume: The dog wants to be able to predict the reward!
ui stimulus ( ) in trial i: ui = 0 or ui = 1
ri reward (
) in trial i: ri = 0 or ri = 1
vi reward that the dog expects ( ? ) in trial i
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Trial 1
Trial 2
Trial 3
Trial 4
u2 = 1
u3 = 0
u4 = 1
stimulus
u1 = 1
reward
r1 = 1
prediction
v1 = 0
time
r2 = 1
v2 = 0.5
r3 = 1
v3 = 1
r4 = 0
v4 = 1
time
Assume: The dog wants to be able to predict the reward!
ui stimulus ( ) in trial i: ui = 0 or ui = 1
ri reward (
) in trial i: ri = 0 or ri = 1
vi reward that the dog expects ( ? ) in trial i
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Measure the ability of the dog to predict the reward!
Prediction error in the i-th trial:
δi = ri − vi
actual
reward
predicted
reward
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Measure the ability of the dog to predict the reward!
Prediction error in the i-th trial:
δi = ri − vi
actual
reward
predicted
reward
δi > 0 : more reward than predicted
δi < 0 : less reward than predicted
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Measure the ability of the dog to predict the reward!
Prediction error in the i-th trial:
δi = ri − vi
actual
reward
predicted
reward
δi > 0 : more reward than predicted
δi < 0 : less reward than predicted
“Loss” in the i-th trial:
2
2
Li = δi = (ri − vi )
Minimize this loss function to maximize
the ability to predict the reward !!
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Minimize the loss:
2
Li = (ri − vi )
Assume: dog’s model of the world
vi = wui
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Minimize the loss:
2
Li = (ri − vi )
Assume: dog’s model of the world
vi = wui
parameter that the dog needs to learn
from observations
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Minimize the loss:
2
Li = (ri − vi )
Assume: dog’s model of the world
vi = wui
Li = (ri − wui )2
loss function L
Then:
4
2
0
0
2
weight w
4
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Minimize the loss:
Li = (ri − wui )2
loss function L
Update parameter w to decrease loss!
4
d
Li
dw
2
0
0
2
weight w
4
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Minimize the loss:
Li = (ri − wui )2
loss function L
Update parameter w to decrease loss!
d
Li
w →w−!
dw
4
d
Li
dw
2
0
0
2
weight w
4
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Minimize the loss:
Li = (ri − wui )2
Update parameter w to decrease loss!
d
Li
w →w−!
dw
loss function L
learning rate
4
d
Li
dw
2
0
0
2
weight w
4
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
Minimize the loss:
Li = (ri − wui )2
Update parameter w to decrease loss!
d
Li
w →w−!
dw
Compute the derivative:
d
d
2
(ri − wui )
Li =
dw
dw
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
“Loss” in the i-th trial:
Li = (ri − wui )2
“Chain rule”
}
}
Update parameter w to decrease loss!
d
Li
w →w−!
dw
Compute the derivative:
d
d
2
(ri − wui )
Li =
dw
dw
= (−ui ) 2(ri − wui )
inner
derivative
outer
derivative
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
“Loss” in the i-th trial:
Li = (ri − wui )2
Update parameter w to decrease loss!
d
Li
w →w−!
dw
Compute the derivative:
d
d
2
(ri − wui )
Li =
dw
dw
= (−ui ) 2(ri − wui )
prediction error
= −2ui δi
δi = ri − wui = ri − vi
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
“Loss” in the i-th trial:
Li = (ri − wui )2
Update parameter w to decrease loss!
d
Li
w →w−!
dw
d
Li = −2ui δi
dw
Therefore, ! → !/2
w → w + !δi ui
Cogmaster CO6 / Christian Machens
The Rescorla-Wagner model
“Loss” in the i-th trial:
Li = (ri − wui )2
Update parameter w to decrease loss!
d
Li
w →w−!
dw
d
Li = −2ui δi
dw
Rescorla-Wagner-rule
w → w + !δi ui
“delta-rule”
Cogmaster CO6 / Christian Machens
Conditioning explained
stimulus (CS) ui = 1
reward (US) ri = 1
expected reward
1
0.8
0.6
0.4
0.2
0
0
20
40
60
80
100
trial
Cogmaster CO6 / Christian Machens
Conditioning explained
stimulus (CS) ui = 1
reward (US) ri = 1
expected reward
vi = wui
expected reward
1
0.8
w=0
first trial: v = 0
1
0.6
0.4
0.2
0
0
20
40
60
80
100
trial
Cogmaster CO6 / Christian Machens
Conditioning explained
stimulus (CS) ui = 1
reward (US) ri = 1
expected reward
vi = wui
expected reward
1
prediction error
δi = ri − vi
0.8
w=0
first trial: v = 0
1
δ1 = 1
0.6
0.4
0.2
0
0
20
40
60
80
100
trial
Cogmaster CO6 / Christian Machens
Conditioning explained
stimulus (CS) ui = 1
reward (US) ri = 1
expected reward
vi = wui
expected reward
1
prediction error
δi = ri − vi
0.8
w=0
first trial: v = 0
1
δ1 = 1
0.6
0.4
⇒ w = 0.1
0.2
0
learning rule
w → w + !δi ui
! = 0.1
0
20
40
60
80
100
trial
Cogmaster CO6 / Christian Machens
Conditioning explained
stimulus (CS) ui = 1
reward (US) ri = 1
expected reward
vi = wui
expected reward
1
prediction error
δi = ri − vi
0.8
w = 0.1
second trial: v2 = 0.1
δ2 = 0.9
⇒ w = 0.19
0.6
0.4
0.2
0
0
20
40
60
learning rule
w → w + !δi ui
! = 0.1
80
100
trial
Cogmaster CO6 / Christian Machens
Conditioning explained
stimulus (CS) ui = 1
reward (US) ri = 1
expected reward
vi = wui
expected reward
1
prediction error
δi = ri − vi
0.8
0.6
after many trials:
w=1
0.4
0.2
0
learning rule
w → w + !δi ui
! = 0.1
0
20
40
60
80
100
trial
Cogmaster CO6 / Christian Machens
Extinction explained
stimulus (CS) ui = 1
reward (US) ri = 0
expected reward
vi = wui
expected reward
1
0.8
first extinction trial:
w=1
vi = 1
δi = −1
⇒ w = 0.9
0.6
0.4
0.2
0
prediction error
δi = ri − vi
0
20
40
60
learning rule
w → w + !δi ui
! = 0.1
80
100
trial
Cogmaster CO6 / Christian Machens
Extinction explained
stimulus + reward
(CS + US)
}
ui = 1
ri = 0
expected reward
vi = wui
expected reward
1
0.8
second extinction trial:
w = 0.9
vi = 0.9
δi = −0.9
⇒ w = 0.81
0.6
0.4
0.2
0
prediction error
δi = ri − vi
0
20
40
60
learning rule
w → w + !δi ui
! = 0.1
80
100
trial
Cogmaster CO6 / Christian Machens
Extinction explained
stimulus + reward
(CS + US)
stimulus
(CS)
expected reward
1
0.8
0.6
many extinction trials:
0.4
0.2
0
0
20
40
60
80
100
trial
Cogmaster CO6 / Christian Machens
A Conditioning Experiment
Rabbit eye blinking
Schneiderman et al, Science,1962
Classical conditioning: blocking
training
1.
sound
food
salivation
food
salivation
2.
clapping sound
testing
3.
sound
salivation
clapping
no salivation
4.
Rescorla-Wagner rule vectorized
reward prediction with two stimuli (i-th trial)
vi = w1 u1i + w2 u2i = w · ui
stimulus 2
expected
reward
weight for
stimulus 1
stimulus 1
weight for
stimulus 2
Cogmaster CO6 / Christian Machens
Rescorla-Wagner rule vectorized
reward prediction with two stimuli (i-th trial)
vi = w1 u1i + w2 u2i = w · ui
expected
reward
weight
vector
stimulus
vector
Cogmaster CO6 / Christian Machens
Rescorla-Wagner rule vectorized
reward prediction with two stimuli (i-th trial)
vi = w1 u1i + w2 u2i = w · ui
expected
reward
weight
vector
stimulus
vector
Rescorla-Wagner learning rule (“delta rule”)
w1 → w1 + !δi u1i
w2 → w2 + !δi u2i
prediction error
δi = ri − vi
Cogmaster CO6 / Christian Machens
Rescorla-Wagner rule vectorized
reward prediction with two stimuli (i-th trial)
vi = w1 u1i + w2 u2i = w · ui
expected
reward
weight
vector
stimulus
vector
Rescorla-Wagner learning rule (“delta rule”)
w1 → w1 + !δi u1i
w2 → w2 + !δi u2i
prediction error
δi = ri − vi
Short-hand vector notation:
w → w + !δi ui
Cogmaster CO6 / Christian Machens
1
0
1
initially:
w = (w1 , w2 ) = (0, 0)
expected reward
vi = w · ui
0
1
reward
stimulus 2
stimulus 1
Classical conditioning: blocking
expected
reward
0
1
0
0
trial
5
10
15
20
25
trial
30
35
40
45
50
Cogmaster CO6 / Christian Machens
1
0
1
initially:
w = (w1 , w2 ) = (0, 0)
expected reward
vi = w · ui
0
1
reward
stimulus 2
stimulus 1
Classical conditioning: blocking
prediction error
δi = ri − vi
expected
reward
0
1
0
0
trial
5
10
15
20
25
trial
30
35
40
45
50
Cogmaster CO6 / Christian Machens
1
0
1
initially:
w = (w1 , w2 ) = (0, 0)
expected reward
vi = w · ui
0
1
reward
stimulus 2
stimulus 1
Classical conditioning: blocking
prediction error
δi = ri − vi
learning rule
w → w + !δi ui
expected
reward
0
1
0
0
trial
5
10
15
20
25
trial
30
35
40
45
50
Cogmaster CO6 / Christian Machens
1
0
1
initially:
w = (w1 , w2 ) = (0, 0)
expected reward
vi = w · ui
0
1
reward
stimulus 2
stimulus 1
Classical conditioning: blocking
expected
reward
0
1
0
0
trial
5
10
15
20
25
trial
30
prediction error
δi = ri − vi
learning rule
w → w + !δi ui
after
several
trials:
35
40
45
50
w = (w1 , w2 ) = (1, 0)
Cogmaster CO6 / Christian Machens
1
new stimulus, but:
no prediction error!!
0
1
expected reward
vi = w · ui
0
1
reward
stimulus 2
stimulus 1
Classical conditioning: blocking
expected
reward
0
1
0
0
5
10
trial
15
20
25
trial
30
prediction error
δi = ri − vi
learning rule
w → w + !δi ui
after
several
trials:
35
40
45
50
w = (w1 , w2 ) = (1, 0)
Cogmaster CO6 / Christian Machens
1
0
1
new stimulus, but:
no prediction error!
no learning!
0
1
expected reward
vi = w · ui
reward
stimulus 2
stimulus 1
Classical conditioning: blocking
expected
reward
0
1
0
0
5
10
trial
15
20
25
trial
30
prediction error
δi = ri − vi
learning rule
w → w + !δi ui
after
several
trials:
35
40
45
50
w = (w1 , w2 ) = (1, 0)
Cogmaster CO6 / Christian Machens
1
0
1
0
1
reward
stimulus 2
stimulus 1
Classical conditioning: blocking
no prediction
expected
reward
0
1
0
0
5
10
15
20
25
trial
30
35
40
45
50
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
training
1.
sound
food
clapping
food
salivation
2.
salivation
3.
light clapping
salivation disappears
testing
4.
sound
salivation
sound
reduced salivation
5.
light
Cogmaster CO6 / Christian Machens
Rescorla-Wagner rule vectorized
Simple linear model (i-th trial)
vi = w · ui = w1 u1i + w2 u2i + w3 u3i + . . .
expected
reward
weight
vector
stimulus
vector
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
s1
1
0
1
s2
initially:
w = (w1 , w2 , w3 ) = (0, 0, 0)
expected reward
vi = w · ui
expected
reward
reward
s3
0
1
0
1
prediction error
δi = ri − vi
learning rule
w → w + !δui
0
1
0
0
10
20
trial
30
40
w
=
1
learn
1
50
60
70
trial
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
s1
1
0
1
s2
previously:
w = (1, 0, 0)
expected reward
vi = w · ui
expected
reward
reward
s3
0
1
0
1
prediction error
δi = ri − vi
learning rule
w → w + !δui
0
1
0
0
10
20
30
40
50
learn 60w2 = 1 70
trial
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
previously:
w = (1, 1, 0)
s1
1
s2
0
1
0
1
expected
reward
reward
s3
expected reward
vi = w · ui
prediction error
δi = ri − vi
learning rule
w → w + !δui
0
1
0
1
0
0
10
20
30
40
50
60
70
trial
learn w = (1, 0.5, −0.5)
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
s1
1
s2
0
1
expected
reward
reward
s3
0
1
0
1
0
1
0
0
10
20
30
40
50
60
70
trial
learned weights:
w1 = 1
w2 = 0.5
w3 = −0.5
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
s1
1
s2
0
1
expected
reward
reward
s3
0
1
0
1
0
1
0
0
10
20
30
40
50
60
70
trial
learned weights:
w1 = 1
w2 = 0.5
w3 = −0.5
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
s1
1
s2
0
1
expected
reward
reward
s3
0
1
0
1
0
1
0
0
10
20
30
40
50
60
70
trial
learned weights:
w1 = 1
w2 = 0.5
w3 = −0.5
Cogmaster CO6 / Christian Machens
Inhibitory conditioning
s1
1
s2
0
1
expected
reward
reward
s3
0
1
0
1
0
1
0
0
10
20
30
40
50
60
70
trial
learned weights:
w1 = 1
w2 = 0.5
w3 = −0.5
Cogmaster CO6 / Christian Machens
Secondary conditioning
training
1.
sound
food
salivation
*
2.
clapping
sound
salivation
sound
salivation
testing
3.
4.
clapping
salivation
* if you do this too often,
you
get extinction, of course!
Cogmaster CO6 / Christian Machens
Now what?
HOW DO WE ACT BASED ON VALUES WE LEARN?
(OPERANT CONDITIONING)
Part 1.2
The ExplorationExploitation Dilemma
Choosing Actions in a Changing World
Cogmaster CO6 / Boris Gutkin
Exploration-Exploitation Dilemma
Lunch break
Exploration-Exploitation Dilemma
Possible choices (actions) of a bee:
land on a blue or yellow flower
Bee searching for nectar
Possible choices (actions) of the bee:
land on a blue or yellow flower
rewards (in drops of nectar)
rb = 8
ry = 2
“Policy”: Bee’s plan of action
Assume: choices or actions a are taken at random,
according to a probabilistic “policy”:
p(a = yellow)
p(a = blue)
Math reminder
Probabilities: Dice Example
Probability that you get A=1
p(A = 1) = 1/6
Math reminder
Playing dice
Probability that you get A=1
p(A = 1) = 1/6
Probability that you get either A=1 or A=6
p(A = 1 or A = 6) = 1/6 + 1/6 = 2/3
1
Math reminder
Playing dice
Probability that you get A=1
p(A = 1) = 1/6
Probability that you get either A=1 or A=6
p(A = 1 or A = 6) = 1/6 + 1/6 = 2/3
1
Prob. to get A=1 on the first throw and A=6 on the second:
p(A = 1 and A = 6) = 1/6 ∗ 1/6 = 1/36
Math reminder
Playing dice
Probability that you get A=1
p(A = 1) = 1/6
Probability that you get either A=1 or A=6
p(A = 1 or A = 6) = 1/6 + 1/6 = 2/3
1
Prob. to get A=1 on the first throw and A=6 on the second:
p(A = 1 and A = 6) = 1/6 ∗ 1/6 = 1/36
Probability that you will not get A=1
p(not A = 1) = 1 − p(A = 1) = 5/6
Math reminder
Probabilities: general rules
Probability of event A:
p(A) ∈ [0, 1]
Probability of two (mutually exclusive) events:
p(A or B) = p(A) + p(B)
Probability of two independent events:
p(A and B) = p(A)p(B)
Tertium non datur: (either A happens or not)
p(A) + p(not A) = 1
“Policy”: Bee’s plan of action
Assume: choices or actions a are taken at random,
according to a probabilistic “policy”:
p(a = yellow)
p(a = blue)
p(a = blue) + p(a = yellow) = 1
“Policy”: Bee’s plan of action
Assume: choices or actions a are taken at random,
according to a probabilistic “policy”:
p(a = yellow) = 0.5
p(a = blue) = 0.5
p(a = blue) + p(a = yellow) = 1
choice = yellow
choice = blue
0
5
10
trial number
15
20
“Policy”: Bee’s plan of action
Assume: choices or actions a are taken at random,
according to a probabilistic “policy”:
p(a = yellow) = 0.2
p(a = blue) = 0.8
p(a = blue) + p(a = yellow) = 1
choice = yellow
choice = blue
0
5
10
trial number
15
20
“Optimal Policy”: The greedy bee
Optimal policy:
p(a = blue) = 1
p(a = yellow) = 0
Rewards:
rb = 8
ry = 2
“Optimal Policy”: The greedy bee
Optimal policy:
p(a = blue) = 1
p(a = yellow) = 0
Rewards:
rb = 8
ry = 2
BUT: What happens if the environment changes?
Day
1
2
3
...
rb
8
2
3
...
ry
2
8
5
...
Bee needs to explore and exploit
“greedy” policy
BA
D
I
CH F TH
p(a = blue) = 1
AN ING
GE S
p(a = yellow) = 0
!
“ε-greedy” policy ( ! ! 1 )
p(a = blue) = 1 − !
p(a = yellow) = !
exploratory trials
Bee needs to explore and exploit
“greedy” policy
BA
D
I
CH F TH
p(a = blue) = 1
AN ING
GE S
p(a = yellow) = 0
!
“ε-greedy” policy ( ! ! 1 )
p(a = blue) = 1 − !
p(a = yellow) = !
softmax Gibbs-policy (depends on rewards!)
!
"
p(a = blue) = exp(βrb )/ exp(βr
b ) + exp(βry ) "
!
p(a = yellow) = exp(βry )/ exp(βrb ) + exp(βry )
Softmax-Gibbs Policy
exp(βrb )
p(b) =
exp(βrb ) + exp(βry )
exp(βry )
p(y) =
exp(βrb ) + exp(βry )
short-hand notation: we write p(b) for p(a = blue) etc.
Softmax-Gibbs Policy
exp(βrb )
p(b) =
exp(βrb ) + exp(βry )
exp(βry )
p(y) =
exp(βrb ) + exp(βry )
1
Rewards:
0.8
probability
rb = 8
ry = 2
p(b)
0.6
0.4
0.2
0
p(y)
0
20
40
beta
60
80
Softmax-Gibbs Policy
exp(βrb )
p(b) =
exp(βrb ) + exp(βry )
exp(βry )
p(y) =
exp(βrb ) + exp(βry )
1
Rewards:
0.8
probability
rb = 8
ry = 2
p(b)
0.6
0.4
0.2
0
p(y)
0
20
40
beta
60
exploration
exploitation
80
What does the bee know?
actual
reward
internal
estimate
rb
mb
ry
my
How can the bee learn the rewards?
“greedy” update:
mb = rb,i
last reward on a
blue flower
my = ry,i
last reward on a
yellow flower
How can the bee learn the rewards?
reward
2
0
5
10
trial number
15
20
15
20
8
reward
“greedy” update:
mb = rb,i
my = ry,i
rb,i
8
ry,i
ry,i
2
0
5
10
i
trial number
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
rb,i
8
mb
2
choices:
0
5
10
trial number
15
20
15
20
reward
8
my
ry,i
2
0
5
10
trial number i
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
rb,i
8
mb
2
choices:
0
5
10
trial number
15
20
15
20
reward
8
my
ry,i
2
0
5
10
trial number i
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
rb,i
8
mb
2
choices:
0
5
10
trial number
15
20
15
20
reward
8
my
ry,i
2
0
5
10
trial number i
How can the bee learn the rewards?
rb,i
mb
reward
“greedy” update:
mb = rb,i
my = ry,i
8
2
choices:
0
5
10
trial number
15
20
15
20
reward
8
my
2
0
5
ry,i
10
trial number i
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
8
2
choices:
0
5
10
trial number
15
20
0
5
10
trial number i
15
20
reward
8
2
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
8
2
choices:
0
5
10
trial number
15
20
0
5
10
trial number i
15
20
reward
8
2
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
8
2
choices:
0
5
10
trial number
15
20
0
5
10
trial number i
15
20
reward
8
2
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
8
2
choices:
0
5
10
trial number
15
20
0
5
10
trial number i
15
20
reward
8
2
How can the bee learn the rewards?
reward
“greedy” update:
mb = rb,i
my = ry,i
8
2
choices:
0
5
10
trial number
15
20
0
5
10
trial number i
15
20
reward
8
2
How can the bee learn the rewards?
“greedy” update:
mb = rb,i
my = ry,i
“batch” update:
N
!
1
mb =
rb,i
N i=1
average reward on last
N visits to a blue flower
N
!
1
my =
ry,i
N i=1
average reward on last
N visits to a yellow flower
How can the bee learn the rewards?
“greedy” update:
mb = rb,i
my = ry,i
“batch” update:
N
!
1
mb =
rb,i
N i=1
N
!
1
my =
ry,i
N i=1
“online” update:
learning
rate
δ = rb,i − mb
}
}
mb → mb + !(rb,i − mb )
“delta”- rule mb → mb + !δ with
prediction
error
How can the bee learn the rewards?
reward
“online” update:
mb → mb + !δb
my → my + !δy
8
2
choices:
5
10
trial number
15
20
0
5
10
trial number
15
20
8
reward
prediction error:
δb = rb,i − mb
δy = ry,i − my
0
2
i
How can the bee learn the rewards?
reward
“online” update:
mb → mb + !δb
my → my + !δy
8
2
choices:
5
10
trial number
15
20
0
5
10
trial number
15
20
8
reward
prediction error:
δb = rb,i − mb
δy = ry,i − my
0
2
i
How can the bee learn the rewards?
reward
“online” update:
mb → mb + !δb
my → my + !δy
8
2
choices:
5
10
trial number
15
20
0
5
10
trial number
15
20
8
reward
prediction error:
δb = rb,i − mb
δy = ry,i − my
0
2
i
How can the bee learn the rewards?
reward
“online” update:
mb → mb + !δb
my → my + !δy
8
2
choices:
5
10
trial number
15
20
0
5
10
trial number
15
20
8
reward
prediction error:
δb = rb,i − mb
δy = ry,i − my
0
2
i
How can the bee learn the rewards?
reward
“online” update:
mb → mb + !δb
my → my + !δy
8
2
choices:
5
10
trial number
15
20
0
5
10
trial number
15
20
8
reward
prediction error:
δb = rb,i − mb
δy = ry,i − my
0
2
i
How can the bee learn the rewards?
reward
“online” update:
mb → mb + !δb
my → my + !δy
8
2
Idea: the bee could use its0 estimates
the flower
5 about10
15 rewards
20
trial number
to change its policy!
reward
prediction error:
δb = rb,i − mb
δy = ry,i − my
8
2
0
5
10
trial number
i
15
20
Changing the policy online
exp(βmb )
p(b) =
exp(βmb ) + exp(βmy )
exp(βmy )
p(y) =
exp(βmb + exp(βmy )
actual
reward
internal
estimate
rb
mb
ry
my
Changing the policy online
exp(βmb )
p(b) =
exp(βmb ) + exp(βmy )
exp(βmy )
p(y) =
exp(βmb + exp(βmy )
1
p(b)
probability
0.8
if mb > my
0.6
0.4
0.2
0
p(y)
0
20
40
beta
60
exploration
exploitation
80
Learning the nectar reward:
exploration-exploitation trade-off
actual
average reward
10
β=0
(50/50) policy
ry
6
4
rb
2
0
0
100
200
300
400
500
10
estimated
average reward
softmax
Gibbs-policy
with
8
8
my
6
4
mb
2
0
0
100
200
300
time steps
400
500
Learning the nectar reward:
exploration-exploitation trade-off
actual
average reward
10
β = 0.5
ry
6
4
rb
2
0
0
100
200
300
400
500
10
estimated
average reward
softmax
Gibbs-policy
with
8
8
my
6
4
mb
2
0
0
100
200
300
time steps
400
500
Learning the nectar reward:
exploration-exploitation trade-off
actual
average reward
10
β=2
ry
6
4
rb
2
0
0
100
200
300
400
500
10
estimated
average reward
softmax
Gibbs-policy
with
8
8
my
6
4
mb
2
0
0
100
200
300
time steps
400
500
Exploration-exploitation tradeoff
softmax
Gibbs-policy
average reward
55
50
45
40
0
0.5
1
1.5
beta
2
2.5
3
What do real bees do?
bumblebees (N=5)
solid line:
blue=constantly rewarded
yellow=randomly rewarded
switch at trial 17!
blue=randomly rewarded
yellow=constantly rewarded
switch
dashed line:
less variability on randomly
rewarded flowers
Real, 1991, Science 253:980-986