Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduc)ontoComputa)onalNeuroscience2017 Lecturers BorisGutkin([email protected]) Ma:yChalk(ma:[email protected]) GregoryDumont([email protected]) Tuesdays17-19,SallePres)ge HW:HelpsessionsTBA hEp://iec-lnc.ens.fr/group-for-neural-theory/teaching-260/ar)cle/ co6-course Textbook:Dayan&AbboE.Theore)calNeuroscience.MITPress 2001. Introduc)ontoComputa)onalNeuroscience2017 Lecturers BorisGutkin([email protected]) Ma:yChalk(ma:[email protected]) GregoryDumont([email protected]) Tuesdays17-19,SallePres)ge HW:HelpsessionsTBA hEp://iec-lnc.ens.fr/group-for-neural-theory/teaching-260/ar)cle/ co6-course Textbook:Dayan&AbboE.Theore)calNeuroscience.MITPress 2001. Behavior 1.1TheRescorla-WagnerRule(Gutkin) 1.2ExploraOon-ExpoitaOon/Reinforcement Learning:Model-freeapproaches(Gutkin) 1.3ReinforcementLearning:Model-based NeuralBiophysics approaches(Gutkin) 3.1BinaryNeuronsandnetworks 1.4SeminarinRL (Dumont) 3.2Firingratemodels(Dumont) 3.3BiophysicsofNeurons(Dumont) 3.4Hodgkin-Huxley(Dumont) NeuralComputa)on 2.1NeuralDecoding(Chalk) 2.2PopulaOonCoding(Chalk) 2.3Decision-Making(Chalk) 2.4NeuralEncoding(Chalk) 2.5Seminar(Deneve) Introduc)ontoComputa)onalNeuroscience2017 Lecturers BorisGutkin([email protected]) Ma:yChalk(ma:[email protected]) GregoryDumont([email protected]) Tuesdays17-19,SallePres)ge HW:HelpsessionsTBA hEp://iec-lnc.ens.fr/group-for-neural-theory/teaching-260/ar)cle/ co6-course Textbook:Dayan&AbboE.Theore)calNeuroscience.MITPress 2001. Behavior 1.1TheRescorla-WagnerRule(Gutkin) 1.2ExploraOon-ExpoitaOon/Reinforcement Learning:Model-freeapproaches(Gutkin) 1.3ReinforcementLearning:Model-based NeuralBiophysics approaches(Gutkin) 3.1BinaryNeuronsandnetworks 1.4SeminarinRL (Dumont) 3.2Firingratemodels(Dumont) 3.3BiophysicsofNeurons(Dumont) 3.4Hodgkin-Huxley(Dumont) NeuralComputa)on 2.1NeuralDecoding(Chalk) 2.2PopulaOonCoding(Chalk) 2.3Decision-Making(Chalk) 2.4NeuralEncoding(Chalk) 2.5Seminar(Deneve) Evalua)on: Homeworks80%:takenupnextweek,returned2weeks later.NolateHW!!!!!!! WriEenExams:20%June6 Part 1.1 Classical Conditioning The Rescorla-Wagner Learning Rule Cogmaster CO6 / Boris Gutkin Classical conditioning Ivan Pavlov (1849-1936) • Work on digestive tract of dogs • Nobel prize in 1903 for work on digestive glands • Most famous for “classical conditioning” Cogmaster CO6 / Christian Machens Classical conditioning food training 2. 3. testing testing 1. 4. sound sound sound salivation no salivation food salivation salivation Cogmaster CO6 / Christian Machens training 1. testing Extinction 2. training 3. testing sound 4. food sound sound sound salivation salivation no food salivation disappears no salivation Cogmaster CO6 / Christian Machens Interpretation Why is this happening? involuntary behavior learning of associations temporal proximity of stimuli Ivan Pavlov (1849-1936) work: 1890s John B Watson (1878-1958) work: 1920s Cogmaster CO6 / Christian Machens Interpretation Why is this happening? involuntary behavior learning of associations temporal proximity of stimuli Ivan Pavlov (1849-1936) work: 1890s John B Watson (1878-1958) work: 1920s Robert Rescorla Allan R Wagner Why is this happening? learning to predict Cogmaster CO6 / Christian Machens sound food salivation testing training The Rescorla-Wagner model sound salivation Assume: The dog wants to be able to predict the reward! Cogmaster CO6 / Christian Machens sound food salivation testing training The Rescorla-Wagner model sound salivation Assume: The dog wants to be able to predict the reward! ui stimulus ( ) in trial i: ui = 0 or ui = 1 ri reward ( ) in trial i: ri = 0 or ri = 1 Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Trial 1 Trial 2 Trial 3 Trial 4 u2 = 1 u3 = 0 u4 = 1 stimulus u1 = 1 reward r1 = 1 time r2 = 1 r3 = 1 r4 = 0 time Assume: The dog wants to be able to predict the reward! ui stimulus ( ) in trial i: ui = 0 or ui = 1 ri reward ( ) in trial i: ri = 0 or ri = 1 Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Trial 1 Trial 2 Trial 3 Trial 4 u2 = 1 u3 = 0 u4 = 1 stimulus u1 = 1 reward r1 = 1 prediction v1 = 0 time r2 = 1 v2 = 0.5 r3 = 1 v3 = 1 r4 = 0 v4 = 1 time Assume: The dog wants to be able to predict the reward! ui stimulus ( ) in trial i: ui = 0 or ui = 1 ri reward ( ) in trial i: ri = 0 or ri = 1 vi reward that the dog expects ( ? ) in trial i Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Trial 1 Trial 2 Trial 3 Trial 4 u2 = 1 u3 = 0 u4 = 1 stimulus u1 = 1 reward r1 = 1 prediction v1 = 0 time r2 = 1 v2 = 0.5 r3 = 1 v3 = 1 r4 = 0 v4 = 1 time Assume: The dog wants to be able to predict the reward! ui stimulus ( ) in trial i: ui = 0 or ui = 1 ri reward ( ) in trial i: ri = 0 or ri = 1 vi reward that the dog expects ( ? ) in trial i Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Measure the ability of the dog to predict the reward! Prediction error in the i-th trial: δi = ri − vi actual reward predicted reward Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Measure the ability of the dog to predict the reward! Prediction error in the i-th trial: δi = ri − vi actual reward predicted reward δi > 0 : more reward than predicted δi < 0 : less reward than predicted Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Measure the ability of the dog to predict the reward! Prediction error in the i-th trial: δi = ri − vi actual reward predicted reward δi > 0 : more reward than predicted δi < 0 : less reward than predicted “Loss” in the i-th trial: 2 2 Li = δi = (ri − vi ) Minimize this loss function to maximize the ability to predict the reward !! Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Minimize the loss: 2 Li = (ri − vi ) Assume: dog’s model of the world vi = wui Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Minimize the loss: 2 Li = (ri − vi ) Assume: dog’s model of the world vi = wui parameter that the dog needs to learn from observations Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Minimize the loss: 2 Li = (ri − vi ) Assume: dog’s model of the world vi = wui Li = (ri − wui )2 loss function L Then: 4 2 0 0 2 weight w 4 Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Minimize the loss: Li = (ri − wui )2 loss function L Update parameter w to decrease loss! 4 d Li dw 2 0 0 2 weight w 4 Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Minimize the loss: Li = (ri − wui )2 loss function L Update parameter w to decrease loss! d Li w →w−! dw 4 d Li dw 2 0 0 2 weight w 4 Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Minimize the loss: Li = (ri − wui )2 Update parameter w to decrease loss! d Li w →w−! dw loss function L learning rate 4 d Li dw 2 0 0 2 weight w 4 Cogmaster CO6 / Christian Machens The Rescorla-Wagner model Minimize the loss: Li = (ri − wui )2 Update parameter w to decrease loss! d Li w →w−! dw Compute the derivative: d d 2 (ri − wui ) Li = dw dw Cogmaster CO6 / Christian Machens The Rescorla-Wagner model “Loss” in the i-th trial: Li = (ri − wui )2 “Chain rule” } } Update parameter w to decrease loss! d Li w →w−! dw Compute the derivative: d d 2 (ri − wui ) Li = dw dw = (−ui ) 2(ri − wui ) inner derivative outer derivative Cogmaster CO6 / Christian Machens The Rescorla-Wagner model “Loss” in the i-th trial: Li = (ri − wui )2 Update parameter w to decrease loss! d Li w →w−! dw Compute the derivative: d d 2 (ri − wui ) Li = dw dw = (−ui ) 2(ri − wui ) prediction error = −2ui δi δi = ri − wui = ri − vi Cogmaster CO6 / Christian Machens The Rescorla-Wagner model “Loss” in the i-th trial: Li = (ri − wui )2 Update parameter w to decrease loss! d Li w →w−! dw d Li = −2ui δi dw Therefore, ! → !/2 w → w + !δi ui Cogmaster CO6 / Christian Machens The Rescorla-Wagner model “Loss” in the i-th trial: Li = (ri − wui )2 Update parameter w to decrease loss! d Li w →w−! dw d Li = −2ui δi dw Rescorla-Wagner-rule w → w + !δi ui “delta-rule” Cogmaster CO6 / Christian Machens Conditioning explained stimulus (CS) ui = 1 reward (US) ri = 1 expected reward 1 0.8 0.6 0.4 0.2 0 0 20 40 60 80 100 trial Cogmaster CO6 / Christian Machens Conditioning explained stimulus (CS) ui = 1 reward (US) ri = 1 expected reward vi = wui expected reward 1 0.8 w=0 first trial: v = 0 1 0.6 0.4 0.2 0 0 20 40 60 80 100 trial Cogmaster CO6 / Christian Machens Conditioning explained stimulus (CS) ui = 1 reward (US) ri = 1 expected reward vi = wui expected reward 1 prediction error δi = ri − vi 0.8 w=0 first trial: v = 0 1 δ1 = 1 0.6 0.4 0.2 0 0 20 40 60 80 100 trial Cogmaster CO6 / Christian Machens Conditioning explained stimulus (CS) ui = 1 reward (US) ri = 1 expected reward vi = wui expected reward 1 prediction error δi = ri − vi 0.8 w=0 first trial: v = 0 1 δ1 = 1 0.6 0.4 ⇒ w = 0.1 0.2 0 learning rule w → w + !δi ui ! = 0.1 0 20 40 60 80 100 trial Cogmaster CO6 / Christian Machens Conditioning explained stimulus (CS) ui = 1 reward (US) ri = 1 expected reward vi = wui expected reward 1 prediction error δi = ri − vi 0.8 w = 0.1 second trial: v2 = 0.1 δ2 = 0.9 ⇒ w = 0.19 0.6 0.4 0.2 0 0 20 40 60 learning rule w → w + !δi ui ! = 0.1 80 100 trial Cogmaster CO6 / Christian Machens Conditioning explained stimulus (CS) ui = 1 reward (US) ri = 1 expected reward vi = wui expected reward 1 prediction error δi = ri − vi 0.8 0.6 after many trials: w=1 0.4 0.2 0 learning rule w → w + !δi ui ! = 0.1 0 20 40 60 80 100 trial Cogmaster CO6 / Christian Machens Extinction explained stimulus (CS) ui = 1 reward (US) ri = 0 expected reward vi = wui expected reward 1 0.8 first extinction trial: w=1 vi = 1 δi = −1 ⇒ w = 0.9 0.6 0.4 0.2 0 prediction error δi = ri − vi 0 20 40 60 learning rule w → w + !δi ui ! = 0.1 80 100 trial Cogmaster CO6 / Christian Machens Extinction explained stimulus + reward (CS + US) } ui = 1 ri = 0 expected reward vi = wui expected reward 1 0.8 second extinction trial: w = 0.9 vi = 0.9 δi = −0.9 ⇒ w = 0.81 0.6 0.4 0.2 0 prediction error δi = ri − vi 0 20 40 60 learning rule w → w + !δi ui ! = 0.1 80 100 trial Cogmaster CO6 / Christian Machens Extinction explained stimulus + reward (CS + US) stimulus (CS) expected reward 1 0.8 0.6 many extinction trials: 0.4 0.2 0 0 20 40 60 80 100 trial Cogmaster CO6 / Christian Machens A Conditioning Experiment Rabbit eye blinking Schneiderman et al, Science,1962 Classical conditioning: blocking training 1. sound food salivation food salivation 2. clapping sound testing 3. sound salivation clapping no salivation 4. Rescorla-Wagner rule vectorized reward prediction with two stimuli (i-th trial) vi = w1 u1i + w2 u2i = w · ui stimulus 2 expected reward weight for stimulus 1 stimulus 1 weight for stimulus 2 Cogmaster CO6 / Christian Machens Rescorla-Wagner rule vectorized reward prediction with two stimuli (i-th trial) vi = w1 u1i + w2 u2i = w · ui expected reward weight vector stimulus vector Cogmaster CO6 / Christian Machens Rescorla-Wagner rule vectorized reward prediction with two stimuli (i-th trial) vi = w1 u1i + w2 u2i = w · ui expected reward weight vector stimulus vector Rescorla-Wagner learning rule (“delta rule”) w1 → w1 + !δi u1i w2 → w2 + !δi u2i prediction error δi = ri − vi Cogmaster CO6 / Christian Machens Rescorla-Wagner rule vectorized reward prediction with two stimuli (i-th trial) vi = w1 u1i + w2 u2i = w · ui expected reward weight vector stimulus vector Rescorla-Wagner learning rule (“delta rule”) w1 → w1 + !δi u1i w2 → w2 + !δi u2i prediction error δi = ri − vi Short-hand vector notation: w → w + !δi ui Cogmaster CO6 / Christian Machens 1 0 1 initially: w = (w1 , w2 ) = (0, 0) expected reward vi = w · ui 0 1 reward stimulus 2 stimulus 1 Classical conditioning: blocking expected reward 0 1 0 0 trial 5 10 15 20 25 trial 30 35 40 45 50 Cogmaster CO6 / Christian Machens 1 0 1 initially: w = (w1 , w2 ) = (0, 0) expected reward vi = w · ui 0 1 reward stimulus 2 stimulus 1 Classical conditioning: blocking prediction error δi = ri − vi expected reward 0 1 0 0 trial 5 10 15 20 25 trial 30 35 40 45 50 Cogmaster CO6 / Christian Machens 1 0 1 initially: w = (w1 , w2 ) = (0, 0) expected reward vi = w · ui 0 1 reward stimulus 2 stimulus 1 Classical conditioning: blocking prediction error δi = ri − vi learning rule w → w + !δi ui expected reward 0 1 0 0 trial 5 10 15 20 25 trial 30 35 40 45 50 Cogmaster CO6 / Christian Machens 1 0 1 initially: w = (w1 , w2 ) = (0, 0) expected reward vi = w · ui 0 1 reward stimulus 2 stimulus 1 Classical conditioning: blocking expected reward 0 1 0 0 trial 5 10 15 20 25 trial 30 prediction error δi = ri − vi learning rule w → w + !δi ui after several trials: 35 40 45 50 w = (w1 , w2 ) = (1, 0) Cogmaster CO6 / Christian Machens 1 new stimulus, but: no prediction error!! 0 1 expected reward vi = w · ui 0 1 reward stimulus 2 stimulus 1 Classical conditioning: blocking expected reward 0 1 0 0 5 10 trial 15 20 25 trial 30 prediction error δi = ri − vi learning rule w → w + !δi ui after several trials: 35 40 45 50 w = (w1 , w2 ) = (1, 0) Cogmaster CO6 / Christian Machens 1 0 1 new stimulus, but: no prediction error! no learning! 0 1 expected reward vi = w · ui reward stimulus 2 stimulus 1 Classical conditioning: blocking expected reward 0 1 0 0 5 10 trial 15 20 25 trial 30 prediction error δi = ri − vi learning rule w → w + !δi ui after several trials: 35 40 45 50 w = (w1 , w2 ) = (1, 0) Cogmaster CO6 / Christian Machens 1 0 1 0 1 reward stimulus 2 stimulus 1 Classical conditioning: blocking no prediction expected reward 0 1 0 0 5 10 15 20 25 trial 30 35 40 45 50 Cogmaster CO6 / Christian Machens Inhibitory conditioning training 1. sound food clapping food salivation 2. salivation 3. light clapping salivation disappears testing 4. sound salivation sound reduced salivation 5. light Cogmaster CO6 / Christian Machens Rescorla-Wagner rule vectorized Simple linear model (i-th trial) vi = w · ui = w1 u1i + w2 u2i + w3 u3i + . . . expected reward weight vector stimulus vector Cogmaster CO6 / Christian Machens Inhibitory conditioning s1 1 0 1 s2 initially: w = (w1 , w2 , w3 ) = (0, 0, 0) expected reward vi = w · ui expected reward reward s3 0 1 0 1 prediction error δi = ri − vi learning rule w → w + !δui 0 1 0 0 10 20 trial 30 40 w = 1 learn 1 50 60 70 trial Cogmaster CO6 / Christian Machens Inhibitory conditioning s1 1 0 1 s2 previously: w = (1, 0, 0) expected reward vi = w · ui expected reward reward s3 0 1 0 1 prediction error δi = ri − vi learning rule w → w + !δui 0 1 0 0 10 20 30 40 50 learn 60w2 = 1 70 trial Cogmaster CO6 / Christian Machens Inhibitory conditioning previously: w = (1, 1, 0) s1 1 s2 0 1 0 1 expected reward reward s3 expected reward vi = w · ui prediction error δi = ri − vi learning rule w → w + !δui 0 1 0 1 0 0 10 20 30 40 50 60 70 trial learn w = (1, 0.5, −0.5) Cogmaster CO6 / Christian Machens Inhibitory conditioning s1 1 s2 0 1 expected reward reward s3 0 1 0 1 0 1 0 0 10 20 30 40 50 60 70 trial learned weights: w1 = 1 w2 = 0.5 w3 = −0.5 Cogmaster CO6 / Christian Machens Inhibitory conditioning s1 1 s2 0 1 expected reward reward s3 0 1 0 1 0 1 0 0 10 20 30 40 50 60 70 trial learned weights: w1 = 1 w2 = 0.5 w3 = −0.5 Cogmaster CO6 / Christian Machens Inhibitory conditioning s1 1 s2 0 1 expected reward reward s3 0 1 0 1 0 1 0 0 10 20 30 40 50 60 70 trial learned weights: w1 = 1 w2 = 0.5 w3 = −0.5 Cogmaster CO6 / Christian Machens Inhibitory conditioning s1 1 s2 0 1 expected reward reward s3 0 1 0 1 0 1 0 0 10 20 30 40 50 60 70 trial learned weights: w1 = 1 w2 = 0.5 w3 = −0.5 Cogmaster CO6 / Christian Machens Secondary conditioning training 1. sound food salivation * 2. clapping sound salivation sound salivation testing 3. 4. clapping salivation * if you do this too often, you get extinction, of course! Cogmaster CO6 / Christian Machens Now what? HOW DO WE ACT BASED ON VALUES WE LEARN? (OPERANT CONDITIONING) Part 1.2 The ExplorationExploitation Dilemma Choosing Actions in a Changing World Cogmaster CO6 / Boris Gutkin Exploration-Exploitation Dilemma Lunch break Exploration-Exploitation Dilemma Possible choices (actions) of a bee: land on a blue or yellow flower Bee searching for nectar Possible choices (actions) of the bee: land on a blue or yellow flower rewards (in drops of nectar) rb = 8 ry = 2 “Policy”: Bee’s plan of action Assume: choices or actions a are taken at random, according to a probabilistic “policy”: p(a = yellow) p(a = blue) Math reminder Probabilities: Dice Example Probability that you get A=1 p(A = 1) = 1/6 Math reminder Playing dice Probability that you get A=1 p(A = 1) = 1/6 Probability that you get either A=1 or A=6 p(A = 1 or A = 6) = 1/6 + 1/6 = 2/3 1 Math reminder Playing dice Probability that you get A=1 p(A = 1) = 1/6 Probability that you get either A=1 or A=6 p(A = 1 or A = 6) = 1/6 + 1/6 = 2/3 1 Prob. to get A=1 on the first throw and A=6 on the second: p(A = 1 and A = 6) = 1/6 ∗ 1/6 = 1/36 Math reminder Playing dice Probability that you get A=1 p(A = 1) = 1/6 Probability that you get either A=1 or A=6 p(A = 1 or A = 6) = 1/6 + 1/6 = 2/3 1 Prob. to get A=1 on the first throw and A=6 on the second: p(A = 1 and A = 6) = 1/6 ∗ 1/6 = 1/36 Probability that you will not get A=1 p(not A = 1) = 1 − p(A = 1) = 5/6 Math reminder Probabilities: general rules Probability of event A: p(A) ∈ [0, 1] Probability of two (mutually exclusive) events: p(A or B) = p(A) + p(B) Probability of two independent events: p(A and B) = p(A)p(B) Tertium non datur: (either A happens or not) p(A) + p(not A) = 1 “Policy”: Bee’s plan of action Assume: choices or actions a are taken at random, according to a probabilistic “policy”: p(a = yellow) p(a = blue) p(a = blue) + p(a = yellow) = 1 “Policy”: Bee’s plan of action Assume: choices or actions a are taken at random, according to a probabilistic “policy”: p(a = yellow) = 0.5 p(a = blue) = 0.5 p(a = blue) + p(a = yellow) = 1 choice = yellow choice = blue 0 5 10 trial number 15 20 “Policy”: Bee’s plan of action Assume: choices or actions a are taken at random, according to a probabilistic “policy”: p(a = yellow) = 0.2 p(a = blue) = 0.8 p(a = blue) + p(a = yellow) = 1 choice = yellow choice = blue 0 5 10 trial number 15 20 “Optimal Policy”: The greedy bee Optimal policy: p(a = blue) = 1 p(a = yellow) = 0 Rewards: rb = 8 ry = 2 “Optimal Policy”: The greedy bee Optimal policy: p(a = blue) = 1 p(a = yellow) = 0 Rewards: rb = 8 ry = 2 BUT: What happens if the environment changes? Day 1 2 3 ... rb 8 2 3 ... ry 2 8 5 ... Bee needs to explore and exploit “greedy” policy BA D I CH F TH p(a = blue) = 1 AN ING GE S p(a = yellow) = 0 ! “ε-greedy” policy ( ! ! 1 ) p(a = blue) = 1 − ! p(a = yellow) = ! exploratory trials Bee needs to explore and exploit “greedy” policy BA D I CH F TH p(a = blue) = 1 AN ING GE S p(a = yellow) = 0 ! “ε-greedy” policy ( ! ! 1 ) p(a = blue) = 1 − ! p(a = yellow) = ! softmax Gibbs-policy (depends on rewards!) ! " p(a = blue) = exp(βrb )/ exp(βr b ) + exp(βry ) " ! p(a = yellow) = exp(βry )/ exp(βrb ) + exp(βry ) Softmax-Gibbs Policy exp(βrb ) p(b) = exp(βrb ) + exp(βry ) exp(βry ) p(y) = exp(βrb ) + exp(βry ) short-hand notation: we write p(b) for p(a = blue) etc. Softmax-Gibbs Policy exp(βrb ) p(b) = exp(βrb ) + exp(βry ) exp(βry ) p(y) = exp(βrb ) + exp(βry ) 1 Rewards: 0.8 probability rb = 8 ry = 2 p(b) 0.6 0.4 0.2 0 p(y) 0 20 40 beta 60 80 Softmax-Gibbs Policy exp(βrb ) p(b) = exp(βrb ) + exp(βry ) exp(βry ) p(y) = exp(βrb ) + exp(βry ) 1 Rewards: 0.8 probability rb = 8 ry = 2 p(b) 0.6 0.4 0.2 0 p(y) 0 20 40 beta 60 exploration exploitation 80 What does the bee know? actual reward internal estimate rb mb ry my How can the bee learn the rewards? “greedy” update: mb = rb,i last reward on a blue flower my = ry,i last reward on a yellow flower How can the bee learn the rewards? reward 2 0 5 10 trial number 15 20 15 20 8 reward “greedy” update: mb = rb,i my = ry,i rb,i 8 ry,i ry,i 2 0 5 10 i trial number How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i rb,i 8 mb 2 choices: 0 5 10 trial number 15 20 15 20 reward 8 my ry,i 2 0 5 10 trial number i How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i rb,i 8 mb 2 choices: 0 5 10 trial number 15 20 15 20 reward 8 my ry,i 2 0 5 10 trial number i How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i rb,i 8 mb 2 choices: 0 5 10 trial number 15 20 15 20 reward 8 my ry,i 2 0 5 10 trial number i How can the bee learn the rewards? rb,i mb reward “greedy” update: mb = rb,i my = ry,i 8 2 choices: 0 5 10 trial number 15 20 15 20 reward 8 my 2 0 5 ry,i 10 trial number i How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i 8 2 choices: 0 5 10 trial number 15 20 0 5 10 trial number i 15 20 reward 8 2 How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i 8 2 choices: 0 5 10 trial number 15 20 0 5 10 trial number i 15 20 reward 8 2 How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i 8 2 choices: 0 5 10 trial number 15 20 0 5 10 trial number i 15 20 reward 8 2 How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i 8 2 choices: 0 5 10 trial number 15 20 0 5 10 trial number i 15 20 reward 8 2 How can the bee learn the rewards? reward “greedy” update: mb = rb,i my = ry,i 8 2 choices: 0 5 10 trial number 15 20 0 5 10 trial number i 15 20 reward 8 2 How can the bee learn the rewards? “greedy” update: mb = rb,i my = ry,i “batch” update: N ! 1 mb = rb,i N i=1 average reward on last N visits to a blue flower N ! 1 my = ry,i N i=1 average reward on last N visits to a yellow flower How can the bee learn the rewards? “greedy” update: mb = rb,i my = ry,i “batch” update: N ! 1 mb = rb,i N i=1 N ! 1 my = ry,i N i=1 “online” update: learning rate δ = rb,i − mb } } mb → mb + !(rb,i − mb ) “delta”- rule mb → mb + !δ with prediction error How can the bee learn the rewards? reward “online” update: mb → mb + !δb my → my + !δy 8 2 choices: 5 10 trial number 15 20 0 5 10 trial number 15 20 8 reward prediction error: δb = rb,i − mb δy = ry,i − my 0 2 i How can the bee learn the rewards? reward “online” update: mb → mb + !δb my → my + !δy 8 2 choices: 5 10 trial number 15 20 0 5 10 trial number 15 20 8 reward prediction error: δb = rb,i − mb δy = ry,i − my 0 2 i How can the bee learn the rewards? reward “online” update: mb → mb + !δb my → my + !δy 8 2 choices: 5 10 trial number 15 20 0 5 10 trial number 15 20 8 reward prediction error: δb = rb,i − mb δy = ry,i − my 0 2 i How can the bee learn the rewards? reward “online” update: mb → mb + !δb my → my + !δy 8 2 choices: 5 10 trial number 15 20 0 5 10 trial number 15 20 8 reward prediction error: δb = rb,i − mb δy = ry,i − my 0 2 i How can the bee learn the rewards? reward “online” update: mb → mb + !δb my → my + !δy 8 2 choices: 5 10 trial number 15 20 0 5 10 trial number 15 20 8 reward prediction error: δb = rb,i − mb δy = ry,i − my 0 2 i How can the bee learn the rewards? reward “online” update: mb → mb + !δb my → my + !δy 8 2 Idea: the bee could use its0 estimates the flower 5 about10 15 rewards 20 trial number to change its policy! reward prediction error: δb = rb,i − mb δy = ry,i − my 8 2 0 5 10 trial number i 15 20 Changing the policy online exp(βmb ) p(b) = exp(βmb ) + exp(βmy ) exp(βmy ) p(y) = exp(βmb + exp(βmy ) actual reward internal estimate rb mb ry my Changing the policy online exp(βmb ) p(b) = exp(βmb ) + exp(βmy ) exp(βmy ) p(y) = exp(βmb + exp(βmy ) 1 p(b) probability 0.8 if mb > my 0.6 0.4 0.2 0 p(y) 0 20 40 beta 60 exploration exploitation 80 Learning the nectar reward: exploration-exploitation trade-off actual average reward 10 β=0 (50/50) policy ry 6 4 rb 2 0 0 100 200 300 400 500 10 estimated average reward softmax Gibbs-policy with 8 8 my 6 4 mb 2 0 0 100 200 300 time steps 400 500 Learning the nectar reward: exploration-exploitation trade-off actual average reward 10 β = 0.5 ry 6 4 rb 2 0 0 100 200 300 400 500 10 estimated average reward softmax Gibbs-policy with 8 8 my 6 4 mb 2 0 0 100 200 300 time steps 400 500 Learning the nectar reward: exploration-exploitation trade-off actual average reward 10 β=2 ry 6 4 rb 2 0 0 100 200 300 400 500 10 estimated average reward softmax Gibbs-policy with 8 8 my 6 4 mb 2 0 0 100 200 300 time steps 400 500 Exploration-exploitation tradeoff softmax Gibbs-policy average reward 55 50 45 40 0 0.5 1 1.5 beta 2 2.5 3 What do real bees do? bumblebees (N=5) solid line: blue=constantly rewarded yellow=randomly rewarded switch at trial 17! blue=randomly rewarded yellow=constantly rewarded switch dashed line: less variability on randomly rewarded flowers Real, 1991, Science 253:980-986