Download Lecture 4 - TeachLine

Decision making ? Probability in games of chance Blaise Pascal 1623 - 1662 How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x) Decisions under uncertainty Maximize expected value (Pascal) Bets should be assessed according to  p  x  gain  x  x Decisions under uncertainty The value of an alternative is a monotonous function of the • Probability of reward • Magnitude of reward Do Classical Decision Variables Influence Brain Activity in LIP? LIP Varying Movement Value Platt and Glimcher 1999 What Influences LIP? Related to Movement Desirability • Value/Utility of Reward • Probability of Reward Varying Movement Probability What Influences LIP? Related to Movement Desirability • Value/Utility of Reward • Probability of Reward Decisions under uncertainty Neural activity in area LIP depends on: • Probability of reward • Magnitude of reward Relative or absolute reward? Dorris and Glimcher 2004 ? $X $Y $Z $A $B $C $D $E Maximization of utility Consider a set of alternatives X and a binary relation on it,  X  X , interpreted as “preferred at least as”. Consider the following three axioms: C1. Completeness: For every x, y  X , x y or x C2. Transitivity: For every x, y, z  X , x y and y z imply x C3. Separability y z Theorem: A binary relation can be represented by a real-valued function if and only if it satisfies C1-C3 Under these conditions, the function u is unique up to increasing transformation (Cantor 1915) A face utility function? In there an explicit representation of ‘value’ of a choice in the brain? Neurons in the orbitofrontal cortex encode value Padoa-Schioppa and Assad, 2006 Examples of neurons encoding the chosen value A neuron encoding the value of A A neuron encoding the value of B A neuron encoding the chosen juice taste Encoding takes place at different times post-offer (a, d, e, blue), pre-juice (b, cyan), post-juice (c, f, black) How does the brain learn the values? The computational problem The goal is to maximize the sum of rewards  end  Vt  E   r    t  The computational problem The value of the state S1 depends on the policy If the animal chooses ‘right’ at S1, V  S1   R  ice cream  V  S2  How to find the optimal policy in a complicated world? How to find the optimal policy in a complicated world? • If values of the different states are known then this task is easy V  St   rt  V  St 1  How to find the optimal policy in a complicated world? • If values of the different states are known then this task is easy How can the values of the different states be learned? V  St   rt  V  St 1  V(St) = the value of the state at time t rt = the (average) reward delivered at time t V(St+1) = the value of the state at time t+1 The TD (temporal difference) learning algorithm V  St   V  St   t where t  rt  V  St 1   V  St   is the TD error. Schultz, Dayan and Montague, Science, 1997 2 1 3 4 6 5 CS 7 8 Reward Before trial 1: V  S1   V  S2    V  S9   0 In trial 1: • no reward in states 1-7 t  rt  V  St 1   V  St    0 V  St   V  St   t  0 • reward of size 1 in states 8 t  rt  V  S9   V  S8    1 V  S8   V  St   t   9 1 2 3 4 5 6 CS 7 8 9 Reward Before trial 2: V  S1   V  S2    V  S7   V  S9   0 V  S8    In trial 2, for states 1-6 t  rt  V  St 1   V  St    0 V  St   V  St   t  0 For state 7, t  rt  V  St 1   V  St     2 V  S7   V  S7   t   1 2 3 4 5 6 CS 7 8 9 Reward Before trial 2: For state 8, V  S1   V  S2    V  S7   V  S9   0 V  S8    t  rt  V  St 1   V  St    1   V  S8   V  S8   t     1       2    2 1 3 4 5 6 CS 7 8 9 Reward Before trial 3: V  S1   V  S2    V  S6   V  S9   0 V  S7    2 V  S8     2    In trial 2, for states 1-5 t  rt  V  St 1   V  St    0 V  St   V  St   t  0 For state 6, t  rt  V  St 1   V  St     2 3 V  S7   V  S7   t   2 1 3 4 5 6 7 CS Before trial 3: 9 8 Reward V  S1   V  S2    V  S6   V  S9   0 V  S7    2 V  S8     2    For state 7, t  rt  V  St 1   V  St      2       2 1    2 V  S7   V  S7   t      2 1     3  2 2 2 For state 8, t  rt  V  St 1   V  St    1    2    V  S   V  S     2 1      1    2     3 1 2 3 4 5 CS After many trialsV 6 7 8 Reward  S1    V  S8   1 V  S9   0 t  rt  V  St 1   V  St    0 Except for the CS whose time is unknown 9 Schultz, 1998 “We found that these neurons encoded the difference between the current reward and a weighted average of previous rewards, a reward prediction error, but only for outcomes that were better than expected”. Bayer and Glimcher, 1998 Bayer and Glimcher, 1998

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture 4 - TeachLine