Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Decision making ? Probability in games of chance Blaise Pascal 1623 - 1662 How much should I bet on ’20’? E[gain] = Σgain(x) Pr(x) Decisions under uncertainty Maximize expected value (Pascal) Bets should be assessed according to p x gain x x Decisions under uncertainty The value of an alternative is a monotonous function of the • Probability of reward • Magnitude of reward Do Classical Decision Variables Influence Brain Activity in LIP? LIP Varying Movement Value Platt and Glimcher 1999 What Influences LIP? Related to Movement Desirability • Value/Utility of Reward • Probability of Reward Varying Movement Probability What Influences LIP? Related to Movement Desirability • Value/Utility of Reward • Probability of Reward Decisions under uncertainty Neural activity in area LIP depends on: • Probability of reward • Magnitude of reward Relative or absolute reward? Dorris and Glimcher 2004 ? $X $Y $Z $A $B $C $D $E Maximization of utility Consider a set of alternatives X and a binary relation on it, X X , interpreted as “preferred at least as”. Consider the following three axioms: C1. Completeness: For every x, y X , x y or x C2. Transitivity: For every x, y, z X , x y and y z imply x C3. Separability y z Theorem: A binary relation can be represented by a real-valued function if and only if it satisfies C1-C3 Under these conditions, the function u is unique up to increasing transformation (Cantor 1915) A face utility function? In there an explicit representation of ‘value’ of a choice in the brain? Neurons in the orbitofrontal cortex encode value Padoa-Schioppa and Assad, 2006 Examples of neurons encoding the chosen value A neuron encoding the value of A A neuron encoding the value of B A neuron encoding the chosen juice taste Encoding takes place at different times post-offer (a, d, e, blue), pre-juice (b, cyan), post-juice (c, f, black) How does the brain learn the values? The computational problem The goal is to maximize the sum of rewards end Vt E r t The computational problem The value of the state S1 depends on the policy If the animal chooses ‘right’ at S1, V S1 R ice cream V S2 How to find the optimal policy in a complicated world? How to find the optimal policy in a complicated world? • If values of the different states are known then this task is easy V St rt V St 1 How to find the optimal policy in a complicated world? • If values of the different states are known then this task is easy How can the values of the different states be learned? V St rt V St 1 V(St) = the value of the state at time t rt = the (average) reward delivered at time t V(St+1) = the value of the state at time t+1 The TD (temporal difference) learning algorithm V St V St t where t rt V St 1 V St is the TD error. Schultz, Dayan and Montague, Science, 1997 2 1 3 4 6 5 CS 7 8 Reward Before trial 1: V S1 V S2 V S9 0 In trial 1: • no reward in states 1-7 t rt V St 1 V St 0 V St V St t 0 • reward of size 1 in states 8 t rt V S9 V S8 1 V S8 V St t 9 1 2 3 4 5 6 CS 7 8 9 Reward Before trial 2: V S1 V S2 V S7 V S9 0 V S8 In trial 2, for states 1-6 t rt V St 1 V St 0 V St V St t 0 For state 7, t rt V St 1 V St 2 V S7 V S7 t 1 2 3 4 5 6 CS 7 8 9 Reward Before trial 2: For state 8, V S1 V S2 V S7 V S9 0 V S8 t rt V St 1 V St 1 V S8 V S8 t 1 2 2 1 3 4 5 6 CS 7 8 9 Reward Before trial 3: V S1 V S2 V S6 V S9 0 V S7 2 V S8 2 In trial 2, for states 1-5 t rt V St 1 V St 0 V St V St t 0 For state 6, t rt V St 1 V St 2 3 V S7 V S7 t 2 1 3 4 5 6 7 CS Before trial 3: 9 8 Reward V S1 V S2 V S6 V S9 0 V S7 2 V S8 2 For state 7, t rt V St 1 V St 2 2 1 2 V S7 V S7 t 2 1 3 2 2 2 For state 8, t rt V St 1 V St 1 2 V S V S 2 1 1 2 3 1 2 3 4 5 CS After many trialsV 6 7 8 Reward S1 V S8 1 V S9 0 t rt V St 1 V St 0 Except for the CS whose time is unknown 9 Schultz, 1998 “We found that these neurons encoded the difference between the current reward and a weighted average of previous rewards, a reward prediction error, but only for outcomes that were better than expected”. Bayer and Glimcher, 1998 Bayer and Glimcher, 1998