Download Lecture 4 - TeachLine

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

Choice modelling wikipedia, lookup

Microeconomics wikipedia, lookup

Criticisms of the labour theory of value wikipedia, lookup

Time value of money wikipedia, lookup

Transcript
Decision making
?
Probability in games of chance
Blaise Pascal
1623 - 1662
How much should I bet on ’20’?
E[gain] = Σgain(x) Pr(x)
Decisions under uncertainty
Maximize expected value
(Pascal)
Bets should be assessed
according to
 p  x  gain  x 
x
Decisions under uncertainty
The value of an alternative is a monotonous function
of the
• Probability of reward
• Magnitude of reward
Do Classical Decision Variables
Influence Brain Activity in LIP?
LIP
Varying Movement Value
Platt and Glimcher 1999
What Influences LIP?
Related to Movement Desirability
• Value/Utility of Reward
• Probability of Reward
Varying Movement Probability
What Influences LIP?
Related to Movement Desirability
• Value/Utility of Reward
• Probability of Reward
Decisions under uncertainty
Neural activity in area LIP depends on:
• Probability of reward
• Magnitude of reward
Relative or absolute reward?
Dorris and Glimcher 2004
?
$X
$Y
$Z
$A $B
$C
$D
$E
Maximization of utility
Consider a set of alternatives X and a binary
relation on it,  X  X , interpreted as “preferred
at least as”.
Consider the following three axioms:
C1. Completeness: For every x, y  X , x y or x
C2. Transitivity: For every
x, y, z  X , x y and y z imply x
C3. Separability
y
z
Theorem: A binary relation can be represented by a
real-valued function if and only if it satisfies C1-C3
Under these conditions, the function u is unique up to
increasing transformation
(Cantor 1915)
A face utility function?
In there an explicit representation of
‘value’ of a choice in the brain?
Neurons in the orbitofrontal cortex encode value
Padoa-Schioppa and Assad, 2006
Examples of neurons encoding the chosen value
A neuron encoding the value of A
A neuron encoding the value of B
A neuron encoding the chosen juice taste
Encoding takes place at different times
post-offer (a, d, e, blue),
pre-juice (b, cyan),
post-juice (c, f, black)
How does the brain learn the values?
The computational problem
The goal is to maximize the sum of rewards
 end 
Vt  E   r 
  t 
The computational problem
The value of the state S1 depends on the policy
If the animal chooses ‘right’ at S1,
V  S1   R  ice cream  V  S2 
How to find the optimal policy in a
complicated world?
How to find the optimal policy in a
complicated world?
• If values of the different states are known
then this task is easy
V  St   rt  V  St 1 
How to find the optimal policy in a
complicated world?
• If values of the different states are known
then this task is easy
How can the values of the different states
be learned?
V  St   rt  V  St 1 
V(St) = the value of the state at time t
rt = the (average) reward delivered at time t
V(St+1) = the value of the state at time t+1
The TD (temporal difference) learning algorithm
V  St   V  St   t
where
t  rt  V  St 1   V  St  
is the TD error.
Schultz, Dayan and Montague, Science, 1997
2
1
3
4
6
5
CS
7
8
Reward
Before trial 1:
V  S1   V  S2  
 V  S9   0
In trial 1:
• no reward in states 1-7
t  rt  V  St 1   V  St    0
V  St   V  St   t  0
• reward of size 1 in states 8
t  rt  V  S9   V  S8    1
V  S8   V  St   t  
9
1
2
3
4
5
6
CS
7
8
9
Reward
Before trial 2:
V  S1   V  S2  
 V  S7   V  S9   0
V  S8   
In trial 2, for states 1-6
t  rt  V  St 1   V  St    0
V  St   V  St   t  0
For state 7,
t  rt  V  St 1   V  St    
2
V  S7   V  S7   t  
1
2
3
4
5
6
CS
7
8
9
Reward
Before trial 2:
For state 8,
V  S1   V  S2  
 V  S7   V  S9   0
V  S8   
t  rt  V  St 1   V  St    1  
V  S8   V  S8   t     1       2   
2
1
3
4
5
6
CS
7
8
9
Reward
Before trial 3:
V  S1   V  S2  
 V  S6   V  S9   0
V  S7    2 V  S8     2   
In trial 2, for states 1-5
t  rt  V  St 1   V  St    0
V  St   V  St   t  0
For state 6,
t  rt  V  St 1   V  St     2
3
V  S7   V  S7   t  
2
1
3
4
5
6
7
CS
Before trial 3:
9
8
Reward
V  S1   V  S2  
 V  S6   V  S9   0
V  S7    2 V  S8     2   
For state 7,
t  rt  V  St 1   V  St      2       2 1   
2
V  S7   V  S7   t      2 1     3  2
2
2
For state 8,
t  rt  V  St 1   V  St    1    2   
V  S   V  S     2 1      1    2    
3
1
2
3
4
5
CS
After many trialsV
6
7
8
Reward
 S1  
 V  S8   1 V  S9   0
t  rt  V  St 1   V  St    0
Except for the CS whose time is unknown
9
Schultz, 1998
“We found that these neurons encoded the difference between
the current reward and a weighted average of previous rewards,
a reward prediction error, but only for outcomes that were
better than expected”.
Bayer and Glimcher, 1998
Bayer and Glimcher, 1998