Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Impulsivity wikipedia , lookup

Father absence wikipedia , lookup

Learning theory (education) wikipedia , lookup

Behaviorism wikipedia , lookup

Neuroeconomics wikipedia , lookup

Psychological behaviorism wikipedia , lookup

Psychophysics wikipedia , lookup

Eyeblink conditioning wikipedia , lookup

Operant conditioning wikipedia , lookup

Classical conditioning wikipedia , lookup

Transcript
Classical Conditioning
Pavlov’s finding (1890-1900s):
Initially, sight of food leads to dog
salivating. If sound of bell
consistently accompanies or
precedes presentation of food, then
after a while the sound of the bell
leads to salivating.
1
Terminology
food
unconditioned stimulus, US
(reward)
salivating
unconditioned response, UR
Sound of bell consistently precedes food. Afterwards, bell
leads to salivating:
bell
conditioned stimulus, CS
(expectation of reward)
salivating
conditioned response, CR
2
Variations of Conditioning 1
Extinction:
• Stimulus (bell) repeatedly shown without reward (food).
• Result: conditioned response (salivating) reduced.
Partial reinforcement:
• Stimulus only sometimes preceding reward
• Result: conditioned response weaker than in classical case.
Blocking (2 stimuli S1+S2):
• First: S1 associated with reward: (classical case)
• Then: S1 and S2 shown together followed by reward.
• Result: Association between S2 and reward not learned.
3
Variations of Conditioning 2
Inhibitory Conditioning (2 stimuli): alternate 2 types of trials:
1. S1 followed by reward.
2. S1+S2 followed by absence of reward.
Result:
S1 becomes predictor of presence of reward,
S2 becomes predictor of absence of reward.
To show this use for example the following 2 methods:
•
train animal to predict reward based on S2. Result:
learning slowed
•
train animal to predict reward based on S3, then show
S2+S3. Result: conditioned response weaker than for S3
alone.
4
Variations of Conditioning 3
Overshadowing (2 stimuli):
•
Repeatedly present S1+S2 followed by reward.
•
Result: often, reward prediction shared unequally between
stimuli.
Example (made up):
•
red light + high pitch beep precede pigeon food.
•
Result: red light more effective in predicting the food
than high pitch beep.
Secondary Conditioning:
•
S1 preceding reward. Then, S2 preceding S1.
•
Result: S2 leads to prediction of reward.
•
But: if S1 following S2 showed too often: extinction will
occur
5
Summary of Conditioning Findings
(incomplete, has been studied extensively for decades,
many books on topic)
figure taken from Dayan&Abbott
6
Rescorla Wagner Model of Classical
Conditioning
(Rescorla & Wagner, 1972):
• simple trial level model that explains a number of
findings about classical conditioning
• many things it doesn’t explain
• number of extensions
7
The setting: (Rescorla & Wagner, 1972):
Consider stimulus variable u representing presence (u = 1) or
absence (u = 0) of stimulus. Correspondingly, reward variable r
represents presence or absence of reward.
The expected reward v is modeled as “stimulus times weight”:
v = wu
w
u
v
Learning is done by adjusting the weight to minimize error
between predicted reward v and actual reward r.
8
How to change the weight?
Denote the prediction error by δ (delta):
δ = r-v
Learning rule:
w←w+εδu,
where ε is a learning rate.
Q: Why is this useful?
A: This rule does stochastic gradient descent to minimize
the expected squared error (r-v)2; w converges to <r>. R.W.
rule is variant of the “delta rule” in neural networks.
9
Goal: minimize squared error E=δ2
∂
∂ 2 ∂
δ =
E=
(r − v) 2
∂w
∂w
∂w
∂
(r − wu ) 2
=
∂w
= 2(r − wu )(−u )
δ2
∂
E>0
∂w
∂
E<0
∂w
= −2δu
w ← w −η
∂E
∂w
w
w ← w + εδu , where ε = 2η
Note: in psychological terms the learning rate is a measure
of associability of stimulus with reward.
10
Example:
prediction error δ = r-v; learning rule: w ← w + ε δ u
figure taken from Dayan&Abbott
conditioning
extinction
partial reinforcement
11
Multiple Stimuli
Essentially the same idea/learning rule:
In case of multiple stimuli: v = w·u
(predicted reward = dot product of stimulus vector and weight vector)
Prediction error: δ = r-v
Learning rule: w ← w + ε δ u
∂ 2
∂
∂
2
δ =
(r − v) =
(r − w ⋅ u) 2
∂wi
∂wi
∂wi
∂
= 2(r − w ⋅ u)
∂wi


 − ∑ w ju j 


j


= −2δui
12
In how far does Rescorla Wagner rule account for variants
of classical conditioning?
(prediction: v = w·u; error: δ = r-v; learning: w← w + ε δ u)
figure taken from Dayan&Abbott
13
(prediction: v = w·u; error: δ = r-v; learning: w := w + ε δ u)
• Extinction, Partial Reinforcement: o.k., since w
converges to <r>
• Blocking: during pre-training, w1 converges to r. During
training v=w1u1+w2u2=r, hence δ=0 and w2 does not grow.
• Inhibitory Conditioning: on S1 only trials, w1 gets
positive value. on S1+S2 trials, v=w1+w2 must converge to
zero, hence w2 becoming negative.
• Overshadow: v=w1+w2 goes to r, but w1 and w2 may
become different if there are different learning rates
εi for them.
• Secondary Conditioning: R.-W.-rule predicts negative S2
weight!
• Rescorla Wagner rule qualitatively accounts for wide
range of conditioning phenomena but not for secondary
conditioning (and some other things).
14
Instrumental Conditioning
•
Classical Conditioning: only concerned with prediction of
reward; doesn’t consider agent’s actions. Reward usually
depends on what you’ve done!
Edward L. Thorndike‘s „Law of effect“ (1911):
• Responses to a situation that are followed by
satisfaction are strengthened
• Responses that are followed by discomfort are
weakened.
15
“Of several responses made to the same situation, those
which are accompanied or closely followed by
satisfaction to the animal will, other things being equal,
be more firmly connected with the situation, so that,
when it recurs, they will be more likely to recur; those
which are accompanied or closely followed by discomfort
to the animal will, other things being equal, have their
connections with that situation weakened, so that, when
it recurs, they will be less likely to occur. The greater
the satisfaction or discomfort, the greater the
strengthening or weakening of the bond”
(Thorndike, 1911, p. 244)
cat in “puzzle box”
rat in “Skinner box”
16
multi-modal “Skinner box”
17
Clicker Training
• used by animal trainers to teach
behaviors
• developed by Marian Kruse and
Keller Breland (two of Skinner‘s
first students)
• used on over 150 species and
robots [Kaplan et al., 2002]
• basic idea: use „click“ as a
secondary reinforcer.
• trainer „marks“ desired behavior
with click at precisely the right
time
18