Download Document 29414

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Transcript
Bayesian model selec-on and es-ma-on: Simultaneous mixed effects for models and parameters a Klinik für Psychiatrie
und Psychotherapie
Campus Mitte
a*
a,b
c Daniel J. Schad , Michael A. Rapp , Quen-n J.M. Huys
c b * contact: [email protected] B) Varia-onal Bayes: Methods Introduc-on 0.16. OPTIMIZATION ALGORITHM
Bayesian model selection and estimation (BMSE):
Powerful methods for determining the most likely among sets of
competing hypotheses about the mechanisms and parameters that
generated observed data, e.g., from experiments on decision-making.
53
Sufficient statistics approach: combine empirical Bayes [1] with
random effects for models [2]
Full random effects inference: Variational Bayes
Mixed-effects (or empirical / hierarchical Bayes‘) models:
Provide full inference in group-studies – with repeated observations for
each individual – by adequatly capturing:
-  Individual differences (random effects / posteriors)
-  Mechanisms & parameters common to all individuals (fixed effects /
priors)
Previous models: have assumed mixed-effects
-  either for model parameters: Huys et al. [1] applied empirical Bayes‘
via Expectation Maximization (EM) to reinforcement learning models
-  or for the model identity: Stephan et al. [2] developed a Variational
Bayes‘ (VB) method for treating models as random-effects
Here:
A)  We evaluate the empirical Bayes‘ method assuming mixed-effects
4
for parameters for reinforcement
learning models [1]
B)  We present a novel Variational Bayes' (VB) model which considers
0.2 and
Parameter
mixed-effects for models
parametersEstimation:
simultaneouslyEmpirical Bayes
THEORY
In an empirical Bayes’ approach the prior is inferred from the group information
by inteFigure 21:
Bayesian dependency graphs for our random e↵ects generative model for
grating over the vector of individual subject-parameters, ✓,
multi-subject data. Rectangles denote deterministic parameters and shaded circles represent observed values. ↵ = parameters of the Dirichlet distribution (number of model
Z 1
Simulations
from
decision
N = 90 simulated
subjects
”occurrences”);
r =known
parameters
of theprocesses
multinomialwith
distribution
(probabilities
of the mod-  Generating prior parameters can be
P (X | µ✓ , ✓ ) /
d✓P (X,✓ | µ✓ , ✓ )
(4) labels; ✓ = individual subject parameters; µ = prior group mean; 2 =
els);
m
=
model
recovered from simulated data
1
Simple
RL:variance;
ab = simple
model, assuming
state, 2 actions,
ratefor
(a)
prior group
µ0 , ⌫ RL
= hyper-priors
for the µ1 parameter;
a0 , b0 =learning
hyper-priors
-  The precision scales with number of
inverse
noisinessy (b)
parameters
the 2 parameter;
= observed
data;(N=60)
y | m = probability of the data given model k; k =
data points as theoretically
expected
and prior parameters are set to their ML estimates
model= index;
K = model
number(N
of =
models;
Rep
repetition
30) n = subject index; N = number of subjects.
A
B
Prior Mean
Prior
PriorVariance
Variance
1.00
1.0
Parameter
Parameter
Z 1
2
Beta
Beta
0.8
0.75
Learning
Learning
Rate
Rate = argmax P (X | µ ,
RL(5)
+ 200 trials
µ̂
,ˆ
d✓ P (X | ✓) P (✓ | µ✓ ,Simple
✓
✓
✓ ✓ ) = argmax
✓)
1
0.6
µ✓ , ✓
µ✓ , ✓
1
0.50
With sufficient data, the correct model
0.4
0
can be identified for all subjects and
0.25
0.2
for both methods with high certainty
0.00
0.0
−1
A) empirical Bayes B) Varia-onal Bayes: Results Model: Rep
# Subjects
●●
●●● ●
●●
●●
●
● ●
● ●
●
●
●
● ●
●
●
●●
● ●●
●●
●●
●● ●
●●
●
● ●●
●●●
●●
●● ●
●● ●
●●
●
100
C
200
●●
●
●● ●
●●
●
●
●
●
●●
●●
●
●●●●
●●
● ●
●
●● ●●●●
●●
0
40
●
●
−10
● ● ● ●
● ● ●
● ●
● ● ●
●
● ●
● ●
● ●
● ●
●
●
● ●
●
● ●
● ●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
0.0
●
● ● ●
● ●
● ●
● ● ●
●
●
● ●
● ●
● ●
● ●
● ●
●
−2.5
●
●
● ●
● ● ● ● ● ● ● ● ● ●
● ●
● ● ●
● ● ●
● ●
● ●
● ●
● ●
● ● ● ● ●
● ● ● ●
● ●
● ●
● ● ●
● ●
● ●
● ●
● ●
●
● ●
● ● ● ● ● ● ●
●
● ●
●
●
● ●
● ●
● ●
● ● ●
● ● ● ● ●
●
● ●
●
● ●
●
●
●
●
● ● ● ●
● ●
●
●
● ●
●
● ● ● ●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
20
0.995
50
100
# Trials
●
●
60
●
●
150
●
●
300
●
●
−5.0
●
1
●
●
●
10
●
600
●
−20
1200
Expectation Maximization with Laplace approximation
100
100
D
Prior Mean: Beta
●
●●
●● ●●
0.2.1
300
Group Level Precision [1/SE]
Group Level Precision [1/SE]
●
● ●
Log Likelihood [Relative to Maximum]
●●
●●● ●
● ● ● ●
● ●
● ●
●
● ●
● ● ● ● ● ●
● ●
● ●
●
●
● ●
● ●
● ●
● ● ●
●
●
● ●
● ●
●
●
●
● ●
● ● ●
● ● ● ●
●
● ●
● ●
●
● ●
●
● ● ●
● ●
●
● ●
●
●
●
● ●
●
●
●
●
●
●
● ●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
200
200
300
∆Mean:
BIC
< −2:
N = 0Rate
/1000
Prior
Prior
Mean:
Learning
Learning
Rate
300
4040 ##Trials
Trials ●● 60
60 150
150 300
300 600
600 1200
1200
●
P(|t|>T
−7.5
krit
150
) = 0.048
100
1.4
1.2
1
0.99
0.985
0.98
0.975
1
0.8
0.8
0.6
0.6
0.97
EM
Sufficient Statistics
Full Variational Bayes
True
0.4
0.965
0.96
# Data Total
[Rank]
0.95
20
−30
1.2
0.2
0
Rep
ab
Computational Model
0.2
0
rep
Model Parameters
15
250
0.4
0.955
●
# sim = 1000/1000
1.6
Full Variational Bayes
Sufficient Statistics
5
Parameter Values
0
●
●
Learning Rate
Average Correct Probability Mass
True and Estimates +/− 95% CI
Beta
Model: ab
1.4
Model: Rep
To estimate
model
parameters we use Expectation Maximization (EM), where we iterate
55 10
10 20
20 50
50 100
100
Deviation from MLE [dx]
between
an
expectation
step,
where
we
approximate the distribution of latent subject20
2020
Simple
RL + 20 trials
parameters, and a maximization
step,
where
we
search
for
maximum
likelihood
estimates
The likelihood for the prior is approximately
10
1010
With scarce data per subject, the full
(MLE) for the prior parameters.
this procedure,
Gaussian,In
providing
a basis each
for a parameter
Laplace- update is obtained
VB method improves model
0
00
conditional on the currently
best estimates
for theerror
otherbars,
parameters.
Approximation
to derive
with
comparison compared to the
100
200
300
100
100
200
200
300
300
Total # of observations [sqrt N x T]
Total
Total##of
ofobservations
observations[sqrt
[sqrtNNxxT]
T]
normative alpha errors
sufficient (see
statistics approach
To ease computations with thebeta likelihood of the prioralpha parameters we take its logarithm
Equantion 6). As the distribution of individual subject parameters cannot in general be
Table.!Δ!BIC!scores!(model9BIC!minus!BIC!for!the!best!model!for!the!dataset)!for!six!
0 , S) to
solved
analytically
and
directly,
we
introduce
the
distribution
q(✓)
⌘
N
ormal(✓;
✓
different!models!for!six!different!datasets!generated!from!these!same!models.!The!
highlighted!diagonal!shows!that!for!all!simulated!data!sets!the!generating!model!was!
approximate the distribution over ✓ via a normal distribution (i.e., Laplace approximation).
also!recovered!from!the!data.!
In the E-step, we approximate this distribution based on the prior parameters2step:
and theHybrid
data. = model-based + model-free; model-free; non-learner [3]
Generating"
Fitted"model"
Model"
"
In
the
M-step,m" we can
subsequently update our estimates for the other (prior) parameters
"
m2b2alr"
mr"
2b2alr"
m2b2al"
2b2al"
m2b2alr"
0"
337"
49"given441"
1297"
531"
this approximation
to ✓. To solve equation (7) we utilize Jensen’s inequality and
N = 30
N = 30
mr"
42"
0"
428"
800"
801"
1490"
2step
+
201
trials
the logarithm
into the integral in (8).
2b2alr"
12"
841"
0"move280"
2678"
271"
m2b2al"
6"
452"
95"
0"
514"
83"
Using three similar
m"
40"
21"
408"
45"
0"
436"
Z 1
models with differing
2b2al"
16"
1391"
5"
18"
2271"
0"
!
model parameters
in
log p (X | µ✓ , ✓ ) = log
d✓ p (X, ✓ | µ✓ , ✓ )
(6)
BICint extracts the true generating model from the data
1
the 2step task also
Z 1
p (X, ✓ | µ✓ , ✓ )
yields posterior
= log
d✓ q(✓)
(7)
q(✓)
1
uncertainty,
and
the
Z 1
p (X, ✓ | µ✓ , ✓ )
full VB performs best
d✓ q(✓) log
(8)
q(✓)
1
●
●
●
−0.1
0.1
0.2
−0.2
−0.1
0.0
0.1
100
40
50
●●
●●
● ●
● ●
20
50
0
5
BIC0 − BIC1
10
0
−4
15
−2
0
t−value
2
0
4
0
0.2
P(|t|>T ) = 0.048
krit
# sim = 1000/1000
∆ BIC < −2: # sim
N = =1 1000/1000
/1000
300
1.2
1
0.98
●●
150
150
0.4
0.6
p−value
0.8
1
P(|t|>Tkrit) = 0.059
100
1
0.96
0.94
0.92
0.9
0.88
0.6
0.5
a
200
Parameter:
b
Parameter:
250
0.84
0.2
0.82
100
150
100
50
200
100
80
80
60
60
40
40
20
20
0
Rep
ab
Computational Model
100
50
Model: MF+Rep
Model-Free
Model: MB+MF+Rep
Hybrid
0
−4
0
−2 5
0 10
BIC0 − BICt−value
1
N = 1 /1000
2
15
4
0 −2
0.2
0.6
0 0.4
2
t−value p−value
0.84
1
0
0
0.2
0.4
0.6
p−value
0.8
150
100
100
60
150
100
50
0
−5
Conclusions 0
5
BIC0 − BIC1
10
15
40
50
EM
0.6
Sufficient Statistics
Full Variational Bayes
True
0.5
1
1.5
0.9
0.5
1
0.4
0
0.5
0.3
−0.5
0
0.85
Ntrue = 30
true
2
0.95
Model: Rep
Non-Learner
0.7
true
1
80
200
1.5
Sufficient Statistics
Full Variational Bayes
1
P(|t|>Tkrit) = 0.059
250
a
0
# sim = 1000/1000
300
Parameter:
0
−4
Parameter Values
0
−5
0.2
−1
−0.5
0.1
0.8
20
0
−4
−2
0
t−value
2
4
0
−1
MB+MF+Rep
MF+Rep Non-Learner
Rep
Hybrid Model-Free
Computational Model
0
0.2
0.4
0.6
p−value
0.8
b1
b2 a1 a2 l
w
Model Parameters
−1.5
r
b1
b2 a1 a2
l
Model Parameters
r
0
r1
r2
Model Parameters
1
-  Fitting empirical Bayes‘ models of reinforcement learning using Expectation
Maximization (EM) [1] exhibits desirable normative properties
-  Our new Variational Bayes method suggests that we can and should understand
the heterogeneity and homogeneity observed in group studies of decision-making
by investigating contributions of both, the underlying mechanisms and their
parameters
-  We find increased accuracy in Bayesian model comparison for our new VB method
compared to previous approaches [1, 2]
-  We expect that this new mixed-effects method will prove useful for a wide range of
computational modeling approches in group studies of cognition and biology
Model-Free
Model: MF+Rep
Model: MB+MF+Rep
Hybrid
2step + 201 trials +
parameters based
on real data [4]
The advantage for
the full VB is visible
also for parameters
obtained from real
(observed) data
2
Ntrue = 21
Sufficient Statistics
Full Variational Bayes
1.5
1
0.98
1.5
Model: Rep
Non-Learner
Ntrue = 3
0.94
0.92
0.9
0.88
1
1
Ntrue = 3
EM
Sufficient Statistics
Full Variational Bayes
True
0.5
0.5
0.8
0
0
0.6
−0.5
−0.5
0.86
1.4
1.2
1
0.96
Parameter Values
15
Average Correct Probability Mass
∆ BIC < −2:
10
b
a
Model Parameters
50
Average Correct Probability Mass
5
BIC0 − BIC1
0
150
50
0
rep
Model Parameters
100
2.5
0
−5
EM
Sufficient Statistics
Full Variational Bayes
True
0.4
0.86
1
0.8
0.8
250
Model: ab
1.5
Sufficient Statistics
Full Variational Bayes
5
0.2
150
N = 0 /1000
300
0.0
60
0
−5
∆ BIC < −2:
−0.2
100
200
Parameter Values
Parameter:
●
##Subjects
Subjects
3030
b
●
1.4
10
Average Correct Probability Mass
30
80
b
a
Model Parameters
0.4
0.84
−1
−1
0.82
0.2
0.8
MB+MF+Rep
MF+Rep Rep
Hybrid Model-Free
Non-Learner
Computational Model
−1.5
b1
b2 a1 a2 l
w
Model Parameters
References [1] Huys, Q. J. M., Eshel, N., O'Nions, E., Sheridan, L., Dayan, P., & Roiser, J. P. (2012). Bonsai trees in your head: How the pavlovian system sculpts goal-directed
choices by pruning decision trees. PLoS Computational Biology, 8(3), e1002410.
[2] Stephan, K. E., Penny, W. D., Daunizeau, J., Moran, R. J., & Friston, K. J. (2009). Bayesian model selection for group studies. Neuroimage, 46(4), 1004-1017.
[3] Daw, N. D., Gershman, S. J., Seymour, B., Dayan, P., & Dolan, R. J. (2011). Model-based influences on humans’ choices and striatal prediction errors. Neuron,
69(6), 1204–1215.
[4] Schad, D. J., Jünger, E., Garbusow, M., Sebold, M., Bernhardt, N., Javadi, A. H., et al. (2014). Individual differences in processing speed and working memory
capacity moderate the balance between habitual and goal-directed choice behaviour. Submitted.
r
−1.5
b1
b2 a1 a2
l
Model Parameters
r
0
r1
r2
Model Parameters
Acknowledgements This research was supported by Deutsche
Forschungsgemeinschaft (DFG FOR 1617
Alcohol and Learning), Grant RA1047/2-1
Similar