Download Representation and Rationality: The Case of - Philsci

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
9.
Representation and Rationality: The Case of Backwards Blocking1
1. Representation
Representation and inference covary. Bayes nets are a form of representation of causal
and probabilaity claims, but the details of the representation may be uncertain in
particular cases. Which choices are appropriate as a psychological model can be
constrained by empirical facts about the inferences of subjects whose viewpoint and
procedures the Bayes net representation is intended to model, and how the representation
is conjoined with an inference procedure. Backwards blocking provides a vivid example
of some of the uncertainties and interactions.
2. Backwards Blocking
One of Isaac Newton’s Rules of Reasoning in Natural Philosophy, the Vera Causa rule in
Book III of the Principia, recommends this: postulate no more causes except as are true
and sufficient to save the phenomena. Newton intended the rule to justify assuming that
only the force of gravitation acts on the planets and their satellites, and on the tides, for
he had established that cause, and it sufficed.
Some recent work on associative learning (Van Hamme, et al., 1994) argues that adults
sometimes apply a version of the Vera Causa rule in contexts with less gravitas. In “cue
competition” or “backwards blocking” features A and B appear together followed by an
effect, E, and, with neither A nor B present, either E does not appear or appears more
1
This chapter was motivated by enlightening discussions with Alison Gopnik and Joshua Tenenbaum.
Indeed, the entire chapter is my side of an extended correspondence with Tenenbaum, who suggested the
extended form network (for experiments with only two trials) discussed below, and raised the objections I
describe to the normal form, as well as objections I do not describe to the bootstrap estimation of Cheng
models. Gopnik and Tenenbaum are of course not responsible for, and in some cases may not share, my
opinions.
1
rarely, and people are said to give some weight to the causal role of both A and B in
producing E. But if A also appears alone followed by a higher probability of the effect
(than when A and B are both absent), the judgement of the efficacy of B is reduced. The
causal role of A is established, and suffices to explain the data.
It is well known that backwards blocking is inconsistent with the familiar RescorlaWagner model of associative learning, and various modifications of the model have been
proposed to deal with the phenomenon. Backwards blocking illustrates the variety of
Bayes net representations of causal relations subjects might tacitly be using, and the
variety of ways those representations might be used to make inferences about the causal
powers of an object or kind of object.
I will use variations of an experimental design
suggested by Gopnik and Tenenbaum.
3. Experiments
In what follows I will consider variations of the following simple experiment to test
backwards blocking:
Experiment : Subjects are given information that objects of some unobserved kind K—
corresponding to no observed quality such as color or shape—are rare, but produce an
effect E with some high probability in some circumstance C, which is observable.
Subjects are given no reason to think that two or more objects of kind K interact to
produce E. In fact, E will occur with some high probability when an object of kind K is
present in C, an E may otherwise occur with some small probability. Subjects see m trials
in which two objects, A, B are both simultaneously in circumstance C. A is actually of
kind K; B may actually be of kind K or not. Subjects see E occurs or not according to the
probabilities assigned by the experimenter. That is, in Rescorla Wagner terms, the cues
A, B are always presented together. Subjects’ judgements of the probability that B is of
kind K are elicited. Subjects then see n trials in which A alone is present in C and E
occurs or does not. Subjects’ judgements of the probability that B is of kind K are then
elicited
2
There are many variations of these designs depending on the choices of the numbers of
trials m and n, and on the probability that E occurs with and without the presence of an
observed object of kind K. In the simplest case, E always occurs whenever A is in
circumstance C and never otherwise.
Whatever the outcomes of such experiments, any plausible model must not only save the
phenomena of the elicited judgements of probabilities, it must also not presume
unrealistic memory or computational powers of the subjects. The following surveys
some of the predictions about experiments of the kind described above obtained from
various combinations of representation and inference procedure.
4. The Cheng Model
Cheng’s model of the adult model of causal relations in typical experiments is a Bayes
net that can represented by a directed graph with vertices representing potential causal
factors, say A, B, U, and the effect, E, and with directed edges from A, B, U into E.
A
qa
B
qb
U
qu
E
E = qa A  qb B  qu U, where, as in previous chapters,  is Boolean addition.
In the Cheng model A, B indicate the presence or absence of objects A, B in a trial, and
the qs are the causal powers. p(qa = 1) ~ 1 indicates that A is of kind K, and so forth. The
model is applied as follows: On each trial, A, B, U and qa, qb and qu take on values,
either 0 or 1.For any two trials, the values of these random variables on one trial are
independent of their values on the other trial, and have the same probabilities in the first
m trials, in which A and B always occur, and the same probabilities in the next n trials, in
3
which A occurs alone. In there terms, the subjects’ task is to estimate p(qb = 1), although
responses would not, of course, be expressed in that way
Suppose we try a naive Bayesian treatment of backwards blocking in the context of the
Cheng model for the special case in which A always produces E and B always produces
E and E never occurs otherwise. In the experiment, subjects are given information about
the rarity of properties that cause E and information about the frequency of E in the
absence of both A and B. We might take that information as specifying the initial
estimate of the causal powers, p(qa = 1, and p(qb = 1) for A and B. That turns out to be a
mistake.
Suppose subjects are then given a segment of data, D, for m trials in which A, B, both
occur, and E always occurs.
The posterior probability that B brings about E when A and B are both present takes the
familiar form of Bayes rule:
(1)
p’(qb = 1) = p(qb = 1) p(D | qb = 1, A, B) / p(D). = p(qb = 1) / p(D) =
p(qb = 1) / [p(qa = 1) + p(qb = 1) + p(Uqu = 1) - p(qa = 1) p(qb = 1) –
p(qa = 1) p(Uqu = 1) - p(qb = 1) p(Uqu = 1) + p(qa = 1) p(qb = 1) p(Uqu = 1)]n =
p(qb = 1) / p(E)n
According to (1), p’(qb = 1) is greater than one unless the prior probability of E is 1.
Probabilities cannot be greater than one. Something went wrong: taking the frequency of
properties of kind K to be an initial probability that qb = 1 and trying to update. Similar
difficulties occur if E does not always occur when A is present or when B is present.
4
5. A Bayesian Calculation with Cheng Models
To give a coherent Bayesian analysis we must distinguish between probabilities as
subjective degrees of belief and probabilities as objective measures of causal powers,
that, is the prior probability distribution must specify the (subjective) probability of the
(objective) probability that qb = 1, etc. One might argue that information about the rarity
of thinks of kind K creates such a prior, but I will not fuss about where it comes from.
Denote the density of such a prior probability distribution by P. For each (measurable) set
of values between 0 and 1 of p(qa = 1), etc., P specifies a probability density for that set. I
will assume that P respects the independencies assumed in the Cheng model, in particular
when a feature such as B is absent, the Cheng model postulates that the frequency of E
does not depend on p(qb = 1), and I will assume that P implies such independencies.
Then the marginal posterior density, P’, which is P conditional on data D, over values of
p(qa = 1) in a (measurable) set a , and p(qb =1) in a (measurable) set b, when A and B
are both present equals
(2)
ab P(p(qa = 1), p(qb = 1)P(D | p(qa = 1), p(qb = 1), A, B)dp(qa =1)dp(qb=1)]
[0,1] [0,1] P(D | p (qa =1), p(qb = 1), A, B)dp(qa = 1)dp(qb = 1)
When A is present and B is not present, according to the Cheng model the data, D’, are
independent of p(qb = 1) and the new posterior density, P’’, obtained by conditioning on
D’ is:
5
(3)
ab P’(p(qa = 1), p(qb = 1))P’(D’ | p(qa = 1, A, ~B)dp(qa =1)dp(qb=1)]
[0,1] P(D’ | p (qa =1), A, ~B)dp(qa = 1)
If p(qa = 1) and p(qb = 1) are independent in P’, then
P’(p(qa = 1), p(qb = 1)) = P’(qa = 1)P’(qb = 1) and (3) becomes:
(4)
aP’(p(qa = 1)P’(D’ | p(qa = 1), A, ~B)dp(qa =1)] [b P’(p(qb = 1)dp(qb = 1)]
[0,1] P(D’ | p (qa =1)A, ~B)dp(qa = 1)
(4) implies that the marginal of the posterior density P’’ over p(qb = 1) obtained by
conditioning on data obtained when B is not present equals the marginal of the previous
density, P’, over p(qb = 1). That is, (4) implies there is no backwards blocking.
On a Bayesian analysis of inference with Cheng models, backwards blocking occurs only
if the causal powers of A and of B are not independent in the probability density over the
causal powers obtained from the initial (prior) density by conditioning on the frequency
of E when A and B are both present. In other words, subjects must have the seemingly
odd but not incoherent P’ such that
P’(p(qa = 1) & p(qb = 1)) = P’(p(qa = 1) p(qb = 1)), but
P’(p(qa = 1) & p(qb = 1))  P’(p(qa = 1)) P’(p(qb = 1))
6
In general, for small samples, depending on the prior distribution assumed, evaluating the
integrals requires iterative numerical methods. Scientists evaluate such quantities with
digital computers and computationally intensive numerical analysis or simulation
algorithms. The Bayesian model of inference applied to the Cheng model has intense
computational requirements.
6. Bootstrapping to Backwards Blocking in the Cheng Model
Consider a simple non-Bayesian treatment of backwards blocking in the context of the
Cheng model. The principle I will use is simple: estimate as much as you can about the
unknown quantities, or their Boolean combinations, from the data, assuming the
observed frequencies of E on various combinations of the presence or absence of the
cues, A, B, equal the corresponding conditional probabilities of E, and assuming the
Cheng model. The principle is an uncomplicated version of an inference strategy
described in Glymour (1980), there called “bootstrapping” (not to be confused with a
statistical procedure of the same name) because the hypothesis to be tested is used in
calculating quantities that occur in the hypothesis. Similar inferences are made every day
in scientific work, and, indeed, Newton’s argument for universal gravitation uses
analogous methods.
Suppose subjects are given data about the frequency of E in the absence of both A and B.
They can estimate:
(5) p(qu=1)p(U) = fr(E | ~A, ~B).
Now suppose they are given data about the frequency of E in the presence of both A and
B. They can estimate:
(6) p(qa = 1 or qb = 1) = [fr(E | A,B) – fr(E | ~A,~B)] / (1 – fr(E | ~A,~B))
7
but they cannot estimate p(qa = 1) or p(qb = 1), other than that these probabilities are each
between 0 and p(qa = 1 or qb = 1)).
Suppose, finally, they are given data about the frequency of E in the presence of A
without B. Then they can estimate:
(7) p(qa = 1) = [fr(E | A, ~B) – fr(E | ~A, ~B)] / (1 – fr(E | ~A, ~B))
Further, although B was not present in the previous data, they can now also estimate p(qb
= 1).
(8) p(qb = 1). = [p(qa =1 or qb = 1) – p(qa = 1)] / (1- p(qa = 1).=
[fr(E | AB) – fr(E | A, ~B)] / [1 – fr(E | A, ~B)]
Note that in the case in which E always occurs if either A or B occur, the causal power of
B cannot be estimated this way, because the denominator in (8) is zero. Other things
equal, the closer the frequency of E when A alone is present is to the frequency of E
when A and B are both present, the smaller is the estimated value of p(qb = 1), the causal
power of B. We have a form of backwards blocking for the Cheng model estimated with
elementary algebra, presumably not used consciously.
The Cheng model so estimated has low memory and computational requirements.
7. Alternative Bayes Net Representations
A reasonable thought is that subjects’ inferences in these experiments have a simpler
representation than the Cheng model. In the first m trials A and B occur together in the
relevant circumstance, and in the remaining n trials only A is presented. The causal and
8
probabilistic relations in the m + n trials can be represented accurately by the following
Bayes net:
B is a K
E on tria1
A is a K
E on trial 2 …… E on trial m
E on trial m+1……..E on trial m+n
For reference, I will call this network fully-extended. The network comes with a prior
probability distribution on the nodes and either fixed or uncertain (that is, with a second
order probability distribution) on the probabilities of the trial nodes conditional on the
nodes labeled “A is a K” and “B is a K.” If these conditional probabilities are unknown,
then, as with Cheng models, Bayesian updating on the evidence requires integrating over
a second order probability distribution. Even if we assume these conditional probabilities
are known to the subject from the instructions or cover story in the experiment (e.g., they
are told that any object of kind K produces E 95% of the time), the model still imposes
implausible memory requirements for m and n that are bigger than, say 5. A natural
question is whether, under the same assumption of known conditional probabilities, there
are more compact representations that can correctly represent the probabilities.
I will call the following a normal form network:
A present
A is a K
B present
B is a K
E
The normal form network is applied to each trial, the probability that A is a K is updated
and the probability that B is a K is updated, and then the same network, with the new
probabilities, is applied to the next case. The memory and computational requirements
9
are small, assuming the conditional probability of E on each pair of values of “A is a K”
and “B is a K” are known when one or the other or both are present and when they are
both absent..
The normal form network does not give backwards blocking. “A is a K” and “B is a K”
remain independent on each new trial, no matter the outcomes of previous trials. So,
when A alone is present, the outcome of the trial is independent of “B is a K” and the
outcome does not change the probability that “B is a K.”
Moreover, as Joshua
Tenenbaum as pointed out (private communication) if the joint probability distribution,
including all of the trials, is faithful to the fully extended model, the procedure using the
normal form network violates the Markov assumption.
Another Bayes net model I will call an extended form network is essentially a truncated
version of the fully-extended form network
A is a K
E occurs when A and B are present
B is a K
E occurs when A alone is present
Tenenbaum has pointed out (private commmunication) that this network satisfies the
Markov assumption and implies backwards blocking in an experiment with only two
trials, in one of which A and B are both present and in the other A alone is present. But in
longer series of trials it is problematic. If we apply this network in the same way as the
normal form network, one trial at a time, updating as we go but not changing the network
topology, we again get no backwards blocking, and we again violate the Markov
assumption. We can imagine more elaborate procedures. for example: Apply the model
one trial at a time so long as the trials only have both A and B present, but when the first
case in which A alone is present, keep fixed the value of the node “E occurs when A and
B are present” from the last trial in which A and B were both present. This procedure
10
gives backwards blocking, but (for most values of the conditional probabilities) for n > 1
it mistakenly estimates the probability that A is a K, because the last value of “E occurs
when A and B are present” is counted multiple times.
8. Conclusion
These examples scarcely exhaust the alternative Byes net/inference models with
implications for backwards blocking, but they should be sufficient to illustrate how easy
it is to make mistakes about the representations and inference methods people use to
make causal judgements. A model that fits beautifully and is Bayesian coherent when
there are only two trials and E is a deterministic function, known to the subjects, of “A is
of kind K” and “B is of kind K” may fail—either fail to accommodate subjects’
probability judgements or impose implausible memory and computational demands--if
there are multiple trials, or if E is not a deterministic function, or if the conditional
probabilities are unknown to the subjects. An experiment that appears to refute Cheng’s
parameterization may not do so if it is varied to allow subjects to decline to estimate
probabilities, or to express uncertainty. An experiment that agrees with that
parameterization may fail to do so if the number of trials is very small. An experiment
that appears to refute Bayesian updating may not do so on another representation. Certain
representation invite special inference procedures. Because of their parametrization,
Cheng models, for example, allow algebraic bootstrap inferences that may not be
available in more general parametrizations of Bayes nets. But representations do not
typically determine a unique inference method, especially in sequential trials, and
experiments that confirm or disconfirm a representation assuming one inference method
may not do so on another, and symmetrically for confirming or disconfirming inference
procedures assuming representations. The difficulties are made worse by the quite
plausible thought that representations and inference procedures may vary with slight
changes in context, not only because of framing heuristics, but also because of variations
in memory and computational demands.
11