Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
9. Representation and Rationality: The Case of Backwards Blocking1 1. Representation Representation and inference covary. Bayes nets are a form of representation of causal and probabilaity claims, but the details of the representation may be uncertain in particular cases. Which choices are appropriate as a psychological model can be constrained by empirical facts about the inferences of subjects whose viewpoint and procedures the Bayes net representation is intended to model, and how the representation is conjoined with an inference procedure. Backwards blocking provides a vivid example of some of the uncertainties and interactions. 2. Backwards Blocking One of Isaac Newton’s Rules of Reasoning in Natural Philosophy, the Vera Causa rule in Book III of the Principia, recommends this: postulate no more causes except as are true and sufficient to save the phenomena. Newton intended the rule to justify assuming that only the force of gravitation acts on the planets and their satellites, and on the tides, for he had established that cause, and it sufficed. Some recent work on associative learning (Van Hamme, et al., 1994) argues that adults sometimes apply a version of the Vera Causa rule in contexts with less gravitas. In “cue competition” or “backwards blocking” features A and B appear together followed by an effect, E, and, with neither A nor B present, either E does not appear or appears more 1 This chapter was motivated by enlightening discussions with Alison Gopnik and Joshua Tenenbaum. Indeed, the entire chapter is my side of an extended correspondence with Tenenbaum, who suggested the extended form network (for experiments with only two trials) discussed below, and raised the objections I describe to the normal form, as well as objections I do not describe to the bootstrap estimation of Cheng models. Gopnik and Tenenbaum are of course not responsible for, and in some cases may not share, my opinions. 1 rarely, and people are said to give some weight to the causal role of both A and B in producing E. But if A also appears alone followed by a higher probability of the effect (than when A and B are both absent), the judgement of the efficacy of B is reduced. The causal role of A is established, and suffices to explain the data. It is well known that backwards blocking is inconsistent with the familiar RescorlaWagner model of associative learning, and various modifications of the model have been proposed to deal with the phenomenon. Backwards blocking illustrates the variety of Bayes net representations of causal relations subjects might tacitly be using, and the variety of ways those representations might be used to make inferences about the causal powers of an object or kind of object. I will use variations of an experimental design suggested by Gopnik and Tenenbaum. 3. Experiments In what follows I will consider variations of the following simple experiment to test backwards blocking: Experiment : Subjects are given information that objects of some unobserved kind K— corresponding to no observed quality such as color or shape—are rare, but produce an effect E with some high probability in some circumstance C, which is observable. Subjects are given no reason to think that two or more objects of kind K interact to produce E. In fact, E will occur with some high probability when an object of kind K is present in C, an E may otherwise occur with some small probability. Subjects see m trials in which two objects, A, B are both simultaneously in circumstance C. A is actually of kind K; B may actually be of kind K or not. Subjects see E occurs or not according to the probabilities assigned by the experimenter. That is, in Rescorla Wagner terms, the cues A, B are always presented together. Subjects’ judgements of the probability that B is of kind K are elicited. Subjects then see n trials in which A alone is present in C and E occurs or does not. Subjects’ judgements of the probability that B is of kind K are then elicited 2 There are many variations of these designs depending on the choices of the numbers of trials m and n, and on the probability that E occurs with and without the presence of an observed object of kind K. In the simplest case, E always occurs whenever A is in circumstance C and never otherwise. Whatever the outcomes of such experiments, any plausible model must not only save the phenomena of the elicited judgements of probabilities, it must also not presume unrealistic memory or computational powers of the subjects. The following surveys some of the predictions about experiments of the kind described above obtained from various combinations of representation and inference procedure. 4. The Cheng Model Cheng’s model of the adult model of causal relations in typical experiments is a Bayes net that can represented by a directed graph with vertices representing potential causal factors, say A, B, U, and the effect, E, and with directed edges from A, B, U into E. A qa B qb U qu E E = qa A qb B qu U, where, as in previous chapters, is Boolean addition. In the Cheng model A, B indicate the presence or absence of objects A, B in a trial, and the qs are the causal powers. p(qa = 1) ~ 1 indicates that A is of kind K, and so forth. The model is applied as follows: On each trial, A, B, U and qa, qb and qu take on values, either 0 or 1.For any two trials, the values of these random variables on one trial are independent of their values on the other trial, and have the same probabilities in the first m trials, in which A and B always occur, and the same probabilities in the next n trials, in 3 which A occurs alone. In there terms, the subjects’ task is to estimate p(qb = 1), although responses would not, of course, be expressed in that way Suppose we try a naive Bayesian treatment of backwards blocking in the context of the Cheng model for the special case in which A always produces E and B always produces E and E never occurs otherwise. In the experiment, subjects are given information about the rarity of properties that cause E and information about the frequency of E in the absence of both A and B. We might take that information as specifying the initial estimate of the causal powers, p(qa = 1, and p(qb = 1) for A and B. That turns out to be a mistake. Suppose subjects are then given a segment of data, D, for m trials in which A, B, both occur, and E always occurs. The posterior probability that B brings about E when A and B are both present takes the familiar form of Bayes rule: (1) p’(qb = 1) = p(qb = 1) p(D | qb = 1, A, B) / p(D). = p(qb = 1) / p(D) = p(qb = 1) / [p(qa = 1) + p(qb = 1) + p(Uqu = 1) - p(qa = 1) p(qb = 1) – p(qa = 1) p(Uqu = 1) - p(qb = 1) p(Uqu = 1) + p(qa = 1) p(qb = 1) p(Uqu = 1)]n = p(qb = 1) / p(E)n According to (1), p’(qb = 1) is greater than one unless the prior probability of E is 1. Probabilities cannot be greater than one. Something went wrong: taking the frequency of properties of kind K to be an initial probability that qb = 1 and trying to update. Similar difficulties occur if E does not always occur when A is present or when B is present. 4 5. A Bayesian Calculation with Cheng Models To give a coherent Bayesian analysis we must distinguish between probabilities as subjective degrees of belief and probabilities as objective measures of causal powers, that, is the prior probability distribution must specify the (subjective) probability of the (objective) probability that qb = 1, etc. One might argue that information about the rarity of thinks of kind K creates such a prior, but I will not fuss about where it comes from. Denote the density of such a prior probability distribution by P. For each (measurable) set of values between 0 and 1 of p(qa = 1), etc., P specifies a probability density for that set. I will assume that P respects the independencies assumed in the Cheng model, in particular when a feature such as B is absent, the Cheng model postulates that the frequency of E does not depend on p(qb = 1), and I will assume that P implies such independencies. Then the marginal posterior density, P’, which is P conditional on data D, over values of p(qa = 1) in a (measurable) set a , and p(qb =1) in a (measurable) set b, when A and B are both present equals (2) ab P(p(qa = 1), p(qb = 1)P(D | p(qa = 1), p(qb = 1), A, B)dp(qa =1)dp(qb=1)] [0,1] [0,1] P(D | p (qa =1), p(qb = 1), A, B)dp(qa = 1)dp(qb = 1) When A is present and B is not present, according to the Cheng model the data, D’, are independent of p(qb = 1) and the new posterior density, P’’, obtained by conditioning on D’ is: 5 (3) ab P’(p(qa = 1), p(qb = 1))P’(D’ | p(qa = 1, A, ~B)dp(qa =1)dp(qb=1)] [0,1] P(D’ | p (qa =1), A, ~B)dp(qa = 1) If p(qa = 1) and p(qb = 1) are independent in P’, then P’(p(qa = 1), p(qb = 1)) = P’(qa = 1)P’(qb = 1) and (3) becomes: (4) aP’(p(qa = 1)P’(D’ | p(qa = 1), A, ~B)dp(qa =1)] [b P’(p(qb = 1)dp(qb = 1)] [0,1] P(D’ | p (qa =1)A, ~B)dp(qa = 1) (4) implies that the marginal of the posterior density P’’ over p(qb = 1) obtained by conditioning on data obtained when B is not present equals the marginal of the previous density, P’, over p(qb = 1). That is, (4) implies there is no backwards blocking. On a Bayesian analysis of inference with Cheng models, backwards blocking occurs only if the causal powers of A and of B are not independent in the probability density over the causal powers obtained from the initial (prior) density by conditioning on the frequency of E when A and B are both present. In other words, subjects must have the seemingly odd but not incoherent P’ such that P’(p(qa = 1) & p(qb = 1)) = P’(p(qa = 1) p(qb = 1)), but P’(p(qa = 1) & p(qb = 1)) P’(p(qa = 1)) P’(p(qb = 1)) 6 In general, for small samples, depending on the prior distribution assumed, evaluating the integrals requires iterative numerical methods. Scientists evaluate such quantities with digital computers and computationally intensive numerical analysis or simulation algorithms. The Bayesian model of inference applied to the Cheng model has intense computational requirements. 6. Bootstrapping to Backwards Blocking in the Cheng Model Consider a simple non-Bayesian treatment of backwards blocking in the context of the Cheng model. The principle I will use is simple: estimate as much as you can about the unknown quantities, or their Boolean combinations, from the data, assuming the observed frequencies of E on various combinations of the presence or absence of the cues, A, B, equal the corresponding conditional probabilities of E, and assuming the Cheng model. The principle is an uncomplicated version of an inference strategy described in Glymour (1980), there called “bootstrapping” (not to be confused with a statistical procedure of the same name) because the hypothesis to be tested is used in calculating quantities that occur in the hypothesis. Similar inferences are made every day in scientific work, and, indeed, Newton’s argument for universal gravitation uses analogous methods. Suppose subjects are given data about the frequency of E in the absence of both A and B. They can estimate: (5) p(qu=1)p(U) = fr(E | ~A, ~B). Now suppose they are given data about the frequency of E in the presence of both A and B. They can estimate: (6) p(qa = 1 or qb = 1) = [fr(E | A,B) – fr(E | ~A,~B)] / (1 – fr(E | ~A,~B)) 7 but they cannot estimate p(qa = 1) or p(qb = 1), other than that these probabilities are each between 0 and p(qa = 1 or qb = 1)). Suppose, finally, they are given data about the frequency of E in the presence of A without B. Then they can estimate: (7) p(qa = 1) = [fr(E | A, ~B) – fr(E | ~A, ~B)] / (1 – fr(E | ~A, ~B)) Further, although B was not present in the previous data, they can now also estimate p(qb = 1). (8) p(qb = 1). = [p(qa =1 or qb = 1) – p(qa = 1)] / (1- p(qa = 1).= [fr(E | AB) – fr(E | A, ~B)] / [1 – fr(E | A, ~B)] Note that in the case in which E always occurs if either A or B occur, the causal power of B cannot be estimated this way, because the denominator in (8) is zero. Other things equal, the closer the frequency of E when A alone is present is to the frequency of E when A and B are both present, the smaller is the estimated value of p(qb = 1), the causal power of B. We have a form of backwards blocking for the Cheng model estimated with elementary algebra, presumably not used consciously. The Cheng model so estimated has low memory and computational requirements. 7. Alternative Bayes Net Representations A reasonable thought is that subjects’ inferences in these experiments have a simpler representation than the Cheng model. In the first m trials A and B occur together in the relevant circumstance, and in the remaining n trials only A is presented. The causal and 8 probabilistic relations in the m + n trials can be represented accurately by the following Bayes net: B is a K E on tria1 A is a K E on trial 2 …… E on trial m E on trial m+1……..E on trial m+n For reference, I will call this network fully-extended. The network comes with a prior probability distribution on the nodes and either fixed or uncertain (that is, with a second order probability distribution) on the probabilities of the trial nodes conditional on the nodes labeled “A is a K” and “B is a K.” If these conditional probabilities are unknown, then, as with Cheng models, Bayesian updating on the evidence requires integrating over a second order probability distribution. Even if we assume these conditional probabilities are known to the subject from the instructions or cover story in the experiment (e.g., they are told that any object of kind K produces E 95% of the time), the model still imposes implausible memory requirements for m and n that are bigger than, say 5. A natural question is whether, under the same assumption of known conditional probabilities, there are more compact representations that can correctly represent the probabilities. I will call the following a normal form network: A present A is a K B present B is a K E The normal form network is applied to each trial, the probability that A is a K is updated and the probability that B is a K is updated, and then the same network, with the new probabilities, is applied to the next case. The memory and computational requirements 9 are small, assuming the conditional probability of E on each pair of values of “A is a K” and “B is a K” are known when one or the other or both are present and when they are both absent.. The normal form network does not give backwards blocking. “A is a K” and “B is a K” remain independent on each new trial, no matter the outcomes of previous trials. So, when A alone is present, the outcome of the trial is independent of “B is a K” and the outcome does not change the probability that “B is a K.” Moreover, as Joshua Tenenbaum as pointed out (private communication) if the joint probability distribution, including all of the trials, is faithful to the fully extended model, the procedure using the normal form network violates the Markov assumption. Another Bayes net model I will call an extended form network is essentially a truncated version of the fully-extended form network A is a K E occurs when A and B are present B is a K E occurs when A alone is present Tenenbaum has pointed out (private commmunication) that this network satisfies the Markov assumption and implies backwards blocking in an experiment with only two trials, in one of which A and B are both present and in the other A alone is present. But in longer series of trials it is problematic. If we apply this network in the same way as the normal form network, one trial at a time, updating as we go but not changing the network topology, we again get no backwards blocking, and we again violate the Markov assumption. We can imagine more elaborate procedures. for example: Apply the model one trial at a time so long as the trials only have both A and B present, but when the first case in which A alone is present, keep fixed the value of the node “E occurs when A and B are present” from the last trial in which A and B were both present. This procedure 10 gives backwards blocking, but (for most values of the conditional probabilities) for n > 1 it mistakenly estimates the probability that A is a K, because the last value of “E occurs when A and B are present” is counted multiple times. 8. Conclusion These examples scarcely exhaust the alternative Byes net/inference models with implications for backwards blocking, but they should be sufficient to illustrate how easy it is to make mistakes about the representations and inference methods people use to make causal judgements. A model that fits beautifully and is Bayesian coherent when there are only two trials and E is a deterministic function, known to the subjects, of “A is of kind K” and “B is of kind K” may fail—either fail to accommodate subjects’ probability judgements or impose implausible memory and computational demands--if there are multiple trials, or if E is not a deterministic function, or if the conditional probabilities are unknown to the subjects. An experiment that appears to refute Cheng’s parameterization may not do so if it is varied to allow subjects to decline to estimate probabilities, or to express uncertainty. An experiment that agrees with that parameterization may fail to do so if the number of trials is very small. An experiment that appears to refute Bayesian updating may not do so on another representation. Certain representation invite special inference procedures. Because of their parametrization, Cheng models, for example, allow algebraic bootstrap inferences that may not be available in more general parametrizations of Bayes nets. But representations do not typically determine a unique inference method, especially in sequential trials, and experiments that confirm or disconfirm a representation assuming one inference method may not do so on another, and symmetrically for confirming or disconfirming inference procedures assuming representations. The difficulties are made worse by the quite plausible thought that representations and inference procedures may vary with slight changes in context, not only because of framing heuristics, but also because of variations in memory and computational demands. 11