Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Information theory, MDL and human cognition Nick Chater Department of Psychology University College London [email protected] Overview Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications Bayes and MDL: A simplified story Shannon’s coding theorem. For distribution Pr(A), optimal code will assign -log2Pr(A) to code event A MDL model selection: choose, M, that yields the shortest code for D, i.e., minimize: -log2Pr(D, M) A simple equivalence Minimize: Maximize: Pr(M|D) Just what Bayes recommends Pr(D, M) = Pr(M|D)Pr(D) Maximize: -log2Pr(D, M) (if choosing a single model) Equivalence generalizes to parametric M(θ) to ‘full’ Bayes; and in other ways Chater, 1996, for application to the simplicity and likelihood principles in perceptual organization Codes or priors? Which comes first? 1. The philosophical issue Bayesian viewpoint: Probabilities as basic calculus for degrees of beliefs (probability theory) decision theory (probabilities meet action) brain as a probabilistic calculation machine (whether belief propagation, dynamic programming…) Simplicity/MDL viewpoint: Codes as basic Rissanen: data is all there is; distributions are a fiction Code structure is primary; code interpretation is secondary Probabilities defined over events; but “events” are cognitive constructs Leeuwenberg & Boselie (1988) Codes or priors? Which comes first? 2. The practical issue Bayesian viewpoint: Take probabilities as basic… …when we know most about probability, e.g, image statistics: MDL viewpoint: Take codes as basic… …when we know most about representation, e.g., grammars Bayesian viewpoint (e.g., Geisler et al, 2001) Good continuation—most lines continue in the same direction in real images Simplicity/MDL viewpoint (e.g., Goldsmith, 2001) S NP VP VP V NP VP V NP PP NP Det Noun NP NP PP “Binding contraints” In, e.g., linguistics, representations are given by theory And we can roughly assess the complexity of grammars (by length) Not so clear how directly to set a prior over all grammars (though can define a generative process in simple cases…) Gzip as a handy approximation!?! Simplicity/MDL and Bayes are closely related Lets now explore the simplicity perspective Overview Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications The most neutral possible prior… Suppose we want a prior so neutral that it never rules out a model Possible, if limit to computable models Mixture of all (computable) priors, with weights, i, that decline fairly fast: m( x) i pi ( x) i Then, this multiplicatively dominates all priors though neutral priors will mean slow learning m(x) are “universal” priors The most neutral possible coding language Universal programming languages (Java, matlab, UTMs, etc) K(x) = length of shortest program in Java, matlab, UTM, that generates x (K is uncomputable) Invariance theorem any languages L1, L2, c, x |KL1(x)-KL2(x)| ≤ c Mathematically justifies talk of K(x), not KJava(x) , KMatlab(x),… So does this mean that choice of language doesn’t matter? Not quite! c can be large And, for any L1, c0, L2, x such that |KL1(x)-KL2(x)| ≥ c0 The problem of the one-instruction code for the entire data set… But Kolmogorov complexity can be made concrete… Compact Universal Turing machines 210 bits, λ-calculus 272, combinators Due to Jon Tromp, 2007 Not much room to hide, here! Neutral priors and Kolmogorov complexity A key result: K(x) = -log2m(x) o(1) Where m is a universal prior Analogous to the Shannon’s source coding theorem And for any computable q, K(x) ≤ -log2q(x) o(1) For typical x drawn from q(x) Any data, x, that is likely for any sensible probability distribution has low K(x) Prediction by simplicity Find shortest ‘program/explanation’ for current ‘corpus’ (binary string) Predict using that program Strictly, use ‘weighted sum’ of explanations, weighted by brevity Prediction is possible (Solomonoff, 1978) Summed error has finite bound log e 2 sj K ( ) 2 j=1 sj is summed squared error between prediction and true probability on item j So prediction converges [faster than 1/nlog(n)], for corpus size n Computability assumptions only (no stationarity needed) Summary so far… Simplicity/MDL - close and deep connections with Bayes Defines universal prior (i.e., based on simplicity) Can be made “concrete” General prediction results A convenient “dual” framework to Bayes, when codes are easier than probabilities Li, M. & Vitanyi, P. (1997) (2nd Edition). Introduction to Kolmogorov complexity theory and its applications. Berlin: Springer. Overview Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications A problem of model selection? Or: why simplicity won’t go away Where do priors come from? Well, priors can be given by hyper-priors And hyper-priors by hyper-hyper-priors But it can’t go on forever! And we need priors over models we’re only just thought of And, in some contexts, over models we haven’t yet thought of (!) Code length in our representation language is a fixed basis Nb. Building probabilistic models = augmenting our language with new coding schemes The hidden role of simplicity… Bayesian model selection prefers y(x)=a2x2+a1x+a0 Not y(x)=a125x125+a124x124+ …+a0 But who says how many parameters a function has got?? A trick… Convert parameters to constants y(x)=a125x125+a124x124+…+a0 126 parameters y(x)=.003x125 + .02x124+…+3x – 24.3 0 parameters And hence is favoured by Bayesian (and all other) model selection criteria All the virtues of theft over honest toil Zoubin’s problem for ML ML Gaussian is a delta function on one point ML: Gaussian is a delta functions on one point An impressive fit! A related problem for Bayes? The mixture of delta functions model (!) A related problem for Bayes? The mixture of delta functions model (!) An even more impressive fit! Should the “cheating” model get a huge boost from this data No! Sense of moral outrage Model must be fitted post-hoc It would be different if I’d thought of it before the data arrived (cf empirical Bayes) Yes! ! But order of data acquisition has no role in Bayes Confirmation is just the same, whenever I thought of the model The models get a spectacular boost; but is even more spectacularly unlikely… So we need to take care with priors! y=x High prior; compact to state y=.003x125 + .02x124+…+3x-24 Low prior; not compact to state With a different representation language, could have the opposite bias But we start from where we are; our actual representations We can discoverthat things are simpler than we though (i.e., simplicity is not quite so subjective…) Overview Bayes and MDL: An overview Universality and induction Some puzzles about model fitting Cognitive science applications There are quite a few… Domain Principle References Perceptual organization Find grouping that minimizes cost Koffka, 1935; Leeuwenberg, 1971; Attneave & Frost, 1969; Early vision Efficient coding & transmission Blakemore, 1990; Barlow, 1974; Srivinisan, Laughlin Causal reasoning Find minimal belief network Wedelind Similarity Similarity between representations measured by code length between them Chater & Vitányi, 2003; Hahn, Chater & Richardson, 2003 Categorization Categorize items to find shortest code (high level perceptual organization) Pothos & Chater, 2002; Feldman, 2000 Memory storage Shorter codes easier to store Chater, 1999 Memory retrieval Explain interference by cuetrace complexity; a rational foundation for distinctiveness models Rational foundation for distinctiveness models: SIMPLE (Brown, Neath & Chater, 2005) Language acquisition Find grammar that best explains child’s input Chomsky, 1955; J. D. Fodor & Crain; Chater, 2004; Chater & Vitányi, 2005 Here: Perceptual organization Language Acquisition Similarity and generalization Long tradition of simplicity in perception (Mach, Koffka, Leeuwenberg); e.g., Gestalt laws (x) (x,v) (x) (x) (x) (x) (x) Grouped 6+1 vectors + + (x,v)(x,v)+ +(x,v) (x,v) + (v) + (x,v) Ungrouped 6 x 2 vectors And language acquisition: where it helps resolve an apparent learnability paradox Undergeneral grammars predict that good sentences are not allowed just wait til one turns up Overgeneral grammars predict that bad sentences are actually ok Need negative evidence--say a bad sentence, and get corrected The logical problem of language acquisition (e.g., Hornstein & Lightfoot, 1981; Pinker, 1979) Without negative evidence can never eliminate overgeneral grammars “Mere” non-occurrence of sentences is not enough… …because almost all acceptable sentences also never occur Backed-up by formal results (Gold, 1967; though Feldman, Horning et al) Argument for innateness? An “ideal” learning set-up (cf ideal observers) Linguistic environment Measures of learning performance Learning method Positive evidence only; computability Statistical Simplicity Overgeneralization Theorem (Chater & Vitányi) Suppose learner has probability j of erroneously guessing an ungrammatical jth word j 1 j K ( ) log e 2 Intuitive explanation: overgeneralization underloads probabilities of grammatical sentences; Small probabilities implies longer code lengths Absence as implicit negative evidence Overgeneral grammars predict missing sentences And their absence is a clue that the grammar is wrong Method can be “scaled-down” to consider learnability of specific linguistic constructions Similarity and categorization Cognitive dissimilarity: representational “distortion” required to get from x to y DU(x,y) = K(y|x) Not symmetrical K(y|x) > K(x|y) when K(y) > K(x) Deletion is easy… Shepard’s (1987) Universal Law Generalization (strictly confusability) is an exponention function of psychological “distance” G(a, b) Ae BDU ( Sa , Sb ) A derivation Pr( Ra | S b ) Pr( Rb | S a ) G (a, b) Pr( Ra | S a ) Pr( Rb | S b ) 1/ 2 Shepard’s generalization measure log 2 G(a, b) 1 / 2(log 2 Pr( Ra | Sb ) log 2 Pr( Rb | S a ) log 2 Pr( Ra | S a ) log 2 Pr( Rb | Sb )) 1 / 2K ( Ra | Sb ) K ( Rb | Sa ) o(1) for “typical” items log 2 G(a, b) 1 / 2K ( Ra | S b ) K ( Rb | S a ) o(1) DU ( S a , Sb ) o(1) Assuming items roughly the same complexity G(a, b) Ae BDU ( Sa , Sb ) The universal law The asymmetry of similarity What thing is this like? And what is this like? A heuristic measure of amount of information: Shannon’s guessing game… 1. Pony? 2. Cow? 3. Dog? … 345. Pegasus√ 345! 1. Pony? 2. Cow? 3. Dog? … 345. Pegasus√ Asymmetry of codelengths asymmetry of similarity Horse: guess #345 gets Pegasus. log2Pr(#345) is very small. Pegasus: guess #2 gets Horse. log2(Pr(#2)) is very small. So Pegasus is more like Horse, than Horse is like Pegasus Many other examples of asymmetry, and many measures (search times, memory confusions…), which seem to fit this pattern Treisman & Souther (1985) A simple array A complex array Summary MDL/Kolmogorov complexity close relation with Bayes Basis for a “universal” prior Variety of applications to cognitive science