Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Learning Syntax with Minimum Description Length Mike Dowman 9 March 2006 Syntactic Theory Chomsky (1957): • Discovery procedure • Decision procedure • Evaluation procedure Chomsky (1965): Language acquisition Poverty of the Stimulus • Evidence available to children is utterances produced by other speakers • No direct cues to sentence structure • Or to word categories So children need prior knowledge of possible structures UG Negative Evidence • Some constructions seem impossible to learn without negative evidence John gave a painting to the museum John gave the museum a painting John donated a painting to the museum * John donated the museum a painting Implicit Negative Evidence If constructions don’t appear can we just assume they’re not grammatical? No – we only see a tiny proportion of possible, grammatical sentences People generalize from examples they have seen to form new utterances ‘[U]nder exactly what circumstances does a child conclude that a nonwitnessed sentence is ungrammatical?’ (Pinker, 1989) Minimum Description Length (MDL) MDL may be able to solve the poverty of the stimulus problem Prefers the grammar that results in the simplest overall description of data • So prefers simple grammars • And grammars that result in simple descriptions of the data Simplest means specifiable using the least amount of information Space of possible sentences Simple but non-constraining grammar Observed sentences Inferred grammars Complex but constraining grammar Grammar that is a good fit to the data MDL and Bayes’ Rule P ( h | d ) P ( h) P ( d | h ) • h is a hypothesis (= grammar) • d is some data (= sentences) The probability of a grammar given some data is equal to its a priori probability time how likely the observed sentences would be if that grammar were correct Probabilities and Complexities • Information theory relates probability P and amount of information I I = -log2 P It takes less information to encode likely events compared to unlikely events P(h | d ) 2 ( I ( h) I ( d |h)) The best grammar is the one that allows both itself and the data to be encoded using the least amount of information Complexity and Probability • More complex grammar lower probability • More restrictive grammar less choices for data, so each possibility has a higher probability MDL finds a middle ground between always generalizing and never generalizing Encoding Grammars and Data Decoder 1010100111010100101101010001100111100011010110 Grammar Data coded in terms of grammar ABC The comedian died BDE A kangaroo burped E {kangaroo, aeroplane, comedian} The aeroplane laughed D {the, a, some} Some comedian burped C {died, laughed, burped} MDL and Prior (Innate?) Bias • MDL solves the difficult problem of deciding prior probability for each grammar • But MDL is still subjective – the prior bias is just hidden in the formalism chosen to represent grammars, and in the encoding scheme MDL in Linguistics • Morphology (John Goldsmith, 2001; Michael Brent, 1993) • Phonology (Mark Ellison, 1992) • Syntax (Andreas Stolcke, 1994; Langley and Stromsten, 2000; Grünwald 1994; Onnis, Roberts and Chater, 2002) • Iterated Learning (Teal and Taylor, 2000; Brighton and Kirby, 2001; Roberts, Onnis and Chater, 2005) Learning Phrase Structure Grammars: Dowman (2000) • Binary or non-branching rules: SB C BE C tomato • All derivations start from special symbol S Encoding Grammars Grammars can be coded as lists of three symbols • null symbol in 3rd position indicates a nonbranching rule First symbol is rules left hand side, second and third its right hand side S, B, C, B, E, null, C, tomato, null Statistical Encoding of Grammars • First we encode the frequency of each symbol • Then encode each symbol using the frequency information S, B, C, B, E, null, C, tomato, null I(S) = -log 1/9 I(null) = -log 2/9 Uncommon symbols have a higher coding length than common ones Encoding Data 1 S NP VP 2 NP john 3 NP mary 4 VP screamed 5 VP died Data: John screamed John died Mary Screamed Data encoding: 1, 2, 4, 1, 2, 5, 1, 3, 4 There is a restricted range of choices at each stage of the derivation Fewer choices = higher probability Statistical Encoding of Data If we record the frequency of each rule, this information can help us make a more efficient encoding • • • • • 1 S NP VP (3) Total frequency for S = 3 2 NP john (2) Total frequency for NP = 3 3 NP mary (1) 4 VP screamed (2) Total frequency for VP = 3 5 VP died (1) Data: 1, 2, 4, 1, 2, 5, 1, 3, 4 Probabilities: 1 3/3, 2 2/3, 4 2/3, 1 3/3, 2 2/3… Encoding in My Model Number of bits decoded = evaluation Decoder 1010100111010100101101010001100111100011010110 Symbol Frequencies S (1) NP (3) VP (3) john (1) mary (1) screamed (1) died (1) null (4) 1 S NP VP 2 NP john 3 NP mary 4 VP screamed 5 VP died Grammar Rule 1 3 Rule 2 2 Rule 3 1 Rule 4 2 Rule 5 1 Rule Frequencies John screamed John died Mary Screamed Data Creating Candidate Grammars • Start with simple grammar that allows all sentences • Make simple change and see if it improves the evaluation (add a rule, delete a rule, change a symbol in a rule, etc.) • Annealing search • First stage: just look at data coding length • Second stage: look at overall evaluation Example: English John hit Mary Mary hit Ethel Ethel ran John ran Mary ran Ethel hit John Noam hit John Ethel screamed Mary kicked Ethel John hopes Ethel thinks Mary hit Ethel Ethel thinks John ran John thinks Ethel ran Mary ran Ethel hit Mary Mary thinks John hit Ethel John screamed Noam hopes John screamed Mary hopes Ethel hit John Noam kicked Mary Learned Grammar S NP VP VP ran VP screamed VP Vt NP VP Vs S Vt hit Vt kicked Vs thinks Vs hopes NP John NP Ethel NP Mary NP Noam Evaluation (bits) Evaluations 450 400 350 300 250 200 150 100 50 0 Overall Evaluation Grammar Data Initial Grammar Learned Grammar Dative Alternation John gave a painting to the museum John gave the museum a painting John donated a painting to the museum * John donated the museum a painting • Two sub-categories of verb • Productive use of new verbs • U-shaped learning Training Data • Three alternating verbs: gave, passed, lent • One non-alternating verb: donated • One verb seen only once: sent The museum lent Sam a painting John gave a painting to Sam Sam donated John to the museum The museum sent a painting to Sam Dative Evaluations Evaluation (bits) 3500 3000 2500 Overall Evaluation Grammar 2000 1500 1000 Data 500 0 Initial Grammar Learned Grammar Learned Structures S X Y Z NP NP VA DET John gave a N painting P NP to Sam But all sentences generated by the grammar are grammatical Two Verb Classes Learned • Learned grammar distinguishes alternating and non-alternating verbs A grammar with one verb class would be simpler So why two classes? A Grammar with one Verb Class donated was placed in the same class as the other verbs and redundant rule deleted donated doesn’t alternate donated alternates Overall Evaluation (bits) 1703.4 1710.4 Grammar (bits) 321.0 298.2 Data (bits) 1382.3 1412.2 The new grammar is simpler But predicts many ungrammatical sentences U-shaped Learning • With less data, a grammar with only one class of verbs was learned (so donated could appear in both constructions) • In this case the benefit derived from a better description of the data was not enough to justify a more complex grammar So a simpler grammar was preferred in this case Regular and Irregular Rules • The model places sent the alternating class. Why? Y VA NP Y VA Z Y VP Z VA passed VA gave VA lent VP donated VA / VP sent sent doesn’t alternate sent alternates Overall Evaluation (bits) 1703.6 1703.4 Grammar (bits) 322.2 321.0 Data (bits) 1381.4 1382.3 Regular constructions are preferred because the grammar is coded statistically Why use Statistical Grammars? Statistics are a valuable source of information They help to infer when absences are due to chance The learned grammar predicted that sent should appear in the double object construction • but in 150 sentences it was only seen in the prepositional dative construction • With a non statistical grammar we need an explanation as to why this is • A statistical grammar knows that sent is rare, which explains the absence of double object occurrences Non-statistical Coding of Grammars and Non-statistical Grammars Status of rules in the grammar Nature of encoding of grammar statistical statistical non-statistical statistical non-statistical non-statistical Component of evaluation Grammar in which sent does not alternate Grammar in which sent alternates Total 1703.6 1703.4 Grammar 322.2 321.0 Data 1381.4 1382.3 Total 1734.2 1735.2 Grammar 352.8 352.8 Data 1381.4 1382.3 Total 1653.9 1659.0 Grammar 227.2 226.0 Data 1426.7 1433.0 Total 1684.6 1690.8 Grammar 257.8 257.8 Data 1426.7 1433.0 Sent only alternates when grammar is both statistical and encoded statistically Is this really how alternating verbs are learned? • Change in possession verbs (send, give) alternate • Unless they are Latinate (donate) Children are aware of such cues (Gropen et al, 1989) So a complete theory of the dative alternation must take account of semantic and morphophonological cues • In MDL rules can be conditioned based on semantic or phonological information Does Pinker’s / Mazurkewich and White’s Solution Still need MDL? • Correspondence between morphophonological/semantic cues and subcategorizations is language specific Must be learned from distributions • Which semantic/phonological cues are relevant? • Which are due to chance similarity? The same kind of issues resurface as with a purely distributional account Is the Learnability Problem Solved? • MDL can learn very simple grammars for small data-sets • No-one’s succeeded in scaling it up A second learnability problem arises from the impossibility of considering all possible grammars • Do we need innate constraints if the search for a correct grammar is going to be successful? Alternative Grammar Formalisms Phrase structure grammars are too simple to capture some phenomena • Agreement • Movement • etcetera But MDL is compatible with other grammar formalisms Neurologically and Psychologically Plausible? Take Home Messages • Example sentences are a rich source of evidence to linguistic structure • Statistical information is very valuable – so don’t ignore it • Maybe syntax is more learned than innate • MDL tells us which generalizations are appropriate (justified by the data) and which are not • Lack of negative evidence is not a particular problem when using MDL References • • • • • • • Brent, M. (1993). Minimal Generative Explanations: A Middle Ground between Neurons and Triggers. Proceedings of the 15th Annual Conference of the Cognitive Science Society. Hillsdale, N.J.: Lawrence Erlbaum Associates. Brighton, H. & Kirby, S. (2001). The Survival of the Smallest: Stability Conditions for the Cultural Evolution of Compositional Language. In J. Kelemen & P. Sosík (Eds.) Advances in Artificial Life. Berlin: Springer. Ellison, T. M. (1992). The Machine Learning of Phonological Structure. Doctor of Philosophy thesis, University of Western Australia. Chomsky, N. (1957). Syntactic Structures. The Hague: Mouton & Co. Chomsky, N. (1965). Aspects of the Theory of Syntax. Cambridge, MA: MIT Press. Goldsmith, J. (2001). Unsupervised Learning of the Morphology of a Natural Language. Computational Linguistics 27(2): 153-198. Gropen, J., Pinker, S., Hollander, M., Goldberg, R. & Wilson, R. (1989). The Learnability and Acquisition of the Dative Alternation in English. Language, 65, 203-257. • • • • • • • Grünwald, P. (1994). A minimum description length approach to grammar inference. In G. Scheler, S. Wernter, and E. Riloff, (Eds.), Connectionist, Statistical and Symbolic Approaches to Learning for Natural Language. Berlin: Springer Verlag. Langley, P., & Stromsten, S. (2000). Learning context-free grammars with a simplicity bias. In R. L. de Mantaras, and E. Plaza, (Eds.) Proceedings of the Eleventh European Conference on Machine Learning. Barcelona: Springer-Verlag. Onnis, L., Roberts, M. and Chater, N. (2002). Simplicity: A cure for overgeneralizations in language acquisition? In W. Gray and C. Schunn (Eds.) Proceedings of the Twenty-Fourth Annual Conference of the Cognitive Science Society. Hillsdale, NJ: Lawrence Erlbaum Associates. Pinker, S. (1989), Learnability and Cognition: the Acquisition of Argument Structure. Cambridge, MA: MIT Press. Roberts, M., Onnis, L., & Chater, N. (2005). Acquisition and evolution of quasi-regular languages: two puzzles for the price of one. In Tallerman, M. (Ed.) Language Origins: Perspectives on Evolution. Oxford: Oxford University Press. Stolcke, A. (1994). Bayesian Learning of Probabilistic Language Models. Doctoral dissertation, Department of Electrical Engineering and Computer Science, University of California at Berkeley. Teal, T. K. & Taylor, C. E. (2000). Effects of Compression on Language Evolution. Artificial Life, 6: 129-143.