* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Statistical learning of grammars
Survey
Document related concepts
Transcript
Outline Introduction Stochastic grammars Supervised learning Unsupervised learning Learning a real language Conclusion Statistical learning of grammars Mark Johnson Brown University BUCLD 2005 Statistical learning and implicit negative evidence I Logical approach to acquisition L1 I No negative evidence ⇒ subset problem: guess L2 when true lg is L1 I I L2 I I if L2 − L1 is expected to occur but doesn’t ⇒ L2 is probably wrong implicit negative evidence succeeds where logical learning fails (e.g., PCFGs) I I stronger input assumptions (follows distribution) weaker success criteria (probabilistic) Statistical learners learn from statistical distributional properties of input I Statistical approach to learning I I What is statistical learning? I I Statistical learning (a.k.a. machine learning) is a separate field I I Both logic and statistics are kinds of inference I not just whether something occurs (logical learning), but how often assumes input follows some (unknown) probability distribution mathematical theories relating learning goal with statistics most informative statistic depends on: I statistical inference uses more information from input I I what learner is trying to learn current state of learner much more than transitional probabilities! Vapnik (1998) Statistical Learning Theory What are the right units of generalization? I I grammars are tools for investigating different units of generalization grammars can model wide variety of phenomena I I I I 1. Colorless green ideas sleep furiously. 2. *Furiously sleep ideas green colorless. various types of grammatical dependencies word segmentation (Brent) syllable structure (Goldwater and Johnson) morphological dependencies (Goldsmith) I I S NP Sam VP V VP +control promised TO to Phrase−structure grammar Sam promised to write an VP V write article Both sentences have zero frequency ⇒ frequency 6= well-formedness Hidden class bigram model P(colorless green ideas sleep furiously) = P(colorless)P(green|colorless) . . . = 2 × 105 × P(furiously sleep ideas green colorless) Dependency grammar NP +transitive Units of generalization in learning DT N an article Chomsky (1957) Syntactic Structures Pereira (2000) “Formal grammar and information theory: Together again?” Outline Introduction Stochastic grammars Supervised learning Why grammars? 1. Useful for both production and comprehension 2. Compositional representations seem necessary for semantic interpretation 3. Curse of dimensionality: the number of possibly related entities grows exponentially I Unsupervised learning I I Learning a real language Conclusion 1,000 words = 1,000 unigrams, 1,000,000 bigrams, 1,000,000,000 trigrams, . . . (sparse data) grammars identify relationships to generalize over sparse data problems are more severe with larger, more specialized representations 4. “Glass-box” models: (you can see inside) the learner’s assumptions and conclusions are explicit There are stochastic variants of most grammars I I Grammar generates candidate structures (e.g., string of words, trees, OT candidates, construction grammar analyses, minimalist derivations, . . . ) Associate numerical weights with configurations that occur in these structures I I I I I pairs of adjacent words rules used to derive structure constructions occuring in structure P&P parameters (e.g., H EAD F INAL) Combine (e.g., multiply) the weights of configurations occuring in a structure to get its score Abney (1997) “Stochastic Attribute-Value Grammars” Probabilistic Context-Free Grammars I The probability of a tree is the product of the probabilities of the rules used to construct it 1.0 S → NP VP 0.75 NP → George 0.6 V → barks NP P George S VP V barks = 0.45 1.0 VP → V 0.25 NP → Al 0.4 V → snores P S NP VP Al V snores = 0.1 Learning as optimization I Pick a task that the correct grammar should be able to do well I I I I predicting sentences and their structures (supervised learning) predicting the (next) words in sentences (unsupervised learning) Find weights that optimize performance on task Searching for optimal weights is usually easier than searching for optimal categorical grammars Rummelhart and McClelland (1986) Parallel Distributed Processing Tesar and Smolensky (2000) Learnability in Optimality Theory Outline Introduction Stochastic grammars Supervised learning Unsupervised learning Learning a real language Conclusion Grammars and generalizations I Learning PCFGs from trees (supervised) Grammar determines units of generalization S NP S VP rice grows NP S VP rice grows Rule Count Rel Freq S → NP VP 3 1 NP → rice 2 2/3 NP → corn 1 1/3 VP → grows 3 1 NP VP corn grows S P NP VP = 2/3 rice grows S Rel freq is maximum likelihood estimator P NP VP = 1/3 (selects rule probabilities that maximize probability of trees) corn grows Grammars and generalizations I Grammars and generalizations Grammar determines units of generalization I I Training data: 50%: N, 30%: N PP, 20%: N PP PP with flat rules NP → N, NP → N PP, NP → N PP PP predicted probabilities replicate training data 50% NP 30% NP 20% NP N N PP N PP PP I Grammar determines units of generalization I Training data: 50%: N, 30%: N PP, 20%: N PP PP Grammars and generalizations I I I I Training data: 50%: N, 30%: N PP, 20%: N PP PP with flat rules NP → N, NP → N PP, NP → N PP PP predicted probabilities replicate training data 50% NP 30% NP 20% NP N I N PP I I NP PP NP PP NP PP N Finding best units of generalization I NP PP NP PP NP PP N Training data: 50%: N, 30%: N PP, 20%: N PP PP with flat rules NP → N, NP → N PP, NP → N PP PP predicted probabilities replicate training data 50% NP 30% NP 20% NP N NP N Grammar determines units of generalization N PP PP but with adjunction rules NP → N, NP → NP PP 58%: NP 24%: NP 10%: NP 5%: N I Grammars and generalizations Grammar determines units of generalization N PP N PP PP but with adjunction rules NP → N, NP → NP PP 58%: NP 24%: NP 10%: NP 5%: NP NP PP NP PP N N NP PP NP PP NP PP N NP PP N Grammars and generalizations I I I N PP NP PP NP PP NP PP I Finding best units of generalization I Training data: 50%: N, 30%: N PP, 20%: N PP PP with flat rules NP → N, NP → N PP, NP → N PP PP predicted probabilities replicate training data 50% NP 30% NP 20% NP N PP N PP PP but with adjunction rules NP → N, NP → NP PP 58%: NP 24%: NP 10%: NP 5%: NP NP PP NP PP N NP PP NP PP N I I N NP N Grammar determines units of generalization N PP PP but with adjunction rules NP → N, NP → NP PP 58%: NP 24%: NP 10%: NP 5%: N I I Training data: 50%: N, 30%: N PP, 20%: N PP PP with flat rules NP → N, NP → N PP, NP → N PP PP predicted probabilities replicate training data 50% NP 30% NP 20% NP N I I Grammars and generalizations Grammar determines units of generalization N NP PP N N Predicate and argument structure in Lexicalized Tree-Adjoining Grammar Head-argument dependencies in Dependency Grammar I Finding best units of generalization I NP PP NP PP NP PP NP PP N Predicate and argument structure in Lexicalized Tree-Adjoining Grammar Learning from words alone (unsupervised) I I I Training data consists of strings of words w Optimize grammar’s ability to predict w: find grammar that makes w as likely as possible Expectation maximization is an iterative procedure for building unsupervised learners out of supervised learners I I I I parse a bunch of sentences with current guess at grammar weight each parse tree by its probability under current grammar estimate grammar from these weighted parse trees as before Each iteration is guaranteed not to decrease P(w) (but can get trapped in local minima) Dempster, Laird and Rubin (1977) “Maximum likelihood from incomplete data via the EM algorithm” Outline Introduction Stochastic grammars Supervised learning Unsupervised learning Learning a real language Conclusion Probability of “English” Expectation Maximization with a toy grammar 1 Geometric 0.1 average sentence 0.01 probability 0.001 Initial rule probs 1e-04 1e-05 1e-06 0 1 3 2 Iteration 4 5 rule ··· VP → V VP → V NP VP → NP V VP → V NP NP VP → NP NP V ··· Det → the N → the V → the prob ··· 0.2 0.2 0.2 0.2 0.2 ··· 0.1 0.1 0.1 “English” input the dog bites the dog bites a man a man gives the dog a bone ··· “pseudo-Japanese” input the dog bites the dog a man bites a man the dog a bone gives ··· Probability of “Japanese” Rule probabilities from “English” 1 1 Geometric 0.1 average sentence 0.01 probability 0.001 Rule probability 0.5 0.4 1e-04 0.3 0.2 1e-05 1e-06 VP →V NP VP →NP V VP →V NP NP 0.8 VP →NP NP V Det →the 0.7 N →the 0.6 V →the 0.9 0.1 0 1 3 2 Iteration 4 5 0 0 1 2 3 Iteration 4 5 Statistical grammar learning I I I I 1 Simple algorithm: learn from your best guesses I I Rule probabilities from “Japanese” requires learner to parse the input “Glass box” models: learner’s prior knowledge and learnt generalizations are explicitly represented Optimization of smooth function of rule weights ⇒ learning can involve small, incremental updates Learning structure (rules) is hard, but . . . Parameter estimation can approximate rule learning I I I start with “superset” grammar estimate rule probabilities discard low probability rules VP →V NP VP →NP V VP →V NP NP 0.8 VP →NP NP V Det →the 0.7 N →the 0.6 V →the 0.9 Rule probability 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 Iteration 4 5 Outline The importance of starting small Introduction I Stochastic grammars EM works by learning from its own parses I I Supervised learning I Unsupervised learning Learning a real language Conclusion Each parse is weighted by its probability Rules used in high-probability parses receive strong reinforcement In grammar-based models, ambiguity grows with sentence length I longer sentences are typically highly ambiguous ⇒ lower average parse probability ⇒ less clear information about which rules are most useful Training from real language Applying EM learning to real language I 1. Extract productions from trees and estimate probabilities probabilities from trees to produce PCFG. 2. Initialize EM with the treebank grammar and MLE probabilities 3. Apply EM (to strings alone) to re-estimate production probabilities. 4. At each iteration: I I Measure the likelihood of the training data and the quality of the parses produced by each grammar. Test on training data (so poor performance is not due to overlearning). I I ATIS treebank consists of 1,300 hand-constructed parse trees ignore the words (in this experiment) about 1,000 PCFG rules are needed to build these trees S . VP VB NP NP Show PRP NP DT JJ NNS . PP me PDT the nonstop flights PP all IN ADJP PP NP PP TO NP early IN from NNP to NNP Dallas JJ Denver in NP DT NN the morning Accuracy of parses produced using the learnt grammar 1 Precision Recall 0.95 Parse Accuracy 0.9 0.85 0.8 0.75 0.7 0 5 10 Iteration 15 20 Probability of training strings -14000 -14200 -14400 -14600 -14800 log P -15000 -15200 -15400 -15600 -15800 -16000 0 5 10 Iteration 15 20 Outline Introduction Stochastic grammars Discussion I I Predicting words 6= finding correct structure Why didn’t the learner find the right structures? I Supervised learning I I I Unsupervised learning I Grammar ignores semantics (Zettlemoyer and Collins) Predicting words is wrong objective Wrong kind of grammar (Klein and Manning) Wrong training data (Yang) Wrong learning algorithm (much work in CL and ML) Learning a real language Conclusion de Marken (1995) “Lexical heads, phrase structure and the induction of grammar” Bayesian learning Summary I I I A statistical learning framework that integrates: I I I I “hard” priors ignore some analyses, focus on others “soft” priors bias learner toward certain hypotheses I I I likelihood of the data (prediction) bias or prior knowledge (e.g., innate constraints) markedness constraints (e.g., syllables have onsets) can be over-ridden by sufficient data I I Statistical learning extracts more information from input Curse of dimensionality: something must guide learner to focus on correct generalizations Stochastic versions of most kinds of grammar Statistical grammar learning combines: I I I Glass box: grammars use explicit representations I evaluate different kinds of universals I I I I compositional representations optimization-based learning generalizations learnt prior knowledge assumed predicting the input 6= correctly analysing the input Applied to psycholinguistics (Jurafsky, Crocker) Should be useful for child language Grammars in computational linguistics 1980s: hand-written linguistic grammars on linguistically interesting examples early 1990s: simple statistical models dominate speech recognition and computational linguistics I they can learn I corpus-based evaluation methodology late 1990s: techniques for statistical learning of probabilitstic grammars today: loosely linguistic grammar-based approaches are competitive, but so are non-grammar-based approaches