Download acc

Ambiguity Management in Deep Grammar Engineering Tracy Holloway King Ambiguity: bug or feature?   Bug in computer programming languages Feature in natural language – People good at resolving ambiguity in context – Ambiguity consequently often unperceived “Readjust paper holding clip” even though thousand-fold ambiguities are common – Ambiguity promotes conciseness  Computers can’t resolve ambiguity like humans  If we are going to build large-scale, linguistically sophisticated grammars, we need ways to handle ambiguity Talk Outline   Sources of ambiguity Grammar engineering approaches – Shallow markup – (Dis)preference marks   Stochastic disambiguation Efficiency in ambiguity management Sources of Ambiguity  Phonetic:  Tokenization: – “I scream” or “ice cream” – “I like Jan.” --- |Jan|. Or |Jan.|.  Morphological:  Lexical:  Syntactic:  Semantic:  Pragmatic: (abbrev January) – “walks” --- plural noun or 3sg verb – “untieable knot” --- un(tieable) or (untie)able – “bank” --- river bank or financial institution – “The turkeys are ready to eat.” --- fattened or hungry – “Two boys ate fifteen pizzas.” --- 15 each or 15 total – “Sue won. Ed gave her a good luck charm.” --- cause or result PP Attachment A classic example of syntactic ambiguity   PP adjuncts can attach to VPs and NPs Strings of PPs in the VP are ambiguous – I see the girl with the telescope. I see [the girl with the telescope]. I see [the girl] [with the telescope].  Ambiguities proliferate exponentially – I see the girl with the telescope in the park I see [the girl with [the telescope in the park]] I see [the [girl with the telescope] in the park] I see the girl [with the [telescope in the park]] I see the girl [with the telescope] [in the park] I see [the girl with the telescope] [in the park] – The syntax has no way to determine the attachment, even if humans can. Coverage entails ambiguity I fell in the park. + I know the girl in the park. I see the girl in the park. Ambiguity can be explosive If alternatives multiply within or across components… Discourse Semantics Syntax Morphology Tokenize Ambiguity figures   Deep grammars are massively ambiguous Example: 700 from section 23 of WSJ – average # of words: 19.6 – average # of optimal parses: 684 » for 1-10 word sentences: 3.8 » for 11-20 word sentences: 25.2 » for 50-60 word sentences: 12,888 Managing Ambiguity  Grammar engineering approaches – Trim early with shallow markup – (Dis)preference marks on rules   Choose most probable parse for applications that need a single input Use packing to parse and manipulate the ambiguities efficiently Talk Outline   Sources of ambiguity Grammar engineering approaches – Shallow markup – (Dis)preference marks   Stochastic disambiguation Efficiency in ambiguity management Shallow markup  Part of speech marking as filter I saw her duck/VB. – accuracy of tagger (v. good for English) – can use partial tagging (verbs and nouns)  Named entities – <company>Goldman, Sachs & Co.</company> bought IBM. – good for proper names and times – hard to parse internal structure  Fall back technique if fail – slows parsing – accuracy vs. speed Example shallow markup: Named entities  Allow tokenizer to accept marked up input: parse {<person>Mr. Thejskt Thejs</person> arrived.} tokenized string: Mr. Thejskt Thejs TB +NEperson Mr(TB). TB Thejskt TB Thejs  Add TB arrived TB . TB lexical entries and rules for NE tags Resulting C-structure Resulting F-structure Results for shallow markup Full/All % Full parses Optimal sol’ns Best F-sc Time % Unmarked 76 482/1753 82/79 65/100 Named ent 78 263/1477 86/84 60/91 POS tag 62 248/1916 76/72 40/48 Kaplan and King 2003 (Dis)preference marks (OT marks)  Want to (dis)prefer certain constructions – prefer: use when possible – disprefer: do not use unless no other analysis  Implementation – Put marks in rules and lexical entries – Rank those marks » ranking can be different for different grammars/corpora – Use most prefered parse(s) » can use as a two pass system for robust parsing Ungrammatical input  Real world text contains ungrammatical input – Deep grammars tend to only cover grammatical output  Common errors can be coded in the rules – may want to know that error occurred (e.g., provide feedback in CALL grammars)  Disprefer parses of ungrammatical structures – tools for grammar writer to rank rules – two+ pass system 1. standard rules 2. rules for known ungrammatical constructions 3. default fall back rules Sample ungrammatical structures  Mismatched subject-verb agreement Verb3Sg = { SUBJ PERS = 3 SUBJ NUM = sg |BadVAgr }  Missing copula VPcop ==> { Vcop: ^=! |e: (^ PRED)='NullBe<(^ SUBJ)(^XCOMP)>' MissingCopularVerb} { NP: (^ XCOMP)=! |AP: (^ XCOMP)=! | …} Dispreferred grammatical structures  Prefer subcategorized infinitives to adverbials – I want it. I finished up (in order) to leave. – I want it to leave. VP --> V (NP: (^ OBJ)=!) (VPinf: { (^ XCOMP)=! +InfSubcat |! $ (^ ADJUNCT) InfAdjunct } ).  Post-copular gerunds – He is a boy. (His) going is difficult. – He is going. OT Mark summary    Use (dis)preference marks to (dis)prefer constructions or words Allows inclusion of marginal/ungrammatical constructions Issues: – Only works with ambiguities with known preferences (not PP attachment) – Hard to determine ranking for many marks – Two-pass parsing can be slow Talk Outline   Sources of ambiguity Grammar engineering approaches – Shallow markup – (Dis)preference marks   Stochastic disambiguation Efficiency in ambiguity management Packing & Pruning in XLE  XLE produces (too) many candidates – All valid (with respect to grammar and OT marks) – Not all equally likely – Some applications require a single best parse or at most just a handful (n best)  Grammar writer can’t specify correct choices – Many implicit properties of words and structures with unclear significance Pruning in XLE    Appeal to probability model to choose best parse Assume: previous experience is a good guide for future decisions Collect corpus of training sentences, build probability model that optimizes for previous good results – partially labelled training data is ok [NP-SBJ They] see [NP-OBJ the girl with the telescope]  Apply model to choose best analysis of new sentences – efficient (XLE English grammar: 5% of parse time) Exponential models are appropriate (aka Maximum Entropy or Log-linear models)    Assign probabilities to representations, not to choices in a derivation No independence assumption Arithmetic combined with human insight – Human: » Define properties of representations that may be relevant » Based on any computable configuration of features, trees – Arithmetic: » Train to figure out the weight of each property Properties employed in WSJ Experiment  ~800 property-functions: – – – – – – –  c-structure nodes and subtrees recursively embedded phrases f-structure attributes (grammatical functions) atomic attribute-value pairs left/right branching (non)parallelism in coordination lexical elements (subcategorization frames) Some end up with no discrimination power after training Stochastic Disambiguation Summary  Training: – Define a set of features by hand – Train on partially labelled data – Can train on low-ambiguity data  Use: – Choose just one structure for applications that want just one – XLE displays most probable first – 5% of parse time to disambiguate – 30% gain in F-score Talk Outline   Sources of ambiguity Grammar engineering approaches – Shallow markup – (Dis)preference marks   Stochastic disambiguation Efficiency in ambiguity management Computational consequences of ambiguity  Serious problem for computational systems – Broad coverage, hand written grammars frequently produce thousands of analyses, sometimes millions – Machine learned grammars easily produce hundreds of thousands of analyses if allowed to parse to completion  Three approaches to ambiguity management: – Pruning: block unlikely analysis paths early – Procrastination: do not expand analysis paths that will lead to ambiguity explosion until something else requires them » Also known as underspecification – Packing: compact representation and computation of all possible analyses The Problem with Pruning: premature disambiguation  The conventional approach: Use heuristics to prune as soon as possible  Strong constraints may reject the so-far-best (= only) option Statistics X Discourse Fast computation, wrong result Semantics X Syntax Morphology Tokenize X X X The problem with procrastination: passing the buck  Chunk parsing as an example: – Collect noun groups, verb groups, PP groups – Leave it to later processing to figure out the correct way of putting these together – Not all combinations are grammatically acceptable  Later processing must either – Call parser to check grammatical constraints – Have its own model of grammatical constraints – In the best case, solve a set of constraints the partial parser includes with its output The Problem with Packing   There may be too many analyses to pack efficiently A major problem for relatively unconstrained machine induced grammars – Grammars overgenerate massively – Statistics used to prune out unlikely sub-analyses  Less of a problem for carefully hand-coded broad coverage grammars Packing  Explosion of ambiguity results from a small number of sub-analyses combining in different ways to produce a large number of total analyses (e.g. PP attachment)  Compute and represent each sub-analysis just once Compute a factored representation of how these sub-analyses combine  Generalizing Free Choice Packing The sheep saw the fish. How many sheep? How many fish? Options multiplied out The sheep-sg saw the fish-sg. The sheep-pl saw the fish-sg. The sheep-sg saw the fish-pl. The sheep-pl saw the fish-pl. In principle, a verb might require agreement of Subject and Object: Have to check it out. Options packed The sheep sg sg saw the fish pl pl But English doesn’t do that: Any combination of choices is OK Dependent choices Das Mädchen The girl nom acc sah die Katze nom acc saw the cat Again, packing avoids duplication … but it’s wrong It doesn’t encode all dependencies, choices are not free. Das Mädchen-nom sah die Katze-nom Das Mädchen-nom sah die Katze-acc Das Mädchen-acc sah die Katze-nom Das Mädchen-acc sah die Katze-acc bad The girl saw the cat The cat saw the girl bad Solution: Label dependent choices Das Mädchen-nom sah die Katze-nom Das Mädchen-nom sah die Katze-acc Das Mädchen-acc sah die Katze-nom Das Mädchen-acc sah die Katze-acc Das Mädchen p:nom p:acc sah die Katze q:nom q:acc bad The girl saw the cat The cat saw the girl bad = (pq)  (pq) • Label each choice with distinct Boolean variables p, q, etc. • Record acceptable combinations as a Boolean expression  • Each analysis corresponds to a satisfying truth-value assignment (a line from ’s truth table that assigns it “true”) The Free Choice Gamble  Worst case, where everything interacts: – As many choice variables as there are readings – Packing blows up, and becomes exponential  Best case, no interactions – N completely independent choices represent 2N readings  Language interactions mostly limited & local – Tends towards the best case – Free choice packing pays off for linguistic analysis Conclusions   Ambiguity has to be dealt with Deep grammars use a variety of approaches – preprocessing – grammar engineering – stochastic disambiguation  Why use deep grammars if they are so ambiguous? Deep analysis matters… if you care about the answer Example: A delegation led by Vice President Philips, head of the chemical division, flew to Chicago a week after the incident. Question: Who flew to Chicago? Candidate answers: division head V.P. Philips closest noun next closest next delegation furthest away but Subject of flew shallow but wrong deep and right Applications of Language Engineering Post-Search Sifting Broad Autonomous Knowledge Filtering Alta Vista AskJeeves Document Base Management Narrow Domain Coverage Google Restricted Dialogue Manually-tagged Keyword Search Knowledge Fusion Good Translation Useful Summary Natural Dialogue Microsoft Paperclip Low Functionality High What to do with them?     Define yes-no / 1-0 features, f, that seem important Training determines weights on these features, λ, to reflect their actual importance Select parse x: count occurrences of features (0,1) and multiply by corresponding weights, λ.f(x) Convert weighted feature counts to probabilities eλ.f(x)  eλ.f(X) Un-normalized probability Normalizing factor Issues in Stochastic Disambiguation     What kind of probability model? What kind of training data? Efficiency of training, efficiency of disambiguation? Benefit vs. random choice of parse Advantages of Free Choice Packing  Avoids procrastination – Nogoods are constraints that parser sends to other component – Eliminating nogoods: other components don’t do parser’s work  Independence between choices: Allows processing relying on independence assumptions – Counting number of readings » Apparently trivial but of crucial importance, since statistical modelling requires the ability to count – Hence, statistical processing  A general mechanism extending beyond parsing Simplifying Truth Tables Das Mädchen p:nom p:acc sah die Katze p 1 1 0 0 Das Mädchen p:nom p:acc q  1 0 0 1 1 1 0 0 sah die Katze p 1 0  1 1 q:nom q:acc = (pq)  (pq) (q = p) p:nom p:acc = (p  p) Freely choose any line from the truth table

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download acc