* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PowerPoint Presentation - META-Net
Survey
Document related concepts
Semantic holism wikipedia , lookup
Latin syntax wikipedia , lookup
Context-free grammar wikipedia , lookup
Focus (linguistics) wikipedia , lookup
Antisymmetry wikipedia , lookup
Lexical analysis wikipedia , lookup
Musical syntax wikipedia , lookup
Construction grammar wikipedia , lookup
Determiner phrase wikipedia , lookup
Cognitive semantics wikipedia , lookup
Pipil grammar wikipedia , lookup
Integrational theory of language wikipedia , lookup
Transformational grammar wikipedia , lookup
Dependency grammar wikipedia , lookup
Distributed morphology wikipedia , lookup
Probabilistic context-free grammar wikipedia , lookup
Transcript
MultiWord Expressions in NLP Jan Odijk LOT Summerschool Utrecht, June 2004 Overview • • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon Overview • • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon Natural Language Processing • Automatic processing of natural language – Generation: Semantic Repr String – Analysis: String Semantic Representation • Example applications – – – – Machine Translation (MT) Information Retrieval (IR) Cross-language Information Retrieval (CLIR) Question-Answering Natural Language Processing • Based on Grammars – (Popular) frameworks • Feature structure based – Head-driven Phrase Structure Grammar (HPSG) – Lexical-Functional Grammar (LFG) • Tree-based – Tree-Adjoining Grammar (TAG) – M-Grammar • Based on grammar components or dedicated modules – – – – – – Decompounding PoS-tagging Chunking Named Entity Recognition Name/Address grammars Date / Amount grammars Natural Language Processing • Based on Statistics – No explicit grammar – Statistics • Derived from (annotated) training corpus • Tested with test corpus • Applied to new corpora • Combinations of grammar and statistics NLP Grammar • Defines <form, meaning> pairs and structural descriptions at various levels • Components – – – – Semantics Syntax Morphology Orthography (Phonology) NLP Grammar • Semantics – Defines the meaning of an utterance – usually synchronized with syntax (compositionality) • HPSG: CONTENTS attribute • M-Grammar: in-tandem build up • Synchronous TAG: in-tandem build-up with derivation trees • LFG: in tandem with f-structure NLP Grammar • Syntax – – – – Defines the syntactic structure of an utterance Object types: Trees, DAGs Features: attribute-value pairs Value: atomic or structured NLP Grammar • Syntax – Often surface syntax and deep syntax (not necessarily on a separate level) • HPSG: surface tree v. DAG • M-Grammar: surface trees v. derivation trees • LFG: c-structure v. f-structure • TAG: derived tree v. derivation tree • Alpino: surface tree v. dependency tree NLP Grammar • Morphology – Relates (word structure, string) – Word-internal structure build-up usually in the syntactic component – Usually a rule system (intensional definition) – Simple Inflection: sometimes list of triples <base form, morph prop, word form> (extensional definition) NLP Grammar • Orthography – Relates ([String], String) • [he, said, :, “, come, in, !, “] • He said: “come in!” – Usually trivial in generation – Easy in analysis (tokenization) for many languages – Sometimes split (erop, opgebeld) – Very problematic for Chinese, Japanese, etc. Overview • • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon What are MWEs? What are MWEs? • sequence of words that has lexical, orthographic, phonological, morphological, syntactic, semantic, pragmatic or translational properties not predictable from the individual components or their normal mode of combination What are MWEs? • sequence of – Not necessarily contiguous in a concrete utterance • ...omdat hij de plaat wilde poetsen – Not necessarily always in the same order in each utterance • Hij poetste gisteren de plaat • words – Ambiguity between type and token (intentional) – Inflected word form v. lemma – Ambiguity between • Character sequences separated from other character sequences by spaces and other separators (Narrow interpretation) • Abstract lexical units of the grammar (Broad interpretation) What are MWEs? • that has properties not predictable from the individual components and their normal mode of combination What are MWEs? • Lexical – – – – – – – De plaat poetsen Een poging wagen / doen / *maken Dat varkentje eens wassen Zware / *sterke shag Scherpe kritiek Perdre la tête/ la boule / *la cervelle Se creuser la tête / * la boule / la cervelle What are MWEs? • Orthographic – – – – – – – – viz. Bijv. www.uilots.nl i.v.m. Yahoo! Groen! Aujourd’hui (v. l’homme) ‘s (avonds/morgens/middags) What are MWEs? • phonological, – – – – – – – – Over de rooie/*rode (gaan/zijn/raken) om de dooie/*dode donder niet op zijn dooie akkertje/gemak op zijn dooie eentje De kwaaie/*kwade Piet toegespeeld krijgen Je niet in de kouwe/*koude kleren gaan zitten Een gouwe ouwe (but geen rode/rooie cent/duit (hebben)) What are MWEs? • morphological, – – – – – – – Ten gevolge van Ter wereld Van goeden huize Zonder aanzien des persoons Het lood*(je) leggen Dat varken*(tje) wassen De *raap is / rapen zijn gaar What are MWEs? • Syntactic – – – – – Ten gevolge van In opdracht van (no article) Iemand een oor aannaaien Rekening houden met (obligatorily indefinite) Het bijvoeglijk(*e) naamwoord (v. een groot/grote man) What are MWEs? • Semantic – – – – De plaat poetsen Dat varkentje wassen Een bok schieten Een flater slaan What are MWEs? • Pragmatic – – – – Ladies and Gentlemen Ik heb gezegd. Eet smakelijk! (Bon appétit!, Enjoy!) Sincerely yours What are MWEs? • Translational properties – Laten zien (F. montrer, E. show) – Witte wijn (P. vinho verde) – Nuclear power plant (D. atoomcentrale, G. Kernkraftwerk) – Space probe (F. sonde spatiale) – Iemand iets laten weten • inform someone of something Overview • • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon MWEs in NLP • MWEs occur very often in natural language – Esp. in languages with little compounding • Especially in specialized domains – Multi-word terminology MWEs in NLP • MT – Improves parsing and translation of the MWEs – Also improves parsing hence translation of the sentence containing the MWEs (Nivre & Nilsson LREC 2004) • CLIR – Nuclear power plant • Kern- macht plant • Kern- Macht Pflanz • v. atoomcentrale / Kernkraftwerk MWEs in NLP • Problems MWEs pose for NLP – How are MWEs to be dealt with in the grammar of an NLP system? – What lexical representation of MWEs is required for this? – How can we obtain lexicons containing MWEs with such lexical representations Overview • • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon Types of MWEs (I) • Fixed • Semi-flexible • Flexible Fixed MWEs • Fixed MWEs – Words of the MWE in a fixed order – No variation in lexical item choice – Always contiguous (no other elements in between) – No inflectional processes except at the edges Fixed MWEs • Fixed MWEs – ad hoc, stante pede, ter plaatse – Hong Kong, Kuala Lumpur, New York, San Francisco – credit card, travel agency, real estate agency • NOT – in plaats van (cf. in plaats daarvan) (‘instead of’) – carta telefonica (cf. carte telefoniche) – de plaat poetsen (‘polish the plate’, ‘bolt’) Semi-Flexible MWEs • Semi-Flexible MWEs – MWEs with fixed order of elements – That are impenetrable for other words – Parts can be inflected Semi-Flexible MWEs • Examples: – Chambre des représentants • House of representatives – Patatas fritas • French fries – Mise au point automatique • Autofocus – Calculateur analogique • Analogue computer Semi-Flexible MWEs • Examples: – Cité plus haut • Above-stated – Résistant aux acides • Acid-proof – Malade en altitude • Airsick Flexible MWEs • Flexible MWEs • • • • Allow or require inflection in multiple parts, and Allow permutations of subphrases, or Allow intrusion by other phrases, or Have controlled variation (bound pronouns) Flexible MWEs – de plaat poetsen (‘bolt’) • Hij heeft gisteren de plaat gepoetst • …omdat hij de plaat wilde poetsen • Hij poetste gisteren de plaat – to lose one’s temper • He lost his temper • She lost her temper Treatment • Fixed MWEs – No inflection: Relate single string to sequence of strings (in Orthography) • ([ad_hoc] , [ad, hoc]) • Lexical entry for ad_hoc – With inflection: Relate single stem to sequence of stems in Morphology • ([real, estate, agency, Plur] -> [real_estate_agency, Plur]) • Lexical entry for real_estate_agency Treatment • Semi-flexible MWEs – Require local syntax – Chunking may be enough Treatment • Flexible MWEs – Require sophisticated syntax Types of MWEs (II) • Verb –particle combinations (English, German, Dutch, Hungarian) – Ik sloeg hem over – I looked the passage up Types of MWEs (II) • Verb + prepositional complement – I looked after her – Hij heeft altijd van haar gehouden Types of MWEs (II) • Circumpositions (Dutch, German) – Op iemand af / ?toe / *heen – Auf jemanden *ab / zu – Over de brug heen / *af / *toe Types of MWEs (II) • Lexical item (from open or closed class) • + closed class lexical item – Finite (actually small) list • Limited variety of predictable syntactic structures • Dealt with by almost any grammar-based NLP system Types of MWEs (II) • Multiword Names – Examples • • • • Fifth Avenue Koning Leopold III-laan Krimpen aan de IJssel Koninklijke Nederlandse Philips N.V. Types of MWEs (II) • Multiword Names – Issues • Keys – variation – (Koning) Leopold III-laan – Fifth (Avenue) – ((Calle) Roberto) González – Many different ones, continuously new ones – Very important for correct parsing and translation • Minister Kohl Minister Cabbage Types of MWEs(II) • Compounds (in English) – Examples • • • • • Real estate agency Nuclear power plant Blue cheese Private eye High school Types of MWEs(II) • Idioms – No or unpredictable meaning of the components – Fixed (or very limited ) lexical item selection – Opaque • Kick the bucket • De plaat poetsen • Casser sa pipe Types of MWEs(II) • Idioms – Semi-transparant • `een bok schieten’ – Bok (male goat) = blunder – Schieten (shoot) = make • `dat varkentje wassen’ – Varkentje (little pig) = problem – Wassen (wash) = address, take care of Types of MWEs(II) • Idioms – Usually completely normal syntactic structure – Both a literal and an idiomatic reading – Participate normally in many grammatical processes – BUT: often restrictions on participating in grammatical processes Types of MWEs(II) • Idioms (opaque) – Normal participation • Hij poetste de plaat (V2) • Poetste hij de plaat? (V1, question formation) • ...omdat hij de plaat wilde poetsen (VR) – But not • • • • #De plaat heeft hij niet gepoetst (topicalization) #Welke plaat heeft hij gepoetst? (wh-Q) #Hij heeft de mooie plaat gepoetst (internal modification) #Hij heeft de plaat waarschijnlijk niet gepoetst (Mid-field NP-Adv permutation) • #De plaat die hij gepoetst heeft (Relativization) • #De plaat werd door hem gepoetst (Passive) • #Wat een plaat (independent occurrence)) Types of MWEs(II) • Idioms (semi-transparant) – Normal participation • Hij schoot een bok (V2) • Schoot hij een bok? (V1, question formation) • ...omdat hij een bok zou schieten (VR) – Also • • • • Een bok heeft hij niet geschoten (topicalization) Wat voor een bok heeft hij nu weer geschoten? (wh-Q) Hij heeft een enorme bok geschoten (internal modification) Hij heeft die bok waarschijnlijk geschoten omdat ... (Mid-field NP-Adv permutation) • De bok die hij geschoten heeft (Relativization) • Er werd door hem een enorme bok geschoten (Passive) – But not: • Wat een bok! (independent occurrence) Types of MWEs(II) • Idioms – In some cases irregular syntactic structure • Ten gevolge van (fossilized portmanteau words, noun e-form) • Het bijvoeglijk naamwoord (no –e) • Iemand de oren wassen (inalienable possession construction) – Regular syntactic structure but not predictable from the components’ properties • Iemand welkom heten – (`heten’ on its own can only take a pred. complement) – *Hij heet hem aardig / president / Jan • to lose face (count noun in determinerless NP) Types of MWEs(II) • Idioms – Cranberry words • Ergens de brui aan geven Types of MWEs(II) • Semi-idioms (collocations) – One element occurs in its normal meaning – The lexical selection of the other element is fixed or very limited – The other element has a special meaning – Examples • Zware tabak (heavy tobacco) `strong tobacco’ • Scherpe kritiek (sharp criticism) `severe criticism’ • Heavy smoker Types of MWEs(II) • Support verb constructions – Type I • • • • Een poging wagen Een lezing houden / geven To pay attention to (aandacht schenken aan) To take advantage of Types of MWEs (II) • Arguments of the noun outside the NP – – – – – De kritiek die we hadden op hem ?de kritiek op hem die we hadden *De kritiek die we naar voren brachten op hem De kritiek op hem die we naar voren brachten De kritiek op hem verstomde Types of MWEs (II) • Arguments of the noun outside the NP – De aandacht die we schonken aan hem – *de aandacht aan hem die we schonken – *De aandacht die we becommentarieerden aan hem – De aandacht aan hem die we becommentarieerden – De aandacht *aan / voor hem verflauwde Types of MWEs (II) • Arguments of the noun outside the NP – – – – – The attention that we paid to this subject *The attention to this subject that we paid *The attention that we criticised to this subject The attention to this subject that we criticized The attention to this subject Types of MWEs (II) • Arguments of the noun outside the NP – – – – No attention was paid to this subject This subject was paid no attention to Advantage was taken of this proposition This proposition was taken advantage of Types of MWEs (II) – Type II • Iemand een stomp geven • Iemand een klap geven • To give someone a kiss – Noun signifying bodily touch – `give’ + indirect object as Patient Types of MWEs (II) – Type III(?) ..\Utrecht\MWEs\copulasetc\selectzijn edited Current.xls • In de war zijn / raken / * gaan / ?komen / brengen • In zijn nopjes zijn / raken / * komen / *brengen • De pijp uit zijn / ?raken / gaan/ *komen / *brengen Types of MWEs (II) • Quasi-idioms – Characteristics: Regular meaning + something extra (specialization) – Examples • Huisdeur (`front door’) • Bijvoeglijk naamwoord • Fried eggs – Terms – Compounds Overview • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks • MWEs and the lexicon Requirements • Account for the fact that MWEs usually have `normal’ syntactic structures • Account for the normal participation of MWEs in most syntactic processes • Account for the restrictions on the participation in some syntactic processes • Recognize it as an MWE and assign it the associated semantics Tree-Adjoining Grammar (TAG) • • • • • • Originally developed by Aravind Joshi Extended by Shieber, Schabes Applied to French by Abeillé Basic object: trees Enriched with features (incl. unification) Known parsing algorithm and complexity properties • Defines mildly context-sensitive languages TAG • Lexicalized TAG (LTAG) • Trees: – Elementary Trees • Associated with a lexical item • Initial – Leaves labeled by » Terminals » Substitution nodes TAG • Initial Trees (examples) – – – – N[Jean] S[N0↓ V[dormir]] S[N0↓ V[aimer] N1↓] S[N0↓ V[ressembler] PP[P[à] N1↓] • Used for words (verbs) taking nominal, adjectival, prepositional arguments TAG • Auxiliary Trees – One leaf node (foot node) same category as root node – Used for modifiers, auxiliary verbs, raising verbs, verbs taking sentential arguments – Examples • • • • N[A[beau] N*] N[Det[le] N*] S[N0↓ V[penser] S1’[C[que] S1*]] V[V[sembler] V*] TAG • Constraints on Elementary Trees – Lexicalization – Predicate-Argument co-occurrence – Semantic Consistency TAG • Operations – Substitution – Adjunction • NO – Deletion – Movement – Permutation TAG • Operations – Substitution • Substitutes a tree at a leaf node marked for substitution (↓) • Example – S[N0↓ V[dormir]] + N[Jean] – S[N[Jean] V[dormir]] TAG • Operations – Adjunction • Inserts an auxiliary tree (or a tree derived from an auxiliary tree) at any node (with the same label) • If this node dominates a subtree, this subtree is copied under the foot node of the auxiliary tree – Example • S[N[Jean] V[dormir]] + V[semble V*] • S[N[Jean] V[semble V[dormir]]] TAG • Derived tree – Tree created by substitution or adjunction – Encodes word order, inflection, morphosyntactic features • Derivation Tree – α-dormir[ 1/α’-Jean 2/β3-semble] – Close to dependency trees – Basis for semantic interpretation TAG • Lexical rules – Elementary tree elementary tree – For passive, wh-questions, cleft-constructions, relatve clauses, cliticization – Define the lexical item’s tree family • Examples of derived elementary trees – Passive: S[N1↓ V[être] V[aimé] PP[P[par] N0↓] – Object-cleft S[V[CI[ce] V[être] N1↓ S’[C[que] S[N0↓ V[aimer]]]] – Object-Rel.: N[N1* S’[C[que] S[N 0↓V[aimer]]]] TAG • Synchronous TAG • Each elementary tree is associated with a semantic tree • Links between nodes from the elementary tree and the semantic tree TAG • N-1[Jean] – T-1[jean’] • S-2[N0↓-1 V[dormir]] – F-2[R[dormir’] T1↓-1] • S-1[N0↓-2 V[aimer] N1↓-3] – F-1[R[aimer] T0↓-2 T1↓-3] • S-1[N0↓-2 V[ressembler] PP[P[à] N1↓-3] – F-1[R[ressembler’] T0↓-2 T1↓-3] TAG • N-1[A[beau] N*] – F-1[R[beau’] T0*] • N[Det[le] N*] – F-1[R[le’] T0*] • S-1[N0↓-2 V[penser] S1’[C[que] S1*]] – F-1[R[penser’] T0↓-2 T1*] • V-1[V[sembler] V*] – F-1[R[sembler’] T0*] TAG • Given a pair, select a link (nondeterministically) with roots A and B • Select another pair with roots A and B • Combine the syntactic trees (at node A) and the semantic trees (at node B), by adjunction or substitution, and remove the link between A and B • Do this recursively TAG • <S[N0↓-1 V-2[dormir]], F-2[R[dormir’] T1↓-1]>, select link 1 • + <N-1[Jean], T-1[jean’]> • <S[N[Jean] V-2[dormir]], F-2[ R[dormir’] T[jean’]] • !+ < V-1[V[sembler] V*], F-1[R[sembler’] T0*]> • <S[N[Jean] V[V[sembler] V-2[dormir]]], F[R[sembler] F-2[ R[dormir’] T[jean’]]] TAG • Idiomatic Expressions in LTAG – Each idiomatic expression represented by an elementary tree • Examples – S[N0↓ V[briser] N1[D[la] N[glace]] – S[N0↓ V[prendre] N1↓ PP[P[en] N[compte]]] – S[N0[D[des] N[ailes]] V[pousser] PP1[P[à] N1↓]] TAG • • • • Associated with a semantic tree F[R[briser-la-glace’] T0↓] F[R[prendre-en-compte’] T0↓ T1↓] F[R[des-ailes-pousser- à’] T0↓] • With the appropriate links TAG • Lexical rules – Apply to idiom elementary trees normally – But can be individually constrained • Links between parts of an idiomatic elementary tree and semantic trees are allowed: – internal syntactic modification can correspond to extenal semantic modification TAG • MWEs are normal elementary trees • Lexical rules apply to idiom elementary trees as usual • Lexical rules can be restricted to apply to certain elementary trees • Recognition and semantics: same as with single word elementary trees TAG • But – No restrictions on elementary trees – Idiomatic elementary trees can deviate – Elementary trees are complex (esp. features): difficult to maintain – Restrictions on grammatical processes basically stipulated Overview • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks • MWEs and the lexicon M-Grammar • Developed by Jan Landsbergen • Inspired by Montague grammar • Compositional Grammars – The meaning of an expression is a function of the meaning of its parts and the way they are combined • Use traditional syntactic surface trees (but with relations) M-Grammar • Used for Machine Translation • Research Prototype MT System – Dutch, English, Spanish • Developed at Philips Research Labs – Rosetta project – Rosetta3 System • Compositional Translation Method M-Grammar • Compositionality of Meaning – The grammars are organised in such a way that the meaning of an expression is a function of the meaning of its parts and the way they are combined. • Implemented by Compositional Grammars – Basic Expressions (BE), with a meaning – Rules, with a meaning (recursively applicable) M-Grammar • Basic Expressions – With basic meaning • M-Rules – With meaning operation • Basic object: S-trees • S-tree = N[r1/T1,...rn/Tn] – N: node = CAT{a-v pairs} – Ti: S-trees – Ri: grammatical relation (subject, object, head,...) • S-tree of a basic expression is basic S-tree M-Grammar • S-tree of a full utterance is created by applying Mrules to S-trees, initially basic S-trees • Derivation history is recorded in syntactic derivation tree (syntactic D-tree) • M-rules – Powerful rules – Structure creation, deletion, permutations, movements, insertions M-Grammar • M-Rules – Relate [T1,..,Tn] to T – Reversible • Analytic and generative versions can be derived automatically – Measure condition • Each Ti in [T1,...,Tn] must be `smaller’ (according to some measure) than T – Reversibility and Measure Condition guarantee effectiveness of parsing M-Grammar • Syntactic D-trees contain names of basic expressions and names of rules • Can be mapped into (isomorphic) Semantic D-trees, containing names of basic meanings and names of meaning operations M-Grammar • Principle of Compositionality of Translation – Two expressions are each other's translation if they are built up from parts which are each other's translation, by means of rules which are each other’s translation. • Implemented by tuning Compositional Grammars G1 and G2 – For each BE in G1 at least one BE in G2 that is translationally equivalent – For each rule in G1 at least one rule in G2 that is translationally equivalent M-Grammar • Interlingual System • Interlingua obtained as a side-effect of tuning compositional grammars (no independent interlingua) • Interlingua expresses translational equivalence – B1 and B1’ are translations of each other – Not necessarily the meaning of B1/B1’ M-Grammar • G1: – BEs: N-boek (M-book:book’), A-interessant (M-interesting: interesting’) – Mrule R1: meaning name: IndefPlMod • syntax – <[N, A], NP[mod/A{e}, head/N{pl}]> • Semantics: [[A]](x) & [[N]](x) • G2 – BEs: N-livre (M-book: book’), A-intéressant( M-interesting: interesting’) – Mrule R2 meaning name: IndefPlMod • Syntax – <[N,A], NP[det/D-des, head/N’[head/N{pl, m}, mod/A{pl, m}]] • Semantics – [[A]](x) & [[N]](x) M-Grammar • NP[mod/A{e}-interessant head/N{pl}-boek] – interessante boeken • Syn D-Tree: R1 [boek interessant • NP[det/D{pl}-des head/N’[head/N{pl,m}-livre], mod/A{pl,m}-interéssant] – des livres interéssants • Syn D-Tree: R2 [livre interessant] • Sem D-Tree IndefPlMod[M-book M-interesting] • Semantics: book’(x) & interesting’(x) – (ignoring plurality, indefiniteness) M-Grammar • Full System • • • • • • • • • Sem D-Trees (IL) A-TRANSFER G-TRANSFER Syn D-Trees Syn D-Trees M-PARSER M-GENERATOR S-Trees S-Trees S-PARSER LEAVES [Lexical S-Trees] [Lexical S-Trees] A-MORPH G-MORPH String String M-Grammar:MWEs • Method for dealing with idioms • Each idiom – is a basic expression – Associated with a complex syntactic structure – complex basic expressions • Syntactic Structure of an idiom: – Canonical (abstracting from syntactic operations: passive, topicalization, verb movements, wh-questioning, ... – Represented as a D-tree • • • • Much more stable part of the grammar Much simpler than S-trees (no features etc) Makes it easier to deal with bound variation Better guarantee of correctness of structures – In the lexicon a identifier to a D-tree (idiom pattern) M-Grammar:MWEs • Start rules: – Combine a BE with its arguments (variables) – If the BE is complex structure created by applying the rules in the associated D-Tree • Resulting Structure runs through all the normal rules of grammar • In analysis: incoming structure is analyzed using the rules in the D-tree M-Grammar:MWEs • In analysis: – Guide the analysis by the idiom’s D-tree – If successful, and the arguments are also correctly analyzed: extend the sentence’s D-tree with the start rule dominating a BE (for the idiom) and the D-trees for the arguments M-Grammar:MWEs • De pijp uit gaan • D-tree for vpid30 (simplified): Rsubst,i [RVP [$aV_00_ga, VAR_j RPPpost [$s_prep1286700, VAR_i ] ], RNPdef [$aV_00_pijp] ] M-Grammar:MWEs • Start Rule 1: BE(verb) + VAR • S-Tree created in case of an idiom by applying the D-Tree for the idiom • S[subj/VAR_j, head/VP[compl/PP [obj/NP[det/D-de head/N-pijp ] head/P-uit ] head/V-ga ] ] M-Grammar:MWEs • Normal participation in grammatical processes: structures for idioms are normal • Restrictions: – Rules that affect meaningful elements cannot apply to meaningless parts of idioms – relativization, topicalization, wh-questioning, NPAdv order switch, exclamative formation, adjectival modification, independent occurrence, ... – (parts of ) Rules that are purely syntactic not restricted – V2, VR, V1 (if question formation and V1 are separated) M-Grammar:MWEs • Deviant syntax – Portmanteau words (ten, ter) – E-form of nouns – Inalienable possession construction • Minor Rules: rules that can only occur in an idiom D-Tree M-Grammar:MWEs • Idioms have normal syntactic structures generated by applying the normal M-rules following the associated idiom D-Tree • Other M-Rules apply to these structures in the normal way • Restrictions on applications accounted for because M-Rules applying to meaningful elements cannot apply to meaningless parts of idioms • Deviant syntactic structure can be accounted for by Minor Rules • The idiom D-trees can be derived using the system itself (putting minor rules on in the grammar) • Recognition of idioms: by guiding the analysis along the idiom D-tree • Semantics: each idiom is (complex) basic expression with associated basic meaning M-Grammars and TAG • Neither has an adequate treatment of semiidioms (yet) • Neither has an adequate treatment of transparent idioms (yet) • Idiom representation as D-trees can also be done in TAG (Abeillé) Overview • • • • • • NLP MWEs MWEs in NLP MWE Types Treatment of MWEs in selected frameworks MWEs and the lexicon Lexical representation • How do we obtain lists of MWEs? • How do we represent them lexically • How can we improve exchangeability of MWE lexical representations? MWE Acquisition • Lists from existing dictionaries – Always incomplete – Especially for specific domains – Do not necessarily reflect actual usage • Semi-automatic acquisition from text corpora is called for – Especially for rapid tuning NLP system to specific domain, company, organization MWE Acquisition • Use statistical properties of MWEs to acquire them (semi-)automatically • Examples: – Mutual Information: log Pr(xy)/(Pr(x)Pr(y)) – Salience: Pr(xy) * MI – Dice, chi-square, log-likelyhood, em metric, ... MWE Acquisition • Mutual Information: – Pr(z) estimated by F(z)/T – Pr(xy)/Pr(x)Pr(y) = (F(xy)*T)/F(x)F(y) – Experiment: (just for adjacent 2-word MWEs) • Salience – Compensates for favoring low frequency items – Experiment: (just for adjacent 2-word MWEs) MWE Acquisition • Extend to 3-word MWEs, etc. • Combine with NLP system/components – PoS-tag to obtain syntactically meaningful combinations (yes: A N, no: Adv N) – Parsing and compute statistics on D-trees • Allows acquisition of discontinuous MWEs • Very lively and active research area Lexical Representation of Idioms • Many grammatical treatments of idioms require – Whatever is needed for single word lexical items – Syntactic structure – Unique references to lexical items • Highly framework / theory / implementation-specific Lexical Representation • M-Grammar system-specific matters: – D-trees, compatible with the specific implementation – Unique references to items in Rosettalexicon – Order of the items (`gaan uit pijp’) – Presence of items (articles absent) Lexical Representation • LTAG system-specific matters: – Syntactic trees, compatible with the specific implementation – Unique references to items in system’s lexicon – Order of the items (= canonical surface order) – Presence of items (all present) SEQCI • Lexical Representation – Maximally theory-neutral • Incorporation Method – Generic – Maximally reuses existing NLP system • Core Idea: – Describes which idioms have the same structure – Structural Equivalence Classes for Idioms SEQCI • Lexical Representation= – Idiom Descriptions – Idiom Pattern Descriptions • Idiom description= – Idiom pattern (identifier) • Identifier for idiom structures • Used to define the equivalence classes – Idiom Component List (ICL), with base forms • All • Any order, but the same within one equivalence class – Example sentence containing the idiom • Same syntactic structure for each idiom of the same equivalence class SEQCI • Idiom pattern description • Idiom pattern identifier • Comments (free text) SEQCI • Example: – Idiom Descriptions • Idp30;De pijp uit gaan;Hij is de pijp uit gegaan • Idp30;De boot in gaan;Hij is de boot in gegaan • Idp30:Het schip in gaan;Hij is het schip in gegaan – Idiom pattern definition • Idp30 • Idiom headed by a verb taking a postpositional PP containing a definite singular NP and one free argument as subject SEQCI • Incorporation Method – Manual part, once for each idiom pattern – Automatic Part, for each idiom description SEQCI • Manual part (`hij is de pijp uit gegaan’) 1. Parse the example sentence of an idiom description with idiom pattern P, yielding the Reference Parse 2. Define a transformation to turn the reference parse into the idiom structure ( Parse Transformation, PT) 3. Determine the list of unique IDs of the lexical items in the idiom structure for the system derived from the reference parse (Idiom Component ID List, ICIL) 4. Define a transformation to relate ICL and ICIL (Idiom Component Transformation, ICT) 5. Apply the ICT to the ICL, yielding the transformed ICL (TICL) and check that each item in it equals the base form of the corresponding element on the ICIL SEQCI Automatic part, for each idiom description I (`hij is de boot in gegaan’) 1. Parse example sentence (Syntactic Structure) 2. Apply IPT and check identity with idiom structure modulo the lexical items 3. Select the component IDs from the parse tree, in order to obtain the ICIL) 4. Apply ICT to the ICL of I, yielding the TICL 5. Check that <bf(c1),…bf(cn)>=TICL where ICIL = <c1, …cn> ( TICL check) SEQCI • Advantages – – – – Technically Simple As theory/grammar/implementationindependent as possible No need for prescribing syntactic structures System-specific aspects are derived from the NLP-system itself SEQCI • Will it work? – – If there are not too many different idiom patterns, and Sufficient number of instantiations per idiom pattern SEQCI • Various improvements possible – Parameterization • Over local morphosyntactic differences (sg/pl; pos/dim; pos/comp/sup; ...) – Abstraction • Especially for large fixed parts – Weten waar Abraham de mosterd haalt – Use of underspecified syntactic structures • Optional, of any kind – Guidelines for selection of example sentences SEQCI SAID-Database Dutch Minidatabase Coverage #idioms #patterns#idioms #patterns 50% 7383 28 449 21 60% 8853 54 539 36 70% 10304 140 628 59 80% 11773 481 716 98 85% 12509 908 760 134 90% 13245 1644 804 178 95% 13981 2380 849 223 100% 14716 3116 893 267 Conclusions • SEQCI: – Technically simple – Highly theory, grammar, implementation independent • • • • • Can reduce/share lexicon development efforts significantly Candidate for a standard lexical representation for idioms Extension to other types of MWEs looks promising Initial experiments started More testing required, with more NLP systems – I can provide test data – Try it out in your own NLP system! SEQCI • Possible Problems/Objections? – Manual Part • • • – Pattern sentence does not yield a parse Pattern sentence yields multiple parses Pattern corresponds to 2 or more different structures in my system Automatic Part • • • Sentence does not yield a parse Sentence yields multiple parses Multiple skeys result for the same base form SEQCI: Reference Parse Rdecl[Rperf [Rsubst(j) [Rsent [Rsubst(i) [RVP[$aV_00_ga, RPPpost [$s_prep1286700, VAR_i ] ] RNPdef [$aN_00_pijp] ], VAR_j ], RNP[$hij_PRON] ] ] SEQCI: Idiom Structure • IPT: IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron] • D-tree for vpid30 (simplified): Rsubst,i [RVP [$aV_00_ga, RPPpost [$s_prep1286700, VAR_i ] ], RNPdef [$aN_00_pijp] ] ICIL < $aV_00_ga, $prep1286700, $aN_00_pijp > SEQCI: Illustration Manual Part, applied to `de pijp uitgaan’ 1. 2. 3. 4. Reference Parse: See D-tree next slide IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron] ICIL: < $aV_00_ga, $aN_00_pijp, $prep1286700> ICT: 1 2 3 4 => 4 3 2 5. TICL = ICT(<de, pijp, uit, gaan>) = <gaan, pijp, uit> = < Bf($aV_00_ga), Bf($aN_00_pijp), Bf($prep1286700) > ICT ICL: Must be turned into: <de, pijp, uit, gaan> < gaan, uit, pijp> ICT : 1 2 3 4 => 4 3 2 TICL TICL = ICT(ICL) = ICT(<de, pijp, uit, gaan>) = <gaan, uit, pijp> = < Bf($aV_00_ga), Bf($prep1286700), Bf($aN_00_pijp) > Syntactic Structure Rdecl[Rperf [Rsubst(j) [Rsent [Rsubst(i) [RVP[$aV_00_ga, RPPpost [$s_prep1286800, VAR_i ] ], RNPdef [$aN_00_boot] ], VAR_j ], RNP[$hij_PRON] ] ] Apply IPT Rsubst,i [RVP [$aV_00_ga, RPPpost [$s_prep1286800, VAR_i ] ], RNPdef [$aN_00_boot] ] ICIL ICIL=< $aV_00_ga , $s_prep1286800, $aN_00_boot>) TICL ICT(ICL) = ICT(<de, boot, in, gaan>)= <gaan, in, boot> TICL check <bf($aV_00_ga), bf($s_prep1286800), bf($aN_00_boot) > = TICL = <gaan, in, boot>