Download PowerPoint Presentation - META-Net

Document related concepts

Semantic holism wikipedia , lookup

Causative wikipedia , lookup

Latin syntax wikipedia , lookup

Pleonasm wikipedia , lookup

Context-free grammar wikipedia , lookup

Focus (linguistics) wikipedia , lookup

Antisymmetry wikipedia , lookup

Lexical analysis wikipedia , lookup

Musical syntax wikipedia , lookup

Construction grammar wikipedia , lookup

Determiner phrase wikipedia , lookup

Cognitive semantics wikipedia , lookup

Pipil grammar wikipedia , lookup

Integrational theory of language wikipedia , lookup

Transformational grammar wikipedia , lookup

Dependency grammar wikipedia , lookup

Distributed morphology wikipedia , lookup

Probabilistic context-free grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Parsing wikipedia , lookup

Lexical semantics wikipedia , lookup

Transcript
MultiWord Expressions in NLP
Jan Odijk
LOT Summerschool
Utrecht, June 2004
Overview
•
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected frameworks
MWEs and the lexicon
Overview
•
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected frameworks
MWEs and the lexicon
Natural Language Processing
• Automatic processing of natural language
– Generation: Semantic Repr  String
– Analysis: String  Semantic Representation
• Example applications
–
–
–
–
Machine Translation (MT)
Information Retrieval (IR)
Cross-language Information Retrieval (CLIR)
Question-Answering
Natural Language Processing
• Based on Grammars
– (Popular) frameworks
• Feature structure based
– Head-driven Phrase Structure Grammar (HPSG)
– Lexical-Functional Grammar (LFG)
• Tree-based
– Tree-Adjoining Grammar (TAG)
– M-Grammar
• Based on grammar components or dedicated modules
–
–
–
–
–
–
Decompounding
PoS-tagging
Chunking
Named Entity Recognition
Name/Address grammars
Date / Amount grammars
Natural Language Processing
• Based on Statistics
– No explicit grammar
– Statistics
• Derived from (annotated) training corpus
• Tested with test corpus
• Applied to new corpora
• Combinations of grammar and statistics
NLP Grammar
• Defines <form, meaning> pairs and
structural descriptions at various levels
• Components
–
–
–
–
Semantics
Syntax
Morphology
Orthography (Phonology)
NLP Grammar
• Semantics
– Defines the meaning of an utterance
– usually synchronized with syntax
(compositionality)
• HPSG: CONTENTS attribute
• M-Grammar: in-tandem build up
• Synchronous TAG: in-tandem build-up with
derivation trees
• LFG: in tandem with f-structure
NLP Grammar
• Syntax
–
–
–
–
Defines the syntactic structure of an utterance
Object types: Trees, DAGs
Features: attribute-value pairs
Value: atomic or structured
NLP Grammar
• Syntax
– Often surface syntax and deep syntax
(not necessarily on a separate level)
• HPSG: surface tree v. DAG
• M-Grammar: surface trees v. derivation trees
• LFG: c-structure v. f-structure
• TAG: derived tree v. derivation tree
• Alpino: surface tree v. dependency tree
NLP Grammar
• Morphology
– Relates (word structure, string)
– Word-internal structure build-up usually in the
syntactic component
– Usually a rule system (intensional definition)
– Simple Inflection: sometimes list of triples
<base form, morph prop, word form>
(extensional definition)
NLP Grammar
• Orthography
– Relates ([String], String)
• [he, said, :, “, come, in, !, “]
• He said: “come in!”
– Usually trivial in generation
– Easy in analysis (tokenization) for many
languages
– Sometimes split (erop, opgebeld)
– Very problematic for Chinese, Japanese, etc.
Overview
•
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected frameworks
MWEs and the lexicon
What are MWEs?
What are MWEs?
• sequence of words that has lexical,
orthographic, phonological, morphological,
syntactic, semantic, pragmatic or
translational properties not predictable from
the individual components or their normal
mode of combination
What are MWEs?
• sequence of
– Not necessarily contiguous in a concrete utterance
• ...omdat hij de plaat wilde poetsen
– Not necessarily always in the same order in each utterance
• Hij poetste gisteren de plaat
• words
– Ambiguity between type and token (intentional)
– Inflected word form v. lemma
– Ambiguity between
• Character sequences separated from other character sequences by
spaces and other separators (Narrow interpretation)
• Abstract lexical units of the grammar (Broad interpretation)
What are MWEs?
• that has properties not predictable from the
individual components and their normal
mode of combination
What are MWEs?
• Lexical
–
–
–
–
–
–
–
De plaat poetsen
Een poging wagen / doen / *maken
Dat varkentje eens wassen
Zware / *sterke shag
Scherpe kritiek
Perdre la tête/ la boule / *la cervelle
Se creuser la tête / * la boule / la cervelle
What are MWEs?
• Orthographic
–
–
–
–
–
–
–
–
viz.
Bijv.
www.uilots.nl
i.v.m.
Yahoo!
Groen!
Aujourd’hui (v. l’homme)
‘s (avonds/morgens/middags)
What are MWEs?
• phonological,
–
–
–
–
–
–
–
–
Over de rooie/*rode (gaan/zijn/raken)
om de dooie/*dode donder niet
op zijn dooie akkertje/gemak
op zijn dooie eentje
De kwaaie/*kwade Piet toegespeeld krijgen
Je niet in de kouwe/*koude kleren gaan zitten
Een gouwe ouwe
(but geen rode/rooie cent/duit (hebben))
What are MWEs?
• morphological,
–
–
–
–
–
–
–
Ten gevolge van
Ter wereld
Van goeden huize
Zonder aanzien des persoons
Het lood*(je) leggen
Dat varken*(tje) wassen
De *raap is / rapen zijn gaar
What are MWEs?
• Syntactic
–
–
–
–
–
Ten gevolge van
In opdracht van (no article)
Iemand een oor aannaaien
Rekening houden met (obligatorily indefinite)
Het bijvoeglijk(*e) naamwoord (v. een
groot/grote man)
What are MWEs?
• Semantic
–
–
–
–
De plaat poetsen
Dat varkentje wassen
Een bok schieten
Een flater slaan
What are MWEs?
• Pragmatic
–
–
–
–
Ladies and Gentlemen
Ik heb gezegd.
Eet smakelijk! (Bon appétit!, Enjoy!)
Sincerely yours
What are MWEs?
• Translational properties
– Laten zien (F. montrer, E. show)
– Witte wijn (P. vinho verde)
– Nuclear power plant (D. atoomcentrale, G.
Kernkraftwerk)
– Space probe (F. sonde spatiale)
– Iemand iets laten weten
• inform someone of something
Overview
•
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected frameworks
MWEs and the lexicon
MWEs in NLP
• MWEs occur very often in natural language
– Esp. in languages with little compounding
• Especially in specialized domains
– Multi-word terminology
MWEs in NLP
• MT
– Improves parsing and translation of the MWEs
– Also improves parsing hence translation of the sentence
containing the MWEs (Nivre & Nilsson LREC 2004)
• CLIR
– Nuclear power plant
• Kern- macht plant
• Kern- Macht Pflanz
• v. atoomcentrale / Kernkraftwerk
MWEs in NLP
• Problems MWEs pose for NLP
– How are MWEs to be dealt with in the
grammar of an NLP system?
– What lexical representation of MWEs is
required for this?
– How can we obtain lexicons containing MWEs
with such lexical representations
Overview
•
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected frameworks
MWEs and the lexicon
Types of MWEs (I)
• Fixed
• Semi-flexible
• Flexible
Fixed MWEs
• Fixed MWEs
– Words of the MWE in a fixed order
– No variation in lexical item choice
– Always contiguous (no other elements in
between)
– No inflectional processes except at the edges
Fixed MWEs
• Fixed MWEs
– ad hoc, stante pede, ter plaatse
– Hong Kong, Kuala Lumpur, New York, San Francisco
– credit card, travel agency, real estate agency
• NOT
– in plaats van (cf. in plaats daarvan) (‘instead of’)
– carta telefonica (cf. carte telefoniche)
– de plaat poetsen (‘polish the plate’, ‘bolt’)
Semi-Flexible MWEs
• Semi-Flexible MWEs
– MWEs with fixed order of elements
– That are impenetrable for other words
– Parts can be inflected
Semi-Flexible MWEs
• Examples:
– Chambre des représentants
• House of representatives
– Patatas fritas
• French fries
– Mise au point automatique
• Autofocus
– Calculateur analogique
• Analogue computer
Semi-Flexible MWEs
• Examples:
– Cité plus haut
• Above-stated
– Résistant aux acides
• Acid-proof
– Malade en altitude
• Airsick
Flexible MWEs
• Flexible MWEs
•
•
•
•
Allow or require inflection in multiple parts, and
Allow permutations of subphrases, or
Allow intrusion by other phrases, or
Have controlled variation (bound pronouns)
Flexible MWEs
– de plaat poetsen (‘bolt’)
• Hij heeft gisteren de plaat gepoetst
• …omdat hij de plaat wilde poetsen
• Hij poetste gisteren de plaat
– to lose one’s temper
• He lost his temper
• She lost her temper
Treatment
• Fixed MWEs
– No inflection: Relate single string to sequence of
strings (in Orthography)
• ([ad_hoc] , [ad, hoc])
• Lexical entry for ad_hoc
– With inflection: Relate single stem to sequence of stems
in Morphology
• ([real, estate, agency, Plur] -> [real_estate_agency, Plur])
• Lexical entry for real_estate_agency
Treatment
• Semi-flexible MWEs
– Require local syntax
– Chunking may be enough
Treatment
• Flexible MWEs
– Require sophisticated syntax
Types of MWEs (II)
• Verb –particle combinations (English,
German, Dutch, Hungarian)
– Ik sloeg hem over
– I looked the passage up
Types of MWEs (II)
• Verb + prepositional complement
– I looked after her
– Hij heeft altijd van haar gehouden
Types of MWEs (II)
• Circumpositions (Dutch, German)
– Op iemand af / ?toe / *heen
– Auf jemanden *ab / zu
– Over de brug heen / *af / *toe
Types of MWEs (II)
• Lexical item (from open or closed class)
• + closed class lexical item
– Finite (actually small) list
• Limited variety of predictable syntactic
structures
• Dealt with by almost any grammar-based
NLP system
Types of MWEs (II)
• Multiword Names
– Examples
•
•
•
•
Fifth Avenue
Koning Leopold III-laan
Krimpen aan de IJssel
Koninklijke Nederlandse Philips N.V.
Types of MWEs (II)
• Multiword Names
– Issues
• Keys – variation
– (Koning) Leopold III-laan
– Fifth (Avenue)
– ((Calle) Roberto) González
– Many different ones, continuously new ones
– Very important for correct parsing and translation
• Minister Kohl  Minister Cabbage
Types of MWEs(II)
• Compounds (in English)
– Examples
•
•
•
•
•
Real estate agency
Nuclear power plant
Blue cheese
Private eye
High school
Types of MWEs(II)
• Idioms
– No or unpredictable meaning of the
components
– Fixed (or very limited ) lexical item selection
– Opaque
• Kick the bucket
• De plaat poetsen
• Casser sa pipe
Types of MWEs(II)
• Idioms
– Semi-transparant
• `een bok schieten’
– Bok (male goat) = blunder
– Schieten (shoot) = make
• `dat varkentje wassen’
– Varkentje (little pig) = problem
– Wassen (wash) = address, take care of
Types of MWEs(II)
• Idioms
– Usually completely normal syntactic structure
– Both a literal and an idiomatic reading
– Participate normally in many grammatical
processes
– BUT: often restrictions on participating in
grammatical processes
Types of MWEs(II)
• Idioms (opaque)
– Normal participation
• Hij poetste de plaat (V2)
• Poetste hij de plaat? (V1, question formation)
• ...omdat hij de plaat wilde poetsen (VR)
– But not
•
•
•
•
#De plaat heeft hij niet gepoetst (topicalization)
#Welke plaat heeft hij gepoetst? (wh-Q)
#Hij heeft de mooie plaat gepoetst (internal modification)
#Hij heeft de plaat waarschijnlijk niet gepoetst (Mid-field NP-Adv
permutation)
• #De plaat die hij gepoetst heeft (Relativization)
• #De plaat werd door hem gepoetst (Passive)
• #Wat een plaat (independent occurrence))
Types of MWEs(II)
• Idioms (semi-transparant)
– Normal participation
• Hij schoot een bok (V2)
• Schoot hij een bok? (V1, question formation)
• ...omdat hij een bok zou schieten (VR)
– Also
•
•
•
•
Een bok heeft hij niet geschoten (topicalization)
Wat voor een bok heeft hij nu weer geschoten? (wh-Q)
Hij heeft een enorme bok geschoten (internal modification)
Hij heeft die bok waarschijnlijk geschoten omdat ... (Mid-field NP-Adv
permutation)
• De bok die hij geschoten heeft (Relativization)
• Er werd door hem een enorme bok geschoten (Passive)
– But not:
• Wat een bok! (independent occurrence)
Types of MWEs(II)
• Idioms
– In some cases irregular syntactic structure
• Ten gevolge van (fossilized portmanteau words, noun e-form)
• Het bijvoeglijk naamwoord (no –e)
• Iemand de oren wassen (inalienable possession construction)
– Regular syntactic structure but not predictable from the
components’ properties
• Iemand welkom heten
– (`heten’ on its own can only take a pred. complement)
– *Hij heet hem aardig / president / Jan
• to lose face (count noun in determinerless NP)
Types of MWEs(II)
• Idioms
– Cranberry words
• Ergens de brui aan geven
Types of MWEs(II)
• Semi-idioms (collocations)
– One element occurs in its normal meaning
– The lexical selection of the other element is
fixed or very limited
– The other element has a special meaning
– Examples
• Zware tabak (heavy tobacco) `strong tobacco’
• Scherpe kritiek (sharp criticism) `severe criticism’
• Heavy smoker
Types of MWEs(II)
• Support verb constructions
– Type I
•
•
•
•
Een poging wagen
Een lezing houden / geven
To pay attention to (aandacht schenken aan)
To take advantage of
Types of MWEs (II)
• Arguments of the noun outside the NP
–
–
–
–
–
De kritiek die we hadden op hem
?de kritiek op hem die we hadden
*De kritiek die we naar voren brachten op hem
De kritiek op hem die we naar voren brachten
De kritiek op hem verstomde
Types of MWEs (II)
• Arguments of the noun outside the NP
– De aandacht die we schonken aan hem
– *de aandacht aan hem die we schonken
– *De aandacht die we becommentarieerden aan
hem
– De aandacht aan hem die we
becommentarieerden
– De aandacht *aan / voor hem verflauwde
Types of MWEs (II)
• Arguments of the noun outside the NP
–
–
–
–
–
The attention that we paid to this subject
*The attention to this subject that we paid
*The attention that we criticised to this subject
The attention to this subject that we criticized
The attention to this subject
Types of MWEs (II)
• Arguments of the noun outside the NP
–
–
–
–
No attention was paid to this subject
This subject was paid no attention to
Advantage was taken of this proposition
This proposition was taken advantage of
Types of MWEs (II)
– Type II
• Iemand een stomp geven
• Iemand een klap geven
• To give someone a kiss
– Noun signifying bodily touch
– `give’ + indirect object as Patient
Types of MWEs (II)
– Type III(?)
..\Utrecht\MWEs\copulasetc\selectzijn edited
Current.xls
• In de war zijn / raken / * gaan / ?komen / brengen
• In zijn nopjes zijn / raken / * komen / *brengen
• De pijp uit zijn / ?raken / gaan/ *komen / *brengen
Types of MWEs (II)
• Quasi-idioms
– Characteristics: Regular meaning + something
extra (specialization)
– Examples
• Huisdeur (`front door’)
• Bijvoeglijk naamwoord
• Fried eggs
– Terms
– Compounds
Overview
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected
frameworks
• MWEs and the lexicon
Requirements
• Account for the fact that MWEs usually
have `normal’ syntactic structures
• Account for the normal participation of
MWEs in most syntactic processes
• Account for the restrictions on the
participation in some syntactic processes
• Recognize it as an MWE and assign it the
associated semantics
Tree-Adjoining Grammar (TAG)
•
•
•
•
•
•
Originally developed by Aravind Joshi
Extended by Shieber, Schabes
Applied to French by Abeillé
Basic object: trees
Enriched with features (incl. unification)
Known parsing algorithm and complexity
properties
• Defines mildly context-sensitive languages
TAG
• Lexicalized TAG (LTAG)
• Trees:
– Elementary Trees
• Associated with a lexical item
• Initial
– Leaves labeled by
» Terminals
» Substitution nodes
TAG
• Initial Trees (examples)
–
–
–
–
N[Jean]
S[N0↓ V[dormir]]
S[N0↓ V[aimer] N1↓]
S[N0↓ V[ressembler] PP[P[à] N1↓]
• Used for words (verbs) taking nominal,
adjectival, prepositional arguments
TAG
• Auxiliary Trees
– One leaf node (foot node) same category as root node
– Used for modifiers, auxiliary verbs, raising verbs, verbs
taking sentential arguments
– Examples
•
•
•
•
N[A[beau] N*]
N[Det[le] N*]
S[N0↓ V[penser] S1’[C[que] S1*]]
V[V[sembler] V*]
TAG
• Constraints on Elementary Trees
– Lexicalization
– Predicate-Argument co-occurrence
– Semantic Consistency
TAG
• Operations
– Substitution
– Adjunction
• NO
– Deletion
– Movement
– Permutation
TAG
• Operations
– Substitution
• Substitutes a tree at a leaf node marked for
substitution (↓)
• Example
– S[N0↓ V[dormir]] + N[Jean] 
– S[N[Jean] V[dormir]]
TAG
• Operations
– Adjunction
• Inserts an auxiliary tree (or a tree derived from an
auxiliary tree) at any node (with the same label)
• If this node dominates a subtree, this subtree is
copied under the foot node of the auxiliary tree
– Example
• S[N[Jean] V[dormir]] + V[semble V*] 
• S[N[Jean] V[semble V[dormir]]]
TAG
• Derived tree
– Tree created by substitution or adjunction
– Encodes word order, inflection,
morphosyntactic features
• Derivation Tree
– α-dormir[ 1/α’-Jean 2/β3-semble]
– Close to dependency trees
– Basis for semantic interpretation
TAG
• Lexical rules
– Elementary tree  elementary tree
– For passive, wh-questions, cleft-constructions, relatve
clauses, cliticization
– Define the lexical item’s tree family
• Examples of derived elementary trees
– Passive: S[N1↓ V[être] V[aimé] PP[P[par] N0↓]
– Object-cleft S[V[CI[ce] V[être] N1↓ S’[C[que] S[N0↓
V[aimer]]]]
– Object-Rel.: N[N1* S’[C[que] S[N 0↓V[aimer]]]]
TAG
• Synchronous TAG
• Each elementary tree is associated with a
semantic tree
• Links between nodes from the elementary
tree and the semantic tree
TAG
• N-1[Jean]
– T-1[jean’]
• S-2[N0↓-1 V[dormir]]
– F-2[R[dormir’] T1↓-1]
• S-1[N0↓-2 V[aimer] N1↓-3]
– F-1[R[aimer] T0↓-2 T1↓-3]
• S-1[N0↓-2 V[ressembler] PP[P[à] N1↓-3]
– F-1[R[ressembler’] T0↓-2 T1↓-3]
TAG
• N-1[A[beau] N*]
– F-1[R[beau’] T0*]
• N[Det[le] N*]
– F-1[R[le’] T0*]
• S-1[N0↓-2 V[penser] S1’[C[que] S1*]]
– F-1[R[penser’] T0↓-2 T1*]
• V-1[V[sembler] V*]
– F-1[R[sembler’] T0*]
TAG
• Given a pair, select a link (nondeterministically) with roots A and B
• Select another pair with roots A and B
• Combine the syntactic trees (at node A) and
the semantic trees (at node B), by
adjunction or substitution, and remove the
link between A and B
• Do this recursively
TAG
• <S[N0↓-1 V-2[dormir]], F-2[R[dormir’] T1↓-1]>,
select link 1
• + <N-1[Jean], T-1[jean’]>
•  <S[N[Jean] V-2[dormir]],
F-2[ R[dormir’] T[jean’]]
• !+ < V-1[V[sembler] V*],
F-1[R[sembler’] T0*]>
•  <S[N[Jean] V[V[sembler] V-2[dormir]]],
F[R[sembler] F-2[ R[dormir’] T[jean’]]]
TAG
• Idiomatic Expressions in LTAG
– Each idiomatic expression represented by an
elementary tree
• Examples
– S[N0↓ V[briser] N1[D[la] N[glace]]
– S[N0↓ V[prendre] N1↓ PP[P[en] N[compte]]]
– S[N0[D[des] N[ailes]] V[pousser] PP1[P[à]
N1↓]]
TAG
•
•
•
•
Associated with a semantic tree
F[R[briser-la-glace’] T0↓]
F[R[prendre-en-compte’] T0↓ T1↓]
F[R[des-ailes-pousser- à’] T0↓]
• With the appropriate links
TAG
• Lexical rules
– Apply to idiom elementary trees normally
– But can be individually constrained
• Links between parts of an idiomatic
elementary tree and semantic trees are
allowed:
– internal syntactic modification can correspond
to extenal semantic modification
TAG
• MWEs are normal elementary trees
• Lexical rules apply to idiom elementary
trees as usual
• Lexical rules can be restricted to apply to
certain elementary trees
• Recognition and semantics: same as with
single word elementary trees
TAG
• But
– No restrictions on elementary trees
– Idiomatic elementary trees can deviate
– Elementary trees are complex (esp. features):
difficult to maintain
– Restrictions on grammatical processes basically
stipulated
Overview
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected
frameworks
• MWEs and the lexicon
M-Grammar
• Developed by Jan Landsbergen
• Inspired by Montague grammar
• Compositional Grammars
– The meaning of an expression is a function of
the meaning of its parts and the way they are
combined
• Use traditional syntactic surface trees (but
with relations)
M-Grammar
• Used for Machine Translation
• Research Prototype MT System
– Dutch, English, Spanish
• Developed at Philips Research Labs
– Rosetta project
– Rosetta3 System
• Compositional Translation Method
M-Grammar
• Compositionality of Meaning
– The grammars are organised in such a way that
the meaning of an expression is a function of
the meaning of its parts and the way they are
combined.
• Implemented by Compositional Grammars
– Basic Expressions (BE), with a meaning
– Rules, with a meaning (recursively applicable)
M-Grammar
• Basic Expressions
– With basic meaning
• M-Rules
– With meaning operation
• Basic object: S-trees
• S-tree = N[r1/T1,...rn/Tn]
– N: node = CAT{a-v pairs}
– Ti: S-trees
– Ri: grammatical relation (subject, object, head,...)
• S-tree of a basic expression is basic S-tree
M-Grammar
• S-tree of a full utterance is created by applying Mrules to S-trees, initially basic S-trees
• Derivation history is recorded in syntactic
derivation tree (syntactic D-tree)
• M-rules
– Powerful rules
– Structure creation, deletion, permutations, movements,
insertions
M-Grammar
• M-Rules
– Relate [T1,..,Tn] to T
– Reversible
• Analytic and generative versions can be derived automatically
– Measure condition
• Each Ti in [T1,...,Tn] must be `smaller’ (according to some
measure) than T
– Reversibility and Measure Condition guarantee
effectiveness of parsing
M-Grammar
• Syntactic D-trees contain names of basic
expressions and names of rules
• Can be mapped into (isomorphic) Semantic
D-trees, containing names of basic
meanings and names of meaning operations
M-Grammar
• Principle of Compositionality of Translation
– Two expressions are each other's translation if they are
built up from parts which are each other's translation,
by means of rules which are each other’s translation.
• Implemented by tuning Compositional Grammars
G1 and G2
– For each BE in G1 at least one BE in G2 that is
translationally equivalent
– For each rule in G1 at least one rule in G2 that is
translationally equivalent
M-Grammar
• Interlingual System
• Interlingua obtained as a side-effect of
tuning compositional grammars (no
independent interlingua)
• Interlingua expresses translational
equivalence
– B1 and B1’ are translations of each other
– Not necessarily the meaning of B1/B1’
M-Grammar
• G1:
– BEs: N-boek (M-book:book’), A-interessant (M-interesting:
interesting’)
– Mrule R1: meaning name: IndefPlMod
• syntax
– <[N, A], NP[mod/A{e}, head/N{pl}]>
• Semantics: [[A]](x) & [[N]](x)
• G2
– BEs: N-livre (M-book: book’), A-intéressant( M-interesting:
interesting’)
– Mrule R2 meaning name: IndefPlMod
• Syntax
– <[N,A], NP[det/D-des, head/N’[head/N{pl, m}, mod/A{pl, m}]]
• Semantics
– [[A]](x) & [[N]](x)
M-Grammar
• NP[mod/A{e}-interessant head/N{pl}-boek]
–  interessante boeken
• Syn D-Tree: R1 [boek interessant
• NP[det/D{pl}-des head/N’[head/N{pl,m}-livre],
mod/A{pl,m}-interéssant]
– des livres interéssants
• Syn D-Tree: R2 [livre interessant]
• Sem D-Tree IndefPlMod[M-book M-interesting]
• Semantics: book’(x) & interesting’(x)
– (ignoring plurality, indefiniteness)
M-Grammar
• Full System
•
•
•
•
•
•
•
•
•
Sem D-Trees (IL)
A-TRANSFER
G-TRANSFER
Syn D-Trees
Syn D-Trees
M-PARSER
M-GENERATOR
S-Trees
S-Trees
S-PARSER
LEAVES
[Lexical S-Trees]
[Lexical S-Trees]
A-MORPH
G-MORPH
String
String
M-Grammar:MWEs
• Method for dealing with idioms
• Each idiom
– is a basic expression
– Associated with a complex syntactic structure
–  complex basic expressions
• Syntactic Structure of an idiom:
– Canonical (abstracting from syntactic operations: passive,
topicalization, verb movements, wh-questioning, ...
– Represented as a D-tree
•
•
•
•
Much more stable part of the grammar
Much simpler than S-trees (no features etc)
Makes it easier to deal with bound variation
Better guarantee of correctness of structures
– In the lexicon a identifier to a D-tree (idiom pattern)
M-Grammar:MWEs
• Start rules:
– Combine a BE with its arguments (variables)
– If the BE is complex structure created by
applying the rules in the associated D-Tree
• Resulting Structure runs through all the
normal rules of grammar
• In analysis: incoming structure is analyzed
using the rules in the D-tree
M-Grammar:MWEs
• In analysis:
– Guide the analysis by the idiom’s D-tree
– If successful, and the arguments are also
correctly analyzed: extend the sentence’s D-tree
with the start rule dominating a BE (for the
idiom) and the D-trees for the arguments
M-Grammar:MWEs
• De pijp uit gaan
• D-tree for vpid30 (simplified):
Rsubst,i
[RVP [$aV_00_ga,
VAR_j
RPPpost
[$s_prep1286700,
VAR_i
]
],
RNPdef [$aV_00_pijp]
]
M-Grammar:MWEs
• Start Rule 1: BE(verb) + VAR
• S-Tree created in case of an idiom by applying the D-Tree for the
idiom
• S[subj/VAR_j,
head/VP[compl/PP
[obj/NP[det/D-de
head/N-pijp
]
head/P-uit
]
head/V-ga
]
]
M-Grammar:MWEs
• Normal participation in grammatical processes:
structures for idioms are normal
• Restrictions:
– Rules that affect meaningful elements cannot apply to
meaningless parts of idioms
–  relativization, topicalization, wh-questioning, NPAdv order switch, exclamative formation, adjectival
modification, independent occurrence, ...
– (parts of ) Rules that are purely syntactic not restricted
–  V2, VR, V1 (if question formation and V1 are
separated)
M-Grammar:MWEs
• Deviant syntax
– Portmanteau words (ten, ter)
– E-form of nouns
– Inalienable possession construction
• Minor Rules: rules that can only occur in an
idiom D-Tree
M-Grammar:MWEs
• Idioms have normal syntactic structures generated by applying the
normal M-rules following the associated idiom D-Tree
• Other M-Rules apply to these structures in the normal way
• Restrictions on applications accounted for because M-Rules applying
to meaningful elements cannot apply to meaningless parts of idioms
• Deviant syntactic structure can be accounted for by Minor Rules
• The idiom D-trees can be derived using the system itself (putting
minor rules on in the grammar)
• Recognition of idioms: by guiding the analysis along the idiom D-tree
• Semantics: each idiom is (complex) basic expression with associated
basic meaning
M-Grammars and TAG
• Neither has an adequate treatment of semiidioms (yet)
• Neither has an adequate treatment of
transparent idioms (yet)
• Idiom representation as D-trees can also be
done in TAG (Abeillé)
Overview
•
•
•
•
•
•
NLP
MWEs
MWEs in NLP
MWE Types
Treatment of MWEs in selected frameworks
MWEs and the lexicon
Lexical representation
• How do we obtain lists of MWEs?
• How do we represent them lexically
• How can we improve exchangeability of
MWE lexical representations?
MWE Acquisition
• Lists from existing dictionaries
– Always incomplete
– Especially for specific domains
– Do not necessarily reflect actual usage
• Semi-automatic acquisition from text
corpora is called for
– Especially for rapid tuning NLP system to
specific domain, company, organization
MWE Acquisition
• Use statistical properties of MWEs to
acquire them (semi-)automatically
• Examples:
– Mutual Information: log Pr(xy)/(Pr(x)Pr(y))
– Salience: Pr(xy) * MI
– Dice, chi-square, log-likelyhood, em metric, ...
MWE Acquisition
• Mutual Information:
– Pr(z) estimated by F(z)/T
– Pr(xy)/Pr(x)Pr(y) = (F(xy)*T)/F(x)F(y)
– Experiment: (just for adjacent 2-word MWEs)
• Salience
– Compensates for favoring low frequency items
– Experiment: (just for adjacent 2-word MWEs)
MWE Acquisition
• Extend to 3-word MWEs, etc.
• Combine with NLP system/components
– PoS-tag to obtain syntactically meaningful
combinations (yes: A N, no: Adv N)
– Parsing and compute statistics on D-trees
• Allows acquisition of discontinuous MWEs
• Very lively and active research area
Lexical Representation of Idioms
• Many grammatical treatments of idioms
require
– Whatever is needed for single word lexical
items
– Syntactic structure
– Unique references to lexical items
• Highly framework / theory /
implementation-specific
Lexical Representation
• M-Grammar system-specific matters:
– D-trees, compatible with the specific
implementation
– Unique references to items in Rosettalexicon
– Order of the items (`gaan uit pijp’)
– Presence of items (articles absent)
Lexical Representation
• LTAG system-specific matters:
– Syntactic trees, compatible with the
specific implementation
– Unique references to items in
system’s lexicon
– Order of the items (= canonical
surface order)
– Presence of items (all present)
SEQCI
• Lexical Representation
– Maximally theory-neutral
• Incorporation Method
– Generic
– Maximally reuses existing NLP system
• Core Idea:
– Describes which idioms have the same structure
– Structural Equivalence Classes for Idioms
SEQCI
• Lexical Representation=
– Idiom Descriptions
– Idiom Pattern Descriptions
• Idiom description=
– Idiom pattern (identifier)
• Identifier for idiom structures
• Used to define the equivalence classes
– Idiom Component List (ICL), with base forms
• All
• Any order, but the same within one equivalence class
– Example sentence containing the idiom
• Same syntactic structure for each idiom of the same
equivalence class
SEQCI
• Idiom pattern description
• Idiom pattern identifier
• Comments (free text)
SEQCI
• Example:
– Idiom Descriptions
• Idp30;De pijp uit gaan;Hij is de pijp uit gegaan
• Idp30;De boot in gaan;Hij is de boot in gegaan
• Idp30:Het schip in gaan;Hij is het schip in gegaan
– Idiom pattern definition
• Idp30
• Idiom headed by a verb taking a postpositional PP
containing a definite singular NP and one free
argument as subject
SEQCI
• Incorporation Method
– Manual part, once for each idiom pattern
– Automatic Part, for each idiom
description
SEQCI
• Manual part (`hij is de pijp uit gegaan’)
1. Parse the example sentence of an idiom description
with idiom pattern P, yielding the Reference Parse
2. Define a transformation to turn the reference parse into
the idiom structure ( Parse Transformation, PT)
3. Determine the list of unique IDs of the lexical items in
the idiom structure for the system derived from the
reference parse (Idiom Component ID List, ICIL)
4. Define a transformation to relate ICL and ICIL (Idiom
Component Transformation, ICT)
5. Apply the ICT to the ICL, yielding the transformed ICL
(TICL) and check that each item in it equals the base
form of the corresponding element on the ICIL
SEQCI
Automatic part, for each idiom description I
(`hij is de boot in gegaan’)
1. Parse example sentence (Syntactic Structure)
2. Apply IPT and check identity with idiom
structure modulo the lexical items
3. Select the component IDs from the parse tree, in
order to obtain the ICIL)
4. Apply ICT to the ICL of I, yielding the TICL
5. Check that <bf(c1),…bf(cn)>=TICL
where ICIL = <c1, …cn> ( TICL check)
SEQCI
• Advantages
–
–
–
–
Technically Simple
As theory/grammar/implementationindependent as possible
No need for prescribing syntactic structures
System-specific aspects are derived from the
NLP-system itself
SEQCI
•
Will it work?
–
–
If there are not too many different idiom
patterns, and
Sufficient number of instantiations per idiom
pattern
SEQCI
• Various improvements possible
– Parameterization
• Over local morphosyntactic differences (sg/pl; pos/dim;
pos/comp/sup; ...)
– Abstraction
• Especially for large fixed parts
– Weten waar Abraham de mosterd haalt
– Use of underspecified syntactic structures
• Optional, of any kind
– Guidelines for selection of example sentences
SEQCI
SAID-Database Dutch Minidatabase
Coverage #idioms #patterns#idioms #patterns
50%
7383
28
449
21
60%
8853
54
539
36
70% 10304
140
628
59
80% 11773
481
716
98
85% 12509
908
760
134
90% 13245
1644
804
178
95% 13981
2380
849
223
100% 14716
3116
893
267
Conclusions
• SEQCI:
– Technically simple
– Highly theory, grammar, implementation independent
•
•
•
•
•
Can reduce/share lexicon development efforts significantly
Candidate for a standard lexical representation for idioms
Extension to other types of MWEs looks promising
Initial experiments started
More testing required, with more NLP systems
– I can provide test data
– Try it out in your own NLP system!
SEQCI
•
Possible Problems/Objections?
–
Manual Part
•
•
•
–
Pattern sentence does not yield a parse
Pattern sentence yields multiple parses
Pattern corresponds to 2 or more different structures in my system
Automatic Part
•
•
•
Sentence does not yield a parse
Sentence yields multiple parses
Multiple skeys result for the same base form
SEQCI: Reference Parse
Rdecl[Rperf
[Rsubst(j)
[Rsent
[Rsubst(i)
[RVP[$aV_00_ga,
RPPpost
[$s_prep1286700,
VAR_i
]
]
RNPdef [$aN_00_pijp]
],
VAR_j
],
RNP[$hij_PRON]
]
]
SEQCI: Idiom Structure
• IPT: IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron]
• D-tree for vpid30 (simplified):
Rsubst,i
[RVP [$aV_00_ga,
RPPpost
[$s_prep1286700,
VAR_i
]
],
RNPdef [$aN_00_pijp]
]
ICIL
< $aV_00_ga, $prep1286700, $aN_00_pijp >
SEQCI: Illustration
Manual Part, applied to `de pijp uitgaan’
1.
2.
3.
4.
Reference Parse: See D-tree next slide
IPT: Delete Rdecl, Rperf, Rsubj(j), RNP[$hij_Pron]
ICIL: < $aV_00_ga, $aN_00_pijp, $prep1286700>
ICT: 1 2 3 4 => 4 3 2
5. TICL = ICT(<de, pijp, uit, gaan>) = <gaan, pijp,
uit> = < Bf($aV_00_ga), Bf($aN_00_pijp),
Bf($prep1286700) >
ICT
ICL:
Must be turned into:
<de, pijp, uit, gaan>
< gaan, uit, pijp>
ICT :
1 2 3 4 => 4 3 2
TICL
TICL = ICT(ICL) =
ICT(<de, pijp, uit, gaan>) =
<gaan, uit, pijp> =
< Bf($aV_00_ga), Bf($prep1286700),
Bf($aN_00_pijp)
>
Syntactic Structure
Rdecl[Rperf
[Rsubst(j)
[Rsent
[Rsubst(i)
[RVP[$aV_00_ga,
RPPpost
[$s_prep1286800,
VAR_i
]
],
RNPdef [$aN_00_boot]
],
VAR_j
],
RNP[$hij_PRON]
]
]
Apply IPT
Rsubst,i
[RVP
[$aV_00_ga,
RPPpost
[$s_prep1286800,
VAR_i
]
],
RNPdef [$aN_00_boot]
]
ICIL
ICIL=< $aV_00_ga , $s_prep1286800,
$aN_00_boot>)
TICL
ICT(ICL) =
ICT(<de, boot, in, gaan>)=
<gaan, in, boot>
TICL check
<bf($aV_00_ga), bf($s_prep1286800),
bf($aN_00_boot) > =
TICL =
<gaan, in, boot>