* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download ppt file
Survey
Document related concepts
Transcript
Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Prague Dependency Treebank 1.0 Functional Generative Description CD-ROM PRESENTATION Dec 18, 2000 Functional Generative Description theoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School methodological requirements of a formal description levels: tectogrammatical (underlying) representations (TRs) with dependency based syntax morphemics phonemics and phonetics TRs (see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way) Prague Dependency Treebank 1.0 Dependency tree My younger brother arrived there yesterday. Linearized form, one-to-one relation: ((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday) Prague Dependency Treebank 1.0 Dependency Tree labels - lexical meanings (abstract symbols) with indices functors subscripts at parentheses oriented towards head grammatemes - values of morphological categories Tense, Modality, Number, Definiteness, etc. projectivity valency arguments (inner participants) and adjuncts (circumstantials or 'free modifications') obligatory and optional with a given head, deletable or not Prague Dependency Treebank 1.0 Dependency Tree participants (arguments) of verbs Actor/Bearer (underlying subject) Objective (Patient, underlying direct object) Addressee (underlying indirect object) Effect ('second' object: to choose so. as sth.) Origin (to make sth. out of sth.) adjuncts Locative, several Directional and Temporal modifications Condition, Means, Manner, etc. Prague Dependency Treebank 1.0 Dependency Tree Complementations dependent mainly on nouns inner participants Material (Partitive) two baskets of sth. Identity the river Danube; the notion of operator free modifications Possession (Appurtenance) my table; Jim's brother Restrictive rich man Descriptive the Swedes, who are a Scandinavian nation Prague Dependency Treebank 1.0 Dependency Tree syntactic grammatemes Loc, Dir - in, on, under, between... Regard - with, without operational (testable) criteria for distinguishing arguments from adjuncts, from each other deletability (dialogue test) Prague Dependency Treebank 1.0 Simplified valency frames read V Act Addr Obj change V Act Obj Orig Eff give V Act Addr Obj brother N Appurt man N glass N Material full A Material obligatory complementations in blue Prague Dependency Treebank 1.0 Topic-focus articulation T there contextual boundness main verb CB/NB (T/F) dependents to the left/right communicative dynamism left-right (mother, sisters, transitive) young partial ordering left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR) underlying word order left-right linear ordering Prague Dependency Treebank 1.0 Topic-focus articulation T F yesterday there young TFA - one of the basic aspects of underlying structures Prague Dependency Treebank 1.0 Complex sentence My brother, whom you know, arrived there yesterday. a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause Prague Dependency Treebank 1.0 Complex sentence Martin came there late, since he had to accompany his sick mother. function words (synsemantic) are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs Prague Dependency Treebank 1.0 Complex sentence Martin arrived late to the session, since he had to accompany his sick mother. schematically (morphemes): Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother. dot - close connection of morphemes ('semes') Prague Dependency Treebank 1.0 deleted items restored order of items - difference between 'underlying' and surface (morphemic) word order transductive components - Panevová, Oliva, Borota coordination (multidimensional) Jim and Mary, who have two children, went to Boston. the linearized notation is adequate: ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr children)))Act went (Dir Boston) structures close to Boolean, i.e. no complex 'innate properties' specific for natural language are needed. Prague Dependency Treebank 1.0 Prague Dependency Treebank - corpus annotation an intermediate level - 'analytical' representations dependency trees, not always projective nodes for all word tokens, even for punctuation marks tectogrammmatical tree: coordinating conjunction as the head Prague Dependency Treebank 1.0 Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Prague Dependency Treebank 1.0 Morphological Layer CD-ROM PRESENTATION Dec 18, 2000 ACKNOWLEDGEMENTS Prague Dependency Treebank 1.0 ANNOTATED CORPORA PDT version 1.0, 2000 (1996 - 2000) Penn Treebank, release 3, 1999 (1989 - 1999) Prague Dependency Treebank 1.0 TAG SETs Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, … English - language with poor inflection work, works, worked, working Prague Dependency Treebank 1.0 Prague Dependency Treebank 1.0 TEXT SOURCES Lidové noviny ´88, ´89 WSJ articles Mladá Fronta Dnes Air Travel Information System transcripts Vesmír Českomoravský Profit Brown Corpus Switchboard transcripts ...taken from Czech National Corpus Prague Dependency Treebank 1.0 ANNOTATION STRATEGY Penn Treebank TEXT Ken Church‘s stochastic tagger, Eric Brill‘s transformation tagger corrections by annotator (GNU Emacs Lisp based package) Prague Dependency Treebank 1.0 ANNOTATION STRATEGY - PDT Automatic Morphological Analyzer (AMA) two independent annotators; Linux, Win tools differences resolved by third annotator comparison with the current AMA; manual resolution; Win tools Prague Dependency Treebank 1.0 INTERNAL FORMAT SGML coding, csts dtd word/tag(|tag)* Prague Dependency Treebank 1.0 SAMPLES <s id=“ln95040:020-p1s1“> <f>Pokus<l>pokus<t>NNIS1-----A---<f>o<l>o<t>RR--4---------<f>zázrak<l>zázrak<t>NNIS4-----A---<d>.<l>.<t>Z:------------The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./. Prague Dependency Treebank 1.0 CONVERSION SGML coding word/tag pdt2wsj.pl pdt2wsjFLT.pl SGML coding word/lemma/tag Prague Dependency Treebank 1.0 DATA SIZE # word tokens # sentences PDT 1.0 1 730K 112K Penn Treebank 4 600K 350K release 3 Prague Dependency Treebank 1.0 DATA SETs of MORPHOLOGICALLY ANNOTATED DATA for tagging only #tokens/sentences training data 1 470K/95K development test data 130K/8K evaluation test data 127K/8K for parsing (preprocessing step) training data 475K/29K development test data 130K/8K evaluation test data 127K/8K Prague Dependency Treebank 1.0 TOOLS Automatic Morphological Analyser/Generator of Czech HMAnalyze.pl, HMGenerate.pl Dictionary: CZE_a Remote Acces Czech Taggers HMM Exponential Prague Dependency Treebank 1.0 Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Prague Dependency Treebank 1.0 Analytical Layer in PDT CD-ROM PRESENTATION Dec 18, 2000 Introduction Input: morphologically tagged sentences Graph Editor: “user-friendly” software Output: ATS structure „surface“ syntax tree structure nodes labelled by the analytical functions Prague Dependency Treebank 1.0 Two stages (chronologically) (A) manual „analytic“ annotation (ATS) training data for (B)(a) (B) (a) semiautomatic procedure (Collin‘s parser) (b) manual correcting of (B)(a) Prague Dependency Treebank 1.0 Constraints and limitations any string has a node of its own word-form, punctuation mark, etc. AuxV, AuxP, AuxC, AuxX, AuxG… reflecting the coordination and apposition relations so called third dimension of the graph in the plain tree (X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.) Prague Dependency Treebank 1.0 Constraints and limitations no missing nodes (on the surface) can be added analytic funtion Ex_D is used relations between semi-automatic and manual procedure 80% edges are established correctly automatically Prague Dependency Treebank 1.0 Project organization team consisting of 5-6 annotators handbook for ATS structure annotation 1999: 100000 sentences on ATS tectogrammatical annotation follows Prague Dependency Treebank 1.0 První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang. AuxT Adv Prague Dependency Treebank 1.0 Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Prague Dependency Treebank 1.0 From the Analytical towards the Tectogrammatical layer CD-ROM PRESENTATION Dec 18, 2000 Introduction ATS annotation nodes: edges: word forms punctuation graphical symbols surface relations TGTS annotation autosemantic words deletions deep layer functions Prague Dependency Treebank 1.0 Annotation process Input Czech sentence Tokenization Morphological tagging and lexical disambiguation ATS Tree structure pruning Syntactic parsing and analytic function assignment PDT1.0 Attribute assignments Prague Dependency Treebank 1.0 TGTS Transition procedure deterministic procedure operating on trees macro language for Graph Editor (C++ like) automatic changes & tools for annotators Requirements new attributes for tectogrammatical layer ATS is recoverable from TGTS automatized to a maximally high degree Prague Dependency Treebank 1.0 New attributes trlemma - lemma of the original node or lemma composed of joined nodes morphological grammatemes gender, number, degree of comparison, tense, aspect, iterativeness, verbal modality, deontic modality, sentence modality position of the node functor, topic-focus articulation, syntactic grammateme, type of relation (dependency, coordination, apposition), phraseme, deletion, quoted word, direct speech, coreference, antecedent Prague Dependency Treebank 1.0 Tree Structure Pruning U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. For those, who start actually at zero, the tax outcome for the state is not substantial. Prague Dependency Treebank 1.0 Tree Structure Pruning U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný. For those, who start actually at zero, the tax outcome for the state is not substantial. REG Prague Dependency Treebank 1.0 Verbal Nodes verbmod=CDN deontmod=HRT PRED •… podnikatelé by měli mít daně … •… enterpreneurs should have (their) taxes … Prague Dependency Treebank 1.0 Attribute Assignments prepositions stored as fw attribute quoted words clause in quotes -> DSP one pair of quotes in the sentence -> DSPP string in quotes -> QUOT gender, number, tense, degcmp, aspect default values Prague Dependency Treebank 1.0 Macros for Annotators keyboard shortcuts (in Graph editor) structure changes hide/recover nodes merge nodes add new nodes functor assignments Prague Dependency Treebank 1.0 Manual annotation structure checking functors deletions of obligatory modifications feedback for formulating the handbook for annotators Prague Dependency Treebank 1.0 Prague Dependency Treebank 1.0 CD-ROM PRESENTATION Dec 18, 2000 Prague Dependency Treebank 1.0 Tectogrammatical Layer CD-ROM PRESENTATION Dec 18, 2000 Prague Dependency Treebank 1.0 F T C T T T T T T Prague Dependency Treebank 1.0 F Jirka se včera opil do němoty a Honza dneska. George himself yesterday drank to silence and Honza today. Prague Dependency Treebank 1.0 Attributes of Coreferrential relations only in MC attribute coref corsnt values the lemma of the antecedent NIL - in the same sentence PREV1 ... PREVi - position of the sentence which includes the antecedent grammatical coreference antec the functor of the antecedent Prague Dependency Treebank 1.0 Example coref: corsnt: cornum: antec: Honza Honza Honza NIL 1 ACT slíbil přijít včas. promised to come in time. Prague Dependency Treebank 1.0