* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Word Form Features
Agglutination wikipedia , lookup
Comparison (grammar) wikipedia , lookup
Udmurt grammar wikipedia , lookup
Ukrainian grammar wikipedia , lookup
Swedish grammar wikipedia , lookup
Pipil grammar wikipedia , lookup
Ojibwe grammar wikipedia , lookup
Grammatical case wikipedia , lookup
Russian grammar wikipedia , lookup
Lithuanian grammar wikipedia , lookup
Latin syntax wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Arabic grammar wikipedia , lookup
Malay grammar wikipedia , lookup
Romanian numbers wikipedia , lookup
Literary Welsh morphology wikipedia , lookup
Esperanto grammar wikipedia , lookup
Lithuanian declension wikipedia , lookup
Turkish grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Sanskrit grammar wikipedia , lookup
Yiddish grammar wikipedia , lookup
Modern Greek grammar wikipedia , lookup
Old English grammar wikipedia , lookup
Ancient Greek grammar wikipedia , lookup
Old Irish grammar wikipedia , lookup
French grammar wikipedia , lookup
Archaic Dutch declension wikipedia , lookup
Old Norse morphology wikipedia , lookup
Scottish Gaelic grammar wikipedia , lookup
Latvian declension wikipedia , lookup
Romanian nouns wikipedia , lookup
Premises of a Morphological Dictionary of Romanian Emil Ionescu 1. The Aim The aim of the present enterprise is the construction of a dictionary of annotated inflected word forms for Romanian. The dictionary (the file Romdict.txt) is expressed in the DELAF format (INTEX, Zilberstein 1993) and it is associated with another file containing the description of the labels used in characterizing the inflected forms (Definitions.txt). Both files serve as resources for the analyzer constructed by the colleagues from Bulgaria. The analyzer is available at the address http://www.larflast.bas.bg/balric/tag/default.htm The dictionary is a full-form lexicon. It is therefore based on morphological features. It is not ‘inflectionally-oriented’ (because it does not deal with matters of the inner structure of the word), nor is it based on principles of derivational morphology1. The main concern was to capture the properties of word forms in terms of features specific to Romanian morphology. This orientation is worth noticing, because it is representative for the present-day trend in HLT: a trend which takes into account the relation between texts and the ultimate building blocks of them, the words. 2. The General Structure of the Dictionary An entry in this dictionary has the following structure: Word form Lemma Characteristics: lemma characteristics, word form characteristics A sample: Word form= agresivă (‘aggressive’ singular, feminine, nominative or accusative) Lemma= agresiv Characteristics: Lemma characteristics: A (=adjective)+ GR (= gradable) Word form characteristics: ufsr (=undefinite, feminine, singular, nominative or accusative) 1 For other tools and resources concerning the morphology of Romanian, see Tufis, L. Diaconu, Barbu and C. Diaconu (1996), Tufis, L. Diaconu, C. Diaconu and Barbu (1996), Peev, Bibolar and Endre (1996), Curteanu, Holban, I. Pavaloi, C. Pavaloi., Negulescu and Todirascu (1996), Vuscan (1996) – all the above mentioned contributions being papers in Dan Tufis (ed.) “Limbaj si Tehnologie” (Language and Technology), Editura Academiei Romane, Bucuresti, 1996. 203 The dictionary comprises only synthetic forms. The analytic ones are considered collocations. Synthetic forms in turn may be simple or compound; agresivă, for instance, is a simple synthetic form but nici_un (the determiner no) is compound. Compound forms are written with underscore. In its present form, the dictionary contains 6768 entries, which cover all the parts of speech of Romanian. 3. The Phases of the Research The main steps which led the research to the previously mentioned results have been the following: Establishing the corpus of the dictionary Identifying the lexical forms which belong to the corpus Building up the inflectional paradigms (wherever needed) Establishing the morphological features involved in the characterization of lemmas and word forms. 3.1. The Corpus The corpus is composed of a set of 13 articles collected from one of the most important Romanian newspapers, “Evenimentul Zilei” (The Daily Event). It is a representative corpus for the present–day standard Romanian. 3.2. Identifying Lexical Forms The newspaper articles supplied a number of 1478 word forms, which were isolated through tokenization. Tokenization was performed with a tokenizer built up within the MULTEX Project developed at the Romanian Academy Center of Artificial Intelligence (RACAI), available at www.racai.ro. A sample of a tokenized text is given below: #el #(he) #a# (has) #afirmat# (stated) #că# (that ) #singura# (The only) #soluţie# (solution) #pentru_ca # (for) #populaţia# (the population) #să #(to) #-şi# (its own) #poată #(be able) #achita# (to pay) #facturile# (the bills) 204 #este# (is) #creşterea# (the augmentation) #reală #(real) #a# (of) #salariilor# (the salaries) . The word forms identified by tokenization represent 523 lemmas covering both inflectional and non-inflectional parts of speech of Romanian. 3.3. Inflectional Paradigms For every lemma representing an inflectional part of speech, the corresponding inflection paradigm was built up. This operation has been performed in a manual way. The result was the production of 6768 word forms, which constitute the present form of the dictionary. 3.4 Morphological Features 3.4.1. Establishing the set of morphological features for Romanian represented the linguistic part of the research. In this respect, the way the Bulgarian dictionary was constructed has been followed. Accordingly, a feature was considered morphologically relevant, if that feature was found important for the production and/or distinction of the paradigm members. For instance, the feature A(uxiliary) is in Romanian a syntactic feature of the verbs. Nevertheless, it also counts as a morphological one, because the paradigms of the auxiliary verbs a avea (to have) and a vrea (to want) are distinct from the paradigms of the corresponding ‘main’ verbs a avea and a vrea. Likewise, in the case of adjectives, the gender – a semantic feature - has to be equally considered morphologically relevant – just like in Bulgarian, but unlike English – because it serves to distinguish between members of the same paradigm2. 3.4.2. With these criteria at hand, twelve (main) parts of speech were identified: noun, verb, adjective, determiner, pronoun, numeral, article, adverb, preposition, interjection, particle and abbreviation. The last two are not usually encountered in morphological descriptions of Romanian, so some explanations are in point. The category of particles contain certain words with special distribution and behavior. It is about the adverb of negation nu, the ‘conjunction’ să (which marks the subjunctive mood), and the so-called ‘morpheme’ a (which marks a certain form of infinitive). As for abbreviations (for instance, tel, from telefon), the option to adopt them as a distinct category, originated in the fact that, in general, an abbreviation substantively differs from its corresponding ‘full’ part of speech. For example, tel does not present case inflections, nor can it be used in plural. Anyway, given the marginal character of abbreviations (especially 2 It is just for this reason that the very important verb feature transitivity is not considered here a morphological feature: it contributes in no way to the formation of the verb paradigm in Romanian. 205 from the quantitative point of view) their unexpected presence in the morphology of Romanian does not rise serious problems. 3.4.3. It may happen that the same feature is counted as a lemma feature in relation to a given part of speech but as a word form feature in relation to another part of speech. This comes about, for instance, with gender, which characterizes the lemma of nouns and the word form of adjectives. 3.4.4 There is a slight difference in the way the Bulgarian and the Romanian dictionary deal with homonymous morphological forms. The difference may be illustrated in the treatment of the following example. For the word form fly, in the Bulgarian dictionary the analysis is pr12sg:123pl (= present tense, 1st or 2nd person, singular, or 1st or 2nd or 3rd person, plural), while in the Romanian one, it is fly pr12sg; fly123pl. In the Romanian dictionary the word form is therefore registered twice (if the homonymy is illustraed by two word forms). 3.4.5 The set of morphological features of Romanian looks as follows: Nouns Lemma Features Noun Common or Proper Masculine or Feminine or Neuter Word Form Features Singular or Plural Definite or Undefinite Nominative or Accusative;Genitive or Dative Verbs Lemma Features Verb Auxiliary (default feature) Word Form Features Indicative or Subjunctive or Imperative or Infinitive or Gerund or Participle Present or Imperfect or Recent Perfect or Past Perfect 1st person or 2nd person or 3rd person Singular or Plural 206 Adjectives Lemma Features Adjective Gradable or Non-gradable Word Form Features Definite or Undefinite Masculine or Feminine Singular or Plural Nominative or Accusative or Dative or Genitive Pronouns Lemma Features Pronoun Personal or Reflexive or Demonstrative or Negative or Indefinite or Interro-Relative Word Form Features For Personal Pronouns Strong or Weak Form 1st person or 2nd person or 3rd person Singular or Plural Nominative or Accusative or Dative or Genitive Masculine or Feminine For Reflexive Pronouns Strong or Weak Form 1st person or 2nd person or 3rd person Singular or Plural Nominative or Accusative or Dative For Demonstrative Pronouns Singular or Plural Nominative or Accusative or Dative or Genitive For Negative Pronouns Singular or Plural Nominative or Accusative or Dative or Genitive For Interro-Relative Pronouns Nominative or Accusative or Dative or Genitive 207 For Indefinite Pronouns Nominative or Accusative or Dative or Genitive Determiners Lemma Features Determiner Demonstrative or Possessive or Negative or Indefinite or Interro-Relative or Emphatic Word Form Features For Demonstrative Determiners Singular or Plural Nominative or Accusative or Dative or Genitive For Negative Determiners Singular or Plural Nominative or Accusative or Dative or Genitive For Indefinite Determiners Nominative or Accusative or Dative or Genitive For Interro-Relative Determiners Nominative or Accusative or Dative or Genitive For Possessive Determiners 1st person or 2nd person or 3rd person Singular or Plural For Emphatic Determiners 1st person or 2nd person or 3rd person Singular or Plural Nominative or Accusative or Dative or Genitive Articles Lemma Features Article Demonstrative or Possessive or Indefinite 208 Word Form Features Masculine or Feminine Singular or Plural Nominative or Accusative or Dative or Genitive Numerals Lemma Features Numeral Cardinal or Ordinal Masculine or Feminine Singular or Plural Word Form Features Adverbs Lemma Features Adverb General or Verbal or Interrro-Relative Word Form Features: none Conjunctions Lemma Features Conjunction Coordinating or Subordinating Word Form Features: none 209