Download Word Form Features

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Agglutination wikipedia , lookup

Comparison (grammar) wikipedia , lookup

Udmurt grammar wikipedia , lookup

Ukrainian grammar wikipedia , lookup

Swedish grammar wikipedia , lookup

Pipil grammar wikipedia , lookup

Ojibwe grammar wikipedia , lookup

Grammatical case wikipedia , lookup

Russian grammar wikipedia , lookup

Inflection wikipedia , lookup

Lithuanian grammar wikipedia , lookup

Latin syntax wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Arabic grammar wikipedia , lookup

Malay grammar wikipedia , lookup

Romanian numbers wikipedia , lookup

Literary Welsh morphology wikipedia , lookup

Esperanto grammar wikipedia , lookup

Lithuanian declension wikipedia , lookup

Turkish grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Sanskrit grammar wikipedia , lookup

Yiddish grammar wikipedia , lookup

Modern Greek grammar wikipedia , lookup

Old English grammar wikipedia , lookup

Ancient Greek grammar wikipedia , lookup

Old Irish grammar wikipedia , lookup

French grammar wikipedia , lookup

Archaic Dutch declension wikipedia , lookup

Old Norse morphology wikipedia , lookup

Scottish Gaelic grammar wikipedia , lookup

Latvian declension wikipedia , lookup

Romanian nouns wikipedia , lookup

Polish grammar wikipedia , lookup

Serbo-Croatian grammar wikipedia , lookup

Transcript
Premises of a Morphological Dictionary of
Romanian
Emil Ionescu
1. The Aim
The aim of the present enterprise is the construction of a dictionary of annotated inflected
word forms for Romanian. The dictionary (the file Romdict.txt) is expressed in the DELAF
format (INTEX, Zilberstein 1993) and it is associated with another file containing the
description of the labels used in characterizing the inflected forms (Definitions.txt). Both files
serve as resources for the analyzer constructed by the colleagues from Bulgaria. The analyzer
is available at the address http://www.larflast.bas.bg/balric/tag/default.htm
The dictionary is a full-form lexicon. It is therefore based on morphological features. It is not
‘inflectionally-oriented’ (because it does not deal with matters of the inner structure of the
word), nor is it based on principles of derivational morphology1. The main concern was to
capture the properties of word forms in terms of features specific to Romanian morphology.
This orientation is worth noticing, because it is representative for the present-day trend in
HLT: a trend which takes into account the relation between texts and the ultimate building
blocks of them, the words.
2. The General Structure of the Dictionary
An entry in this dictionary has the following structure:



Word form
Lemma
Characteristics: lemma characteristics, word form characteristics
A sample:
Word form= agresivă (‘aggressive’ singular, feminine, nominative or accusative)
Lemma= agresiv
Characteristics: Lemma characteristics: A (=adjective)+ GR (= gradable)
Word form characteristics: ufsr (=undefinite, feminine, singular, nominative
or accusative)
1
For other tools and resources concerning the morphology of Romanian, see Tufis, L.
Diaconu, Barbu and C. Diaconu (1996), Tufis, L. Diaconu, C. Diaconu and Barbu (1996),
Peev, Bibolar and Endre (1996), Curteanu, Holban, I. Pavaloi, C. Pavaloi., Negulescu and
Todirascu (1996), Vuscan (1996) – all the above mentioned contributions being papers in
Dan Tufis (ed.) “Limbaj si Tehnologie” (Language and Technology), Editura Academiei
Romane, Bucuresti, 1996.
203
The dictionary comprises only synthetic forms. The analytic ones are considered
collocations. Synthetic forms in turn may be simple or compound; agresivă, for instance, is a
simple synthetic form but nici_un (the determiner no) is compound. Compound forms are
written with underscore.
In its present form, the dictionary contains 6768 entries, which cover all the parts of speech of
Romanian.
3. The Phases of the Research
The main steps which led the research to the previously mentioned results have been the
following:




Establishing the corpus of the dictionary
Identifying the lexical forms which belong to the corpus
Building up the inflectional paradigms (wherever needed)
Establishing the morphological features involved in the characterization of lemmas
and word forms.
3.1. The Corpus
The corpus is composed of a set of 13 articles collected from one of the most important
Romanian newspapers, “Evenimentul Zilei” (The Daily Event). It is a representative corpus
for the present–day standard Romanian.
3.2. Identifying Lexical Forms
The newspaper articles supplied a number of 1478 word forms, which were isolated through
tokenization. Tokenization was performed with a tokenizer built up within the MULTEX
Project developed at the Romanian Academy Center of Artificial Intelligence (RACAI),
available at www.racai.ro. A sample of a tokenized text is given below:
#el #(he)
#a# (has)
#afirmat# (stated)
#că# (that )
#singura# (The only)
#soluţie# (solution)
#pentru_ca # (for)
#populaţia# (the population)
#să #(to)
#-şi# (its own)
#poată #(be able)
#achita# (to pay)
#facturile# (the bills)
204
#este# (is)
#creşterea# (the augmentation)
#reală #(real)
#a# (of)
#salariilor# (the salaries)
.
The word forms identified by tokenization represent 523 lemmas covering both inflectional
and non-inflectional parts of speech of Romanian.
3.3. Inflectional Paradigms
For every lemma representing an inflectional part of speech, the corresponding inflection
paradigm was built up. This operation has been performed in a manual way. The result was
the production of 6768 word forms, which constitute the present form of the dictionary.
3.4 Morphological Features
3.4.1. Establishing the set of morphological features for Romanian represented the linguistic
part of the research. In this respect, the way the Bulgarian dictionary was constructed has
been followed. Accordingly, a feature was considered morphologically relevant, if that
feature was found important for the production and/or distinction of the paradigm members.
For instance, the feature A(uxiliary) is in Romanian a syntactic feature of the verbs.
Nevertheless, it also counts as a morphological one, because the paradigms of the auxiliary
verbs a avea (to have) and a vrea (to want) are distinct from the paradigms of the
corresponding ‘main’ verbs a avea and a vrea. Likewise, in the case of adjectives, the gender
– a semantic feature - has to be equally considered morphologically relevant – just like in
Bulgarian, but unlike English – because it serves to distinguish between members of the same
paradigm2.
3.4.2. With these criteria at hand, twelve (main) parts of speech were identified: noun, verb,
adjective, determiner, pronoun, numeral, article, adverb, preposition, interjection, particle
and abbreviation.
The last two are not usually encountered in morphological descriptions of Romanian, so
some explanations are in point. The category of particles contain certain words with special
distribution and behavior. It is about the adverb of negation nu, the ‘conjunction’ să (which
marks the subjunctive mood), and the so-called ‘morpheme’ a (which marks a certain form of
infinitive).
As for abbreviations (for instance, tel, from telefon), the option to adopt them as a distinct
category, originated in the fact that, in general, an abbreviation substantively differs from its
corresponding ‘full’ part of speech. For example, tel does not present case inflections, nor
can it be used in plural. Anyway, given the marginal character of abbreviations (especially
2
It is just for this reason that the very important verb feature transitivity is not considered
here a morphological feature: it contributes in no way to the formation of the verb paradigm
in Romanian.
205
from the quantitative point of view) their unexpected presence in the morphology of
Romanian does not rise serious problems.
3.4.3. It may happen that the same feature is counted as a lemma feature in relation to a given
part of speech but as a word form feature in relation to another part of speech. This comes
about, for instance, with gender, which characterizes the lemma of nouns and the word form
of adjectives.
3.4.4 There is a slight difference in the way the Bulgarian and the Romanian dictionary deal
with homonymous morphological forms. The difference may be illustrated in the treatment
of the following example. For the word form fly, in the Bulgarian dictionary the analysis is
pr12sg:123pl (= present tense, 1st or 2nd person, singular, or 1st or 2nd or 3rd person, plural),
while in the Romanian one, it is fly pr12sg; fly123pl. In the Romanian dictionary the word
form is therefore registered twice (if the homonymy is illustraed by two word forms).
3.4.5 The set of morphological features of Romanian looks as follows:
Nouns



Lemma Features
Noun
Common or Proper
Masculine or Feminine or Neuter



Word Form Features
Singular or Plural
Definite or Undefinite
Nominative or Accusative;Genitive or Dative
Verbs
Lemma Features


Verb
Auxiliary (default feature)
Word Form Features




Indicative or Subjunctive or Imperative or Infinitive or Gerund or Participle
Present or Imperfect or Recent Perfect or Past Perfect
1st person or 2nd person or 3rd person
Singular or Plural
206
Adjectives
Lemma Features


Adjective
Gradable or Non-gradable




Word Form Features
Definite or Undefinite
Masculine or Feminine
Singular or Plural
Nominative or Accusative or Dative or Genitive
Pronouns


Lemma Features
Pronoun
Personal or Reflexive or Demonstrative or Negative or Indefinite or Interro-Relative
Word Form Features
For Personal Pronouns





Strong or Weak Form
1st person or 2nd person or 3rd person
Singular or Plural
Nominative or Accusative or Dative or Genitive
Masculine or Feminine
For Reflexive Pronouns




Strong or Weak Form
1st person or 2nd person or 3rd person
Singular or Plural
Nominative or Accusative or Dative
For Demonstrative Pronouns


Singular or Plural
Nominative or Accusative or Dative or Genitive
For Negative Pronouns


Singular or Plural
Nominative or Accusative or Dative or Genitive
For Interro-Relative Pronouns

Nominative or Accusative or Dative or Genitive
207
For Indefinite Pronouns

Nominative or Accusative or Dative or Genitive
Determiners
Lemma Features


Determiner
Demonstrative or Possessive or Negative or Indefinite or Interro-Relative or Emphatic
Word Form Features
For Demonstrative Determiners


Singular or Plural
Nominative or Accusative or Dative or Genitive
For Negative Determiners


Singular or Plural
Nominative or Accusative or Dative or Genitive
For Indefinite Determiners

Nominative or Accusative or Dative or Genitive
For Interro-Relative Determiners

Nominative or Accusative or Dative or Genitive
For Possessive Determiners


1st person or 2nd person or 3rd person
Singular or Plural
For Emphatic Determiners



1st person or 2nd person or 3rd person
Singular or Plural
Nominative or Accusative or Dative or Genitive
Articles


Lemma Features
Article
Demonstrative or Possessive or Indefinite
208
Word Form Features



Masculine or Feminine
Singular or Plural
Nominative or Accusative or Dative or Genitive
Numerals
Lemma Features


Numeral
Cardinal or Ordinal


Masculine or Feminine
Singular or Plural
Word Form Features
Adverbs
Lemma Features


Adverb
General or Verbal or Interrro-Relative
Word Form Features: none
Conjunctions
Lemma Features


Conjunction
Coordinating or Subordinating
Word Form Features: none
209