Download Annotation guidelines

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Annotation guidelines
The following guidelines are rules to annotate a corpus with XML tags in order to
delimit multiword expressions with adverbial function.
1. General definition
A multiword expression is defined as an expression made of several words with
several of its elements frozen together. For example, de nos jours ‘nowadays’ should be
tagged as <ADV fs='PDETC'>de nos jours</ADV>. It is multiword because it is made
of three words; it is frozen because the words do not belong to a paradigm of words
which could be freely substituted to them.
The criterion for being made of several words instead of one is the presence of at
least one character which is not a letter inside the word. Thus, adverbs with an internal
apostrophe, such as d'ailleurs ‘by the way’, should be tagged. Discontinuous
expressions should not be tagged, except if the discontinuity consists of an embedded
phrase (cf. section 3 below).
The criterion of frozenness is the fact that the combination of elements in the
multiword expression does not obey productive rules of syntactic and semantic
compositionality. For example, de nos jours ‘nowadays’ is frozen in Il est facile de nos
jours de s'informer ‘Getting informed is easy nowadays’, and therefore it should be
tagged; the same phrase is not frozen, and therefore should not be tagged, in Voici la
liste de nos jours de fermeture ‘Here is the list of our closing days’.
An expression has an adverbial function if it is a complement (of a predicative
expression or of an adverb), but not an object. For example, au hasard has an adverbial
function in Ils erraient au hasard ‘They were wandering at random’, but not in Ils
faisaient confiance au hasard ‘They trusted chance’. In the former sentence, au hasard
is an object of the predicative expression faisaient confiance ‘trusted’.
In the following, we give more detailed guidelines on the application of these
rules in various cases of doubt.
2. Adverbial function
In these guidelines, a complement of a predicative expression or of an adverb is
said to have an adverbial function if and only if it is not an object. The distinction
between objects (or essential complements) and complements with adverbial function
should be made on the basis of criteria (Gross, 1986, 1990a, 1990b) involving the fact
that
- complements with adverbial function are optional (but some objects are optional too),
- they combine freely with a wide variety of predicates,
- and some of them pronominalize with specific forms.
For example, à neuf heures ‘at nine’ has an adverbial function and should be tagged in
Nous avons résolu le problème à neuf heures ‘We fixed the problem at nine’, but not in
Nous avons fixé la réunion à neuf heures ‘We set the meeting at nine’, because the first
sentence is paraphrased by Nous avons résolu le problème et cela s'est produit à neuf
heures ‘We fixed the problem and that happened at nine’, whereas the second is not
1
paraphrased by Nous avons fixé la réunion et cela s'est produit à neuf heures ‘We set
the meeting and that happened at nine’.
In French, the essential/adverbial distinction is particularly difficult in the case
of locative complements. In case of doubt, annotators should use the criterion of support
sentences (Guillet, Leclère, 1992). For example, au fond ‘at/to the bottom’ has an
adverbial function and should be tagged in
(1)
Un ruisseau coule au fond
‘A stream flows at the bottom’
but not in
(2)
Nous sommes descendus au fond
‘We descended to the bottom’
To check this, construct both support sentences Nous sommes au fond ‘We are at the
bottom’ and Nous ne sommes pas au fond ‘We are not at the bottom’ and observe that
one of them holds before the process denoted by sentence (2) and the other holds after.
The same is not observed with (1).
Complements of predicative expressions should be analysed in order to
determine whether they have an adverbial function. This includes complements:
- of verbs, as tous les jours ‘everyday’ in Il se promène tous les jours ‘He takes a walk
everyday’;
- of adjectives, as de plus en plus ‘more and more’ in L'eau est de plus en plus froide
‘The water is colder and colder’;
- and of support-verb constructions, as tous les jours ‘everyday’ in Il fait une promenade
tous les jours ‘He takes a walk everyday’, or à travers les frontières ‘across borders’ in
Le public manifeste sa solidarité à travers les frontières ‘The public shows solidarity
across borders’.
Complements of adverbs should be analysed also, as de plus en plus ‘more and
more’ in L'eau coule de plus en plus vite ‘The water is flowing faster and faster’.
However, complements of nouns, as de tous les jours ‘everyday’ in Il fait sa
promenade de tous les jours ‘He takes his everyday walk’, or à travers les frontières
‘across borders’ in Les victimes s'en remettent à la solidarité à travers les frontières
‘The victims hope for solidarity across borders’ should not be tagged.
3. Embedded free parts
When a modifier with an embedded free phrase is embedded in a multiword
expression with adverbial function, the free phrase should be annotated in function of its
syntactic category. For example, du fait de cette décision ‘as a consequence of this
decision’ should be tagged as
<ADV fs='PCDN'>du fait de <NP>cette décision</NP></ADV>
because the noun phrase cette décision ‘this decision’ is embedded in the complement
with adverbial function. If the preposition is contracted with the determiner, the tagging
should leave the contraction out of the noun phrase. For example, du fait du temps ‘as a
consequence of the weather’ should be tagged as
<ADV fs='PCDN'>du fait du <NP>temps</NP></ADV>
2
When the embedded phrase is a clause, it should be tagged as a sentence, i.e.
with the S element: du fait qu'il a plu ‘since it has rained’ should be tagged as
<ADV fs='PCDN'>du fait qu'<S>il a plu</S></ADV>
A complementizer introducing a sentential complement should be left out of the
embedded sentence, as in the preceding example. A relative pronoun introducing a
relative clause should be included in the embedded sentence: au moment où il a plu ‘at
the moment when it rained’ should be tagged as
<ADV fs='PCDN'>au moment <S>où il a plu</S></ADV>
4. Named entities
Named entities should be tagged only when they are multiword and have an
adverbial function. For example, le soir ‘the evening’ should be tagged in Le soir, le
vent tomba ‘In the evening, the wind fell’, but not in Le soir arriva ‘The evening came’.
For named entities, the criterion of frozenness mentioned in section 1 is less
relevant. (Named entities which are not clearly frozen obey a specific syntax, but this
syntax is usually largely independent of the rest of the syntax of the language.)
5. Inclusion in larger multiword units
When an adverbial multiword expression is embedded in another, the inner one
should be tagged only if it has an adverbial function with respect to the embedding
expression:
<ADV fs='PCDN'>du fait, <ADV fs='PC Conj'>en somme</ADV>, de
<NP>cette décision</NP></ADV>
‘as a consequence, in short, of this decision’
or if it is embedded in a free phrase:
<ADV fs='PCDN'>du fait qu'<S><ADV fs='PCA'>à coup sûr</ADV>, il a
plu</S></ADV>
‘since it has certainly rained’
In other cases, the inner expression should not be tagged, for example when a named
entity of date or a time is a part of another named entity:
<ADV fs='DATE Conj'>Le lendemain 2 mai à midi</ADV>
‘On the day after, May 2nd, at noon’
In particular, annotating multiword expressions with adverbial function involves
analysing sentences and detecting whether sequences are included in larger frozen units.
Such larger frozen units may be verbal idioms, e.g. s'attendre au pire ‘expect the worst’,
or frozen prepositional phrases used with être ‘be’, e.g. au mieux ‘at one's best’. In these
phrases, au pire and au mieux should not be tagged, even though they can have
adverbial function in other contexts: au pire ‘at worst’, au mieux ‘at best’.
6. Coordination
When a multiword unit is coordinated with another one and appears as reduced
because a common part is pronominalized, it should be tagged as if it were not reduced.
3
For example, in dans les rangs de la fonction publique et dans ceux du privé ‘in the
ranks of civil servants and in those of the private sector’, the noun rangs ‘ranks’ is
pronominalized into ceux ‘those’; therefore, dans les rangs de la fonction publique ‘in
the ranks of civil servants’ should be tagged on its own, and dans ceux du privé should
be tagged with the same tags as if it had the form of dans les rangs du privé ‘in the
ranks of the private sector’.
The rules above do not apply when a modifier embedded in the expression is
occupied by a coordination of embedded free phrases, as in dans les rangs de la fonction
publique et du privé ‘in the ranks of civil servants and of the private sector’, which
should be tagged as a single occurrence of an expression, with two embedded free noun
phrases.
The rules above do not apply either when the whole coordination is frozen, as in
en tout et pour tout ‘altogether, only’, which is recognizable by the impossibility to
permute the co-ordinated parts (pour tout et en tout ‘for everything and in everything’ is
interpretable only compositionally).
7. Subcategories
Multiword expressions annotated in the corpus should be assigned the name of
the subcategory to which they belong. These subcategories are based upon the surface
constituency of the internal structure of multiword expressions, except for the case of
named entities, in which it depends on the semantic content. A closed list of
subcategories should be used:
Cat. names
Description of morphosyntactic structure
Example
PC
Preposition and noun
par exemple
PDETC
Preposition, determiner and noun
de nos jours
PAC
Preposition, determiner, preposed
adjective and noun
à la dernière minute
PCA
Preposition, determiner, noun and
preposed adjective
à la nuit tombante
PCDC
Prepositional phrase containing a
prepositional phrase with preposition de
and frozen noun phrase
dans la limite du possible
PCPC
Prepositional phrase containing a
prepositional phrase with preposition
other than de and frozen noun phrase
à cent pour cent
PCONJ
Co-ordination
tôt ou tard
PCDN
Prepositional phrase containing a
prepositional phrase with preposition de
and free noun phrase
à l’insu de NP
PCPN
Prepositional phrase containing a
en comparaison avec NP
4
prepositional phrase with preposition
other than de and free noun phrase
PV
PF
Expression with a subjectless verb
à dire vrai
Expression with an embedded sentence
jusqu'à ce que mort
s'ensuive
PECO
Comparative phrase with comme and a
noun phrase, compatible with an
adjective
<fidèle> comme un chien
PVCO
Comparative phrase with comme and a
noun phrase, compatible with a verb
<travailler> comme un
chien
PPCO
Comparative phrase with comme and a
prepositional phrase, compatible with a
verb
<disparaître> comme par
enchantement
PJC
Expression beginning with a coordinating conjunction
mais aussi et surtout
DATE
Named entity denoting a date
le 22 mai 2008
Named entity denoting a duration
pendant vingt-quatre
heures
Named entity denoting a time
à huit heures du soir
DURATION
TIME
FREQUENCE Named entity denoting a frequence
deux fois par jour
Not all multiword nouns in French strictly match one of these descriptions.
Some of them match variants of them: for instance, à nouveau matches the PC structure,
except for the part of speech of nouveau, which is an adjective rather than a noun. In
that case, annotators are requested to select the closest structure, here PC, so that the
closed list above is respected.
If the expression to be annotated is a variant of an expression with an embedded
free phrase, the morpho-syntactic structure is assigned in function of the form with the
embedded free phrase. For example, à nos yeux ‘in our opinion’ is a variant of aux yeux
de NP ‘in the opinion of NP’ where NP is possessivized; therefore, it should be assigned
the PCDN structure. Similarly, à ce sujet ‘about this’ is a variant of au sujet de NP
‘about NP’, and should be assigned the PCDN structure.
8. Conjunctive function
An expression with adverbial function assumes a conjunctive function in
discourse if it connects the clause in which it occurs with the previous clause, as en
somme ‘in short’. The positive value is indicated by identifier ‘Conj’ in attribute ‘fs’.
Example: <ADV fs='PC Conj'>en somme</ADV>.
9. XML Syntax
The XML syntax for tagging multiword expressions with adverbial function
involves
5
- the ADV element
- the fs attribute in the ADV element.
The value of the fs attribute is a list of feature identifiers separated by spaces.
Example: <ADV fs='PCDN Conj'>En conséquence</ADV>. Feature identifiers may be
subcategory names such as PCDN and binary feature names such as Conj.
The syntax for tagging embedded free parts in multiword expressions involves
- the NP element for embedded noun phrases
- the S element for embedded clauses.
Bibliography
Gross, Maurice. 1986. Lexicon-Grammar. The representation of compound words. In
Proceedings of the Eleventh International Conference on Computational
Linguistics, Bonn, West Germany, pp. 1--6.
Gross, Maurice. 1990a. Grammaire transformationnelle du français: 3. Syntaxe de
l’adverbe. Paris, ASSTRIL.
Gross, Maurice. 1990b. La caractérisation des adverbes dans un lexique-grammaire.
Langue Française, 86, pp. 90-102.
Guillet, Alain; Christian Leclère. 1992. La structure des phrases simples en français.
Les constructions transitives locatives, Genève, Droz, 446 p.
6