Download Null Subjects in Statistical Machine Translation

Document related concepts
no text concepts found
Transcript
Universität Stuttgart
Institut für maschinelle Sprachverarbeitung
Azenbergstraÿe 12
D - 70174 Stuttgart
Diplomarbeit Nr. 77
Null Subjects in Statistical Machine
Translation:
A Case Study on Aligning
English and Italian Verb Phrases
with Pronominal Subjects
Betreuer: Dr. Alexander Fraser
Erstprüfer: Dr. Helmut Schmid
Zweitprüfer: apl. Prof. Dr. Ulrich Heid
Bearbeitung: Anita Gojun
Anmeldung: 01. Juni 2010
Abgabe: 30. August 2010
Abstract
In this thesis, I present a method for aligning English and Italian parallel verb phrases
which have pronominal subjects. The phrases contain the pronominal subject, the verbal
elements of a verb phrase (VP) and the negation. I use English parse trees and part
of speech tagged Italian sentences. The process of aligning parallel phrases consists of
several steps. An Italian sentence is searched in order to nd all Italian VPs. In the
parallel English sentence, the clauses with pronominal subjects are detected. Base word
alignment (created by GIZA++ ) of the elements of an English VP is used to identify
the matching Italian VP. The alignment of parallel phrases is computed by applying
alignment rules which dene the alignment between words with a specic part of speech
tag.
The rule-based VP alignment reaches f-score of 81% whereas f-score of the base word
alignment is 64%. The rules compute correct alignments for most parallel VPs. However,
they produce erroneous alignments if false parallel phrases are identied.
This is the
case when the English VP is not translated, or when it corresponds to an Italian phrase
of an arbitrary type (e.g.
prepositional phrase).
These cases are analyzed and a few
experiments are carried out in order to solve these problems. They lead to higher recall
(best recall is 84%), but lower precision.
I use the rule-based word alignment to build phrase-based SMT systems with Moses
and to examine whether improved word alignment of English pronominal subjects leads
to better results when the translation of pronominal subjects between a null subject
language Italian and a non-null subject language English is carried out. SMT systems
built using the rule-based VP alignment receive lower BLEU scores even though the
translations are comparable with the translations generated by SMT systems which are
built using the base alignment.
In translation direction EN
→
IT, a BLEU score of
the SMT system build using the base alignment is 19.15. The SMT system build using
the rule-based VP alignment has a BLEU score of 18.18.
In the opposite translation
direction, the SMT system build using the base alignment has a BLEU score 22.07
whereas the SMT system build using the rule-based VP alignment has a BLEU score of
21.81. The systems perform equally with respect to translation of pronominal subjects
which means that the improved VP alignment does not lead to the improvement of the
subject pronoun translation between English and Italian.
The analysis of translations of example sentences will show that the pronoun resolution
and syntactic analysis of both languages is necessary to ensure the correct generation of
the corresponding subject pronoun. Furthermore, when English pronouns are translated
into Italian, the decision must be made as to whether the Italian subject pronoun should
be overtly expressed.
Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig verfasst habe und dabei
keine andere als die angegebene Literatur verwendet habe. Alle Zitate und sinngemäÿen
Entlehnungen sind als solche unter genauer Angabe der Quelle gekennzeichnet.
Contents
1 Introduction
6
1.1
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1.2
Methodology
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6
1.3
Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
8
2 Pro-drop and Null Subject Languages
2.1
2.2
6
9
Pro-drop theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
10
2.1.1
Rich inection morphology . . . . . . . . . . . . . . . . . . . . . .
10
2.1.2
Zero topic theory . . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Null subjects and syntax . . . . . . . . . . . . . . . . . . . . . . . . . . .
12
2.2.1
Null subjects and English syntax
. . . . . . . . . . . . . . . . . .
12
2.2.2
Null subjects and Italian syntax . . . . . . . . . . . . . . . . . . .
13
2.3
Null subjects and pragmatics
. . . . . . . . . . . . . . . . . . . . . . . .
15
2.4
Statistics on null subjects in Italian . . . . . . . . . . . . . . . . . . . . .
17
2.5
Summary
19
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
3 Pro-drop in machine translation
20
3.1
Previous work on zero pronouns in MT . . . . . . . . . . . . . . . . . . .
21
3.2
Translation between English and Italian
. . . . . . . . . . . . . . . . . .
22
3.2.1
Italian to English . . . . . . . . . . . . . . . . . . . . . . . . . . .
23
3.2.2
English to Italian . . . . . . . . . . . . . . . . . . . . . . . . . . .
28
3.3
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
4 Statistical machine translation
29
31
4.1
Word alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31
4.2
Phrase-based SMT
35
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Word alignment of English and Italian verb phrases
5.1
5.2
5.3
5.4
38
Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
38
Data preparation
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
5.2.1
English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
39
5.2.2
Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
44
5.2.3
Data preprocessing errors
. . . . . . . . . . . . . . . . . . . . . .
44
Applying alignment rules . . . . . . . . . . . . . . . . . . . . . . . . . . .
45
5.3.1
Identication of Italian VPs
5.3.2
Identication of the most probable Italian VP
. . . . . . . . . . . . . . . . . . . . .
47
. . . . . . . . . . .
49
Alignment rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
50
5.4.1
Syntax of the English and Italian VPs
. . . . . . . . . . . . . . .
51
5.4.2
Subject pronouns . . . . . . . . . . . . . . . . . . . . . . . . . . .
54
5.4.3
Finite verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57
5.4.4
Participles, innitives and gerundives . . . . . . . . . . . . . . . .
60
5.4.5
Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
62
4
5.5
5.6
5.7
5.4.6
Innitival particle . . . . . . . . . . . . . . . . . . . . . . . . . . .
63
5.4.7
Alignment examples
. . . . . . . . . . . . . . . . . . . . . . . . .
63
Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
65
5.5.1
Precision, Recall, F-score . . . . . . . . . . . . . . . . . . . . . . .
65
5.5.2
Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
System extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
78
5.6.1
Lexical search for the matching Italian VP . . . . . . . . . . . . .
78
5.6.2
Retaining the base alignment
. . . . . . . . . . . . . . . . . . . .
80
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
81
Summary
6 Evaluation of SMT systems
83
6.1
The BLEU score
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
6.2
Evaluation of SMT systems
. . . . . . . . . . . . . . . . . . . . . . . . .
84
6.3
Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
6.4
Adequate training data . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
7 Conclusion
83
93
7.1
Summary
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
93
7.2
Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
96
A Italian tag set
99
B English tag set (Penn Treebank Tagset)
101
C English subject pronoun occurrences
102
List of Tables
103
List of Figures
104
References
105
1 Introduction
In my diploma thesis, I addressed the problem of pro-drop in statistical machine translation using the language pair English - Italian. I carried out linguistic analysis of the
phenomenon with respect to machine translation, and I developed rules based on part
of speech tags which dene the word alignment of the English subject pronoun and its
verb phrase with elements of the corresponding Italian verb phrase.
I examined the
generated translations as well as the translation parameters to nd an explanation for
decient translation of pronominal subjects between English and Italian.
1.1 Motivation
English is a language in which the subject position must always be occupied. In Italian,
this is not the case.
When the Italian subject is expressed by a pronoun, it can be
dropped. This means that the English pronominal subject does not necessarily have a
pronominal counterpart in Italian. In the context of (statistical) machine translation,
this leads to problems in both translation directions, as well as within the automatic
word alignment task. The questions concerning the pronominal subjects that rise, are:
(Q1) When word alignment of parallel sentences is carried out, with which Italian word
should an English subject pronoun be aligned when a subject pronoun in Italian
is omitted?
(Q2) How can we make sure that the correct pronoun is generated when translating an
Italian null subject into English?
(Q3) How can we decide when to generate a null pronoun when translating an English
subject pronoun into Italian?
The theoretical discussion on the problem of pro-drop within (statistical) machine translation will show which information can be used to solve the problems formulated in (Q2)
and (Q3). In the practical part of the work, I will present the method which handles
the question (Q1). Improved word alignment of English pronominal subjects does not
solve the problems in (Q1) and (Q2). Translations of example sentences will be analyzed
thoroughly in order to explain why the improved word alignment does not contribute to
the translation of pronominal subjects between English and Italian.
1.2 Methodology
This work concentrates on the improvement of the word alignment of English and Italian verb phrases consisting of a subject pronoun (cf.
question (Q1) in the preceding
section). Since English subject pronouns do not always have Italian counterparts, they
are often aligned incorrectly. I develop therefore a set of rules which dene the alignment of English subject pronouns (cf. section 5.4). Since Italian verbs correspond to
English phrases containing the subject pronoun and verbs, the alignment rules compute
6
alignment of entire English and Italian parallel verb phrases (VPs).
Alignment rules
dene only the alignment of the verbal elements of the VPs and negation. Therefore, I
use the term VP to denote a part of verb phrases which only contain verbal elements
and negation. The other elements of verb phrases are not handled within this work. I
make three important assumptions for the computation of the VP alignment:
(A1) Each English VP which has a pronominal subject has a parallel Italian VP,
(A2) The base alignment
1
is correct enough to allow the identication of English and
Italian parallel VPs,
(A3) English and Italian parallel phrases have parallel part of speech sequences.
I use English parse trees and Italian part of speech tagged sentences (an Italian parser
was not available). The program for the computation of the VP alignment is applied
on English and Italian phrase pairs whereas the English phrase must have a pronominal
subject (cf.
section 5.3).
To assure that the Italian VPs are correct, i.e.
that they
contain only verbal elements, I rst identify all VPs in an Italian sentence by searching
for PoS sequences which build a VP (cf. section 5.3.1). The parallel Italian VP is then
identied on the basis of the base alignment (cf.
section 5.3.2).
The alignment rules
compute alignments for the matching part of speech tags of the phrase pair elements (cf.
section 5.4). All links in base alignment for aligned phrase pairs are removed. Then,
the alignment computed for the VP pairs is integrated in the base word alignment. The
resulting word alignment of a sentence pair does not have any base alignments for the
phrase pairs which are handled by the alignment rules.
I evaluate the VP alignment by computing precision, recall and f-score (cf. section
5.5.1).
I created gold alignment manually by dening the alignment only of relevant
English phrases.
The alignments of other tokens in a sentence were ignored in the
evaluation. The rule based VP alignment outperforms the base alignment. Expressed in
f-score, the rule-based VP alignment achieves an improvement of 17% (f-score = 81%).
The assumptions (A1) and (A2) do not always hold which leads to false alignments.
Not every English VP has a parallel Italian VP which is a contradiction to (A1). Sometimes, the phrases are not translated, or they correspond to other Italian phrases (prepositional phrases, participles, etc.).
Since the alignment rules are dened only for PoS
sequences of English and Italian VPs (cf. assumption (A3)), in such cases, they compute
false alignments. The assumption (A2) can lead to the identication of false phrase pairs
since the base alignment is not error-free (cf. section 5.5.2). I show some experiments
which were carried out in order to solve these problems (cf. section 5.6). For example,
to deal with problems with respect to (A1), the base alignment can be retained if the
parallel Italian VP could not be identied. In general, the experiments that I carried
out lead to higher recall but lower precision.
The base alignment and rule-based VP alignment are used to build statistical machine translation (SMT) systems for both translation directions. The quality of generated translations is given in BLEU scores. The rule-based VP alignment leads to lower
1 Base
word alignment is created by GIZA++ (cf. chapter 4.1).
7
BLEU scores, but the manual analyses of the generated translations revealed that the
translations are nearly the same (cf.
section 6.2).
This leads to the conclusion that
the improved VP alignment does not contribute to the translation of pronominal subjects between English and Italian.
The discussion of the translation probabilities of
the relevant phrases will show that phrase-based SMT is not an appropriate machine
translation approach for subject pronoun translation since it does not have access to
the context (preceding sentences) of the input sentence. When translating null subjects
into English subject pronouns, in many cases, the characteristics of the omitted pronoun
(number, gender, person) can be derived from the inected verbs (cf. section 3.2.1). But,
in general, pronoun resolution (using the preceding sentences) is needed to ensure the
generation of the correct English pronoun (cf. question (Q2) in the previous section).
Furthermore, only the syntactic analysis of the Italian input can provide clear information whether the Italian sentence has an (omitted) pronominal subject or a NP subject.
When translating English pronominal subjects into Italian, the decision must be made
as to whether the Italian subject pronoun has to be expressed overtly (cf. question (Q3)
in the previous section).
The data observation revealed that some words (adjectives)
occur often with overtly expressed Italian subject pronouns (cf. section 3.2.2).
1.3 Outline
This work is organized as follows: in Chapter 2, I introduce the phenomenon of pro-drop.
Two theories are briey presented which mention a number of linguistic characteristics
which allow or prohibit pro-drop.
In Chapter 3, pro-drop is discussed with respect
to machine translation. The features of English and Italian are identied which could
simplify the generation of correct subject pronouns. In Chapter 4, the characteristics of
phrase-based statistical machine translation are described. Chapter 5 contains a detailed
description of the rules and of the program for computing word alignment between
English and Italian verb phrases. The evaluation results of the VP alignment rules are
presented and the most common errors are discussed. In Chapter 6, the evaluation of
the SMT systems is carried out. I report BLEU/NIST scores and take a closer look at
generated translations and translation parameters in order to nd an explanation for false
pronoun translations. Finally, in Chapter 7, the ndings of the work are summarized
and future work is outlined.
8
2 Pro-drop and Null Subject Languages
In this chapter, I introduce the terms pro-drop and null subject language and present
two theories which give an explanation why some languages are able to omit subject
(and object) pronouns (cf. section 2.1). In section 2.2, I present syntactic constructions
in English and Italian in which null subjects can occur whereas in section 2.3, functions
which overtly expressed Italian subject pronouns fulll are discussed. Statistics about
null subjects in Italian are exposed in section 2.4.
Consider a simple sentence in English with a subject (SUBJ ) and a verbal predicate
(VPRED ) as shown in (1).
(1)
HeSU BJ sleepsV P RED .
The German translation of the sentence in (1) is shown in (2). If we compare the syntax
of these two sentences, we see that both of them have the same sentence elements: a
subject and a verbal predicate.
(2)
ErSU BJ schläftV P RED .
he
sleeps
'He sleeps.'
Let us now take a look at Italian and Croatian sentences which are equivalent to German
and English sentences in previous examples.
(3)
a. EgliSU BJ dormeV P RED .
he
sleeps
'He sleeps.'
b. DormeV P RED .
sleeps
He/she/it sleeps.
(4)
a. OnSU BJ spavaV P RED .
he
sleeps
'He sleeps.'
b. SpavaV P RED .
sleeps
'He/she sleeps.'
The Italian sentences in (3) are both correct translations of the English and German
sentences above.
But there is one important dierence between them: The sentence
in (3a) contains subject and predicate, whereas the sentence in (3b) has only a verbal
predicate.
While Italian and, for example, Croatian (cf.
examples in (4)) grammars
allow for omission of the subject pronoun, English and German grammars require the
subjects to be overtly expressed. Languages such as Italian, Croatian, Spanish are only
able to omit subject pronouns.
Thus, they are called null subject languages (NSLs).
9
Many Romance (like Italian, Spanish, Portuguese etc.) and Slavic (like Croatian, Czech,
Polish etc.) languages belong to this group of languages.
There are also languages which allow for omission both of subject and object pronouns
such as Chinese. These are called pro-drop languages. The set of NSLs is a subset of
pro-drop languages.
Examples (1) and (2) show grammatically correct sentences of English and German.
However, they would become ungrammatical if the subject pronouns were omitted. In
these languages, pronoun dropping (pro-drop) is not allowed. English and German are
neither pro-drop languages nor NSLs.
Let us now take a look at the following German sentences.
(5)
Er sagte, dass
He said,
∅
that
gefeiert
wurde.
celebrated has been.
'He said that there was a celebration.'
(6)
Heute
∅
Today
wird
gefeiert.
will be celebrated.
'Today, there will be a celebration.'
The dass -sentence (corresponding to the English that -sentence) in (5) does not contain
a subject. However, the sentence is grammatically correct. In German, there are a few
constructions which allow the expletive to be dropped, so German can be called a semi
NSL. German examples show that in some cases, it is not simple to say if some language
is NSL or not. Some languages like modern Hebrew and Scandinavian languages do not
allow zero subject pronouns, however, in a number of constructions they can be omitted
[Haegeman, 96].
2.1 Pro-drop theories
In the following, I briey introduce two theories that try to explain why some languages
are able to omit subject and/or object pronouns, and some do not exhibit this property.
The theories account both for the omission of subject and object pronouns. In further
discussion though, only the omission of subjects will be considered, since this work
concentrates on the problem of translating subject pronouns between a NSL and a nonNSL.
2.1.1 Rich inection morphology
It is widely accepted that the possibility of pro-drop often correlates with the existence
of a rich inectional morphology (verb-subject, verb-object agreement). The agreement
marking on a verb has to be rich enough to determine, or to allow the recovery of the
content (reference) of a missing pronoun [Huang, 84]. The Italian example sentence in
(7) should clarify this thesis.
(7)
Leggo un libro.
read
a
book.
10
'I read a book.'
Although the subject pronoun in (7) is not phonetically realised, its content has to
be determined.
To achieve this, [Huang, 84] proposes the co-indexing of the missing
pronoun with the closest nominal element.
In our example sentence, this is the Agr
(Agreement) of the verb leggo. The verb in (7) can clearly dene the person and number
st
of the missing subject: 1 person singular.
Let us take a look at a literal translation of (7) into English.
(8)
* Read a book.
The English verb read in (8) cannot unambiguously dene the content of the missing
st
nd
subject pronoun. It is ambiguous and could be combined with the 1 and 2
person sinrd
gular and plural and with the 3
person plural. So we need a lexical element (pronoun)
to identify the number and person of the subject.
According to this theory, pro-drop languages are also able to omit objects if they
have a verb-object-agreement. Since Italian and English do not exhibit any verb-object
agreement, object pronouns cannot be dropped.
In languages like German (cf. examples (5) and (6)) which have some constructions
which allow the omission of the subject pronouns, there is one restriction regarding the
subject pronouns. They can be realized as null subjects only if they are non-referential.
[Haegeman, 96] explains this by the fact that the German inection is richer than in
English but poorer than in Italian. The inection may licence null subjects in German,
but the verb agreement does not enable us to identify a referent for a null subject
pronoun.
The theory about morphological richness and pro-drop holds for many languages, but
there is a group of languages like Chinese or Japanese which have no morphology at all,
but still allow for pro-drop. In the next section, I discuss one theory that tries to explain
the ability of pro-drop in the mentioned languages.
2.1.2 Zero topic theory
The zero topic theory proposed by [Huang, 84] is based on the language classication of
[Tsao, 77].
[Tsao, 77] proposed that the languages like Chinese may be distinguished
from languages like English by a parameter called discourse-oriented vs.
sentence-
oriented. He observed many properties to group languages into discourse-oriented and
sentence-oriented. To these belong also the property of Topic NP deletion which is only
observed in languages which are characterized as being discourse-oriented. They allow
for deletion of the topic of a sentence under identity with the topic in the preceding
sentence. The ability of a language to map an empty topic to an appropriate preceding
topic is called the topic chain interpretation rule. The grammars of sentence-oriented
languages lack this topic interpretation rule. Their sentences must have a subject. This
also accounts for the presence of the expletive in such languages.
[Huang, 84] assumes that languages like Chinese allow binding of empty categories
(which arise when some syntactic elements like subject and object are omitted) with a
11
zero topic. Assuming that a topic can be deleted only if it refers to a preceding topic,
we can now recover the content of the missing element.
Languages like Italian or Spanish do not have zero topics which could be an explanation of not being able to omit the object pronoun. To recover the content of an omitted
element, we refer here again to the morphology of the language. An empty subject pronoun can be recovered by examining verb inection, but this is not possible for object
pronouns.
The theory of [Huang, 84] is thus based on several factors which consider several
properties of a language (zero topics, morphological richness) and some principles and
conditions formulated in the government and binding theory of Chomsky (for more
details, see [Huang, 84]).
2.2 Null subjects and syntax
In the following sections, various syntactic constructions in English and Italian are shown
in which the subject pronouns can be omitted.
2.2.1 Null subjects and English syntax
Although English does not belong to the group of NSLs, there are indeed some constructions like innitival subclauses and imperatives, in which the subject is absent.
(9)
a.
Speak! (Imperative)
b.
I would like [to come]XCOM P .
c.
I must [read this]XCOM P .
d.
John preferred [seeing Mary]GER .
In English, an empty pronoun may occur only as a subject of an imperative, an innitival
clause or of a gerund, but nowhere else. It cannot occur at all as a subject of the tensed
clause or as an object [Huang, 84].
However, the subjects in (9b-d) have somewhat
dierent properties from null subjects as in (10), in so far as the subject of an innitive
must be coreferential with the given subject of the main clause (subject control).
(10)
∅i
a.
Joei eats a banana and
watches TV.
b.
Youi should wash the dishes or
∅i
∅i
vacuum the apartment.
c.
* Joei eats a banana while
watches TV.
d.
* Youi should wash the dishes although
∅i
vacuumed the apartment.
The example sentences in (10) show though that some nite subclauses, i.e. coordinated
sentences, do not need a subject. In (10a), the subject of the clause watches TV does
not exist locally, but this kind of construction allows the identication of the subject of
a coordinated sentence with the subject of the main sentence, namely Joe. In contrast,
subordinating conjunctions do not provide this kind of subject sharing.
The clauses
introduced by a subordinating conjunction require the subject to be overtly expressed
(cf. (10c) and (10d)).
12
Yet, examples of subject omission in English nite clauses can be found in some
nonstandard language constructions.
(11)
∅SU BJ
a.
-
cried yesterday morning.
b.
Shei is Alsatian.
∅iSU BJ
Seems intelligent.
[Haegeman, 00] found out, that English allows null subjects in some special discourse
environments like short diary entries or notes (cf. sentences in (11)).
In this work, I will not deal with this kind of null subjects in English. Nevertheless,
it is important to discuss these constructions to show that there is a gradation rather
than a hard boundary between NSLs and non-NSLs.
2.2.2 Null subjects and Italian syntax
Italian counterparts to the English sentences in (9) are shown in (12).
(12)
a. Parla/Parlate! (Imperative)
speak!
'Speak!'
b. Vorrei
[venire]XCOM P .
I would come.
'I would like to come.'
c. Devo
[leggere questo]XCOM P .
I must read this.
'I must read this.'
d. John preferisce [di veder Mary]GER .
John prefers
to seeing Mary.
'John prefers to see Mary.'
The examples in (9) and (12) show that there are some syntactically isomorphic constructions in English and Italian which exhibit the same characteristics regarding the
occurrence of the subject pronoun.
But Italian has more constructions in which the
subject pronoun can be omitted.
Finite clauses
(13)
È stanca.
is tired
'She is tired.'
(14)
Ti
hanno imbrogliato.
you have
cheated
'They cheated you.'
13
Example (13) shows a typical use of the null subject pronoun. The verb è gives informard
tion about the missing subject: It can only be the 3
person singular. The predicative
adjective stanca reveals another important characteristic about the null subject.
Its
ending can only match with a feminine subject. Now, we can derive the correct form of
the subject pronoun although it is not overtly expressed: egla (= she ). It is important
to notice that the information about the gender of the missing subject is not always
available in the sentence (cf. example (3b)). Thus, in some cases, the information about
the gender can be only derived if more context of the sentence is available.
One interesting fact about the use of subject pronouns in nite subclauses is shown
in one example sentence of modern Italian in [Vanelli, Renzi, et al., 06], here example
(15).
(15)
Il
professorei ha
the professor
parlato dopo lui∗i è arrivato.
has spoken after he
arrived
'The professor spoke after he arrived.'
[Vanelli, Renzi, et al., 06] claim that it is not possible to unify the subject pronoun in
the subclause with the subject of the main clause. [Roberts, 07] notes though that this
interpretation is rather unusual than impossible (footnote number 2, page 40). If the
pronoun is stressed (cf. example (16a)), modied (cf. example (16b)) or coordinated
(cf. example (16c)), the reference is possible in the subordinate clause [Cardinaletti &
Repetti, 03]:
(16)
a. Marioi ha
Mario
detto che
has said
LUIi verrà
that HE
domani.
will-come tomorrow
'Mario has said that HE will come tomorrow.'
b. Mario ha
detto che
Mario has said
solo lui verrà
domani.
that only he will-come tomorrow
'Mario has said that only he will come tomorrow.'
c. Mario ha
detto che
Mario has said
lui e sua madre verrano domani.
that he and
his
mother
will-come tomorrow
'Mario has said that he and his mother will come tomorrow.'
Constructions like the one in (14) can also be used as an impersonal construction. The
rd
agreement of the auxiliary hanno identies uniquely the subject as the 3
person plural,
but this is not necessarily some specic group of referents. Such sentences emphasize
the described fact whereas the subject is irrelevant (or simply not known).
Impersonal expressions
a. With impersonal verbs
(17)
Piove.
rains
14
'It rains.'
b. Impersonal passive
(18)
È stato detto che
is
said
viene.
that comes
'It was said that he/she comes.'
c. Si impersonale
(19)
In Italia si parla
In Italy
italiano.
speaks Italian
'In Italy one/people speak(s) Italian.'
Impersonal verbs (sometimes also called weather verbs ) do not take any subject at all.
The subject in the English translation of (17) is not a true subject. It occurs because
subjects are obligatory, but it does not have a thematic role. Such impersonal subject
pronouns are also called expletive it. The example in (18) is an Italian construction in
which a subject pronoun does not occur. In English translation of the sentence, we have
an expletive as a subject as in the previous example as well.
Another way to express something impersonal in Italian is to use si impersonale. The
reexive pronoun si in (19) which could be seen as a subject of the given sentence, allows
for expressing a given fact without specifying the subject.
2.3 Null subjects and pragmatics
The optionality of using subject (and object) pronouns raises the question, why should
one use them at all. When they occur as subjects, do they fulll some specic function?
If this is not the case, it could be assumed that subject pronouns in Italian can generally
be dropped and are simply never used. In the literature, it is often said that optional
pronouns are used when they are stressed. This explains why expletives, subjects of so
called weather verbs, are not possible in Italian: Since they do not contribute to the
interpretation of the sentence, they would never be stressed, and they will therefore
never be overt [Haegeman, 96].
Beyond this explanation for overt subject pronouns, there are some other functions
that overt subject pronouns fulll. [Duranti, 84] observed the use of subject pronouns in
spoken Italian and specied these functions. Pronouns, nouns and, generally, all dening
phrases are used to draw attention to some specic referent. [Duranti, 84] suggests that
Italian subject pronouns are devices through which speakers dene main characters in
a narrative and/or convey empathy or positive aect toward certain referents. We start
with an example of the common use of zero pronouns.
(20)
Mio padre è andato a casa.
my
father
went
Vuole cucinare.
home. wants cook.
'My father went home. He wants to cook.'
15
A null subject (or zero anaphora) is typically used for talking about some referent that
has been mentioned in the immediate prior context (usually one or two clauses back).
After introducing the referent (in example (20), mio padre ) the omitted subject personal
pronoun is used to make additional statements about the introduced referent.
[Duranti, 84] determined that in some situations the subject pronoun should be used.
In these cases, it has to have some special function.
He identied these functions by
observing and analysing sketches of conversations of Italian native speakers.
1.
Introducing and keeping track of referents in discourse
If one referent is not a part of the recent context, it can be brought back to the
context by using the pronoun that refers to it. In this case, the pronoun can be seen
as an attention-getting device : It draws the addressee's attention to a particular
referent.
Sometimes, subject pronouns are used although their referents have been mentioned in the immediate context. In these cases, there is some discontinuity in the
temporal or spatial dimension of a discourse. For example, the pronoun is used for
reintroduction of some already mentioned referent, but in a context of some new
specic event.
2.
'Main' characters and 'minor' characters
There is some dierence in using pronouns for referents who are important in
a story (main characters) and for those who are not (minor characters).
The
more important the character, the more often is he/she referred to by means of
a personal pronoun. On the other side, for referring to minor characters, NPs or
demonstratives are used.
3.
Expressing empathy toward referent
Beside the personal pronouns, in Italian one can refer to someone by using demonstrative pronouns.
Closer observation of the use of personal and demonstrative
pronouns showed that demonstrative pronouns are used to express a certain emotional distance or negative aect to the referent whereas personal pronouns are
used the express empathy with the referent.
[Duranti, 84] also points out that the prior mention of some referent is not a necessary
condition for using a subject pronoun that should refer to someone or something. For exrd
ample, in some cases, the 3
person subject pronoun is used without prior identication
of any referent. It can be used for referents that can be implied by a previous identication set. Table 1 from [Duranti, 80] shows how often the referents are introduced before
referring to them by a null subject pronoun, by a pronoun, and by a noun. The length of
context for introduction of the referent has been set to 2 preceding sentences. In 72,5%
cases, the null subjects referents can be found in one of the two preceding clauses. In
other cases, the referent is either not mentioned at all, or the distance between the referent and subject pronoun is greater then two clauses. Overt pronouns behave similarly
to nouns. Their referents are rarely mentioned in immediate context.
16
Referent of
introduced not introduced
null subject (111)
72,1%
27,9%
pronoun (29)
34,5%
65,5%
noun (62)
27,4%
72,6%
Table 1:
Statistics on referents of 3
rd
person subjects in Italian
2.4 Statistics on null subjects in Italian
In the previous chapter we have seen that the subject pronoun in Italian is rarely used.
To get an idea of how often the subject pronoun is omitted, I examined 45 randomly
selected sentences (93 main and subordinate clauses) from Europarl (cf. chapter 5.2). I
identied sentence subjects and counted how often they are realised as zero pronouns,
overt pronouns and nominal phrases (NPs). The results are presented in table 2.
SUBJ-NP SUBJ-PRON null-SUBJ
42 (45%)
Table 2:
7 (7%)
45 (48%)
Occurrence of SUBJ in Italian
Nearly half of all clauses have zero subjects. The subject pronoun is used in only 7% of
cases. I also examined which zero pronouns are omitted (cf. table 3).
Num/Pers 1 2 3 3P
Sg
24
∅ 8 4
Pl
4
3
2
Table 3:
Occurrence of null-SUBJ in 93 observed clauses
The majority of the omitted subject pronouns are for the
1st
person singular. This is not
really surprising: The corpus that I worked with (cf. chapter 5.2) consists of parliament
discussions in which a certain person exposes his or her opinion about something. The
st
speakers speak for themselves so most pronouns are 1
person singular. Sometimes,
they also speak for some group of people to which they belong to, e.g. a party. In these
st
cases, the omitted subject refers to 1 person plural referents. We see that there are no
nd
pronouns for 2
person singular. This is also not surprising because in such meetings,
people do not address each other informally.
rd
Regarding the 3
person singular, we have to distinguish between the polite form in
rd
Italian (column 3P in table 3) which is expressed by 3
person singular when only one
rd
person is the addressee. The other cases of 3
person singular pronouns refer either to
someone or something already mentioned, or they correspond to English expletives.
17
Now, let us take a look at the clauses in which the subject pronoun has not been
omitted. In a set of 95 examined clauses, I found 7 occurrences of overt subject pronouns,
three of these are the polite form. Let us take a closer look to these sentences.
(21)
Sì,
onorevole
Evansi , ... , che
yes, honourable Evans,
leii propone ...
... , that you suggest ...
'Yes, honourable Evans, ... , you are suggesting ...'
(22)
Onorevole
Lynnei , leii ha
perfettamente ragione ...
honourable Lynne, you have perfectly
right
'Honourable Lynne, you are perfectly right ...'
(23)
Onorevole
Barón Crespoi , leii non ha potuto partecipare ...
collega
honourable colleague Barón Crespo, you not
could
participate ...
'Honourable colleague Barón Crespo, you couldn't participate ...'
Examples (21) - (23) show that the referents of the
clauses are situated in the same sentence.
3rd
person singular pronoun in sub-
This is rather unusual if we refer to the
observations of [Duranti, 80]. I assume that the subject pronoun is used here to disamrd
biguate the referent which can serve as a subject of the 3
person singular verbs: the
NP introduced in the main clause or a referent from the preceding context (sentences).
st
Another three occurrences of pronouns are in the 1 person singular or plural.
(24)
Noi tutti siamo lieti
we
all
are
...
pleased ...
'We all are pleased ...'
(25)
... che
proprio
noi non rispettiamo ...
... that ourselves we not adhere to
'... that ourselves not adhere to ...'
(26)
... l'
onorevole
Díez González e
io avevamo presentato ...
... the honourable Díez González and I
have
presented
...
'Honourable colleague Díez González and I have presented ...'
Examples (24) and (25) show that the pronouns are used to stress something, e.g. the
subject of the sentence. It is peculiar that the pronouns occur with adverbs like tutti
and proprio that in some way emphasize the subject.
The subject of the sentence in
(26) diers from the subjects we observed until now. The Italian subject pronoun io is
used as a part of the coordinated subject NP which also consists of the NP l' onorevole
Díez González. As a part of a coordinated subject NP, the subject pronoun cannot be
omitted.
rd
Finally, there is one occurrence of the subject pronoun of the 3
person singular:
(27)
... che
esso stesso approva.
... which it
itself
adheres to.
'... which itself adheres to.'
The last example shows that the pronoun is also emphasized, in this case by an adjective
stesso. Similar cases of emphasis have already been shown in examples (16b) and (16c).
18
2.5 Summary
Pro-drop is a linguistic phenomenon which can be found in many languages.
Some
languages allow for omitting both subject and object pronouns (pro-drop languages)
whereas some languages like Italian permit only the subject pronoun to be omitted.
Italian is therefore called a null subject language (NSL). On the other hand, we have
observed that some languages like English must have overtly expressed (pronominal)
subjects. English belongs to the group of not-null subject languages (non-NSL). Whereas
English morphology is not rich enough to allow the recovery of the characteristics of
the missing subjects, the Italian verb inection enables the derivation of the linguistic
characteristics (for example, number and person) of the omitted pronominal subject.
The analysis of subject pronouns in the given language pair showed that English as a
non-NSL also has constructions in which the subject can be omitted (cf. examples (9) (11)). However, these constructions are not relevant for this work in which I deal solely
with nite English sentences which do not allow for omitted subjects.
The analysis of Italian sentences revealed that the pronouns in Italian (according
to the observed corpus) are omitted in most cases (cf.
table 2).
If they are overtly
expressed, they are often emphasized by underlying adjectives or adverbs (cf. examples
rd
in (24) and (25)). In specic contexts, the 3
person pronoun lei is used to enable
unambiguous identication of the NP that it refers to (cf. examples (22) and (23)).
The dierence in the usage of subject pronouns in Italian and English (cf. example (7))
leads to problems in machine translation (MT). In the following chapter, the problem of
pro-drop within MT is discussed. After previous work on pro-drop in MT is presented,
dierent cases of problems regarding the translation of pronominal subjects in both
translation directions IT
→
EN and EN
→
IT are shown.
19
3 Pro-drop in machine translation
In this chapter, subject pronoun omission within machine translation (MT) is discussed.
Although this work concentrates on statistical machine translation, I discuss previous
work regarding pro-drop in dierent MT systems. In section 3.2, a detailed analysis of
pronominal subject translation between English and Italian is carried out. Example sentences consisting of pronominal subjects have been translated by the rule-based system
Systran
2
and statistical MT systems Google Translate
3
4
and Moses .
When Italian null subjects are translated into English, their properties like number,
person and gender have to be derived in order to generate the correct English subject
pronoun.
For human translators, it is relatively easy to do this, since they are able
to dene the person, animal or thing to which the omitted subject pronoun refers to.
These referents are not necessarily in the same sentence: They can occur in one of the
preceding sentences.
Problems occur when single Italian sentences containing a null
pronoun should be translated. Without context and access to the world knowledge, it is
possible to derive the right person and number of the omitted pronoun. But, for example,
rd
if it is known that the missing pronoun is 3
person singular, but we do not know which
gender the pronoun has, how can we decide if we should translate the missing pronoun
as a feminine pronoun she or as a masculine subject pronoun he ?
When the translation task is in the other direction, the decision must be made if the
Italian pronominal subject should be expressed overtly or be dropped. Furthermore, the
gender discrepany between English and Italian can lead to the generation of incorrect
rd
Italian pronouns (for example, 3
person pronouns).
Machine translation is confronted with the same problems when translating between
a non-NSL English and a NSL Italian. Most MT systems operate on the single sentence
input and do not use previous sentence context.
When translating into English, the
correct pronoun for a null subject in Italian has to be found.
But often, the context
(previous sentences) of an observed sentence should be taken into account to resolve the
missing pronoun. When translating into Italian, it has to be determined if the subject
pronoun should be generated or omitted.
We summarize the questions that have to be answered:
(Q1) Automatic word alignment
How to align the existing subject pronoun in non-NSL (English) with an omitted
subject in NSL (Italian)?
(Q2) Translation: NSL
→
non-NSL
How can we automatically generate the right subject pronoun in the target language for the missing subject pronoun in the source language?
2 http://www.systranet.com/
3 http://www.google.com/language_tools
4I
built a baseline SMT system with Moses (cf. chapter 6.2).
20
(Q3) Translation: non-NSL
→
NSL
When should the non-NSL subject pronoun be omitted in the NSL target language? The answer to this question is important if we want to achieve that the
automatically generated translations sound natural.
3.1 Previous work on zero pronouns in MT
The problems regarding automatic translation of null subjects from a NSL to some
non-NSL and vice versa, have been dealt with only indirectly.
[Goldwater & McClosky, 05] dealt with the statistical machine translation of the language pair Czech (NSL) and English.
The aim of their work was to nd out if the
translation from Czech, a morphologically rich language, to English, which is a language
with weak morphological inection, can be improved if the morphological information
is available. Their idea was to use morphological analysis on Czech. The Czech input
has been lemmatized and pseudowords have been inserted in order to eliminate some
morphological dierences between the two languages and to deal with the sparse data
problem. These pseudowords are morphological tags that express some specic properties. [Goldwater & McClosky, 05] inserted the pseudowords with information about the
verb person (among others) to the Czech input. The pseudowords should simulate the existence of pronouns for the English pronouns to align with. [Goldwater & McClosky, 05]
reported that person pseudowords indeed have been aligned to English pronouns with
high probability. However, it has not been reported if these pseudowords solve all problems regarding null subjects. The question is how often the null subjects are correctly
translated. Erroneous translations are possible when ambiguous verbs should be translated, or when the referents of the omitted subject pronouns have a dierent grammatical
gender. For the opposite translation direction this approach could be somewhat problematic:
If English pronouns are in most cases aligned to Czech pseudowords (with
surface form
∅),
this translation alternative receives high likelihood. Are then (nearly)
all English subject pronouns translated as null subjects in Czech?
Another work on translation between NSL (Spanish) and non-NSL (English) has been
done by [Peral & Ferrández, 03]. They developed a system which identies and resolves
all pronouns (not only the omitted subject pronouns) in Spanish as a source language.
Their translation system is based on an interlingua approach. The input text undergoes
several analysis steps:
morphological analysis, POS-tagging, parsing and word-sense
disambiguation. The enriched input text serves as input to a component which deals
with dierent NLP problems like anaphora identication and resolution. After dealing
with anaphora the generation of the interlingua representation of the whole input text is
carried out. This representation contains all information needed to translate pronouns
in the target language. Although the authors report very good results in the tasks of
anaphora identication and generation, there are some additional problems that their
MT-system had to solve. For example, if it is clear that the omitted subject pronoun
rd
in Spanish as source language is 3
person feminine, this does not mean automatically
that the correct English pronoun should also be of the same gender (e.g.
elmasc with
the referent el perromasc vs. itneut with the referent dogneut ). In English, animals have
21
neutral grammatical gender. So, we have to have the information that the referent of el
is an animal in order to correctly translate the pronoun (possibly an omitted pronoun)
in English. Evaluating their system, [Peral & Ferrández, 03] translated all occurrences
of English (as source language) pronouns into their Spanish equivalents.
They note
though that a subsequent task must decide if the pronoun in Spanish must be generated,
substituted by some other pronoun or must be eliminated.
[Nakaiwa & Ikehara, 92] developed an anaphora resolution system for Japanese (a
pro-drop language) and integrated it into a machine translation system for Japanese
to English called ALT-J/E. The anaphora resolution process is based on semantic attributes of verbs and their relationship to the arguments. For each verb it is necessary
to determine its semantic category and its relationship to its arguments (SUBJ, OBJ).
These arguments can be the anaphora and nominal phrases. Rules allow the derivation
of the correct referent for a particular anaphora, which can be a zero pronoun, using
this information. For example, let us assume that we want to resolve an anaphora ai
governed by some verb vi with some semantic attribute vsai . ai is a subject of vi . We
have the same information about some verb vj of a so called topicalized unit sentence
5
which governs some phrase which could be a referent of ai . Given this information, the
rules are searched in order to nd the right referent for ai . The rules have the following
form: If vi has a verb category vsai and governs an anaphora ai as its argument argi (e.g.
SUBJ) and we have some verb vj with verb category vsaj , then the argument argj (e.g.
OBJ) of verb vj can be assumed to be a referent of ai . To apply these rules, the verb in
the sentence with zero pronoun and the verb of the unit sentence have to be extracted.
Their verb categories are identied. According to the rules describing verb relationships
as sketched above and the identied verb categories, the referent of the zero pronoun is
established.
When translating the resolved zero anaphora (i.e.
their referents), it could happen
that the translation in English becomes verbose. In this case, elliptical pronouns and
denite articles should be used [Nakaiwa & Ikehara, 92]. This leads again to the problem
of generating the correct English subject (and object) pronoun.
3.2 Translation between English and Italian
In this chapter, I will describe dierences between English and Italian regarding the null
subject that cause problems for automatic translation between the two languages. Some
of the cases have already been mentioned in the preceding discussion.
Now, we look
at concrete examples and translations that three MT systems provided: S - the rule-
6
7
based MT system SYSTRAN , G - the statistical MT system Google translator , and
M - the statistical MT system Moses (cf. chapter 4). Translation under R represents
the reference. Some of the source language sentences are extracted from Europarl (cf.
section 5.2) whereas a part of them were constructed by myself.
5 This
is a sentence that contains nominal phrases which can serve as referents of the anaphora in the
following sentences.
6 Free translation at http://www.systranet.com/ (November 2009).
7 Free translation at http://www.google.com/language_tools (November 2009).
22
Since the example analysis describes linguistic knowledge needed for resolving some
problems regarding null subjects, it is important to point out that phrase-based statistical MT systems in their original form do not have access to any linguistic knowledge, so
they are certainly disadvantaged when linguistic knowledge is needed to generate correct
translations. Rule-based systems are more likely to recognise which pronominal subject
can occur with a given verb form.
3.2.1 Italian to English
We already know that in Italian, the properties of the missing subject like number and
person can be derived from the verb inection (cf. section 2.1). We will now examine
how well this works in available MT systems. The words set in bold in the Italian input
sentences are nite verbs. The pronouns in bold in the English translations represent
subjects corresponding to the omitted subject in Italian.
First person subject pronouns
Let us begin with the omitted pronouns of the rst person singular and plural.
(28)
So che il governo americano condivide i nostri obiettivi.
R: I know that the American government shares our goals.
I know that the U.S. government shares our goals.
S: I know that the government American shares our objectives.
M: I know that the american government shares our objectives.
Hanno compreso, come noi, quanto sia importante che svolgiamo insieme ...
G:
(29)
R: They understood, as we did, how important it is that we carry out together ...
we do together ...
S: They have comprised, like we, how much is important that we carry out ...
M: They understood, as we, how important it is that *∅ perform together ...
G: They understood, like us, it is important that
All translations but one are correct.
pronoun of the verb perform.
In (29), Moses does not generate the subject
st
Verb forms for 1
person singular and plural are not
ambiguous, so that the right pronoun in English can be derived from the analysis of
8
Italian verb form.
The translation possibilities can be summarised as shown in (30).
(30)
IT.Verb.1.P.Sg
IT.Verb.1.P.Pl
→ I + EN.Verb.1.P.Sg
→ We + EN.Verb.1.P.Pl
Second person subject pronouns
Let us go on with the second person singular and plural.
(31)
Hai detto che parli italiano.
R: You said that you speak Italian.
8 An
explanation for false Moses output is given later in chapter 6.
23
You said that you speak Italian.
S: You have said that it speaks Italian.
M: You have said that *∅ speaks Italian.
Avete giocato con i genitori.
G:
(32)
R: You played with parents.
You played with their parents.
S: You had played with the parents.
M: *You with their parents.
G:
The VPs (auxiliary + participle) in the main clauses in example sentences (31) and
(32) can be uniquely translated into English.
In Italian subclause in (31), we face
nd
an ambiguous verb parli : It can occur with the 2
person singular, as recognised by
rd
Google. But, as a subjunctive, parli can furthermore occur with the 3
person singular,
as recognised by Systran. Moses does not generate any subject pronoun leading to the
grammatically incorrect subclause translation.
Beyond the ambiguity regarding some verbs in indicative and subjunctive, there is
another problem regarding verbs in present tense. The indicative and imperative verbs
for the second person are the same.
(33)
Dite che parlate italiano.
R: Say that you speak Italian.
you speak Italian.
S: You say that *∅ speeches Italian.
M: You say that you are italian.
Dite se parlate italiano.
∅
G:
(34)
Say
R: Say if you speak Italian.
(35)
you speak Italian.
S: You say if *∅ speeches Italian.
M: You say if *∅ spoken italian.
Scrivi una lettera.
(36)
*∅ Write a letter.
S: You write a letter.
M: *∅ Refer a letter.
Scrivi una lettera!
∅
G:
Say if
R: You are writing a letter.
G:
R: Write a letter!
G:
S:
∅
Write a letter!
∅
Refer a letter!
*You write a letter!
M:
The only dierence between (33) and (34) is the conjunction used: che (= that ) and
se (= if ).
Whereas the conjunction che could be used both in an indicative and an
imperative sentence, the conjunction se should instead be used with the interpretation
of the verb dite as imperative.
So, the Google translations are both acceptable, but
24
Systran's are not. Whether the subject of the main clause in (33) should be used (for
indicative reading) or not (for imperative reading) cannot be derived directly.
This
would be probably easier if we had access to the context of the given sentence. If the
sentence mode is marked by punctuation, it is possible to derive the right sentence
mode (cf. examples (35) and (36)). Unfortunately, the MT systems do not seem to use
this information for deciding whether the subject in English should be generated (for
indicative) or not (for imperative).
Let us summarise the translation alternatives for the omitted subject for the 2nd
person singular and plural.
(37)
→ You + EN.Verb.2.P.Sg (indicative)
IT.Verb.2.P.Sg → ∅ + EN.Verb.2.P.Sg (imperative)
IT.Verb.2.P.Pl → You + EN.Verb.2.P.Pl (indicative)
IT.Verb.2.P.Pl → ∅ + EN.Verb.2.P.Pl (imperative)
IT.Verb.2.P.Sg
Third person subject pronouns
The most complicated case is that of the 3
rd
person pronouns that have been omitted.
We will start with the cases in singular.
(38)
Dice che parla italiano.
R: He/She says that he/she speaks Italian.
She says she speaks Italian.
S: It says that *it speaks Italian.
M: *∅ Says that *∅ speaks Italian.
Pensa che non è malata.
G:
(39)
R: He/She thinks that she is not ill.
*∅ Think that is *∅ not sick.
S: *It thinks that *it is not sick.
M: *∅ Does that *∅ not is sick .
G:
Examples (38) and (39) already show the limitations of the tested systems regarding
the null subject. Indeed, in the rst example, it is not possible to derive the gender of
rd
the missing subject pronoun. Google proposes the pronoun for 3
person feminine as
subject for both subclauses in the source sentence. Since we do not know anything about
the context of the sentence, we can accept this solution.
9
The translation that Systran
suggested has at least one error. The proposed subject for the main clause can be seen
as correct if the subject refers, for example, to some book or note or the like. Knowing
though, that only humans can speak, the subject pronoun it for the subclause cannot
be correct. The Moses translation does not contain subject pronouns and is therefore
grammatically incorrect.
In contrast to example (38), at least the subclause in (39) provides all information
needed to generate the right subject pronoun in English. Predicative adjectives which
9I
have been told by a native speaker of Italian that masculine is used when a decision about the
gender cannot be made.
25
occur with copula verbs match in number and gender with the referents that they modify.
So, it is possible to determine the subject of the subclause in (39) as feminine singular.
rd
The verb provides the information that the subject is in the 3
person, so we can clearly
say that the subject in English translation should be she. Concerning the subject of the
main clause, the translation should be at least he or she if we assume that only humans
have the ability to think.
The property of Italian described for the subclause in (39) holds also for composed
tense forms which take essere (= be ) as an auxiliary.
(40)
È andata a casa.
R: She went home.
*∅ Went home.
*It has gone to house.
M: *∅ Has gone home.
Era rimasto a scuola.
G:
S:
(41)
R: He stayed at school.
He had stayed in school.
S: *Era remained to school.
M: *∅ Remained at school.
G:
The underlined participle in (40) provides information about the gender of the omitted
subject pronoun. Together with the information which the inected verb È provides, it
rd
is possible to identify the subject as 3
person singular feminine: she. The same form
of the analysis for the verb Era and the participle rimasto leads us to the conclusion
that the subject in English in (41) should be he.
rd
The 3
person singular is additionally used in the polite form of address. It is used
rd
rd
with Italian 3
person pronouns lei which is unfortunately also a pronoun for the 3
person singular feminine. So, this is another case of ambiguity to deal with.
(42)
Lei non è stata a casa?
R: Was she not at home?
She was not at home?
S: Hasn't *it been to house?
M: *You was not at home?
G:
Google translator recognises the subject pronoun Lei as 3rd person singular feminine
which is one interpretation alternative of this pronoun. The other translation possibility,
namely as you is found by Moses but the generated pronoun does not match with the
corresponding verb was.
The next examples show impersonal constructions in Italian. We begin with an example of a so called weather verb.
(43)
Piove.
G: *∅ Rains.
S: It rains.
M: *∅ Rain.
26
Weather verbs as in (43) need expletives in English. Only Systran generates the correct
subject pronoun for the example sentence in (43).
Let us now examine the si sentences and their English equivalents. The rst three
examples contain intransitive verbs. These constructions are called si impersonale.
(44)
In Germania
si beve la birra.
R: In Germany, people drink beer.
G: * In Germany,
∅
drinking beer.
S: In Germany the beer is drunk.
M: In Germany we drink beer.
(45)
In Germania
si è letto molto..
R: In Germany, people have read a lot.
G: In Germany
*you have read a lot.
S: In Germany a lot has been read.
M: Germany has read.
(46)
Quando
eravamo studenti, si è andati a scuola.
R: When we were students, we went to school.
we were students, *he went to school.
S: When we were students, it has been gone to school.
M: When we were studenti, *∅ has gone to school.
G: When
Examples (44) - (46) show the use of si impersonale.
The subjects in the English
translations of (44) and (45) should be people or one.
The translations of the main
clause in (46) are correct, but the translations of the subclause are a bit problematic.
rd
The subclause consists of the nite verb for the 3
person singular and the participle
andati that matches a subject in plural. MT systems use only the information about
the nite verb and generate the corresponding pronouns in the target language, though
they have dierent values for gender.
But if the VP è andati refers to the same set of referents as in the main clause, the
pronoun we should be used as a subject of the subclause. This is not trivial since we are
rd
dealing with the verb è, which needs a subject of the 3
person singular, but we want
st
to generate a pronoun of the 1 person plural in the target language.
rd
Until now, we have taken a look only at cases of 3
person singular. In (47) and (48)
rd
follow examples for 3
person plural.
(47)
Hanno cantato la mia canzone.
R: They sang my song.
They sang my song.
S: They have sung my song.
G:
M: My song have been sung.
(48)
Sono state in Croazia.
R: They were in Croatia.
*∅ Were in Croatia.
S: They have been in the Croatia.
M: *∅ Were in Croatia.
G:
27
rd
The only alternative for translating 3
person plural in English is they. All information
rd
(3
person plural feminine) can be derived for the subject in the example sentence (48).
rd
Since there are no gender distinctions for 3
person plural in English, this translation
case is unambiguous and should be they.
Let us now summarise the observations made by examining examples (38) - (48).
(49)
IT.Copula.3.P.Sg + IT.PastPart.F
IT.Copula.3.P.Sg + IT.PastPart.F
→
→
She + EN.Verb.3.P.Sg
You + EN.Verb.2.P.Sg (polite )
→ She + EN.Verb.3.P.Sg
IT.Predicative.F → You + EN.Verb (polite )
IT.PastPart.M → He + EN.Verb.3.P.Sg
IT.Predicative.M → He + EN.Verb.3.P.Sg
IT.Copula.3.P.Sg + IT.Predicative.F
IT.Copula.3.P.Sg +
IT.Copula.3.P.Sg +
IT.Copula.3.P.Sg +
IT.Verb.3.P.Sg
IT.Verb.3.P.Sg
→
→
He/She + EN.Verb.3.P.Sg (if only human referents possible )
It + EN.Verb (if human referents not possible )
IT.si + IT.Verb.3.P.Sg
IT.Impers.3.P.Sg
IT.Verb.3.P.Pl
→
→
→
one/people + EN.Verb.3.P.Sg/Pl
It + EN.Verb.3.P.Sg
They + EN.Verb.3.P.Pl
There is another interesting construction in Italian which does not contain a subject,
nd
namely the negated imperative for 2
person singular.
(50)
Non mangiare nelle ore di lezione!
R: Do not eat in the hours of lessons!
∅ Do not eat in the hours of lessons!
S: ∅ Not to eat in the hours of lesson!
M: ∅ Do not eat in hours of lesson!
G:
The negated imperative form for the
2nd
person singular consists of the negation non
and the innitive, in our case mangiare. This kind of sentences should be translated by
a do not ... construction, as Google translator suggested. Though Systran's translation
does not have a subject, which is correct, it also contains an innitive marker to which
makes the sentence grammatically incorrect. The analysis of the example in (50) leads
to the following rule:
(51)
IT.non + IT.inn
→∅
+ do not EN.inn
3.2.2 English to Italian
As already mentioned at the beginning of the chapter, the main question in translation
direction EN
omitted.
→
IT is whether the Italian subject pronouns should be generated or
In principle, they could always be generated or always omitted.
Both of
these decisions are not ideal: Whereas the omission of all subject pronouns can lead
to problems with respect to the adequacy of the translations, the generation of all
subject pronouns would very likely result in a text that sounds rather unnatural.
A
text consisting of a sequence of sentences in which almost each sentence has a subject
pronoun contains a lot of redundant information (number, person, gender) coded at the
28
same time both in the subject pronouns and in the nite verbs. So, the subject pronouns
should be omitted to avoid the redundancy and to preserve the text uency.
If just one isolated sentence should be translated, it is rather imaginable that such
a sentence contains a subject pronoun. The explicit occurrence of the subject pronoun
in such isolated sentences can be explained by the fact that without the context, it is
not possible to determine the referent which the omitted subject pronoun refers to. In
such a context, the use of a subject pronoun can thus be compared with the use of
a NP subject. It introduces a referent and provides information about it. In isolated
sentences, this information can only be provided by the referent that is situated in the
given sentence.
Since translation is more often carried out on a text, it should be examined in which
contexts, the subject pronoun should be dropped or realized overtly. In our discussion
so far, we saw that the use of a subject pronoun has often pragmatic reasons (cf. section
2.3) which are not easy to capture in an automatic translation system.
Some cases in which the pronoun is overtly used have already been shown and discussed
in section 2.2.2. A much more detailed examination is needed to nd the contexts in
which subject pronouns in Italian are used. The pronoun triggers shown in (16b), (24),
(25), (27) have to be identied and it should be investigated how probable is it that they
really occur with the subject pronoun.
This kind of rather local regularity can be captured by the SMT systems. They work
on the word level and can identify word sequences which are often translated to each
other. So, if itself corresponds relatively often to the phrase esso stesso, it has a good
chance to be translated to it without using the heuristics to decide whether the pronoun
10
should be generated.
3.3 Summary
In MT, the problem of pro-drop has been dealt with only marginally. But in my opinion,
this is an important issue since the absence of the subject (pronoun) in a non-NSL
leads to grammatically incorrect sentences. If the subject is not generated because the
corresponding element in the source language does not exist, it should be examined
which information in the source language could be used to generate the correct subject
pronoun. The analysis of source sentences of a NSL, Italian, (cf. section 3.2.1) showed
that in many cases, Italian verbs bear quite a lot information to enable the generation of
the correct English pronoun. However, in a number of cases, Italian verbs are ambiguous
and require therefore the observation of the context (preceding sentences) in order to
derive the correct English subject pronoun.
When an Italian text should be generated out of an English input, it has to be determined if the subject pronouns should be absent or not. Since their use has pragmatic
reasons, more detailed analysis of Italian is needed to answer this question.
In the following chapter, the details of statistical machine translation are sketched.
In chapter 5, a method for the word alignment of Italian and English VPs is described.
10 Details
on phrase-based SMT are discussed in chapter 4.
29
SMT systems are build to test if the rule-based VP alignment contribute to better
translation of pronominal subjects between Italian and English. The evaluation results
of the systems are shown in chapter 6.
30
4 Statistical machine translation
This chapter describes phrase-based statistical machine translation (SMT). In section
4.1, the statistical models for the automatic word alignment are introduced. We take a
closer look at GIZA++, the open source word alignment tool developed by [Och & Ney, 03]
since this tool was used to create a baseline word alignment which has been improved
by applying the alignment rules described in chapter 5. In section 4.2, the concept of
phrase-based SMT is described. The phrase-based SMT approach is implemented in an
open source SMT system Moses [Koehn et al., 07] which has been used within this work.
4.1 Word alignment
Word alignment is a very important task within SMT. In the training process of an SMT
system, it is necessary to identify word equivalences to gain the translation tables which
are needed in the translation process. Phrase-based SMT systems (cf. section 4.2) use
the word alignment to extract translation phrases (word sequences). So, the quality of
the word alignment is crucial for extracting good parallel phrases.
There are ve statistical models, so called the IBM Models, which are used to automatically compute the word alignment of a parallel sentence-aligned corpus [Brown et al., 03].
Word alignment models are trained by the Expectation Maximization Algorithm (EM).
The EM contains of two steps: (i) expectation in which the alignment model is applied
to the data, and (ii) maximization in which the model parameter are recalculated. The
simplest way to start the EM training is to assume that all words are equally probable
to be aligned to each other.
The model is applied to the data resulting in the word
aligned parallel corpus. On the basis of the counts of the alignment pairs, the lexical
translation probabilities are re-estimated. These recalculated model parameters are used
as the model for the next iteration. The algorithm stops when convergence is reached.
In the rst statistical word alignment model, IBM Model 1, the sentences are treated
as a bag of words which means that the word order does not play any role in the word
alignment process. The improvement of this model leads to the Model 2 in which the
target word also depends on its position in the TL sentence. Since some words can be
aligned to a sequence of words in some other language, it is desirable to model and allow
1 − to − n
alignments. This is done by modeling the word fertility in the Model 3. In
Model 4, the position of the previously translated word is taken into account. In the
following, the IBM models for the word alignment are briey described.
11
IBM Model 1
When computing word alignment of a sentence pair, we are interested in the most
probable alignment
e = (e1 , ..., ele )
a
for a sentence pair containing the target language (TL) sentence
and the source (SL) sentence
compute the alignment probability
p(a|e, f )
f = (f1 , ..., flf ).
Formally, we need to
(cf. equation (1)).
11 For
more detailed discussion about the methods in statistical machine translation, please refer to
[Koehn, 09].
31
le
Y
t(ej |fa(j) )
p(a|e, f ) =
Plf
j=1
i=0 t(ej |fi )
(1)
t(ej |fi ) which express the probability of generating the TL word ej from the SL word fi . Furthermore, the numerator
t(ej |fa(j) ) models the probability of generating the word fi from the word ej given an
alignment function a(j) = i.
Equation (1) uses the lexical translation probabilities
After the most probable alignment of a sentence pair is computed using equation (1),
c(e|f ; e, f ) for translating a
particular SL word f into a particular TL word e in the sentence pair (e, f ) are collected.
Having these counts, new translation probability t(e|f ) can be estimated. As the initial
the model parameters are re-estimated. The weighted counts
lexical probability distribution, the uniform probability distribution is taken indicating
that every TL word is equally likely to be generated out of each SL word.
IBM Model 2
IBM Model 1 does not incorporate any knowledge about the word order in the target
sentence. On contrary, IBM Model 2 has an explicit model for an alignment based on
the position of the input and output words (cf. equation (2)).
a(i|j, le , lf )
(2)
The alignment probability distribution in (2) models the probability of translating some
source word in the position
i
in a target word in a position
j.
The model predicts the
source word positions conditioned on the generated target word positions. Expanding
IBM Model 1 with the position based alignment probability distribution shown in (2),
we become a new equation for computing the most probable alignment
pair
(e, f ).
a
for a sentence
The equation is shown in (3).
le
Y
t(ej |fa(j) ) a(a(j)|j, le , lf )
p(a|e, f ) = Plf
j=1
i=0 t(ej |fi ) a(a(j)|j, le , lf )
(3)
As in Model 1, new lexical translation probabilities are estimated from the weighted
counts for lexical translations
c(e|f ; e, f ).
Additionally to the lexical translations, the
position based probability distribution is computed using the counts for the translation
of the words in specic positions:
c(i|j, le , lf ; e, f ).
As the initial lexical probability
distribution, Model 2 uses the lexical probabilities computed by Model 1. The position
1
based alignment probabilities are initialised as
.
lf +1
32
IBM Model 3
IBM Model 3 contains of an additional model which expresses the fertility of a source
word. It contains probabilities of translating a source word in one or two or more target
words. An articial fertility probabilities for the Italian word all (= to the ) is shown
in (4). The probability that all generates two English words is much higher than the
probability that it generates only one English word.
n(2|all) = 0.8
n(1|all) = 0.2
(4)
The fertility model allows also insertion of target words that do not have a counterpart
in a source sentence. These words are treated as being generated from a special token
NULL with fertility
n(φ|N U LL).
Additionally, the fertility model permits that a source
word is not translated at all. With other words, it can be dropped. This is expressed
by a fertility
n(0|w),
where
w
is a source word.
Instead of the alignment probability distribution in Model 2, Model 3 consists of a
distortion probability distribution
d(j|i, le , lf ) which predicts target word positions based
on the source word positions.
For the re-estimation of the model parameters, only the most probable word alignments for a sentence pair
(e, f )
are used. As the initial lexical probability distribution,
the estimates form Model 2 are used.
Since in the rst iteration step, the distortion
probabilities are not available, the alignment probabilities estimated by Model 2 are
used as starting distortion probability distribution.
IBM Model 4
IBM Model 4 introduces a relative distortion model which is an improvement of an
absolute distortion model from IBM Model 3. Absolute distortion model does not do
well when large source and target sentences are dealt with. The movement probabilities
for such sentence pairs are sparse and not very realistic [Koehn, 09].
Since the position of a generated target word depends in particular on the position
of the generated word for a preceding source word, Model 4 introduces a distortion
probability distribution based on the position of the alignment of the previous source
word.
The distortion model implemented in IBM Model 4 is based on cepts.
consists of a source word
a cept
i
i )
d1 of
(denoted by
Relative distortion
position of a cept
i,
fj
which is aligned at least with one target word. A center of
is dened as the ceiling of the average of the word positions.
a target word
ej
in a position
ej
j,
which is also the starting
is dened as shown in (5).
d1 (j − i−1 )
If a target word
A cept
(5)
is not the start element of a cept, its relative distortion is dened
as shown in (6). With the term
word in the cept which
ej
πi,k−1 ,
we refer to the position of the preceding target
belongs to.
33
d>1 (j − πi,k−1 )
Computed relative distortion values
ej
d1
and
d>1
(6)
express the movement of a target word
depending on the position of the preceding target word
ej−1 .
The training of the model starts with the estimates of the Model 3 as the initial model
parameters. As in Model 3, the most probable alignments are computed from which the
counts for the parameter re-estimation are gathered.
GIZA++
The basis for the presented work poses the base word alignment computed by the system
called GIZA++ developed by [Och & Ney, 03]. It is a combination of the Model 1, a
HMM (Hidden Markov Alignment Model)
pHM M (f, a|e) =
p(B0 |B1I )
·
pHM M
I
Y
shown in (7) and the Model 4
p(Bi |Bi−1 , ei ) ·
i=1
In HMM, inverted alignments
B0I
I Y
Y
p4 .
p(fj |ei )
(7)
i=0 j∈Bi
are used for representation of the alignment
represent the mapping from a TL word to a SL word.
Bi
aJ1 .
They
is a partition of the SL sentence
marking the word (sequence) of a SL. The alignments with empty words are modeled by
I
the probability distribution p(B0 |B1 ), where the set B0 contains of all positions of SL
words which are aligned with the empty word. p(Bi |Bi−1 , ei ) expresses the probability of
SL word (sequence)
and a target word
Bi
given the translation of the preceding SL word (sequence)
Bi−1
ei .
GIZA++ combines Model 1, HMM and Model 4. First, the parameters for the Model
1 are computed. They serve as the initial model parameters for HMM. The estimates of
the HMM are nally used in Model 4 for deriving the nal model parameters.
To allow
n−to−m alignments, the alignment symmetrization
is carried out. The word
alignment is carried out in both directions. In the next step, the produced alignments
are combined to compute the output alignment.
In GIZA++, the intersection of the
alignments is computed. Thus, the alignments which are a part of both alignments are
taken. These alignments are considered to be very reliable since they can be found in
both alignments. After these links are identied, the alignments for the neighbouring
words are computed using the union of the two alignments (rened symmetrization)
[Och & Ney, 03][Koehn et al., 03].
In this work, GIZA++ has been applied to the English-Italian parallel corpus producing the baseline word alignment which has been partially improved (cf.
5).
section
[Pianta & Bentivogli, 04] evaluated the statistical word alignment computed by
GIZA++ for Italian and English.
They used a corpus consisting of 25,000 sentence
pairs. Table 4 shows the evaluation results.
As a symmetrization method, [Pianta & Bentivogli, 04] used the intersection of the alignments computed for English
→
Italian and Italian
→
English. The reported results on
the word alignment evaluation show that the GIZA++ word alignment for EnglishItalian lets some room for improvement.
34
Alignment Precision Recall
IT
→
EN
Intersection
Table 4:
73.4
55.2
95.2
38.8
Evaluation of GIZA++ word alignment for English and Italian
4.2 Phrase-based SMT
The SMT belongs to the group of word-based machine translation systems. This means
that the input sentence that should be translated does not undergo any analysis (syntactic, semantic), but it is translated word-by-word. A large bilingual dictionary is needed
to carry out word-by-word translation.
There are many cases in which word-by-word
translation fails. One word in SL does not always correspond to only one word in TL
which also holds for the opposite translation direction. This leads to an assumption that
instead of words, the phrases, word sequences, should be translated as one translation
unit. These phrases are not necessarily equal with linguistic phrases. For example, the
Italian word sequence Io sono (pronoun as a subject + sentence predicate) can be a
phrase which is translated as one translation unit in English phrase I am. To carry out
this type of translation, we need translation probabilities for phrase pairs as shown in
table 5.
Translation
Table 5:
Probability p(e|f )
i am
0.80
i was
0.10
i have been
0.05
myself am also
0.03
we are
0.02
Example phrase translation probabilities for io sono
When TL phrases are generated, they have to be reordered in order to appear in the
correct phrase order in the generated sentence. This is modelled by a reordering model.
Instead of learning reordering probabilities from the data, a cost function is applied.
The cost function express how expensive the movement of some phrase is.
In the following, the details on phrase-based SMT are described with respect to the implementation of phrase-based SMT in an open source SMT system Moses [Koehn et al., 07].
Phrase translation table
The rst step in obtaining translation phrases is word alignment of parallel sentences.
In Moses, a word alignment tool GIZA++ (cf. chapter 4.1) is used. GIZA++ allows
one-to-many word alignment, where at most one TL word can be aligned with each
SL word. To account for this aw, Moses expands the word alignment by aligning the
35
words in both directions. The result of bidirectional alignment is a man-to-many word
alignment of the sentence pair. The two alignments can be combined in several ways:
They can be intersected or the union can be build. In Moses, these two methods are
combined. Firstly, the intersection of the bidirectional alignments is computed. In the
next step, the additional alignment points are heuristically chosen from the alignment
union.
When word alignment is given, translation phrases can be derived. The phrases must
be consistent with the word alignment which means that the words of a phrase pair are
only aligned with the words within these phrases and not to the words outside.
After the phrase pairs are collected, their translation probability is estimated by relative frequency as shown in (8).
φ(f¯|ē) = count(f¯, ē) P
The probability of a phrase
f¯ given
a phrase
ē
1
¯
f¯i count(fi , ē)
(8)
is a product of the count of how often
the phrases occur together and the total number of occurrences of the phrase
ē.
Reordering models
The reordering model in Moses is based on the phrase reordering relative to the previous
phrase. We dene
starti
as a start position of the preceding phrase
i,
and
endi
as the
last word of that phrase. The reordering distance is computed as shown in (9).
x = starti − endi−1 − 1
On the computed reordering distance, the cost function in (10) is applied, where
(9)
α∈
[0; 1].
d(x) = α|x|
(10)
Generally, this reordering model punishes any movement. This works ne for the language pairs with similar syntax, but it leads to bad translation for the language pairs
which dier signicantly with respect to the word order. Although the language models
should account for the dierent word order in SL and TL sentences, they are limited as
they consider only small word sequences. For this reason, phrase-based SMT uses an
additional reordering model: lexicalized reordering model. It models the orientation of
an extracted phrase pair. The orientation species the position of the TL phrase. It can
be monotone which means that its position is equal with the position of the SL phrase.
Furthermore, it can be swap indicating that the SL and the TL phrases are swapped.
Finally, the phrases can be discontinuous, thus interrupted by other phrases.
Language model
Dierent word order in dierent languages poses a problem for the statistical machine
translation translation. A language model which is build out of the large target language
text should account for this.
It consists of automatically computed n-grams which
36
express the probability of a target word
words.
ej
n already generated target
sentence e = e1 , ..., el given a
if it is preceded by
The computation of the probability of a target
trigram language model is shown in (11).
p(e) = p(e1 , e2 , ..., el )
= p(e1 )p(e2 |e1 ) ... p(el |e1 , e2 , ... , el−1 )
' p(e1 ) p(e2 |e1 ) ... p(el |en−1 , en−2 )
(11)
The model parameters are computed using the counts of the word sequences as shown
in (12).
count(w1 , w2 , w3 )
p(w3 |w1 , w2 ) = P
w count(w1 , w2 , w)
(12)
Log-Linear Model
The translation model in phrase-based SMT uses the lexical translation table
the reordering model
d
and the language model
pLM (e).
φ(f¯|ē),
The models are combined in a
log-linear model shown in (13).
ebest = argmaxe
I
Y
φ(f¯i |ēi )λφ d(starti − endi−1 − 1)λd
i=1
|e|
Y
pLM (ei |e1 ...ei−1 )λLM
i=1
Dierent models used in the phrase-based translation are weighted by
for the translation model
(13)
λφ ,
the reordering model
λd
λ.
The weights
and the language model
λLM
are
learned from the bilingual data in order to maximize the likelihood of the training data.
37
5 Word alignment of English and Italian verb phrases
This chapter describes a method for improving the base word alignment with respect
to the problem of null subject pronouns in Italian and obligatory subject pronouns in
English. Since the English subject pronoun does not necessarily have a counterpart in
Italian, it is often aligned with incorrect words in a given parallel Italian sentence.
I present a rule-based method for the correction of the base alignment of English
subject pronouns. Since the alignment of the subject pronouns depends on the alignment
of the sentence predicate, rules have been developed which dene the alignment not only
of English subject pronouns, but also of entire English verb phrases (VP) which belong
to a subject pronoun. In the following, the term verb phrase is used for the combination
of the (null) subject pronoun, negation and the verbal elements of the VP.
After a short motivation for the base alignment improvement in the following section,
I describe the data that worked with (cf. section 5.2). In section 5.3, the algorithm used
to compute the VP alignment is presented. The rules based on part of speech tags which
have been developed and applied on the base word alignment, English parses and Italian
tagged sentences, are described in chapter 5.4. The evaluation results of the improved
alignment are discussed in section 5.5.
Some extensions of the proposed method are
shown in section 5.6.
5.1 Motivation
Since the pronominal subject in Italian can be omitted, the English subject pronoun
is often aligned with arbitrary Italian words. These include for example conjunctions,
punctuation, etc. Knowing about the word category of these Italian words, rules can be
applied which prohibit the alignment of the English subject pronoun with these words.
The rules are based on the PoS of the words whose alignment should be computed.
If an Italian subject pronoun is omitted, the information about the subject is provided
by a nite verb which is aligned with the English nite verb (cf. section 2.2.2). What
I would like to achieve is the alignment of the English subject pronoun with the Italian
verb that corresponds to the English nite verb. This is the reason why not only the
English subjects are examined, but also all verbal elements of VPs. In the following, the
term VP denotes a part of a VP which contains only verbal elements and a negation.
Since the sequence of the verbs in English VPs can be interrupted by adverbs or
embedded clauses, parse trees of the English input are used to identify English VPs
(verbal elements and negation) correctly.
The Italian input has been PoS tagged to
provide information about word categories. Since the Italian parser was not available,
the Italian VPs are dened as PoS sequences. For each sentence pair, the English parse
tree is searched to nd clauses with a pronominal subject. The tagged Italian sentence is
searched in order to nd all Italian VPs. Using the base alignment of the elements of an
English phrase which has a pronominal subject, the parallel Italian VP is identied. The
alignment rules compute alignment for PoS sequences of the parallel phrases, whereby
only the PoS which mark verbs, negation and personal pronouns are taken into account.
The rule-based VP alignment is integrated in the base alignment of the sentence so that
38
the base alignments of the phrase elements are removed.
I assume that every English VP with a pronominal subject has a parallel Italian VP
(cf. (A1) in section 1.2). This assumption is made to limit the number of PoS sequences
for which the alignment rules are dened (cf. (A3) in section 1.2). Furthermore, I assume
that the base alignment is correct enough to allow for identication of parallel English
and Italian VPs (cf. (A2) in section 1.2). The assumptions hold in many cases but they
also leads to problems which will be shown in section 5.5.2.
In the following, the algorithm for the application of alignment rules is presented (cf.
section 5.3), as well as the rules based on word categories (expressed by PoS tags) of
English and Italian words (cf. section 5.4).
5.2 Data preparation
The alignment rules have been developed and applied on a reduced version of the Eu-
roparl corpus [Koehn, 05] consisting of 749,646 parallel sentences.
Since the alignment rules do not operate on word level but on PoS level, it was
necessary to preprocess the parallel corpus. English sentences have been parsed in order
to simplify the search for pronominal subjects and VPs. Since English parse tree nodes
are underspecied with respect to the grammatical function of phrases, I wrote a program
which enriches relevant nodes with their grammatical function. Since a parser for Italian
was not available, the Italian sentences have been PoS tagged in order to get information
about the word categories.
In the following, I describe the steps in the data preparation process.
5.2.1 English
English sentences have been parsed with the generative parser [Charniak, 00]. The parse
trees allow to identify the subclauses of the input sentence, the subjects and the VPs.
12
The parser also assigns to each word its part of speech tag
which is needed to match
conditions in the alignment rules. An example parse tree is shown in (52).
(52)
I would like your advice about Rule 143 concerning inadmissibility.
12 English
PoS are listed in appendix B
39
(TOP
(S (NP (PRP I))
(VP (MD would)
(VP (VB like)
(NP (NP (PRP your)
(NN advice ))
(PP (IN about)
(NP (NNP Rule)
(CD 143)))
(VP (VBG concerning)
(NP (NN inadmissibility))))))
(. .)))
NP nodes in the parse tree in (52) are not specied with respect to their grammatical
functions. To determine if some NP is a subject or an object, the context has to be taken
into account. The assumption that the rst NP under S (representing topic position)
is a subject does not always hold (cf. parse tree in (53)) which makes the search for a
(pronominal) subject more complicated.
(53)
This makes it necessary to also take account of the ways in which materials and
packaging are aected by cold of this kind.
(TOP
(S
(NP1 (DT This))
(VP1 (VBZ makes)
(S
(NP2 (PRP it))
(ADJP (JJ necessary)
(S
(VP2 (TO to)
(ADVP (RB also))
(VP (VB take)
(NP ...
(. .)))
NP2 in (53) is actually an object of the verb in the preceding subclause (with sentence
predicate makes ) and a subject of VP2.
Thus, the underspecication of NP nodes
requires a context check (father and sister nodes) in order to identify the subject of a
VP.
Not only the underspecication of NP poses a problem for a correct identication of a
subject and its VPs. There are verbs which subcategorize an innitival verb phrase (toinnitive), for example I would like [to say]XCOM P (cf. examples (9b) - (9d) in section
2.2.1). The extracted VP which belongs to a pronominal subject should also contain a
40
subcategorized innitive. To-innitives can be embedded in various nodes, for example
in VP or ADJP (as in parse tree in (53)). Since I wanted to handle only to-innitives
which are subcategorized by a verb in a preceding clause (and not for example by an
adjective as in gure 53), it was also necessary to examine the context of to-innitives
(VP nodes) in order to make a decision if a to-innitive should be treated as a part of
a nite VP whose alignment should be computed.
There are two ways to solve the problem of identifying subjects and VPs. One way
is a runtime examination of the context of corresponding nodes, and the other way is
to enrich the parse trees with function tags as a part of data preprocessing. I chose the
second approach which resulted in a program which enriches English parse trees
13
, and
a relative simple method for subject and VP extraction from a modied parse tree.
The tool which transforms original Charniak parse trees examines only NP and S nodes
enriching them with a tag expressing their grammatical function. The transformation
rules examine the context of the relevant nodes. If conditions for a specic function tag
are complied, the original node is enriched by the corresponding function tag.
Transformation rules for NP are given in (54). NP nodes are marked as subject NPs
only if the father node is S or SBAR, they are not preceded by a VP (for example
[LetV B ]V P [meP RP ]N P [sayV B ]V P ) and a sister VP is not a to-innitive (for example
[[usP RP ]N P [toT O sayV B ]V P ]S ). In an interrogative sentence, the nite verb in front of
a subject is not embedded in a VP, for example [[CanM D ] [youP RP ]N P [sayV B ]V P ... ]S ,
so that, in this case, the NP would be identied as a subject NP. The conditions for
subject NPs are summarized in the rule (54a). If these conditions are not fullled, the
NP is an object (NP rule (54b)). Furthermore, the NP node is identied as an object
when the father node is a VP (NP rule (54c)).
(54)
a.
NP
→
NP-SUBJ
if the father node is S or SBAR, and there is no preceding VP under the
father node, and if there is a sister VP node, it is not a to-innitive
b.
NP
→
NP-OBJ
if the father node is S or SBAR, and there is a preceding VP or sister VP
node which is a to-innitive
c.
NP
→
NP-OBJ
if the father node is a VP
It was also necessary to examine S nodes to determine if they consist of a to-innitive.
If this is the case, and the innitive is subcategorized by a verb in the preceding clause,
the S node should be annotated by a function tag that reects these features, namely
S-XCOMP. For example, the examination of the phrase I would like to say, in which the
to-innitive to say is embedded in the category S, should identify the to-innitive as an
innitive which belongs to the preceding nite verb. If the example phrase is modied
13 [Blaheta,
2004] developed a function tagger which provides parse trees with function tags annotated
to the phrases and words. I was not able to run this tagger for which reason I implemented my
own tool for this task. It is important to note that my program enriches only the nodes which are
relevant for the presented work.
41
resulting in a phrase I would like you to say, the to-innitive should not be determined
as a part of a VP [would like ]V P , since its subject is not I. The parses for such sentences
are shown in (55) and (56).
(55)
I would like once again to wish you ...
(S1
(NP (PRP I))
(VP (MD would)
(VP (VB like)
(S2 (ADVP (RB once)
(RB again))
(VP (TO to)
(VP (VB wish)
(NP (PRP you))
...
(. .)))
(56)
I would therefore once more ask you to ensure ...
(TOP
(S3
(NP (PRP I))
(VP (MD would)
(ADVP (RB therefore))
(VP (ADVP (RB once)
(JJR more))
(VB ask)
(NP (PRP you))
(S4
(VP (TO to)
(VP (VB ensure)
...
(. .)))
The transformation rules for S nodes are given in (57). The condition for applying these
transformation rules is that a S or SBAR node is embedded in a VP. If the nodes have
a preceding sister node NP (as S4 in (56)), they should be marked as S-OBJXCOMP
expressing that the to-innitive does not have the same subject as the VP in a superordinate clause (cf. S rule (57b)). If this is not the case, the node should be marked as
S-XCOMP expressing that the to-innitive belongs to the superordinate VP (cf. S2 in
(55)). This is reached by the S rule (57a).
42
(57)
a.
S, SBAR
→
S-XCOMP, SBAR-XCOMP
if the father node is VP and it is not preceded by a sister node NP
b.
S, SBAR
→
S-OBJXCOMP, SBAR-OBJXCOMP
if the father node is VP and it is preceded by a sister node NP
The appliance of the transformation rules in (54) and (57) on the parse trees in (55) and
(56) results in the modied parse trees shown in (58) and (59).
(58)
I would like once again to wish you ...
(S1
(NP-SUBJ (PRP I))
(VP (MD would)
(VP (VB like)
(S2-XCOMP (ADVP (RB once)
(RB again))
(VP (TO to)
(VP (VB wish)
(NP-OBJ (PRP you))
...
(. .)))
(59)
I would therefore once more ask you to ensure ...
(TOP
(S3
(NP-SUBJ (PRP I))
(VP (MD would)
(ADVP (RB therefore))
(VP (ADVP (RB once)
(JJR more))
(VB ask)
(NP-OBJ (PRP you))
(S4-OBJXCOMP
(VP (TO to)
(VP (VB ensure)
...
(. .)))
Modied parse trees such as (58) and (59) simplify the search for English subjects and
corresponding VPs. Having enriched NP and VP nodes, we can search directly for nodes
that correspond to the phrases we are interested in.
43
5.2.2 Italian
Italian sentences have been tagged with TreeTagger [Schmid, 95] creating an input text
14
consisting of the words with their PoS
.
The PoS tagged Italian sentence in (60b) corresponds to the sentence in (60a). The
words are enriched with their PoS. # is a delimiter between a word and its PoS.
a. Non credo
(60)
not
però che
la
relazione arrivi tardi.
believe but that the report
comes late.
'But I do not believe that the report comes too late.'
b.
Non#NEG credo#VER:fin però#ADV che#CHE la#ART
relazione#NOUN arrivi#VER:fin tardi#ADV .#SENT
On the basis of the PoS, we can identify the verbs (VER:n, VER:in, VER:ppast,
VER2:n, VER2:geru, etc.), negation (NEG ), and subject pronouns (PRO:pers, PRO:demo )
in tagged Italian input. For example, the PoS VER:n bears information that credo is
a nite verb.
5.2.3 Data preprocessing errors
The tagger for Italian which was used in this work was trained using the Italian morphological lexicon MorphIt [Zanchetta & Baroni, 05] and a set of about 100,000 manually taged words from the newspaper corpus Repubblica [Baroni et al., 04].
The ac-
curacy of the tagger can be compared with the accuracy of the Italian TreeTagger
[Schmid, Baroni et al., 2007] reported on The Part of Speech Tagging Task EVALITA
15
2007. The TreeTagger reaches accuracy of 97%.
However, the examination of the Italian tagged input revealed that some words are
often tagged falsely. Example (61a) shows the Italian counterpart of the English sentence
in (52).
16
(61)
a. Gradirei avere il suo parere riguardo all' articolo 143 sull' inammissibilità.
b.
Gradirei#VER:fin avere#VER:infi il#ART suo#DET:poss parere#NOUN
riguardo#VER:fin all#NOUN '#PUN articolo#NOUN 143#NUM
sull#NOUN '#PUN inammissibilità#ADJ .#SENT
The example sentence in (61b) contains relatively lot of false tagged words. For example,
the ambiguous word form riguardo which can be a noun (= consideration ) or a verb (
= I concern ) is treated in (61b) as a noun which is in this context not correct. One of
the common tagging errors is that of prepositions merged with denite article. When
these word forms appear in front of a word which begins with a vocal, they end with
an apostrophe: sull', all'. The tagger does not recognize these word forms as merged
forms of an article and a preposition, which would become a tag ARTPRE, but as words
14 Italian
PoS are listed in appendix A
comparison of the evaluation results was suggested by M. Baroni (pers. comm.).
16 These examples are taken from the parallel corpus Europarl.
15 The
44
of an arbitrary category (for example, as a noun (NOUN ) or nite verb (VER:n )). I
corrected article tags errors manually to reduce the number of the false verb tags, since
they lead to erroneous identication of the Italian VPs.
5.3 Applying alignment rules
The program for base alignment improvement expects a set of parallel sentences of Italian
(with PoS) and English (as a parse tree) as input. Details about the corpus preparation
have already been described in section 5.2.
The parallel sentences are automatically
word aligned with GIZA++ [Och & Ney, 03]. This base word alignment is the basis for
the rule-based VP alignment.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
function correct_align(en_parse, it_tag, base_align)
new_align
. New alignment
for e in en_parse do
. English sentence
e_subj _verb ← search_subj _verb(e)
phrase_pair ← search_it_vp(e_subj _verb, it_tagged, base_align)
new_align ← align(phrase_pair, base_align)
pos_pattern ← derive_pos_pattern(new_align)
end for
return new_align
end function
Figure 1:
Main program: correct_align
The main program is shown in gure 1. Several steps are done for each sentence pair,
beginning with a check whether the English sentence
e
contains a pronominal subject.
After identifying the English pronominal subject and its verbs (line 4 in gure 1), it
looks for the Italian VP which the English words are aligned with (line 5 in gure 1).
The procedure which fulls this task is described in section 5.3.2. The output of this
procedure is a phrase pair containing words enriched with information about the word
category (PoS) and position of the word in the sentence.
Having a phrase pair whose alignment should be computed, we can now call the function which computes alignment of the given phrase pair applying PoS based alignment
rules (line 6 in gure 1). The program also derives PoS and PoS patterns of the parallel
phrases (line 7 in gure 1).
17
The main program returns the computed VP alignment
which is then integrated in the word alignment of the sentence pair.
17 The
counts on the PoS occurrences could be used to compute the probability of translating an English
PoS epi into an Italian PoS ipj . The derived PoS patterns could be used to check the correctness
of the found PoS patterns. Furthermore, the PoS translation pairs could be used to examine which
tenses are mostly used in the given language pairs. They would also allow the examination of the
tense similarity: How often the same tense is used in the parallel sentence pair, or how often the
tense and voice diverge.
45
A graphical illustration of the complete system is shown in gure 2. Each box represents one processing step. The processing steps are explained in detail in the following
sections.
IT – EN
Parallel corpus
EN
Charniak
parser
Preprocessing
IT
TreeTagger
Word alignment
with GIZA++
Enriching
parse trees
Seaching
sentences with
pronominal subject
Seaching
IT - VPs
Identification of
VP pairs
Aligning
VP elements
Merging base alignment
with
VP - alignment
Alignment improvement
system
Figure 2:
System components
After the VP alignments are produced, they are merged with the base word alignment.
In the resulting alignment, the pronominal subjects and VPs in both languages have
only the alignments computed by the program for the VP alignment.
The baseline
alignments for these words are deleted.
The function align(phrase_pair, base_align) (line 6 in gure 1) which computes the
alignment of the VP pairs, is shown in gure 3.
The functions for alignment of dif-
ferent word classes of English (align_subj(e, it), align_vn(e, it), etc.) implement the
alignment rules described in section 5.4. For the given English word, compatible Italian
46
words are identied.
The examination of the alignment takes only PoS into account.
If there is no appropriate Italian word (with appropriate PoS), the given English word
stays unaligned.
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
function align(phrase_pair, base_align)
new_align
en ← english_words(phrase_pair)
it ← italian_words(phrase_pair)
for e in en do
if subject(e) = T rue then
new_align.append(align_subj(e, it))
else if f inverb(e) = T rue then
new_align.append(align_vf in(e, it))
else if inf partger(e) = T rue then
new_align.append(align_inf partger(e, it))
else if negation(e) = T rue then
new_align.append(align_neg(e, it))
else if inf particle(e) = T rue then
new_align.append(align_inf part(e, it))
.
Computed alignment
end if
end for
return new_align
end function
Figure 3:
Alignment check and improvement
5.3.1 Identication of Italian VPs
Due to the lack of availability of an Italian parser, the extraction of correct Italian verb
phrases using the base word alignment posed a great problem.
In this work, I made
the assumption that the base alignment is suciently correct to make it possible to
nd the Italian phrase corresponding to the given English phrase. Unfortunately, the
Italian phrases were often incomplete. This means that some VP elements were missing.
Therefore, I identify all Italian VPs before the search for a matching Italian VP is carried
out which is described in the following section.
The identication of Italian VPs is based on PoS. I dened PoS which mark the start
of a VP, and PoS of other verbal elements, which can be a part of a VP. An Italian
sentence is searched through until a PoS is found that can be a start of a VP. From
this sentence position, the search for other elements goes so long until the sentence end
or another VP starting PoS is reached.
The search function returns the Italian word
sequences that contain a personal pronoun, negation and verbal elements of a VP. Other
VP elements are ignored.
This method nds not only nite VPs starting with a pronoun, nite verb or negation,
but also innitival VPs, and gerundive phrases which often consist of only one gerundive.
47
(62)
a. Perché non esistono istruzioni
why
not exist
da seguire
in caso di incendio?
instructions to continue in case of re?
'Why there are no instructions in case of re?'
b.
Perché#WH non#NEG esistono#VER:fin istruzioni#NOUN da#PRE
seguire#VER:infi in#PRE caso#NOUN di#PRE incendio#NOUN ?#SENT
The sentence in (62b) consists of two VPs. The implemented method for identication
of Italian VPs nds following verb phrases:
1. non#NEG esistono#VER:n
2. da#PRE seguire#VER:in
Indenite phrases like [da seguire ]XCOM P are extracted as independent phrases since they
often correspond to complete English clauses. An example for such case in shown in gure 4. The English VP [would ask ]V P does not include the to-innitive [to request ]XCOM P
since the to-innitive and the nite verb phrase do not have the same subject. For this
reason, it would be wrong to handle the Italian innitive [die chiedere ]V P as a part of
the Italian VP [prego ]V P which corresponds to the English VP [would ask ]V P .
It is also possible to translate a nite English sentence as an Italian innite clause.
This is an additional reason, why I handle Italian innitives as separate VPs.
I5 /P RP iTTTT
la /CLI
@ 5
TTTT
TTTT
TTTT
TTTT
TT)
/
would6 /M D o
jjj5 prego6 /V ER : f in
ask6 /V B
you7 /P RP
to8 /T O
request9 /V B
Figure 4:
jj
jjjj
j
j
j
jj
jjjj
ujjjj
di /P RE
u: 7
uu
u
u
uu
uu
u
uu
uu
chiedere6 /V ER
u
u:
uu
u
u
u
u
uu
uu
uu
uu
u
u
u
u
uz u
uu
uu
u
u
uu
uu
u
u
uu
uz u
: inf i
Alignment of I would ask you to request and la prego di chiedere
These rather simple rules for detection of complete Italian VPs do not always provide
correct verb phrases. Mistakes are made if the word order is changed, or if a sequence
of VP start elements occurs. Furthermore, false tagging leads also to the identication
of false Italian VPs as shown in (63b) (cf. section 5.2.3).
48
(63)
a. Come avrete avuto modo di constatare il
as
have
millennio
had
way
non si è
millennium not
to observe
grande baco del
the big
bug of the
realizzato.
was realized.
'As you could have seen, the millennium bug did not materialize.'
b.
come#WH avrete#AUX:fin avuto#VER:ppast modo#NOUN di#PRE
constatare#VER:infi il#ART grande#ADJ "#PUN baco#VER:fin del#ARTPRE
millennio#NOUN "#PUN non#NEG si#CLI è#AUX:fin
materializzato#VER:ppast.#SENT
The method for the identication of Italian VPs nds the following verb phrases for the
sentence (63b):
1. avrete#AUX:n avuto#VER:ppast
2. di#PRE constatare#VER:inf
3. baco#VER:n
4. non#NEG è#AUX:n materializzato#VER:ppast
The VP in 3 (baco#VER:n ) is not correct. Baco (= the bug (noun)) has been assigned
the wrong PoS resulting in extraction of a false VP. Although the rules for the identication of the Italian VP can lead to false VPs, they provide a relatively good basis for
the process of searching for an Italian VP that corresponds to a given English phrase
which is described in the following section.
5.3.2 Identication of the most probable Italian VP
Good VP alignment results can be achieved by applying alignment rules only if the
rules are applied on parallel English and Italian VPs.
The procedure for searching
for the matching Italian VP given an English VP is given in gure 5.
The method
for determination of the best Italian VP given an English VP is based on a count of
alignments between English and Italian words in these phrases. So, I assume that the
base alignment is correct on the level of phrase alignment. This means, that the best
Italian phrase has the most base alignment links for the given English word sequence.
The search function in gure 5 receives as input English subject and its verbs, a list of
Italian VPs extracted from the parallel sentence (as described in the previous section),
and the base alignment. For each Italian VP, the number of alignment links between
its elements and English input is computed.
The VP with the most alignment links
represents the best Italian VP for the English input.
49
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
function search_it_vp(en_subj _vp, it_vps, base_align)
word_pairs ← []
. EN and IT words which belong to parallel phrases
en_al ← Alignments of EN words
vp_links ← []
. Pairs: (IT VP, # links to its elements)
for it_vp in it_vps do
. Loop over all Italian VPs
links ← 0
. # alignments for EN phrase and IT candidate VP
for (en, it) in en_al do
. Alignment pairs
if it ∈ it_vp then
. Italian word it is a part of Italian VP it_vp
links+ = 1
end if
end for
vp_links.append(it_vp, links)
end for
best_vp ← max(vp_links)
.
word_pairs.append(elements_of (best_vp))
return word_pairs
Italian VP with most alignments
end function
Figure 5:
Search for the best Italian VP
5.4 Alignment rules
The alignment rules dene the alignment of the relevant sentence parts in an English
and Italian parallel corpus.
They are based on an already created alignment (base
alignment) and the PoS of the words that are observed.
In previous sections, it was
already mentioned that these sentence parts are subject pronouns, negation and verbal
sentence predicates.
improvement.
In gure 3 in section 5.3, I showed the program for alignment
It consists of a loop over the words of the English input phrase.
For
each word, its word category is derived (on the basis of PoS), and the function is called
which computes alignment for the found word category. There are ve functions which
compute alignment for ve word category groups. If the input word e is:
1. subject, i.e. its PoS is PRP,
the function align_subj(e, it) is called.
2. nite verb, i.e. its PoS is:
•
•
•
•
•
AUX : auxiliary
MOD : modal verb
VBZ : 3rd person singular present
VBP : non-3rd person singular present
VBD : past tense
the function align_vn(e, it) is called
3. innitive, participle or gerundive, i.e. its PoS is:
50
•
•
•
VB : innitive
VBN : past participle
VBG : gerundive
the function align_infpart(e, it) is called
4. negation, i.e. its PoS is RB
18
,
the function align_neg(e, it) is called
5. innitive particle to, i.e. its PoS is TO,
the function align_infpart(e, it) is called.
In the following sections, the alignment rules for each word category are presented (cf.
chapters 5.4.2 - 5.4.6). But before the rules are described, we should examine the syntax
of English and Italian VPs. The PoS sequence which occurs in a specic tense is crucial
for dening alignment rules. Each rule is applicable only if the context constraints are
fullled. The context is dened by PoS of the words of the given phrases.
5.4.1 Syntax of the English and Italian VPs
The VP alignment rules use word categories expressed by PoS to compute the word
alignment of parallel VPs.
Since the alignment of a specic PoS is not always the
same but context-dependent, it is necessary to examine which contexts (PoS sequences)
are possible. Having this information, constraints can be dened which limit the word
alignment of PoS in a specic PoS context.
In the following, we take a closer look to the composition of English and Italian VPs
(PoS sequences).
English
English tenses can be realized by one verb only or by a sequence of verbal elements.
Since the alignment rules are based on PoS, we have to know which PoS sequences in
English VPs are possible. We start with examples of tenses which have only one verb.
In the following examples, only relevant tokens are marked with their PoS.
(64)
a.
He/PRP sleeps/VBZ.
b.
It/PRP is/AUX nice.
c.
I/PRP sleep/VBP.
d.
He/PRP went/VBD home.
If we would like to negate the sentences in (64), we would get composed VPs containing
an auxiliary, a negation and an innitive.
(65)
18 Since
RB
a.
I/He/PRP do/does/did/AUX not/RB sleep/VB.
there is no dierence between tags for negation and other adverbs, the word forms tagged with
had to be examined to identify the negation.
51
b.
I/He/PRP do/does/did/AUX not/RB have/do/AUX it.
Constructions with modal verbs are shown in (66).
(66)
a.
He/PRP will/would/MD (not/RB) sleep/VB.
b.
He/PRP will/would/MD (not/RB) have/do/AUX it.
c.
He/PRP will/would/MD (not/RB) be/AUX sleeping/VBG.
d.
He/PRP will/would/MD (not/RB) be/AUX having/doing/AUXG.
e.
He/PRP will/would/MD (not/RB) have/AUX slept/VBN.
f.
He/PRP will/would/MD (not/RB) have/AUX had/done/AUX it.
g.
He/PRP would/MD (not/RB) have/AUX been/AUX sleeping/VBG.
h.
He/PRP would/MD (not/RB) have/AUX been/AUX having/doing/AUXG
this.
The following example sentences show the tenses which contain an auxiliary.
(67)
a.
He/PRP is/was/AUX (not/RB) sleeping/VBG.
b.
He/PRP is/was/AUX (not/RB) having/doing/AUXG this.
c.
He/PRP has/had/AUX (not/RB) slept/VBN.
d.
He/PRP has/had/AUX (not/RB) been/AUX sleeping/VBG.
e.
He/PRP has/had/AUX (not/RB) been/AUX having/doing/AUXG.
f.
I/PRP am/AUX going/VBG to/TO sleep/VB.
g.
I/PRP am/AUX going/VBG to/TO have/do/AUX this.
If, for example, English auxiliaries should be aligned dierently depending on the VP
that they belong to, the composition of the English VP has to be determined by examining its PoS sequence. If, for example, has/AUX (cf. example (67c)) should be aligned
with the corresponding Italian auxiliary only if has/AUX is a part of the composed VP,
we would require the English VP to consist of a participle. Thus, in addition to AUX,
the PoS sequence of the English VP should also contain the PoS VBN.
A closer observation of the examples in (66) and (67) reveals that the PoS AUX is
used not only for the auxiliaries.
The verbs am and do in (67g) have the same PoS
although do should be considered here as a main verb.
19
This causes a problem because
dierent word categories are handled with dierent sets of the alignment rules. If the
word category is erroneous, false alignment rules can be applied. We will come back to
this problem in section 5.4.4.
Italian
Again, we start with tenses which have only one verb. The optional subject pronoun
and the negation are put in brackets.
19 [Charniak,
AUXG
00] expands the Penn Treebank Tagset (listed in appendix B) with the tags
which are assigned to the auxiliaries.
52
AUX
and
(68)
(Egli/PRO:pers) (non/NEG) dorme/dormivo/dormii/dormirò/VER:n.
He
not
sleeps/has slept/had slept/will sleep
'He sleeps/has slept/had slept/will (not) sleep.'
(69)
(Io/PRO:pers) (non/NEG) dormo/dormissi/dormirei/VER:n.
I
not
would sleep/would have slept/would sleep
'I would sleep/would have slept/would (not) sleep.'
Italian composed tenses require an auxiliary or a modal verb, and a participle, an innitive or a gerundive. Examples in (70) and (71) show simple sentences with the verb
dormire (= to sleep ) which have an auxiliary avere (= to have ). In Italian, there are also
verbs like andare (= to go ), venire (= to come ) which occur with the auxiliary essere (=
to be ). Since the PoS sequence is the same for tenses with both auxiliaries, I do not list
example sentences with these verbs. Examples in (70) - (74) show all composed Italian
tenses with all possible PoS sequences.
(70)
(Egli/PRO:pers) (non/NEG) ha/avrà/AUX:n dormito/avuto/VER:ppast.
He
(not)
has/will have
slept/had
'He has/will have (not) slept.'
(71)
(Io/PRO:pers) (non/NEG) abbia/avrei/AUX:n dormito/VER:ppast.
I
(not)
would have/will have slept
'I would (not) have/would (not) have/will (not) have slept.'
(72)
(Io/PRO:pers) (non/NEG) sto/AUX:n dormendo/VER:geru.
I
am
(not)
sleeping
'I am not sleeping.'
(73)
(Io/PRO:pers) (non/NEG) posso/potrò/VER2:n dormire/VER:in.
I
(not)
can
sleep
'I can not sleep.'
(74)
(Io/PRO:pers) (not/NEG) sto/stavo/VER:n facendo/VER:geru ...
I
not
am/was
doing
'I am/was (not) doing ...'
Modal verbs subcategorize an innitive as shown in (73). When these verbs are used in
some of the tenses which are composed of an auxiliary and a participle, a dierent PoS
sequence is generated (cf. example (75)).
(75)
(Io/PRO:pers) (non/NEG) ho/avrei/AUX:n potuto/VER:ppast
I
(not)
have
can
constatare/VER:in
observe
...
'I would (not) have/would (not) have/will (not) have observe ...'
Whereas some tenses in passive voice (cf. example in (76)) do not dier from composed
tenses shown in (74) regarding PoS sequence, some past tenses in passive voice require
two forms of an auxiliary as showed in (77).
53
(Egli/PRO:pers) (non/NEG) è/saràAUX:n amato/VER:ppast.
(76)
He
(not)
is/was/will be loved
'He is (not)/was (not)/will (not) be loved.'
(Egli/PRO:pers) (non/NEG) è/era/AUX:n stato/AUX:ppast
(77)
He
(not)
is/will be
been
amato/VER:ppast.
loved
'He has (not) been/will (not) have been loved.'
There is one construction in Italian which is often used to abbreviate a nite sentence. It
consists of a gerundive and, if the verb is modal, of an innitive. Since these constructions
can be used as translations of English nite clauses, they should also be taken into
account.
(Non/NEG) ribadendo/VER:geru ...
(78)
(not)
stressing
...
'(Not) stressing ... / I (do not) stress ...'
(Non/NEG) volendo/VER:geru arontare/VER:in ...
(79)
not
wishing
confront
...
'(Not wishing to confront ... / I (do not) wish to confront ...'
Let us now consider parallel sentences (67c) and (70).
We have already dened the
English context constraints that have to be fullled if he/AUX should be aligned with
the Italian auxiliary, in our example with ha/AUX:n.
If we would like to allow the
alignment of the English nite auxiliary with any Italian nite verb form, the condition
on Italian is that the Italian verb (here, an auxiliary ha/AUX:n ) is nite, i.e. its PoS
must contain n (e.g. AUX:n, VER:n, VER2:n ).
In the following sections, the alignment rules for dierent word categories are presented.
5.4.2 Subject pronouns
Since the pronominal subject in English does not have to have its pronominal counterpart
in the Italian parallel sentence, the alignment of the subject pronoun is often not correct.
20
Figure 6 shows an example of an incorrect base word alignment.
if9 /IN
O
wish11 /V
BP
O
se8 /CON
lo9 /CLI
desidera10 /V ER : f in
Figure 6:
20 The
you10 /P
RP
O
Incorrect base alignment of if you wish and se lo desidera
subscripts in the alignment gures mark the word position in a sentence.
54
The phrases in gure 6 are taken form the sentences which are shown in (80) and (81).
(80)
That is precisely the time when you may, if you wish, raise this question, ...
(81)
È appunto in quell' occasione che, se lo
is exactly
in this
sollevare la
rase
sua
occasion
desidera, avrà
that, if you which,
modo
di
will have chance to
questione ...
the your question
...
Figure 6 shows an alignment of embedded clauses if you wish and se lo desidera. The
English subject pronoun is aligned with the Italian object clitic lo whereas the predicate wish is correctly aligned with Italian predicate desidera. In this case, it would be
correct to align the English pronoun with the Italian nite verb since you wish should
be translated as desidera
21
if9 /IN
O
. The correct alignment is shown in gure 7.
you10 /P RPiT
wish11 /V
BP
O
lo9 /CLI
desidera10 /V ER : f in
TTTT
TTTT
TTTT
TT)
se8 /CON
Figure 7:
Correct alignment of if you wish and se lo desidera
The correctness of the alignment in gure 7 is linguistically motivated. The information
provided by the English subject pronoun you and the nite verb wish is the same as the
information provided by the Italian nite verb desidera regarding person and number
of the subject.
verb.
This is why both English words should be aligned with the Italian
In general, the English pronoun should be aligned with the Italian nite verb
that corresponds to the English nite verb. The subject alignment rule lead to the link
between the English word with PoS PRP (English subject pronoun) and the Italian
word with PoS VER:n (nite verb), VER2:n (nite modal verb), or AUX:n (nite
form of an auxiliary). This rule is summarized in (82a).
In gure 7, the alignment rule (82a) leads to a deletion of the base alignment link
between the English pronoun you and the Italian object clitic lo. Since lo does not have
further base alignment links, it remains unaligned in the given sentence pair.
Yet, the Italian subject pronoun is not always omitted. If it is expressed overtly, the
English subject pronoun should be aligned only with it. The pronouns bear the same
information about number, person and gender of the subject.
In such a context, the
Italian verb is not needed to derive these characteristics of the English subject pronoun,
so I do not align it with the Italian nite verb. This rule is presented in (82b). The rule
associates the English pronoun (PRP ) with the Italian pronouns (PRO:pers - personal
pronoun, PRO:demo - demonstrative pronoun).
The English nite clause can also be translated as a gerundive construction in Italian.
The gerundive bears the semantics that corresponds to the semantics of the English
21 The
Italian nite verb is 3rd person singular, so this should be understood as a polite form of address
where the addressee is one person.
55
predicate (for example, IP RP thinkV BP
↔
pensandoV ER:geru ). In such constructions, the
aim is to align the English pronoun and nite verb with the same Italian verb (here,
gerundive). This rule is expressed in (82c). When the Italian VP is an innitive construction (for example, IP RP haveAU X thoughtV BN
↔ averAU X:inf i
pensatoV ER:ppast ), the
English subject pronoun should be aligned with the innitive form of the Italian auxiliary. Thus, the alignment between pronoun (PRP ) and innitival auxiliary (AUX:in )
has to be allowed.
Another possible innitival construction in Italian consists of a preposition (PRE )
and an innitive ( *:in:* ), for example I believe, IP RP knowV BP this
↔
Credo diP RE
saperloV ER:inf i:cli . The pronoun I should be aligned with the Italian preposition di to
22
satisfy the condition of being aligned with the same word as its nite verb.
This is
expressed in the rule (82d).
(82)
a.
EN subject pronoun
→
IT nite verb
if IT does not have a subject pronoun
EN: PRP → IT: {VER:n, VER2:n, AUX:n}
b.
EN: subject pronoun
→
IT: subject pronoun
if IT has a subject pronoun
EN: PRP → IT: {PRO:pers, PRO:demo}
c.
EN: subject pronoun
→
IT: gerundive
if IT is gerundive construction
EN: PRP → IT: {VER:geru, VER2:geru}
d.
EN: subject pronoun
→
IT: innitival particle or innitive auxiliary
if IT is an innitive construction
EN: PRP → IT: {PRE, AUX:in}
Figure 8 is an example for the alignment rule (82a).
23
The rule (82a) leads to the
alignment of the English subject pronoun I with the Italian nite verb posso (VER2:n ).
I4 /P RP
O
can5 /M D
tell6 /V B
you7 /P RP
posso4 /V ER2 : f in
Figure 8:
risponderle5 /V ER : inf i : cli
Alignment of I can tell you and posso risponderle
Figure 9 shows an example for the alignment rule (82b). The English personal subject
pronoun it is only aligned with the Italian pronominal subject esso.
In gure 10, a phrase pair is shown on which the alignment rule (82c) can be applied.
It allows alignment of an English pronoun with an Italian gerundive.
Figure 11 shows an example for the alignment rule (82d) which allows alignment of an
English pronoun with an Italian innitive auxiliary.
22 Finite
23 The
verb alignment rules are discussed in the next chapter.
alignments marked with dotted lines are at this moment out of interest.
56
it20 /PO RP
actually21 /RB
passes22 /V BZ
stesso20 /ADJ
aprova21 /V ER : f in
esso19 /P RO : pers
Figure 9:
Alignment of it actually passes and esso stesso approva
I0 /PO RP
would1 /M D
say2 /V B
volendo0 /V ER2 : geru
Figure 10:
dire1 /V ER : inf i
Alignment of I would say and volendo dire
I0 /PO RP
have1 /AU X
said2 /V BN
aver0 /AU X : inf i
Figure 11:
detto1 /V ER : ppast
Alignment of I have said and aver detto
5.4.3 Finite verbs
After dealing with English subject pronouns, now we examine verbal elements of English
sentences containing a subject pronoun. Let us rst examine the example base alignment
presented in gure 12.
I5 /PO RP
f eel6 /V
BP
O
5
kkk
kkk
k
k
kkk
ku kk
ritengo6 /V ER : f in
Figure 12:
che7 /CHE
Incorrect base alignment of I feel and ritengo
The sentences which contain the phrases in gure 12 are shown in (83) and (84).
(83)
Yes, Mr Evans, i feel an initiative of the type you have just suggested would be
entirely appropriate.
(84)
Sì,
Onorevole Evans, ritengo che
yes, mr
sia
un' iniziativa del tipo che
evans, believe that a
initiative of
assolutamente opportuna.
suggest would be
absolutely
appropriate.
57
lei
propone
the type that you
The English nite verb feel should be only aligned with the Italian nite verb form
ritengo. The motivation for this assumption is that their semantic features are similar.
They have the same word category and share the same verbal features (tense, niteness,
person, number).
24
Following this idea, correct alignment of the phrases in gure 12 is
presented in gure 13. The base alignment link between feel and che is removed. The
English verb in only aligned with the Italian verb.
I5 /P RP
O
f eel6 /V BP
5
kkk
kkk
k
k
kkk
ku kk
ritengo6 /V ER : f in
Figure 13:
25
che7 /CHE
Correct base alignment of I feel and ritengo
If parallel sentences both consist of a nite VP, the nite verbs in both languages should
be aligned to each other. This means that English words with PoS VBZ, VBD, VBP,
AUX, MD are aligned with the Italian words with PoS VER:n, VER2:n, AUX:n.
This is stated in the rule in (85a).
The English nite verbs can also be auxiliaries (AUX ) or modals (MD ). I refer to
both types of the verbs as auxiliaries. If an English (nite) auxiliary is to be aligned,
it should be aligned with Italian (nite) auxiliary (or auxiliaries). If we have VPs that
dier in a voice (active vs.
passive), the English nite verb or auxiliaries should be
aligned with Italian nite auxiliaries or their participles (cf. rule (85b)).
If the English VP consists only of one verb whereas the Italian VP is composed,
the English nite verb should be aligned to all Italian verbs.
amine parallel VPs youP RP saidV BD
↔
For example, if we ex-
abbiateAU X:f in dettoV ER:ppast , we see that the
English nite verb said bears the same verb features as the Italian composed VP [ab-
biate detto ]V P . They both express a past action. Thus, we would like to translate the
English past tense (in the example said ) into the corresponding past tense in Italian
which not only contains of the participle trascorso, but also of the auxiliary abbiate. So,
the English verb should be aligned to both Italian verbs (verb alignment rules (85a)
and (85c)). This alignment rule should lead to an alignment between an English word
with PoS VBZ, VBD, VBP, AUX, MD and an Italian participle with PoS AUX:ppast,
VER:ppast, VER2:ppast, VER:in, VER2:in.
Furthermore, the combination of the
rules (85a) and (85c) satises the condition that the English pronoun and its nite verb
should both be aligned with the same Italian nite verb (if the subject pronoun does
not exist in Italian).
If the Italian parallel VP is a gerundive or an innitive construction consisting of a
24 Although
25
parallel verbs sometimes have dierent verbal features, they should be aligned satisfying
the condition that same word categories should be associated to each other.
In this work, only the alignment of verbal sentence elements have been modied. The denition of
an alignment of subcategorized conjunctions is out of scope of this thesis. Furthermore, the removal
of the link between feel and che in gure 13 still allows the extraction of the translation phrase pairs
(I feel ↔ ritengo) and (I feel ↔ ritengo che).
58
preposition, the English nite verb should be aligned with the Italian gerundive (cf. rule
(85e)), or with the Italian preposition (cf. rule (85d)), resp.
Finite verb alignment rules are summarized in (85).
(85)
a.
EN nite verb
→
IT nite verb,
EN: {VBZ, VBD, VBP, AUX, MD} →
IT: {VER:n, VER2:n, AUX:n}
b.
EN: nite verb
→
IT: participle form of auxiliary
if IT VP has a passive voice
EN: {VBZ, VBD, VBP, AUX, MD} → IT: {AUX:ppast}
c.
EN: nite verb
→
IT: participle of innitive
if EN VP is not composed
EN: {VBZ, VBD, VBP, AUX, MD} →
IT: {VER:ppast, VER2:ppast, VER:in, VER2:in}
d.
EN: nite verb
→
IT: innitival particle
if IT is an innitive construction
EN: {VBZ, VBD, VBP, AUX, MD} → IT: {PRE}
e.
EN: nite verb
→
IT: gerundive
if IT is a gerundive construction
EN: {VBZ, VBD, VBP, AUX, MD} → IT: {VER:geru, VER2:geru}
Figure 14 shows an alignment of the English nite verb enjoyed after applying alignment
rules (85a) and (85b). The link between the English nite verb and the Italian participle
should be only possible, if the English VP is not composed and the Italian VP consists
of an auxiliary and a participle or innitive.
you33 /P RP
enjoyed34
/V BD
O
hh4
hhhh
h
h
h
h
hhhh
h
t hhh
abbiate23 /AU X : f in
Figure 14:
trascorso24 /V ER : ppast
Alignment of you enjoyed and abbiate trascorso
If the English VP is composed, and thus, the condition for applying the rule (85c) is not
fullled, only the alignment rule (85a) can be applied resulting in an alignment shown
in gure 15.
you0 /P RP
have1 /AU X
ii4
iiii
i
i
i
iiii
t iii
i
avete0 23/AU X : f in
Figure 15:
requested2 /V BN
chiesto1 /V ER : ppast
Alignment of you have requested and avete chiesto
59
Figure 16 shows an example for alignment rules (85a) and (85b). The English nite verb
were is aligned with two Italian words: nite verb siamo and the second auxiliary stati
which is a participle. As already mentioned, English auxiliaries should be aligned with
Italian auxiliaries.
we13 /P RP
were14 /AU X
ii4
iiii
i
i
i
iiii
i
t iii
siamo17 /AU X : f in
Figure 16:
elected15 /V BN
O
stati18 /AU X : ppast
eletti19 /V ER : ppast
Alignment of we were elected and sono stati eletti
Figure 17 shows parallel VPs of a dierent type. Whereas in English, we have a nite
subclause with the predicate had, in Italian, the innitival construction di avere is used.
The sentences that the phrases in gure 17 are extracted from are shown in (86) and
(87).
(86)
... that everybody would make certain that they had adequate ...
(87)
... che
tutti si accertino di avere una formazione adeguata ...
... that all
ensure
to have a
education
adequate ...
Alignment rule (85d) allows the English nite verb had to be aligned with the Italian
innitival particle di. Combining this rule with rules (82d) for pronouns and (85c) for
nite verbs, the alignment shown in gure 17 is computed.
had18 /AU X
they17 /P RP
O
di2 /P RE
Figure 17:
k5
kkk
kkk
k
k
kk
ku kk
O
avere3 /V ER : inf i
Complete alignment of they had and di avere
5.4.4 Participles, innitives and gerundives
The innitive, participle or gerundive form of a verb is a part of a VP if a VP is composed.
Auxiliaries are used to build some tenses, but the meaning of a VP is provided by an
innitive or participle form of the main verb.
For this reason, the alignment rules
for English innitives, participles and gerundives should lead to an alignment between
English innitives, participles and gerundives and Italian innitives, participles and
gerundives. This is stated in the rule (88b). An example is shown in gure 18.
The alignment between English and Italian participles, innitives and gerundives is
possible only if the Italian VP is composed which is not necessarily the case. This would
mean that in Italian, we could have a tense that does not require an auxiliary, so that
all English verbs, including the participle, should be aligned with the Italian nite verb
60
you0 /P RP
have1 /AU X
avete0 23/AU X : f in
chiesto1 /V ER : ppast
Figure 18:
requested2 /V BN
iii4
iiii
i
i
i
ii
it iii
Alignment of you have requested and avete chiesto
as shown in gure 19. The same holds for innitives which occur with modal verbs. The
English participle should be aligned with Italian verb which have the same or similar
semantic features. The Italian verb form should be aligned with the English auxiliary
and the main verb in order to express the same tense. This leads to the denition of the
rule in (88a).
you/P RP
have/AU X
ee2
eeeeee
e
e
e
e
e
e
eeeeee
e
r eeeee
requested/V BN
chiedevate/AU X : f in
Figure 19:
Alignment of you have requested and chiedevate
The rules handling these cases are summarized in (88). If we take a closer look at rule
(88a), we see that in some cases, the information provided by PoS is not enough to
apply the correct alignment rule. For example, the verb been has the same PoS (AUX )
no matter if it is used as an auxiliary or as a main verb. Computing the alignment for
been, we have to decide whether been is used as an auxiliary or as a main verb. If, for
example, a composed Italian VP is given, been as auxiliary (for example in they have
been said ) should be aligned with Italian auxiliary. If it is a main verb, it should be
26
aligned with Italian main verb.
(88)
a.
→
EN: participle, innitive or gerundive
IT nite verb,
if IT VP is not composed, or English verb is like subcategorizing an toinnitive, or English verb is be
EN: {VBN, VB, VBG, TO} → IT: {VER:n, VER2:n, AUX:n}
b.
EN: participle, innitive or gerundive
→ IT:participle, innitive or gerun-
dive, if IT VP is composed
EN: {VBN, VB, VBG, AUXG, TO} →
IT: {VER:ppast, VER2:ppast, VER:in, VER2:in, AUX:in}
c.
EN: participle
→
IT: participle of an auxiliary
if EN VP is not composed and IT not in passive voice
EN: {VBN} → IT: {AUX:ppast}
26 Italian
verb form stata has two dierent PoS depending on a context in which it is situated. If it is
a part of a passive construction, it has a PoS AUX:ppast, otherwise it is tagged as VER:ppast.
61
d.
EN: participle
→
IT: innitival particle
if IT is innitival construction
EN: {VBN} → IT: {PRE}
The English innitive like is another verb which I treat as an auxiliary, but only if it
occurs with a to-innitive, for example I would like to say as shown in gure 20. If like is
a part of a construction containing a modal verb (MD ) and a to-innitive (TO + VB ),
it should be treated as an auxiliary, i.e. as a nite verb. This ensures that it is aligned
with the same Italian nite verb as the modal (here, would ).
I/P RP
would/M D
e2 like/V B
eeeeee
e
e
e
e
e
eeeee
eeeeee
e
r eeeee
vorrei/V ER : f in
say/V B
to/T O
dire/V ER : f in
Figure 20:
Alignment of I would like to say and vorrei dire
5.4.5 Negation
In this work, the English negation particle not is treated as a part of the VP and its
alignment should also be taken in account. The simplest case of the alignment of the
English negation is to associate it with the Italian negation as shown in gure 21.
we5 /P RP
do9 /AU X
noi6 /P RO : pers
non7 /N EG
Figure 21:
j4
jjjj
j
j
j
jjjj
jt jjj
not10 /RB
adhere11 /V B
rispettiamo8 /V ER : inf i
Alignment of we do not adhere and noi non rispettiamo
But this is not always possible.
Since sentences in the used parallel corpus are not
always one-to-one translations of each other, it can happen that the negation exists in
only one of the given languages. On the other hand, it is also possible that the verb in
one language already contains the negation (for example as an attached prex) whereas
its counterpart does not, and requires therefore an explicit occurrence of the negation.
(89)
EN: negation
→
IT: negation
if IT VP contains a negation
EN: {RB} → IT: {NEG}
The negation alignment rule in (89) allows for English negation only to be aligned to
the Italian negation particle.
If there are some mismatches, English negation stays
unaligned.
62
5.4.6 Innitival particle
Since the English to-innitives which are considered as being subcategorised by the verbs
are also handled, we need alignment rules for English innitival particle to. The rule is
simple: It should be aligned with the Italian innitival particle (PRE ) if the Italian VP
is an innitival construction (PRE + *:in or simply *:in ).
If this condition is not
given, it should be handled as an innitive.
(90)
a.
EN: innitival particle
→
IT: innitival particle
if IT is innitival construction
EN: {TO} → IT: {PRE}
b.
EN: innitival particle
↔
EN: innitive
if IT is not an innitival construction
Figure 22 shows an example for the rule (90b). Behind this alignment, the gure shows
also alignments for other English tokens computed by applying the rules (82a) for I,
(85a) for suggest, and (85b) for present.
I/PO RP
suggest/V BP
i4
iiii
i
i
i
iiii
i
t iii
raccomando/V ER : f in
Figure 22:
di/P RE
present/V B
4 to/T O
jj4
jjjj
j
j
j
j
j
j
jj
jjj
jjjj
jjjj
tjjjj
jt jjj
presentare/V ER : inf i
Alignment of I suggest to present and raccomando di presentare
5.4.7 Alignment examples
In the following, a few examples of computed VP alignment are presented. The sentences
are taken from Europarl.
I13 /P RP
O
shall /M D
14
5
ffff2
kkk
k
ffffff
f
k
f
f
k
f
k
f
k ffffff
kkk
u kk rffffff
k
do15 /AU X
seguiro8 /V ER : f in
Figure 23:
Alignment of I shall do and seguirò
The phrases in gure 23 are simple to align. Since there is only one verb in Italian, all
English words are aligned with it. To achieve an alignment between the English subject
pronoun I and the Italian nite verb seguirò, the alignment rule (82a) has to be applied.
The modal shall, which is recognised as a nite verb is aligned with seguirò according
to the alignment rule (85a). Finally, the link between the auxiliary do, which represents
the main verb of the given VP, is also computed by the rule (85a). This example also
shows that the same tense (future tense) is formed dierently in the given language pair.
63
Whereas the English needs a modal verb and an innitive, the Italian verb becomes a
sux to express the future tense.
we9 /P RP
O
have10 /AU X
hh3
hhhh
h
h
h
h
hhhh
s hhh
h
abbiamo9 /V ER : f in
Figure 24:
upheld12 /V BN
i4
iiii
i
i
i
i
iiii
it iii
sostenuto11 /V ER : ppast
Alignment of we have upheld and abbiamo sostenuto
Figure 24 shows the alignment of composed VPs. The English personal pronoun and
nite auxiliary are aligned with the Italian nite verb abbiamo. The subject pronoun is
aligned according to the alignment rule (82a) whereas the English auxiliary is aligned to
the same verb according to the rule (85a). The participles are aligned with each other,
which is determined by the rule (88b).
Figure 25 shows an example for a VP pair, in which the Italian VP consists of a
subject pronoun.
you12 /P RP
O
have13 /AU X
O
lei13 /P RO : pers
propone14 /V ER : f in
Figure 25:
suggested14 /V BN
ii4
iiii
i
i
i
iiii
it iii
Alignment of you have suggested and lei propone (= you proposed )
In this context, the English subject pronoun should only be aligned with the Italian
subject pronoun. This is stated in the alignment rule (82b). The English nite verb is
aligned with the Italian nite verb according to rule (85a) whereas the alignment of the
English participle (as a main verb) is dened by the rule (88a).
This example pair shows another discrepancy that I noticed by observing the identied
phrase pairs. Often, VPs are not in the same tense. In gure 25, the English VP is in
past tense whereas the corresponding Italian VP denotes an action in the present.
The VP pair shown in gure 26 shows the case in which the English subcategorized
to-innitive should be aligned with the Italian participle as a main verb (not subcategorized).
27
The phrases are extracted from the sentences in (91) and (92).
he4 /P RP
O
ii4
iiii
i
i
i
iiii
i
t iii
verra4 /AU X : f in
Figure 26:
to /T O gg3 go7 /V B
6 6
g
lll ggggggggg
l
l
l
g
g
l
lll ggggg
vlll sggggg
is5 /AU X
messo5 /V ER : ppast
Alignment of he is to go and verrà messo
27 The
phrases are rather idioms. In Europarl, I found only 39 sentences containing the English VP
whereas the Italian VP occurs solely in 16 sentences.
64
(91)
Now, however, he is to go before the courts once more because the public prosecutor is appealing.
(92)
Ora, però, verrà messo
now, but,
will come put again
pubblico ministero
public
nuovamente in stato
perché
il
in position of accusation because the
ricorrerà in
government recurs
di accusa
appello.
the appeal.
Again, the English pronoun and nite verb are aligned according to the rules (82a) and
(85a) with the Italian nite verb verrà. The innitive particle to is treated as an English
participle, innitive or gerundive, since the Italian VP does not contain a preposition
(as an innitive particle) which would be seen as an alignment candidate for English
to. As a participle, innitive or gerundive, the innitive particle, as well as the English
innitive go, is aligned with the Italian participle.
The rule applied to compute this
alignment is the rule (88b).
An Italian VP corresponding to an English nite VP can also consist of only one verb
which is not necessarily nite.
Figure 27 shows an Italian VP consisting only of one
gerundive. The link between the English pronoun and the Italian gerundive is produced
by applying the alignment rule (82c).
In the given context, the English nite verb is
aligned according to the rule (85e).
you4 /P RP
O
hear5 /V BP
jj4
jjjj
j
j
j
jjj
jt jjj
ascoltando4 /V ER : geru
Figure 27:
Alignment of you hear and ascoltando
5.5 Evaluation
In this section, the VP alignment computed on the basis of the rules which take the PoS
of the words into account, is evaluated. Precision, recall and f-score are computed for
the base alignment and the rule-based VP alignment. After comparing gained results,
errors made by the rule-based VP alignment are shown and discussed.
Furthermore,
some examples of syntactic divergences between English and Italian that are problematic
for the system are shown.
After an evaluation of the improved alignment, translation systems are built and
tested.
BLEU scores are reported and the translations of example sentences with
pronominal subjects are discussed.
5.5.1 Precision, Recall, F-score
The program for word alignment computation of the English and the Italian VPs has
been applied to 200 parallel sentences randomly chosen from Europarl (cf. section 5.2).
The sentences consist both of NP and pronominal subjects.
65
The program for the alignment improvement produces a set of partial alignments,
containing alignments only for identied pronominal subjects and their VPs. The alignments of other words are not a part of the output of the program.
I annotated manually the alignment of the English pronominal subjects and VPs in the
test set with their Italian counterparts. The manual annotation of the test set provided
the partial gold alignment G containing 563 gold alignment links. The alignments of the
English words outside of the phrases (pronominal subject + VP) that were of interest
for this work were ignored (they are simply not word aligned in the hypothesis and the
gold alignment).
To evaluate the base alignment of English pronominal subjects and VPs, it was necessary to extract the alignments of the relevant words out of the complete word alignment
for a sentence pair. This was done on the basis of the word positions of aligned English
words in the gold alignment. The extracted base word alignment contains all links for
the elements of English VPs which are annotated in the gold alignment. So, if there are
links to Italian words which are not a part of matching Italian VPs, they have a negative
impact on precision.
The alignment that is tested, is called hypothesis H. Having gold alignment and the
hypothesis, the evaluation method basing on precision, recall and f-score can be applied.
Precision is a measure for the correctness of the hypothesis and is calculated as shown
in equation (14).
P =
H ∩G
|H|
(14)
Recall is a percentage of gold alignments that are found by the hypothesis (cf. equation
(15)).
R=
H ∩G
|G|
(15)
F-score is a harmonic mean of precision and recall, and it is computed as shown in (16).
R=
2P R
P +R
(16)
The evaluation results are shown in table 6.
Alignment # alignments Precision Recall F-score
Base
522
0.66
0.61
0.64
Rule-based
572
0.80
0.81
0.81
Table 6:
Evaluation of the VP alignment
In all measures, the rule-based VP alignment is better then the base alignment. Measuring f-score, the rule-based VP alignment reaches an improvement of 17% compared
with base alignment. In a large number of sentences, the base alignment of VPs is both
incomplete and incorrect. Since the method described in this work identies entire VPs,
66
all VP elements are examined and aligned producing the VP alignment which contains
links between all VP elements.
The alignment rules allow only alignments between
elements which share some characteristics (word category, number, etc.)
so that the
alignments to some other word categories, which are incorrect, are excluded.
28
In the following, examples are presented in which the rule-based VP alignment leads
to the improvement of the alignment compared to the base word alignment. We start
with an example that shows the most frequent correction of the base alignment.
As
already mentioned in section 5.1, the English subject pronoun is often aligned with
dierent Italian words because its direct counterpart is missing. These include, among
others, the Italian object clitics. The Italian syntax allows the object clitic to occur in
front of the nite verb. Therefore, it is often aligned with the English subject pronoun
which is always situated in front of the verbal sentence predicate (cf. gure 28). The
base alignment links are marked with waved lines whereas the rule based alignments are
displayed by straight lines. Overlapping alignments are marked as a combination of a
waved and a straight line.
I13 /PO RP Si
O
O
accept14O /V BP
SSS
SSS
SSS
SSS
S)
O
la15 /CLI
Figure 28:
O
O
O
accetto16 /V ER : f in
Alignment comparison: I accept and lo accetto
In gure 28, the alignment rules for the subject pronouns lead to the alignment of the
English subject pronoun I with the Italian nite verb accetto (straight line) whereas
the link between the pronoun and the Italian object clitic is deleted (waved line). Both
alignments contain the link between the main verbs accept and accetto.
The example in gure 29 shows the advantage of using English parse trees.
The
English VP is interrupted by an embedded sentence, but the derivation of the English
VP from the parse trees leads to the extraction of the complete VP, which is then
correctly aligned with the Italian counterpart.
The base alignment does not produce
any alignments for the beginning part of the English VP, namely the word sequence it
will.
it0 /P RP kXXX will1 /M D
Rh RR
XXXX
...
XXXXX
R
XXXXX RRRRR
XXXXX RRR
XX+ R(
it4 it4 i4
iit4 it4 i
t
4
i
4
t
i
i
4
t
it4 iit4 i
tiit4 iit4
sara2 /AU X : f in
Figure 29:
be6 /AU X
examined7 /V BN
it4 i4
it4 it4 iit4
t
4
i
i
4
t
i
it4 it4 iit4
tiit4 it4 i
esaminata3 /V ER : ppast
Alignment comparison: it will (, I hope,) be examined and sarà esami-
nata
28 For
now, we assume that VP elements can only be aligned to VP elements, i.e. verbs, negation and
subject pronouns leading to higher precision. However, the assumption is not completely correct
having a negative impact on recall. This will be discussed in the following chapter.
67
The following gure shows the case in which the base alignment assigns the English
subject pronoun to the Italian adverb pertanto. The rule-based VP alignment deletes
this link and creates the alignment between the pronoun and the Italian nite verb form
può which corresponds to the English modal can. Furthermore, the link between the
29
English main verb give and the Italian preposition su is removed.
I12 /P RP
O
O
O
O
pertanto13 /ADV
Figure 30:
...
can13 /M OD
iTTTT
TTTT
TTTT
TTT)
O
O
O
jUUUU
UUUU
UUUU
UUUU
U*
O
puo14 /V ER2 : f in
give17 /V BN
j5
ju5 jju5 ju5
j
5
u
j
5
u
j
u5 ju5 jju5
ujju5 jj
contare15 /V ER : inf i
O
O
O
O
su16 /P RE
Alignment comparison: I can (,therefore,) give and pertanto può con-
tare su
In gure 31, there is an example of an Italian innitival clause corresponding to the
English nite clause. The contexts of the VPs are shown in (93) and (94).
(93)
... I would ask you to request that the commission express its opinion on this
issue and that we then proceed to the vote.
(94)
... la
prego di chiedere alla commissione di esprimersi
... you I ask to request
the commission
di procedere al
poi
afterwards to proceed
subito e
to express itself soon
and
voto.
to the vote.
Again, the English pronoun is aligned with the Italian adverb whereas the English
innitive is only aligned with the Italian main verb.
The alignment rules correct the
alignment of the English subject pronoun and align it with the Italian preposition di
which is considered to be a part of the Italian VP. Since the intention was to align the
English subject pronoun and its nite verb with the same Italian word, the English
nite verb proceed is also aligned with di. Additionally, it is also aligned with its Italian
counterpart procedere.
we12 /P
RP
O
gO
O
O
O
OOO
OOO
OOO
O'
poi13 /ADV
Figure 31:
...
proceed17O /V BN
jjj5
jjjj
j
j
j
j
ju jjj
di14 /P RE
O
O
O
procedere15 /V ER : inf i
Alignment comparison: we (then) proceed and poi di procedere
Figure 32 shows one of the common base alignments for the English subject pronoun.
It is namely often aligned with sentence punctuation, in our case with comma. The VP
alignment rules remove this link and lead to the resulting alignment in which the English
subject pronoun and its nite verb have are both aligned with the same Italian word.
29 Cf.
footnote 25 in section 5.4.3.
68
I0 /PO RPh
O
O
O
...
have1 /AU
X
O
PPP
PPP
PPP
PPP
(
,2 /P U N
O
O
O
ho3 /AU X : f in
Figure 32:
proposed3 /V BN
dr2 ddr2 dr2 ddr2 2
d
2
r
d
2
r
d
d
2
r
d
2
r
d
dr2
r2 ddr2 dr2 ddr2 dr2 d
r2 dr2 ddr2 dr2 ddr2 d
d
2
r
d
2
r
d
d
2
r
d
2
r
d
rdr2dd
proposto4 /V ER : ppast
Alignment comparison: I have (thus) proposed and , ho proposto
In gure 33, the VPs including the negation are shown.
In the base alignment, the
English negation does not have any alignments whereas the rule-based VP alignment
assigns it to its Italian counterpart non. If the English VP contains a negation and an
auxiliary which is needed to negate the verbal predicate (here reect ), the auxiliary is
aligned solely with the Italian main verb (here rietterà ). Certainly, the auxiliary could
also be aligned with the Italian negation since it is used to build a negated English VP. I
decided though to align the auxiliary with the Italian main verb because there are many
other English constructions containing an auxiliary and a main verb which correspond to
the Italian main verb (for example, I [do think]V P
↔
io [penso]V P , he [is playing]V P
↔
egli [gioca]V P ). When such a context is given, the auxiliary and the main verb are both
aligned with the corresponding Italian verb if the Italian VP does not have an auxiliary
(cf. gure 25).
do2 /AU X
they1 /P
RP
O
O
O
O
O
esso1 /P RO : pers
Figure 33:
O
O
O
5
iTTTT
TTTTjjjjjjj
T
j
jjj TTTTTT
ujjjj
)
non2 /N EG
not3 /RB
...
ggs3 3
gs3 gs3 ggs3 gs3
g
3
s
g
g
3
s
g
s3
s3 gs3 ggs3 g
sggs3 gs3 gg
ref lect4 /V B
rif lettera17 /V ER : f in
Alignment comparison:
they do not (properly) reect and esso non
rietterà
In the VP pair in gure 34, an Italian VP is shown consisting of reexive verb perme-
ttersi (= allow, permit ). The Italian reexive pronoun occupies the position in front
of the Italian nite verb permettesse.
This can be compared with the position of the
Italian object clitics shown in gure 28. The base alignment contains a link between the
English subject pronoun and the Italian reexive pronoun.
The rule-based VP align-
ment, however, deletes this link and creates the alignment between the English pronoun
and the Italian nite verb. Since the Italian reexive pronouns have the same PoS as
Italian object clitics, I excluded the alignment of the English subject pronouns with
Italian words tagged with the PoS CLI. This allows for a deletion of many links created
between the English subject pronouns and the Italian object clitics, but it also prohibits
the alignment between the English subject and the Italian reexive pronouns which
30
could be considered as correct.
Furthermore, the rule-based VP alignment creates a
link between allowed and permettesse which was incorrectly not included in the base
word alignment.
30 Since
I do not allow the alignment of English subject pronouns with Italian reexive pronouns, these
alignments are not a part of the gold alignment.
69
I15 /P RP jeLL*j *j *j
LL *j *j *j
LL
LL *j *j *j *j *j
LL
*j *j *j
LL
*j *j *j
LL
*
L
LL
might16 /V BP jUU*j U*j UU*j
mi15 /CLI
L
LL
U*j UU*j U*j
L
UU*j U*j U
U*j U*j U*jU LLLL
U*j U*jUU*j LLL
U*jUU*j UL%
*j U
be17 /AU X o /o /o /o /o /o /o /o /o /o /o /o /o /o /o ii/o i/o i4*/ permettesse16 /V ER : f in
allowed18 /V BN
to19 /T O
give20 /V B
Figure 34:
i
iiii
iiii
i
i
i
ii
iiii
tiiii
i4t i4 di17 /P RE
i4t i4tii4t
i
t
4
i
t
4
i
i4t i4t i4ti
ii4t i4t i
t
4
i
t
4
i
i
t
4
i4t i
tii4t i4t i
i4t i4t i4 rilasciare18 /V ER
4tii4t i4ti
i
t
4
i
i
t
4
ii4t i4t i
i4t ii4t i4t
t
4
i
i
t
4
i
t
4
i
tii4t i4t i
: inf i
Alignment comparison: I might be allowed to give and mi permettesse
di rilasciare
The preceding examples show the cases in which the alignment rules lead to the improvement of English and Italian VPs and, especially of the English subject pronoun.
But the rule-based VP alignment still make errors in the alignment of about 20% of the
tested sentences. In the following, we examine the errors that are made by the described
method for the VP alignment.
5.5.2 Error analysis
Manual examination of the erroneous alignments revealed problems which can be divided
into four categories:
1. Correct Italian VP not found
The parallel Italian VP is not found
2. Extended VPs
The VPs can contain coordinated verbs or innitives which do not have a correspondence in the other language
3. Alignment rules
The rules compute false alignments when the VPs are too complex
4. Erroneous preprocessing
VP elements can have false PoS.
In the following, the error categories are discussed. Sentence pairs and alignment examples are shown to demonstrate the problems within the task of the VP alignment.
70
Correct Italian VP not found
Alignment rules dene alignments between English and Italian VPs. The correctness of
the computed alignment for a given VP pair depends not only on the denition of the
rules, but also on an assumption that the VPs correspond to each other. The method
for searching the corresponding Italian VP given an English VP, has been described in
the section 5.3.2. The method is based on the base alignment: The Italian VP which
has the most alignments in the base alignment to the English VP is considered to be
the corresponding Italian VP. This is not always correct, so an incorrect Italian VP can
be chosen.
If the English VP does not have any alignments to an Italian VP, an empty Italian
VP is chosen. In this case, the English VP stays unaligned. A sentence pair for this case
is shown in (95).
(95)
a. As you know, like Mr. Rack, I come from a transit country ...
b. Anch' io, come l'
Also
I,
as
Onorevole
Rack, provengo da
the honourable Rack, come
un paese
from a
di transito
country of transit
'Like Mr. Rack, I also come from a transit country ...'
Whereas the English VP [you know]V P does not have a corresponding Italian VP, for the
VP [I come]V P , the Italian VP [provengo]V P should be identied as the corresponding
phrase. Unfortunately, the base alignment does not reveal this fact, so that the English
VP stays unaligned which lowers recall.
Until now, I postulated that for every found English VP with pronominal subject, there
is a parallel Italian VP. This phrase parallelism is not always present in the sentence
pair which is to be processed. The English VP can correspond to an Italian phrase of
some other category, for example, to a PP as shown in (96).
(96)
a. We understand that ...
b. A nostro avviso
At
our notice
'In our opinion'
Having identied the English pronominal subject we and its VP [understand]V P , the
search for the Italian VP is carried out. VP search allows only VPs as corresponding
phrases to the given English phrase, so that the PP [A nostro avviso]P P cannot be
determined as the parallel phrase of the English VP, even though the base alignment
indicates this correspondence. In most cases of this kind of divergence, the English VP
stays unaligned. Since the phrases are parallel, in gold alignment they are aligned to
each other, so this leads to a loss of recall.
Another phrase divergence which has been observed is shown in (97).
(97)
a. Your group was alone in advocating what you are saying now.
b. Soltanto un
Only
gruppo politico
one group
condivideva l'
political shared
questa sede.
this
seat.
71
opinione da lei espressa
the opinion
in
of you expressed in
'Only one political group shared the opinion that you expressed in this seat.'
The English nite VP [you are saying]V P corresponds to the Italian PP [da lei espressa]P P
consisting of the preposition da, the subject pronoun lei, and the participle espressa. In
this form, it poses a counterpart to the nite English VP, but in a passive voice. In the
process of identication of Italian VPs in a given Italian sentence (cf. chapter 5.3.1),
this kind of phrase is not identied as a VP, because it starts with a preposition and
it does not contain a nite verb form. The same problem occurs if the English nite
VP corresponds only to a participle in Italian. So, in these cases, we have English VPs
which stay unaligned leading to a reduction of recall.
The problems with regard to the parallel Italian VPs can be summarized as follows:
1. Base alignment
•
•
False VP because the base alignment is incorrect
No VP
because the base alignment does not contain links to any possible Italian VP
2. Phrase divergence (free translation, idioms)
•
•
EN:VP
↔
IT:PP
EN: nite VP
↔
IT: participle
In section 5.6, I present experiments that I carried out in order to account for these
problems.
Extended VPs
In the previous discussion, examples of VPs have been shown which consist only of
one main verb or subcategorized innitive.
The VPs can also contain a sequence of
verbs which are either combined by a coordination, or pose an enumeration separated
by comma.
(98)
a. It is irresponsible of EU Member States to refuse to renew the embargo.
b. Gli
stati
membri dell'
unione sono stati irresponsabili a non rinnovare
The states member of the union
l'
were
irresponsible to not renew
embargo.
the embargo.
'It is irresponsible of Member States not to renew the embargo.'
The sentence pair in (98) shows the English VP [It is ...
its Italian counterpart [sono stati ...
a non rinnovare]V P .
to refuse to renew]V P and
The English VP contains
two to-innitives. The rst one, namely [to refuse]XCOM P , does not have a direct VP
correspondence in Italian. Moreover, semantically, it is equal to the Italian negation non.
This type of correspondence is not described by the alignment rules. In this context, [to
72
refuse]XCOM P as well as [to renew]XCOM P are aligned with [a rinnovare]XCOM P whereas
the Italian negation remains unaligned.
This sentence pair reveals another divergence in a way of expressing the same fact in
the given language pair. Whereas in English, the expletive has the role of a sentence
subject, in Italian, the subject is a NP [Gli stati membri dell' unione]N P which is a
translation of an English PP [of EU Member States]N P . This inequality exists also in
the processed VP pairs.
In some cases, in which the English VP has a pronominal
subject, the corresponding Italian VP has a nominal phrase as a subject.
Since the
Italian subject was not taken into account unless it is a pronoun, the English pronoun is
not aligned with Italian subject NP, but instead with the corresponding element of the
VP.
Example (99) shows the Italian VP [è chiedere] consisting of an innitive chiedere that
does not correspond to any part of the English VP [It is not]. It is, therefore, seen as
an extension of the Italian VP which leads to false alignments.
(99)
a. It is not a lot to ask.
b. Non è chiedere molto.
Not is to ask
lot
'It is not a lot to ask.'
According to the parse tree for the English sentence in (99), the VP with the pronominal
subject does not contain the to-innitive [to ask]XCOM P . It is instead embedded in an
adverbial phrase together with an adverbial lot. On the other side, the search for VPs
in the Italian sentence returns only one VP, namely [è chiedere]V P . Given the VPs [It is
not]V P and [è chiedere]V P , the Italian innitive is aligned with the English nite verb is,
and not with [to ask]XCOM P . This additional false link leads to a reduction of precision.
The described problems can be summarized as follows:
1. Subcategorization of innitives
Poses a problem if the innitive does not have equivalent phrase in the other
language
2. Coordination
A coordination of verbs, in which not every verb has a counterpart in the other
language
Alignment rules
The denitions of the contexts in which a specic alignment rule should be applied is not
error-free. Additionally, since
n−m
alignments have to be allowed, a specic, already
aligned VP element is not prohibited to be aligned with further words. In complex VPs
which contain additional elements such as subcategorized innitives or a sequence of
verbs, the rules lead to a generation of too many links. For example, if in both VPs, two
coordinated nite verbs are present, six links are generated, namely, from each English
73
nite verb to each Italian nite verb form, and from the English subject to both of the
Italian nite verbs. This is shown in sentence pair in (100).
(100)
a. With regard to the budget and annual appropriations, we agree with the rap-
porteur's position and fully support it.
b. Per quanto attiene al
Regarding
bilancio e
the budget
condividiamo e
share
annuali,
and the appropriations annual,
appoggiamo la
and support
alle dotazioni
posizione della relatrice.
the position
of
the
rapporteur.
'Regarding the budget and the annual appropriations, we share and support
the position of the rapporteur.'
The computed alignment for the underlined VPs in (100) is shown in gure 35.
The
rule (82a) leads to the alignments between the English subject pronoun and both Italian
nite verbs.
The rule (85a) is responsible for alignments between both English nite
verbs with both Italian verbs.
we9 /P RP foLL
h4/ condividiamo10 /V ER : f in
hhhh{{=
LL
h
h
h
h
LL
{
LL hhhhhhh
{{
LhLh
{
h
{
hh LLL
hhhh
{{
LL
...
LL {{{
agree10 /V BP thjVVV
L
...
support18 /V BP
Figure 35:
VVVV
VVVV {{{LLLL
VV{V{V
LL
V
L
{{ VVVVVVVLLLL
{
VVV&*
{
{{
{
h4 appoggiamo12 /V ER
hhhh
{{
h
h
h
{
h
{{
hhhh
{{ hhhhhhhh
{
}{ h
thhhh
: f in
Alignment of we agree (...) support and condividiamo (...) appoggiamo
While the alignments between the English subject and both Italian verbs can be considered as correct, both English verbs should be aligned only with the corresponding
Italian verb. So, the shown alignment consists of two additional false alignments which
lead to a reduction of precision.
The English VPs with modal verbs pose the main problem for the rules and context
denition for alignment of VP elements. Figure 36 shows the computed alignment of an
English VP containing auxiliaries and modals, and its Italian counterpart.
it10 /PO RP
may11 /M D
have12 /AU Xcccccce12 been13 /AU X
ee2
O
jj4
eeeeeeckkckckckc5 cccccccececececeeeeee
j
e
e
j
e
e
j
e
jj
eeee
cccckckck
eeeee
jjjj
eeeeeecccccccckckckkkekeeeeeeeeee
uk
t jjjqcrececececececececccccc re
j
sia4 /AU X : f in
stata5 /V ER : ppast
Figure 36:
Alignment of it may have been and sia stato
74
In this example, too many links are computed. The English nite verb should only be
aligned with the Italian nite verb. So, the link between may and stata is false. English
have as an auxiliary should also only be aligned with the Italian nite verb sia. The
participle been as a main verb should only be aligned with the Italian participle and
main verb stata.
False links are generated because of complex English context.
Many rules check
whether the English VP contains auxiliaries, or if it contains modals.
According to
the result of such a context check, the links for English verbs are computed. In the given
example, this leads to a generation both of correct and incorrect links.
Head switching
Head switching is a phenomenon which involves syntactic and semantic dierences between languages. The main semantic contributor of a phrase in one language does not
correspond to the head of the corresponding phrase in the other language [Butt, 94].
For example, the main verb of a VP which bears the semantic information of the VP
need not always correspond to the main verb of the parallel VP in some other language.
This kind of divergence is given in (101).
The semantics of the English verb answer
corresponds to the semantics of the Italian noun risposta.
(101)
a. ... they had been answered in a previous part-session.
b. ... avessero già
... have
ottenuto risposta in una tornata precedente.
already received answer
in one session previous
'... they have already received the answer in the previous session.'
With respect to the word alignment, we can say that one verb in one language is equivalent to a combination of the verb and NP in some other language. In the example (101),
the English verb answered corresponds to the Italian verb ottenuto and the object NP
risposta. For the given VPs, the alignment rules produce the alignments shown in gure
37.
they14 /P RP jUUUUU
UUUU
UUUU
UUUU
UUUU
UUUU
U/*
4 avessero18 /V ER : f in
had15 /AU X o
iiii
been16 /AU X
answered17 /V BN
Figure 37:
i
iiii
iiii
i
i
i
ii
iiii
tiiii
ii4 ottenuto20 /V ER
iiii
i
i
i
iiii
iiii
i
i
i
ii
tiiii
: ppast
Alignment of they have been answered and avessero ottenuto (risposta)
75
The Italian object is not a part of the Italian VP which the English VP should be
aligned with.
So, the English answered verb is not aligned with the Italian object
risposta. The produced alignment is incomplete since the link between answered and
risposta is missing.
It is likely that the statistically computed base alignment contain the alignments
between a verb on the one side, and a verb and object NP on the other side (cf. example
(101)).
On the other hand, even if the object NP were a part of the Italian VP, the
alignment rules would not allow for alignments between verbs and nouns or articles
since they only allow for alignments between words with PoS which indicate that the
alignment candidates are a part of a VP, i.e. verbs, negation and subject pronouns. So,
for the given case, alignment rules produce only some of the correct word alignments.
The sentence pair in (102) shows another construction dierence that is comparable
with the divergence in the example (101).
(102)
a. ... you were unable to attend the Conference of Presidents last Thursday.
b. ... lei non ha potuto partecipare
giovedì
scorso alla
... you not have could participate Thursday last
dei
conferenza
to the conference
presidenti.
of the presidents.
'... you could not participate on the Conference of Presidents last Thursday.'
The English VP [you were]V P is a part of a predicative phrase consisting of the mentioned
VP and an adverbial unable, whereas unable subcategorizes the following to-innitive [to
attend]XCOM P . In the parse tree, XCOMP is embedded in an adverbial phrase ADJP,
so that it is not identied as a part of the VP [you were]V P . This causes two diculties:
(i) unable as adverbial with the PoS JJ cannot be aligned with the equivalent Italian
phrase [non ha potuto]V P (negation and verbs), and (ii) the Italian innitive partecipare
is not aligned with its English equivalent [to attend]XCOM P . The computed alignment
for example in (102) is given in gure 38.
Figure 39 shows a combination of the computed VP alignments (straight lines) and the
base alignment (dashed) for the VPs in (102). This combination of VP alignment and the
base alignment would be desirable as the output alignment, but the dashed alignments
are not a part of the resulting word alignment for the given sentence pair. When the
words belonging to the VPs which should be aligned to each other are identied, all
base alignments for these words are rst deleted. Subsequently, the phrase elements are
aligned according to the alignment rules, so that the words of the given VP pair can only
be aligned to each other, and not to the words outside of them. In the given example,
this leads to the deletion of correct links.
76
you4 /P RP o
/
lei5 /P RO : pers
non6
were5 /V BD TdjJ_?JTTTT
??JJJ TTTTT
TTTT
?? JJ
TTTT
?? JJJ
TTTT
?? JJJ
TT*
?? JJ
ha7 /AU X : f in
?? JJJ
JJ
??
JJ
??
JJ
??
JJ
??
J$
??
potuto8 /V ER : ppast
??
??
??
??
partecipare9 /V ER : inf i
Figure 38:
Alignment of you were (unable to attend) and lei non ha potuto parte-
cipare
you4 /P RP o
/
non6
were5 /V BD TdjJ_?JTTTT
unable6 /JJ
to7 /T O
attend8 /V B
Figure 39:
lei5 /P RO : pers
??JJJ TTTTT
TTTT
?? JJ
TTTT
?? JJJ
TTTT
?? JJJ
TT*
?
J
J
?
Tj T
ha7 /AU X : f in
J
T T ??? JJJ
JJ
T T??
?T? T JJJ
?? T T TJJJ
??
T$*
??
Tj T
potuto8 /V ER : ppast
??
T T
T T
??
T T
?
T T ??
T T*
_o _ _ _ _ _ _ _ _ _ _/
partecipare9 /V ER : inf i
Alignment of you were unable to attend and lei non ha potuto parteci-
pare
This type of link deletion problem could be solved by checking how reliable the alignments for a given word with the words outside of the corresponding VP are. Because
of the assumption that the elements of VPs should only be aligned to each other, these
cases of divergence have not been investigated further.
77
5.6 System extensions
In the previous section, I presented the errors made by the rule-based method for computing word alignment between the English VP with a pronominal subject and its Italian
counterpart. Some assumptions were made that, unfortunately, did not always lead to
a generation of correct alignments. We saw that the process of searching for an Italian
VP on the basis of the base alignments can be erroneous (cf. assumption (A2) in section
1.2).
Furthermore, the assumption that the given English VP can only be expressed
with a VP in Italian does not hold in all cases (cf. assumptions (A1) and (A3) in section
1.2). In the following, I suggest some improvements of the presented work, in order to
consider the problems that have been observed.
5.6.1 Lexical search for the matching Italian VP
In section 5.5.2, parallel sentences were shown, in which the wrong Italian VP has been
identied as a counterpart for the given English VP. Let us consider once more the
example shown in (103).
(103)
a. As you know, like Mr. Rack, I come from a transit country ...
b. Anch' io, come l'
Also
I,
as
Onorevole
Rack, provengo da
the honourable Rack, come
un paese
from a
di transito
country of transit
'Like Mr. Rack, I also come from a transit country ...'
The Italian VP [provengo]V P has not been identied as the parallel VP to the English
VP [I come]V P . In this example, the similarity at meaning of come and provengo could
provide us with the information, that these two VPs correspond to each other. So, it
may be that lexical translation probabilities could be helpful to identify the matching
Italian VP.
There are two ways to include lexical translation probabilities in the subroutine for
nding the matching Italian VP. The search could be changed, so that only a lexical
search for the Italian VP is carried out.
search based on the base alignment.
Or, we can combine lexical search and the
I carried out two experiments including lexical
translation probabilities for the identication of the parallel Italian VP. The rst one
includes only lexical probabilities whereas in the second, base alignment and lexical
translation probabilities are combined.
The lexical search uses lexical translation probabilities computed by Moses based on
the base word alignment. For a given English VP
of an Italian VP
i = i1 , ..., im
e = e1 , ..., en , the matching probability
is computed using equation (17).
v
um
uY
mt
arg maxel ∈e p(ik |el )
p(i|e) =
(17)
k=1
For each Italian word
ik
which is a part of an Italian candidate VP
probability of generating it out of one of the elements
el
of English VP
multiplied with highest probabilities of other Italian words within
78
i.
i, the highest
e is taken and
th
The m
root of
the product is computed in order to assure that shorter Italian VPs are not dispreferred
compared to the longer phrases.
The most probable matching Italian VP
imax
for a given English VP
e
is the Italian
VP with the highest matching probability. The probabilities of the most probable Italian
VPs have to be higher that the threshold
t since the probabilities can be relatively small
indicating that the phrases are not very likely to be parallel. This is shown in equation
(18). I set manually the threshold
t
to
t = 0.001. On the test set, this threshold led to
Imax lays under the threshold, an empty
the best evaluation results. If the probability of
Italian VP is returned.
imax
(
arg maxi p(i|e)
=
[]
, if
p(imax |e) > 0.001
, else
(18)
The evaluating results for dierent approaches for searching Italian VP are shown in
table 7.
IT-VP search # alignments Precision Recall F-score
Lexical
556
0.68
0.67
0.67
Base
572
0.80
0.81
0.81
Base + lexical
604
0.79
0.84
0.81
Table 7:
Evaluation of VP alignment for dierent IT-VP identication approaches
The search based only on lexical probabilities does not lead to desirable results. This
is due to the fact that verbs can have many dierent translations, so that the most
probable translation is not correct in every context. Furthermore, in equation (17), the
word or phrase positions are not taken into account. For instance, it can happen that
the position of the most probable Italian VP diers signicantly from the position of an
English phrase. This fact could indicate that the phrases do not match to each other but
the proposed computation does not have an access to this kind of knowledge. Finally,
there are no checks as to whether a found Italian VP has already been identied as a
parallel VP of some other English phrase.
Some of the mentioned problems can be partially solved if the base alignment and
lexical search are combined. This is done as follows: First, the base alignment search is
carried out. If no Italian VP is found, the lexical search is applied. The combination of
base alignment and lexical search leads to a higher recall since some VPs are found which
have not been identied by the base alignment search. As an example for this case, we
consider the VP [I come]V P from English sentence in (103). The correct alignment is
shown in gure 40. The base alignment does not identify the Italian VP [provengo]V P
as a counterpart of the English phrase [I come]V P . In fact, it fails to nd any Italian VP
for the given English VP which results in unaligned words of the English VP. Since no
Italian VP has been found, the lexical search is applied. This search process nds the
correct Italian VP and the alignment rules dene correct alignments between the phrase
elements.
79
I8 /PO RP
come9 /V BP
jj5
jjjj
j
j
j
jj
ju jjj
provengo10 /V ER : f in
Figure 40:
Alignment of I come and provengo
Unfortunately, the combination of the two search methods lead to lower precision compared with precision of the base alignment.
This has two reasons.
First, there are
contexts in which an English VP should stay unaligned since it has no counterpart in
Italian. Lexical search computes though a parallel VP if its translation probability is
higher that the threshold. Second, false VP is identied since it has higher probability
than the correct parallel phrase.
To demonstrate this, the English VP [you know]V P
from the example sentence (103) is taken. Its alignment is shown in gure 41.
know2 /V BP
you1 /P
RP
O
jj5
jjjj
j
j
j
jj
ju jjj
assume20 /V ER : f in
Figure 41:
Alignment of you know and assume
The base alignment does not identify any Italian VP as a counterpart for English [you
know]V P . So, the lexical search is applied suggesting that Italian VP [assume]V P is a
parallel phrase to the given English VP which is false.
5.6.2 Retaining the base alignment
As already discussed, the English VP does not need to have an Italian VP as its counterpart. It can correspond to a phrase of some other type, for example, to a prepositional
phrase, or simply to a participle. The implemented method for VP alignment does not
allow for this kind of parallelism. For an English VP, only an Italian VP can be found
as its parallel phrase. If the lexical search does not nd any parallel Italian VP, instead
of not aligning the English VP, its base alignment could be retained. This could lead
to correct alignments which cannot be created by the alignment rules, but it could also
lead to alignments which are incorrect. The results of the experiment for retaining the
base alignments are shown in table 8.
Alignment / Score # alignments Precision Recall F-score
Base
522
0.66
0.61
0.64
Rule-based
567
0.80
0.81
0.81
Rule-based + base
588
0.79
0.82
0.80
Table 8:
Evaluation of dierent VP alignments
80
Evaluation results show that retaining the base alignment for the phrases for which no
alignment could be computed has a negative impact on precision.
5.7 Summary
In this chapter, I presented a method for the alignment of English and Italian VPs
which have pronominal subjects. The aim of the rules developed for VP alignment was
to correct the alignment of the English pronominal subject which often does not have
an Italian counterpart, and which is therefore often aligned with incorrect Italian words.
Since the alignment of English subject pronouns depends on the alignment of their
VPs, the rules were written to cover the alignment of entire English and Italian VPs.
The denition of the alignment rules was motivated by both the linguistic and semantic
characteristics of the verbs. Words which bear similar features (for example, number,
deniteness, person, etc.) are aligned to each other. The rules do not have any lexical
knowledge.
They operate on the PoS sequences of the parallel VPs.
The evaluation
revealed that the rule-based VP alignment reaches higher precision, recall and f-score
than the base word alignment (cf. table 8). F-score of the base alignment is 0.64 whereas
f-score of the rule-based VP alignment is 0.81.
Parallel VPs have been extracted on the basis of the base alignment of each English
VP. Since English parse trees were available, in most cases the correct English VPs have
been extracted. The Italian VPs were identied on the basis of PoS sequences that form
a VP (cf.
section 5.3.1).
The identication of correct Italian VPs is not ideal; More
error-free VPs (consisting only of verbal elements that belong to the specic VP) could
have been extracted, if Italian parse trees had been available.
The identication of parallel VPs, which is based on the base word alignment, is not
always correct. When additionally to the base alignment, the lexical translation probabilities are included in the search for parallel VPs, the recall has a small improvement,
but precision falls (cf. table 7 in section 5.6.1). This is due to the fact that the search
method nearly always nds a matching Italian VP for the English input. In some cases,
the found Italian VP is correct, but there are also cases in which this is not the case.
The examination of the parallel corpus showed that there are many syntactic divergences between English and Italian (cf. section 5.5.2). Frequently, the English VP does
not have an Italian counterpart because the whole clause has not been translated (free
translation). Furthermore, English VPs can also correspond to Italian PPs, participles
or to the arguments of Italian verbs. Such cases of phrase divergences have not been
dealt with in this work.
The alignment rules lead to satisfying alignments of the PoS sequences in the majority
of VPs, but they produce false links if they are applied to complex VPs (coordinated
verbs or to-innitives). They search through all VP elements and compute all possible
links, sometimes associating one English verb with two Italian verbs, and vice versa
(cf.
gure 35, section 5.5.2).
This is due to the implementation of the rules.
There
is no limitation on the number of the links that can be computed for an input word.
This could be improved by using the lexical translation probabilities.
If there are a
number of candidates that an English main verb could be aligned with, the lexical
81
translation probabilities and the word positions in the VP could be used to determine
which alignment is the most probable while other links would be discarded.
In this work, only those VPs have been dealt with that have pronominal subjects, but
the method presented in this section could also be applied on VPs with NP subjects.
In this section, we have dened the VP alignment rules and conducted an evaluation and
an error analysis of the generated alignments. In the following section, the SMT systems built using the two dierent word alignments (base alignment and base alignment
combined with rule-based VP alignment) will be presented and evaluated. A detailed examination of translation parameters will be carried out in order to explain the evaluation
results.
82
6 Evaluation of SMT systems
In this section, I present an evaluation of four SMT systems.
For each translation
direction, I built two SMT systems: a baseline system (M1) and a system using rulebased VP alignment (Mmod ). I introduce the evaluation measure BLEU and present the
BLEU scores of M1 and Mmod systems. I discuss why the improved VP alignment does
not lead to the improvement of the translation of null subjects. Subsequently, I discuss
possible solutions of the problem.
6.1 The BLEU score
In the previous chapter, it has been demonstrated that the word alignment is improved
by applying the alignment rules to the base alignment. We will now evaluate whether
the word alignment improvement has an impact on the quality of generated translations.
In this work, the quality of translation is measured using BLEU [Papineni et al., 02].
The computation of the BLEU scores takes into account the similarity between the
generated translation (hypothesis) and one or more reference translations which are
correct translations of the sentence which is to be translated. The similarity is expressed
by a modied n-gram precision
word sequences of the length
pn .
n.
The sentences are viewed as a set of n-grams, i.e.
The count of a n-gram is clipped to the maximum
number of occurrences of the n-gram in one of the references. The modied precision of
a n-gram of the length
C
n is computed by summing over the matches for every hypothesis
in the whole corpus Candidates. This is expressed in equation (19).
P
pn = P
C∈{Candidates}
C 0 ∈{Candidates}
P
P n−gram∈C
n−gram0 ∈C 0
Countclip (n − gram)
Countclip (n − gram0 )
(19)
Additionally to the modied n-gram precision, the BLEU score also considers the length
c
of the hypothesis: It should be not too short compared with the reference which has
a length
r.
BP
in (20). Too long sentences are
if c > r
if c ≤ r
(20)
This is expressed by the brevity penalty
already penalized by lower precision.
(
1
BP =
1−r
e c
The BLEU score is computed by combining n-gram precision and the brevity penalty as
demonstrated in equation (21).
N = 4)
whereas weights
wn
N
represents the maximum length of n-grams (usually
are uniform:
wn = 1/N .
N
X
BLEU = BP · exp(
wn log pn )
n=1
83
(21)
6.2 Evaluation of SMT systems
I have built four SMT systems, two for each translation direction. The baseline SMTs
(M1) use the base word alignment produced by GIZA++ whereas the other two systems
(Mmod ) use the modied word alignment.
All systems are built on a parallel corpus
containing 749,646 sentence pairs. The same corpus was used to build language models.
31
As a dev and a test set, I used the WMT Newstest 2009.
All sets (development and test sets) contain 1000 sentences.
BLEU, one reference sentence was used.
For computation of
The evaluation results (BLEU scores) are
shown in table 9.
Baseline SMT (M1)
Improved WA (Mmod )
Table 9:
IT → EN EN → IT
22.07
19.15
21.81
18.18
BLEU scores of the SMT systems for EN
↔
IT
6.3 Error analysis
The BLEU scores are slightly worse for Mmod .
But, a closer look at the generated
translations by the base and modied systems shows that the translations are nearly the
same. Often, the sentences dier only in synonyms. Such dierences can unfortunately
have a strong impact on BLEU scores.
In the following, a detailed analysis of the
translation of subject pronouns is presented and discussed.
Translation direction IT → EN
Manual examination of subject pronoun translations revealed that both systems perform
equally well. Looking at the translations, I noticed that some null subject pronouns are
st
nd
better translated than others. The 1
and 2
person pronouns seem to be easier to
rd
translate than 3
person pronouns. This could be explained by the fact that the use of
st
nd
the pronouns for the 1 and 2
person is more common. For example, if someone speaks
for himself, he would rather refer to himself by a pronoun than by a NP. Concerning a
st
parallel corpus, this means, that the English subject pronoun for the 1 person singular
is very likely to occur together with an inected Italian verb with omitted subject. This
leads to a higher probability of translating the Italian VP with an inected nite verb
into the corresponding English pronoun and VP (and vice versa).
In table 10, two
possible translations for the Italian verb form so (= I know ) are shown.
32
Table 10 also shows the dierence in probabilities between the two systems indicating
that the rule-based word alignment of VPs does have an impact on translation probabilities. The English pronoun I occurs in 56% of possible translations for the Italian
31 http://www.statmt.org/wmt09/
32 The
column phrase count shows how often the SL phrase has been extracted. The column
denotes the number of dierent translations of the SL phrase.
pair types
84
phrase
so
→
M1
Mmod
i know know
phrase count phrase pair types
0.5546
0.1611
1,850
190
0.6202
0.0113
1,851
228
Table 10:
Translation probabilities for so into (I) know
verb so in M1. In Mmod , the English pronoun is found in 68% of phrases. The ve most
probable phrase translations for so are shown in table 11.
→
so
M1
Mmod
0.5546
i know
0.6202
i know
0.1611
know
0.0918
i am
0.0497
i am
0.0448
i am aware
0.0373
i am aware
0.0189
i understand
0.0178
i
0.0113
know
Table 11:
Top ve translation phrases for so
The probability of generating English verb know without subject when Italian translation
phrase so is given, is higher in M1 than in Mmod . This is the result of the rule-based
VP alignment. The rules lead to alignment between the English subject pronoun with
the same Italian verb as the corresponding English nite verb. Thus, they enforce that
the Italian verb so is aligned only with English word sequence I know. This alignment
leads to higher probability of extracting phrases (so, I know ) compared to the phrase
pair (so, know ). In M1, the phrase pair (so, know ) was extracted 298 times whereas in
Mmod the translation pair was extracted only 21 times. This means that in Mmod the
inected Italian verb so was only 21 times not aligned with the English pronoun when it
occurred with the English verb know. In these cases, it is likely that English clauses had
NP subjects (due to free translation), so that the VP alignment rules were not applied.
nd
The 2
person singular pronouns are a little bit more complicated. My intuition is
st
that their usage is comparable with the usage of the 1 person pronouns. But looking
at the generated translations, I observed that, often, incorrect English subject pronouns
are generated. This is due to the ambiguity of the Italian verbs. Example sentence (31)
(cf. chapter 3.2.1) already showed such a case of ambiguous Italian verbs. The same
example is showed again in (104a). The translation produced by M1 and Mmod is shown
in (104b).
(104)
a. Hai
detto che
have said
parli
italiano.
that speak Italian.
'You said that you speak Italian.'
b. You have said that speaks Italian.
85
Both translation systems generate the same translation for the sentence in (104a). The
input was segmented into the following phrases.
(105)
[Hai]p1 [detto]p2 [che parli]p3 [italiano.]p4
p1 and p2 generate correct English pronoun and verb, but the phrase p3 leads
The phrases
to a false translation, which does not have the obligatory subject pronoun. Furthermore,
nd
rd
the verb parli could indicate the 2
person singular indicative, or the 3
person singular
conjunctive. The phrases that speaks and che parli are parallel if that and che are relative
pronouns.
They are very likely to be translated into each other.
In this example,
this interpretation of this and che is wrong. Since the SMT systems do not have this
knowledge, in this case they produce incorrect translations.
→
che parli
M1
Mmod
Table 12:
that
that
that phrase phrase
you speak she speaks speaks count pair types
-
0.125
0.125
8
8
-
0.125
0.125
8
8
Translation probabilities of che parli into that (you/she) speak(s)
Translation table 12 for the phrase che parli shows that the correct translation for the
phrase is not included at table at all. There is no dierence in the probability distribution
for the phrase che parli between M1 and Mmod since che is in nearly all sentences used
as a relative pronoun.
Hence, a parallel English sentence does not have a personal
subject pronoun which is necessary for applying the VP alignment rules. Looking at the
translations of parli shown in table 13, the correct translation phrase is present, but it
has a very small probability.
parli
→
M1
Mmod
you speak speak you talk
talk
phrase count
phrase
pair types
0.009
0.099
-
0.054
111
71
0.0083
0.075
0.0083
0.0333
120
85
Table 13:
Translation probabilities of parli into that (you) speak/talk
With respect to the 2
nd
person pronouns, I also noticed that the Italian verbs for the
second person singular indicative are very rare in the corpus that was working with (cf.
chapter 2.4) which has a negative impact on translating them into English.
rd
The 3
person pronouns are most dicult to translate. They do not occur as frequently as the other pronouns with the corresponding verb form. The intuition is that
rd
the verbs marking the 3
person occur very often with the subject NP. In the word
alignment, the verbs of the given language pair are aligned to each other. In the phrase
extraction step, they are then extracted as parallel phrases, and the English VP does
not contain a subject pronoun. As a translation example, we consider the sentence in
(106a). M1 and Mmod generated the translation shown in (106b).
86
a. Hanno cantato la mia canzone.
(106)
have
sang
my song.
'They sang my song.'
b. Have been sung my song.
The input sentence has been segmented as follows:
(107)
[Hanno]p1 [cantato]p2 [la mia]p3 [canzone.]p4
In this example, the translation of the phrases
p1
and
p2
is crucial to become correct
output. First, we examine the translation probabilities of hanno as an inected Italian
verb (cf. table 14).
hanno
M1
Mmod
→
have
they have phrase count phrase pair type
0.5761
0.0264
15,589
1961
0.5552
0.0388
15,292
2230
Table 14:
Translation probabilities of hanno into (they) have
The dierence in translation probabilities between the phrase pairs (hanno, have ) and
(hanno, they have ) is huge. Even if the language model gives lower scores to the generated sentences which do not have a subject (pronoun) at the beginning of a sentence,
this might still happen. The rule-based VP alignment leads though to higher number of
occurrences of the phrase pair (hanno, they have ). Whereas in M1 it was extracted 412
times, in Mmod the phrase pair was extracted 593 time. The English pronoun they occurs
in 6% of possible translations in M1 whereas in Mmod , it occurs in 12% of translation
phrases.
Similar behaviour is also observed in the case of the inected main verbs (cf. table 15):
They is a part of 15% of the translation phrases in M1 whereas in Mmod , it is included
in 27% of the phrases.
Compared to the phrase pairs in table 14, the rule-based VP
alignment leads to small changes regarding the counts for the translation phrases of
pensano.
pensano
M1
Mmod
→
think
they believe they phrase phrase
think
believe count pair type
0.2251
0.0471
0.0733
0.0157
191
92
0.2126
0.0435
0.0628
0.0241
207
104
Table 15:
Translation probabilities of pensano into (they) think
Further examination revealed another problem, namely regarding morphology of Italian
and the corpus characteristics.
The sentences in (108a) and (109a) dier only in the
gender of the subject which is marked by the Italian participles stataf em and statomasc .
Sentences in (108b) and (109b) are generated translations of the sentences (108a) and
(109a).
87
(108)
a. Lei
non è
stata a casa.
you/she not have been at home.
'You/she were/was not at home.'
b. You was not at home.
(109)
a. Lei non è
stato a casa.
you not have been at home.
'You were not at home.'
b. You were not at home.
Whereas the word sequence Lei non è stato has been extracted as a translation unit,
this was not the case for Lei non è stata. In translation process, this led to following
segmentation of the sentences.
(110)
[Lei]p1 [non è stata]p2 [a casa.]p3
(111)
[Lei non è stato]p1 [a casa.]p2
Translation of the phrase
p2
in (110) leads to the generation of the false English verb.
But, when the subject pronoun is a part of the translation phrase as in phrase
p1
in
(111), the correct translation is generated. Unfortunately, the Italian pronoun lei occurs
only with the masculine participle of the verb essere (= be ) in the training data, so the
needed phrase was not extracted.
In conclusion, we have shown that the improved VP alignment does not contribute
to the improvement of translating the omitted Italian subject pronouns into English.
The rule-based VP alignment does change the translation probabilities of the relevant
translation pairs. Correct translation pairs were found, which have higher probabilities
in Mmod . Furthermore, it has been observed that the English subject pronouns are more
frequently a part of the phrases extracted from the modied word alignment. Unfortunately, these changes do not have an impact on the generated translations. Incorrect
subject pronouns in English are generated not because of erroneous word alignment, but
st
because of the nature of using subject pronouns. Frequently used subject pronouns (1
nd
and 2
person pronouns) are often correctly generated. They occur more often with the
corresponding Italian inected verb and can therefore be extracted as translation pairs
rd
with relatively high probabilities. 3
person pronouns are relatively rare and lead to the
extraction of the corresponding translation pairs with relatively small translation probabilities.
33
The English language model which I used was trained on a relatively small
monolingual data set. A better language model could in some cases lead to generation
of correct obligatory English subject pronouns which were false in the examples shown
in this section.
In the preceding discussion, I claimed that an Italian inected verb should generate
an English subject pronoun and verb. This will though lead to erroneous translations
if in Italian a NP subject exists. When, for example, the Italian verb hanno has to be
translated, it is required that it can be translated both as the English verb have and
33 Statistics
on the occurrence of dierent subject pronouns in English are shown in appendix C.
88
the English phrase they have (cf. table 14). Which translation is correct depends on the
Italian input. If the input sentence does not have a subject (the pronominal subject is
dropped), the English phrase they have should be generated. If the Italian input contains
a NP subject, the Italian verb should be translated as the corresponding English verb
(without the subject pronoun). Therefore, both translation phrase pairs are correct in
an adequate context.
Translation direction EN → IT
In the following, we examine the opposite translation direction and check if the rulebased VP alignment contribute to the translation of the English subject pronoun and its
VP into the correct Italian VP. As already discussed in section 3.2.2, when translating
the English subject pronoun into Italian, it has to be decided if the Italian pronoun
should be expressed overtly, or if it should be omitted. In SMT, this decision is made
implicitly by using the translation probabilities of English phrases in combination with
the Italian language model.
rd
When examining test sentences, I noticed that 3
person singular pronouns are often
generated whereas the others are more often omitted. Again, I presume that this is due
st
to the usage of the pronouns. English 1
person pronouns are very frequent and are
very likely to occur with dierent Italian VPs with omitted subject. Therefore, they are
very likely to be extracted as parallel phrases in which the Italian phrase does not have
a pronoun. This is conrmed by the phrase translation tables tables 16 and 17.
i can
posso io posso phrase count phrase pair type
→
M1
Mmod
0.5712
0.0024
2,902
594
0.6341
0.0034
2,963
372
Table 16:
Translation probabilities of i can into (io) posso
In M1, the phrase pair (I can, posso ) was extracted 1650 times whereas in Mmod it was
extracted 1879 times. The dierence in number of the phrase pair (I can, io posso ) is
relatively small. In M1, the phrase pair (I can, io posso ) occurs 7 times, and in Mmod
10 times. When English and Italian sentences have both pronominal subjects, it is very
likely that the pronouns are aligned with each other. But it is not excluded that the
English pronoun is aligned with additional Italian words which would hava an impact on
the extraction of translation phrases. The VP alignment rules prohibit these additional
alignments which could explain the higher count of (I can, io posso ) in Mmod than in
M1. The same observation can be applied on the phrase pairs shown in table 17.
we know
M1
Mmod
→
sappiamo noi sappiamo phrase count phrase pair type
0.6125
0.0157
2,145
372
0.5358
0.0139
2,736
551
Table 17:
Translation probabilities of we know into (noi) sappiamo
89
The ve most probable translations for we know are shown in table 18.
we know
→
M1
Mmod
0.6125
sappiamo
0.5358
sappiamo
0.0181
è noto
0.0259
conosciamo
0.0176
sappiamo bene
0.0186
è
0.0171
conosciamo
0.0179
sappiamo che
0.0167
si sa
0.0157
sappiamo bene
Table 18:
Top ve translation phrases for we know
I examined if there is a dierence in the number of Italian phrases aligned with the
English phrase we know which contain verbs which are equivalent to the English verb
know, namely sappiamo and conosciamo.
Whereas these verbs are found in 34% of
phrases in M1, in Mmod , they are a part of 38% of the translation phrases. The dierence
is due to the VP alignment rules which allow English VPs to be exclusively aligned Italian
VPs (which in this work contain only verbal elements and negation). This is also the
reason why Italian phrase è noto (VER:n + ADJ) has a smaller probability in Mmod
whereas the nite verb form è is more probable than in M1.
The third person pronouns are not very frequent and occur with a relatively small
number of verbs. In the process of translation, if the English subject pronoun and VP
are not in the phrase table as a translation unit, they are split resulting in two separate
translation phrases: a phrase with the subject pronoun and a phrase with its VP. This is
shown by the sentence in (112a) and its segmentation in (113). The generated translation
is shown in (112b).
(112)
a. He has spoken with his father.
b. Egli ha
he
(113)
parlato con
il
suo padre.
has spoken with the his father.
[He]p1 [has spoken with]p2 [his]p3 [father]p4 [.]p5
It is very likely that he will be translated into the corresponding Italian pronoun (cf.
table 19).
The phrase has spoken with generates the correct Italian VP. The result
is a sentence with an explicit subject pronoun.
When the sentence is isolated, this
is acceptable, but in a larger text, if a large number of Italian subject pronouns are
generated where null subjects could be used, the translation would sound unnatural.
he
→
M1
Mmod
egli
lui
ha
0.1634
0.03
0.1146
5,981
1616
0.4719
0.0861
0.0168
2,740
505
Table 19:
phrase count phrase pair type
Translation probabilities of he into egli, lui and ha
Table 19 also shows the impact of the rule-based VP alignment on the phrase translation
table. If the Italian pronoun is available, the English pronoun is only aligned with it. If
90
this is not the case, it is aligned with the Italian verb form. In M1, the phrase pair (he,
egli ) is extracted 967 times, whereas in Mmod it was extracted 1293 times. This leads
to a very high probability of translating the English pronoun into the Italian pronoun,
which is correct, but it leads to too few occurrences of the null subject in Italian.
he has
M1
→
phrase count
phrase pair type
Table 20:
Mmod
0.2639
ha
0.3391
ha
0.0926
che ha
0.0739
che ha
0.0847
egli ha
0.0716
egli ha
0.0236
è
0.0414
è
0.0197
abbia
0.0235
abbia
1514
1787
456
402
Top ve translation phrases for he has
However, if the segmentation of the English sentence had included the phrase he has,
it would have been more probable that the generated Italian sentence does not have a
subject pronoun (cf. table 20).
Splitting the English subject pronoun from its VP could also lead to the generation
of false Italian inection since the English verbs have poor morphology.
Given, for
example, the translation phrase desired without the subject pronoun, it is very likely
that the wrong Italian verb is generated if the language model does not penalise the
erroneous Italian word sequence. Although I expected to see such errors, I was not able
to nd them in the tested sentences.
nd
I also noticed that the 2
person pronoun you is often translated as lei meaning you
in the polite form of address. An example of such a case is shown in the sentence in
(114a) whose translation is shown in (114b).
(114)
a. I can understand that you are annoyed.
b. Capisco
che
lei
è
arrabbiato.
understand that you are annoyed
The English phrase you are corresponds to three possible Italian phrases. You can
nd
rd
correspond to the pronouns for the 2
person singular tu and plural voi, and the 3
person singular lei (polite form of address).
The SMT systems cannot resolve this
ambiguity and choose the most probable phrase, in this case, the phrase for the polite
form of address: lei è. Without any context, a human translator, however, would also
have problems deciding which Italian VP is the correct translation of the English one.
nd
Within the context, if it is clear that the 2
person singular is meant, the generated
translation would be wrong. Certainly, this way of translating you are is caused by the
corpus that has been worked with. But, even if we had more evidence for translating
you are into other possible Italian constructions, the ambiguity would still be a problem.
In summary, if the English subject pronoun and its VP are not included in the translation table as a translation unit, they are split resulting in the generation of the Italian
91
subject pronoun which could (or should) be omitted. Also, at least theoretically, English
translation phrases consisting of verbs without the subject pronoun could lead to the
generation of Italian VPs with false inection, since one English verb often corresponds
to a number of Italian verbs.
6.4 Adequate training data
After the discussion in the previous sections, infrequent use of pronouns seems to pose
the greatest problem for SMT in translating pronominal subjects.
The question this
raises is: if we had a corpus containing a large number of sentences with pronominal
subjects occurring with many dierent verbs, would this solve the problems presented
previously?
We would certainly have phrase pairs
EN : prpi + vpj → IT : vpk
with high transla-
tion probabilities which could improve translation results when translating the English
subject pronoun
prpi
with VP
vpj
into the Italian null subject and the correct VP
vpk .
If the English pronominal subject is not split into a separate phrase from its VP, the
overgeneration of Italian subject pronouns could be avoided.
But, if we would like to translate the Italian null pronoun into the correct English
pronoun and VP, this would lead to another problem. Suppose that an Italian VP with
NP subject should be translated, and the Italian VP is a translation unit. If it has a
high probability being translated as an English pronoun and a VP, we would incorrectly
have two subjects:
a translation of the Italian NP subject and the English pronoun
generated out of Italian VP. Since both translations have to be possible, it is important
IT : vpi → EN :
vpi must have a probable English
prpj and the VP vpk , and a phrase
that both translation alternatives have comparable probabilities:
prpj + vpk
and
IT : vpi → EN : vpk .
The Italian VP
translation phrase consisting both of the pronoun
only consisting of the VP
vpk .
To make sure that an additional subject pronoun in
English is not generated, it would be necessary to determine the subject of the Italian
sentence. Having information about the subject, correct translation phrase pairs could
be favored compared to the other.
Problems regarding ambiguities of verb inection would, however, still exist. To resolve them, information from the context outside of the phrase pairs is needed.
This
lack of a model of context is a known aw of phrase-based statistical machine translation
which has only recently been addressed in a preliminary fashion in the literature.
92
7 Conclusion
In this work, a detailed analysis of the problem regarding the translation of the pronominal subjects within statistical machine translation is carried out. A null subject language (NSL) Italian and a non-null subject language (non-NSL) English were used. A
rule-based method for aligning English and Italian VPs with pronominal subjects is
presented. The rule-based VP alignment was used to build phrase-based SMT systems
in order to examine if the more accurate word alignment of VPs would lead to the improvement of the pronominal subject translation. Unfortunately, this was not the case.
The usage of subject pronouns and the corpus characteristics have a signicant inuence
on extracting the correct translation pairs. Phrase-based SMT is not adequate for the
pronoun translation and generation since it does not have any information about the
context outside the translation phrases.
The main ndings of the work are summarized in section 7.1. Future eort in improving translation of (null) subject pronouns is outlined in section 7.2.
7.1 Summary
In some languages like Italian, overt subject pronouns are not obligatory. The verbal
morphology is rich enough to reveal characteristics like person and number of the missing
pronoun (cf.
section 2.1).
Italian subject pronouns are used when they fulll some
specic functions like emphasis, reintroducing referents, etc. (cf. section 2.3). On the
other hand, some languages like English rarely allow the omission of subject pronouns.
English syntax generally requires that the subject position is occupied, otherwise, the
sentence is not grammatically correct.
The optional use of the subject pronoun in Italian and the obligatory use of the subject pronoun in English leads to problems in word alignment of parallel sentences with
pronominal subjects, as well as in statistical machine translation. Until now, the problem of translating (null) subjects between a NSL and a non-NSL has been dealt with
only indirectly. An overview of previous work was given in section 3.1. The analysis of
dierent translation cases showed that in many cases, Italian inected verbs can provide
information needed to generate the correct English subject pronoun.
Problems arise
when the verbs are ambiguous with respect to the person, number and/or gender. For
rd
example, an Italian nite verb which is 3
person singular does not have information
about the gender of the missing subject. This can lead to the generation of the false
English pronoun. A further problem is the gender discrepancy between languages. For
example, whereas animals have the grammatical gender neutral in English, in Italian
they can be both feminine and masculine. Various translation cases and problems are
discussed in section 3.2. In many cases, the examination of the context (previous sentence(s)) is required to derive all information that would ensure the generation of the
correct English pronoun. Most (statistical) machine translation systems do not use the
context, but translate sentences as isolated translation units.
As already mentioned, the absence of Italian subject pronouns causes problems in the
word alignment task. Suppose that an Italian and an English sentence pair containing
93
pronominal subjects has to be word aligned automatically.
It is very likely that the
English subject pronoun does not have a direct Italian counterpart since Italian allows
for subject pronoun omission (cf. table 2, section 2.4). For this reason, English subject
pronouns are often aligned with Italian object clitics, conjunctions, etc.
I developed
alignment rules which dene the word alignment of English subject pronouns. English
subject pronouns have to be aligned with Italian words with the same linguistic information (person, number, gender). If the Italian subject pronoun is expressed overtly, the
English subject pronoun is aligned with it. If the subject is dropped, the English subject
is aligned with the Italian nite verb form. In addition to the rules for the alignment of
English subject pronouns, I developed rules for the alignment of VPs (verbal elements of
a VP and negation). The rules are based on the category of the VP elements (nite verb,
auxiliary, participle, etc.). I used English parse trees enriched with functional tags (cf.
section 5.2.1) and part of speech tagged Italian sentences (cf. section 5.2.2). The process
of aligning parallel phrases consists of several steps. An Italian sentence is searched in
order to nd all Italian VPs (cf.
section 5.3.1).
In the parallel English sentence, the
clauses with pronominal subjects are detected. Baseline word alignment of the elements
of an English VP (created by GIZA++ ) is used to identify the matching Italian VP (cf.
section 5.3.2). The alignment rules compute the alignment of the phrase pair elements
by searching for specic PoS pairs in a specic PoS sequence. A detailed description of
15 alignment rules for Italian and English VPs is presented in section 5.4.
The rules were applied on a test set containing 200 parallel sentences. The evaluation
results (precision, recall, f-score) indicate that the VP alignment computed by the rules
is better than the baseline alignment computed by GIZA++ (cf. table 6, section 5.5.1).
Expressed in f-score, the rule-based VP alignment exhibits an improvement of 17% (fscore = 81%). Precision of the baseline VP alignment is 66% whereas the precision of
the rule-based VP alignment is 80%. Recall of the base alignment is 61% whereas the
recall of the rule-based VP alignment is 81%.
False alignments are computed if false parallel VPs are identied. Not every English
VP has a parallel Italian VP. Due to free translation, English VPs can correspond to
Italian PPs, participles, or they are simply not translated. These cases cause problems
for the rule-based VP alignment.
The process of the identication of the matching
Italian VP for an English VP does not always nd the correct Italian VP. Since the VP
alignment rules take only the PoS of the phrase elements into account, in these cases,
they compute false word alignment.
Furthermore, the implementation of the rules is
insucient as they do not have any constraints on the number of the links that can be
computed for a VP element. In some cases, this leads to additional alignments which are
erroneous. For example, the VPs can be extended, containing participles or innitives
that do not correspond to any element of the parallel phrase. Such phrases can lead to
an alignment between, for example, one English main verb and two Italian main verbs.
The program does not verify which alignment is more probable (i.e., lexical parallelism
of the aligned words) and should therefore exist in the resulting alignment. Instead, all
possible alignments are included in the computed VP alignment. The errors made by
the rule-based VP alignment are discussed in section 5.5.2.
I built four SMT systems to examine whether the improved VP alignment leads to
94
the improvement of the pronominal subject translation between English and Italian.
For each translation direction, two systems have been built: (i) a phrase-based SMT
system using the baseline word alignment (M1), and (ii) a phrase-based SMT system
using baseline word alignment combined with the rule-based VP alignment (Mmod ). In
the translation direction EN
→
IT, M1 has a BLEU score of 19.15 whereas the BLEU
score of Mmod is 18.18. In the opposite translation direction, the BLEU score of M1 is
22.07 whereas the BLEU score of Mmod is 21.81 (cf. table 9, section 6.2). The BLEU
scores are slightly worse for the Mmod systems. Manual examination of the generated
sentences though revealed that all systems produce nearly identical output leading to
the conclusion that the rule-based VP alignment does not have any impact on the (null)
subject translation between English and Italian.
However, the rule-based word alignment does change the translation parameters. The
number of the phrases (VPs) in which the English phrase contains the subject pronoun
whereas the Italian VP has only the inected verb form is greater in Mmod than in M1
(cf. section 6.3). In some cases, the translation probability of the correct translation
pair is higher in Mmod than in M1 (cf. table 16, section 6.3). These observations lead
to two important conclusions: (i) When translating Italian into English, Mmod is more
likely to generate the English subject pronoun; (ii) The probability of generating the
correct Italian inected verb is higher in Mmod than in M1. Despite the fact that there
are dierences in translation probabilities for the relevant translation phrases indicating
that Mmod should generate better translations, an improvement in translation output
was not observed.
This can be explained by the fact that the translation probabilities of the phrases
consisting of a subject pronoun with a VP are relatively small.
The verbs in such
phrases do not only occur with the pronominal subjects, but also with NP subjects. In
such contexts, the verb (or VP) pairs are extracted without a subject pronoun. Their
likelihood is high since they occur often and with a large number of dierent NP subjects.
When translating inected Italian verbs into English, it is therefore very likely that the
verb is translated into the corresponding English verb.
subject, this translation is correct.
If the Italian verb has a NP
But if the Italian subject is dropped, an English
sentence is generated that does not have a subject.
I also noticed that some pronouns are more often correctly translated than others.
This is due to the relatively infrequent use of subject pronouns and the characteristics
st
nd
of the corpus that I have been working with. 1 and 2
person pronouns are used more
rd
frequently than 3
person pronouns. Observation of the generated sentences showed
st
nd
that 1
person pronouns are correctly translated in most cases. 2
person pronouns
are problematic because of the ambiguity of Italian verbs and the characteristics of the
rd
corpus (cf. examples (108) and (109), section 6.3 and table 3, section 2.4). 3
person
pronouns cause the most problems because the verbs they occur with can also have
NP subjects, as already mentioned above. When translating English into Italian, it is
rd
very likely that the English phrase containing the subject pronoun (for example, 3
person pronoun) and the VP is not included in the translation table. The pronoun is
then translated separately from the VP leading to the generation of the Italian subject
pronoun. If this occurs in many subsequent sentences, one is faced with overgeneration
95
of pronouns in the Italian output.
The problem regarding the small translation probabilities of the phrases consisting of
a pronominal subject and a VP cannot be solved by better (or perfect) word alignment
of the VPs with pronominal subjects. In fact, a parallel corpus is needed in which the
pronouns occur much more frequently with a large number of dierent verbs. Within a
SMT system, this would increase their translation probabilities automatically. However,
when translating Italian into English, a syntactic analysis of the Italian input is needed to
derive whether the sentence has a pronominal or a NP subject. Given this information,
the correct translation phrase can be chosen.
The linguistic characteristics (person,
number, gender) of Italian pronominal subjects can be determined if the Italian (null)
subject is resolved which requires the access to previous sentences.
In the opposite
translation direction, it has to be decided whether the Italian subject pronouns have
to be dropped or expressed overtly. I noticed that some adjectives (for example, tutti
(= all )) trigger the use of the overt Italian subject pronouns (cf. examples (24) - (27),
section 2.4).
However, since the use of the Italian subject pronouns has pragmatic
reasons (cf. section 2.3), it is not trivial for a (statistical) machine translation system
to decide whether the subject pronoun should be realized overtly or be dropped.
7.2 Future work
In my thesis, I showed a method for aligning English VPs with pronominal subjects
with parallel Italian VPs. Improved alignment of English subject pronouns with Italian
inected verbs did not result in the improvement of pronominal subject translation
between English and Italian.
In the following, I outline further possible methods to
improve the alignment of relevant phrases and the translation of pronominal subjects
between a null subject language Italian and a non-null subject language English.
Word alignment of English and Italian VPs
The method for the VP alignment that I presented in this thesis is based on an assumption that every English VP with a pronominal subject has a parallel Italian VP. This
assumption does not always hold since the translations are not always literal.
Some
English VPs do not have an Italian counterpart or they correspond to an Italian phrase
of an arbitrary type.
The rule-based method for the VP alignment could be extended in order to handle these cases. The method for identication of a parallel Italian phrase should allow
Italian phrases like PPs to be identied as parallel phrases of English VPs. The translation probabilities of English and Italian PoS sequences could be used to derive parallel
phrases.
The rules for the VP alignment handle only verbal elements of a VP, negation and
subject pronouns. In a case of a syntactic divergence in which the words of a phrase
pair do not have a matching PoS, they remain unaligned. In some cases, this leads to
removal of correct alignment links.
A deletion of such links could be avoided if their
reliability (for example, by using lexical translation probabilities and alignment of the
96
neighbouring words) would be computed.
If we assume that syntactic phrases of dierent types in English and Italian correspond
to each other, we need parse trees of English and Italian sentences in order to identify
correctly the parallel phrases.
In this work, the VP alignment rules have been applied only on VPs with a pronominal
subject. The rules could be as well used to align all VPs regardless of the type of a subject
(pronominal or NP subject).
Translation direction IT → EN
Italian sentences often do not have overtly expressed subjects. Their characteristics like
person, number and gender can be derived from the inected verb and the preceding
context (sentences). Statistical machine translation systems do not have an access to
the preceding context of a sentence that should be translated. The translation phrases
which contain Italian inected verbs and the English language model should therefore
lead to the generation of correct English subject pronouns. Correct phrase pairs could
be learned from a corpus which contains many sentences with pronominal subjects (cf.
section 6.4). But, if the correct translation phrases had high translation probabilities,
we would become problems when the Italian source sentence contains a NP subject.
If the inected verb generates an English pronoun, the English translation could have
two subjects which would be incorrect. The information about the subject in a source
sentence could be used to choose the correct translation phrase pair.
Another approach to the problem of the generation of English pronominal subjects
is incorporation of pronoun resolution in the translation process. If the referent of an
Italian omitted subject pronoun is determined, all characteristics (number, gender, etc.)
of the missing pronoun could be derived and used to generate the corresponding English
pronoun.
Translation direction EN → IT
When translating English subject pronouns into Italian, it is important that the correct
inected verb is generated. Furthermore, a decision has to be made whether the subject
pronoun should be generated or omitted. The use of an appropriate corpus as training
data (cf. section 6.4) could lead to an improvement of translation of English pronouns
into Italian (null) pronouns. A corpus containing many sentences with dierent pronominal subjects would lead to an extraction of many dierent English phrases consisting of
a subject pronoun and verbs with their Italian counterparts ((null subject) + inected
verb). This would though not solve the problems which concern the gender discrepancy
between English and Italian. To ensure the generation of a correct Italian subject pronoun, it would be necessary to resolve the co-reference of an English pronominal subject.
[Le Nagard & Koehn, 10] show a method for integration of co-reference resolution into
phrase-based statistical machine translation.
In some cases, Italian subject pronouns are expressed overtly. Statistical models could
learn such contexts (word or PoS sequences) in order to predict how the Italian subject
97
pronoun should be realized.
98
A Italian tag set
DJ adjective
ADV adverb (excluding -mente forms)
ADV:mente adveb ending in -mente
ART article
ARTPRE preposition + article
AUX:fin finite form of auxiliary
AUX:fin:cli finite form of auxiliary with clitic
AUX:geru gerundive form of auxiliary
AUX:geru:cli gerundive form of auxiliary with clitic
AUX:infi infinitival form of auxiliary
AUX:infi:cli infinitival form of auxiliary with clitic
AUX:ppast past participle of auxiliary
AUX:ppre present participle of auxiliary
CHE che
CLI clitic
CON conjunction
DET:demo demonstrative determiner
DET:indef indefinite determiner
DET:num numeral determiner
DET:poss possessive determiner
DET:wh wh determiner
NEG negation
NOCAT non-linguistic element
NOUN noun
NPR proper noun
NUM number
PRE preposition
PRO:demo demonstrative pronoun
PRO:indef indefinite pronoun
PRO:num numeral pronoun
PRO:pers personal pronoun
PRO:poss possessive pronoun
PUN non-sentence-final punctuation mark
SENT sentence-final punctuation mark
VER2:fin finite form of modal/causal verb
VER2:fin:cli finite form of modal/causal verb with clitic
VER2:geru gerundive form of modal/causal verb
VER2:geru:cli gerundive form of modal/causal verb with clitic
VER2:infi infinitival form of modal/causal verb
VER2:infi:cli infinitival form of modal/causal verb with clitic
VER2:ppast past participle of modal/causal verb
VER2:ppre present participle of modal/causal verb
99
VER:fin finite form of verb
VER:fin:cli finite form of verb with clitic
VER:geru gerundive form of verb
VER:geru:cli gerundive form of verb with clitic
VER:infi infinitival form of verb
VER:infi:cli infinitival form of verb with clitic
VER:ppast past participle of verb
VER:ppast:cli past participle of verb with clitic
VER:ppre present participle of verb
WH wh word
100
B English tag set (Penn Treebank Tagset)
CC Coordinating conjunction
CD Cardinal number
DT Determiner
EX Existential there
FW Foreign word
IN Preposition or subordinating conjunction
JJ Adjective
JJR Adjective, comparative
JJS Adjective, superlative
LS List item marker
MD Modal
NN Noun, singular or mass
NNS Noun, plural
NP Proper noun, singular
NNPS Proper noun, plural
PDT Predeterminer
POS Possessive ending
PRP Personal pronoun
PRP$ Possessive pronoun
RB Adverb
RBR Adverb, comparative
RBS Adverb, superlative
RP Particle
SYM Symbol
TO to
UH Interjection
VB Verb, base form
VBD Verb, past tense
VBG Verb, gerund or present participle
VBN Verb, past participle
VBP Verb, non-3rd person singular present
VBZ Verb, 3rd person singular present
WDT Wh-determiner
WP Wh-pronoun
WP$ Possessive wh-pronoun
WRB Wh-adverb
101
C English subject pronoun occurrences
In the process of computing VP alignment, clauses in the English part of the parallel
corpus (cf.
pronoun.
chapter 5.2) are identied and checked whether they contain a subject
I counted subject pronoun occurrences and clauses in which the subject is
not pronominal. The counting results are shown in table 21. Entire corpus consists of
34
749,646 sentences which can be divided into 1,254,086 clauses.
I
we
you
he
she
it
14%
15%
2%
0.8%
0.2%
9%
Table 21:
they NP
0.2%
54%
Pronoun occurrence in English
Half of the corpus clauses have NP subjects. In the context of dealing with subject pronouns, these sentences (its verbs) cannot be used to extract English verbs and pronouns
with their correspondences in Italian.
In fact, they contribute to the probabilities of
phrases consisting only of verbs without a subject pronoun.
34 Missing
5% are due to false recognition of subjects.
102
List of Tables
rd
1
Statistics on referents of 3
person subjects in Italian . . . . . . . . . . .
17
2
Occurrence of SUBJ in Italian . . . . . . . . . . . . . . . . . . . . . . . .
17
3
Occurrence of null-SUBJ in 93 observed clauses
. . . . . . . . . . . . . .
17
4
Evaluation of GIZA++ word alignment for English and Italian . . . . . .
35
5
Example phrase translation probabilities for io sono . . . . . . . . . . . .
35
6
Evaluation of the VP alignment . . . . . . . . . . . . . . . . . . . . . . .
66
7
Evaluation of VP alignment for dierent IT-VP identication approaches
79
8
Evaluation of dierent VP alignments . . . . . . . . . . . . . . . . . . . .
80
9
BLEU scores of the SMT systems for EN
. . . . . . . . . . . . . .
84
10
Translation probabilities for so into (I) know . . . . . . . . . . . . . . . .
85
11
Top ve translation phrases for so . . . . . . . . . . . . . . . . . . . . . .
85
12
Translation probabilities of che parli into that (you/she) speak(s)
. . . .
86
13
Translation probabilities of parli into that (you) speak/talk
. . . . . . . .
86
14
Translation probabilities of hanno into (they) have
15
Translation probabilities of pensano into (they) think
↔
IT
. . . . . . . . . . . .
87
. . . . . . . . . . .
87
16
Translation probabilities of i can into (io) posso . . . . . . . . . . . . . .
89
17
Translation probabilities of we know into (noi) sappiamo
. . . . . . . . .
89
18
Top ve translation phrases for we know
. . . . . . . . . . . . . . . . . .
90
19
Translation probabilities of he into egli, lui and ha
20
Top ve translation phrases for he has
21
Pronoun occurrence in English . . . . . . . . . . . . . . . . . . . . . . . . 102
103
. . . . . . . . . . . .
90
. . . . . . . . . . . . . . . . . . .
91
List of Figures
1
Main program: correct_align
. . . . . . . . . . . . . . . . . . . . . . . .
45
2
System components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
46
3
Alignment check and improvement
. . . . . . . . . . . . . . . . . . . . .
47
4
Alignment of I would ask you to request and la prego di chiedere . . . . .
48
5
Search for the best Italian VP . . . . . . . . . . . . . . . . . . . . . . . .
50
6
Incorrect base alignment of if you wish and se lo desidera . . . . . . . . .
54
7
Correct alignment of if you wish and se lo desidera
. . . . . . . . . . . .
55
8
Alignment of I can tell you and posso risponderle
. . . . . . . . . . . . .
56
9
Alignment of it actually passes and esso stesso approva . . . . . . . . . .
57
10
Alignment of I would say and volendo dire
. . . . . . . . . . . . . . . . .
57
. . . . . . . . . . . . . . . . . .
57
11
Alignment of I have said and aver detto
12
Incorrect base alignment of I feel and ritengo
13
Correct base alignment of I feel and ritengo
. . . . . . . . . . . . . . .
57
. . . . . . . . . . . . . . . .
58
14
Alignment of you enjoyed and abbiate trascorso
. . . . . . . . . . . . . .
59
15
Alignment of you have requested and avete chiesto . . . . . . . . . . . . .
59
16
Alignment of we were elected and sono stati eletti
. . . . . . . . . . . . .
60
17
Complete alignment of they had and di avere . . . . . . . . . . . . . . . .
60
18
Alignment of you have requested and avete chiesto . . . . . . . . . . . . .
61
19
Alignment of you have requested and chiedevate
. . . . . . . . . . . . . .
61
20
Alignment of I would like to say and vorrei dire
. . . . . . . . . . . . . .
62
21
Alignment of we do not adhere and noi non rispettiamo . . . . . . . . . .
62
22
Alignment of I suggest to present and raccomando di presentare
. . . . .
63
23
Alignment of I shall do and seguirò
. . . . . . . . . . . . . . . . . . . . .
63
24
Alignment of we have upheld and abbiamo sostenuto
25
Alignment of you have suggested and lei propone (= you proposed )
. . .
64
26
Alignment of he is to go and verrà messo . . . . . . . . . . . . . . . . . .
64
. . . . . . . . . . .
64
27
Alignment of you hear and ascoltando
. . . . . . . . . . . . . . . . . . .
65
28
Alignment comparison: I accept and lo accetto . . . . . . . . . . . . . . .
67
29
Alignment comparison: it will (, I hope,) be examined and sarà esaminata
67
30
Alignment comparison: I can (,therefore,) give and pertanto può contare su 68
31
Alignment comparison: we (then) proceed and poi di procedere
. . . . . .
68
32
Alignment comparison: I have (thus) proposed and , ho proposto . . . . .
69
33
Alignment comparison: they do not (properly) reect and esso non rietterà 69
34
Alignment comparison: I might be allowed to give and mi permettesse di
rilasciare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
70
35
Alignment of we agree (...) support and condividiamo (...) appoggiamo
74
36
Alignment of it may have been and sia stato
37
Alignment of they have been answered and avessero ottenuto (risposta)
38
Alignment of you were (unable to attend) and lei non ha potuto partecipare 77
39
Alignment of you were unable to attend and lei non ha potuto partecipare
77
40
Alignment of I come and provengo
. . . . . . . . . . . . . . . . . . . . .
80
41
Alignment of you know and assume . . . . . . . . . . . . . . . . . . . . .
80
104
.
. . . . . . . . . . . . . . . .
.
74
75
References
[Baroni et al., 04] Baroni, M. et al. Introducing the "la Repubblica" corpus: A large, an-
notated, TEI(XML)-compliant corpus of newspaper Italian in Proceedings of LREC
2004, Lisbon, Portugal, 2004
[Bennis, 06] Bennis, H. Agreement, Pro, and Imperatives in Ackema, P.; Brandt, P. et
al. (eds.) Arguments and Agreement, Oxford University Press, New York, 2006
[Brown et al., 03] Brown, P. F. et al. The Mathematics of Statistical Machine Transla-
tion: Parameter Estimation, Computational Linguistics, 1993
[Butt, 94] Butt, M. Machine Translation and Complex Predicates, Konvens, Wien, 1994
[Charniak, 00] Charniak, E. A Maximum-Entropy-Inspired Parser in Proceedings of the
conferences and Proceedings of the ANLP-NAACL 2000 Student Research WorkshopSeattle, USA, 2000
[Duranti, 80] Duranti, A. Sull' uso dei pronomi tonici nelle conversazioni in Berrettoni,
P. (ed.) Problemi di analisi linguistica, Rome, 1980
[Duranti, 84] Duranti, A. The social meaning of subject pronouns in Italian conversation
in Van Dijk, T. (ed.) Text. An interdisciplinary journal for the study of discourse,
Mouton publishers, 1984
[Goldwater & McClosky, 05] Goldwater, S.; McClosky, D. Improving Statistical MT
through Morphological Analysis in Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, 2005
[Haegeman, 96] Haegeman, L. Introduction to Government & Binding Theory, 2nd edi-
tion, Blackwell Publishing, 1996
[Huang, 84] Huang, C.T.J. On the distribution and reference of empty pronouns in
Roberts, I. (ed.) Comparative grammar. Critical concepts in linguistics, Routledge,
2007
[Koehn et al., 03] Koehn, P.; Och, F. J.; Marcu, D. Statistical phrase based translation
in Proceedings of the Joint Conference on Human Language Technologies and the
Annual Meeting of the North American Chapter of the Association of Computational
Linguistics (HLT-NAACL), 2003.
[Koehn, 05] Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation,
MT Summit, 2005
[Koehn et al., 07] Koehn, P. et al. Moses:
Open Source Toolkit for Statistical Ma-
chine Translation, Annual Meeting of the Association for Computational Linguistics
(ACL), demonstration session, Prague, Czech Republic, June 2007
105
[Koehn, 09] Koehn, P. Statistical machine translation, Cambridge University Press, 2009
[Le Nagard & Koehn, 10] Le Nagard, R.; Koehn, P. Aiding Pronoun Translation with
Co-Reference Resolution in Proceedings of the Joint 5th Workshop on Statistical
Machine Translation and MetricsMATR, Uppsala, Sweden, 2010
[Nakaiwa & Ikehara, 92] Nakaiwa, H.; Ikehara, S. Zero pronoun Resolution in Japanese
to English Machine Translation System using Verbal Semantic Attributes in Applied Natural Language Conferences. Proceedings of the third conference on Applied
natural language processing, Trento, Italy, 1992
[Och & Ney, 03] Och, F. J.; Ney, H. A Systematic Comparison of Various Statistical
Alignment Models in Computational Linguistics, vol. 29, num. 1, MIT Press, 2003
[Papineni et al., 02] Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. BLEU: A Method for
Automatic Evaluation of Machine Translation in Proceedings of the 40th Annual
Meeting of the Association for Computational Linguistics (ACL), Philadelphia, 2002
[Peral & Ferrández, 03] Peral, J.; Ferrández, A. Translation of Pronominal Anaphora
between English and Spanish: Discrepancies and Evaluation in Journal of Articial
Intelligence Research 18, 2003
[Pianta & Bentivogli, 04] Pianta E.; Bentivogli, L. Knowledge Intensive Word Align-
ment with KNOWA, Proceedings of the 20th international conference on Computational Linguistics, Geneva, Switzerland, 2004
[Rizzi, 82] Rizzi, L. Negation, Wh-movement and the null subject parameter in Compar-
ative Grammar, Volume II, The Null-Subject Parameter, Roberts, I. (ed.), Routledge, 2007
[Roberts, 07] Roberts, I. Introduction. The Null-Subject Parameter in Roberts, I. (ed.)
Comparative grammar. Critical concepts in linguistics, Routledge, 2007
[Schmid, 95] Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees in
Proceedings of International Conference on New Methods in Language Processing,
1995
[Schmid, Baroni et al., 2007] Schmid, H. et al. The enriched TreeTagger System in In-
telligenza Articiale IV-2, 2007
[Stolcke, 02] Stolcke, A. SRILM An Extensible Language Modeling Toolkit in Proc.
Intl. Conf. on Spoken Language Processing, vol. 2, Denver, 2002
[Tsao, 77] Tsao, F. A Functional Study of Topic in Chinese:
The First Step toward
Discourse Analysis, Dissertation, USC, Los Angeles, 1977
[Vanelli, Renzi, et al., 06] Vanelli, L.; Renzi, L.; Benincà, P. A typology of romance sub-
ject pronouns in Roberts, I. (ed.) Comparative grammar. Critical concepts in linguistics, Routledge, 2007
106
[Zanchetta & Baroni, 05] Zanchetta, E.; Baroni, M. Morph-it! A free corpus-based mor-
phological resource for the Italian language in Corpus Linguistics 2005, University
of Birmingham, Birmingham, UK, 2005
107