Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Universität Stuttgart Institut für maschinelle Sprachverarbeitung Azenbergstraÿe 12 D - 70174 Stuttgart Diplomarbeit Nr. 77 Null Subjects in Statistical Machine Translation: A Case Study on Aligning English and Italian Verb Phrases with Pronominal Subjects Betreuer: Dr. Alexander Fraser Erstprüfer: Dr. Helmut Schmid Zweitprüfer: apl. Prof. Dr. Ulrich Heid Bearbeitung: Anita Gojun Anmeldung: 01. Juni 2010 Abgabe: 30. August 2010 Abstract In this thesis, I present a method for aligning English and Italian parallel verb phrases which have pronominal subjects. The phrases contain the pronominal subject, the verbal elements of a verb phrase (VP) and the negation. I use English parse trees and part of speech tagged Italian sentences. The process of aligning parallel phrases consists of several steps. An Italian sentence is searched in order to nd all Italian VPs. In the parallel English sentence, the clauses with pronominal subjects are detected. Base word alignment (created by GIZA++ ) of the elements of an English VP is used to identify the matching Italian VP. The alignment of parallel phrases is computed by applying alignment rules which dene the alignment between words with a specic part of speech tag. The rule-based VP alignment reaches f-score of 81% whereas f-score of the base word alignment is 64%. The rules compute correct alignments for most parallel VPs. However, they produce erroneous alignments if false parallel phrases are identied. This is the case when the English VP is not translated, or when it corresponds to an Italian phrase of an arbitrary type (e.g. prepositional phrase). These cases are analyzed and a few experiments are carried out in order to solve these problems. They lead to higher recall (best recall is 84%), but lower precision. I use the rule-based word alignment to build phrase-based SMT systems with Moses and to examine whether improved word alignment of English pronominal subjects leads to better results when the translation of pronominal subjects between a null subject language Italian and a non-null subject language English is carried out. SMT systems built using the rule-based VP alignment receive lower BLEU scores even though the translations are comparable with the translations generated by SMT systems which are built using the base alignment. In translation direction EN → IT, a BLEU score of the SMT system build using the base alignment is 19.15. The SMT system build using the rule-based VP alignment has a BLEU score of 18.18. In the opposite translation direction, the SMT system build using the base alignment has a BLEU score 22.07 whereas the SMT system build using the rule-based VP alignment has a BLEU score of 21.81. The systems perform equally with respect to translation of pronominal subjects which means that the improved VP alignment does not lead to the improvement of the subject pronoun translation between English and Italian. The analysis of translations of example sentences will show that the pronoun resolution and syntactic analysis of both languages is necessary to ensure the correct generation of the corresponding subject pronoun. Furthermore, when English pronouns are translated into Italian, the decision must be made as to whether the Italian subject pronoun should be overtly expressed. Hiermit erkläre ich, dass ich die vorliegende Arbeit selbstständig verfasst habe und dabei keine andere als die angegebene Literatur verwendet habe. Alle Zitate und sinngemäÿen Entlehnungen sind als solche unter genauer Angabe der Quelle gekennzeichnet. Contents 1 Introduction 6 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 1.3 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2 Pro-drop and Null Subject Languages 2.1 2.2 6 9 Pro-drop theories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10 2.1.1 Rich inection morphology . . . . . . . . . . . . . . . . . . . . . . 10 2.1.2 Zero topic theory . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Null subjects and syntax . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.2.1 Null subjects and English syntax . . . . . . . . . . . . . . . . . . 12 2.2.2 Null subjects and Italian syntax . . . . . . . . . . . . . . . . . . . 13 2.3 Null subjects and pragmatics . . . . . . . . . . . . . . . . . . . . . . . . 15 2.4 Statistics on null subjects in Italian . . . . . . . . . . . . . . . . . . . . . 17 2.5 Summary 19 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 Pro-drop in machine translation 20 3.1 Previous work on zero pronouns in MT . . . . . . . . . . . . . . . . . . . 21 3.2 Translation between English and Italian . . . . . . . . . . . . . . . . . . 22 3.2.1 Italian to English . . . . . . . . . . . . . . . . . . . . . . . . . . . 23 3.2.2 English to Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Statistical machine translation 29 31 4.1 Word alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 4.2 Phrase-based SMT 35 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Word alignment of English and Italian verb phrases 5.1 5.2 5.3 5.4 38 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 Data preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.1 English . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 5.2.2 Italian . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 5.2.3 Data preprocessing errors . . . . . . . . . . . . . . . . . . . . . . 44 Applying alignment rules . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 5.3.1 Identication of Italian VPs 5.3.2 Identication of the most probable Italian VP . . . . . . . . . . . . . . . . . . . . . 47 . . . . . . . . . . . 49 Alignment rules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 5.4.1 Syntax of the English and Italian VPs . . . . . . . . . . . . . . . 51 5.4.2 Subject pronouns . . . . . . . . . . . . . . . . . . . . . . . . . . . 54 5.4.3 Finite verbs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 5.4.4 Participles, innitives and gerundives . . . . . . . . . . . . . . . . 60 5.4.5 Negation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4 5.5 5.6 5.7 5.4.6 Innitival particle . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.4.7 Alignment examples . . . . . . . . . . . . . . . . . . . . . . . . . 63 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.5.1 Precision, Recall, F-score . . . . . . . . . . . . . . . . . . . . . . . 65 5.5.2 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 System extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.6.1 Lexical search for the matching Italian VP . . . . . . . . . . . . . 78 5.6.2 Retaining the base alignment . . . . . . . . . . . . . . . . . . . . 80 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Summary 6 Evaluation of SMT systems 83 6.1 The BLEU score . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Evaluation of SMT systems . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.3 Error analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 6.4 Adequate training data . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 7 Conclusion 83 93 7.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 7.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96 A Italian tag set 99 B English tag set (Penn Treebank Tagset) 101 C English subject pronoun occurrences 102 List of Tables 103 List of Figures 104 References 105 1 Introduction In my diploma thesis, I addressed the problem of pro-drop in statistical machine translation using the language pair English - Italian. I carried out linguistic analysis of the phenomenon with respect to machine translation, and I developed rules based on part of speech tags which dene the word alignment of the English subject pronoun and its verb phrase with elements of the corresponding Italian verb phrase. I examined the generated translations as well as the translation parameters to nd an explanation for decient translation of pronominal subjects between English and Italian. 1.1 Motivation English is a language in which the subject position must always be occupied. In Italian, this is not the case. When the Italian subject is expressed by a pronoun, it can be dropped. This means that the English pronominal subject does not necessarily have a pronominal counterpart in Italian. In the context of (statistical) machine translation, this leads to problems in both translation directions, as well as within the automatic word alignment task. The questions concerning the pronominal subjects that rise, are: (Q1) When word alignment of parallel sentences is carried out, with which Italian word should an English subject pronoun be aligned when a subject pronoun in Italian is omitted? (Q2) How can we make sure that the correct pronoun is generated when translating an Italian null subject into English? (Q3) How can we decide when to generate a null pronoun when translating an English subject pronoun into Italian? The theoretical discussion on the problem of pro-drop within (statistical) machine translation will show which information can be used to solve the problems formulated in (Q2) and (Q3). In the practical part of the work, I will present the method which handles the question (Q1). Improved word alignment of English pronominal subjects does not solve the problems in (Q1) and (Q2). Translations of example sentences will be analyzed thoroughly in order to explain why the improved word alignment does not contribute to the translation of pronominal subjects between English and Italian. 1.2 Methodology This work concentrates on the improvement of the word alignment of English and Italian verb phrases consisting of a subject pronoun (cf. question (Q1) in the preceding section). Since English subject pronouns do not always have Italian counterparts, they are often aligned incorrectly. I develop therefore a set of rules which dene the alignment of English subject pronouns (cf. section 5.4). Since Italian verbs correspond to English phrases containing the subject pronoun and verbs, the alignment rules compute 6 alignment of entire English and Italian parallel verb phrases (VPs). Alignment rules dene only the alignment of the verbal elements of the VPs and negation. Therefore, I use the term VP to denote a part of verb phrases which only contain verbal elements and negation. The other elements of verb phrases are not handled within this work. I make three important assumptions for the computation of the VP alignment: (A1) Each English VP which has a pronominal subject has a parallel Italian VP, (A2) The base alignment 1 is correct enough to allow the identication of English and Italian parallel VPs, (A3) English and Italian parallel phrases have parallel part of speech sequences. I use English parse trees and Italian part of speech tagged sentences (an Italian parser was not available). The program for the computation of the VP alignment is applied on English and Italian phrase pairs whereas the English phrase must have a pronominal subject (cf. section 5.3). To assure that the Italian VPs are correct, i.e. that they contain only verbal elements, I rst identify all VPs in an Italian sentence by searching for PoS sequences which build a VP (cf. section 5.3.1). The parallel Italian VP is then identied on the basis of the base alignment (cf. section 5.3.2). The alignment rules compute alignments for the matching part of speech tags of the phrase pair elements (cf. section 5.4). All links in base alignment for aligned phrase pairs are removed. Then, the alignment computed for the VP pairs is integrated in the base word alignment. The resulting word alignment of a sentence pair does not have any base alignments for the phrase pairs which are handled by the alignment rules. I evaluate the VP alignment by computing precision, recall and f-score (cf. section 5.5.1). I created gold alignment manually by dening the alignment only of relevant English phrases. The alignments of other tokens in a sentence were ignored in the evaluation. The rule based VP alignment outperforms the base alignment. Expressed in f-score, the rule-based VP alignment achieves an improvement of 17% (f-score = 81%). The assumptions (A1) and (A2) do not always hold which leads to false alignments. Not every English VP has a parallel Italian VP which is a contradiction to (A1). Sometimes, the phrases are not translated, or they correspond to other Italian phrases (prepositional phrases, participles, etc.). Since the alignment rules are dened only for PoS sequences of English and Italian VPs (cf. assumption (A3)), in such cases, they compute false alignments. The assumption (A2) can lead to the identication of false phrase pairs since the base alignment is not error-free (cf. section 5.5.2). I show some experiments which were carried out in order to solve these problems (cf. section 5.6). For example, to deal with problems with respect to (A1), the base alignment can be retained if the parallel Italian VP could not be identied. In general, the experiments that I carried out lead to higher recall but lower precision. The base alignment and rule-based VP alignment are used to build statistical machine translation (SMT) systems for both translation directions. The quality of generated translations is given in BLEU scores. The rule-based VP alignment leads to lower 1 Base word alignment is created by GIZA++ (cf. chapter 4.1). 7 BLEU scores, but the manual analyses of the generated translations revealed that the translations are nearly the same (cf. section 6.2). This leads to the conclusion that the improved VP alignment does not contribute to the translation of pronominal subjects between English and Italian. The discussion of the translation probabilities of the relevant phrases will show that phrase-based SMT is not an appropriate machine translation approach for subject pronoun translation since it does not have access to the context (preceding sentences) of the input sentence. When translating null subjects into English subject pronouns, in many cases, the characteristics of the omitted pronoun (number, gender, person) can be derived from the inected verbs (cf. section 3.2.1). But, in general, pronoun resolution (using the preceding sentences) is needed to ensure the generation of the correct English pronoun (cf. question (Q2) in the previous section). Furthermore, only the syntactic analysis of the Italian input can provide clear information whether the Italian sentence has an (omitted) pronominal subject or a NP subject. When translating English pronominal subjects into Italian, the decision must be made as to whether the Italian subject pronoun has to be expressed overtly (cf. question (Q3) in the previous section). The data observation revealed that some words (adjectives) occur often with overtly expressed Italian subject pronouns (cf. section 3.2.2). 1.3 Outline This work is organized as follows: in Chapter 2, I introduce the phenomenon of pro-drop. Two theories are briey presented which mention a number of linguistic characteristics which allow or prohibit pro-drop. In Chapter 3, pro-drop is discussed with respect to machine translation. The features of English and Italian are identied which could simplify the generation of correct subject pronouns. In Chapter 4, the characteristics of phrase-based statistical machine translation are described. Chapter 5 contains a detailed description of the rules and of the program for computing word alignment between English and Italian verb phrases. The evaluation results of the VP alignment rules are presented and the most common errors are discussed. In Chapter 6, the evaluation of the SMT systems is carried out. I report BLEU/NIST scores and take a closer look at generated translations and translation parameters in order to nd an explanation for false pronoun translations. Finally, in Chapter 7, the ndings of the work are summarized and future work is outlined. 8 2 Pro-drop and Null Subject Languages In this chapter, I introduce the terms pro-drop and null subject language and present two theories which give an explanation why some languages are able to omit subject (and object) pronouns (cf. section 2.1). In section 2.2, I present syntactic constructions in English and Italian in which null subjects can occur whereas in section 2.3, functions which overtly expressed Italian subject pronouns fulll are discussed. Statistics about null subjects in Italian are exposed in section 2.4. Consider a simple sentence in English with a subject (SUBJ ) and a verbal predicate (VPRED ) as shown in (1). (1) HeSU BJ sleepsV P RED . The German translation of the sentence in (1) is shown in (2). If we compare the syntax of these two sentences, we see that both of them have the same sentence elements: a subject and a verbal predicate. (2) ErSU BJ schläftV P RED . he sleeps 'He sleeps.' Let us now take a look at Italian and Croatian sentences which are equivalent to German and English sentences in previous examples. (3) a. EgliSU BJ dormeV P RED . he sleeps 'He sleeps.' b. DormeV P RED . sleeps He/she/it sleeps. (4) a. OnSU BJ spavaV P RED . he sleeps 'He sleeps.' b. SpavaV P RED . sleeps 'He/she sleeps.' The Italian sentences in (3) are both correct translations of the English and German sentences above. But there is one important dierence between them: The sentence in (3a) contains subject and predicate, whereas the sentence in (3b) has only a verbal predicate. While Italian and, for example, Croatian (cf. examples in (4)) grammars allow for omission of the subject pronoun, English and German grammars require the subjects to be overtly expressed. Languages such as Italian, Croatian, Spanish are only able to omit subject pronouns. Thus, they are called null subject languages (NSLs). 9 Many Romance (like Italian, Spanish, Portuguese etc.) and Slavic (like Croatian, Czech, Polish etc.) languages belong to this group of languages. There are also languages which allow for omission both of subject and object pronouns such as Chinese. These are called pro-drop languages. The set of NSLs is a subset of pro-drop languages. Examples (1) and (2) show grammatically correct sentences of English and German. However, they would become ungrammatical if the subject pronouns were omitted. In these languages, pronoun dropping (pro-drop) is not allowed. English and German are neither pro-drop languages nor NSLs. Let us now take a look at the following German sentences. (5) Er sagte, dass He said, ∅ that gefeiert wurde. celebrated has been. 'He said that there was a celebration.' (6) Heute ∅ Today wird gefeiert. will be celebrated. 'Today, there will be a celebration.' The dass -sentence (corresponding to the English that -sentence) in (5) does not contain a subject. However, the sentence is grammatically correct. In German, there are a few constructions which allow the expletive to be dropped, so German can be called a semi NSL. German examples show that in some cases, it is not simple to say if some language is NSL or not. Some languages like modern Hebrew and Scandinavian languages do not allow zero subject pronouns, however, in a number of constructions they can be omitted [Haegeman, 96]. 2.1 Pro-drop theories In the following, I briey introduce two theories that try to explain why some languages are able to omit subject and/or object pronouns, and some do not exhibit this property. The theories account both for the omission of subject and object pronouns. In further discussion though, only the omission of subjects will be considered, since this work concentrates on the problem of translating subject pronouns between a NSL and a nonNSL. 2.1.1 Rich inection morphology It is widely accepted that the possibility of pro-drop often correlates with the existence of a rich inectional morphology (verb-subject, verb-object agreement). The agreement marking on a verb has to be rich enough to determine, or to allow the recovery of the content (reference) of a missing pronoun [Huang, 84]. The Italian example sentence in (7) should clarify this thesis. (7) Leggo un libro. read a book. 10 'I read a book.' Although the subject pronoun in (7) is not phonetically realised, its content has to be determined. To achieve this, [Huang, 84] proposes the co-indexing of the missing pronoun with the closest nominal element. In our example sentence, this is the Agr (Agreement) of the verb leggo. The verb in (7) can clearly dene the person and number st of the missing subject: 1 person singular. Let us take a look at a literal translation of (7) into English. (8) * Read a book. The English verb read in (8) cannot unambiguously dene the content of the missing st nd subject pronoun. It is ambiguous and could be combined with the 1 and 2 person sinrd gular and plural and with the 3 person plural. So we need a lexical element (pronoun) to identify the number and person of the subject. According to this theory, pro-drop languages are also able to omit objects if they have a verb-object-agreement. Since Italian and English do not exhibit any verb-object agreement, object pronouns cannot be dropped. In languages like German (cf. examples (5) and (6)) which have some constructions which allow the omission of the subject pronouns, there is one restriction regarding the subject pronouns. They can be realized as null subjects only if they are non-referential. [Haegeman, 96] explains this by the fact that the German inection is richer than in English but poorer than in Italian. The inection may licence null subjects in German, but the verb agreement does not enable us to identify a referent for a null subject pronoun. The theory about morphological richness and pro-drop holds for many languages, but there is a group of languages like Chinese or Japanese which have no morphology at all, but still allow for pro-drop. In the next section, I discuss one theory that tries to explain the ability of pro-drop in the mentioned languages. 2.1.2 Zero topic theory The zero topic theory proposed by [Huang, 84] is based on the language classication of [Tsao, 77]. [Tsao, 77] proposed that the languages like Chinese may be distinguished from languages like English by a parameter called discourse-oriented vs. sentence- oriented. He observed many properties to group languages into discourse-oriented and sentence-oriented. To these belong also the property of Topic NP deletion which is only observed in languages which are characterized as being discourse-oriented. They allow for deletion of the topic of a sentence under identity with the topic in the preceding sentence. The ability of a language to map an empty topic to an appropriate preceding topic is called the topic chain interpretation rule. The grammars of sentence-oriented languages lack this topic interpretation rule. Their sentences must have a subject. This also accounts for the presence of the expletive in such languages. [Huang, 84] assumes that languages like Chinese allow binding of empty categories (which arise when some syntactic elements like subject and object are omitted) with a 11 zero topic. Assuming that a topic can be deleted only if it refers to a preceding topic, we can now recover the content of the missing element. Languages like Italian or Spanish do not have zero topics which could be an explanation of not being able to omit the object pronoun. To recover the content of an omitted element, we refer here again to the morphology of the language. An empty subject pronoun can be recovered by examining verb inection, but this is not possible for object pronouns. The theory of [Huang, 84] is thus based on several factors which consider several properties of a language (zero topics, morphological richness) and some principles and conditions formulated in the government and binding theory of Chomsky (for more details, see [Huang, 84]). 2.2 Null subjects and syntax In the following sections, various syntactic constructions in English and Italian are shown in which the subject pronouns can be omitted. 2.2.1 Null subjects and English syntax Although English does not belong to the group of NSLs, there are indeed some constructions like innitival subclauses and imperatives, in which the subject is absent. (9) a. Speak! (Imperative) b. I would like [to come]XCOM P . c. I must [read this]XCOM P . d. John preferred [seeing Mary]GER . In English, an empty pronoun may occur only as a subject of an imperative, an innitival clause or of a gerund, but nowhere else. It cannot occur at all as a subject of the tensed clause or as an object [Huang, 84]. However, the subjects in (9b-d) have somewhat dierent properties from null subjects as in (10), in so far as the subject of an innitive must be coreferential with the given subject of the main clause (subject control). (10) ∅i a. Joei eats a banana and watches TV. b. Youi should wash the dishes or ∅i ∅i vacuum the apartment. c. * Joei eats a banana while watches TV. d. * Youi should wash the dishes although ∅i vacuumed the apartment. The example sentences in (10) show though that some nite subclauses, i.e. coordinated sentences, do not need a subject. In (10a), the subject of the clause watches TV does not exist locally, but this kind of construction allows the identication of the subject of a coordinated sentence with the subject of the main sentence, namely Joe. In contrast, subordinating conjunctions do not provide this kind of subject sharing. The clauses introduced by a subordinating conjunction require the subject to be overtly expressed (cf. (10c) and (10d)). 12 Yet, examples of subject omission in English nite clauses can be found in some nonstandard language constructions. (11) ∅SU BJ a. - cried yesterday morning. b. Shei is Alsatian. ∅iSU BJ Seems intelligent. [Haegeman, 00] found out, that English allows null subjects in some special discourse environments like short diary entries or notes (cf. sentences in (11)). In this work, I will not deal with this kind of null subjects in English. Nevertheless, it is important to discuss these constructions to show that there is a gradation rather than a hard boundary between NSLs and non-NSLs. 2.2.2 Null subjects and Italian syntax Italian counterparts to the English sentences in (9) are shown in (12). (12) a. Parla/Parlate! (Imperative) speak! 'Speak!' b. Vorrei [venire]XCOM P . I would come. 'I would like to come.' c. Devo [leggere questo]XCOM P . I must read this. 'I must read this.' d. John preferisce [di veder Mary]GER . John prefers to seeing Mary. 'John prefers to see Mary.' The examples in (9) and (12) show that there are some syntactically isomorphic constructions in English and Italian which exhibit the same characteristics regarding the occurrence of the subject pronoun. But Italian has more constructions in which the subject pronoun can be omitted. Finite clauses (13) È stanca. is tired 'She is tired.' (14) Ti hanno imbrogliato. you have cheated 'They cheated you.' 13 Example (13) shows a typical use of the null subject pronoun. The verb è gives informard tion about the missing subject: It can only be the 3 person singular. The predicative adjective stanca reveals another important characteristic about the null subject. Its ending can only match with a feminine subject. Now, we can derive the correct form of the subject pronoun although it is not overtly expressed: egla (= she ). It is important to notice that the information about the gender of the missing subject is not always available in the sentence (cf. example (3b)). Thus, in some cases, the information about the gender can be only derived if more context of the sentence is available. One interesting fact about the use of subject pronouns in nite subclauses is shown in one example sentence of modern Italian in [Vanelli, Renzi, et al., 06], here example (15). (15) Il professorei ha the professor parlato dopo lui∗i è arrivato. has spoken after he arrived 'The professor spoke after he arrived.' [Vanelli, Renzi, et al., 06] claim that it is not possible to unify the subject pronoun in the subclause with the subject of the main clause. [Roberts, 07] notes though that this interpretation is rather unusual than impossible (footnote number 2, page 40). If the pronoun is stressed (cf. example (16a)), modied (cf. example (16b)) or coordinated (cf. example (16c)), the reference is possible in the subordinate clause [Cardinaletti & Repetti, 03]: (16) a. Marioi ha Mario detto che has said LUIi verrà that HE domani. will-come tomorrow 'Mario has said that HE will come tomorrow.' b. Mario ha detto che Mario has said solo lui verrà domani. that only he will-come tomorrow 'Mario has said that only he will come tomorrow.' c. Mario ha detto che Mario has said lui e sua madre verrano domani. that he and his mother will-come tomorrow 'Mario has said that he and his mother will come tomorrow.' Constructions like the one in (14) can also be used as an impersonal construction. The rd agreement of the auxiliary hanno identies uniquely the subject as the 3 person plural, but this is not necessarily some specic group of referents. Such sentences emphasize the described fact whereas the subject is irrelevant (or simply not known). Impersonal expressions a. With impersonal verbs (17) Piove. rains 14 'It rains.' b. Impersonal passive (18) È stato detto che is said viene. that comes 'It was said that he/she comes.' c. Si impersonale (19) In Italia si parla In Italy italiano. speaks Italian 'In Italy one/people speak(s) Italian.' Impersonal verbs (sometimes also called weather verbs ) do not take any subject at all. The subject in the English translation of (17) is not a true subject. It occurs because subjects are obligatory, but it does not have a thematic role. Such impersonal subject pronouns are also called expletive it. The example in (18) is an Italian construction in which a subject pronoun does not occur. In English translation of the sentence, we have an expletive as a subject as in the previous example as well. Another way to express something impersonal in Italian is to use si impersonale. The reexive pronoun si in (19) which could be seen as a subject of the given sentence, allows for expressing a given fact without specifying the subject. 2.3 Null subjects and pragmatics The optionality of using subject (and object) pronouns raises the question, why should one use them at all. When they occur as subjects, do they fulll some specic function? If this is not the case, it could be assumed that subject pronouns in Italian can generally be dropped and are simply never used. In the literature, it is often said that optional pronouns are used when they are stressed. This explains why expletives, subjects of so called weather verbs, are not possible in Italian: Since they do not contribute to the interpretation of the sentence, they would never be stressed, and they will therefore never be overt [Haegeman, 96]. Beyond this explanation for overt subject pronouns, there are some other functions that overt subject pronouns fulll. [Duranti, 84] observed the use of subject pronouns in spoken Italian and specied these functions. Pronouns, nouns and, generally, all dening phrases are used to draw attention to some specic referent. [Duranti, 84] suggests that Italian subject pronouns are devices through which speakers dene main characters in a narrative and/or convey empathy or positive aect toward certain referents. We start with an example of the common use of zero pronouns. (20) Mio padre è andato a casa. my father went Vuole cucinare. home. wants cook. 'My father went home. He wants to cook.' 15 A null subject (or zero anaphora) is typically used for talking about some referent that has been mentioned in the immediate prior context (usually one or two clauses back). After introducing the referent (in example (20), mio padre ) the omitted subject personal pronoun is used to make additional statements about the introduced referent. [Duranti, 84] determined that in some situations the subject pronoun should be used. In these cases, it has to have some special function. He identied these functions by observing and analysing sketches of conversations of Italian native speakers. 1. Introducing and keeping track of referents in discourse If one referent is not a part of the recent context, it can be brought back to the context by using the pronoun that refers to it. In this case, the pronoun can be seen as an attention-getting device : It draws the addressee's attention to a particular referent. Sometimes, subject pronouns are used although their referents have been mentioned in the immediate context. In these cases, there is some discontinuity in the temporal or spatial dimension of a discourse. For example, the pronoun is used for reintroduction of some already mentioned referent, but in a context of some new specic event. 2. 'Main' characters and 'minor' characters There is some dierence in using pronouns for referents who are important in a story (main characters) and for those who are not (minor characters). The more important the character, the more often is he/she referred to by means of a personal pronoun. On the other side, for referring to minor characters, NPs or demonstratives are used. 3. Expressing empathy toward referent Beside the personal pronouns, in Italian one can refer to someone by using demonstrative pronouns. Closer observation of the use of personal and demonstrative pronouns showed that demonstrative pronouns are used to express a certain emotional distance or negative aect to the referent whereas personal pronouns are used the express empathy with the referent. [Duranti, 84] also points out that the prior mention of some referent is not a necessary condition for using a subject pronoun that should refer to someone or something. For exrd ample, in some cases, the 3 person subject pronoun is used without prior identication of any referent. It can be used for referents that can be implied by a previous identication set. Table 1 from [Duranti, 80] shows how often the referents are introduced before referring to them by a null subject pronoun, by a pronoun, and by a noun. The length of context for introduction of the referent has been set to 2 preceding sentences. In 72,5% cases, the null subjects referents can be found in one of the two preceding clauses. In other cases, the referent is either not mentioned at all, or the distance between the referent and subject pronoun is greater then two clauses. Overt pronouns behave similarly to nouns. Their referents are rarely mentioned in immediate context. 16 Referent of introduced not introduced null subject (111) 72,1% 27,9% pronoun (29) 34,5% 65,5% noun (62) 27,4% 72,6% Table 1: Statistics on referents of 3 rd person subjects in Italian 2.4 Statistics on null subjects in Italian In the previous chapter we have seen that the subject pronoun in Italian is rarely used. To get an idea of how often the subject pronoun is omitted, I examined 45 randomly selected sentences (93 main and subordinate clauses) from Europarl (cf. chapter 5.2). I identied sentence subjects and counted how often they are realised as zero pronouns, overt pronouns and nominal phrases (NPs). The results are presented in table 2. SUBJ-NP SUBJ-PRON null-SUBJ 42 (45%) Table 2: 7 (7%) 45 (48%) Occurrence of SUBJ in Italian Nearly half of all clauses have zero subjects. The subject pronoun is used in only 7% of cases. I also examined which zero pronouns are omitted (cf. table 3). Num/Pers 1 2 3 3P Sg 24 ∅ 8 4 Pl 4 3 2 Table 3: Occurrence of null-SUBJ in 93 observed clauses The majority of the omitted subject pronouns are for the 1st person singular. This is not really surprising: The corpus that I worked with (cf. chapter 5.2) consists of parliament discussions in which a certain person exposes his or her opinion about something. The st speakers speak for themselves so most pronouns are 1 person singular. Sometimes, they also speak for some group of people to which they belong to, e.g. a party. In these st cases, the omitted subject refers to 1 person plural referents. We see that there are no nd pronouns for 2 person singular. This is also not surprising because in such meetings, people do not address each other informally. rd Regarding the 3 person singular, we have to distinguish between the polite form in rd Italian (column 3P in table 3) which is expressed by 3 person singular when only one rd person is the addressee. The other cases of 3 person singular pronouns refer either to someone or something already mentioned, or they correspond to English expletives. 17 Now, let us take a look at the clauses in which the subject pronoun has not been omitted. In a set of 95 examined clauses, I found 7 occurrences of overt subject pronouns, three of these are the polite form. Let us take a closer look to these sentences. (21) Sì, onorevole Evansi , ... , che yes, honourable Evans, leii propone ... ... , that you suggest ... 'Yes, honourable Evans, ... , you are suggesting ...' (22) Onorevole Lynnei , leii ha perfettamente ragione ... honourable Lynne, you have perfectly right 'Honourable Lynne, you are perfectly right ...' (23) Onorevole Barón Crespoi , leii non ha potuto partecipare ... collega honourable colleague Barón Crespo, you not could participate ... 'Honourable colleague Barón Crespo, you couldn't participate ...' Examples (21) - (23) show that the referents of the clauses are situated in the same sentence. 3rd person singular pronoun in sub- This is rather unusual if we refer to the observations of [Duranti, 80]. I assume that the subject pronoun is used here to disamrd biguate the referent which can serve as a subject of the 3 person singular verbs: the NP introduced in the main clause or a referent from the preceding context (sentences). st Another three occurrences of pronouns are in the 1 person singular or plural. (24) Noi tutti siamo lieti we all are ... pleased ... 'We all are pleased ...' (25) ... che proprio noi non rispettiamo ... ... that ourselves we not adhere to '... that ourselves not adhere to ...' (26) ... l' onorevole Díez González e io avevamo presentato ... ... the honourable Díez González and I have presented ... 'Honourable colleague Díez González and I have presented ...' Examples (24) and (25) show that the pronouns are used to stress something, e.g. the subject of the sentence. It is peculiar that the pronouns occur with adverbs like tutti and proprio that in some way emphasize the subject. The subject of the sentence in (26) diers from the subjects we observed until now. The Italian subject pronoun io is used as a part of the coordinated subject NP which also consists of the NP l' onorevole Díez González. As a part of a coordinated subject NP, the subject pronoun cannot be omitted. rd Finally, there is one occurrence of the subject pronoun of the 3 person singular: (27) ... che esso stesso approva. ... which it itself adheres to. '... which itself adheres to.' The last example shows that the pronoun is also emphasized, in this case by an adjective stesso. Similar cases of emphasis have already been shown in examples (16b) and (16c). 18 2.5 Summary Pro-drop is a linguistic phenomenon which can be found in many languages. Some languages allow for omitting both subject and object pronouns (pro-drop languages) whereas some languages like Italian permit only the subject pronoun to be omitted. Italian is therefore called a null subject language (NSL). On the other hand, we have observed that some languages like English must have overtly expressed (pronominal) subjects. English belongs to the group of not-null subject languages (non-NSL). Whereas English morphology is not rich enough to allow the recovery of the characteristics of the missing subjects, the Italian verb inection enables the derivation of the linguistic characteristics (for example, number and person) of the omitted pronominal subject. The analysis of subject pronouns in the given language pair showed that English as a non-NSL also has constructions in which the subject can be omitted (cf. examples (9) (11)). However, these constructions are not relevant for this work in which I deal solely with nite English sentences which do not allow for omitted subjects. The analysis of Italian sentences revealed that the pronouns in Italian (according to the observed corpus) are omitted in most cases (cf. table 2). If they are overtly expressed, they are often emphasized by underlying adjectives or adverbs (cf. examples rd in (24) and (25)). In specic contexts, the 3 person pronoun lei is used to enable unambiguous identication of the NP that it refers to (cf. examples (22) and (23)). The dierence in the usage of subject pronouns in Italian and English (cf. example (7)) leads to problems in machine translation (MT). In the following chapter, the problem of pro-drop within MT is discussed. After previous work on pro-drop in MT is presented, dierent cases of problems regarding the translation of pronominal subjects in both translation directions IT → EN and EN → IT are shown. 19 3 Pro-drop in machine translation In this chapter, subject pronoun omission within machine translation (MT) is discussed. Although this work concentrates on statistical machine translation, I discuss previous work regarding pro-drop in dierent MT systems. In section 3.2, a detailed analysis of pronominal subject translation between English and Italian is carried out. Example sentences consisting of pronominal subjects have been translated by the rule-based system Systran 2 and statistical MT systems Google Translate 3 4 and Moses . When Italian null subjects are translated into English, their properties like number, person and gender have to be derived in order to generate the correct English subject pronoun. For human translators, it is relatively easy to do this, since they are able to dene the person, animal or thing to which the omitted subject pronoun refers to. These referents are not necessarily in the same sentence: They can occur in one of the preceding sentences. Problems occur when single Italian sentences containing a null pronoun should be translated. Without context and access to the world knowledge, it is possible to derive the right person and number of the omitted pronoun. But, for example, rd if it is known that the missing pronoun is 3 person singular, but we do not know which gender the pronoun has, how can we decide if we should translate the missing pronoun as a feminine pronoun she or as a masculine subject pronoun he ? When the translation task is in the other direction, the decision must be made if the Italian pronominal subject should be expressed overtly or be dropped. Furthermore, the gender discrepany between English and Italian can lead to the generation of incorrect rd Italian pronouns (for example, 3 person pronouns). Machine translation is confronted with the same problems when translating between a non-NSL English and a NSL Italian. Most MT systems operate on the single sentence input and do not use previous sentence context. When translating into English, the correct pronoun for a null subject in Italian has to be found. But often, the context (previous sentences) of an observed sentence should be taken into account to resolve the missing pronoun. When translating into Italian, it has to be determined if the subject pronoun should be generated or omitted. We summarize the questions that have to be answered: (Q1) Automatic word alignment How to align the existing subject pronoun in non-NSL (English) with an omitted subject in NSL (Italian)? (Q2) Translation: NSL → non-NSL How can we automatically generate the right subject pronoun in the target language for the missing subject pronoun in the source language? 2 http://www.systranet.com/ 3 http://www.google.com/language_tools 4I built a baseline SMT system with Moses (cf. chapter 6.2). 20 (Q3) Translation: non-NSL → NSL When should the non-NSL subject pronoun be omitted in the NSL target language? The answer to this question is important if we want to achieve that the automatically generated translations sound natural. 3.1 Previous work on zero pronouns in MT The problems regarding automatic translation of null subjects from a NSL to some non-NSL and vice versa, have been dealt with only indirectly. [Goldwater & McClosky, 05] dealt with the statistical machine translation of the language pair Czech (NSL) and English. The aim of their work was to nd out if the translation from Czech, a morphologically rich language, to English, which is a language with weak morphological inection, can be improved if the morphological information is available. Their idea was to use morphological analysis on Czech. The Czech input has been lemmatized and pseudowords have been inserted in order to eliminate some morphological dierences between the two languages and to deal with the sparse data problem. These pseudowords are morphological tags that express some specic properties. [Goldwater & McClosky, 05] inserted the pseudowords with information about the verb person (among others) to the Czech input. The pseudowords should simulate the existence of pronouns for the English pronouns to align with. [Goldwater & McClosky, 05] reported that person pseudowords indeed have been aligned to English pronouns with high probability. However, it has not been reported if these pseudowords solve all problems regarding null subjects. The question is how often the null subjects are correctly translated. Erroneous translations are possible when ambiguous verbs should be translated, or when the referents of the omitted subject pronouns have a dierent grammatical gender. For the opposite translation direction this approach could be somewhat problematic: If English pronouns are in most cases aligned to Czech pseudowords (with surface form ∅), this translation alternative receives high likelihood. Are then (nearly) all English subject pronouns translated as null subjects in Czech? Another work on translation between NSL (Spanish) and non-NSL (English) has been done by [Peral & Ferrández, 03]. They developed a system which identies and resolves all pronouns (not only the omitted subject pronouns) in Spanish as a source language. Their translation system is based on an interlingua approach. The input text undergoes several analysis steps: morphological analysis, POS-tagging, parsing and word-sense disambiguation. The enriched input text serves as input to a component which deals with dierent NLP problems like anaphora identication and resolution. After dealing with anaphora the generation of the interlingua representation of the whole input text is carried out. This representation contains all information needed to translate pronouns in the target language. Although the authors report very good results in the tasks of anaphora identication and generation, there are some additional problems that their MT-system had to solve. For example, if it is clear that the omitted subject pronoun rd in Spanish as source language is 3 person feminine, this does not mean automatically that the correct English pronoun should also be of the same gender (e.g. elmasc with the referent el perromasc vs. itneut with the referent dogneut ). In English, animals have 21 neutral grammatical gender. So, we have to have the information that the referent of el is an animal in order to correctly translate the pronoun (possibly an omitted pronoun) in English. Evaluating their system, [Peral & Ferrández, 03] translated all occurrences of English (as source language) pronouns into their Spanish equivalents. They note though that a subsequent task must decide if the pronoun in Spanish must be generated, substituted by some other pronoun or must be eliminated. [Nakaiwa & Ikehara, 92] developed an anaphora resolution system for Japanese (a pro-drop language) and integrated it into a machine translation system for Japanese to English called ALT-J/E. The anaphora resolution process is based on semantic attributes of verbs and their relationship to the arguments. For each verb it is necessary to determine its semantic category and its relationship to its arguments (SUBJ, OBJ). These arguments can be the anaphora and nominal phrases. Rules allow the derivation of the correct referent for a particular anaphora, which can be a zero pronoun, using this information. For example, let us assume that we want to resolve an anaphora ai governed by some verb vi with some semantic attribute vsai . ai is a subject of vi . We have the same information about some verb vj of a so called topicalized unit sentence 5 which governs some phrase which could be a referent of ai . Given this information, the rules are searched in order to nd the right referent for ai . The rules have the following form: If vi has a verb category vsai and governs an anaphora ai as its argument argi (e.g. SUBJ) and we have some verb vj with verb category vsaj , then the argument argj (e.g. OBJ) of verb vj can be assumed to be a referent of ai . To apply these rules, the verb in the sentence with zero pronoun and the verb of the unit sentence have to be extracted. Their verb categories are identied. According to the rules describing verb relationships as sketched above and the identied verb categories, the referent of the zero pronoun is established. When translating the resolved zero anaphora (i.e. their referents), it could happen that the translation in English becomes verbose. In this case, elliptical pronouns and denite articles should be used [Nakaiwa & Ikehara, 92]. This leads again to the problem of generating the correct English subject (and object) pronoun. 3.2 Translation between English and Italian In this chapter, I will describe dierences between English and Italian regarding the null subject that cause problems for automatic translation between the two languages. Some of the cases have already been mentioned in the preceding discussion. Now, we look at concrete examples and translations that three MT systems provided: S - the rule- 6 7 based MT system SYSTRAN , G - the statistical MT system Google translator , and M - the statistical MT system Moses (cf. chapter 4). Translation under R represents the reference. Some of the source language sentences are extracted from Europarl (cf. section 5.2) whereas a part of them were constructed by myself. 5 This is a sentence that contains nominal phrases which can serve as referents of the anaphora in the following sentences. 6 Free translation at http://www.systranet.com/ (November 2009). 7 Free translation at http://www.google.com/language_tools (November 2009). 22 Since the example analysis describes linguistic knowledge needed for resolving some problems regarding null subjects, it is important to point out that phrase-based statistical MT systems in their original form do not have access to any linguistic knowledge, so they are certainly disadvantaged when linguistic knowledge is needed to generate correct translations. Rule-based systems are more likely to recognise which pronominal subject can occur with a given verb form. 3.2.1 Italian to English We already know that in Italian, the properties of the missing subject like number and person can be derived from the verb inection (cf. section 2.1). We will now examine how well this works in available MT systems. The words set in bold in the Italian input sentences are nite verbs. The pronouns in bold in the English translations represent subjects corresponding to the omitted subject in Italian. First person subject pronouns Let us begin with the omitted pronouns of the rst person singular and plural. (28) So che il governo americano condivide i nostri obiettivi. R: I know that the American government shares our goals. I know that the U.S. government shares our goals. S: I know that the government American shares our objectives. M: I know that the american government shares our objectives. Hanno compreso, come noi, quanto sia importante che svolgiamo insieme ... G: (29) R: They understood, as we did, how important it is that we carry out together ... we do together ... S: They have comprised, like we, how much is important that we carry out ... M: They understood, as we, how important it is that *∅ perform together ... G: They understood, like us, it is important that All translations but one are correct. pronoun of the verb perform. In (29), Moses does not generate the subject st Verb forms for 1 person singular and plural are not ambiguous, so that the right pronoun in English can be derived from the analysis of 8 Italian verb form. The translation possibilities can be summarised as shown in (30). (30) IT.Verb.1.P.Sg IT.Verb.1.P.Pl → I + EN.Verb.1.P.Sg → We + EN.Verb.1.P.Pl Second person subject pronouns Let us go on with the second person singular and plural. (31) Hai detto che parli italiano. R: You said that you speak Italian. 8 An explanation for false Moses output is given later in chapter 6. 23 You said that you speak Italian. S: You have said that it speaks Italian. M: You have said that *∅ speaks Italian. Avete giocato con i genitori. G: (32) R: You played with parents. You played with their parents. S: You had played with the parents. M: *You with their parents. G: The VPs (auxiliary + participle) in the main clauses in example sentences (31) and (32) can be uniquely translated into English. In Italian subclause in (31), we face nd an ambiguous verb parli : It can occur with the 2 person singular, as recognised by rd Google. But, as a subjunctive, parli can furthermore occur with the 3 person singular, as recognised by Systran. Moses does not generate any subject pronoun leading to the grammatically incorrect subclause translation. Beyond the ambiguity regarding some verbs in indicative and subjunctive, there is another problem regarding verbs in present tense. The indicative and imperative verbs for the second person are the same. (33) Dite che parlate italiano. R: Say that you speak Italian. you speak Italian. S: You say that *∅ speeches Italian. M: You say that you are italian. Dite se parlate italiano. ∅ G: (34) Say R: Say if you speak Italian. (35) you speak Italian. S: You say if *∅ speeches Italian. M: You say if *∅ spoken italian. Scrivi una lettera. (36) *∅ Write a letter. S: You write a letter. M: *∅ Refer a letter. Scrivi una lettera! ∅ G: Say if R: You are writing a letter. G: R: Write a letter! G: S: ∅ Write a letter! ∅ Refer a letter! *You write a letter! M: The only dierence between (33) and (34) is the conjunction used: che (= that ) and se (= if ). Whereas the conjunction che could be used both in an indicative and an imperative sentence, the conjunction se should instead be used with the interpretation of the verb dite as imperative. So, the Google translations are both acceptable, but 24 Systran's are not. Whether the subject of the main clause in (33) should be used (for indicative reading) or not (for imperative reading) cannot be derived directly. This would be probably easier if we had access to the context of the given sentence. If the sentence mode is marked by punctuation, it is possible to derive the right sentence mode (cf. examples (35) and (36)). Unfortunately, the MT systems do not seem to use this information for deciding whether the subject in English should be generated (for indicative) or not (for imperative). Let us summarise the translation alternatives for the omitted subject for the 2nd person singular and plural. (37) → You + EN.Verb.2.P.Sg (indicative) IT.Verb.2.P.Sg → ∅ + EN.Verb.2.P.Sg (imperative) IT.Verb.2.P.Pl → You + EN.Verb.2.P.Pl (indicative) IT.Verb.2.P.Pl → ∅ + EN.Verb.2.P.Pl (imperative) IT.Verb.2.P.Sg Third person subject pronouns The most complicated case is that of the 3 rd person pronouns that have been omitted. We will start with the cases in singular. (38) Dice che parla italiano. R: He/She says that he/she speaks Italian. She says she speaks Italian. S: It says that *it speaks Italian. M: *∅ Says that *∅ speaks Italian. Pensa che non è malata. G: (39) R: He/She thinks that she is not ill. *∅ Think that is *∅ not sick. S: *It thinks that *it is not sick. M: *∅ Does that *∅ not is sick . G: Examples (38) and (39) already show the limitations of the tested systems regarding the null subject. Indeed, in the rst example, it is not possible to derive the gender of rd the missing subject pronoun. Google proposes the pronoun for 3 person feminine as subject for both subclauses in the source sentence. Since we do not know anything about the context of the sentence, we can accept this solution. 9 The translation that Systran suggested has at least one error. The proposed subject for the main clause can be seen as correct if the subject refers, for example, to some book or note or the like. Knowing though, that only humans can speak, the subject pronoun it for the subclause cannot be correct. The Moses translation does not contain subject pronouns and is therefore grammatically incorrect. In contrast to example (38), at least the subclause in (39) provides all information needed to generate the right subject pronoun in English. Predicative adjectives which 9I have been told by a native speaker of Italian that masculine is used when a decision about the gender cannot be made. 25 occur with copula verbs match in number and gender with the referents that they modify. So, it is possible to determine the subject of the subclause in (39) as feminine singular. rd The verb provides the information that the subject is in the 3 person, so we can clearly say that the subject in English translation should be she. Concerning the subject of the main clause, the translation should be at least he or she if we assume that only humans have the ability to think. The property of Italian described for the subclause in (39) holds also for composed tense forms which take essere (= be ) as an auxiliary. (40) È andata a casa. R: She went home. *∅ Went home. *It has gone to house. M: *∅ Has gone home. Era rimasto a scuola. G: S: (41) R: He stayed at school. He had stayed in school. S: *Era remained to school. M: *∅ Remained at school. G: The underlined participle in (40) provides information about the gender of the omitted subject pronoun. Together with the information which the inected verb È provides, it rd is possible to identify the subject as 3 person singular feminine: she. The same form of the analysis for the verb Era and the participle rimasto leads us to the conclusion that the subject in English in (41) should be he. rd The 3 person singular is additionally used in the polite form of address. It is used rd rd with Italian 3 person pronouns lei which is unfortunately also a pronoun for the 3 person singular feminine. So, this is another case of ambiguity to deal with. (42) Lei non è stata a casa? R: Was she not at home? She was not at home? S: Hasn't *it been to house? M: *You was not at home? G: Google translator recognises the subject pronoun Lei as 3rd person singular feminine which is one interpretation alternative of this pronoun. The other translation possibility, namely as you is found by Moses but the generated pronoun does not match with the corresponding verb was. The next examples show impersonal constructions in Italian. We begin with an example of a so called weather verb. (43) Piove. G: *∅ Rains. S: It rains. M: *∅ Rain. 26 Weather verbs as in (43) need expletives in English. Only Systran generates the correct subject pronoun for the example sentence in (43). Let us now examine the si sentences and their English equivalents. The rst three examples contain intransitive verbs. These constructions are called si impersonale. (44) In Germania si beve la birra. R: In Germany, people drink beer. G: * In Germany, ∅ drinking beer. S: In Germany the beer is drunk. M: In Germany we drink beer. (45) In Germania si è letto molto.. R: In Germany, people have read a lot. G: In Germany *you have read a lot. S: In Germany a lot has been read. M: Germany has read. (46) Quando eravamo studenti, si è andati a scuola. R: When we were students, we went to school. we were students, *he went to school. S: When we were students, it has been gone to school. M: When we were studenti, *∅ has gone to school. G: When Examples (44) - (46) show the use of si impersonale. The subjects in the English translations of (44) and (45) should be people or one. The translations of the main clause in (46) are correct, but the translations of the subclause are a bit problematic. rd The subclause consists of the nite verb for the 3 person singular and the participle andati that matches a subject in plural. MT systems use only the information about the nite verb and generate the corresponding pronouns in the target language, though they have dierent values for gender. But if the VP è andati refers to the same set of referents as in the main clause, the pronoun we should be used as a subject of the subclause. This is not trivial since we are rd dealing with the verb è, which needs a subject of the 3 person singular, but we want st to generate a pronoun of the 1 person plural in the target language. rd Until now, we have taken a look only at cases of 3 person singular. In (47) and (48) rd follow examples for 3 person plural. (47) Hanno cantato la mia canzone. R: They sang my song. They sang my song. S: They have sung my song. G: M: My song have been sung. (48) Sono state in Croazia. R: They were in Croatia. *∅ Were in Croatia. S: They have been in the Croatia. M: *∅ Were in Croatia. G: 27 rd The only alternative for translating 3 person plural in English is they. All information rd (3 person plural feminine) can be derived for the subject in the example sentence (48). rd Since there are no gender distinctions for 3 person plural in English, this translation case is unambiguous and should be they. Let us now summarise the observations made by examining examples (38) - (48). (49) IT.Copula.3.P.Sg + IT.PastPart.F IT.Copula.3.P.Sg + IT.PastPart.F → → She + EN.Verb.3.P.Sg You + EN.Verb.2.P.Sg (polite ) → She + EN.Verb.3.P.Sg IT.Predicative.F → You + EN.Verb (polite ) IT.PastPart.M → He + EN.Verb.3.P.Sg IT.Predicative.M → He + EN.Verb.3.P.Sg IT.Copula.3.P.Sg + IT.Predicative.F IT.Copula.3.P.Sg + IT.Copula.3.P.Sg + IT.Copula.3.P.Sg + IT.Verb.3.P.Sg IT.Verb.3.P.Sg → → He/She + EN.Verb.3.P.Sg (if only human referents possible ) It + EN.Verb (if human referents not possible ) IT.si + IT.Verb.3.P.Sg IT.Impers.3.P.Sg IT.Verb.3.P.Pl → → → one/people + EN.Verb.3.P.Sg/Pl It + EN.Verb.3.P.Sg They + EN.Verb.3.P.Pl There is another interesting construction in Italian which does not contain a subject, nd namely the negated imperative for 2 person singular. (50) Non mangiare nelle ore di lezione! R: Do not eat in the hours of lessons! ∅ Do not eat in the hours of lessons! S: ∅ Not to eat in the hours of lesson! M: ∅ Do not eat in hours of lesson! G: The negated imperative form for the 2nd person singular consists of the negation non and the innitive, in our case mangiare. This kind of sentences should be translated by a do not ... construction, as Google translator suggested. Though Systran's translation does not have a subject, which is correct, it also contains an innitive marker to which makes the sentence grammatically incorrect. The analysis of the example in (50) leads to the following rule: (51) IT.non + IT.inn →∅ + do not EN.inn 3.2.2 English to Italian As already mentioned at the beginning of the chapter, the main question in translation direction EN omitted. → IT is whether the Italian subject pronouns should be generated or In principle, they could always be generated or always omitted. Both of these decisions are not ideal: Whereas the omission of all subject pronouns can lead to problems with respect to the adequacy of the translations, the generation of all subject pronouns would very likely result in a text that sounds rather unnatural. A text consisting of a sequence of sentences in which almost each sentence has a subject pronoun contains a lot of redundant information (number, person, gender) coded at the 28 same time both in the subject pronouns and in the nite verbs. So, the subject pronouns should be omitted to avoid the redundancy and to preserve the text uency. If just one isolated sentence should be translated, it is rather imaginable that such a sentence contains a subject pronoun. The explicit occurrence of the subject pronoun in such isolated sentences can be explained by the fact that without the context, it is not possible to determine the referent which the omitted subject pronoun refers to. In such a context, the use of a subject pronoun can thus be compared with the use of a NP subject. It introduces a referent and provides information about it. In isolated sentences, this information can only be provided by the referent that is situated in the given sentence. Since translation is more often carried out on a text, it should be examined in which contexts, the subject pronoun should be dropped or realized overtly. In our discussion so far, we saw that the use of a subject pronoun has often pragmatic reasons (cf. section 2.3) which are not easy to capture in an automatic translation system. Some cases in which the pronoun is overtly used have already been shown and discussed in section 2.2.2. A much more detailed examination is needed to nd the contexts in which subject pronouns in Italian are used. The pronoun triggers shown in (16b), (24), (25), (27) have to be identied and it should be investigated how probable is it that they really occur with the subject pronoun. This kind of rather local regularity can be captured by the SMT systems. They work on the word level and can identify word sequences which are often translated to each other. So, if itself corresponds relatively often to the phrase esso stesso, it has a good chance to be translated to it without using the heuristics to decide whether the pronoun 10 should be generated. 3.3 Summary In MT, the problem of pro-drop has been dealt with only marginally. But in my opinion, this is an important issue since the absence of the subject (pronoun) in a non-NSL leads to grammatically incorrect sentences. If the subject is not generated because the corresponding element in the source language does not exist, it should be examined which information in the source language could be used to generate the correct subject pronoun. The analysis of source sentences of a NSL, Italian, (cf. section 3.2.1) showed that in many cases, Italian verbs bear quite a lot information to enable the generation of the correct English pronoun. However, in a number of cases, Italian verbs are ambiguous and require therefore the observation of the context (preceding sentences) in order to derive the correct English subject pronoun. When an Italian text should be generated out of an English input, it has to be determined if the subject pronouns should be absent or not. Since their use has pragmatic reasons, more detailed analysis of Italian is needed to answer this question. In the following chapter, the details of statistical machine translation are sketched. In chapter 5, a method for the word alignment of Italian and English VPs is described. 10 Details on phrase-based SMT are discussed in chapter 4. 29 SMT systems are build to test if the rule-based VP alignment contribute to better translation of pronominal subjects between Italian and English. The evaluation results of the systems are shown in chapter 6. 30 4 Statistical machine translation This chapter describes phrase-based statistical machine translation (SMT). In section 4.1, the statistical models for the automatic word alignment are introduced. We take a closer look at GIZA++, the open source word alignment tool developed by [Och & Ney, 03] since this tool was used to create a baseline word alignment which has been improved by applying the alignment rules described in chapter 5. In section 4.2, the concept of phrase-based SMT is described. The phrase-based SMT approach is implemented in an open source SMT system Moses [Koehn et al., 07] which has been used within this work. 4.1 Word alignment Word alignment is a very important task within SMT. In the training process of an SMT system, it is necessary to identify word equivalences to gain the translation tables which are needed in the translation process. Phrase-based SMT systems (cf. section 4.2) use the word alignment to extract translation phrases (word sequences). So, the quality of the word alignment is crucial for extracting good parallel phrases. There are ve statistical models, so called the IBM Models, which are used to automatically compute the word alignment of a parallel sentence-aligned corpus [Brown et al., 03]. Word alignment models are trained by the Expectation Maximization Algorithm (EM). The EM contains of two steps: (i) expectation in which the alignment model is applied to the data, and (ii) maximization in which the model parameter are recalculated. The simplest way to start the EM training is to assume that all words are equally probable to be aligned to each other. The model is applied to the data resulting in the word aligned parallel corpus. On the basis of the counts of the alignment pairs, the lexical translation probabilities are re-estimated. These recalculated model parameters are used as the model for the next iteration. The algorithm stops when convergence is reached. In the rst statistical word alignment model, IBM Model 1, the sentences are treated as a bag of words which means that the word order does not play any role in the word alignment process. The improvement of this model leads to the Model 2 in which the target word also depends on its position in the TL sentence. Since some words can be aligned to a sequence of words in some other language, it is desirable to model and allow 1 − to − n alignments. This is done by modeling the word fertility in the Model 3. In Model 4, the position of the previously translated word is taken into account. In the following, the IBM models for the word alignment are briey described. 11 IBM Model 1 When computing word alignment of a sentence pair, we are interested in the most probable alignment e = (e1 , ..., ele ) a for a sentence pair containing the target language (TL) sentence and the source (SL) sentence compute the alignment probability p(a|e, f ) f = (f1 , ..., flf ). Formally, we need to (cf. equation (1)). 11 For more detailed discussion about the methods in statistical machine translation, please refer to [Koehn, 09]. 31 le Y t(ej |fa(j) ) p(a|e, f ) = Plf j=1 i=0 t(ej |fi ) (1) t(ej |fi ) which express the probability of generating the TL word ej from the SL word fi . Furthermore, the numerator t(ej |fa(j) ) models the probability of generating the word fi from the word ej given an alignment function a(j) = i. Equation (1) uses the lexical translation probabilities After the most probable alignment of a sentence pair is computed using equation (1), c(e|f ; e, f ) for translating a particular SL word f into a particular TL word e in the sentence pair (e, f ) are collected. Having these counts, new translation probability t(e|f ) can be estimated. As the initial the model parameters are re-estimated. The weighted counts lexical probability distribution, the uniform probability distribution is taken indicating that every TL word is equally likely to be generated out of each SL word. IBM Model 2 IBM Model 1 does not incorporate any knowledge about the word order in the target sentence. On contrary, IBM Model 2 has an explicit model for an alignment based on the position of the input and output words (cf. equation (2)). a(i|j, le , lf ) (2) The alignment probability distribution in (2) models the probability of translating some source word in the position i in a target word in a position j. The model predicts the source word positions conditioned on the generated target word positions. Expanding IBM Model 1 with the position based alignment probability distribution shown in (2), we become a new equation for computing the most probable alignment pair (e, f ). a for a sentence The equation is shown in (3). le Y t(ej |fa(j) ) a(a(j)|j, le , lf ) p(a|e, f ) = Plf j=1 i=0 t(ej |fi ) a(a(j)|j, le , lf ) (3) As in Model 1, new lexical translation probabilities are estimated from the weighted counts for lexical translations c(e|f ; e, f ). Additionally to the lexical translations, the position based probability distribution is computed using the counts for the translation of the words in specic positions: c(i|j, le , lf ; e, f ). As the initial lexical probability distribution, Model 2 uses the lexical probabilities computed by Model 1. The position 1 based alignment probabilities are initialised as . lf +1 32 IBM Model 3 IBM Model 3 contains of an additional model which expresses the fertility of a source word. It contains probabilities of translating a source word in one or two or more target words. An articial fertility probabilities for the Italian word all (= to the ) is shown in (4). The probability that all generates two English words is much higher than the probability that it generates only one English word. n(2|all) = 0.8 n(1|all) = 0.2 (4) The fertility model allows also insertion of target words that do not have a counterpart in a source sentence. These words are treated as being generated from a special token NULL with fertility n(φ|N U LL). Additionally, the fertility model permits that a source word is not translated at all. With other words, it can be dropped. This is expressed by a fertility n(0|w), where w is a source word. Instead of the alignment probability distribution in Model 2, Model 3 consists of a distortion probability distribution d(j|i, le , lf ) which predicts target word positions based on the source word positions. For the re-estimation of the model parameters, only the most probable word alignments for a sentence pair (e, f ) are used. As the initial lexical probability distribution, the estimates form Model 2 are used. Since in the rst iteration step, the distortion probabilities are not available, the alignment probabilities estimated by Model 2 are used as starting distortion probability distribution. IBM Model 4 IBM Model 4 introduces a relative distortion model which is an improvement of an absolute distortion model from IBM Model 3. Absolute distortion model does not do well when large source and target sentences are dealt with. The movement probabilities for such sentence pairs are sparse and not very realistic [Koehn, 09]. Since the position of a generated target word depends in particular on the position of the generated word for a preceding source word, Model 4 introduces a distortion probability distribution based on the position of the alignment of the previous source word. The distortion model implemented in IBM Model 4 is based on cepts. consists of a source word a cept i i ) d1 of (denoted by Relative distortion position of a cept i, fj which is aligned at least with one target word. A center of is dened as the ceiling of the average of the word positions. a target word ej in a position ej j, which is also the starting is dened as shown in (5). d1 (j − i−1 ) If a target word A cept (5) is not the start element of a cept, its relative distortion is dened as shown in (6). With the term word in the cept which ej πi,k−1 , we refer to the position of the preceding target belongs to. 33 d>1 (j − πi,k−1 ) Computed relative distortion values ej d1 and d>1 (6) express the movement of a target word depending on the position of the preceding target word ej−1 . The training of the model starts with the estimates of the Model 3 as the initial model parameters. As in Model 3, the most probable alignments are computed from which the counts for the parameter re-estimation are gathered. GIZA++ The basis for the presented work poses the base word alignment computed by the system called GIZA++ developed by [Och & Ney, 03]. It is a combination of the Model 1, a HMM (Hidden Markov Alignment Model) pHM M (f, a|e) = p(B0 |B1I ) · pHM M I Y shown in (7) and the Model 4 p(Bi |Bi−1 , ei ) · i=1 In HMM, inverted alignments B0I I Y Y p4 . p(fj |ei ) (7) i=0 j∈Bi are used for representation of the alignment represent the mapping from a TL word to a SL word. Bi aJ1 . They is a partition of the SL sentence marking the word (sequence) of a SL. The alignments with empty words are modeled by I the probability distribution p(B0 |B1 ), where the set B0 contains of all positions of SL words which are aligned with the empty word. p(Bi |Bi−1 , ei ) expresses the probability of SL word (sequence) and a target word Bi given the translation of the preceding SL word (sequence) Bi−1 ei . GIZA++ combines Model 1, HMM and Model 4. First, the parameters for the Model 1 are computed. They serve as the initial model parameters for HMM. The estimates of the HMM are nally used in Model 4 for deriving the nal model parameters. To allow n−to−m alignments, the alignment symmetrization is carried out. The word alignment is carried out in both directions. In the next step, the produced alignments are combined to compute the output alignment. In GIZA++, the intersection of the alignments is computed. Thus, the alignments which are a part of both alignments are taken. These alignments are considered to be very reliable since they can be found in both alignments. After these links are identied, the alignments for the neighbouring words are computed using the union of the two alignments (rened symmetrization) [Och & Ney, 03][Koehn et al., 03]. In this work, GIZA++ has been applied to the English-Italian parallel corpus producing the baseline word alignment which has been partially improved (cf. 5). section [Pianta & Bentivogli, 04] evaluated the statistical word alignment computed by GIZA++ for Italian and English. They used a corpus consisting of 25,000 sentence pairs. Table 4 shows the evaluation results. As a symmetrization method, [Pianta & Bentivogli, 04] used the intersection of the alignments computed for English → Italian and Italian → English. The reported results on the word alignment evaluation show that the GIZA++ word alignment for EnglishItalian lets some room for improvement. 34 Alignment Precision Recall IT → EN Intersection Table 4: 73.4 55.2 95.2 38.8 Evaluation of GIZA++ word alignment for English and Italian 4.2 Phrase-based SMT The SMT belongs to the group of word-based machine translation systems. This means that the input sentence that should be translated does not undergo any analysis (syntactic, semantic), but it is translated word-by-word. A large bilingual dictionary is needed to carry out word-by-word translation. There are many cases in which word-by-word translation fails. One word in SL does not always correspond to only one word in TL which also holds for the opposite translation direction. This leads to an assumption that instead of words, the phrases, word sequences, should be translated as one translation unit. These phrases are not necessarily equal with linguistic phrases. For example, the Italian word sequence Io sono (pronoun as a subject + sentence predicate) can be a phrase which is translated as one translation unit in English phrase I am. To carry out this type of translation, we need translation probabilities for phrase pairs as shown in table 5. Translation Table 5: Probability p(e|f ) i am 0.80 i was 0.10 i have been 0.05 myself am also 0.03 we are 0.02 Example phrase translation probabilities for io sono When TL phrases are generated, they have to be reordered in order to appear in the correct phrase order in the generated sentence. This is modelled by a reordering model. Instead of learning reordering probabilities from the data, a cost function is applied. The cost function express how expensive the movement of some phrase is. In the following, the details on phrase-based SMT are described with respect to the implementation of phrase-based SMT in an open source SMT system Moses [Koehn et al., 07]. Phrase translation table The rst step in obtaining translation phrases is word alignment of parallel sentences. In Moses, a word alignment tool GIZA++ (cf. chapter 4.1) is used. GIZA++ allows one-to-many word alignment, where at most one TL word can be aligned with each SL word. To account for this aw, Moses expands the word alignment by aligning the 35 words in both directions. The result of bidirectional alignment is a man-to-many word alignment of the sentence pair. The two alignments can be combined in several ways: They can be intersected or the union can be build. In Moses, these two methods are combined. Firstly, the intersection of the bidirectional alignments is computed. In the next step, the additional alignment points are heuristically chosen from the alignment union. When word alignment is given, translation phrases can be derived. The phrases must be consistent with the word alignment which means that the words of a phrase pair are only aligned with the words within these phrases and not to the words outside. After the phrase pairs are collected, their translation probability is estimated by relative frequency as shown in (8). φ(f¯|ē) = count(f¯, ē) P The probability of a phrase f¯ given a phrase ē 1 ¯ f¯i count(fi , ē) (8) is a product of the count of how often the phrases occur together and the total number of occurrences of the phrase ē. Reordering models The reordering model in Moses is based on the phrase reordering relative to the previous phrase. We dene starti as a start position of the preceding phrase i, and endi as the last word of that phrase. The reordering distance is computed as shown in (9). x = starti − endi−1 − 1 On the computed reordering distance, the cost function in (10) is applied, where (9) α∈ [0; 1]. d(x) = α|x| (10) Generally, this reordering model punishes any movement. This works ne for the language pairs with similar syntax, but it leads to bad translation for the language pairs which dier signicantly with respect to the word order. Although the language models should account for the dierent word order in SL and TL sentences, they are limited as they consider only small word sequences. For this reason, phrase-based SMT uses an additional reordering model: lexicalized reordering model. It models the orientation of an extracted phrase pair. The orientation species the position of the TL phrase. It can be monotone which means that its position is equal with the position of the SL phrase. Furthermore, it can be swap indicating that the SL and the TL phrases are swapped. Finally, the phrases can be discontinuous, thus interrupted by other phrases. Language model Dierent word order in dierent languages poses a problem for the statistical machine translation translation. A language model which is build out of the large target language text should account for this. It consists of automatically computed n-grams which 36 express the probability of a target word words. ej n already generated target sentence e = e1 , ..., el given a if it is preceded by The computation of the probability of a target trigram language model is shown in (11). p(e) = p(e1 , e2 , ..., el ) = p(e1 )p(e2 |e1 ) ... p(el |e1 , e2 , ... , el−1 ) ' p(e1 ) p(e2 |e1 ) ... p(el |en−1 , en−2 ) (11) The model parameters are computed using the counts of the word sequences as shown in (12). count(w1 , w2 , w3 ) p(w3 |w1 , w2 ) = P w count(w1 , w2 , w) (12) Log-Linear Model The translation model in phrase-based SMT uses the lexical translation table the reordering model d and the language model pLM (e). φ(f¯|ē), The models are combined in a log-linear model shown in (13). ebest = argmaxe I Y φ(f¯i |ēi )λφ d(starti − endi−1 − 1)λd i=1 |e| Y pLM (ei |e1 ...ei−1 )λLM i=1 Dierent models used in the phrase-based translation are weighted by for the translation model (13) λφ , the reordering model λd λ. The weights and the language model λLM are learned from the bilingual data in order to maximize the likelihood of the training data. 37 5 Word alignment of English and Italian verb phrases This chapter describes a method for improving the base word alignment with respect to the problem of null subject pronouns in Italian and obligatory subject pronouns in English. Since the English subject pronoun does not necessarily have a counterpart in Italian, it is often aligned with incorrect words in a given parallel Italian sentence. I present a rule-based method for the correction of the base alignment of English subject pronouns. Since the alignment of the subject pronouns depends on the alignment of the sentence predicate, rules have been developed which dene the alignment not only of English subject pronouns, but also of entire English verb phrases (VP) which belong to a subject pronoun. In the following, the term verb phrase is used for the combination of the (null) subject pronoun, negation and the verbal elements of the VP. After a short motivation for the base alignment improvement in the following section, I describe the data that worked with (cf. section 5.2). In section 5.3, the algorithm used to compute the VP alignment is presented. The rules based on part of speech tags which have been developed and applied on the base word alignment, English parses and Italian tagged sentences, are described in chapter 5.4. The evaluation results of the improved alignment are discussed in section 5.5. Some extensions of the proposed method are shown in section 5.6. 5.1 Motivation Since the pronominal subject in Italian can be omitted, the English subject pronoun is often aligned with arbitrary Italian words. These include for example conjunctions, punctuation, etc. Knowing about the word category of these Italian words, rules can be applied which prohibit the alignment of the English subject pronoun with these words. The rules are based on the PoS of the words whose alignment should be computed. If an Italian subject pronoun is omitted, the information about the subject is provided by a nite verb which is aligned with the English nite verb (cf. section 2.2.2). What I would like to achieve is the alignment of the English subject pronoun with the Italian verb that corresponds to the English nite verb. This is the reason why not only the English subjects are examined, but also all verbal elements of VPs. In the following, the term VP denotes a part of a VP which contains only verbal elements and a negation. Since the sequence of the verbs in English VPs can be interrupted by adverbs or embedded clauses, parse trees of the English input are used to identify English VPs (verbal elements and negation) correctly. The Italian input has been PoS tagged to provide information about word categories. Since the Italian parser was not available, the Italian VPs are dened as PoS sequences. For each sentence pair, the English parse tree is searched to nd clauses with a pronominal subject. The tagged Italian sentence is searched in order to nd all Italian VPs. Using the base alignment of the elements of an English phrase which has a pronominal subject, the parallel Italian VP is identied. The alignment rules compute alignment for PoS sequences of the parallel phrases, whereby only the PoS which mark verbs, negation and personal pronouns are taken into account. The rule-based VP alignment is integrated in the base alignment of the sentence so that 38 the base alignments of the phrase elements are removed. I assume that every English VP with a pronominal subject has a parallel Italian VP (cf. (A1) in section 1.2). This assumption is made to limit the number of PoS sequences for which the alignment rules are dened (cf. (A3) in section 1.2). Furthermore, I assume that the base alignment is correct enough to allow for identication of parallel English and Italian VPs (cf. (A2) in section 1.2). The assumptions hold in many cases but they also leads to problems which will be shown in section 5.5.2. In the following, the algorithm for the application of alignment rules is presented (cf. section 5.3), as well as the rules based on word categories (expressed by PoS tags) of English and Italian words (cf. section 5.4). 5.2 Data preparation The alignment rules have been developed and applied on a reduced version of the Eu- roparl corpus [Koehn, 05] consisting of 749,646 parallel sentences. Since the alignment rules do not operate on word level but on PoS level, it was necessary to preprocess the parallel corpus. English sentences have been parsed in order to simplify the search for pronominal subjects and VPs. Since English parse tree nodes are underspecied with respect to the grammatical function of phrases, I wrote a program which enriches relevant nodes with their grammatical function. Since a parser for Italian was not available, the Italian sentences have been PoS tagged in order to get information about the word categories. In the following, I describe the steps in the data preparation process. 5.2.1 English English sentences have been parsed with the generative parser [Charniak, 00]. The parse trees allow to identify the subclauses of the input sentence, the subjects and the VPs. 12 The parser also assigns to each word its part of speech tag which is needed to match conditions in the alignment rules. An example parse tree is shown in (52). (52) I would like your advice about Rule 143 concerning inadmissibility. 12 English PoS are listed in appendix B 39 (TOP (S (NP (PRP I)) (VP (MD would) (VP (VB like) (NP (NP (PRP your) (NN advice )) (PP (IN about) (NP (NNP Rule) (CD 143))) (VP (VBG concerning) (NP (NN inadmissibility)))))) (. .))) NP nodes in the parse tree in (52) are not specied with respect to their grammatical functions. To determine if some NP is a subject or an object, the context has to be taken into account. The assumption that the rst NP under S (representing topic position) is a subject does not always hold (cf. parse tree in (53)) which makes the search for a (pronominal) subject more complicated. (53) This makes it necessary to also take account of the ways in which materials and packaging are aected by cold of this kind. (TOP (S (NP1 (DT This)) (VP1 (VBZ makes) (S (NP2 (PRP it)) (ADJP (JJ necessary) (S (VP2 (TO to) (ADVP (RB also)) (VP (VB take) (NP ... (. .))) NP2 in (53) is actually an object of the verb in the preceding subclause (with sentence predicate makes ) and a subject of VP2. Thus, the underspecication of NP nodes requires a context check (father and sister nodes) in order to identify the subject of a VP. Not only the underspecication of NP poses a problem for a correct identication of a subject and its VPs. There are verbs which subcategorize an innitival verb phrase (toinnitive), for example I would like [to say]XCOM P (cf. examples (9b) - (9d) in section 2.2.1). The extracted VP which belongs to a pronominal subject should also contain a 40 subcategorized innitive. To-innitives can be embedded in various nodes, for example in VP or ADJP (as in parse tree in (53)). Since I wanted to handle only to-innitives which are subcategorized by a verb in a preceding clause (and not for example by an adjective as in gure 53), it was also necessary to examine the context of to-innitives (VP nodes) in order to make a decision if a to-innitive should be treated as a part of a nite VP whose alignment should be computed. There are two ways to solve the problem of identifying subjects and VPs. One way is a runtime examination of the context of corresponding nodes, and the other way is to enrich the parse trees with function tags as a part of data preprocessing. I chose the second approach which resulted in a program which enriches English parse trees 13 , and a relative simple method for subject and VP extraction from a modied parse tree. The tool which transforms original Charniak parse trees examines only NP and S nodes enriching them with a tag expressing their grammatical function. The transformation rules examine the context of the relevant nodes. If conditions for a specic function tag are complied, the original node is enriched by the corresponding function tag. Transformation rules for NP are given in (54). NP nodes are marked as subject NPs only if the father node is S or SBAR, they are not preceded by a VP (for example [LetV B ]V P [meP RP ]N P [sayV B ]V P ) and a sister VP is not a to-innitive (for example [[usP RP ]N P [toT O sayV B ]V P ]S ). In an interrogative sentence, the nite verb in front of a subject is not embedded in a VP, for example [[CanM D ] [youP RP ]N P [sayV B ]V P ... ]S , so that, in this case, the NP would be identied as a subject NP. The conditions for subject NPs are summarized in the rule (54a). If these conditions are not fullled, the NP is an object (NP rule (54b)). Furthermore, the NP node is identied as an object when the father node is a VP (NP rule (54c)). (54) a. NP → NP-SUBJ if the father node is S or SBAR, and there is no preceding VP under the father node, and if there is a sister VP node, it is not a to-innitive b. NP → NP-OBJ if the father node is S or SBAR, and there is a preceding VP or sister VP node which is a to-innitive c. NP → NP-OBJ if the father node is a VP It was also necessary to examine S nodes to determine if they consist of a to-innitive. If this is the case, and the innitive is subcategorized by a verb in the preceding clause, the S node should be annotated by a function tag that reects these features, namely S-XCOMP. For example, the examination of the phrase I would like to say, in which the to-innitive to say is embedded in the category S, should identify the to-innitive as an innitive which belongs to the preceding nite verb. If the example phrase is modied 13 [Blaheta, 2004] developed a function tagger which provides parse trees with function tags annotated to the phrases and words. I was not able to run this tagger for which reason I implemented my own tool for this task. It is important to note that my program enriches only the nodes which are relevant for the presented work. 41 resulting in a phrase I would like you to say, the to-innitive should not be determined as a part of a VP [would like ]V P , since its subject is not I. The parses for such sentences are shown in (55) and (56). (55) I would like once again to wish you ... (S1 (NP (PRP I)) (VP (MD would) (VP (VB like) (S2 (ADVP (RB once) (RB again)) (VP (TO to) (VP (VB wish) (NP (PRP you)) ... (. .))) (56) I would therefore once more ask you to ensure ... (TOP (S3 (NP (PRP I)) (VP (MD would) (ADVP (RB therefore)) (VP (ADVP (RB once) (JJR more)) (VB ask) (NP (PRP you)) (S4 (VP (TO to) (VP (VB ensure) ... (. .))) The transformation rules for S nodes are given in (57). The condition for applying these transformation rules is that a S or SBAR node is embedded in a VP. If the nodes have a preceding sister node NP (as S4 in (56)), they should be marked as S-OBJXCOMP expressing that the to-innitive does not have the same subject as the VP in a superordinate clause (cf. S rule (57b)). If this is not the case, the node should be marked as S-XCOMP expressing that the to-innitive belongs to the superordinate VP (cf. S2 in (55)). This is reached by the S rule (57a). 42 (57) a. S, SBAR → S-XCOMP, SBAR-XCOMP if the father node is VP and it is not preceded by a sister node NP b. S, SBAR → S-OBJXCOMP, SBAR-OBJXCOMP if the father node is VP and it is preceded by a sister node NP The appliance of the transformation rules in (54) and (57) on the parse trees in (55) and (56) results in the modied parse trees shown in (58) and (59). (58) I would like once again to wish you ... (S1 (NP-SUBJ (PRP I)) (VP (MD would) (VP (VB like) (S2-XCOMP (ADVP (RB once) (RB again)) (VP (TO to) (VP (VB wish) (NP-OBJ (PRP you)) ... (. .))) (59) I would therefore once more ask you to ensure ... (TOP (S3 (NP-SUBJ (PRP I)) (VP (MD would) (ADVP (RB therefore)) (VP (ADVP (RB once) (JJR more)) (VB ask) (NP-OBJ (PRP you)) (S4-OBJXCOMP (VP (TO to) (VP (VB ensure) ... (. .))) Modied parse trees such as (58) and (59) simplify the search for English subjects and corresponding VPs. Having enriched NP and VP nodes, we can search directly for nodes that correspond to the phrases we are interested in. 43 5.2.2 Italian Italian sentences have been tagged with TreeTagger [Schmid, 95] creating an input text 14 consisting of the words with their PoS . The PoS tagged Italian sentence in (60b) corresponds to the sentence in (60a). The words are enriched with their PoS. # is a delimiter between a word and its PoS. a. Non credo (60) not però che la relazione arrivi tardi. believe but that the report comes late. 'But I do not believe that the report comes too late.' b. Non#NEG credo#VER:fin però#ADV che#CHE la#ART relazione#NOUN arrivi#VER:fin tardi#ADV .#SENT On the basis of the PoS, we can identify the verbs (VER:n, VER:in, VER:ppast, VER2:n, VER2:geru, etc.), negation (NEG ), and subject pronouns (PRO:pers, PRO:demo ) in tagged Italian input. For example, the PoS VER:n bears information that credo is a nite verb. 5.2.3 Data preprocessing errors The tagger for Italian which was used in this work was trained using the Italian morphological lexicon MorphIt [Zanchetta & Baroni, 05] and a set of about 100,000 manually taged words from the newspaper corpus Repubblica [Baroni et al., 04]. The ac- curacy of the tagger can be compared with the accuracy of the Italian TreeTagger [Schmid, Baroni et al., 2007] reported on The Part of Speech Tagging Task EVALITA 15 2007. The TreeTagger reaches accuracy of 97%. However, the examination of the Italian tagged input revealed that some words are often tagged falsely. Example (61a) shows the Italian counterpart of the English sentence in (52). 16 (61) a. Gradirei avere il suo parere riguardo all' articolo 143 sull' inammissibilità. b. Gradirei#VER:fin avere#VER:infi il#ART suo#DET:poss parere#NOUN riguardo#VER:fin all#NOUN '#PUN articolo#NOUN 143#NUM sull#NOUN '#PUN inammissibilità#ADJ .#SENT The example sentence in (61b) contains relatively lot of false tagged words. For example, the ambiguous word form riguardo which can be a noun (= consideration ) or a verb ( = I concern ) is treated in (61b) as a noun which is in this context not correct. One of the common tagging errors is that of prepositions merged with denite article. When these word forms appear in front of a word which begins with a vocal, they end with an apostrophe: sull', all'. The tagger does not recognize these word forms as merged forms of an article and a preposition, which would become a tag ARTPRE, but as words 14 Italian PoS are listed in appendix A comparison of the evaluation results was suggested by M. Baroni (pers. comm.). 16 These examples are taken from the parallel corpus Europarl. 15 The 44 of an arbitrary category (for example, as a noun (NOUN ) or nite verb (VER:n )). I corrected article tags errors manually to reduce the number of the false verb tags, since they lead to erroneous identication of the Italian VPs. 5.3 Applying alignment rules The program for base alignment improvement expects a set of parallel sentences of Italian (with PoS) and English (as a parse tree) as input. Details about the corpus preparation have already been described in section 5.2. The parallel sentences are automatically word aligned with GIZA++ [Och & Ney, 03]. This base word alignment is the basis for the rule-based VP alignment. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: function correct_align(en_parse, it_tag, base_align) new_align . New alignment for e in en_parse do . English sentence e_subj _verb ← search_subj _verb(e) phrase_pair ← search_it_vp(e_subj _verb, it_tagged, base_align) new_align ← align(phrase_pair, base_align) pos_pattern ← derive_pos_pattern(new_align) end for return new_align end function Figure 1: Main program: correct_align The main program is shown in gure 1. Several steps are done for each sentence pair, beginning with a check whether the English sentence e contains a pronominal subject. After identifying the English pronominal subject and its verbs (line 4 in gure 1), it looks for the Italian VP which the English words are aligned with (line 5 in gure 1). The procedure which fulls this task is described in section 5.3.2. The output of this procedure is a phrase pair containing words enriched with information about the word category (PoS) and position of the word in the sentence. Having a phrase pair whose alignment should be computed, we can now call the function which computes alignment of the given phrase pair applying PoS based alignment rules (line 6 in gure 1). The program also derives PoS and PoS patterns of the parallel phrases (line 7 in gure 1). 17 The main program returns the computed VP alignment which is then integrated in the word alignment of the sentence pair. 17 The counts on the PoS occurrences could be used to compute the probability of translating an English PoS epi into an Italian PoS ipj . The derived PoS patterns could be used to check the correctness of the found PoS patterns. Furthermore, the PoS translation pairs could be used to examine which tenses are mostly used in the given language pairs. They would also allow the examination of the tense similarity: How often the same tense is used in the parallel sentence pair, or how often the tense and voice diverge. 45 A graphical illustration of the complete system is shown in gure 2. Each box represents one processing step. The processing steps are explained in detail in the following sections. IT – EN Parallel corpus EN Charniak parser Preprocessing IT TreeTagger Word alignment with GIZA++ Enriching parse trees Seaching sentences with pronominal subject Seaching IT - VPs Identification of VP pairs Aligning VP elements Merging base alignment with VP - alignment Alignment improvement system Figure 2: System components After the VP alignments are produced, they are merged with the base word alignment. In the resulting alignment, the pronominal subjects and VPs in both languages have only the alignments computed by the program for the VP alignment. The baseline alignments for these words are deleted. The function align(phrase_pair, base_align) (line 6 in gure 1) which computes the alignment of the VP pairs, is shown in gure 3. The functions for alignment of dif- ferent word classes of English (align_subj(e, it), align_vn(e, it), etc.) implement the alignment rules described in section 5.4. For the given English word, compatible Italian 46 words are identied. The examination of the alignment takes only PoS into account. If there is no appropriate Italian word (with appropriate PoS), the given English word stays unaligned. 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: 18: 19: function align(phrase_pair, base_align) new_align en ← english_words(phrase_pair) it ← italian_words(phrase_pair) for e in en do if subject(e) = T rue then new_align.append(align_subj(e, it)) else if f inverb(e) = T rue then new_align.append(align_vf in(e, it)) else if inf partger(e) = T rue then new_align.append(align_inf partger(e, it)) else if negation(e) = T rue then new_align.append(align_neg(e, it)) else if inf particle(e) = T rue then new_align.append(align_inf part(e, it)) . Computed alignment end if end for return new_align end function Figure 3: Alignment check and improvement 5.3.1 Identication of Italian VPs Due to the lack of availability of an Italian parser, the extraction of correct Italian verb phrases using the base word alignment posed a great problem. In this work, I made the assumption that the base alignment is suciently correct to make it possible to nd the Italian phrase corresponding to the given English phrase. Unfortunately, the Italian phrases were often incomplete. This means that some VP elements were missing. Therefore, I identify all Italian VPs before the search for a matching Italian VP is carried out which is described in the following section. The identication of Italian VPs is based on PoS. I dened PoS which mark the start of a VP, and PoS of other verbal elements, which can be a part of a VP. An Italian sentence is searched through until a PoS is found that can be a start of a VP. From this sentence position, the search for other elements goes so long until the sentence end or another VP starting PoS is reached. The search function returns the Italian word sequences that contain a personal pronoun, negation and verbal elements of a VP. Other VP elements are ignored. This method nds not only nite VPs starting with a pronoun, nite verb or negation, but also innitival VPs, and gerundive phrases which often consist of only one gerundive. 47 (62) a. Perché non esistono istruzioni why not exist da seguire in caso di incendio? instructions to continue in case of re? 'Why there are no instructions in case of re?' b. Perché#WH non#NEG esistono#VER:fin istruzioni#NOUN da#PRE seguire#VER:infi in#PRE caso#NOUN di#PRE incendio#NOUN ?#SENT The sentence in (62b) consists of two VPs. The implemented method for identication of Italian VPs nds following verb phrases: 1. non#NEG esistono#VER:n 2. da#PRE seguire#VER:in Indenite phrases like [da seguire ]XCOM P are extracted as independent phrases since they often correspond to complete English clauses. An example for such case in shown in gure 4. The English VP [would ask ]V P does not include the to-innitive [to request ]XCOM P since the to-innitive and the nite verb phrase do not have the same subject. For this reason, it would be wrong to handle the Italian innitive [die chiedere ]V P as a part of the Italian VP [prego ]V P which corresponds to the English VP [would ask ]V P . It is also possible to translate a nite English sentence as an Italian innite clause. This is an additional reason, why I handle Italian innitives as separate VPs. I5 /P RP iTTTT la /CLI @ 5 TTTT TTTT TTTT TTTT TT) / would6 /M D o jjj5 prego6 /V ER : f in ask6 /V B you7 /P RP to8 /T O request9 /V B Figure 4: jj jjjj j j j jj jjjj ujjjj di /P RE u: 7 uu u u uu uu u uu uu chiedere6 /V ER u u: uu u u u u uu uu uu uu u u u u uz u uu uu u u uu uu u u uu uz u : inf i Alignment of I would ask you to request and la prego di chiedere These rather simple rules for detection of complete Italian VPs do not always provide correct verb phrases. Mistakes are made if the word order is changed, or if a sequence of VP start elements occurs. Furthermore, false tagging leads also to the identication of false Italian VPs as shown in (63b) (cf. section 5.2.3). 48 (63) a. Come avrete avuto modo di constatare il as have millennio had way non si è millennium not to observe grande baco del the big bug of the realizzato. was realized. 'As you could have seen, the millennium bug did not materialize.' b. come#WH avrete#AUX:fin avuto#VER:ppast modo#NOUN di#PRE constatare#VER:infi il#ART grande#ADJ "#PUN baco#VER:fin del#ARTPRE millennio#NOUN "#PUN non#NEG si#CLI è#AUX:fin materializzato#VER:ppast.#SENT The method for the identication of Italian VPs nds the following verb phrases for the sentence (63b): 1. avrete#AUX:n avuto#VER:ppast 2. di#PRE constatare#VER:inf 3. baco#VER:n 4. non#NEG è#AUX:n materializzato#VER:ppast The VP in 3 (baco#VER:n ) is not correct. Baco (= the bug (noun)) has been assigned the wrong PoS resulting in extraction of a false VP. Although the rules for the identication of the Italian VP can lead to false VPs, they provide a relatively good basis for the process of searching for an Italian VP that corresponds to a given English phrase which is described in the following section. 5.3.2 Identication of the most probable Italian VP Good VP alignment results can be achieved by applying alignment rules only if the rules are applied on parallel English and Italian VPs. The procedure for searching for the matching Italian VP given an English VP is given in gure 5. The method for determination of the best Italian VP given an English VP is based on a count of alignments between English and Italian words in these phrases. So, I assume that the base alignment is correct on the level of phrase alignment. This means, that the best Italian phrase has the most base alignment links for the given English word sequence. The search function in gure 5 receives as input English subject and its verbs, a list of Italian VPs extracted from the parallel sentence (as described in the previous section), and the base alignment. For each Italian VP, the number of alignment links between its elements and English input is computed. The VP with the most alignment links represents the best Italian VP for the English input. 49 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14: 15: 16: 17: function search_it_vp(en_subj _vp, it_vps, base_align) word_pairs ← [] . EN and IT words which belong to parallel phrases en_al ← Alignments of EN words vp_links ← [] . Pairs: (IT VP, # links to its elements) for it_vp in it_vps do . Loop over all Italian VPs links ← 0 . # alignments for EN phrase and IT candidate VP for (en, it) in en_al do . Alignment pairs if it ∈ it_vp then . Italian word it is a part of Italian VP it_vp links+ = 1 end if end for vp_links.append(it_vp, links) end for best_vp ← max(vp_links) . word_pairs.append(elements_of (best_vp)) return word_pairs Italian VP with most alignments end function Figure 5: Search for the best Italian VP 5.4 Alignment rules The alignment rules dene the alignment of the relevant sentence parts in an English and Italian parallel corpus. They are based on an already created alignment (base alignment) and the PoS of the words that are observed. In previous sections, it was already mentioned that these sentence parts are subject pronouns, negation and verbal sentence predicates. improvement. In gure 3 in section 5.3, I showed the program for alignment It consists of a loop over the words of the English input phrase. For each word, its word category is derived (on the basis of PoS), and the function is called which computes alignment for the found word category. There are ve functions which compute alignment for ve word category groups. If the input word e is: 1. subject, i.e. its PoS is PRP, the function align_subj(e, it) is called. 2. nite verb, i.e. its PoS is: • • • • • AUX : auxiliary MOD : modal verb VBZ : 3rd person singular present VBP : non-3rd person singular present VBD : past tense the function align_vn(e, it) is called 3. innitive, participle or gerundive, i.e. its PoS is: 50 • • • VB : innitive VBN : past participle VBG : gerundive the function align_infpart(e, it) is called 4. negation, i.e. its PoS is RB 18 , the function align_neg(e, it) is called 5. innitive particle to, i.e. its PoS is TO, the function align_infpart(e, it) is called. In the following sections, the alignment rules for each word category are presented (cf. chapters 5.4.2 - 5.4.6). But before the rules are described, we should examine the syntax of English and Italian VPs. The PoS sequence which occurs in a specic tense is crucial for dening alignment rules. Each rule is applicable only if the context constraints are fullled. The context is dened by PoS of the words of the given phrases. 5.4.1 Syntax of the English and Italian VPs The VP alignment rules use word categories expressed by PoS to compute the word alignment of parallel VPs. Since the alignment of a specic PoS is not always the same but context-dependent, it is necessary to examine which contexts (PoS sequences) are possible. Having this information, constraints can be dened which limit the word alignment of PoS in a specic PoS context. In the following, we take a closer look to the composition of English and Italian VPs (PoS sequences). English English tenses can be realized by one verb only or by a sequence of verbal elements. Since the alignment rules are based on PoS, we have to know which PoS sequences in English VPs are possible. We start with examples of tenses which have only one verb. In the following examples, only relevant tokens are marked with their PoS. (64) a. He/PRP sleeps/VBZ. b. It/PRP is/AUX nice. c. I/PRP sleep/VBP. d. He/PRP went/VBD home. If we would like to negate the sentences in (64), we would get composed VPs containing an auxiliary, a negation and an innitive. (65) 18 Since RB a. I/He/PRP do/does/did/AUX not/RB sleep/VB. there is no dierence between tags for negation and other adverbs, the word forms tagged with had to be examined to identify the negation. 51 b. I/He/PRP do/does/did/AUX not/RB have/do/AUX it. Constructions with modal verbs are shown in (66). (66) a. He/PRP will/would/MD (not/RB) sleep/VB. b. He/PRP will/would/MD (not/RB) have/do/AUX it. c. He/PRP will/would/MD (not/RB) be/AUX sleeping/VBG. d. He/PRP will/would/MD (not/RB) be/AUX having/doing/AUXG. e. He/PRP will/would/MD (not/RB) have/AUX slept/VBN. f. He/PRP will/would/MD (not/RB) have/AUX had/done/AUX it. g. He/PRP would/MD (not/RB) have/AUX been/AUX sleeping/VBG. h. He/PRP would/MD (not/RB) have/AUX been/AUX having/doing/AUXG this. The following example sentences show the tenses which contain an auxiliary. (67) a. He/PRP is/was/AUX (not/RB) sleeping/VBG. b. He/PRP is/was/AUX (not/RB) having/doing/AUXG this. c. He/PRP has/had/AUX (not/RB) slept/VBN. d. He/PRP has/had/AUX (not/RB) been/AUX sleeping/VBG. e. He/PRP has/had/AUX (not/RB) been/AUX having/doing/AUXG. f. I/PRP am/AUX going/VBG to/TO sleep/VB. g. I/PRP am/AUX going/VBG to/TO have/do/AUX this. If, for example, English auxiliaries should be aligned dierently depending on the VP that they belong to, the composition of the English VP has to be determined by examining its PoS sequence. If, for example, has/AUX (cf. example (67c)) should be aligned with the corresponding Italian auxiliary only if has/AUX is a part of the composed VP, we would require the English VP to consist of a participle. Thus, in addition to AUX, the PoS sequence of the English VP should also contain the PoS VBN. A closer observation of the examples in (66) and (67) reveals that the PoS AUX is used not only for the auxiliaries. The verbs am and do in (67g) have the same PoS although do should be considered here as a main verb. 19 This causes a problem because dierent word categories are handled with dierent sets of the alignment rules. If the word category is erroneous, false alignment rules can be applied. We will come back to this problem in section 5.4.4. Italian Again, we start with tenses which have only one verb. The optional subject pronoun and the negation are put in brackets. 19 [Charniak, AUXG 00] expands the Penn Treebank Tagset (listed in appendix B) with the tags which are assigned to the auxiliaries. 52 AUX and (68) (Egli/PRO:pers) (non/NEG) dorme/dormivo/dormii/dormirò/VER:n. He not sleeps/has slept/had slept/will sleep 'He sleeps/has slept/had slept/will (not) sleep.' (69) (Io/PRO:pers) (non/NEG) dormo/dormissi/dormirei/VER:n. I not would sleep/would have slept/would sleep 'I would sleep/would have slept/would (not) sleep.' Italian composed tenses require an auxiliary or a modal verb, and a participle, an innitive or a gerundive. Examples in (70) and (71) show simple sentences with the verb dormire (= to sleep ) which have an auxiliary avere (= to have ). In Italian, there are also verbs like andare (= to go ), venire (= to come ) which occur with the auxiliary essere (= to be ). Since the PoS sequence is the same for tenses with both auxiliaries, I do not list example sentences with these verbs. Examples in (70) - (74) show all composed Italian tenses with all possible PoS sequences. (70) (Egli/PRO:pers) (non/NEG) ha/avrà/AUX:n dormito/avuto/VER:ppast. He (not) has/will have slept/had 'He has/will have (not) slept.' (71) (Io/PRO:pers) (non/NEG) abbia/avrei/AUX:n dormito/VER:ppast. I (not) would have/will have slept 'I would (not) have/would (not) have/will (not) have slept.' (72) (Io/PRO:pers) (non/NEG) sto/AUX:n dormendo/VER:geru. I am (not) sleeping 'I am not sleeping.' (73) (Io/PRO:pers) (non/NEG) posso/potrò/VER2:n dormire/VER:in. I (not) can sleep 'I can not sleep.' (74) (Io/PRO:pers) (not/NEG) sto/stavo/VER:n facendo/VER:geru ... I not am/was doing 'I am/was (not) doing ...' Modal verbs subcategorize an innitive as shown in (73). When these verbs are used in some of the tenses which are composed of an auxiliary and a participle, a dierent PoS sequence is generated (cf. example (75)). (75) (Io/PRO:pers) (non/NEG) ho/avrei/AUX:n potuto/VER:ppast I (not) have can constatare/VER:in observe ... 'I would (not) have/would (not) have/will (not) have observe ...' Whereas some tenses in passive voice (cf. example in (76)) do not dier from composed tenses shown in (74) regarding PoS sequence, some past tenses in passive voice require two forms of an auxiliary as showed in (77). 53 (Egli/PRO:pers) (non/NEG) è/saràAUX:n amato/VER:ppast. (76) He (not) is/was/will be loved 'He is (not)/was (not)/will (not) be loved.' (Egli/PRO:pers) (non/NEG) è/era/AUX:n stato/AUX:ppast (77) He (not) is/will be been amato/VER:ppast. loved 'He has (not) been/will (not) have been loved.' There is one construction in Italian which is often used to abbreviate a nite sentence. It consists of a gerundive and, if the verb is modal, of an innitive. Since these constructions can be used as translations of English nite clauses, they should also be taken into account. (Non/NEG) ribadendo/VER:geru ... (78) (not) stressing ... '(Not) stressing ... / I (do not) stress ...' (Non/NEG) volendo/VER:geru arontare/VER:in ... (79) not wishing confront ... '(Not wishing to confront ... / I (do not) wish to confront ...' Let us now consider parallel sentences (67c) and (70). We have already dened the English context constraints that have to be fullled if he/AUX should be aligned with the Italian auxiliary, in our example with ha/AUX:n. If we would like to allow the alignment of the English nite auxiliary with any Italian nite verb form, the condition on Italian is that the Italian verb (here, an auxiliary ha/AUX:n ) is nite, i.e. its PoS must contain n (e.g. AUX:n, VER:n, VER2:n ). In the following sections, the alignment rules for dierent word categories are presented. 5.4.2 Subject pronouns Since the pronominal subject in English does not have to have its pronominal counterpart in the Italian parallel sentence, the alignment of the subject pronoun is often not correct. 20 Figure 6 shows an example of an incorrect base word alignment. if9 /IN O wish11 /V BP O se8 /CON lo9 /CLI desidera10 /V ER : f in Figure 6: 20 The you10 /P RP O Incorrect base alignment of if you wish and se lo desidera subscripts in the alignment gures mark the word position in a sentence. 54 The phrases in gure 6 are taken form the sentences which are shown in (80) and (81). (80) That is precisely the time when you may, if you wish, raise this question, ... (81) È appunto in quell' occasione che, se lo is exactly in this sollevare la rase sua occasion desidera, avrà that, if you which, modo di will have chance to questione ... the your question ... Figure 6 shows an alignment of embedded clauses if you wish and se lo desidera. The English subject pronoun is aligned with the Italian object clitic lo whereas the predicate wish is correctly aligned with Italian predicate desidera. In this case, it would be correct to align the English pronoun with the Italian nite verb since you wish should be translated as desidera 21 if9 /IN O . The correct alignment is shown in gure 7. you10 /P RPiT wish11 /V BP O lo9 /CLI desidera10 /V ER : f in TTTT TTTT TTTT TT) se8 /CON Figure 7: Correct alignment of if you wish and se lo desidera The correctness of the alignment in gure 7 is linguistically motivated. The information provided by the English subject pronoun you and the nite verb wish is the same as the information provided by the Italian nite verb desidera regarding person and number of the subject. verb. This is why both English words should be aligned with the Italian In general, the English pronoun should be aligned with the Italian nite verb that corresponds to the English nite verb. The subject alignment rule lead to the link between the English word with PoS PRP (English subject pronoun) and the Italian word with PoS VER:n (nite verb), VER2:n (nite modal verb), or AUX:n (nite form of an auxiliary). This rule is summarized in (82a). In gure 7, the alignment rule (82a) leads to a deletion of the base alignment link between the English pronoun you and the Italian object clitic lo. Since lo does not have further base alignment links, it remains unaligned in the given sentence pair. Yet, the Italian subject pronoun is not always omitted. If it is expressed overtly, the English subject pronoun should be aligned only with it. The pronouns bear the same information about number, person and gender of the subject. In such a context, the Italian verb is not needed to derive these characteristics of the English subject pronoun, so I do not align it with the Italian nite verb. This rule is presented in (82b). The rule associates the English pronoun (PRP ) with the Italian pronouns (PRO:pers - personal pronoun, PRO:demo - demonstrative pronoun). The English nite clause can also be translated as a gerundive construction in Italian. The gerundive bears the semantics that corresponds to the semantics of the English 21 The Italian nite verb is 3rd person singular, so this should be understood as a polite form of address where the addressee is one person. 55 predicate (for example, IP RP thinkV BP ↔ pensandoV ER:geru ). In such constructions, the aim is to align the English pronoun and nite verb with the same Italian verb (here, gerundive). This rule is expressed in (82c). When the Italian VP is an innitive construction (for example, IP RP haveAU X thoughtV BN ↔ averAU X:inf i pensatoV ER:ppast ), the English subject pronoun should be aligned with the innitive form of the Italian auxiliary. Thus, the alignment between pronoun (PRP ) and innitival auxiliary (AUX:in ) has to be allowed. Another possible innitival construction in Italian consists of a preposition (PRE ) and an innitive ( *:in:* ), for example I believe, IP RP knowV BP this ↔ Credo diP RE saperloV ER:inf i:cli . The pronoun I should be aligned with the Italian preposition di to 22 satisfy the condition of being aligned with the same word as its nite verb. This is expressed in the rule (82d). (82) a. EN subject pronoun → IT nite verb if IT does not have a subject pronoun EN: PRP → IT: {VER:n, VER2:n, AUX:n} b. EN: subject pronoun → IT: subject pronoun if IT has a subject pronoun EN: PRP → IT: {PRO:pers, PRO:demo} c. EN: subject pronoun → IT: gerundive if IT is gerundive construction EN: PRP → IT: {VER:geru, VER2:geru} d. EN: subject pronoun → IT: innitival particle or innitive auxiliary if IT is an innitive construction EN: PRP → IT: {PRE, AUX:in} Figure 8 is an example for the alignment rule (82a). 23 The rule (82a) leads to the alignment of the English subject pronoun I with the Italian nite verb posso (VER2:n ). I4 /P RP O can5 /M D tell6 /V B you7 /P RP posso4 /V ER2 : f in Figure 8: risponderle5 /V ER : inf i : cli Alignment of I can tell you and posso risponderle Figure 9 shows an example for the alignment rule (82b). The English personal subject pronoun it is only aligned with the Italian pronominal subject esso. In gure 10, a phrase pair is shown on which the alignment rule (82c) can be applied. It allows alignment of an English pronoun with an Italian gerundive. Figure 11 shows an example for the alignment rule (82d) which allows alignment of an English pronoun with an Italian innitive auxiliary. 22 Finite 23 The verb alignment rules are discussed in the next chapter. alignments marked with dotted lines are at this moment out of interest. 56 it20 /PO RP actually21 /RB passes22 /V BZ stesso20 /ADJ aprova21 /V ER : f in esso19 /P RO : pers Figure 9: Alignment of it actually passes and esso stesso approva I0 /PO RP would1 /M D say2 /V B volendo0 /V ER2 : geru Figure 10: dire1 /V ER : inf i Alignment of I would say and volendo dire I0 /PO RP have1 /AU X said2 /V BN aver0 /AU X : inf i Figure 11: detto1 /V ER : ppast Alignment of I have said and aver detto 5.4.3 Finite verbs After dealing with English subject pronouns, now we examine verbal elements of English sentences containing a subject pronoun. Let us rst examine the example base alignment presented in gure 12. I5 /PO RP f eel6 /V BP O 5 kkk kkk k k kkk ku kk ritengo6 /V ER : f in Figure 12: che7 /CHE Incorrect base alignment of I feel and ritengo The sentences which contain the phrases in gure 12 are shown in (83) and (84). (83) Yes, Mr Evans, i feel an initiative of the type you have just suggested would be entirely appropriate. (84) Sì, Onorevole Evans, ritengo che yes, mr sia un' iniziativa del tipo che evans, believe that a initiative of assolutamente opportuna. suggest would be absolutely appropriate. 57 lei propone the type that you The English nite verb feel should be only aligned with the Italian nite verb form ritengo. The motivation for this assumption is that their semantic features are similar. They have the same word category and share the same verbal features (tense, niteness, person, number). 24 Following this idea, correct alignment of the phrases in gure 12 is presented in gure 13. The base alignment link between feel and che is removed. The English verb in only aligned with the Italian verb. I5 /P RP O f eel6 /V BP 5 kkk kkk k k kkk ku kk ritengo6 /V ER : f in Figure 13: 25 che7 /CHE Correct base alignment of I feel and ritengo If parallel sentences both consist of a nite VP, the nite verbs in both languages should be aligned to each other. This means that English words with PoS VBZ, VBD, VBP, AUX, MD are aligned with the Italian words with PoS VER:n, VER2:n, AUX:n. This is stated in the rule in (85a). The English nite verbs can also be auxiliaries (AUX ) or modals (MD ). I refer to both types of the verbs as auxiliaries. If an English (nite) auxiliary is to be aligned, it should be aligned with Italian (nite) auxiliary (or auxiliaries). If we have VPs that dier in a voice (active vs. passive), the English nite verb or auxiliaries should be aligned with Italian nite auxiliaries or their participles (cf. rule (85b)). If the English VP consists only of one verb whereas the Italian VP is composed, the English nite verb should be aligned to all Italian verbs. amine parallel VPs youP RP saidV BD ↔ For example, if we ex- abbiateAU X:f in dettoV ER:ppast , we see that the English nite verb said bears the same verb features as the Italian composed VP [ab- biate detto ]V P . They both express a past action. Thus, we would like to translate the English past tense (in the example said ) into the corresponding past tense in Italian which not only contains of the participle trascorso, but also of the auxiliary abbiate. So, the English verb should be aligned to both Italian verbs (verb alignment rules (85a) and (85c)). This alignment rule should lead to an alignment between an English word with PoS VBZ, VBD, VBP, AUX, MD and an Italian participle with PoS AUX:ppast, VER:ppast, VER2:ppast, VER:in, VER2:in. Furthermore, the combination of the rules (85a) and (85c) satises the condition that the English pronoun and its nite verb should both be aligned with the same Italian nite verb (if the subject pronoun does not exist in Italian). If the Italian parallel VP is a gerundive or an innitive construction consisting of a 24 Although 25 parallel verbs sometimes have dierent verbal features, they should be aligned satisfying the condition that same word categories should be associated to each other. In this work, only the alignment of verbal sentence elements have been modied. The denition of an alignment of subcategorized conjunctions is out of scope of this thesis. Furthermore, the removal of the link between feel and che in gure 13 still allows the extraction of the translation phrase pairs (I feel ↔ ritengo) and (I feel ↔ ritengo che). 58 preposition, the English nite verb should be aligned with the Italian gerundive (cf. rule (85e)), or with the Italian preposition (cf. rule (85d)), resp. Finite verb alignment rules are summarized in (85). (85) a. EN nite verb → IT nite verb, EN: {VBZ, VBD, VBP, AUX, MD} → IT: {VER:n, VER2:n, AUX:n} b. EN: nite verb → IT: participle form of auxiliary if IT VP has a passive voice EN: {VBZ, VBD, VBP, AUX, MD} → IT: {AUX:ppast} c. EN: nite verb → IT: participle of innitive if EN VP is not composed EN: {VBZ, VBD, VBP, AUX, MD} → IT: {VER:ppast, VER2:ppast, VER:in, VER2:in} d. EN: nite verb → IT: innitival particle if IT is an innitive construction EN: {VBZ, VBD, VBP, AUX, MD} → IT: {PRE} e. EN: nite verb → IT: gerundive if IT is a gerundive construction EN: {VBZ, VBD, VBP, AUX, MD} → IT: {VER:geru, VER2:geru} Figure 14 shows an alignment of the English nite verb enjoyed after applying alignment rules (85a) and (85b). The link between the English nite verb and the Italian participle should be only possible, if the English VP is not composed and the Italian VP consists of an auxiliary and a participle or innitive. you33 /P RP enjoyed34 /V BD O hh4 hhhh h h h h hhhh h t hhh abbiate23 /AU X : f in Figure 14: trascorso24 /V ER : ppast Alignment of you enjoyed and abbiate trascorso If the English VP is composed, and thus, the condition for applying the rule (85c) is not fullled, only the alignment rule (85a) can be applied resulting in an alignment shown in gure 15. you0 /P RP have1 /AU X ii4 iiii i i i iiii t iii i avete0 23/AU X : f in Figure 15: requested2 /V BN chiesto1 /V ER : ppast Alignment of you have requested and avete chiesto 59 Figure 16 shows an example for alignment rules (85a) and (85b). The English nite verb were is aligned with two Italian words: nite verb siamo and the second auxiliary stati which is a participle. As already mentioned, English auxiliaries should be aligned with Italian auxiliaries. we13 /P RP were14 /AU X ii4 iiii i i i iiii i t iii siamo17 /AU X : f in Figure 16: elected15 /V BN O stati18 /AU X : ppast eletti19 /V ER : ppast Alignment of we were elected and sono stati eletti Figure 17 shows parallel VPs of a dierent type. Whereas in English, we have a nite subclause with the predicate had, in Italian, the innitival construction di avere is used. The sentences that the phrases in gure 17 are extracted from are shown in (86) and (87). (86) ... that everybody would make certain that they had adequate ... (87) ... che tutti si accertino di avere una formazione adeguata ... ... that all ensure to have a education adequate ... Alignment rule (85d) allows the English nite verb had to be aligned with the Italian innitival particle di. Combining this rule with rules (82d) for pronouns and (85c) for nite verbs, the alignment shown in gure 17 is computed. had18 /AU X they17 /P RP O di2 /P RE Figure 17: k5 kkk kkk k k kk ku kk O avere3 /V ER : inf i Complete alignment of they had and di avere 5.4.4 Participles, innitives and gerundives The innitive, participle or gerundive form of a verb is a part of a VP if a VP is composed. Auxiliaries are used to build some tenses, but the meaning of a VP is provided by an innitive or participle form of the main verb. For this reason, the alignment rules for English innitives, participles and gerundives should lead to an alignment between English innitives, participles and gerundives and Italian innitives, participles and gerundives. This is stated in the rule (88b). An example is shown in gure 18. The alignment between English and Italian participles, innitives and gerundives is possible only if the Italian VP is composed which is not necessarily the case. This would mean that in Italian, we could have a tense that does not require an auxiliary, so that all English verbs, including the participle, should be aligned with the Italian nite verb 60 you0 /P RP have1 /AU X avete0 23/AU X : f in chiesto1 /V ER : ppast Figure 18: requested2 /V BN iii4 iiii i i i ii it iii Alignment of you have requested and avete chiesto as shown in gure 19. The same holds for innitives which occur with modal verbs. The English participle should be aligned with Italian verb which have the same or similar semantic features. The Italian verb form should be aligned with the English auxiliary and the main verb in order to express the same tense. This leads to the denition of the rule in (88a). you/P RP have/AU X ee2 eeeeee e e e e e e eeeeee e r eeeee requested/V BN chiedevate/AU X : f in Figure 19: Alignment of you have requested and chiedevate The rules handling these cases are summarized in (88). If we take a closer look at rule (88a), we see that in some cases, the information provided by PoS is not enough to apply the correct alignment rule. For example, the verb been has the same PoS (AUX ) no matter if it is used as an auxiliary or as a main verb. Computing the alignment for been, we have to decide whether been is used as an auxiliary or as a main verb. If, for example, a composed Italian VP is given, been as auxiliary (for example in they have been said ) should be aligned with Italian auxiliary. If it is a main verb, it should be 26 aligned with Italian main verb. (88) a. → EN: participle, innitive or gerundive IT nite verb, if IT VP is not composed, or English verb is like subcategorizing an toinnitive, or English verb is be EN: {VBN, VB, VBG, TO} → IT: {VER:n, VER2:n, AUX:n} b. EN: participle, innitive or gerundive → IT:participle, innitive or gerun- dive, if IT VP is composed EN: {VBN, VB, VBG, AUXG, TO} → IT: {VER:ppast, VER2:ppast, VER:in, VER2:in, AUX:in} c. EN: participle → IT: participle of an auxiliary if EN VP is not composed and IT not in passive voice EN: {VBN} → IT: {AUX:ppast} 26 Italian verb form stata has two dierent PoS depending on a context in which it is situated. If it is a part of a passive construction, it has a PoS AUX:ppast, otherwise it is tagged as VER:ppast. 61 d. EN: participle → IT: innitival particle if IT is innitival construction EN: {VBN} → IT: {PRE} The English innitive like is another verb which I treat as an auxiliary, but only if it occurs with a to-innitive, for example I would like to say as shown in gure 20. If like is a part of a construction containing a modal verb (MD ) and a to-innitive (TO + VB ), it should be treated as an auxiliary, i.e. as a nite verb. This ensures that it is aligned with the same Italian nite verb as the modal (here, would ). I/P RP would/M D e2 like/V B eeeeee e e e e e eeeee eeeeee e r eeeee vorrei/V ER : f in say/V B to/T O dire/V ER : f in Figure 20: Alignment of I would like to say and vorrei dire 5.4.5 Negation In this work, the English negation particle not is treated as a part of the VP and its alignment should also be taken in account. The simplest case of the alignment of the English negation is to associate it with the Italian negation as shown in gure 21. we5 /P RP do9 /AU X noi6 /P RO : pers non7 /N EG Figure 21: j4 jjjj j j j jjjj jt jjj not10 /RB adhere11 /V B rispettiamo8 /V ER : inf i Alignment of we do not adhere and noi non rispettiamo But this is not always possible. Since sentences in the used parallel corpus are not always one-to-one translations of each other, it can happen that the negation exists in only one of the given languages. On the other hand, it is also possible that the verb in one language already contains the negation (for example as an attached prex) whereas its counterpart does not, and requires therefore an explicit occurrence of the negation. (89) EN: negation → IT: negation if IT VP contains a negation EN: {RB} → IT: {NEG} The negation alignment rule in (89) allows for English negation only to be aligned to the Italian negation particle. If there are some mismatches, English negation stays unaligned. 62 5.4.6 Innitival particle Since the English to-innitives which are considered as being subcategorised by the verbs are also handled, we need alignment rules for English innitival particle to. The rule is simple: It should be aligned with the Italian innitival particle (PRE ) if the Italian VP is an innitival construction (PRE + *:in or simply *:in ). If this condition is not given, it should be handled as an innitive. (90) a. EN: innitival particle → IT: innitival particle if IT is innitival construction EN: {TO} → IT: {PRE} b. EN: innitival particle ↔ EN: innitive if IT is not an innitival construction Figure 22 shows an example for the rule (90b). Behind this alignment, the gure shows also alignments for other English tokens computed by applying the rules (82a) for I, (85a) for suggest, and (85b) for present. I/PO RP suggest/V BP i4 iiii i i i iiii i t iii raccomando/V ER : f in Figure 22: di/P RE present/V B 4 to/T O jj4 jjjj j j j j j j jj jjj jjjj jjjj tjjjj jt jjj presentare/V ER : inf i Alignment of I suggest to present and raccomando di presentare 5.4.7 Alignment examples In the following, a few examples of computed VP alignment are presented. The sentences are taken from Europarl. I13 /P RP O shall /M D 14 5 ffff2 kkk k ffffff f k f f k f k f k ffffff kkk u kk rffffff k do15 /AU X seguiro8 /V ER : f in Figure 23: Alignment of I shall do and seguirò The phrases in gure 23 are simple to align. Since there is only one verb in Italian, all English words are aligned with it. To achieve an alignment between the English subject pronoun I and the Italian nite verb seguirò, the alignment rule (82a) has to be applied. The modal shall, which is recognised as a nite verb is aligned with seguirò according to the alignment rule (85a). Finally, the link between the auxiliary do, which represents the main verb of the given VP, is also computed by the rule (85a). This example also shows that the same tense (future tense) is formed dierently in the given language pair. 63 Whereas the English needs a modal verb and an innitive, the Italian verb becomes a sux to express the future tense. we9 /P RP O have10 /AU X hh3 hhhh h h h h hhhh s hhh h abbiamo9 /V ER : f in Figure 24: upheld12 /V BN i4 iiii i i i i iiii it iii sostenuto11 /V ER : ppast Alignment of we have upheld and abbiamo sostenuto Figure 24 shows the alignment of composed VPs. The English personal pronoun and nite auxiliary are aligned with the Italian nite verb abbiamo. The subject pronoun is aligned according to the alignment rule (82a) whereas the English auxiliary is aligned to the same verb according to the rule (85a). The participles are aligned with each other, which is determined by the rule (88b). Figure 25 shows an example for a VP pair, in which the Italian VP consists of a subject pronoun. you12 /P RP O have13 /AU X O lei13 /P RO : pers propone14 /V ER : f in Figure 25: suggested14 /V BN ii4 iiii i i i iiii it iii Alignment of you have suggested and lei propone (= you proposed ) In this context, the English subject pronoun should only be aligned with the Italian subject pronoun. This is stated in the alignment rule (82b). The English nite verb is aligned with the Italian nite verb according to rule (85a) whereas the alignment of the English participle (as a main verb) is dened by the rule (88a). This example pair shows another discrepancy that I noticed by observing the identied phrase pairs. Often, VPs are not in the same tense. In gure 25, the English VP is in past tense whereas the corresponding Italian VP denotes an action in the present. The VP pair shown in gure 26 shows the case in which the English subcategorized to-innitive should be aligned with the Italian participle as a main verb (not subcategorized). 27 The phrases are extracted from the sentences in (91) and (92). he4 /P RP O ii4 iiii i i i iiii i t iii verra4 /AU X : f in Figure 26: to /T O gg3 go7 /V B 6 6 g lll ggggggggg l l l g g l lll ggggg vlll sggggg is5 /AU X messo5 /V ER : ppast Alignment of he is to go and verrà messo 27 The phrases are rather idioms. In Europarl, I found only 39 sentences containing the English VP whereas the Italian VP occurs solely in 16 sentences. 64 (91) Now, however, he is to go before the courts once more because the public prosecutor is appealing. (92) Ora, però, verrà messo now, but, will come put again pubblico ministero public nuovamente in stato perché il in position of accusation because the ricorrerà in government recurs di accusa appello. the appeal. Again, the English pronoun and nite verb are aligned according to the rules (82a) and (85a) with the Italian nite verb verrà. The innitive particle to is treated as an English participle, innitive or gerundive, since the Italian VP does not contain a preposition (as an innitive particle) which would be seen as an alignment candidate for English to. As a participle, innitive or gerundive, the innitive particle, as well as the English innitive go, is aligned with the Italian participle. The rule applied to compute this alignment is the rule (88b). An Italian VP corresponding to an English nite VP can also consist of only one verb which is not necessarily nite. Figure 27 shows an Italian VP consisting only of one gerundive. The link between the English pronoun and the Italian gerundive is produced by applying the alignment rule (82c). In the given context, the English nite verb is aligned according to the rule (85e). you4 /P RP O hear5 /V BP jj4 jjjj j j j jjj jt jjj ascoltando4 /V ER : geru Figure 27: Alignment of you hear and ascoltando 5.5 Evaluation In this section, the VP alignment computed on the basis of the rules which take the PoS of the words into account, is evaluated. Precision, recall and f-score are computed for the base alignment and the rule-based VP alignment. After comparing gained results, errors made by the rule-based VP alignment are shown and discussed. Furthermore, some examples of syntactic divergences between English and Italian that are problematic for the system are shown. After an evaluation of the improved alignment, translation systems are built and tested. BLEU scores are reported and the translations of example sentences with pronominal subjects are discussed. 5.5.1 Precision, Recall, F-score The program for word alignment computation of the English and the Italian VPs has been applied to 200 parallel sentences randomly chosen from Europarl (cf. section 5.2). The sentences consist both of NP and pronominal subjects. 65 The program for the alignment improvement produces a set of partial alignments, containing alignments only for identied pronominal subjects and their VPs. The alignments of other words are not a part of the output of the program. I annotated manually the alignment of the English pronominal subjects and VPs in the test set with their Italian counterparts. The manual annotation of the test set provided the partial gold alignment G containing 563 gold alignment links. The alignments of the English words outside of the phrases (pronominal subject + VP) that were of interest for this work were ignored (they are simply not word aligned in the hypothesis and the gold alignment). To evaluate the base alignment of English pronominal subjects and VPs, it was necessary to extract the alignments of the relevant words out of the complete word alignment for a sentence pair. This was done on the basis of the word positions of aligned English words in the gold alignment. The extracted base word alignment contains all links for the elements of English VPs which are annotated in the gold alignment. So, if there are links to Italian words which are not a part of matching Italian VPs, they have a negative impact on precision. The alignment that is tested, is called hypothesis H. Having gold alignment and the hypothesis, the evaluation method basing on precision, recall and f-score can be applied. Precision is a measure for the correctness of the hypothesis and is calculated as shown in equation (14). P = H ∩G |H| (14) Recall is a percentage of gold alignments that are found by the hypothesis (cf. equation (15)). R= H ∩G |G| (15) F-score is a harmonic mean of precision and recall, and it is computed as shown in (16). R= 2P R P +R (16) The evaluation results are shown in table 6. Alignment # alignments Precision Recall F-score Base 522 0.66 0.61 0.64 Rule-based 572 0.80 0.81 0.81 Table 6: Evaluation of the VP alignment In all measures, the rule-based VP alignment is better then the base alignment. Measuring f-score, the rule-based VP alignment reaches an improvement of 17% compared with base alignment. In a large number of sentences, the base alignment of VPs is both incomplete and incorrect. Since the method described in this work identies entire VPs, 66 all VP elements are examined and aligned producing the VP alignment which contains links between all VP elements. The alignment rules allow only alignments between elements which share some characteristics (word category, number, etc.) so that the alignments to some other word categories, which are incorrect, are excluded. 28 In the following, examples are presented in which the rule-based VP alignment leads to the improvement of the alignment compared to the base word alignment. We start with an example that shows the most frequent correction of the base alignment. As already mentioned in section 5.1, the English subject pronoun is often aligned with dierent Italian words because its direct counterpart is missing. These include, among others, the Italian object clitics. The Italian syntax allows the object clitic to occur in front of the nite verb. Therefore, it is often aligned with the English subject pronoun which is always situated in front of the verbal sentence predicate (cf. gure 28). The base alignment links are marked with waved lines whereas the rule based alignments are displayed by straight lines. Overlapping alignments are marked as a combination of a waved and a straight line. I13 /PO RP Si O O accept14O /V BP SSS SSS SSS SSS S) O la15 /CLI Figure 28: O O O accetto16 /V ER : f in Alignment comparison: I accept and lo accetto In gure 28, the alignment rules for the subject pronouns lead to the alignment of the English subject pronoun I with the Italian nite verb accetto (straight line) whereas the link between the pronoun and the Italian object clitic is deleted (waved line). Both alignments contain the link between the main verbs accept and accetto. The example in gure 29 shows the advantage of using English parse trees. The English VP is interrupted by an embedded sentence, but the derivation of the English VP from the parse trees leads to the extraction of the complete VP, which is then correctly aligned with the Italian counterpart. The base alignment does not produce any alignments for the beginning part of the English VP, namely the word sequence it will. it0 /P RP kXXX will1 /M D Rh RR XXXX ... XXXXX R XXXXX RRRRR XXXXX RRR XX+ R( it4 it4 i4 iit4 it4 i t 4 i 4 t i i 4 t it4 iit4 i tiit4 iit4 sara2 /AU X : f in Figure 29: be6 /AU X examined7 /V BN it4 i4 it4 it4 iit4 t 4 i i 4 t i it4 it4 iit4 tiit4 it4 i esaminata3 /V ER : ppast Alignment comparison: it will (, I hope,) be examined and sarà esami- nata 28 For now, we assume that VP elements can only be aligned to VP elements, i.e. verbs, negation and subject pronouns leading to higher precision. However, the assumption is not completely correct having a negative impact on recall. This will be discussed in the following chapter. 67 The following gure shows the case in which the base alignment assigns the English subject pronoun to the Italian adverb pertanto. The rule-based VP alignment deletes this link and creates the alignment between the pronoun and the Italian nite verb form può which corresponds to the English modal can. Furthermore, the link between the 29 English main verb give and the Italian preposition su is removed. I12 /P RP O O O O pertanto13 /ADV Figure 30: ... can13 /M OD iTTTT TTTT TTTT TTT) O O O jUUUU UUUU UUUU UUUU U* O puo14 /V ER2 : f in give17 /V BN j5 ju5 jju5 ju5 j 5 u j 5 u j u5 ju5 jju5 ujju5 jj contare15 /V ER : inf i O O O O su16 /P RE Alignment comparison: I can (,therefore,) give and pertanto può con- tare su In gure 31, there is an example of an Italian innitival clause corresponding to the English nite clause. The contexts of the VPs are shown in (93) and (94). (93) ... I would ask you to request that the commission express its opinion on this issue and that we then proceed to the vote. (94) ... la prego di chiedere alla commissione di esprimersi ... you I ask to request the commission di procedere al poi afterwards to proceed subito e to express itself soon and voto. to the vote. Again, the English pronoun is aligned with the Italian adverb whereas the English innitive is only aligned with the Italian main verb. The alignment rules correct the alignment of the English subject pronoun and align it with the Italian preposition di which is considered to be a part of the Italian VP. Since the intention was to align the English subject pronoun and its nite verb with the same Italian word, the English nite verb proceed is also aligned with di. Additionally, it is also aligned with its Italian counterpart procedere. we12 /P RP O gO O O O OOO OOO OOO O' poi13 /ADV Figure 31: ... proceed17O /V BN jjj5 jjjj j j j j ju jjj di14 /P RE O O O procedere15 /V ER : inf i Alignment comparison: we (then) proceed and poi di procedere Figure 32 shows one of the common base alignments for the English subject pronoun. It is namely often aligned with sentence punctuation, in our case with comma. The VP alignment rules remove this link and lead to the resulting alignment in which the English subject pronoun and its nite verb have are both aligned with the same Italian word. 29 Cf. footnote 25 in section 5.4.3. 68 I0 /PO RPh O O O ... have1 /AU X O PPP PPP PPP PPP ( ,2 /P U N O O O ho3 /AU X : f in Figure 32: proposed3 /V BN dr2 ddr2 dr2 ddr2 2 d 2 r d 2 r d d 2 r d 2 r d dr2 r2 ddr2 dr2 ddr2 dr2 d r2 dr2 ddr2 dr2 ddr2 d d 2 r d 2 r d d 2 r d 2 r d rdr2dd proposto4 /V ER : ppast Alignment comparison: I have (thus) proposed and , ho proposto In gure 33, the VPs including the negation are shown. In the base alignment, the English negation does not have any alignments whereas the rule-based VP alignment assigns it to its Italian counterpart non. If the English VP contains a negation and an auxiliary which is needed to negate the verbal predicate (here reect ), the auxiliary is aligned solely with the Italian main verb (here rietterà ). Certainly, the auxiliary could also be aligned with the Italian negation since it is used to build a negated English VP. I decided though to align the auxiliary with the Italian main verb because there are many other English constructions containing an auxiliary and a main verb which correspond to the Italian main verb (for example, I [do think]V P ↔ io [penso]V P , he [is playing]V P ↔ egli [gioca]V P ). When such a context is given, the auxiliary and the main verb are both aligned with the corresponding Italian verb if the Italian VP does not have an auxiliary (cf. gure 25). do2 /AU X they1 /P RP O O O O O esso1 /P RO : pers Figure 33: O O O 5 iTTTT TTTTjjjjjjj T j jjj TTTTTT ujjjj ) non2 /N EG not3 /RB ... ggs3 3 gs3 gs3 ggs3 gs3 g 3 s g g 3 s g s3 s3 gs3 ggs3 g sggs3 gs3 gg ref lect4 /V B rif lettera17 /V ER : f in Alignment comparison: they do not (properly) reect and esso non rietterà In the VP pair in gure 34, an Italian VP is shown consisting of reexive verb perme- ttersi (= allow, permit ). The Italian reexive pronoun occupies the position in front of the Italian nite verb permettesse. This can be compared with the position of the Italian object clitics shown in gure 28. The base alignment contains a link between the English subject pronoun and the Italian reexive pronoun. The rule-based VP align- ment, however, deletes this link and creates the alignment between the English pronoun and the Italian nite verb. Since the Italian reexive pronouns have the same PoS as Italian object clitics, I excluded the alignment of the English subject pronouns with Italian words tagged with the PoS CLI. This allows for a deletion of many links created between the English subject pronouns and the Italian object clitics, but it also prohibits the alignment between the English subject and the Italian reexive pronouns which 30 could be considered as correct. Furthermore, the rule-based VP alignment creates a link between allowed and permettesse which was incorrectly not included in the base word alignment. 30 Since I do not allow the alignment of English subject pronouns with Italian reexive pronouns, these alignments are not a part of the gold alignment. 69 I15 /P RP jeLL*j *j *j LL *j *j *j LL LL *j *j *j *j *j LL *j *j *j LL *j *j *j LL * L LL might16 /V BP jUU*j U*j UU*j mi15 /CLI L LL U*j UU*j U*j L UU*j U*j U U*j U*j U*jU LLLL U*j U*jUU*j LLL U*jUU*j UL% *j U be17 /AU X o /o /o /o /o /o /o /o /o /o /o /o /o /o /o ii/o i/o i4*/ permettesse16 /V ER : f in allowed18 /V BN to19 /T O give20 /V B Figure 34: i iiii iiii i i i ii iiii tiiii i4t i4 di17 /P RE i4t i4tii4t i t 4 i t 4 i i4t i4t i4ti ii4t i4t i t 4 i t 4 i i t 4 i4t i tii4t i4t i i4t i4t i4 rilasciare18 /V ER 4tii4t i4ti i t 4 i i t 4 ii4t i4t i i4t ii4t i4t t 4 i i t 4 i t 4 i tii4t i4t i : inf i Alignment comparison: I might be allowed to give and mi permettesse di rilasciare The preceding examples show the cases in which the alignment rules lead to the improvement of English and Italian VPs and, especially of the English subject pronoun. But the rule-based VP alignment still make errors in the alignment of about 20% of the tested sentences. In the following, we examine the errors that are made by the described method for the VP alignment. 5.5.2 Error analysis Manual examination of the erroneous alignments revealed problems which can be divided into four categories: 1. Correct Italian VP not found The parallel Italian VP is not found 2. Extended VPs The VPs can contain coordinated verbs or innitives which do not have a correspondence in the other language 3. Alignment rules The rules compute false alignments when the VPs are too complex 4. Erroneous preprocessing VP elements can have false PoS. In the following, the error categories are discussed. Sentence pairs and alignment examples are shown to demonstrate the problems within the task of the VP alignment. 70 Correct Italian VP not found Alignment rules dene alignments between English and Italian VPs. The correctness of the computed alignment for a given VP pair depends not only on the denition of the rules, but also on an assumption that the VPs correspond to each other. The method for searching the corresponding Italian VP given an English VP, has been described in the section 5.3.2. The method is based on the base alignment: The Italian VP which has the most alignments in the base alignment to the English VP is considered to be the corresponding Italian VP. This is not always correct, so an incorrect Italian VP can be chosen. If the English VP does not have any alignments to an Italian VP, an empty Italian VP is chosen. In this case, the English VP stays unaligned. A sentence pair for this case is shown in (95). (95) a. As you know, like Mr. Rack, I come from a transit country ... b. Anch' io, come l' Also I, as Onorevole Rack, provengo da the honourable Rack, come un paese from a di transito country of transit 'Like Mr. Rack, I also come from a transit country ...' Whereas the English VP [you know]V P does not have a corresponding Italian VP, for the VP [I come]V P , the Italian VP [provengo]V P should be identied as the corresponding phrase. Unfortunately, the base alignment does not reveal this fact, so that the English VP stays unaligned which lowers recall. Until now, I postulated that for every found English VP with pronominal subject, there is a parallel Italian VP. This phrase parallelism is not always present in the sentence pair which is to be processed. The English VP can correspond to an Italian phrase of some other category, for example, to a PP as shown in (96). (96) a. We understand that ... b. A nostro avviso At our notice 'In our opinion' Having identied the English pronominal subject we and its VP [understand]V P , the search for the Italian VP is carried out. VP search allows only VPs as corresponding phrases to the given English phrase, so that the PP [A nostro avviso]P P cannot be determined as the parallel phrase of the English VP, even though the base alignment indicates this correspondence. In most cases of this kind of divergence, the English VP stays unaligned. Since the phrases are parallel, in gold alignment they are aligned to each other, so this leads to a loss of recall. Another phrase divergence which has been observed is shown in (97). (97) a. Your group was alone in advocating what you are saying now. b. Soltanto un Only gruppo politico one group condivideva l' political shared questa sede. this seat. 71 opinione da lei espressa the opinion in of you expressed in 'Only one political group shared the opinion that you expressed in this seat.' The English nite VP [you are saying]V P corresponds to the Italian PP [da lei espressa]P P consisting of the preposition da, the subject pronoun lei, and the participle espressa. In this form, it poses a counterpart to the nite English VP, but in a passive voice. In the process of identication of Italian VPs in a given Italian sentence (cf. chapter 5.3.1), this kind of phrase is not identied as a VP, because it starts with a preposition and it does not contain a nite verb form. The same problem occurs if the English nite VP corresponds only to a participle in Italian. So, in these cases, we have English VPs which stay unaligned leading to a reduction of recall. The problems with regard to the parallel Italian VPs can be summarized as follows: 1. Base alignment • • False VP because the base alignment is incorrect No VP because the base alignment does not contain links to any possible Italian VP 2. Phrase divergence (free translation, idioms) • • EN:VP ↔ IT:PP EN: nite VP ↔ IT: participle In section 5.6, I present experiments that I carried out in order to account for these problems. Extended VPs In the previous discussion, examples of VPs have been shown which consist only of one main verb or subcategorized innitive. The VPs can also contain a sequence of verbs which are either combined by a coordination, or pose an enumeration separated by comma. (98) a. It is irresponsible of EU Member States to refuse to renew the embargo. b. Gli stati membri dell' unione sono stati irresponsabili a non rinnovare The states member of the union l' were irresponsible to not renew embargo. the embargo. 'It is irresponsible of Member States not to renew the embargo.' The sentence pair in (98) shows the English VP [It is ... its Italian counterpart [sono stati ... a non rinnovare]V P . to refuse to renew]V P and The English VP contains two to-innitives. The rst one, namely [to refuse]XCOM P , does not have a direct VP correspondence in Italian. Moreover, semantically, it is equal to the Italian negation non. This type of correspondence is not described by the alignment rules. In this context, [to 72 refuse]XCOM P as well as [to renew]XCOM P are aligned with [a rinnovare]XCOM P whereas the Italian negation remains unaligned. This sentence pair reveals another divergence in a way of expressing the same fact in the given language pair. Whereas in English, the expletive has the role of a sentence subject, in Italian, the subject is a NP [Gli stati membri dell' unione]N P which is a translation of an English PP [of EU Member States]N P . This inequality exists also in the processed VP pairs. In some cases, in which the English VP has a pronominal subject, the corresponding Italian VP has a nominal phrase as a subject. Since the Italian subject was not taken into account unless it is a pronoun, the English pronoun is not aligned with Italian subject NP, but instead with the corresponding element of the VP. Example (99) shows the Italian VP [è chiedere] consisting of an innitive chiedere that does not correspond to any part of the English VP [It is not]. It is, therefore, seen as an extension of the Italian VP which leads to false alignments. (99) a. It is not a lot to ask. b. Non è chiedere molto. Not is to ask lot 'It is not a lot to ask.' According to the parse tree for the English sentence in (99), the VP with the pronominal subject does not contain the to-innitive [to ask]XCOM P . It is instead embedded in an adverbial phrase together with an adverbial lot. On the other side, the search for VPs in the Italian sentence returns only one VP, namely [è chiedere]V P . Given the VPs [It is not]V P and [è chiedere]V P , the Italian innitive is aligned with the English nite verb is, and not with [to ask]XCOM P . This additional false link leads to a reduction of precision. The described problems can be summarized as follows: 1. Subcategorization of innitives Poses a problem if the innitive does not have equivalent phrase in the other language 2. Coordination A coordination of verbs, in which not every verb has a counterpart in the other language Alignment rules The denitions of the contexts in which a specic alignment rule should be applied is not error-free. Additionally, since n−m alignments have to be allowed, a specic, already aligned VP element is not prohibited to be aligned with further words. In complex VPs which contain additional elements such as subcategorized innitives or a sequence of verbs, the rules lead to a generation of too many links. For example, if in both VPs, two coordinated nite verbs are present, six links are generated, namely, from each English 73 nite verb to each Italian nite verb form, and from the English subject to both of the Italian nite verbs. This is shown in sentence pair in (100). (100) a. With regard to the budget and annual appropriations, we agree with the rap- porteur's position and fully support it. b. Per quanto attiene al Regarding bilancio e the budget condividiamo e share annuali, and the appropriations annual, appoggiamo la and support alle dotazioni posizione della relatrice. the position of the rapporteur. 'Regarding the budget and the annual appropriations, we share and support the position of the rapporteur.' The computed alignment for the underlined VPs in (100) is shown in gure 35. The rule (82a) leads to the alignments between the English subject pronoun and both Italian nite verbs. The rule (85a) is responsible for alignments between both English nite verbs with both Italian verbs. we9 /P RP foLL h4/ condividiamo10 /V ER : f in hhhh{{= LL h h h h LL { LL hhhhhhh {{ LhLh { h { hh LLL hhhh {{ LL ... LL {{{ agree10 /V BP thjVVV L ... support18 /V BP Figure 35: VVVV VVVV {{{LLLL VV{V{V LL V L {{ VVVVVVVLLLL { VVV&* { {{ { h4 appoggiamo12 /V ER hhhh {{ h h h { h {{ hhhh {{ hhhhhhhh { }{ h thhhh : f in Alignment of we agree (...) support and condividiamo (...) appoggiamo While the alignments between the English subject and both Italian verbs can be considered as correct, both English verbs should be aligned only with the corresponding Italian verb. So, the shown alignment consists of two additional false alignments which lead to a reduction of precision. The English VPs with modal verbs pose the main problem for the rules and context denition for alignment of VP elements. Figure 36 shows the computed alignment of an English VP containing auxiliaries and modals, and its Italian counterpart. it10 /PO RP may11 /M D have12 /AU Xcccccce12 been13 /AU X ee2 O jj4 eeeeeeckkckckckc5 cccccccececececeeeeee j e e j e e j e jj eeee cccckckck eeeee jjjj eeeeeecccccccckckckkkekeeeeeeeeee uk t jjjqcrececececececececccccc re j sia4 /AU X : f in stata5 /V ER : ppast Figure 36: Alignment of it may have been and sia stato 74 In this example, too many links are computed. The English nite verb should only be aligned with the Italian nite verb. So, the link between may and stata is false. English have as an auxiliary should also only be aligned with the Italian nite verb sia. The participle been as a main verb should only be aligned with the Italian participle and main verb stata. False links are generated because of complex English context. Many rules check whether the English VP contains auxiliaries, or if it contains modals. According to the result of such a context check, the links for English verbs are computed. In the given example, this leads to a generation both of correct and incorrect links. Head switching Head switching is a phenomenon which involves syntactic and semantic dierences between languages. The main semantic contributor of a phrase in one language does not correspond to the head of the corresponding phrase in the other language [Butt, 94]. For example, the main verb of a VP which bears the semantic information of the VP need not always correspond to the main verb of the parallel VP in some other language. This kind of divergence is given in (101). The semantics of the English verb answer corresponds to the semantics of the Italian noun risposta. (101) a. ... they had been answered in a previous part-session. b. ... avessero già ... have ottenuto risposta in una tornata precedente. already received answer in one session previous '... they have already received the answer in the previous session.' With respect to the word alignment, we can say that one verb in one language is equivalent to a combination of the verb and NP in some other language. In the example (101), the English verb answered corresponds to the Italian verb ottenuto and the object NP risposta. For the given VPs, the alignment rules produce the alignments shown in gure 37. they14 /P RP jUUUUU UUUU UUUU UUUU UUUU UUUU U/* 4 avessero18 /V ER : f in had15 /AU X o iiii been16 /AU X answered17 /V BN Figure 37: i iiii iiii i i i ii iiii tiiii ii4 ottenuto20 /V ER iiii i i i iiii iiii i i i ii tiiii : ppast Alignment of they have been answered and avessero ottenuto (risposta) 75 The Italian object is not a part of the Italian VP which the English VP should be aligned with. So, the English answered verb is not aligned with the Italian object risposta. The produced alignment is incomplete since the link between answered and risposta is missing. It is likely that the statistically computed base alignment contain the alignments between a verb on the one side, and a verb and object NP on the other side (cf. example (101)). On the other hand, even if the object NP were a part of the Italian VP, the alignment rules would not allow for alignments between verbs and nouns or articles since they only allow for alignments between words with PoS which indicate that the alignment candidates are a part of a VP, i.e. verbs, negation and subject pronouns. So, for the given case, alignment rules produce only some of the correct word alignments. The sentence pair in (102) shows another construction dierence that is comparable with the divergence in the example (101). (102) a. ... you were unable to attend the Conference of Presidents last Thursday. b. ... lei non ha potuto partecipare giovedì scorso alla ... you not have could participate Thursday last dei conferenza to the conference presidenti. of the presidents. '... you could not participate on the Conference of Presidents last Thursday.' The English VP [you were]V P is a part of a predicative phrase consisting of the mentioned VP and an adverbial unable, whereas unable subcategorizes the following to-innitive [to attend]XCOM P . In the parse tree, XCOMP is embedded in an adverbial phrase ADJP, so that it is not identied as a part of the VP [you were]V P . This causes two diculties: (i) unable as adverbial with the PoS JJ cannot be aligned with the equivalent Italian phrase [non ha potuto]V P (negation and verbs), and (ii) the Italian innitive partecipare is not aligned with its English equivalent [to attend]XCOM P . The computed alignment for example in (102) is given in gure 38. Figure 39 shows a combination of the computed VP alignments (straight lines) and the base alignment (dashed) for the VPs in (102). This combination of VP alignment and the base alignment would be desirable as the output alignment, but the dashed alignments are not a part of the resulting word alignment for the given sentence pair. When the words belonging to the VPs which should be aligned to each other are identied, all base alignments for these words are rst deleted. Subsequently, the phrase elements are aligned according to the alignment rules, so that the words of the given VP pair can only be aligned to each other, and not to the words outside of them. In the given example, this leads to the deletion of correct links. 76 you4 /P RP o / lei5 /P RO : pers non6 were5 /V BD TdjJ_?JTTTT ??JJJ TTTTT TTTT ?? JJ TTTT ?? JJJ TTTT ?? JJJ TT* ?? JJ ha7 /AU X : f in ?? JJJ JJ ?? JJ ?? JJ ?? JJ ?? J$ ?? potuto8 /V ER : ppast ?? ?? ?? ?? partecipare9 /V ER : inf i Figure 38: Alignment of you were (unable to attend) and lei non ha potuto parte- cipare you4 /P RP o / non6 were5 /V BD TdjJ_?JTTTT unable6 /JJ to7 /T O attend8 /V B Figure 39: lei5 /P RO : pers ??JJJ TTTTT TTTT ?? JJ TTTT ?? JJJ TTTT ?? JJJ TT* ? J J ? Tj T ha7 /AU X : f in J T T ??? JJJ JJ T T?? ?T? T JJJ ?? T T TJJJ ?? T$* ?? Tj T potuto8 /V ER : ppast ?? T T T T ?? T T ? T T ?? T T* _o _ _ _ _ _ _ _ _ _ _/ partecipare9 /V ER : inf i Alignment of you were unable to attend and lei non ha potuto parteci- pare This type of link deletion problem could be solved by checking how reliable the alignments for a given word with the words outside of the corresponding VP are. Because of the assumption that the elements of VPs should only be aligned to each other, these cases of divergence have not been investigated further. 77 5.6 System extensions In the previous section, I presented the errors made by the rule-based method for computing word alignment between the English VP with a pronominal subject and its Italian counterpart. Some assumptions were made that, unfortunately, did not always lead to a generation of correct alignments. We saw that the process of searching for an Italian VP on the basis of the base alignments can be erroneous (cf. assumption (A2) in section 1.2). Furthermore, the assumption that the given English VP can only be expressed with a VP in Italian does not hold in all cases (cf. assumptions (A1) and (A3) in section 1.2). In the following, I suggest some improvements of the presented work, in order to consider the problems that have been observed. 5.6.1 Lexical search for the matching Italian VP In section 5.5.2, parallel sentences were shown, in which the wrong Italian VP has been identied as a counterpart for the given English VP. Let us consider once more the example shown in (103). (103) a. As you know, like Mr. Rack, I come from a transit country ... b. Anch' io, come l' Also I, as Onorevole Rack, provengo da the honourable Rack, come un paese from a di transito country of transit 'Like Mr. Rack, I also come from a transit country ...' The Italian VP [provengo]V P has not been identied as the parallel VP to the English VP [I come]V P . In this example, the similarity at meaning of come and provengo could provide us with the information, that these two VPs correspond to each other. So, it may be that lexical translation probabilities could be helpful to identify the matching Italian VP. There are two ways to include lexical translation probabilities in the subroutine for nding the matching Italian VP. The search could be changed, so that only a lexical search for the Italian VP is carried out. search based on the base alignment. Or, we can combine lexical search and the I carried out two experiments including lexical translation probabilities for the identication of the parallel Italian VP. The rst one includes only lexical probabilities whereas in the second, base alignment and lexical translation probabilities are combined. The lexical search uses lexical translation probabilities computed by Moses based on the base word alignment. For a given English VP of an Italian VP i = i1 , ..., im e = e1 , ..., en , the matching probability is computed using equation (17). v um uY mt arg maxel ∈e p(ik |el ) p(i|e) = (17) k=1 For each Italian word ik which is a part of an Italian candidate VP probability of generating it out of one of the elements el of English VP multiplied with highest probabilities of other Italian words within 78 i. i, the highest e is taken and th The m root of the product is computed in order to assure that shorter Italian VPs are not dispreferred compared to the longer phrases. The most probable matching Italian VP imax for a given English VP e is the Italian VP with the highest matching probability. The probabilities of the most probable Italian VPs have to be higher that the threshold t since the probabilities can be relatively small indicating that the phrases are not very likely to be parallel. This is shown in equation (18). I set manually the threshold t to t = 0.001. On the test set, this threshold led to Imax lays under the threshold, an empty the best evaluation results. If the probability of Italian VP is returned. imax ( arg maxi p(i|e) = [] , if p(imax |e) > 0.001 , else (18) The evaluating results for dierent approaches for searching Italian VP are shown in table 7. IT-VP search # alignments Precision Recall F-score Lexical 556 0.68 0.67 0.67 Base 572 0.80 0.81 0.81 Base + lexical 604 0.79 0.84 0.81 Table 7: Evaluation of VP alignment for dierent IT-VP identication approaches The search based only on lexical probabilities does not lead to desirable results. This is due to the fact that verbs can have many dierent translations, so that the most probable translation is not correct in every context. Furthermore, in equation (17), the word or phrase positions are not taken into account. For instance, it can happen that the position of the most probable Italian VP diers signicantly from the position of an English phrase. This fact could indicate that the phrases do not match to each other but the proposed computation does not have an access to this kind of knowledge. Finally, there are no checks as to whether a found Italian VP has already been identied as a parallel VP of some other English phrase. Some of the mentioned problems can be partially solved if the base alignment and lexical search are combined. This is done as follows: First, the base alignment search is carried out. If no Italian VP is found, the lexical search is applied. The combination of base alignment and lexical search leads to a higher recall since some VPs are found which have not been identied by the base alignment search. As an example for this case, we consider the VP [I come]V P from English sentence in (103). The correct alignment is shown in gure 40. The base alignment does not identify the Italian VP [provengo]V P as a counterpart of the English phrase [I come]V P . In fact, it fails to nd any Italian VP for the given English VP which results in unaligned words of the English VP. Since no Italian VP has been found, the lexical search is applied. This search process nds the correct Italian VP and the alignment rules dene correct alignments between the phrase elements. 79 I8 /PO RP come9 /V BP jj5 jjjj j j j jj ju jjj provengo10 /V ER : f in Figure 40: Alignment of I come and provengo Unfortunately, the combination of the two search methods lead to lower precision compared with precision of the base alignment. This has two reasons. First, there are contexts in which an English VP should stay unaligned since it has no counterpart in Italian. Lexical search computes though a parallel VP if its translation probability is higher that the threshold. Second, false VP is identied since it has higher probability than the correct parallel phrase. To demonstrate this, the English VP [you know]V P from the example sentence (103) is taken. Its alignment is shown in gure 41. know2 /V BP you1 /P RP O jj5 jjjj j j j jj ju jjj assume20 /V ER : f in Figure 41: Alignment of you know and assume The base alignment does not identify any Italian VP as a counterpart for English [you know]V P . So, the lexical search is applied suggesting that Italian VP [assume]V P is a parallel phrase to the given English VP which is false. 5.6.2 Retaining the base alignment As already discussed, the English VP does not need to have an Italian VP as its counterpart. It can correspond to a phrase of some other type, for example, to a prepositional phrase, or simply to a participle. The implemented method for VP alignment does not allow for this kind of parallelism. For an English VP, only an Italian VP can be found as its parallel phrase. If the lexical search does not nd any parallel Italian VP, instead of not aligning the English VP, its base alignment could be retained. This could lead to correct alignments which cannot be created by the alignment rules, but it could also lead to alignments which are incorrect. The results of the experiment for retaining the base alignments are shown in table 8. Alignment / Score # alignments Precision Recall F-score Base 522 0.66 0.61 0.64 Rule-based 567 0.80 0.81 0.81 Rule-based + base 588 0.79 0.82 0.80 Table 8: Evaluation of dierent VP alignments 80 Evaluation results show that retaining the base alignment for the phrases for which no alignment could be computed has a negative impact on precision. 5.7 Summary In this chapter, I presented a method for the alignment of English and Italian VPs which have pronominal subjects. The aim of the rules developed for VP alignment was to correct the alignment of the English pronominal subject which often does not have an Italian counterpart, and which is therefore often aligned with incorrect Italian words. Since the alignment of English subject pronouns depends on the alignment of their VPs, the rules were written to cover the alignment of entire English and Italian VPs. The denition of the alignment rules was motivated by both the linguistic and semantic characteristics of the verbs. Words which bear similar features (for example, number, deniteness, person, etc.) are aligned to each other. The rules do not have any lexical knowledge. They operate on the PoS sequences of the parallel VPs. The evaluation revealed that the rule-based VP alignment reaches higher precision, recall and f-score than the base word alignment (cf. table 8). F-score of the base alignment is 0.64 whereas f-score of the rule-based VP alignment is 0.81. Parallel VPs have been extracted on the basis of the base alignment of each English VP. Since English parse trees were available, in most cases the correct English VPs have been extracted. The Italian VPs were identied on the basis of PoS sequences that form a VP (cf. section 5.3.1). The identication of correct Italian VPs is not ideal; More error-free VPs (consisting only of verbal elements that belong to the specic VP) could have been extracted, if Italian parse trees had been available. The identication of parallel VPs, which is based on the base word alignment, is not always correct. When additionally to the base alignment, the lexical translation probabilities are included in the search for parallel VPs, the recall has a small improvement, but precision falls (cf. table 7 in section 5.6.1). This is due to the fact that the search method nearly always nds a matching Italian VP for the English input. In some cases, the found Italian VP is correct, but there are also cases in which this is not the case. The examination of the parallel corpus showed that there are many syntactic divergences between English and Italian (cf. section 5.5.2). Frequently, the English VP does not have an Italian counterpart because the whole clause has not been translated (free translation). Furthermore, English VPs can also correspond to Italian PPs, participles or to the arguments of Italian verbs. Such cases of phrase divergences have not been dealt with in this work. The alignment rules lead to satisfying alignments of the PoS sequences in the majority of VPs, but they produce false links if they are applied to complex VPs (coordinated verbs or to-innitives). They search through all VP elements and compute all possible links, sometimes associating one English verb with two Italian verbs, and vice versa (cf. gure 35, section 5.5.2). This is due to the implementation of the rules. There is no limitation on the number of the links that can be computed for an input word. This could be improved by using the lexical translation probabilities. If there are a number of candidates that an English main verb could be aligned with, the lexical 81 translation probabilities and the word positions in the VP could be used to determine which alignment is the most probable while other links would be discarded. In this work, only those VPs have been dealt with that have pronominal subjects, but the method presented in this section could also be applied on VPs with NP subjects. In this section, we have dened the VP alignment rules and conducted an evaluation and an error analysis of the generated alignments. In the following section, the SMT systems built using the two dierent word alignments (base alignment and base alignment combined with rule-based VP alignment) will be presented and evaluated. A detailed examination of translation parameters will be carried out in order to explain the evaluation results. 82 6 Evaluation of SMT systems In this section, I present an evaluation of four SMT systems. For each translation direction, I built two SMT systems: a baseline system (M1) and a system using rulebased VP alignment (Mmod ). I introduce the evaluation measure BLEU and present the BLEU scores of M1 and Mmod systems. I discuss why the improved VP alignment does not lead to the improvement of the translation of null subjects. Subsequently, I discuss possible solutions of the problem. 6.1 The BLEU score In the previous chapter, it has been demonstrated that the word alignment is improved by applying the alignment rules to the base alignment. We will now evaluate whether the word alignment improvement has an impact on the quality of generated translations. In this work, the quality of translation is measured using BLEU [Papineni et al., 02]. The computation of the BLEU scores takes into account the similarity between the generated translation (hypothesis) and one or more reference translations which are correct translations of the sentence which is to be translated. The similarity is expressed by a modied n-gram precision word sequences of the length pn . n. The sentences are viewed as a set of n-grams, i.e. The count of a n-gram is clipped to the maximum number of occurrences of the n-gram in one of the references. The modied precision of a n-gram of the length C n is computed by summing over the matches for every hypothesis in the whole corpus Candidates. This is expressed in equation (19). P pn = P C∈{Candidates} C 0 ∈{Candidates} P P n−gram∈C n−gram0 ∈C 0 Countclip (n − gram) Countclip (n − gram0 ) (19) Additionally to the modied n-gram precision, the BLEU score also considers the length c of the hypothesis: It should be not too short compared with the reference which has a length r. BP in (20). Too long sentences are if c > r if c ≤ r (20) This is expressed by the brevity penalty already penalized by lower precision. ( 1 BP = 1−r e c The BLEU score is computed by combining n-gram precision and the brevity penalty as demonstrated in equation (21). N = 4) whereas weights wn N represents the maximum length of n-grams (usually are uniform: wn = 1/N . N X BLEU = BP · exp( wn log pn ) n=1 83 (21) 6.2 Evaluation of SMT systems I have built four SMT systems, two for each translation direction. The baseline SMTs (M1) use the base word alignment produced by GIZA++ whereas the other two systems (Mmod ) use the modied word alignment. All systems are built on a parallel corpus containing 749,646 sentence pairs. The same corpus was used to build language models. 31 As a dev and a test set, I used the WMT Newstest 2009. All sets (development and test sets) contain 1000 sentences. BLEU, one reference sentence was used. For computation of The evaluation results (BLEU scores) are shown in table 9. Baseline SMT (M1) Improved WA (Mmod ) Table 9: IT → EN EN → IT 22.07 19.15 21.81 18.18 BLEU scores of the SMT systems for EN ↔ IT 6.3 Error analysis The BLEU scores are slightly worse for Mmod . But, a closer look at the generated translations by the base and modied systems shows that the translations are nearly the same. Often, the sentences dier only in synonyms. Such dierences can unfortunately have a strong impact on BLEU scores. In the following, a detailed analysis of the translation of subject pronouns is presented and discussed. Translation direction IT → EN Manual examination of subject pronoun translations revealed that both systems perform equally well. Looking at the translations, I noticed that some null subject pronouns are st nd better translated than others. The 1 and 2 person pronouns seem to be easier to rd translate than 3 person pronouns. This could be explained by the fact that the use of st nd the pronouns for the 1 and 2 person is more common. For example, if someone speaks for himself, he would rather refer to himself by a pronoun than by a NP. Concerning a st parallel corpus, this means, that the English subject pronoun for the 1 person singular is very likely to occur together with an inected Italian verb with omitted subject. This leads to a higher probability of translating the Italian VP with an inected nite verb into the corresponding English pronoun and VP (and vice versa). In table 10, two possible translations for the Italian verb form so (= I know ) are shown. 32 Table 10 also shows the dierence in probabilities between the two systems indicating that the rule-based word alignment of VPs does have an impact on translation probabilities. The English pronoun I occurs in 56% of possible translations for the Italian 31 http://www.statmt.org/wmt09/ 32 The column phrase count shows how often the SL phrase has been extracted. The column denotes the number of dierent translations of the SL phrase. pair types 84 phrase so → M1 Mmod i know know phrase count phrase pair types 0.5546 0.1611 1,850 190 0.6202 0.0113 1,851 228 Table 10: Translation probabilities for so into (I) know verb so in M1. In Mmod , the English pronoun is found in 68% of phrases. The ve most probable phrase translations for so are shown in table 11. → so M1 Mmod 0.5546 i know 0.6202 i know 0.1611 know 0.0918 i am 0.0497 i am 0.0448 i am aware 0.0373 i am aware 0.0189 i understand 0.0178 i 0.0113 know Table 11: Top ve translation phrases for so The probability of generating English verb know without subject when Italian translation phrase so is given, is higher in M1 than in Mmod . This is the result of the rule-based VP alignment. The rules lead to alignment between the English subject pronoun with the same Italian verb as the corresponding English nite verb. Thus, they enforce that the Italian verb so is aligned only with English word sequence I know. This alignment leads to higher probability of extracting phrases (so, I know ) compared to the phrase pair (so, know ). In M1, the phrase pair (so, know ) was extracted 298 times whereas in Mmod the translation pair was extracted only 21 times. This means that in Mmod the inected Italian verb so was only 21 times not aligned with the English pronoun when it occurred with the English verb know. In these cases, it is likely that English clauses had NP subjects (due to free translation), so that the VP alignment rules were not applied. nd The 2 person singular pronouns are a little bit more complicated. My intuition is st that their usage is comparable with the usage of the 1 person pronouns. But looking at the generated translations, I observed that, often, incorrect English subject pronouns are generated. This is due to the ambiguity of the Italian verbs. Example sentence (31) (cf. chapter 3.2.1) already showed such a case of ambiguous Italian verbs. The same example is showed again in (104a). The translation produced by M1 and Mmod is shown in (104b). (104) a. Hai detto che have said parli italiano. that speak Italian. 'You said that you speak Italian.' b. You have said that speaks Italian. 85 Both translation systems generate the same translation for the sentence in (104a). The input was segmented into the following phrases. (105) [Hai]p1 [detto]p2 [che parli]p3 [italiano.]p4 p1 and p2 generate correct English pronoun and verb, but the phrase p3 leads The phrases to a false translation, which does not have the obligatory subject pronoun. Furthermore, nd rd the verb parli could indicate the 2 person singular indicative, or the 3 person singular conjunctive. The phrases that speaks and che parli are parallel if that and che are relative pronouns. They are very likely to be translated into each other. In this example, this interpretation of this and che is wrong. Since the SMT systems do not have this knowledge, in this case they produce incorrect translations. → che parli M1 Mmod Table 12: that that that phrase phrase you speak she speaks speaks count pair types - 0.125 0.125 8 8 - 0.125 0.125 8 8 Translation probabilities of che parli into that (you/she) speak(s) Translation table 12 for the phrase che parli shows that the correct translation for the phrase is not included at table at all. There is no dierence in the probability distribution for the phrase che parli between M1 and Mmod since che is in nearly all sentences used as a relative pronoun. Hence, a parallel English sentence does not have a personal subject pronoun which is necessary for applying the VP alignment rules. Looking at the translations of parli shown in table 13, the correct translation phrase is present, but it has a very small probability. parli → M1 Mmod you speak speak you talk talk phrase count phrase pair types 0.009 0.099 - 0.054 111 71 0.0083 0.075 0.0083 0.0333 120 85 Table 13: Translation probabilities of parli into that (you) speak/talk With respect to the 2 nd person pronouns, I also noticed that the Italian verbs for the second person singular indicative are very rare in the corpus that was working with (cf. chapter 2.4) which has a negative impact on translating them into English. rd The 3 person pronouns are most dicult to translate. They do not occur as frequently as the other pronouns with the corresponding verb form. The intuition is that rd the verbs marking the 3 person occur very often with the subject NP. In the word alignment, the verbs of the given language pair are aligned to each other. In the phrase extraction step, they are then extracted as parallel phrases, and the English VP does not contain a subject pronoun. As a translation example, we consider the sentence in (106a). M1 and Mmod generated the translation shown in (106b). 86 a. Hanno cantato la mia canzone. (106) have sang my song. 'They sang my song.' b. Have been sung my song. The input sentence has been segmented as follows: (107) [Hanno]p1 [cantato]p2 [la mia]p3 [canzone.]p4 In this example, the translation of the phrases p1 and p2 is crucial to become correct output. First, we examine the translation probabilities of hanno as an inected Italian verb (cf. table 14). hanno M1 Mmod → have they have phrase count phrase pair type 0.5761 0.0264 15,589 1961 0.5552 0.0388 15,292 2230 Table 14: Translation probabilities of hanno into (they) have The dierence in translation probabilities between the phrase pairs (hanno, have ) and (hanno, they have ) is huge. Even if the language model gives lower scores to the generated sentences which do not have a subject (pronoun) at the beginning of a sentence, this might still happen. The rule-based VP alignment leads though to higher number of occurrences of the phrase pair (hanno, they have ). Whereas in M1 it was extracted 412 times, in Mmod the phrase pair was extracted 593 time. The English pronoun they occurs in 6% of possible translations in M1 whereas in Mmod , it occurs in 12% of translation phrases. Similar behaviour is also observed in the case of the inected main verbs (cf. table 15): They is a part of 15% of the translation phrases in M1 whereas in Mmod , it is included in 27% of the phrases. Compared to the phrase pairs in table 14, the rule-based VP alignment leads to small changes regarding the counts for the translation phrases of pensano. pensano M1 Mmod → think they believe they phrase phrase think believe count pair type 0.2251 0.0471 0.0733 0.0157 191 92 0.2126 0.0435 0.0628 0.0241 207 104 Table 15: Translation probabilities of pensano into (they) think Further examination revealed another problem, namely regarding morphology of Italian and the corpus characteristics. The sentences in (108a) and (109a) dier only in the gender of the subject which is marked by the Italian participles stataf em and statomasc . Sentences in (108b) and (109b) are generated translations of the sentences (108a) and (109a). 87 (108) a. Lei non è stata a casa. you/she not have been at home. 'You/she were/was not at home.' b. You was not at home. (109) a. Lei non è stato a casa. you not have been at home. 'You were not at home.' b. You were not at home. Whereas the word sequence Lei non è stato has been extracted as a translation unit, this was not the case for Lei non è stata. In translation process, this led to following segmentation of the sentences. (110) [Lei]p1 [non è stata]p2 [a casa.]p3 (111) [Lei non è stato]p1 [a casa.]p2 Translation of the phrase p2 in (110) leads to the generation of the false English verb. But, when the subject pronoun is a part of the translation phrase as in phrase p1 in (111), the correct translation is generated. Unfortunately, the Italian pronoun lei occurs only with the masculine participle of the verb essere (= be ) in the training data, so the needed phrase was not extracted. In conclusion, we have shown that the improved VP alignment does not contribute to the improvement of translating the omitted Italian subject pronouns into English. The rule-based VP alignment does change the translation probabilities of the relevant translation pairs. Correct translation pairs were found, which have higher probabilities in Mmod . Furthermore, it has been observed that the English subject pronouns are more frequently a part of the phrases extracted from the modied word alignment. Unfortunately, these changes do not have an impact on the generated translations. Incorrect subject pronouns in English are generated not because of erroneous word alignment, but st because of the nature of using subject pronouns. Frequently used subject pronouns (1 nd and 2 person pronouns) are often correctly generated. They occur more often with the corresponding Italian inected verb and can therefore be extracted as translation pairs rd with relatively high probabilities. 3 person pronouns are relatively rare and lead to the extraction of the corresponding translation pairs with relatively small translation probabilities. 33 The English language model which I used was trained on a relatively small monolingual data set. A better language model could in some cases lead to generation of correct obligatory English subject pronouns which were false in the examples shown in this section. In the preceding discussion, I claimed that an Italian inected verb should generate an English subject pronoun and verb. This will though lead to erroneous translations if in Italian a NP subject exists. When, for example, the Italian verb hanno has to be translated, it is required that it can be translated both as the English verb have and 33 Statistics on the occurrence of dierent subject pronouns in English are shown in appendix C. 88 the English phrase they have (cf. table 14). Which translation is correct depends on the Italian input. If the input sentence does not have a subject (the pronominal subject is dropped), the English phrase they have should be generated. If the Italian input contains a NP subject, the Italian verb should be translated as the corresponding English verb (without the subject pronoun). Therefore, both translation phrase pairs are correct in an adequate context. Translation direction EN → IT In the following, we examine the opposite translation direction and check if the rulebased VP alignment contribute to the translation of the English subject pronoun and its VP into the correct Italian VP. As already discussed in section 3.2.2, when translating the English subject pronoun into Italian, it has to be decided if the Italian pronoun should be expressed overtly, or if it should be omitted. In SMT, this decision is made implicitly by using the translation probabilities of English phrases in combination with the Italian language model. rd When examining test sentences, I noticed that 3 person singular pronouns are often generated whereas the others are more often omitted. Again, I presume that this is due st to the usage of the pronouns. English 1 person pronouns are very frequent and are very likely to occur with dierent Italian VPs with omitted subject. Therefore, they are very likely to be extracted as parallel phrases in which the Italian phrase does not have a pronoun. This is conrmed by the phrase translation tables tables 16 and 17. i can posso io posso phrase count phrase pair type → M1 Mmod 0.5712 0.0024 2,902 594 0.6341 0.0034 2,963 372 Table 16: Translation probabilities of i can into (io) posso In M1, the phrase pair (I can, posso ) was extracted 1650 times whereas in Mmod it was extracted 1879 times. The dierence in number of the phrase pair (I can, io posso ) is relatively small. In M1, the phrase pair (I can, io posso ) occurs 7 times, and in Mmod 10 times. When English and Italian sentences have both pronominal subjects, it is very likely that the pronouns are aligned with each other. But it is not excluded that the English pronoun is aligned with additional Italian words which would hava an impact on the extraction of translation phrases. The VP alignment rules prohibit these additional alignments which could explain the higher count of (I can, io posso ) in Mmod than in M1. The same observation can be applied on the phrase pairs shown in table 17. we know M1 Mmod → sappiamo noi sappiamo phrase count phrase pair type 0.6125 0.0157 2,145 372 0.5358 0.0139 2,736 551 Table 17: Translation probabilities of we know into (noi) sappiamo 89 The ve most probable translations for we know are shown in table 18. we know → M1 Mmod 0.6125 sappiamo 0.5358 sappiamo 0.0181 è noto 0.0259 conosciamo 0.0176 sappiamo bene 0.0186 è 0.0171 conosciamo 0.0179 sappiamo che 0.0167 si sa 0.0157 sappiamo bene Table 18: Top ve translation phrases for we know I examined if there is a dierence in the number of Italian phrases aligned with the English phrase we know which contain verbs which are equivalent to the English verb know, namely sappiamo and conosciamo. Whereas these verbs are found in 34% of phrases in M1, in Mmod , they are a part of 38% of the translation phrases. The dierence is due to the VP alignment rules which allow English VPs to be exclusively aligned Italian VPs (which in this work contain only verbal elements and negation). This is also the reason why Italian phrase è noto (VER:n + ADJ) has a smaller probability in Mmod whereas the nite verb form è is more probable than in M1. The third person pronouns are not very frequent and occur with a relatively small number of verbs. In the process of translation, if the English subject pronoun and VP are not in the phrase table as a translation unit, they are split resulting in two separate translation phrases: a phrase with the subject pronoun and a phrase with its VP. This is shown by the sentence in (112a) and its segmentation in (113). The generated translation is shown in (112b). (112) a. He has spoken with his father. b. Egli ha he (113) parlato con il suo padre. has spoken with the his father. [He]p1 [has spoken with]p2 [his]p3 [father]p4 [.]p5 It is very likely that he will be translated into the corresponding Italian pronoun (cf. table 19). The phrase has spoken with generates the correct Italian VP. The result is a sentence with an explicit subject pronoun. When the sentence is isolated, this is acceptable, but in a larger text, if a large number of Italian subject pronouns are generated where null subjects could be used, the translation would sound unnatural. he → M1 Mmod egli lui ha 0.1634 0.03 0.1146 5,981 1616 0.4719 0.0861 0.0168 2,740 505 Table 19: phrase count phrase pair type Translation probabilities of he into egli, lui and ha Table 19 also shows the impact of the rule-based VP alignment on the phrase translation table. If the Italian pronoun is available, the English pronoun is only aligned with it. If 90 this is not the case, it is aligned with the Italian verb form. In M1, the phrase pair (he, egli ) is extracted 967 times, whereas in Mmod it was extracted 1293 times. This leads to a very high probability of translating the English pronoun into the Italian pronoun, which is correct, but it leads to too few occurrences of the null subject in Italian. he has M1 → phrase count phrase pair type Table 20: Mmod 0.2639 ha 0.3391 ha 0.0926 che ha 0.0739 che ha 0.0847 egli ha 0.0716 egli ha 0.0236 è 0.0414 è 0.0197 abbia 0.0235 abbia 1514 1787 456 402 Top ve translation phrases for he has However, if the segmentation of the English sentence had included the phrase he has, it would have been more probable that the generated Italian sentence does not have a subject pronoun (cf. table 20). Splitting the English subject pronoun from its VP could also lead to the generation of false Italian inection since the English verbs have poor morphology. Given, for example, the translation phrase desired without the subject pronoun, it is very likely that the wrong Italian verb is generated if the language model does not penalise the erroneous Italian word sequence. Although I expected to see such errors, I was not able to nd them in the tested sentences. nd I also noticed that the 2 person pronoun you is often translated as lei meaning you in the polite form of address. An example of such a case is shown in the sentence in (114a) whose translation is shown in (114b). (114) a. I can understand that you are annoyed. b. Capisco che lei è arrabbiato. understand that you are annoyed The English phrase you are corresponds to three possible Italian phrases. You can nd rd correspond to the pronouns for the 2 person singular tu and plural voi, and the 3 person singular lei (polite form of address). The SMT systems cannot resolve this ambiguity and choose the most probable phrase, in this case, the phrase for the polite form of address: lei è. Without any context, a human translator, however, would also have problems deciding which Italian VP is the correct translation of the English one. nd Within the context, if it is clear that the 2 person singular is meant, the generated translation would be wrong. Certainly, this way of translating you are is caused by the corpus that has been worked with. But, even if we had more evidence for translating you are into other possible Italian constructions, the ambiguity would still be a problem. In summary, if the English subject pronoun and its VP are not included in the translation table as a translation unit, they are split resulting in the generation of the Italian 91 subject pronoun which could (or should) be omitted. Also, at least theoretically, English translation phrases consisting of verbs without the subject pronoun could lead to the generation of Italian VPs with false inection, since one English verb often corresponds to a number of Italian verbs. 6.4 Adequate training data After the discussion in the previous sections, infrequent use of pronouns seems to pose the greatest problem for SMT in translating pronominal subjects. The question this raises is: if we had a corpus containing a large number of sentences with pronominal subjects occurring with many dierent verbs, would this solve the problems presented previously? We would certainly have phrase pairs EN : prpi + vpj → IT : vpk with high transla- tion probabilities which could improve translation results when translating the English subject pronoun prpi with VP vpj into the Italian null subject and the correct VP vpk . If the English pronominal subject is not split into a separate phrase from its VP, the overgeneration of Italian subject pronouns could be avoided. But, if we would like to translate the Italian null pronoun into the correct English pronoun and VP, this would lead to another problem. Suppose that an Italian VP with NP subject should be translated, and the Italian VP is a translation unit. If it has a high probability being translated as an English pronoun and a VP, we would incorrectly have two subjects: a translation of the Italian NP subject and the English pronoun generated out of Italian VP. Since both translations have to be possible, it is important IT : vpi → EN : vpi must have a probable English prpj and the VP vpk , and a phrase that both translation alternatives have comparable probabilities: prpj + vpk and IT : vpi → EN : vpk . The Italian VP translation phrase consisting both of the pronoun only consisting of the VP vpk . To make sure that an additional subject pronoun in English is not generated, it would be necessary to determine the subject of the Italian sentence. Having information about the subject, correct translation phrase pairs could be favored compared to the other. Problems regarding ambiguities of verb inection would, however, still exist. To resolve them, information from the context outside of the phrase pairs is needed. This lack of a model of context is a known aw of phrase-based statistical machine translation which has only recently been addressed in a preliminary fashion in the literature. 92 7 Conclusion In this work, a detailed analysis of the problem regarding the translation of the pronominal subjects within statistical machine translation is carried out. A null subject language (NSL) Italian and a non-null subject language (non-NSL) English were used. A rule-based method for aligning English and Italian VPs with pronominal subjects is presented. The rule-based VP alignment was used to build phrase-based SMT systems in order to examine if the more accurate word alignment of VPs would lead to the improvement of the pronominal subject translation. Unfortunately, this was not the case. The usage of subject pronouns and the corpus characteristics have a signicant inuence on extracting the correct translation pairs. Phrase-based SMT is not adequate for the pronoun translation and generation since it does not have any information about the context outside the translation phrases. The main ndings of the work are summarized in section 7.1. Future eort in improving translation of (null) subject pronouns is outlined in section 7.2. 7.1 Summary In some languages like Italian, overt subject pronouns are not obligatory. The verbal morphology is rich enough to reveal characteristics like person and number of the missing pronoun (cf. section 2.1). Italian subject pronouns are used when they fulll some specic functions like emphasis, reintroducing referents, etc. (cf. section 2.3). On the other hand, some languages like English rarely allow the omission of subject pronouns. English syntax generally requires that the subject position is occupied, otherwise, the sentence is not grammatically correct. The optional use of the subject pronoun in Italian and the obligatory use of the subject pronoun in English leads to problems in word alignment of parallel sentences with pronominal subjects, as well as in statistical machine translation. Until now, the problem of translating (null) subjects between a NSL and a non-NSL has been dealt with only indirectly. An overview of previous work was given in section 3.1. The analysis of dierent translation cases showed that in many cases, Italian inected verbs can provide information needed to generate the correct English subject pronoun. Problems arise when the verbs are ambiguous with respect to the person, number and/or gender. For rd example, an Italian nite verb which is 3 person singular does not have information about the gender of the missing subject. This can lead to the generation of the false English pronoun. A further problem is the gender discrepancy between languages. For example, whereas animals have the grammatical gender neutral in English, in Italian they can be both feminine and masculine. Various translation cases and problems are discussed in section 3.2. In many cases, the examination of the context (previous sentence(s)) is required to derive all information that would ensure the generation of the correct English pronoun. Most (statistical) machine translation systems do not use the context, but translate sentences as isolated translation units. As already mentioned, the absence of Italian subject pronouns causes problems in the word alignment task. Suppose that an Italian and an English sentence pair containing 93 pronominal subjects has to be word aligned automatically. It is very likely that the English subject pronoun does not have a direct Italian counterpart since Italian allows for subject pronoun omission (cf. table 2, section 2.4). For this reason, English subject pronouns are often aligned with Italian object clitics, conjunctions, etc. I developed alignment rules which dene the word alignment of English subject pronouns. English subject pronouns have to be aligned with Italian words with the same linguistic information (person, number, gender). If the Italian subject pronoun is expressed overtly, the English subject pronoun is aligned with it. If the subject is dropped, the English subject is aligned with the Italian nite verb form. In addition to the rules for the alignment of English subject pronouns, I developed rules for the alignment of VPs (verbal elements of a VP and negation). The rules are based on the category of the VP elements (nite verb, auxiliary, participle, etc.). I used English parse trees enriched with functional tags (cf. section 5.2.1) and part of speech tagged Italian sentences (cf. section 5.2.2). The process of aligning parallel phrases consists of several steps. An Italian sentence is searched in order to nd all Italian VPs (cf. section 5.3.1). In the parallel English sentence, the clauses with pronominal subjects are detected. Baseline word alignment of the elements of an English VP (created by GIZA++ ) is used to identify the matching Italian VP (cf. section 5.3.2). The alignment rules compute the alignment of the phrase pair elements by searching for specic PoS pairs in a specic PoS sequence. A detailed description of 15 alignment rules for Italian and English VPs is presented in section 5.4. The rules were applied on a test set containing 200 parallel sentences. The evaluation results (precision, recall, f-score) indicate that the VP alignment computed by the rules is better than the baseline alignment computed by GIZA++ (cf. table 6, section 5.5.1). Expressed in f-score, the rule-based VP alignment exhibits an improvement of 17% (fscore = 81%). Precision of the baseline VP alignment is 66% whereas the precision of the rule-based VP alignment is 80%. Recall of the base alignment is 61% whereas the recall of the rule-based VP alignment is 81%. False alignments are computed if false parallel VPs are identied. Not every English VP has a parallel Italian VP. Due to free translation, English VPs can correspond to Italian PPs, participles, or they are simply not translated. These cases cause problems for the rule-based VP alignment. The process of the identication of the matching Italian VP for an English VP does not always nd the correct Italian VP. Since the VP alignment rules take only the PoS of the phrase elements into account, in these cases, they compute false word alignment. Furthermore, the implementation of the rules is insucient as they do not have any constraints on the number of the links that can be computed for a VP element. In some cases, this leads to additional alignments which are erroneous. For example, the VPs can be extended, containing participles or innitives that do not correspond to any element of the parallel phrase. Such phrases can lead to an alignment between, for example, one English main verb and two Italian main verbs. The program does not verify which alignment is more probable (i.e., lexical parallelism of the aligned words) and should therefore exist in the resulting alignment. Instead, all possible alignments are included in the computed VP alignment. The errors made by the rule-based VP alignment are discussed in section 5.5.2. I built four SMT systems to examine whether the improved VP alignment leads to 94 the improvement of the pronominal subject translation between English and Italian. For each translation direction, two systems have been built: (i) a phrase-based SMT system using the baseline word alignment (M1), and (ii) a phrase-based SMT system using baseline word alignment combined with the rule-based VP alignment (Mmod ). In the translation direction EN → IT, M1 has a BLEU score of 19.15 whereas the BLEU score of Mmod is 18.18. In the opposite translation direction, the BLEU score of M1 is 22.07 whereas the BLEU score of Mmod is 21.81 (cf. table 9, section 6.2). The BLEU scores are slightly worse for the Mmod systems. Manual examination of the generated sentences though revealed that all systems produce nearly identical output leading to the conclusion that the rule-based VP alignment does not have any impact on the (null) subject translation between English and Italian. However, the rule-based word alignment does change the translation parameters. The number of the phrases (VPs) in which the English phrase contains the subject pronoun whereas the Italian VP has only the inected verb form is greater in Mmod than in M1 (cf. section 6.3). In some cases, the translation probability of the correct translation pair is higher in Mmod than in M1 (cf. table 16, section 6.3). These observations lead to two important conclusions: (i) When translating Italian into English, Mmod is more likely to generate the English subject pronoun; (ii) The probability of generating the correct Italian inected verb is higher in Mmod than in M1. Despite the fact that there are dierences in translation probabilities for the relevant translation phrases indicating that Mmod should generate better translations, an improvement in translation output was not observed. This can be explained by the fact that the translation probabilities of the phrases consisting of a subject pronoun with a VP are relatively small. The verbs in such phrases do not only occur with the pronominal subjects, but also with NP subjects. In such contexts, the verb (or VP) pairs are extracted without a subject pronoun. Their likelihood is high since they occur often and with a large number of dierent NP subjects. When translating inected Italian verbs into English, it is therefore very likely that the verb is translated into the corresponding English verb. subject, this translation is correct. If the Italian verb has a NP But if the Italian subject is dropped, an English sentence is generated that does not have a subject. I also noticed that some pronouns are more often correctly translated than others. This is due to the relatively infrequent use of subject pronouns and the characteristics st nd of the corpus that I have been working with. 1 and 2 person pronouns are used more rd frequently than 3 person pronouns. Observation of the generated sentences showed st nd that 1 person pronouns are correctly translated in most cases. 2 person pronouns are problematic because of the ambiguity of Italian verbs and the characteristics of the rd corpus (cf. examples (108) and (109), section 6.3 and table 3, section 2.4). 3 person pronouns cause the most problems because the verbs they occur with can also have NP subjects, as already mentioned above. When translating English into Italian, it is rd very likely that the English phrase containing the subject pronoun (for example, 3 person pronoun) and the VP is not included in the translation table. The pronoun is then translated separately from the VP leading to the generation of the Italian subject pronoun. If this occurs in many subsequent sentences, one is faced with overgeneration 95 of pronouns in the Italian output. The problem regarding the small translation probabilities of the phrases consisting of a pronominal subject and a VP cannot be solved by better (or perfect) word alignment of the VPs with pronominal subjects. In fact, a parallel corpus is needed in which the pronouns occur much more frequently with a large number of dierent verbs. Within a SMT system, this would increase their translation probabilities automatically. However, when translating Italian into English, a syntactic analysis of the Italian input is needed to derive whether the sentence has a pronominal or a NP subject. Given this information, the correct translation phrase can be chosen. The linguistic characteristics (person, number, gender) of Italian pronominal subjects can be determined if the Italian (null) subject is resolved which requires the access to previous sentences. In the opposite translation direction, it has to be decided whether the Italian subject pronouns have to be dropped or expressed overtly. I noticed that some adjectives (for example, tutti (= all )) trigger the use of the overt Italian subject pronouns (cf. examples (24) - (27), section 2.4). However, since the use of the Italian subject pronouns has pragmatic reasons (cf. section 2.3), it is not trivial for a (statistical) machine translation system to decide whether the subject pronoun should be realized overtly or be dropped. 7.2 Future work In my thesis, I showed a method for aligning English VPs with pronominal subjects with parallel Italian VPs. Improved alignment of English subject pronouns with Italian inected verbs did not result in the improvement of pronominal subject translation between English and Italian. In the following, I outline further possible methods to improve the alignment of relevant phrases and the translation of pronominal subjects between a null subject language Italian and a non-null subject language English. Word alignment of English and Italian VPs The method for the VP alignment that I presented in this thesis is based on an assumption that every English VP with a pronominal subject has a parallel Italian VP. This assumption does not always hold since the translations are not always literal. Some English VPs do not have an Italian counterpart or they correspond to an Italian phrase of an arbitrary type. The rule-based method for the VP alignment could be extended in order to handle these cases. The method for identication of a parallel Italian phrase should allow Italian phrases like PPs to be identied as parallel phrases of English VPs. The translation probabilities of English and Italian PoS sequences could be used to derive parallel phrases. The rules for the VP alignment handle only verbal elements of a VP, negation and subject pronouns. In a case of a syntactic divergence in which the words of a phrase pair do not have a matching PoS, they remain unaligned. In some cases, this leads to removal of correct alignment links. A deletion of such links could be avoided if their reliability (for example, by using lexical translation probabilities and alignment of the 96 neighbouring words) would be computed. If we assume that syntactic phrases of dierent types in English and Italian correspond to each other, we need parse trees of English and Italian sentences in order to identify correctly the parallel phrases. In this work, the VP alignment rules have been applied only on VPs with a pronominal subject. The rules could be as well used to align all VPs regardless of the type of a subject (pronominal or NP subject). Translation direction IT → EN Italian sentences often do not have overtly expressed subjects. Their characteristics like person, number and gender can be derived from the inected verb and the preceding context (sentences). Statistical machine translation systems do not have an access to the preceding context of a sentence that should be translated. The translation phrases which contain Italian inected verbs and the English language model should therefore lead to the generation of correct English subject pronouns. Correct phrase pairs could be learned from a corpus which contains many sentences with pronominal subjects (cf. section 6.4). But, if the correct translation phrases had high translation probabilities, we would become problems when the Italian source sentence contains a NP subject. If the inected verb generates an English pronoun, the English translation could have two subjects which would be incorrect. The information about the subject in a source sentence could be used to choose the correct translation phrase pair. Another approach to the problem of the generation of English pronominal subjects is incorporation of pronoun resolution in the translation process. If the referent of an Italian omitted subject pronoun is determined, all characteristics (number, gender, etc.) of the missing pronoun could be derived and used to generate the corresponding English pronoun. Translation direction EN → IT When translating English subject pronouns into Italian, it is important that the correct inected verb is generated. Furthermore, a decision has to be made whether the subject pronoun should be generated or omitted. The use of an appropriate corpus as training data (cf. section 6.4) could lead to an improvement of translation of English pronouns into Italian (null) pronouns. A corpus containing many sentences with dierent pronominal subjects would lead to an extraction of many dierent English phrases consisting of a subject pronoun and verbs with their Italian counterparts ((null subject) + inected verb). This would though not solve the problems which concern the gender discrepancy between English and Italian. To ensure the generation of a correct Italian subject pronoun, it would be necessary to resolve the co-reference of an English pronominal subject. [Le Nagard & Koehn, 10] show a method for integration of co-reference resolution into phrase-based statistical machine translation. In some cases, Italian subject pronouns are expressed overtly. Statistical models could learn such contexts (word or PoS sequences) in order to predict how the Italian subject 97 pronoun should be realized. 98 A Italian tag set DJ adjective ADV adverb (excluding -mente forms) ADV:mente adveb ending in -mente ART article ARTPRE preposition + article AUX:fin finite form of auxiliary AUX:fin:cli finite form of auxiliary with clitic AUX:geru gerundive form of auxiliary AUX:geru:cli gerundive form of auxiliary with clitic AUX:infi infinitival form of auxiliary AUX:infi:cli infinitival form of auxiliary with clitic AUX:ppast past participle of auxiliary AUX:ppre present participle of auxiliary CHE che CLI clitic CON conjunction DET:demo demonstrative determiner DET:indef indefinite determiner DET:num numeral determiner DET:poss possessive determiner DET:wh wh determiner NEG negation NOCAT non-linguistic element NOUN noun NPR proper noun NUM number PRE preposition PRO:demo demonstrative pronoun PRO:indef indefinite pronoun PRO:num numeral pronoun PRO:pers personal pronoun PRO:poss possessive pronoun PUN non-sentence-final punctuation mark SENT sentence-final punctuation mark VER2:fin finite form of modal/causal verb VER2:fin:cli finite form of modal/causal verb with clitic VER2:geru gerundive form of modal/causal verb VER2:geru:cli gerundive form of modal/causal verb with clitic VER2:infi infinitival form of modal/causal verb VER2:infi:cli infinitival form of modal/causal verb with clitic VER2:ppast past participle of modal/causal verb VER2:ppre present participle of modal/causal verb 99 VER:fin finite form of verb VER:fin:cli finite form of verb with clitic VER:geru gerundive form of verb VER:geru:cli gerundive form of verb with clitic VER:infi infinitival form of verb VER:infi:cli infinitival form of verb with clitic VER:ppast past participle of verb VER:ppast:cli past participle of verb with clitic VER:ppre present participle of verb WH wh word 100 B English tag set (Penn Treebank Tagset) CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign word IN Preposition or subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, plural NP Proper noun, singular NNPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun PRP$ Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO to UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb 101 C English subject pronoun occurrences In the process of computing VP alignment, clauses in the English part of the parallel corpus (cf. pronoun. chapter 5.2) are identied and checked whether they contain a subject I counted subject pronoun occurrences and clauses in which the subject is not pronominal. The counting results are shown in table 21. Entire corpus consists of 34 749,646 sentences which can be divided into 1,254,086 clauses. I we you he she it 14% 15% 2% 0.8% 0.2% 9% Table 21: they NP 0.2% 54% Pronoun occurrence in English Half of the corpus clauses have NP subjects. In the context of dealing with subject pronouns, these sentences (its verbs) cannot be used to extract English verbs and pronouns with their correspondences in Italian. In fact, they contribute to the probabilities of phrases consisting only of verbs without a subject pronoun. 34 Missing 5% are due to false recognition of subjects. 102 List of Tables rd 1 Statistics on referents of 3 person subjects in Italian . . . . . . . . . . . 17 2 Occurrence of SUBJ in Italian . . . . . . . . . . . . . . . . . . . . . . . . 17 3 Occurrence of null-SUBJ in 93 observed clauses . . . . . . . . . . . . . . 17 4 Evaluation of GIZA++ word alignment for English and Italian . . . . . . 35 5 Example phrase translation probabilities for io sono . . . . . . . . . . . . 35 6 Evaluation of the VP alignment . . . . . . . . . . . . . . . . . . . . . . . 66 7 Evaluation of VP alignment for dierent IT-VP identication approaches 79 8 Evaluation of dierent VP alignments . . . . . . . . . . . . . . . . . . . . 80 9 BLEU scores of the SMT systems for EN . . . . . . . . . . . . . . 84 10 Translation probabilities for so into (I) know . . . . . . . . . . . . . . . . 85 11 Top ve translation phrases for so . . . . . . . . . . . . . . . . . . . . . . 85 12 Translation probabilities of che parli into that (you/she) speak(s) . . . . 86 13 Translation probabilities of parli into that (you) speak/talk . . . . . . . . 86 14 Translation probabilities of hanno into (they) have 15 Translation probabilities of pensano into (they) think ↔ IT . . . . . . . . . . . . 87 . . . . . . . . . . . 87 16 Translation probabilities of i can into (io) posso . . . . . . . . . . . . . . 89 17 Translation probabilities of we know into (noi) sappiamo . . . . . . . . . 89 18 Top ve translation phrases for we know . . . . . . . . . . . . . . . . . . 90 19 Translation probabilities of he into egli, lui and ha 20 Top ve translation phrases for he has 21 Pronoun occurrence in English . . . . . . . . . . . . . . . . . . . . . . . . 102 103 . . . . . . . . . . . . 90 . . . . . . . . . . . . . . . . . . . 91 List of Figures 1 Main program: correct_align . . . . . . . . . . . . . . . . . . . . . . . . 45 2 System components . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 3 Alignment check and improvement . . . . . . . . . . . . . . . . . . . . . 47 4 Alignment of I would ask you to request and la prego di chiedere . . . . . 48 5 Search for the best Italian VP . . . . . . . . . . . . . . . . . . . . . . . . 50 6 Incorrect base alignment of if you wish and se lo desidera . . . . . . . . . 54 7 Correct alignment of if you wish and se lo desidera . . . . . . . . . . . . 55 8 Alignment of I can tell you and posso risponderle . . . . . . . . . . . . . 56 9 Alignment of it actually passes and esso stesso approva . . . . . . . . . . 57 10 Alignment of I would say and volendo dire . . . . . . . . . . . . . . . . . 57 . . . . . . . . . . . . . . . . . . 57 11 Alignment of I have said and aver detto 12 Incorrect base alignment of I feel and ritengo 13 Correct base alignment of I feel and ritengo . . . . . . . . . . . . . . . 57 . . . . . . . . . . . . . . . . 58 14 Alignment of you enjoyed and abbiate trascorso . . . . . . . . . . . . . . 59 15 Alignment of you have requested and avete chiesto . . . . . . . . . . . . . 59 16 Alignment of we were elected and sono stati eletti . . . . . . . . . . . . . 60 17 Complete alignment of they had and di avere . . . . . . . . . . . . . . . . 60 18 Alignment of you have requested and avete chiesto . . . . . . . . . . . . . 61 19 Alignment of you have requested and chiedevate . . . . . . . . . . . . . . 61 20 Alignment of I would like to say and vorrei dire . . . . . . . . . . . . . . 62 21 Alignment of we do not adhere and noi non rispettiamo . . . . . . . . . . 62 22 Alignment of I suggest to present and raccomando di presentare . . . . . 63 23 Alignment of I shall do and seguirò . . . . . . . . . . . . . . . . . . . . . 63 24 Alignment of we have upheld and abbiamo sostenuto 25 Alignment of you have suggested and lei propone (= you proposed ) . . . 64 26 Alignment of he is to go and verrà messo . . . . . . . . . . . . . . . . . . 64 . . . . . . . . . . . 64 27 Alignment of you hear and ascoltando . . . . . . . . . . . . . . . . . . . 65 28 Alignment comparison: I accept and lo accetto . . . . . . . . . . . . . . . 67 29 Alignment comparison: it will (, I hope,) be examined and sarà esaminata 67 30 Alignment comparison: I can (,therefore,) give and pertanto può contare su 68 31 Alignment comparison: we (then) proceed and poi di procedere . . . . . . 68 32 Alignment comparison: I have (thus) proposed and , ho proposto . . . . . 69 33 Alignment comparison: they do not (properly) reect and esso non rietterà 69 34 Alignment comparison: I might be allowed to give and mi permettesse di rilasciare . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 35 Alignment of we agree (...) support and condividiamo (...) appoggiamo 74 36 Alignment of it may have been and sia stato 37 Alignment of they have been answered and avessero ottenuto (risposta) 38 Alignment of you were (unable to attend) and lei non ha potuto partecipare 77 39 Alignment of you were unable to attend and lei non ha potuto partecipare 77 40 Alignment of I come and provengo . . . . . . . . . . . . . . . . . . . . . 80 41 Alignment of you know and assume . . . . . . . . . . . . . . . . . . . . . 80 104 . . . . . . . . . . . . . . . . . . 74 75 References [Baroni et al., 04] Baroni, M. et al. Introducing the "la Repubblica" corpus: A large, an- notated, TEI(XML)-compliant corpus of newspaper Italian in Proceedings of LREC 2004, Lisbon, Portugal, 2004 [Bennis, 06] Bennis, H. Agreement, Pro, and Imperatives in Ackema, P.; Brandt, P. et al. (eds.) Arguments and Agreement, Oxford University Press, New York, 2006 [Brown et al., 03] Brown, P. F. et al. The Mathematics of Statistical Machine Transla- tion: Parameter Estimation, Computational Linguistics, 1993 [Butt, 94] Butt, M. Machine Translation and Complex Predicates, Konvens, Wien, 1994 [Charniak, 00] Charniak, E. A Maximum-Entropy-Inspired Parser in Proceedings of the conferences and Proceedings of the ANLP-NAACL 2000 Student Research WorkshopSeattle, USA, 2000 [Duranti, 80] Duranti, A. Sull' uso dei pronomi tonici nelle conversazioni in Berrettoni, P. (ed.) Problemi di analisi linguistica, Rome, 1980 [Duranti, 84] Duranti, A. The social meaning of subject pronouns in Italian conversation in Van Dijk, T. (ed.) Text. An interdisciplinary journal for the study of discourse, Mouton publishers, 1984 [Goldwater & McClosky, 05] Goldwater, S.; McClosky, D. Improving Statistical MT through Morphological Analysis in Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, Vancouver, 2005 [Haegeman, 96] Haegeman, L. Introduction to Government & Binding Theory, 2nd edi- tion, Blackwell Publishing, 1996 [Huang, 84] Huang, C.T.J. On the distribution and reference of empty pronouns in Roberts, I. (ed.) Comparative grammar. Critical concepts in linguistics, Routledge, 2007 [Koehn et al., 03] Koehn, P.; Och, F. J.; Marcu, D. Statistical phrase based translation in Proceedings of the Joint Conference on Human Language Technologies and the Annual Meeting of the North American Chapter of the Association of Computational Linguistics (HLT-NAACL), 2003. [Koehn, 05] Koehn, P. Europarl: A Parallel Corpus for Statistical Machine Translation, MT Summit, 2005 [Koehn et al., 07] Koehn, P. et al. Moses: Open Source Toolkit for Statistical Ma- chine Translation, Annual Meeting of the Association for Computational Linguistics (ACL), demonstration session, Prague, Czech Republic, June 2007 105 [Koehn, 09] Koehn, P. Statistical machine translation, Cambridge University Press, 2009 [Le Nagard & Koehn, 10] Le Nagard, R.; Koehn, P. Aiding Pronoun Translation with Co-Reference Resolution in Proceedings of the Joint 5th Workshop on Statistical Machine Translation and MetricsMATR, Uppsala, Sweden, 2010 [Nakaiwa & Ikehara, 92] Nakaiwa, H.; Ikehara, S. Zero pronoun Resolution in Japanese to English Machine Translation System using Verbal Semantic Attributes in Applied Natural Language Conferences. Proceedings of the third conference on Applied natural language processing, Trento, Italy, 1992 [Och & Ney, 03] Och, F. J.; Ney, H. A Systematic Comparison of Various Statistical Alignment Models in Computational Linguistics, vol. 29, num. 1, MIT Press, 2003 [Papineni et al., 02] Papineni, K.; Roukos, S.; Ward, T.; Zhu, W. BLEU: A Method for Automatic Evaluation of Machine Translation in Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics (ACL), Philadelphia, 2002 [Peral & Ferrández, 03] Peral, J.; Ferrández, A. Translation of Pronominal Anaphora between English and Spanish: Discrepancies and Evaluation in Journal of Articial Intelligence Research 18, 2003 [Pianta & Bentivogli, 04] Pianta E.; Bentivogli, L. Knowledge Intensive Word Align- ment with KNOWA, Proceedings of the 20th international conference on Computational Linguistics, Geneva, Switzerland, 2004 [Rizzi, 82] Rizzi, L. Negation, Wh-movement and the null subject parameter in Compar- ative Grammar, Volume II, The Null-Subject Parameter, Roberts, I. (ed.), Routledge, 2007 [Roberts, 07] Roberts, I. Introduction. The Null-Subject Parameter in Roberts, I. (ed.) Comparative grammar. Critical concepts in linguistics, Routledge, 2007 [Schmid, 95] Schmid, H. Probabilistic Part-of-Speech Tagging Using Decision Trees in Proceedings of International Conference on New Methods in Language Processing, 1995 [Schmid, Baroni et al., 2007] Schmid, H. et al. The enriched TreeTagger System in In- telligenza Articiale IV-2, 2007 [Stolcke, 02] Stolcke, A. SRILM An Extensible Language Modeling Toolkit in Proc. Intl. Conf. on Spoken Language Processing, vol. 2, Denver, 2002 [Tsao, 77] Tsao, F. A Functional Study of Topic in Chinese: The First Step toward Discourse Analysis, Dissertation, USC, Los Angeles, 1977 [Vanelli, Renzi, et al., 06] Vanelli, L.; Renzi, L.; Benincà, P. A typology of romance sub- ject pronouns in Roberts, I. (ed.) Comparative grammar. Critical concepts in linguistics, Routledge, 2007 106 [Zanchetta & Baroni, 05] Zanchetta, E.; Baroni, M. Morph-it! A free corpus-based mor- phological resource for the Italian language in Corpus Linguistics 2005, University of Birmingham, Birmingham, UK, 2005 107