* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download An Empirical Analysis of Source Context Features for Phrase
Survey
Document related concepts
Esperanto grammar wikipedia , lookup
Chinese grammar wikipedia , lookup
Spanish grammar wikipedia , lookup
Morphology (linguistics) wikipedia , lookup
Yiddish grammar wikipedia , lookup
Polish grammar wikipedia , lookup
Japanese grammar wikipedia , lookup
Lexical semantics wikipedia , lookup
Latin syntax wikipedia , lookup
Preposition and postposition wikipedia , lookup
Word-sense disambiguation wikipedia , lookup
Malay grammar wikipedia , lookup
Pipil grammar wikipedia , lookup
Transcript
Universität Stuttgart Institut für Maschinelle Sprachverarbeitung Azenbergstraße 12, 70174 Stuttgart An Empirical Analysis of Source Context Features for Phrase-Based Statistical Machine Translation Marion Weller Diplomarbeit Nr. 93 01.03.2010 - 01.09.2010 Betreuer: Dr. Alexander Fraser Prüfer: apl. Prof. Dr. Ulrich Heid Prof. Dr. Hinrich Schütze Hiermit erkläre ich, dass ich die vorliegende Arbeit selbständig verfasst habe und dabei keine andere als die angegebene Literatur verwendet habe. Alle Zitate und sinngemäßen Entlehnungen sind als solche unter genauer Angabe der Quelle gekennzeichnet. Abstract Statistical phrase-based machine translation systems make only little use of context information: while the language model takes into account target side context, context information on the source side is typically not integrated into phrase-based translation systems. Translational features such as phrase translation probabilities are learned from phrase-translation pairs extracted from word-aligned parallel corpora. Since there is no information besides the co-occurrence frequencies of the phrase-translation pairs, all occurrences of a given source phrase are used for the estimation of translation probabilities, regardless of their contexts in the training data. However, information about the context of a source phrase, e.g. adjacent words or part-of-speech tags, might be a valuable resource for the identification of appropriate translations in a given context. In this work, we want to analyze the use of source side context features in phrase-based statistical machine translation. For every phrase in an input sentence, context-sensitive phrase translation probabilities will be estimated: by reducing the set of all phrase-translation pairs to the subset of those with the same context as the given phrase, we can compute individual translation probabilities depending on the respective context. Assuming that the different translations of ambiguous source phrases occur within different contexts, contextually conditioned translation probabilities might help to solve ambiguities by separating the entire set of translation candidates into subsets appropriate for different situations. However, the more refined probability estimates should also have a general positive influence on translation quality. Furthermore, the integration of context features offers the possibility to include linguistic information which is not used in standard statistical machine translation. In our experiments, which are conducted on an English to German translation system, we will focus on the integration of local context features, choosing a simple method for the computation of contextually conditioned phrase-translation probabilities and their incorporation into a standard phrase-based statistical translation system. For all experiments, we will provide an extensive evaluation of the overall translation quality using standard automatic metrics such as bleu, but also attempt to individually rate fluency and adequacy. Contents 1 Introduction 1 2 Related work 2.1 Overview of statistical phrase-based machine translation 2.2 Contextually conditioned translation probabilities . . . . 2.3 Classification based methods . . . . . . . . . . . . . . . 2.4 Word sense disambiguation . . . . . . . . . . . . . . . . 2.5 Relation to example-based translation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 . 5 . 6 . 8 . 10 . 13 3 Integrating context features into a phrase-based statistical translation system 15 3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 3.2 Translation phrase tables . . . . . . . . . . . . . . . . . . . . . . 19 3.2.1 Estimating translation probabilities . . . . . . . . . . . . 19 3.2.2 Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.2.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . 22 3.3 Including context information . . . . . . . . . . . . . . . . . . . . 22 3.3.1 A simple first example . . . . . . . . . . . . . . . . . . . . 23 3.3.2 Expanded example: Using Pos-based context templates . 24 3.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 27 3.4 Experimental settings and tools . . . . . . . . . . . . . . . . . . . 28 3.4.1 Description of the Pos tag set . . . . . . . . . . . . . . . 28 3.4.2 Dimensions of the original and modified phrase-tables . . 29 4 Evaluation methods 4.1 Bleu . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Error types in machine translation output . . . 4.3 Syntactically motivated evaluation . . . . . . . . 4.4 Testing for significance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Rating the reliability of contexts and smoothing 5.1 General reflections . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Interpolation and discounting . . . . . . . . . . . . . . . . . . . . 5.3 Discount factors . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Count-based criteria for the usefulness of contexts . . . . 5.3.2 Example: interpolation and discounting . . . . . . . . . . 5.3.3 Good-Turing estimation . . . . . . . . . . . . . . . . . . . 5.3.4 Type-token relations as criteria for the evaluation of contexts 5.4 Back-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Evaluation of the presented approaches . . . . . . . . . . . . . . . 31 32 34 35 37 39 39 41 42 43 44 45 46 48 50 6 Basic context templates 6.1 Word-based contexts . . . . . . . . . . . . . . . 6.2 Pos-based contexts . . . . . . . . . . . . . . . . 6.3 Combining features . . . . . . . . . . . . . . . . 6.4 Evaluation . . . . . . . . . . . . . . . . . . . . . 6.4.1 Content word oriented evaluation . . . . 6.4.2 Manual evaluation . . . . . . . . . . . . 6.5 Analysis of selected examples . . . . . . . . . . 6.5.1 Adequacy . . . . . . . . . . . . . . . . . 6.5.2 Fluency: translational behavior of verbs 6.6 Summary . . . . . . . . . . . . . . . . . . . . . 7 Analysis of general aspects of context features 7.1 Chunks as source-side context features . . . . . 7.1.1 Chunked data . . . . . . . . . . . . . . . 7.1.2 Chunk-based contexts . . . . . . . . . . 7.1.3 Results and evaluation . . . . . . . . . . 7.2 Granularity of context features . . . . . . . . . 7.3 Analysis of specific context realizations . . . . . 7.4 Phrasal segmentation . . . . . . . . . . . . . . . 7.4.1 Example . . . . . . . . . . . . . . . . . . 7.4.2 Comparison with other systems . . . . . 7.5 Feature selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 Translating verbs: Filtering translation options side information 8.1 Basic idea . . . . . . . . . . . . . . . . . . . . . . . 8.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . . 8.2.1 Results . . . . . . . . . . . . . . . . . . . . 8.2.2 Error analysis . . . . . . . . . . . . . . . . . 8.2.3 Comparison of the presented systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 81 82 83 85 87 89 93 95 96 97 with target 101 . . . . . . . . 102 . . . . . . . . 104 . . . . . . . . 104 . . . . . . . . 107 . . . . . . . . 109 9 Conclusion and future work 9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.3 Morpho-syntactic preprocessing . . . . . . . . . . . . . . . . . . References . . . . . . . . . . 55 56 59 62 65 65 69 71 71 75 78 113 . 113 . 115 . 117 118 A Manually evaluated sample 121 A.1 Contextually conditioned system better than baseline . . . . . . 121 A.2 Baseline better than contextually conditioned system . . . . . . . 124 1 Introduction In the course of this work, we want to discuss to what extent a phrase-based statistical machine translation system benefits from enriching phrase translation probabilities with source-side context features. The basic idea of phrase-based statistical machine translation (pbsmt) is to segment an input sentence into sequences of words (phrases) which are then translated. In order to find a good translation, the scores of a translation model and a target side language model are maximized: this setting takes into account the relation between source and target language as well as target language similarity by requiring that a phrase is at the same time a good translation of the source phrase and also leads to a good target language string. Phrase translation probabilities (and other translational features) are learned from word aligned parallel corpora by extracting all possible phrase-translation pairs from the corpus. The probability of a source phrase to be translated into a target phrase is a simple estimation based on relative frequencies. Since this estimation is carried out on the set of all phrase-translation pairs, there is no information available besides their co-occurrence frequency. This means that information about the context in which a source phrase occurred in the training data is lost even though it might contain key information necessary for a good translation. Context information used for a better estimation of phrasetranslation probabilities could also include linguistically motivated features like part-of-speech (pos) tags. Since linguistic information is not used in standard pbsmt systems (despite providing valuable information), the use of context knowledge also offers genuinely new features in statistical machine translation. While the language model takes into account local context on the target side when rating the similarity between a translation and the target language, no source-side context information is used by the system. However, additional information about the context of a phrase, e.g. adjacent words or a linguistic description thereof, might be a useful resource for the identification of appropriate translations for a given context. The additional information provided by context features could help to deal with ambiguous phrases, but more refined probability estimates should also have a general positive influence. Assuming that context information helps to improve translation quality, we intend to re-estimate phrase-translation probabilities conditioned on context information and thus produce context-sensitive translation distributions. This means that for every phrase in an input source sentence, individual phrasetranslation probabilities will be computed depending on their respective context: only those phrase-translation pairs extracted from the training data with the corresponding context will be used for the estimation of translation probabilities for the given phrase. In our experiments conducted on an English→German translation system, we will focus on the integration of local source context features. We will analyze the effect of one context (e.g. one adjacent word or pos tag) at a time: this will be called a context template, indicating that the type of context is fixed while the realizations (i.e. the respective words or pos tags) within this template vary depending on the situation. For all experiments, a detailed evaluation based 1 on several automatic translation quality metrics will be provided, as well as a small-scale manual evaluation. Summary We begin this work with a brief introduction to pbsmt, followed by an overview of previous publications on the subject of using context features to improve phrase-based machine translation. The methods of context integration presented in chapter 2 range from straightforward probability estimation using relative frequencies to sophisticated classification methods and techniques originally used for word sense disambiguation. The approach we chose for this work is situated at the simple end of the spectrum, both in terms of context design and integration techniques. A detailed motivation for context features is provided in chapter 3: After explaining the basic idea of pbsmt and the design of some of the most important features used by a standard system, we illustrate how source side information allows the re-estimation of translation probabilities depending on the given context. With the example of ambiguous source phrases, we show how contextually conditioned translation probabilities enable the system to find appropriate translations by excluding those translation candidates which have not been seen in the training data within the given context. This chapter also provides explanations of the technical aspects of using context-dependent translation probabilities and a description of data and tools. In chapter 4, we discuss general aspects of evaluating machine translation. We present standard evaluation metrics like bleu and briefly discuss advantages and disadvantages of automatic evaluation metrics. We also discuss the idea of computing bleu scores based on pos tagged or lemmatized output to improve the measurement of the syntactical or lexical quality of the produced translations. So far, our simple approach of estimating contextually conditioned phrasetranslation probabilities does not take into account the usefulness of a specific phrase-context combination; that is to say, whether conditioning a phrase on its context results in a better probability estimation or, in the worst case, even leads to a deterioration. In chapter 5, we will define criteria to rate the usefulness of phrase-context combinations and we also address smoothing issues. In order to avoid over-estimated translation probabilities of low-frequency phrases conditioned on context features, we discuss the potential use of restricting contextual conditioning to phrases with a minimum occurrence frequency, but also analyze the appearance of contextually conditioned distributions in an attempt to derive the significance of a phrase-context combination. The discussed criteria will then be verified in experiments, with the result that simple frequency based conditions work as well as more sophisticated ones. In chapter 6, we present experiments with different context templates and combinations thereof: Contexts based on either words or pos tags will be integrated separately into translation systems. In addition to full phrase-based systems, we will also analyze the output of word translation systems and systems with a minimal number of translation features since in these simplified settings the effect of context-informed translation probabilities is greatly enhanced. The evaluation methods used are the standard metrics bleu and nist, but we also use modified versions based on pos tagged or lemmatized mt-output 2 for evaluating grammatical and lexical quality. With an analysis of correctly translated content words, we try to rate adequacy. Furthermore, a manual evaluation will give information about three participants’ preferences (baseline system vs. context informed system) and provide data for a small-scale analysis of error types. We conclude this chapter with discussing example translations in order to illustrate the effects of contextually conditioning which were observed during the evaluation, as well as indirect and unexpected effects. Chapter 7 begins with a discussion about context granularity: In an attempt to generalize context features as much as possible, we use a very compact set of contexts, linguistically well-formed phrases, and also experiment with differently designed sets of pos tags. Conditioning on phrasal elements also introduces the subject of phrasal segmentation of the source sentence, which is the second topic of this chapter. We will compare translation probability distributions conditioned on either good or less good contexts with the original distributions; this analysis will be helpful to understand the reason for effects observed in the previous chapter and the general influence on phrasal segmentation. Since our systems produce segmentations that differ greatly from results reported in previous publications, we attempt to find an explanation and specifically compare our approach to one of the methods presented in chapter 2. The translation of verbs is generally challenging due to the structural differences between English and German and in chapter 8, we focus entirely on the problem that verbs tend to be ‘lost’ during translation due to being translated as meaningless structures such as verb particles. While in all previous experiments only source side context features were used for conditioning, we now also integrate target side information: By requiring that source phrases containing a verb are translated as phrases containing at least one content word, we try to enforce that verbs are translated with meaningful phrases. We conclude this work with a summary of our results and ideas for improvement in chapter 9. Experiments showed significantly improved results in the case of word translation, but the outcome was less clear for full systems. However, our extensive evaluation revealed interesting effects such as the fact that contextually conditioned systems translate more context words than a standard system and that especially linguistically motivated features tend to produce better translations in terms of grammaticality: Overall, we found evidence for a positive effect on the two traditional criteria of fluency and adequacy. 3 4 2 Related work In phrase-based statistical machine translation, parallel word-aligned text is used to learn translation probabilities as well as other translational features. The estimation of translation probabilities is based on relations between target and source phrases, but does not take into account the relation of source phrases and the context they actually appear in. However, contextual information could be a valuable tool to identify translations that are especially appropriate for a given context. In this chapter, different publications integrating knowledge into standard phrase-based statistical machine translation systems are presented. The methodologies range from a straightforward estimation of probabilities using relative frequencies to more complex methods such as classifying and techniques originally used in word sense disambiguation. Given a certain similarity to example-based machine translation (ebmt), we will conclude this chapter with a brief comparison of the basic concept of ebmt and the general idea of integrating context features in smt. 2.1 Overview of statistical phrase-based machine translation The concept of phrase-based machine translation is illustrated in figure 1: the English input sentence is segmented into phrases, which are then translated. Translation units can be of variable length and are not necessarily linguistically well-formed. As can be seen on the phrase pairs [the minister]→[der minister] and [attends]→[besucht], the order of the phrases in the source sentence is not always maintained. The decisions of the translation system are largely based on the translation probabilities of the individual phrases and the target side language model. By choosing translations with high translation probabilities, the system attempts to reproduce the content of the source phrase as precise as possible. The second factor, the language model which rates the quality of a target language string, is to guarantee that the chosen translation fits well into the target translation. Additionally, a reordering model indicates how the phrases are to be positioned in the target sentence. Translational features are derived from phrase-translation pairs extracted from word-aligned training data. The translation probabilities for a given source phrase f to be translated into a target phrase e are estimated by calculating the relative frequencies of all observed translation candidates for the source phrase f. smt systems use many different features whose respective relevance is trained by tomorrow , morgen the minister besucht attends der minister a conference eine konferenz . . Figure 1: Example for phrase-based machine translation 5 repeatedly translating a development data set with varying parameter settings until an optimal setting is found (minimum error rate training). The different features and their feature weights are represented by a log-linear model, which is explained in more detail in section 3.2.3. When translating, the system searches for phrases which maximize the combined scores of translational features and language model. Since the system does not only compute scores for the phrase to be translated, but also estimates the future cost of the subsequent phrases as part of the costs of the actual phrase, the features of the next phrase can affect the translation of the actual phrase. Additionally, the language model conditions translation decisions on target side context by accepting only phrases that lead to a good target string when added to the already translated phrases. However, this is only a very indirect form of context information and does not directly take into account any specific source context of the phrase to be translated. For conditioning translation probabilities on source-side context, not only the phrase-translation pairs from the training data will be extracted, but also information about the context they appeared in. The translation probabilities of a phrase occurring in an input sentence will then be estimated based on only the set of phrase-translation pairs with the same context instead of using the entire set of all phrase-translation pairs. 2.2 Contextually conditioned translation probabilities This work is mainly inspired by [Gimpel and Smith, 2008], who integrate context features into a standard phrase-based statistical translation system. As the translation probability p(e|f ) for a given source phrase f does not take into account any form of context information, they adjusted the phrase translation probability by conditioning on the phrase f and its context cf ; thus, they work with refined translation probabilities p(e|f, cf ), which are estimated by using the relative frequencies of source phrases with the respective contexts and target phrases. Only context of the source-side is used for conditioning translation probabilities, since including target side context features would require to adapt the algorithms used for parameter training and decoding. Additionally, the target language model is already a very strong component, which effectively ensures that only those translation candidates are chosen that fit well into the already produced partial translation. Local ambiguities such as morphological agreement between two phrases can be solved since the language model only gives good scores to strings which have been observed in the training data; thus, ungrammatical (i.e. not observed) phrase combinations are excluded. Phrase-based mt systems use many features in a setting that allows the addition of new features (log-linear model, cf. 3.2.3). The contextually conditioned probabilities, some of which are based on very sparse data, are integrated into the log-linear model. By applying minimum-error-rate training on the new model, the new feature functions are weighted according to their usefulness. Thus, relevant context templates are assigned a high feature weight, whereas less useful ones are given a lower weight. Sparse data, or in this case infrequent phrase-translation-context combinations, often result in overestimated 6 translation probabilities; [Gimpel and Smith, 2008] deal with this problem by merging features conditioned on different context templates into one single model, considering the different (sparse) contexts to be backed-off estimates of a complete context. Different types of context are used as basis for a conditioned estimation of translation probabilities: The most simple form of context are the words next to the source phrase f . As actual words are likely to be too sparse, they can be replaced by their part-of-speech (pos) tags for a more general context. [Gimpel and Smith, 2008] also experimented with syntactically motivated features, such as the label of parse-tree nodes that span over a source phrase or information about whether a phrase is a complete constituent. In addition to lexical and syntactic types of context features, the position of the phrase in the source sentence (start, end or relative position of the phrase) are also used as context features. As adding new context features did not always turn out to improve performance, [Gimpel and Smith, 2008] used a feature selection algorithm to find an optimal combination of context features. This procedure starts with no feature and iteratively adds that feature that leads to the largest improvement on unseen development data. Experiments with different context features were carried out on systems for Chinese→English, English→German and German→English translation. For Chinese→English, [Gimpel and Smith, 2008] report large improvements in three standard evaluation measures when using proceedings of the United Nations (un-proceedings) as training and test material, but no significant improvement when using a considerably smaller set of newswire data. This is not surprising as there was not enough data to guarantee that the context features are useful. Additionally, un-proceedings are repetitive and formulaic and thus better suited for the use of additional context. A system trained on the combined un and news data performs worse without context features, but significantly outperforms the in-domain baseline (trained and tested on un-proceeding), when context information is used. While these results are not as high as the in-domain results, this outcome suggests that context features help to make better use of out-of-domain data. Unfortunately, the results for English→German and German→English were not as good as for Chinese→English. In most cases, features did not harm the outcome, but there was also no improvement. As a reason, [Gimpel and Smith, 2008] name the insufficient size of their training material, as well as the complex morphology and difficulties in parsing German. For the English→German translation, the best result was achieved by using the context templates as proposed by the feature selection algorithm. Given that a rich variety of context features is used, including relatively sophisticated syntactic information, it is somewhat surprising that the best feature combination consists of the relatively simple combination of 2 pos tags on the left and one word on the right side of a source phrase. [Allauzen et al., 2009] implemented a system very similar to that presented by [Gimpel and Smith, 2008] for the wmt shared task 2009, translating French→English and English→French. As context features, they use words 7 and pos tags on the left and the right side of source phrases. While for the English→French translation, the results were in the same range as the baseline system (a standard moses system), they report a small gain of bleu points for the reverse direction French→English. [Allauzen et al., 2009] intend to redo their experiments on an especially large data set, which, at the point of publication, was not possible due to technical issues. In this work, we adopt the concept of estimating contextually conditioned translation probabilities based on relative frequencies. In contrast to the methodology presented by [Gimpel and Smith, 2008], we do not as much focus on the simultaneous integration of many different context templates, but study different types of context separately and provide a detailed evaluation of translation results. Similarly to [Gimpel and Smith, 2008]’s unsatisfactory results for the language pair German-English, we will not achieve great results in standard bleu, but find modest improvements with alternative evaluation metrics. [Gimpel and Smith, 2008] use a feature selection algorithm to select context templates based on their relevance: While we also carry out experiments with combined context templates, we do not use a sophisticated method to rate the relative usefulness of contexts. However, since combined contexts tend to outperform single contexts in our experiments, this is an open point requiring more research. 2.3 Classification based methods The concept of maximum entropy in natural language processing is introduced by [Berger et al., 1996] who discuss efficient algorithms for parameter estimation and feature selection. They also present several practical applications, including modeling context-sensitive word translations for the integration into the French→English translation system Candide (developed by ibm). [Garcia-Varea et al., 2001] adopt this idea and use a maximum entropy based approach to produce refined lexicon models for word translation. They provide a pool of possibly useful features out of which a subset of relevant features is selected. Context information of both source and target side is taken from a 3-word window around the targeted word and includes words and word classes allowing for generalization. They report promising results for this task on VerbMobil data. [Stroppa et al., 2007] choose a memory-based classifier to model context dependent probability distributions. The input for such a classifier is a vector of fixed length containing the source phrase, context features and a label for the class, i.e. the target phrase e. The data are then stored in a decision tree structure which is used to predict the conditioned probability p(e|f, cf ). The tree is traversed top-down, with the most informative features tested first. The output of the classification is a weighted set of class labels representing the possible translations e of a source phrase f ; this output needs to be normalized to obtain the targeted probability distribution. Classifiers as presented by [Stroppa et al., 2007] have two essential advantages, the first of which is that the output corresponds to the posterior probabilities of the target phrases and only needs normalization; if no context information is given to the classifier, the (normalized) output corresponds to the original probability distribution p(e|f ). Additionally, 8 such classifiers can efficiently process large amounts of data and produce any number of output classes. In addition to the context dependent probability distribution, [Stroppa et al., 2007] use a binary feature that assigns the ’bonus value’ 1 to the phrase e that is most probable within p(e|f, cf ), while all other phrases are given the value 0. For the context features (adjacent words and pos-tags), a list indicating the relevance (information gain) of each context is presented. The rankings of the context features are very similar for Italian→English and Chinese→English, the two translation directions of the experiments. With the source phrase itself being the feature with the highest information gain, lexical features, i.e. adjacent words, score higher than pos information. Also, contexts on the right side of a phrase are found to have higher information gain than contexts on the left side of a source phrase. Interestingly, this ranking does not correspond well with the feature relevance [Gimpel and Smith, 2008] retrieved by applying a feature selection algorithm. For Chinese→English translation, the top-scoring context in [Stroppa et al., 2007] was only chosen as third and least-important feature by [Gimpel and Smith, 2008]’s method. Additionally, the context considered most relevant by [Gimpel and Smith, 2008] for English→German translation, 2 pos tags on the left side, ranks very low at the 9th position of 10. As [Gimpel and Smith, 2008] use a richer set of context features and also different language pairs, it is difficult to directly compare their results to [Stroppa et al., 2007]’s; however, it is surprising that they differ to such an extent. For both translation tasks, Italian→English and Chinese→English, [Stroppa et al., 2007] report significant improvement in comparison to the respective baseline system. Similarly, the work by [Max et al., 2008] also describes a decision-tree based classification. While most other research groups experimented with translations into English, their experiments conducted on English→French concern a translation into a highly inflected language. Additionally to the widely used context features of adjacent words and pos tags, they also included dependencies obtained from dependency parsing. The relevance of context features corresponds with the results of [Stroppa et al., 2007], with immediately adjacent words and pos tags being the most important context features. Dependency features turned out to be less valuable. In addition to the probability distribution obtained by classifying, the target phrase with the highest probability is assigned the value 1 in a ‘bonus feature’ (0 otherwise) to force the system to decide for the most probable option. Despite being slightly better than their baseline, the results of their systems fail to show statistically significant improvement. However, a manual evaluation suggested a noticeable improvement. An essential advantage of the classification-based approaches is the fact that context features are included according to their relevance and are thus able to perform an estimation on an optimal feature subset for any given situation. Since our approach basically focuses on using only one context template at a time, we run the risk of including essentially useless contexts while leaving out potentially more relevant ones. However, even with our simple approach, we can report some modest improvements for some settings. Similarly to [Max et al., 2008], the target language of our translation pair 9 English→German has a complex morphology which could be a factor in the outcome of the experiments. Thoughts on how to deal with rich morphology will be addressed in section 7.4.2. 2.4 Word sense disambiguation Another approach to integrate source side information is the use of word-sense disambiguation (wsd) techniques: the integration of wsd-features can be viewed as an extension of the classification-based methods. The traditional task of wsd is to find the sense of a word given a set of context information and possible word senses. wsd normally uses very fine-grained, manually created data sets that capture even very subtle differences of meaning. When applied to smt, the task is to predict the translation of a source word instead of its sense. Assuming that better translation choices can be made if the semantic word sense is known for a source word, the integration of word-sense prediction into smt systems seems promising. Parallel corpora provide a large amount of training material: By extracting phrase-translation pairs, source phrases that are ‘sense-tagged’ with their respective translation can be obtained. This has the major advantage that there is no need to produce hand-crafted data sets, as both the smt and the wsd systems can be trained on the same data which avoids possible domain mismatches. The design of the training data for classification based methods (as presented in the previous section) is essentially the same as for word-sense disambiguation techniques; in both cases, the possible translations of a source phrase are regarded as either classes or senses and thus, the task of finding the most probable class or word sense is actually very similar. In comparison, word-sense disambiguation usually relies on an elaborately designed set of features. While the first attempts to integrate wsd techniques into phrase-based statistical translation systems were not very promising, recent results indicate that smt could benefit from word-sense disambiguation. We present the work of two groups of authors who successfully re-purposed wsd for translation tasks. [Vickrey et al., 2005] do not work with a full-scale smt system, but focus on the partial task of word-translation. Training data is obtained from a bilingual corpus (europarl): For each source word a, the set of translation possibilities is extracted using word alignment. Additionally, the source and target sentences in which a and its translations occurred are stored as well to provide contextual information. The resulting set of sentences is split into test and training data. The context features consist of the respective pos tag of the word a, as well as a binary feature with values of either 0 or 1 depending on whether a word occurred within a predefined context window of a. After experiments with different window sizes, the optimal size turned out to be a one-word context on the left side and a two-word context on the right side of the targeted phrase; larger context windows were generally overfitting. In traditional wsd, a model which is given a word with a set of senses and a sentence containing this word, has to predict the correct sense for the word seen in the sentence. When adapted to smt, the model has to find a valid translation for the single word a; translations do not necessarily have to be single words 10 but can also be multi word units. A logistic regression model was trained, since the large amount of data enables the use of such a model. With regard to the requirements of a phrase-based smt system, the output of the model is required to provide not only a prediction for the most probable target phrase, but a complete list of translation probabilities for all possible target phrases e. For their experiments with different context settings and context window sizes, [Vickrey et al., 2005] report improvement of average word translation accuracy. In a next step, they attempt to use their context-aware translation model on a simplified translation problem. Instead of translating complete sentences, the system is limited to the translation of ambiguous words only. In a preprocessing step, ambiguous words in the source sentence are identified and the respective words on the target side are replaced by blanks. The mt system with the context dependent word translation model is then used to fill the blanks with appropriate translations. This ‘isolated setting’ is to guarantee that the quality of the translation is purely determined by the probability model and not influenced by other factors. For this experiment, [Vickrey et al., 2005] also find improvements of word translation accuracy, which suggests that the integration of wsd components might help to improve the quality of smt. [Carpuat and Wu, 2007a] and [Carpuat and Wu, 2007b] present an approach in which the wsd task, i.e. training and prediction, is adapted to fit into a full-scale smt-system and directly focuses on translation quality by being closely tied to the smt system instead of being a stand-alone component. While both papers essentially present the same method, [Carpuat and Wu, 2007b] focus on the general concept whereas [Carpuat and Wu, 2007a] provide a more detailed description of context-sensitive translation lexicons. In contrast to previous publications, their wsd component is not limited to single (content) words, but is capable of handling phrases of any length. [Carpuat and Wu, 2007b] stress the importance of performing word-sense disambiguation on multi-word phrases as wsd systems focusing on single words force smt systems to decide between context-aware single words and context-independent multi-word phrases. Additionally, single word wsd systems (as presented in previous publications) often do not allow for generalization to larger phrases without making compromises in order to meet the requirements of phrase-based translation systems. Training data is obtained by extracting phrase-translation pairs with context information from word-aligned bilingual corpora; the senses of a source phrase are the set of all translations of this phrase seen in the data. Extracted phrases are not necessarily syntactically correct phrases, but can be of any form, which differs from typical wsd tasks where only single content words (nouns, verbs, adjectives) are disambiguated. As context disambiguation is performed on all phrases, they also contain non-content words like articles or even punctuation. In most typical wsd-scenarios, training data consists of carefully annotated material, whereas the data used for the translation task is produced by automatic word-alignments and thus contains a considerable amount of incorrect phrase-translation pairs. The wsd model for Chinese→English translation is modeled based on a system that yielded very good results on Chinese data in a pure wsd-task. As the translation system cannot work with multiple, context-aware translation 11 probabilities, it is necessary to use sentence-wise dynamically created translation lexicons in order to incorporate the wsd probabilities as an additional feature in the log-linear model. In [Carpuat and Wu, 2007a], who focus on the production of contextdependent translation lexicons, context features are described more precisely. The overall objective is to use a rich set of context features (as is typically done in wsd) to create a phrase translation lexicon that takes into account various forms of context that standard phrase-based smt does not factor in. It was found that the most valuable contexts to disambiguate Chinese phrases are pos tags on both sides of the phrase, as well as full sentence context represented as bag-of-words. Full sentence context is usually not available in phrase-based smt, since smt systems only model local context. The bag-ofwords concept allows to generalize sentence context instead of representing full sentences as very long phrases in the phrase-translation table, which would be the only possibility to integrate full sentences in a standard smt system. The set of possible contexts also comprises local collocations and basic dependency features, although there is no detailed description about those features and their relevance. [Carpuat and Wu, 2007b] carried out a detailed evaluation using 8 common evaluation metrics on three data sets. The integration of wsd prediction yielded improved results on all data sets and for each of the 8 evaluation metrics. A significance test on nist scores was successful at the 95% level. The positive overall outcome of their experiments indicates that a close cooperation of wsd and smt helps to improve translation results. A more detailed evaluation in terms of typical properties showed that the result of wsd is superior to the baseline translation probabilities which has several effects on the outcome of translations: Valid translation candidates with a low baseline probability are ranked higher by the context aware system and consequently are more likely to be chosen by the system. Influenced by the strong scores of wsd, translation probabilities become more competitive compared to the (relatively strong) target language model. The most interesting point is that the stronger scores lead to a segmentation into larger translation units, whereas the baseline segmentation preferred smaller phrases containing frequent translation candidates that often turn out to be incorrect. In our experiments, we found that context-sensitive translation probabilities have a very interesting influence on segmentation: in contrast to the effect observed in [Carpuat and Wu, 2007a] and [Carpuat and Wu, 2007b], the contextsensitive translation probabilities in our system trigger a preference for shorter translation units. The reasons for this effect will be discussed in section 7.4, where we will also compare our system with the system presented by [Carpuat and Wu, 2007a] and [Carpuat and Wu, 2007b]. In most of the presented publications, adjacent words and pos tag information of adjacent words turned out to be the most useful context features regardless of the examined language pair, whereas more sophisticated features like dependency structures or local collocations were less relevant. With this in mind, we will mostly focus on word-based contexts and information derived from pos tags. 12 2.5 Relation to example-based translation The integration of contextual knowledge into a smt system is reminiscent of the basic idea of example-based machine translation (ebmt). In ebmt, sentences and their (manually produced) translations are stored in a data base. In order to translate a sentence, sophisticated matching strategies are applied to find data base entries close to the input sentence. Matching strategies can be based on words, but also include different forms of linguistic analysis such as tagging or parsing in order to identify pos tags, constituents or semantic roles such as subjects or objects, which allows for more generalization. As an exact match with a stored example is very rare, the data base is searched for fragments that partially match the input sentence. The translations of these fragments need to be extracted and recombined to a valid sentence of the target language. The basic idea is illustrated by the following example: If we were to translate sentence (1) and the database contains a sentence like (2), it can be used for translation in the form of (3), where x and y are variables that can be filled with the actual words Alice and book (or rather their translations). Another sentence like (4) is needed to provide the translation for book. (1) Alice buys a book. (2) Susan buys a bike. → Susan kauft ein Fahrrad. (3) x buys a y. → x kauft ein y. (4) I like the book. → Mir gefällt das Buch. Loosely speaking, the task of ebmt is to find stored examples (or parts thereof) having the same structure as the input sentence, as well as to extract the corresponding translation. This is no trivial task if no reliable word-alignment is provided; building a correct target string with the partial translations is not easy, either. The basic principle of phrase-based statistical translation is very similar: It also consists of splitting an input sentence into smaller units which are then translated and recombined on the target side. An essential difference is the fact that ebmt requires the fragments used for translation to come from a sentence similar to the input sentence, while smt uses statistics computed on all occurrences of a phrase, which means that the origins of individual phrases do not have any influence. Depending on the matching strategy, ebmt requires (more or less) perfect matches on word level or syntactic level and therefore, the structure of the source side is a crucial criterion. In contrast, the integration of source side context information into phrase-based smt is an attempt to find translations that are appropriate for a given context instead of using a general translation lexicon; the context-aware translation probabilities are not the main criterion for a translation, but can rather be regarded as an additional feature or refinement in an already complex system. In addition to conditioning the translation of a phrase on its context in the input sentence, ebmt and the previously presented smt systems also share the concept of generalization by using pos tags or, at least in some cases, syntactic dependencies. 13 14 3 Integrating context features into a phrase-based statistical translation system In phrase-based statistical translation systems, the source language input sentence is segmented into (not necessarily linguistically well-formed) phrases, that are translated into target language phrases and then undergo reordering. Additionally, a target side language model rates the fluency of the translated sentence. The probability to translate a phrase of the source language into a phrase of the target language is derived from aligned training data by first extracting all phrase-translation pairs and then computing translation possibilities for a given source string based on the extracted pairs. An interesting aspect of mt-systems is the length of translation units: a translation system benefits from both short phrases being more universally applicable and long phrases which translate longer sequences of text and therefore add to the fluency of the resulting target sentence. However, even larger phrases do not contain any information on the context they originally appeared in. The objective of this work is to refine the estimation of translation probabilities by conditioning source phrases on contextual features. 3.1 Motivation As phrases, be it single word phrases or multi-word chunks, usually have a large number of translation candidates, additional information may help to find appropriate translations and filter out inappropriate ones depending on the context in which the phrase appears. The translation probability for a source phrase to be translated into a target phrase is computed as the relative frequency of the total number of occurrences of the source phrase and the number of times it was translated into the target phrase. This estimation does not take into account further information about the source phrase, but rather produces a somewhat imprecise estimation of how to generally translate that phrase. While this works for phrases where all translation possibilities fit equally well, this is a problem if within a specific context only a subset of the translations is possible. Considering the context of a phrase when estimating translation probabilities may help to reduce the set of all translation possibilities to a smaller set better suited for a given context, as only ’good’ translations for this context are seen in the training data. Contexts can be as simple as adjacent words or more sophisticated such as part-of-speech tags (pos-tags) of adjacent words or even complex syntactic information about a phrase’s role in the sentence it originally occurred in. Context features are not limited to the actual surrounding of a phrase, but can also contain information about the phrase itself, like a feature indicating whether the phrase is a constituent in a syntactic parse tree. Filtering the set of translation candidates may take place on a relatively subtle level such as discarding translations whose morphological form is not quite right in the given context, or in the case of homonymous source words, where only those translations are to be kept which express the correct meaning. Actually, slight ambiguities like morphological syncretism can be expected to 15 be comparatively frequent, whereas ambiguities with more severe effects (like homonymy) are likely to appear to a lesser extent. When thinking about phrase translation probabilities, we have to keep in mind that other factors also influence translation, such as the target side language model or the reordering model or the segmentation of the source sentence. While the target side language model and the reordering statistics are indispensable factors of the translation system, other features can be omitted or modified to behave in a certain way. In order to see the maximum impact refined translation probabilities have on the translation output, some experiments in the following sections are carried out with simplified systems. Illustrating the need for source-side context The following description of different scenarios is intended to illustrate the general idea of the usefulness of context knowledge and is therefore somewhat hypothetical; examples can be found in sections 3.3.1 and 3.3.2, as well as in the sections dealing with the evaluation of different context types. A simple example to demonstrate the need of context knowledge is that of collocational structures: For collocates, a straightforward, literal translation is often not the best choice. In a standard phrase based translation system using phrases of variable length, collocational units such as take a walk or get used are assumed to be translated as a whole rather than as a sequence of single-word phrases. Word sequences translated as one phrase need not necessarily be collocations or linguistically motivated phrases, but need to be quite frequent in the training data, so that the system can ‘see’ the phrase and ‘learn’ its translation. When translating sequences longer than one word, the length of the phrase itself restricts the amount of translation possibilities by the simple reason that longer sequences are seen less often in the training data and therefore have less translations appropriate for different contexts than universally applicable single words. However, the ‘context’ of a longer phrase in a standard system is part of the phrase itself and has to be translated: it simply captures the whole expression, there is no context used outside the phrase. Also, in the case of shorter phrases and especially of single word phrases, there is less to no ‘context’ that can help to identify appropriate translations. If we integrate context information and this context happens to be one adjacent word of a phrase, then the translation probabilities of a phrase of length n should correspond, at least approximately, to the ones of the phrase with length n+1, i.e. consisting of the phrase and the context word. The lack of context becomes especially evident when looking at the simpler task of word translation: As already mentioned, simplified systems help to better understand the effect of contextually conditioned probabilities. Limiting the source phrase length to single words results in a system with no context information at all and thus can be used as a sort of baseline. Also, without variation in the segmentation of source sentences, this allows for a more direct comparison of the baseline system and the modified one. The baseline character of such a limited system is illustrated by the following example: When translating a sentence containing a collocational structure like get used, a system restricted to single word translation might choose the most 16 probable translation of each word, i.e. the default translation. In this case of used, a literal translation with the meaning of applied is most probable but not a good choice in this situation. (5) the commission also needs to get used to the idea that ... When conditioning the translation probabilities on the word on the left side of the phrase, as shown in example (5), translation probabilities for used→gewöhnt (accustomed) can be assumed to become larger if used appears in the constellation get used, while the probabilities for translations with the meaning of applied should decrease. While there are a lot of occurrences in the training data where used is translated as applied, none of them are likely to co-occur with get and thus will be ignored when estimating translation probabilities. This example illustrates how word translation benefits from being conditioned on contexts; as single words are more or less universally applicable, context-aware translation probabilities help to find the correct word sense in a given context. When translating phrases, the system also profits from the integration of context information. As suggested in sentence (6), get used is translated as a unit and it is very likely that the system picks a valid translation with the meaning of to accustom. Being conditioned on the word on the left side, to, further narrows down the set of ‘good’ translations, e.g. by preferring infinite verbs over finite verbs. (6) the commission also needs to get used to the idea that ... Refined phrase translation probabilities might also have a positive influence on the segmentation of the input sentence: By giving high probabilities to ‘good’ phrases, ‘good’ segmentation should be enhanced; in fact, contextually conditioned translation probabilities have an interesting effect on segmentation which will be discussed in detail in section 7.4. While phrases are not linguistically motivated, they need to be well-formed in terms of their alignment structure; this means that an input phrase cannot always be segmented in any combination of phrase within a predefined length. This is illustrated by the following example containing a crossing alignment structure: that i have slept (7) dass ich geschlafen habe When translating this sentence, the possibilities for segmentation are limited as the phrase i have → ich habe is not possible since phrases need to be continuous. If the longer phrase i have slept is not available, then the ‘link’ between i and have on the source side is lost. Integrating context information would allow to condition the phrase have on its left word i and thus approximate the probability estimation for the non-well-formed phrase i have. As German verbs have different forms depending on the subject, it could prove useful to know whether the source sentence contains e.g. i have or you have in the source sentence. 17 Actually, the phrases i and have have a good chance to be translated correctly even if not conditioned on a context: The target side language model, which rates the similarity of translation candidates with sentences seen in training data, would not give a high score to an invalid translation like [i] [have] → [ich] [hast], simply because the obviously wrong string ich hast can not appear in the training corpus. Thus, the language model can also be seen as a tool to indirectly solve ambiguities by only accepting strings that are likely to occur in the target language. The language model is a relatively strong component in the smt system; while it is restricted to relatively local ambiguities (limited by the size of the used n-grams), problems like the different realizations of have in German can be solved if the corresponding subject appears next to it on the target side. A disadvantage of the language model is its tendency to overrate translations containing frequent words; in the case of used, the default translation with the meaning of applied would probably be favored in every context, because relatively highly frequent words occurring within different contexts in the training data are likely to receive high scores and fit well into translated sentences. Additionally, the ‘direct link’ between two phrases that were adjacent in the source phrase might be lost if target phrases are reordered. If the distance introduced by reordering exceeds the n-gram size, then the language model has to rate the translations of those phrases separately. While the language model already is very useful, a smt system should also profit from contextually modified translation probabilities, as they are able to capture connections between phrases that would otherwise be lost and to enhance translation probabilities of comparatively infrequent phrase-translation pairs. Another advantage of conditioning on source side context is the possibility of generalization: So far, we mostly discussed adjacent words as context features. While this seems promising, purely lexical context features are prone to data sparseness; by replacing words with e.g. pos tags, the data can be better exploited. In many cases, it is sufficient to know whether a phrase occurred to the right of a determiner in order to find out that the phrase is/begins with a noun and not e.g. a verb. As translation statistics conditioned on pos based contexts are estimated on a basis of more data than word based contexts, they can expected to be more representative and thus provide information that cannot be captured by larger phrases. Linguistically motivated information might turn out useful in cases like the following examples. A homonym like light can be both an adjective or a noun and means either the opposite of the adjective heavy or describes the effect of switching on a lamp; thus, there is need for disambiguation in order to guarantee that the correct reading is chosen. We can take another step leaving the disambiguation of word senses and have a look at the morphological level: For example, knowing whether a (noun)phrase is on the left or the right side of a verb, might help the system to pick a German target phrase having the morphological attributes of either a subject or object phrase depending on the context. pos tags will not only be used as features on the source side: in the experiments in chapter 8, pos tags on the target side will also be integrated. 18 In the course of this work, we want to add more refined probabilites for the translation of a given source language phrase, that are not only based on phrase counts, but also on the context the source phrase appeared in. By integrating different types of context features into a standard phrase-based translation system, we hope for a general improvement of translation quality, not only in cases with some form of polysemy or homonymy. As not every possible context is likely to be useful, we also experiment with evaluations of the quality and reliability of contexts. In the next sections, the phrase extraction procedure and construction of standard phrase translation tables as well as the integration of source-side context information are described. All experiments were carried out with the statistical machine translation system moses1 . Thus, explanations specifically refer to this system. For more detailed background on moses, see also [Koehn, 2010]. 3.2 Translation phrase tables Basically, a phrase-based translation system works with the translation probabilities of phrases, a reordering model and a target language model to judge whether the output is a good target language string. Phrase translation probabilities and additional features like lexical weighting and a phrase length penalty are listed in a phrase table. To a great extent, translation quality depends on phrase translation probabilities. The following sections illustrate the process of creating phrase tables as well as how its features are factored into the translation system. 3.2.1 Estimating translation probabilities To build translation tables, phrase pairs are extracted from word-aligned text and for each phrase pair, several scores including the respective translation probabilities are computed. While phrases are not linguistically motivated, they allerdings but sehen we wir auch , dass bestimmte verfahrensweisen also recognise that some of the überholt practices . outdated sind are . English phrase (src) but but we also recognise but we also recognise recognise we also recognise we also recognise we also recognise that we we also we also also also German phrase (trg) allerdings allerdings sehen wir auch allerdings sehen wir auch , sehen sehen wir auch sehen wir auch , sehen wir auch , dass wir wir auch wir auch , auch auch , Figure 2: A sample of extracted phrase pairs from word-aligned text. The maximum phrase length is set to 5. 1 http://www.statmt.org/moses/index.php?n=Main.HomePage 19 source are outdated are outdated are outdated are outdated are outdated are outdated target sich sind veraltet veraltet sind veraltet veralteten überholt sind ϕ(src | trg) 1.2199e-05 0.125 0.0625 0.009009 0.0098039 0.210526 lex(src | trg) 8.0279e-08 0.0529022 0.0529022 0.0018107 0.0022989 0.0367321 ϕ(trg | src) 0.11111 0.11111 0.11111 0.11111 0.11111 0.44444 lex(trg | src) 0.023719 0.018204 0.018204 0.095238 0.073016 0.029126 Table 1: Phrase table entries for the english phrase are outdated. The value for the feature phrase penalty is always exp(1) = 2.178 need to be wellformed in terms of their alignment, i.e. multiword phrases have to be continuous. As illustrated in the example below (figure 2), extracted phrase pairs are of different length, ranging from one word to a predefined maximum phrase length. The extraction method starts with listing a short phrase pair (i.e. two single words aligned to each other) and then continuously adds subsequent phrase chunks until the maximum phrase length is reached. The entries in the table in figure 2 show the procedure: starting with the minimal unit but - allerdings, the algorithm adds the next minimal phrase chunk we also recognise - sehen wir auch. While the single phrase pairs of the second block, recognise - sehen, we wir and also - auch, are valid entries in the phrase collection, only the entire block can be added to the first phrase. This is due to the alignment structure: A phrase pair like but we - allerdings auch is not well formed as the two words sehen wir on the German side would be skipped. Unaligned words, such as the comma in the fifth position, are added to their adjacent phrases, leading to multiple entries like the second and third entry in the table in figure 2, where one part of the pair remains the same. When all phrase pairs are extracted, the probability for a given source phrase to translate into a target phrase can be estimated by the relative frequency: count(src, trg) trgi count(src, trgi ) ϕ(trg | src) = P The standard phrase table in the moses system lists 5 scores for each phrase pair: the translation probabilities ϕ(src | trg) and ϕ(trg | src), lexical weightings lex (trg | src) and lex (src | trg) and a phrase penalty. Table 1 shows some example entries for a phrase of the example in figure 2. Lexical weighting is used to judge the reliability of phrase pairs. Since translation probabilities are 1 in the case of a phrase pair occurring only once, such pairs are often overestimated. To compute lexical weight scores, phrases are basically backed off to single words whose - more reliable - translation probabilities are estimated from the corpus data. The lexical weight for a target phrase with given source phrase and word alignment is defined as follows: length(trg) lex(trg | src, a) = Y i=1 X 1 w(srci |trgj ) | {j | (i, j) ∈ a} | ∀(i,j)∈a 20 Each target-word is generated by the source word it is aligned to. In the case of one-to-many alignment, the average value of all alignment pairs is taken. If a word is not aligned, i.e. if it is aligned to the null-word, the factor w(ei |null) is used. For the phrase pair überholt sind - are outdated, the lexical weighting is computed as: lex(überholt sind | are outdated, a) = w(überholt | outdated) w(sind | are) where the a denotes the respective alignment between überholt and outdated and sind and are. As both alignments are plausible, the respective word translation probabilities can expected to be relatively high. Bad phrase pairs, i.e. pairs whose single word translations have low probabilities, score lower than phrases with well-matching single word alignments and might thus be disliked by the system. For example the phrase pair are outdated - sich has a significantly lower lex (src | trg) than the other phrases. As the single target word sich is aligned with two source words, its lexical score is the average of the two corresponding word translation probabilities: 1 lex(sich | are outdated, a) = [w(sich | outdated) + w(sich | are)] 2 While the alignment of sich and are is at least very dubious, the alignment between sich and outdated is clearly wrong. Therefore, both word translation probabilities must be quite low, resulting in a low overall score. Given that the translation probability ϕ(trg | src) is the same as for most pairs, the low lexical weighting score helps to filter out bad translation candidates. For both lexical weighting and translation probabilities, both translation directions are computed. Since an input sentence needs to be segmented into phrases before being translated and at this point, all segmentations are equally likely, the system needs to know whether to favor few long phrases or more short phrases. This is done by introducing a phrase penalty score ρ that prefers longer phrases when ρ < 1 and shorter phrases when ρ > 1. In the case of equally scoring phrases, the system generally prefers the longer one. Since longer phrases contain more context and potentially overestimated reliability caused by data sparseness has been dealt with by the lexical translation score, this is a positive effect. 3.2.2 Reordering If a phrase segmentation and the respective phrase translations have been found, translated phrases may be reordered. In the case of English-German translations, the order of verb-subject has to be fixed (cf. the phrase pair sehen wir - we ... recognise in table 2) or the position of verbs in verb-final structures (cf. überholt sind - are outdated in the example). In contrast to e.g. the switching of adjectives and nouns in French, both movements are not local but can span over a distance of several words. If reordering is handled with a (distance based) model where a cost function generally punishes movements, reordering can only be justified when there is a huge gain in the language model score. For local changes, like adjective-noun 21 conversion when translating French, this method works quite well, but it reaches its limit when larger movements are necessary. In a lexicalized reordering model, reordering depends on the actual phrases, assuming that some phrases have more characteristic reordering properties than others. The reordering model learns a probability distribution from word-aligned text to predict reordering properties for the phrases listed in the phrase table. 3.2.3 Parameter tuning When translating a source sentence into a target sentence, a linear combination of features and feature weights is maximised. Such combinations, called log-linear models, have the following form: p(x) = argmaxx exp n X λi hi (x) i=1 where the random variable x represents the source and target sentence and its segmentation into phrases. There are n feature functions hi with according weights λi . Features include the language and reordering models and the probabilities listed in the phrase table. The setting of log-linear models has two advantages: It allows to add new features and the features can be weighted differently. Feature weights are learned by minimizing a standard error metric (such as bleu) on a development set. This means that parameters λi are modified such that they maximize the bleu score yielded when translating the development set until convergence. Since parameter tuning tries to achieve optimal scores of a standard evaluation criterion on development data, it is also called minimum error rate training. 3.3 Including context information In this work, only source-side context features are used as basis for a more refined estimation of translation probabilities. Given that we know the entire source sentence when translating, we also have information about every imaginable context in this sentence at our disposal and can therefore work with translation probabilities computed specifically for this very context (provided it has been seen in the traing data). We even have the option to either substitute the original ϕ(trg | src) with the new probability distribution conditioned on context features or to add the new distribution to the original phrase table, as the design of the log-linear model allows for additional features. The task to translate a sentence by means of various probability estimates, is the same - hence, there is no need to change the decoding algorithm or make modifications elsewhere than adding the new translations into the the phrase-table. However, working with the context of phrases on the target side is considerably more difficult: During the translation process, the final target sentence is not yet known and therefore, as we do not know a translation candidate’s environment and position in the target sentence, we can not work with translation probabilities modified to fit a special context. The target side phrases are dependent on each other; thus a context such as e.g. an immediately adjacent 22 word would not be a fixed constant (as on the source side), but also depend on adjacent phrases. The sole exception on target side information can be an invariant description of the target phrase itself, such as e.g. the pos-tags of its words. Adding context features on the target side that are conditioned on phrases in the target output would require to adapt the decoding method. Also, it might not be necessary to incorporate additional features on the target side: Language models are a strong component in smt-systems and, in combination with reordering models, cover at least local context of phrases on the target side. Context can be of different forms: The most simple contexts are words next to a phrase, i.e. a sequence of n words to the left or the right of a phrase. However, adjacent words, even when lemmatized, lead to sparse data. An obvious step to overcome this sparseness is a reduction to part-of-speech (pos) tags: Not a sequence of adjacent words, but a sequence of pos tags are used as the context of a phrase. Examples for both types of context, words and pos-tags, will be presented in the next sections. 3.3.1 A simple first example In this section, one of the examples briefly mentioned in 3.1 will be discussed in more detail: When the phrase length is restricted to 1, we found that in the case of the phrase used, one word on the left side helps to disambiguate between the default translation verwendet (applied) and a variant appropriate for the collocational structure get used. The three entries above the line in table 2 are the phrase-translation pairs for used with the highest probabilities, the remaining three entries are some of the translations one would use in the context of get. In total, there are 1883 translation possibilities, derived from 7394 extracted phrase-translation pairs. It is easy to see that the top three entries have a much higher translation probability than the translations suitable for the context get, even though the latter are comparatively high frequent translation candidates with the meaning of gewöhnen (to accustom). When having a look at the column labelled context = get, we see that the first three translations were never seen when the phrase used co-occurs with get. In total, 14 translation possibilities were seen in this specific context, of which 13 occurred only once. While this probability distribution is rather flat, it only source target gloss used used used used used used verwendet genutzt eingesetzt gewöhnen daran gewöhnt zur gewohnheit applied utilized deployed to accustom accustomed to (to the) custom normal f ϕ 731 0.0989 557 0.0753 557 0.0753 7 0.00095 5 0.00068 1 0.00014 context = get f ϕget 0 0 0 0 0 0 5 0.2778 1 0.0556 1 0.0556 Table 2: Translation probabilities and occurrence frequencies for the single word phrase used without context and conditioned on the word get on the left side. 23 contains translations that are suitable for the context get. Without wanting to anticipate a more detailed evaluation of a system using only single word source side phrases, the following sentences are to illustrate the effect the modification of translation probabilities has on machine translation output. (8) the question is whether you get used to the minor annoyances ... (9) die frage ist , ob sie verwendet werden die kleinen schikanen ... (10) die frage ist , ob sie sich daran gewöhnt , die kleinen schikanen ... (11) die frage ist bloß , ob man sich an diese kleinen unannehmlichkeiten gewöhnt ... Both (9) and (10) are output of a mt-system, while (11) is a reference translation. Sentence (9) is the translation of (8) using a system without context information. As expected, used is translated as verwendet, the entry with the highest probability. In combination with werden, the translation of get, it forms a common German phrase which is grammatical but not appropriate here. In contrast, the translation of used in sentence (10) is daran gewöhnt. Though this is not the highest scoring option from the set of translations in a system using context information, the word sequence sich daran gewöhnt (i.e. the translation of get used), is very common. This is also reflected in the language model which might have contributed to the fact that daran gewöhnt was chosen over gewöhnen, the candidate with the highest translation probability. In table 3, some of the phrase-translation pairs and the respective translation probability for the phrase get used are listed. While the distribution is not exactly the same as in the case of used with the context get, they are quite similar with respect to both the amount of translation candidates and translation candidates themselves. A more detailed comparison of a length-1 system with context information and a length-2 system without context information, as well as a more fine grained evaluation will be provided in section 6.1. target gloss ϕ gewöhnen to accustom 0.2222 gewöhnt euch accustom 0.1111 gewöhnt accustomed 0.1111 sich himself 0.1111 gewöhnen sich accustom 0.037 Table 3: The most probable translations for the phrase get used without context information. 3.3.2 Expanded example: Using Pos-based context templates As already mentioned, pos tags are a simple, but effective method for generalization. Replacing adjacent words with their respective pos tags has the advantage that sparse data is no more a problem: While a specific word might be seen only once or twice or not at all, nouns or verbs occur sufficiently often. This means not only that phrases occurring in combination with infrequent words are not excluded from contextual conditioning, but also that the translation 24 statistics are also much richer as their estimation is no more based on the (modest) number of occurrences of a specific word, but on the entire class of this word’s pos tag. On the downside, the loss of lexical information could also have adverse effects in cases where a specific word is needed for disambiguation instead of a pos tag. In the following example, we discuss simple, pos tag based contexts2 of the English word fine. fine has more than one meaning; first, it can be an adjective with the meaning of nice. Actually, the meanings of fine as adjective can also vary widely, ranging from beautiful or pretty, the most prominent readings (cf. 12), to ok or alright when used at the beginning of a sentence as in (13). When having a closer look at the translation phrase table, one can see that there is also a great variety of translations that loosely fit into the concept of something positive, such as hehr (noble) or feierlich (festive), but which were seen only once or twice. The second reading of fine is that of a financial penalty. It can be realized either as verb (14) or noun (15). (12) they were given plenty of fine words, but no practical action. (13) fine , i can accept that , but what gives you the right ... (14) the recent decision to fine ryanair, europe’s main budget airline, for its arrangement with local officials ... (15) the ... national radio and television council ... decided to impose a fine of euro 100 000 on the private television station mega channel. In table 4, translation probabilities for the English word fine to several German translation candidates are listed, as well as the number of occurrences with and without context (labelled f). In total, the English-German phrase pairs containing fine were seen 1266 times in the training data, resulting in a set of 379 translation candidates. The numbers in the column titled normal were computed without including context features. It is evident that different forms of schön (beautiful) and gut (good) are the most prominent translation candidates. Also, variants of the confirmation marker in ordnung (ok) are contained in the list. While these two readings might not always be separated clearly, the reading of fine as financial penalty is clearly distinguishable from the other ones. However, the probabilities for fine to be translated into a phrase of the financial penalty reading (rows with a grey background in table 4) are very low compared to the different forms of schön or gut. Given that occurrences of fine meaning something like pretty or good are significantly more frequent than occurrences with the meaning of financial penalty, the phrase translation probabilities for gut or schön are relatively high and thus, fine is not very likely to be translated as geldstrafe. Also, a relatively high frequency phrase like schönen or schöne can be expected to be well-liked by the language model. As illustrated in table 4, using context information can help to divide translation candidates into different sets: In the ‘context 1’ column, only phrases co-occurring with the pos tag to on the left side were considered. This pos 2 A description of the tag set is provided in section 3.4.1 25 target gloss schönen schöne gute guten gut in ordnung schön schönes gutes geldstrafe ausgezeichneten strafe verhängen einer geldstrafe bußgelder gegen geldbußen verhängen geldstrafe bezahlen geldstrafe gegen geldbuße beautiful beautiful good good good ok beautiful beautiful good penalty excellent penalty impose of a penalty penalty against impose a penalty pay a penalty penalty against penalty f 160 120 70 62 60 26 21 19 19 19 18 15 2 1 1 1 1 1 1 normal ϕ 0.1264 0.0948 0.05529 0.04897 0.04739 0.02054 0.01659 0.01501 0.01501 0.01501 0.01422 0.01184 0.00158 0.00079 0.00079 0.00079 0.00079 0.00079 0.00079 context1 f ϕc1 0 0 1 0.0435 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0.087 0 0 2 0.087 2 0.087 1 0.0435 1 0.0435 1 0.0435 1 0.0435 1 0.0435 0 0 context2 f ϕc2 3 0.038 3 0.038 3 0.038 2 0.0253 0 0 0 0 0 0 0 0 0 0 11 0.139 1 0.0127 8 0.101 0 0 0 0 0 0 0 0 0 0 0 0 6 0.076 Table 4: Translation probabilities for the english phrase fine. In total, there are 379 translation candidates, 290 of which have the probability of 0.00079. The entries above the line are the top 12 translation candidates, the ones below are representative for context1 and were manually selected. context1: the pos tag on the left side of the phrase is to. context2: one pos tag on the left side of the phrase (dt), the leftmost pos tag of the phrase (nn). tag, representing the word to left of infinite verbs (cf. example 14), almost always translates to a financial penalty - related phrase. Since fine occurring immediately after the pos tag to needs to be a verb, this context not only disambiguates the basic meanings of the phrase, but also tends to separate between verbal and nominal translations. The distribution of this context is very flat: 23 English-German phrase pairs with this context have been extracted, resulting in 21 phrase-translation candidate pairs. The two translations with the highest probability are geldstrafe and verhängen (impose). While verhängen is not a good translation for fine, it is still considered appropriate in this context since geldstrafe+verhängen is collocational. In context 2, we look at the first pos of the phrase and the pos on its left: in the example, fine is a noun, preceeded by an article. In contrast to context 1, verbal translations are not part of the resulting distribution. While also some forms of schön or gut are among translation candidates, the nouns geldstrafe, strafe and geldbuße are clearly favorites with translation probabilities even higher than the highest scoring translations without context. Summarizing, we can say that contextual features, as simple as an adjacent word as in section 3.3.1 or pos-tags, can help to disambiguate translation candidates. 26 3.3.3 Implementation The phrase translation probabilty ϕ(trg | srccontext ) is computed as the standard translation probability ϕ(trg|src) using relative frequencies: count(srccontext , trg) trgi count(srccontext , trgi ) ϕ(trg | srccontext ) = P When extracting phrase-translation pairs from aligned training data (cf. figure 2), the respective contexts are extracted as well. In the case of more sophisticated context features (e.g. pos-tags), the training data as well as the test set and the development set require pos tagging or another form of pre-processing. The next step is to assign the modified translation probabilities to the corresponding phrases: since a phrase can appear in different contexts, there can be different sets of translation probabilities. In a standard system, a phrase and its translation candidates are listed only once in the phrase table, i.e. each time, this phrase is to be translated, the same phrase table entries are applied. In order to find the corresponding contextually conditioned probabilities for each phrase during the translation process, each word is given a unique identifier both in the sentences to be translated and the phrase table entries. A simple way to do so is to consecutively number the words in the sentences to be translated and then transfer these numbers to the entries in the phrase table. For example, assume that the phrase front of occurred twice in the test set, once in context a and once in context b. The resulting phrase table thus needs to contain entries for the phrase fronti ofj conditioned on context a and entries for the phrase frontk ofl conditioned on context b. This principle is illustrated by the simple example given in figure 3, where the contextually conditioned probabilities prefer or disprefer the translation vorderseite von depending on the word left to the phrase front of. Since each word in the modified system is uniquely identified, the translation phrase table contains several entries for the same phrase pairs depending on the context they appeared in. This approach leads to a huge inflation of phrase tables. To make phrase table size manageable, only phrase translation pairs phrase front of front of front of translation vor der vor dem vorderseite von ϕ 0.4 0.3 0.3 phrase front2 of3 front2 of3 front2 of3 front7 of8 front7 of8 a the1 front2 of3 the4 hotel5 translation vor der vor dem vorderseite von vor der vor dem ϕc 0.2 0.2 0.6 0.6 0.4 b in6 front7 of8 the9 hotel10 Figure 3: A simple example demonstrating how modified translation probabilities are written into the phrase table. The table on the left side gives the original translation probabilities for the phrase front of while the table on the right side shows the new probabilities conditioned on the context one word on the left side of the phrase. 27 with a probability above a threshold are listed in the translation table. Another effect of this representation is that phrase tables are not universally applicable but need to be produced specifically for the sentences to be translated, i.e. the development and the test set. Unknown words require an additional post-processing step: While unknown words are not listed in the phrase table, they may still be represented in the language model. Since they can not be translated, unknown source phrases are just integrated into the target language sentence (eg. proper names). However, when marked with a unique identifier, these words can not be found in the language model. Hence, input sentences need to be compared with the phrase table to remove identifiers from those words not represented in the phrase tables. In the example in figure 3, it becomes evident that some translations are not seen in combination with a context and therefore obtain a probability of 0, i.e. they are discarded from the set of possible translations. For different reasons, it might be a good choice to keep these translations and give them a small pseudo-probability. This issue of smoothing will be adressed later in section 5, where different approaches will be discussed and compared. 3.4 Experimental settings and tools For the translation experiments, we used the standard moses-system, alignment was done using giza++. The training data consists of newspaper text and europarl3 (1.466.088 sentences in total). All words are lowercased in order to maximally exploit the data and the sentence length does not exceed 70 tokens. 1025 sentences were used for parameter tuning (see 3.2.3), which is almost the same size as the test set (1026 sentences). Test set and tuning set are newspaper text and very similar in their domain as for each sentence in the tuning set, an adjacent sentence is taken into the test set. Both test set and tuning set, as well as the training corpus, are data from the shared task 20094 . For pos-tagging, we used Tree-Tagger [Schmid, 1994], a widely used tool for the annotation of pos-tags and lemma information. It exists for many (European and non-European) languages, including English and German. A variant of Tree-Tagger is also capable of chunking and was used for the annotation of chunk-based contexts. 3.4.1 Description of the Pos tag set The English tag set is based on the Penn tag set5 , which was slightly adapted to our needs. Most modifications concern the tags of verbs: In contrast to the original set, we distinguish between full verbs and the auxiliary verbs to be and to have, but not between past tense and present tense. Furthermore, no difference is made between nouns in singular and plural. In total, the tag set was intentionally left relatively rich, especially the tags of verbs which have different tags for auxiliaries and also represent a wide variety 3 http://www.statmt.org/europarl/ http://www.statmt.org/wmt09/ 5 http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html 4 28 tag cc dt jj jjr jjs pdt pos pp$ pp nn in rp wdt wp$ wp wrb ex fw (, ) ” example and, or the, a cheap cheaper cheapest half, all ’s our, his, their he, she, we store, client under, about (pull) out that, which whose who when, how there (is) i.e. (, ) ” description conjunction determiner adjective quantifier pronoun noun preposition particle rel. pronoun question pron. exist. ‘there‘ abbreviation punctuation tag vb vbff vbg vbn vh vhff vhg vhn vv vvff vvg vvn md to rb rbr rbs cd ls sent example be (inf.) be (fin.) being been have (inf.) have (fin.) having had verb (inf.) verb (fin.) verb + ing verb-part. would, can to late later latest one, 1.9 4 . , description aux: be aux: have verb modal verb to adverb number punctuation Figure 4: English tags with examples of grammatical forms. In figure 4, all tags are listed with an example and a description. Additionally, the beginning and the end of a sentence are marked as well and used as context for the very first or last word of a sentence. 3.4.2 Dimensions of the original and modified phrase-tables Phrase-tables cannot be of unlimited size as all translation options need to be available during translation and therefore the system needs to be capable to load the entire table. When the phrase-tables are used during parameter tuning or the actual translation, they are filtered so they only contain phrase-translation pairs that also occur in the test set or the tuning set. In the case of a standard phrase-table with 68.733.076 entries, only 7.85 percent (5.398.857 entries) are left after filtering for the translation of roughly 1000 sentences. The modified phrase-tables only have entries for phrases seen in the test set in any case; otherwise, there would not be a context to be used in the estimation of probabilities. However, the design of the modified phrase-tables, i.e. listing each phrase occurring in the test set separately in order to find the probabilities for each context, leads to an inflation of the phrase table. The dimension of inflation, and inherently also the need to cut off some entries, is illustated with an example of the article the. As the is very highly frequent, there were also a lot of translation options seen during training, resulting in a set of 109.384 translation candidates only for the single-word phrase the. In the test set, the phrase the occurs 1.809 times; if we insisted on including every entry for every occurrence of the, we would end up with 1.809 · 109.384 = 197.875.656 entries for only this phrase. Less frequent words 29 also have quite some effect: In the case of the word public (26 occurrences in the test set) with 2.495 translation possibilities, including every entry of every occurrence would result in 26 · 2.495 = 64.870 entries. In the experiments peresented later, only entries with a translation probability of 0.0002 or more are listed in the phrase table. This threshold was also used in the experiments by [Gimpel and Smith, 2008], who found that such filtering had no apparent effect on the output of their systems. As there is no need to filter phrase-tables in the baseline system, there was no filtering applied to them. 30 4 Evaluation methods When evaluating the performance of an mt-system, there are several aspects to consider: While there are also technical points like the translation speed or the amount of memory used during translation, the two main criteria are fluency and adequacy. These criteria are motivated by the idea that the translated sentence should convey the exact meaning of the source sentence without adding more content or leaving out information - and thus be adequate. The output should also be good with respect to grammar and word choice of the target language, i.e. consist of fluent English or German. While a human annotator would be perfectly capable to judge mt-output in terms of fluency and adequacy, there are two problems: first, a manual evaluation is too expensive and time-consuming, especially when the evaluation result is needed to to rate different systems during development. Also, judgement produced by human annotators can be expected to be subjective and therefore not always correspond perfectly with that of another judge (inter-annotator agreement.) When evaluating mt-systems, we need an evaluation metric that is both consistent in different situations and can be produced automatically in a reasonable amount of time. Also, we have to define a meaning for the score computed by the metric. This leads to the idea to rank two systems against each other instead of using an absolute value for fluency or adequacy. That means that not the actual score a system achieved is relevant, but the difference to the score of another system, e.g. a comparison between a baseline system and a modified system. The results of automatic evaluation methods are also required to correlate with human judgement. The most obvious way to automatically evaluate the quality of a translated sentence is to compare it with a reference translation assuming that a good translation is similar to the reference. However, such an evaluation of mt-output is tricky: As there are in most cases not one, but very many good translations, it is not sufficient to match the the output against a single reference translation expecting good sentences to match perfectly. This will hardly ever happen, not even with very short sentences. In order to judge less strictly, the compared units are not entire sentences but n-grams of differenth length. Checking whether the translation output roughly consists of the same words as the reference translation is an attempt to judge adequacy, while longer n-grams reflect fluency by capturing larger chunks of text at once while still, at least to a certain degree, allowing for a different constituent order. There are several n-gram-based evaluation metrics such as n-gram precision and recall or the widely used bleu-metric. Given that there are countless possibilities to translate a sentence, and that good translations do not necessarily need to be very similar in terms of their word choice or constituent order, an automatic evaluation system benefits from using multiple reference translations. It is possible to compute measures such as bleu based on more than one reference translation. However, in the course of this work, only single reference translations will be used for evaluation. 31 4.1 Bleu The bleu-metric (bilingual evaluation, understudy), [Papineni et al., 2002], works with n-grams of differenth length up to a predefined maximum n. These n-grams are then compared with a reference sentence by computing the n-gram precision for each n. As mt-systems sometimes overproduce highly frequent words such as articles, there is need for some modification in order to not overestimate n-gram precisions in such cases. Instead of just counting the ngrams in the candidate translation which also occur in the reference translation and then dividing that number by the total number of n-grams, only that many n-grams that have an actual counterpart in the reference are counted. The method is illustrated in the following example by computing the modified unigram precision: (16) the the the the the the the (17) the cat is on the mat While (16) is clearly bad mt-output, it would still achieve a unigram precision of 77 . As there are only two occurrences of the in the reference translation (17), the is only counted twice for the modified unigram precision, resulting in a value of only 27 . Sentences which are longer than the corresponding reference translation have difficulty to get high modified n-gram scores. In contrast, short sentences have an advantage in terms of modified n-gram precision. As good sentences should match their reference translation in word choice, word order and sentence length, bleu additionally uses a brevity penalty in order to guarantee that short sentences do not get a too high score. bleu is computed as follows: bleun = brevity-penalty · exp n X λi · log(precisioni ) i=1 ( brevity-penalty = 1 r e1− c if |c| > |r| if |c| ≤ |r| where λi are weights for the different n-gram precisions, |c| is the length of the candidate sentence and |r| is the length of the reference sentence. In the case of multiple reference sentences, the one whose length is closest to the candidate’s length is taken. This formula can be simplified to the generally used form by setting the weights λi = 1 and the order of n-grams to n = 4:6 bleu4 = brevity-penalty · 4 Y precisioni i=1 As the entire score for a sentence is 0 when one of the n-gram precisions is 0, bleu-scores are usually computed for the entire test set rather than on sentence 6 taken from [Koehn, 2010] 32 level. However, it is possible to compute bleu-scores on sentence level by simply adding 1 to each n-gram precision (bleuadd−one ). While sentence based bleu is helpful to pick out interesting sentences when manually evaluating a test set, it does not reflect human judgement. However, when computed on the entire test set, bleu correlates well with human judgements. A related evaluation metric is nist [Doddington, 2002]. It is also based on n-grams, but uses an arithmetic average of n-gram counts instead of a geometric average. It also gives a higher weights to those n-grams that occur less frequently assuming them to be more informative than highly frequent n-grams. As bleu has been shown to correlate better with human judgement than nist, our evaluation will mainly focus on bleu values, but nist scores are also reported on behalf of completeness. Example The sentences (20) and (21) are both translations of (18), with one reference translation (19)7 . The translations are very similar; in fact, they differ only in two words: The article on the left side of vermögen (fortune) and the word wert (worth), which is missing in (20). Except for the phrase tadellos groomed, 21 is exactly the same as the reference translation. (18) the women are all impeccably groomed and the men are all worth a fortune . (19) die frauen sind alle vorbildlich gepflegt und die männer sind alle ein vermögen wert . (20) die frauen sind alle tadellos groomed und die männer sind alle eine vermögen . (21) die frauen sind alle tadellos groomed und die männer sind alle ein vermögen wert . translation candidate n=1 n-gram precision n=2 n=3 n=4 (20) 11 14 7 13 5 12 (21) 13 15 11 14 9 13 brevity penalty bleu4 3 11 0.9311 0.045 7 12 1 0.275 Table 5: Modified n-gram precisions for the sentences (20) and (21) when matched against the reference translation (19). Despite their similarity, (21) yields higher n-gram precision scores and is therefore – rightly – assumed to be a better translation than (20). N-gram precisions and the final bleu-score are given in table 5. This example also demonstrates one of the weak points of bleu: While matching the reference translation is a good indicator for a good translation, the fact that some words are not matching does not necessarily indicate a bad translation. In the example, impeccably was translated in both sentences as 7 (20) was translated with the baseline system and (21) was translated with a system conditioned on 2 pos tags on the left side of a phrase. 33 tadellos, which is as good as vorbildlich. Yet, tadellos is not counted into the respective precision scores: bleu is not capable to recognize words which are correct translations, but do not occur in the reference translation. A detailed critical examination of bleu and its role in machine translation can be found in [Callison-Burch et al., 2006]. Meteor An evaluation metric which deals with problem illustrated in the example is meteor ( [Banerjee and Lavie, 2005]): It is designed to recognize near matches, such as decision and decide, by backing off to word stems. If an ontology like Wordnet is provided, meteor can also use semantic classes in order to find synonyms or semantically closely related words: With the help of a German ontology (e.g. Germanet), tadellos could be classified as an equally good translation as vorbildlich. Unfortunately, meteor was found to not correlate very well with human judgement on German newspaper data (c.f. [Callison-Burch et al., 2008]); thus, we will not report meteor scores. 4.2 Error types in machine translation output N-gram based precision metrics such as bleu, discussed in the previous chapter, are useful to evaluate the overall quality of a translation system in comparison with another. While bleu is designed to capture both fluency and adequacy, it does not provide an analysis of error types. However, the identification of different error types is crucial in order to improve the translation quality. [Vilar et al., 2006] carried out a very fine-grained study of error types with the aim to provide an evaluation framework for human evaluators. They propose five main groups, missing words, word order, incorrect words, unknown words and punctuation errors, which can again be divided into subgroups. In the category of missing words, a distinction can be made between missing words which are crucial for the meaning of the sentence, and cases where the meaning of the sentence is preserved despite a missing word. Generally, ’content words’ like verbs, nouns or proper nouns are assumed to be most essential for the meaning of a sentence. However, this assumption is overly simple considering that a preposition or negation can significantly change the meaning of a sentence. Similarly, in the case of incorrect words, the most important point is whether the incorrect translation of a word disrupts the meaning or whether its content is still comprehensible. There are different forms of incorrect words: A word can be translated completely wrongly, especially if there was a need to disambiguate a source word, or be a valid translation in terms of meaning, i.e. the right base, form but a wrong inflection. Furthermore, the translation of idiomatic expressions is challenging as they cannot be translated literally in most cases. Another point is that of bad choice of words, i.e. constructions which carry the content of the source phrase but, in the given context, cannot be considered correct for not being a good, fluent phrase of the target language. Reordering errors can occur on phrase level or word level and can further be classified into local or long range problems. 34 Incorrect word forms, as well as reordering problems, are to a certain degree depending on the language pair one is looking at: Finding morphologically correct forms is especially challenging for highly inflecting languages; typical examples of local reordering problems arise when translating Romance languages, where nouns normally preceed adjectives, into a language like English where the order of nouns and adjectives is inverse. As the use of contextual information aims to find context appropriate translations, the main focus of evaluations will be on incorrect words as well as missing words, while other categories such as reordering issues, unknown words and incorrect puntuation will be discussed only marginally or ignored. Also, a more targeted evaluation on incorrect or missing words with respect to their assumed relevance, i.e. content words vs. non-content words, is an interesting aspect given that n-gram based measures such as bleu do not make a distinction between important and less inportant words. This classification of error types, as proposed by [Vilar et al., 2006], which is summarized only briefly, was originally intended as an orientation framework for human annotators as most of the criteria are too subtle to be entirely captured by automatic evaluation routines and categorized accordingly. [Popović et al., 2006] propose a method to automatically measure the degree of errors in local reordering (adj-noun vs. noun-adj) when translating Spanish-English, and the relative amount of inflection errors in Spanish noun phrases and verbs. To measure inflection errors, they compare the word error rate computed on fully inflected forms with the word error rate computed on base forms; a large difference indicates many inflection errors. Similarly, the difference between word error rate and position independent word error rate is used as indicator for reordering problems. While these approximations are not perfect, [Popović et al., 2006] showed that their evaluation results correspond to results obtained in evaluations carried out by human annotators. 4.3 Syntactically motivated evaluation The integration of linguistically motivated features is a necessary step for a targeted evaluation of specific error types: in order to improve the performance of an automatic translation system, frequent error types or difficulties when translating certain structures need to be identified. A syntactically orientated evaluation method is presented in the work of [Popović and Ney, 2009]: In their work, not the actual words of machine translation output are used for the computation of standard evaluation metrics, but pos-tags. The generalization to pos tags takes the evaluation from the purely lexical level to a more abstract level: As good translations do not necessarily have to be exactly like the reference translation, word based bleu punishes lexical variation, even if the syntactic structure of the sentences is the same. By introducing the pos-bleu score, we want to be independent from the actual realization of words but focus on the syntactic structure of the target sentence. If the translation and the reference sentence are similar on a syntactic level, then there is evidence that the translation is relatively well-formed, even if the actual words do not always correspond with the reference translation. 35 (22) the race remains unusually wide open . ref das art rennen nn bl die art jagd nn co das art rennen nn bleibt vvfin nach adv ungewönlich adjd wie kokom bleibt vvfin vor adv weit adjd offen adjd außerordentlich adjd außerordentlich adjd weit adjd . $. weit adjd offen adjd offen adjd . $. . $. In example (22), the reference translation as well as two different automatic translations (baseline and contextually conditioned) are given with pos tag annotations. While the second translation (co) is very good, there is one difference compared to the reference translation: Instead of the adjective ungewöhnlich (unusually), the system chose the equivalent adjective außergewöhnlich (exceptionally). As the sequence of pos tags is the same in ref and co, the translation co receives the top pos-bleu score of 1. The baseline translation bl is well comprehensible, but definitely flawed as it contains no verb. While the construction nach wie vor (still) roughly expresses the meaning of remains, the structure differs greatly from the reference translation and the sentence bl therefore receives a significantly smaller pos-bleu score of 0.4518. Even if there was a verb (tagged vvfin) in the correct position after the noun-phrase die jagd (the chase) in bl, the pos-bleu score would only be slightly increased (0.4692), despite being a syntactically well-formed sentence. The example shows that pos-bleu is capable of balancing out different word choices (as long as they are tagged equally), but also that it cannot deal with equivalent syntactic structures. When computing bleu on pos tags, there is also the risk that, especially in long sentences, pos tag sequences of the test sentence randomly match with pos tag sequences of the reference translation although they are located at different parts of the sentence and have no correspondence at all. [Popović and Ney, 2009] investigate the bleu metric based on pos-tags, as well as recall, precision and f-measure of pos n-grams. For these measures, they report a good correlation with human judgement. Correlation statistics are computed on the manually evaluated test data from the 2006, 2007 and 2008 shared task of the Statistical Machine Translation Workshop8 . In the first two data sets, both fluency and adequacy are annotated. The pos-based metrics, especially the pos-bleu and the pos-f-measure, have a good correlation with the fluency and adequacy scores obtained by manual annotation. Additionally, the pos-bleu and pos-f-measure have better correlation scores with human judgement than the standard bleu and meteor with respect to the fluency score, and pos-bleu also correlates better with the adequacy score than the standard bleu and meteor. Since metrics entirely based on pos-tags do not take into account lexical aspects, a score including both word n-grams and pos n-grams was introduced. The correlation test for this variant was carried out on the 2008 data set, where 8 http://www.statmt.org/wmt{06|07|08} 36 translated sentenced were ranked against each other, but no fluency or adequacy scores were annotated. A good correlation for the word and pos n-gram measure, as well as for the pos-bleu and pos-f-measure on the new data set, could be observed. Given that the typical output of an mt-system is not well-formed, postagging the mt-output seems a reasonable task. The output of a robust tagger which relies on a large lexicon and is therefore able to assign tags even if the sentence is not perferctly well-formed is expected to be sufficiently reliable, especially if using a relatively simple tag set. In contrast, a deep syntactic analysis such as parsing or assigning fine-grained morpholgical descriptions to the output would be quite difficult to carry out. As we hope for an improvement on both lexical and grammatical level, variations of the pos based evaluation seem a promising method to capture overall syntactic effects. In order to also include lexical aspects in the evaluation, we intend to measure precision and recall of translated content words in an attempt to compensate for the fact that standard bleu considers all words/ngrams to be equally important. We also hope that pos-bleu scores help to find individual sentences that are much better that the baseline and will thus be a useful tool for a manual evaluation of specific error types. 4.4 Testing for significance As the absolute scores of evaluation metrics are meaningless, the respective scores of a modified system and the baseline are compared. However, it is not sufficient to observe an increase of bleu points (or the score of any other evaluation metric), but it must also be shown that the improvement is statistically significant. [Koehn, 2004] describes how bootstrap resampling can be used to compute statistical significance of automatic evaluation metrics, focusing on bleu. Instead of comparing the performance of different translation systems, the same system was trained on 11 languages and used to translate the test set into English. Thus, the English output of 11 different source languages can be compared. The test set, consisting of 30.000 sentences, is especially large. In an attempt to find the ‘true’ bleu score, the test set is divided into smaller parts for each of which bleu scores are computed. The choice of these test sets is crucial as consecutive sentences tend to be alike and of similar translation quality and thus are not representative of the entire test set. [Koehn, 2004] shows empirically that by using a broad sample, i.e. drawing sentences from different parts of the main test set, the obtained bleu scores of the smaller sets lie within a smaller range (as compared to test sets of neighboring sentences) and are thus closer to the ‘true’ bleu value of the full set of 30.000 sentences. To find the true translation quality of a smt system, we would need to consider all possible sentences of a domain. Since this is not possible, we need to figure out how to exploit our finite test set in order to simulate an infinite amount of test data. The objective to find the ‘true’ bleu value is now redefined to the task of computing whether the true bleu score lies within a certain interval. For this 37 computation, we demand a certain confidence, which is set at q = 0.95, or in other words, a p-level of 0.05. The basic idea of bootstrap resampling is to repeatedly draw a test set of size n from and a very large (infinite) set and compute bleu scores for each of this test sets. If this is done for a sufficient number of times, according to the law of large numbers, the individual bleu scores enclose the assumed true bleu score very tightly. By eliminating the 2.5 % highest and 2.5 % lowest individual bleu scores, the 95 % confidence interval around the true bleu score is found. Again, as there is no infinite set of test sentences, we need to resort to a trick: The test sets of size n are each built by drawing n sentences with replacement from the available collection of test sentences. [Koehn, 2004] makes the assumption that such a set is equivalent to a set drawn from an infinite collection of test sentences and can thus be used for the estimation of the confidence interval. By comparing the performances of test sets of modest size and his exceptionally large test set, he shows empirically that his assumption is valid. In order to compare whether the quality of two systems is significantly different, the method of paired bootstrap resampling is introduced: For each of the test sets produced by the two systems, new test sets are created by drawing n sentences with replacement. This is repeated a sufficient number of times and the respective scores of the systems are compared. If one system outperforms the other one in at least 95 % of the iterations, it is concluded that this system is better with 95 % statistical significance. The significance testing carried out in the course of this work is done using pairwise bootstrap resampling with 1000 samples and 95 % confidence. 38 5 Rating the reliability of contexts and smoothing When computing contextually conditioned translation probabilities, it is indispensable to rate contexts in terms of their reliability as we only want to work with useful and non-trivial context features. The next step is to figure out what to do with translation candidates that were seen in the distributions without contextual conditioning, but not in the contextually conditioned ones. As such phrase-translation pairs are given a translation probability of zero, they are eliminated from the set of translation possibilities. Since it might not always be a good idea to discard those unseen translations, they need to be assigned a relatively small translation probability. In the next section, criteria to identify good and harmful contexts will be discussed, as well as methods to assign non-zero translation probabilities to phrase-translation pairs that did not occur within a given context. 5.1 General reflections Generally, a few factors have to be considered when computing contextually conditioned probabilities: The most important aspect is the reliability of the probability estimation. While some contexts can be useful to reduce the set of translation candidates to a well suited smaller set, other contexts can be trivial or even harmful when they favor bad translations at the expense of better ones. In the made-up toy-example in table 6, three different contexts are compared to illustrate the effects of good and less good contexts when no criterion is used at all to rate the reliability of a context. Based on possible translations of the phrase fine, it is an extreme simplification of one of the examples presented previously. The possible translations are grouped according to their meaning: pretty (À), the nominal phrase financial penalty (Á) and the verbal construction to impose a penalty (Â). The candidates in section à (fine→it and fine→car) are always a bad choice regardless of the context. The task is to find appropriate translations for fine depending on the respective context, i.e. to disambiguate between penalty and pretty, while also excluding non-sense translations. Phrase translation probabilities are conditioned on one word on the left side of the phrase. Context c1 clearly prefers the verbal variant of the penalty reading. As there is also a reasonable number of phrase-translation pairs in this context, we can consider context c1 to be reliable and useful. Similarly, context c3 gives relatively high probabilities to only two translation candidates. In this case however, the two translation candidates that are kept for translation are most likely a product of alignment errors or bad pos tagging. The difference between the good context c1 and the bad context c3 is that the phrases in c1 were seen multiple times while the phrases in c3 only occurred once and can thus be assumed to be random. Given that alignment is very error prone and that preprocessing procedures such as pos tagging are additional sources of error, it is not surprising to find a certain amount of such random pairs in most translation probability distributions. If a small number of these faulty translation candidates happens to be the only phrases seen in a context, then 39 À Á  à translation gloss schön schöne strafe geldsrafe bestrafen beautiful beautiful penalty penalty penalize impose a penalty it it , car strafe auferlegen es es , auto original f ϕ 30 0.30 26 0.26 14 0.14 9 0.90 16 0.16 c1 = to f ϕc1 0 0 0 0 0 0 0 0 15 0.88 c2 = the f ϕc2 28 0.44 20 0.32 8 0.13 6 0.09 0 0 c3 = , f ϕc3 0 0 0 0 0 0 0 0 0 0 2 0.02 2 0.12 0 0 0 0 1 1 1 0.01 0.01 0.01 0 0 0 0 0 0 0 0 1 0 0 0.02 1 1 0 0.5 0.5 0 Table 6: Simple example illustrating useful, trivial and harmful contexts. Translations for the word fine are conditioned on the word on the left side of the phrase. their translation probabilities are extremely overestimated, which is definitely an unwanted effect: we thus need to define a criterion (e.g. a threshold) to prevent contextually conditioning in such situations. While context c2 is not as disastrous as c3 , it is not very useful either: It reduces the number of translation candidates and assigns reasonable translation probabilities, but fails to disambiguate between penalty and pretty. Although the system is not likely to be harmed by this context, it does not benefit either and the reduction of the set of possible translations results in the fact that the estimation of translation probabilities is based on less data than without the context. Therefore, we are not only interested in filtering out downright bad contexts, but we also want to have criteria helping to identify trivial contexts. By illustrating what the probability distributions of good, bad and trivial contexts look like, this constructed example helps to define criteria for the identification of useful contexts. Thus, a context could be considered useful and reliable if it occurs frequently and leads to a considerable reduction of translation candidates. Ideally, the resulting new probability distribution has a few favorite translations with relatively high probabilities, as illustrated with context c1 in the example. In reality, even useful contexts tend to produce quite flat probability distributions: This was already seen in some of the previous examples and will also be the topic of some of the following discussions. Since judging the distribution by its appearance might not work in all cases, we also take a step back and contemplate the number of phrase pairs co-occurring with a given context assuming that a frequently seen context is more reliable (i.e. not random) than a context that was seen only once or twice. Criteria based on the number of co-occurring phrase-translation pairs and on the appearance of the contextually conditioned distribution will be presented with regard to their potential use as identifier for good contexts, but also as discount/weighting factors for smoothing. 40 5.2 Interpolation and discounting As can be seen in table 6, translations not seen in a context are given a translation probability of zero and are thus no more eligible for translation. Smoothing the probabilities for unseen translations is a necessary step to deal with the reduced data set used to estimate the translation probabilities seen in the given context. In order to assign a small probability to unseen events, the translation probabilities of phrase-translation pairs seen in a context are slightly decreased by a factor λ. The probability mass taken away from seen events is then assigned to unseen events. There are many different ways to realize smoothing; one of the most crucial aspects is the design of the discount factor that is used to decrease the probability mass of seen occurrences: the underlying rationale is to only exploit useful contexts by discounting them very little, while heavily discounting trivial contexts and thereby reproducing a distribution similar to the original ϕ(trg | src). Another important point is the influence of the original ϕ(trg | src) on the new probability ϕ(trg | srccontext ); we differentiate between the methods of discounting and interpolation. In the case of discounting, all translation probabilities of seen events are decreased by a discount factor and the remaining probability mass is then distributed to the unseen events proportionally to the original distribution. Thus, the original non-context distribution is only reflected in entries of unseen translation candidates and the probabilities for seen events are entirely based on the contextually conditioned distributions. When computing the new probabilty as an interpolation, a (small) fraction of the original distribution is part of every new probability. A weighting factor λ (with values between 0 and 1) is used to balance the relation between the two probabilities: ϕ(trg | srccontext )int = λ · ϕ(trg | srccontext ) + (1 − λ) · ϕ(trg | src) For a good context, λ should be close to 1 in order to use the contextually conditioned probability without noticeable influence of the original distributions. On the contrary, λ should be close to zero in the case of a not very convincing context in order to reproduce more or less the original distribution ϕ(trg | src). Regardless of the weighting factor, the original ϕ(trg | src) might suppress a good distribution yielded by a good context when the ϕ(trg | src) is relatively high in contrast to ϕ(trg | srccontext ). One has to keep in mind that the example based on fine is somewhat extreme – it is relatively rare that phrases are ambiguous to such an extent. In most other cases, the differences of appropriateness of translations are much more subtle and a too strong influence from ϕ(trg | src) can ruin the effect achieved by using context knowledge. Additionally, translations that are bad regardless of the context, but have a relatively high translation probability would still be represented in the new probability distribution even if a they were not seen in the context used for conditioning. Given that alignment is a very error prone procedure, a considerable number of faulty translation candidates can be expected in most sets of phrase-translation pairs. These bad translation candidates often consist of prepositions, articles or punctuation marks, i.e. 41 tokens that have a high frequency and, to a certain degree, fit everywhere in a sentence. On the other hand, interpolation could be regarded as a way to slightly re-balance the translation probabilities of distributions that are already relatively good. In this case, the impact of contexts that are not exactly bad, but also not very good either could be too harsh when using the discount method. In our experiments, we will compare the outcome of both discount based methods and interpolation. In both variants, discounting and interpolation, the design of the factor λ is an important point and closely related to the identification of useful contexts. Thus, we need to define criteria for the design of the discount factor and also for the reliability of contexts, i.e. decide about the extent to which we want to use the contextually conditioned probability and when to prefer the original one. 5.3 Discount factors So far, we only worked with a simple maximum likelihood estimator of the form ϕ(trg)srccontext = P count(srccontext , trg) ) trgi count(srccontext , trgi where the probability for the translation trg of a given phrase srccontext is estimated as the relation of how often trg is the translation of srccontext and the total number of possible translations in the given context. While this method is simple and straightforward, it has the disadvantage that translations that were not seen in the context are assigned a probability of zero while seen events can be overestimated at the same time. In some of the experiments presented later, the step of giving non-zero probabilities to unseen events will be skipped and translations not occurring within a context are discarded, in order to examine the maximal effect of context information. Furthermore, we have to keep in mind that the new design of the phrase tables (cf. the example on page 27) with each phrase of the test set listed separately, leads to a huge inflation of the phrase table. Therefore, entries with a translation probability below a certain threshold need to be excluded anyway, making the task to assign non-zero translation probabilities, to a certain degree, less important. However, there are situations in which we prefer smoothing: A simple example is the combination of translation probability distributions based on two or more independant contexts, where a translation candidate not occurring in one context might (often) occur in the other one and therefore needs a non-zero translation probability in the distribution of the context it was not seen in. If the probability distributions of two independent contexts are given as separate features to the log-linear model, it might be too harsh to use only those translations that occur in both contexts, especially when considering that the system can decide itself on the relevance of features during parameter tuning. In the following sections, ideas of how to create factors to rate contexts and for ’stealing’ probability mass from seen events in order to maximally 42 exploit good contexts will be presented, followed by a comparison of the different approaches. 5.3.1 Count-based criteria for the usefulness of contexts As already mentioned, the number of phrase-translation pairs seen in a context can be taken as indicator for the quality of a context. In order to find out whether a context is reliable or potentially harmful, the simplest thing to do is to set up a threshold to guarantee that contextually conditioned probabilities are only used if there are at least n phrase-translation pairs seen in this context. If there are less than n occurrences, this context is considered to be random or a product of alignment errors. In the case of trivial or ’bad’ contexts, it can be very harmful to compute translation probabilities based on a very small number of counts: If there are only e.g. three translations seen under the restriction of a certain context and each of them occurs only once, then they receive the relatively high probability of 13 each, minus the value taken away by a discount factor, while the remaining translations end up with comparatively low translation probabilities. This is definitely not what we want if the ’boosted’ translations are bad choices, but also in the case of a trivial context where most translations are - to a certain degree - equally probable, but only a small part of them happens to co-occur with a specific context and ends up on top with a high probability. An extreme case of bad translations being enhanced by a random context was already shown in the made-up example in table 6. Another idea to impose a restriction on the minimal number of counts is to introduce a function based on counts that penalizes low-frequency contexts. There are many ways to do so; one possibility is to set λ= ln(count) ln(count) + 1 This function slowly increases for increasing values of count, while growing faster for smaller values than for larger ones. This is a positive effect as we want to give more differentiated scores in the critical range of ≈ 5 to 15 counts: a context seen 10 times might be considered far more reliable than a context with a frequency of only 5, while a context seen 100 times can be assumed to be as trustworthy as a context seen 105 or even 200 times. Using a count-based function avoids the problem of imposing an arbitrarily chosen threshold by allowing for a gradual transition between good, less good and bad contexts instead of classifying them into good and bad. So far, we can define three methods for the estimation of contextually conditioned translation probabilities: • Set up a threshold to decide whether to use the contextually conditioned probabilities or to switch back to the non-context distribution. In our experiments, the threshold was set to count = 10. • Use the count-based factor as weight λ for interpolation: ϕint = λ · ϕc + (1 − λ) · ϕo (Version À in the evaluation) 43 • Use the count-based factor as factor for discounting: ϕdisc = λ · ϕc The remaining probability mass is assigned to unseen events proportionally to ϕo . (Version  in the evaluation) The methods using either a threshold of count = 10 or interpolation turned out to work better than using always the contextually conditioned probability. Experiments showed also that the result of these systems are in the same range, whereas the results of the system using the discounting method are considerably lower. 5.3.2 Example: interpolation and discounting translation gloss schönen schöne gute guten gut in ordnung gutes geldstrafe ausgezeichneten strafe geldbuße verhängen einer geldstrafe geldstrafe bezahlen beautiful beautiful good good in ok good penalty excellent penalty penalty impose a penalty pay a penalty original f ϕo 160 0.1264 120 0.0948 70 0.0553 62 0.049 60 0.0474 26 0.0205 19 0.0150 19 0.0150 18 0.0142 15 0.0118 7 0.0053 2 0.0016 1 0.0008 1 0.0008 threshold f ϕc 3 0.038 3 0.038 3 0.038 2 0.025 0 0 0 0 0 0 11 0.139 1 0.013 8 0.101 6 0.076 0 0 0 0 0 0 interpol. discount λ = 0.8138 0.0545 0.0309 0.0486 0.0309 0.0412 0.0309 0.0297 0.0206 0.0088 0.0164 0.0038 0.0071 0.0028 0.0052 0.1161 0.1133 0.013 0.0103 0.0846 0.0824 0.0619 0.0618 0.0003 0.0005 0.0001 0.0003 0.0001 0.0003 Table 7: Some of the translation candidates for the English phrase fine and translation probabilities when fine is tagged as noun with a determiner on the left side. Candidates with the targeted meaning of penalty have a grey background. The weight λ = 0.8138 is count-based Table 7 shows contextually conditioned probabilities smoothed by the countbased factor discussed above, compared to the original probability distribution ϕo and the contextually conditioned distribution ϕc that only has to fulfill the threshold criterion. This example is intended to show the differences to ϕc , but also the differences between interpolation and discounting. While the translations geldstrafe and strafe always have the highest probabilities, the method using a function based on the counts of phrase-translation pairs seen in the context better reflects the contextually conditioned distribution than the version using interpolation. Especially the favorite translations in the original system (schönen and schöne) are considerably higher when we use interpolation. But also the translations strafe und geldstrafe profit from interpolation as their original probability is not lost but added to the new value. Interpolation might not be the best choice to bring out a candidate from the lower midrange in the original distribution to be by far the best translation candidate in the new distribution. However, it might be suitable to re-balance a distribution conditioned on a more subtle context where all valid translations were already scoring similarly high without context information. 44 5.3.3 Good-Turing estimation The Good-Turing method is used to estimate the probability of previously seen und unseen events in large data sets, e.g. the probability to see a word in a large text corpus. Good-Turing estimation is often used to model probabilities in text corpora. Good-Turing estimators are of the following form: Fx = Nx + 1 E(Nx + 1) · T E(Nx ) In this equation, x is the event we are interested in (e.g. a word or a translation) that has been seen nx times. t is the size of the data set and e(n) denotes the estimation of how many events were seen n times. There are many ways to design a Good-Turing estimator; a very simple one is given in the following equation: Nx E(1) P (x) = · 1− T T and P (unseen) = E(1) T Here, the probability for the event x is given by the maximum likelihood estimator E(1) Nx T which is discounted by the factor (1 − T ). As the probabilities for x need to sum to 1, the probability mass for unseen events is E(1) T . Visualized with the example of words in a corpus, the probability for formerly unseen words to appear for the first time is estimated with the number of words seen only once in relation to the size of the data set. In our application of phrase tables, x is a translation candidate for a given phrase and a given context. t is the number of translations for this phrase and context and nx is the number of times in which x is the translation of the phrase. The left-over probability, E(1) T , is the probability mass for all unseen translations in this context. Therefore, P(unseen) needs to be divided between all unseen translation candidates. This can be done by either assigning an equal probability to all unseen events, or these probabilities can reflect the original distribution without context information. While it could be argued that translations that were not seen in a specific context are equally improbable and therefore should be assigned the same pseudo-probability, this also means that information is thrown away: if it is known that, when looking at two unseen translations, one of them is twice as frequent as the other one when not contextually conditioned, then there is no reason not to use this information during the computation of smoothed probabilities. Good-Turing estimators are designed for large data sets of Zipfian distribution. This means that a relatively small number of events occurs very frequently while most events are very low-frequent. When applying the Good-Turing method to the probability distributions, we will find that the criteria for the application of Good-Turing are often not quite met. While there are probability distributions appropriate for Good-Turing estimation, some are definitely not. Generally, the distributions tend to be very flat and, given that we are looking at translations for one phrase in one specific context, i.e. very restricted conditions, are also often of modest size. 45 It might also happen that every translation in the entire distribution was seen only once, i.e. E(1) = T , or that no translation was seen once, i.e. E(1) = 0. While it would certainly be possible to create rules to deal with such exceptions, there is another aspect that is not considered in the Good-Turing approach as it was presented here: Unlike in typical applications for Good-Turing (e.g. words in a corpus), we have detailed knowledge about the unseen events, and also a more ’robust’ probability distribution to which we can switch back if we do not like the context. Especially the relation between the contextually conditioned distribution and the original one is promising to be a useful indicator of whether a context is good or trivial; the assumed relevance of a context should also be reflected in the design of a discount factor. 5.3.4 Type-token relations as criteria for the evaluation of contexts The relation of types and tokens in the contextually conditioned distribution, as well as a comparison with the original distribution, are interesting aspects and might indicate whether a context is trustworthy and thus to which extent we want to use its probability distribution. If the number of types and tokens in the conditioned probability distribution is roughly the same, i.e. the distribution is very flat without one or a few favorite translations, then the context is not very promising, as the initial idea of incorporating context information when estimating probabilities was to find translations especially suitable for this context. While some inappropriate translations might have been filtered out, a very flat distribution suggests that the context is not very helpful. In this situation, we would like to introduce a penalty, such as a factor λ1 = types(translation, context) tokens(translation, context) which is high if there are approximately as many types as tokens. It is also interesting to compare how many of the translation candidates in the original distribution are also seen in the context. If almost every single translation of the original distribution was also seen in the context, then this context is likely to be trivial and we might use as well use the entire data to estimate probabilities, i.e. reproduce the original probability distribution. λ2 = types(translation, context) types(translation) captures the percentage of possible translations of a phrase coming up in the modified distribution. Assuming λ1 and λ2 to be equally important, they can be combined to the form 1 λ = [(1 − λ1 ) + (1 − λ2 )] 2 and be used either as weighting factor in interpolation or as discount factor. 46 tokens(translation, context) types(translation, context) types(translation) λ1 λ2 λ phrase = general c1 = vvff nn c2 = in jj 14 1710 2 240 1075 1075 0.1428 0.1404 0.0019 0.2233 0.9276 0.8182 phrase = fine c3 = to 23 21 379 0.91 0.055 0.5157 Table 8: Effects of the type-token criteria on different types of distribution. The effects of the type-token criteria on different types of distribution are illustrated by the following example. Table 8 shows three different distributions; in the case of general, the context used for conditioning consists of the pos tag on the left side and the pos tag of the phrase itself. The word general appears mostly with the meaning of generally/generic, e.g. in the very common construction in general. The second possibility, in which general denotes a military rank, is comparatively rare. In context c1 , in which general is tagged as noun with a finite verb on its left side, only two different translation candidates occur, both with the meaning of military person. As this seems a good outcome for the given context, the weight λ should be relatively high in order to keep the majority of probability mass and give only very little to unseen events, or, in the case of interpolation, keep the influence of the original distribution to a minimum. In context c2 (general is an adjective with a preposition on its left side), the factor λ1 is almost the same as in contextc1 . However, the number of seen translation candidates is considerably larger and thus, c2 does not score as high as c1 . This also illustrates the positive effect of combining the two factors λ1 and λ2 instead of using just one of them. As became evident in the case of context c1 , contexts with a relatively small number of phrase-translation pairs (as long as they meet the threshold criterion) are not disadvantaged. However, flat translations are difficult to handle: They are a strong indicator for trivial contexts with a considerable number of random singletons. However, even good contexts can lead to flat distributions as exemplified by context c3 : The pos tag to on the left side of fine is a good context to find only translations with the meaning of impose a penalty: Of the 21 different translation candidates seen within this context, 17 can be considered appropriate. The main reason for the form of this distribution is the fact that unaligned words are concatenated to adjacent phrases; as a result, two translation candidates have three entries each, and two other candidates are each listed twice. Although c3 is a good context, it does not look very promising and therefore gets a low weighting factor. With the new factor λ, that takes into account the type-token relation of the contextually conditioned distribution, as well as a comparison of the number of translation candidates in the original and new distribution, we can define two methods for the computation of contextually conditioned translation probabilities: 47 • Use the type-token based factor as weight λ for interpolation (Version Á in the evaluation) • Use the type-token based factor as factor for discounting (Version à in the evaluation) If in the case of interpolation both λ1 and λ2 are high (close to 1) and thus indicating a bad context, the combined factor λ is low and therefore reproduces ϕ(trg|src) while hardly incorporating ϕ(trg|srccontext ). If λ is high for a promising context, translations seen in this context should get high probabilities, while unseen events only get a small fraction of their original probability. However, if the original probabilities are very large compared to contextually conditioned ones, the influence of the original distribution can be quite strong, even if the weighting factor favors the conditioned distribution. Similarly, when not working with interpolation and only taking into account the contextually conditioned probabilities for seen translations, they are heavily reduced when λ happens to be low. If the leftover probability mass is distributed to the unseen translations proportionally to the original distribution, it is likely to be - more or less - reproduced for the unseen translations. In fact, the seen translations can even be dispreferred if they are discounted heavily and the translation probabilities of some of the unseen translations are very high in the original distribution. While factors based on type-token relations might work reliably on probability distributions looking as they are expected to look, we have to keep in mind that even good distributions tend to be flat. As it is very difficult to say how distributions based on good contexts look like in reality, it is also difficult to produce universally applicable criteria that do not only work for somewhat extreme examples such as general, but also on more subtle ones where all translation options are somehow working, but some of them better in certain contexts. When using the discount method, results hardly outperform the system using the contextually conditioned distribution without restrictions. As for the count-based factor, interpolation yielded better results. The bleu scores and a more detailed evaluation can be found in section 5.5. 5.4 Back-off Table 9 provides an example illustrating the need to impose criteria to judge the reliability of a context, but it also shows the (theoretical) gain when backing off to a smaller context: While rare contexts do not necessarily have to be wrong, at least they have to be treated with caution. In the context studied in table 9, fine itself is a verb preceeded by an adverb, a combination with a very low frequency of occurrence (count = 4). A preceeding adverb is not the best context when the task is to find the appropriate meaning of fine, as a word like even can occur likewise with adjectives, nouns and verbs. But since the word fine is tagged as a verb, valid translations should reflect the meaning of impose a penalty. However, none of the 4 translations which were seen in combination with the context is valid. Nevertheless, each one would be assigned the new probability 48 full cont. 1 1 1 1 0 0 0 0 back off 1 1 1 1 3 2 1 1 translation gloss hervorragend schönen sympathische all die schönen geldstrafe verhängen geldstrafe bezahlen geldbußen zu verhängen excellent beautiful pleasant all the pretty penalty impose pay a penalty to impose a p. no threshold no back off 0.1452 0.1452 0.1452 0.1452 0.0073 0.0008 0.0004 0.0004 threshold back off 0.0028 0.0743 0.0005 0.0005 0.0878 0.0585 0.0293 0.0293 Table 9: Translation candidates for the English phrase fine. The examined context is the pos tag rb (adverb) on the left side of the phrase fine which itself is tagged as a verb. The two columns on the left side show the counts with either a full context or backed off. The entries below the line benefit from backing off to the context fine = verb. Entries with a grey background have the targeted meaning of impose a penalty. of 0.25 when no threshold is set. Although the value 0.25 is harshly decreased to 0.1452 by the count-based discounting factor, the probabilities of these bad translation choices would amount to nearly 60 percent of the overall probability mass. The faulty phrase-translation pairs in the first four lines of table 9 are not a result of bad alignment, but the word fine was tagged incorrectly. Given that the precision of tagging is very high and that alignment is known to produce a considerable amount of errors, this is somewhat surprising: It reminds us of the fact that especially in the case of ambiguous words even a high-quality tagger can fail and therefore introduce an uncertainty of context reliability on another level than discussed before. In this case, the adjacency with adverbs might have lead the tagger to choose the tag verb for fine instead of adjective. When imposing a threshold of e.g. count ≥ 10, the original probabilities replace the contextually influenced new probabilities. A problematic aspect in this method is that the threshold is chosen more or less arbitrarily. One could also argue that a count-based function penalizing small sets is the better choice as those contexts are not completely lost. Discounting the probabilities with a count-based, penalizing factor is not sufficient in the example above; however, we cannot expect to find a solution that fits every single situation and therefore aim to find a way that works best on average. At this point, we wonder whether it would be helpful to split complex contexts into simpler ones instead of throwing them away completely. Backing off to a partial context if the threshold criterion is not met helps to better exploit the data. However, there is also the risk to ‘dilute’ contextually conditioned probability distributions by including probabilities of trivial contexts. When using back off probabilities, translation candidates are basically divided into three groups: those occurring with the full context, those occurring with a part of the context and the rest. Then, new translation probabilities for the first group are computed (or replaced with the old ones when the threshold criterion is not met) and discounted. The remaining probability mass has to be divided 49 between the translations in the second and the third group; again, probabilities for the translations of the second group are computed and discounted with a factor based on the counts of the translations occurring with a part of the full context if they meet the threshold criterion. The now remaining probability mass is distributed (proportionally to the original probabilities) to the translations in the third group. This procedure specifically refers to a 2-item context and the contexts used here will not exceed this: sparse data definitely is problematic and long contexts that in most cases need to be backed off are not likely to be very useful. In contrast to the variant where neither backing off nor a threshold were used and the first four entries were assigned nearly 60 percent of the probability mass, they now only amount to 7,7 percent of the probability mass, with the backed off translations representing 70,2 percent and the rest 22,1 percent. Disappointingly, backing off did not work in our experiments: Its scores were even lower than the results of the baseline system. 5.5 Evaluation of the presented approaches In order to capture the maximum effect of the λ-factors used for rating the usefulness of contexts and for discounting, experiments were not carried out on full systems containing all available paramaters, but on a system using only the translation probability ϕ(trg|src), the language model on the target side and reordering statistics. Especially the reverse translation probability ϕ(src|trg), which remained unconditioned by contextual features, could ‘interfere’ with the new probability distribution if probabilities in ϕc and ϕo are very different. This means that lexical probabilities in both directions and the reverse translation probability ϕ(src|trg) are not used, i.e. the phrase table only has values for the translation probabilities for phrases of the target language given a source phrase, and the phrase penalty which is always set to e = 2.718. As translation systems are very complex, this is considered a necessary step to compare the different methods for computing the probability distributions. During the tuning step, different weights are assigned to the features used for translation; by reducing the features to the one probability we are interestend in and the indispensable features language and reordering model, the comparison will be more direct and less influenced by the other features and the different weights they would obtain by tuning. In table 10, results for experiments with the context of two pos-tags on the left side of a phrase are listed. This context was chosen because it reflects to a certain degree the kind of contexts used in later experiments. As opposed to words, pos-tags are not as prone to sparse-data related problems, but generalize contexts while still being relatively fine-grained. (For an overview of the tagset, see section 3.4.1.) With regard to both word-based contexts and more compressed contexts such as chunks, i.e. linguistically motivated phrases, pos-tags seem to be a reasonable intermediate solution. Table 10 shows the results for the different methods to estimate contextually conditioned translation probabilities presented above. For each method, two experiments were carried out, as we not only wanted to see the effect of sub50 ϕ(trg|srcc ) no threshold threshold interÀ polation Á  discount à back-off count distr. count distr. 12.51 12.88* 12.90* 12.68 12.53 12.54 12.39 ϕ(trg|srco ) ϕ(trg|srcc ) 12.40 12.68 12.94* 12.96* 12.40 12.71 12.18 feature weights ϕ(trg|srco ) ϕ(trg|srcc ) 0.031992 0.019339 0.006017 0.021211 0.041349 0.053002 0.011541 0.115852 0.028883 0.079518 0.018464 0.055335 0.017702 0.123757 Table 10: bleu values for the experiments with different smoothing and rating techniques. Experiments were carried out on a system using only uni-directional phrase-translation probabilities, a target-side language model and reordering statistics. The baseline system scored 12.45 bleu-points. Also shown are the feature weights obtained by parameter tuning in systems using both translation probability features. Scores which are significantly better than the baseline are in bold-face and scores significantly better than the contextually conditioned distribution without threshold are marked with *. stituting the original translation probabilities with the modified contextually conditioned ones, but we also wanted to compare the relevance of ϕ(trg|srco ) and ϕ(trg|srcc ) as estimated during parameter tuning. A standard moses system with the same reduced feature settings was used as the baseline system, i.e. only the phrase-translation probability ϕ(trg|src), the language model and reordering statistics. The results in table 10 clearly show that using the contextually conditioned distribution without performing some sort of smoothing or setting up a threshold does not lead to an improved translation quality. As the probabilities of phrase-translation pairs with a low number of occurrence tend to be overestimated, a threshold at count = 10 was introduced, resulting in a significantly better result in bleu points, at least when using only the contextually conditioned probabilities. Our choice to set the threshold at n = 10 might seem somewhat arbitrary; however, such thresholds are often set in the range between 5 and 10. Since the phrase extraction routine concatenates unaligned words to adjacent phrases and thus often produces several entries for one seen occurrence, we opted for a comparatively high threshold. (An experiment with a threshold set at n = 5 showed almost no difference.) Substituting the original distribution with the contextually conditioned one (no-threshold system) results in a bleu score that is almost the same as the result of the baseline. The same applies for the combination of the original probabilities and the contextually conditioned ones. In this case, the feature weights obtained by parameter tuning are higher for the original probability distribution than for the new one that was intended to be a more refined version of the old one. When imposing a threshold, the result for the one-feature version is significantly better than both the baseline system and the no-threshold system. The scores for the system using interpolation with a count-based weighting factor are in the same range as the scores of the system with the threshold, as well as the other system using interpolated probability estimates, though only for the 51 two-feature variant. Discounting the probabilities as in  and à does not turn out to be very useful; the scores are hardly better than the baseline system, except for the twofeature system using both the original and new probability distributions, which still fails to be significantly better than either the baseline or the no-threshold system. While the differences between feature weights vary, it can be noted that in each case, the weight given to the contextually conditioned distribution is higher than that given to the original distribution, except for the no-threshold system. (Weights assigned to the phrase penalty responsible for segmenting the input phrase are not listed in the this table, although they can be strikingly different from system to system.) To a certain extent, the version using a threshold and the interpolated system in À are very alike. The main difference is that there is either a fixed threshold of count=10 to decide whether the original or the contextually conditioned distribution is to be used, or a weight entirely based on counts according to which the old and the new distribution are incorporated into the new score ϕ(trg|srcc ). The results of the version with threshold and À being almost identical suggests that not using unseen translations does not harm the system, and also that the choice of the threshold, being static or variable, is not a crucial point, either. Version Á, with a weight factor based on the type-token relation of the modified probability distribution and the comparison with the original one, gets a lower result (12.68) when using only the new probabilities than when using both distributions. In this case, the bleu-value is in the same range as the results in À and the first variant of the non-smoothed version. Although the contextually conditioned probability distribution in Á received a remarkably higher feature weight than the original one, using both distributions leads to a considerable gain in translation quality. The method of interpolation seems to work better than discounting, at least for this examined context. The reason for this might be that gradually switching back to the original distribution by giving it a high interpolation weight in nontrustworthy contexts is more reliable than keeping the contextually conditioned distributions although they might be heavily discounted if not trusted as was done in  and Ã. After all, the original probabilities are the best estimation in all those situations where context information does not turn out to be helpful, or contextually conditioned probabilities are overestimated as a result of their low occurrence frequency. However, the role of the original distribution is not quite clear when looking at the results in table 10. Especially in the first two experiments (no-threshold and threshold systems), including the original distributions as additional feature turns out to be harmful: In this case, the distributions are the most different as ϕ(trg|srcc ) does not include information from ϕ(trg|srco ), except for cases where the count criterion is not met. Thus, the distributions are more prone to be conflicting than the ones in À-Ã, where there is always a certain influence of the old distribution and they can be assumed to be more alike. As the probabilities in the phrase table are multiplied and only entries with a high overall score are 52 likely to be chosen as translation candidates, translation probabilities in both the new and the original distributions need to be high in order to have a high total value. To a certain degree, we expect ϕ(trg|srco ) and ϕ(trg|srcc ) to be different – after all, creating a new, refined probability distribution was the purpose of including context knowledge. On the other hand, the systems Á and à clearly benefit from combining both features. Their weighting/discounting factor is based on type/token relations of the new and original probability distributions: While the design of this factor is promising for specific probability distributions, it is difficult to capture its average effect on those distributions that do not look like the ‘typical’ good or bad context. While this factor might be able to identify exceptionally good contexts, it can also be assumed that it fails in a considerable amount of cases (e.g. flat distributions) and thus, a more robust distribution such as the non-context distribution is needed as an additional feature. The experiment to back off to a more simple context turned out quite disappointing. In this example, the context of two pos-tags on the left side of the phrase was reduced to one pos-tag on the left side of the phrase in cases where the full context was not seen or the count ≥ 10 criterion was not met. If there were also less than 10 occurrences with only one pos tag, the original probability was used as there was not enough data to justify a contextually conditioned probability estimation. The underlying idea is to only use contexts with enough evidence (count=10), but still include as much context information as possible by backing off. As we will se in the following chapter, one pos-tag on the left side yields comparable results as two pos-tags. However, switching back and forth between the two distributions within one context feature seems to be harmful. Conclusion The results, both bleu-values and feature weights, suggest that including context features is useful, but that translation probabilities without context knowledge are also important, as illustrated in versions À, Á and Ã. Altogether, using both distributions might be harmful if they happen to be conflicting; this might also apply in the case of the back-off experiment, where three distributions are packed into one. With the similarity of the threshold version and À, setting a threshold and ignoring unseen translations seems reasonable for most experiments, especially when considering that a simple way of estimating translation probabilities matches the concept of a reduced experimental setting, allowing to examine the influence of different contexts with the least amount of possibly conflicting features. In cases where it is indespensable to keep unseen translations, interpolation as done in À might be a good choice. 53 54 6 Basic context templates In this chapter, we will discuss the use of local context features and provide a detailed evaluation of the respective outputs. Context templates are based on adjacent words or pos tags of phrases and will be used for conditioning either separately or as a combination of left side and right side contexts. As in the previous chapter, the experiments will be run on reduced systems in order to simplify the evaluation. As the version without smoothing and a simple threshold of count = 10 turned out to work quite well, this approach was chosen for the experiments presented in the course of this chapter. In each case, we look at the results of a full system using the complete set of features where ϕ(trg|srco ) has been replaced with ϕ(trg|srcc ), a system with the single feature ϕ(trg|srcc ), as well as a single-feature system restricted to source phrases of length one. While this last setting seems artificial, it enables us to directly compare the different translation choices in the baseline and the modified version, as there is no variablity in segmenting the source sentence. To a certain degree, length-restricted translation units can also be regarded as an extreme case of the situation that comes up when a system is trained on a domain that differs significantly from the test data: As a result, mostly short phrases are selected for translation. Since the length-one system with original translation probabilities has no context information at all, we expect to see a considerable improvement when using context-informed translation probabilities; a significant improvement in this setting could even be regarded as a necessary criterion for improvement in a full setting. In the one-feature setting, the effect of contextual conditioning is likely to be weaker than in the length-one system, as the longer phrases contain internal context information. Since the only (translational) feature ϕ(trg|srcc ) is assumed to be a better estimation than the original translation probabilities, this system should outperform the baseline even though it is not as dependent on context information than the length-one system. In the full setting, we expect to see the least impact: Since the other translational features are not modified, they might interfere with the new translation probabilities and thus block potentially good decisions. This might be the case if an entry with a formerly low translation probability is assigned a higher contextually conditioned probability but still has a comparatively low reverse translation probability and low lexical probabilities. We carried out a detailed analysis of all variants using the standard automatic evaluation metrics bleu and nist, as well as variants of these measures. As already mentioned in the evaluation chapter, there are more possibilities than the standard bleu metric to measure the quality of machine translation output. In order to detect an effect of our modifications on the syntactic level, bleu was computed on pos tags. As this does not take into account lexical aspects, bleu was also computed on lemmatized words, since this is less strict than bleu on full forms. As the use of context information is to a certain degree assumed to enhance literal translations as well as to produce a more well-formed output, we are interested to find out whether there is a gain in both scores. [Popović and Ney, 2009] report a good correlation of human annotation with pos-based scores, 55 especially the pos-bleu score. While they do not mention bleu values based on lemmata, it does not seem too farfetched to use such a version of standard bleu as a measure to estimate the degree of adequacy for a morphologically rich language like German. Reducing full forms to stems is also part of the meteor metric: When evaluation English mt output, meteor works better than bleu. Even though bleu and meteor use different approaches, this observation additionally supports the idea of using lemma-based bleu. Since bleu-based metrics do not take into account the relevance of individual words, we also compared the percentage of correctly translated content words, as they are most important to express the meaning of a sentence. In addition to automatic evaluation methods, we carried out a small scale manual evaluation in which three participants annotated which one of two presented sentences (baseline vs. contrastive) they preferred. In the last section of this chapter, we will discuss selected examples to illustrate the effects of contextual conditioning on a lexical and syntactic level. 6.1 Word-based contexts Table 11 shows the respective results for conditioning on adjacent words; results that are significantly better than the baseline system are in bold-face. As expected, the reduced settings (especially the length-one system) are more influenced by the use of context features. In the full system, there is no real improvement in bleu-points compared to the baseline. When switching to the simpler system with only ϕ(trg|srcc ), the effect of the modification becomes more noticeable, especially with the 1-word-right context. An additional simplification by restricting the source phrase length to a minimum leads to an even larger difference of bleu-points to the baseline in the case of the one-word-left context. The system with the restricted phrase length is considerably worse than the one where phrases of every length up to a maximum length are available. This is mainly due to the fact that single word phrases often cannot be translated very well. For example, a complex noun on the German target side would be expressed as a sequence of several words on the English source side in most cases. Imposing a restriction on the source side phrase length does not allow for such many-to-one or many-to-many translations, but leads to non-intuitive translations. Additionally, in a system translating only single word phrases, there is no context information available at all, as opposed to a non-limited system where longer phrases inherently have some context information. The assumption that this context information can be captured by considering a context feature like an adjacent word on either the right or the left side of the phrase when computing the translation probabilities is illsutrated by the results of the length-one systems. As expected, the difference to the baseline is the most visible in this system, whereas there is only a very slight (non-significant) improvement for the 1-word-right context in the normal system. The effect of context information on different levels is demonstrated by the following example of single word translations: The English sentence in (23) contains the expression far from, which is a form of negation: although the sentence is not explicitly negated, it is clear to a human reader that the outcome 56 baseline 1 word left 1 word right bleu nist bleu nist bleu nist full setting ϕ(trg|srcc ) 13.84 5.14 13.84 5.17 13.98 5.17 12.45 4.80 12.50 4.81 12.57 4.89 ϕ(trg|srcc ) src-phrase-length=1 10.24 4.31 10.83 4.48 10.45 4.44 Table 11: Results for translations using either one word on the left or on the right side of the phrase. Numbers in bold print are significantly better than the baseline. is not clear. While the expression bei weitem nicht (at far not) in the reference translation (26) is comparatively similar to the structure far from, an explicit negation in the form of nicht (not) is indispensable. In the baseline translation (24), this special constellation and appropriate translation candidates are not known to the system and it thus chooses the more or less literal translation far from → viel von (many of), which does not contain any form of negation. (23) this time the party is divided and the outcome is far from clear . (24) diese zeit der partei ist , und das ergebnis ist viel von klar . (25) dieses mal die partei ist gespalten und die ergebnis ist viel nicht klar . (26) dieses mal ist die partei gespalten und das ergebnis ist bei weitem nicht klar . In (25), one word on the left side of the phrase is known: Now, a majority of the translations possible without context information are not seen, and the translation probability from→nicht gets considerably larger and thus, the negation is transfered to the translation. Additionally, the participle divided in the input sentence is translated correctly as gespalten in (25), while it is translated as a comma in the baseline. While it is generally challenging to translate verbs, this is not a problem of the same category as word combinations such as far from or get used, which can be solved by just looking at an adjacent word, but a problem mainly caused by the differences of the German and English consituent order and the resulting alignment issues. However, the reduced set of translation candidates seems, to a certain degree, to boost literal translations for words with no ambiguous meanings. In the case of divided, some really bad translation candidates of the original distribution were not seen in the contextually conditioned one; for example the obviously wrong translation comma, which had nevertheless been chosen in the baseline system. Translations like divided→, are mainly a product of alignment errors. If such errors are accumulated over all occurrences of a phrase, they can result in relatively high translation probabilities. While candidates with a considerably smaller translation probability than the ones with the highest probabilities are not likely to be chosen, there might arise situations in which low translation probabilities can be cancelled out by i.e. high language model scores, which seems to have happened in the baseline version. The context word is of the phrase divided suggests some sort of verbal or adjective-like translation, and indeed enhances the translation probability of 57 divided→gespalten from 13.9 % up to 25 % while the phrase translation pair divided→, (as well as comparable nonsense-pairs) did not occur at all since all alignments seen in this phrase-context combination are valid. The improved translations for both far from and divided illustrate that including context features does not only work for contexts with really ambiguous words, but also in cases like divided, where some of the nonsense translations are filtered out due to a good context word, and the remaining translations are more likely to be appropriate in both meaning and syntactic form. While the translation in (25) is far from perfect, it is still better than the baseline translation (24) since the meaning of far from is captured better with context, the verb divided is translated and the choice of mal instead of zeit is also better. The decision to pick dieses mal instead of diese zeit might be strongly influenced by the language model, as the phrase dieses mal is a fixed formulation. However, limiting the length of segmentation units cannot lead to optimal results. When also allowing for phrases of length 2 in the baseline system, the translation (27) obtained for (23) is pretty good. (27) [dieses mal] [die] [partei] [ist] [gespalten , und] [das ergebnis] [ist] [alles andere als] [klar] [.] Here, the phrase far from can be translated as a unit into alles andere als (everything but), which is as good as bei weitem nicht in the reference sentence (26). This system actually gets a bleu value of 12.26, which is almost as good as the one-feature baseline system with no restrictions on the source phrase length, which has a bleu value of 12.45. Pos-based and lemma-based Bleu For both the pos-based bleu and lemma-based bleu (cf. table 12), the differences to the baseline system are larger than for standard bleu, especially for the one-word-left context in the full system and the version with restricted phrase length. Surprisingly, the values for the systems with one feature are not as high as the other ones, but only in the same range as the baseline. Generally, it is also somewhat surprising that there is no improvement of standard bleu-values for the full system, but a gain in pos-bleu. In the case of the one-word-left context, we find that there is a significant improvement in the full system’s pos-bleu value, while the standard bleu-scores are the same. In the system using only single word source phrases, improvements of bleu and nist values are significant both on the pos and lemma level. Similarly, results of standard bleu and nist for this settings are also significant except for one case. As at least for pos-bleu a good correlation with human judgement has been proven, it seems justified to believe that there has to be some improvement, at least in the one-word-left context, although it is not noticeable in the standard bleu score. A more detailed analysis of both individual cases and different error types is necessary and will be provided later in this chapter. 58 baseline 1 word left 1 word right baseline 1 word left 1 word right pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist full setting ϕ(trg|srcc ) 36.72 5.22 37.14 5.24 36.95 5.29 34.76 4.98 34.70 5.00 34.88 5.00 full setting ϕ(trg|srcc ) 17.34 5.80 17.51 5.84 17.53 5.85 15.95 5.45 16.04 5.46 15.95 5.52 ϕ(trg|srcc ) src-phrase-length=1 30.65 4.56 31.50 4.67 31.19 4.63 ϕ(trg|srcc ) src-phrase-length=1 13.27 4.94 14.05 5.11 13.70 5.09 Table 12: Results for translations using either one word on the left or on the right side of the phrase. The scores in the first table are based on pos-tags, the scores in the second table are computed on word lemmas. Numbers in bold print are significantly better than the baseline. 6.2 Pos-based contexts While conditioning translation probabilities on adjacent words helps to capture context as can be seen especially in the reduced setting, this is not optimal as a large amount of words can be expected to occur only once or twice in combination with a phrase and therefore unnecessarily increase the number of possible contexts of a phrase. By using pos tags instead of words, we can better exploit the data, as all words of one category are merged into one context. While the number of possible contexts is heavily reduced, the tag set is still relatively fine-grained. However, lexical information is lost. As has already been illustrated in some of the examples presented before (e.g. section 3.3.2), pos-tags may be useful context features as they may not only help to restrict translation possibilities on a mostly lexical level, but also on a syntactic level. By including linguistically motivated features, we especially hope for better fluency in the translated sentences, i.e. a gain in pos-bleu. In order to illustrate this expectation, we have another look at the example from the previous section, where the context word is on the left side of divided resulted in an increase of verbal and adjective-like translations, while faulty phrasetranslation pairs were discarded. However, all realizations of auxiliary verbs are listed separately when looking at word forms and thus, probability distributions of basically equal contexts are divided which means that only a fraction of the available data is used for the estimation of translation probabilities: by replacing word forms with their respective pos tags and thus making contexts more general, this problem can be overcome. We carried out a detailed study of pos based contexts on the left side of a phrase by examining contexts of different length (one and two pos tags) 59 baseline 1 pos left 2 pos left leftmost pos 1 pos left leftmost pos 1 pos right bleu nist bleu nist bleu nist bleu nist bleu nist bleu nist full setting ϕ(trg|srcc ) 13.84 5.14 14.01 5.15 13.93 5.18 13.64 5.10 14.05 5.18 14.05 5.19 12.45 4.80 12.64 4.84 12.88 4.84 12.11 4.78 12.95 4.92 12.32 4.78 ϕ(trg|srcc ) src-phrase-length=1 10.24 4.31 10.80 4.48 10.73 4.46 10.31 4.37 10.74 4.41 10.53 4.32 Table 13: Results for translations using contexts based on pos-tags. The context with one tag on the left side of the phrase and the leftmost pos tag is treated as a single feature. Numbers in bold print are significantly better than the baseline. and modelling the transition between left-side context and the phrase itself. Furthermore, we conducted experiments conditioned on one pos tag on the right side. Table 13 shows the standard bleu and nist scores obtained for these experiments. Similarly to the word-based contexts, the difference between modified system and baseline is larger for the reduced settings. The systems using one or two pos tags on the left side are better than the baseline, although there is not always a significant gain. In contrast, the results of the context leftmost-pos are disappointing. While there is a slight improvement in the length-one system, the scores of the other two systems are even worse than the baseline. As knowledge about the phrase itself may also help to find appropriate phrase-translation pairs, the idea to condition probability estimation on the phrase’s pos tags seems plausible. As the phrase length can go up to 7 words, we did not want to use the entire tag sequence of a phrase, but only the tag of the first word of a phrase, which enables the system to disambiguate the first word of a phrase. Obviously, this does not take into account the whole phrase. Accordingly, the only improvement of the leftmost-pos context is in the length-one system, where the context feature covers the entire phrase. In contrast to conditioning on adjacent words (which vary in most cases), the pos tag of the phrase is often the same, which means that the conditioned probabilities often correspond more or less to the original distribution. As a consequence, the improvement is not nearly as high as for the other two systems and also fails to be significant. However, there is a large improvement when conditioning on the leftmost-pos and the adjacent pos on the left side of the phrase: In this case, we do not look at an arbitrarily chosen tag, but at the link between the focused phrase and its left neighbor. The last context in table 13, one pos tag on the right side, is somewhat different from the left-side contexts as it has a comparatively high bleu score in the full setting, but comparably low scores in the reduced systems. Since 60 baseline 1 pos left 2 pos left leftmost pos 1 pos left leftmost pos 1 pos right baseline 1 pos left 2 pos left leftmost pos 1 pos left leftmost pos 1 pos right pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist full setting ϕ(trg|srcc ) 36.72 5.22 36.66 5.25 36.91 5.28 36.49 5.23 37.20 5.25 36.90 5.27 34.76 4.98 34.84 5.03 34.89 4.99 34.36 4.98 35.23 5.08 34.52 4.94 full setting ϕ(trg|srcc ) 17.34 5.80 17.57 5.82 17.54 5.86 17.12 5.78 17.80 5.85 17.60 5.85 15.95 5.45 16.18 5.50 16.37 5.47 15.65 5.43 16.61 5.59 15.84 5.41 ϕ(trg|srcc ) src-phrase-length=1 30.65 4.56 31.68 4.67 31.72 4.64 31.04 4.61 31.45 4.65 31.22 4.56 ϕ(trg|srcc ) src-phrase-length=1 13.27 4.94 13.98 5.19 13.86 5.07 13.45 5..00 13.91 5.19 13.58 4.96 Table 14: Results for translations using contexts based on pos-tags. Numbers in bold print are significantly better than the baseline. improvement in the length-one system can be considered a necessary condition for improvement in the full system, this outcome suggests that the scores of this context might not be fully reliable. As it is not guaranteed that parameter tuning finds the globally best setting, it might be possible that there exists a better set of parameters for the length-one system allowing to produce a better result. On the other hand, it might be possible that the seemingly good result of the full system is due to merely random effects and therefore does not represent the ‘true’ improvement of this context. For a more focused evaluation on the lexical and syntactic level, bleu and nist scores were computed on pos tags and lemmatized words: the results are listed in table 14. Similarly to the results before, the effect of including context information is much more noticeable in the reduced settings. For the 2-pos-left and 1-pos-left contexts, we find considerably higher pos-bleu scores in the length-one systems than when conditioned on adjacent words. This outcome supports our initial hypothesis that linguistically motivated context features have an positive effect on the grammatical level, i.e. enhance 61 fluency. Although the pos-based contexts lack lexical aspects, there are not only improvements of pos-based scores to be observed, but also gains in the lemma-based scores, corresponding in most cases with gains in standard bleu. The results for the 1-pos-left context are disappointing in in the full system: while not being as bad as the ones of the leftmost-pos context, they fail to be significant for the two systems other than the length restricted one although the standard bleu score of this context is comparatively high. This leaves us a bit surprised, especially when considering that the 1-pos-left context’s scores in the length-one system tend to be relatively high. The scores of the leftmost-pos context correspond to the results of the standard bleu as they slightly increase for the length-one system, yet to a smaller extent than the other two contexts. It is also surprising that the leftmost-pos-1-pos-left context, which has a very high pos-bleu score in the full setting does not have equally high results in the length-one system. When comparing left-side and right-side contexts (e.g. one word/pos tag), it is remarkable that the contexts on the left side have especially good scores in the length-one systems, whereas the right-side contexts have lower scores in the length-one systems but equally high (or even better) results in the full systems. 6.3 Combining features So far, only one context at the same time on either the right or the left side of a phrase has been considered. However, combining two contexts might be useful, as only phrase-translation pairs with high translation probabilities in both contexts are likely to be taken for translation. In this approach, we did not estimate translation probabilities conditioned on both contexts at the same time, but used the respective probabilities conditioned on one context as separate feature functions in the log-linear model. The respective relevance of the features is then to be determined by the parameter tuning. Translation probabilities were computed using interpolation (version À, cf. table 10), and entries where both contextually conditioned translation probabilities were smaller than 0.0002 were not included in the phrase table. The basic idea of combining features is to do an ‘intersection’ of translation candidates based on two independent criteria, with the anticipated effect of not using translation candidates only appropriate for one context. The context features are required to be independent since this is a condition feature functions of a log-linear model have to fulfill. But regardless of the technical requirements of the log-linear model, it would be not very useful to combine, say, the 1-pos-left context and the 1-word-left context as one is a more detailed version of the other. In the following section, experiments with combined contexts consisted of features on the left and the right side of the phrase, with the exception of the leftmost-pos context which was combined with the 1 pos-left context. The most obvious combinations are those of one adjacent word or pos tag on each side of the phrase; by combining pos tags on one side with an adjacent word on the other side, the system can benefit from both lexically and syntactically motivated contexts: this is an attempt to overcome the loss 62 of lexical features coming along with the introduction of pos tags, while still profiting from generalization. The same pattern as seen in the previous evaluations emerges for the results of combined context features: As can be seen in table 15, scores are significantly better in the length-one systems, but mostly in the same range as the baseline for the full systems. So far, the best result of a full system has been achieved by the system where a pos tag on the left side of the phrase and the (subsequent) leftmost pos tag of the phrase have been considered for translation probability estimation. This corresponds to the result of the same context being integrated as a single feature instead of two feature functions: Taking into account these two adjacent phrases models to a certain degree the transition between them by not only conditioning on the surroundings of the phrase, but also including (linguistic) information about the very next word. Generally, the scores of the length-one systems are slightly better than the respective scores of the single-context systems. The same applies for the results when only using the modified translation probabilities; here, more results are significantly better than the baseline system. Interestingly, improvements seem to be concentrated more on the lexical level (lemma-based and standard bleu) than on the syntactic level, even though the pos-bleu scores of the length-one systems are higher than in single-context experiments. However, improvements of full systems still lack significance in most cases. A particular thing about the 1-pos-left and leftmost-pos combination is that its improvement in the length-one system is comparatively small given that its full system achieved the best score. In contrast to the other combined systems, this context does also not work very well in the one-feature system, where, if at all, only one score (bleu or nist) is significant. It seems somewhat contradicting that in this case (i.e. significant improvement in the full system), the gain in bleu/nist scores is not noticeable to the same extent in the reduced settings. In comparison, the scores of the one-feature system with the single context 1-pos-left-leftmost-pos (cf. tables 13 and 14) are better. We can make a similar observation on the 1-pos-left-1-pos-right context: While it has good values in the full settings, it does not perform equally well in the length-one system, except for pos-bleu, suggesting that a purely syntactic context setting tends to improve the grammatical quality. Equally, the purely lexical setting with one word on either side of the phrase results in a high lemma-based bleu (although not in the full setting) with a comparatively low pos-bleu. The general tendency of pos-based features having a positive effect on pos-bleu was already observed for (left-side) single features in table 14. Summarizing the results of the three bleu-based evaluation metrics on different feature designs, we can say that almost every context template resulted in an improvement in the length-one system. Scores of the full systems tend to be improved as well, but this improvement is mostly not statistically significant. Generally, combined contexts worked better than single-feature contexts, and there is also evidence that pos based contexts have a positive influence on the syntactic structure of the resulting translation. Furthermore, some of the (full) systems have a significantly better nist score, but no significantly improved 63 baseline 1 word left 1 word right 1 pos left 1 pos right 1 pos left 1 word right 2 pos left 1 word right 1 pos left leftmost pos baseline 1 word left 1 word right 1 pos left 1 pos right 1 pos left 1 word right 2 pos left 1 word right 1 pos left leftmost pos baseline 1 word left 1 word right 1 pos left 1 pos right 1 pos left 1 word right 2 pos left 1 word right 1 pos left leftmost pos bleu nist bleu nist bleu nist bleu nist bleu nist bleu nist pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist pos-bleu pos-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist lem-bleu lem-nist full setting ϕ(trg|srcc ) 13.84 5.14 14.04 5.13 14.05 5.18 14.02 5.16 13.86 5.11 14.12 5.19 12.45 4.80 12.83 4.91 12.88 4.84 12.97 4.87 12.95 4.87 12.79 4.80 full setting ϕ(trg|srcc ) 36.72 5.22 36.75 5.13 37.16 5.28 36.56 5.23 36.75 5.21 37.03 5.25 34.76 4.98 34.98 5.01 34.92 5.00 35.22 5.03 35.17 5.01 34.72 4.96 full setting ϕ(trg|srcc ) 17.34 5.80 71.53 5.78 17.65 5.85 17.62 5.83 17.56 5.78 17.78 5.85 15.95 5.45 16.52 5.58 16.42 5.49 16.66 5.53 16.49 5.52 16.26 5.44 ϕ(trg|srcc ) src-phrase-length=1 10.24 4.31 11.07 4.51 10.80 4.45 11.10 4.50 11.06 4.48 10.79 4.43 ϕ(trg|srcc ) src-phrase-length=1 30.65 4.56 31.56 4.67 31.84 4.67 31.93 4.70 31.68 4.66 31.03 31.48 ϕ(trg|srcc ) src-phrase-length=1 13.27 4.94 14.30 5.14 13.98 5.11 14.25 5.14 14.26 5.14 13.80 5.06 Table 15: Results for combining two different contexts as seperate features. Numbers in bold print are significantly better than the baseline. 64 bleu score: Since nist weighs n-grams according to their assumed relevance, this might indicate that there are improvements on a lexical level that (lemma-based) bleu scores fail to detect: It seems indispensable to provide an evaluation of adequacy focusing on lexical aspects only. While this method to combine features is very simple by being limited to two contexts on either side of the phrase, positive effects are noticeable. Yet, one could imagine to use more sophisticated methods to find optimal combinations of two or more context features. [Gimpel and Smith, 2008] use a forward variable selection algorithm: Starting with no context templates, new contexts are added iteratively based on the gain in bleu on unseen data after parameter tuning. In their experiments, the combination of the syntactic feature 2-pos-left and the lexical feature 1-word-right worked best for English→German. In our experiments, this feature combination had relatively good results in the reduced settings, but improved bleu scores failed to be significant in the full system. 6.4 Evaluation In order to get a better insight on the characteristics of the contextually conditioned systems, we will now discuss evaluations going beyond the standard bleu and nist scores. While the criterion of fluency has already been judged by the pos-based metrics, we now attempt to provide a better analysis in terms of adequacy. Although lemma-based bleu allows for more variation than standard bleu, both measures are still very similar, which is also illustrated by the fact that systems with good bleu scores tend to have good lemma-based bleu scores. In order to introduce a purely lexical evaluation method, we go a step further and concentrate on evaluating only content words, assuming them to be crucial for the reproduction of the source sentence. By restricting the evaluation of content words, syntactic aspects (represented by word sequences) are completely excluded. In addition to automatic evaluation, we carried out a manual evaluation focusing on adequacy and used the annotated material for an analysis of error types. Assumptions and outcomes resulting from the different approaches of evaluation are then to be illustrated with example sentences. 6.4.1 Content word oriented evaluation Standard bleu allows to measure both fluency and adequacy by working with n-grams of different length. Computing bleu on lemmatized words instead of full forms is an attempt to estimate more precisely the degree of adequacy for a morphologically relatively complex language like German. However, this does not take into account that some words are more important than other words. If a word such as a determiner is omitted, the sentence is still comprehensible as long as the noun has been correctly translated. For the following experiments, we define nouns, verbs and adjectives to be content words. To measure the degree of adequacy, we compute recall, precision and f-measure only on correctly translated content words. This classification into content word and non-content word is only a simple approximation; it is very easy to come up 65 with counter-examples like e.g. negation: While being essential for the meaning of a sentence, negation particles are not classified as content words, as negated constructions might often not be translated literally. There are far too many ways to express negation to be detected automatically, such as variations between verbal negation nicht and nominal negation kein or e.g. inherent negation in constructions like not want→ablehnen (refuse). This problem of non-isomorphic translation also applies for prepositions, auxiliary verbs, etc. While limiting the evaluation to ‘basic’ content words is an extreme simplification, it could still prove useful especially when considering the assumption that literal translations are enhanced if faulty non sense translations are discarded by good contexts. Recall, precision and f-measure were first computed on sentence level and then averaged over the entire test set. To compute recall and precision, the set of words tagged as noun, verb or adjective in the automatically produced translation was compared with the words tagged as noun, verb or adjective in the reference translation. The comparison of words was conducted on lemma level. To be counted as correct translation, it was sufficient to be part of the group of content words and have a matching equivalent in the reference translation; we did not insist on verbs being translated as verbs, nouns being translated as nouns and so on. Furthermore, we checked the respective number of each word in order to guarantee that words occurring multiple times are only counted if there are corresponding words in the reference sentence. This means that a word occurring twice in a translated sentence can only be considered correct twice if there are two corresponding words in the reference translation. This method is similar to the modified n-gram precision used for bleu. Additionally, we computed the average number of content words per sentence: in this case, every word tagged as noun, verb or adjective was counted without checking with the reference translation and then averaged over the set of sentences. While sentence-level scores are somewhat unusual in machine translation evaluation, this approach helps to find sentences with especially high and low scores and therefore might be useful to detect characteristic features of the system by identifying sentences where modifications had an especially good or bad influence compared to the baseline system. Table 16 shows content word based statistics for the systems that were presented so far. Similarly to the outcome of the previous evaluations, results are best for the length-one system. Not only does every system score higher than the baseline system, but the differences between baseline and modified system are also larger. In every system setting (complete set of features, one feature, restricted to length one), the average gain in recall is higher than the average gain in precision. This correlates with the fact that the average number of content words per sentence is often slightly higher in the modified systems (cf. table 17). It is difficult to generally decide whether a higher number of translated content words is good or bad, given that translated content words do not necessarily have to be correct. The fact that the average number of content words in the baseline translations is lower than in the reference translations suggests that an increase of translated content words is needed. Especially in the cases of the 1-word-right and 1-pos-left context (full setting), 66 baseline 1 word left 1 word right 1 pos left 2 pos left 1 pos left leftmost pos 1 pos right 1 word left 1 word right 1 pos left 1 pos right 1 pos left 1 word right 2 pos left 1 word right 1 pos left leftmost pos p r f p r f p r f p r f p r f p r f p r f p r f p r f p r f p r f p r f full setting ϕ(trg|srcc ) 0.3973 0.3889 0.3867 0.3998 0.3916 0.3895 0.3942 0.3934 0.3874 0.3969 0.3959 0.3901 0.3980 0.3955 0.3905 0.4026 0.3942 0.3920 0.3944 0.3930 0.3872 0.3954 0.3825 0.3821 0.3945 0.3932 0.3876 0.3974 0.3879 0.3860 0.3982 0.3783 0.3810 0.3987 0.3887 0.3872 0.3841 0.3383 0.3519 0.3828 0.3398 0.3522 0.3869 0.3347 0.3515 0.3825 0.3519 0.3593 0.3866 0.3351 0.3512 0.3941 0.3523 0.3648 0.3795 0.3392 0.3514 0.3942 0.3519 0.3648 0.3837 0.3456 0.3562 0.3870 0.3492 0.3599 0.3882 0.3516 0.3618 0.3821 0.3405 0.3529 ϕ(trg|srcc ) src-phrase-length=1 0.3538 0.3195 0.3293 0.3669 0.3457 0.3495 0.3684 0.3398 0.3473 0.3699 0.3410 0.3482 0.3692 0.3365 0.3451 0.3705 0.3399 0.3480 0.3606 0.3245 0.3354 0.3672 0.3493 0.3517 0.3648 0.3456 0.3485 0.3675 0.3451 0.3497 0.3664 0.3508 0.3522 0.3635 0.3335 0.3416 Table 16: Precision, recall and f-measure for correctly translated content words. Values which are better than the baseline have a gray background. Additionally, the best respective scores (precision, recall and f-measure respectively) in each column are printed in bold. 67 baseline 1 word left 1 word right 1 pos left 2 pos left leftmost pos, 1 pos left 1 pos right 1 word left, 1 word right 1 pos left, 1 pos right 1 pos left, 1 word right 2 pos left, 1 word right 1 pos left full setting ϕ(trg|srcc ) 8.5507 8.5468 8.7212 8.7398 8.6979 8.5302 8.7281 8.4288 8.7359 8.5214 8.2788 8.5029 7.6823 7.7378 7.4864 8.0322 7.5448 7.7680 7.7300 7.6988 7.8509 7.9630 7.9600 7.6481 ϕ(trg|srcc ) src-phrase-length=1 7.8509 8.2758 8.0700 8.0565 8.0117 8.0039 7.7534 8.3168 8.3148 8.2310 8.4337 7.9288 Table 17: Average number of content words per sentence. The reference sentences contain 8.6209 content words on average. the average number of translated content words is higher than in the baseline system, resulting in slightly decreased precision scores. However, precision does not suffer dramatically and the baseline’s recall value of less than 0.39 is far from perfect, indicating that not enough content words are correctly translated. The main problem with this sort of automatic evaluation are mismatches between the reference sentence and good translations, as good translations that are not part of the reference sentence are not taken into account. Particularly interesting are the values of the leftmost-pos-1-pos-left context (full setting), where on average less content words than in the baseline are translated, but a gain in recall and precision can be observed. Similarly, three of the combined contexts have both less translated content words and a better precision than the baseline. The fact that these three systems using feature combinations behave similarly in terms of precision and recall indicates that independent features complement and reinforce each other by being especially restrictive: Since both features need to have high values for an high total score, only translation candidates that are appropriate in both contexts are likely to be taken. The increased number of content words, particularly in the reduced systems, also strengthens our assumption that (literal) translations into content words are enhanced. In the one-feature systems, noticeably less content words are translated. As this applies for the baseline as well as the modified systems, we assume that it is an inherent characteristic of this type of setting and has nothing to do with our modifications. Nevertheless, we have to keep in mind that the improvements are not tested for significance as was the case with the bleu and nist evaluation and that there is no knowledge about correlation with human judgments. Therefore, the results based on content words cannot be regarded the same way as commonly accepted evaluation metrics. Yet, the results of the content words based evaluation are to a certain degree consistent with the previous evaluation. This applies particularly 68 for the leftmost-pos-1-pos-left context, which outperforms most other systems in all setting variants (bold-printed numbers in table 16): when looking at the bleu and nist scores of this context, we find that there are significant improvements in all variants (standard, pos-based and lemma-based). 6.4.2 Manual evaluation To complete the evaluation of translation results, we also carried out a small scale manual evaluation. The test set comprises 100 randomly chosen sentences from the full system conditioned on one pos tag on either side of the phrase. This setting was chosen because it has significantly improved pos-bleu and lem-bleu scores, and a comparatively high (although not significantly) standard bleu score. Since the lack of a significantly increased standard bleu score while having significantly better pos or lemma based scores applies to several settings, we are interested to find out whether there are phenomena that are not captured by automatic evaluation metrics. The evaluation task consisted of annotating which one of two given translations (baseline or contrastive system) was better, or whether both translations were of equal quality. In order to not influence participants, the two different translations were presented in random order. Additionally, the English input sentence was given, but not the German reference translation: This should prevent participants to favor sentences resembling the reference translation despite being of the same quality as the alternative translation. Annotation directives were kept very simple: The most important point was adequacy, meaning that the sentence that better expresses the content of the input sentence was to be marked as the better one. The criterion of fluency came second. Actually, in most cases, a better adequacy is equivalent to a better fluency since missing or wrong words tend to have negative effects on the syntactic level, too. As test set, 100 sentences whose baseline translation differed from the translation produced with the contextually conditioned system were randomly chosen, with the additional requirement of a sentence length between 5 and 25 words. The results of the three annotators (1 linguist and 2 non-linguists) are given in table 18. Generally, there is not a high correlation between the annotation results: While À and Á judged roughly equally many sentences to belong to the respective categories, they have not a good agreement. The annotation produced by  is somewhat different as this participant found less translations to be of equal quality: Being a linguist, he might be more sensitive to subtle criteria and therefore have a preference for one of the options. As annotator equal quality baseline better contrastive better À 44 21 35 Á 37 25 38  21 33 46 À∩Á 23 12 22 Á∩ 13 19 29 À∩ 17 18 32 À∩Á∩ 12 12 20 Table 18: Results of manual evaluation with three annotators. The persons À and Á are non-linguists,  has a linguistic background. 69 a result, all three participants agreed only on 44 sentences, less than half the test set. The differences between the different annotations show the difficulty of rating machine translation output. Of course, one might argue that the annotation guidelines were formulated too sloppily, but on the other side, this experiment was intended to find out those phrases that were intuitively chosen to be the better variant. If we consider all sentences with no clear results to be more or less equal, then the number of 68 sentences with roughly the same quality suggests that the two systems are relatively alike. However, the fact that each participant preferred a higher number of sentences produced by the contextually conditioned system indicates that there is a certain improvement in comparison to the baseline. Additionally, the set of sentences translated with the contrastive system that all three participants prefer is lager than the set of preferred baseline sentences. The set of sentences on which all annotators agreed was then used as basis for a more detailed evaluation of error types. Since the test set for the manual evaluation was chosen randomly from the entire set of sentences, this sample can be considered representative9 . A closer inspection of the two sets containing sentences being either better or worse than the baseline might help to understand the properties of the contextually conditioned system. For this analysis, three types of error were defined: As missing words are a major impediment for the understanding of a translation, we counted how often the contextually conditioned system manged to translate previously untranslated words, or lost words that were correctly translated in the baseline. Another criterion is better fluency; in this case, we do not distinguish between improvement on the syntactic or morphological level or a more fluent representation of fixed expressions. The last point is that of word choice and applies to cases where in comparison with the other system one word or expression is clearly better than the alternative. Of the set of 20 sentences being better than the baseline, 10 contain words that are missing in the baseline translation; 7 of these formerly missing words are verbs. Furthermore, 9 sentences were found to contain more fluent word sequences and in two cases, the word choice was improved. (One of the sentences was improved by containing a better suited word and a more fluent structure). The distribution of error types found in the set of 12 sentences where the baseline system produced better sentences is slightly different: While the percentage of fluency-related cases (5 of 12) corresponds to the percentage of errors of this category in the other sample, there are roughly as many wordchoice problems (4) as missing words (3). Interestingly, in two of the sentences that were found to be worse than the baseline, we can observe the opposite effect of missing words (categorized as incorrect choice of word). One of the two redundant words does not make any sense at all and is clearly the result of overestimating a random phrase-translation pair. The other case is more interesting and illustrated in the following example, where (28) is the baseline translation and (29) is the contrastive translation. (28) [what country ,]1 [exactly]2 [, does the]3 [federal chancellor]4 [live in]5 [?]6 [welches land ,]1 [genau]2 [die]3 [bundeskanzlerin]4 [leben in]5 [?]6 9 The complete set of the sentences used for error analysis is listed in the appendix. 70 (29) [what country]1 [, exactly , does ]2 [the ]3 [federal chancellor]4 [live in]5 [?]6 [welches land]1 [genau plant ]2 [die]3 [bundeskanzlerin]4 [leben]5 [?]6 (30) which country does the federal chancellor plan to live in? In German, the concept of do+verb does not exists; thus, it is optimal to only translate the main verb (live) and ignore the do. The baseline translation, where does the is translated as an article, is perfectly comprehensible, although it is not syntactically well-formed. In the alternative translation in (29), does is translated by the verb plant (plans/intends), which gives a new meaning to the entire question (cf. 30). While this is definitely an unwanted effect, there is a certain logic behind this error, since the chosen verb belongs to a similar category (both do and plan introduce infinite verbs) and could even have been a valid choice in a different situation. In total, the results illustrate that the contextually conditioned systems tend to translate more content words. Given that especially verbs are often lost in the translation process, this is a positive effect, even if in some cases the system is overzealous and inserts unnecessary words. The evaluation showed that the number of correctly translated formerly missing words clearly tops the number of redundant words. This observation is also consistent with the results of the content-word based evaluation. 6.5 Analysis of selected examples In the chapter about evaluation, a classification of typical error types in machine translation output (cf. section 4.2) was presented, as well as different scenarios where the integration of context features could prove useful. In an attempt to better understand the effects of our modifications, we will carry out a detailed analysis of example translations that are representative of some of the error categories mentioned before. The first part of the evaluation focuses mainly on adequacy, while the second part illustrates translational behavior of verbs and takes the evaluation to the level of fluency. All examples were chosen manually, but with the search criterion of high differences in evaluation scores. For each example, the segmentation of the source sentence is shown; the baseline translations are abbreviated as bl, while the output of the contextually conditioned systems is labelled cont. 6.5.1 Adequacy The task of finding appropriate translation candidates for polysemous source phrases was one of the main objectives for the use of context features. In the example in table 19, the words march and general are ambiguous. In the case of march, which denotes the action of marching in this sentence, both the baseline and the modified system fail to disambiguate this word. As a matter of fact, none of the other systems conditioned on different contexts was able to find a valid translation of march. In mixed-case text, the month march is generally uppercase; however, this information is lost by lowercasing the entire data. 71 [other military]1 [units were]2 [to join the] 3 [march] input bl gloss 4 [shortly]5 [, said]6 [general] 7 [danilo]8 [lim]9 [, former]10 [leader of the]11 [scout]12 [rangers]13 [elite]14 [unit]15 [.]16 [andere militärische]1 [einheiten]2 [zu den] 3 [märz] 4 [kurz]5 [gesagt ,]6 [allgemeine] 7 [danilo]8 [lim]9 [, ehemaliger]10 [führer der]11 [pfadfinder]12 [rangers]13 [elite]14 [einheit]15 [.]16 [other military]1 [units]2 [to the] 3 [march month] 4 [shortly]5 [said ,]6 [general adj] 7 [danilo]8 [lim]9 [, former]10 [leader of the]11 [scout]12 [rangers]13 [elite]14 [unit]15 [.]16 [other]1 [military]2 [units were]3 [to join] 4 [the]5 [march] 6 [shortly]7 input [, said]8 [general] 9 [danilo]10 [lim]11 [, former]12 [leader of the]13 [scout]14 [rangers]15 [elite]16 [unit]17 [.]18 [andere]1 [militärische]2 [einheiten]3 [im]5 [märz] 6 [in kürze]7 cont [beitreten] 4 [, sagte]8 [general] 9 [danilo]10 [lim]11 [, ehemaliger]12 [führer der]13 [pfadfinder]14 [rangers]15 [elite]16 [einheit]17 [.]18 [other]1 [military]2 [units]3 [in]5 [march month] 6 [shortly]7 [join] 4 gloss [, said]6 [general] 9 [danilo]10 [lim]11 [, former]12 [leader of the]13 [scout]14 [rangers]15 [elite]16 [unit]17 [.]18 ref general danilo lim, der früher die eliteeinheit der scout rangers befehligte sagte, andere armee-einheiten wollten sich dem marsch in kürze anschließen. Table 19: The translation probabilities of this example were conditioned on the leftmost pos of the phrase and the pos on the left side (full system). The word march is tagged as noun regardless of its meaning; but given that the combination in march is more common when speaking of the month march, we are left surprised that the system did not manage to find a better translation. When looking at the phrase translation probabilities, it becomes evident that there was actually a big change: The translation probability for march → märz, (March, i.e. the unwanted translation) decreased from 0.82 to 0.33 while the targeted translation probability for march → marsch (march) rose to 0.08 (formerly 0.01). However, 0.08 is still less than 0.33. Furthermore, with märz (March) occurring considerably more often in the training data, the language model might give it a higher score than the less frequent marsch (march). The language model’s preference might be additionally enhanced by march’s proximity to shortly, which can be expected to co-occur with indications of dates. The other ambiguous word in this example is general, being either a noun with the meaning of a military rank or an adjective meaning something like common or generic. Given that a part of the context used for disambiguation consists of the leftmost pos tag of the phrase, this context is able to reduce all 72 possible translations to a set of valid options. In combination with a finite verb (said) on the left side, only two translation candidates remain: general und der general (the general), both of which are correct. In this example, not only the translation probabilities of the ambiguous words were changed, but also the translation of the two verbs said and join was better: The phrase containing join was translated as to the in the baseline translation, with the result that the equivalent of join is missing in the target sentence. As join is a content word and thus important for the understanding of the target sentence, its correct translation in the contextually modified system leads to a higher comprehensibility. While the translation of said in the baseline preserves the meaning of to say in past tense, the word form is not well chosen, as opposed to the contextually conditioned system, where it leads to an improvement of fluency. While the baseline sentence had a bleu score of zero10 as it does not contain a 4-gram also appearing in the reference sentence; the sentence translated with the refined system has a positive bleu score as well as a higher pos-bleu, lem-bleu and percentage of translated content words. The next examples are to demonstrate effects on the aspect of lexical choice. In some cases, translations are well comprehensible but cannot be considered completely correct as the choice of words is inappropriate, e.g. in the case of disrupted collocations. The baseline translation in table 20 is relatively well-formed and expresses the meaning of the source sentence. However, the choice of the word recht (right/law) as translation for law is not optimal, and the verbal construction gebrochen worden (been broken) is not perfect as it is missing an auxiliary verb. In the modified system, the system chose a better translation option: co-occurring with the verb brechen (break), gesetz (law) is a better translation because this structure is collocational. Additionally, the verbal construction is well-formed. In contrast to the example presented previously, the translation probabilities of the phrases containing law or has been broken did not change, but the difference of the translation outcome was caused indirectly by a different segmentation. While the segmentation depends from the respective weight obtained during parameter tuning, the fact that the translation probability of if the → wenn das was increased considerably in the modified system is most likely to have triggered the new segmenting, which allowed to translate the phrase law has been broken as a unit. In the original distribution, the translation probability for if the → wenn die (if the, but wrong form of determiner) was highest due to the fact that the determiner die is applicable in both singular and plural in the cases accusative and nominative and therefore occurs very often, whereas the appropriate article for this context, das, is comparatively infrequent. With law on the right side, the translation probability for the appropriate variant wenn das increases and allows the system to combine it with the phrase pair law has been broken → gesetz gebrochen wurde. This would not have been possible with wenn die as translation of the first phrase, since die cannot be a determiner for gesetz (law) and consequently, this combination would not be accepted by the language model. 10 Technically, it is undefined. 73 input bl gloss [if the law ]1 [has been broken ]2 [, it is a matter]3 [for the police .]4 [wenn das recht ]1 [ gebrochen worden ]2 [, es ist eine frage]3 [für die polizei.]4 [if the law ]1 [ been broken ]2 [, it is a question]3 [for the police .]4 gloss [if the]1 [ law has been broken ]2 [, it is a matter]3 [for the police .]4 [wenn das]1 [ gesetz gebrochen wurde ]2 [, es ist eine frage]3 [für die polizei.]4 [if the]1 [ law was broken ]2 [, it is a question]3 [for the police .]4 ref wenn das gesetz gebrochen wurde, ist es eine sache für die polizei. input cont Table 20: The translation probabilities of this example were conditioned on one word on the right side of the phrase (full system). By translating the entire phrase, i.e. verb and noun, the optimal translation for law in combination with break is found with the additional advantage of the verbal construction being translated correctly. This example is intended to show that not only ‘straightforward’ contexts can lead to different translations, but that there are also indirect effects such as segmentation. Also, it shows that a (good) decision made at one point of the translation, i.e. [if the law] vs. [if the], affects the translation of other phrases, even though their translation probabilities have not necessarily been changed. However, these indirect effects caused by the use of contextually conditioned translation probabilities are somewhat unpredictable and can also backfire, which is illustrated in the next example. In the example shown in table 21, the translation of the baseline sentence is better than the translation of the modified system. Except for the position of the verbs, the baseline sentence is a valid translation of the source phrase. Again, the segmentation is different: the phrase were harmed was translated as one unit in the baseline, resulting in an appropriate translation. In the other case, the system chose to separate the phrase into two single words, probably due to the considerably increased translation probability of were → wurden. As there was no translation candidate with the meaning of injured for the phrase harmed, a translation with the meaning of damaged ended up to be taken. While this translation roughly expresses the content of the source phrase, it is not a good translation if the object of harm is a person. As in the previous example, enhanced translation probabilities of adjacent phrases triggered a different segmentation, possibly supported by the new weight of the segmentation feature, and thus lead to a different translation of phrases whose translation probabilities might not even have been changed. These last two examples demonstrate that there are ‘random influences’ like segmentation that are not necessarily closely related with the change of translation probabilities of the focused phrase but caused by different translation probabilities of seemingly trivial phrases. Additionally, weights for segmentation and reordering result from the complex procedure of parameter tuning and are thus hard to trace back. 74 input bl gloss input cont gloss ref [during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5 [were harmed ]6 [in]7 [riots .]8 [in der nacht vom]1 [montag]2 [, über]3 [80]4 [polizisten]5 [ verletzt worden seien]6 [in]7 [krawallen]8 [during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5 [ injured have been]6 [in]7 [riots .]8 [during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5 [were]6 [ harmed ]7 [in riots]8 [.]9 [in der nacht vom]1 [montag]2 [, über]3 [80]4 [polizisten]5 [in krawallen]8 [ geschädigt ]7 [wurden]6 [.]9 [during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5 [in riots]8 [ damaged ]7 [were]6 [.]9 . in der nacht auf dienstag waren bei krawallen in villiers-le-bel rund 80 polizisten verletzt worden. Table 21: The translation probabilities of this example were conditioned on the leftmost pos of the phrase and the pos on the left side (full system). While bad lexical choice is not as harmful as missing words or completely wrong translations as long as the output is comprehensible, it is also a factor for the evaluation of machine translation. The example in table 20 suggested that collocations are best translated as single unit, which does, to a certain degree, strengthen the hypothesis that contextual features are needed to artificially ‘enlarge’ phrases. 6.5.2 Fluency: translational behavior of verbs In addition to improve the comprehensibility by finding appropriate translations for ambiguous words in the source sentence, adding context information was also expected to result in a more fluent translation output. Especially when using pos tags, translated sentences should be more fluent as their translation probabilities were conditioned on linguistically motivated contexts. The examples presented in this section will mainly focus on the translational behavior of verbs; we will look at cases where the modified system found better realizations of verbal constructions, but also at cases where verbs were not translated at all. In the following example (table 22), the verbal phrase compete with was translated as im wettberb mit (in competition with). While the meaning of the source phrase remains preserved, this is not a good translation. A verb like compete is likely to be translated as a collocational verb-noun construction like konkurrenz darstellen (lit. to be/pose a competition) or im wettbewerb stehen (to be in competition). The version chosen by the baseline system is part of the latter, but the verb stehen is missing which is a negative factor when rating fluency. The modified system decided to translate compete with the literal equivalent konkurrieren when conditioned by the tag to, resulting in a sentence that is 75 input bl [”]1 [we don ’t want to]2 [ compete with ]3 [our clients]4 [” ,]5 [ says ]6 [rivero .]7 [”]1 [wir wollen nicht]2 [ im wettbewerb mit ]3 [unsere kunden]4 [”]5 [ sagt , dass ]6 [rivero .]7 gloss [”]1 [we don’t want to]2 [ in competition with ]3 [our clients]4 [”]5 [ says , that ]6 [rivero .]7 input [”]1 [we don ’t want to]2 [ compete ]3 [with our]4 [clients]5 [”]6 [,]7 [ says ]8 [rivero]9 [.]10 cont gloss ref [”]1 [wir wollen nicht]2 [ konkurrieren ]3 [mit unseren]4 [kunden]5 [”]6 [,]7 [ sagt ]8 [rivero]9 [.]10 [”]1 [we don’t want to]2 [ compete ]3 [with our]4 [clients]5 [”]6 [,]7 [ says ]8 8 [rivero]9 [.]10 ” wir möchten für unsere kunden keine konkurrenz darstellen ”, erklärt rivero . Table 22: The translation probabilities of this example were conditioned on one pos on the left side (full system). completely grammatical except for word order: It would sound more natural if phrase 3 would change to the position after phrase 5. Additionally, conditioning on one pos tag on the left side helps to find a better translation for said: the context comma excludes the phrase-translation pair says → sagt, dass, (says that), which is quite straightforward when considering that said that can be expected to be preceded by either a noun or personal pronoun in most cases. We also can see an improvement on a morphological level: As with our was translated as one unit in the modified system, the German form unsere (our) is dative, as required by the preposition mit (with). The segmentation of the baseline source phrase divides with and our; thus, the link to the preposition and its case requirement is lost. The form of unsere in the baseline is nominative or accusative, either of which generally occurs more often than dative. The fact that nominative and accusative often have the same forms additionally enhances their translation probabilities11 . As it is difficult to evaluate morphological aspects on non-well-formed sentences, this will not be discussed any further. Altogether, the translation choices of the modified system are more or less literal; but despite being relatively well-formed, there is not a great similarity to the reference translation, resulting in a standard bleu score of zero, the same score as the baseline. When computing bleu on pos tags, the two translations receive different scores: while the score of the baseline still is zero, the improved translation is rewarded with a pos-bleu score of 39.85: two matching 4-grams are provided by the now correct sequence [“ , vvfin ne .], which guarantees to obtain a non-zero score. 11 Conditioning on a preposition on the left side of a phrase would be no help as there is no distinction made between prepositions requiring dative or accusative. 76 input bl gloss input cont gloss ref [nonetheless ,]1 [scientists]2 [suspect]3 [that both]4 [planets]5 [ developed under ]6 [very similar]7 [conditions .]8 [dennoch ,]1 [wissenschaftler]2 [vermuten]3 [,dass beide]4 [planeten]5 [ unter ]6 [ähnlichen]7 [bedingungen .]8 [nonetheless ,]1 [scientists]2 [vermuten]3 [that both]4 [planets]5 [ under ]6 [similar]7 [conditions .]8 [nonetheless ,]1 [scientists]2 [suspect]3 [that]4 [both]5 [planets]6 [ developed ]7 [under]8 [very]9 [similar]10 [conditions]11 [.]12 [dennoch]1 [wissenschaftler]2 [vermuten]3 [, dass]4 [beide]5 [planeten]6 [ entwickelten ]7 [unter]8 [sehr]9 [ähnliche]10 [bedingungen]11 [.]12 [nonetheless ,]1 [scientists]2 [suspect]3 [that]4 [both]5 [planets]6 [ developed ]7 [under]8 [very]9 [similar]10 [conditions]11 [.]12 [...] gehen wissenschaftler davon aus , dass beide planeten unter ganz ähnlichen voraussetzungen entstanden sind . Table 23: The translation probabilities of this example were conditioned on one pos on the left side (full system). The previous example showed that context features can filter the set of translation candidates to a smaller, better suited set of grammatically more appropriate translations. However, a much more serious problem are verbs on the source side that are not translated at all: while this is mostly an aspect of adequacy, missing verbs also have an impact on fluency. The next example shows a sentence, where the baseline system failed to translate a verb which occurred in a subordinated clause. In German subordinated clauses, verbs normally are at the very end of the clause, as opposed to the English structure where they figure more or less at the beginning in close proximity to the subject. For statistical machine translation systems, it is difficult to overcome this distance and put the verb in the right place or only to find a translation which expresses the content of the source word. A combination of different translation probabilities and segmentation differences lead to the literal translation developed → entwickelten in the example in table 23, instead of the invalid translation in the baseline. In the translated sentence, entwickelten is placed at the position that corresponds to that of the verb on the English side. As can be seen in the reference translation, the position of the corresponding verbs entstanden sind (have emerged) is at the very end of the sentence. Although the phrase entwickelten is not placed correctly and does not correspond to the reference, it is a valid translation of developed: this outcome is a large improvement in comparison to the baseline. It goes without saying that conditioning the translation probabilities on various context features does not allow us to translate every verb that was omitted in the baseline translation. As translating verbs remains challenging, chapter 8 specifically focuses on this issue by additionally using pos tags on the target side in an attempt to force the system to only pick content words as 77 translations for source phrases containing verbs. As in the example in table 22, the modified system in table 23 chose a very literal translation. In the case of table 23, this variant turns out to be more accurate than the baseline, as not only the verb is translated, but also the word very is not lost. Clinging to literal translations is generally not a good idea (especially for idiomatic expressions or structures that cannot be translated isomorphically), but choosing a literal translation is better than ‘no’ translation, i.e., in this case, the translation of a full verb with a non-content word. While we must be careful to not over-generalize the presented examples, it seems that contextually conditioned systems prefer – at least to a certain degree – literal translations in cases where the source phrase is not ambiguous. Unfortunately, there is no possibility to prove or disprove this hypothesis; we will discuss this assumption in section 7.3. 6.6 Summary In this chapter, we examined the effects of including local context templates in statistical translation systems, comparing different system settings (full systems vs. reduced systems). While most systems failed to significantly outperform the baseline system when using a full-feature setting, systems with reduced settings, especially when only using source phrases of length one, achieved results that were significantly better than the baseline. The best context in our experiments turned out to be a combination of one pos tag on the left side and the leftmost pos tag of the phrase: The bleu score for the full system was best when the two contexts were used as separate features, but the results of the different system settings were more stable when the contexts were combined into a single feature. In order to measure the syntactic quality of machine translation output, the pos-bleu score was introduced. In an attempt to better capture the changes on the lexical level, we also used bleu scores on lemmas and computed statistics for translated content words. Similarly to the standard bleu scores, the new metrics also showed significant improvement for the reduced settings, but also in some of the full systems. In addition to the automatic evaluation metrics, we carried out a simple manual evaluation. An interesting observation is the fact that in most systems, more content words than in the baseline system are translated; this has not only been shown by the content-words statistic, but was also a result of the manual evaluation. We assume that contextual conditioning enhances the translation probabilities of literal translations and consequently, content words have a better chance to be picked as translations. When looking at example translations, we found that the choice of phrasetranslation pairs is not always directly determined by a high translation probability, but also by indirect factors like segmentation, which again depend on the translation probabilities, both during the process of parameter tuning and during the actual translation process. The correlation between segmentation and the modified translation probabilities favoring literal translations will be discussed in chapter 7.4, where we will also address the cause for the difference of left-side and right-side contexts observed at the beginning of the evaluation. 78 So far, only context templates have been used: No distinction was made between the different realizations of e.g. the context 1-pos-lef t, though it is likely that some pos tags work better than others. In the next chapter, we will look at a reduction of the set of possible contexts and also address the issue of the relevance of specific realizations of contexts. When discussing the results of the different systems, the differences between bleu and nist were mentioned only marginally: In several cases (e.g. table 13), the nist score of the full setting is significantly better than the baseline, while the bleu score is not significantly better. By giving a higher weight to more informative (i.e. less frequent) n-grams, nist is more sensitive to lexical choice while bleu considers all n-grams to be equally important and additionally requires matches of larger n-grams, thus giving more credit to word order. Generally, bleu is a better measure for translation quality; however, it has been shown that nist is able to detect small differences due to its sensitivity in lexical aspects, whereas bleu often cannot capture small lexical changes (see [Riezler and Maxwell, 2005]). Given that there are several cases in which the nist score is significantly better than the baseline, this is an interesting point, especially when considering that we also found that contextually conditioned systems tend to translate a higher percentage of content words. 79 80 7 Analysis of general aspects of context features In this chapter, we want to look more closely at general characteristics of source side contexts. This includes the question of how fine grained linguistically motivated contexts should be, which will be illustrated by using chunks as context information and by experimenting with differently designed sets of pos tags for auxiliary verbs. We will then focus on the influence of specific pos tags on translation quality, by comparing distributions conditioned on either a specific pos tag or on the complement of this tag in an attempt to comprehend the average effect of non-trivial contexts. The general effects observed in this experiment have an interesting influence on the phrasal segmentation of the input sentence as the changed translation probabilities favor a segmentation into short phrases, which would not be considered a positive effect. Since segmenting is a crucial point for translation quality, we attempt to thoroughly explain this phenomenon and the resulting consequences on the overall performance, and we also provide a comparison to the system presented in [Carpuat and Wu, 2007a/b]. We conclude this chapter with a few thoughts on feature selection. 7.1 Chunks as source-side context features In the same way as pos tags are a generalization of words, chunks can be regarded as a compressed set of tags. Chunks are linguistically motivated phrases like nominal chunks (nc) or prepositional chunks (pc); in addition to representing a compact type of context, they might also offer a key to better segmenting. Intuitively, one could assume that an input sentence that is segmented according to linguistic criteria should be translated relatively well: Decomposing a sentence into constituents which are then translated and concatenated on the target side should be a relatively safe method; after all, a well-formed sentence is nothing else than a sequence of constituents. However, it has been shown that restricting the phrase-translation pairs to constituents on both source side and target side does harm the performance of a smt system. The results of [Koehn et al., 2003] show that by eliminating non-constituent pairs, too many phrases are excluded. In our experiments, we regard chunks mainly as a more compact form of pos tags, but also as a means to improve segmentation. Similarly to the work reported in the previous chapter, source phrases will be conditioned on left side context information, using chunks instead of pos tags. Being more general than pos tags, chunks should better exploit the training data, but might also trigger a better segmentation. In order to boost a syntactically motivated segmentation, we will introduce an additional feature function to represent the status of the focused phrase itself: if the source phrase is (partially) well-formed, it is assigned the value 1, otherwise 0. This is not the same form of context integration as described above, since the resulting values are either 0 or 1, instead of representing a probability distribution. In contrast to our previous experiments, this feature is not intended to filter the set of all extracted phrase-translation pairs to a smaller set of phrases with the same specific context as the focused phrase, but it is designed to reward phrases which meet the general criterion of being (partially) well-formed. 81 nc pc adjc conjc advc [genuinelyrb representativejj democraciesnn ]nc [atin [somedt pointnn ]nc ]pc [asrb possiblejj ]adjc [ratherrb ]conjc [oncerb againrb ]advc vc prt lst intj [toto swimvv ]vc [uprp ]prt [1lst ]lst [yesuh ]intj Table 24: Different types of chunks with examples. A tag indicating begin and end of the sentence is also used for conditioning. 7.1.1 Chunked data Training data was chunked with a variant of tree-tagger [Schmid, 1994]. Its output format can be seen in figure 5: the structure is generally flat and does not contain deeply nested chunks, with a few exceptions like e.g. ncs embedded in pcs. As can be seen, some tokens like while or punctuation are not part of chunks. As they might also be useful as context, they were replaced by their pos tag. Table 24 shows the different forms of chunks; even with ca. 10 additional tags, the set of possible contexts is much smaller than the tag set used in the previous section. When conditioning on chunks on the left side of a phrase, we want to benefit from the reduced number of contexts: In many cases, it does not matter whether a phrase is adjacent to e.g. a noun or a personal pronoun, and therefore, tags with similar properties might as well be combined into one single representative of the group. In the case of a context consisting of two pos tags, essentially equal contexts might be split up in e.g. art noun, adj noun or conj noun, despite always representing the equivalent of a noun phrase. Furthermore, singular nouns often have an article while plural nouns often have no article and thus the leftmost pos tag in a 2-tag context could as well be something that is usually not associated with noun chunks. As a consequence, the number of pos tag based contexts is often unnecessarily inflated. <s> <NC> energy generation and distribution </NC> <VC> must continue </VC> <VC> to be made </VC> <ADJC> more NN NN CC NN energy generation and distribution MD VV must continue TO VB VVN to be make RBR more efficient </ADJC> , while <VC> complying </VC> <PC> with <NC> environmental standards </NC> </PC> . </s> JJ efficient , IN , while VVG comply IN with JJ NN environmental standard SENT . Figure 5: Example for chunked text. 82 While context features based on chunks are extremely compact, an obvious problem is overlapping: Since the phrases used for translation are only wellformed in terms of alignment structure, they are not linguistically motivated and thus, the available phrases in the phrase-table do not always correspond to the chunks in the input sentence. 7.1.2 Chunk-based contexts When using chunks for conditioning on the left side, we distinguish between phrases having an adjacent chunk on the left side without overlapping and phrases whose left borders do not correspond with chunk borders: phrases with no overlapping chunk on the left side can simply be conditioned on this context, while phrases which do not start at a chunk border are conditioned on the chunk containing the left side of the phrase. In the case of embedded chunks, the outer chunk is used as context: if the left context consists of a complete nc in a complete pc, then pc is taken as context. The phrase p1 in figure 6 is simply conditioned on the adjacent context ncb sharing the boundary with p1 , while the phrase p2 is conditioned on the chunk nco that overlaps with p2 . In the case of p2 , the context is not purely adjacent, but used to model the transition between the context and the phrase itself. In the experiments discussed in section 6.3, the combination of the 1-pos-left and the leftmost-pos contexts turned out to work relatively well; the context of an overlapping phrase is somewhat similar to the 1-pos-left and leftmost-pos context in that it captures the information of the phrase beginning inside a given chunk rather that occurring after a given chunk as in p1 . Figure 7 shows an example for conditioning on overlapping chunks: the phrase vote for is overlapping with either a vc or a nc on its left side. Without conditioning, the top ranking translation candidates for vote for mainly consist of verbal translations. When conditioned on vc, the most probable translation candidates are more or less the same, albeit with increased translation probabilities, whereas conditioning on nc favors nominal translations. If the left boundary of a phrase corresponds to a linguistically well-formed segmentation, this means that the phrase is, loosely speaking, at least partially well-formed. As we are interested in a segmentation of the input sentence that approximates a flat linguistic analysis, conditioning on a non-overlapping context might enhance translation probabilities of phrases that might trigger a good phrasal segmentation, while also profiting from source-side context information. In an attempt to additionally enhance partially well-formed phrases of the type p1 , such phrases are assigned a bonus value, which is realized as additional ncb [ [ [ nco ]p1 he will [ vote vco for ]p the report ]p2 i cast my [ vote nco for ]p the report Figure 6: conditioning on overlap- Figure 7: Example for conditioning on overping or non-overlapping chunks. lapping chunks. 83 context feature function consisting of the values 0 and 1. As mentioned earlier, this feature differs from the usual form of contextual conditioning as it does not represent translation probabilities conditioned on specific context, but the general property of being (partially) well-formed. In addition to using left side context, phrase-translation pairs can also be conditioned on the status of the phrase itself. Although the set of chunks is more compact than the set of pos tags, conditioning on the chunks a phrase consist of would be too complex. Instead, we choose a classification into phrase is a complete chunk or a sequence thereof vs. phrase is not a complete chunk or sequence thereof. Thus, there are only two contexts, being a constituent-like phrase or not, on which the phrase can be conditioned. We will refer to this setting as const. when discussing the results. Again, we introduce an additional bonus feature in order to give a better weight to linguistically well-formed translations. While chunks are a promising generalization for pos tags and also might help to find a better segmentation, they have two major disadvantages: The problem of overlapping makes it difficult to find a uniform definition for contexts (adjacent vs. is part of). The second problem is the fact that a phrase can be considered a complete chunk in one sentence but might only be a partial chunk in another sentence, as can be seen in the following example: (1) [with [an internal market]nc ]pc [of [500 mio consumers ]nc ]pc (2) [ consumers ]nc [are provided]vc [with [consistent quality]nc ]pc In the first example, the word consumers is only a part of a larger phrase, while it is a complete chunk in the second sentence. In this case, the splitting into a complete chunk vs. a partial chunk leads to a loss of data as it introduces two different distributions of the word consumers. Of course, one could think about implementing a special treatment for the head words of chunks, trying not to loose translations of single words, but this would contradict the idea of using chunks. In contrast, distinguishing between words being a complete phrase or only a part thereof might prove useful in some situations: For example a verb, that is part of a verbal chunk such as to find or should find is likely to be infinite, whereas it can be expected to be finite when it is a single-word constituent since there are no other verbs in its immediate proximity. In an attempt to avoid splitting probability distributions of essentially the same phrases, we also experiment with using the original translation probabilities in combination with assigning a bonus value for syntactically well-formed source phrases in order to find an optimal segmentation of the source phrase. The decision whether a phrase is a chunk or not is made specifically for each phrase in each source sentence. Thus, a good segmentation for each sentence is proposed to the system by assigning a bonus value to each phrase with a corresponding chunk. 84 7.1.3 Results and evaluation Unfortunately, the chunk-based contexts did not help to improve translation quality. While they generally did not harm the performance, there was no improvement neither. Similarly to the results of pos-based or word-based contexts, the reduced settings showed only modest improvement. Since chunks normally consist of several words, experiments on a length-one system do not seem very promising, especially when trying to improve phrase segmentation. Table 25 shows the results for the different variants. Generally, the versions with the bonus feature seem to work slightly better. As already observed before, a significant improvement in a reduced setting does not necessarily correspond to a significant improvement in the full setting. In this case, the scores of the full setting systems 1-chunk-left+bonus and const+bonus (columns Á and à in table (b)) are almost exactly the same as the baseline, while the systems using only the new translation probabilities yielded better scores (columns Á and à in table (a)). Although the criteria for receiving a bonus point are different in the 1-chunkleft+bonus and const+bonus systems (partially well-formed vs. well-formed), the (a) 1 feature bleu pos-bleu lem-bleu nc-segm. nc-vc-segm. vc-nc-segm. bl ϕo 12.45 34.76 15.95 1624 142 74 À 1 chunk left 12.68 34.83 16.30 1815 145 69 Á 1 chunk left + bonus 12.96 34.56 16.28 2168 164 100  const. 12.45 34.46 15.87 1662 131 79 à const. + bonus 12.84 34.07 16.30 2840 116 75 Ä ϕo + bonus 12.54 34.35 15.91 2926 136 99 à const. + bonus 13.83 36.83 17.47 1929 215 80 Ä ϕo + bonus 13.94 36.94 17.62 2406 190 99 (b) all features bleu pos-bleu lem-bleu nc-segm. nc-vc-segm. vc-nc-segm. bl ϕo 13.84 36.72 17.34 1168 148 78 À 1 chunk left 13.79 36.15 17.29 1403 138 70 Á 1 chunk left + bonus 13.85 36.72 17.39 1703 188 106  const. 13.78 36.34 17.22 1176 170 73 Table 25: Results of experiments with chunk based context features. Table (a) displays scores of a system using the modified translation probability as single feature (plus bonus feature), table (b) shows results of the full setting. The bonus feature refers to partially well-formed phrases in the 1-chunk-left context and to constituents otherwise. Bold printed numbers are significantly better than the baseline (bl), i.e. the original distribution. The lower half of each table shows the number of translation units correponding to linguistic segmentation. 85 bleu scores of the full systems are basically the same. Also, the standard bleu scores of all full setting systems lie within a smaller range than the scores of the one-feature systems, suggesting that the feature functions that were left out in the one-feature setting are ‘stabilizing’ the quality of translation regardless of the context feature. Of all systems with a full parameter setting, the version using the original translation probabilities in combination with the bonus feature works best (column Ä in table (b)). While the standard bleu score is still in the same range as the baseline, the two alternative bleu scores are better, although they fail to be statistically significant. Table 25 also shows an overwiew of how many translation units actually correspond to linguistic phrases. For this evaluation, we counted all source sentence segments that were either a noun chunk (nc-segm.) or a combination of a verb chunk and a noun chunk (nc-vc-segm., vc-nc-segm.) over the entire test set. While these three types of chunks are not fully representative of the whole array of possible chunk combinations, they are some of the most important chunk combinations. Since we look at both single chunks and combinations thereof, we try to determine whether both longer and shorter phrases benefit from being conditioned on chunk-based context features. The numbers for ncs are similar in the one-feature system and the full setting: with every context, more of the translated phrases are noun chunks, especially in the systems using the bonus feature. The same tendency applies for phrases consisting of combined chunks. However, there does not seem to be a strong positive correlation between the number of linguistically well-formed translation units and the bleu-score: For example, all full systems with a bonus feature tend to have noticeably more well-formed translation units than the baseline, but still have more or less the same score as the baseline system. Similarly, the one-feature system using translation probabilities from the original distribution ϕo and bonus values (column Ä in table (a)) tends to have more well-formed translation units but its bleu score is barely higher than the baseline score. In contrast, the two one-feature systems conditioned on either the chunk on the left or on phrases being constituents or not in combination with the bonus feature (columns Á and Ã) have significantly improved bleu values, whereas the corresponding systems without a bonus feature have lower results. These results suggest that chunk based context features might help to increase the number of linguistically well-formed translation units, but also that the modified segmentation does not guarantee better translation results. In [Gimpel and Smith, 2008], constituent-based context features are mentioned, although there is no detailed description provided both in terms of the design of these features nor of their usefulness. However, the fact that these contexts were not explicitly listed suggests that they were not found to be very useful. Similarly, the syntactic features implemented by [Max et al., 2008] did not meet their expectations. Given the disappointing result of the chunk based contexts, we did not carry out similar experiments with chunk based contexts on the right side of a phrase. 86 7.2 Granularity of context features In the pos tag based experiments in chapter 6, we used a relatively fine-grained tag set. Especially the description of verbs is detailed by differentiating between the two auxiliary verbs to be and to have and the different forms of verbs: finite, infinite, -ing-form and past participle. While rich context features offer detailed information and thus the possibility to find appropriate translation candidates for specific contexts, contexts that are too fine-grained could also split up essentially equal distributions and thus reduce the benefit of context information. Since the first attempt to generalize context features by using chunked data turned out not to work very well, presumably due to problems caused by overlapping, we experimented with reducing the tag set. It is difficult to decide on a level of granularity as one could always come up with an example demonstrating the need for either a more detailed tag set or a more general representation. As our approach only works with context templates (e.g. 1 pos left), but does not take into account the informativeness of the specific realizations of this template, a context is considered good if its different realizations work well on average. The task to find the right level of granularity is similar as we cannot expect to find a solution that guarantees an optimal context to every phrase, but only a setting that gets good results on average. By merging the tags for nouns (nn) and personal pronouns (pp) into a single tag (np), words with similar properties are combined to represent the equivalent of a noun-phrase: In many cases, it is not relevant whether a phrase was seen in adjacency of e.g. the man or he; thus, there is no need to differentiate between the pos tags of these two types of context. The second object of reduction is the group of auxiliary verbs, which are simply merged into the single tag aux. In order to capture the effect of merging pos tags with similar properties, phrase translation probabilities were only conditioned if the context was either part of the aux or the np category, whereas in all other cases, the original translation probabilities were kept. By conditioning only on those pos tags we are interested in (labelled ‘isolated context’ in table 26), their translation probabilities are the only ones to change in comparison to the baseline and we hope that the differences caused by the varying granularity of the tags are more noticeable in this isolated setting. Table 26 shows the results for a full system (i.e. all features and no restriction on the phrase length) when the context used for conditioning is one pos tag on the left side of the phrase. When looking at the subtable with the results for the isolated contexts, it can be seen that systems conditioned on either the isolated context vb.* vh.* 13.44 aux va.* 13.89 13.88 vb vh 13.83 nn pp 13.39 np 13.72 all pos aux np 13.82 13.97 Table 26: Different levels of context granularity; the systems were conditioned on 1 pos on the left. Conditioning on all realizations of the detailed tag set results in a bleu score of 14.01, the bleu score of the baseline is 13.84 (cf. chapter 6). 87 simple tag aux (or variants thereof) or np achieved better results than those conditioned on the more complex tags (with gray background). This outcome suggests that a tag set containing less detailed tags for auxiliary verbs is a better inventory for contexts; similarly, the combined tag np seems to be more effective. The reduction of the detailed auxiliary tags to the category aux means a considerable loss of information. However, variations with the design of the aux tag did not change the results in the case of isolated contexts : bleu scores remained stable if all auxiliary forms were merged (aux, 13.89), or if a distinction between have and be was made (13.83), or if the different verb forms were kept without differentiating between have and be (13.88). However, there is no corresponding improvement when translation probabilities are conditioned on all possible contexts: when conditioning on all pos tags with the aux tag instead of detailed auxiliary tags, the resulting bleu score (13.82) is almost identical with the result of the baseline (13.84), i.e. the improvement achieved with the detailed, complete tag set (14.01) is lost. In contrast, the substitution of nouns and personal pronouns with the more general tag np did not lead to an improvement corresponding to the result when using no other contexts, but is nearly the same as in the setting with the more detailed tag set (13.97 vs. 14.01). This indicates that the np tag is equivalent to the noun tag and the personal pronouns it represents, whereas the introduction of the aux tag seems to harm the performance when conditioning on all pos tags since its result is worse than using the individual auxiliary tags. However, this does not correspond to our observations for the isolated context where a reduction of the complexity of tags led to an improvement. Given that the scores of 14.01 or 13.97 are not significantly better than the baseline’s 13.84, it is difficult to decide what to think about the results; but it seems that merging all auxiliary tags into a single one is too harsh whereas the introduction of the np tag has no influence. Considering the fact that the results of the experiments with word-based context features are in the same range as the results of entirely pos-based context features, it is obvious that a very rich inventory of possible contexts is not harmful, but in contrast offers a wide range of lexical information. In consequence, a detailed set of pos tags should not be harmful as well. However, the main objective of using pos tags is to maximally exploit the available training material by generalizing linguistic concepts, and a too detailed tag set is conflicting with this idea. It is difficult to find an explanation why reducing the context complexity leads to better results in the isolated setting, whereas there is no corresponding improvement in the full setting conditioned on every possible context. It might be possible that the unconditioned translation probabilities and the probabilities conditioned on only one context are too different and therefore are interfering. Since the distributions are likely to differ more when conditioned on a detailed context inventory than on a more general context, this difference is more harmful in the detailed setting, whereas the probabilities obtained when conditioning on the generalized contexts aux or np are more similar to the original distribution and a combination therefore might work better. 88 The assumed side effect of combining probability distributions of different quality could be explained by the fact that translation probabilities of contextaware phrase translation pairs tend to be higher for being estimated on a smaller data set. Those phrase-translation pairs then have an ‘unfair advantage’ in comparison to the phrase translation pairs with the original translation probabilities and might be more likely to be chosen by the system despite not always being the optimal choice. Enhancing the translation probabilities of certain subgroups of phrases to a greater extent than other subgroups has an interesting effect on the segmentation of the input sentence which will be discussed in section 7.4: in the course of this discussion, we will find that the translation probabilities of shorter phrases are more enhanced than the probabilities of longer phrases, which causes the system to prefer a segmentation into comparatively short units. 7.3 Analysis of specific context realizations In this section, we focus on specific realizations of the context template 1-pos-left: in previous experiments, every pos tag was considered to be equally important. However, we assume that some pos tags provide more information than other ones. For example, after the tag to, there will always be the infinitive form of a verb; this means that this tag is helpful for identifying verbal translations (e.g. the deal vs. to deal); it additionally indicates that the appropriate verb form might be infinite rather than finite. Similarly, if a word has a determiner (dt) on its left side, it is very likely that this word is either a noun or an adjective, whereas other tags such as e.g. adverbs or conjunctions can turn up in different positions and are thus less informative. Since it is difficult to measure the usefulness of a specific context realization, we resort to a similar method as in the previous section by comparing the output of two systems where only one pos tag is used for conditioning. One setting corresponds to the experiments presented in the previous section: For phrases appearing to the right of context c, the conditioned probability distribution ϕc is used; otherwise, we switch back to the original distribution ϕo . In the contrastive setting, phrases appearing to the right of context c are assigned translation probabilities conditioned on c as well, but phrases with a context different from c are conditioned on the complement of c, i.e. the set of all other pos tags. In this section, we do not attempt to find a set of especially useful context realizations of the template 1-pos-left, but to find evidence for the assumption that good contexts enhance literal translations and exclude faulty phrase-translation pairs. As a consequence, the complementary set of contexts should contain most of the bad phrase-translation pairs and thus negatively influence the translation performance. If the context c happens to be a good context, then we would assume that conditioning on c would assign higher translation probabilities to appropriate translations whereas less suited ones should be either excluded from the list or score a comparatively low translation probability. In contrast, we would expect a bad context to contain a proportionally larger amount of inappropriate trans89 lations and thus have a more flat probability distribution giving less probability mass to good translation candidates, due to the fact that a large part of all possible good translations have been seen in the good context. While we cannot prove or disprove this assumption, the classification into conditioned on c and conditioned on c is an attempt to understand the distribution of all translation candidates that are excluded by a (supposedly) good context. Note that in this case we do not intend to disambiguate polysemous words but compare the quality of probability distributions for plain, simple phrases whose translations share all the same basic meaning. Obviously, this concept is overly simple as it does not take into account that there could be more than one good context, and it also leaves out the fact that in the case of phrases like fine, there are at least two distributions that are merged into a single one when not disambiguated. To take an example, we saw that the translation candidates of the phrase divided (cf. section 6.1, on page 57) with an auxiliary verb on the left side only consisted of valid options with higher translation probabilities and that all non-sense translation possibilities such as comma were excluded by this context. This is especially interesting when considering that the available translation candidates of certain groups of words, e.g. verbs, are often not very good and contain a considerable amount of bad translation possibilities. Experiments were carried out with the contexts dt (determiner), aux (auxiliary verbs) and to since they have a high frequency and might turn out to be good contexts for being close to either nouns or verbs, which are important for the understanding of a sentence. The results in table 27 show that the version that is partially conditioned on c with c = dt is considerably worse than the version switching back to the original distribution ϕo . This suggests that c = dt is actually a useful context since it seems to keep a large part of good and appropriate phrase-translation pairs which allows to produce a good or even better estimation of translation probabilities, while the distributions based on c seem to provide worse translation options. The difference between the two systems with c = aux is not as large, and the results with c = to are almost alike. This might be due to the fact that alignment structures including verbs are generally critical and that the distributions of verbs have a certain number of faulty or imprecise phrase-translation pairs which also appear within a good context. Especially the to context does not seem to work very well, although in theory, one might consider it to be relatively useful. An explication might be that the context c = to is likely to introduce impersonal verbal constructions whose structure often cannot be preserved during translation and which are thus prone to bad alignment. Table 28 shows the 10 most probable translation candidates for the phrase discovered, conditioned on the context aux or its complement. The distributions c = dt 13.98 13.42 context = c: ϕc and context 6= c: ϕo context = c: ϕc and context = 6 c: ϕc c = aux 13.89 13.70 c = to 13.75 13.69 bl 13.84 Table 27: Results of full systems conditioned on context c or its complement c. 90 ϕo entdeckt festgestellt entdeckten aufgedeckt feststellen gefunden erfahren entdeckte erfahren erkannt 0.183 0.078 0.036 0.034 0.028 0.026 0.022 0.022 0.020 0.020 ϕc entdeckt festgestellt aufgedeckt gefunden erkannt herausgefunden erfahren entdeckung feststellen entdecken 0.280 0.075 0,059 0,042 0.034 0,021 0.017 0.017 0.013 0.013 ϕc entdeckt festgestellt entdeckten feststellen entdeckte erfahren haben feststellten fest , 0.095 0.080 0.068 0.042 0.042 0.027 0.023 0.023 0.019 0,015 ϕpp festgestellt feststellten feststellen entdeckt haben entdeckten erfahren entdeckte habe fest 0.092 0.061 0.061 0.051 0.041 0.041 0.031 0.031 0.020 0.020 Table 28: Translation candidates for the phrase discovered sorted according to their translation probability. The context (1-pos-left) in the first three columns is c = aux; in the last column, translations were conditioned on personal pronouns (pp). Invalid translations have a colored background. ϕo and ϕc are relatively similar with the exception that translation probabilities in ϕc are higher, especially the translation probability of the top entry: This is consistent with the observation that literal translations are enhanced. All translation candidates listed for ϕo and ϕc are valid: most of them are variations of entdecken (to discover), feststellen (to detect), aufdecken (to unveil) or finden (to find). However, in ϕc , the translation probability of the top entry is considerably lower, while the probabilities of the following entries are more or less the same in all settings. This means that the top translation of ϕc is considerably less probable in comparison with the second entry. Furthermore, there are three invalid translation candidates among the top 10: the auxiliary verb haben (to have), the particle fest of the particle verb feststellen (to detect) and comma. While these invalid options have comparatively small probabilities and might therefore only have a small chance to be chosen, at least haben and comma are highly frequent and might therefore obtain high scores by the language model. These observations suggest that phrases like have discovered or was discovered are more likely to provide valid translations than e.g. the phrase he discovered. Table 28 also shows the probability distribution of discovered conditioned on a personal pronoun on the left side, which is similar to ϕc . The difference between ϕc and ϕpp might be partially due to the fact that discovered preceded by an auxiliary is more frequent than discovered preceded by a personal pronoun (239 vs. 98 occurrences), enabling a better estimation of translation probabilities within the context aux. But the primary reason might lie in the actual translations that are typically triggered by either auxiliaries or personal pronouns. Assuming that the syntactic structures of pp/aux + discovered and the respective translations in the training data are mostly identical, we will find that the combination aux + discovered allows for less variation than pp + discovered. Preceded by an auxiliary, discovered is most likely to be translated by the participle form of a verb with corresponding meaning; German past participles are not subject to 91 morphological variation12 . However, if discovered is to be translated as a finite verb after a personal pronoun, the candidate translations vary in number and person. Additionally, particles and verbs might be separated (depending on the constituent order of the sentence), probably causing incorrect alignments of the English verb with a German particle. In contrast, particles are always part of a past participle and thus, this form of error is diminished in distribution ϕc . While the top translation candidates observed in the context pp in table 28 are by no means incorrect, the probability distribution produced by this context is flatter due to the morphological variations and, in contrast to ϕc , contains three invalid entries: the particle fest and forms of the auxiliary haben (to have). When conditioning on aux, we would expect that a German auxiliary verb (if existing) is aligned with the English one, thus leaving the possibility to correctly align the main verbs. However, if in the case of ϕpp , there is no auxiliary on the English side, German auxiliaries might end up to be aligned with the English main verb, especially if the German auxiliary is roughly in the same position as the English verb and the German main verb is positioned far from its auxiliary verb. These assumptions are confirmed when looking at the entries in table 28: the seven top-ranking translation candidates in column ϕc are participles of different verbs with the meaning of discover, followed by the noun entdeckung (discovery) and two infinitives. In contrast, the top entries of ϕc and ϕpp contain different forms of the same verbs, mixed with invalid translations. Particularly when conditioning on personal pronouns, the first three entries are forms of the same verb feststellen (to detect), followed by different forms of the verb entdecken (to discover). Summarizing, we found evidence that the context aux used for conditioning an English participle like discovered is likely to produce a good translation probability distribution, i.e. a non-flat distribution without invalid translation candidates in the top-ranking positions. This was indicated by the form of ϕc , and illustrated with the specific example of ϕpp . Use of context information for alignment Information about which pos tags are useful contexts is obviously interesting when using contextual information in order to refine translation probabilities. But knowing which contexts are likely to produce reliable alignment structures could also help to improve alignment quality: alignment tools are trained by using the em-algorithm which basically starts with a set of uniformly initialized parameters, produces alignments for training data and re-estimates all parameters based on the preceding run of alignment until it converges. The re-estimation step could benefit from information about the reliability of alignments in certain structures by giving a higher weight to ‘safe’ alignments. 12 When used as adjectives, past participles have to agree in gender, number and case with the rest of the phrase. However, this possibility is excluded when preceded by an auxiliary verb. 92 segm. length 1 2 3 4 5 6 7 average bleu bl 5.767 4.841 2.095 585 145 30 4 1.857 13.84 1 word left 13.780 3.769 895 193 38 7 – 1.339 13.84 1 pos left 12.944 3.195 1.256 352 86 10 1 1.401 14.01 1 word right 7.143 4.272 1.952 591 173 31 7 1.765 13.98 1 pos right 7.191 4.414 1.880 595 152 30 4 1.753 14.05 1 p. l. interp. 6.133 4.687 2.062 579 154 31 6 1.832 13.98 1 p. r interp. 8.648 4.462 1.623 461 114 23 4 1.631 13.96 Table 29: Length of segments and average segment length for different full systems. 7.4 Phrasal segmentation Segmenting the input sentence into translation units is a crucial step in smt. Generally, larger phrases are better since they capture longer expressions providing more context. Unfortunately, our method of contextual conditioning favors short phrases. As can be seen in table 29, there are huge differences in the number of single-word phrases and consequently in the average segment length of the baseline system and the context-aware systems. The table shows the length of the translation units for the baseline (bl), and when conditioned on a pos tag or word on either side of the phrase. For the computation of the translation probabilities of these systems, the established method with the threshold of n = 10 was used. In the two columns on the right, segmentation statistics of systems using the same context (one pos on the left or right side), but with interpolated translation probabilities are listed. In addition to the preference for shorter translation units, there is also a discrepancy between left-side contexts and right-side contexts: While conditioning on pos tags or words on the right side leads already to a noticeable increase in single-word phrases, the systems conditioned on pos tags or words on the left side use even more than twice as much single word phrases than the baseline system. This leaves us with two questions to answer: Why does contextual conditioning lead to an increased amount of single word phrases, and why is there such a difference between left-side and right-side context? Since the parameters for segmentation are determined during parameter tuning, which is highly complex and influenced by a multitude of factors, it is not easy to find an explanation. However, in the previous evaluations, it became apparent that, at least in the case of non-ambiguous source phrases, the translation probabilities of the top translations are generally greatly enhanced. Translation probabilities of short phrases, especially single word phrases, seem to be increased to a greater extent than longer phrases. Given that we introduced a threshold to prevent low frequency phrases from being over-estimated, longer phrases are often excluded from conditioning for not meeting the threshold criterion, whereas short phrases generally occur a sufficient number of times and as a result, are assigned higher translation probabilities. 93 Considering that long translation units often already have relatively high translation probabilities, it seems somewhat counterintuitive to complain that their translation probabilities are not increased as well. However, rarely seen phrases often have lower scores in the reverse translation probability or in lexical weights since these features were specifically added to counteract overestimation. While all scores of low frequency phrases remain the same, the translation probabilities of all other phrases - presumably rather short ones - are boosted and, as a result, have an advantage in comparison to phrases with the original translation probability. During parameter tuning, the (shorter) phrases with the new translation probabilities seem more promising, causing the system to arrange parameters in order to prefer short phrases over longer ones. This assumption is strengthened by the observation that, for the 1-pos-left context, using interpolation with a count-based interpolation weight instead of a fixed threshold leads to a more moderate amount of single word phrases and an average phrase length that is comparatively close to the baseline. Now that all translation probabilities are modified, the distributions for shorter and longer phrases are more alike which means that shorter phrases are no more advantaged. Similarly, the 1-word-left context with interpolated translation probabilities (not shown in table 29) has a higher average translation phrase length. In both cases, the bleu results of the respective systems are almost identical regardless of using interpolation or a fixed threshold. In the case of right-side contexts, results are not as clear: with the 1-wordright context (not listed in the table), the segmentation length of 1.853 is closest to the baseline. However, with only 13.65 bleu points, translation quality suffered. With the 1-pos-right context, translation quality remained stable, but the number of short phrases is increased, although not as much as with left-side contexts without interpolation. The great difference between left-side context and right-side context is very interesting and could partially be explained by differences in quality of context features on the left or the right side of a phrase. When looking back at the results of the different evaluation metrics (bleu, pos-bleu and lem-bleu) discussed in chapter 6, we find that for the length-one systems, left-side context has higher scores than right-side context. Since this setting is independent of segmentation issues, we consider this as evidence that left-side context is more useful for a better estimation of phrase-translation probabilities: ironically, this would mean that by producing more ‘peaked’ distributions favoring short phrases, left-side contexts lead to an unfortunate phrasal segmentation. But this observation also suggests that right-side contexts might have a positive influence on segmentation as full systems conditioned on right-side contexts are at least as good as systems conditioned on left side context. The key might lie in the searching procedure when translating the sentence: As sentences are translated from left to right, there is basically only information about the left side (i.e. the already translated target words), which is used by the target side language model. When translating a phrase, the costs of the n best translation candidates of this phrase are calculated as well as ‘future costs’ of subsequent translation candidates. Phrase translation probabilities conditioned on the right side of a phrase might act as a look-ahead and thus 94 the ϕo ϕc die die the the 0.337 0.467 der der the the 0.222 0.216 den das the the 0.113 0.093 central ϕo ϕc mittelzentrale middle- central 0.216 0.348 zentrale zentralen central central 0.209 0.248 zentralen mittelcentral middle0.158 0.129 the central ϕo ϕc die zentrale die zentrale the central the central 0.204 0.304 das zentrale das zentrale the central the central 0.077 0.143 die zentralen die mittelthe central the middle0.061 0.071 the central banks ϕo = ϕc die zentralbanken the central banks 0.343 den zentralbanken the central banks 0.239 der zentralbanken the central banks 0.134 Table 30: Comparison of the three top translation probabilities in the original distribution and conditioned on 1 pos tag on the left side. provide information that is not available in a standard system. While the concept of future costs is designed to find a translation candidate causing reasonable costs for both the actual and the following phrase, translation probabilities conditioned on right-side contexts offer the possibility to indirectly look one position further and therefore strengthen especially suitable translations, as inappropriate candidates were not seen within the context and can therefore be filtered out. Although the average segmentation length is considerably smaller in the contextually conditioned systems, their overall performance is not as bad as one might assume, but actually tends to be better than the baseline: the single word phrases of the modified systems are not single word phrases in the same sense as in the baseline system, since they contain context information. The fact that the context-aware systems are relatively successful (at any rate not worse than the baseline) despite preferring small translation units could be regarded as a confirmation that context features are useful. 7.4.1 Example So far, most examples included ambiguous words or otherwise interesting translation variants. In this case, attempting to use a representative example, we chose to look at different segmentations of a simple nominal phrase whose translations are all more or less the same, differing only in terms of morphological realization or word choice. Table 30 shows the three top translation probabilities for different segmentations taken from a sentence beginning with the central banks worry ... . While the translation probabilities of the conditioned distribution ϕc generally increase, the top ranking translation probability of the central increases to a lesser extent than the two single word phrases. Furthermore, the phrase the central banks does not comply with the threshold criterion and thus has the same translation probabilities in both ϕc and ϕo . The assumption that the probabilities of longer phrases are always less enhanced by conditioning is overly simple: to a large degree, this depends on the frequency of phrases. Additionally, for local collocations or fixed word groups, we would expect that contextually conditioned probabilities benefit 95 from conditioning to the same extent as short phrases. For example, the top translation probability of the phrase central banks, which corresponds to the German compound noun zentralbanken, rises from ca. 0.7 to 0.85 when contextually conditioned. This suggests that in addition to enhancing the top translations of single word phrases, the translation candidates of collocations or similar structures are also boosted when conditioned. Since the words of collocational structures always occur together, they can, to a certain degree, be considered as one complex unit. However, in the case of trivial structures, translation probabilities seem to be increased to a lesser extent or are not increased at all. In the example, we found that the translation probabilities of the and central banks gained more than the trivial phrase the central, whereas the longer phrase the central banks did not meet the threshold despite containing the non-trivial structure central banks, and therefore kept its original translation probabilities, making it less likely to be chosen. These reflections actually correspond to the segmentation of [the central banks] in the baseline system and [the][central banks] in the context-aware system (1-pos-left). 7.4.2 Comparison with other systems In contrast to our results, [Carpuat and Wu, 2007b] and [Carpuat and Wu, 2007a] report a segmentation into longer translation units with word sense disambiguation techniques. They explain that the context-aware probabilities are better estimates and that the higher scores enable the decoder to chose longer phrases. When comparing the work presented by [Carpuat and Wu, 2007a/b] and the system discussed in the course of this work, there are two major differences, the most important of which is the method of integrating contextual features in order to refine probability distributions. While we opted for a simple maximum likelihood-model considering only one context template at once, their approach is much more sophisticated by using a maximum entropy setting where all contexts are available, allowing the model to choose features depending on their respective usefulness for a given phrase. By requiring a minimum number of phrase-context co-occurrences, we exclude low frequency phrases from being contextually conditioned and thus create a disadvantage for a subgroup of phrases. Since the approach presented by [Carpuat and Wu, 2007a/b] does not rely on relative frequencies but instead uses classification methods, phrases of all length can be treated identically. Assuming that their classifier works equally well for short and for long phrases, all phrases can profit from the context-aware probabilities and thus offer equally high scores to the decoder. This assumption is somewhat inaccurate as longer phrases usually provide less training data which should negatively affect the performance of a classifier. However, there is a gradual transition of classification quality for low frequency phrases instead of a fixed classification into conditioned vs. non-conditioned as in our approach. The other point is the fact that the language pairs used have different properties: English and Chinese have almost no morphology, and thus, the data 96 [Carpuat and Wu, 2007a/b] work with is relatively compact and less prone to sparse data problems. In contrast, German has a relatively rich morphology which often leads to flat distributions. For example, in the case of adjectives, it is not unusual that the five top entries of an adjective are essentially the same word but with different morphological features. With the translations of one source phrase split into several morphological realizations, it is more difficult to estimate translation probabilities, especially when trying to not over-estimate infrequent phrases. For the translation direction German→English, it would be possible to preprocess the German data by eliminating those morphological features that have no equivalent on the target side, such as gender and case marking. In the reverse direction, German morphological features cannot be simply eliminated since the target sentence has to consist of full forms. When translating word stems to avoid data sparseness, morphological features need to be reconstructed on the target side. [Minkov et al., 2007] propose a method to generate morphology by modelling morphological agreement on the target side to predict fully inflected forms. Basically, simplified data is used for translation, and a model trained on morpho-syntactic features from source and target language is used to predict the morphological descriptions of words in the translated target sentence. They report promising results for experiments on reference translations with the language pairs English→Russian and English→Arabic, whose target languages are both highly inflected. In [Toutanova et al., 2008], the method is integrated into a smt system. It would be interesting to combine this method with the idea of contextual conditioning. The translation of English→German should generally benefit from the reduced data, and context features could better generalize and thus be more powerful. One could even consider to ‘pass along’ context dependent features as information for the model predicting the fully inflected forms. Since the idea presented by [Minkov et al., 2007] basically consists of the pre-processing step of simplifying the data and the prediction of inflected forms on the translated sentences, no alteration of the actual translation system is required. Thus, it could be combined with contextual conditioning without major modifications. 7.5 Feature selection In our approach, contexts such as one pos tag on the left side of a phrase are considered as context template, leaving out the possibility that the different realizations, i.e. the actual pos tags within this template may be of different importance depending on the respective phrase. This means that a context template is considered to be good if the probability distributions it produces turn out to be useful on average. However, this concept is not optimal as it leaves out the fact that some context realizations are less useful than others. In section 7.3 for example, we found that conditioning participles on auxiliary verbs is likely to improve the translation probability distribution, whereas conditioning on personal pronouns might even lead to a deterioration in comparison to the baseline. 97 By applying criteria such as the count-based factor defined in section 5.3.1 or the factor based on type-token relations discussed in section 5.3.4, we tried to take this into account by putting less trust in presumably trivial and useless contexts. Similarly, the introduction of a simple threshold, which allows conditioning only for phrase-context combinations with a minimum number of occurrences, is an attempt to rate the individual relevance of phrase-context combinations. An essential drawback is that our approach does not allow to switch between different contexts depending on their respective relevance, but only uses one fixed context for conditioning. In the experiments presented in section 6.3, we integrated two context templates as separate features into the log-linear model. This setting allowed us to give an individual feature weight to each of the context templates, and therefore, the respective context templates can be used according to their relevance. Still, the feature weights obtained by parameter tuning only indicate the relevance of the context template, i.e. the averaged relevance, instead of rating individual phrase-context combinations. A maximum entropy based approach, as presented in chapter 2, would allow the system to select context features according to their relevance in the given situation. Since we mainly focus on the analysis of (individual) context templates and designed our methods accordingly, a more complex classification-based approach would go beyond the scope of this work. However, we conducted preliminary experiments in which we attempted to rate the individual relevance of phrase-context combinations in order to determine which of the examined contexts is most promising for conditioning. The relevance of a context for a given phrase was estimated by the factor λ based on type-token relations that was defined in section 5.3.4. For our experiments, we used a system conditioned on the combination of one pos tag on the left and on the right side, as well as a system using either one or two pos tags on the left side. While the first setting is designed to combine different context features (i.e. left vs. right), the second setting could be regarded as a form of backing-off between one or two pos tags on the left side, depending on which feature is more promising. First, we used the factor λ to decide which of the context should be used for conditioning; i.e. each phrase was only conditioned on the better context. Alternatively, the translation probabilities produced by the two different contexts were weighted according to λ and merged into a single score. However, the results of all variants were disappointing as the bleu scores were hardly better than the baseline. Given that these context features achieved good results when used separately or, in the case of one pos tag on either side of the phrase, as combined context templates, this outcome is surprising, especially when considering that in comparison to the context template settings, translation probabilities should be either the same or replaced with better ones. When looking back at the results achieved with the type-token based factor λ applied to context templates in section 5.5, we find that the best bleu score was obtained when λ was used as weight for an interpolation of the contextually conditioned distribution ϕc and the original distribution ϕo . Attempting to recreate this setting for the 1-pos-left and 1-pos-right system, we used λ to determine the better context c and then computed interpolated translation 98 probabilities based on ϕo and ϕc . Again, the result of this experiment did not meet our expectations. In the discussion in section 5.3.4, it became apparent that λ fails to produce good estimations for flat probability distributions. Additionally, its performance was not stable throughout tested the systems (cf. section 5.5), which suggests that λ is not a suitable criterion for the task of comparing two distributions. In the case of the 1-pos-left and 1-pos-right context, the differences between left-side and right-side contexts, might also be of importance: we cannot rule out that the individual properties of the two context types are ‘interfering’ when merged into a single score. 99 100 8 Translating verbs: Filtering translation options with target side information In this chapter, we will specifically focus on the translation of verbs: A motivational example is to illustrate the general problems of translating German verbs, which is mostly due to the different structures of German and English sentences. We then present an idea of how to integrate target side information in order to trigger valid translations for verbs, followed by an evaluation of our method, which reveals that conditioning on target side features principally works, but is not overly successful due to the structural differences of the two languages. When translating between English and German, it is generally difficult to find good translations for verbs. This is mostly due to the fact that German constituent order can vary greatly and does therefore not always correspond very well to the word order on the English side. The main problem is that in German sentences, the verb is often at the very end of the sentence while the corresponding English verb is at the beginning in close proximity to the subject. It is difficult to align such structures and as result, the phrase-translation estimates are often not very good since a considerable number of correct alignments is lost while wrong alignments are extracted. Considering more information than the mere number of occurrences of phrase-translation pairs could help to improve translation quality. In the previous sections, we presented experiments and thoughts about the use of source side context features. While our modifications did not turn out to be hugely successful, we can report small improvements for some of the settings. The probability distributions of verbs benefit to a similar extent to other words from being conditioned on context features; however, there are still many sentences where verbs are missing because they were translated with non-content words like auxiliary verbs, verb particles or punctuation. With respect to the criterion of adequacy, i.e. the task to reproduce the content of the source sentence in the target language, it is important that content words like verbs are not ’lost’ during the translation process. If it is known that a source phrase contains a verb, this knowledge can be used to try to enforce appropriate translations. As a valid translation of a verb needs to express the meaning of the source phrase, we demand that translation candidates contain at least one content word. This means that translation probability distributions are not only conditioned on the source side (contains a verb vs. does not contain verb), but target side information is considered as well: By pos-tagging the German part of the training data, content words (nouns, verbs, adjectives) can be easily identified. The extent to which information about phrases on the target side can be included is very limited; it can only consist of a description of the phrase itself (e.g. pos tags), but not of adjacent phrases as nothing is known about the phrase’s final position in the translation. While this is not strictly the same sort of context as before, it still shares the basic idea of reducing the set of translations and enhancing translation probabilities for good translations by means of additional information. 101 input bl gloss ref [mitt]1 [romney]2 [accused his]3 [opponent]4 [rudy]5 [giuliani]6 [ of turning ]7 [new york]8 [into a ”]9 [save haven]10 [for illegal immigrants]11 [” .]12 [mitt]1 [romney]2 [beschuldigte seine]3 [gegner]4 [rudy]5 [giuliani]6 [ aus ]7 [new york]8 [in einen ”]9 [sicheren hafen]10 [für illegale einwanderer]11 [” .]13 [mitt]1 [romney]2 [accused his]3 [opponent]4 [rudy]5 [giuliani]6 [ from ]7 [new york]8 [into a ”]9 [save haven]10 [for illegal immigrants]11 [” .]12 er habe new york zu einem ”zufluchtsort für illegale einwanderer” gemacht , wirft mitt romney seinem kontrahenten rudy giuliani vor. Table 31: Baseline translation with omitted verb. 8.1 Basic idea In order to get a picture of the scale of the problem of the structural differences of English and German and the resulting poor alignment quality, we discuss the example in table 31. As can be seen, the verbal construction of turning is translated as the ‘meaningless word’ aus. The given example was translated with the baseline system, but the results of the contextually conditioned systems are comparable and also fail to produce a valid translation for of turning. Admittedly, it would hardly be possible to translate this phrase into the same structure as in the reference translation. The part of turning new york into a ”save heaven for illegal immigrants” is marked in the reference translation, with the verbs in bold face. The main verb gemacht (made) is at the end of the subordinated clause, while the subject er (he) and an auxiliary verb habe (had) are at the beginning. Here, the impersonal construction of turning is translated entirely differently, involving verbs at different positions in the sentence. As in most cases, there are many good alternative translations; a simple possibility is given below: of turning new york into a ”save haven for illegal immigrants” (1) new york in einen ”sicheren hafen für illegale einwanderer” zu verwandeln (2) zu verwandeln new york in einen ”sicheren hafen für illegale einwanderer” By choosing the more or less literal translation of turning→zu verwandeln, we are avoiding the problem of a ‘discontinuous’ translation with several verbs like in the reference translation in table 31, and more or less preserving the structure as zu verwandeln is impersonal as well. Since reordering is expensive, we would rather expect that zu verwandeln is placed at a similar position as its equivalent of turning, as illustrated in (2), instead of the correct position in (1). Yet, even the ungrammatical version given in (2) would be a great improvement in comparison to the translation in table 31 as it expresses at least the semantic content of the source phrase. Looking at the phrase table entries for of turning, we find that the translation probability for of turning→zu verwandeln is relatively high. But it becomes 102 ϕo 0.074 0.056 0.056 0.037 0.037 0.037 trg , zu verwandeln , dass , um , sich machen gloss , to turn , that , to , itself make ϕc 0.143 0.095 0.095 0.048 0.048 0.048 trg zu verwandeln machen die umsetzung zu wenden macht vorhaben , gloss to turn make the conversion to turn makes to intend Table 32: Translation candidates for of turning sorted according to their probability: The left table shows the entries of the baseline system, the right table contains probability estimations based content words only. also evident that there are a lot of faulty phrase-translation pairs resulting from bad alignment. In the original distribution, the most probable translation candidate is comma, followed by the valid translation zu verwandeln and some other non-content words. When filtering out non-content words and estimating phrase-translation probabilities based only on content words, as shown on the right side of table 32, zu verwandeln is the top entry, and the other candidates mostly seem relatively reasonable, too. One might argue that not all content words are appropriate translations for a verb: There might still be random alignments or disrupted collocational structures which cannot provide a complete translation, such as e.g. decide→treffen (make) instead of entscheidung treffen (make a decision). However, the highest ranking candidates in the second table have considerably more potential for becoming a good translation, with a very good candidate on top: We hope that by eliminating truly bad candidates, the remaining top-ranking translations have a better chance of getting chosen by the system even though their targeted position in the sentence is not ‘natural’. Another point of criticism could be our assumption that valid translations of verbs have to contain content words: With a Romance language as source language, we might face the phenomenon of chassé croisé, in which a verb of directional movement can be translated as a preposition expressing a direction, as illustrated by the following example: il a traversé la place en courant. he ran through the square. Since this is a specific example that is of no immediate relevance for our language pair, we will not go into detail, but regard this example as a reminder that non-isomorphic translation structures sometimes allow to translate verbs into non-content words: In section 8.2, we will come back to this point. Realization Since we do not know how strongly the elimination of non-content words will affect the system, we attempt to keep our modifications as minimal as possible. Related experiments focusing on translating only linguistically well-formed phrases found that the elimination of non well-formed phrases from the translation table had a negative effect on the system, as non-intuitive segmentation often works better (cf. [Koehn et al., 2003]). 103 With this information in mind, modifications will be applied only for verbs; this is also due to the fact that the translation of e.g. nouns is far more reliable. Furthermore, there are no conditions imposed on the number of verbs on the source side and the number of content words on the target side, as well as we do not differentiate between the different content words on the target side assuming verbs, nouns and adjectives to be equally important. In order to identify content words in the target language phrases, the German part of the parallel corpus was pos tagged and relevant tag information included in the extracted phrases. As in our previous experiments, the new probability distributions are dependent on the input sentences: When computing translation probabilities for a phrase containing a verb, phrase translation pairs without at least one content word on the target side are ignored. By doing so, we hope to bring out the literal translation while excluding all candidates that would be wrong in any situation. It is important to point out that the estimation of translation probabilities is not conditioned on the source language part of extracted training data being a verb or e.g. a noun, but solely on the phrase in the input sentence. Otherwise, we would split distributions of a word like e.g. deal in two separate distributions for the deal (noun) and to deal (verb). Since both have the same basic meaning, splitting the seen occurrences would mean a loss of data. Furthermore, there is no need to insist that a verb in the source sentence be translated as a verb in the target sentence. 8.2 Evaluation In the following sections, we will discuss and compare several variants of the general idea and also provide an error analysis. 8.2.1 Results We carried out several experiments based on the idea on banning non-content words as translation candidates for verbs. Unfortunately, the results (presented in table 33) are not overwhelming. In version À, exactly the method described above was applied to all phrases containing verbs in the input sentence. However, we opted for a threshold of at least 10 occurrences in version Á. This was considered a necessary step, as discarding phrase-translation entries is a harsh intervention: even if only undoubtedly bad entries are removed, there is no guarantee that the remaining entries are better. By demanding a certain number of occurrences, there is a better chance that remaining entries are actually valid translations and not randomly aligned phrase pairs. While introducing the threshold did not lead to a significant gain in standard bleu, there is a considerable gain in pos-bleu. However, the results are still lower than the baseline. As already indicated, there might be situations in which the optimal translation of a verb does not contain a content word. While words such as like keep their non-content word translations when not tagged as verb, participial prepositions are mostly tagged as verbs, but are generally translated as preposi- 104 bl À Á threshold bleu pos-bleu lem-bleu 13.84 36.72 17.34 13.44 36.18 16.91 13.59 36.61 17.16  threshold part. prep. 13.69 36.68 17.24 à bonus 13.98 36.87 17.47 Ä bonus src context 13.94 37.15 17.63 Table 33: Results of the experiments focusing on the translation of verbs. Numbers in the column labelled bl are the scores of the baseline system, numbers in boldface are significantly better than the baseline. The source side context used in version Ä is the combination of 1-pos-left and leftmost-pos. tions. Participial prepositions are a group of verbal forms ending on -ing, such as concerning, regarding or following. While they can be used as ‘normal’ verbs, they also have the status of a preposition requiring a corresponding translation. This is exemplified by the following example: (31) on a wednesday morning there were long queues at all cash registers following a sudden increase in shoppers . (32) am mittwoch morgen gab es lange schlangen auf alle registrierkassen folgenden ein plötzlicher anstieg der käufer . Except for the translation of following, (32) is a relatively good, more or less literal translation of the sentence in (31). For this sentence, the simplest translation would be following→nach (after), whereas the translation in (32) is as weird as subsequent a sudden increase in shoppers. Actually, the adjective folgenden is a literal translation as well, but it is not a good option in this situation. As the concept of -ing-forms does not exist in German, there is no ‘general’ way of how to translate them best: They might be realized as adjectives (e.g. the following year), as infinitive constructions as seen in the example in table 32, as prepositions or they might even require a totally different sentence structure. Since participial prepositions are a closed word class, verbs with the ending -ing were checked with a list (see table 34) before discarding translation candidates without content words. If a verbal form could possibly be a participial preposition, all translation options were kept. The results for this version are listed in column  in table 33; this small modification did not change much and the results are still not better that the baseline. We assume that completely discarding improbable translations is too harsh as the remaining translations might ‘block’ good translations of other parts of the input sentence for not fitting well in the target sentence. according concerning considering consisting corresponding following given including notwithstanding pending regarding respecting resulting touching Table 34: Participial prepositions that were excluded for having non-content word translations removed 105 In version Ã, we choose a slightly different approach: Instead of discarding non-promising translation candidates, we now opt for a ‘softer’ variant by giving a ‘bonus’ score for seemingly good translations of verbs. This means that an additional feature function is assigned the value 1 in the case of a verb → content word translation and 013 if the translation consisted of non-content words. All other phrases, i.e. phrases without a verb, are assigned the value 1 as well; thus, the phrases with the value 1 are the same as in the systems À-Â, but the ‘unwanted’ translations are still available though they need to score relatively high in order to balance out the penalty feature. This might be useful if despite the threshold only random translations are left or a good (or nearly good) translation is too ‘expensive’ and blocks good translations of the rest of the sentence: in this case, a universally applicable token like comma or a preposition allows for more possible translations than words that do not occur very often but are the only available translations. The phrase translation probabilities of this system are the same as in the baseline, and since all non-verbal phrases are assigned the value 1, they should be equally probable as in the baseline system. Verb → content word pairs also keep their original translation probabilities but have an advantage in comparison to the seemingly bad phrase translation pairs which are penalized by the value of 0. This version yields better results than the previous versions À-Â, but is still not significantly better than the baseline. These results suggest that completely discarding translation candidates is too harsh. This seems somewhat contradicting since the results of the experiments presented in chapter 6 tend to be better than the baseline despite non-promising translation candidates being discarded completely. However, translations that were not seen within a specific source-side context included all sorts of phrases, leaving in most cases both content word phrases and non-content word phrases for translation. Furthermore, since the remaining translation candidates were actually seen in the given context, they have a relatively high probability of actually being valid translations. In contrast, when conditioning on target-side pos tags, all discarded nonpromising translation candidates are non-content word phrases, including entries that are a ‘cheap’ option for the language model. This is a particularly critical point since the verbal translations are in many cases likely to be placed at a position in the target sentence that normally does not contain verbs and thus might ‘irritate’ the target side language model. Opposed to distributions conditioned entirely on source-side context, this method of filtering does not take into account how good a translation candidate is according to the (standard) translation model, but only considers target side information and thus can favor translations that would otherwise be relatively unlikely. To a certain degree, it is our intention to enhance the translation probabilities of otherwise unlikely translation candidates, as illustrated in the motivational example. However, if all remaining translation candidates have low scores in the conventional translation features, this might negatively influence the overall translation performance. 13 Technically, the value is only close to zero as the system does not accept zeros. 106 In version Ä, we keep the bonus/penalty feature, but replace the original translation probabilities with contextually conditioned ones. The context used for conditioning is the combination of the two contexts1-pos-left and leftmostpos, which was the best setting in chapter 6. The standard bleu score is still in the same range as the baseline, but there are significant improvements in both pos-bleu and lem-bleu. 8.2.2 Error analysis Generally, the concept of enforcing content-word translations for verbs does work, but, when looking at bleu scores, it is evident that there are also negative effects. The sentence presented in table 35 contains the verb cut, which could not be translated neither in the baseline system nor in any of the contextually conditioned systems discussed in chapter 6. The translation cut → senken in this version is correct in terms of lexical choice, but not on a grammatical level, as it is not correctly positioned and the word form is not correct. In German, plural verb forms normally are the same as the infinite form of a verb; thus there is no way to decide whether senken is a finite or infinite form. However, it would not be grammatical in either case: Being a finite form, it would not suit the singular subject ecb, otherwise, there would not be a finite verb. But despite being incorrectly positioned, the meaning of senken is crucial for understanding the sentence, and thus, its semantically correct translation can be considered an improvement. The fact that the verb is placed at a wrong position also has a negative influence on the pos-bleu score as can be seen in the following example: ref , , $, dass that kous die the art ezb ezb nn die the art zinsen interests nn zweimal twice adv senken cut vvinf bl , , $, dass that kous die the art ezb ezb nn die the art zinssätze interest rates nn zweimal twice adv à , , $, dass that kous die the art ezb ezb nn die the art zinssätze interest rates nn senken cut vvinf wird will vafin im in the prep zweimal twice adv im in the prep Matching pos tag sequences are highlighted; as can be seen, there are 7 subsequent pos tags in the baseline translation that are matching the reference translation. By inserting the verb senken, this sequence is disrupted. While we are gaining the 1-gram vvfin, several other n-grams of higher order are lost. The same would apply to the standard bleu score if interest rates would have been translated as the same word in the reference translation and in the automatic translation. While a single example is not necessarily universally valid, and automatically translated sentences are often structured differently than the respective reference translations, this example illustrates a systematic flaw of our approach to enforce 107 input à gloss ref [they also]1 [predict that the]2 [ecb will]3 [ cut ]4 [interest rates]5 [twice]6 [during the]7 [course of 2008]8 [.]9 [auch sie]1 [vorhersagen , dass die]2 [ezb]3 [die zinssätze]5 [ senken ]4 [zweimal]6 [im]7 [laufe des jahres 2008]8 [.]9 [they also]1 [predict that the]2 [ecb ]3 [the interest rates]5 [ cut ]4 [twice]6 [in the]7 [course of the year 2008]8 [.]9 für 2008 rechnen experten damit , dass die ezb die zinsen zweimal senken wird . Table 35: Example for a successfully translated verb. valid translations of verbs: As they are expected to appear in an incorrect position in the target sentence, bleu scores can be expected to be negatively influenced by verbs inserted into matching pos tag sequences. In this example, it is also remarkable that the system did not choose the translation candidate with the highest translation probability, cut → entzieht (withdraws), but the entry working best in combination with the noun zinssätze (interest rates): This is most probably due to the influence of the language model as it seems to have preferred zinssätze senken over any other possibility. Another problem is the fact that the remaining translation candidates are not necessarily good translations of the verbal phrase in the source sentence: It is possible that a phrase containing a verb is translated by only ‘half’ of the required structure. This is especially the case if the source phrase contains more than one content word, e.g. consists of a verb and a noun, but the target phrase only contains one content word. As there are no restrictions on the number of content words, this is not unusual. However, it would certainly not be a good idea to demand the same number of content words on the source and the target side: this would lead to a loss of data as multi word expressions can be translated by single words. This observation leads to the somewhat counterintuitive assumption that it is ‘safer’ to translate verbs as short phrases containing no other content words as there is only the single risk to pick the wrong content word, whereas multi word phrases could be translated incorrectly or by incomplete structures. In this context, it is important to point out again that the expected position in the target sentence is not suitable for verbs in most cases and that they might thus be dispreferred when competing with a nonverbal target phrase. As the language model has not seen n-grams with verbs at this position, it is likely to give bad scores for such seemingly inappropriate verbal translations. This is illustrated in table 36, where the sentence from the previous example is translated by system Â. This time, the phrase cut interest rates was to be translated as one unit. While the target phrase 4 meets the requirement to contain at least one content word, it fails to be an adequate translation for the English phrase. For the phrase cut interest rates, there are 5 translation candidates all of which are more or less equally probable and at least partially correct. 108 input  gloss [they also]1 [predict that the]2 [ecb will]3 [ cut interest rates ]4 [twice]5 [during the course of]6 [2008 .]7 [sie]1 [vorhersagen , dass die]2 [ezb]3 [ die zinssätze ]4 [zweimal]5 [im]6 [jahr 2008 .]7 [they also]1 [predict that the]2 [ecb]3 [ the interest rates ]4 [twice]5 [in the]6 [year 2008 .]7 Table 36: Invalid translation of a verb as a result of different segmentation. However, in some cases, it is surprising that the system did not take the most probable translation candidate, but instead a translation with a considerably smaller probability. In an attempt to illustrate this observation, the example discussed in the beginning of this chapter is presented again. The systems used to translate the sentence in table 37 are basically the same except for the treatment of participial prepositions. While the translation of of turning with system  can be considered to be valid, it is not the top-ranked translation, but a candidate with about one third of its probability. In contrast, the translation chosen by system Á is completely random. While both have the same translation probability, stillhaltepolitik occurs exactly once in the training data, whereas the phrase zu verwandeln , is far more frequent. Again, this might be due to the strong influence of the language model which dislikes verbal translations at this position. This observation shows that even completely random candidates are taken for translation even if there are far better candidates available, suggesting that the presented approach is not capable of dealing with the problem of the translation’s wrong positioning in the target sentence. 8.2.3 Comparison of the presented systems In an attempt to measure the success of our modifications on a more adequacyoriented level, we computed statistics on precision, recall and f-measure of input  gloss input Á gloss [ of turning ]7 [new york]8 [into a ”]9 [save haven]10 [for illegal immigrants]11 [” .]12 [ zu verwandeln , ]7 [new york]8 [in einen ”]9 [sicheren hafen]10 [für illegale einwanderer]11 [” .]13 [ to turn , ]7 [new york]8 [into a ”]9 [save haven]10 [for illegal immigrants]11 [” .]12 [ of turning ]7 [new york]8 [into a ”]9 [save haven]10 [for illegal immigrants]11 [” .]12 [ stillhaltepolitik ]7 [new york]8 [in einen ”]9 [sicheren hafen]10 [für illegale einwanderer]11 [” .]13 [ ’policy of turning a blind eye to events’ ]7 [new york]8 [into a ”]9 [save haven]10 [for illegal immigrants]11 [” .]12 Table 37: Different translation options 109 bl À Á threshold bleu p r f 13.84 0.2186 0.1789 0.1863 13.44 0.2240 0.2082 0.2060 13.59 0.2329 0.2043 0.2071  threshold part. prep. 13.69 0.2227 0.1908 0.1953 à bonus 13.98 0.2063 0.1696 0.1759 Ä bonus src context 13.94 0.2131 0.1696 0.1773 Table 38: Precision, recall and f-measure of correctly translated verbs. The respective highest scores are in boldface. correctly translated verbs and try to analyze the correlation between these scores and the different presented systems. While systems À- are basically the same, à is a softer variant as it does not throw away non-content word translations, but penalizes them. Ä is additionally conditioned on source-side information. Table 38 shows statistics about the number of correctly translated verbs. Similarly to chapter 6, precision, recall and f-measure for verb-lemmas were computed on sentence level and then averaged over the entire test set. Interestingly, the number of correctly translated verbs does not seem to correlate with bleu scores. As can be seen in the table, the system with the lowest bleu score has the best recall of correctly translated verbs. Similarly, the system with the second-lowest bleu score has the highest precision and f-measure of translated verbs, while the system with the best bleu score has the lowest scores in terms of correctly translated verbs. As verbs are crucial for the understanding of a sentence, this is somewhat surprising. When looking at the values for the systems À and Á, we see that there is a noticeable increase in the number of correctly translated verbs in comparison to the baseline, which indicates that at least fluency should be improved. À, the most straightforward system, has the highest recall: As only strings containing content words are provided as possible translations for verbs, there is a good chance that the correct translation is chosen. Due to the threshold introduced in system Á, some verbal phrases are left with their original set of translations in an attempt to avoid overestimation of random alignments. While this leads to a small loss of recall, it definitely helps precision and also the f-measure. System  is still better than the baseline, but the special treatment for participial prepositions seems to have a negative influence, even though less than 20 words are directly affected. In Ã, the same translation options as in the baseline are provided. Despite the penalty for non-content word translations, the system often took more or less the same non-content word translations as in the baseline. While its bleu score is slightly higher than the baseline, the values for correctly translated verbs are lower. The increase of bleu points might be explained by the fact that the system is trained to maximize bleu: While in systems À- only content-word translations are available, system à has more choice and since forcing verbs to appear at unusual positions seems to harm bleu, the system might avoid this form of translation. The same assumption could be valid for system Ä; in this 110 case however, the conditioning on the source side leads to better pos-bleu and lem-bleu scores by taking into account all sorts of phrases instead of focusing on only one group of pos tags. Since the main issue of our approach to enhance meaningful translations of verbs is the mismatch in syntactic structure between English and German, one might consider to modify the sentence structures in order to make them more alike. Experiments on this subject are presented in the work of [Collins et al., 2005], who restructure the German part of the training data in order to adapt the German data to the English structure. Reordering rules include the moving of verbs from the end of a (subordinated) sentence to the beginning in a position immediately adjacent to the subject and auxiliary verbs. The same applies for the verbal negation nicht (not) or verb particles. A German→English system trained on the restructured data achieved better results than the baseline system; even though the adaptation for the reverse translation direction English→German is more complicated due to the fact that the reordered German has to be reconstructed into ‘normal’ German, results have been shown to be promising. However, this method does not guarantee perfectly translated verbs: Combining both ideas by enforcing meaningful translation of verbs on reordered training data would provide the system with data of equal structure and thus enable it to fit appropriate translations into the target sentence without irritating the language model. 111 112 9 9.1 Conclusion and future work Summary In this work, we presented a method for the integration of local source-side context features into a phrase-based statistical machine translation system in order to enable a more refined estimation of translation probabilities. Looking at adjacent words or pos tags, we made use of potentially very valuable information normally not used in statistic translation. Linguistically motivated context features in particular were expected to improve the syntactic quality of translation output. We carried out experiments on the language pair English→German on three different system settings (full system, a system with a minimal set of translation features and a system where source phrases were restricted to length one) as we expected to observe a stronger influence of our modifications in the simplified settings. After motivating the use of context information and illustrating the technical aspects of integrating additional features in chapter 3, we presented automatic evaluation metrics in chapter 4 and also introduced the idea of computing bleu on pos tags for evaluating the syntactic quality of machine translation output. In chapter 5, we discussed several criteria to avoid conditioning on contexts which would lead to deteriorated translation probabilities for a given phrasecontext combination. After experimenting with different types of criteria to rate the reliability of phrase-context combinations, we decided to implement the most simple variant using a fixed threshold to avoid overestimating the probabilities of low-frequency phrase-context pairs. Our extensive evaluation in chapter 6 showed that contextual information significantly increased the translation quality in the case of word translation, but results were less clear for full systems: only two settings achieved a significantly improved bleu score; in both cases, the systems were enriched with two separate context features. However, improved bleu scores computed on lemmatized data indicated improvements on a lexical level. Furthermore, we detected slight syntactic improvements (represented by pos-bleu). The evaluation of content words revealed that systems generally translated more content words, presumably resulting in a better adequacy. A small-scale manual evaluation supported these assumptions. When examining examples, we found that in many cases, our modifications increased the translation probabilities for better suited candidates which were consequently chosen by the system. But we also observed that contextual conditioning can indirectly cause the system to prefer some candidates over other ones: different translation decisions in the baseline and a contextually conditioned system are not always due to modified translation probabilities, but can also be caused by other factors like e.g. different reordering or segmentation of the source sentence. We saw an example in which the re-estimated probability of a phrase positively affected (per segmentation and language model score) the translation of the subsequent phrase, but also an example in which a similar constellation indirectly caused the system to prefer a translation worse than in the baseline. 113 Generally, combinations of context templates tended to achieve better results than single context templates. We also found that right-side contexts had lower scores in the word-translation task than left-side contexts, whereas the results of left-side and right-side contexts in full systems were comparable: This is most likely correlated with the different effects on segmentation observed in the second part of chapter 7. Chapter 7 started with a discussion about context granularity: since we were working with a relatively detailed tag set, we analyzed the effect of merging different pos tags with similar properties into one tag with the result that auxiliary tags are better left detailed whereas combining nouns and personal pronouns did not greatly influence the total result. The introduction of chunks, i.e. linguistically motivated phrases, as contexts was intended to allow for a maximal generalization of context information, but also to trigger a linguistically wellformed segmentation. Chunk-based contexts turned out to be cumbersome due to overlapping phrases and also did not lead to the desired improvement. While chunk-based contexts did result in a more linguistically motivated segmentation, this did not lead to an improvement of translation quality. In the second part of chapter 7, we analyzed the correlation between contextual conditioning and the preference of the system for shorter segmentation units. A comparison of contextually conditioned translation probability distributions and the original distributions revealed that good contexts tended to favor literal translations by giving them comparatively high translation probabilities. In combination with the threshold, this led to an unexpected effect on segmentation: Since short phrases meet the threshold criterion far more often than longer phrases, they benefit to a larger extent from contextual conditioning and thus have an ‘advantage’ in comparison to the phrases left with their original translation probabilities. As a result, the system prefers shorter phrases. Although shorter phrases generally lead to bad translation results, we find that the quality of our systems tends to be improved rather than deteriorated, suggesting that contextual information is indeed valuable as it managed to even out the inherent disadvantages of short phrases. Interestingly, we found that contexts on the left side led to an immense increase of source phrases of length one, whereas contexts on the right side only caused a moderate increase of single-word source phrases. While we could not find an entirely satisfying explanation for this observation, we assume that right-side context might act as a ‘look-ahead’ during decoding and thus behaves differently from left-side context. Chapter 8 focused entirely on the translation of verbs. In contrast to the previously presented experiments, we additionally integrated target side information. Due to the different positions of verbs in English and German sentences, verbs are often translated by common, but meaningless phrases (e.g. particles or punctuation). Using information about the pos tags of the German translations, we tried to enforce meaningful translations of verbs. While the idea generally worked, the major structural differences between English and German caused the language model to disprefer essentially good verbal translation candidates for not fitting well into the sentence structure. Completely discarding entries of verbs with a non-content word translation led to a better recall and precision of correctly translated verbs in the output 114 sentence, but it negatively affected the overall bleu score, presumably because enforced meaningful translations blocked other translations. Penalizing nonmeaningful translations of verbs resulted in better bleu scores, but lower precision and recall of correctly translated verbs. A combination of penalizing non-meaningful translations of verbs with the best source context detected in chapter 6 resulted in significantly increased pos-bleu and lemma-based bleu. 9.2 Discussion The general, underwhelming outcome for full systems is consistent with the results for comparable language pairs reported in previous publications: While systems translating Chinese↔English were greatly improved by the use of context features (e.g. [Gimpel and Smith, 2008], [Stroppa et al., 2007], [Carpuat and Wu, 2007b]), systems with a morphologically more complex target language were found to benefit less from context information (e.g. [Gimpel and Smith, 2008], [Max et al., 2008]). Since the authors of all papers presented in chapter 2 reported that simple context features such as adjacent words or pos tags turned out to work best, we mainly focused on this type of context features. In contrast to most of the cited articles, our general method of context integration is much simpler as we only use one context template at the same time. This allows us to analyze more general aspects like the different effects on phrasal segmentation of left-side and right-side contexts, but also to rate the usefulness of different context templates. For example, we found that the first pos tag of a phrase was not useful, but the combination with the preceding tag outperformed every other setting. In addition to identifying good context templates, our extensive evaluation also revealed that contextually conditioned systems tend to translate more content words. This is related to our analysis of specific contexts in section 7.3; while we only studied an individual case, we found evidence that information about good phrase-context structures might also prove interesting for alignment. Yet, our approach has two significant disadvantages, the first of which is the evident lack of flexibility in terms of feature selection: since we only integrate one context feature at a time, the system is forced to use this context (given that the threshold criterion is met) even though this context might not be relevant at all, while ignoring potentially better contexts. A maximum entropy based approach (as presented in chapter 2), would offer the possibility to choose context features according to their respective relevance. However, such systems are far more complicated than the systems presented in the course of this work. Our experiments showed that systems conditioned on two separate context functions tended to outperform single-context settings. The respective relevance of the context templates was then determined by parameter tuning in the form of feature weights. In the case of the combination of one pos tag left and the leftmost tag of the phrase, we observed that the system using separate feature functions yielded a slightly better result, possibly due to the fact that the two features received different weights representing their actual relevance instead of assuming them to be equally important. Yet, a feature weight assigned to the entire collection of probability distributions conditioned on one context 115 template does still not take into account the respective relevance of the template in different phrase-context constellations. (Conditioning each phrase-context pair regardless of the respective usefulness might also have caused imprecise estimates.) Thus, the implementation of a more sophisticated feature selection seems a necessary next step. In an attempt to go beyond the concept of context templates, we carried out a small-scale experiment in which we tried to combine the values of two independent context templates, weighted according to their individual relevance in the respective phrase-context combination (cf. section 7.5). The relevance of a context for a given phrase was determined by a factor λ based on typetoken relations of the contextually conditioned distribution (cf. section 5.3.4). Experiments were conducted on a system with the context combination of 1-pos-left and 1-pos-right as well as on a system conditioned on one or two pos tags on the left side. In the first variant of this experiment, the factor λ was used to determine the better context which was then used for conditioning. Alternatively, the individual translation probabilities of both contexts were weighted according to the respective factor λ and merged into a single score. Unfortunately, the results of these preliminary experiments were disappointing. Although the factor λ led to good results in some of the experiments presented in section 5.5, its performance was not stable throughout the tested systems, suggesting that it is not suited for every setting. The second problem is the estimation of translation probabilities using relative frequencies. Experiments in chapter 5 showed that some criteria to prevent the over-estimation of infrequent phrases was necessary. However, settings using a fixed threshold led to an unfortunate phrasal segmentation. Interestingly, using interpolation for the computation of translation probabilities (and thus avoiding the fixed threshold) does not lead to better bleu scores despite the higher average translation unit length. This outcome suggests that phrasetranslation probabilities without the influence of the original distribution are better than the interpolated distributions: In order to benefit from contextually conditioned probabilities without triggering a strong preference for short phrases, it is necessary to treat all phrases alike regardless of their occurrence frequency. The key to do so might lie in a maximum entropy based approach as decisions are made based on the relevance of features instead of systematically dispreferring entire subgroup of infrequent phrase-context pairs. In connection with the phrasal segmentation, the different properties of left-side and right-side contexts seem to play an important role. However, it is not entirely clear what caused the extreme outcome of left-side contexts, and thus, this issue definitely requires more research. This observation could even been taken as evidence that left-side context and right-side context should be treated differently in order to suit their respective properties. If right-side context does really act as ‘look-ahead’ during decoding, it would be interesting to have a closer look at this observation. One could even consider to integrate information based on right-side contexts directly into the decoder: this would allow the search algorithm to make direct use of the information instead of deriving it somewhat indirectly from the contextually conditioned translation probabilities. 116 9.3 Morpho-syntactic preprocessing We already suggested two different methods of preprocessing the training data: By reordering the training data in an attempt to adjust the structures of the English and German side, especially verbal translations are expected to benefit (cf. [Collins et al., 2005]). In addition to structural issues, German also has a relatively rich morphology, resulting in flat probability distributions. This becomes evident when looking at the probability distributions of e.g. adjectives: since German adjectives vary in number, gender and case, an English adjective is typically not translated by an equivalent German adjective, but by several forms of the same word. The same applies for German nouns and verbs. When introducing additional features, sparse data becomes increasingly more problematic. Thus, the idea of using simplified data for translation and relocating the task of modelling (monolingual) morphological features to a separate step seems promising: In this setting, the translation system should generally benefit from the simplified data, while contextual conditioning could make use of more generalized context settings. [Minkov et al., 2007] report good results for the prediction of inflected word forms with English-Russian and English-Arabian sentences, and successfully integrated the concept into a smt system in [Toutanova et al., 2008]. Similarly, [Weller, 2009] adapted this idea for German; experiments conducted on reference translations were promising. These two pre-processing steps are mainly intended to enable a better general translation. However, morphological modelling could also be used to design context features: One of the main problems of translating into a morphologically rich language is the fact that morphological features not existing in the source language need to be realized in the target language. When eliminating all morphological features in order to simplify the training data, the tool for the prediction of morphological features needs to guess at some point. (This problem would be similar with ‘normal’ training data since the source language does not indicate which realization to use.) In the case of simple and local phenomena, one could consider to pass along target-side inspired information to be used as context feature. A simple example is the case government of German prepositions (dative or accusative), which can vary depending on whether they are used to describe a situation or a directional movement. For example, in the phrase auf einem Baum sitzen (to sit on a tree), the case required by the preposition auf is dative, whereas in the case of (auf einen Baum klettern) (to climb up a tree), the preposition requires accusative case. Since this is an entirely semantic problem, it is difficult to predict a correct form based only on morphological features. However, if the German training data is annotated with case information, we could use this as a feature to condition the source phrase on on the target side contexts accusative or dative. This concept is not limited to prepositions, but could also be applied to verbs. 117 References [Allauzen et al., 2009] Allauzen, A., Crego, J., Max, A., and Yvon, F. (2009). LIMSI’s statistical translation system for WMT 2009. In Proceedings of the 4th EACL Workshop on Statistical Machine Translation, Athens, Greece. Association for Computational Linguistics. [Banerjee and Lavie, 2005] Banerjee, S. and Lavie, A. (2005). METEOR: an automatic metric for MT evaluation with improved correlation with human judgements. In proceedings of ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, Michigan. [Berger et al., 1996] Berger, A., della Pietra, S., and della Pietra, V. (1996). A maximum entropy approach to natural language processing. Computational Linguistics, Volume 22. [Callison-Burch et al., 2008] Callison-Burch, C., Fordyce, C., Koehn, P., Monz, C., and Schroeder, J. (2008). Further meta-evaluation of machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, Columbus, Ohio. [Callison-Burch et al., 2006] Callison-Burch, C., Osborne, M., and Koehn, P. (2006). Re-evaluating the role of BLEU in machine translation research. In Proceedings of 11th Conference of the European Chapter of the Association for Computational Linguistics: EACL 2006, Trento, Italy. [Carpuat and Wu, 2007a] Carpuat, M. and Wu, D. (2007a). Context-dependent phrasal translation lexicons for statistical machine translation. In Proceedings of Machine Translation Summit XI, Copenhagen, Denmark. [Carpuat and Wu, 2007b] Carpuat, M. and Wu, D. (2007b). Improving statistical machine translation using word sense disambiguation. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, Prague, Czech Republic. [Collins et al., 2005] Collins, M., Koehn, P., and Kucerová, I. (2005). Clause restructuring for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the ACL, Ann Arbor, Michigan. [Doddington, 2002] Doddington, G. (2002). Automatic evaluation of machine translation quality using n-gram co-occurrence statistics. In Proceedings of the Human Language Technology conference (HLT), San Diego, CA. [Garcia-Varea et al., 2001] Garcia-Varea, I., Och, F., Ney, H., and Casacuberta, F. (2001). Refined lexicon models for statistical machine translation using a maximum entropy approach. In Proceedings of the 39th annual meeting of the association for computational linguistics, Toulouse, France. 118 [Gimpel and Smith, 2008] Gimpel, K. and Smith, N. A. (2008). Rich sourceside context for statistical machine translation. In Proceedings of the Third Workshop on Statistical Machine Translation, Columbus, Ohio. [Koehn, 2004] Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of Conference on Empirical Methods in Natural Language Processing (EMNLP), Barcelona, Spain. [Koehn, 2010] Koehn, P. (2010). Statistical Machine Translation. Cambridge. [Koehn et al., 2003] Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical phrase-based translation. In Proceedings of HLT-NAACL, Edmonton, Canada. [Max et al., 2008] Max, A., Makhloufi, R., and Anglais, P. (2008). Explorations in using grammatical dependencies for contextual phrase translation disambiguation. In Proceedings of EAMT, poster session, Hamburg, Germany. [Minkov et al., 2007] Minkov, E., Toutanova, K., and Suzuki, H. (2007). Generating complex morphology for machine translation. In Proceedings of the 45th Annual Meeting of the Association of Compuational Linguistics, Prague, Czech Republic. [Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. (2002). BLEU: a method for automatic evaluation of machine translation. In Proceedings of of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, USA. [Popović et al., 2006] Popović, M., de Gisbert, A., Gupta, D., Lambert, P., Ney, H., Mariño, J. B., Federico, M., and Banchs, R. (2006). Morpho-syntactic information for automatic error analysis of statistical machine translation output. In Proceedings of the Workshop on Statistical Machine Translation, USA, New York City. Association for Computational Linguistics. [Popović and Ney, 2009] Popović, M. and Ney, H. (2009). Syntax-oriented evaluation measures for machine translation output. In Proceedings of the 4th EACL Workshop on Statistical Machine Translation, Athens, Greece. Association for Computational Linguistics. [Riezler and Maxwell, 2005] Riezler, S. and Maxwell, J. (2005). On some pitfalls in automatic evaluation and significance testing in MT. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT and/or Summarization, Ann Arbor, Michigan. [Schmid, 1994] Schmid, H. (1994). Probabilistic part-of-speech tagging using decision trees. In International Conference on New Methods in Language Processing, Manchester, UK. www.ims.unistuttgart.de/projekte/corplex/TreeTagger/. [Stroppa et al., 2007] Stroppa, N., van den Bosch, A., and Way, A. (2007). Exploiting source similarity for SMT using context-informed features. In Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), Skövde, Sweden. 119 [Toutanova et al., 2008] Toutanova, K., Suzuki, H., and Ruopp, A. (2008). Applying morphology generation models to machine translation. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, Columbus, Ohio. [Vickrey et al., 2005] Vickrey, D., Biewald, L., Teyssier, M., and Koller, D. (2005). Word-sense disambiguation for machine translation. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP), Vancouver, Canada. [Vilar et al., 2006] Vilar, D., Xu, J., D’Haro, F., and Ney, H. (2006). Error analysis of statistical machine translation output. In Proceedings of the Fifth Inernational Conference on Language Resources and Evaluation (LREC), Genova, Italy. [Weller, 2009] Weller, M. (2009). Separate Morphologiebehandlung als Methode zur Verbesserung statistischer maschineller Übersetzung (Studienarbeit, Universität Stuttgart). 120 A Manually evaluated sample In the following subsections, all sentences with unambiguous annotation from the manual evaluation presented in 6.4.2 are listed. A.1 Contextually conditioned system better than baseline Missing words input the group responded with a new and even better cd. bl die fraktion mit einem neuen und besseren cd. cont die fraktion reagierte darauf mit einem neuen und besseren cd. ref die band hat reagiert, wie es sollte: mit einem neuen, noch besseren album. input among the five joined seven children. bl von den fünf sind sieben kinder. cont von den fünf beigetreten sind sieben kinder. ref die fünf haben zusammen sieben kinder. input a broker who does not take vacations is a broker who does not want anybody to look into his records,” kerviel concluded. bl ein makler, die nicht urlaub ist ein vermittler, der nicht will, dass jemand, in seine aufzeichnungen,” kerviel abgeschlossen. cont ein makler, die nicht urlaub ist ein makler, der nicht will, dass jemand zu prüfen, seine aufzeichnungen,” kerviel abgeschlossen. ref ein makler, der keinen urlaub nimmt, ist einer, der nicht will, dass man ihm in die karten schaut,” sagte kerviel abschließend. input d. it is less serious to participate in a fight than to participate in the interchange of degraded music. bl d. es weniger schwerwiegend zu einem kampf als an der austausch von schlechterer musik. cont d. es weniger schwerwiegend zur teilnahme an einem kampf als zur teilnahme an den austausch von schlechterer musik. ref d. es ist weniger schwerwiegend an einer schlägerei teilzunehmen als an einem musikdaten-austausch. input this same company has lowered the forecasted profit of antenna 3 to 4.5 % in 2008 and to 7.5 % in 2009 . bl dieses unternehmen hat die voraussichtlichen profit des nebenstelle 3 bis 4 , 5 % im jahr 2008 und auf 7,5 % im jahr 2009 . cont diese gleichen firma senkten die voraussichtlichen profit des nebenstelle 3 bis 4,5 % im jahr 2008 und auf 7,5 % im jahr 2009 . 121 ref diese hat auch die voraussichtliche gewinnberechnung von antena 3 für 2008 um 4,5 % und für 2009 um 7,5 % herabgesetzt. input he latter then seized the trial to make a sudden escape along with 30 sympathizers. bl letztere dann die prozess zu einem plötzlichen entkommen zusammen mit 30 sympathisanten. cont letztere dann beschlagnahmt werden den prozess zu einem plötzlichen entkommen zusammen mit 30 sympathisanten. ref er nutzte die verhandlung, um sich plötzlich mit rund 30 sympathisanten abzusetzen. input in suburbs north of paris, there had been violent clashes between youths and the police in recent nights. bl in den vororten von paris kam es zu gewalttätigen zusammenstößen zwischen jugendlichen und die polizei in den vergangenen nächten. cont in vororten nördlich von paris kam es zu gewalttätigen zusammenstößen zwischen jugendlichen und die polizei in den vergangenen nächten. ref in nördlichen vororten von paris hatten sich jugendliche in den vergangenen nächten schwere straßenschlachten mit der polizei geliefert. input the average price per share was 14.46 euros, the company announced on wednesday evening. bl der durchschnittliche preis pro wurde 14.46 euro, das unternehmen am mittwoch abend. cont die durchschnittliche preis pro aktie war 14.46 euro, das unternehmen am mittwoch abend. ref der durchschnittspreis je aktie beträgt 14,46 euro, wie das unternehmen am mittwochabend mitteilte. input 600 thousand people did not manage to renew driver’s licenses. bl 600 000 menschen nicht führerscheine zu erneuern. cont 600 000 menschen nicht gelingt, führerscheine erneuern. ref 600000 leuten gelang es nicht, ihren führerschein umzutauschen input on wednesday (local time), chávez also announced a severance of diplomatic ties with neighbouring country colombia due to a hostage crisis. bl am mittwoch (ortszeit), chávez kündigte auch eine severance der diplomatischen beziehungen mit nachbarstaat kolumbien eine geisel krise. cont am mittwoch (ortszeit), chávez kündigte auch eine severance der diplomatischen beziehungen mit nachbarstaat kolumbien aufgrund einer geiseldrama. ref zudem kündigte chávez am mittwoch (ortszeit) wegen einer geiselaffäre den abbruch der beziehungen zum nachbarland kolumbien angekündigt. 122 Word choice input opportunities to combine vocational and academic education are developed comparatively strongly. bl chancengleichheit zu verbinden, berufliche und wissenschaftliche ausbildung entwickelten vergleichsweise nachdrücklich. cont möglichkeiten zu verbinden, berufliche und wissenschaftliche ausbildung entwickelten vergleichsweise stark. ref auch die kombinationsmöglichkeiten von beruflicher und akademischer ausbildung sind vergleichsweise stark ausgebaut. input but his report pointed out other key checks were properly carried out across a sample of five health authorities. bl doch in seinem bericht darauf hingewiesen , anderen entscheidenden prüfungen wurden ordnungsgemäß durchgeführt, in einer stichprobe von fünf gesundheit behörden. cont doch in seinem bericht darauf hingewiesen , andere entscheidende prüfungen wurden ordnungsgemäß durchgeführt, in einer stichprobe von fünf gesundheitsbehörden. ref aber sein bericht wies darauf hin, dass andere wichtige kontrollen bei einer stichprobe von fünf gesundheitsbehörden korrekt durchgeführt wurden. Fluency input in total 200, located in the most commercial zones of spain. bl insgesamt 200, in der die meisten kommerziellen zonen von spanien. cont insgesamt 200 in den meisten kommerziellen zonen von spanien. ref insgesamt 200 davon sind in den wirtschaftsstärksten regionen spaniens angesiedelt. input ”oops, sorry. i forgot to tell you that marketing was looking for you,” says one of your colleagues. bl ”hoppla, leid. ich vergessen habe, ihnen sagen zu können, dass die vermarktung hat sie,” sagt einem ihrer kollegen. cont ”hoppla, es tut mir leid. ich vergaß zu sagen, dass die vermarktung hat sie,” sagt einem ihrer kollegen. ref ”sorry, ich habe vergessen dir auszurichten, dass jemand von der marketingabteilung angerufen hat” - sagt ein kollege. input the number of violent acts rose 27 percent compared with last year, as much as 60 percent in the southern helmand province. bl die zahl der gewalttaten stieg um 27 % im vergleich zum letzten jahr, als 60 prozent in den südlichen helmand. cont die zahl der gewaltakte stieg um 27 % im vergleich zum letzten jahr, so viel wie 60 prozent im südlichen helmand. 123 ref die anzahl von gewaltaktionen ist im vergleich zum vorjahr um 27 prozent gestiegen, in der südlichen provinz helmand sogar um sechzig prozent. input also arrested was a secretary of the consortium and the head of the municipal police. bl ebenfalls verhaftet war ein sekretär des konsortiums und der polizeichef. cont ebenfalls verhaftet wurde ein sekretär des konsortiums und der polizeichef. ref außerdem wurden noch eine der sekretärinnen des gemeinderats und der chef der gemeindepolizei festgenommen. input elected for the first time in 2000 and for a second time in 2004, vladimir putin can not afford a third mandate. bl zum ersten mal gewählt in 2000 und für ein zweites mal im jahr 2004, wladimir putin nicht leisten können, eine dritte amtszeit. cont zum ersten mal gewählt in 2000 und für ein zweites mal im jahr 2004, wladimir putin kann es sich nicht leisten, eine dritte amtszeit. ref da er 2000 gewählt und 2004 wiedergewählt wurde, kann wladimir putin nicht für ein drittes mandat in folge antreten. input fully 139 raf images have been uploaded onto the internet and so far over they have received over 75,000 hits. bl voll 139 raf bilder haben uploaded auf das internet und so weit über sie haben über 75 zuschlägt. cont voll 139 raf bilder wurden uploaded auf das internet und so weit über sie habe über 75.000 zuschlägt. ref soldaten der luftwaffe haben bereits 139 solche aufnahmen beigesteuert, ihr ”kanal” wurde mehr als 75.000-mal angesehen. A.2 Baseline better than contextually conditioned system Missing words input the offices, that will even have meeting rooms for smes, will not sell financial products. bl die büros, haben sogar sitzungssälen für kmu, wird nicht verkaufen finanzprodukten. cont die büros, das wird sogar sitzungsräumen für kmu, wird nicht verkaufen finanzprodukten. ref die niederlassungen, die sogar mit zusätzlichen räumlichkeiten für die kmu ausgestattet sein werden, werden keine finanziellen produkte verkaufen. input b.zs. : that keeps changing with every concert, especially now that i’ve already let go of the cd. bl b.zs. : dass sich ständig verändernden mit jeder konzert, vor allem jetzt, da habe ich bereits die cd. 124 cont hält sich mit jedem konzert, vor allem jetzt, da habe ich bereits die cd. ref b. zs.: das verändert sich mit den konzerten laufend und auch jetzt, wo ich das album ”losgelasssen” habe. input havana became a storage and transfer centre for the treasures the spaniards had stolen in america and transferred to europe. bl havanna wurde ein lagerung und übertragung zentrum für die reichtümer der spanier hätten gestohlenen in amerika und in europa. cont havanna zu einem lagerung und übertragung zentrum für die schätze die spanier hätten gestohlenen in amerika und in europa. ref havanna wurde einst zu einem lager oder durchgangslager von schätzen, die die spanier in amerika geraubt hatten und dann nach europa transportierten. Word choice input what country, exactly, does the federal chancellor live in? bl welches land, genau die bundeskanzlerin leben in? cont welches land genau plant die bundeskanzlerin leben? ref in welchem land lebt diese bundeskanzlerin eigentlich? input the basic price of the car with a 1.6 petrol engine of 329,900 czech crowns is not that bad. bl der preis für das auto mit einem 1,6 benzin motor der 329,900 tschechischen kronen ist nicht schlecht. cont die grundlegenden preis für das auto mit einem 1,6 benzin motor der 329,900 tschechischen kronen ist nicht so schlimm. ref der grundpreis des autos mit dem 1400er benzinmotor in höhe von 329900 kronen geht noch an. input signs of strain are clearly visible. bl anzeichen einer belastung sind deutlich sichtbar. cont anzeichen einer belastung sind eindeutig sichtbar. ref anzeichen der anspannung sind deutlich sichtbar. input around 1973, however, rother followed the cluster musicians into exile, to forst, in the weser highland. bl um jedoch 1973, rother folgte die cluster musiker ins exil, forst, in der weser hochland. cont etwa 1973, aber rother folgte das forstindustrie-cluster musiker ins exil, forst, in der weser hochland. ref um 1973 folgte rother allerdings den cluster-musikanten ins exil nach forst im weserbergland. 125 Fluency input belgrade fears a domino effect in a fragile region, weakened by the independence wars of the 1990s. bl belgrad befürchtet eine dominoeffekt in einer instabilen region, geschwächt durch die unabhängigkeit kriege der 90er jahre. cont belgrad befürchtungen ein dominoeffekt in einer instabilen region, geschwächt durch die unabhängigkeit kriege der 90er jahre. ref belgrad meint, dass es besonders einen ”domino-effekt” in einer region fürchet, die noch so sehr durch die unabhängigkeitskriege der 90er jahre geschwächt ist. input one of the local residents even classified the quarrels with eastern european immigrants as a fight for survival. bl eine der anwohner sogar stuften die auseinandersetzungen mit den osteuropäischen migranten als einen kampf ums überleben. cont eines der anwohner sogar stuften die auseinandersetzungen mit den osteuropäischen migranten als kampf ums überleben. ref einer der einheimischen bezeichnete den streit mit den zuwanderern aus osteuropa sogar als überlebenskampf. input those who are in power are not so bad, why changing them? bl diejenigen, die an der macht sind nicht so schlecht, warum sie zu verändern? cont diejenigen, die an der macht sind nicht so schlecht, warum sich verändernden ihnen? ref diejenigen, die an der macht sind, sind nicht so schlecht, warum sie ablösen? input still he knew he would never earn as much as the others. bl noch immer er wusste er nie verdienen wie die anderen. cont er wusste er sie nie verdienen wie die anderen. ref dennoch wusste er, dass er nie so viel verdienen werde wie andere. input russia has the world’s largest stocks of gas, but much of it remains under-developed. bl russland hat die weltweit größten bestände von gas, aber vieles bleibt noch unterentwickelt. cont russland hat die weltweit größte bestände von gas, aber vieles bleibt es unterentwickelt. ref russland hat die weltweit größten gasvorkommen, aber viel davon bleibt unterentwickelt. 126