Download An Empirical Analysis of Source Context Features for Phrase

Document related concepts

Esperanto grammar wikipedia , lookup

Chinese grammar wikipedia , lookup

Spanish grammar wikipedia , lookup

Morphology (linguistics) wikipedia , lookup

Yiddish grammar wikipedia , lookup

Polish grammar wikipedia , lookup

Japanese grammar wikipedia , lookup

Lexical semantics wikipedia , lookup

Latin syntax wikipedia , lookup

Preposition and postposition wikipedia , lookup

Word-sense disambiguation wikipedia , lookup

Malay grammar wikipedia , lookup

Pleonasm wikipedia , lookup

Pipil grammar wikipedia , lookup

Junction Grammar wikipedia , lookup

Untranslatability wikipedia , lookup

Transcript
Universität Stuttgart
Institut für Maschinelle Sprachverarbeitung
Azenbergstraße 12, 70174 Stuttgart
An Empirical Analysis of Source Context
Features for Phrase-Based Statistical
Machine Translation
Marion Weller
Diplomarbeit Nr. 93
01.03.2010 - 01.09.2010
Betreuer:
Dr. Alexander Fraser
Prüfer:
apl. Prof. Dr. Ulrich Heid
Prof. Dr. Hinrich Schütze
Hiermit erkläre ich, dass ich die vorliegende Arbeit selbständig verfasst habe
und dabei keine andere als die angegebene Literatur verwendet habe. Alle Zitate
und sinngemäßen Entlehnungen sind als solche unter genauer Angabe der Quelle
gekennzeichnet.
Abstract
Statistical phrase-based machine translation systems make only little use of
context information: while the language model takes into account target side
context, context information on the source side is typically not integrated
into phrase-based translation systems. Translational features such as phrase
translation probabilities are learned from phrase-translation pairs extracted
from word-aligned parallel corpora. Since there is no information besides the
co-occurrence frequencies of the phrase-translation pairs, all occurrences of a
given source phrase are used for the estimation of translation probabilities,
regardless of their contexts in the training data. However, information about the
context of a source phrase, e.g. adjacent words or part-of-speech tags, might be
a valuable resource for the identification of appropriate translations in a given
context.
In this work, we want to analyze the use of source side context features
in phrase-based statistical machine translation. For every phrase in an input
sentence, context-sensitive phrase translation probabilities will be estimated:
by reducing the set of all phrase-translation pairs to the subset of those with
the same context as the given phrase, we can compute individual translation
probabilities depending on the respective context.
Assuming that the different translations of ambiguous source phrases occur
within different contexts, contextually conditioned translation probabilities might
help to solve ambiguities by separating the entire set of translation candidates
into subsets appropriate for different situations. However, the more refined
probability estimates should also have a general positive influence on translation
quality. Furthermore, the integration of context features offers the possibility to
include linguistic information which is not used in standard statistical machine
translation.
In our experiments, which are conducted on an English to German translation
system, we will focus on the integration of local context features, choosing a simple method for the computation of contextually conditioned phrase-translation
probabilities and their incorporation into a standard phrase-based statistical
translation system. For all experiments, we will provide an extensive evaluation
of the overall translation quality using standard automatic metrics such as bleu,
but also attempt to individually rate fluency and adequacy.
Contents
1 Introduction
1
2 Related work
2.1 Overview of statistical phrase-based machine translation
2.2 Contextually conditioned translation probabilities . . . .
2.3 Classification based methods . . . . . . . . . . . . . . .
2.4 Word sense disambiguation . . . . . . . . . . . . . . . .
2.5 Relation to example-based translation . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5
. 5
. 6
. 8
. 10
. 13
3 Integrating context features into a phrase-based statistical translation system
15
3.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Translation phrase tables . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Estimating translation probabilities . . . . . . . . . . . . 19
3.2.2 Reordering . . . . . . . . . . . . . . . . . . . . . . . . . . 21
3.2.3 Parameter tuning . . . . . . . . . . . . . . . . . . . . . . . 22
3.3 Including context information . . . . . . . . . . . . . . . . . . . . 22
3.3.1 A simple first example . . . . . . . . . . . . . . . . . . . . 23
3.3.2 Expanded example: Using Pos-based context templates . 24
3.3.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . 27
3.4 Experimental settings and tools . . . . . . . . . . . . . . . . . . . 28
3.4.1 Description of the Pos tag set . . . . . . . . . . . . . . . 28
3.4.2 Dimensions of the original and modified phrase-tables . . 29
4 Evaluation methods
4.1 Bleu . . . . . . . . . . . . . . . . . . . . . . . . .
4.2 Error types in machine translation output . . .
4.3 Syntactically motivated evaluation . . . . . . . .
4.4 Testing for significance . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Rating the reliability of contexts and smoothing
5.1 General reflections . . . . . . . . . . . . . . . . . . . . . . . . . .
5.2 Interpolation and discounting . . . . . . . . . . . . . . . . . . . .
5.3 Discount factors . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.3.1 Count-based criteria for the usefulness of contexts . . . .
5.3.2 Example: interpolation and discounting . . . . . . . . . .
5.3.3 Good-Turing estimation . . . . . . . . . . . . . . . . . . .
5.3.4 Type-token relations as criteria for the evaluation of contexts
5.4 Back-off . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
5.5 Evaluation of the presented approaches . . . . . . . . . . . . . . .
31
32
34
35
37
39
39
41
42
43
44
45
46
48
50
6 Basic context templates
6.1 Word-based contexts . . . . . . . . . . . . . . .
6.2 Pos-based contexts . . . . . . . . . . . . . . . .
6.3 Combining features . . . . . . . . . . . . . . . .
6.4 Evaluation . . . . . . . . . . . . . . . . . . . . .
6.4.1 Content word oriented evaluation . . . .
6.4.2 Manual evaluation . . . . . . . . . . . .
6.5 Analysis of selected examples . . . . . . . . . .
6.5.1 Adequacy . . . . . . . . . . . . . . . . .
6.5.2 Fluency: translational behavior of verbs
6.6 Summary . . . . . . . . . . . . . . . . . . . . .
7 Analysis of general aspects of context features
7.1 Chunks as source-side context features . . . . .
7.1.1 Chunked data . . . . . . . . . . . . . . .
7.1.2 Chunk-based contexts . . . . . . . . . .
7.1.3 Results and evaluation . . . . . . . . . .
7.2 Granularity of context features . . . . . . . . .
7.3 Analysis of specific context realizations . . . . .
7.4 Phrasal segmentation . . . . . . . . . . . . . . .
7.4.1 Example . . . . . . . . . . . . . . . . . .
7.4.2 Comparison with other systems . . . . .
7.5 Feature selection . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
8 Translating verbs: Filtering translation options
side information
8.1 Basic idea . . . . . . . . . . . . . . . . . . . . . . .
8.2 Evaluation . . . . . . . . . . . . . . . . . . . . . . .
8.2.1 Results . . . . . . . . . . . . . . . . . . . .
8.2.2 Error analysis . . . . . . . . . . . . . . . . .
8.2.3 Comparison of the presented systems . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
81
81
82
83
85
87
89
93
95
96
97
with target
101
. . . . . . . . 102
. . . . . . . . 104
. . . . . . . . 104
. . . . . . . . 107
. . . . . . . . 109
9 Conclusion and future work
9.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.2 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
9.3 Morpho-syntactic preprocessing . . . . . . . . . . . . . . . . . .
References
.
.
.
.
.
.
.
.
.
.
55
56
59
62
65
65
69
71
71
75
78
113
. 113
. 115
. 117
118
A Manually evaluated sample
121
A.1 Contextually conditioned system better than baseline . . . . . . 121
A.2 Baseline better than contextually conditioned system . . . . . . . 124
1
Introduction
In the course of this work, we want to discuss to what extent a phrase-based
statistical machine translation system benefits from enriching phrase translation
probabilities with source-side context features.
The basic idea of phrase-based statistical machine translation (pbsmt) is
to segment an input sentence into sequences of words (phrases) which are then
translated. In order to find a good translation, the scores of a translation model
and a target side language model are maximized: this setting takes into account
the relation between source and target language as well as target language
similarity by requiring that a phrase is at the same time a good translation of
the source phrase and also leads to a good target language string.
Phrase translation probabilities (and other translational features) are learned
from word aligned parallel corpora by extracting all possible phrase-translation
pairs from the corpus. The probability of a source phrase to be translated
into a target phrase is a simple estimation based on relative frequencies. Since
this estimation is carried out on the set of all phrase-translation pairs, there
is no information available besides their co-occurrence frequency. This means
that information about the context in which a source phrase occurred in the
training data is lost even though it might contain key information necessary for
a good translation. Context information used for a better estimation of phrasetranslation probabilities could also include linguistically motivated features like
part-of-speech (pos) tags. Since linguistic information is not used in standard
pbsmt systems (despite providing valuable information), the use of context
knowledge also offers genuinely new features in statistical machine translation.
While the language model takes into account local context on the target
side when rating the similarity between a translation and the target language,
no source-side context information is used by the system. However, additional
information about the context of a phrase, e.g. adjacent words or a linguistic
description thereof, might be a useful resource for the identification of appropriate
translations for a given context. The additional information provided by context
features could help to deal with ambiguous phrases, but more refined probability
estimates should also have a general positive influence.
Assuming that context information helps to improve translation quality, we
intend to re-estimate phrase-translation probabilities conditioned on context
information and thus produce context-sensitive translation distributions. This
means that for every phrase in an input source sentence, individual phrasetranslation probabilities will be computed depending on their respective context:
only those phrase-translation pairs extracted from the training data with the
corresponding context will be used for the estimation of translation probabilities
for the given phrase.
In our experiments conducted on an English→German translation system,
we will focus on the integration of local source context features. We will analyze
the effect of one context (e.g. one adjacent word or pos tag) at a time: this will
be called a context template, indicating that the type of context is fixed while
the realizations (i.e. the respective words or pos tags) within this template vary
depending on the situation. For all experiments, a detailed evaluation based
1
on several automatic translation quality metrics will be provided, as well as a
small-scale manual evaluation.
Summary We begin this work with a brief introduction to pbsmt, followed
by an overview of previous publications on the subject of using context features
to improve phrase-based machine translation. The methods of context integration presented in chapter 2 range from straightforward probability estimation
using relative frequencies to sophisticated classification methods and techniques
originally used for word sense disambiguation. The approach we chose for this
work is situated at the simple end of the spectrum, both in terms of context
design and integration techniques.
A detailed motivation for context features is provided in chapter 3: After
explaining the basic idea of pbsmt and the design of some of the most important
features used by a standard system, we illustrate how source side information
allows the re-estimation of translation probabilities depending on the given
context. With the example of ambiguous source phrases, we show how contextually conditioned translation probabilities enable the system to find appropriate
translations by excluding those translation candidates which have not been
seen in the training data within the given context. This chapter also provides
explanations of the technical aspects of using context-dependent translation
probabilities and a description of data and tools.
In chapter 4, we discuss general aspects of evaluating machine translation.
We present standard evaluation metrics like bleu and briefly discuss advantages
and disadvantages of automatic evaluation metrics. We also discuss the idea of
computing bleu scores based on pos tagged or lemmatized output to improve
the measurement of the syntactical or lexical quality of the produced translations.
So far, our simple approach of estimating contextually conditioned phrasetranslation probabilities does not take into account the usefulness of a specific
phrase-context combination; that is to say, whether conditioning a phrase on
its context results in a better probability estimation or, in the worst case,
even leads to a deterioration. In chapter 5, we will define criteria to rate
the usefulness of phrase-context combinations and we also address smoothing
issues. In order to avoid over-estimated translation probabilities of low-frequency
phrases conditioned on context features, we discuss the potential use of restricting
contextual conditioning to phrases with a minimum occurrence frequency, but
also analyze the appearance of contextually conditioned distributions in an
attempt to derive the significance of a phrase-context combination. The discussed
criteria will then be verified in experiments, with the result that simple frequency
based conditions work as well as more sophisticated ones.
In chapter 6, we present experiments with different context templates and
combinations thereof: Contexts based on either words or pos tags will be
integrated separately into translation systems. In addition to full phrase-based
systems, we will also analyze the output of word translation systems and systems
with a minimal number of translation features since in these simplified settings
the effect of context-informed translation probabilities is greatly enhanced.
The evaluation methods used are the standard metrics bleu and nist, but
we also use modified versions based on pos tagged or lemmatized mt-output
2
for evaluating grammatical and lexical quality. With an analysis of correctly
translated content words, we try to rate adequacy. Furthermore, a manual
evaluation will give information about three participants’ preferences (baseline
system vs. context informed system) and provide data for a small-scale analysis
of error types. We conclude this chapter with discussing example translations in
order to illustrate the effects of contextually conditioning which were observed
during the evaluation, as well as indirect and unexpected effects.
Chapter 7 begins with a discussion about context granularity: In an attempt
to generalize context features as much as possible, we use a very compact set of
contexts, linguistically well-formed phrases, and also experiment with differently
designed sets of pos tags. Conditioning on phrasal elements also introduces
the subject of phrasal segmentation of the source sentence, which is the second
topic of this chapter. We will compare translation probability distributions
conditioned on either good or less good contexts with the original distributions;
this analysis will be helpful to understand the reason for effects observed in
the previous chapter and the general influence on phrasal segmentation. Since
our systems produce segmentations that differ greatly from results reported
in previous publications, we attempt to find an explanation and specifically
compare our approach to one of the methods presented in chapter 2.
The translation of verbs is generally challenging due to the structural differences between English and German and in chapter 8, we focus entirely on the
problem that verbs tend to be ‘lost’ during translation due to being translated
as meaningless structures such as verb particles. While in all previous experiments only source side context features were used for conditioning, we now also
integrate target side information: By requiring that source phrases containing a
verb are translated as phrases containing at least one content word, we try to
enforce that verbs are translated with meaningful phrases.
We conclude this work with a summary of our results and ideas for improvement in chapter 9.
Experiments showed significantly improved results in the case of word translation, but the outcome was less clear for full systems. However, our extensive
evaluation revealed interesting effects such as the fact that contextually conditioned systems translate more context words than a standard system and that
especially linguistically motivated features tend to produce better translations
in terms of grammaticality: Overall, we found evidence for a positive effect on
the two traditional criteria of fluency and adequacy.
3
4
2
Related work
In phrase-based statistical machine translation, parallel word-aligned text is
used to learn translation probabilities as well as other translational features.
The estimation of translation probabilities is based on relations between target
and source phrases, but does not take into account the relation of source phrases
and the context they actually appear in. However, contextual information could
be a valuable tool to identify translations that are especially appropriate for a
given context.
In this chapter, different publications integrating knowledge into standard
phrase-based statistical machine translation systems are presented. The methodologies range from a straightforward estimation of probabilities using relative
frequencies to more complex methods such as classifying and techniques originally
used in word sense disambiguation.
Given a certain similarity to example-based machine translation (ebmt), we
will conclude this chapter with a brief comparison of the basic concept of ebmt
and the general idea of integrating context features in smt.
2.1
Overview of statistical phrase-based machine translation
The concept of phrase-based machine translation is illustrated in figure 1: the
English input sentence is segmented into phrases, which are then translated.
Translation units can be of variable length and are not necessarily linguistically
well-formed. As can be seen on the phrase pairs [the minister]→[der minister]
and [attends]→[besucht], the order of the phrases in the source sentence is not
always maintained.
The decisions of the translation system are largely based on the translation
probabilities of the individual phrases and the target side language model. By
choosing translations with high translation probabilities, the system attempts
to reproduce the content of the source phrase as precise as possible. The second
factor, the language model which rates the quality of a target language string,
is to guarantee that the chosen translation fits well into the target translation.
Additionally, a reordering model indicates how the phrases are to be positioned
in the target sentence.
Translational features are derived from phrase-translation pairs extracted
from word-aligned training data. The translation probabilities for a given source
phrase f to be translated into a target phrase e are estimated by calculating the
relative frequencies of all observed translation candidates for the source phrase f.
smt systems use many different features whose respective relevance is trained by
tomorrow ,
morgen
the minister
besucht
attends
der minister
a conference
eine konferenz
.
.
Figure 1: Example for phrase-based machine translation
5
repeatedly translating a development data set with varying parameter settings
until an optimal setting is found (minimum error rate training). The different
features and their feature weights are represented by a log-linear model, which
is explained in more detail in section 3.2.3.
When translating, the system searches for phrases which maximize the
combined scores of translational features and language model. Since the system
does not only compute scores for the phrase to be translated, but also estimates
the future cost of the subsequent phrases as part of the costs of the actual phrase,
the features of the next phrase can affect the translation of the actual phrase.
Additionally, the language model conditions translation decisions on target side
context by accepting only phrases that lead to a good target string when added
to the already translated phrases. However, this is only a very indirect form of
context information and does not directly take into account any specific source
context of the phrase to be translated.
For conditioning translation probabilities on source-side context, not only
the phrase-translation pairs from the training data will be extracted, but also
information about the context they appeared in. The translation probabilities
of a phrase occurring in an input sentence will then be estimated based on only
the set of phrase-translation pairs with the same context instead of using the
entire set of all phrase-translation pairs.
2.2
Contextually conditioned translation probabilities
This work is mainly inspired by [Gimpel and Smith, 2008], who integrate context
features into a standard phrase-based statistical translation system. As the
translation probability p(e|f ) for a given source phrase f does not take into
account any form of context information, they adjusted the phrase translation
probability by conditioning on the phrase f and its context cf ; thus, they work
with refined translation probabilities p(e|f, cf ), which are estimated by using
the relative frequencies of source phrases with the respective contexts and target
phrases. Only context of the source-side is used for conditioning translation
probabilities, since including target side context features would require to adapt
the algorithms used for parameter training and decoding. Additionally, the
target language model is already a very strong component, which effectively
ensures that only those translation candidates are chosen that fit well into the
already produced partial translation. Local ambiguities such as morphological
agreement between two phrases can be solved since the language model only
gives good scores to strings which have been observed in the training data; thus,
ungrammatical (i.e. not observed) phrase combinations are excluded.
Phrase-based mt systems use many features in a setting that allows the
addition of new features (log-linear model, cf. 3.2.3). The contextually conditioned probabilities, some of which are based on very sparse data, are integrated
into the log-linear model. By applying minimum-error-rate training on the new
model, the new feature functions are weighted according to their usefulness.
Thus, relevant context templates are assigned a high feature weight, whereas
less useful ones are given a lower weight. Sparse data, or in this case infrequent phrase-translation-context combinations, often result in overestimated
6
translation probabilities; [Gimpel and Smith, 2008] deal with this problem by
merging features conditioned on different context templates into one single
model, considering the different (sparse) contexts to be backed-off estimates of
a complete context.
Different types of context are used as basis for a conditioned estimation of
translation probabilities: The most simple form of context are the words next
to the source phrase f . As actual words are likely to be too sparse, they can be
replaced by their part-of-speech (pos) tags for a more general context. [Gimpel
and Smith, 2008] also experimented with syntactically motivated features, such
as the label of parse-tree nodes that span over a source phrase or information
about whether a phrase is a complete constituent. In addition to lexical and
syntactic types of context features, the position of the phrase in the source
sentence (start, end or relative position of the phrase) are also used as context
features.
As adding new context features did not always turn out to improve performance, [Gimpel and Smith, 2008] used a feature selection algorithm to find
an optimal combination of context features. This procedure starts with no
feature and iteratively adds that feature that leads to the largest improvement
on unseen development data.
Experiments with different context features were carried out on systems for
Chinese→English, English→German and German→English translation.
For Chinese→English, [Gimpel and Smith, 2008] report large improvements in
three standard evaluation measures when using proceedings of the United Nations
(un-proceedings) as training and test material, but no significant improvement
when using a considerably smaller set of newswire data. This is not surprising
as there was not enough data to guarantee that the context features are useful.
Additionally, un-proceedings are repetitive and formulaic and thus better suited
for the use of additional context. A system trained on the combined un and news
data performs worse without context features, but significantly outperforms
the in-domain baseline (trained and tested on un-proceeding), when context
information is used. While these results are not as high as the in-domain
results, this outcome suggests that context features help to make better use of
out-of-domain data.
Unfortunately, the results for English→German and German→English were
not as good as for Chinese→English. In most cases, features did not harm the
outcome, but there was also no improvement. As a reason, [Gimpel and Smith,
2008] name the insufficient size of their training material, as well as the complex
morphology and difficulties in parsing German. For the English→German
translation, the best result was achieved by using the context templates as
proposed by the feature selection algorithm. Given that a rich variety of context
features is used, including relatively sophisticated syntactic information, it is
somewhat surprising that the best feature combination consists of the relatively
simple combination of 2 pos tags on the left and one word on the right side of a
source phrase.
[Allauzen et al., 2009] implemented a system very similar to that presented by [Gimpel and Smith, 2008] for the wmt shared task 2009, translating
French→English and English→French. As context features, they use words
7
and pos tags on the left and the right side of source phrases. While for the
English→French translation, the results were in the same range as the baseline
system (a standard moses system), they report a small gain of bleu points for
the reverse direction French→English. [Allauzen et al., 2009] intend to redo their
experiments on an especially large data set, which, at the point of publication,
was not possible due to technical issues.
In this work, we adopt the concept of estimating contextually conditioned
translation probabilities based on relative frequencies. In contrast to the methodology presented by [Gimpel and Smith, 2008], we do not as much focus on the
simultaneous integration of many different context templates, but study different
types of context separately and provide a detailed evaluation of translation
results. Similarly to [Gimpel and Smith, 2008]’s unsatisfactory results for the
language pair German-English, we will not achieve great results in standard
bleu, but find modest improvements with alternative evaluation metrics. [Gimpel and Smith, 2008] use a feature selection algorithm to select context templates
based on their relevance: While we also carry out experiments with combined
context templates, we do not use a sophisticated method to rate the relative
usefulness of contexts. However, since combined contexts tend to outperform
single contexts in our experiments, this is an open point requiring more research.
2.3
Classification based methods
The concept of maximum entropy in natural language processing is introduced
by [Berger et al., 1996] who discuss efficient algorithms for parameter estimation
and feature selection. They also present several practical applications, including modeling context-sensitive word translations for the integration into the
French→English translation system Candide (developed by ibm).
[Garcia-Varea et al., 2001] adopt this idea and use a maximum entropy
based approach to produce refined lexicon models for word translation. They
provide a pool of possibly useful features out of which a subset of relevant
features is selected. Context information of both source and target side is taken
from a 3-word window around the targeted word and includes words and word
classes allowing for generalization. They report promising results for this task
on VerbMobil data.
[Stroppa et al., 2007] choose a memory-based classifier to model context
dependent probability distributions. The input for such a classifier is a vector
of fixed length containing the source phrase, context features and a label for
the class, i.e. the target phrase e. The data are then stored in a decision tree
structure which is used to predict the conditioned probability p(e|f, cf ). The
tree is traversed top-down, with the most informative features tested first. The
output of the classification is a weighted set of class labels representing the
possible translations e of a source phrase f ; this output needs to be normalized to
obtain the targeted probability distribution. Classifiers as presented by [Stroppa
et al., 2007] have two essential advantages, the first of which is that the output
corresponds to the posterior probabilities of the target phrases and only needs
normalization; if no context information is given to the classifier, the (normalized)
output corresponds to the original probability distribution p(e|f ). Additionally,
8
such classifiers can efficiently process large amounts of data and produce any
number of output classes.
In addition to the context dependent probability distribution, [Stroppa et al.,
2007] use a binary feature that assigns the ’bonus value’ 1 to the phrase e that
is most probable within p(e|f, cf ), while all other phrases are given the value 0.
For the context features (adjacent words and pos-tags), a list indicating
the relevance (information gain) of each context is presented. The rankings of
the context features are very similar for Italian→English and Chinese→English,
the two translation directions of the experiments. With the source phrase itself
being the feature with the highest information gain, lexical features, i.e. adjacent
words, score higher than pos information. Also, contexts on the right side of a
phrase are found to have higher information gain than contexts on the left side
of a source phrase.
Interestingly, this ranking does not correspond well with the feature relevance
[Gimpel and Smith, 2008] retrieved by applying a feature selection algorithm.
For Chinese→English translation, the top-scoring context in [Stroppa et al.,
2007] was only chosen as third and least-important feature by [Gimpel and
Smith, 2008]’s method. Additionally, the context considered most relevant
by [Gimpel and Smith, 2008] for English→German translation, 2 pos tags on
the left side, ranks very low at the 9th position of 10. As [Gimpel and Smith,
2008] use a richer set of context features and also different language pairs, it is
difficult to directly compare their results to [Stroppa et al., 2007]’s; however,
it is surprising that they differ to such an extent. For both translation tasks,
Italian→English and Chinese→English, [Stroppa et al., 2007] report significant
improvement in comparison to the respective baseline system.
Similarly, the work by [Max et al., 2008] also describes a decision-tree based
classification. While most other research groups experimented with translations
into English, their experiments conducted on English→French concern a translation into a highly inflected language. Additionally to the widely used context
features of adjacent words and pos tags, they also included dependencies obtained from dependency parsing. The relevance of context features corresponds
with the results of [Stroppa et al., 2007], with immediately adjacent words and
pos tags being the most important context features. Dependency features turned
out to be less valuable. In addition to the probability distribution obtained by
classifying, the target phrase with the highest probability is assigned the value
1 in a ‘bonus feature’ (0 otherwise) to force the system to decide for the most
probable option. Despite being slightly better than their baseline, the results
of their systems fail to show statistically significant improvement. However, a
manual evaluation suggested a noticeable improvement.
An essential advantage of the classification-based approaches is the fact that
context features are included according to their relevance and are thus able
to perform an estimation on an optimal feature subset for any given situation.
Since our approach basically focuses on using only one context template at a
time, we run the risk of including essentially useless contexts while leaving out
potentially more relevant ones. However, even with our simple approach, we
can report some modest improvements for some settings.
Similarly to [Max et al., 2008], the target language of our translation pair
9
English→German has a complex morphology which could be a factor in the
outcome of the experiments. Thoughts on how to deal with rich morphology
will be addressed in section 7.4.2.
2.4
Word sense disambiguation
Another approach to integrate source side information is the use of word-sense
disambiguation (wsd) techniques: the integration of wsd-features can be viewed
as an extension of the classification-based methods. The traditional task of wsd
is to find the sense of a word given a set of context information and possible
word senses. wsd normally uses very fine-grained, manually created data sets
that capture even very subtle differences of meaning. When applied to smt, the
task is to predict the translation of a source word instead of its sense. Assuming
that better translation choices can be made if the semantic word sense is known
for a source word, the integration of word-sense prediction into smt systems
seems promising. Parallel corpora provide a large amount of training material:
By extracting phrase-translation pairs, source phrases that are ‘sense-tagged’
with their respective translation can be obtained. This has the major advantage
that there is no need to produce hand-crafted data sets, as both the smt and
the wsd systems can be trained on the same data which avoids possible domain
mismatches.
The design of the training data for classification based methods (as presented
in the previous section) is essentially the same as for word-sense disambiguation
techniques; in both cases, the possible translations of a source phrase are regarded
as either classes or senses and thus, the task of finding the most probable class
or word sense is actually very similar. In comparison, word-sense disambiguation
usually relies on an elaborately designed set of features.
While the first attempts to integrate wsd techniques into phrase-based
statistical translation systems were not very promising, recent results indicate
that smt could benefit from word-sense disambiguation. We present the work of
two groups of authors who successfully re-purposed wsd for translation tasks.
[Vickrey et al., 2005] do not work with a full-scale smt system, but focus on
the partial task of word-translation. Training data is obtained from a bilingual
corpus (europarl): For each source word a, the set of translation possibilities
is extracted using word alignment. Additionally, the source and target sentences
in which a and its translations occurred are stored as well to provide contextual
information. The resulting set of sentences is split into test and training data.
The context features consist of the respective pos tag of the word a, as well
as a binary feature with values of either 0 or 1 depending on whether a word
occurred within a predefined context window of a. After experiments with
different window sizes, the optimal size turned out to be a one-word context on
the left side and a two-word context on the right side of the targeted phrase;
larger context windows were generally overfitting.
In traditional wsd, a model which is given a word with a set of senses and a
sentence containing this word, has to predict the correct sense for the word seen
in the sentence. When adapted to smt, the model has to find a valid translation
for the single word a; translations do not necessarily have to be single words
10
but can also be multi word units.
A logistic regression model was trained, since the large amount of data enables
the use of such a model. With regard to the requirements of a phrase-based smt
system, the output of the model is required to provide not only a prediction for
the most probable target phrase, but a complete list of translation probabilities
for all possible target phrases e. For their experiments with different context
settings and context window sizes, [Vickrey et al., 2005] report improvement of
average word translation accuracy.
In a next step, they attempt to use their context-aware translation model
on a simplified translation problem. Instead of translating complete sentences,
the system is limited to the translation of ambiguous words only. In a preprocessing step, ambiguous words in the source sentence are identified and the
respective words on the target side are replaced by blanks. The mt system with
the context dependent word translation model is then used to fill the blanks
with appropriate translations. This ‘isolated setting’ is to guarantee that the
quality of the translation is purely determined by the probability model and not
influenced by other factors. For this experiment, [Vickrey et al., 2005] also find
improvements of word translation accuracy, which suggests that the integration
of wsd components might help to improve the quality of smt.
[Carpuat and Wu, 2007a] and [Carpuat and Wu, 2007b] present an approach
in which the wsd task, i.e. training and prediction, is adapted to fit into a
full-scale smt-system and directly focuses on translation quality by being closely
tied to the smt system instead of being a stand-alone component. While both
papers essentially present the same method, [Carpuat and Wu, 2007b] focus on
the general concept whereas [Carpuat and Wu, 2007a] provide a more detailed
description of context-sensitive translation lexicons. In contrast to previous
publications, their wsd component is not limited to single (content) words, but
is capable of handling phrases of any length. [Carpuat and Wu, 2007b] stress
the importance of performing word-sense disambiguation on multi-word phrases
as wsd systems focusing on single words force smt systems to decide between
context-aware single words and context-independent multi-word phrases. Additionally, single word wsd systems (as presented in previous publications) often
do not allow for generalization to larger phrases without making compromises
in order to meet the requirements of phrase-based translation systems.
Training data is obtained by extracting phrase-translation pairs with context
information from word-aligned bilingual corpora; the senses of a source phrase are
the set of all translations of this phrase seen in the data. Extracted phrases are
not necessarily syntactically correct phrases, but can be of any form, which differs
from typical wsd tasks where only single content words (nouns, verbs, adjectives)
are disambiguated. As context disambiguation is performed on all phrases, they
also contain non-content words like articles or even punctuation. In most typical
wsd-scenarios, training data consists of carefully annotated material, whereas
the data used for the translation task is produced by automatic word-alignments
and thus contains a considerable amount of incorrect phrase-translation pairs.
The wsd model for Chinese→English translation is modeled based on a
system that yielded very good results on Chinese data in a pure wsd-task. As
the translation system cannot work with multiple, context-aware translation
11
probabilities, it is necessary to use sentence-wise dynamically created translation
lexicons in order to incorporate the wsd probabilities as an additional feature
in the log-linear model.
In [Carpuat and Wu, 2007a], who focus on the production of contextdependent translation lexicons, context features are described more precisely.
The overall objective is to use a rich set of context features (as is typically done
in wsd) to create a phrase translation lexicon that takes into account various
forms of context that standard phrase-based smt does not factor in.
It was found that the most valuable contexts to disambiguate Chinese
phrases are pos tags on both sides of the phrase, as well as full sentence context
represented as bag-of-words. Full sentence context is usually not available in
phrase-based smt, since smt systems only model local context. The bag-ofwords concept allows to generalize sentence context instead of representing full
sentences as very long phrases in the phrase-translation table, which would be
the only possibility to integrate full sentences in a standard smt system.
The set of possible contexts also comprises local collocations and basic
dependency features, although there is no detailed description about those
features and their relevance.
[Carpuat and Wu, 2007b] carried out a detailed evaluation using 8 common
evaluation metrics on three data sets. The integration of wsd prediction yielded
improved results on all data sets and for each of the 8 evaluation metrics. A
significance test on nist scores was successful at the 95% level. The positive
overall outcome of their experiments indicates that a close cooperation of wsd
and smt helps to improve translation results.
A more detailed evaluation in terms of typical properties showed that the
result of wsd is superior to the baseline translation probabilities which has
several effects on the outcome of translations: Valid translation candidates
with a low baseline probability are ranked higher by the context aware system
and consequently are more likely to be chosen by the system. Influenced by
the strong scores of wsd, translation probabilities become more competitive
compared to the (relatively strong) target language model. The most interesting
point is that the stronger scores lead to a segmentation into larger translation
units, whereas the baseline segmentation preferred smaller phrases containing
frequent translation candidates that often turn out to be incorrect.
In our experiments, we found that context-sensitive translation probabilities
have a very interesting influence on segmentation: in contrast to the effect
observed in [Carpuat and Wu, 2007a] and [Carpuat and Wu, 2007b], the contextsensitive translation probabilities in our system trigger a preference for shorter
translation units. The reasons for this effect will be discussed in section 7.4,
where we will also compare our system with the system presented by [Carpuat
and Wu, 2007a] and [Carpuat and Wu, 2007b].
In most of the presented publications, adjacent words and pos tag information
of adjacent words turned out to be the most useful context features regardless of
the examined language pair, whereas more sophisticated features like dependency
structures or local collocations were less relevant. With this in mind, we will
mostly focus on word-based contexts and information derived from pos tags.
12
2.5
Relation to example-based translation
The integration of contextual knowledge into a smt system is reminiscent of the
basic idea of example-based machine translation (ebmt). In ebmt, sentences
and their (manually produced) translations are stored in a data base. In order
to translate a sentence, sophisticated matching strategies are applied to find
data base entries close to the input sentence. Matching strategies can be based
on words, but also include different forms of linguistic analysis such as tagging
or parsing in order to identify pos tags, constituents or semantic roles such as
subjects or objects, which allows for more generalization. As an exact match
with a stored example is very rare, the data base is searched for fragments that
partially match the input sentence. The translations of these fragments need to
be extracted and recombined to a valid sentence of the target language.
The basic idea is illustrated by the following example: If we were to translate
sentence (1) and the database contains a sentence like (2), it can be used for
translation in the form of (3), where x and y are variables that can be filled
with the actual words Alice and book (or rather their translations). Another
sentence like (4) is needed to provide the translation for book.
(1)
Alice buys a book.
(2)
Susan buys a bike. → Susan kauft ein Fahrrad.
(3) x buys a y. → x kauft ein y.
(4)
I like the book. → Mir gefällt das Buch.
Loosely speaking, the task of ebmt is to find stored examples (or parts thereof)
having the same structure as the input sentence, as well as to extract the
corresponding translation. This is no trivial task if no reliable word-alignment
is provided; building a correct target string with the partial translations is not
easy, either.
The basic principle of phrase-based statistical translation is very similar: It
also consists of splitting an input sentence into smaller units which are then
translated and recombined on the target side. An essential difference is the
fact that ebmt requires the fragments used for translation to come from a
sentence similar to the input sentence, while smt uses statistics computed on
all occurrences of a phrase, which means that the origins of individual phrases
do not have any influence.
Depending on the matching strategy, ebmt requires (more or less) perfect
matches on word level or syntactic level and therefore, the structure of the
source side is a crucial criterion. In contrast, the integration of source side
context information into phrase-based smt is an attempt to find translations
that are appropriate for a given context instead of using a general translation
lexicon; the context-aware translation probabilities are not the main criterion for
a translation, but can rather be regarded as an additional feature or refinement
in an already complex system.
In addition to conditioning the translation of a phrase on its context in the
input sentence, ebmt and the previously presented smt systems also share the
concept of generalization by using pos tags or, at least in some cases, syntactic
dependencies.
13
14
3
Integrating context features into a phrase-based
statistical translation system
In phrase-based statistical translation systems, the source language input sentence is segmented into (not necessarily linguistically well-formed) phrases,
that are translated into target language phrases and then undergo reordering.
Additionally, a target side language model rates the fluency of the translated
sentence.
The probability to translate a phrase of the source language into a phrase of
the target language is derived from aligned training data by first extracting all
phrase-translation pairs and then computing translation possibilities for a given
source string based on the extracted pairs.
An interesting aspect of mt-systems is the length of translation units: a
translation system benefits from both short phrases being more universally
applicable and long phrases which translate longer sequences of text and therefore
add to the fluency of the resulting target sentence. However, even larger phrases
do not contain any information on the context they originally appeared in. The
objective of this work is to refine the estimation of translation probabilities by
conditioning source phrases on contextual features.
3.1
Motivation
As phrases, be it single word phrases or multi-word chunks, usually have a
large number of translation candidates, additional information may help to find
appropriate translations and filter out inappropriate ones depending on the
context in which the phrase appears. The translation probability for a source
phrase to be translated into a target phrase is computed as the relative frequency
of the total number of occurrences of the source phrase and the number of times
it was translated into the target phrase. This estimation does not take into
account further information about the source phrase, but rather produces a
somewhat imprecise estimation of how to generally translate that phrase. While
this works for phrases where all translation possibilities fit equally well, this is a
problem if within a specific context only a subset of the translations is possible.
Considering the context of a phrase when estimating translation probabilities
may help to reduce the set of all translation possibilities to a smaller set better
suited for a given context, as only ’good’ translations for this context are seen
in the training data. Contexts can be as simple as adjacent words or more
sophisticated such as part-of-speech tags (pos-tags) of adjacent words or even
complex syntactic information about a phrase’s role in the sentence it originally
occurred in. Context features are not limited to the actual surrounding of a
phrase, but can also contain information about the phrase itself, like a feature
indicating whether the phrase is a constituent in a syntactic parse tree.
Filtering the set of translation candidates may take place on a relatively
subtle level such as discarding translations whose morphological form is not
quite right in the given context, or in the case of homonymous source words,
where only those translations are to be kept which express the correct meaning.
Actually, slight ambiguities like morphological syncretism can be expected to
15
be comparatively frequent, whereas ambiguities with more severe effects (like
homonymy) are likely to appear to a lesser extent.
When thinking about phrase translation probabilities, we have to keep in
mind that other factors also influence translation, such as the target side language
model or the reordering model or the segmentation of the source sentence. While
the target side language model and the reordering statistics are indispensable
factors of the translation system, other features can be omitted or modified to
behave in a certain way. In order to see the maximum impact refined translation
probabilities have on the translation output, some experiments in the following
sections are carried out with simplified systems.
Illustrating the need for source-side context The following description
of different scenarios is intended to illustrate the general idea of the usefulness
of context knowledge and is therefore somewhat hypothetical; examples can be
found in sections 3.3.1 and 3.3.2, as well as in the sections dealing with the
evaluation of different context types.
A simple example to demonstrate the need of context knowledge is that of
collocational structures: For collocates, a straightforward, literal translation
is often not the best choice. In a standard phrase based translation system
using phrases of variable length, collocational units such as take a walk or
get used are assumed to be translated as a whole rather than as a sequence
of single-word phrases. Word sequences translated as one phrase need not
necessarily be collocations or linguistically motivated phrases, but need to be
quite frequent in the training data, so that the system can ‘see’ the phrase
and ‘learn’ its translation. When translating sequences longer than one word,
the length of the phrase itself restricts the amount of translation possibilities
by the simple reason that longer sequences are seen less often in the training
data and therefore have less translations appropriate for different contexts than
universally applicable single words. However, the ‘context’ of a longer phrase in
a standard system is part of the phrase itself and has to be translated: it simply
captures the whole expression, there is no context used outside the phrase. Also,
in the case of shorter phrases and especially of single word phrases, there is
less to no ‘context’ that can help to identify appropriate translations. If we
integrate context information and this context happens to be one adjacent word
of a phrase, then the translation probabilities of a phrase of length n should
correspond, at least approximately, to the ones of the phrase with length n+1,
i.e. consisting of the phrase and the context word.
The lack of context becomes especially evident when looking at the simpler
task of word translation: As already mentioned, simplified systems help to
better understand the effect of contextually conditioned probabilities. Limiting
the source phrase length to single words results in a system with no context
information at all and thus can be used as a sort of baseline. Also, without
variation in the segmentation of source sentences, this allows for a more direct
comparison of the baseline system and the modified one.
The baseline character of such a limited system is illustrated by the following
example: When translating a sentence containing a collocational structure like
get used, a system restricted to single word translation might choose the most
16
probable translation of each word, i.e. the default translation. In this case of
used, a literal translation with the meaning of applied is most probable but not
a good choice in this situation.
(5)
the commission also needs to get used to the idea that ...
When conditioning the translation probabilities on the word on the left side of
the phrase, as shown in example (5), translation probabilities for used→gewöhnt
(accustomed) can be assumed to become larger if used appears in the constellation
get used, while the probabilities for translations with the meaning of applied
should decrease. While there are a lot of occurrences in the training data where
used is translated as applied, none of them are likely to co-occur with get and
thus will be ignored when estimating translation probabilities. This example
illustrates how word translation benefits from being conditioned on contexts; as
single words are more or less universally applicable, context-aware translation
probabilities help to find the correct word sense in a given context.
When translating phrases, the system also profits from the integration of
context information. As suggested in sentence (6), get used is translated as
a unit and it is very likely that the system picks a valid translation with the
meaning of to accustom. Being conditioned on the word on the left side, to,
further narrows down the set of ‘good’ translations, e.g. by preferring infinite
verbs over finite verbs.
(6)
the commission also needs to get used to the idea that ...
Refined phrase translation probabilities might also have a positive influence
on the segmentation of the input sentence: By giving high probabilities to
‘good’ phrases, ‘good’ segmentation should be enhanced; in fact, contextually
conditioned translation probabilities have an interesting effect on segmentation
which will be discussed in detail in section 7.4.
While phrases are not linguistically motivated, they need to be well-formed
in terms of their alignment structure; this means that an input phrase cannot
always be segmented in any combination of phrase within a predefined length.
This is illustrated by the following example containing a crossing alignment
structure:
that
i
have
slept
(7)
dass ich
geschlafen
habe
When translating this sentence, the possibilities for segmentation are limited as
the phrase i have → ich habe is not possible since phrases need to be continuous.
If the longer phrase i have slept is not available, then the ‘link’ between i and
have on the source side is lost. Integrating context information would allow to
condition the phrase have on its left word i and thus approximate the probability
estimation for the non-well-formed phrase i have. As German verbs have different
forms depending on the subject, it could prove useful to know whether the source
sentence contains e.g. i have or you have in the source sentence.
17
Actually, the phrases i and have have a good chance to be translated correctly
even if not conditioned on a context: The target side language model, which
rates the similarity of translation candidates with sentences seen in training
data, would not give a high score to an invalid translation like [i] [have] → [ich]
[hast], simply because the obviously wrong string ich hast can not appear in
the training corpus. Thus, the language model can also be seen as a tool to
indirectly solve ambiguities by only accepting strings that are likely to occur in
the target language. The language model is a relatively strong component in
the smt system; while it is restricted to relatively local ambiguities (limited by
the size of the used n-grams), problems like the different realizations of have in
German can be solved if the corresponding subject appears next to it on the
target side.
A disadvantage of the language model is its tendency to overrate translations
containing frequent words; in the case of used, the default translation with the
meaning of applied would probably be favored in every context, because relatively
highly frequent words occurring within different contexts in the training data are
likely to receive high scores and fit well into translated sentences. Additionally,
the ‘direct link’ between two phrases that were adjacent in the source phrase
might be lost if target phrases are reordered. If the distance introduced by
reordering exceeds the n-gram size, then the language model has to rate the
translations of those phrases separately.
While the language model already is very useful, a smt system should also
profit from contextually modified translation probabilities, as they are able to
capture connections between phrases that would otherwise be lost and to enhance
translation probabilities of comparatively infrequent phrase-translation pairs.
Another advantage of conditioning on source side context is the possibility
of generalization: So far, we mostly discussed adjacent words as context features.
While this seems promising, purely lexical context features are prone to data
sparseness; by replacing words with e.g. pos tags, the data can be better
exploited. In many cases, it is sufficient to know whether a phrase occurred to
the right of a determiner in order to find out that the phrase is/begins with a
noun and not e.g. a verb. As translation statistics conditioned on pos based
contexts are estimated on a basis of more data than word based contexts, they
can expected to be more representative and thus provide information that cannot
be captured by larger phrases.
Linguistically motivated information might turn out useful in cases like the
following examples. A homonym like light can be both an adjective or a noun
and means either the opposite of the adjective heavy or describes the effect of
switching on a lamp; thus, there is need for disambiguation in order to guarantee
that the correct reading is chosen.
We can take another step leaving the disambiguation of word senses and
have a look at the morphological level: For example, knowing whether a (noun)phrase is on the left or the right side of a verb, might help the system to pick a
German target phrase having the morphological attributes of either a subject or
object phrase depending on the context.
pos tags will not only be used as features on the source side: in the experiments in chapter 8, pos tags on the target side will also be integrated.
18
In the course of this work, we want to add more refined probabilites for the
translation of a given source language phrase, that are not only based on phrase
counts, but also on the context the source phrase appeared in. By integrating
different types of context features into a standard phrase-based translation
system, we hope for a general improvement of translation quality, not only in
cases with some form of polysemy or homonymy. As not every possible context
is likely to be useful, we also experiment with evaluations of the quality and
reliability of contexts.
In the next sections, the phrase extraction procedure and construction of
standard phrase translation tables as well as the integration of source-side
context information are described. All experiments were carried out with the
statistical machine translation system moses1 . Thus, explanations specifically
refer to this system. For more detailed background on moses, see also [Koehn,
2010].
3.2
Translation phrase tables
Basically, a phrase-based translation system works with the translation probabilities of phrases, a reordering model and a target language model to judge whether
the output is a good target language string. Phrase translation probabilities
and additional features like lexical weighting and a phrase length penalty are
listed in a phrase table.
To a great extent, translation quality depends on phrase translation probabilities. The following sections illustrate the process of creating phrase tables as
well as how its features are factored into the translation system.
3.2.1
Estimating translation probabilities
To build translation tables, phrase pairs are extracted from word-aligned text
and for each phrase pair, several scores including the respective translation
probabilities are computed. While phrases are not linguistically motivated, they
allerdings
but
sehen
we
wir
auch
,
dass
bestimmte
verfahrensweisen
also
recognise
that
some
of
the
überholt
practices
.
outdated
sind
are
.
English phrase (src)
but
but we also recognise
but we also recognise
recognise
we also recognise
we also recognise
we also recognise that
we
we also
we also
also
also
German phrase (trg)
allerdings
allerdings sehen wir auch
allerdings sehen wir auch ,
sehen
sehen wir auch
sehen wir auch ,
sehen wir auch , dass
wir
wir auch
wir auch ,
auch
auch ,
Figure 2: A sample of extracted phrase pairs from word-aligned text. The
maximum phrase length is set to 5.
1
http://www.statmt.org/moses/index.php?n=Main.HomePage
19
source
are outdated
are outdated
are outdated
are outdated
are outdated
are outdated
target
sich
sind veraltet
veraltet sind
veraltet
veralteten
überholt sind
ϕ(src | trg)
1.2199e-05
0.125
0.0625
0.009009
0.0098039
0.210526
lex(src | trg)
8.0279e-08
0.0529022
0.0529022
0.0018107
0.0022989
0.0367321
ϕ(trg | src)
0.11111
0.11111
0.11111
0.11111
0.11111
0.44444
lex(trg | src)
0.023719
0.018204
0.018204
0.095238
0.073016
0.029126
Table 1: Phrase table entries for the english phrase are outdated. The value for
the feature phrase penalty is always exp(1) = 2.178
need to be wellformed in terms of their alignment, i.e. multiword phrases have to
be continuous. As illustrated in the example below (figure 2), extracted phrase
pairs are of different length, ranging from one word to a predefined maximum
phrase length. The extraction method starts with listing a short phrase pair (i.e.
two single words aligned to each other) and then continuously adds subsequent
phrase chunks until the maximum phrase length is reached.
The entries in the table in figure 2 show the procedure: starting with the
minimal unit but - allerdings, the algorithm adds the next minimal phrase chunk
we also recognise - sehen wir auch.
While the single phrase pairs of the second block, recognise - sehen, we wir and also - auch, are valid entries in the phrase collection, only the entire
block can be added to the first phrase. This is due to the alignment structure:
A phrase pair like but we - allerdings auch is not well formed as the two words
sehen wir on the German side would be skipped.
Unaligned words, such as the comma in the fifth position, are added to their
adjacent phrases, leading to multiple entries like the second and third entry in
the table in figure 2, where one part of the pair remains the same.
When all phrase pairs are extracted, the probability for a given source phrase
to translate into a target phrase can be estimated by the relative frequency:
count(src, trg)
trgi count(src, trgi )
ϕ(trg | src) = P
The standard phrase table in the moses system lists 5 scores for each phrase
pair: the translation probabilities ϕ(src | trg) and ϕ(trg | src), lexical weightings
lex (trg | src) and lex (src | trg) and a phrase penalty. Table 1 shows some example
entries for a phrase of the example in figure 2.
Lexical weighting is used to judge the reliability of phrase pairs. Since
translation probabilities are 1 in the case of a phrase pair occurring only once,
such pairs are often overestimated. To compute lexical weight scores, phrases
are basically backed off to single words whose - more reliable - translation
probabilities are estimated from the corpus data. The lexical weight for a target
phrase with given source phrase and word alignment is defined as follows:
length(trg)
lex(trg | src, a) =
Y
i=1
X
1
w(srci |trgj )
| {j | (i, j) ∈ a} | ∀(i,j)∈a
20
Each target-word is generated by the source word it is aligned to. In the case of
one-to-many alignment, the average value of all alignment pairs is taken. If a
word is not aligned, i.e. if it is aligned to the null-word, the factor w(ei |null)
is used. For the phrase pair überholt sind - are outdated, the lexical weighting is
computed as:
lex(überholt sind | are outdated, a) = w(überholt | outdated) w(sind | are)
where the a denotes the respective alignment between überholt and outdated and
sind and are. As both alignments are plausible, the respective word translation
probabilities can expected to be relatively high.
Bad phrase pairs, i.e. pairs whose single word translations have low probabilities, score lower than phrases with well-matching single word alignments and
might thus be disliked by the system. For example the phrase pair are outdated
- sich has a significantly lower lex (src | trg) than the other phrases. As the single
target word sich is aligned with two source words, its lexical score is the average
of the two corresponding word translation probabilities:
1
lex(sich | are outdated, a) = [w(sich | outdated) + w(sich | are)]
2
While the alignment of sich and are is at least very dubious, the alignment
between sich and outdated is clearly wrong. Therefore, both word translation
probabilities must be quite low, resulting in a low overall score. Given that
the translation probability ϕ(trg | src) is the same as for most pairs, the low
lexical weighting score helps to filter out bad translation candidates. For both
lexical weighting and translation probabilities, both translation directions are
computed.
Since an input sentence needs to be segmented into phrases before being
translated and at this point, all segmentations are equally likely, the system
needs to know whether to favor few long phrases or more short phrases. This is
done by introducing a phrase penalty score ρ that prefers longer phrases when
ρ < 1 and shorter phrases when ρ > 1. In the case of equally scoring phrases,
the system generally prefers the longer one. Since longer phrases contain more
context and potentially overestimated reliability caused by data sparseness has
been dealt with by the lexical translation score, this is a positive effect.
3.2.2
Reordering
If a phrase segmentation and the respective phrase translations have been found,
translated phrases may be reordered. In the case of English-German translations,
the order of verb-subject has to be fixed (cf. the phrase pair sehen wir - we
... recognise in table 2) or the position of verbs in verb-final structures (cf.
überholt sind - are outdated in the example). In contrast to e.g. the switching
of adjectives and nouns in French, both movements are not local but can span
over a distance of several words.
If reordering is handled with a (distance based) model where a cost function
generally punishes movements, reordering can only be justified when there is a
huge gain in the language model score. For local changes, like adjective-noun
21
conversion when translating French, this method works quite well, but it reaches
its limit when larger movements are necessary.
In a lexicalized reordering model, reordering depends on the actual phrases,
assuming that some phrases have more characteristic reordering properties than
others. The reordering model learns a probability distribution from word-aligned
text to predict reordering properties for the phrases listed in the phrase table.
3.2.3
Parameter tuning
When translating a source sentence into a target sentence, a linear combination of
features and feature weights is maximised. Such combinations, called log-linear
models, have the following form:
p(x) = argmaxx exp
n
X
λi hi (x)
i=1
where the random variable x represents the source and target sentence and its
segmentation into phrases. There are n feature functions hi with according
weights λi . Features include the language and reordering models and the
probabilities listed in the phrase table. The setting of log-linear models has
two advantages: It allows to add new features and the features can be weighted
differently. Feature weights are learned by minimizing a standard error metric
(such as bleu) on a development set. This means that parameters λi are
modified such that they maximize the bleu score yielded when translating the
development set until convergence. Since parameter tuning tries to achieve
optimal scores of a standard evaluation criterion on development data, it is also
called minimum error rate training.
3.3
Including context information
In this work, only source-side context features are used as basis for a more refined
estimation of translation probabilities. Given that we know the entire source
sentence when translating, we also have information about every imaginable
context in this sentence at our disposal and can therefore work with translation
probabilities computed specifically for this very context (provided it has been
seen in the traing data). We even have the option to either substitute the
original ϕ(trg | src) with the new probability distribution conditioned on context
features or to add the new distribution to the original phrase table, as the design
of the log-linear model allows for additional features. The task to translate a
sentence by means of various probability estimates, is the same - hence, there is
no need to change the decoding algorithm or make modifications elsewhere than
adding the new translations into the the phrase-table.
However, working with the context of phrases on the target side is considerably more difficult: During the translation process, the final target sentence
is not yet known and therefore, as we do not know a translation candidate’s
environment and position in the target sentence, we can not work with translation probabilities modified to fit a special context. The target side phrases are
dependent on each other; thus a context such as e.g. an immediately adjacent
22
word would not be a fixed constant (as on the source side), but also depend
on adjacent phrases. The sole exception on target side information can be an
invariant description of the target phrase itself, such as e.g. the pos-tags of
its words. Adding context features on the target side that are conditioned on
phrases in the target output would require to adapt the decoding method.
Also, it might not be necessary to incorporate additional features on the
target side: Language models are a strong component in smt-systems and, in
combination with reordering models, cover at least local context of phrases on
the target side.
Context can be of different forms: The most simple contexts are words next
to a phrase, i.e. a sequence of n words to the left or the right of a phrase.
However, adjacent words, even when lemmatized, lead to sparse data. An
obvious step to overcome this sparseness is a reduction to part-of-speech (pos)
tags: Not a sequence of adjacent words, but a sequence of pos tags are used
as the context of a phrase. Examples for both types of context, words and
pos-tags, will be presented in the next sections.
3.3.1
A simple first example
In this section, one of the examples briefly mentioned in 3.1 will be discussed in
more detail: When the phrase length is restricted to 1, we found that in the
case of the phrase used, one word on the left side helps to disambiguate between
the default translation verwendet (applied) and a variant appropriate for the
collocational structure get used.
The three entries above the line in table 2 are the phrase-translation pairs
for used with the highest probabilities, the remaining three entries are some
of the translations one would use in the context of get. In total, there are
1883 translation possibilities, derived from 7394 extracted phrase-translation
pairs. It is easy to see that the top three entries have a much higher translation
probability than the translations suitable for the context get, even though the
latter are comparatively high frequent translation candidates with the meaning
of gewöhnen (to accustom).
When having a look at the column labelled context = get, we see that the
first three translations were never seen when the phrase used co-occurs with get.
In total, 14 translation possibilities were seen in this specific context, of which
13 occurred only once. While this probability distribution is rather flat, it only
source
target
gloss
used
used
used
used
used
used
verwendet
genutzt
eingesetzt
gewöhnen
daran gewöhnt
zur gewohnheit
applied
utilized
deployed
to accustom
accustomed to
(to the) custom
normal
f
ϕ
731 0.0989
557 0.0753
557 0.0753
7
0.00095
5
0.00068
1
0.00014
context = get
f
ϕget
0
0
0
0
0
0
5
0.2778
1
0.0556
1
0.0556
Table 2: Translation probabilities and occurrence frequencies for the single word
phrase used without context and conditioned on the word get on the left side.
23
contains translations that are suitable for the context get.
Without wanting to anticipate a more detailed evaluation of a system using
only single word source side phrases, the following sentences are to illustrate the
effect the modification of translation probabilities has on machine translation
output.
(8)
the question is whether you get used to the minor annoyances ...
(9)
die frage ist , ob sie verwendet werden die kleinen schikanen ...
(10)
die frage ist , ob sie sich daran gewöhnt , die kleinen schikanen ...
(11) die frage ist bloß , ob man sich an diese kleinen unannehmlichkeiten
gewöhnt ...
Both (9) and (10) are output of a mt-system, while (11) is a reference translation. Sentence (9) is the translation of (8) using a system without context
information. As expected, used is translated as verwendet, the entry with the
highest probability. In combination with werden, the translation of get, it forms
a common German phrase which is grammatical but not appropriate here.
In contrast, the translation of used in sentence (10) is daran gewöhnt. Though
this is not the highest scoring option from the set of translations in a system
using context information, the word sequence sich daran gewöhnt (i.e. the
translation of get used), is very common. This is also reflected in the language
model which might have contributed to the fact that daran gewöhnt was chosen
over gewöhnen, the candidate with the highest translation probability.
In table 3, some of the phrase-translation pairs and the respective translation
probability for the phrase get used are listed. While the distribution is not
exactly the same as in the case of used with the context get, they are quite
similar with respect to both the amount of translation candidates and translation
candidates themselves.
A more detailed comparison of a length-1 system with context information
and a length-2 system without context information, as well as a more fine grained
evaluation will be provided in section 6.1.
target
gloss
ϕ
gewöhnen
to accustom
0.2222
gewöhnt euch
accustom
0.1111
gewöhnt
accustomed
0.1111
sich
himself
0.1111
gewöhnen sich
accustom
0.037
Table 3: The most probable translations for the phrase get used without context
information.
3.3.2
Expanded example: Using Pos-based context templates
As already mentioned, pos tags are a simple, but effective method for generalization. Replacing adjacent words with their respective pos tags has the advantage
that sparse data is no more a problem: While a specific word might be seen
only once or twice or not at all, nouns or verbs occur sufficiently often. This
means not only that phrases occurring in combination with infrequent words
are not excluded from contextual conditioning, but also that the translation
24
statistics are also much richer as their estimation is no more based on the
(modest) number of occurrences of a specific word, but on the entire class of
this word’s pos tag. On the downside, the loss of lexical information could also
have adverse effects in cases where a specific word is needed for disambiguation
instead of a pos tag.
In the following example, we discuss simple, pos tag based contexts2 of the
English word fine. fine has more than one meaning; first, it can be an adjective
with the meaning of nice. Actually, the meanings of fine as adjective can
also vary widely, ranging from beautiful or pretty, the most prominent readings
(cf. 12), to ok or alright when used at the beginning of a sentence as in (13).
When having a closer look at the translation phrase table, one can see that
there is also a great variety of translations that loosely fit into the concept of
something positive, such as hehr (noble) or feierlich (festive), but which were
seen only once or twice. The second reading of fine is that of a financial penalty.
It can be realized either as verb (14) or noun (15).
(12)
they were given plenty of fine words, but no practical action.
(13) fine , i can accept that , but what gives you the right ...
(14) the recent decision to fine ryanair, europe’s main budget airline, for its
arrangement with local officials ...
(15) the ... national radio and television council ... decided to impose a fine
of euro 100 000 on the private television station mega channel.
In table 4, translation probabilities for the English word fine to several German
translation candidates are listed, as well as the number of occurrences with
and without context (labelled f). In total, the English-German phrase pairs
containing fine were seen 1266 times in the training data, resulting in a set of
379 translation candidates.
The numbers in the column titled normal were computed without including
context features. It is evident that different forms of schön (beautiful) and gut
(good) are the most prominent translation candidates. Also, variants of the
confirmation marker in ordnung (ok) are contained in the list. While these two
readings might not always be separated clearly, the reading of fine as financial
penalty is clearly distinguishable from the other ones.
However, the probabilities for fine to be translated into a phrase of the
financial penalty reading (rows with a grey background in table 4) are very
low compared to the different forms of schön or gut. Given that occurrences
of fine meaning something like pretty or good are significantly more frequent
than occurrences with the meaning of financial penalty, the phrase translation
probabilities for gut or schön are relatively high and thus, fine is not very likely
to be translated as geldstrafe. Also, a relatively high frequency phrase like
schönen or schöne can be expected to be well-liked by the language model.
As illustrated in table 4, using context information can help to divide
translation candidates into different sets: In the ‘context 1’ column, only phrases
co-occurring with the pos tag to on the left side were considered. This pos
2
A description of the tag set is provided in section 3.4.1
25
target
gloss
schönen
schöne
gute
guten
gut
in ordnung
schön
schönes
gutes
geldstrafe
ausgezeichneten
strafe
verhängen
einer geldstrafe
bußgelder gegen
geldbußen verhängen
geldstrafe bezahlen
geldstrafe gegen
geldbuße
beautiful
beautiful
good
good
good
ok
beautiful
beautiful
good
penalty
excellent
penalty
impose
of a penalty
penalty against
impose a penalty
pay a penalty
penalty against
penalty
f
160
120
70
62
60
26
21
19
19
19
18
15
2
1
1
1
1
1
1
normal
ϕ
0.1264
0.0948
0.05529
0.04897
0.04739
0.02054
0.01659
0.01501
0.01501
0.01501
0.01422
0.01184
0.00158
0.00079
0.00079
0.00079
0.00079
0.00079
0.00079
context1
f
ϕc1
0
0
1 0.0435
0
0
0
0
0
0
0
0
0
0
0
0
0
0
2 0.087
0
0
2 0.087
2 0.087
1 0.0435
1 0.0435
1 0.0435
1 0.0435
1 0.0435
0
0
context2
f
ϕc2
3
0.038
3
0.038
3
0.038
2
0.0253
0
0
0
0
0
0
0
0
0
0
11 0.139
1
0.0127
8
0.101
0
0
0
0
0
0
0
0
0
0
0
0
6
0.076
Table 4: Translation probabilities for the english phrase fine. In total,
there are 379 translation candidates, 290 of which have the probability of
0.00079. The entries above the line are the top 12 translation candidates,
the ones below are representative for context1 and were manually selected.
context1: the pos tag on the left side of the phrase is to.
context2: one pos tag on the left side of the phrase (dt), the leftmost pos tag
of the phrase (nn).
tag, representing the word to left of infinite verbs (cf. example 14), almost
always translates to a financial penalty - related phrase. Since fine occurring
immediately after the pos tag to needs to be a verb, this context not only
disambiguates the basic meanings of the phrase, but also tends to separate
between verbal and nominal translations.
The distribution of this context is very flat: 23 English-German phrase
pairs with this context have been extracted, resulting in 21 phrase-translation
candidate pairs. The two translations with the highest probability are geldstrafe
and verhängen (impose). While verhängen is not a good translation for fine,
it is still considered appropriate in this context since geldstrafe+verhängen is
collocational.
In context 2, we look at the first pos of the phrase and the pos on its left:
in the example, fine is a noun, preceeded by an article. In contrast to context 1,
verbal translations are not part of the resulting distribution. While also some
forms of schön or gut are among translation candidates, the nouns geldstrafe,
strafe and geldbuße are clearly favorites with translation probabilities even higher
than the highest scoring translations without context.
Summarizing, we can say that contextual features, as simple as an adjacent
word as in section 3.3.1 or pos-tags, can help to disambiguate translation
candidates.
26
3.3.3
Implementation
The phrase translation probabilty ϕ(trg | srccontext ) is computed as the standard
translation probability ϕ(trg|src) using relative frequencies:
count(srccontext , trg)
trgi count(srccontext , trgi )
ϕ(trg | srccontext ) = P
When extracting phrase-translation pairs from aligned training data (cf. figure 2),
the respective contexts are extracted as well. In the case of more sophisticated
context features (e.g. pos-tags), the training data as well as the test set and
the development set require pos tagging or another form of pre-processing.
The next step is to assign the modified translation probabilities to the
corresponding phrases: since a phrase can appear in different contexts, there can
be different sets of translation probabilities. In a standard system, a phrase and
its translation candidates are listed only once in the phrase table, i.e. each time,
this phrase is to be translated, the same phrase table entries are applied. In
order to find the corresponding contextually conditioned probabilities for each
phrase during the translation process, each word is given a unique identifier both
in the sentences to be translated and the phrase table entries. A simple way to
do so is to consecutively number the words in the sentences to be translated
and then transfer these numbers to the entries in the phrase table.
For example, assume that the phrase front of occurred twice in the test set,
once in context a and once in context b. The resulting phrase table thus needs
to contain entries for the phrase fronti ofj conditioned on context a and entries
for the phrase frontk ofl conditioned on context b. This principle is illustrated
by the simple example given in figure 3, where the contextually conditioned
probabilities prefer or disprefer the translation vorderseite von depending on
the word left to the phrase front of.
Since each word in the modified system is uniquely identified, the translation
phrase table contains several entries for the same phrase pairs depending on
the context they appeared in. This approach leads to a huge inflation of phrase
tables. To make phrase table size manageable, only phrase translation pairs
phrase
front of
front of
front of
translation
vor der
vor dem
vorderseite von
ϕ
0.4
0.3
0.3
phrase
front2 of3
front2 of3
front2 of3
front7 of8
front7 of8
a the1 front2 of3 the4 hotel5
translation
vor der
vor dem
vorderseite von
vor der
vor dem
ϕc
0.2
0.2
0.6
0.6
0.4
b in6 front7 of8 the9 hotel10
Figure 3: A simple example demonstrating how modified translation probabilities
are written into the phrase table. The table on the left side gives the original
translation probabilities for the phrase front of while the table on the right side
shows the new probabilities conditioned on the context one word on the left side
of the phrase.
27
with a probability above a threshold are listed in the translation table. Another
effect of this representation is that phrase tables are not universally applicable
but need to be produced specifically for the sentences to be translated, i.e. the
development and the test set.
Unknown words require an additional post-processing step: While unknown
words are not listed in the phrase table, they may still be represented in the
language model. Since they can not be translated, unknown source phrases are
just integrated into the target language sentence (eg. proper names). However,
when marked with a unique identifier, these words can not be found in the
language model. Hence, input sentences need to be compared with the phrase
table to remove identifiers from those words not represented in the phrase tables.
In the example in figure 3, it becomes evident that some translations are
not seen in combination with a context and therefore obtain a probability of
0, i.e. they are discarded from the set of possible translations. For different
reasons, it might be a good choice to keep these translations and give them
a small pseudo-probability. This issue of smoothing will be adressed later in
section 5, where different approaches will be discussed and compared.
3.4
Experimental settings and tools
For the translation experiments, we used the standard moses-system, alignment
was done using giza++. The training data consists of newspaper text and
europarl3 (1.466.088 sentences in total). All words are lowercased in order to
maximally exploit the data and the sentence length does not exceed 70 tokens.
1025 sentences were used for parameter tuning (see 3.2.3), which is almost the
same size as the test set (1026 sentences). Test set and tuning set are newspaper
text and very similar in their domain as for each sentence in the tuning set, an
adjacent sentence is taken into the test set. Both test set and tuning set, as well
as the training corpus, are data from the shared task 20094 .
For pos-tagging, we used Tree-Tagger [Schmid, 1994], a widely used tool for
the annotation of pos-tags and lemma information. It exists for many (European
and non-European) languages, including English and German. A variant of
Tree-Tagger is also capable of chunking and was used for the annotation of
chunk-based contexts.
3.4.1
Description of the Pos tag set
The English tag set is based on the Penn tag set5 , which was slightly adapted
to our needs. Most modifications concern the tags of verbs: In contrast to the
original set, we distinguish between full verbs and the auxiliary verbs to be
and to have, but not between past tense and present tense. Furthermore, no
difference is made between nouns in singular and plural.
In total, the tag set was intentionally left relatively rich, especially the tags
of verbs which have different tags for auxiliaries and also represent a wide variety
3
http://www.statmt.org/europarl/
http://www.statmt.org/wmt09/
5
http://www.comp.leeds.ac.uk/ccalas/tagsets/upenn.html
4
28
tag
cc
dt
jj
jjr
jjs
pdt
pos
pp$
pp
nn
in
rp
wdt
wp$
wp
wrb
ex
fw
(, )
”
example
and, or
the, a
cheap
cheaper
cheapest
half, all
’s
our, his, their
he, she, we
store, client
under, about
(pull) out
that, which
whose
who
when, how
there (is)
i.e.
(, )
”
description
conjunction
determiner
adjective
quantifier
pronoun
noun
preposition
particle
rel. pronoun
question pron.
exist. ‘there‘
abbreviation
punctuation
tag
vb
vbff
vbg
vbn
vh
vhff
vhg
vhn
vv
vvff
vvg
vvn
md
to
rb
rbr
rbs
cd
ls
sent
example
be (inf.)
be (fin.)
being
been
have (inf.)
have (fin.)
having
had
verb (inf.)
verb (fin.)
verb + ing
verb-part.
would, can
to
late
later
latest
one, 1.9
4
. ,
description
aux: be
aux: have
verb
modal verb
to
adverb
number
punctuation
Figure 4: English tags with examples
of grammatical forms. In figure 4, all tags are listed with an example and a
description. Additionally, the beginning and the end of a sentence are marked
as well and used as context for the very first or last word of a sentence.
3.4.2
Dimensions of the original and modified phrase-tables
Phrase-tables cannot be of unlimited size as all translation options need to be
available during translation and therefore the system needs to be capable to load
the entire table. When the phrase-tables are used during parameter tuning or
the actual translation, they are filtered so they only contain phrase-translation
pairs that also occur in the test set or the tuning set. In the case of a standard
phrase-table with 68.733.076 entries, only 7.85 percent (5.398.857 entries) are
left after filtering for the translation of roughly 1000 sentences.
The modified phrase-tables only have entries for phrases seen in the test set
in any case; otherwise, there would not be a context to be used in the estimation
of probabilities. However, the design of the modified phrase-tables, i.e. listing
each phrase occurring in the test set separately in order to find the probabilities
for each context, leads to an inflation of the phrase table.
The dimension of inflation, and inherently also the need to cut off some
entries, is illustated with an example of the article the. As the is very highly
frequent, there were also a lot of translation options seen during training,
resulting in a set of 109.384 translation candidates only for the single-word
phrase the. In the test set, the phrase the occurs 1.809 times; if we insisted
on including every entry for every occurrence of the, we would end up with
1.809 · 109.384 = 197.875.656 entries for only this phrase. Less frequent words
29
also have quite some effect: In the case of the word public (26 occurrences in
the test set) with 2.495 translation possibilities, including every entry of every
occurrence would result in 26 · 2.495 = 64.870 entries.
In the experiments peresented later, only entries with a translation probability
of 0.0002 or more are listed in the phrase table. This threshold was also used in
the experiments by [Gimpel and Smith, 2008], who found that such filtering had
no apparent effect on the output of their systems. As there is no need to filter
phrase-tables in the baseline system, there was no filtering applied to them.
30
4
Evaluation methods
When evaluating the performance of an mt-system, there are several aspects to
consider: While there are also technical points like the translation speed or the
amount of memory used during translation, the two main criteria are fluency
and adequacy. These criteria are motivated by the idea that the translated
sentence should convey the exact meaning of the source sentence without adding
more content or leaving out information - and thus be adequate. The output
should also be good with respect to grammar and word choice of the target
language, i.e. consist of fluent English or German.
While a human annotator would be perfectly capable to judge mt-output
in terms of fluency and adequacy, there are two problems: first, a manual
evaluation is too expensive and time-consuming, especially when the evaluation
result is needed to to rate different systems during development. Also, judgement
produced by human annotators can be expected to be subjective and therefore
not always correspond perfectly with that of another judge (inter-annotator
agreement.)
When evaluating mt-systems, we need an evaluation metric that is both
consistent in different situations and can be produced automatically in a reasonable amount of time. Also, we have to define a meaning for the score computed
by the metric. This leads to the idea to rank two systems against each other
instead of using an absolute value for fluency or adequacy. That means that
not the actual score a system achieved is relevant, but the difference to the
score of another system, e.g. a comparison between a baseline system and a
modified system. The results of automatic evaluation methods are also required
to correlate with human judgement.
The most obvious way to automatically evaluate the quality of a translated
sentence is to compare it with a reference translation assuming that a good
translation is similar to the reference. However, such an evaluation of mt-output
is tricky: As there are in most cases not one, but very many good translations,
it is not sufficient to match the the output against a single reference translation
expecting good sentences to match perfectly. This will hardly ever happen, not
even with very short sentences. In order to judge less strictly, the compared units
are not entire sentences but n-grams of differenth length. Checking whether
the translation output roughly consists of the same words as the reference
translation is an attempt to judge adequacy, while longer n-grams reflect fluency
by capturing larger chunks of text at once while still, at least to a certain degree,
allowing for a different constituent order.
There are several n-gram-based evaluation metrics such as n-gram precision
and recall or the widely used bleu-metric. Given that there are countless
possibilities to translate a sentence, and that good translations do not necessarily
need to be very similar in terms of their word choice or constituent order, an
automatic evaluation system benefits from using multiple reference translations.
It is possible to compute measures such as bleu based on more than one
reference translation. However, in the course of this work, only single reference
translations will be used for evaluation.
31
4.1
Bleu
The bleu-metric (bilingual evaluation, understudy), [Papineni et al., 2002],
works with n-grams of differenth length up to a predefined maximum n. These
n-grams are then compared with a reference sentence by computing the n-gram
precision for each n. As mt-systems sometimes overproduce highly frequent
words such as articles, there is need for some modification in order to not
overestimate n-gram precisions in such cases. Instead of just counting the ngrams in the candidate translation which also occur in the reference translation
and then dividing that number by the total number of n-grams, only that
many n-grams that have an actual counterpart in the reference are counted.
The method is illustrated in the following example by computing the modified
unigram precision:
(16) the the the the the the the
(17) the cat is on the mat
While (16) is clearly bad mt-output, it would still achieve a unigram precision
of 77 . As there are only two occurrences of the in the reference translation (17),
the is only counted twice for the modified unigram precision, resulting in a value
of only 27 .
Sentences which are longer than the corresponding reference translation
have difficulty to get high modified n-gram scores. In contrast, short sentences
have an advantage in terms of modified n-gram precision. As good sentences
should match their reference translation in word choice, word order and sentence
length, bleu additionally uses a brevity penalty in order to guarantee that short
sentences do not get a too high score.
bleu is computed as follows:
bleun = brevity-penalty · exp
n
X
λi · log(precisioni )
i=1
(
brevity-penalty =
1
r
e1− c
if |c| > |r|
if |c| ≤ |r|
where λi are weights for the different n-gram precisions, |c| is the length of the
candidate sentence and |r| is the length of the reference sentence. In the case of
multiple reference sentences, the one whose length is closest to the candidate’s
length is taken.
This formula can be simplified to the generally used form by setting the
weights λi = 1 and the order of n-grams to n = 4:6
bleu4 = brevity-penalty ·
4
Y
precisioni
i=1
As the entire score for a sentence is 0 when one of the n-gram precisions is 0,
bleu-scores are usually computed for the entire test set rather than on sentence
6
taken from [Koehn, 2010]
32
level. However, it is possible to compute bleu-scores on sentence level by simply
adding 1 to each n-gram precision (bleuadd−one ). While sentence based bleu is
helpful to pick out interesting sentences when manually evaluating a test set, it
does not reflect human judgement. However, when computed on the entire test
set, bleu correlates well with human judgements.
A related evaluation metric is nist [Doddington, 2002]. It is also based on
n-grams, but uses an arithmetic average of n-gram counts instead of a geometric
average. It also gives a higher weights to those n-grams that occur less frequently
assuming them to be more informative than highly frequent n-grams. As bleu
has been shown to correlate better with human judgement than nist, our
evaluation will mainly focus on bleu values, but nist scores are also reported
on behalf of completeness.
Example The sentences (20) and (21) are both translations of (18), with one
reference translation (19)7 . The translations are very similar; in fact, they differ
only in two words: The article on the left side of vermögen (fortune) and the
word wert (worth), which is missing in (20). Except for the phrase tadellos
groomed, 21 is exactly the same as the reference translation.
(18) the women are all impeccably groomed and the men are all worth a
fortune .
(19) die frauen sind alle vorbildlich gepflegt und die männer sind alle ein
vermögen wert .
(20) die frauen sind alle tadellos groomed und die männer sind alle eine
vermögen .
(21) die frauen sind alle tadellos groomed und die männer sind alle ein
vermögen wert .
translation
candidate
n=1
n-gram precision
n=2 n=3 n=4
(20)
11
14
7
13
5
12
(21)
13
15
11
14
9
13
brevity
penalty
bleu4
3
11
0.9311
0.045
7
12
1
0.275
Table 5: Modified n-gram precisions for the sentences (20) and (21) when
matched against the reference translation (19).
Despite their similarity, (21) yields higher n-gram precision scores and is therefore
– rightly – assumed to be a better translation than (20). N-gram precisions and
the final bleu-score are given in table 5.
This example also demonstrates one of the weak points of bleu: While
matching the reference translation is a good indicator for a good translation,
the fact that some words are not matching does not necessarily indicate a bad
translation. In the example, impeccably was translated in both sentences as
7
(20) was translated with the baseline system and (21) was translated with a system
conditioned on 2 pos tags on the left side of a phrase.
33
tadellos, which is as good as vorbildlich. Yet, tadellos is not counted into the
respective precision scores: bleu is not capable to recognize words which are
correct translations, but do not occur in the reference translation.
A detailed critical examination of bleu and its role in machine translation
can be found in [Callison-Burch et al., 2006].
Meteor An evaluation metric which deals with problem illustrated in the
example is meteor ( [Banerjee and Lavie, 2005]): It is designed to recognize
near matches, such as decision and decide, by backing off to word stems. If an
ontology like Wordnet is provided, meteor can also use semantic classes in
order to find synonyms or semantically closely related words: With the help of
a German ontology (e.g. Germanet), tadellos could be classified as an equally
good translation as vorbildlich.
Unfortunately, meteor was found to not correlate very well with human
judgement on German newspaper data (c.f. [Callison-Burch et al., 2008]); thus,
we will not report meteor scores.
4.2
Error types in machine translation output
N-gram based precision metrics such as bleu, discussed in the previous chapter,
are useful to evaluate the overall quality of a translation system in comparison
with another. While bleu is designed to capture both fluency and adequacy,
it does not provide an analysis of error types. However, the identification of
different error types is crucial in order to improve the translation quality.
[Vilar et al., 2006] carried out a very fine-grained study of error types with
the aim to provide an evaluation framework for human evaluators. They propose
five main groups, missing words, word order, incorrect words, unknown words
and punctuation errors, which can again be divided into subgroups.
In the category of missing words, a distinction can be made between missing
words which are crucial for the meaning of the sentence, and cases where the
meaning of the sentence is preserved despite a missing word. Generally, ’content
words’ like verbs, nouns or proper nouns are assumed to be most essential for the
meaning of a sentence. However, this assumption is overly simple considering
that a preposition or negation can significantly change the meaning of a sentence.
Similarly, in the case of incorrect words, the most important point is whether
the incorrect translation of a word disrupts the meaning or whether its content
is still comprehensible. There are different forms of incorrect words: A word can
be translated completely wrongly, especially if there was a need to disambiguate
a source word, or be a valid translation in terms of meaning, i.e. the right
base, form but a wrong inflection. Furthermore, the translation of idiomatic
expressions is challenging as they cannot be translated literally in most cases.
Another point is that of bad choice of words, i.e. constructions which carry the
content of the source phrase but, in the given context, cannot be considered
correct for not being a good, fluent phrase of the target language.
Reordering errors can occur on phrase level or word level and can further be
classified into local or long range problems.
34
Incorrect word forms, as well as reordering problems, are to a certain degree
depending on the language pair one is looking at: Finding morphologically
correct forms is especially challenging for highly inflecting languages; typical
examples of local reordering problems arise when translating Romance languages,
where nouns normally preceed adjectives, into a language like English where the
order of nouns and adjectives is inverse.
As the use of contextual information aims to find context appropriate
translations, the main focus of evaluations will be on incorrect words as well
as missing words, while other categories such as reordering issues, unknown
words and incorrect puntuation will be discussed only marginally or ignored.
Also, a more targeted evaluation on incorrect or missing words with respect
to their assumed relevance, i.e. content words vs. non-content words, is an
interesting aspect given that n-gram based measures such as bleu do not make
a distinction between important and less inportant words.
This classification of error types, as proposed by [Vilar et al., 2006], which is
summarized only briefly, was originally intended as an orientation framework for
human annotators as most of the criteria are too subtle to be entirely captured
by automatic evaluation routines and categorized accordingly.
[Popović et al., 2006] propose a method to automatically measure the
degree of errors in local reordering (adj-noun vs. noun-adj) when translating
Spanish-English, and the relative amount of inflection errors in Spanish noun
phrases and verbs. To measure inflection errors, they compare the word error
rate computed on fully inflected forms with the word error rate computed on
base forms; a large difference indicates many inflection errors. Similarly, the
difference between word error rate and position independent word error rate is
used as indicator for reordering problems. While these approximations are not
perfect, [Popović et al., 2006] showed that their evaluation results correspond to
results obtained in evaluations carried out by human annotators.
4.3
Syntactically motivated evaluation
The integration of linguistically motivated features is a necessary step for a
targeted evaluation of specific error types: in order to improve the performance
of an automatic translation system, frequent error types or difficulties when
translating certain structures need to be identified.
A syntactically orientated evaluation method is presented in the work of
[Popović and Ney, 2009]: In their work, not the actual words of machine
translation output are used for the computation of standard evaluation metrics,
but pos-tags. The generalization to pos tags takes the evaluation from the purely
lexical level to a more abstract level: As good translations do not necessarily
have to be exactly like the reference translation, word based bleu punishes
lexical variation, even if the syntactic structure of the sentences is the same. By
introducing the pos-bleu score, we want to be independent from the actual
realization of words but focus on the syntactic structure of the target sentence.
If the translation and the reference sentence are similar on a syntactic level,
then there is evidence that the translation is relatively well-formed, even if the
actual words do not always correspond with the reference translation.
35
(22)
the race remains unusually wide open .
ref
das
art
rennen
nn
bl
die
art
jagd
nn
co
das
art
rennen
nn
bleibt
vvfin
nach
adv
ungewönlich
adjd
wie
kokom
bleibt
vvfin
vor
adv
weit
adjd
offen
adjd
außerordentlich
adjd
außerordentlich
adjd
weit
adjd
.
$.
weit
adjd
offen
adjd
offen
adjd
.
$.
.
$.
In example (22), the reference translation as well as two different automatic
translations (baseline and contextually conditioned) are given with pos tag annotations. While the second translation (co) is very good, there is one difference
compared to the reference translation: Instead of the adjective ungewöhnlich
(unusually), the system chose the equivalent adjective außergewöhnlich (exceptionally). As the sequence of pos tags is the same in ref and co, the
translation co receives the top pos-bleu score of 1. The baseline translation
bl is well comprehensible, but definitely flawed as it contains no verb. While
the construction nach wie vor (still) roughly expresses the meaning of remains,
the structure differs greatly from the reference translation and the sentence
bl therefore receives a significantly smaller pos-bleu score of 0.4518. Even if
there was a verb (tagged vvfin) in the correct position after the noun-phrase
die jagd (the chase) in bl, the pos-bleu score would only be slightly increased
(0.4692), despite being a syntactically well-formed sentence. The example shows
that pos-bleu is capable of balancing out different word choices (as long as
they are tagged equally), but also that it cannot deal with equivalent syntactic
structures.
When computing bleu on pos tags, there is also the risk that, especially in
long sentences, pos tag sequences of the test sentence randomly match with pos
tag sequences of the reference translation although they are located at different
parts of the sentence and have no correspondence at all.
[Popović and Ney, 2009] investigate the bleu metric based on pos-tags,
as well as recall, precision and f-measure of pos n-grams. For these measures,
they report a good correlation with human judgement. Correlation statistics are
computed on the manually evaluated test data from the 2006, 2007 and 2008
shared task of the Statistical Machine Translation Workshop8 . In the first two
data sets, both fluency and adequacy are annotated. The pos-based metrics,
especially the pos-bleu and the pos-f-measure, have a good correlation with
the fluency and adequacy scores obtained by manual annotation. Additionally,
the pos-bleu and pos-f-measure have better correlation scores with human
judgement than the standard bleu and meteor with respect to the fluency
score, and pos-bleu also correlates better with the adequacy score than the
standard bleu and meteor.
Since metrics entirely based on pos-tags do not take into account lexical
aspects, a score including both word n-grams and pos n-grams was introduced.
The correlation test for this variant was carried out on the 2008 data set, where
8
http://www.statmt.org/wmt{06|07|08}
36
translated sentenced were ranked against each other, but no fluency or adequacy
scores were annotated. A good correlation for the word and pos n-gram measure,
as well as for the pos-bleu and pos-f-measure on the new data set, could be
observed.
Given that the typical output of an mt-system is not well-formed, postagging the mt-output seems a reasonable task. The output of a robust tagger
which relies on a large lexicon and is therefore able to assign tags even if the
sentence is not perferctly well-formed is expected to be sufficiently reliable,
especially if using a relatively simple tag set. In contrast, a deep syntactic
analysis such as parsing or assigning fine-grained morpholgical descriptions to
the output would be quite difficult to carry out.
As we hope for an improvement on both lexical and grammatical level,
variations of the pos based evaluation seem a promising method to capture
overall syntactic effects. In order to also include lexical aspects in the evaluation,
we intend to measure precision and recall of translated content words in an
attempt to compensate for the fact that standard bleu considers all words/ngrams to be equally important.
We also hope that pos-bleu scores help to find individual sentences that
are much better that the baseline and will thus be a useful tool for a manual
evaluation of specific error types.
4.4
Testing for significance
As the absolute scores of evaluation metrics are meaningless, the respective scores
of a modified system and the baseline are compared. However, it is not sufficient
to observe an increase of bleu points (or the score of any other evaluation metric),
but it must also be shown that the improvement is statistically significant. [Koehn,
2004] describes how bootstrap resampling can be used to compute statistical
significance of automatic evaluation metrics, focusing on bleu.
Instead of comparing the performance of different translation systems, the
same system was trained on 11 languages and used to translate the test set
into English. Thus, the English output of 11 different source languages can be
compared. The test set, consisting of 30.000 sentences, is especially large.
In an attempt to find the ‘true’ bleu score, the test set is divided into smaller
parts for each of which bleu scores are computed. The choice of these test sets
is crucial as consecutive sentences tend to be alike and of similar translation
quality and thus are not representative of the entire test set. [Koehn, 2004]
shows empirically that by using a broad sample, i.e. drawing sentences from
different parts of the main test set, the obtained bleu scores of the smaller sets
lie within a smaller range (as compared to test sets of neighboring sentences)
and are thus closer to the ‘true’ bleu value of the full set of 30.000 sentences.
To find the true translation quality of a smt system, we would need to
consider all possible sentences of a domain. Since this is not possible, we need
to figure out how to exploit our finite test set in order to simulate an infinite
amount of test data.
The objective to find the ‘true’ bleu value is now redefined to the task of
computing whether the true bleu score lies within a certain interval. For this
37
computation, we demand a certain confidence, which is set at q = 0.95, or in
other words, a p-level of 0.05. The basic idea of bootstrap resampling is to
repeatedly draw a test set of size n from and a very large (infinite) set and
compute bleu scores for each of this test sets. If this is done for a sufficient
number of times, according to the law of large numbers, the individual bleu
scores enclose the assumed true bleu score very tightly. By eliminating the 2.5 %
highest and 2.5 % lowest individual bleu scores, the 95 % confidence interval
around the true bleu score is found.
Again, as there is no infinite set of test sentences, we need to resort to
a trick: The test sets of size n are each built by drawing n sentences with
replacement from the available collection of test sentences. [Koehn, 2004] makes
the assumption that such a set is equivalent to a set drawn from an infinite
collection of test sentences and can thus be used for the estimation of the
confidence interval. By comparing the performances of test sets of modest size
and his exceptionally large test set, he shows empirically that his assumption is
valid.
In order to compare whether the quality of two systems is significantly
different, the method of paired bootstrap resampling is introduced: For each of
the test sets produced by the two systems, new test sets are created by drawing
n sentences with replacement. This is repeated a sufficient number of times and
the respective scores of the systems are compared. If one system outperforms
the other one in at least 95 % of the iterations, it is concluded that this system
is better with 95 % statistical significance.
The significance testing carried out in the course of this work is done using
pairwise bootstrap resampling with 1000 samples and 95 % confidence.
38
5
Rating the reliability of contexts and smoothing
When computing contextually conditioned translation probabilities, it is indispensable to rate contexts in terms of their reliability as we only want to
work with useful and non-trivial context features. The next step is to figure
out what to do with translation candidates that were seen in the distributions
without contextual conditioning, but not in the contextually conditioned ones.
As such phrase-translation pairs are given a translation probability of zero,
they are eliminated from the set of translation possibilities. Since it might not
always be a good idea to discard those unseen translations, they need to be
assigned a relatively small translation probability. In the next section, criteria
to identify good and harmful contexts will be discussed, as well as methods to
assign non-zero translation probabilities to phrase-translation pairs that did not
occur within a given context.
5.1
General reflections
Generally, a few factors have to be considered when computing contextually
conditioned probabilities: The most important aspect is the reliability of the
probability estimation. While some contexts can be useful to reduce the set of
translation candidates to a well suited smaller set, other contexts can be trivial
or even harmful when they favor bad translations at the expense of better ones.
In the made-up toy-example in table 6, three different contexts are compared
to illustrate the effects of good and less good contexts when no criterion is used
at all to rate the reliability of a context. Based on possible translations of the
phrase fine, it is an extreme simplification of one of the examples presented
previously.
The possible translations are grouped according to their meaning: pretty
(À), the nominal phrase financial penalty (Á) and the verbal construction to
impose a penalty (Â). The candidates in section à (fine→it and fine→car) are
always a bad choice regardless of the context. The task is to find appropriate
translations for fine depending on the respective context, i.e. to disambiguate
between penalty and pretty, while also excluding non-sense translations. Phrase
translation probabilities are conditioned on one word on the left side of the
phrase. Context c1 clearly prefers the verbal variant of the penalty reading. As
there is also a reasonable number of phrase-translation pairs in this context,
we can consider context c1 to be reliable and useful. Similarly, context c3
gives relatively high probabilities to only two translation candidates. In this
case however, the two translation candidates that are kept for translation are
most likely a product of alignment errors or bad pos tagging. The difference
between the good context c1 and the bad context c3 is that the phrases in c1
were seen multiple times while the phrases in c3 only occurred once and can
thus be assumed to be random. Given that alignment is very error prone and
that preprocessing procedures such as pos tagging are additional sources of
error, it is not surprising to find a certain amount of such random pairs in
most translation probability distributions. If a small number of these faulty
translation candidates happens to be the only phrases seen in a context, then
39
À
Á
Â
Ã
translation
gloss
schön
schöne
strafe
geldsrafe
bestrafen
beautiful
beautiful
penalty
penalty
penalize
impose a
penalty
it
it ,
car
strafe auferlegen
es
es ,
auto
original
f
ϕ
30 0.30
26 0.26
14 0.14
9 0.90
16 0.16
c1 = to
f
ϕc1
0
0
0
0
0
0
0
0
15 0.88
c2 = the
f
ϕc2
28 0.44
20 0.32
8 0.13
6 0.09
0
0
c3 = ,
f ϕc3
0
0
0
0
0
0
0
0
0
0
2
0.02
2
0.12
0
0
0
0
1
1
1
0.01
0.01
0.01
0
0
0
0
0
0
0
0
1
0
0
0.02
1
1
0
0.5
0.5
0
Table 6: Simple example illustrating useful, trivial and harmful contexts. Translations for the word fine are conditioned on the word on the left side of the
phrase.
their translation probabilities are extremely overestimated, which is definitely
an unwanted effect: we thus need to define a criterion (e.g. a threshold) to
prevent contextually conditioning in such situations.
While context c2 is not as disastrous as c3 , it is not very useful either: It
reduces the number of translation candidates and assigns reasonable translation
probabilities, but fails to disambiguate between penalty and pretty. Although
the system is not likely to be harmed by this context, it does not benefit either
and the reduction of the set of possible translations results in the fact that the
estimation of translation probabilities is based on less data than without the
context. Therefore, we are not only interested in filtering out downright bad
contexts, but we also want to have criteria helping to identify trivial contexts.
By illustrating what the probability distributions of good, bad and trivial
contexts look like, this constructed example helps to define criteria for the
identification of useful contexts.
Thus, a context could be considered useful and reliable if it occurs frequently
and leads to a considerable reduction of translation candidates. Ideally, the
resulting new probability distribution has a few favorite translations with relatively high probabilities, as illustrated with context c1 in the example. In
reality, even useful contexts tend to produce quite flat probability distributions:
This was already seen in some of the previous examples and will also be the
topic of some of the following discussions. Since judging the distribution by its
appearance might not work in all cases, we also take a step back and contemplate
the number of phrase pairs co-occurring with a given context assuming that
a frequently seen context is more reliable (i.e. not random) than a context
that was seen only once or twice. Criteria based on the number of co-occurring
phrase-translation pairs and on the appearance of the contextually conditioned
distribution will be presented with regard to their potential use as identifier for
good contexts, but also as discount/weighting factors for smoothing.
40
5.2
Interpolation and discounting
As can be seen in table 6, translations not seen in a context are given a translation
probability of zero and are thus no more eligible for translation. Smoothing
the probabilities for unseen translations is a necessary step to deal with the
reduced data set used to estimate the translation probabilities seen in the given
context. In order to assign a small probability to unseen events, the translation
probabilities of phrase-translation pairs seen in a context are slightly decreased
by a factor λ. The probability mass taken away from seen events is then assigned
to unseen events.
There are many different ways to realize smoothing; one of the most crucial
aspects is the design of the discount factor that is used to decrease the probability
mass of seen occurrences: the underlying rationale is to only exploit useful
contexts by discounting them very little, while heavily discounting trivial contexts
and thereby reproducing a distribution similar to the original ϕ(trg | src).
Another important point is the influence of the original ϕ(trg | src) on
the new probability ϕ(trg | srccontext ); we differentiate between the methods
of discounting and interpolation. In the case of discounting, all translation
probabilities of seen events are decreased by a discount factor and the remaining
probability mass is then distributed to the unseen events proportionally to the
original distribution. Thus, the original non-context distribution is only reflected
in entries of unseen translation candidates and the probabilities for seen events
are entirely based on the contextually conditioned distributions.
When computing the new probabilty as an interpolation, a (small) fraction
of the original distribution is part of every new probability. A weighting factor
λ (with values between 0 and 1) is used to balance the relation between the two
probabilities:
ϕ(trg | srccontext )int = λ · ϕ(trg | srccontext ) + (1 − λ) · ϕ(trg | src)
For a good context, λ should be close to 1 in order to use the contextually
conditioned probability without noticeable influence of the original distributions.
On the contrary, λ should be close to zero in the case of a not very convincing
context in order to reproduce more or less the original distribution ϕ(trg | src).
Regardless of the weighting factor, the original ϕ(trg | src) might suppress a
good distribution yielded by a good context when the ϕ(trg | src) is relatively
high in contrast to ϕ(trg | srccontext ).
One has to keep in mind that the example based on fine is somewhat extreme
– it is relatively rare that phrases are ambiguous to such an extent. In most
other cases, the differences of appropriateness of translations are much more
subtle and a too strong influence from ϕ(trg | src) can ruin the effect achieved
by using context knowledge. Additionally, translations that are bad regardless
of the context, but have a relatively high translation probability would still be
represented in the new probability distribution even if a they were not seen
in the context used for conditioning. Given that alignment is a very error
prone procedure, a considerable number of faulty translation candidates can
be expected in most sets of phrase-translation pairs. These bad translation
candidates often consist of prepositions, articles or punctuation marks, i.e.
41
tokens that have a high frequency and, to a certain degree, fit everywhere in a
sentence.
On the other hand, interpolation could be regarded as a way to slightly
re-balance the translation probabilities of distributions that are already relatively
good. In this case, the impact of contexts that are not exactly bad, but also not
very good either could be too harsh when using the discount method. In our
experiments, we will compare the outcome of both discount based methods and
interpolation.
In both variants, discounting and interpolation, the design of the factor λ is
an important point and closely related to the identification of useful contexts.
Thus, we need to define criteria for the design of the discount factor and also for
the reliability of contexts, i.e. decide about the extent to which we want to use
the contextually conditioned probability and when to prefer the original one.
5.3
Discount factors
So far, we only worked with a simple maximum likelihood estimator of the form
ϕ(trg)srccontext = P
count(srccontext , trg)
)
trgi count(srccontext , trgi
where the probability for the translation trg of a given phrase srccontext is
estimated as the relation of how often trg is the translation of srccontext and the
total number of possible translations in the given context. While this method is
simple and straightforward, it has the disadvantage that translations that were
not seen in the context are assigned a probability of zero while seen events can
be overestimated at the same time.
In some of the experiments presented later, the step of giving non-zero
probabilities to unseen events will be skipped and translations not occurring
within a context are discarded, in order to examine the maximal effect of context
information. Furthermore, we have to keep in mind that the new design of the
phrase tables (cf. the example on page 27) with each phrase of the test set listed
separately, leads to a huge inflation of the phrase table. Therefore, entries with
a translation probability below a certain threshold need to be excluded anyway,
making the task to assign non-zero translation probabilities, to a certain degree,
less important. However, there are situations in which we prefer smoothing:
A simple example is the combination of translation probability distributions
based on two or more independant contexts, where a translation candidate not
occurring in one context might (often) occur in the other one and therefore
needs a non-zero translation probability in the distribution of the context it was
not seen in. If the probability distributions of two independent contexts are
given as separate features to the log-linear model, it might be too harsh to use
only those translations that occur in both contexts, especially when considering
that the system can decide itself on the relevance of features during parameter
tuning.
In the following sections, ideas of how to create factors to rate contexts
and for ’stealing’ probability mass from seen events in order to maximally
42
exploit good contexts will be presented, followed by a comparison of the different
approaches.
5.3.1
Count-based criteria for the usefulness of contexts
As already mentioned, the number of phrase-translation pairs seen in a context
can be taken as indicator for the quality of a context. In order to find out
whether a context is reliable or potentially harmful, the simplest thing to do is
to set up a threshold to guarantee that contextually conditioned probabilities
are only used if there are at least n phrase-translation pairs seen in this context.
If there are less than n occurrences, this context is considered to be random
or a product of alignment errors. In the case of trivial or ’bad’ contexts, it
can be very harmful to compute translation probabilities based on a very small
number of counts: If there are only e.g. three translations seen under the
restriction of a certain context and each of them occurs only once, then they
receive the relatively high probability of 13 each, minus the value taken away by
a discount factor, while the remaining translations end up with comparatively
low translation probabilities. This is definitely not what we want if the ’boosted’
translations are bad choices, but also in the case of a trivial context where most
translations are - to a certain degree - equally probable, but only a small part
of them happens to co-occur with a specific context and ends up on top with
a high probability. An extreme case of bad translations being enhanced by a
random context was already shown in the made-up example in table 6.
Another idea to impose a restriction on the minimal number of counts is
to introduce a function based on counts that penalizes low-frequency contexts.
There are many ways to do so; one possibility is to set
λ=
ln(count)
ln(count) + 1
This function slowly increases for increasing values of count, while growing faster
for smaller values than for larger ones. This is a positive effect as we want
to give more differentiated scores in the critical range of ≈ 5 to 15 counts: a
context seen 10 times might be considered far more reliable than a context with
a frequency of only 5, while a context seen 100 times can be assumed to be as
trustworthy as a context seen 105 or even 200 times.
Using a count-based function avoids the problem of imposing an arbitrarily
chosen threshold by allowing for a gradual transition between good, less good
and bad contexts instead of classifying them into good and bad.
So far, we can define three methods for the estimation of contextually
conditioned translation probabilities:
• Set up a threshold to decide whether to use the contextually conditioned
probabilities or to switch back to the non-context distribution. In our
experiments, the threshold was set to count = 10.
• Use the count-based factor as weight λ for interpolation:
ϕint = λ · ϕc + (1 − λ) · ϕo
(Version À in the evaluation)
43
• Use the count-based factor as factor for discounting: ϕdisc = λ · ϕc
The remaining probability mass is assigned to unseen events proportionally
to ϕo .
(Version  in the evaluation)
The methods using either a threshold of count = 10 or interpolation turned
out to work better than using always the contextually conditioned probability.
Experiments showed also that the result of these systems are in the same range,
whereas the results of the system using the discounting method are considerably
lower.
5.3.2
Example: interpolation and discounting
translation
gloss
schönen
schöne
gute
guten
gut
in ordnung
gutes
geldstrafe
ausgezeichneten
strafe
geldbuße
verhängen
einer geldstrafe
geldstrafe bezahlen
beautiful
beautiful
good
good
in
ok
good
penalty
excellent
penalty
penalty
impose
a penalty
pay a penalty
original
f
ϕo
160 0.1264
120 0.0948
70
0.0553
62
0.049
60
0.0474
26
0.0205
19
0.0150
19
0.0150
18
0.0142
15
0.0118
7
0.0053
2
0.0016
1
0.0008
1
0.0008
threshold
f
ϕc
3
0.038
3
0.038
3
0.038
2
0.025
0
0
0
0
0
0
11 0.139
1
0.013
8
0.101
6
0.076
0
0
0
0
0
0
interpol. discount
λ = 0.8138
0.0545
0.0309
0.0486
0.0309
0.0412
0.0309
0.0297
0.0206
0.0088
0.0164
0.0038
0.0071
0.0028
0.0052
0.1161
0.1133
0.013
0.0103
0.0846
0.0824
0.0619
0.0618
0.0003
0.0005
0.0001
0.0003
0.0001
0.0003
Table 7: Some of the translation candidates for the English phrase fine and
translation probabilities when fine is tagged as noun with a determiner on the left
side. Candidates with the targeted meaning of penalty have a grey background.
The weight λ = 0.8138 is count-based
Table 7 shows contextually conditioned probabilities smoothed by the countbased factor discussed above, compared to the original probability distribution
ϕo and the contextually conditioned distribution ϕc that only has to fulfill
the threshold criterion. This example is intended to show the differences to
ϕc , but also the differences between interpolation and discounting. While the
translations geldstrafe and strafe always have the highest probabilities, the
method using a function based on the counts of phrase-translation pairs seen in
the context better reflects the contextually conditioned distribution than the
version using interpolation. Especially the favorite translations in the original
system (schönen and schöne) are considerably higher when we use interpolation.
But also the translations strafe und geldstrafe profit from interpolation as their
original probability is not lost but added to the new value. Interpolation might
not be the best choice to bring out a candidate from the lower midrange in
the original distribution to be by far the best translation candidate in the
new distribution. However, it might be suitable to re-balance a distribution
conditioned on a more subtle context where all valid translations were already
scoring similarly high without context information.
44
5.3.3
Good-Turing estimation
The Good-Turing method is used to estimate the probability of previously seen
und unseen events in large data sets, e.g. the probability to see a word in a
large text corpus. Good-Turing estimation is often used to model probabilities
in text corpora.
Good-Turing estimators are of the following form:
Fx =
Nx + 1 E(Nx + 1)
·
T
E(Nx )
In this equation, x is the event we are interested in (e.g. a word or a translation)
that has been seen nx times. t is the size of the data set and e(n) denotes the
estimation of how many events were seen n times.
There are many ways to design a Good-Turing estimator; a very simple one
is given in the following equation:
Nx
E(1)
P (x) =
· 1−
T
T
and
P (unseen) =
E(1)
T
Here, the probability for the event x is given by the maximum likelihood estimator
E(1)
Nx
T which is discounted by the factor (1 − T ). As the probabilities for x need
to sum to 1, the probability mass for unseen events is E(1)
T .
Visualized with the example of words in a corpus, the probability for formerly
unseen words to appear for the first time is estimated with the number of words
seen only once in relation to the size of the data set. In our application of phrase
tables, x is a translation candidate for a given phrase and a given context. t is
the number of translations for this phrase and context and nx is the number of
times in which x is the translation of the phrase.
The left-over probability, E(1)
T , is the probability mass for all unseen translations in this context. Therefore, P(unseen) needs to be divided between all
unseen translation candidates. This can be done by either assigning an equal
probability to all unseen events, or these probabilities can reflect the original distribution without context information. While it could be argued that
translations that were not seen in a specific context are equally improbable
and therefore should be assigned the same pseudo-probability, this also means
that information is thrown away: if it is known that, when looking at two
unseen translations, one of them is twice as frequent as the other one when not
contextually conditioned, then there is no reason not to use this information
during the computation of smoothed probabilities.
Good-Turing estimators are designed for large data sets of Zipfian distribution.
This means that a relatively small number of events occurs very frequently while
most events are very low-frequent. When applying the Good-Turing method to
the probability distributions, we will find that the criteria for the application of
Good-Turing are often not quite met. While there are probability distributions
appropriate for Good-Turing estimation, some are definitely not. Generally, the
distributions tend to be very flat and, given that we are looking at translations
for one phrase in one specific context, i.e. very restricted conditions, are also
often of modest size.
45
It might also happen that every translation in the entire distribution was seen
only once, i.e. E(1) = T , or that no translation was seen once, i.e. E(1) = 0.
While it would certainly be possible to create rules to deal with such exceptions, there is another aspect that is not considered in the Good-Turing approach
as it was presented here: Unlike in typical applications for Good-Turing (e.g.
words in a corpus), we have detailed knowledge about the unseen events, and
also a more ’robust’ probability distribution to which we can switch back if
we do not like the context. Especially the relation between the contextually
conditioned distribution and the original one is promising to be a useful indicator
of whether a context is good or trivial; the assumed relevance of a context should
also be reflected in the design of a discount factor.
5.3.4
Type-token relations as criteria for the evaluation of contexts
The relation of types and tokens in the contextually conditioned distribution, as
well as a comparison with the original distribution, are interesting aspects and
might indicate whether a context is trustworthy and thus to which extent we
want to use its probability distribution.
If the number of types and tokens in the conditioned probability distribution
is roughly the same, i.e. the distribution is very flat without one or a few
favorite translations, then the context is not very promising, as the initial
idea of incorporating context information when estimating probabilities was to
find translations especially suitable for this context. While some inappropriate
translations might have been filtered out, a very flat distribution suggests that
the context is not very helpful. In this situation, we would like to introduce a
penalty, such as a factor
λ1 =
types(translation, context)
tokens(translation, context)
which is high if there are approximately as many types as tokens.
It is also interesting to compare how many of the translation candidates
in the original distribution are also seen in the context. If almost every single
translation of the original distribution was also seen in the context, then this
context is likely to be trivial and we might use as well use the entire data to
estimate probabilities, i.e. reproduce the original probability distribution.
λ2 =
types(translation, context)
types(translation)
captures the percentage of possible translations of a phrase coming up in the
modified distribution.
Assuming λ1 and λ2 to be equally important, they can be combined to the
form
1
λ = [(1 − λ1 ) + (1 − λ2 )]
2
and be used either as weighting factor in interpolation or as discount factor.
46
tokens(translation, context)
types(translation, context)
types(translation)
λ1
λ2
λ
phrase = general
c1 = vvff nn c2 = in jj
14
1710
2
240
1075
1075
0.1428
0.1404
0.0019
0.2233
0.9276
0.8182
phrase = fine
c3 = to
23
21
379
0.91
0.055
0.5157
Table 8: Effects of the type-token criteria on different types of distribution.
The effects of the type-token criteria on different types of distribution are
illustrated by the following example. Table 8 shows three different distributions;
in the case of general, the context used for conditioning consists of the pos
tag on the left side and the pos tag of the phrase itself. The word general
appears mostly with the meaning of generally/generic, e.g. in the very common
construction in general. The second possibility, in which general denotes a
military rank, is comparatively rare. In context c1 , in which general is tagged as
noun with a finite verb on its left side, only two different translation candidates
occur, both with the meaning of military person. As this seems a good outcome
for the given context, the weight λ should be relatively high in order to keep
the majority of probability mass and give only very little to unseen events, or,
in the case of interpolation, keep the influence of the original distribution to a
minimum.
In context c2 (general is an adjective with a preposition on its left side),
the factor λ1 is almost the same as in contextc1 . However, the number of seen
translation candidates is considerably larger and thus, c2 does not score as high
as c1 . This also illustrates the positive effect of combining the two factors λ1
and λ2 instead of using just one of them.
As became evident in the case of context c1 , contexts with a relatively
small number of phrase-translation pairs (as long as they meet the threshold
criterion) are not disadvantaged. However, flat translations are difficult to
handle: They are a strong indicator for trivial contexts with a considerable
number of random singletons. However, even good contexts can lead to flat
distributions as exemplified by context c3 : The pos tag to on the left side of fine
is a good context to find only translations with the meaning of impose a penalty:
Of the 21 different translation candidates seen within this context, 17 can be
considered appropriate. The main reason for the form of this distribution is the
fact that unaligned words are concatenated to adjacent phrases; as a result, two
translation candidates have three entries each, and two other candidates are
each listed twice. Although c3 is a good context, it does not look very promising
and therefore gets a low weighting factor.
With the new factor λ, that takes into account the type-token relation
of the contextually conditioned distribution, as well as a comparison of the
number of translation candidates in the original and new distribution, we can
define two methods for the computation of contextually conditioned translation
probabilities:
47
• Use the type-token based factor as weight λ for interpolation (Version Á
in the evaluation)
• Use the type-token based factor as factor for discounting (Version à in
the evaluation)
If in the case of interpolation both λ1 and λ2 are high (close to 1) and thus
indicating a bad context, the combined factor λ is low and therefore reproduces
ϕ(trg|src) while hardly incorporating ϕ(trg|srccontext ). If λ is high for a promising context, translations seen in this context should get high probabilities, while
unseen events only get a small fraction of their original probability. However, if
the original probabilities are very large compared to contextually conditioned
ones, the influence of the original distribution can be quite strong, even if the
weighting factor favors the conditioned distribution.
Similarly, when not working with interpolation and only taking into account
the contextually conditioned probabilities for seen translations, they are heavily
reduced when λ happens to be low. If the leftover probability mass is distributed
to the unseen translations proportionally to the original distribution, it is likely
to be - more or less - reproduced for the unseen translations. In fact, the seen
translations can even be dispreferred if they are discounted heavily and the
translation probabilities of some of the unseen translations are very high in the
original distribution.
While factors based on type-token relations might work reliably on probability
distributions looking as they are expected to look, we have to keep in mind
that even good distributions tend to be flat. As it is very difficult to say how
distributions based on good contexts look like in reality, it is also difficult to
produce universally applicable criteria that do not only work for somewhat
extreme examples such as general, but also on more subtle ones where all
translation options are somehow working, but some of them better in certain
contexts.
When using the discount method, results hardly outperform the system
using the contextually conditioned distribution without restrictions. As for the
count-based factor, interpolation yielded better results. The bleu scores and a
more detailed evaluation can be found in section 5.5.
5.4
Back-off
Table 9 provides an example illustrating the need to impose criteria to judge the
reliability of a context, but it also shows the (theoretical) gain when backing off
to a smaller context: While rare contexts do not necessarily have to be wrong, at
least they have to be treated with caution. In the context studied in table 9, fine
itself is a verb preceeded by an adverb, a combination with a very low frequency
of occurrence (count = 4). A preceeding adverb is not the best context when
the task is to find the appropriate meaning of fine, as a word like even can occur
likewise with adjectives, nouns and verbs. But since the word fine is tagged as a
verb, valid translations should reflect the meaning of impose a penalty.
However, none of the 4 translations which were seen in combination with the
context is valid. Nevertheless, each one would be assigned the new probability
48
full
cont.
1
1
1
1
0
0
0
0
back
off
1
1
1
1
3
2
1
1
translation
gloss
hervorragend
schönen
sympathische
all die schönen
geldstrafe
verhängen
geldstrafe bezahlen
geldbußen zu verhängen
excellent
beautiful
pleasant
all the pretty
penalty
impose
pay a penalty
to impose a p.
no threshold
no back off
0.1452
0.1452
0.1452
0.1452
0.0073
0.0008
0.0004
0.0004
threshold
back off
0.0028
0.0743
0.0005
0.0005
0.0878
0.0585
0.0293
0.0293
Table 9: Translation candidates for the English phrase fine. The examined
context is the pos tag rb (adverb) on the left side of the phrase fine which itself
is tagged as a verb. The two columns on the left side show the counts with either
a full context or backed off. The entries below the line benefit from backing off
to the context fine = verb. Entries with a grey background have the targeted
meaning of impose a penalty.
of 0.25 when no threshold is set. Although the value 0.25 is harshly decreased
to 0.1452 by the count-based discounting factor, the probabilities of these bad
translation choices would amount to nearly 60 percent of the overall probability
mass.
The faulty phrase-translation pairs in the first four lines of table 9 are not a
result of bad alignment, but the word fine was tagged incorrectly. Given that
the precision of tagging is very high and that alignment is known to produce a
considerable amount of errors, this is somewhat surprising: It reminds us of the
fact that especially in the case of ambiguous words even a high-quality tagger
can fail and therefore introduce an uncertainty of context reliability on another
level than discussed before. In this case, the adjacency with adverbs might have
lead the tagger to choose the tag verb for fine instead of adjective.
When imposing a threshold of e.g. count ≥ 10, the original probabilities
replace the contextually influenced new probabilities. A problematic aspect in
this method is that the threshold is chosen more or less arbitrarily. One could
also argue that a count-based function penalizing small sets is the better choice
as those contexts are not completely lost. Discounting the probabilities with a
count-based, penalizing factor is not sufficient in the example above; however,
we cannot expect to find a solution that fits every single situation and therefore
aim to find a way that works best on average.
At this point, we wonder whether it would be helpful to split complex
contexts into simpler ones instead of throwing them away completely. Backing
off to a partial context if the threshold criterion is not met helps to better exploit
the data. However, there is also the risk to ‘dilute’ contextually conditioned
probability distributions by including probabilities of trivial contexts.
When using back off probabilities, translation candidates are basically divided
into three groups: those occurring with the full context, those occurring with a
part of the context and the rest. Then, new translation probabilities for the first
group are computed (or replaced with the old ones when the threshold criterion
is not met) and discounted. The remaining probability mass has to be divided
49
between the translations in the second and the third group; again, probabilities
for the translations of the second group are computed and discounted with a
factor based on the counts of the translations occurring with a part of the full
context if they meet the threshold criterion. The now remaining probability mass
is distributed (proportionally to the original probabilities) to the translations in
the third group. This procedure specifically refers to a 2-item context and the
contexts used here will not exceed this: sparse data definitely is problematic
and long contexts that in most cases need to be backed off are not likely to be
very useful.
In contrast to the variant where neither backing off nor a threshold were
used and the first four entries were assigned nearly 60 percent of the probability
mass, they now only amount to 7,7 percent of the probability mass, with the
backed off translations representing 70,2 percent and the rest 22,1 percent.
Disappointingly, backing off did not work in our experiments: Its scores were
even lower than the results of the baseline system.
5.5
Evaluation of the presented approaches
In order to capture the maximum effect of the λ-factors used for rating the
usefulness of contexts and for discounting, experiments were not carried out on
full systems containing all available paramaters, but on a system using only the
translation probability ϕ(trg|src), the language model on the target side and
reordering statistics. Especially the reverse translation probability ϕ(src|trg),
which remained unconditioned by contextual features, could ‘interfere’ with the
new probability distribution if probabilities in ϕc and ϕo are very different. This
means that lexical probabilities in both directions and the reverse translation
probability ϕ(src|trg) are not used, i.e. the phrase table only has values for
the translation probabilities for phrases of the target language given a source
phrase, and the phrase penalty which is always set to e = 2.718.
As translation systems are very complex, this is considered a necessary step
to compare the different methods for computing the probability distributions.
During the tuning step, different weights are assigned to the features used for
translation; by reducing the features to the one probability we are interestend in
and the indispensable features language and reordering model, the comparison
will be more direct and less influenced by the other features and the different
weights they would obtain by tuning.
In table 10, results for experiments with the context of two pos-tags on the
left side of a phrase are listed. This context was chosen because it reflects to a
certain degree the kind of contexts used in later experiments. As opposed to
words, pos-tags are not as prone to sparse-data related problems, but generalize
contexts while still being relatively fine-grained. (For an overview of the tagset,
see section 3.4.1.) With regard to both word-based contexts and more compressed
contexts such as chunks, i.e. linguistically motivated phrases, pos-tags seem to
be a reasonable intermediate solution.
Table 10 shows the results for the different methods to estimate contextually
conditioned translation probabilities presented above. For each method, two
experiments were carried out, as we not only wanted to see the effect of sub50
ϕ(trg|srcc )
no threshold
threshold
interÀ
polation Á
Â
discount
Ã
back-off
count
distr.
count
distr.
12.51
12.88*
12.90*
12.68
12.53
12.54
12.39
ϕ(trg|srco )
ϕ(trg|srcc )
12.40
12.68
12.94*
12.96*
12.40
12.71
12.18
feature weights
ϕ(trg|srco ) ϕ(trg|srcc )
0.031992
0.019339
0.006017
0.021211
0.041349
0.053002
0.011541
0.115852
0.028883
0.079518
0.018464
0.055335
0.017702
0.123757
Table 10: bleu values for the experiments with different smoothing and rating
techniques. Experiments were carried out on a system using only uni-directional
phrase-translation probabilities, a target-side language model and reordering
statistics. The baseline system scored 12.45 bleu-points. Also shown are the
feature weights obtained by parameter tuning in systems using both translation
probability features. Scores which are significantly better than the baseline are
in bold-face and scores significantly better than the contextually conditioned
distribution without threshold are marked with *.
stituting the original translation probabilities with the modified contextually
conditioned ones, but we also wanted to compare the relevance of ϕ(trg|srco )
and ϕ(trg|srcc ) as estimated during parameter tuning.
A standard moses system with the same reduced feature settings was used
as the baseline system, i.e. only the phrase-translation probability ϕ(trg|src),
the language model and reordering statistics. The results in table 10 clearly
show that using the contextually conditioned distribution without performing
some sort of smoothing or setting up a threshold does not lead to an improved
translation quality. As the probabilities of phrase-translation pairs with a low
number of occurrence tend to be overestimated, a threshold at count = 10 was
introduced, resulting in a significantly better result in bleu points, at least
when using only the contextually conditioned probabilities. Our choice to set the
threshold at n = 10 might seem somewhat arbitrary; however, such thresholds
are often set in the range between 5 and 10. Since the phrase extraction routine
concatenates unaligned words to adjacent phrases and thus often produces several
entries for one seen occurrence, we opted for a comparatively high threshold.
(An experiment with a threshold set at n = 5 showed almost no difference.)
Substituting the original distribution with the contextually conditioned one
(no-threshold system) results in a bleu score that is almost the same as the
result of the baseline. The same applies for the combination of the original
probabilities and the contextually conditioned ones. In this case, the feature
weights obtained by parameter tuning are higher for the original probability
distribution than for the new one that was intended to be a more refined version
of the old one. When imposing a threshold, the result for the one-feature version
is significantly better than both the baseline system and the no-threshold system.
The scores for the system using interpolation with a count-based weighting factor
are in the same range as the scores of the system with the threshold, as well as
the other system using interpolated probability estimates, though only for the
51
two-feature variant.
Discounting the probabilities as in  and à does not turn out to be very
useful; the scores are hardly better than the baseline system, except for the twofeature system using both the original and new probability distributions, which
still fails to be significantly better than either the baseline or the no-threshold
system.
While the differences between feature weights vary, it can be noted that in
each case, the weight given to the contextually conditioned distribution is higher
than that given to the original distribution, except for the no-threshold system.
(Weights assigned to the phrase penalty responsible for segmenting the input
phrase are not listed in the this table, although they can be strikingly different
from system to system.)
To a certain extent, the version using a threshold and the interpolated system
in À are very alike. The main difference is that there is either a fixed threshold
of count=10 to decide whether the original or the contextually conditioned
distribution is to be used, or a weight entirely based on counts according to
which the old and the new distribution are incorporated into the new score
ϕ(trg|srcc ). The results of the version with threshold and À being almost
identical suggests that not using unseen translations does not harm the system,
and also that the choice of the threshold, being static or variable, is not a crucial
point, either.
Version Á, with a weight factor based on the type-token relation of the
modified probability distribution and the comparison with the original one, gets
a lower result (12.68) when using only the new probabilities than when using
both distributions. In this case, the bleu-value is in the same range as the
results in À and the first variant of the non-smoothed version. Although the
contextually conditioned probability distribution in Á received a remarkably
higher feature weight than the original one, using both distributions leads to a
considerable gain in translation quality.
The method of interpolation seems to work better than discounting, at least
for this examined context. The reason for this might be that gradually switching
back to the original distribution by giving it a high interpolation weight in nontrustworthy contexts is more reliable than keeping the contextually conditioned
distributions although they might be heavily discounted if not trusted as was
done in  and Ã. After all, the original probabilities are the best estimation in
all those situations where context information does not turn out to be helpful,
or contextually conditioned probabilities are overestimated as a result of their
low occurrence frequency.
However, the role of the original distribution is not quite clear when looking
at the results in table 10. Especially in the first two experiments (no-threshold
and threshold systems), including the original distributions as additional feature
turns out to be harmful: In this case, the distributions are the most different
as ϕ(trg|srcc ) does not include information from ϕ(trg|srco ), except for cases
where the count criterion is not met. Thus, the distributions are more prone to be
conflicting than the ones in À-Ã, where there is always a certain influence of the
old distribution and they can be assumed to be more alike. As the probabilities
in the phrase table are multiplied and only entries with a high overall score are
52
likely to be chosen as translation candidates, translation probabilities in both the
new and the original distributions need to be high in order to have a high total
value. To a certain degree, we expect ϕ(trg|srco ) and ϕ(trg|srcc ) to be different
– after all, creating a new, refined probability distribution was the purpose of
including context knowledge. On the other hand, the systems Á and à clearly
benefit from combining both features. Their weighting/discounting factor is
based on type/token relations of the new and original probability distributions:
While the design of this factor is promising for specific probability distributions,
it is difficult to capture its average effect on those distributions that do not look
like the ‘typical’ good or bad context. While this factor might be able to identify
exceptionally good contexts, it can also be assumed that it fails in a considerable
amount of cases (e.g. flat distributions) and thus, a more robust distribution
such as the non-context distribution is needed as an additional feature.
The experiment to back off to a more simple context turned out quite
disappointing. In this example, the context of two pos-tags on the left side of
the phrase was reduced to one pos-tag on the left side of the phrase in cases
where the full context was not seen or the count ≥ 10 criterion was not met.
If there were also less than 10 occurrences with only one pos tag, the original
probability was used as there was not enough data to justify a contextually
conditioned probability estimation. The underlying idea is to only use contexts
with enough evidence (count=10), but still include as much context information
as possible by backing off. As we will se in the following chapter, one pos-tag
on the left side yields comparable results as two pos-tags. However, switching
back and forth between the two distributions within one context feature seems
to be harmful.
Conclusion The results, both bleu-values and feature weights, suggest that
including context features is useful, but that translation probabilities without
context knowledge are also important, as illustrated in versions À, Á and Ã.
Altogether, using both distributions might be harmful if they happen to be
conflicting; this might also apply in the case of the back-off experiment, where
three distributions are packed into one.
With the similarity of the threshold version and À, setting a threshold
and ignoring unseen translations seems reasonable for most experiments, especially when considering that a simple way of estimating translation probabilities
matches the concept of a reduced experimental setting, allowing to examine
the influence of different contexts with the least amount of possibly conflicting features. In cases where it is indespensable to keep unseen translations,
interpolation as done in À might be a good choice.
53
54
6
Basic context templates
In this chapter, we will discuss the use of local context features and provide a
detailed evaluation of the respective outputs. Context templates are based on
adjacent words or pos tags of phrases and will be used for conditioning either
separately or as a combination of left side and right side contexts.
As in the previous chapter, the experiments will be run on reduced systems
in order to simplify the evaluation. As the version without smoothing and a
simple threshold of count = 10 turned out to work quite well, this approach was
chosen for the experiments presented in the course of this chapter.
In each case, we look at the results of a full system using the complete set of
features where ϕ(trg|srco ) has been replaced with ϕ(trg|srcc ), a system with
the single feature ϕ(trg|srcc ), as well as a single-feature system restricted to
source phrases of length one. While this last setting seems artificial, it enables
us to directly compare the different translation choices in the baseline and the
modified version, as there is no variablity in segmenting the source sentence.
To a certain degree, length-restricted translation units can also be regarded as
an extreme case of the situation that comes up when a system is trained on a
domain that differs significantly from the test data: As a result, mostly short
phrases are selected for translation.
Since the length-one system with original translation probabilities has no
context information at all, we expect to see a considerable improvement when
using context-informed translation probabilities; a significant improvement in
this setting could even be regarded as a necessary criterion for improvement in
a full setting. In the one-feature setting, the effect of contextual conditioning
is likely to be weaker than in the length-one system, as the longer phrases
contain internal context information. Since the only (translational) feature
ϕ(trg|srcc ) is assumed to be a better estimation than the original translation
probabilities, this system should outperform the baseline even though it is not
as dependent on context information than the length-one system. In the full
setting, we expect to see the least impact: Since the other translational features
are not modified, they might interfere with the new translation probabilities
and thus block potentially good decisions. This might be the case if an entry
with a formerly low translation probability is assigned a higher contextually
conditioned probability but still has a comparatively low reverse translation
probability and low lexical probabilities.
We carried out a detailed analysis of all variants using the standard automatic
evaluation metrics bleu and nist, as well as variants of these measures. As
already mentioned in the evaluation chapter, there are more possibilities than
the standard bleu metric to measure the quality of machine translation output.
In order to detect an effect of our modifications on the syntactic level, bleu was
computed on pos tags. As this does not take into account lexical aspects, bleu
was also computed on lemmatized words, since this is less strict than bleu on
full forms. As the use of context information is to a certain degree assumed to
enhance literal translations as well as to produce a more well-formed output, we
are interested to find out whether there is a gain in both scores. [Popović and
Ney, 2009] report a good correlation of human annotation with pos-based scores,
55
especially the pos-bleu score. While they do not mention bleu values based
on lemmata, it does not seem too farfetched to use such a version of standard
bleu as a measure to estimate the degree of adequacy for a morphologically
rich language like German. Reducing full forms to stems is also part of the
meteor metric: When evaluation English mt output, meteor works better
than bleu. Even though bleu and meteor use different approaches, this
observation additionally supports the idea of using lemma-based bleu.
Since bleu-based metrics do not take into account the relevance of individual
words, we also compared the percentage of correctly translated content words,
as they are most important to express the meaning of a sentence. In addition to
automatic evaluation methods, we carried out a small scale manual evaluation
in which three participants annotated which one of two presented sentences
(baseline vs. contrastive) they preferred.
In the last section of this chapter, we will discuss selected examples to
illustrate the effects of contextual conditioning on a lexical and syntactic level.
6.1
Word-based contexts
Table 11 shows the respective results for conditioning on adjacent words; results
that are significantly better than the baseline system are in bold-face. As
expected, the reduced settings (especially the length-one system) are more
influenced by the use of context features. In the full system, there is no real
improvement in bleu-points compared to the baseline. When switching to the
simpler system with only ϕ(trg|srcc ), the effect of the modification becomes more
noticeable, especially with the 1-word-right context. An additional simplification
by restricting the source phrase length to a minimum leads to an even larger
difference of bleu-points to the baseline in the case of the one-word-left context.
The system with the restricted phrase length is considerably worse than the
one where phrases of every length up to a maximum length are available. This
is mainly due to the fact that single word phrases often cannot be translated
very well. For example, a complex noun on the German target side would be
expressed as a sequence of several words on the English source side in most
cases. Imposing a restriction on the source side phrase length does not allow
for such many-to-one or many-to-many translations, but leads to non-intuitive
translations. Additionally, in a system translating only single word phrases,
there is no context information available at all, as opposed to a non-limited
system where longer phrases inherently have some context information. The
assumption that this context information can be captured by considering a
context feature like an adjacent word on either the right or the left side of the
phrase when computing the translation probabilities is illsutrated by the results
of the length-one systems. As expected, the difference to the baseline is the
most visible in this system, whereas there is only a very slight (non-significant)
improvement for the 1-word-right context in the normal system.
The effect of context information on different levels is demonstrated by the
following example of single word translations: The English sentence in (23)
contains the expression far from, which is a form of negation: although the
sentence is not explicitly negated, it is clear to a human reader that the outcome
56
baseline
1 word left
1 word right
bleu
nist
bleu
nist
bleu
nist
full setting
ϕ(trg|srcc )
13.84
5.14
13.84
5.17
13.98
5.17
12.45
4.80
12.50
4.81
12.57
4.89
ϕ(trg|srcc )
src-phrase-length=1
10.24
4.31
10.83
4.48
10.45
4.44
Table 11: Results for translations using either one word on the left or on the
right side of the phrase. Numbers in bold print are significantly better than the
baseline.
is not clear. While the expression bei weitem nicht (at far not) in the reference
translation (26) is comparatively similar to the structure far from, an explicit
negation in the form of nicht (not) is indispensable. In the baseline translation (24), this special constellation and appropriate translation candidates are
not known to the system and it thus chooses the more or less literal translation
far from → viel von (many of), which does not contain any form of negation.
(23)
this time the party is divided and the outcome is far from clear .
(24)
diese zeit der partei ist , und das ergebnis ist viel von klar .
(25)
dieses mal die partei ist gespalten und die ergebnis ist viel nicht klar .
(26)
dieses mal ist die partei gespalten und das ergebnis ist bei weitem nicht klar .
In (25), one word on the left side of the phrase is known: Now, a majority
of the translations possible without context information are not seen, and
the translation probability from→nicht gets considerably larger and thus, the
negation is transfered to the translation.
Additionally, the participle divided in the input sentence is translated correctly as gespalten in (25), while it is translated as a comma in the baseline.
While it is generally challenging to translate verbs, this is not a problem of the
same category as word combinations such as far from or get used, which can
be solved by just looking at an adjacent word, but a problem mainly caused by
the differences of the German and English consituent order and the resulting
alignment issues. However, the reduced set of translation candidates seems,
to a certain degree, to boost literal translations for words with no ambiguous
meanings. In the case of divided, some really bad translation candidates of
the original distribution were not seen in the contextually conditioned one;
for example the obviously wrong translation comma, which had nevertheless
been chosen in the baseline system. Translations like divided→, are mainly a
product of alignment errors. If such errors are accumulated over all occurrences
of a phrase, they can result in relatively high translation probabilities. While
candidates with a considerably smaller translation probability than the ones
with the highest probabilities are not likely to be chosen, there might arise
situations in which low translation probabilities can be cancelled out by i.e. high
language model scores, which seems to have happened in the baseline version.
The context word is of the phrase divided suggests some sort of verbal or
adjective-like translation, and indeed enhances the translation probability of
57
divided→gespalten from 13.9 % up to 25 % while the phrase translation pair
divided→, (as well as comparable nonsense-pairs) did not occur at all since all
alignments seen in this phrase-context combination are valid.
The improved translations for both far from and divided illustrate that
including context features does not only work for contexts with really ambiguous
words, but also in cases like divided, where some of the nonsense translations
are filtered out due to a good context word, and the remaining translations are
more likely to be appropriate in both meaning and syntactic form.
While the translation in (25) is far from perfect, it is still better than the
baseline translation (24) since the meaning of far from is captured better with
context, the verb divided is translated and the choice of mal instead of zeit
is also better. The decision to pick dieses mal instead of diese zeit might be
strongly influenced by the language model, as the phrase dieses mal is a fixed
formulation.
However, limiting the length of segmentation units cannot lead to optimal
results. When also allowing for phrases of length 2 in the baseline system, the
translation (27) obtained for (23) is pretty good.
(27)
[dieses mal] [die] [partei] [ist] [gespalten , und] [das ergebnis] [ist] [alles andere
als] [klar] [.]
Here, the phrase far from can be translated as a unit into alles andere als
(everything but), which is as good as bei weitem nicht in the reference sentence
(26). This system actually gets a bleu value of 12.26, which is almost as good
as the one-feature baseline system with no restrictions on the source phrase
length, which has a bleu value of 12.45.
Pos-based and lemma-based Bleu For both the pos-based bleu and
lemma-based bleu (cf. table 12), the differences to the baseline system are
larger than for standard bleu, especially for the one-word-left context in the full
system and the version with restricted phrase length. Surprisingly, the values
for the systems with one feature are not as high as the other ones, but only in
the same range as the baseline. Generally, it is also somewhat surprising that
there is no improvement of standard bleu-values for the full system, but a gain
in pos-bleu. In the case of the one-word-left context, we find that there is a
significant improvement in the full system’s pos-bleu value, while the standard
bleu-scores are the same. In the system using only single word source phrases,
improvements of bleu and nist values are significant both on the pos and
lemma level. Similarly, results of standard bleu and nist for this settings are
also significant except for one case.
As at least for pos-bleu a good correlation with human judgement has been
proven, it seems justified to believe that there has to be some improvement, at
least in the one-word-left context, although it is not noticeable in the standard
bleu score. A more detailed analysis of both individual cases and different error
types is necessary and will be provided later in this chapter.
58
baseline
1 word left
1 word right
baseline
1 word left
1 word right
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
full setting
ϕ(trg|srcc )
36.72
5.22
37.14
5.24
36.95
5.29
34.76
4.98
34.70
5.00
34.88
5.00
full setting
ϕ(trg|srcc )
17.34
5.80
17.51
5.84
17.53
5.85
15.95
5.45
16.04
5.46
15.95
5.52
ϕ(trg|srcc )
src-phrase-length=1
30.65
4.56
31.50
4.67
31.19
4.63
ϕ(trg|srcc )
src-phrase-length=1
13.27
4.94
14.05
5.11
13.70
5.09
Table 12: Results for translations using either one word on the left or on the
right side of the phrase. The scores in the first table are based on pos-tags,
the scores in the second table are computed on word lemmas. Numbers in bold
print are significantly better than the baseline.
6.2
Pos-based contexts
While conditioning translation probabilities on adjacent words helps to capture
context as can be seen especially in the reduced setting, this is not optimal
as a large amount of words can be expected to occur only once or twice in
combination with a phrase and therefore unnecessarily increase the number of
possible contexts of a phrase. By using pos tags instead of words, we can better
exploit the data, as all words of one category are merged into one context. While
the number of possible contexts is heavily reduced, the tag set is still relatively
fine-grained. However, lexical information is lost.
As has already been illustrated in some of the examples presented before (e.g.
section 3.3.2), pos-tags may be useful context features as they may not only
help to restrict translation possibilities on a mostly lexical level, but also on a
syntactic level. By including linguistically motivated features, we especially hope
for better fluency in the translated sentences, i.e. a gain in pos-bleu. In order
to illustrate this expectation, we have another look at the example from the
previous section, where the context word is on the left side of divided resulted
in an increase of verbal and adjective-like translations, while faulty phrasetranslation pairs were discarded. However, all realizations of auxiliary verbs are
listed separately when looking at word forms and thus, probability distributions
of basically equal contexts are divided which means that only a fraction of
the available data is used for the estimation of translation probabilities: by
replacing word forms with their respective pos tags and thus making contexts
more general, this problem can be overcome.
We carried out a detailed study of pos based contexts on the left side of
a phrase by examining contexts of different length (one and two pos tags)
59
baseline
1 pos left
2 pos left
leftmost pos
1 pos left
leftmost pos
1 pos right
bleu
nist
bleu
nist
bleu
nist
bleu
nist
bleu
nist
bleu
nist
full setting
ϕ(trg|srcc )
13.84
5.14
14.01
5.15
13.93
5.18
13.64
5.10
14.05
5.18
14.05
5.19
12.45
4.80
12.64
4.84
12.88
4.84
12.11
4.78
12.95
4.92
12.32
4.78
ϕ(trg|srcc )
src-phrase-length=1
10.24
4.31
10.80
4.48
10.73
4.46
10.31
4.37
10.74
4.41
10.53
4.32
Table 13: Results for translations using contexts based on pos-tags. The context
with one tag on the left side of the phrase and the leftmost pos tag is treated as
a single feature. Numbers in bold print are significantly better than the baseline.
and modelling the transition between left-side context and the phrase itself.
Furthermore, we conducted experiments conditioned on one pos tag on the
right side. Table 13 shows the standard bleu and nist scores obtained for
these experiments. Similarly to the word-based contexts, the difference between
modified system and baseline is larger for the reduced settings.
The systems using one or two pos tags on the left side are better than
the baseline, although there is not always a significant gain. In contrast, the
results of the context leftmost-pos are disappointing. While there is a slight
improvement in the length-one system, the scores of the other two systems
are even worse than the baseline. As knowledge about the phrase itself may
also help to find appropriate phrase-translation pairs, the idea to condition
probability estimation on the phrase’s pos tags seems plausible. As the phrase
length can go up to 7 words, we did not want to use the entire tag sequence
of a phrase, but only the tag of the first word of a phrase, which enables the
system to disambiguate the first word of a phrase. Obviously, this does not
take into account the whole phrase. Accordingly, the only improvement of the
leftmost-pos context is in the length-one system, where the context feature
covers the entire phrase. In contrast to conditioning on adjacent words (which
vary in most cases), the pos tag of the phrase is often the same, which means
that the conditioned probabilities often correspond more or less to the original
distribution. As a consequence, the improvement is not nearly as high as for
the other two systems and also fails to be significant.
However, there is a large improvement when conditioning on the leftmost-pos
and the adjacent pos on the left side of the phrase: In this case, we do not look
at an arbitrarily chosen tag, but at the link between the focused phrase and its
left neighbor.
The last context in table 13, one pos tag on the right side, is somewhat
different from the left-side contexts as it has a comparatively high bleu score
in the full setting, but comparably low scores in the reduced systems. Since
60
baseline
1 pos left
2 pos left
leftmost pos
1 pos left
leftmost pos
1 pos right
baseline
1 pos left
2 pos left
leftmost pos
1 pos left
leftmost pos
1 pos right
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
full setting
ϕ(trg|srcc )
36.72
5.22
36.66
5.25
36.91
5.28
36.49
5.23
37.20
5.25
36.90
5.27
34.76
4.98
34.84
5.03
34.89
4.99
34.36
4.98
35.23
5.08
34.52
4.94
full setting
ϕ(trg|srcc )
17.34
5.80
17.57
5.82
17.54
5.86
17.12
5.78
17.80
5.85
17.60
5.85
15.95
5.45
16.18
5.50
16.37
5.47
15.65
5.43
16.61
5.59
15.84
5.41
ϕ(trg|srcc )
src-phrase-length=1
30.65
4.56
31.68
4.67
31.72
4.64
31.04
4.61
31.45
4.65
31.22
4.56
ϕ(trg|srcc )
src-phrase-length=1
13.27
4.94
13.98
5.19
13.86
5.07
13.45
5..00
13.91
5.19
13.58
4.96
Table 14: Results for translations using contexts based on pos-tags. Numbers
in bold print are significantly better than the baseline.
improvement in the length-one system can be considered a necessary condition
for improvement in the full system, this outcome suggests that the scores of
this context might not be fully reliable. As it is not guaranteed that parameter
tuning finds the globally best setting, it might be possible that there exists a
better set of parameters for the length-one system allowing to produce a better
result. On the other hand, it might be possible that the seemingly good result of
the full system is due to merely random effects and therefore does not represent
the ‘true’ improvement of this context.
For a more focused evaluation on the lexical and syntactic level, bleu and
nist scores were computed on pos tags and lemmatized words: the results are
listed in table 14. Similarly to the results before, the effect of including context
information is much more noticeable in the reduced settings.
For the 2-pos-left and 1-pos-left contexts, we find considerably higher
pos-bleu scores in the length-one systems than when conditioned on adjacent
words. This outcome supports our initial hypothesis that linguistically motivated
context features have an positive effect on the grammatical level, i.e. enhance
61
fluency. Although the pos-based contexts lack lexical aspects, there are not
only improvements of pos-based scores to be observed, but also gains in the
lemma-based scores, corresponding in most cases with gains in standard bleu.
The results for the 1-pos-left context are disappointing in in the full system:
while not being as bad as the ones of the leftmost-pos context, they fail to be
significant for the two systems other than the length restricted one although
the standard bleu score of this context is comparatively high. This leaves us a
bit surprised, especially when considering that the 1-pos-left context’s scores in
the length-one system tend to be relatively high. The scores of the leftmost-pos
context correspond to the results of the standard bleu as they slightly increase
for the length-one system, yet to a smaller extent than the other two contexts.
It is also surprising that the leftmost-pos-1-pos-left context, which has a
very high pos-bleu score in the full setting does not have equally high results
in the length-one system.
When comparing left-side and right-side contexts (e.g. one word/pos tag),
it is remarkable that the contexts on the left side have especially good scores in
the length-one systems, whereas the right-side contexts have lower scores in the
length-one systems but equally high (or even better) results in the full systems.
6.3
Combining features
So far, only one context at the same time on either the right or the left side
of a phrase has been considered. However, combining two contexts might be
useful, as only phrase-translation pairs with high translation probabilities in
both contexts are likely to be taken for translation. In this approach, we did not
estimate translation probabilities conditioned on both contexts at the same time,
but used the respective probabilities conditioned on one context as separate
feature functions in the log-linear model. The respective relevance of the features
is then to be determined by the parameter tuning. Translation probabilities
were computed using interpolation (version À, cf. table 10), and entries where
both contextually conditioned translation probabilities were smaller than 0.0002
were not included in the phrase table.
The basic idea of combining features is to do an ‘intersection’ of translation
candidates based on two independent criteria, with the anticipated effect of
not using translation candidates only appropriate for one context. The context
features are required to be independent since this is a condition feature functions
of a log-linear model have to fulfill. But regardless of the technical requirements
of the log-linear model, it would be not very useful to combine, say, the 1-pos-left
context and the 1-word-left context as one is a more detailed version of the
other.
In the following section, experiments with combined contexts consisted of
features on the left and the right side of the phrase, with the exception of the
leftmost-pos context which was combined with the 1 pos-left context.
The most obvious combinations are those of one adjacent word or pos
tag on each side of the phrase; by combining pos tags on one side with an
adjacent word on the other side, the system can benefit from both lexically
and syntactically motivated contexts: this is an attempt to overcome the loss
62
of lexical features coming along with the introduction of pos tags, while still
profiting from generalization.
The same pattern as seen in the previous evaluations emerges for the results
of combined context features: As can be seen in table 15, scores are significantly
better in the length-one systems, but mostly in the same range as the baseline
for the full systems.
So far, the best result of a full system has been achieved by the system
where a pos tag on the left side of the phrase and the (subsequent) leftmost pos
tag of the phrase have been considered for translation probability estimation.
This corresponds to the result of the same context being integrated as a single
feature instead of two feature functions: Taking into account these two adjacent
phrases models to a certain degree the transition between them by not only
conditioning on the surroundings of the phrase, but also including (linguistic)
information about the very next word.
Generally, the scores of the length-one systems are slightly better than the
respective scores of the single-context systems. The same applies for the results
when only using the modified translation probabilities; here, more results are
significantly better than the baseline system. Interestingly, improvements seem
to be concentrated more on the lexical level (lemma-based and standard bleu)
than on the syntactic level, even though the pos-bleu scores of the length-one
systems are higher than in single-context experiments. However, improvements
of full systems still lack significance in most cases.
A particular thing about the 1-pos-left and leftmost-pos combination is
that its improvement in the length-one system is comparatively small given
that its full system achieved the best score. In contrast to the other combined
systems, this context does also not work very well in the one-feature system,
where, if at all, only one score (bleu or nist) is significant. It seems somewhat
contradicting that in this case (i.e. significant improvement in the full system),
the gain in bleu/nist scores is not noticeable to the same extent in the reduced
settings. In comparison, the scores of the one-feature system with the single
context 1-pos-left-leftmost-pos (cf. tables 13 and 14) are better.
We can make a similar observation on the 1-pos-left-1-pos-right context:
While it has good values in the full settings, it does not perform equally well in
the length-one system, except for pos-bleu, suggesting that a purely syntactic
context setting tends to improve the grammatical quality. Equally, the purely
lexical setting with one word on either side of the phrase results in a high
lemma-based bleu (although not in the full setting) with a comparatively low
pos-bleu. The general tendency of pos-based features having a positive effect
on pos-bleu was already observed for (left-side) single features in table 14.
Summarizing the results of the three bleu-based evaluation metrics on
different feature designs, we can say that almost every context template resulted
in an improvement in the length-one system. Scores of the full systems tend to
be improved as well, but this improvement is mostly not statistically significant.
Generally, combined contexts worked better than single-feature contexts, and
there is also evidence that pos based contexts have a positive influence on the
syntactic structure of the resulting translation. Furthermore, some of the (full)
systems have a significantly better nist score, but no significantly improved
63
baseline
1 word left
1 word right
1 pos left
1 pos right
1 pos left
1 word right
2 pos left
1 word right
1 pos left
leftmost pos
baseline
1 word left
1 word right
1 pos left
1 pos right
1 pos left
1 word right
2 pos left
1 word right
1 pos left
leftmost pos
baseline
1 word left
1 word right
1 pos left
1 pos right
1 pos left
1 word right
2 pos left
1 word right
1 pos left
leftmost pos
bleu
nist
bleu
nist
bleu
nist
bleu
nist
bleu
nist
bleu
nist
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
pos-bleu
pos-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
lem-bleu
lem-nist
full setting
ϕ(trg|srcc )
13.84
5.14
14.04
5.13
14.05
5.18
14.02
5.16
13.86
5.11
14.12
5.19
12.45
4.80
12.83
4.91
12.88
4.84
12.97
4.87
12.95
4.87
12.79
4.80
full setting
ϕ(trg|srcc )
36.72
5.22
36.75
5.13
37.16
5.28
36.56
5.23
36.75
5.21
37.03
5.25
34.76
4.98
34.98
5.01
34.92
5.00
35.22
5.03
35.17
5.01
34.72
4.96
full setting
ϕ(trg|srcc )
17.34
5.80
71.53
5.78
17.65
5.85
17.62
5.83
17.56
5.78
17.78
5.85
15.95
5.45
16.52
5.58
16.42
5.49
16.66
5.53
16.49
5.52
16.26
5.44
ϕ(trg|srcc )
src-phrase-length=1
10.24
4.31
11.07
4.51
10.80
4.45
11.10
4.50
11.06
4.48
10.79
4.43
ϕ(trg|srcc )
src-phrase-length=1
30.65
4.56
31.56
4.67
31.84
4.67
31.93
4.70
31.68
4.66
31.03
31.48
ϕ(trg|srcc )
src-phrase-length=1
13.27
4.94
14.30
5.14
13.98
5.11
14.25
5.14
14.26
5.14
13.80
5.06
Table 15: Results for combining two different contexts as seperate features.
Numbers in bold print are significantly better than the baseline.
64
bleu score: Since nist weighs n-grams according to their assumed relevance, this
might indicate that there are improvements on a lexical level that (lemma-based)
bleu scores fail to detect: It seems indispensable to provide an evaluation of
adequacy focusing on lexical aspects only.
While this method to combine features is very simple by being limited to two
contexts on either side of the phrase, positive effects are noticeable. Yet, one
could imagine to use more sophisticated methods to find optimal combinations
of two or more context features. [Gimpel and Smith, 2008] use a forward variable
selection algorithm: Starting with no context templates, new contexts are
added iteratively based on the gain in bleu on unseen data after parameter
tuning. In their experiments, the combination of the syntactic feature 2-pos-left
and the lexical feature 1-word-right worked best for English→German. In our
experiments, this feature combination had relatively good results in the reduced
settings, but improved bleu scores failed to be significant in the full system.
6.4
Evaluation
In order to get a better insight on the characteristics of the contextually conditioned systems, we will now discuss evaluations going beyond the standard
bleu and nist scores. While the criterion of fluency has already been judged by
the pos-based metrics, we now attempt to provide a better analysis in terms of
adequacy. Although lemma-based bleu allows for more variation than standard
bleu, both measures are still very similar, which is also illustrated by the fact
that systems with good bleu scores tend to have good lemma-based bleu scores.
In order to introduce a purely lexical evaluation method, we go a step further
and concentrate on evaluating only content words, assuming them to be crucial
for the reproduction of the source sentence. By restricting the evaluation of
content words, syntactic aspects (represented by word sequences) are completely
excluded.
In addition to automatic evaluation, we carried out a manual evaluation
focusing on adequacy and used the annotated material for an analysis of error
types. Assumptions and outcomes resulting from the different approaches of
evaluation are then to be illustrated with example sentences.
6.4.1
Content word oriented evaluation
Standard bleu allows to measure both fluency and adequacy by working with
n-grams of different length. Computing bleu on lemmatized words instead of
full forms is an attempt to estimate more precisely the degree of adequacy for a
morphologically relatively complex language like German. However, this does
not take into account that some words are more important than other words. If
a word such as a determiner is omitted, the sentence is still comprehensible as
long as the noun has been correctly translated. For the following experiments,
we define nouns, verbs and adjectives to be content words. To measure
the degree of adequacy, we compute recall, precision and f-measure only on
correctly translated content words. This classification into content word and
non-content word is only a simple approximation; it is very easy to come up
65
with counter-examples like e.g. negation: While being essential for the meaning
of a sentence, negation particles are not classified as content words, as negated
constructions might often not be translated literally. There are far too many
ways to express negation to be detected automatically, such as variations between
verbal negation nicht and nominal negation kein or e.g. inherent negation in
constructions like not want→ablehnen (refuse). This problem of non-isomorphic
translation also applies for prepositions, auxiliary verbs, etc.
While limiting the evaluation to ‘basic’ content words is an extreme simplification, it could still prove useful especially when considering the assumption that
literal translations are enhanced if faulty non sense translations are discarded
by good contexts.
Recall, precision and f-measure were first computed on sentence level and
then averaged over the entire test set. To compute recall and precision, the
set of words tagged as noun, verb or adjective in the automatically produced
translation was compared with the words tagged as noun, verb or adjective in the
reference translation. The comparison of words was conducted on lemma level.
To be counted as correct translation, it was sufficient to be part of the group
of content words and have a matching equivalent in the reference translation;
we did not insist on verbs being translated as verbs, nouns being translated as
nouns and so on. Furthermore, we checked the respective number of each word
in order to guarantee that words occurring multiple times are only counted if
there are corresponding words in the reference sentence. This means that a
word occurring twice in a translated sentence can only be considered correct
twice if there are two corresponding words in the reference translation. This
method is similar to the modified n-gram precision used for bleu. Additionally,
we computed the average number of content words per sentence: in this case,
every word tagged as noun, verb or adjective was counted without checking with
the reference translation and then averaged over the set of sentences.
While sentence-level scores are somewhat unusual in machine translation
evaluation, this approach helps to find sentences with especially high and low
scores and therefore might be useful to detect characteristic features of the
system by identifying sentences where modifications had an especially good or
bad influence compared to the baseline system.
Table 16 shows content word based statistics for the systems that were
presented so far. Similarly to the outcome of the previous evaluations, results
are best for the length-one system. Not only does every system score higher
than the baseline system, but the differences between baseline and modified
system are also larger. In every system setting (complete set of features, one
feature, restricted to length one), the average gain in recall is higher than
the average gain in precision. This correlates with the fact that the average
number of content words per sentence is often slightly higher in the modified
systems (cf. table 17). It is difficult to generally decide whether a higher
number of translated content words is good or bad, given that translated content
words do not necessarily have to be correct. The fact that the average number
of content words in the baseline translations is lower than in the reference
translations suggests that an increase of translated content words is needed.
Especially in the cases of the 1-word-right and 1-pos-left context (full setting),
66
baseline
1 word left
1 word right
1 pos left
2 pos left
1 pos left
leftmost pos
1 pos right
1 word left
1 word right
1 pos left
1 pos right
1 pos left
1 word right
2 pos left
1 word right
1 pos left
leftmost pos
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
p
r
f
full setting
ϕ(trg|srcc )
0.3973
0.3889
0.3867
0.3998
0.3916
0.3895
0.3942
0.3934
0.3874
0.3969
0.3959
0.3901
0.3980
0.3955
0.3905
0.4026
0.3942
0.3920
0.3944
0.3930
0.3872
0.3954
0.3825
0.3821
0.3945
0.3932
0.3876
0.3974
0.3879
0.3860
0.3982
0.3783
0.3810
0.3987
0.3887
0.3872
0.3841
0.3383
0.3519
0.3828
0.3398
0.3522
0.3869
0.3347
0.3515
0.3825
0.3519
0.3593
0.3866
0.3351
0.3512
0.3941
0.3523
0.3648
0.3795
0.3392
0.3514
0.3942
0.3519
0.3648
0.3837
0.3456
0.3562
0.3870
0.3492
0.3599
0.3882
0.3516
0.3618
0.3821
0.3405
0.3529
ϕ(trg|srcc )
src-phrase-length=1
0.3538
0.3195
0.3293
0.3669
0.3457
0.3495
0.3684
0.3398
0.3473
0.3699
0.3410
0.3482
0.3692
0.3365
0.3451
0.3705
0.3399
0.3480
0.3606
0.3245
0.3354
0.3672
0.3493
0.3517
0.3648
0.3456
0.3485
0.3675
0.3451
0.3497
0.3664
0.3508
0.3522
0.3635
0.3335
0.3416
Table 16: Precision, recall and f-measure for correctly translated content words.
Values which are better than the baseline have a gray background. Additionally,
the best respective scores (precision, recall and f-measure respectively) in each
column are printed in bold.
67
baseline
1 word left
1 word right
1 pos left
2 pos left
leftmost pos, 1 pos left
1 pos right
1 word left, 1 word right
1 pos left, 1 pos right
1 pos left, 1 word right
2 pos left, 1 word right
1 pos left
full setting
ϕ(trg|srcc )
8.5507
8.5468
8.7212
8.7398
8.6979
8.5302
8.7281
8.4288
8.7359
8.5214
8.2788
8.5029
7.6823
7.7378
7.4864
8.0322
7.5448
7.7680
7.7300
7.6988
7.8509
7.9630
7.9600
7.6481
ϕ(trg|srcc )
src-phrase-length=1
7.8509
8.2758
8.0700
8.0565
8.0117
8.0039
7.7534
8.3168
8.3148
8.2310
8.4337
7.9288
Table 17: Average number of content words per sentence. The reference sentences
contain 8.6209 content words on average.
the average number of translated content words is higher than in the baseline
system, resulting in slightly decreased precision scores. However, precision does
not suffer dramatically and the baseline’s recall value of less than 0.39 is far from
perfect, indicating that not enough content words are correctly translated. The
main problem with this sort of automatic evaluation are mismatches between
the reference sentence and good translations, as good translations that are not
part of the reference sentence are not taken into account.
Particularly interesting are the values of the leftmost-pos-1-pos-left context
(full setting), where on average less content words than in the baseline are
translated, but a gain in recall and precision can be observed. Similarly, three
of the combined contexts have both less translated content words and a better
precision than the baseline. The fact that these three systems using feature
combinations behave similarly in terms of precision and recall indicates that
independent features complement and reinforce each other by being especially
restrictive: Since both features need to have high values for an high total score,
only translation candidates that are appropriate in both contexts are likely to
be taken.
The increased number of content words, particularly in the reduced systems,
also strengthens our assumption that (literal) translations into content words
are enhanced.
In the one-feature systems, noticeably less content words are translated. As
this applies for the baseline as well as the modified systems, we assume that it
is an inherent characteristic of this type of setting and has nothing to do with
our modifications.
Nevertheless, we have to keep in mind that the improvements are not tested
for significance as was the case with the bleu and nist evaluation and that there
is no knowledge about correlation with human judgments. Therefore, the results
based on content words cannot be regarded the same way as commonly accepted
evaluation metrics. Yet, the results of the content words based evaluation are to a
certain degree consistent with the previous evaluation. This applies particularly
68
for the leftmost-pos-1-pos-left context, which outperforms most other systems
in all setting variants (bold-printed numbers in table 16): when looking at
the bleu and nist scores of this context, we find that there are significant
improvements in all variants (standard, pos-based and lemma-based).
6.4.2
Manual evaluation
To complete the evaluation of translation results, we also carried out a small
scale manual evaluation. The test set comprises 100 randomly chosen sentences
from the full system conditioned on one pos tag on either side of the phrase.
This setting was chosen because it has significantly improved pos-bleu and
lem-bleu scores, and a comparatively high (although not significantly) standard
bleu score. Since the lack of a significantly increased standard bleu score while
having significantly better pos or lemma based scores applies to several settings,
we are interested to find out whether there are phenomena that are not captured
by automatic evaluation metrics.
The evaluation task consisted of annotating which one of two given translations (baseline or contrastive system) was better, or whether both translations
were of equal quality. In order to not influence participants, the two different
translations were presented in random order. Additionally, the English input
sentence was given, but not the German reference translation: This should
prevent participants to favor sentences resembling the reference translation
despite being of the same quality as the alternative translation.
Annotation directives were kept very simple: The most important point was
adequacy, meaning that the sentence that better expresses the content of the
input sentence was to be marked as the better one. The criterion of fluency
came second. Actually, in most cases, a better adequacy is equivalent to a
better fluency since missing or wrong words tend to have negative effects on the
syntactic level, too.
As test set, 100 sentences whose baseline translation differed from the
translation produced with the contextually conditioned system were randomly
chosen, with the additional requirement of a sentence length between 5 and
25 words. The results of the three annotators (1 linguist and 2 non-linguists)
are given in table 18. Generally, there is not a high correlation between the
annotation results: While À and Á judged roughly equally many sentences
to belong to the respective categories, they have not a good agreement. The
annotation produced by  is somewhat different as this participant found less
translations to be of equal quality: Being a linguist, he might be more sensitive
to subtle criteria and therefore have a preference for one of the options. As
annotator
equal quality
baseline better
contrastive better
À
44
21
35
Á
37
25
38
Â
21
33
46
À∩Á
23
12
22
Á∩Â
13
19
29
À∩Â
17
18
32
À∩Á∩Â
12
12
20
Table 18: Results of manual evaluation with three annotators. The persons À
and Á are non-linguists, Â has a linguistic background.
69
a result, all three participants agreed only on 44 sentences, less than half the
test set. The differences between the different annotations show the difficulty
of rating machine translation output. Of course, one might argue that the
annotation guidelines were formulated too sloppily, but on the other side, this
experiment was intended to find out those phrases that were intuitively chosen
to be the better variant.
If we consider all sentences with no clear results to be more or less equal, then
the number of 68 sentences with roughly the same quality suggests that the two
systems are relatively alike. However, the fact that each participant preferred a
higher number of sentences produced by the contextually conditioned system
indicates that there is a certain improvement in comparison to the baseline.
Additionally, the set of sentences translated with the contrastive system that all
three participants prefer is lager than the set of preferred baseline sentences.
The set of sentences on which all annotators agreed was then used as basis
for a more detailed evaluation of error types. Since the test set for the manual
evaluation was chosen randomly from the entire set of sentences, this sample
can be considered representative9 . A closer inspection of the two sets containing
sentences being either better or worse than the baseline might help to understand
the properties of the contextually conditioned system. For this analysis, three
types of error were defined: As missing words are a major impediment for
the understanding of a translation, we counted how often the contextually
conditioned system manged to translate previously untranslated words, or lost
words that were correctly translated in the baseline. Another criterion is better
fluency; in this case, we do not distinguish between improvement on the syntactic
or morphological level or a more fluent representation of fixed expressions. The
last point is that of word choice and applies to cases where in comparison with
the other system one word or expression is clearly better than the alternative.
Of the set of 20 sentences being better than the baseline, 10 contain words
that are missing in the baseline translation; 7 of these formerly missing words
are verbs. Furthermore, 9 sentences were found to contain more fluent word
sequences and in two cases, the word choice was improved. (One of the sentences
was improved by containing a better suited word and a more fluent structure).
The distribution of error types found in the set of 12 sentences where
the baseline system produced better sentences is slightly different: While the
percentage of fluency-related cases (5 of 12) corresponds to the percentage of
errors of this category in the other sample, there are roughly as many wordchoice problems (4) as missing words (3). Interestingly, in two of the sentences
that were found to be worse than the baseline, we can observe the opposite
effect of missing words (categorized as incorrect choice of word). One of the
two redundant words does not make any sense at all and is clearly the result
of overestimating a random phrase-translation pair. The other case is more
interesting and illustrated in the following example, where (28) is the baseline
translation and (29) is the contrastive translation.
(28)
[what country ,]1 [exactly]2 [, does the]3 [federal chancellor]4 [live in]5 [?]6
[welches land ,]1 [genau]2 [die]3 [bundeskanzlerin]4 [leben in]5 [?]6
9
The complete set of the sentences used for error analysis is listed in the appendix.
70
(29)
[what country]1 [, exactly , does ]2 [the ]3 [federal chancellor]4 [live in]5 [?]6
[welches land]1 [genau plant ]2 [die]3 [bundeskanzlerin]4 [leben]5 [?]6
(30)
which country does the federal chancellor plan to live in?
In German, the concept of do+verb does not exists; thus, it is optimal to only
translate the main verb (live) and ignore the do. The baseline translation, where
does the is translated as an article, is perfectly comprehensible, although it is
not syntactically well-formed. In the alternative translation in (29), does is
translated by the verb plant (plans/intends), which gives a new meaning to
the entire question (cf. 30). While this is definitely an unwanted effect, there
is a certain logic behind this error, since the chosen verb belongs to a similar
category (both do and plan introduce infinite verbs) and could even have been a
valid choice in a different situation.
In total, the results illustrate that the contextually conditioned systems tend
to translate more content words. Given that especially verbs are often lost in
the translation process, this is a positive effect, even if in some cases the system
is overzealous and inserts unnecessary words. The evaluation showed that the
number of correctly translated formerly missing words clearly tops the number
of redundant words. This observation is also consistent with the results of the
content-word based evaluation.
6.5
Analysis of selected examples
In the chapter about evaluation, a classification of typical error types in machine
translation output (cf. section 4.2) was presented, as well as different scenarios
where the integration of context features could prove useful. In an attempt to
better understand the effects of our modifications, we will carry out a detailed
analysis of example translations that are representative of some of the error
categories mentioned before. The first part of the evaluation focuses mainly on
adequacy, while the second part illustrates translational behavior of verbs and
takes the evaluation to the level of fluency.
All examples were chosen manually, but with the search criterion of high
differences in evaluation scores. For each example, the segmentation of the
source sentence is shown; the baseline translations are abbreviated as bl, while
the output of the contextually conditioned systems is labelled cont.
6.5.1
Adequacy
The task of finding appropriate translation candidates for polysemous source
phrases was one of the main objectives for the use of context features.
In the example in table 19, the words march and general are ambiguous.
In the case of march, which denotes the action of marching in this sentence,
both the baseline and the modified system fail to disambiguate this word. As
a matter of fact, none of the other systems conditioned on different contexts
was able to find a valid translation of march. In mixed-case text, the month
march is generally uppercase; however, this information is lost by lowercasing
the entire data.
71
[other military]1 [units were]2 [to join the] 3 [march]
input
bl
gloss
4
[shortly]5
[, said]6 [general] 7 [danilo]8 [lim]9 [, former]10 [leader of the]11
[scout]12 [rangers]13 [elite]14 [unit]15 [.]16
[andere militärische]1 [einheiten]2 [zu den] 3 [märz] 4 [kurz]5
[gesagt ,]6 [allgemeine] 7 [danilo]8 [lim]9 [, ehemaliger]10 [führer der]11
[pfadfinder]12 [rangers]13 [elite]14 [einheit]15 [.]16
[other military]1 [units]2 [to the] 3 [march month] 4 [shortly]5
[said ,]6 [general adj] 7 [danilo]8 [lim]9 [, former]10 [leader of the]11
[scout]12 [rangers]13 [elite]14 [unit]15 [.]16
[other]1 [military]2 [units were]3 [to join]
4
[the]5 [march]
6
[shortly]7
input
[, said]8 [general] 9 [danilo]10 [lim]11 [, former]12 [leader of the]13
[scout]14 [rangers]15 [elite]16 [unit]17 [.]18
[andere]1 [militärische]2 [einheiten]3 [im]5 [märz] 6 [in kürze]7
cont
[beitreten] 4 [, sagte]8 [general] 9 [danilo]10 [lim]11 [, ehemaliger]12
[führer der]13 [pfadfinder]14 [rangers]15 [elite]16 [einheit]17 [.]18
[other]1 [military]2 [units]3 [in]5 [march month] 6 [shortly]7 [join] 4
gloss
[, said]6 [general] 9 [danilo]10 [lim]11 [, former]12 [leader of the]13
[scout]14 [rangers]15 [elite]16 [unit]17 [.]18
ref
general danilo lim, der früher die eliteeinheit der scout rangers
befehligte sagte, andere armee-einheiten wollten sich dem marsch in
kürze anschließen.
Table 19: The translation probabilities of this example were conditioned on the
leftmost pos of the phrase and the pos on the left side (full system).
The word march is tagged as noun regardless of its meaning; but given
that the combination in march is more common when speaking of the month
march, we are left surprised that the system did not manage to find a better
translation. When looking at the phrase translation probabilities, it becomes
evident that there was actually a big change: The translation probability for
march → märz, (March, i.e. the unwanted translation) decreased from 0.82 to
0.33 while the targeted translation probability for march → marsch (march)
rose to 0.08 (formerly 0.01). However, 0.08 is still less than 0.33. Furthermore,
with märz (March) occurring considerably more often in the training data, the
language model might give it a higher score than the less frequent marsch
(march). The language model’s preference might be additionally enhanced by
march’s proximity to shortly, which can be expected to co-occur with indications
of dates.
The other ambiguous word in this example is general, being either a noun
with the meaning of a military rank or an adjective meaning something like
common or generic. Given that a part of the context used for disambiguation
consists of the leftmost pos tag of the phrase, this context is able to reduce all
72
possible translations to a set of valid options. In combination with a finite verb
(said) on the left side, only two translation candidates remain: general und der
general (the general), both of which are correct.
In this example, not only the translation probabilities of the ambiguous words
were changed, but also the translation of the two verbs said and join was better:
The phrase containing join was translated as to the in the baseline translation,
with the result that the equivalent of join is missing in the target sentence. As
join is a content word and thus important for the understanding of the target
sentence, its correct translation in the contextually modified system leads to a
higher comprehensibility. While the translation of said in the baseline preserves
the meaning of to say in past tense, the word form is not well chosen, as opposed
to the contextually conditioned system, where it leads to an improvement of
fluency. While the baseline sentence had a bleu score of zero10 as it does
not contain a 4-gram also appearing in the reference sentence; the sentence
translated with the refined system has a positive bleu score as well as a higher
pos-bleu, lem-bleu and percentage of translated content words.
The next examples are to demonstrate effects on the aspect of lexical choice.
In some cases, translations are well comprehensible but cannot be considered
completely correct as the choice of words is inappropriate, e.g. in the case of
disrupted collocations.
The baseline translation in table 20 is relatively well-formed and expresses the
meaning of the source sentence. However, the choice of the word recht (right/law)
as translation for law is not optimal, and the verbal construction gebrochen
worden (been broken) is not perfect as it is missing an auxiliary verb. In the
modified system, the system chose a better translation option: co-occurring
with the verb brechen (break), gesetz (law) is a better translation because this
structure is collocational. Additionally, the verbal construction is well-formed.
In contrast to the example presented previously, the translation probabilities
of the phrases containing law or has been broken did not change, but the difference
of the translation outcome was caused indirectly by a different segmentation.
While the segmentation depends from the respective weight obtained during
parameter tuning, the fact that the translation probability of if the → wenn
das was increased considerably in the modified system is most likely to have
triggered the new segmenting, which allowed to translate the phrase law has
been broken as a unit.
In the original distribution, the translation probability for if the → wenn die
(if the, but wrong form of determiner) was highest due to the fact that the determiner die is applicable in both singular and plural in the cases accusative and
nominative and therefore occurs very often, whereas the appropriate article for
this context, das, is comparatively infrequent. With law on the right side, the
translation probability for the appropriate variant wenn das increases and allows
the system to combine it with the phrase pair law has been broken → gesetz
gebrochen wurde. This would not have been possible with wenn die as translation of the first phrase, since die cannot be a determiner for gesetz (law) and
consequently, this combination would not be accepted by the language model.
10
Technically, it is undefined.
73
input
bl
gloss
[if the law ]1 [has been broken ]2 [, it is a matter]3 [for the police .]4
[wenn das recht ]1 [ gebrochen worden ]2 [, es ist eine frage]3
[für die polizei.]4
[if the law ]1 [ been broken ]2 [, it is a question]3 [for the police .]4
gloss
[if the]1 [ law has been broken ]2 [, it is a matter]3 [for the police .]4
[wenn das]1 [ gesetz gebrochen wurde ]2 [, es ist eine frage]3
[für die polizei.]4
[if the]1 [ law was broken ]2 [, it is a question]3 [for the police .]4
ref
wenn das gesetz gebrochen wurde, ist es eine sache für die polizei.
input
cont
Table 20: The translation probabilities of this example were conditioned on one
word on the right side of the phrase (full system).
By translating the entire phrase, i.e. verb and noun, the optimal translation
for law in combination with break is found with the additional advantage of the
verbal construction being translated correctly.
This example is intended to show that not only ‘straightforward’ contexts
can lead to different translations, but that there are also indirect effects such
as segmentation. Also, it shows that a (good) decision made at one point of
the translation, i.e. [if the law] vs. [if the], affects the translation of other
phrases, even though their translation probabilities have not necessarily been
changed. However, these indirect effects caused by the use of contextually
conditioned translation probabilities are somewhat unpredictable and can also
backfire, which is illustrated in the next example.
In the example shown in table 21, the translation of the baseline sentence
is better than the translation of the modified system. Except for the position
of the verbs, the baseline sentence is a valid translation of the source phrase.
Again, the segmentation is different: the phrase were harmed was translated as
one unit in the baseline, resulting in an appropriate translation. In the other
case, the system chose to separate the phrase into two single words, probably
due to the considerably increased translation probability of were → wurden. As
there was no translation candidate with the meaning of injured for the phrase
harmed, a translation with the meaning of damaged ended up to be taken. While
this translation roughly expresses the content of the source phrase, it is not a
good translation if the object of harm is a person.
As in the previous example, enhanced translation probabilities of adjacent
phrases triggered a different segmentation, possibly supported by the new weight
of the segmentation feature, and thus lead to a different translation of phrases
whose translation probabilities might not even have been changed. These last two
examples demonstrate that there are ‘random influences’ like segmentation that
are not necessarily closely related with the change of translation probabilities of
the focused phrase but caused by different translation probabilities of seemingly
trivial phrases. Additionally, weights for segmentation and reordering result
from the complex procedure of parameter tuning and are thus hard to trace back.
74
input
bl
gloss
input
cont
gloss
ref
[during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5
[were harmed ]6 [in]7 [riots .]8
[in der nacht vom]1 [montag]2 [, über]3 [80]4 [polizisten]5
[ verletzt worden seien]6 [in]7 [krawallen]8
[during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5
[ injured have been]6 [in]7 [riots .]8
[during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5
[were]6 [ harmed ]7 [in riots]8 [.]9
[in der nacht vom]1 [montag]2 [, über]3 [80]4 [polizisten]5
[in krawallen]8 [ geschädigt ]7 [wurden]6 [.]9
[during the night of]1 [monday]2 [, about]3 [80]4 [policemen]5
[in riots]8 [ damaged ]7 [were]6 [.]9 .
in der nacht auf dienstag waren bei krawallen in villiers-le-bel
rund 80 polizisten verletzt worden.
Table 21: The translation probabilities of this example were conditioned on the
leftmost pos of the phrase and the pos on the left side (full system).
While bad lexical choice is not as harmful as missing words or completely
wrong translations as long as the output is comprehensible, it is also a factor for
the evaluation of machine translation. The example in table 20 suggested that
collocations are best translated as single unit, which does, to a certain degree,
strengthen the hypothesis that contextual features are needed to artificially
‘enlarge’ phrases.
6.5.2
Fluency: translational behavior of verbs
In addition to improve the comprehensibility by finding appropriate translations
for ambiguous words in the source sentence, adding context information was also
expected to result in a more fluent translation output. Especially when using pos
tags, translated sentences should be more fluent as their translation probabilities
were conditioned on linguistically motivated contexts. The examples presented
in this section will mainly focus on the translational behavior of verbs; we will
look at cases where the modified system found better realizations of verbal
constructions, but also at cases where verbs were not translated at all.
In the following example (table 22), the verbal phrase compete with was
translated as im wettberb mit (in competition with). While the meaning of the
source phrase remains preserved, this is not a good translation. A verb like
compete is likely to be translated as a collocational verb-noun construction like
konkurrenz darstellen (lit. to be/pose a competition) or im wettbewerb stehen (to
be in competition). The version chosen by the baseline system is part of the latter,
but the verb stehen is missing which is a negative factor when rating fluency.
The modified system decided to translate compete with the literal equivalent
konkurrieren when conditioned by the tag to, resulting in a sentence that is
75
input
bl
[”]1 [we don ’t want to]2 [ compete with ]3 [our clients]4 [” ,]5
[ says ]6 [rivero .]7
[”]1 [wir wollen nicht]2 [ im wettbewerb mit ]3 [unsere kunden]4 [”]5
[ sagt , dass ]6 [rivero .]7
gloss
[”]1 [we don’t want to]2 [ in competition with ]3 [our clients]4 [”]5
[ says , that ]6 [rivero .]7
input
[”]1 [we don ’t want to]2 [ compete ]3 [with our]4 [clients]5 [”]6 [,]7
[ says ]8 [rivero]9 [.]10
cont
gloss
ref
[”]1 [wir wollen nicht]2 [ konkurrieren ]3 [mit unseren]4 [kunden]5 [”]6
[,]7 [ sagt ]8 [rivero]9 [.]10
[”]1 [we don’t want to]2 [ compete ]3 [with our]4 [clients]5 [”]6 [,]7
[ says ]8 8 [rivero]9 [.]10
” wir möchten für unsere kunden keine konkurrenz darstellen ”,
erklärt rivero .
Table 22: The translation probabilities of this example were conditioned on one
pos on the left side (full system).
completely grammatical except for word order: It would sound more natural if
phrase 3 would change to the position after phrase 5. Additionally, conditioning
on one pos tag on the left side helps to find a better translation for said: the
context comma excludes the phrase-translation pair says → sagt, dass, (says
that), which is quite straightforward when considering that said that can be
expected to be preceded by either a noun or personal pronoun in most cases.
We also can see an improvement on a morphological level: As with our was
translated as one unit in the modified system, the German form unsere (our)
is dative, as required by the preposition mit (with). The segmentation of the
baseline source phrase divides with and our; thus, the link to the preposition and
its case requirement is lost. The form of unsere in the baseline is nominative or
accusative, either of which generally occurs more often than dative. The fact
that nominative and accusative often have the same forms additionally enhances
their translation probabilities11 . As it is difficult to evaluate morphological
aspects on non-well-formed sentences, this will not be discussed any further.
Altogether, the translation choices of the modified system are more or less
literal; but despite being relatively well-formed, there is not a great similarity to
the reference translation, resulting in a standard bleu score of zero, the same
score as the baseline. When computing bleu on pos tags, the two translations
receive different scores: while the score of the baseline still is zero, the improved
translation is rewarded with a pos-bleu score of 39.85: two matching 4-grams
are provided by the now correct sequence [“ , vvfin ne .], which guarantees to
obtain a non-zero score.
11
Conditioning on a preposition on the left side of a phrase would be no help as there is no
distinction made between prepositions requiring dative or accusative.
76
input
bl
gloss
input
cont
gloss
ref
[nonetheless ,]1 [scientists]2 [suspect]3 [that both]4 [planets]5
[ developed under ]6 [very similar]7 [conditions .]8
[dennoch ,]1 [wissenschaftler]2 [vermuten]3 [,dass beide]4 [planeten]5
[ unter ]6 [ähnlichen]7 [bedingungen .]8
[nonetheless ,]1 [scientists]2 [vermuten]3 [that both]4 [planets]5
[ under ]6 [similar]7 [conditions .]8
[nonetheless ,]1 [scientists]2 [suspect]3 [that]4 [both]5 [planets]6
[ developed ]7 [under]8 [very]9 [similar]10 [conditions]11 [.]12
[dennoch]1 [wissenschaftler]2 [vermuten]3 [, dass]4 [beide]5 [planeten]6
[ entwickelten ]7 [unter]8 [sehr]9 [ähnliche]10 [bedingungen]11 [.]12
[nonetheless ,]1 [scientists]2 [suspect]3 [that]4 [both]5 [planets]6
[ developed ]7 [under]8 [very]9 [similar]10 [conditions]11 [.]12
[...] gehen wissenschaftler davon aus , dass beide planeten unter ganz
ähnlichen voraussetzungen entstanden sind .
Table 23: The translation probabilities of this example were conditioned on one
pos on the left side (full system).
The previous example showed that context features can filter the set of
translation candidates to a smaller, better suited set of grammatically more
appropriate translations. However, a much more serious problem are verbs on
the source side that are not translated at all: while this is mostly an aspect of
adequacy, missing verbs also have an impact on fluency.
The next example shows a sentence, where the baseline system failed to
translate a verb which occurred in a subordinated clause. In German subordinated clauses, verbs normally are at the very end of the clause, as opposed
to the English structure where they figure more or less at the beginning in
close proximity to the subject. For statistical machine translation systems, it is
difficult to overcome this distance and put the verb in the right place or only to
find a translation which expresses the content of the source word.
A combination of different translation probabilities and segmentation differences lead to the literal translation developed → entwickelten in the example
in table 23, instead of the invalid translation in the baseline. In the translated
sentence, entwickelten is placed at the position that corresponds to that of the
verb on the English side. As can be seen in the reference translation, the position
of the corresponding verbs entstanden sind (have emerged) is at the very end
of the sentence. Although the phrase entwickelten is not placed correctly and
does not correspond to the reference, it is a valid translation of developed: this
outcome is a large improvement in comparison to the baseline.
It goes without saying that conditioning the translation probabilities on
various context features does not allow us to translate every verb that was
omitted in the baseline translation. As translating verbs remains challenging,
chapter 8 specifically focuses on this issue by additionally using pos tags on
the target side in an attempt to force the system to only pick content words as
77
translations for source phrases containing verbs.
As in the example in table 22, the modified system in table 23 chose a very
literal translation. In the case of table 23, this variant turns out to be more
accurate than the baseline, as not only the verb is translated, but also the word
very is not lost. Clinging to literal translations is generally not a good idea
(especially for idiomatic expressions or structures that cannot be translated
isomorphically), but choosing a literal translation is better than ‘no’ translation,
i.e., in this case, the translation of a full verb with a non-content word.
While we must be careful to not over-generalize the presented examples,
it seems that contextually conditioned systems prefer – at least to a certain
degree – literal translations in cases where the source phrase is not ambiguous.
Unfortunately, there is no possibility to prove or disprove this hypothesis; we
will discuss this assumption in section 7.3.
6.6
Summary
In this chapter, we examined the effects of including local context templates in
statistical translation systems, comparing different system settings (full systems
vs. reduced systems). While most systems failed to significantly outperform the
baseline system when using a full-feature setting, systems with reduced settings,
especially when only using source phrases of length one, achieved results that
were significantly better than the baseline. The best context in our experiments
turned out to be a combination of one pos tag on the left side and the leftmost
pos tag of the phrase: The bleu score for the full system was best when the two
contexts were used as separate features, but the results of the different system
settings were more stable when the contexts were combined into a single feature.
In order to measure the syntactic quality of machine translation output, the
pos-bleu score was introduced. In an attempt to better capture the changes on
the lexical level, we also used bleu scores on lemmas and computed statistics
for translated content words. Similarly to the standard bleu scores, the new
metrics also showed significant improvement for the reduced settings, but also
in some of the full systems. In addition to the automatic evaluation metrics, we
carried out a simple manual evaluation.
An interesting observation is the fact that in most systems, more content
words than in the baseline system are translated; this has not only been shown
by the content-words statistic, but was also a result of the manual evaluation.
We assume that contextual conditioning enhances the translation probabilities
of literal translations and consequently, content words have a better chance to
be picked as translations.
When looking at example translations, we found that the choice of phrasetranslation pairs is not always directly determined by a high translation probability, but also by indirect factors like segmentation, which again depend on
the translation probabilities, both during the process of parameter tuning and
during the actual translation process. The correlation between segmentation
and the modified translation probabilities favoring literal translations will be
discussed in chapter 7.4, where we will also address the cause for the difference
of left-side and right-side contexts observed at the beginning of the evaluation.
78
So far, only context templates have been used: No distinction was made
between the different realizations of e.g. the context 1-pos-lef t, though it is
likely that some pos tags work better than others. In the next chapter, we will
look at a reduction of the set of possible contexts and also address the issue of
the relevance of specific realizations of contexts.
When discussing the results of the different systems, the differences between
bleu and nist were mentioned only marginally: In several cases (e.g. table 13),
the nist score of the full setting is significantly better than the baseline, while
the bleu score is not significantly better. By giving a higher weight to more
informative (i.e. less frequent) n-grams, nist is more sensitive to lexical choice
while bleu considers all n-grams to be equally important and additionally
requires matches of larger n-grams, thus giving more credit to word order.
Generally, bleu is a better measure for translation quality; however, it has been
shown that nist is able to detect small differences due to its sensitivity in lexical
aspects, whereas bleu often cannot capture small lexical changes (see [Riezler
and Maxwell, 2005]).
Given that there are several cases in which the nist score is significantly
better than the baseline, this is an interesting point, especially when considering
that we also found that contextually conditioned systems tend to translate a
higher percentage of content words.
79
80
7
Analysis of general aspects of context features
In this chapter, we want to look more closely at general characteristics of source
side contexts. This includes the question of how fine grained linguistically
motivated contexts should be, which will be illustrated by using chunks as
context information and by experimenting with differently designed sets of pos
tags for auxiliary verbs. We will then focus on the influence of specific pos
tags on translation quality, by comparing distributions conditioned on either a
specific pos tag or on the complement of this tag in an attempt to comprehend
the average effect of non-trivial contexts. The general effects observed in this
experiment have an interesting influence on the phrasal segmentation of the
input sentence as the changed translation probabilities favor a segmentation into
short phrases, which would not be considered a positive effect. Since segmenting
is a crucial point for translation quality, we attempt to thoroughly explain this
phenomenon and the resulting consequences on the overall performance, and we
also provide a comparison to the system presented in [Carpuat and Wu, 2007a/b].
We conclude this chapter with a few thoughts on feature selection.
7.1
Chunks as source-side context features
In the same way as pos tags are a generalization of words, chunks can be regarded
as a compressed set of tags. Chunks are linguistically motivated phrases like
nominal chunks (nc) or prepositional chunks (pc); in addition to representing
a compact type of context, they might also offer a key to better segmenting.
Intuitively, one could assume that an input sentence that is segmented according
to linguistic criteria should be translated relatively well: Decomposing a sentence
into constituents which are then translated and concatenated on the target side
should be a relatively safe method; after all, a well-formed sentence is nothing
else than a sequence of constituents. However, it has been shown that restricting
the phrase-translation pairs to constituents on both source side and target side
does harm the performance of a smt system. The results of [Koehn et al., 2003]
show that by eliminating non-constituent pairs, too many phrases are excluded.
In our experiments, we regard chunks mainly as a more compact form of
pos tags, but also as a means to improve segmentation. Similarly to the work
reported in the previous chapter, source phrases will be conditioned on left side
context information, using chunks instead of pos tags. Being more general than
pos tags, chunks should better exploit the training data, but might also trigger
a better segmentation.
In order to boost a syntactically motivated segmentation, we will introduce
an additional feature function to represent the status of the focused phrase
itself: if the source phrase is (partially) well-formed, it is assigned the value 1,
otherwise 0. This is not the same form of context integration as described above,
since the resulting values are either 0 or 1, instead of representing a probability
distribution. In contrast to our previous experiments, this feature is not intended
to filter the set of all extracted phrase-translation pairs to a smaller set of phrases
with the same specific context as the focused phrase, but it is designed to reward
phrases which meet the general criterion of being (partially) well-formed.
81
nc
pc
adjc
conjc
advc
[genuinelyrb representativejj democraciesnn ]nc
[atin [somedt pointnn ]nc ]pc
[asrb possiblejj ]adjc
[ratherrb ]conjc
[oncerb againrb ]advc
vc
prt
lst
intj
[toto swimvv ]vc
[uprp ]prt
[1lst ]lst
[yesuh ]intj
Table 24: Different types of chunks with examples. A tag indicating begin and
end of the sentence is also used for conditioning.
7.1.1
Chunked data
Training data was chunked with a variant of tree-tagger [Schmid, 1994]. Its
output format can be seen in figure 5: the structure is generally flat and does
not contain deeply nested chunks, with a few exceptions like e.g. ncs embedded
in pcs. As can be seen, some tokens like while or punctuation are not part of
chunks. As they might also be useful as context, they were replaced by their pos
tag. Table 24 shows the different forms of chunks; even with ca. 10 additional
tags, the set of possible contexts is much smaller than the tag set used in the
previous section.
When conditioning on chunks on the left side of a phrase, we want to benefit
from the reduced number of contexts: In many cases, it does not matter whether
a phrase is adjacent to e.g. a noun or a personal pronoun, and therefore, tags
with similar properties might as well be combined into one single representative
of the group. In the case of a context consisting of two pos tags, essentially equal
contexts might be split up in e.g. art noun, adj noun or conj noun, despite
always representing the equivalent of a noun phrase. Furthermore, singular
nouns often have an article while plural nouns often have no article and thus the
leftmost pos tag in a 2-tag context could as well be something that is usually
not associated with noun chunks. As a consequence, the number of pos tag
based contexts is often unnecessarily inflated.
<s>
<NC>
energy
generation
and
distribution
</NC>
<VC>
must
continue
</VC>
<VC>
to
be
made
</VC>
<ADJC>
more
NN
NN
CC
NN
energy
generation
and
distribution
MD
VV
must
continue
TO
VB
VVN
to
be
make
RBR
more
efficient
</ADJC>
,
while
<VC>
complying
</VC>
<PC>
with
<NC>
environmental
standards
</NC>
</PC>
.
</s>
JJ
efficient
,
IN
,
while
VVG
comply
IN
with
JJ
NN
environmental
standard
SENT
.
Figure 5: Example for chunked text.
82
While context features based on chunks are extremely compact, an obvious
problem is overlapping: Since the phrases used for translation are only wellformed in terms of alignment structure, they are not linguistically motivated
and thus, the available phrases in the phrase-table do not always correspond to
the chunks in the input sentence.
7.1.2
Chunk-based contexts
When using chunks for conditioning on the left side, we distinguish between
phrases having an adjacent chunk on the left side without overlapping and
phrases whose left borders do not correspond with chunk borders: phrases with
no overlapping chunk on the left side can simply be conditioned on this context,
while phrases which do not start at a chunk border are conditioned on the chunk
containing the left side of the phrase. In the case of embedded chunks, the
outer chunk is used as context: if the left context consists of a complete nc in a
complete pc, then pc is taken as context.
The phrase p1 in figure 6 is simply conditioned on the adjacent context ncb
sharing the boundary with p1 , while the phrase p2 is conditioned on the chunk
nco that overlaps with p2 . In the case of p2 , the context is not purely adjacent,
but used to model the transition between the context and the phrase itself. In
the experiments discussed in section 6.3, the combination of the 1-pos-left and
the leftmost-pos contexts turned out to work relatively well; the context of
an overlapping phrase is somewhat similar to the 1-pos-left and leftmost-pos
context in that it captures the information of the phrase beginning inside a
given chunk rather that occurring after a given chunk as in p1 . Figure 7 shows
an example for conditioning on overlapping chunks: the phrase vote for is
overlapping with either a vc or a nc on its left side. Without conditioning,
the top ranking translation candidates for vote for mainly consist of verbal
translations. When conditioned on vc, the most probable translation candidates
are more or less the same, albeit with increased translation probabilities, whereas
conditioning on nc favors nominal translations.
If the left boundary of a phrase corresponds to a linguistically well-formed
segmentation, this means that the phrase is, loosely speaking, at least partially
well-formed. As we are interested in a segmentation of the input sentence that
approximates a flat linguistic analysis, conditioning on a non-overlapping context
might enhance translation probabilities of phrases that might trigger a good
phrasal segmentation, while also profiting from source-side context information.
In an attempt to additionally enhance partially well-formed phrases of the type
p1 , such phrases are assigned a bonus value, which is realized as additional
ncb [ [
[
nco
]p1
he will [ vote vco for ]p the report
]p2
i cast my [ vote nco for ]p the report
Figure 6: conditioning on overlap- Figure 7: Example for conditioning on overping or non-overlapping chunks. lapping chunks.
83
context feature function consisting of the values 0 and 1. As mentioned earlier,
this feature differs from the usual form of contextual conditioning as it does
not represent translation probabilities conditioned on specific context, but the
general property of being (partially) well-formed.
In addition to using left side context, phrase-translation pairs can also be
conditioned on the status of the phrase itself. Although the set of chunks is
more compact than the set of pos tags, conditioning on the chunks a phrase
consist of would be too complex. Instead, we choose a classification into phrase
is a complete chunk or a sequence thereof vs. phrase is not a complete chunk
or sequence thereof. Thus, there are only two contexts, being a constituent-like
phrase or not, on which the phrase can be conditioned. We will refer to this
setting as const. when discussing the results. Again, we introduce an additional
bonus feature in order to give a better weight to linguistically well-formed
translations.
While chunks are a promising generalization for pos tags and also might
help to find a better segmentation, they have two major disadvantages: The
problem of overlapping makes it difficult to find a uniform definition for contexts
(adjacent vs. is part of). The second problem is the fact that a phrase can be
considered a complete chunk in one sentence but might only be a partial chunk
in another sentence, as can be seen in the following example:
(1) [with [an internal market]nc ]pc [of [500 mio consumers ]nc ]pc
(2) [ consumers ]nc [are provided]vc [with [consistent quality]nc ]pc
In the first example, the word consumers is only a part of a larger phrase, while
it is a complete chunk in the second sentence. In this case, the splitting into a
complete chunk vs. a partial chunk leads to a loss of data as it introduces two
different distributions of the word consumers. Of course, one could think about
implementing a special treatment for the head words of chunks, trying not to
loose translations of single words, but this would contradict the idea of using
chunks. In contrast, distinguishing between words being a complete phrase or
only a part thereof might prove useful in some situations: For example a verb,
that is part of a verbal chunk such as to find or should find is likely to be infinite,
whereas it can be expected to be finite when it is a single-word constituent since
there are no other verbs in its immediate proximity.
In an attempt to avoid splitting probability distributions of essentially the
same phrases, we also experiment with using the original translation probabilities
in combination with assigning a bonus value for syntactically well-formed source
phrases in order to find an optimal segmentation of the source phrase. The
decision whether a phrase is a chunk or not is made specifically for each phrase in
each source sentence. Thus, a good segmentation for each sentence is proposed
to the system by assigning a bonus value to each phrase with a corresponding
chunk.
84
7.1.3
Results and evaluation
Unfortunately, the chunk-based contexts did not help to improve translation
quality. While they generally did not harm the performance, there was no
improvement neither. Similarly to the results of pos-based or word-based
contexts, the reduced settings showed only modest improvement. Since chunks
normally consist of several words, experiments on a length-one system do not
seem very promising, especially when trying to improve phrase segmentation.
Table 25 shows the results for the different variants. Generally, the versions
with the bonus feature seem to work slightly better. As already observed before,
a significant improvement in a reduced setting does not necessarily correspond
to a significant improvement in the full setting. In this case, the scores of the
full setting systems 1-chunk-left+bonus and const+bonus (columns Á and à in
table (b)) are almost exactly the same as the baseline, while the systems using
only the new translation probabilities yielded better scores (columns Á and Ã
in table (a)).
Although the criteria for receiving a bonus point are different in the 1-chunkleft+bonus and const+bonus systems (partially well-formed vs. well-formed), the
(a)
1 feature
bleu
pos-bleu
lem-bleu
nc-segm.
nc-vc-segm.
vc-nc-segm.
bl
ϕo
12.45
34.76
15.95
1624
142
74
À
1 chunk
left
12.68
34.83
16.30
1815
145
69
Á
1 chunk
left + bonus
12.96
34.56
16.28
2168
164
100
Â
const.
12.45
34.46
15.87
1662
131
79
Ã
const.
+ bonus
12.84
34.07
16.30
2840
116
75
Ä
ϕo
+ bonus
12.54
34.35
15.91
2926
136
99
Ã
const.
+ bonus
13.83
36.83
17.47
1929
215
80
Ä
ϕo
+ bonus
13.94
36.94
17.62
2406
190
99
(b)
all features
bleu
pos-bleu
lem-bleu
nc-segm.
nc-vc-segm.
vc-nc-segm.
bl
ϕo
13.84
36.72
17.34
1168
148
78
À
1 chunk
left
13.79
36.15
17.29
1403
138
70
Á
1 chunk
left + bonus
13.85
36.72
17.39
1703
188
106
Â
const.
13.78
36.34
17.22
1176
170
73
Table 25: Results of experiments with chunk based context features. Table (a)
displays scores of a system using the modified translation probability as single
feature (plus bonus feature), table (b) shows results of the full setting. The
bonus feature refers to partially well-formed phrases in the 1-chunk-left context
and to constituents otherwise. Bold printed numbers are significantly better
than the baseline (bl), i.e. the original distribution. The lower half of each table
shows the number of translation units correponding to linguistic segmentation.
85
bleu scores of the full systems are basically the same. Also, the standard bleu
scores of all full setting systems lie within a smaller range than the scores of the
one-feature systems, suggesting that the feature functions that were left out in
the one-feature setting are ‘stabilizing’ the quality of translation regardless of
the context feature.
Of all systems with a full parameter setting, the version using the original
translation probabilities in combination with the bonus feature works best
(column Ä in table (b)). While the standard bleu score is still in the same
range as the baseline, the two alternative bleu scores are better, although they
fail to be statistically significant.
Table 25 also shows an overwiew of how many translation units actually
correspond to linguistic phrases. For this evaluation, we counted all source
sentence segments that were either a noun chunk (nc-segm.) or a combination of
a verb chunk and a noun chunk (nc-vc-segm., vc-nc-segm.) over the entire test
set. While these three types of chunks are not fully representative of the whole
array of possible chunk combinations, they are some of the most important chunk
combinations. Since we look at both single chunks and combinations thereof,
we try to determine whether both longer and shorter phrases benefit from being
conditioned on chunk-based context features. The numbers for ncs are similar
in the one-feature system and the full setting: with every context, more of the
translated phrases are noun chunks, especially in the systems using the bonus
feature. The same tendency applies for phrases consisting of combined chunks.
However, there does not seem to be a strong positive correlation between the
number of linguistically well-formed translation units and the bleu-score: For
example, all full systems with a bonus feature tend to have noticeably more
well-formed translation units than the baseline, but still have more or less the
same score as the baseline system. Similarly, the one-feature system using
translation probabilities from the original distribution ϕo and bonus values
(column Ä in table (a)) tends to have more well-formed translation units but
its bleu score is barely higher than the baseline score. In contrast, the two
one-feature systems conditioned on either the chunk on the left or on phrases
being constituents or not in combination with the bonus feature (columns Á and
Ã) have significantly improved bleu values, whereas the corresponding systems
without a bonus feature have lower results.
These results suggest that chunk based context features might help to increase
the number of linguistically well-formed translation units, but also that the
modified segmentation does not guarantee better translation results.
In [Gimpel and Smith, 2008], constituent-based context features are mentioned, although there is no detailed description provided both in terms of the
design of these features nor of their usefulness. However, the fact that these
contexts were not explicitly listed suggests that they were not found to be very
useful. Similarly, the syntactic features implemented by [Max et al., 2008] did
not meet their expectations.
Given the disappointing result of the chunk based contexts, we did not carry
out similar experiments with chunk based contexts on the right side of a phrase.
86
7.2
Granularity of context features
In the pos tag based experiments in chapter 6, we used a relatively fine-grained
tag set. Especially the description of verbs is detailed by differentiating between
the two auxiliary verbs to be and to have and the different forms of verbs: finite,
infinite, -ing-form and past participle. While rich context features offer detailed
information and thus the possibility to find appropriate translation candidates for
specific contexts, contexts that are too fine-grained could also split up essentially
equal distributions and thus reduce the benefit of context information.
Since the first attempt to generalize context features by using chunked
data turned out not to work very well, presumably due to problems caused by
overlapping, we experimented with reducing the tag set. It is difficult to decide on
a level of granularity as one could always come up with an example demonstrating
the need for either a more detailed tag set or a more general representation. As
our approach only works with context templates (e.g. 1 pos left), but does not
take into account the informativeness of the specific realizations of this template,
a context is considered good if its different realizations work well on average.
The task to find the right level of granularity is similar as we cannot expect to
find a solution that guarantees an optimal context to every phrase, but only a
setting that gets good results on average.
By merging the tags for nouns (nn) and personal pronouns (pp) into a single
tag (np), words with similar properties are combined to represent the equivalent
of a noun-phrase: In many cases, it is not relevant whether a phrase was seen in
adjacency of e.g. the man or he; thus, there is no need to differentiate between
the pos tags of these two types of context. The second object of reduction is
the group of auxiliary verbs, which are simply merged into the single tag aux.
In order to capture the effect of merging pos tags with similar properties,
phrase translation probabilities were only conditioned if the context was either
part of the aux or the np category, whereas in all other cases, the original
translation probabilities were kept. By conditioning only on those pos tags
we are interested in (labelled ‘isolated context’ in table 26), their translation
probabilities are the only ones to change in comparison to the baseline and we
hope that the differences caused by the varying granularity of the tags are more
noticeable in this isolated setting.
Table 26 shows the results for a full system (i.e. all features and no restriction
on the phrase length) when the context used for conditioning is one pos tag
on the left side of the phrase. When looking at the subtable with the results
for the isolated contexts, it can be seen that systems conditioned on either the
isolated
context
vb.*
vh.*
13.44
aux
va.*
13.89
13.88
vb
vh
13.83
nn
pp
13.39
np
13.72
all
pos
aux
np
13.82
13.97
Table 26: Different levels of context granularity; the systems were conditioned on
1 pos on the left. Conditioning on all realizations of the detailed tag set results
in a bleu score of 14.01, the bleu score of the baseline is 13.84 (cf. chapter 6).
87
simple tag aux (or variants thereof) or np achieved better results than those
conditioned on the more complex tags (with gray background). This outcome
suggests that a tag set containing less detailed tags for auxiliary verbs is a
better inventory for contexts; similarly, the combined tag np seems to be more
effective.
The reduction of the detailed auxiliary tags to the category aux means a
considerable loss of information. However, variations with the design of the
aux tag did not change the results in the case of isolated contexts : bleu
scores remained stable if all auxiliary forms were merged (aux, 13.89), or if a
distinction between have and be was made (13.83), or if the different verb forms
were kept without differentiating between have and be (13.88).
However, there is no corresponding improvement when translation probabilities are conditioned on all possible contexts: when conditioning on all pos
tags with the aux tag instead of detailed auxiliary tags, the resulting bleu
score (13.82) is almost identical with the result of the baseline (13.84), i.e. the
improvement achieved with the detailed, complete tag set (14.01) is lost. In
contrast, the substitution of nouns and personal pronouns with the more general
tag np did not lead to an improvement corresponding to the result when using
no other contexts, but is nearly the same as in the setting with the more detailed
tag set (13.97 vs. 14.01). This indicates that the np tag is equivalent to the
noun tag and the personal pronouns it represents, whereas the introduction
of the aux tag seems to harm the performance when conditioning on all pos
tags since its result is worse than using the individual auxiliary tags. However,
this does not correspond to our observations for the isolated context where a
reduction of the complexity of tags led to an improvement.
Given that the scores of 14.01 or 13.97 are not significantly better than the
baseline’s 13.84, it is difficult to decide what to think about the results; but it
seems that merging all auxiliary tags into a single one is too harsh whereas the
introduction of the np tag has no influence.
Considering the fact that the results of the experiments with word-based
context features are in the same range as the results of entirely pos-based
context features, it is obvious that a very rich inventory of possible contexts
is not harmful, but in contrast offers a wide range of lexical information. In
consequence, a detailed set of pos tags should not be harmful as well. However,
the main objective of using pos tags is to maximally exploit the available
training material by generalizing linguistic concepts, and a too detailed tag set
is conflicting with this idea.
It is difficult to find an explanation why reducing the context complexity
leads to better results in the isolated setting, whereas there is no corresponding
improvement in the full setting conditioned on every possible context. It might
be possible that the unconditioned translation probabilities and the probabilities
conditioned on only one context are too different and therefore are interfering.
Since the distributions are likely to differ more when conditioned on a detailed
context inventory than on a more general context, this difference is more harmful
in the detailed setting, whereas the probabilities obtained when conditioning on
the generalized contexts aux or np are more similar to the original distribution
and a combination therefore might work better.
88
The assumed side effect of combining probability distributions of different
quality could be explained by the fact that translation probabilities of contextaware phrase translation pairs tend to be higher for being estimated on a
smaller data set. Those phrase-translation pairs then have an ‘unfair advantage’
in comparison to the phrase translation pairs with the original translation
probabilities and might be more likely to be chosen by the system despite
not always being the optimal choice. Enhancing the translation probabilities
of certain subgroups of phrases to a greater extent than other subgroups has
an interesting effect on the segmentation of the input sentence which will be
discussed in section 7.4: in the course of this discussion, we will find that
the translation probabilities of shorter phrases are more enhanced than the
probabilities of longer phrases, which causes the system to prefer a segmentation
into comparatively short units.
7.3
Analysis of specific context realizations
In this section, we focus on specific realizations of the context template 1-pos-left:
in previous experiments, every pos tag was considered to be equally important.
However, we assume that some pos tags provide more information than other
ones. For example, after the tag to, there will always be the infinitive form of a
verb; this means that this tag is helpful for identifying verbal translations (e.g.
the deal vs. to deal); it additionally indicates that the appropriate verb form
might be infinite rather than finite. Similarly, if a word has a determiner (dt)
on its left side, it is very likely that this word is either a noun or an adjective,
whereas other tags such as e.g. adverbs or conjunctions can turn up in different
positions and are thus less informative.
Since it is difficult to measure the usefulness of a specific context realization,
we resort to a similar method as in the previous section by comparing the output
of two systems where only one pos tag is used for conditioning. One setting
corresponds to the experiments presented in the previous section: For phrases
appearing to the right of context c, the conditioned probability distribution
ϕc is used; otherwise, we switch back to the original distribution ϕo . In the
contrastive setting, phrases appearing to the right of context c are assigned
translation probabilities conditioned on c as well, but phrases with a context
different from c are conditioned on the complement of c, i.e. the set of all other
pos tags.
In this section, we do not attempt to find a set of especially useful context realizations of the template 1-pos-left, but to find evidence for the assumption that
good contexts enhance literal translations and exclude faulty phrase-translation
pairs. As a consequence, the complementary set of contexts should contain most
of the bad phrase-translation pairs and thus negatively influence the translation
performance.
If the context c happens to be a good context, then we would assume that
conditioning on c would assign higher translation probabilities to appropriate
translations whereas less suited ones should be either excluded from the list or
score a comparatively low translation probability. In contrast, we would expect
a bad context to contain a proportionally larger amount of inappropriate trans89
lations and thus have a more flat probability distribution giving less probability
mass to good translation candidates, due to the fact that a large part of all
possible good translations have been seen in the good context.
While we cannot prove or disprove this assumption, the classification into
conditioned on c and conditioned on c is an attempt to understand the distribution of all translation candidates that are excluded by a (supposedly) good
context. Note that in this case we do not intend to disambiguate polysemous
words but compare the quality of probability distributions for plain, simple
phrases whose translations share all the same basic meaning. Obviously, this
concept is overly simple as it does not take into account that there could be
more than one good context, and it also leaves out the fact that in the case
of phrases like fine, there are at least two distributions that are merged into a
single one when not disambiguated.
To take an example, we saw that the translation candidates of the phrase
divided (cf. section 6.1, on page 57) with an auxiliary verb on the left side
only consisted of valid options with higher translation probabilities and that all
non-sense translation possibilities such as comma were excluded by this context.
This is especially interesting when considering that the available translation
candidates of certain groups of words, e.g. verbs, are often not very good and
contain a considerable amount of bad translation possibilities.
Experiments were carried out with the contexts dt (determiner), aux (auxiliary verbs) and to since they have a high frequency and might turn out to be
good contexts for being close to either nouns or verbs, which are important for
the understanding of a sentence. The results in table 27 show that the version
that is partially conditioned on c with c = dt is considerably worse than the
version switching back to the original distribution ϕo . This suggests that c = dt
is actually a useful context since it seems to keep a large part of good and
appropriate phrase-translation pairs which allows to produce a good or even
better estimation of translation probabilities, while the distributions based on c
seem to provide worse translation options.
The difference between the two systems with c = aux is not as large, and the
results with c = to are almost alike. This might be due to the fact that alignment
structures including verbs are generally critical and that the distributions of
verbs have a certain number of faulty or imprecise phrase-translation pairs which
also appear within a good context. Especially the to context does not seem
to work very well, although in theory, one might consider it to be relatively
useful. An explication might be that the context c = to is likely to introduce
impersonal verbal constructions whose structure often cannot be preserved
during translation and which are thus prone to bad alignment.
Table 28 shows the 10 most probable translation candidates for the phrase
discovered, conditioned on the context aux or its complement. The distributions
c = dt
13.98
13.42
context = c: ϕc and context 6= c: ϕo
context = c: ϕc and context =
6 c: ϕc
c = aux
13.89
13.70
c = to
13.75
13.69
bl
13.84
Table 27: Results of full systems conditioned on context c or its complement c.
90
ϕo
entdeckt
festgestellt
entdeckten
aufgedeckt
feststellen
gefunden
erfahren
entdeckte
erfahren
erkannt
0.183
0.078
0.036
0.034
0.028
0.026
0.022
0.022
0.020
0.020
ϕc
entdeckt
festgestellt
aufgedeckt
gefunden
erkannt
herausgefunden
erfahren
entdeckung
feststellen
entdecken
0.280
0.075
0,059
0,042
0.034
0,021
0.017
0.017
0.013
0.013
ϕc
entdeckt
festgestellt
entdeckten
feststellen
entdeckte
erfahren
haben
feststellten
fest
,
0.095
0.080
0.068
0.042
0.042
0.027
0.023
0.023
0.019
0,015
ϕpp
festgestellt
feststellten
feststellen
entdeckt
haben
entdeckten
erfahren
entdeckte
habe
fest
0.092
0.061
0.061
0.051
0.041
0.041
0.031
0.031
0.020
0.020
Table 28: Translation candidates for the phrase discovered sorted according to
their translation probability. The context (1-pos-left) in the first three columns is
c = aux; in the last column, translations were conditioned on personal pronouns
(pp). Invalid translations have a colored background.
ϕo and ϕc are relatively similar with the exception that translation probabilities
in ϕc are higher, especially the translation probability of the top entry: This
is consistent with the observation that literal translations are enhanced. All
translation candidates listed for ϕo and ϕc are valid: most of them are variations
of entdecken (to discover), feststellen (to detect), aufdecken (to unveil) or finden
(to find).
However, in ϕc , the translation probability of the top entry is considerably
lower, while the probabilities of the following entries are more or less the same
in all settings. This means that the top translation of ϕc is considerably less
probable in comparison with the second entry. Furthermore, there are three
invalid translation candidates among the top 10: the auxiliary verb haben (to
have), the particle fest of the particle verb feststellen (to detect) and comma.
While these invalid options have comparatively small probabilities and might
therefore only have a small chance to be chosen, at least haben and comma are
highly frequent and might therefore obtain high scores by the language model.
These observations suggest that phrases like have discovered or was discovered
are more likely to provide valid translations than e.g. the phrase he discovered.
Table 28 also shows the probability distribution of discovered conditioned on a
personal pronoun on the left side, which is similar to ϕc . The difference between
ϕc and ϕpp might be partially due to the fact that discovered preceded by an
auxiliary is more frequent than discovered preceded by a personal pronoun (239
vs. 98 occurrences), enabling a better estimation of translation probabilities
within the context aux.
But the primary reason might lie in the actual translations that are typically triggered by either auxiliaries or personal pronouns. Assuming that the
syntactic structures of pp/aux + discovered and the respective translations
in the training data are mostly identical, we will find that the combination
aux + discovered allows for less variation than pp + discovered. Preceded by an
auxiliary, discovered is most likely to be translated by the participle form of a
verb with corresponding meaning; German past participles are not subject to
91
morphological variation12 . However, if discovered is to be translated as a finite
verb after a personal pronoun, the candidate translations vary in number and
person. Additionally, particles and verbs might be separated (depending on the
constituent order of the sentence), probably causing incorrect alignments of the
English verb with a German particle. In contrast, particles are always part of a
past participle and thus, this form of error is diminished in distribution ϕc .
While the top translation candidates observed in the context pp in table 28
are by no means incorrect, the probability distribution produced by this context
is flatter due to the morphological variations and, in contrast to ϕc , contains
three invalid entries: the particle fest and forms of the auxiliary haben (to have).
When conditioning on aux, we would expect that a German auxiliary verb (if
existing) is aligned with the English one, thus leaving the possibility to correctly
align the main verbs. However, if in the case of ϕpp , there is no auxiliary on the
English side, German auxiliaries might end up to be aligned with the English
main verb, especially if the German auxiliary is roughly in the same position as
the English verb and the German main verb is positioned far from its auxiliary
verb.
These assumptions are confirmed when looking at the entries in table 28: the
seven top-ranking translation candidates in column ϕc are participles of different
verbs with the meaning of discover, followed by the noun entdeckung (discovery)
and two infinitives. In contrast, the top entries of ϕc and ϕpp contain different
forms of the same verbs, mixed with invalid translations. Particularly when
conditioning on personal pronouns, the first three entries are forms of the same
verb feststellen (to detect), followed by different forms of the verb entdecken (to
discover).
Summarizing, we found evidence that the context aux used for conditioning
an English participle like discovered is likely to produce a good translation
probability distribution, i.e. a non-flat distribution without invalid translation
candidates in the top-ranking positions. This was indicated by the form of ϕc ,
and illustrated with the specific example of ϕpp .
Use of context information for alignment Information about which pos
tags are useful contexts is obviously interesting when using contextual information in order to refine translation probabilities. But knowing which contexts
are likely to produce reliable alignment structures could also help to improve
alignment quality: alignment tools are trained by using the em-algorithm which
basically starts with a set of uniformly initialized parameters, produces alignments for training data and re-estimates all parameters based on the preceding
run of alignment until it converges. The re-estimation step could benefit from
information about the reliability of alignments in certain structures by giving a
higher weight to ‘safe’ alignments.
12
When used as adjectives, past participles have to agree in gender, number and case with
the rest of the phrase. However, this possibility is excluded when preceded by an auxiliary verb.
92
segm.
length
1
2
3
4
5
6
7
average
bleu
bl
5.767
4.841
2.095
585
145
30
4
1.857
13.84
1 word
left
13.780
3.769
895
193
38
7
–
1.339
13.84
1 pos
left
12.944
3.195
1.256
352
86
10
1
1.401
14.01
1 word
right
7.143
4.272
1.952
591
173
31
7
1.765
13.98
1 pos
right
7.191
4.414
1.880
595
152
30
4
1.753
14.05
1 p. l.
interp.
6.133
4.687
2.062
579
154
31
6
1.832
13.98
1 p. r
interp.
8.648
4.462
1.623
461
114
23
4
1.631
13.96
Table 29: Length of segments and average segment length for different full
systems.
7.4
Phrasal segmentation
Segmenting the input sentence into translation units is a crucial step in smt.
Generally, larger phrases are better since they capture longer expressions providing more context. Unfortunately, our method of contextual conditioning
favors short phrases. As can be seen in table 29, there are huge differences in
the number of single-word phrases and consequently in the average segment
length of the baseline system and the context-aware systems. The table shows
the length of the translation units for the baseline (bl), and when conditioned
on a pos tag or word on either side of the phrase. For the computation of
the translation probabilities of these systems, the established method with the
threshold of n = 10 was used. In the two columns on the right, segmentation
statistics of systems using the same context (one pos on the left or right side),
but with interpolated translation probabilities are listed.
In addition to the preference for shorter translation units, there is also a discrepancy between left-side contexts and right-side contexts: While conditioning
on pos tags or words on the right side leads already to a noticeable increase in
single-word phrases, the systems conditioned on pos tags or words on the left
side use even more than twice as much single word phrases than the baseline
system. This leaves us with two questions to answer: Why does contextual
conditioning lead to an increased amount of single word phrases, and why is
there such a difference between left-side and right-side context?
Since the parameters for segmentation are determined during parameter
tuning, which is highly complex and influenced by a multitude of factors, it
is not easy to find an explanation. However, in the previous evaluations, it
became apparent that, at least in the case of non-ambiguous source phrases, the
translation probabilities of the top translations are generally greatly enhanced.
Translation probabilities of short phrases, especially single word phrases, seem to
be increased to a greater extent than longer phrases. Given that we introduced
a threshold to prevent low frequency phrases from being over-estimated, longer
phrases are often excluded from conditioning for not meeting the threshold
criterion, whereas short phrases generally occur a sufficient number of times and
as a result, are assigned higher translation probabilities.
93
Considering that long translation units often already have relatively high
translation probabilities, it seems somewhat counterintuitive to complain that
their translation probabilities are not increased as well. However, rarely seen
phrases often have lower scores in the reverse translation probability or in
lexical weights since these features were specifically added to counteract overestimation. While all scores of low frequency phrases remain the same, the
translation probabilities of all other phrases - presumably rather short ones - are
boosted and, as a result, have an advantage in comparison to phrases with the
original translation probability. During parameter tuning, the (shorter) phrases
with the new translation probabilities seem more promising, causing the system
to arrange parameters in order to prefer short phrases over longer ones.
This assumption is strengthened by the observation that, for the 1-pos-left
context, using interpolation with a count-based interpolation weight instead
of a fixed threshold leads to a more moderate amount of single word phrases
and an average phrase length that is comparatively close to the baseline. Now
that all translation probabilities are modified, the distributions for shorter and
longer phrases are more alike which means that shorter phrases are no more
advantaged. Similarly, the 1-word-left context with interpolated translation
probabilities (not shown in table 29) has a higher average translation phrase
length. In both cases, the bleu results of the respective systems are almost
identical regardless of using interpolation or a fixed threshold.
In the case of right-side contexts, results are not as clear: with the 1-wordright context (not listed in the table), the segmentation length of 1.853 is closest
to the baseline. However, with only 13.65 bleu points, translation quality
suffered. With the 1-pos-right context, translation quality remained stable, but
the number of short phrases is increased, although not as much as with left-side
contexts without interpolation.
The great difference between left-side context and right-side context is very
interesting and could partially be explained by differences in quality of context
features on the left or the right side of a phrase. When looking back at the
results of the different evaluation metrics (bleu, pos-bleu and lem-bleu)
discussed in chapter 6, we find that for the length-one systems, left-side context
has higher scores than right-side context. Since this setting is independent
of segmentation issues, we consider this as evidence that left-side context is
more useful for a better estimation of phrase-translation probabilities: ironically,
this would mean that by producing more ‘peaked’ distributions favoring short
phrases, left-side contexts lead to an unfortunate phrasal segmentation.
But this observation also suggests that right-side contexts might have a
positive influence on segmentation as full systems conditioned on right-side
contexts are at least as good as systems conditioned on left side context. The
key might lie in the searching procedure when translating the sentence: As
sentences are translated from left to right, there is basically only information
about the left side (i.e. the already translated target words), which is used by
the target side language model. When translating a phrase, the costs of the
n best translation candidates of this phrase are calculated as well as ‘future
costs’ of subsequent translation candidates. Phrase translation probabilities
conditioned on the right side of a phrase might act as a look-ahead and thus
94
the
ϕo
ϕc
die
die
the
the
0.337 0.467
der
der
the
the
0.222 0.216
den
das
the
the
0.113 0.093
central
ϕo
ϕc
mittelzentrale
middle- central
0.216
0.348
zentrale zentralen
central
central
0.209
0.248
zentralen mittelcentral
middle0.158
0.129
the central
ϕo
ϕc
die zentrale
die zentrale
the central
the central
0.204
0.304
das zentrale das zentrale
the central
the central
0.077
0.143
die zentralen die mittelthe central
the middle0.061
0.071
the central banks
ϕo = ϕc
die zentralbanken
the central banks
0.343
den zentralbanken
the central banks
0.239
der zentralbanken
the central banks
0.134
Table 30: Comparison of the three top translation probabilities in the original
distribution and conditioned on 1 pos tag on the left side.
provide information that is not available in a standard system. While the concept
of future costs is designed to find a translation candidate causing reasonable
costs for both the actual and the following phrase, translation probabilities
conditioned on right-side contexts offer the possibility to indirectly look one
position further and therefore strengthen especially suitable translations, as
inappropriate candidates were not seen within the context and can therefore be
filtered out.
Although the average segmentation length is considerably smaller in the
contextually conditioned systems, their overall performance is not as bad as one
might assume, but actually tends to be better than the baseline: the single word
phrases of the modified systems are not single word phrases in the same sense
as in the baseline system, since they contain context information. The fact that
the context-aware systems are relatively successful (at any rate not worse than
the baseline) despite preferring small translation units could be regarded as a
confirmation that context features are useful.
7.4.1
Example
So far, most examples included ambiguous words or otherwise interesting translation variants. In this case, attempting to use a representative example, we chose
to look at different segmentations of a simple nominal phrase whose translations
are all more or less the same, differing only in terms of morphological realization or word choice. Table 30 shows the three top translation probabilities for
different segmentations taken from a sentence beginning with the central banks
worry ... . While the translation probabilities of the conditioned distribution ϕc
generally increase, the top ranking translation probability of the central increases
to a lesser extent than the two single word phrases. Furthermore, the phrase
the central banks does not comply with the threshold criterion and thus has the
same translation probabilities in both ϕc and ϕo .
The assumption that the probabilities of longer phrases are always less
enhanced by conditioning is overly simple: to a large degree, this depends on
the frequency of phrases. Additionally, for local collocations or fixed word
groups, we would expect that contextually conditioned probabilities benefit
95
from conditioning to the same extent as short phrases. For example, the
top translation probability of the phrase central banks, which corresponds to
the German compound noun zentralbanken, rises from ca. 0.7 to 0.85 when
contextually conditioned. This suggests that in addition to enhancing the top
translations of single word phrases, the translation candidates of collocations
or similar structures are also boosted when conditioned. Since the words of
collocational structures always occur together, they can, to a certain degree, be
considered as one complex unit.
However, in the case of trivial structures, translation probabilities seem to
be increased to a lesser extent or are not increased at all. In the example, we
found that the translation probabilities of the and central banks gained more
than the trivial phrase the central, whereas the longer phrase the central banks
did not meet the threshold despite containing the non-trivial structure central
banks, and therefore kept its original translation probabilities, making it less
likely to be chosen. These reflections actually correspond to the segmentation
of [the central banks] in the baseline system and [the][central banks] in the
context-aware system (1-pos-left).
7.4.2
Comparison with other systems
In contrast to our results, [Carpuat and Wu, 2007b] and [Carpuat and Wu,
2007a] report a segmentation into longer translation units with word sense
disambiguation techniques. They explain that the context-aware probabilities
are better estimates and that the higher scores enable the decoder to chose
longer phrases.
When comparing the work presented by [Carpuat and Wu, 2007a/b] and
the system discussed in the course of this work, there are two major differences,
the most important of which is the method of integrating contextual features in
order to refine probability distributions. While we opted for a simple maximum
likelihood-model considering only one context template at once, their approach
is much more sophisticated by using a maximum entropy setting where all
contexts are available, allowing the model to choose features depending on their
respective usefulness for a given phrase.
By requiring a minimum number of phrase-context co-occurrences, we exclude
low frequency phrases from being contextually conditioned and thus create a
disadvantage for a subgroup of phrases. Since the approach presented by
[Carpuat and Wu, 2007a/b] does not rely on relative frequencies but instead uses
classification methods, phrases of all length can be treated identically. Assuming
that their classifier works equally well for short and for long phrases, all phrases
can profit from the context-aware probabilities and thus offer equally high scores
to the decoder. This assumption is somewhat inaccurate as longer phrases
usually provide less training data which should negatively affect the performance
of a classifier. However, there is a gradual transition of classification quality
for low frequency phrases instead of a fixed classification into conditioned vs.
non-conditioned as in our approach.
The other point is the fact that the language pairs used have different
properties: English and Chinese have almost no morphology, and thus, the data
96
[Carpuat and Wu, 2007a/b] work with is relatively compact and less prone to
sparse data problems. In contrast, German has a relatively rich morphology
which often leads to flat distributions. For example, in the case of adjectives, it
is not unusual that the five top entries of an adjective are essentially the same
word but with different morphological features. With the translations of one
source phrase split into several morphological realizations, it is more difficult to
estimate translation probabilities, especially when trying to not over-estimate
infrequent phrases.
For the translation direction German→English, it would be possible to preprocess the German data by eliminating those morphological features that have
no equivalent on the target side, such as gender and case marking. In the reverse
direction, German morphological features cannot be simply eliminated since the
target sentence has to consist of full forms. When translating word stems to
avoid data sparseness, morphological features need to be reconstructed on the
target side. [Minkov et al., 2007] propose a method to generate morphology by
modelling morphological agreement on the target side to predict fully inflected
forms. Basically, simplified data is used for translation, and a model trained on
morpho-syntactic features from source and target language is used to predict
the morphological descriptions of words in the translated target sentence. They
report promising results for experiments on reference translations with the
language pairs English→Russian and English→Arabic, whose target languages
are both highly inflected. In [Toutanova et al., 2008], the method is integrated
into a smt system.
It would be interesting to combine this method with the idea of contextual
conditioning. The translation of English→German should generally benefit from
the reduced data, and context features could better generalize and thus be more
powerful. One could even consider to ‘pass along’ context dependent features as
information for the model predicting the fully inflected forms.
Since the idea presented by [Minkov et al., 2007] basically consists of the
pre-processing step of simplifying the data and the prediction of inflected forms
on the translated sentences, no alteration of the actual translation system is
required. Thus, it could be combined with contextual conditioning without
major modifications.
7.5
Feature selection
In our approach, contexts such as one pos tag on the left side of a phrase are
considered as context template, leaving out the possibility that the different
realizations, i.e. the actual pos tags within this template may be of different
importance depending on the respective phrase. This means that a context
template is considered to be good if the probability distributions it produces
turn out to be useful on average. However, this concept is not optimal as it
leaves out the fact that some context realizations are less useful than others. In
section 7.3 for example, we found that conditioning participles on auxiliary verbs
is likely to improve the translation probability distribution, whereas conditioning
on personal pronouns might even lead to a deterioration in comparison to the
baseline.
97
By applying criteria such as the count-based factor defined in section 5.3.1 or
the factor based on type-token relations discussed in section 5.3.4, we tried to take
this into account by putting less trust in presumably trivial and useless contexts.
Similarly, the introduction of a simple threshold, which allows conditioning only
for phrase-context combinations with a minimum number of occurrences, is an
attempt to rate the individual relevance of phrase-context combinations.
An essential drawback is that our approach does not allow to switch between
different contexts depending on their respective relevance, but only uses one
fixed context for conditioning. In the experiments presented in section 6.3,
we integrated two context templates as separate features into the log-linear
model. This setting allowed us to give an individual feature weight to each
of the context templates, and therefore, the respective context templates can
be used according to their relevance. Still, the feature weights obtained by
parameter tuning only indicate the relevance of the context template, i.e. the
averaged relevance, instead of rating individual phrase-context combinations.
A maximum entropy based approach, as presented in chapter 2, would allow
the system to select context features according to their relevance in the given
situation.
Since we mainly focus on the analysis of (individual) context templates
and designed our methods accordingly, a more complex classification-based
approach would go beyond the scope of this work. However, we conducted
preliminary experiments in which we attempted to rate the individual relevance
of phrase-context combinations in order to determine which of the examined
contexts is most promising for conditioning. The relevance of a context for a
given phrase was estimated by the factor λ based on type-token relations that
was defined in section 5.3.4. For our experiments, we used a system conditioned
on the combination of one pos tag on the left and on the right side, as well as a
system using either one or two pos tags on the left side. While the first setting
is designed to combine different context features (i.e. left vs. right), the second
setting could be regarded as a form of backing-off between one or two pos tags
on the left side, depending on which feature is more promising. First, we used
the factor λ to decide which of the context should be used for conditioning;
i.e. each phrase was only conditioned on the better context. Alternatively, the
translation probabilities produced by the two different contexts were weighted
according to λ and merged into a single score.
However, the results of all variants were disappointing as the bleu scores
were hardly better than the baseline. Given that these context features achieved
good results when used separately or, in the case of one pos tag on either
side of the phrase, as combined context templates, this outcome is surprising,
especially when considering that in comparison to the context template settings,
translation probabilities should be either the same or replaced with better ones.
When looking back at the results achieved with the type-token based factor
λ applied to context templates in section 5.5, we find that the best bleu score
was obtained when λ was used as weight for an interpolation of the contextually
conditioned distribution ϕc and the original distribution ϕo . Attempting to
recreate this setting for the 1-pos-left and 1-pos-right system, we used λ to
determine the better context c and then computed interpolated translation
98
probabilities based on ϕo and ϕc . Again, the result of this experiment did not
meet our expectations. In the discussion in section 5.3.4, it became apparent that
λ fails to produce good estimations for flat probability distributions. Additionally,
its performance was not stable throughout tested the systems (cf. section 5.5),
which suggests that λ is not a suitable criterion for the task of comparing
two distributions. In the case of the 1-pos-left and 1-pos-right context, the
differences between left-side and right-side contexts, might also be of importance:
we cannot rule out that the individual properties of the two context types are
‘interfering’ when merged into a single score.
99
100
8
Translating verbs: Filtering translation options with
target side information
In this chapter, we will specifically focus on the translation of verbs: A motivational example is to illustrate the general problems of translating German verbs,
which is mostly due to the different structures of German and English sentences.
We then present an idea of how to integrate target side information in order
to trigger valid translations for verbs, followed by an evaluation of our method,
which reveals that conditioning on target side features principally works, but is
not overly successful due to the structural differences of the two languages.
When translating between English and German, it is generally difficult to
find good translations for verbs. This is mostly due to the fact that German
constituent order can vary greatly and does therefore not always correspond very
well to the word order on the English side. The main problem is that in German
sentences, the verb is often at the very end of the sentence while the corresponding
English verb is at the beginning in close proximity to the subject. It is difficult
to align such structures and as result, the phrase-translation estimates are often
not very good since a considerable number of correct alignments is lost while
wrong alignments are extracted.
Considering more information than the mere number of occurrences of
phrase-translation pairs could help to improve translation quality. In the
previous sections, we presented experiments and thoughts about the use of
source side context features. While our modifications did not turn out to be
hugely successful, we can report small improvements for some of the settings. The
probability distributions of verbs benefit to a similar extent to other words from
being conditioned on context features; however, there are still many sentences
where verbs are missing because they were translated with non-content words
like auxiliary verbs, verb particles or punctuation. With respect to the criterion
of adequacy, i.e. the task to reproduce the content of the source sentence in
the target language, it is important that content words like verbs are not ’lost’
during the translation process.
If it is known that a source phrase contains a verb, this knowledge can be
used to try to enforce appropriate translations. As a valid translation of a verb
needs to express the meaning of the source phrase, we demand that translation
candidates contain at least one content word. This means that translation
probability distributions are not only conditioned on the source side (contains a
verb vs. does not contain verb), but target side information is considered as well:
By pos-tagging the German part of the training data, content words (nouns,
verbs, adjectives) can be easily identified.
The extent to which information about phrases on the target side can be
included is very limited; it can only consist of a description of the phrase itself
(e.g. pos tags), but not of adjacent phrases as nothing is known about the
phrase’s final position in the translation. While this is not strictly the same
sort of context as before, it still shares the basic idea of reducing the set of
translations and enhancing translation probabilities for good translations by
means of additional information.
101
input
bl
gloss
ref
[mitt]1 [romney]2 [accused his]3 [opponent]4 [rudy]5 [giuliani]6
[ of turning ]7 [new york]8 [into a ”]9 [save haven]10
[for illegal immigrants]11 [” .]12
[mitt]1 [romney]2 [beschuldigte seine]3 [gegner]4 [rudy]5 [giuliani]6
[ aus ]7 [new york]8 [in einen ”]9 [sicheren hafen]10
[für illegale einwanderer]11 [” .]13
[mitt]1 [romney]2 [accused his]3 [opponent]4 [rudy]5 [giuliani]6
[ from ]7 [new york]8 [into a ”]9 [save haven]10
[for illegal immigrants]11 [” .]12
er habe new york zu einem ”zufluchtsort für illegale einwanderer”
gemacht , wirft mitt romney seinem kontrahenten rudy giuliani vor.
Table 31: Baseline translation with omitted verb.
8.1
Basic idea
In order to get a picture of the scale of the problem of the structural differences
of English and German and the resulting poor alignment quality, we discuss
the example in table 31. As can be seen, the verbal construction of turning is
translated as the ‘meaningless word’ aus. The given example was translated with
the baseline system, but the results of the contextually conditioned systems are
comparable and also fail to produce a valid translation for of turning. Admittedly,
it would hardly be possible to translate this phrase into the same structure as
in the reference translation. The part of turning new york into a ”save heaven
for illegal immigrants” is marked in the reference translation, with the verbs
in bold face. The main verb gemacht (made) is at the end of the subordinated
clause, while the subject er (he) and an auxiliary verb habe (had) are at the
beginning. Here, the impersonal construction of turning is translated entirely
differently, involving verbs at different positions in the sentence.
As in most cases, there are many good alternative translations; a simple
possibility is given below:
of turning new york into a ”save haven for illegal immigrants”
(1) new york in einen ”sicheren hafen für illegale einwanderer” zu verwandeln
(2) zu verwandeln new york in einen ”sicheren hafen für illegale einwanderer”
By choosing the more or less literal translation of turning→zu verwandeln, we
are avoiding the problem of a ‘discontinuous’ translation with several verbs
like in the reference translation in table 31, and more or less preserving the
structure as zu verwandeln is impersonal as well. Since reordering is expensive,
we would rather expect that zu verwandeln is placed at a similar position as its
equivalent of turning, as illustrated in (2), instead of the correct position in (1).
Yet, even the ungrammatical version given in (2) would be a great improvement
in comparison to the translation in table 31 as it expresses at least the semantic
content of the source phrase.
Looking at the phrase table entries for of turning, we find that the translation
probability for of turning→zu verwandeln is relatively high. But it becomes
102
ϕo
0.074
0.056
0.056
0.037
0.037
0.037
trg
,
zu verwandeln
, dass
, um
, sich
machen
gloss
,
to turn
, that
, to
, itself
make
ϕc
0.143
0.095
0.095
0.048
0.048
0.048
trg
zu verwandeln
machen
die umsetzung
zu wenden
macht
vorhaben ,
gloss
to turn
make
the conversion
to turn
makes
to intend
Table 32: Translation candidates for of turning sorted according to their probability: The left table shows the entries of the baseline system, the right table
contains probability estimations based content words only.
also evident that there are a lot of faulty phrase-translation pairs resulting
from bad alignment. In the original distribution, the most probable translation
candidate is comma, followed by the valid translation zu verwandeln and some
other non-content words. When filtering out non-content words and estimating
phrase-translation probabilities based only on content words, as shown on the
right side of table 32, zu verwandeln is the top entry, and the other candidates
mostly seem relatively reasonable, too.
One might argue that not all content words are appropriate translations for
a verb: There might still be random alignments or disrupted collocational structures which cannot provide a complete translation, such as e.g. decide→treffen
(make) instead of entscheidung treffen (make a decision). However, the highest
ranking candidates in the second table have considerably more potential for
becoming a good translation, with a very good candidate on top: We hope that
by eliminating truly bad candidates, the remaining top-ranking translations
have a better chance of getting chosen by the system even though their targeted
position in the sentence is not ‘natural’.
Another point of criticism could be our assumption that valid translations
of verbs have to contain content words: With a Romance language as source
language, we might face the phenomenon of chassé croisé, in which a verb of
directional movement can be translated as a preposition expressing a direction,
as illustrated by the following example:
il a traversé la place en courant.
he ran through the square.
Since this is a specific example that is of no immediate relevance for our language
pair, we will not go into detail, but regard this example as a reminder that
non-isomorphic translation structures sometimes allow to translate verbs into
non-content words: In section 8.2, we will come back to this point.
Realization Since we do not know how strongly the elimination of non-content
words will affect the system, we attempt to keep our modifications as minimal
as possible. Related experiments focusing on translating only linguistically
well-formed phrases found that the elimination of non well-formed phrases
from the translation table had a negative effect on the system, as non-intuitive
segmentation often works better (cf. [Koehn et al., 2003]).
103
With this information in mind, modifications will be applied only for verbs;
this is also due to the fact that the translation of e.g. nouns is far more reliable.
Furthermore, there are no conditions imposed on the number of verbs on the
source side and the number of content words on the target side, as well as we do
not differentiate between the different content words on the target side assuming
verbs, nouns and adjectives to be equally important.
In order to identify content words in the target language phrases, the German
part of the parallel corpus was pos tagged and relevant tag information included
in the extracted phrases. As in our previous experiments, the new probability
distributions are dependent on the input sentences: When computing translation
probabilities for a phrase containing a verb, phrase translation pairs without at
least one content word on the target side are ignored. By doing so, we hope to
bring out the literal translation while excluding all candidates that would be
wrong in any situation.
It is important to point out that the estimation of translation probabilities
is not conditioned on the source language part of extracted training data being
a verb or e.g. a noun, but solely on the phrase in the input sentence. Otherwise,
we would split distributions of a word like e.g. deal in two separate distributions
for the deal (noun) and to deal (verb). Since both have the same basic meaning,
splitting the seen occurrences would mean a loss of data. Furthermore, there is
no need to insist that a verb in the source sentence be translated as a verb in
the target sentence.
8.2
Evaluation
In the following sections, we will discuss and compare several variants of the
general idea and also provide an error analysis.
8.2.1
Results
We carried out several experiments based on the idea on banning non-content
words as translation candidates for verbs. Unfortunately, the results (presented
in table 33) are not overwhelming. In version À, exactly the method described
above was applied to all phrases containing verbs in the input sentence. However,
we opted for a threshold of at least 10 occurrences in version Á. This was
considered a necessary step, as discarding phrase-translation entries is a harsh
intervention: even if only undoubtedly bad entries are removed, there is no
guarantee that the remaining entries are better. By demanding a certain number
of occurrences, there is a better chance that remaining entries are actually
valid translations and not randomly aligned phrase pairs. While introducing
the threshold did not lead to a significant gain in standard bleu, there is a
considerable gain in pos-bleu. However, the results are still lower than the
baseline.
As already indicated, there might be situations in which the optimal translation of a verb does not contain a content word. While words such as like
keep their non-content word translations when not tagged as verb, participial
prepositions are mostly tagged as verbs, but are generally translated as preposi-
104
bl
À
Á
threshold
bleu
pos-bleu
lem-bleu
13.84
36.72
17.34
13.44
36.18
16.91
13.59
36.61
17.16
Â
threshold
part. prep.
13.69
36.68
17.24
Ã
bonus
13.98
36.87
17.47
Ä
bonus
src context
13.94
37.15
17.63
Table 33: Results of the experiments focusing on the translation of verbs.
Numbers in the column labelled bl are the scores of the baseline system,
numbers in boldface are significantly better than the baseline. The source side
context used in version Ä is the combination of 1-pos-left and leftmost-pos.
tions. Participial prepositions are a group of verbal forms ending on -ing, such
as concerning, regarding or following. While they can be used as ‘normal’ verbs,
they also have the status of a preposition requiring a corresponding translation.
This is exemplified by the following example:
(31) on a wednesday morning there were long queues at all cash registers
following a sudden increase in shoppers .
(32) am mittwoch morgen gab es lange schlangen auf alle registrierkassen
folgenden ein plötzlicher anstieg der käufer .
Except for the translation of following, (32) is a relatively good, more or less
literal translation of the sentence in (31). For this sentence, the simplest
translation would be following→nach (after), whereas the translation in (32) is
as weird as subsequent a sudden increase in shoppers.
Actually, the adjective folgenden is a literal translation as well, but it is not
a good option in this situation. As the concept of -ing-forms does not exist in
German, there is no ‘general’ way of how to translate them best: They might
be realized as adjectives (e.g. the following year), as infinitive constructions as
seen in the example in table 32, as prepositions or they might even require a
totally different sentence structure. Since participial prepositions are a closed
word class, verbs with the ending -ing were checked with a list (see table 34)
before discarding translation candidates without content words. If a verbal
form could possibly be a participial preposition, all translation options were
kept. The results for this version are listed in column  in table 33; this small
modification did not change much and the results are still not better that the
baseline. We assume that completely discarding improbable translations is too
harsh as the remaining translations might ‘block’ good translations of other
parts of the input sentence for not fitting well in the target sentence.
according
concerning
considering
consisting
corresponding
following
given
including
notwithstanding
pending
regarding
respecting
resulting
touching
Table 34: Participial prepositions that were excluded for having non-content
word translations removed
105
In version Ã, we choose a slightly different approach: Instead of discarding
non-promising translation candidates, we now opt for a ‘softer’ variant by giving
a ‘bonus’ score for seemingly good translations of verbs. This means that an
additional feature function is assigned the value 1 in the case of a verb → content
word translation and 013 if the translation consisted of non-content words. All
other phrases, i.e. phrases without a verb, are assigned the value 1 as well;
thus, the phrases with the value 1 are the same as in the systems À-Â, but the
‘unwanted’ translations are still available though they need to score relatively
high in order to balance out the penalty feature.
This might be useful if despite the threshold only random translations are
left or a good (or nearly good) translation is too ‘expensive’ and blocks good
translations of the rest of the sentence: in this case, a universally applicable
token like comma or a preposition allows for more possible translations than
words that do not occur very often but are the only available translations.
The phrase translation probabilities of this system are the same as in the
baseline, and since all non-verbal phrases are assigned the value 1, they should
be equally probable as in the baseline system. Verb → content word pairs also
keep their original translation probabilities but have an advantage in comparison
to the seemingly bad phrase translation pairs which are penalized by the value
of 0. This version yields better results than the previous versions À-Â, but is
still not significantly better than the baseline.
These results suggest that completely discarding translation candidates is too
harsh. This seems somewhat contradicting since the results of the experiments
presented in chapter 6 tend to be better than the baseline despite non-promising
translation candidates being discarded completely. However, translations that
were not seen within a specific source-side context included all sorts of phrases,
leaving in most cases both content word phrases and non-content word phrases
for translation. Furthermore, since the remaining translation candidates were
actually seen in the given context, they have a relatively high probability of
actually being valid translations.
In contrast, when conditioning on target-side pos tags, all discarded nonpromising translation candidates are non-content word phrases, including entries
that are a ‘cheap’ option for the language model. This is a particularly critical
point since the verbal translations are in many cases likely to be placed at a
position in the target sentence that normally does not contain verbs and thus
might ‘irritate’ the target side language model.
Opposed to distributions conditioned entirely on source-side context, this
method of filtering does not take into account how good a translation candidate
is according to the (standard) translation model, but only considers target side
information and thus can favor translations that would otherwise be relatively
unlikely. To a certain degree, it is our intention to enhance the translation
probabilities of otherwise unlikely translation candidates, as illustrated in the
motivational example. However, if all remaining translation candidates have low
scores in the conventional translation features, this might negatively influence
the overall translation performance.
13
Technically, the value is only close to zero as the system does not accept zeros.
106
In version Ä, we keep the bonus/penalty feature, but replace the original
translation probabilities with contextually conditioned ones. The context used
for conditioning is the combination of the two contexts1-pos-left and leftmostpos, which was the best setting in chapter 6. The standard bleu score is still in
the same range as the baseline, but there are significant improvements in both
pos-bleu and lem-bleu.
8.2.2
Error analysis
Generally, the concept of enforcing content-word translations for verbs does
work, but, when looking at bleu scores, it is evident that there are also negative
effects.
The sentence presented in table 35 contains the verb cut, which could not
be translated neither in the baseline system nor in any of the contextually
conditioned systems discussed in chapter 6. The translation cut → senken in
this version is correct in terms of lexical choice, but not on a grammatical level,
as it is not correctly positioned and the word form is not correct. In German,
plural verb forms normally are the same as the infinite form of a verb; thus
there is no way to decide whether senken is a finite or infinite form. However, it
would not be grammatical in either case: Being a finite form, it would not suit
the singular subject ecb, otherwise, there would not be a finite verb. But despite
being incorrectly positioned, the meaning of senken is crucial for understanding
the sentence, and thus, its semantically correct translation can be considered an
improvement.
The fact that the verb is placed at a wrong position also has a negative
influence on the pos-bleu score as can be seen in the following example:
ref
,
,
$,
dass
that
kous
die
the
art
ezb
ezb
nn
die
the
art
zinsen
interests
nn
zweimal
twice
adv
senken
cut
vvinf
bl
,
,
$,
dass
that
kous
die
the
art
ezb
ezb
nn
die
the
art
zinssätze
interest rates
nn
zweimal
twice
adv
Ã
,
,
$,
dass
that
kous
die
the
art
ezb
ezb
nn
die
the
art
zinssätze
interest rates
nn
senken
cut
vvinf
wird
will
vafin
im
in the
prep
zweimal
twice
adv
im
in the
prep
Matching pos tag sequences are highlighted; as can be seen, there are 7 subsequent pos tags in the baseline translation that are matching the reference
translation. By inserting the verb senken, this sequence is disrupted. While
we are gaining the 1-gram vvfin, several other n-grams of higher order are
lost. The same would apply to the standard bleu score if interest rates would
have been translated as the same word in the reference translation and in the
automatic translation.
While a single example is not necessarily universally valid, and automatically
translated sentences are often structured differently than the respective reference
translations, this example illustrates a systematic flaw of our approach to enforce
107
input
Ã
gloss
ref
[they also]1 [predict that the]2 [ecb will]3 [ cut ]4 [interest rates]5
[twice]6 [during the]7 [course of 2008]8 [.]9
[auch sie]1 [vorhersagen , dass die]2 [ezb]3 [die zinssätze]5 [ senken ]4
[zweimal]6 [im]7 [laufe des jahres 2008]8 [.]9
[they also]1 [predict that the]2 [ecb ]3 [the interest rates]5 [ cut ]4
[twice]6 [in the]7 [course of the year 2008]8 [.]9
für 2008 rechnen experten damit , dass die ezb die zinsen zweimal
senken wird .
Table 35: Example for a successfully translated verb.
valid translations of verbs: As they are expected to appear in an incorrect position
in the target sentence, bleu scores can be expected to be negatively influenced
by verbs inserted into matching pos tag sequences.
In this example, it is also remarkable that the system did not choose the
translation candidate with the highest translation probability, cut → entzieht
(withdraws), but the entry working best in combination with the noun zinssätze
(interest rates): This is most probably due to the influence of the language
model as it seems to have preferred zinssätze senken over any other possibility.
Another problem is the fact that the remaining translation candidates are
not necessarily good translations of the verbal phrase in the source sentence:
It is possible that a phrase containing a verb is translated by only ‘half’ of
the required structure. This is especially the case if the source phrase contains
more than one content word, e.g. consists of a verb and a noun, but the target
phrase only contains one content word. As there are no restrictions on the
number of content words, this is not unusual. However, it would certainly not
be a good idea to demand the same number of content words on the source
and the target side: this would lead to a loss of data as multi word expressions
can be translated by single words. This observation leads to the somewhat
counterintuitive assumption that it is ‘safer’ to translate verbs as short phrases
containing no other content words as there is only the single risk to pick the
wrong content word, whereas multi word phrases could be translated incorrectly
or by incomplete structures. In this context, it is important to point out again
that the expected position in the target sentence is not suitable for verbs in most
cases and that they might thus be dispreferred when competing with a nonverbal target phrase. As the language model has not seen n-grams with verbs
at this position, it is likely to give bad scores for such seemingly inappropriate
verbal translations.
This is illustrated in table 36, where the sentence from the previous example
is translated by system Â. This time, the phrase cut interest rates was to be
translated as one unit. While the target phrase 4 meets the requirement to
contain at least one content word, it fails to be an adequate translation for
the English phrase. For the phrase cut interest rates, there are 5 translation
candidates all of which are more or less equally probable and at least partially
correct.
108
input
Â
gloss
[they also]1 [predict that the]2 [ecb will]3 [ cut interest rates ]4
[twice]5 [during the course of]6 [2008 .]7
[sie]1 [vorhersagen , dass die]2 [ezb]3 [ die zinssätze ]4
[zweimal]5 [im]6 [jahr 2008 .]7
[they also]1 [predict that the]2 [ecb]3 [ the interest rates ]4
[twice]5 [in the]6 [year 2008 .]7
Table 36: Invalid translation of a verb as a result of different segmentation.
However, in some cases, it is surprising that the system did not take the most
probable translation candidate, but instead a translation with a considerably
smaller probability. In an attempt to illustrate this observation, the example
discussed in the beginning of this chapter is presented again. The systems
used to translate the sentence in table 37 are basically the same except for the
treatment of participial prepositions. While the translation of of turning with
system  can be considered to be valid, it is not the top-ranked translation, but
a candidate with about one third of its probability. In contrast, the translation
chosen by system Á is completely random. While both have the same translation
probability, stillhaltepolitik occurs exactly once in the training data, whereas the
phrase zu verwandeln , is far more frequent. Again, this might be due to the
strong influence of the language model which dislikes verbal translations at this
position. This observation shows that even completely random candidates are
taken for translation even if there are far better candidates available, suggesting
that the presented approach is not capable of dealing with the problem of the
translation’s wrong positioning in the target sentence.
8.2.3
Comparison of the presented systems
In an attempt to measure the success of our modifications on a more adequacyoriented level, we computed statistics on precision, recall and f-measure of
input
Â
gloss
input
Á
gloss
[ of turning ]7 [new york]8 [into a ”]9 [save haven]10
[for illegal immigrants]11 [” .]12
[ zu verwandeln , ]7 [new york]8 [in einen ”]9 [sicheren hafen]10
[für illegale einwanderer]11 [” .]13
[ to turn , ]7 [new york]8 [into a ”]9 [save haven]10
[for illegal immigrants]11 [” .]12
[ of turning ]7 [new york]8 [into a ”]9 [save haven]10
[for illegal immigrants]11 [” .]12
[ stillhaltepolitik ]7 [new york]8 [in einen ”]9 [sicheren hafen]10
[für illegale einwanderer]11 [” .]13
[ ’policy of turning a blind eye to events’ ]7 [new york]8 [into a ”]9
[save haven]10 [for illegal immigrants]11 [” .]12
Table 37: Different translation options
109
bl
À
Á
threshold
bleu
p
r
f
13.84
0.2186
0.1789
0.1863
13.44
0.2240
0.2082
0.2060
13.59
0.2329
0.2043
0.2071
Â
threshold
part. prep.
13.69
0.2227
0.1908
0.1953
Ã
bonus
13.98
0.2063
0.1696
0.1759
Ä
bonus
src context
13.94
0.2131
0.1696
0.1773
Table 38: Precision, recall and f-measure of correctly translated verbs. The
respective highest scores are in boldface.
correctly translated verbs and try to analyze the correlation between these
scores and the different presented systems.
While systems À-Â are basically the same, Ã is a softer variant as it does not
throw away non-content word translations, but penalizes them. Ä is additionally
conditioned on source-side information. Table 38 shows statistics about the
number of correctly translated verbs. Similarly to chapter 6, precision, recall and
f-measure for verb-lemmas were computed on sentence level and then averaged
over the entire test set.
Interestingly, the number of correctly translated verbs does not seem to
correlate with bleu scores. As can be seen in the table, the system with the
lowest bleu score has the best recall of correctly translated verbs. Similarly,
the system with the second-lowest bleu score has the highest precision and
f-measure of translated verbs, while the system with the best bleu score has
the lowest scores in terms of correctly translated verbs. As verbs are crucial for
the understanding of a sentence, this is somewhat surprising.
When looking at the values for the systems À and Á, we see that there is a
noticeable increase in the number of correctly translated verbs in comparison to
the baseline, which indicates that at least fluency should be improved. À, the
most straightforward system, has the highest recall: As only strings containing
content words are provided as possible translations for verbs, there is a good
chance that the correct translation is chosen. Due to the threshold introduced in
system Á, some verbal phrases are left with their original set of translations in
an attempt to avoid overestimation of random alignments. While this leads to a
small loss of recall, it definitely helps precision and also the f-measure. System
 is still better than the baseline, but the special treatment for participial
prepositions seems to have a negative influence, even though less than 20 words
are directly affected.
In Ã, the same translation options as in the baseline are provided. Despite
the penalty for non-content word translations, the system often took more or
less the same non-content word translations as in the baseline. While its bleu
score is slightly higher than the baseline, the values for correctly translated verbs
are lower. The increase of bleu points might be explained by the fact that the
system is trained to maximize bleu: While in systems À-Â only content-word
translations are available, system à has more choice and since forcing verbs to
appear at unusual positions seems to harm bleu, the system might avoid this
form of translation. The same assumption could be valid for system Ä; in this
110
case however, the conditioning on the source side leads to better pos-bleu and
lem-bleu scores by taking into account all sorts of phrases instead of focusing
on only one group of pos tags.
Since the main issue of our approach to enhance meaningful translations of
verbs is the mismatch in syntactic structure between English and German, one
might consider to modify the sentence structures in order to make them more
alike. Experiments on this subject are presented in the work of [Collins et al.,
2005], who restructure the German part of the training data in order to adapt
the German data to the English structure. Reordering rules include the moving
of verbs from the end of a (subordinated) sentence to the beginning in a position
immediately adjacent to the subject and auxiliary verbs. The same applies for the
verbal negation nicht (not) or verb particles. A German→English system trained
on the restructured data achieved better results than the baseline system; even
though the adaptation for the reverse translation direction English→German
is more complicated due to the fact that the reordered German has to be
reconstructed into ‘normal’ German, results have been shown to be promising.
However, this method does not guarantee perfectly translated verbs: Combining both ideas by enforcing meaningful translation of verbs on reordered training
data would provide the system with data of equal structure and thus enable it
to fit appropriate translations into the target sentence without irritating the
language model.
111
112
9
9.1
Conclusion and future work
Summary
In this work, we presented a method for the integration of local source-side
context features into a phrase-based statistical machine translation system in
order to enable a more refined estimation of translation probabilities. Looking at
adjacent words or pos tags, we made use of potentially very valuable information
normally not used in statistic translation. Linguistically motivated context features in particular were expected to improve the syntactic quality of translation
output. We carried out experiments on the language pair English→German
on three different system settings (full system, a system with a minimal set of
translation features and a system where source phrases were restricted to length
one) as we expected to observe a stronger influence of our modifications in the
simplified settings.
After motivating the use of context information and illustrating the technical
aspects of integrating additional features in chapter 3, we presented automatic
evaluation metrics in chapter 4 and also introduced the idea of computing bleu
on pos tags for evaluating the syntactic quality of machine translation output.
In chapter 5, we discussed several criteria to avoid conditioning on contexts
which would lead to deteriorated translation probabilities for a given phrasecontext combination. After experimenting with different types of criteria to
rate the reliability of phrase-context combinations, we decided to implement
the most simple variant using a fixed threshold to avoid overestimating the
probabilities of low-frequency phrase-context pairs.
Our extensive evaluation in chapter 6 showed that contextual information
significantly increased the translation quality in the case of word translation, but
results were less clear for full systems: only two settings achieved a significantly
improved bleu score; in both cases, the systems were enriched with two separate
context features. However, improved bleu scores computed on lemmatized
data indicated improvements on a lexical level. Furthermore, we detected
slight syntactic improvements (represented by pos-bleu). The evaluation of
content words revealed that systems generally translated more content words,
presumably resulting in a better adequacy. A small-scale manual evaluation
supported these assumptions.
When examining examples, we found that in many cases, our modifications
increased the translation probabilities for better suited candidates which were
consequently chosen by the system. But we also observed that contextual
conditioning can indirectly cause the system to prefer some candidates over
other ones: different translation decisions in the baseline and a contextually
conditioned system are not always due to modified translation probabilities, but
can also be caused by other factors like e.g. different reordering or segmentation
of the source sentence. We saw an example in which the re-estimated probability
of a phrase positively affected (per segmentation and language model score) the
translation of the subsequent phrase, but also an example in which a similar
constellation indirectly caused the system to prefer a translation worse than in
the baseline.
113
Generally, combinations of context templates tended to achieve better results
than single context templates. We also found that right-side contexts had lower
scores in the word-translation task than left-side contexts, whereas the results
of left-side and right-side contexts in full systems were comparable: This is
most likely correlated with the different effects on segmentation observed in the
second part of chapter 7.
Chapter 7 started with a discussion about context granularity: since we were
working with a relatively detailed tag set, we analyzed the effect of merging
different pos tags with similar properties into one tag with the result that
auxiliary tags are better left detailed whereas combining nouns and personal pronouns did not greatly influence the total result. The introduction of chunks, i.e.
linguistically motivated phrases, as contexts was intended to allow for a maximal
generalization of context information, but also to trigger a linguistically wellformed segmentation. Chunk-based contexts turned out to be cumbersome due
to overlapping phrases and also did not lead to the desired improvement. While
chunk-based contexts did result in a more linguistically motivated segmentation,
this did not lead to an improvement of translation quality.
In the second part of chapter 7, we analyzed the correlation between contextual conditioning and the preference of the system for shorter segmentation units.
A comparison of contextually conditioned translation probability distributions
and the original distributions revealed that good contexts tended to favor literal
translations by giving them comparatively high translation probabilities. In
combination with the threshold, this led to an unexpected effect on segmentation:
Since short phrases meet the threshold criterion far more often than longer
phrases, they benefit to a larger extent from contextual conditioning and thus
have an ‘advantage’ in comparison to the phrases left with their original translation probabilities. As a result, the system prefers shorter phrases. Although
shorter phrases generally lead to bad translation results, we find that the quality
of our systems tends to be improved rather than deteriorated, suggesting that
contextual information is indeed valuable as it managed to even out the inherent
disadvantages of short phrases. Interestingly, we found that contexts on the left
side led to an immense increase of source phrases of length one, whereas contexts
on the right side only caused a moderate increase of single-word source phrases.
While we could not find an entirely satisfying explanation for this observation,
we assume that right-side context might act as a ‘look-ahead’ during decoding
and thus behaves differently from left-side context.
Chapter 8 focused entirely on the translation of verbs. In contrast to
the previously presented experiments, we additionally integrated target side
information. Due to the different positions of verbs in English and German
sentences, verbs are often translated by common, but meaningless phrases
(e.g. particles or punctuation). Using information about the pos tags of the
German translations, we tried to enforce meaningful translations of verbs. While
the idea generally worked, the major structural differences between English
and German caused the language model to disprefer essentially good verbal
translation candidates for not fitting well into the sentence structure.
Completely discarding entries of verbs with a non-content word translation
led to a better recall and precision of correctly translated verbs in the output
114
sentence, but it negatively affected the overall bleu score, presumably because
enforced meaningful translations blocked other translations. Penalizing nonmeaningful translations of verbs resulted in better bleu scores, but lower
precision and recall of correctly translated verbs. A combination of penalizing
non-meaningful translations of verbs with the best source context detected in
chapter 6 resulted in significantly increased pos-bleu and lemma-based bleu.
9.2
Discussion
The general, underwhelming outcome for full systems is consistent with the
results for comparable language pairs reported in previous publications: While
systems translating Chinese↔English were greatly improved by the use of
context features (e.g. [Gimpel and Smith, 2008], [Stroppa et al., 2007], [Carpuat
and Wu, 2007b]), systems with a morphologically more complex target language
were found to benefit less from context information (e.g. [Gimpel and Smith,
2008], [Max et al., 2008]).
Since the authors of all papers presented in chapter 2 reported that simple
context features such as adjacent words or pos tags turned out to work best,
we mainly focused on this type of context features. In contrast to most of the
cited articles, our general method of context integration is much simpler as we
only use one context template at the same time. This allows us to analyze more
general aspects like the different effects on phrasal segmentation of left-side and
right-side contexts, but also to rate the usefulness of different context templates.
For example, we found that the first pos tag of a phrase was not useful, but
the combination with the preceding tag outperformed every other setting. In
addition to identifying good context templates, our extensive evaluation also
revealed that contextually conditioned systems tend to translate more content
words. This is related to our analysis of specific contexts in section 7.3; while
we only studied an individual case, we found evidence that information about
good phrase-context structures might also prove interesting for alignment.
Yet, our approach has two significant disadvantages, the first of which is the
evident lack of flexibility in terms of feature selection: since we only integrate
one context feature at a time, the system is forced to use this context (given
that the threshold criterion is met) even though this context might not be
relevant at all, while ignoring potentially better contexts. A maximum entropy
based approach (as presented in chapter 2), would offer the possibility to choose
context features according to their respective relevance. However, such systems
are far more complicated than the systems presented in the course of this work.
Our experiments showed that systems conditioned on two separate context
functions tended to outperform single-context settings. The respective relevance
of the context templates was then determined by parameter tuning in the form
of feature weights. In the case of the combination of one pos tag left and the
leftmost tag of the phrase, we observed that the system using separate feature
functions yielded a slightly better result, possibly due to the fact that the two
features received different weights representing their actual relevance instead
of assuming them to be equally important. Yet, a feature weight assigned to
the entire collection of probability distributions conditioned on one context
115
template does still not take into account the respective relevance of the template
in different phrase-context constellations. (Conditioning each phrase-context
pair regardless of the respective usefulness might also have caused imprecise
estimates.) Thus, the implementation of a more sophisticated feature selection
seems a necessary next step.
In an attempt to go beyond the concept of context templates, we carried
out a small-scale experiment in which we tried to combine the values of two
independent context templates, weighted according to their individual relevance
in the respective phrase-context combination (cf. section 7.5). The relevance
of a context for a given phrase was determined by a factor λ based on typetoken relations of the contextually conditioned distribution (cf. section 5.3.4).
Experiments were conducted on a system with the context combination of
1-pos-left and 1-pos-right as well as on a system conditioned on one or two
pos tags on the left side. In the first variant of this experiment, the factor λ
was used to determine the better context which was then used for conditioning.
Alternatively, the individual translation probabilities of both contexts were
weighted according to the respective factor λ and merged into a single score.
Unfortunately, the results of these preliminary experiments were disappointing.
Although the factor λ led to good results in some of the experiments presented
in section 5.5, its performance was not stable throughout the tested systems,
suggesting that it is not suited for every setting.
The second problem is the estimation of translation probabilities using
relative frequencies. Experiments in chapter 5 showed that some criteria to
prevent the over-estimation of infrequent phrases was necessary. However,
settings using a fixed threshold led to an unfortunate phrasal segmentation.
Interestingly, using interpolation for the computation of translation probabilities
(and thus avoiding the fixed threshold) does not lead to better bleu scores despite
the higher average translation unit length. This outcome suggests that phrasetranslation probabilities without the influence of the original distribution are
better than the interpolated distributions: In order to benefit from contextually
conditioned probabilities without triggering a strong preference for short phrases,
it is necessary to treat all phrases alike regardless of their occurrence frequency.
The key to do so might lie in a maximum entropy based approach as decisions are
made based on the relevance of features instead of systematically dispreferring
entire subgroup of infrequent phrase-context pairs.
In connection with the phrasal segmentation, the different properties of
left-side and right-side contexts seem to play an important role. However, it
is not entirely clear what caused the extreme outcome of left-side contexts,
and thus, this issue definitely requires more research. This observation could
even been taken as evidence that left-side context and right-side context should
be treated differently in order to suit their respective properties. If right-side
context does really act as ‘look-ahead’ during decoding, it would be interesting
to have a closer look at this observation. One could even consider to integrate
information based on right-side contexts directly into the decoder: this would
allow the search algorithm to make direct use of the information instead of
deriving it somewhat indirectly from the contextually conditioned translation
probabilities.
116
9.3
Morpho-syntactic preprocessing
We already suggested two different methods of preprocessing the training data:
By reordering the training data in an attempt to adjust the structures of
the English and German side, especially verbal translations are expected to
benefit (cf. [Collins et al., 2005]). In addition to structural issues, German
also has a relatively rich morphology, resulting in flat probability distributions.
This becomes evident when looking at the probability distributions of e.g.
adjectives: since German adjectives vary in number, gender and case, an English
adjective is typically not translated by an equivalent German adjective, but
by several forms of the same word. The same applies for German nouns and
verbs. When introducing additional features, sparse data becomes increasingly
more problematic. Thus, the idea of using simplified data for translation and
relocating the task of modelling (monolingual) morphological features to a
separate step seems promising: In this setting, the translation system should
generally benefit from the simplified data, while contextual conditioning could
make use of more generalized context settings. [Minkov et al., 2007] report
good results for the prediction of inflected word forms with English-Russian and
English-Arabian sentences, and successfully integrated the concept into a smt
system in [Toutanova et al., 2008]. Similarly, [Weller, 2009] adapted this idea
for German; experiments conducted on reference translations were promising.
These two pre-processing steps are mainly intended to enable a better general
translation. However, morphological modelling could also be used to design
context features: One of the main problems of translating into a morphologically
rich language is the fact that morphological features not existing in the source
language need to be realized in the target language. When eliminating all
morphological features in order to simplify the training data, the tool for the
prediction of morphological features needs to guess at some point. (This problem
would be similar with ‘normal’ training data since the source language does not
indicate which realization to use.) In the case of simple and local phenomena,
one could consider to pass along target-side inspired information to be used as
context feature. A simple example is the case government of German prepositions
(dative or accusative), which can vary depending on whether they are used to
describe a situation or a directional movement. For example, in the phrase auf
einem Baum sitzen (to sit on a tree), the case required by the preposition auf is
dative, whereas in the case of (auf einen Baum klettern) (to climb up a tree), the
preposition requires accusative case. Since this is an entirely semantic problem,
it is difficult to predict a correct form based only on morphological features.
However, if the German training data is annotated with case information, we
could use this as a feature to condition the source phrase on on the target side
contexts accusative or dative. This concept is not limited to prepositions, but
could also be applied to verbs.
117
References
[Allauzen et al., 2009] Allauzen, A., Crego, J., Max, A., and Yvon, F. (2009).
LIMSI’s statistical translation system for WMT 2009. In Proceedings of the
4th EACL Workshop on Statistical Machine Translation, Athens, Greece.
Association for Computational Linguistics.
[Banerjee and Lavie, 2005] Banerjee, S. and Lavie, A. (2005). METEOR: an
automatic metric for MT evaluation with improved correlation with human
judgements. In proceedings of ACL Workshop on Intrinsic and Extrinsic
Evaluation Measures for MT and/or Summarization, Ann Arbor, Michigan.
[Berger et al., 1996] Berger, A., della Pietra, S., and della Pietra, V. (1996). A
maximum entropy approach to natural language processing. Computational
Linguistics, Volume 22.
[Callison-Burch et al., 2008] Callison-Burch, C., Fordyce, C., Koehn, P., Monz,
C., and Schroeder, J. (2008). Further meta-evaluation of machine translation.
In Proceedings of the Third Workshop on Statistical Machine Translation,
Columbus, Ohio.
[Callison-Burch et al., 2006] Callison-Burch, C., Osborne, M., and Koehn, P.
(2006). Re-evaluating the role of BLEU in machine translation research. In
Proceedings of 11th Conference of the European Chapter of the Association
for Computational Linguistics: EACL 2006, Trento, Italy.
[Carpuat and Wu, 2007a] Carpuat, M. and Wu, D. (2007a). Context-dependent
phrasal translation lexicons for statistical machine translation. In Proceedings
of Machine Translation Summit XI, Copenhagen, Denmark.
[Carpuat and Wu, 2007b] Carpuat, M. and Wu, D. (2007b). Improving statistical machine translation using word sense disambiguation. In Proceedings
of the 2007 Joint Conference on Empirical Methods in Natural Language
Processing and Computational Natural Language Learning, Prague, Czech
Republic.
[Collins et al., 2005] Collins, M., Koehn, P., and Kucerová, I. (2005). Clause
restructuring for statistical machine translation. In Proceedings of the 43rd
Annual Meeting of the ACL, Ann Arbor, Michigan.
[Doddington, 2002] Doddington, G. (2002). Automatic evaluation of machine
translation quality using n-gram co-occurrence statistics. In Proceedings of
the Human Language Technology conference (HLT), San Diego, CA.
[Garcia-Varea et al., 2001] Garcia-Varea, I., Och, F., Ney, H., and Casacuberta,
F. (2001). Refined lexicon models for statistical machine translation using a
maximum entropy approach. In Proceedings of the 39th annual meeting of
the association for computational linguistics, Toulouse, France.
118
[Gimpel and Smith, 2008] Gimpel, K. and Smith, N. A. (2008). Rich sourceside context for statistical machine translation. In Proceedings of the Third
Workshop on Statistical Machine Translation, Columbus, Ohio.
[Koehn, 2004] Koehn, P. (2004). Statistical significance tests for machine translation evaluation. In Proceedings of Conference on Empirical Methods in
Natural Language Processing (EMNLP), Barcelona, Spain.
[Koehn, 2010] Koehn, P. (2010). Statistical Machine Translation. Cambridge.
[Koehn et al., 2003] Koehn, P., Och, F. J., and Marcu, D. (2003). Statistical
phrase-based translation. In Proceedings of HLT-NAACL, Edmonton, Canada.
[Max et al., 2008] Max, A., Makhloufi, R., and Anglais, P. (2008). Explorations
in using grammatical dependencies for contextual phrase translation disambiguation. In Proceedings of EAMT, poster session, Hamburg, Germany.
[Minkov et al., 2007] Minkov, E., Toutanova, K., and Suzuki, H. (2007). Generating complex morphology for machine translation. In Proceedings of the
45th Annual Meeting of the Association of Compuational Linguistics, Prague,
Czech Republic.
[Papineni et al., 2002] Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J.
(2002). BLEU: a method for automatic evaluation of machine translation. In
Proceedings of of the 40th Annual Meeting of the Association for Computational Linguistics, Philadelphia, USA.
[Popović et al., 2006] Popović, M., de Gisbert, A., Gupta, D., Lambert, P., Ney,
H., Mariño, J. B., Federico, M., and Banchs, R. (2006). Morpho-syntactic
information for automatic error analysis of statistical machine translation
output. In Proceedings of the Workshop on Statistical Machine Translation,
USA, New York City. Association for Computational Linguistics.
[Popović and Ney, 2009] Popović, M. and Ney, H. (2009). Syntax-oriented
evaluation measures for machine translation output. In Proceedings of the
4th EACL Workshop on Statistical Machine Translation, Athens, Greece.
Association for Computational Linguistics.
[Riezler and Maxwell, 2005] Riezler, S. and Maxwell, J. (2005). On some pitfalls
in automatic evaluation and significance testing in MT. In Proceedings of
the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for MT
and/or Summarization, Ann Arbor, Michigan.
[Schmid, 1994] Schmid, H. (1994).
Probabilistic part-of-speech tagging using decision trees.
In International Conference on New
Methods in Language Processing, Manchester, UK.
www.ims.unistuttgart.de/projekte/corplex/TreeTagger/.
[Stroppa et al., 2007] Stroppa, N., van den Bosch, A., and Way, A. (2007).
Exploiting source similarity for SMT using context-informed features. In
Proceedings of the 11th International Conference on Theoretical and Methodological Issues in Machine Translation (TMI), Skövde, Sweden.
119
[Toutanova et al., 2008] Toutanova, K., Suzuki, H., and Ruopp, A. (2008). Applying morphology generation models to machine translation. In Proceedings
of the 46th Annual Meeting of the Association for Computational Linguistics:
Human Language Technologies, Columbus, Ohio.
[Vickrey et al., 2005] Vickrey, D., Biewald, L., Teyssier, M., and Koller, D.
(2005). Word-sense disambiguation for machine translation. In Proceedings
of Human Language Technology Conference and Conference on Empirical
Methods in Natural Language Processing (HLT/EMNLP), Vancouver, Canada.
[Vilar et al., 2006] Vilar, D., Xu, J., D’Haro, F., and Ney, H. (2006). Error
analysis of statistical machine translation output. In Proceedings of the Fifth
Inernational Conference on Language Resources and Evaluation (LREC),
Genova, Italy.
[Weller, 2009] Weller, M. (2009). Separate Morphologiebehandlung als Methode zur Verbesserung statistischer maschineller Übersetzung (Studienarbeit,
Universität Stuttgart).
120
A
Manually evaluated sample
In the following subsections, all sentences with unambiguous annotation from
the manual evaluation presented in 6.4.2 are listed.
A.1
Contextually conditioned system better than baseline
Missing words
input the group responded with a new and even better cd.
bl die fraktion mit einem neuen und besseren cd.
cont die fraktion reagierte darauf mit einem neuen und besseren cd.
ref die band hat reagiert, wie es sollte: mit einem neuen, noch besseren album.
input among the five joined seven children.
bl von den fünf sind sieben kinder.
cont von den fünf beigetreten sind sieben kinder.
ref die fünf haben zusammen sieben kinder.
input a broker who does not take vacations is a broker who does not want anybody to
look into his records,” kerviel concluded.
bl ein makler, die nicht urlaub ist ein vermittler, der nicht will, dass jemand, in
seine aufzeichnungen,” kerviel abgeschlossen.
cont ein makler, die nicht urlaub ist ein makler, der nicht will, dass jemand zu prüfen,
seine aufzeichnungen,” kerviel abgeschlossen.
ref ein makler, der keinen urlaub nimmt, ist einer, der nicht will, dass man ihm in
die karten schaut,” sagte kerviel abschließend.
input d. it is less serious to participate in a fight than to participate in the interchange
of degraded music.
bl d. es weniger schwerwiegend zu einem kampf als an der austausch von schlechterer
musik.
cont d. es weniger schwerwiegend zur teilnahme an einem kampf als zur teilnahme an
den austausch von schlechterer musik.
ref d. es ist weniger schwerwiegend an einer schlägerei teilzunehmen als an einem
musikdaten-austausch.
input this same company has lowered the forecasted profit of antenna 3 to 4.5 % in
2008 and to 7.5 % in 2009 .
bl dieses unternehmen hat die voraussichtlichen profit des nebenstelle 3 bis 4 , 5 %
im jahr 2008 und auf 7,5 % im jahr 2009 .
cont diese gleichen firma senkten die voraussichtlichen profit des nebenstelle 3 bis
4,5 % im jahr 2008 und auf 7,5 % im jahr 2009 .
121
ref diese hat auch die voraussichtliche gewinnberechnung von antena 3 für 2008 um
4,5 % und für 2009 um 7,5 % herabgesetzt.
input he latter then seized the trial to make a sudden escape along with 30 sympathizers.
bl letztere dann die prozess zu einem plötzlichen entkommen zusammen mit 30
sympathisanten.
cont letztere dann beschlagnahmt werden den prozess zu einem plötzlichen entkommen
zusammen mit 30 sympathisanten.
ref er nutzte die verhandlung, um sich plötzlich mit rund 30 sympathisanten abzusetzen.
input in suburbs north of paris, there had been violent clashes between youths and the
police in recent nights.
bl in den vororten von paris kam es zu gewalttätigen zusammenstößen zwischen
jugendlichen und die polizei in den vergangenen nächten.
cont in vororten nördlich von paris kam es zu gewalttätigen zusammenstößen zwischen
jugendlichen und die polizei in den vergangenen nächten.
ref in nördlichen vororten von paris hatten sich jugendliche in den vergangenen
nächten schwere straßenschlachten mit der polizei geliefert.
input the average price per share was 14.46 euros, the company announced on wednesday
evening.
bl der durchschnittliche preis pro wurde 14.46 euro, das unternehmen am mittwoch
abend.
cont die durchschnittliche preis pro aktie war 14.46 euro, das unternehmen am mittwoch
abend.
ref der durchschnittspreis je aktie beträgt 14,46 euro, wie das unternehmen am
mittwochabend mitteilte.
input 600 thousand people did not manage to renew driver’s licenses.
bl 600 000 menschen nicht führerscheine zu erneuern.
cont 600 000 menschen nicht gelingt, führerscheine erneuern.
ref 600000 leuten gelang es nicht, ihren führerschein umzutauschen
input on wednesday (local time), chávez also announced a severance of diplomatic ties
with neighbouring country colombia due to a hostage crisis.
bl am mittwoch (ortszeit), chávez kündigte auch eine severance der diplomatischen
beziehungen mit nachbarstaat kolumbien eine geisel krise.
cont am mittwoch (ortszeit), chávez kündigte auch eine severance der diplomatischen
beziehungen mit nachbarstaat kolumbien aufgrund einer geiseldrama.
ref zudem kündigte chávez am mittwoch (ortszeit) wegen einer geiselaffäre den
abbruch der beziehungen zum nachbarland kolumbien angekündigt.
122
Word choice
input opportunities to combine vocational and academic education are developed comparatively strongly.
bl chancengleichheit zu verbinden, berufliche und wissenschaftliche ausbildung entwickelten vergleichsweise nachdrücklich.
cont möglichkeiten zu verbinden, berufliche und wissenschaftliche ausbildung entwickelten vergleichsweise stark.
ref auch die kombinationsmöglichkeiten von beruflicher und akademischer ausbildung
sind vergleichsweise stark ausgebaut.
input but his report pointed out other key checks were properly carried out across a
sample of five health authorities.
bl doch in seinem bericht darauf hingewiesen , anderen entscheidenden prüfungen
wurden ordnungsgemäß durchgeführt, in einer stichprobe von fünf gesundheit
behörden.
cont doch in seinem bericht darauf hingewiesen , andere entscheidende prüfungen wurden ordnungsgemäß durchgeführt, in einer stichprobe von fünf gesundheitsbehörden.
ref aber sein bericht wies darauf hin, dass andere wichtige kontrollen bei einer
stichprobe von fünf gesundheitsbehörden korrekt durchgeführt wurden.
Fluency
input in total 200, located in the most commercial zones of spain.
bl insgesamt 200, in der die meisten kommerziellen zonen von spanien.
cont insgesamt 200 in den meisten kommerziellen zonen von spanien.
ref insgesamt 200 davon sind in den wirtschaftsstärksten regionen spaniens angesiedelt.
input ”oops, sorry. i forgot to tell you that marketing was looking for you,” says one of
your colleagues.
bl ”hoppla, leid. ich vergessen habe, ihnen sagen zu können, dass die vermarktung
hat sie,” sagt einem ihrer kollegen.
cont ”hoppla, es tut mir leid. ich vergaß zu sagen, dass die vermarktung hat sie,” sagt
einem ihrer kollegen.
ref ”sorry, ich habe vergessen dir auszurichten, dass jemand von der marketingabteilung angerufen hat” - sagt ein kollege.
input the number of violent acts rose 27 percent compared with last year, as much as
60 percent in the southern helmand province.
bl die zahl der gewalttaten stieg um 27 % im vergleich zum letzten jahr, als 60
prozent in den südlichen helmand.
cont die zahl der gewaltakte stieg um 27 % im vergleich zum letzten jahr, so viel wie
60 prozent im südlichen helmand.
123
ref die anzahl von gewaltaktionen ist im vergleich zum vorjahr um 27 prozent
gestiegen, in der südlichen provinz helmand sogar um sechzig prozent.
input also arrested was a secretary of the consortium and the head of the municipal
police.
bl ebenfalls verhaftet war ein sekretär des konsortiums und der polizeichef.
cont ebenfalls verhaftet wurde ein sekretär des konsortiums und der polizeichef.
ref außerdem wurden noch eine der sekretärinnen des gemeinderats und der chef der
gemeindepolizei festgenommen.
input elected for the first time in 2000 and for a second time in 2004, vladimir putin
can not afford a third mandate.
bl zum ersten mal gewählt in 2000 und für ein zweites mal im jahr 2004, wladimir
putin nicht leisten können, eine dritte amtszeit.
cont zum ersten mal gewählt in 2000 und für ein zweites mal im jahr 2004, wladimir
putin kann es sich nicht leisten, eine dritte amtszeit.
ref da er 2000 gewählt und 2004 wiedergewählt wurde, kann wladimir putin nicht
für ein drittes mandat in folge antreten.
input fully 139 raf images have been uploaded onto the internet and so far over they
have received over 75,000 hits.
bl voll 139 raf bilder haben uploaded auf das internet und so weit über sie haben
über 75 zuschlägt.
cont voll 139 raf bilder wurden uploaded auf das internet und so weit über sie habe
über 75.000 zuschlägt.
ref soldaten der luftwaffe haben bereits 139 solche aufnahmen beigesteuert, ihr ”kanal”
wurde mehr als 75.000-mal angesehen.
A.2
Baseline better than contextually conditioned system
Missing words
input the offices, that will even have meeting rooms for smes, will not sell financial
products.
bl die büros, haben sogar sitzungssälen für kmu, wird nicht verkaufen finanzprodukten.
cont die büros, das wird sogar sitzungsräumen für kmu, wird nicht verkaufen finanzprodukten.
ref die niederlassungen, die sogar mit zusätzlichen räumlichkeiten für die kmu ausgestattet sein werden, werden keine finanziellen produkte verkaufen.
input b.zs. : that keeps changing with every concert, especially now that i’ve already
let go of the cd.
bl b.zs. : dass sich ständig verändernden mit jeder konzert, vor allem jetzt, da habe
ich bereits die cd.
124
cont hält sich mit jedem konzert, vor allem jetzt, da habe ich bereits die cd.
ref b. zs.: das verändert sich mit den konzerten laufend und auch jetzt, wo ich das
album ”losgelasssen” habe.
input havana became a storage and transfer centre for the treasures the spaniards had
stolen in america and transferred to europe.
bl havanna wurde ein lagerung und übertragung zentrum für die reichtümer der
spanier hätten gestohlenen in amerika und in europa.
cont havanna zu einem lagerung und übertragung zentrum für die schätze die spanier
hätten gestohlenen in amerika und in europa.
ref havanna wurde einst zu einem lager oder durchgangslager von schätzen, die die
spanier in amerika geraubt hatten und dann nach europa transportierten.
Word choice
input what country, exactly, does the federal chancellor live in?
bl welches land, genau die bundeskanzlerin leben in?
cont welches land genau plant die bundeskanzlerin leben?
ref in welchem land lebt diese bundeskanzlerin eigentlich?
input the basic price of the car with a 1.6 petrol engine of 329,900 czech crowns is not
that bad.
bl der preis für das auto mit einem 1,6 benzin motor der 329,900 tschechischen
kronen ist nicht schlecht.
cont die grundlegenden preis für das auto mit einem 1,6 benzin motor der 329,900
tschechischen kronen ist nicht so schlimm.
ref der grundpreis des autos mit dem 1400er benzinmotor in höhe von 329900 kronen
geht noch an.
input signs of strain are clearly visible.
bl anzeichen einer belastung sind deutlich sichtbar.
cont anzeichen einer belastung sind eindeutig sichtbar.
ref anzeichen der anspannung sind deutlich sichtbar.
input around 1973, however, rother followed the cluster musicians into exile, to forst,
in the weser highland.
bl um jedoch 1973, rother folgte die cluster musiker ins exil, forst, in der weser
hochland.
cont etwa 1973, aber rother folgte das forstindustrie-cluster musiker ins exil, forst, in
der weser hochland.
ref um 1973 folgte rother allerdings den cluster-musikanten ins exil nach forst im
weserbergland.
125
Fluency
input belgrade fears a domino effect in a fragile region, weakened by the independence
wars of the 1990s.
bl belgrad befürchtet eine dominoeffekt in einer instabilen region, geschwächt
durch die unabhängigkeit kriege der 90er jahre.
cont belgrad befürchtungen ein dominoeffekt in einer instabilen region, geschwächt
durch die unabhängigkeit kriege der 90er jahre.
ref belgrad meint, dass es besonders einen ”domino-effekt” in einer region fürchet,
die noch so sehr durch die unabhängigkeitskriege der 90er jahre geschwächt ist.
input one of the local residents even classified the quarrels with eastern european
immigrants as a fight for survival.
bl eine der anwohner sogar stuften die auseinandersetzungen mit den osteuropäischen
migranten als einen kampf ums überleben.
cont eines der anwohner sogar stuften die auseinandersetzungen mit den osteuropäischen migranten als kampf ums überleben.
ref einer der einheimischen bezeichnete den streit mit den zuwanderern aus osteuropa
sogar als überlebenskampf.
input those who are in power are not so bad, why changing them?
bl diejenigen, die an der macht sind nicht so schlecht, warum sie zu verändern?
cont diejenigen, die an der macht sind nicht so schlecht, warum sich verändernden
ihnen?
ref diejenigen, die an der macht sind, sind nicht so schlecht, warum sie ablösen?
input still he knew he would never earn as much as the others.
bl noch immer er wusste er nie verdienen wie die anderen.
cont er wusste er sie nie verdienen wie die anderen.
ref dennoch wusste er, dass er nie so viel verdienen werde wie andere.
input russia has the world’s largest stocks of gas, but much of it remains under-developed.
bl russland hat die weltweit größten bestände von gas, aber vieles bleibt noch
unterentwickelt.
cont russland hat die weltweit größte bestände von gas, aber vieles bleibt es unterentwickelt.
ref russland hat die weltweit größten gasvorkommen, aber viel davon bleibt unterentwickelt.
126