Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Road Map
Word-Based SMT
CLT310
Introduction to SMT and Word-based Models
Parallel Corpora and Alignment
• How to prepare training data
• How to align a parallel corpus
Phrase-Based SMT
Decoding and Tuning
...
Jörg Tiedemann
University of Helsinki
Goals for Today
Hand-Crafted vs. Data-Driven MT
3.3 Synchronous Context Free Grammar (SCFG)
SMT systems often produce ungrammatical and disfluent translation hypotheses.
Data-Driven Machine Translation
This is mainly because of language models that only look at local contexts (n-gram
models). Syntax-based approaches to machine translation aim at fixing the problem of
High-level introduction to Statistical MT
producing such disfluent outputs. Such syntax-based systems may take advantage of
Synchronous Context Free Grammars (SCFG): mappings between two context free
interlingua
rules that specify how source language phrase structures correspond to target language
Hmm, every time he sees
Every time I fire a linguist
the performance goes“banco”,
up! he either types
phrase structures. For example, the following basic rule maps noun phrases in the
Language Modeling
source and target languages, and defines the order of the daughter constituents (Noun
and Adjective) in both languages.
“bank” or “bench”
but if
he sees “banco de ”,
he always types “bank”,
never “bench”
Man, this is so boring.
NP::NP [Det ADJ N]
Word-Based Translation Models
[Det N ADJ]
This rule maps English NPs to French NPs, stating that an English NP is constructed
from daughters Det, ADJ and N, while the French NP is constructed from daughters
Det, N and ADJ, in this order. Figure 3a demonstrates translation using transfer of such
transfer rules
derivation trees from English to French using this SCFG rule and lexical rules.
VP → V PP[+Goal] Text
gra
N
NP
Det
ADJ
Det
N
the
source
language
red
house
la
maison
ADJ
rouge
Figure 3a – translation of the English NP “The red house” into the French NP
“La maison rouge” using SCFG
15
Translated
documents
Training
Data
es
NP
decoding
rul
mm
ar
rul
N maison
ADJ rouge
Det la
NP Det N ADJ
ar
mm
N house
ADJ red
Det the
NP Det ADJ N
Using course material from Philipp Koehn, Kevin Knight and Chris Callison-Burch
gra
es
VP → PP[+Goal] V
target language
Coverage and Robustness
Data-Driven Machine Translation
Data-Driven Machine Translation
It is difficult to cover all concepts in all contexts!
• Look , I hate plum jam .
Hmm, every time he sees
“banco”, he either types
“bank” or “bench”
but if
he sees “banco de ”,
he always types “bank”,
never “bench”
• Jag avskyr plommonsylt .
• It doesn’ t break , jam , or overheat .
• Den går inte sönder , fastnar eller överhettar aldrig .
Man, this is so boring.
• If you get into a jam , you can call me .
• Om du hamnar i knipa , kan du ringa mig .
• Prepare to jam with the bearded clam .
• Förbered dig på att festa med klanen .
• This is our jam , ladies . • Det är vår låt , damer .
• Seems like you got yourself in a jam , huh ?
• Du har hamnat i en riktig smet , va ?
Learning MT from Example Translations
Translated documents
MT as Decoding
Idea 1: Translation by analogy: EBMT (lazy learning)
Translate: “He buys a book on international politics.”
Database of Example Translations:
learn the
unknown code
1. He buys a notebook.
Kare wa nõto o kau.
HE topic NOTEBOOK obj BUY.
2. I read a book on international politics.
Watashi wa kokusai seiji nitsuite kakareta hon o yomu.
I topic INTERNATIONAL POLITICS ABOUT CONCERNED BOOK obj READ.
Output:“Kare wa kokusai seiji nitsuite kakareta hon o kau.”
Idea 2: Learn MT models from data: SMT (eager learning)
• translation models with language-specific parameters
• train model parameters & apply to unseen data
When I look at an article in Russian, I say:
This is really written in English,
but it has been coded in some strange symbols.
I will now proceed to decode.
[Weaver, 1947, 1949]
Statistical
Machine
Translation
Machine
Translation
Statistical Machine Translation
What are the problems with this setup?
P(e|c)
Search problem (decoding):
e* = argmax P(e|c)
e
The U.S. island of Guam is maintaining a high
state of alert after the Guam airport and its offices
both received an e-mail from someone calling
himself the Saudi Arabian Osama bin Laden and
threatening a biological/chemical attack against
public places such as the airport.
Translation = decoding
e* = argmax P(e|c)
e
Running over all possible English sentences to find the best
translation given our Chinese sentence!
How do we know the probability that an English sentence is a
good translation of a Chinese one?
Statistical Machine Translation
Modeling Translation
• word-based models (today)
• phrase-based models (later)
• tree-based models (later)
Learning Model Parameters
• sentence and word alignment
• phrase extraction and scoring
• tuning
Decoding
Modeling Translation
Statistical Machine Translation
Statistical MT Systems
To estimate the probabilities for a trigram language model, we need
to collect statistics for three word sequences from large amounts of text.
In statistical machine translation, we use the English side of the parallel
corpus, but may
also include additional
textTranslation
resources.
Statistical
Machine
Chapter 7 has more detail on how language models are built.
4.3.3
Spanish/English
Bilingual Text
Statistical Analysis
• Translation Model p(f|e)
How can we combine
a language model and our translation model? Recall
•
Target
Model p(e)
that we want to find theLanguage
best translation
e for an input sentence f. We now
apply the Bayes rule to include p(e):
English
Text
Statistical Analysis
Broken
English
Spanish
Que hambre tengo yo
English
What hunger have I,
Hungry I am so,
I am so hungry,
Have I that hunger …
I am so hungry
Statistical Machine Translation
Statistical MT Systems
Spanish/English
Bilingual Text
English
Text
Statistical Analysis
Statistical Analysis
Broken
English
Spanish
Translation
Model P(s|e)
Que hambre tengo yo
English
Language
Model P(e)
Decoding algorithm
argmax P(e) * P(s|e)
e
NoisyUse
Channel
Model
Bayes’ Rule
to Decompose p(e|f) into
I am so hungry
p(f |e)p(e)
p(f )
= argmaxe p(f |e)p(e)
argmaxe p(e|f ) = argmaxe
(4.23)
Note that, mathematically, the translation direction has changed from
p(e|f ) to p(f |e). This may create a lot of confusion, since the concept of
what constitutes the source language di↵ers between the mathematics of the
model and the actual application. We try to stay away from this confusion
as much as possible by sticking to the notation p(e|f ) when formulating a
translation model.
Combining a language model and translation model this way is called the
Machine
Translation
noisy channelStatistical
model. This method
is widely
used in speech recognition,
but can be traced back to early information theory.
Shannon [Shannon, 1948] developed the noisy channel model for his work
on error correcting
codes.Model
In this
scenario,
a translations)
message is passed through a
Translation
(prefer
adequate
noisy channel, such
a low-quality
telephone
from >a source S to a receiver
• p(das
haus ist klein|the
houseline
is small)
R. See Figure 4.6• p(das
for anhaus
illustration.
ist klein|the building is tiny) >
The receiver • only
message
p(dasgets
hausaistcorrupted
klein|the shell
is low) R. The challenge is now
to reconstruct the original message using knowledge about possible source
Language Model (prefer grammatical / fluent strings)
• p(the house is small) > p(small the is house)
address issues of data sparseness: the problems that arise from having to
build models with limited training data.
7.1
N-Gram Language Models
The leading method for language models is n-gram language modelling. Ngram language models are based on statistics how likely words follow each
other. Recall the last example: If we analyze a large amount of text, we
will observe that the word home follows more often the word going than the
word house does. We will be exploiting such statistics.
Formally, in language modelling, we want to compute the probability of
a string W = w1 , w2 , ..., wn . Intuitively, p(W ) is the probability that if we
7.1. N-Gram Language Models
209
pick a sequence of English words at random (by opening a random book or
magazine at a random place, or by listening to a conversation) it turns out
The language model probability p(w1 , w2 , ..., wn ) is a product of word
to be W .
probabilities given a history of preceding words. To be able to estimate
How can
we compute
p(W )?
typically
approach to statistical esthese word probability
distributions,
we limit
the The
history
to m words:
timation calls for first collecting a large amount of text and counting how
often
However,
sequences
of words will not ocp(w
' p(w
..., wnlong
(7.4)
n |wW
1 , woccurs
2 , ..., wnin1 )it.
n |wn m ,most
2 , wn
1)
cur in the text at all. So we have to break down the computation of p(W )
This typeinto
of model,
wefor
step
through
sequence
a sequence
smallerwhere
steps,
which
we acan
collect(here:
sufficient
statistics and estimate
of words), and
consider
for
the
transitions
only
a
limited
history,
is called a
probability distributions.
Markov chain. The number of previous states (here: words) is the order
Dealing with sparse data that limits us to collect sufficient statistics
of the model.
to estimate reliable probability distributions is the fundamental problem in
The Markov assumption states that only a limited number of previous
language modelling. In this chapter, we will examine methods that address
words a↵ect the probability of the next word. It is technically wrong, and
it is not too this
hardissue.
to come up with counter-examples that demonstrate that
Language Modeling
N-Gram Language Models
a longer history is needed. However, limited data restrict the collection of
reliable statistics
histories.
7.1.1to short
Markov
chain
Typically, we chose the actually number of words in the history based on
how much training
datalanguage
we have. modelling,
More training
forprocess
longer hisIn n-gram
we data
breakallows
up the
of predicting a word
tories. Mostsequence
commonly,
language models
are used.
Wtrigram
into predicting
one word
at a They
time.consider
We first decompose the
a two word
to Language
predict
third
word.
7.1. history
N-Gram
Models
209
probability
usingthe
the
chain
rule:This requires the collection of
Markov
chain
statistics over
sequences
of three words, so-called 3-grams (trigrams). Language models mayp(w
also, w
estimated
over 2-gramsw (bigrams),
single, wwords
1 model
2 , ..., wprobability
n ) = p(w1 ) p
2 , ..., wnof1 )
( 21|w
The language
p(w
, w12), ...
...,p(w
wn )n |w
is 1a product
word (7.3)
(unigrams), or any other order of n-grams.
probabilities given a history of preceding words. To be able to estimate
assumption
theseMarkov
word probability
distributions, we limit the history to m words:
7.1.2 Estimation
(7.4)
1 , w2 , ..., w
1 ) ' p(w
n |wprediction
n m , ..., wprobabilities
n 2 , wn 1 )
In its simplest form,p(w
then |w
estimation
ofntrigram
word
p(w3 |w1 , w2 ) is straight-forward. We count how often the sequence w1 , w2
type
of model,
where
we step
through
a sequence
is followedThis
by the
word
w3 in our
training
corpus,
opposed
to other(here:
words.a sequence
of words),
and
consider
for
the
transitions
only
a
limited
history,
is called a
According
to
maximum
likelihood
estimation,
we
compute:
Maximum likelihood estimation
Markov chain. The number of previous states (here: words) is the order
count(w1 , w2 , w3 )
of the model.
p(w3 |w1 , w2 ) = P
(7.5)
, w) a limited number of previous
1 , w2only
The Markov assumption
states that
w count(w
words a↵ect the probability of the next word. It is technically wrong, and
Figure 7.1 gives some examples on how this estimation works on real
it is not too hard to come up with counter-examples that demonstrate that
data, in this case the European parliament corpus. We consider three difa longer history
is needed.
However,
limited
data restrict
thefrecollection of
ferent histories:
the green,
the red, and
the blue.
The words
that most
reliable
statistics
to
short
histories.
quently follow are quite di↵erent for the di↵erent histories. For instance the
Typically, we chose the actually number of words in the history based on
Language Models
Statistical Language Models
One essential component of any statistical machine translation system is
the language model, which measures how likely a sequence of words may
be uttered by an English speaker. It is easy to see the benefits of such
string over
a Prefer
model.one
Obviously,
weanother
want a(ensure
machinefluency)
translation systems to not only
• “small
step”:words
5,880,000
hitstrue
on Google
produce
output
that are
to the original in meaning, but string
them
together
in fluent
English
• “little
step”:
1,780,000
hitssentences.
on Google
In fact, the language model typically does much more than just enable
fluent output. It supports difficult decisions in word order and word translaLanguage model
tion. For instance, a probabilistic language model plm should prefer correct
208 • estimate how likely a string is in a givenChapter
7. Language Models
language
word order over incorrect word order:
word choice in context, for instance:
plm (the house is small) > plm (small the is house)
(7.1)
plm (I am going home) > plm (I am going house)
(7.2)
Formally, a language model is a function that takes an English sentence
andThis
returns
a probability
howdominant
likely it is
produced
by an English
speaker.
chapter
presents the
language
modelling
methodology,
According
to
the
example
above,
it
is
more
likely
that
an
English
n-gram language models, along with smoothing and back-o↵ methodsspeaker
that
utters the
sentence
thesparseness:
house is small
than the sentence
is house.
address
issues
of data
the problems
that arisesmall
fromthe
having
to
Hence,
a good
language
plmdata.
assigns a higher probability to the first
build
models
with
limitedmodel
training
sentence.
This preference of the language model aids a statistical machine transla7.1
N-Gram
Language
Models
tion system
to find the
right word order.
Another area where the language
N-Gram
Language
Models
model aids translation is word choice. If a foreign word (such as the GerThe
for language
models
is n-gram
language
modelling.
Nmanleading
Haus) method
has multiple
translations
(house,
home,
...), lexical
translation
gram
language
models
are
based
on
statistics
how
likely
words
follow
each
probabilities already give preference to the more common translation (here:
other.
Recall
thespecific
last example: If other
we analyze
a large amount of text, we
house).
But in
translations
Add sentence
start andcontexts,
end of sentence
marker! are preferred. Again,
will
that
the steps
word home
more probability
often the word
going
than
the
the•observe
language
model
in. It follows
gives
higher
to the
most
natural
certain
tokens
appear
at the
startstatistics.
or at the end
word
house does.
Weoften
will be
exploiting
such
Formally, in language modelling, we
207want to compute the probability of
a string
W = w1 , w2 , ..., wn . Intuitively, p(W ) is the probability that if we
Smoothing
pick• abig
sequence
of English
at random
problem:
unseen words
n-grams
→ p(e) =(by
0 opening a random book or
magazine at a random place, or by listening to a conversation) it turns out
• smoothing = reserve probability mass for unseen events
to be W .
226
Chapter 7. Language Models
How can we compute p(W )? The typically approach to statistical esInterpolation
andfirst
backoff
timation
calls for
collecting a large amount of text and counting how
Then, we build an interpolated language model pI by linear combination
often
W occurshigher-order
in it. However,
long sequences
• combine
and most
lower-order
models of words will not ocof the language models pn :
cur in the text at all. So we have to break down the computation of p(W )
into smaller steps,
which we can
collect sufficient statistics and estimate
pI (wfor
3 |w1 , w2 ) =
1 p1 (w3 )
probability distributions.
⇥ 2 p2 (w3 |w2 )
(7.29)
Dealing with sparse data that limits us to collect sufficient statistics
⇥ 3 p3 (w3 |w1 , w2 )
to estimate reliable probability distributions is the fundamental problem in
language
modelling.
this chapter,
we will its
examine
methods
thatisaddress
Each
n-gram
languageInmodel
pn contributes
assessment,
which
this issue.
weighted
by a factor n . With a lot of training data, we can trust the
ch.7
Koehn, U Edinburgh
ESSLLI Summer School
Day 2
Word-Based Translation Models
7
Generative Model: Source language words
One-to-many
translation
are generated
by target
language words
Word-Based
Statistical MT
Reordering
6
• A source word may translate into multiple target words
• Words may be reordered during translation
1
2
3
4
klein
ist
das
Haus
the
house
is
small
1
2
3
4
1
2
3
4
das
Haus
ist
klitzeklein
the
house
is
very
small
1
2
3
4
5
4.2. Learning
Lexical Translation Models
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
Translation: Decode what kind of word
When
applying
the model
the data,
we need to compute the
sequence
has generated
thetoforeign
string.
bility of di↵erent alignments
given
a
sentence
pair in the data. T
Reordering
Reordering
Koehn, U
Edinburgh
ESSLLI
Summer
School
Day 2
more
formally,
weduring
need
to •compute
p(a|e,
f ),
the probability
of an al
Reordering
during
translation
Words may be reordered
Reordering
translation
• Words
may be reordered
40
Reordering
• Words may be reordered during translation
reordered
during
translation
•
Words
may
be
given the English
klein ist
das Haus
kleinand
ist foreign
das Haussentence.
• Words may be reordered during translation
klein
ist
dasonHaus
klein
ist
das
Haus
Applying the chain rule (recall Section 3.2.3
page 76) gives u
klein ist
das Haus
6
6
6
6
6
1
2
1
Koehn, U Edinburgh
ESSLLI Summer School
Day 2
Statistical Word Alignment Models
1
2
3
the
house
is
small
1
2
3
4
1
2
3
4
Haus
ist
klitzeklein
3
2
4
3
4
p(e,house
a|f ) is small
the
small
p(a|e, f ) =a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
a : {1p(e|f
→ 3, 2 →)4, 3 → 2, 4 → 1}
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
2
house
3
1
4
is
2
1
2 4, 3 →
3 2, 4 →4 1}
a : {11 → 3, 2 →
3
2
4
3
4
We still need to derive p(e|f ), the probability of translating the s
overany
all alignment:
possible ways of generating the string:
f into Sum
e with
X
p(e|f ) =
p(e, a|f )
7
Koehn, U Edinburgh
ESSLLI Summer School
Koehn, U Edinburgh
Summer School
DayESSLLI
2
Koehn, U Edinburgh
ESSLLI Summer School
Koehn, U Edinburgh
Koehn, U Edinburgh
ESSLLI Summer School
• A source word may translate into multiple target words
das
2
1
4
the Models
house is small
StatisticaltheWord
house isAlignment
small
a : {1 → 3, 2 → 4, 3 → 2, 4 → 1}
• observed word j is generated by target word i
1
4
3
4
1
the
Introduce an alignment function a : i → j
One-to-many translation
3
2
Day 2
ESSLLI
Summer School
Day
2
Day 2
Day 2
7
a
7
One-to-many
One-to-many translation
lf
lf One-to-manytranslation
translation
X
X
One-to-many translation
• A source word may translate into multiple target words
7
One-to-many translation
=
7
7
p(e, a|f )
• A source word may translate into multiple target words
...
words
• A source word may translate into multiple
target
• A4 source
1
2into multiple
3
4 words
target
wordsword may translate
• target
A source
word may translate
1
2into multiple
3
the
house
is
very
small
1
2
3
4
5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
Koehn, U Edinburgh
ESSLLI Summer School
40
2
3
4
das
Haus
ist
klitzeklein
the
house
is
very
small
1
2
3
4
5
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
Alignment model: p(f,a|e)
Koehn, U Edinburgh
1
Day 2
ESSLLI Summer School
40
Koehn, U Edinburgh
Koehn, U Edinburgh
2
4
das1 Haus
ist3 klitzeklein
2
4
das1 Haus
ist3 klitzeklein
das Haus ist
klitzekleina(le )=0das Haus ist klitzeklein
a(1)=0
lf
X
the house is very small
the
house
very
small
1
2
3is
4
5
1
2
=
3
4
...
5
lf
X
the
l
Y
✏
t(ej |fa(j) )
le 3, 4 → 4, 5 → 4}
a : {1(l
→ 1, 2+
3→
1)
f 1, 2→→2,
a : {1 →
2, 3 → 3, 4 → 4, 5 → 4}
the
1
a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4}
a : {1 → 1, 2 → 2, 3a(1)=0
→ 3, 4 → 4, 5 →a(l
4} e )=0
......
Day 2
=
✏
Koehn, U Edinburgh
ESSLLI Summer School
40
Koehn, U Edinburgh
e
ESSLLI Summer School
40
(lf + 1)l
lf
X
e
house is very
small
house
very
small
2
3is
4
5
1
a(1)=0
2
3
4
5
j=1
...
lf
le
X
Y
Summer School
DayESSLLI
2
40
ESSLLI
Summer School
Day
2
40
a(le )=0 j=1
t(ej |fa(j) )
Day 2
Day 2
Translation Model Parameters
Parallel Corpus
Lexical Translations
• das → the
• haus → house, home, building, household, shell
Collection of example translations:
Parallel corpus
• ist → is
• klein → small, low
Multiple translation options?
• use the most common one in that context
• learn translation probabilities from example data
Context-Independent Models
im übrigen ist die diesbezügliche
kostenentwicklung völlig unter kontrolle .
sooner or later we will have to be
sufficiently progressive in terms of own
resources as a basis for this fair tax
system .
früher oder später müssen wir die
notwendige progressivität der eigenmittel als
grundlage dieses gerechten steuersystems
zur sprache bringen .
we plan to submit the first accession
partnership in the autumn of this year .
wir planen , die erste beitrittspartnerschaft
im herbst dieses jahres vorzulegen .
it is a question of equality and solidarity
.
hier geht es um gleichberechtigung und
solidarität .
the recommendation for the year 1999
has been formulated at a time of
favourable developments and optimistic
prospects for the european economy .
die empfehlung für das jahr 1999 wurde vor
dem hintergrund günstiger entwicklungen
und einer für den kurs der europäischen
wirtschaft positiven perspektive abgegeben .
that does not , however , detract from
the deep appreciation which we have for
this report .
im übrigen tut das unserer hohen
wertschätzung für den vorliegenden bericht
keinen abbruch .
Context-Independent Models
Probabilities
Collect Statistics
Count Translation Statistics:
• How often is Haus translated into ....
Estimate
Translation
Probabilities:
most
probable
English sentence given a
• Find
•
Maximum
Likelihood
Estimation
(MLE)
Estimate Translation
Probabilities
foreign language sentence
Look at a parallel corpus (German text along with English translation)
Translation of Haus
house
building
home
household
shell
what is more , the relevant cost
dynamic is completely under control.
count(f, e)
t(f |e) =p(e|f )
count(e)
Count
8,000
1,600
200
150
50
Maximum likelihood estimation
ê = arg max p(e|f )
• for f = Haus:
e
8
>
if e = house,
>0.8
p(e)p(f
|e)
>
>
>
>0.16
p(e|f
) = if e = building,
<
)
pf (e)
t(f
|e) =
= 0.02 if e p(f
= home,
>
>
>
0.015 if e = household,
>
>
>
ê = arg
maxif p(e)p(f
|e)
:0.005
e = shell.
e
5
Chapter 4: Word-Based Models
2
Koehn, U Edinburgh
Special Cases in Word Alignment
ESSLLI Summer School
Day 2
Special Cases in Word Alignment
9
8
Inserting words
Dropping words
Inserting words: Introduce special NULL word
words:when translated
• WordsDropping
may be dropped
– The German article das is dropped
1
2
3
4
das
Haus
ist
klein
house
1
is
small
2
3
• Words may be added during translation
– The English just does not have an equivalent in German
– We still need to map it to something: special null token
a : {1 → 2, 2 → 3, 3 → 4}
0
1
2
3
4
NULL
das
Haus
ist
klein
the
house
is
1
2
3
ESSLLI Summer School
10
IBM Model 1: Example
9
Inserting
words
Generative
model:
break
upinto
translation
process
into
smallersteps
steps
model:
up
translation
process
into
smaller
• •Generative
model:
break
upbreak
translation
process
smaller
steps
• Generative
model:
break
up translation
process
smaller into
steps
• Generative
IBM
Model
1
only
uses
lexicaltranslation
translation
–
IBM
Model
1
only
uses
lexical
–
IBM
Model
1
only
uses
lexical
–
Modelmay
1 only
uses lexical
– IBM
during translation
translation translation
• Words
be added
Translation
probability
• •Translation
Translation
probability
just
does probability
not
have an equivalent in German
–• The
English
• Translation
probability
•toaf map
foreign
sentence:
–forfor
a=
foreign
f==
, ...,
lengthlflf
– for
aneed
sentence
,f...,
f(f
length
lflength
–foreign
foreign
sentence
...,
flfftoken
– for a foreign
sentence
(fit1,to
...,
fsentence
) (f
of 1length
l)1f,1of
l)f )ofof
lf(f
– We
still
something:
special
null
lf=
–
to
an
English
sentence
e
=
(e
,
...,
e
)
of
lengthlele
–
to
an
English
sentence
e
=
(e
,
...,
e
)
of
length
l
– to•an0english
sentence
e=
– to an English sentence
e English
= (e11,sentence:
...,
lele1 ,1...,
3 (e
4 elel)e of elength
le )2 of 1length
–
with
an
alignment
of
each
English
word
e
to
a
foreign
ftoi according
to
alignment
English
wordfeijejaccording
aforeign
foreign
wordffi iaccording
according
– with an alignment
ofwith
each
English
word
eeach
to
aEnglish
foreign
– –with
anan
alignment
of
totoaword
word
toto
•NULL
alignment:
a(
j )of
i ist
aligns
with
das
Haus
klein
j word
j=each
the function
alignment
function
a
:
j
→
i
the alignment
a
:
j
→
i
thealignment
alignmentfunction
functiona a: j: j→→i i
the
l
l!
ele
p(e, a|f ) = Z(l
ϵ e , lf ) ϵ t(e!
je|f
ϵ ϵa(j) )!
is
just
small
p(e, a|f ) the
= p(e,house
t(e
|f
)
a|f )p(e,
=
t(e
|f
)t(e|f|fa(j)) )
j
a(j)
p(e,
a|f
)
=
a|f
)
=
t(e
j
le
e
lele a(j) j j a(j)
(l j=1
+ 1)4l(l
(l
1 (lf + 1)
2
5 1)
f++1)
f
j=1 f3
j=1
j=1
j=1
→
1, 2 →
2, 3 →constant
3, 4 →
0, 5 → 4}
– parameter
is a–normalization
– ϵparameter
is: {1
a normalization
• ϵaonly
lexical
translation
parameters
–parameter
parameter
ϵisisa anormalization
normalization
constant
ϵ constant
constant
Example
das
e
t(e|f )
the
0.7
that
0.15
which 0.075
who
0.05
this
0.025
Haus
e
house
building
home
household
shell
ist
t(e|f )
0.8
0.16
0.02
0.015
0.005
e
is
’s
exists
has
are
t(e|f )
0.8
0.16
0.02
0.015
0.005
klein
e
t(e|f )
small 0.4
little
0.4
short 0.1
minor 0.06
petty 0.04
✏
⇥ t(the|das) ⇥ t(house|Haus) ⇥ t(is|ist) ⇥ t(small|klein)
43
✏
= 3 ⇥ 0.7 ⇥ 0.8 ⇥ 0.8 ⇥ 0.4
4
= 0.0028✏
p(e, a|f ) =
• normalization:
✏
Z(l
)School
= School
Koehn, U Edinburgh
ESSLLI
e , lfSummer
Koehn, U Edinburgh
ESSLLI
Summer
le
41ESSLLI
Koehn, U Koehn,
Edinburgh
(lSummer
+ESSLLI
1)School
Koehn,
Edinburgh
SummerSchool
School
UU
Edinburgh
Summer
f ESSLLI
Day 2
41
1010
IBM Model
1 IBM 1IBM
IBMModel
Model
11
Model
1
IBM Model
le
Y
5
ESSLLI Summer School
Day 2
10
le
!
4
a : {1 → 1, 2 → 2, 3 → 3, 4 → 0, 5 → 4}
Koehn, U Edinburgh
Koehn, U Edinburgh
just small
DayDay
2 2
Day 2
Day2 2
Day
Chapter 4: Word-Based Models
11
IBM Model 1
IBM Model 2
111
4.4. Higher IBM Models
IBM Model 2
As a consequence, according to IBM Model 1 the translation probabilities
forWhat
the following
two mean
alternative
translations are the same:
does this
for p(e,a|f)?
1
2
3
4
5
1
natürlich ist das haus klein
2
3
4
Add a model forAdding
positional
a modelalignment:
of alignment
5
natürlich ist das haus klein
1
2
3
4
5
natürlich ist das haus klein
of course the house is small
Higher IBM Models
1
2
3
4
5
the course small is of house
6
1
2
3
4
5
6
lexical translation step
111
of course is the house small
alignment step
IBM Model
addresses
of alignment
with an explicit model for
What
does 2this
meanthe
forissue
p(e|f)
= ∑ p(e,a|f)?
alignment based
on the positions
of1the
input
and output words.
The transconsequence, according
to
IBM
Model
the
translation
probabilities
• of
p(athe
house
is word
smallin| position
das Haus
) word in position j
lation
foreign
input
i to ist
an klein
English
he following two isalternative
translations
are the ist
same:
modelled
byisansmall
alignment
• p( the
houseprobability
| das Hausdistribution
klein )
• p( the house is small |a(i|j,
ist Haus
le , lf ) das klein )
1
2
3
4
5
1
2
3
of course the house is small
1
2
3
4
5
6
(4.24)
4
5
Chapter 4: Word-Based Models
37
atürlich ist das haus
klein
natürlich
ist fdas
haus
Recall that
the length of the
input sentence
is denoted
as lf ,klein
the length
of the output sentence e is le . We can view translation under IBM Model 2
as a two step process with a lexical translation step and an alignment step.
course the house is small
2
3
4
5
IBM Model 2
1
2
Interlude: HMM Model
the course small is of house
6
3
1
2
4
3
4
5
6
HMM alignment model
5
natürlich ist das haus klein
• Words do not move independently of each other
BM Model 2 addresses the issue of alignment with
an explicit
model
lexical
translation
step for
ment based on theofpositions
the house
input and
course isof the
smalloutput words. The transn of a foreign input
word in position i to an English
wordstep
in position j
alignment
New parameter:
delled by an alignment
probability
distribution
alignment
probability
for small
aligning absolute positions
of• course
the
house is
112
1
2
3
4
a(i|j, le , lf )
5
6
– they often move in groups
Chapter 4. Word-Based
Models
(4.24)
The first step is lexical translation as in IBM Model 1, again modelled
by the translation probability t(e|f ). The second step is the alignment step.
steps
are
combined
mathematically
to lform
Model
ecall thatThe
the two
length
of the
input
sentence
f is denoted as
, theIBM
length
ForNew
instance,
translating ist into is has a lexical translationf probability of
Model:
e output sentence
e isand
le .anWe
can view
translation
under
IBM
Model
t(is|ist)
alignment
probability
of a(2|5, 6, 5)
— the 5th
English
word 2
le
Y
is
aligned
to
the
2nd
foreign
word.
two step process with a lexical translation step and an alignment step.
p(e,
) = ✏ function
t(eajmaps
|fa(j)each
)a(a(j)|j,
le , lfword
) j to
Note that
thea|f
alignment
English output
a foreign input position a(j)j=1
and the alignment probability distribution is
also set up in this reverse direction.
2:
(4.25)
Fortunately, adding the alignment probability distribution does not make
2
3
4
5
EM training much more complex. Recall that the number of possible word
atürlich ist das haus klein
alignments for a sentence pair is exponential with the number of words.
lexical
step
However, for IBM Model 1, we were
able totranslation
reduce the complexity
of com1
Motivation:
Words
not move independently!
! condition
worddo
movements
on previous word
• they often move in groups
add condition
alignment of the previous word
• •HMM
alignmentonmodel:
Different alignment parameter: p(a(j)|a(j
1), le)
Good replacement for IBM Model 2
• EM algorithm application harder, requires dynamic programming
• IBM Model 4 is similar, also conditions on word classes
Chapter 4: Word-Based Models
two English words, i.e., to the. Others, such as the flavoring particle ja, get
dropped.
BM Model 3
IBM
Model
3 many words are generated from
have not explicitly
modeled
how
t word. In most cases, a German word translates to one single
ord. However, some
German words like zum typically translate to
Motivation:
sh words, i.e., to the.
Others, such as the flavoring particle ja, get
• Some words tend to be aligned to multiple words
• Other words tend to be unaligned (dropped)
ow want to model the fertility of input words directly with a
y distribution Add a parameter modeling this property!
Fertility:
n( |f )
(4.29)
ach foreign word f , this probability distribution indicates, how
0, 1, 2, ... output words it usually translates to. Returning to the
above, we expect for instance, that
n(1|haus) ' 1
n(2|zum) ' 1
IBM Model 3
(4.30)
n(0|ja) ' 1
116
Chapter 4. by
Word-Based
Models
ty deals explicitly with
dropping input words
allowing
= 0.
is also the issue of adding words. Recall that we introduced the
n to account for words in the output that have no correspondent
put. For instance, the English word do is often inserted when
g verbal negations into English. In the IBM models these added
generated by the special null token.
uld model the fertility of the null token the same way as for all
words by the conditional distribution n( |null). However, the
f inserted words clearly depends on the sentence length, so we
model null insertion as a special step. After the fertility step,
To review, each of these four steps is modeled probabilistically:
uce one null token with
probability
p1 after
generated
word,
• Fertility
is modeled by the distribution
n( |feach
). For instance,
duplication of the German word zum has the probability of n(2|zum).
l token with probability p0 = 1 p1 .
• null insertion is modeled by the probabilities p1 and p0 = 1 p0 . For
instance, the probability of p1 is factored in for the insertion of null
after ich, and the probability of p is factored in for no such insertion
ddition of fertility and null token insertion increases the transla-
We now want to model the fertility of input words directly with a
4.4. Higher IBM Models
probability distribution
IBM Model 3
111
n( |f )
(4.29)
As a consequence, according to IBM Model 1 the translation probabilities
for the following two alternative translations are the same:
To review, each of these four steps is modeled probabilistically:
For each foreign
word f , this probability distribution indicates, how
Example:
•many
Fertility
is
modeled
by thewords
distribution
n( |f
). For instance,
duplica= 0, 1, 2, ... output
translates
to. 1Returning
to
the4
1
2
3 it usually
4
5
2
3
5
• natürlich
German
“zum”
translates
to “to the”
tion
of
the
German
word
zum
has
the
probability
of
n(2|zum).
ist
das
haus
klein
natürlich
ist das haus klein
examples above, we expect for instance, that
• German “ja” (used as a filler) is often dropped
• null insertion is modeled by the probabilities p1 and p0 = 1 p0 . For
instance, the probability
p1n(1|haus)
ishouse
factored
of null
of courseofthe
is'in
small
the course
small is of house
1 for the insertion
1
2
3
5
6
3
4
5
6
after ich, and the
probability
of p04 is factored
in for no1 such 2insertion
n(2|zum) ' 1
(4.30)
after nicht.
n(0|ja) the
' 1issue of alignment with an explicit model for
IBM Model 2 addresses
• Lexical translation is handled by the probability distribution t(e|f ) as
alignment
based on the positions of the input and output words. The transother
differences:
in IBM Model
1. For
instance, translating nicht into not is done with
lation
of
a
foreign
input
word input
in position i to an English word in position j
Fertility deals
explicitly
with
dropping
probability
p(not|nicht).
• explicit
NULL
insertion
parameterwords by allowing = 0.
is modelled
anadding
alignment
probability
distribution
But there is also
issuebyof
Recall that we
introduced the
• the
distortion
instead
of words.
alignment
• Distortion is modeled
almost
the same
way as in IBM Model 2 with a
null token to account for words in the output that have no correspondent
probability distribution d(j|i, le , lf ), which predicts
output
a(i|j, lthe
, l English
)
(4.24)
in the input. For instance, the
English word do eis foften inserted when
word position j based on the foreign input word position i and the
translating verbal negations into English. In the IBM models these added
respective sentence
lengths.
For
instance,
the
placement
of fgois as
the as lf , the length
Recall
that the
length
of the
input
sentence
denoted
words are generated by the special null token.
4th word ofof
the
7
word
English
sentence
as
a
translation
of
gehe
which
the output sentence e is le . We can view translation under IBM Model 2
Wethe
could
model
fertility
of with
the
null
token
thehas
same
way and
as for
was
2nd
word
in
the
6 word
German
sentence
probability
as
a twothe
step
process
a lexical
translation
step
an all
alignment step.
thed(4|2,
other
words
by
the
conditional
distribution
n(
|null).
However,
the
7, 6)
IBMwords
Model
number of inserted
clearly4depends on the sentence length, so we
Note that we called the last step distortion instead of alignment. We
chose to model null insertion as a special step. After the fertility step,
make this distinction, since
we can2 produce
the4 same translation
with the
3
5
we introduce one null1 token with
probability
p after
each generated word,
same alignment also natürlich
in a di↵erentist
way:das haus 1klein
or no null token
with probability p0 = 1 p1 .
Motivation:
The addition• of
fertilitypositions
and null
insertion
the translation
translalexical
Absolute
fortoken
distortion
feelsincreases
wrong
tion process in Model
3
to
four
steps:
• Words do not move independently
step
of course is the house small
• Some words tend to move and some not alignment
step
of course the house is small
→1 Introduce
a relative
distortion
model6
2
3
4
5
→ Introduce dependence on word classes
The first step is lexical translation as in IBM Model 1, again modelled
by the translation probability t(e|f ). The second step is the alignment step.
For instance, translating ist into is has a lexical translation probability of
t(is|ist) and an alignment probability of a(2|5, 6, 5) — the 5th English word
is aligned to the 2nd foreign word.
Note that the alignment function a maps each English output word j to
a foreign input position a(j) and the alignment probability distribution is
also set up in this reverse direction.
IBM Model 4: Relative Distortion
126
126
126
e.g., 4 = 6.
Alignment
How does this scheme using cepts determine distortion for English words?
gehe jaChapter
nicht zum
126
4. haus
Word-Based
Models
For each output word, weNULL
now ich
define
its relative
distortion.
We distinguish between three cases: (a) words that are generated by the null token,
IBM
Model
4: Relative
Distortion
(b) center
the firstofword
a cept,
and
(c)go
inhouse
a cept:
the in
preceding
cept
⇡subsequent
= words
4.theThus
we have relative
I
do
2 isnot 2 to
Chapter 4. Word-Based Models
125
Word-Based
Models
center of the preceding cept Chapter
⇡2Chapter
is 2 4.
=4.4.
Thus we have
relative
Word-Based
Models
distortion of 1, reflecting the forward movement of this word. In the
• cept π = All words in e aligned to the same word in f
case
translationcept
in sequence,
is always +1: house and I
center of
theof preceding
⇡ is distortion
2 = 4. Thus we have relative
⦿
of
a cept
= ceiling
position
in e relative
• center
center
ofare
the
preceding
cept
⇡2 2is of2 average
= 4. Thus
we have
examples
for π
this.
Alignment
4.4. Higher IBM Models
distortion of 1, reflecting the forward movement of this word. In the
distortion of 1, reflecting the forward movement of this word. In the
case(c)
of translation
translation
sequence,
distortion
isalways
always
+1:house
house
andI Ithe
For the placement
of subsequent
words
in a cept,
we
consider
NULL
ich gehedistortion
ja nichtis zum
haus
case
of
ininsequence,
+1:
and
placement
of the previous word in the cept, using the probability disare
examples
for
this.
are examples for this.
tribution
(c) For
For the
the placement
placement ofofsubsequent
subsequentdwords
wordsin
ina a1cept,
considerthethe
⇡i,k
) cept,weweconsider
(4.38)
(c)
>1 (j
I
do word
go in
notthe to
the
house
placement
of
the
previous
cept,
using
the
probability
displacement
the previous
word
in theofcept,
using
probability
⇡i,kofrefers
to the word
position
the kth
wordthe
in the
ith cept. disWe have
tribution
tribution
one example for this in Figure 4.9: The English word the is the second
d
⇡word
)cepts
(4.38)
Foreign
and
>1
i,k
word in the 4th cept
word
d(German
(j(jwords
⇡i,k
(4.38)
>1
1 )1zum), the preceding English
into
the
cept
is position
to.
of
to
⇡4,0⇡3in
=inthe
5,the
ofWe
the
ishave
j = 6,
cept
⇡i Position
⇡the
⇡2isword
⇡position
⇡
refers
to
the
word
position
kth
word
ith
cept.
We
1 kth
4ith
5
i,krefers
⇡⇡i,k
the
word
ofofthe
cept.
have
thus
we
factor
in
d
(+1)
foreign
position
[i]
1
2
4
5
6
>1
oneexample
examplefor
forthis
thisininFigure
Figure4.9:
4.9:The
TheEnglish
Englishword
wordthe
theis is
the
second
one
the
second
foreign word f[i]
ich gehe nicht zum
haus
wordin
inthe
the4th
4th
cept
(German
word
zum),
the
preceding
English
word
word
cept
(German
word
zum),
the
preceding
English
word
English words {ej }
I
go
not
to,the house
Word Classes
in the
the cept
cept isisEnglish
to. Position
Position
oftotois1is⇡4,0
⇡4,0
position
j =6, 6,
in
to.
is isj =
positionsof
{j}
4==5,5,
3position
5,6 ofofthe
7the
center
of
cept i richer1 conditioning
4
3 on distortion.
6
7
thus
we
factor
into
(+1)
Wewe
may
want
introduce
For instance,
thus
factor
in
dd>1
(+1)
>1
of 1, reflecting
the token
forward
this word. In the
(a)distortion
Words generated
by the null
aremovement
uniformly of
distributed.
case of translation in sequence,
distortion
is always +1: house and I
Foreign
words and cepts
(b)are
For
the likelihood
of placements
of⇡the first
word
of⇡ a cept,
we use the
examples
for this.
⇡i first word
⇡3
⇡5
1 in⇡a
2 cept
4
Distortion cept
for the
probability distribution
foreign position [i]
1
2
4
5
6
(c) For the placement foreign
of subsequent
words
in a cept, we
the
word fd
hausconsider
(4.37)
[i]1 (j ich i gehe
1 ) nicht zum
English words
{ej } in the
I
go
not
to,the
placement of the previous
word
cept,
using
the house
probability disEnglish
positions
{j}
1
4
3
5,6
Placement isDistortion
defined as
English word
position
j relative 7to the center
of
subsequent
words
in
a
cept
tribution
center of cept i
1
4
3
6
7
of the preceding cept i 1 .d>1
Refer
Figure
4.9
for
an example:
(j again
⇡i,k to
)
(4.38)
1
The word not, English word
position
is distortion
generated by cept ⇡3 . The
English
words e 3,
and
j
⇡i,k refers to the word position of the kth
word in the ith cept. We have
1
2
3
4
5
one example forjthis in Figure
4.9: The
English
word
the6 is the 7second
ej
I
do
not
go
to
the
house
word in the 4th
cept
the preceding
in cept
⇡i,k (German
⇡1,0
⇡0,0word
⇡3,0zum),
⇡2,0
⇡4,0
⇡4,1English
⇡5,0 word
0
4
1
3
i
1
in the cept is to. Position of to is ⇡4,0 = 5, position of the is 6j = 6,
j
+1
1
+3
+2
+1
thus we factor
in i d1>1 (+1)
distortion
d1 (+1)
1
d1 ( 1) d1 (+3) d1 (+2) d>1 (+1) d1 (+1)
Figure 4.9: Distortion in IBM Model 4. Foreign words with non-zero fertility
some words tend to get reordered during translation, while other remain inWord Classes forms cepts (here 5 cepts), which contain English words ej . The center i of a cept
Englishfor
words
distortion inversion when transj and
order. A typical example
this eis
adjective-noun
⇡i is ceiling(avg(j)). Distortion of each English word is modeled by a probability
Word Classes
Classes
Word
introduce
conditioning
distortion.
instance,
lating French
to
English.
Adjectives
get
when
by aWe may want to distribution:
(a) richer
for null generated
words suchon
as do:
uniform, (b) forFor
first words
j
1
2
3
4moved back,
5
6 preceded
7
We
want
richer
conditioning
on
distortion.
For
instance,
We may
maynoun.
want to
toeintroduce
introduce
richer
conditioning
on
distortion.
For
instance,
in
a
cept
such
as
not:
based
on
the
distance
between
the
word
and
the
center
of
the
For
instance:
A↵aires
extérieur
becomes
external
a↵airs.
some
words
tend
to
get
reordered
during
translation,
while
other
remain
in
I
do
not
go
to
the
house
j
preceding word (j
some
tend
to
reordered
during
while
other
i 1 ), (c) for subsequent words in a cept such as the: distance
be
condition
the⇡2,0
distortion
probability
distribution
in may
cept
⇡1,0 to
⇡0,0
⇡3,0translation,
⇡4,0
⇡4,1 remain
⇡5,0in in
some words
wordsWe
tend
to⇡get
gettempted
reordered
during
translation,
while
other
remain
i,k
order. A typical toexample
for this is adjective-noun inversion when transthe previous word in the cept. The probability distributions d1 and d>1 are
0forthis
-the
4
1 f[i inversion
3:
- when
6
iexample
1 4.37–4.38
Equations
on
ej and
order.
for
isiswords
adjective-noun
when
transorder. AAintypical
typical
example
this
adjective-noun
trans1]inversion
IBM Model 4: Word Classes
IBM Model 5
lating French to learnt
English.
get moved
when preceded
from theAdjectives
data. They are conditioned
using back,
lexical information,
see text for by a
j English.
+1
1moved
+3back, +2
+1 a
i 1
lating
Adjectives
get
when
by
lating French
French to
to English.
Adjectives
get
moved
whenpreceded
preceded
by a
initial
(j back,
ej ) d1 (+1)
details.
distortionfor d
1 ind1cept:
( 1) dd11(+3)
d1[i(+2)
1] |f[i d>1
1] ,(+1)
1 (+1) word
noun.
For
instance:
A↵aires
extérieur
becomes
external
a↵airs.
noun.
(4.39)
noun. For
For instance:
instance: A↵aires
A↵airesextérieur
extérieurbecomes
becomesexternal
externala↵airs.
a↵airs.
for
additional the
words:
d>1 (j probability
⇧i,k 1 |ej ) distribution
We
totocondition
IBM Models
1-4 are deficient
We may be tempted
to condition
the distortion probability distribution
We may
maybe
betempted
tempted
condition
thedistortion
distortion
probability
distribution
Figure
4.9: Distortion
in IBM Model
4. Foreign words
with non-zero
fertility
However,
weon
will
find
ourselves
inf[ifa 1]
situation,
where
for
most e and
in
4.37–4.38
words
ejejand
:
forms
cepts
(here
5 the
cepts),
which
contain
English
words
ej . The
center distortion!
i of a cept j
in Equations
Equations
4.37–4.38
on
the
words
and
:
Some
words
trigger
reordering:
Conditional
in Equations 4.37–4.38
on the
words etranslations
: positive probabilities
[i 1]
• some
impossible
have
j and f[i 1]
of each English
wordstatistics
is modeled by
probabilityrealistic
f[i 1]⇡,i isweceiling(avg(j)).
will not be Distortion
able to collect
sufficient
to aestimate
for
iningenerated
cept:
1d(j
distribution:
(a)word
for null
do:|funiform,
[i [i1] ,1]e,j(b)
• multiple output words may be placed in the same place
• initial
initial
words:
for
initial
word
cept: dwords
e)j )for first words
probability
distributions.
1 (j such[ias[i1]
1] |f
(4.39)
for initial word in cept: d1 (j
[i 1] |f[i 1] , ej )
in ahave
such
as not:
based
thebenefits
distance between
the word and model
the center
of
the
(4.39)
To
both
some
of on
the
of⇧ai,klexicalized
and
sufficient
•cept
subsequent
words:
for
additional
words:
d
(j
|e
)
>1
j
1
• probability mass is wasted!
(4.39)
for word
additional
d>1 (j words
⇧i,k
|ej )such as the: distance
preceding
(j
(c) for subsequent
in a 1cept
i 1 ),words:
statistics,
Model word
4 introduces
word
classes.
When we group
the vocabulary
for additional words: d>1 (j ⇧i,k 1 |ej )
towe
thewill
previous
in the cept.
The
probability distributions
d1most
and d>1
are
However,
find
ourselves
in
a
situation,
where
for
e
and
j
However,
we from
willthe
find
ourselves
in a we
situation,
wherethe
forprobability
most
ej and
of a learnt
language
into,
50areclasses,
can condition
distridata.say,
They
conditioned
using
lexical information,
see text for
ff[i 1] ,, we
will
not
able
totocollect
statistics
totoestimate
realistic
However, we will
ourselves
in a situation, where for most ej and
butions
onbe
these
classes.
Thesufficient
result
is an
almost lexicalized
model,
but with
will
not
be
able
collect
sufficient
statistics
estimate
realistic
IBMfind
Model
5
details.
[i 1] we
Conditioning
on words creates too many parameters!
probability
distributions.
sufficient
statistics.
probability
distributions.
f[i 1] , we will not be• able
to collect
statistics
to estimate realistic
fix deficiency
bysufficient
keeping track
of vacancies
To have To
both
ofdata
the
benefits
lexicalized
model and
put
this more
formally,
weof
introduce
two functions
)sufficient
and
B(e) that
• some
sparse
problem
ina training
To have both
some
of the
benefits
of
a lexicalized
modelA(f
and
sufficient
probability
distributions.
• details: see text book
statistics,map
Model
introduces
word
classes.
When
we to
group
thelanguage
vocabulary
words
to
theirword
word
classes.
This
allows
reformulate
Equation 4.39
• 44induce
classes
for source
target
statistics, Model
introduces
word
classes.
Whenusand
we group
the vocabulary
of a language
To
have
both
some
of the benefits of a lexicalized model and sufficient
into: into, say, 50 classes, we can condition the probability distriof a language into, say, 50 classes, we can condition the probability distributions on these•for
classes.
The
result
is
an
almost
lexicalized
model,
but
with
initial
word
in
cept:
d
(j
|A(f
),
B(e
))
statistics, Model 4 introduces word classes. When we group the vocabulary
initial words:
j
[i lexicalized
1]
[i 1] model,
butions on these classes.
The result is an1 almost
but with
(4.40)
sufficient statistics.
• subsequent
for additionalwords:
words: d>1 (j ⇧i,k 1 |B(ej ))
of a language into, say, 50 classes, we can condition the probability distrisufficient statistics.
To put this more formally, we introduce two functions A(f ) and B(e) that
To put this more formally, we introduce two functions A(f ) and B(e) that
map words to their word classes. This allows us to reformulate Equation 4.39
map
into: words to their word classes. This allows us to reformulate Equation 4.39
into:
for initial word in cept: d1 (j
[i 1] |A(f[i 1] ), B(ej ))
(4.40)
for initial word in cept: d1 (j
[i 1] |A(f[i 1] ), B(ej ))
for additional words: d>1 (j ⇧i,k 1 |B(ej ))
(4.40)
for additional words: d (j ⇧
|B(e ))
butions on these classes. The result is an almost lexicalized model, but with
sufficient statistics.
To put this more formally, we introduce two functions A(f ) and B(e) that
map words to their word classes. This allows us to reformulate Equation 4.39
=
...Models
Chapter 4. Word-Based
110
4.4
Higher IBM Models
a(1)=0
a(le )=0
lf
X
lf
X
p(e, a|f )
le
Y
✏
=
...
t(ej |fa(j) )
IBM
Models
We haveSummary
presented with IBM
Model
1 a model for machine translation. (lf + 1)le
a(1)=0
e )=0
However, at closer inspection, the model has many flaws. The
model isa(l
very
weak in terms of reordering, as well as adding and dropping words. In fact,lf
X
j=1
What(4.10)
is Next?
lf
le
X
Y
✏
according to IBM Model 1, the best translation for any input
= is the empty
...
t(ej |fa(j) ) Assignment on word-based SMT
string (see also Exercise 3 at the end of this chapter).
(lf + 1)le a(1)=0 a(le )=0 j=1
Models
with increasing
complexity
• manipulate translation models
Five models
of increasing
complexity were
proposed in the original work
lf
l
e
on statistical machine translation at IBM. The advances of the five models
• train and manipulate language models
YX
✏
Higher models include more information
are:
t(ej |fi )
=
IBM
IBM
IBM
IBM
IBM
Model
Model
Model
Model
Model
1
2
3
4
5
(lf + 1)le
• decode a simple test corpus
j=1 i=0
lexical translation
adds absolute
alignment
model of the last step: Instead of performingParallel
Note the
significance
a sumCorpora
over and alignment
adds fertility model
le
an exponential number lf of products, we reduced the computational comrelative alignment model
plexity to linear in lf and le (since lf ⇠ le , computing p(e|fPhrase-based
) is roughly SMT
fixes deficiency
quadratic with respect to sentence length).
Let usmodel
now put
Equations
4.9inand
Alignment models introduce an explicit
for reordering
words
a 4.10 together:
Training: Estimate parameters
from
word-aligned
data
sentence. More often than not, words that follow each other in one language
p(e, a|f )
have translations that follow each other in the output language. However,
p(a|e, f ) =
forreorderings
word alignment:
IBM Model 1Can
treatsbeallused
possible
as equally likely.
p(e|f )
Fertility is the notion that input words produce a specific number of
Q le
✏
output words in the output language. Most often, of course, one word in(lfthe
j=1 t(ej |fa(j) )
+1)le
=
Qle Plf
input language translates into one single word in the output language. But
✏
j=1
i=0 t(ej |fi )
some words produce multiple words or get dropped (producing zero words).
(lf +1)le
A model for the fertility of words addresses this aspect of translation.le
Y
Adding additional components increases the complexity of training thet(ej |fa(j) )
=
Plf
models, but the general principles that we introduced for IBM Model 1 stay
t(ej |fi )
j=1
the same: We define the models mathematically and then devise an EMi=0
algorithm for training.
This incremental march towards more complex models does not only
serve didactic purposes. All the IBM models are relevant, because EM
training starts with the simplest Model 1 for a few iterations, and then
proceeds through iterations of the more complex models all the way to IBM
Model 5.
4.4.1
IBM Model 2
In IBM Model 2, we add an explicit model for alignment. Previously, in IBM
Model 1, we do not have a probabilistic model for this aspect of translation.
SMT decoding and tuning
(4.11)