Download Word-Based SMT

Road Map Word-Based SMT CLT310 Introduction to SMT and Word-based Models Parallel Corpora and Alignment • How to prepare training data • How to align a parallel corpus Phrase-Based SMT Decoding and Tuning ... Jörg Tiedemann University of Helsinki Goals for Today Hand-Crafted vs. Data-Driven MT 3.3 Synchronous Context Free Grammar (SCFG) SMT systems often produce ungrammatical and disfluent translation hypotheses. Data-Driven Machine Translation This is mainly because of language models that only look at local contexts (n-gram models). Syntax-based approaches to machine translation aim at fixing the problem of High-level introduction to Statistical MT producing such disfluent outputs. Such syntax-based systems may take advantage of Synchronous Context Free Grammars (SCFG): mappings between two context free interlingua rules that specify how source language phrase structures correspond to target language Hmm, every time he sees Every time I fire a linguist the performance goes“banco”, up! he either types phrase structures. For example, the following basic rule maps noun phrases in the Language Modeling source and target languages, and defines the order of the daughter constituents (Noun and Adjective) in both languages. “bank” or “bench” but if he sees “banco de ”, he always types “bank”, never “bench” Man, this is so boring. NP::NP [Det ADJ N] Word-Based Translation Models [Det N ADJ] This rule maps English NPs to French NPs, stating that an English NP is constructed from daughters Det, ADJ and N, while the French NP is constructed from daughters Det, N and ADJ, in this order. Figure 3a demonstrates translation using transfer of such transfer rules derivation trees from English to French using this SCFG rule and lexical rules. VP → V PP[+Goal] Text gra N NP Det ADJ Det N the source language red house la maison ADJ rouge Figure 3a – translation of the English NP “The red house” into the French NP “La maison rouge” using SCFG 15 Translated documents Training Data es NP decoding rul mm ar rul N maison ADJ rouge Det la NP Det N ADJ ar mm N house ADJ red Det the NP Det ADJ N Using course material from Philipp Koehn, Kevin Knight and Chris Callison-Burch gra es VP → PP[+Goal] V target language Coverage and Robustness Data-Driven Machine Translation Data-Driven Machine Translation It is difficult to cover all concepts in all contexts! • Look , I hate plum jam . Hmm, every time he sees “banco”, he either types “bank” or “bench” but if he sees “banco de ”, he always types “bank”, never “bench” • Jag avskyr plommonsylt . • It doesn’ t break , jam , or overheat . • Den går inte sönder , fastnar eller överhettar aldrig . Man, this is so boring. • If you get into a jam , you can call me . • Om du hamnar i knipa , kan du ringa mig . • Prepare to jam with the bearded clam . • Förbered dig på att festa med klanen . • This is our jam , ladies . • Det är vår låt , damer . • Seems like you got yourself in a jam , huh ? • Du har hamnat i en riktig smet , va ? Learning MT from Example Translations Translated documents MT as Decoding Idea 1: Translation by analogy: EBMT (lazy learning) Translate: “He buys a book on international politics.” Database of Example Translations: learn the unknown code 1. He buys a notebook. Kare wa nõto o kau. HE topic NOTEBOOK obj BUY. 2. I read a book on international politics. Watashi wa kokusai seiji nitsuite kakareta hon o yomu. I topic INTERNATIONAL POLITICS ABOUT CONCERNED BOOK obj READ. Output:“Kare wa kokusai seiji nitsuite kakareta hon o kau.” Idea 2: Learn MT models from data: SMT (eager learning) • translation models with language-specific parameters • train model parameters & apply to unseen data When I look at an article in Russian, I say: This is really written in English, but it has been coded in some strange symbols. I will now proceed to decode. [Weaver, 1947, 1949] Statistical Machine Translation Machine Translation Statistical Machine Translation What are the problems with this setup? P(e|c) Search problem (decoding): e* = argmax P(e|c) e The U.S. island of Guam is maintaining a high state of alert after the Guam airport and its offices both received an e-mail from someone calling himself the Saudi Arabian Osama bin Laden and threatening a biological/chemical attack against public places such as the airport. Translation = decoding e* = argmax P(e|c) e Running over all possible English sentences to find the best translation given our Chinese sentence! How do we know the probability that an English sentence is a good translation of a Chinese one? Statistical Machine Translation Modeling Translation • word-based models (today) • phrase-based models (later) • tree-based models (later) Learning Model Parameters • sentence and word alignment • phrase extraction and scoring • tuning Decoding Modeling Translation Statistical Machine Translation Statistical MT Systems To estimate the probabilities for a trigram language model, we need to collect statistics for three word sequences from large amounts of text. In statistical machine translation, we use the English side of the parallel corpus, but may also include additional textTranslation resources. Statistical Machine Chapter 7 has more detail on how language models are built. 4.3.3 Spanish/English Bilingual Text Statistical Analysis • Translation Model p(f|e) How can we combine a language model and our translation model? Recall • Target Model p(e) that we want to find theLanguage best translation e for an input sentence f. We now apply the Bayes rule to include p(e): English Text Statistical Analysis Broken English Spanish Que hambre tengo yo English What hunger have I, Hungry I am so, I am so hungry, Have I that hunger … I am so hungry Statistical Machine Translation Statistical MT Systems Spanish/English Bilingual Text English Text Statistical Analysis Statistical Analysis Broken English Spanish Translation Model P(s|e) Que hambre tengo yo English Language Model P(e) Decoding algorithm argmax P(e) * P(s|e) e NoisyUse Channel Model Bayes’ Rule to Decompose p(e|f) into I am so hungry p(f |e)p(e) p(f ) = argmaxe p(f |e)p(e) argmaxe p(e|f ) = argmaxe (4.23) Note that, mathematically, the translation direction has changed from p(e|f ) to p(f |e). This may create a lot of confusion, since the concept of what constitutes the source language di↵ers between the mathematics of the model and the actual application. We try to stay away from this confusion as much as possible by sticking to the notation p(e|f ) when formulating a translation model. Combining a language model and translation model this way is called the Machine Translation noisy channelStatistical model. This method is widely used in speech recognition, but can be traced back to early information theory. Shannon [Shannon, 1948] developed the noisy channel model for his work on error correcting codes.Model In this scenario, a translations) message is passed through a Translation (prefer adequate noisy channel, such a low-quality telephone from >a source S to a receiver • p(das haus ist klein|the houseline is small) R. See Figure 4.6• p(das for anhaus illustration. ist klein|the building is tiny) > The receiver • only message p(dasgets hausaistcorrupted klein|the shell is low) R. The challenge is now to reconstruct the original message using knowledge about possible source Language Model (prefer grammatical / fluent strings) • p(the house is small) > p(small the is house) address issues of data sparseness: the problems that arise from having to build models with limited training data. 7.1 N-Gram Language Models The leading method for language models is n-gram language modelling. Ngram language models are based on statistics how likely words follow each other. Recall the last example: If we analyze a large amount of text, we will observe that the word home follows more often the word going than the word house does. We will be exploiting such statistics. Formally, in language modelling, we want to compute the probability of a string W = w1 , w2 , ..., wn . Intuitively, p(W ) is the probability that if we 7.1. N-Gram Language Models 209 pick a sequence of English words at random (by opening a random book or magazine at a random place, or by listening to a conversation) it turns out The language model probability p(w1 , w2 , ..., wn ) is a product of word to be W . probabilities given a history of preceding words. To be able to estimate How can we compute p(W )? typically approach to statistical esthese word probability distributions, we limit the The history to m words: timation calls for first collecting a large amount of text and counting how often However, sequences of words will not ocp(w ' p(w ..., wnlong (7.4) n |wW 1 , woccurs 2 , ..., wnin1 )it. n |wn m ,most 2 , wn 1) cur in the text at all. So we have to break down the computation of p(W ) This typeinto of model, wefor step through sequence a sequence smallerwhere steps, which we acan collect(here: sufficient statistics and estimate of words), and consider for the transitions only a limited history, is called a probability distributions. Markov chain. The number of previous states (here: words) is the order Dealing with sparse data that limits us to collect sufficient statistics of the model. to estimate reliable probability distributions is the fundamental problem in The Markov assumption states that only a limited number of previous language modelling. In this chapter, we will examine methods that address words a↵ect the probability of the next word. It is technically wrong, and it is not too this hardissue. to come up with counter-examples that demonstrate that Language Modeling N-Gram Language Models a longer history is needed. However, limited data restrict the collection of reliable statistics histories. 7.1.1to short Markov chain Typically, we chose the actually number of words in the history based on how much training datalanguage we have. modelling, More training forprocess longer hisIn n-gram we data breakallows up the of predicting a word tories. Mostsequence commonly, language models are used. Wtrigram into predicting one word at a They time.consider We first decompose the a two word to Language predict third word. 7.1. history N-Gram Models 209 probability usingthe the chain rule:This requires the collection of Markov chain statistics over sequences of three words, so-called 3-grams (trigrams). Language models mayp(w also, w estimated over 2-gramsw (bigrams), single, wwords 1 model 2 , ..., wprobability n ) = p(w1 ) p 2 , ..., wnof1 ) ( 21|w The language p(w , w12), ... ...,p(w wn )n |w is 1a product word (7.3) (unigrams), or any other order of n-grams. probabilities given a history of preceding words. To be able to estimate assumption theseMarkov word probability distributions, we limit the history to m words: 7.1.2 Estimation (7.4) 1 , w2 , ..., w 1 ) ' p(w n |wprediction n m , ..., wprobabilities n 2 , wn 1 ) In its simplest form,p(w then |w estimation ofntrigram word p(w3 |w1 , w2 ) is straight-forward. We count how often the sequence w1 , w2 type of model, where we step through a sequence is followedThis by the word w3 in our training corpus, opposed to other(here: words.a sequence of words), and consider for the transitions only a limited history, is called a According to maximum likelihood estimation, we compute: Maximum likelihood estimation Markov chain. The number of previous states (here: words) is the order count(w1 , w2 , w3 ) of the model. p(w3 |w1 , w2 ) = P (7.5) , w) a limited number of previous 1 , w2only The Markov assumption states that w count(w words a↵ect the probability of the next word. It is technically wrong, and Figure 7.1 gives some examples on how this estimation works on real it is not too hard to come up with counter-examples that demonstrate that data, in this case the European parliament corpus. We consider three difa longer history is needed. However, limited data restrict thefrecollection of ferent histories: the green, the red, and the blue. The words that most reliable statistics to short histories. quently follow are quite di↵erent for the di↵erent histories. For instance the Typically, we chose the actually number of words in the history based on Language Models Statistical Language Models One essential component of any statistical machine translation system is the language model, which measures how likely a sequence of words may be uttered by an English speaker. It is easy to see the benefits of such string over a Prefer model.one Obviously, weanother want a(ensure machinefluency) translation systems to not only • “small step”:words 5,880,000 hitstrue on Google produce output that are to the original in meaning, but string them together in fluent English • “little step”: 1,780,000 hitssentences. on Google In fact, the language model typically does much more than just enable fluent output. It supports difficult decisions in word order and word translaLanguage model tion. For instance, a probabilistic language model plm should prefer correct 208 • estimate how likely a string is in a givenChapter 7. Language Models language word order over incorrect word order: word choice in context, for instance: plm (the house is small) > plm (small the is house) (7.1) plm (I am going home) > plm (I am going house) (7.2) Formally, a language model is a function that takes an English sentence andThis returns a probability howdominant likely it is produced by an English speaker. chapter presents the language modelling methodology, According to the example above, it is more likely that an English n-gram language models, along with smoothing and back-o↵ methodsspeaker that utters the sentence thesparseness: house is small than the sentence is house. address issues of data the problems that arisesmall fromthe having to Hence, a good language plmdata. assigns a higher probability to the first build models with limitedmodel training sentence. This preference of the language model aids a statistical machine transla7.1 N-Gram Language Models tion system to find the right word order. Another area where the language N-Gram Language Models model aids translation is word choice. If a foreign word (such as the GerThe for language models is n-gram language modelling. Nmanleading Haus) method has multiple translations (house, home, ...), lexical translation gram language models are based on statistics how likely words follow each probabilities already give preference to the more common translation (here: other. Recall thespecific last example: If other we analyze a large amount of text, we house). But in translations Add sentence start andcontexts, end of sentence marker! are preferred. Again, will that the steps word home more probability often the word going than the the•observe language model in. It follows gives higher to the most natural certain tokens appear at the startstatistics. or at the end word house does. Weoften will be exploiting such Formally, in language modelling, we 207want to compute the probability of a string W = w1 , w2 , ..., wn . Intuitively, p(W ) is the probability that if we Smoothing pick• abig sequence of English at random problem: unseen words n-grams → p(e) =(by 0 opening a random book or magazine at a random place, or by listening to a conversation) it turns out • smoothing = reserve probability mass for unseen events to be W . 226 Chapter 7. Language Models How can we compute p(W )? The typically approach to statistical esInterpolation andfirst backoff timation calls for collecting a large amount of text and counting how Then, we build an interpolated language model pI by linear combination often W occurshigher-order in it. However, long sequences • combine and most lower-order models of words will not ocof the language models pn : cur in the text at all. So we have to break down the computation of p(W ) into smaller steps, which we can collect sufficient statistics and estimate pI (wfor 3 |w1 , w2 ) = 1 p1 (w3 ) probability distributions. ⇥ 2 p2 (w3 |w2 ) (7.29) Dealing with sparse data that limits us to collect sufficient statistics ⇥ 3 p3 (w3 |w1 , w2 ) to estimate reliable probability distributions is the fundamental problem in language modelling. this chapter, we will its examine methods thatisaddress Each n-gram languageInmodel pn contributes assessment, which this issue. weighted by a factor n . With a lot of training data, we can trust the ch.7 Koehn, U Edinburgh ESSLLI Summer School Day 2 Word-Based Translation Models 7 Generative Model: Source language words One-to-many translation are generated by target language words Word-Based Statistical MT Reordering 6 • A source word may translate into multiple target words • Words may be reordered during translation 1 2 3 4 klein ist das Haus the house is small 1 2 3 4 1 2 3 4 das Haus ist klitzeklein the house is very small 1 2 3 4 5 4.2. Learning Lexical Translation Models a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4} a : {1 → 3, 2 → 4, 3 → 2, 4 → 1} Translation: Decode what kind of word When applying the model the data, we need to compute the sequence has generated thetoforeign string. bility of di↵erent alignments given a sentence pair in the data. T Reordering Reordering Koehn, U Edinburgh ESSLLI Summer School Day 2 more formally, weduring need to •compute p(a|e, f ), the probability of an al Reordering during translation Words may be reordered Reordering translation • Words may be reordered 40 Reordering • Words may be reordered during translation reordered during translation • Words may be given the English klein ist das Haus kleinand ist foreign das Haussentence. • Words may be reordered during translation klein ist dasonHaus klein ist das Haus Applying the chain rule (recall Section 3.2.3 page 76) gives u klein ist das Haus 6 6 6 6 6 1 2 1 Koehn, U Edinburgh ESSLLI Summer School Day 2 Statistical Word Alignment Models 1 2 3 the house is small 1 2 3 4 1 2 3 4 Haus ist klitzeklein 3 2 4 3 4 p(e,house a|f ) is small the small p(a|e, f ) =a : {1 → 3, 2 → 4, 3 → 2, 4 → 1} a : {1p(e|f → 3, 2 →)4, 3 → 2, 4 → 1} a : {1 → 3, 2 → 4, 3 → 2, 4 → 1} 2 house 3 1 4 is 2 1 2 4, 3 → 3 2, 4 →4 1} a : {11 → 3, 2 → 3 2 4 3 4 We still need to derive p(e|f ), the probability of translating the s overany all alignment: possible ways of generating the string: f into Sum e with X p(e|f ) = p(e, a|f ) 7 Koehn, U Edinburgh ESSLLI Summer School Koehn, U Edinburgh Summer School DayESSLLI 2 Koehn, U Edinburgh ESSLLI Summer School Koehn, U Edinburgh Koehn, U Edinburgh ESSLLI Summer School • A source word may translate into multiple target words das 2 1 4 the Models house is small StatisticaltheWord house isAlignment small a : {1 → 3, 2 → 4, 3 → 2, 4 → 1} • observed word j is generated by target word i 1 4 3 4 1 the Introduce an alignment function a : i → j One-to-many translation 3 2 Day 2 ESSLLI Summer School Day 2 Day 2 Day 2 7 a 7 One-to-many One-to-many translation lf lf One-to-manytranslation translation X X One-to-many translation • A source word may translate into multiple target words 7 One-to-many translation = 7 7 p(e, a|f ) • A source word may translate into multiple target words ... words • A source word may translate into multiple target • A4 source 1 2into multiple 3 4 words target wordsword may translate • target A source word may translate 1 2into multiple 3 the house is very small 1 2 3 4 5 a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4} Koehn, U Edinburgh ESSLLI Summer School 40 2 3 4 das Haus ist klitzeklein the house is very small 1 2 3 4 5 a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4} Alignment model: p(f,a|e) Koehn, U Edinburgh 1 Day 2 ESSLLI Summer School 40 Koehn, U Edinburgh Koehn, U Edinburgh 2 4 das1 Haus ist3 klitzeklein 2 4 das1 Haus ist3 klitzeklein das Haus ist klitzekleina(le )=0das Haus ist klitzeklein a(1)=0 lf X the house is very small the house very small 1 2 3is 4 5 1 2 = 3 4 ... 5 lf X the l Y ✏ t(ej |fa(j) ) le 3, 4 → 4, 5 → 4} a : {1(l → 1, 2+ 3→ 1) f 1, 2→→2, a : {1 → 2, 3 → 3, 4 → 4, 5 → 4} the 1 a : {1 → 1, 2 → 2, 3 → 3, 4 → 4, 5 → 4} a : {1 → 1, 2 → 2, 3a(1)=0 → 3, 4 → 4, 5 →a(l 4} e )=0 ...... Day 2 = ✏ Koehn, U Edinburgh ESSLLI Summer School 40 Koehn, U Edinburgh e ESSLLI Summer School 40 (lf + 1)l lf X e house is very small house very small 2 3is 4 5 1 a(1)=0 2 3 4 5 j=1 ... lf le X Y Summer School DayESSLLI 2 40 ESSLLI Summer School Day 2 40 a(le )=0 j=1 t(ej |fa(j) ) Day 2 Day 2 Translation Model Parameters Parallel Corpus Lexical Translations • das → the • haus → house, home, building, household, shell Collection of example translations: Parallel corpus • ist → is • klein → small, low Multiple translation options? • use the most common one in that context • learn translation probabilities from example data Context-Independent Models im übrigen ist die diesbezügliche kostenentwicklung völlig unter kontrolle . sooner or later we will have to be sufficiently progressive in terms of own resources as a basis for this fair tax system . früher oder später müssen wir die notwendige progressivität der eigenmittel als grundlage dieses gerechten steuersystems zur sprache bringen . we plan to submit the first accession partnership in the autumn of this year . wir planen , die erste beitrittspartnerschaft im herbst dieses jahres vorzulegen . it is a question of equality and solidarity . hier geht es um gleichberechtigung und solidarität . the recommendation for the year 1999 has been formulated at a time of favourable developments and optimistic prospects for the european economy . die empfehlung für das jahr 1999 wurde vor dem hintergrund günstiger entwicklungen und einer für den kurs der europäischen wirtschaft positiven perspektive abgegeben . that does not , however , detract from the deep appreciation which we have for this report . im übrigen tut das unserer hohen wertschätzung für den vorliegenden bericht keinen abbruch . Context-Independent Models Probabilities Collect Statistics Count Translation Statistics: • How often is Haus translated into .... Estimate Translation Probabilities: most probable English sentence given a • Find • Maximum Likelihood Estimation (MLE) Estimate Translation Probabilities foreign language sentence Look at a parallel corpus (German text along with English translation) Translation of Haus house building home household shell what is more , the relevant cost dynamic is completely under control. count(f, e) t(f |e) =p(e|f ) count(e) Count 8,000 1,600 200 150 50 Maximum likelihood estimation ê = arg max p(e|f ) • for f = Haus: e 8 > if e = house, >0.8 p(e)p(f |e) > > > >0.16 p(e|f ) = if e = building, < ) pf (e) t(f |e) = = 0.02 if e p(f = home, > > > 0.015 if e = household, > > > ê = arg maxif p(e)p(f |e) :0.005 e = shell. e 5 Chapter 4: Word-Based Models 2 Koehn, U Edinburgh Special Cases in Word Alignment ESSLLI Summer School Day 2 Special Cases in Word Alignment 9 8 Inserting words Dropping words Inserting words: Introduce special NULL word words:when translated • WordsDropping may be dropped – The German article das is dropped 1 2 3 4 das Haus ist klein house 1 is small 2 3 • Words may be added during translation – The English just does not have an equivalent in German – We still need to map it to something: special null token a : {1 → 2, 2 → 3, 3 → 4} 0 1 2 3 4 NULL das Haus ist klein the house is 1 2 3 ESSLLI Summer School 10 IBM Model 1: Example 9 Inserting words Generative model: break upinto translation process into smallersteps steps model: up translation process into smaller • •Generative model: break upbreak translation process smaller steps • Generative model: break up translation process smaller into steps • Generative IBM Model 1 only uses lexicaltranslation translation – IBM Model 1 only uses lexical – IBM Model 1 only uses lexical – Modelmay 1 only uses lexical – IBM during translation translation translation • Words be added Translation probability • •Translation Translation probability just does probability not have an equivalent in German –• The English • Translation probability •toaf map foreign sentence: –forfor a= foreign f== , ..., lengthlflf – for aneed sentence ,f..., f(f length lflength –foreign foreign sentence ..., flfftoken – for a foreign sentence (fit1,to ..., fsentence ) (f of 1length l)1f,1of l)f )ofof lf(f – We still something: special null lf= – to an English sentence e = (e , ..., e ) of lengthlele – to an English sentence e = (e , ..., e ) of length l – to•an0english sentence e= – to an English sentence e English = (e11,sentence: ..., lele1 ,1..., 3 (e 4 elel)e of elength le )2 of 1length – with an alignment of each English word e to a foreign ftoi according to alignment English wordfeijejaccording aforeign foreign wordffi iaccording according – with an alignment ofwith each English word eeach to aEnglish foreign – –with anan alignment of totoaword word toto •NULL alignment: a( j )of i ist aligns with das Haus klein j word j=each the function alignment function a : j → i the alignment a : j → i thealignment alignmentfunction functiona a: j: j→→i i the l l! ele p(e, a|f ) = Z(l ϵ e , lf ) ϵ t(e! je|f ϵ ϵa(j) )! is just small p(e, a|f ) the = p(e,house t(e |f ) a|f )p(e, = t(e |f )t(e|f|fa(j)) ) j a(j) p(e, a|f ) = a|f ) = t(e j le e lele a(j) j j a(j) (l j=1 + 1)4l(l (l 1 (lf + 1) 2 5 1) f++1) f j=1 f3 j=1 j=1 j=1 → 1, 2 → 2, 3 →constant 3, 4 → 0, 5 → 4} – parameter is a–normalization – ϵparameter is: {1 a normalization • ϵaonly lexical translation parameters –parameter parameter ϵisisa anormalization normalization constant ϵ constant constant Example das e t(e|f ) the 0.7 that 0.15 which 0.075 who 0.05 this 0.025 Haus e house building home household shell ist t(e|f ) 0.8 0.16 0.02 0.015 0.005 e is ’s exists has are t(e|f ) 0.8 0.16 0.02 0.015 0.005 klein e t(e|f ) small 0.4 little 0.4 short 0.1 minor 0.06 petty 0.04 ✏ ⇥ t(the|das) ⇥ t(house|Haus) ⇥ t(is|ist) ⇥ t(small|klein) 43 ✏ = 3 ⇥ 0.7 ⇥ 0.8 ⇥ 0.8 ⇥ 0.4 4 = 0.0028✏ p(e, a|f ) = • normalization: ✏ Z(l )School = School Koehn, U Edinburgh ESSLLI e , lfSummer Koehn, U Edinburgh ESSLLI Summer le 41ESSLLI Koehn, U Koehn, Edinburgh (lSummer +ESSLLI 1)School Koehn, Edinburgh SummerSchool School UU Edinburgh Summer f ESSLLI Day 2 41 1010 IBM Model 1 IBM 1IBM IBMModel Model 11 Model 1 IBM Model le Y 5 ESSLLI Summer School Day 2 10 le ! 4 a : {1 → 1, 2 → 2, 3 → 3, 4 → 0, 5 → 4} Koehn, U Edinburgh Koehn, U Edinburgh just small DayDay 2 2 Day 2 Day2 2 Day Chapter 4: Word-Based Models 11 IBM Model 1 IBM Model 2 111 4.4. Higher IBM Models IBM Model 2 As a consequence, according to IBM Model 1 the translation probabilities forWhat the following two mean alternative translations are the same: does this for p(e,a|f)? 1 2 3 4 5 1 natürlich ist das haus klein 2 3 4 Add a model forAdding positional a modelalignment: of alignment 5 natürlich ist das haus klein 1 2 3 4 5 natürlich ist das haus klein of course the house is small Higher IBM Models 1 2 3 4 5 the course small is of house 6 1 2 3 4 5 6 lexical translation step 111 of course is the house small alignment step IBM Model addresses of alignment with an explicit model for What does 2this meanthe forissue p(e|f) = ∑ p(e,a|f)? alignment based on the positions of1the input and output words. The transconsequence, according to IBM Model the translation probabilities • of p(athe house is word smallin| position das Haus ) word in position j lation foreign input i to ist an klein English he following two isalternative translations are the ist same: modelled byisansmall alignment • p( the houseprobability | das Hausdistribution klein ) • p( the house is small |a(i|j, ist Haus le , lf ) das klein ) 1 2 3 4 5 1 2 3 of course the house is small 1 2 3 4 5 6 (4.24) 4 5 Chapter 4: Word-Based Models 37 atürlich ist das haus klein natürlich ist fdas haus Recall that the length of the input sentence is denoted as lf ,klein the length of the output sentence e is le . We can view translation under IBM Model 2 as a two step process with a lexical translation step and an alignment step. course the house is small 2 3 4 5 IBM Model 2 1 2 Interlude: HMM Model the course small is of house 6 3 1 2 4 3 4 5 6 HMM alignment model 5 natürlich ist das haus klein • Words do not move independently of each other BM Model 2 addresses the issue of alignment with an explicit model lexical translation step for ment based on theofpositions the house input and course isof the smalloutput words. The transn of a foreign input word in position i to an English wordstep in position j alignment New parameter: delled by an alignment probability distribution alignment probability for small aligning absolute positions of• course the house is 112 1 2 3 4 a(i|j, le , lf ) 5 6 – they often move in groups Chapter 4. Word-Based Models (4.24) The first step is lexical translation as in IBM Model 1, again modelled by the translation probability t(e|f ). The second step is the alignment step. steps are combined mathematically to lform Model ecall thatThe the two length of the input sentence f is denoted as , theIBM length ForNew instance, translating ist into is has a lexical translationf probability of Model: e output sentence e isand le .anWe can view translation under IBM Model t(is|ist) alignment probability of a(2|5, 6, 5) — the 5th English word 2 le Y is aligned to the 2nd foreign word. two step process with a lexical translation step and an alignment step. p(e, ) = ✏ function t(eajmaps |fa(j)each )a(a(j)|j, le , lfword ) j to Note that thea|f alignment English output a foreign input position a(j)j=1 and the alignment probability distribution is also set up in this reverse direction. 2: (4.25) Fortunately, adding the alignment probability distribution does not make 2 3 4 5 EM training much more complex. Recall that the number of possible word atürlich ist das haus klein alignments for a sentence pair is exponential with the number of words. lexical step However, for IBM Model 1, we were able totranslation reduce the complexity of com1 Motivation: Words not move independently! ! condition worddo movements on previous word • they often move in groups add condition alignment of the previous word • •HMM alignmentonmodel: Different alignment parameter: p(a(j)|a(j 1), le) Good replacement for IBM Model 2 • EM algorithm application harder, requires dynamic programming • IBM Model 4 is similar, also conditions on word classes Chapter 4: Word-Based Models two English words, i.e., to the. Others, such as the flavoring particle ja, get dropped. BM Model 3 IBM Model 3 many words are generated from have not explicitly modeled how t word. In most cases, a German word translates to one single ord. However, some German words like zum typically translate to Motivation: sh words, i.e., to the. Others, such as the flavoring particle ja, get • Some words tend to be aligned to multiple words • Other words tend to be unaligned (dropped) ow want to model the fertility of input words directly with a y distribution Add a parameter modeling this property! Fertility: n( |f ) (4.29) ach foreign word f , this probability distribution indicates, how 0, 1, 2, ... output words it usually translates to. Returning to the above, we expect for instance, that n(1|haus) ' 1 n(2|zum) ' 1 IBM Model 3 (4.30) n(0|ja) ' 1 116 Chapter 4. by Word-Based Models ty deals explicitly with dropping input words allowing = 0. is also the issue of adding words. Recall that we introduced the n to account for words in the output that have no correspondent put. For instance, the English word do is often inserted when g verbal negations into English. In the IBM models these added generated by the special null token. uld model the fertility of the null token the same way as for all words by the conditional distribution n( |null). However, the f inserted words clearly depends on the sentence length, so we model null insertion as a special step. After the fertility step, To review, each of these four steps is modeled probabilistically: uce one null token with probability p1 after generated word, • Fertility is modeled by the distribution n( |feach ). For instance, duplication of the German word zum has the probability of n(2|zum). l token with probability p0 = 1 p1 . • null insertion is modeled by the probabilities p1 and p0 = 1 p0 . For instance, the probability of p1 is factored in for the insertion of null after ich, and the probability of p is factored in for no such insertion ddition of fertility and null token insertion increases the transla- We now want to model the fertility of input words directly with a 4.4. Higher IBM Models probability distribution IBM Model 3 111 n( |f ) (4.29) As a consequence, according to IBM Model 1 the translation probabilities for the following two alternative translations are the same: To review, each of these four steps is modeled probabilistically: For each foreign word f , this probability distribution indicates, how Example: •many Fertility is modeled by thewords distribution n( |f ). For instance, duplica= 0, 1, 2, ... output translates to. 1Returning to the4 1 2 3 it usually 4 5 2 3 5 • natürlich German “zum” translates to “to the” tion of the German word zum has the probability of n(2|zum). ist das haus klein natürlich ist das haus klein examples above, we expect for instance, that • German “ja” (used as a filler) is often dropped • null insertion is modeled by the probabilities p1 and p0 = 1 p0 . For instance, the probability p1n(1|haus) ishouse factored of null of courseofthe is'in small the course small is of house 1 for the insertion 1 2 3 5 6 3 4 5 6 after ich, and the probability of p04 is factored in for no1 such 2insertion n(2|zum) ' 1 (4.30) after nicht. n(0|ja) the ' 1issue of alignment with an explicit model for IBM Model 2 addresses • Lexical translation is handled by the probability distribution t(e|f ) as alignment based on the positions of the input and output words. The transother differences: in IBM Model 1. For instance, translating nicht into not is done with lation of a foreign input word input in position i to an English word in position j Fertility deals explicitly with dropping probability p(not|nicht). • explicit NULL insertion parameterwords by allowing = 0. is modelled anadding alignment probability distribution But there is also issuebyof Recall that we introduced the • the distortion instead of words. alignment • Distortion is modeled almost the same way as in IBM Model 2 with a null token to account for words in the output that have no correspondent probability distribution d(j|i, le , lf ), which predicts output a(i|j, lthe , l English ) (4.24) in the input. For instance, the English word do eis foften inserted when word position j based on the foreign input word position i and the translating verbal negations into English. In the IBM models these added respective sentence lengths. For instance, the placement of fgois as the as lf , the length Recall that the length of the input sentence denoted words are generated by the special null token. 4th word ofof the 7 word English sentence as a translation of gehe which the output sentence e is le . We can view translation under IBM Model 2 Wethe could model fertility of with the null token thehas same way and as for was 2nd word in the 6 word German sentence probability as a twothe step process a lexical translation step an all alignment step. thed(4|2, other words by the conditional distribution n( |null). However, the 7, 6) IBMwords Model number of inserted clearly4depends on the sentence length, so we Note that we called the last step distortion instead of alignment. We chose to model null insertion as a special step. After the fertility step, make this distinction, since we can2 produce the4 same translation with the 3 5 we introduce one null1 token with probability p after each generated word, same alignment also natürlich in a di↵erentist way:das haus 1klein or no null token with probability p0 = 1 p1 . Motivation: The addition• of fertilitypositions and null insertion the translation translalexical Absolute fortoken distortion feelsincreases wrong tion process in Model 3 to four steps: • Words do not move independently step of course is the house small • Some words tend to move and some not alignment step of course the house is small →1 Introduce a relative distortion model6 2 3 4 5 → Introduce dependence on word classes The first step is lexical translation as in IBM Model 1, again modelled by the translation probability t(e|f ). The second step is the alignment step. For instance, translating ist into is has a lexical translation probability of t(is|ist) and an alignment probability of a(2|5, 6, 5) — the 5th English word is aligned to the 2nd foreign word. Note that the alignment function a maps each English output word j to a foreign input position a(j) and the alignment probability distribution is also set up in this reverse direction. IBM Model 4: Relative Distortion 126 126 126 e.g., 4 = 6. Alignment How does this scheme using cepts determine distortion for English words? gehe jaChapter nicht zum 126 4. haus Word-Based Models For each output word, weNULL now ich define its relative distortion. We distinguish between three cases: (a) words that are generated by the null token, IBM Model 4: Relative Distortion (b) center the firstofword a cept, and (c)go inhouse a cept: the in preceding cept ⇡subsequent = words 4.theThus we have relative I do 2 isnot 2 to Chapter 4. Word-Based Models 125 Word-Based Models center of the preceding cept Chapter ⇡2Chapter is 2 4. =4.4. Thus we have relative Word-Based Models distortion of 1, reflecting the forward movement of this word. In the • cept π = All words in e aligned to the same word in f case translationcept in sequence, is always +1: house and I center of theof preceding ⇡ is distortion 2 = 4. Thus we have relative ⦿ of a cept = ceiling position in e relative • center center ofare the preceding cept ⇡2 2is of2 average = 4. Thus we have examples for π this. Alignment 4.4. Higher IBM Models distortion of 1, reflecting the forward movement of this word. In the distortion of 1, reflecting the forward movement of this word. In the case(c) of translation translation sequence, distortion isalways always +1:house house andI Ithe For the placement of subsequent words in a cept, we consider NULL ich gehedistortion ja nichtis zum haus case of ininsequence, +1: and placement of the previous word in the cept, using the probability disare examples for this. are examples for this. tribution (c) For For the the placement placement ofofsubsequent subsequentdwords wordsin ina a1cept, considerthethe ⇡i,k ) cept,weweconsider (4.38) (c) >1 (j I do word go in notthe to the house placement of the previous cept, using the probability displacement the previous word in theofcept, using probability ⇡i,kofrefers to the word position the kth wordthe in the ith cept. disWe have tribution tribution one example for this in Figure 4.9: The English word the is the second d ⇡word )cepts (4.38) Foreign and >1 i,k word in the 4th cept word d(German (j(jwords ⇡i,k (4.38) >1 1 )1zum), the preceding English into the cept is position to. of to ⇡4,0⇡3in =inthe 5,the ofWe the ishave j = 6, cept ⇡i Position ⇡the ⇡2isword ⇡position ⇡ refers to the word position kth word ith cept. We 1 kth 4ith 5 i,krefers ⇡⇡i,k the word ofofthe cept. have thus we factor in d (+1) foreign position [i] 1 2 4 5 6 >1 oneexample examplefor forthis thisininFigure Figure4.9: 4.9:The TheEnglish Englishword wordthe theis is the second one the second foreign word f[i] ich gehe nicht zum haus wordin inthe the4th 4th cept (German word zum), the preceding English word word cept (German word zum), the preceding English word English words {ej } I go not to,the house Word Classes in the the cept cept isisEnglish to. Position Position oftotois1is⇡4,0 ⇡4,0 position j =6, 6, in to. is isj = positionsof {j} 4==5,5, 3position 5,6 ofofthe 7the center of cept i richer1 conditioning 4 3 on distortion. 6 7 thus we factor into (+1) Wewe may want introduce For instance, thus factor in dd>1 (+1) >1 of 1, reflecting the token forward this word. In the (a)distortion Words generated by the null aremovement uniformly of distributed. case of translation in sequence, distortion is always +1: house and I Foreign words and cepts (b)are For the likelihood of placements of⇡the first word of⇡ a cept, we use the examples for this. ⇡i first word ⇡3 ⇡5 1 in⇡a 2 cept 4 Distortion cept for the probability distribution foreign position [i] 1 2 4 5 6 (c) For the placement foreign of subsequent words in a cept, we the word fd hausconsider (4.37) [i]1 (j ich i gehe 1 ) nicht zum English words {ej } in the I go not to,the placement of the previous word cept, using the house probability disEnglish positions {j} 1 4 3 5,6 Placement isDistortion defined as English word position j relative 7to the center of subsequent words in a cept tribution center of cept i 1 4 3 6 7 of the preceding cept i 1 .d>1 Refer Figure 4.9 for an example: (j again ⇡i,k to ) (4.38) 1 The word not, English word position is distortion generated by cept ⇡3 . The English words e 3, and j ⇡i,k refers to the word position of the kth word in the ith cept. We have 1 2 3 4 5 one example forjthis in Figure 4.9: The English word the6 is the 7second ej I do not go to the house word in the 4th cept the preceding in cept ⇡i,k (German ⇡1,0 ⇡0,0word ⇡3,0zum), ⇡2,0 ⇡4,0 ⇡4,1English ⇡5,0 word 0 4 1 3 i 1 in the cept is to. Position of to is ⇡4,0 = 5, position of the is 6j = 6, j +1 1 +3 +2 +1 thus we factor in i d1>1 (+1) distortion d1 (+1) 1 d1 ( 1) d1 (+3) d1 (+2) d>1 (+1) d1 (+1) Figure 4.9: Distortion in IBM Model 4. Foreign words with non-zero fertility some words tend to get reordered during translation, while other remain inWord Classes forms cepts (here 5 cepts), which contain English words ej . The center i of a cept Englishfor words distortion inversion when transj and order. A typical example this eis adjective-noun ⇡i is ceiling(avg(j)). Distortion of each English word is modeled by a probability Word Classes Classes Word introduce conditioning distortion. instance, lating French to English. Adjectives get when by aWe may want to distribution: (a) richer for null generated words suchon as do: uniform, (b) forFor first words j 1 2 3 4moved back, 5 6 preceded 7 We want richer conditioning on distortion. For instance, We may maynoun. want to toeintroduce introduce richer conditioning on distortion. For instance, in a cept such as not: based on the distance between the word and the center of the For instance: A↵aires extérieur becomes external a↵airs. some words tend to get reordered during translation, while other remain in I do not go to the house j preceding word (j some tend to reordered during while other i 1 ), (c) for subsequent words in a cept such as the: distance be condition the⇡2,0 distortion probability distribution in may cept ⇡1,0 to ⇡0,0 ⇡3,0translation, ⇡4,0 ⇡4,1 remain ⇡5,0in in some words wordsWe tend to⇡get gettempted reordered during translation, while other remain i,k order. A typical toexample for this is adjective-noun inversion when transthe previous word in the cept. The probability distributions d1 and d>1 are 0forthis -the 4 1 f[i inversion 3: - when 6 iexample 1 4.37–4.38 Equations on ej and order. for isiswords adjective-noun when transorder. AAintypical typical example this adjective-noun trans1]inversion IBM Model 4: Word Classes IBM Model 5 lating French to learnt English. get moved when preceded from theAdjectives data. They are conditioned using back, lexical information, see text for by a j English. +1 1moved +3back, +2 +1 a i 1 lating Adjectives get when by lating French French to to English. Adjectives get moved whenpreceded preceded by a initial (j back, ej ) d1 (+1) details. distortionfor d 1 ind1cept: ( 1) dd11(+3) d1[i(+2) 1] |f[i d>1 1] ,(+1) 1 (+1) word noun. For instance: A↵aires extérieur becomes external a↵airs. noun. (4.39) noun. For For instance: instance: A↵aires A↵airesextérieur extérieurbecomes becomesexternal externala↵airs. a↵airs. for additional the words: d>1 (j probability ⇧i,k 1 |ej ) distribution We totocondition IBM Models 1-4 are deficient We may be tempted to condition the distortion probability distribution We may maybe betempted tempted condition thedistortion distortion probability distribution Figure 4.9: Distortion in IBM Model 4. Foreign words with non-zero fertility However, weon will find ourselves inf[ifa 1] situation, where for most e and in 4.37–4.38 words ejejand : forms cepts (here 5 the cepts), which contain English words ej . The center distortion! i of a cept j in Equations Equations 4.37–4.38 on the words and : Some words trigger reordering: Conditional in Equations 4.37–4.38 on the words etranslations : positive probabilities [i 1] • some impossible have j and f[i 1] of each English wordstatistics is modeled by probabilityrealistic f[i 1]⇡,i isweceiling(avg(j)). will not be Distortion able to collect sufficient to aestimate for iningenerated cept: 1d(j distribution: (a)word for null do:|funiform, [i [i1] ,1]e,j(b) • multiple output words may be placed in the same place • initial initial words: for initial word cept: dwords e)j )for first words probability distributions. 1 (j such[ias[i1] 1] |f (4.39) for initial word in cept: d1 (j [i 1] |f[i 1] , ej ) in ahave such as not: based thebenefits distance between the word and model the center of the (4.39) To both some of on the of⇧ai,klexicalized and sufficient •cept subsequent words: for additional words: d (j |e ) >1 j 1 • probability mass is wasted! (4.39) for word additional d>1 (j words ⇧i,k |ej )such as the: distance preceding (j (c) for subsequent in a 1cept i 1 ),words: statistics, Model word 4 introduces word classes. When we group the vocabulary for additional words: d>1 (j ⇧i,k 1 |ej ) towe thewill previous in the cept. The probability distributions d1most and d>1 are However, find ourselves in a situation, where for e and j However, we from willthe find ourselves in a we situation, wherethe forprobability most ej and of a learnt language into, 50areclasses, can condition distridata.say, They conditioned using lexical information, see text for ff[i 1] ,, we will not able totocollect statistics totoestimate realistic However, we will ourselves in a situation, where for most ej and butions onbe these classes. Thesufficient result is an almost lexicalized model, but with will not be able collect sufficient statistics estimate realistic IBMfind Model 5 details. [i 1] we Conditioning on words creates too many parameters! probability distributions. sufficient statistics. probability distributions. f[i 1] , we will not be• able to collect statistics to estimate realistic fix deficiency bysufficient keeping track of vacancies To have To both ofdata the benefits lexicalized model and put this more formally, weof introduce two functions )sufficient and B(e) that • some sparse problem ina training To have both some of the benefits of a lexicalized modelA(f and sufficient probability distributions. • details: see text book statistics,map Model introduces word classes. When we to group thelanguage vocabulary words to theirword word classes. This allows reformulate Equation 4.39 • 44induce classes for source target statistics, Model introduces word classes. Whenusand we group the vocabulary of a language To have both some of the benefits of a lexicalized model and sufficient into: into, say, 50 classes, we can condition the probability distriof a language into, say, 50 classes, we can condition the probability distributions on these•for classes. The result is an almost lexicalized model, but with initial word in cept: d (j |A(f ), B(e )) statistics, Model 4 introduces word classes. When we group the vocabulary initial words: j [i lexicalized 1] [i 1] model, butions on these classes. The result is an1 almost but with (4.40) sufficient statistics. • subsequent for additionalwords: words: d>1 (j ⇧i,k 1 |B(ej )) of a language into, say, 50 classes, we can condition the probability distrisufficient statistics. To put this more formally, we introduce two functions A(f ) and B(e) that To put this more formally, we introduce two functions A(f ) and B(e) that map words to their word classes. This allows us to reformulate Equation 4.39 map into: words to their word classes. This allows us to reformulate Equation 4.39 into: for initial word in cept: d1 (j [i 1] |A(f[i 1] ), B(ej )) (4.40) for initial word in cept: d1 (j [i 1] |A(f[i 1] ), B(ej )) for additional words: d>1 (j ⇧i,k 1 |B(ej )) (4.40) for additional words: d (j ⇧ |B(e )) butions on these classes. The result is an almost lexicalized model, but with sufficient statistics. To put this more formally, we introduce two functions A(f ) and B(e) that map words to their word classes. This allows us to reformulate Equation 4.39 = ...Models Chapter 4. Word-Based 110 4.4 Higher IBM Models a(1)=0 a(le )=0 lf X lf X p(e, a|f ) le Y ✏ = ... t(ej |fa(j) ) IBM Models We haveSummary presented with IBM Model 1 a model for machine translation. (lf + 1)le a(1)=0 e )=0 However, at closer inspection, the model has many flaws. The model isa(l very weak in terms of reordering, as well as adding and dropping words. In fact,lf X j=1 What(4.10) is Next? lf le X Y ✏ according to IBM Model 1, the best translation for any input = is the empty ... t(ej |fa(j) ) Assignment on word-based SMT string (see also Exercise 3 at the end of this chapter). (lf + 1)le a(1)=0 a(le )=0 j=1 Models with increasing complexity • manipulate translation models Five models of increasing complexity were proposed in the original work lf l e on statistical machine translation at IBM. The advances of the five models • train and manipulate language models YX ✏ Higher models include more information are: t(ej |fi ) = IBM IBM IBM IBM IBM Model Model Model Model Model 1 2 3 4 5 (lf + 1)le • decode a simple test corpus j=1 i=0 lexical translation adds absolute alignment model of the last step: Instead of performingParallel Note the significance a sumCorpora over and alignment adds fertility model le an exponential number lf of products, we reduced the computational comrelative alignment model plexity to linear in lf and le (since lf ⇠ le , computing p(e|fPhrase-based ) is roughly SMT fixes deficiency quadratic with respect to sentence length). Let usmodel now put Equations 4.9inand Alignment models introduce an explicit for reordering words a 4.10 together: Training: Estimate parameters from word-aligned data sentence. More often than not, words that follow each other in one language p(e, a|f ) have translations that follow each other in the output language. However, p(a|e, f ) = forreorderings word alignment: IBM Model 1Can treatsbeallused possible as equally likely. p(e|f ) Fertility is the notion that input words produce a specific number of Q le ✏ output words in the output language. Most often, of course, one word in(lfthe j=1 t(ej |fa(j) ) +1)le = Qle Plf input language translates into one single word in the output language. But ✏ j=1 i=0 t(ej |fi ) some words produce multiple words or get dropped (producing zero words). (lf +1)le A model for the fertility of words addresses this aspect of translation.le Y Adding additional components increases the complexity of training thet(ej |fa(j) ) = Plf models, but the general principles that we introduced for IBM Model 1 stay t(ej |fi ) j=1 the same: We define the models mathematically and then devise an EMi=0 algorithm for training. This incremental march towards more complex models does not only serve didactic purposes. All the IBM models are relevant, because EM training starts with the simplest Model 1 for a few iterations, and then proceeds through iterations of the more complex models all the way to IBM Model 5. 4.4.1 IBM Model 2 In IBM Model 2, we add an explicit model for alignment. Previously, in IBM Model 1, we do not have a probabilistic model for this aspect of translation. SMT decoding and tuning (4.11)

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Word-Based SMT